# 01 DATA COLLECTION

# <center>Evaluation of Gradient Boosted Models for Hard Drive Failure Prediction in Data Centers</center>

In recent years, the expansion of the internet of things (IoT) and industrial internet of things (IIoT) connectivity and sensors have contributed to advancements in predictive maintenance (Market Research Future, 2022). In fact, by 2030 the predictive maintenance market is forecasted to reach $111.34 billion USD (Market Research Future, 2022). Historically, maintenance primarily consisted of corrective and preventative (Maintenance (technical), 2022). Of the two maintenance types, corrective maintenance is often more expensive due to consequential part damage or downtime (Maintenance (technical), 2022). Preventative maintenance can also be costly due to the unwarranted replacement of parts (Maintenance (technical), 2022). In contrast, predictive maintenance utilizes sensors to monitor equipment health conditions, which is combined with analytics to predict and prevent unexpected equipment failures (Predictive maintenance, 2022). Predictive maintenance offers benefits such as cost savings, improved reliability, and reduced downtime (Predictive maintenance, 2022).  
<br>
One industry that can and is benefitting from predictive maintenance are data centers. Depending on the type of the data center, various levels of fault tolerances and redundancy exist, which equate to a range of uptime requirements from 99.671% (28.8 hours annual downtime) to 99.995% (26.3 minutes annual downtime) (Hewlett Packard Enterprise Development LP, 2022). To maintain the high level of uptime and availability, predictive maintenance is being leveraged to forecast failure on equipment including, but not limited to, generators, power distribution units (PDU), transfer switches, transformers, and uninterruptible power supplies (UPS) (Tyrrell, 2022). Not only is data center ancillary equipment benefitting from predictive maintenance, but data center computing equipment such as hard drives might also benefit.  
<br>
Accurately predicting hard drive failures within a data center can ensure operational readiness, improve reliability, and reduce costs. Although there are protective measures in place to distribute files or objects over different hard drives and locations, hard drive failures still present a risk of data loss to customers (Wilson, 2018).  
<br>
The project will determine if predictive maintenance using machine learning can be leveraged to proactively identify and replace failing hard drives to mitigate these risks. Furthermore, the project will evaluate three gradient boosted classifier models including histogram-based gradient boosting, CatBoost, and XGBoost to determine which model provides the best evaluation metrics, specifically F-scores (F1 and F2). The F1 score is the harmonic mean between precision and recall (F-score, 2022). The F2 score values recall more, by applying additional weights (F-score, 2022).    
<br>
To attempt to answer the question, hard drive data from within a data center enivornment must be collected for analysis. Backblaze is a cloud company that provides data storage strategies. In 2015, Backblaze started publishing daily hard drive snapshots for each hard drive in Backblaze Data Centers (Beach, 2015). The daily hard drive snapshots provide specific information for each hard drive, such as date, serial number, model, capacity (bytes), failure (0 - operational, 1 - failed), and S.M.A.R.T. attributes (normalized and raw) in a comma separated value (CSV) file (Beach, 2015). The S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) or SMART attributes provide health indicators and statistics for the hard drive (S.M.A.R.T., 2022). These data fields will be explored in greater depth throughout the project. Between 2013 and 2015, the daily hard drive snaphosts were published in a yearly ZIP file consisting of 365 CSV files. In 2016, the daily hard drive snapshots were published in quarterly ZIP files consisting of 90-92 CSV files depending on quarter and year. Backblaze makes the ZIP files available for download through HTTPS download URLs.  
<br>
The project will focus on the Backblaze hard drive data for the first quarter of 2022.  

## Overview of the Jupyter Notebook  

The notebook intends to create a data directory and using functions, the hard drive snapshots for the calendar quarter are downloaded from the Backblaze website in a ZIP file. The hard drive snapshots are stored in CSV files and the CSV files are extracted from the quarterly ZIP file. The CSV files are moved to a corresponding directory for the quarter within the data directory. Lastly, a Parquet file is created and examined for the quarter of CSV files.  

## Import modules and libraries

In [1]:
import os
import re
import shutil
import zipfile
from pathlib import Path

import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import requests

## Create data directory  

To maintain file organization and structure, a `data` directory is created (if not already existent) and the `data` path is mapped.  

In [2]:
# References
# # https://docs.python.org/3/library/pathlib.html
# https://docs.python.org/3/reference/compound_stmts.html
# https://docs.python.org/3/library/exceptions.html
# https://docs.python.org/3/tutorial/errors.html
# https://docs.python.org/3/library/os.html

# The current working directory is assigned to 
# the 'cwd_path' variable.
cwd_path = Path.cwd()

# Try to make the 'data' directory, but if the directory 
# already exists then raise and print and excpetion.
try:
    os.makedirs(cwd_path.joinpath('data'), exist_ok=False)
except FileExistsError:
    print("Directory already exists")

# The 'data' directory is joined to the 'cwd_path' and assigned 
# to the 'data_path' variable.
data_path = cwd_path.joinpath('data')

## Download the quarterly ZIP file from Backblaze website  

A function is defined and used to download the ZIP file for the first quarter of 2022 from the Backblaze website.  

In [3]:
# References:
# https://requests.readthedocs.io/en/latest/
# https://docs.python.org/3/library/shutil.html

def download_file(url, file):
    """
    Download a file from a website using the URL path
    and file name. Move the downloaded file to the to the 
    '/data' directory of the current working 
    directory.
    
    Keyword arguments:
    url -- the URL path location of the file to be downloaded
    file -- the file name to be download
    """
    
    # Using a concatenation of the URL path ('url') 
    # and file name ('file), the 'url_file' variable is created.
    url_file = url + file
    
    # A HTTP request to get the file ('url_file') is made 
    # and a Response object ('r') is received.
    # If necessary, redirects are allowed.
    r = requests.get(url_file, allow_redirects=True)
    
    # Using the 'with' statemeent, the file to be downloaded ('file')
    # is opened with the writing binary ('wb') mode.
    # The content from the HTTP request response object ('r') is 
    # written to the opened file ('fd').
    # The downloaded file ('file') is moved from the current working 
    # directory to the 'data_path' directory.
    # the 'data_path' directorty, the file is overwritten. 
    # In the event, there is an exception, an error is raised
    # and printed.
    with open(file, 'wb') as fd:
        fd.write(r.content)
        src = str(file)
        dst = str(data_path) + '/' + str(file)
        try:
            shutil.move(src, dst)
            print(f"File downloaded: {file}")
        except BaseException as err:
            print(f"Error: {err}")

In [4]:
# Assign the 'dl_file' variable to the zip file ('data_Q1_2022.zip') to 
# be downloaded.
dl_file = 'data_Q1_2022.zip'

# Assign the 'url' variable to the URL path where the zip files 
# are located on the website.
url = 'https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/'

In [5]:
# The URL location ('url') and the file ('df_file') to be downloaded 
# are passed to the 'download_file' function.
download_file(url, dl_file)

File downloaded: data_Q1_2022.zip


## Unzip the quarterly ZIP file to a quarterly directory within the data directory  

A function is defined and used to uncompress the ZIP file and extract the CSV files to a quarterly directory for the first quarter of 2022.  

In [6]:
# References
# https://realpython.com/working-with-files-in-python/
# https://docs.python.org/3/library/zipfile.html
# https://docs.python.org/3/library/re.html
# https://regex101.com/
# https://stackoverflow.com/questions/6773584/how-are-glob-globs-return-values-ordered
# https://docs.python.org/3/library/pathlib.html
# https://docs.python.org/3/library/shutil.html
# https://stackoverflow.com/questions/31813504/move-and-replace-if-same-file-name-already-exists

def unzip_file(zip_file_path, zip_file):
    """
    Unzip a ZIP file from a path where the ZIP files
    exist to expose the CSV files. If the CSV file matches a 
    regular expression, move the CSV file to the 
    '/data/q#_202#' directory of the current working 
    directory.
    
    Keyword arguments:
    zip_file_path -- the ZIP file path of the zipped file
    zip_file -- the ZIP file name to be unzipped
    """
    
    # Using the 'with' statemeent, the ZIP file ('zip_file') is read 
    # and a ZipFile object ('zip_file_object') is created.
    with zipfile.ZipFile(zip_file_path.joinpath(zip_file)) \
    as zip_file_object:
        
        # Try to make the 'temp' directory joined to 'zip_file_path' 
        # path, but if the directory already exists then pass.
        try:
            os.makedirs(zip_file_path.joinpath('temp'), 
                        exist_ok=False)
        except FileExistsError:
            pass
        
        # Try to make the sliced 'zip_file' directory joined to 
        # 'zip_file_path' path, but if the directory already exists 
        # then pass.
        try:
            os.makedirs(zip_file_path.joinpath((zip_file[5:12]).lower()), 
                        exist_ok=False)
        except FileExistsError:
            pass
        
        # The 'temp' directory is joined to the 'zip_file_path' path 
        # and assigned to the 'temp_path' variable.
        temp_path = zip_file_path.joinpath('temp')
        
        # The sliced 'zip_file' directory is joined to the 
        # 'zip_file_path' path and assigned to the 'qtr_path' variable.
        qtr_path =  zip_file_path.joinpath((zip_file[5:12].lower()))
        
        # Extract all folders and files from the ZipFile object 
        # ('zip_file_object') to the 'temp' directory ('temp_path').
        zip_file_object.extractall(path=temp_path)
        
        # Assign 'csv_re_pattern' variable using 
        # regular expressions pattern to 
        # match '20yy-mm-dd.csv' format.
        csv_re_pattern = '20(\d{2})[/.-](\d{2})[/.-](\d{2}).csv$'
        
        # A 'for' loop iterates over the 'csv_file' CSV files in  
        # the 'temp_path' directory and if the 'csv_file' matches 
        # the 'csv_re_pattern' regular expression pattern,
        # the 'csv_file' is moved from the 'temp_path' directory to
        # the 'qtr_path' directory.
        # If the 'csv_file' CSV file already exists in 
        # the 'qtr_path' directorty, the file is overwritten. 
        # In the event, there is an exception, an error is raised
        # and printed.
        for csv_file in (csv_file for csv_file \
                         in sorted(temp_path.rglob('*.csv')) \
                         if re.match(csv_re_pattern, csv_file.name)):
            src = str(csv_file)
            dst = str(qtr_path) + '/' + str(csv_file.name)
            try:
                shutil.move(src, dst)
            except BaseException as err:
                print(f"Error: {err}")
                
        # Removes the 'zip_file' file
        zip_file_path.joinpath(zip_file).unlink()
        
    # Removes the 'temp_path' directory and contents
    # Since files and directories exist in the 'temp_path' directory, 
    # the 'shutil.rmtree' function is used to remove the whole 
    # directory tree
    shutil.rmtree(temp_path)

In [7]:
# The path ('data_path') where the downloaded file ('dl_file') exist
# and the downloaded file ('dl_file') are passed to the 'unzip_file'
# function.
unzip_file(data_path, dl_file)

## Convert the quarterly grouped CSV files into a Parquet file format

In an effort to improve performance, a decision was made to convert the data from the CSV file format to the Parquet file format. CSV files have the benefits of being a common file format accepted by many applications, being a human-readable format, and typically fast to write (Staubli, 2017). However, CSV files are not as efficient for complex data processing (Spicer, 2017). Due to the columnar storage format design, Parquet files are optimized for data analytical purposes (Staubli, 2017). Parquet files include a schema which reduces the computational of expense of infering a schema from a CSV file. In one comparison, the query time on the Parquet files was 34 times faster than CSV files (Yowakim, 2021). Another benefit of the Parquet file format is a reduced file size (Spicer, 2017).  

A function is defined and used to read and write the CSV files into a Parquet file for the first quarter of 2022.  

In [8]:
# References
# https://arrow.apache.org/docs/python/dataset.html
# https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow
# https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html
# https://arrow.apache.org/docs/python/generated/pyarrow.dataset.CsvFileFormat.html
# https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html
# https://stackoverflow.com/questions/70478984/how-to-correct-csv-file-mixed-types-if-using-pyarrow-write-dataset-to-parquet
# https://stackoverflow.com/questions/67071323/how-to-control-whether-pyarrow-dataset-write-dataset-will-overwrite-previous-dat

def csv_to_parquet(csv_file_path):
    """
    Accepts a path to a location of CSV files. Reads the CSV files into
    a PyArrow Dataset and interprets the schema. The fields matching 
    'smart' are assigned to an int64 data type and the schema is 
    updated. The PyArrow Dataset is written to Parquet file.
    
    Keyword arguments:
    csv_file_path --  the path location of the CSV files
    """
    # Try to make 'parquet' directory joined to 'data_path' path, 
    # but if the directory already exists then pass.
    try:
        os.makedirs(data_path.joinpath('parquet'), 
                    exist_ok=False)
    except FileExistsError:
        pass
    
    # The 'parquet' directory is joined to the 'data_path' and assigned 
    # to the 'pq_path' variable.
    pq_path = data_path.joinpath('parquet')
    
    # To create a unique Parquet file name, the 'csv_file_path' string
    # is sliced and appended with '{i}' for the iterator and '.parquet'
    # file suffix.
    pq_name_string = str(csv_file_path)[-7:] + '-' + '{i}' + '.parquet'
    
    # The 'dataset' function from PyArrow 'dataset' module scans
    # the 'csv_file_path' directory for all the CSV files
    # A 'dataset' object is created and assigned to the 'dataset' 
    # variable.
    # The schema is inferred from the first CSV file which has missing
    # values and results, resulting in 'fields' having 'null' data 
    # type.  
    dataset = ds.dataset(csv_file_path, format='csv')
    
    # Create 'column_types' empty dictionary to store the column names 
    # andcolumn types from the 'dataset' schema.
    column_types = {}
    
    # Assign 'field_re_pattern' variable using regular expressions 
    # pattern to match 'smart*' format.
    field_smart_pattern = 'smart*'
    
    # A 'for' loop interates through each 'field' in the 'dataset' 
    # schema and if the 'field.name' matches the 'field_smart_pattern' 
    # regular expression, the 'field.name' is assigned PyArrow 'int64' 
    # data type and collected in the 'column_types' dictionary.
    for field in dataset.schema:
        if re.search(field_smart_pattern, field.name):
            column_types[field.name] = pa.int64()
        else:
            pass
    
    # The 'column_types' dictionary with the 'field.names' and updated
    # PyArrow 'int64' data types are passed to the 'column_types'
    # argument of the 'ConvertOptions' class from PyArrow'csv' module.
    convert_options = csv.ConvertOptions(column_types=column_types)
    
    # The 'convert_options' ConvertOptions object is passed to 
    # 'csv_file_format' CsvFileFormat object.
    csv_file_format = ds.CsvFileFormat(convert_options=convert_options)
    
    # Again, the 'dataset' function from PyArrow 'dataset' module scans
    # the 'csv_file_path' directory for all the CSV files.
    # However, the 'csv_file_format' CsvFileFormat object is used to 
    # map the column names to the column types.
    dataset = ds.dataset(csv_file_path, format=csv_file_format)
    
    # The 'write_dataset' fucntion from PyArrow 'dataset' module writes
    # the 'dataset' object using a Parquet file format to the 'pq_path' 
    # directory and uses the 'pq_name_string' as the file name.
    # In the event the file exists, the file will be overwritten.
    ds.write_dataset(dataset, base_dir=pq_path,
                     basename_template=pq_name_string,
                     format='parquet',
                     existing_data_behavior='overwrite_or_ignore')
    
    # Removes the 'csv_file_path' directory and contents.
    # Since files and exist in the 'csv_file_path' directory, the
    # 'shutil.rmtree' function is used to remove the whole directory 
    # tree.
    shutil.rmtree(csv_file_path)

In [9]:
# The 'q1_2022' path is joined to the 'data' path to create the 
# 'q1_2022_path' path as the location of quarter of CSV files.
q1_2022_path = data_path.joinpath('q1_2022')

In [10]:
# The 'q1_2022_path' path is passed to the 'csv_to_parquet' function, 
# where the CSV files will be read and converted to a Parquet file.
csv_to_parquet(q1_2022_path)

## Inspect the Parquet file metadata

The metadata exposes how the Parquet file was created, as well as the number of columns, number of rows, number of row groups, format version, and serialized size.  

In [11]:
# The 'parquet' directory is assigned to the 'pq_path' variable.
pq_path = data_path.joinpath('parquet')

# The 'pq_files' list is established using a collection of Parquet 
# files to be examined.
pq_file = 'q1_2022-0.parquet'

In [12]:
# References
# https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_metadata.html

# The Parquet file ('pq_file') and Parquet file ('pq_file') is passed
# to the PyArrow 'read_metadata' function to create 'pq_md'. The 
# Parquet file ('pq_file') and Parquet metadata ('pq_md') are 
# displayed.
pq_md = pq.read_metadata(str(pq_path) + '/' + pq_file)
print(pq_file)
print(pq_md)
print('\n')

q1_2022-0.parquet
<pyarrow._parquet.FileMetaData object at 0x7f20a13e7950>
  created_by: parquet-cpp-arrow version 9.0.0
  num_columns: 179
  num_rows: 18845260
  num_row_groups: 6366
  format_version: 1.0
  serialized_size: 123756203




The number of columns and rows provide some insight into the dimensions of the data. The Parquet file consists of 179 columns and 18,845,260 rows; a sizeable amount of data.  

## Inspect the Parquet file schema

The schema reveals the Parquet file field names and data types.  

In [13]:
# References
# https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_schema.html

# The Parquet file ('pq_file') and Parquet file ('pq_file') is passed
# to the PyArrow 'read_schema' function to create 'pq_rs'. The 
# Parquet file ('pq_file') and Parquet schema ('pq_rq') are displayed.
pq_rs = pq.read_schema(str(pq_path) + '/' + pq_file)
print(pq_file)
print(pq_rs)
print('\n')

q1_2022-0.parquet
date: date32[day]
serial_number: string
model: string
capacity_bytes: int64
failure: int64
smart_1_normalized: int64
smart_1_raw: int64
smart_2_normalized: int64
smart_2_raw: int64
smart_3_normalized: int64
smart_3_raw: int64
smart_4_normalized: int64
smart_4_raw: int64
smart_5_normalized: int64
smart_5_raw: int64
smart_7_normalized: int64
smart_7_raw: int64
smart_8_normalized: int64
smart_8_raw: int64
smart_9_normalized: int64
smart_9_raw: int64
smart_10_normalized: int64
smart_10_raw: int64
smart_11_normalized: int64
smart_11_raw: int64
smart_12_normalized: int64
smart_12_raw: int64
smart_13_normalized: int64
smart_13_raw: int64
smart_15_normalized: int64
smart_15_raw: int64
smart_16_normalized: int64
smart_16_raw: int64
smart_17_normalized: int64
smart_17_raw: int64
smart_18_normalized: int64
smart_18_raw: int64
smart_22_normalized: int64
smart_22_raw: int64
smart_23_normalized: int64
smart_23_raw: int64
smart_24_normalized: int64
smart_24_raw: int64
smart_160_norm

The field names and data types appear to be correct and there are no fields that indicate a null value for the data type.  

## Summary

The hard drive snapshots in CSV files for the first quarter of 2022 were downloaded in a ZIP file from the Backblaze website. The CSV files were extracted from the ZIP file to a quarterly directory. Using PyArrow, the CSV files were read and written to a Parquet file for improved performance. Lastly, the metadata and schema of the Parquet file were examined to understand the structure of the data.  

## To prevent memory issues with other notebooks, please shutdown the kernel to free up memory or uncomment the cell below and run.  

In [14]:
#quit()

## <center>References</center>  

Backblaze. (2022). <i>Hard drive data and stats.</i> https://www.backblaze.com/b2/hard-drive-test-data.html  
<br>
Beach, B. (2015, February 4). <i>Reliability data set for 41,000 hard drives now open-source.</i> Backblaze Blog. https://www.backblaze.com/blog/hard-drive-data-feb2015/  
<br>
F-score. (2022, September 18). In <i>Wikipedia</i>. https://en.wikipedia.org/w/index.php?title=F-score&oldid=1110996374  
<br>
Hewlett Packard Enterprise Development LP. (2022). <i>What are data center tiers?</i> https://www.hpe.com/uk/en/what-is/data-center-tiers.html  
<br>
Maintenance (technical). (2022, September 18). In <i>Wikipedia</i>. https://en.wikipedia.org/w/index.php?title=Maintenance_(technical)&oldid=1110994458  
<br>
Market Research Future. (2022, June 16). <i>Predictive maintenance market to hit USD 111.34 billion by 2030, at a CAGR of 26.2% - report by Market Research Future (MRFR).</i> GlobeNewswire. https://www.globenewswire.com/en/news-release/2022/06/16/2463729/0/en/Predictive-Maintenance-Market-to-Hit-USD-111-34-Billion-by-2030-at-a-CAGR-of-26-2-Report-by-Market-Research-Future-MRFR.html  
<br>
Predictive maintenance. (2022, September 21). In <i>Wikipedia</i>. https://en.wikipedia.org/w/index.php?title=Predictive_maintenance&oldid=1111479330  
<br>
S.M.A.R.T. (2022, September 9). In <i>Wikipedia</i>. https://en.wikipedia.org/w/index.php?title=S.M.A.R.T.&oldid=1109378579  
<br>
Spicer, T. (2017, June 14). <i>Apache Parquet: How to be a hero with the open-source columnar data format.</i> Openbridge. https://blog.openbridge.com/how-to-be-a-hero-with-powerful-parquet-google-and-amazon-f2ae0f35ee04  
<br>
Staubli, G. (2017, October 9). <i>Spark file format showdown – CSV vs JSON vs Parquet</i>. LinkedIn. https://www.linkedin.com/pulse/spark-file-format-showdown-csv-vs-json-parquet-garren-staubli/  
<br>
Tyrrell, J. (2022, July 28). <i>Efficiency gains: predictive maintenance supports data center operations.</i> TechHQ . https://techhq.com/2022/07/machine-learning-data-center-maintenance/  
<br>
Wilson, B. (2018, July 17). <i>Backblaze durability calculates at 99.999999999% — and why it doesn’t matter.</i> Backblaze. https://www.backblaze.com/blog/cloud-storage-durability/  
<br>
Yowakim, E. (2021, December 7). <i>Difference between Parquet and CSV.</i> LinkedIn. https://www.linkedin.com/pulse/difference-between-parquet-csv-emad-yowakim/  