Skip to content

exaxorg/import_backblaze

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exax: Import the Backblaze Dataset

This repository contains the Exax import script for the Backblaze dataset. The script will import, type, and fix some problems with model naming.

It also contains a script to compute Backblaze's AFR (Annual Failure Rate) metric.

This script is mentioned in the PyData Global 2021 presentation "Computations as Assets - a New Approach to Reproducibility and Transparency" by Anders Berkeman, Carl Drougge, and Sofia Hörberg.

Install

git clone https://github.com/exaxorg/import_backblaze
cd import_backblaze

You might want to have a look at the file accelerator.conf, and set slices to the number of CPUs you want to use.

Download Data

cd data

# backblaze data
wget https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q1_2021.zip
...

You'll find all the data files at Backblaze. At least one file is needed to run the script. (Pick "data_Q1_2021.zip" to run the AFR calculations below!)

Run!

In a terminal

cd import_backblaze
python3 -m venv venv
source venv/bin/activate
pip install accelerator
ax server

In a second terminal (with ax server running)

cd import_backblaze
source venv/bin/activate

# Run this once to import the data
ax run import
# You can press CTRL-T while waiting for more verbose progress indication.

# Test by calculating the AFR values
ax run afr

On a three year old Lenovo T490 laptop, importing the latest zip-file takes less than 6 minutes, and calculating the AFR takes 5 seconds.

Here's a selected and sorted pick from the output:

model                 #drives    #days   #fails       AFR
HGST_HMS5C4040ALE640     3168   281692        5     0.65%
HGST_HMS5C4040BLE640    12748  1146496       10     0.32%
HGST_HUH728080ALE600     1081    97027        4     1.50%
HGST_HUH721212ALE600     2605   233948        5     0.78%
HGST_HUH721212ALE604     5691   308793        6     0.71%
HGST_HUH721212ALN604    10834   974310        9     0.34%
ST4000DM000             18941  1701967       59     1.27%
ST6000DX000               886    79740        0     0.00%
ST8000DM002              9770   878106       26     1.08%
ST8000NM0055            14450  1297674       31     0.87%
ST10000NM0086            1206   108057        6     2.03%
ST12000NM0007           23036  1732307       66     1.39%
ST12000NM0008           20132  1764318       41     0.85%
ST12000NM001G            9044   704446       12     0.62%
ST14000NM001G            5990   538401       13     0.88%
ST14000NM0138            1684   135157        9     2.43%
ST16000NM001G            2460    54177        1     0.67%
TOSHIBA_MD04ABA400V        99     8910        0     0.00%
TOSHIBA_MG07ACA14TA     27372  2165421       34     0.57%
TOSHIBA_MG07ACA14TEY      406    33831        1     1.08%
TOSHIBA_MG08ACA16TEY     1014    91260        0     0.00%
WDC_WUH721414ALE6L4      8410   640767       10     0.57%
WDC_WUH721816ALE6L0       520     4680        0     0.00%
...

The AFR, drive days and failure values are the same as published by Backblaze but there are differences in the drive count column.

In a web browser

Exax runs a simple web server. Set a port in accelerator.conf like this

board listen: localhost:2020

restart the server and point a browser to

http://localhost:2020

(Select another port or socket in the accelerator.conf file if this one is already in use.)

Data Quality

The Backblaze dataset is of high quality. All files in the collection use the same file format, column names, and header. Exax can import all data directly from the zip-archives, which simplifies things a lot. (Each zip-archive contains two hidden directories generated by OSX, but they can be filtered out using an option to the import function.)

Data Anomalies

There are a few anomalies in the data. The capacity_bytes column cannot always be trusted. Sometimes it contains huge numbers, sometimes it is negative (!). The command below lists all columns in the import_type dataset along with minimum and maximum values:

$ ax ds -c :import_type: | head -11

import-3277/default
    Parent: import-2891
    Method: modelcleaner
    Previous: import-3276
    Columns:
          capacity_bytes        int64       [-9116022715867848704, 600332565813390450]
          cleanmodel            ascii
          date                  date        [   2013-04-10,    2020-12-31]
          failure               bool        [        False,          True]
          model                 ascii
          serial_number         ascii

Another thing is the model column. We've seen two issues. The model WDC WUH721414ALE6L4 appears with both one and two spaces in the string. And there is also a model named 00MD00, which is clearly incorrect.

This is an Exax Accelerator project

This means that

  • processing is carried out in parallel, where possible,
  • project is completely traceable and reproducible, and
  • everything is written in Python.

The Accelerator is an open source (Apache V2) project. See https://exax.org for more information.

About

import script for the Backblaze Hard Disk Dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages