# Data Preprocessing

Data preprocessing steps for model training.

## Steps

1. **Download data** 
2. **Convert to Parquet** - `datastats2parquet.py` 
3. **Truncate data** - `truncate.py`  
4. **Split data** - `Splitter.py`

In [1]:
import sys
from pathlib import Path

sys.path.append('../src')

raw_data_dir = Path('../RawData')
data_dir = Path('../Data')

# Create data directory if it doesn't exist
raw_data_dir.mkdir(exist_ok=True)
data_dir.mkdir(exist_ok=True)

print(f"Raw data directory: {raw_data_dir}")
print(f"Processed data directory: {data_dir}")

Raw data directory: ../RawData
Processed data directory: ../Data


## Step 1: Download Dataset

In this step, we download the raw dataset.
The following cells will download the required dataset files using wget commands. Make sure you have sufficient disk space available.

In [None]:
cd ../RawData

# Download commands
wget https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_2015.zip
# wget https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q1_2016.zip
# wget https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q2_2016.zip
# wget https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q3_2016.zip
# wget https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q4_2016.zip
# wget https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q1_2017.zip
# wget https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q2_2017.zip
# wget https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q3_2017.zip
wget https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q4_2017.zip


## Step 2: Convert Data Statistics to Parquet

The `datastats2parquet.py` script converts raw CSV data files into Parquet format. This step unzips archives and concatenates data. Set `unique_ids` higher than the actual number of drives to ensure all drives are included in the final file.

In [2]:
from preprocessing.datastats2parquet import main as unzip_and_agg

unzip_and_agg(data_folder='.', unique_ids=20000, frequency=1, sample_file='2016.parquet')

ModuleNotFoundError: No module named 'pandas'

## Step 3: Truncate Data

The `truncate.py` script processes the converted Parquet data to handle truncation. Truncated drives are those that are censored only because the observation period ended (right-censored observations).

In [None]:
from preprocessing.truncate import truncate_observations

truncate_observations(input_file='merged.parquet', output_file='2016_2018_trunc.parquet')

## Step 4: Split Dataset

The `Splitter.py` script performs the final preprocessing step by applying the complete preprocessing pipeline and splitting the data:

- **Stratified sampling**: Creates balanced train/test splits preserving failure rate distribution (train and test sets do not overlap by drives)
- **Feature engineering**: Applies time transformations, aggregation, and scaling
- **Multiple sample sizes**: Generates datasets with different numbers of observations per drive
- **Quality control**: Removes drives with insufficient data or anomalous patterns

This step produces the final preprocessed datasets ready for survival model training.

In [None]:
from preprocessing.Splitter import main as split_data

split_data(input_file='2016_2018_trunc.parquet')

## Done

Preprocessing completed. Generated files:
- `{sample_size}_train_preprocessed.csv` - training data
- `{sample_size}_{test_size}_test_preprocessed.csv` - test data

Next: use `training_demo.ipynb` for model training.