# Data Versioning

In the previous Model Development notebook, we transformed the data and saved the model but overlooked saving the processed data. Acknowledging this oversight, it's important to note that the data processing methodology in the Model Development notebook has been updated from the one in the Data Preprocessing notebook. To rectify this, we will save the latest versions of our data.

Emphasizing this aspect of production machine learning is crucial for maintaining version control and promoting reproducibility.

In [1]:
from datetime import datetime

import numpy as np
import os
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer

In [2]:
def preprocessing_pipeline(numeric_features_to_extract, passthrough_features, drop_features):
    """This function performs the pre-processing steps for the features and retuns an numeric representation of all the features which is used to train the model"""

    def extract_numeric_features(X, columns_to_extract):
        X_copy = X.copy()
        for col in columns_to_extract:
            X_copy[col] = pd.to_numeric(X_copy[col].str.split(' ').str[0], downcast='float', errors='coerce')
        return X_copy[columns_to_extract]

    def preprocess_levy_and_fillna(X):
    
        X_copy = X.copy()
        X_copy["Levy"].replace("-", None, inplace=True)
    
        X_copy['Levy'] = pd.to_numeric(X_copy['Levy'], errors='coerce')
        mean_levy_by_year = X_copy.groupby('Prod. year')['Levy'].mean()
        mean_levy_by_year.fillna(0, inplace=True)
        
        for year in X_copy['Prod. year'].unique():
            mask = (X_copy['Prod. year'] == year) & X_copy['Levy'].isnull()
            X_copy.loc[mask, 'Levy'] = mean_levy_by_year[year]
        
        X_copy['Levy'] = X_copy['Levy'].astype(int)
        
        return X_copy
    
    column_transformer = make_column_transformer(
            (FunctionTransformer(preprocess_levy_and_fillna, validate=False), ['Prod. year', 'Levy']),
            (FunctionTransformer(extract_numeric_features, kw_args={'columns_to_extract': numeric_features_to_extract}), numeric_features_to_extract),
            ("passthrough", passthrough_features),
            ("drop", drop_features)
    )

    return column_transformer

Loading the raw data

In [3]:
# Load and process raw data
train_data_dir = os.path.join('.','data','raw','train.csv')
raw_train_data = pd.read_csv(train_data_dir)
raw_train_data.drop('ID', axis=1, inplace=True)

In [4]:
raw_train_data.head()

Unnamed: 0,Price,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags
0,13328,1399,LEXUS,RX 450,2010,Jeep,Yes,Hybrid,3.5,186005 km,6.0,Automatic,4x4,04-May,Left wheel,Silver,12
1,16621,1018,CHEVROLET,Equinox,2011,Jeep,No,Petrol,3.0,192000 km,6.0,Tiptronic,4x4,04-May,Left wheel,Black,8
2,8467,-,HONDA,FIT,2006,Hatchback,No,Petrol,1.3,200000 km,4.0,Variator,Front,04-May,Right-hand drive,Black,2
3,3607,862,FORD,Escape,2011,Jeep,Yes,Hybrid,2.5,168966 km,4.0,Automatic,4x4,04-May,Left wheel,White,0
4,11726,446,HONDA,FIT,2014,Hatchback,Yes,Petrol,1.3,91901 km,4.0,Automatic,Front,04-May,Left wheel,Silver,4


In [5]:
y = raw_train_data["Price"]
X = raw_train_data.drop("Price", axis=1)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.2, random_state=42)

train_df = pd.concat([X_train, y_train], axis=1)
val_df = pd.concat([X_val, y_val], axis=1)

In [6]:
numeric_features = train_df.select_dtypes(np.number).columns.to_list()
extract_num_feats = ['Mileage', 'Engine volume']
passthrough_features = ['Manufacturer', 'Model', 'Fuel type', 
                        'Leather interior', 'Gear box type', 'Category','Price']
drop_features = list(set(X_train.columns) - set(numeric_features + extract_num_feats +
                                         ['Levy'] + passthrough_features))
ct = preprocessing_pipeline(numeric_features_to_extract=extract_num_feats,
                             passthrough_features=passthrough_features,
                             drop_features=drop_features)

Below is a look at the training data frame before transformation

In [7]:
train_df.head(2)

Unnamed: 0,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags,Price
7808,779,TOYOTA,Camry,2013,Sedan,Yes,Hybrid,2.5,225510 km,4.0,Automatic,Front,04-May,Left wheel,White,12,314
16766,1282,CHEVROLET,Captiva,2007,Jeep,Yes,Diesel,2.0,76198 km,4.0,Automatic,Front,04-May,Left wheel,Silver,4,6429


After transformation it is clear that the first 2 columns match up. However, the columns seem out of order. The next step will be extracting the column names from the pipeline. Usually the method, `get_feature_names_out` would be sufficient but due to the Function Transformers a different method is needed.

In [8]:
transformed_data = ct.fit_transform(train_df)
transformed_data[:2]

array([[2013, 779, 225510.0, 2.5, 'TOYOTA', 'Camry', 'Hybrid', 'Yes',
        'Automatic', 'Sedan', 314],
       [2007, 1282, 76198.0, 2.0, 'CHEVROLET', 'Captiva', 'Diesel',
        'Yes', 'Automatic', 'Jeep', 6429]], dtype=object)

Below the column names are extracted.

In [9]:
transformers_info = ct.get_params()['transformers']
column_names = [col for info in transformers_info for col in info[2] if info[1] != 'drop']
column_names

['Prod. year',
 'Levy',
 'Mileage',
 'Engine volume',
 'Manufacturer',
 'Model',
 'Fuel type',
 'Leather interior',
 'Gear box type',
 'Category',
 'Price']

In [10]:
cleaned_train_data = pd.DataFrame(transformed_data, columns=column_names)
cleaned_train_data.head()

Unnamed: 0,Prod. year,Levy,Mileage,Engine volume,Manufacturer,Model,Fuel type,Leather interior,Gear box type,Category,Price
0,2013,779,225510.0,2.5,TOYOTA,Camry,Hybrid,Yes,Automatic,Sedan,314
1,2007,1282,76198.0,2.0,CHEVROLET,Captiva,Diesel,Yes,Automatic,Jeep,6429
2,2010,1399,189530.0,3.5,MERCEDES-BENZ,E 350,Diesel,Yes,Automatic,Sedan,12388
3,2012,642,218525.0,2.0,CHEVROLET,Orlando,Diesel,Yes,Automatic,Jeep,14834
4,2000,2146,25000.0,3.0,BMW,X5,LPG,Yes,Tiptronic,Jeep,10036


### Streamlining the steps above

In typical machine learning projects, models are often fitted on the training data to learn patterns, with transformation applied to holdout samples such as validation and/or test sets. 

However, in certain preprocessing pipelines, like the one here that converts categorical data to numeric format and drops specific columns, the transformation is performed independently on each sample, as the pipeline itself does not involve learning from the data.

In [11]:
def clean_and_save_data(df, file_prefix):
    numeric_features = df.select_dtypes(np.number).columns.to_list()
    extract_num_feats = ['Mileage', 'Engine volume']
    passthrough_features = ['Manufacturer', 'Model', 'Fuel type', 
                            'Leather interior', 'Gear box type', 'Category','Price']
    drop_features = list(set(X_train.columns) - set(numeric_features + extract_num_feats +
                                             ['Levy'] + passthrough_features))
    ct = preprocessing_pipeline(numeric_features_to_extract=extract_num_feats,
                                 passthrough_features=passthrough_features,
                                 drop_features=drop_features)

    # transformation
    transformed_data = ct.fit_transform(df)
    transformers_info = ct.get_params()['transformers']
    column_names = [col for info in transformers_info for col in info[2] if info[1] != 'drop']
    
    cleaned_data = pd.DataFrame(transformed_data, columns=column_names)
    
    # saving data
    data_dir = os.path.join('.','data','cleaned')
    os.makedirs(data_dir, exist_ok=True)
    
    version = datetime.now().strftime("%Y%m%d")
    cleaned_filename = os.path.join(data_dir, f'{file_prefix}_v{version}.csv')
    
    cleaned_data.to_csv(cleaned_filename, index=False)

In [12]:
clean_and_save_data(train_df, 'train')
clean_and_save_data(val_df, 'val')

## Future Considerations

* Save the data in the cloud and access it from there
* Implement a better data versioning system(think petabytes of data)
* Automated data validation 