# Auto ML Pipeline
Using AutoGluon for prediction with a little bit of pre-cleaning

In [1]:
import pandas as pd
import numpy as np

from autogluon.tabular import TabularDataset, TabularPredictor


## Format Data

Read data and do basic cleaning of NaN values

We will only replace values if there is a meaningful replacement; otherwise we leave it for AutoGluon

In [2]:
# read data and convert target variable to log
# log conversion normalizes data (also is where we measure RMSE)
df_train = pd.read_csv('../data/raw/train.csv')
df_train['SalePrice']= np.log(df_train['SalePrice'])

In [3]:
# Find columns containing NaN values
columns_with_nan = df_train.columns[df_train.isna().any()].tolist()

# Print the columns with NaN values
print("Columns with NaN values:", columns_with_nan)
print('                   ')
print('that is ' + str(len(columns_with_nan)) + ' columns')

# Count NaN values in each column
nan_counts = df_train.isna().sum()

# Print the counts of NaN values in each column
print("NaN Value Counts in Each Column:")
print(nan_counts)

Columns with NaN values: ['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']
                   
that is 19 columns
NaN Value Counts in Each Column:
Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64


In [4]:
# Several features seem really easily fillable
# Categoricals that seem to be NaN because that feature is not on property
# often are other columns that verify this assumption (such as area = 0)

easy_fix_cols = ['Alley', 'MasVnrType', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',  'BsmtFinType2', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']

# Replace NaN values in the categorical column with 'noFeature'
replacement_value = 'noFeature'
for col in easy_fix_cols:
    df_train[col] = df_train[col].fillna(replacement_value)

In [5]:
# missing values in 'MasVnrArea' are because there is no masonry veneer

# Find the indices of NaN entries in the 'MasVnrArea' column
column_name = 'MasVnrArea'
nan_indices = df_train[df_train[column_name].isna()].index

# Set the values in the 'MasVnrArea' column for the specified indices to 0
df_train.loc[nan_indices, column_name] = 0

In [6]:
# Assume missing 'LotFrontage' Values have no lot frontage
df_train['LotFrontage'] = df_train['LotFrontage'].fillna(0)


### things we won't fix

We will leave missing garage built year empty because there is no obvious meaningful way to replace it, might as well leave to Autogluon

There is one empty 'Electrical' variable, also leave that to Autogluon

## Autogluon

### CV

Check and see if we can get an idea of performance from this

In [8]:
# CV

# Define the number of folds for cross-validation
num_folds = 10

# Calculate the number of samples and the size of each fold
num_samples = len(df_train)
fold_size = num_samples // num_folds

# Initialize lists to store the train and test data
rmse_list = []

# Iterate through the folds
for fold in range(num_folds):
    # Calculate the start and end indices for the test set
    start = fold * fold_size
    end = (fold + 1) * fold_size if fold < num_folds - 1 else num_samples

    # Use the current fold for testing and the rest for training
    tempTest = df_train.iloc[start:end, :]  # Slice the DataFrame
    tempTrain = pd.concat([df_train.iloc[:start, :], df_train.iloc[end:, :]])  # Concatenate DataFrames

    # convert to Tabular Datasets
    tempTest = TabularDataset(tempTest)
    tempTrain = TabularDataset(tempTrain)

    # fit predictor
    tempPredictor = TabularPredictor(label='SalePrice').fit(tempTrain)

    # find RMSE of logs
    rmse = tempPredictor.evaluate(tempTest, silent=True)['root_mean_squared_error']
    
    rmse_list.append(rmse)

# find CV mean of rmse
print('CV mean is ' + str(sum(rmse_list) / len(rmse_list)))

No path specified. Models will be saved in: "AutogluonModels/ag-20231107_194312/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231107_194312/"
AutoGluon Version:  0.8.2
Python Version:     3.9.16
Operating System:   Darwin
Platform Machine:   x86_64
Platform Version:   Darwin Kernel Version 21.6.0: Mon Aug 22 20:17:10 PDT 2022; root:xnu-8020.140.49~2/RELEASE_X86_64
Disk Space Avail:   19.28 GB / 250.69 GB (7.7%)
Train Data Rows:    1314
Train Data Columns: 80
Label Column: SalePrice
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (13.534473028231162, 10.460242108190519, 12.02686, 0.4017)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feat

[1000]	valid_set's rmse: 0.14704


	-0.147	 = Validation score   (-root_mean_squared_error)
	7.89s	 = Training   runtime
	0.09s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
	-0.1276	 = Validation score   (-root_mean_squared_error)
	0.35s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 55.31s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231107_194312/")
No path specified. Models will be saved in: "AutogluonModels/ag-20231107_194408/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231107_194408/"
AutoGluon Version:  0.8.2
Python Version:     3.9.16
Operating System:   Darwin
Platform Machine:   x86_64
Platform Version:   Darwin Kernel Version 21.6.0: Mon Aug 22 20:17:10 PDT 2022; root:xnu-8020.140.49~2/RELEASE_X86_64
Disk Space Avail:   19.20 GB / 250.69 GB (7.7%)
Train Data Rows:    1314
Train Data Columns: 80
Label Column: SalePrice
Pr

[1000]	valid_set's rmse: 0.136732
[2000]	valid_set's rmse: 0.136671
[3000]	valid_set's rmse: 0.136669


	-0.1367	 = Validation score   (-root_mean_squared_error)
	15.66s	 = Training   runtime
	0.24s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
	-0.1082	 = Validation score   (-root_mean_squared_error)
	0.33s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 86.41s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231107_195528/")


CV mean is -0.12087880090216571


[-0.11477600584917631,
 -0.09665098761177439,
 -0.10251422370823247,
 -0.15676306688418543,
 -0.14315737231602807,
 -0.09812600747053027,
 -0.12014073494694316,
 -0.100005239753442,
 -0.13929900121853114,
 -0.13735536926281383]

In [None]:
td_train = TabularDataset(df_train)

predictor = TabularPredictor(td_train)