# Steps Involved in Selecting a Model For a larger Dataset

In my article on the [Steps Involved in Selecting a Model](#). I talked about the difference between small and large datasets in relation to model selection. I also talked about how to perform model selection and evaluation on them. This notebook address means to select a model on a larger dataset. 

The dataset we'll be making use of will be the [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) dataset. It isn't so large a dataset itself, but it will be a perfect example. The measures I describe from my article includes:

1. Transform Categorical Columns to Numeric (If any)
2. Scale Continuous Columns (if necessary)
3. Split the Dataset
4. Elect Candidate Model
5. Perform Model Evaluation
6. Model Selection

I already worked with this dataset at one point. So, I would create a function to clean up and prepare the dataset so we can move on to the third step. The code for the Exploratory Data Analysis (EDA), Data Cleaning and Transformation can be found on my Kaggle page at  [House Prices Prediction (Beginner)](https://www.kaggle.com/ganiyuolalekan/house-prices-prediction-beginner/notebook).

In [1]:
import numpy as np
import pandas as pd

from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [2]:
dataset = pd.read_csv("data/house_prices/train.csv", index_col='Id')
dataset.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
dataset.shape

(1460, 80)

In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

In [5]:
def clean_house_price_dataset(data):
    """
    Cleans and transform the dataset. Details at:
    https://www.kaggle.com/ganiyuolalekan/house-prices-prediction-beginner/notebook
    """
    
    target = data["SalePrice"].to_numpy()
    
    data.drop([
        "Alley", "FireplaceQu", "PoolQC", "Fence", "MiscFeature", "SalePrice"
    ], axis=1, inplace=True)
    
    continuous_col= list(data.describe().columns)
    
    categorical_col = [
        col 
        for col in data.columns 
        if col not in continuous_col
    ]
    
    continuous_data_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('num_scaler', StandardScaler()),
    ])
    
    categorical_data_pipeline = Pipeline([
        ('freq_imputer', SimpleImputer(strategy='most_frequent')),
        ('cat_encoder', OrdinalEncoder())
    ])
    
    housing_price_pipeline = ColumnTransformer([
        ("continous", continuous_data_pipeline, continuous_col),
        ("categorical", categorical_data_pipeline, categorical_col),
    ])
    
    transformed_dataset = housing_price_pipeline.fit_transform(data)
    
    dataset.loc[:, list(dataset.columns)] = transformed_dataset
    
    return target, transformed_dataset

In [6]:
target, transformed_dataset = clean_house_price_dataset(dataset)

In [7]:
transformed_dataset.shape, type(transformed_dataset), target.shape

((1460, 74), numpy.ndarray, (1460,))

With the `clean_house_price_dataset` function, we’ll be able to clean and transform the dataset. we can then proceed to split the dataset and evaluate the dataset.

### Split the Data-set

The reason we perform an evaluation on machine learning (ML) models is to ensure they don't under-fit or over-fit.

We were able to evaluate the iris data-set (a small data-set) using cross-validation, but given our data-set isn't as small, validating naively would be computationally expensive.

Therefore, we have to split the dataset into a train and test set. Given the entire dataset has a shape of `(1460, 80)`, and `(1460, 74)` after cleaning and transformation, we can perform cross-evaluation on the train-set and evaluate our model performance on the test set.

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
round(1460 - (1460 * .3)), round(1460 * .3)

(1022, 438)

In [10]:
X_train, X_test, y_train, y_test = train_test_split(transformed_dataset, target, test_size=.3, shuffle=True, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1022, 74), (438, 74), (1022,), (438,))

### Elect Candidate Model

Now that we've perfectly split the dataset into both train and test sets, we then proceed to elect models that can solve this task.

We have to understand the dataset. I talked about it in my notebook [House Prices Prediction (Beginner)](https://www.kaggle.com/ganiyuolalekan/house-prices-prediction-beginner/notebook) where I gave an [overview of the dataset](https://www.kaggle.com/ganiyuolalekan/house-prices-prediction-beginner#2.1.-Overview-of-the-data).

So, we're dealing with a regression task consisting of lots of categorical features, having models with linear and decision-making abilities would be useful, like the Decision Tree Regressor or Random Forest Regressor. But let's go for the Random Forest Regressor since it’s more of an ensemble of Decision Trees.

We should also pick models like Support Vector Regressor, Linear Regression, and K-Neighbors Regressor since we're performing evaluations.

> The XGBoost will prove to be a very vital tool in your ML journey and I suggest examining its usage in the notebook [XGBoost](https://www.kaggle.com/dansbecker/xgboost) by Kaggle grandmaster [Dans Becker](https://www.kaggle.com/dansbecker). More resources on XGBoost in the **further reading** section.

In [11]:
names = [
    "Linear Regression",
    "K-Neighbors Regressor",
    "Support Vector Regressor",
    "Random Forest Regressor"
]

models = [
    LinearRegression(),
    KNeighborsRegressor(),
    SVR(),
    RandomForestRegressor()
]

### Perform Model Evaluation

Now that we've successfully split our dataset, and elected the models we want to use. It's time to see how the individual models perform on the training dataset.

In [12]:
def compare_models(data, target, names_models):
    """
    Performing model comparision my anallyzing the different models
    along with their performance.
    """
    
    record = {}
    
    avg_model_performance = []
    
    for name, model in zip(
        names_models.keys(), names_models.values()
    ):
        model.fit(data, target)
        predictions = model.predict(data)
        
        model_mse = mean_squared_error(target, predictions)
        model_mae = mean_absolute_error(target, predictions)
        
        model_record = {
            "model": name,
            "mean_squared_error": model_mse,
            "mean_absolute_error": model_mae,
        }
        
        record[name] = model_record
        
        avg_model_performance.append((round(model_mae, 2), name))
    
    record['Model Performance Rating'] = sorted(avg_model_performance)
    
    return record

In [13]:
record = compare_models(
    X_train, y_train, 
    {
        name: model
        for name, model in zip(names, models)
    }
)

In [14]:
for name in names:
    print(record[name])

{'model': 'Linear Regression', 'mean_squared_error': 947208848.3487878, 'mean_absolute_error': 18860.18959703269}
{'model': 'K-Neighbors Regressor', 'mean_squared_error': 1116571996.207319, 'mean_absolute_error': 20864.65264187867}
{'model': 'Support Vector Regressor', 'mean_squared_error': 6282322967.635323, 'mean_absolute_error': 54896.07183847585}
{'model': 'Random Forest Regressor', 'mean_squared_error': 136855332.17973176, 'mean_absolute_error': 6791.2177201565555}


In [15]:
record['Model Performance Rating']

[(6791.22, 'Random Forest Regressor'),
 (18860.19, 'Linear Regression'),
 (20864.65, 'K-Neighbors Regressor'),
 (54896.07, 'Support Vector Regressor')]

Beyond doubt, the Random Forest Regressor performed best, outperforming the Linear Regression model approx 3x. Although since our focus is on model selection I avoided cross-validating and fine-tuning the models.
 
> In most cases, I would fine-tune and cross-validate the model (using grid search) while searching out the best accuracy each model can produce before making a decision. But the model’s default parameters are also decent enough for this task. So let's leave it simple.

### Model Selection

After splitting the dataset, electing the candidate model, and performing model evaluation we can come to the conclusion that the Random Forest Regressor will be best suited for deployment having a mean absolute error (MAE) of 6732.92.

Although we didn't quite fine-tune the model. We can get a much better MAE by fine-tuning the Random Forest Regressor, but the point has been established.

> You could try out the XGBoost and compare it to see if it performs better. What if you fine-tune the XGBoost model as well!!!

### Conclusion

The larger the dataset the more accurate your model will become, the more the possibility, the more the computing power needed and the more the complication.

With a smaller dataset though, the process of splitting the dataset won't be essential as smaller datasets can’t be processed naively and thus the need for cross-validating smaller sets with a predefined number of folds.

### Further Reading

Data Cleaning

- [The Ultimate Guide to Data Cleaning](https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4)
- [Data Cleaning with Python and Pandas](https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b)

Encoding Categorical Columns

- [Encoding Categorical data in Machine Learning](https://medium.com/bycodegarage/encoding-categorical-data-in-machine-learning-def03ccfbf40)
- [Guide to Encoding Categorical Features Using Scikit-Learn For Machine Learning](https://towardsdatascience.com/guide-to-encoding-categorical-features-using-scikit-learn-for-machine-learning-5048997a5c79)

Scikit-Learn Models

- [Support Vector Machine](https://en.wikipedia.org/wiki/Support-vector_machine)
- [Random Forest](https://en.wikipedia.org/wiki/Random_forest)
- [K-Nearest Neighbor](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)
- [Linear Regression](https://en.wikipedia.org/wiki/Linear_regression)

Further Reading On Model Selection

- [A “short” introduction to model selection](https://towardsdatascience.com/a-short-introduction-to-model-selection-bb1bb9c73376)
- [A Gentle Introduction to Model Selection for Machine Learning](https://machinelearningmastery.com/a-gentle-introduction-to-model-selection-for-machine-learning/)

Associated Notebooks

- [Steps Involved in Selecting a Model For a Small Data-set](https://www.kaggle.com/ganiyuolalekan/model-selection-for-small-dataset)

Book

- [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)