# Stage 4: Base machine-learning model

In this stage, we are going to prepare and train very basic machine-learning model. We will take the results of the model as a reference point for further ML research. Later we can compare results of any ML model to the base to track the progress. 

In [1]:
%pylab inline

import pandas as pd
import numpy as np

Populating the interactive namespace from numpy and matplotlib


# Preparing the data

In [2]:
dataset = pd.read_csv('data/house_prices.csv').set_index('Id')
target_y = dataset['SalePrice']

In [3]:
#
# We've saved base data processing/cleaning and splitting routines from Stage #3 to the common_data.py module
#
from common_data import data_train_test_split, data_process_and_clean_basic

### Cleaning the data using pre-saved data pipeline

In [4]:
processed_dataset = data_process_and_clean_basic(dataset)

In [5]:
processed_dataset.head()

Unnamed: 0_level_0,Fireplaces,GarageArea,MoSold,PoolArea,ScreenPorch,BsmtFullBath,GarageCars,EnclosedPorch,MiscVal,KitchenAbvGr,...,SaleType_ConLw,SaleType_ConLI,SaleType_ConLD,SaleType_Oth,SaleCondition_Normal,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Partial
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,6.308098,1.098612,0.0,0.0,0.693147,1.098612,0.0,0.0,0.693147,...,0,0,0,0,1,0,0,0,0,0
2,0.693147,6.133398,1.791759,0.0,0.0,0.0,1.098612,0.0,0.0,0.693147,...,0,0,0,0,1,0,0,0,0,0
3,0.693147,6.411818,2.302585,0.0,0.0,0.693147,1.098612,0.0,0.0,0.693147,...,0,0,0,0,1,0,0,0,0,0
4,0.693147,6.466145,1.098612,0.0,0.0,0.693147,1.386294,5.609472,0.0,0.693147,...,0,0,0,0,0,1,0,0,0,0
5,0.693147,6.729824,2.564949,0.0,0.0,0.693147,1.386294,0.0,0.0,0.693147,...,0,0,0,0,1,0,0,0,0,0


### Do train/test splitting

In [6]:
X_train, X_test, y_train, y_test = data_train_test_split(processed_dataset, target_y)

print(f"Train dataset length: {len(X_train):<5} ({len(X_train)/len(processed_dataset)*100:<2}%)")
print(f"Test  dataset length: {len(X_test):<5} ({len(X_test)/len(processed_dataset)*100:<2}%)")

Train dataset length: 1095  (75.0%)
Test  dataset length: 365   (25.0%)


## Model estimation

For initial model estimation, we will use cross-validation RMSE score. Cross-validation is an essential technique to prevent model overfitting. 

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called **overfitting**. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. Note that the word “experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally.

In [7]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

def calc_cross_val_scores(model, X, y, display=True):
    """
    Displays model scores based on cross-validation method
    """    
    scores = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=10)
    rmse_scores = np.sqrt(-scores)

    if display:
        print(rmse_scores)
        print(f'Mean: {np.mean(rmse_scores)}')
        print(f'StDev: {np.std(rmse_scores)}')
        
    return np.mean(rmse_scores)

def calc_score(model, X_train, y_train, X_test, y_test, display=True):
    """
    Trains model on X_train dataset and displays its score on X_test dataset
    """
    model.fit(X_train, y_train)
    
    predicted_y = model.predict(X_test)
    
    rmse_score = np.sqrt(mean_squared_error(y_test, predicted_y))
    
    if display:
        print(f"RMSE Score for test dataset: {rmse_score}")
    
    return rmse_score
    

# Setting up simple RandomForest regression model

In [8]:
from sklearn.ensemble import RandomForestRegressor

def model_base():
    """
    Base model with will be used for results comparison and progress tracking
    """
    model = RandomForestRegressor()
    return model

In [9]:
base_model = model_base()

print('Average cross-validation score\n')
calc_cross_val_scores(base_model, X_train, y_train)

Average cross-validation score

[ 0.1870072   0.15468002  0.13999742  0.1512344   0.18055104  0.12990012
  0.17344829  0.17515075  0.14832779  0.16994788]
Mean: 0.16102449092265897
StDev: 0.017883189822264015


0.16102449092265897

# Estimating base model on the test data (final step)


Estimation of the model on the test data is the last step of the ML research. It is very important to avoid repetitive evaluations based on test data to exclude model overfitting. Because of this, we are using cross-validation scores for model fine-tuning and the test dataset scores for model stability evaluation.

In [10]:
calc_score(base_model, X_train, y_train, X_test, y_test);

RMSE Score for test dataset: 0.15165921337038135


# Conclusion

We have finished the first part of the research framework. We prepared the base for the future model improvement and feature engineering tasks. 

In the next stages of the project, we are going to improve the base score of the model by using feature engineering methods, fine-tuning model parameters. Also, we will train other models based on different mathematical principles than the base model. 