# Kaggle: Intermediate Machine Learning

This notebook is an exercise in the [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) course. The tutorial can be referenced [here](https://www.kaggle.com/code/alexisbcook/introduction/tutorial). The aim of this notebook is to provide a preliminary analysis into a Random Forest Regressor for the [Housing Prices Competition on Kaggle](https://www.kaggle.com/competitions/home-data-for-ml-course/overview).

## Set Up

The below cell sets up the environment and adds the necessary datasets into the working directory. 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.impute import SimpleImputer

In [2]:
# Read the data
xTrainOg = pd.read_csv('home-data-for-ml-course/train.csv', index_col = 'Id')
xTestOg = pd.read_csv('home-data-for-ml-course/test.csv', index_col = 'Id')

## Data Preparation

In [3]:
# Remove rows with missing target, separate target from predictors
xTrainOg.dropna(axis = 0, subset = ['SalePrice'], inplace = True)
y = xTrainOg.SalePrice
xTrainOg.drop(['SalePrice'], axis = 1, inplace = True)

# To keep things simple, we'll use only numerical predictors
X = xTrainOg.select_dtypes(exclude=['object'])
X_test = xTestOg.select_dtypes(exclude=['object'])

## Creating Validation Set

A validation set is then obtained from the training set using the industry standard 80/20 split. 

In [4]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 42)

Examining the obtained:

In [5]:
X_train.head()

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
255,20,70.0,8400,5,6,1957,1957,0.0,922,0,...,294,250,0,0,0,0,0,0,6,2010
1067,60,59.0,7837,6,7,1993,1994,0.0,0,0,...,380,0,40,0,0,0,0,0,5,2009
639,30,67.0,8777,5,7,1910,1950,0.0,0,0,...,0,328,0,164,0,0,0,0,5,2008
800,50,60.0,7200,5,7,1937,1950,252.0,569,0,...,240,0,0,264,0,0,0,0,6,2007
381,50,50.0,5000,5,6,1924,1950,0.0,218,0,...,308,0,0,242,0,0,0,0,5,2010


## Preliminary Investigation

Performing a preliminary investigation into the dataset to be used to understand more about the different types of data and possible ways the missing values could be dealt with. 

In [6]:
# Obtaining shape of data
print(X_train.shape)

# Finding the number of missing values in each column of the training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

(1168, 36)
LotFrontage    217
MasVnrArea       6
GarageYrBlt     64
dtype: int64


## Observations

From the above, we can conclude that there are totally 1168 rows and 36 columns in the training dataset. Out of the 36 columns, there are 3 columns containing missing data, namely: <br>
1. LotFrontage <br>
2. MasVnrArea <br>
3. GarageYrBlt <br>

Each of these three columns have 217, 6, and 64 missning data points respectively, thus brining the total number of missing entries to <b>276</b>. 

The LotFrontage column has the greatest number of missing values, however it is only missing less than 20% of its entries. Hence, it can be said that removing this column is unlikely to yield any good results as there may be valuable relationships we may remove as well. However, for the sake of this exercise, we will perform a comparison of MAE scores obtained before and after dropping the said columns. 

## Imputation


In [7]:
# Function for comparing different approaches
def scoreDat(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

In [8]:
# Imputation
myImputer = SimpleImputer()
imputed_xTrain = pd.DataFrame(myImputer.fit_transform(X_train))
imputed_xValid = pd.DataFrame(myImputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_xTrain.columns = X_train.columns
imputed_xValid.columns = X_valid.columns

In [9]:
print("MAE (Imputation):")
print(scoreDat(imputed_xTrain, imputed_xValid, y_train, y_valid))

MAE (Imputation):
18237.925182648403


## Dropping Columns with Missing Values

In [10]:
# Get names of columns with missing values
missingValCols = [col for col in X_train.columns if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_xTrain = X_train.drop(missingValCols, axis = 1)
reduced_xValid = X_valid.drop(missingValCols, axis = 1)

In [11]:
print("MAE (Drop columns with missing values):")
print(scoreDat(reduced_xTrain, reduced_xValid, y_train, y_valid))

MAE (Drop columns with missing values):
18023.26128995434


## Observations

It is expected that imputation perform better than dropping columns as there are not that many missing values in the training dataset. However, as per the MAE values obtained, this is not the case. This may be attributed to the noise in the dataset or the fact that this particular method of imputation is not well suited for this dataset. This could be improved by filling the missing values with zeros instead of the means, for instance. 

## Generating Test Predictions

In [12]:
# Performing imputation
finalImputer = SimpleImputer(strategy='median')
final_xTrain = pd.DataFrame(finalImputer.fit_transform(X_train))
final_xValid = pd.DataFrame(finalImputer.transform(X_valid))

# Placing the imputation removed column names back
final_xTrain.columns = X_train.columns
final_xValid.columns = X_valid.columns

## Evaluating a Random Forest Model

In [13]:
# Define and fit model
model = RandomForestRegressor(n_estimators = 100, random_state = 42)
model.fit(final_xTrain, y_train)

# Get validation predictions and MAE
predValid = model.predict(final_xValid)
print("MAE (Your approach):")
print(mean_absolute_error(y_valid, predValid))

MAE (Your approach):
18123.418618721462


## Working on the Test Data

In [14]:
# Preprocessing the test data
final_xTest = pd.DataFrame(finalImputer.transform(X_test))

In [15]:
# Obtaining test predictions
testPred = model.predict(final_xTest)



In [16]:
# Saving test predictions to CSV
output = pd.DataFrame({'Id': X_test.index, 'SalePrice': testPred})
output.to_csv('submission.csv', index = False)

## Areas of Improvement

This was a premiliminary study into the Random Forest Regressor model. There are a number of other parameters that could have been used along with this model that may have provided better MAE scores. Other models like SVM and Linear Regression could have been used. As previously discussed, the missing values could have been replaced with zeros, the most frequently occurring value, or some other technique. 