# Kaggle: Intermediate Machine Learning

This notebook is an exercise in the [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) course. The tutorial can be referenced [here](https://www.kaggle.com/code/alexisbcook/introduction/tutorial). The aim of this notebook is to provide a preliminary analysis into a Random Forest Regressor for the [Housing Prices Competition on Kaggle](https://www.kaggle.com/competitions/home-data-for-ml-course/overview).

## Set Up

The below cell sets up the environment and adds the necessary datasets into the working directory. 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

In [2]:
# Read the data
xTrainOg = pd.read_csv('home-data-for-ml-course/train.csv', index_col = 'Id')
xTestOg = pd.read_csv('home-data-for-ml-course/test.csv', index_col = 'Id')

## Obtaining the target and predictors 

A validation set is then obtained from the training set using the industry standard 80/20 split. 

In [4]:
y = xTrainOg.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = xTrainOg[features].copy()
xTest = xTestOg[features].copy()

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 42)

Examining the obtained:

In [5]:
X_train.head()

Unnamed: 0_level_0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
255,8400,1957,1314,0,1,3,5
1067,7837,1993,799,772,2,3,7
639,8777,1910,796,0,1,2,4
800,7200,1937,981,787,1,3,7
381,5000,1924,1026,665,2,3,6


## Defining the models to be studied

Random Forest Regressor models are studied in this section of the notebook. A random forest model is a meta estimator fitting a number of decision trees on different dataset sub-samples. This technique uses avergaes in order to improve the accuracy obtained and to control over-fitting.

In [6]:
model_1 = RandomForestRegressor(n_estimators = 50, random_state = 42)
model_2 = RandomForestRegressor(n_estimators = 100, random_state = 42)
model_3 = RandomForestRegressor(n_estimators = 100, criterion = 'absolute_error', random_state = 42)
model_4 = RandomForestRegressor(n_estimators = 200, min_samples_split = 20, random_state = 42)
model_5 = RandomForestRegressor(n_estimators = 100, max_depth = 7, random_state = 42)

models = [model_1, model_2, model_3, model_4, model_5]

## Scoring

The best model will provide the least mean absolute error. A function was developed in order to calculate the same for each model designed. 

In [10]:
# Function for comparing different models
def scoreMod(model, xTrain = X_train, xValid = X_valid, yTrain = y_train, yValid = y_valid):
    model.fit(xTrain, yTrain)
    predictions = model.predict(xValid)
    return mean_absolute_error(yValid, predictions)

In [11]:
# Obtaining scores for each model
for i in range(0, len(models)):
    mae = scoreMod(models[i])
    print("Model %d MAE: %d" % (i+1, mae))

Model 1 MAE: 22411
Model 2 MAE: 22537
Model 3 MAE: 22569
Model 4 MAE: 22719
Model 5 MAE: 23044


Based on the obtained MAE values for each model, the best model is found to be model 1. 

## Generating Predictions

The predictions are generated for the test dataset and stored in a CSV file in order to submit the obtained to the Kaggle Competition.

In [13]:
# Storing the best model
bestMod = model_1

# Fit the model to the training data
bestMod.fit(X, y)

RandomForestRegressor(n_estimators=50, random_state=42)

In [16]:
# Generate test predictions
testPred = bestMod.predict(xTest)

In [17]:
# Save predictions in format used for competition scoring
output = pd.DataFrame({'Id': xTest.index, 'SalePrice': testPred})
output.to_csv('submission.csv', index=False)

## Areas of Improvement

This was a premiliminary study into the Random Forest Regressor model. There are a number of other parameters that could have been used along with this model that may have provided better MAE scores. Other models like SVM and Linear Regression could have been used. Techniques like outlier removal, reducing data skewness, and stacking various models together may have provided more favorable results as well. 