# DAB200 -- Lab 1

In this lab, you will gain some experience in creating and evaluating random forests models and in manually tuning some hyper-parameters.
See each Part below for specific instructions but overall, keep the following in mind:
 - the code for each Part of this lab should be self-contained, that is, each of **Part 1, 2,** and **3** should contain all the necessary code and not rely on code from another **Part** of the lab in order to run (excluding import statements);
 - all parts of the lab should be done using python, sklearn, pandas, numpy, and matplotlib. 

**Grading:** 

45% of the grade will come from error-free code that accomplishes all the steps outlined in the instructions above and for each part of this lab. Another 45% will come from the comments associated with that code and answers to any questions noted, where the comments explain what the code is doing and why it is important to the overall objective. Thus, comments like "split the data" or "train the model" would receive a grade of 0 as they do not indicate any understanding. For his lab, the comments should be provided within each code cell, using the `#` character. The remaining 10% of the grade will be for formatting and clarity of comments/answers, that is, the lab should be easy to read and understand.  

**What to submit**
You should submit the following:
 - a **zip** file containing:
     - a completed version of this notebook with **all cells executed** and **all output visible**;
     - an html/PDF version of this notebook with all cells executed and all output visible.

**Independent research will most likely be required in order to complete the lab properly.**

### Importing all necessary libraries

In [6]:
# Importing all necessary libraries such as Pandas, Numpy, Matplotlib, Scikit learn.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

## Part 1 - Creating and evaluating a random forest model

In this part of the lab, you should:
 - read in the data;
 - verify that all the data is numeric and that there are no missing values;
 - split the data into training and validation sets with a ratio of 80:20 for training:testing;
 - create a random forest model using the data with the default hyper-parameters;
 - evaluate the model on both the training and validation sets using MAE and % error.

In [7]:
# We are using read_csv from pandas to read the data stored in csv format.

house_data = pd.read_csv('house_data.csv') 

In [8]:
house_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   bedrooms          4600 non-null   int64  
 1   bathrooms         4600 non-null   float64
 2   m2_living         4600 non-null   float64
 3   floors            4600 non-null   float64
 4   m2_above          4600 non-null   float64
 5   m2_basement       4600 non-null   float64
 6   m2_lot            4600 non-null   float64
 7   view              4600 non-null   int64  
 8   quality           4600 non-null   int64  
 9   yr_built          4600 non-null   int64  
 10  renovated_last_5  4600 non-null   int64  
 11  city              4600 non-null   int64  
 12  statezip          4600 non-null   int64  
 13  price             4600 non-null   float64
dtypes: float64(7), int64(7)
memory usage: 503.2 KB


We have used info() to verify that we do not have any NULL values and all data is numeric.

In [9]:
# Using X,y variables to store parameters and target respectively.

# We have splitted the data into test and train with a ratio of 20:80 respectively.

# We have used and fixed value of random_state hyper-parameter so that the result do not vary everytime we run the model.

X=house_data.iloc[: , :-1]
y=house_data["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=123)

In [10]:
# Using RandomForestRegressor() from sklearn.ensemble to create a random forest model with the default hyper-parameters

# We have used and fixed value of random_state hyper-parameter so that the result do not vary everytime we run the model.

# Using fit() to fit the model on X_train and y_train.

rf1 = RandomForestRegressor(random_state=123) 
rf1.fit(X_train, y_train)

RandomForestRegressor(random_state=123)

In [11]:
# We are using predict() to make prediction using the model for X_train(training data's parameter) 
# and storing the results in predictions variable.

# We are calculating mean absolute error for the prediction and the actual target we have from the data and printing the result
# as error percentage and average amount error.

predictions = rf1.predict(X_train)
e = mean_absolute_error(y_train, predictions)
ep = e*100 / y.mean()
print(f"${e:.0f} average error; {ep:.2f}% error")

$52531 average error; 9.52% error


In [12]:
# We are using predict() to make prediction using the model for X_test(testing data's parameter)
# and storing the results in validation_predictions variable.

# We are calculating mean absolute error for the prediction and the actual target we have from the data and printing the result
# as error percentage and average amount error.

validation_predictions = rf1.predict(X_test)
validation_e = mean_absolute_error(y_test, validation_predictions)
validation_ep = validation_e * 100 / y.mean() 
print(f"${validation_e:.0f} average error; {validation_ep:.2f}% error")

$131978 average error; 23.91% error


The model predicts training data with an accuracy of 90.48%, but when evaluated for testing data the error percent increased from 9.52% to 23.91%.

## Part 2 - Exploring the `n_estimators` hyper-parameter

In this part of the lab you should: 
 - use a `for` loop to create a random forest model for each value of `n_estimators` from 1 to 30;
 - evaluate each model on the validation set only using MAE;

After that you should answer the following questions:
 - Which value of `n_estimators` gives the best results? 
 - Explain how you decided that this value for `n_estimators` gave the best results;
 - Was the result here better than the result of Part 1? What % better or worse was it?

In [13]:
# We are using read_csv from pandas to read the data stored in csv format.

house_data = pd.read_csv('house_data.csv') 

In [14]:
# Using X,y variables to store parameters and target respectively.

# We have splitted the data into test and train with a ratio of 20:80 respectively.

# We have used and fixed value of random_state hyperparameter so that the result do not vary everytime we run model.

X=house_data.iloc[: , :-1]
y=house_data["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=123)

In [15]:
# Using RandomForestRegressor() from sklearn.ensemble to create a random forest model with the n_estimators 
# hyper-parameter value between 1 and 30.

# We have created empty list testing_errors to store all testing dataset errors for n_estimators from 1 to 30.

# Using fit() to fit the model on X_train and y_train.

# We have used and fixed value of random_state hyper-parameter so that the result do not vary everytime we run the model.

# We are using predict() to make prediction using the model for X_test(testing data's parameter)
# and storing the results in validation_predictions variable.

# We are calculating mean absolute error for the prediction and the actual target we have from the data and storing the 
# resulting percentage error in a testing_errors list.

# In the end we are finding the value of n_estimators with least error percentage and 
# storing it into n_est variable for future use.

testing_errors=[]

for i in range(1,31):
    rf2 = RandomForestRegressor(n_estimators=i,random_state=123) 
    rf2.fit(X_train, y_train)
    validation_predictions = rf2.predict(X_test)
    validation_e = mean_absolute_error(y_test, validation_predictions)
    validation_ep = validation_e * 100 / y.mean()
    testing_errors.append(validation_ep)

n_est = testing_errors.index(min(testing_errors))
print(f"We get best results from {n_est} estimators")

We get best results from 26 estimators


The value for n_estimators is selected on the basis of the minimum error in testing data.

In [16]:
# Using RandomForestRegressor() from sklearn.ensemble to create a random forest model with the n_estimators 
# hyper-parameter value as calculated before.

# Using fit() to fit the model on X_train and y_train.

# We have used and fixed value of random_state hyper-parameter so that the result do not vary everytime we run model.

# We are using predict() to make prediction using the model for X_test(testing data's parameter)
# and storing the results in validation_predictions variable.

# We are calculating mean absolute error for the prediction and the actual target we have from the data and printing the result
# as error percentage and average amount error.

rf3=RandomForestRegressor(n_estimators=n_est,random_state=123)
rf3.fit(X_train,y_train)
validation_predictions = rf3.predict(X_test)
validation_e = mean_absolute_error(y_test, validation_predictions)
validation_ep = validation_e * 100 / y.mean() 
print(f"${validation_e:.0f} average error; {validation_ep:.2f}% error")

$130375 average error; 23.62% error


- The results here shows that the model in Part 2 is slightly better than the one in Part 1 with default parameters.
- There is just 0.29% less error in testing/validation prediction as compared to model in Part 1.

## Part 3 - Exploring the `max_features` hyper-parameter

In this part of the lab you should: 
 - use a `for` loop to create a random forest model for each value of `max_features` from 1 to the total number of features in the data;
 - for each model, use the value for `n_estimators` as determined in Part 2;
 - evaluate each model on the testing set using MAE;
 
After that you should answer the following questions:
 - Which value of `max_features` gives the best results?
 - Explain how you decided that this value for `max_features` gave the best results;
 - Was the result here better than the result of Part 2? Why is that? What % better or worse was it?

In [17]:
# We are using read_csv from pandas to read the data stored in csv format.

house_data = pd.read_csv('house_data.csv') 

In [18]:
# Using X,y variables to store parameters and target respectively.

# We have splitted the data into test and train with a ratio of 20:80 respectively.

# We have used and fixed value of random_state hyperparameter so that the result do not vary everytime we run the model.

X=house_data.iloc[: , :-1]
y=house_data["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=123)

In [19]:
# Using RandomForestRegressor() from sklearn.ensemble to create a random forest model with the n_estimators hyper-parameter as 
# calculated and stored in n_est and max_features is checked between 1 to number of columns.

# We have created empty list testing_errors to store all testing dataset errors for max_features from 1 to number of features. 

# Using fit() to fit the model on X_train and y_train.

# We have used and fixed value of random_state hyper-parameter so that the result do not vary everytime we run model.

# We are calculating mean absolute error for the prediction and the actual target we have from the data and storing the 
# resulting percentage error in a testing_errors list.

# In the end we are finding the value of max_features hyper-parameter with least error percentage and 
# storing it into max_features_result variable for future use.

testing_errors=[]

for i in range(1,len(X.columns)+1):
    rf4 = RandomForestRegressor(n_estimators=n_est,max_features=i,random_state=123) 
    rf4.fit(X_train, y_train)
    validation_predictions = rf4.predict(X_test)
    validation_e = mean_absolute_error(y_test, validation_predictions)
    validation_ep = validation_e * 100 / y.mean()
    testing_errors.append(validation_ep)

max_features_result = testing_errors.index(min(testing_errors))
print(f"We get best results from {max_features_result} features.")

We get best results from 12 features.


The value for max_features is selected on the basis of the minimum error in testing data.

In [20]:
# Using RandomForestRegressor() from sklearn.ensemble to create a random forest model with the n_estimators hyper-parameter as 
# calculated and stored in n_est and max_features as calculated before and stored in max_features_results.

# Using fit() to fit the model on X_train and y_train.

# We have used and fixed value of random_state hyperparameter so that the result do not vary everytime we run model.

# We are using predict() to make prediction using the model for X_test(testing data's parameter)
# and storing the results in validation_predictions variable.

# We are calculating mean absolute error for the prediction and the actual target we have from the data and printing the result
# as error percentage and average amount error.

# We have used the value of n_estimators and max_features hyper-parameter from the varibale n_est and max_features_result
# respectiveley and they have been calculated before.

rf5=RandomForestRegressor(max_features=max_features_result,n_estimators=n_est,random_state=123)
rf5.fit(X_train,y_train)
validation_predictions = rf5.predict(X_test)
validation_e = mean_absolute_error(y_test, validation_predictions)
validation_ep = validation_e * 100 / y.mean() 
print(f"${validation_e:.0f} average error; {validation_ep:.2f}% error")

$132851 average error; 24.07% error


- The results here shows that the model in Part 2 performes slightly better as compared to this one.
- There is 0.45% more error in validation prediction as compared to model in part 2, the possible reason behind this is that
  the combination of slight errors from n_estimators and max_features adds upto a slightly bigger figure than model with default hyper-parameters.