<a href="https://colab.research.google.com/github/XiaonaZhou/data_analytics_2/blob/main/Python/Unit_4/Random_forest_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Random forest with regression model

In this notebook, I re-predict Boston housing price with random forest regressor and try out another example for random forest regressor from [towards data science.com](https://towardsdatascience.com/random-forest-in-python-24d0893d51c0)

## **Re-predict Boston Housing price**

## 1. Set up and import data

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [17]:
# import the boston dataset
from sklearn.datasets import load_boston
boston_dataset = load_boston()
# create a pandas dataframe and store the data
df_boston = pd.DataFrame(boston_dataset.data)
df_boston.columns = boston_dataset.feature_names
# append Price, target, as a new columnn to the dataset
df_boston['Price'] = boston_dataset.target

In [18]:
df_boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Price
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


## 2. Missing values and EDA

Skip. It has been done within `Linear_Regression_Boston_Housing_Guided_Project.ipynb`.

## 3. Build random forest 

In [19]:
# split data into train and test set
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(df_boston.drop('Price', axis=1), df_boston['Price'], test_size=0.30, random_state=101)

In [20]:
# Import the model 
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 100, random_state = 42)
# Train the model on training data
rf.fit(x_train, y_train);

## 4. Evaluate mdoel

In [21]:
# Use the forest's predict method on the test data
predictions = rf.predict(x_test)
# Calculate the absolute errors
errors = abs(predictions - y_test)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

Mean Absolute Error: 2.68 degrees.


In [22]:
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / y_test)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

Accuracy: 87.04 %.


In [23]:
from sklearn import metrics
print('R-squred score: ', metrics.r2_score(y_test, predictions))

R-squred score:  0.8652237890346178


Summary 
* $R^2$ = 0.69 with linear regression
* $R^2$ = 0.78 with decision trees
* $R^2$ = 0.87 with random forest

Therefore, random forest model works better when prediction boston housing price.

## **Predicting tempurature**

Predict actual tempurature using `temps.csv` dataset. 

[Source](https://drive.google.com/file/d/1OcDfTKPZcnqFCzhTgaZtXi-LUWHSEg8a/view)

## 1. Import data


In [49]:
import pandas as pd

# Read in data as pandas dataframe and display first 5 rows
df = pd.read_csv('https://raw.githubusercontent.com/XiaonaZhou/data_analytics_2/main/Python/Unit_4/temps.csv')
df.head(5)

Unnamed: 0,year,month,day,week,temp_2,temp_1,average,actual,forecast_noaa,forecast_acc,forecast_under,friend
0,2016,1,1,Fri,45,45,45.6,45,43,50,44,29
1,2016,1,2,Sat,44,45,45.7,44,41,50,44,61
2,2016,1,3,Sun,45,44,45.8,41,43,46,47,56
3,2016,1,4,Mon,44,41,45.9,40,44,48,46,53
4,2016,1,5,Tues,41,40,46.0,44,46,46,46,41


Following are explanations of the columns:

* year: 2016 for all data points
* month: number for month of the year
* day: number for day of the year
* week: day of the week as a character string
* temp_2: max temperature 2 days prior
* temp_1: max temperature 1 day prior
* average: historical average max temperature
* actual: max temperature measurement
* friend: your friend’s prediction, a random number between 20 below the average and 20 above the average

In [50]:
print('The shape of our features is:', df.shape)

The shape of our features is: (348, 12)


## 2. Data Wrangling

Convert non-numberial values to numerical values (One-hot encode)


In [51]:
df_num = pd.get_dummies(df)
df_num.head(5)

Unnamed: 0,year,month,day,temp_2,temp_1,average,actual,forecast_noaa,forecast_acc,forecast_under,friend,week_Fri,week_Mon,week_Sat,week_Sun,week_Thurs,week_Tues,week_Wed
0,2016,1,1,45,45,45.6,45,43,50,44,29,1,0,0,0,0,0,0
1,2016,1,2,44,45,45.7,44,41,50,44,61,0,0,1,0,0,0,0
2,2016,1,3,45,44,45.8,41,43,46,47,56,0,0,0,1,0,0,0
3,2016,1,4,44,41,45.9,40,44,48,46,53,0,1,0,0,0,0,0
4,2016,1,5,41,40,46.0,44,46,46,46,41,0,0,0,0,0,1,0


In [52]:
print('Shape of features after one-hot encoding:', df_num.shape)

Shape of features after one-hot encoding: (348, 18)


## 3. Build a model

Train and test split

In [53]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
x_train,x_test,y_train,y_test = train_test_split(df_num.drop('actual',axis=1), df_num['actual'], test_size = 0.25,
                                                                           random_state = 42)

*Establish Baseline (optional)*

We may not have decent values for baseline predictions in some case. Here, we use historical average as base line since it makes sense intuitively. 


In [54]:
baseline_preds = x_test['average']

In [55]:
baseline_errors = abs(baseline_preds - y_test)
print('Average baseline error: ', round(np.mean(baseline_errors), 2), 'degrees.')

Average baseline error:  5.06 degrees.


If we can’t beat an average error of 5 degrees, then we need to rethink our approach.

In [56]:
from sklearn.ensemble import RandomForestRegressor

# Instantiate model 
rf = RandomForestRegressor(n_estimators= 1000, random_state=42)

# Train the model on training data
rf.fit(x_train, y_train);

## 4. make prediction and evaluate model

In [57]:
# Use the forest's predict method on the test data
predictions = rf.predict(x_test)
# Calculate the absolute errors
errors = abs(predictions - y_test)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
mape = np.mean(100 * (errors / y_test))
accuracy = 100 - mape

print('Accuracy:', round(accuracy, 2), '%.')

Mean Absolute Error: 3.87 degrees.
Accuracy: 93.93 %.


In [58]:
from sklearn import metrics
print('R-squred score: ', metrics.r2_score(y_test, predictions))

R-squred score:  0.8128487257488989


## 5. feature selection via feature importance 

In [59]:
import pandas as pd
feature_imp = pd.Series(rf.feature_importances_,index=list(x_test.columns)).sort_values(ascending=False)
feature_imp

temp_1            0.655553
average           0.150330
forecast_noaa     0.045382
forecast_acc      0.034859
forecast_under    0.023190
day               0.021119
temp_2            0.020993
friend            0.020685
month             0.010330
week_Sat          0.003613
week_Fri          0.003525
week_Mon          0.002588
week_Tues         0.002303
week_Sun          0.002289
week_Wed          0.001974
week_Thurs        0.001266
year              0.000000
dtype: float64

In [60]:
round(feature_imp,2)

temp_1            0.66
average           0.15
forecast_noaa     0.05
forecast_acc      0.03
forecast_under    0.02
day               0.02
temp_2            0.02
friend            0.02
month             0.01
week_Sat          0.00
week_Fri          0.00
week_Mon          0.00
week_Tues         0.00
week_Sun          0.00
week_Wed          0.00
week_Thurs        0.00
year              0.00
dtype: float64

**The two most important features are `temp_1` and `average`. Let's make another model using only these two features**

In [61]:
# Split the data into training and testing sets
x_train,x_test,y_train,y_test = train_test_split(df_num[['temp_1', 'average']], df_num['actual'], test_size = 0.25,
                                                                           random_state = 42)

In [62]:
# Instantiate model 
rf = RandomForestRegressor(n_estimators= 1000, random_state=42)

# Train the model on training data
rf.fit(x_train, y_train);

In [63]:
# Use the forest's predict method on the test data
predictions = rf.predict(x_test)
# Calculate the absolute errors
errors = abs(predictions - y_test)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
mape = np.mean(100 * (errors / y_test))
accuracy = 100 - mape

print('Accuracy:', round(accuracy, 2), '%.')

Mean Absolute Error: 3.92 degrees.
Accuracy: 93.76 %.


In [64]:
from sklearn import metrics
print('R-squred score: ', metrics.r2_score(y_test, predictions))

R-squred score:  0.8079504720718287


All the measures($R^2$, MAE, accuracy) for the model's performance turned out to be very similar when using both the whole dataset and only the two most important variables. In other words, it is very important to do feature selection since it reduces run-time. 