# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [10]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [21]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score

## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [32]:
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
df = pd.read_csv(WHRDataSet_filename)

## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [33]:
df.isnull().sum()

country                                                       0
year                                                          0
Life Ladder                                                   0
Log GDP per capita                                           27
Social support                                               13
Healthy life expectancy at birth                              9
Freedom to make life choices                                 29
Generosity                                                   80
Perceptions of corruption                                    90
Positive affect                                              18
Negative affect                                              12
Confidence in national government                           161
Democratic Quality                                          171
Delivery Quality                                            171
Standard deviation of ladder by country-year                  0
Standard deviation/Mean of ladder by cou

In [34]:
df.fillna(df.mean(), inplace=True)

In [35]:
df = df.drop('country', axis=1)

In [36]:
df = df.drop('year', axis=1)

In [42]:
df.head()

Unnamed: 0,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
0,3.72359,7.16869,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.92969,-1.655084,1.774662,0.4766,0.372846,0.386948,0.445204
1,4.401778,7.33379,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025,1.722688,0.391362,0.372846,0.386948,0.441906
2,4.758381,7.386629,0.539075,50.008961,0.600127,0.13763,0.706766,0.618265,0.275324,0.299357,-1.99181,-1.617176,1.878622,0.394803,0.372846,0.386948,0.327318
3,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,0.267175,0.307386,-1.919018,-1.616221,1.78536,0.465942,0.372846,0.386948,0.336764
4,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.77562,0.710385,0.267919,0.43544,-1.842996,-1.404078,1.798283,0.475367,0.372846,0.386948,0.34454


## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [37]:
y = df['Positive affect']
X = df.drop('Positive affect', axis=1)

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=1234)

In [39]:
models = [
    ('Linear Regression', LinearRegression()),
    ('Decision Tree Regression', DecisionTreeRegressor()),
    ('Random Forest Regression', RandomForestRegressor()),
    ('Gradient Boosting Regression', GradientBoostingRegressor())
]

In [43]:
for name, model in models:
    rmse_scores = np.sqrt(-cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error'))
    r2_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
    
    avg_rmse = np.mean(rmse_scores)
    avg_r2 = np.mean(r2_scores)
    
    print(f"{name} -> RMSE: {avg_rmse:.4f}, R2: {avg_r2:.4f}")
   

Linear Regression -> RMSE: 0.0648, R2: 0.6274
Decision Tree Regression -> RMSE: 0.0722, R2: 0.5444
Random Forest Regression -> RMSE: 0.0518, R2: 0.7601
Gradient Boosting Regression -> RMSE: 0.0561, R2: 0.7207


In [44]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_leaf': [1, 2, 4],
}

In [45]:
best_model = models[2][1]
best_model

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [46]:
grid_search = GridSearchCV(best_model, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=2)

In [47]:
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV] max_depth=None, min_samples_leaf=1, n_estimators=100 ............


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  max_depth=None, min_samples_leaf=1, n_estimators=100, total=   1.4s
[CV] max_depth=None, min_samples_leaf=1, n_estimators=100 ............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.4s remaining:    0.0s


[CV]  max_depth=None, min_samples_leaf=1, n_estimators=100, total=   1.4s
[CV] max_depth=None, min_samples_leaf=1, n_estimators=100 ............
[CV]  max_depth=None, min_samples_leaf=1, n_estimators=100, total=   1.5s
[CV] max_depth=None, min_samples_leaf=1, n_estimators=100 ............
[CV]  max_depth=None, min_samples_leaf=1, n_estimators=100, total=   1.4s
[CV] max_depth=None, min_samples_leaf=1, n_estimators=100 ............
[CV]  max_depth=None, min_samples_leaf=1, n_estimators=100, total=   1.4s
[CV] max_depth=None, min_samples_leaf=1, n_estimators=200 ............
[CV]  max_depth=None, min_samples_leaf=1, n_estimators=200, total=   2.9s
[CV] max_depth=None, min_samples_leaf=1, n_estimators=200 ............
[CV]  max_depth=None, min_samples_leaf=1, n_estimators=200, total=   2.8s
[CV] max_depth=None, min_samples_leaf=1, n_estimators=200 ............
[CV]  max_depth=None, min_samples_leaf=1, n_estimators=200, total=   2.9s
[CV] max_depth=None, min_samples_leaf=1, n_estimators=20

[CV]  max_depth=10, min_samples_leaf=1, n_estimators=300, total=   3.5s
[CV] max_depth=10, min_samples_leaf=1, n_estimators=300 ..............
[CV]  max_depth=10, min_samples_leaf=1, n_estimators=300, total=   3.4s
[CV] max_depth=10, min_samples_leaf=2, n_estimators=100 ..............
[CV]  max_depth=10, min_samples_leaf=2, n_estimators=100, total=   1.1s
[CV] max_depth=10, min_samples_leaf=2, n_estimators=100 ..............
[CV]  max_depth=10, min_samples_leaf=2, n_estimators=100, total=   1.1s
[CV] max_depth=10, min_samples_leaf=2, n_estimators=100 ..............
[CV]  max_depth=10, min_samples_leaf=2, n_estimators=100, total=   1.2s
[CV] max_depth=10, min_samples_leaf=2, n_estimators=100 ..............
[CV]  max_depth=10, min_samples_leaf=2, n_estimators=100, total=   1.1s
[CV] max_depth=10, min_samples_leaf=2, n_estimators=100 ..............
[CV]  max_depth=10, min_samples_leaf=2, n_estimators=100, total=   1.1s
[CV] max_depth=10, min_samples_leaf=2, n_estimators=200 ..............

[CV]  max_depth=20, min_samples_leaf=2, n_estimators=300, total=   3.7s
[CV] max_depth=20, min_samples_leaf=2, n_estimators=300 ..............
[CV]  max_depth=20, min_samples_leaf=2, n_estimators=300, total=   3.7s
[CV] max_depth=20, min_samples_leaf=2, n_estimators=300 ..............
[CV]  max_depth=20, min_samples_leaf=2, n_estimators=300, total=   3.7s
[CV] max_depth=20, min_samples_leaf=2, n_estimators=300 ..............
[CV]  max_depth=20, min_samples_leaf=2, n_estimators=300, total=   3.8s
[CV] max_depth=20, min_samples_leaf=4, n_estimators=100 ..............
[CV]  max_depth=20, min_samples_leaf=4, n_estimators=100, total=   1.1s
[CV] max_depth=20, min_samples_leaf=4, n_estimators=100 ..............
[CV]  max_depth=20, min_samples_leaf=4, n_estimators=100, total=   1.1s
[CV] max_depth=20, min_samples_leaf=4, n_estimators=100 ..............
[CV]  max_depth=20, min_samples_leaf=4, n_estimators=100, total=   1.2s
[CV] max_depth=20, min_samples_leaf=4, n_estimators=100 ..............

[CV]  max_depth=30, min_samples_leaf=4, n_estimators=200, total=   2.2s
[CV] max_depth=30, min_samples_leaf=4, n_estimators=300 ..............
[CV]  max_depth=30, min_samples_leaf=4, n_estimators=300, total=   3.3s
[CV] max_depth=30, min_samples_leaf=4, n_estimators=300 ..............
[CV]  max_depth=30, min_samples_leaf=4, n_estimators=300, total=   3.2s
[CV] max_depth=30, min_samples_leaf=4, n_estimators=300 ..............
[CV]  max_depth=30, min_samples_leaf=4, n_estimators=300, total=   3.3s
[CV] max_depth=30, min_samples_leaf=4, n_estimators=300 ..............
[CV]  max_depth=30, min_samples_leaf=4, n_estimators=300, total=   3.3s
[CV] max_depth=30, min_samples_leaf=4, n_estimators=300 ..............
[CV]  max_depth=30, min_samples_leaf=4, n_estimators=300, total=   3.3s


[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed:  7.4min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=100, n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_jo

In [49]:
best_estimator = grid_search.best_estimator_
best_estimator

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=20, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [50]:
y_pred = best_estimator.predict(X_test)

In [51]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

In [52]:
print(f"Best Random Forest Model:")
print(best_estimator)
print(f"RMSE: {rmse:.4f}")
print(f"R2 Score: {r2:.4f}")

Best Random Forest Model:
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=20, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)
RMSE: 0.0552
R2 Score: 0.7536
