# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(adultDataSet_filename, header =0)

df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba,<=50K
5,37.0,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40.0,United-States,<=50K
6,49.0,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16.0,Jamaica,<=50K
7,52.0,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,45.0,United-States,>50K
8,31.0,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50.0,United-States,>50K
9,42.0,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,5178,0,40.0,United-States,>50K


In [3]:

np.sum(df.isnull(), axis = 0)

age                162
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex_selfID           0
capital-gain         0
capital-loss         0
hours-per-week     325
native-country     583
income_binary        0
dtype: int64

## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

Answers: 
1)  I have chosen the adult dataset.
2) I will be predicting the label of income_binary
3) It will be a supervised classfication problem, specifically a binary classification problem.
4) I think the features can be every column other than the label obviously and the fnlwgt since it seeems like some kind of ID instead of a numerical value with meaning such that correlation is shown.
5) I think predicting people's income is good especially for insurance companies who wants to know which prodcut to target to. 

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

Part 3

1) The data preparation techniques I will use are removing irrelevant features, one hot encoding on categorical values, replace missing numerical values with the column's mean, and deleting rows where columns of categorical values contains Nan. Additionally, I will need to convert the column of my label from an object to a numerical data type (1 for >50k and 0 for <=50k). I will probably remove the native-country column to remove any racial bias and the fnlwgt column since it doesn't seem useful.
2) I think I will use bagging since I want to use an ensemble method. Therefore, the model I will use is random forest since it can reduce variance which leads to less overfitting since we want our data to perform well on training and test data.
3) To evaluate my model, I will use accuracy score, AUC-ROC, and a confusion matrix to assess my model's performance. Additionally, I want to use Grid Search to test different hyperparameters of the Random Forest to find the best ones.

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

Part 4
1) My feature list will contain every feature except for native-country column to remove any racial bias and the fnlwgt which doesn't seem to have any predictive value. Removing these columns will allow for less computing power and a more ethical model. As for the remaining features, the ones that have object data type will be one hot encoded.
2) I will use one hot encoding for turning columns of string data type to numerical data. I will also replace Nan values of numerical columns with the column's mean. For columns of text data with Nan values, I will just delete the row. Since our data doesn't have many Nan values compared to the number of columns, this would be fine.
3) My model will be a Random Forest model which uses Bagging
4) I will use Grid Search to find best hyperparameters for max_depth and n_estimators, then use the best parameters to train a Random Forest model. I will split my data with the test data being 25% the size of the dataset. Then I will evaluate accuracy and ROC-AUC score.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [4]:
# YOUR CODE HERE
from sklearn.metrics import roc_auc_score, roc_curve, auc, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestClassifier


<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [6]:
df = df.drop(columns = ["fnlwgt", 'native-country'], axis = 0)
df.columns

Index(['age', 'workclass', 'education', 'education-num', 'marital-status',
       'occupation', 'relationship', 'race', 'sex_selfID', 'capital-gain',
       'capital-loss', 'hours-per-week', 'income_binary'],
      dtype='object')

In [7]:
df = df.dropna(subset=['workclass', 'occupation'])

In [8]:
df.describe()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week
count,30565.0,30718.0,30718.0,30718.0,30418.0
mean,38.448062,10.130314,630.954587,88.910216,40.960583
std,13.125337,2.562469,2453.058671,405.657203,11.994215
min,17.0,1.0,0.0,0.0,1.0
25%,28.0,9.0,0.0,0.0,40.0
50%,37.0,10.0,0.0,0.0,40.0
75%,47.0,13.0,0.0,0.0,45.0
max,90.0,16.0,14084.0,4356.0,99.0


In [9]:
np.sum(df.isnull(), axis = 0)

age               153
workclass           0
education           0
education-num       0
marital-status      0
occupation          0
relationship        0
race                0
sex_selfID          0
capital-gain        0
capital-loss        0
hours-per-week    300
income_binary       0
dtype: int64

In [10]:
df[['age', 'hours-per-week']].dtypes

age               float64
hours-per-week    float64
dtype: object

In [11]:
#fill in NaN values for the two columns above with their mean 
mean_age = df['age'].mean()
df['age'].fillna(value=mean_age, inplace=True)

In [12]:
mean_hours_per_week = df['hours-per-week'].mean()
df['hours-per-week'].fillna(value=mean_hours_per_week , inplace=True)

In [13]:
np.sum(df.isnull(), axis = 0)


age               0
workclass         0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex_selfID        0
capital-gain      0
capital-loss      0
hours-per-week    0
income_binary     0
dtype: int64

In [14]:
#performing one hot encoding on values with object data type
df.dtypes

age               float64
workclass          object
education          object
education-num       int64
marital-status     object
occupation         object
relationship       object
race               object
sex_selfID         object
capital-gain        int64
capital-loss        int64
hours-per-week    float64
income_binary      object
dtype: object

In [15]:
df = pd.get_dummies(df, prefix=['workclass_'], drop_first = True, columns= ["workclass"])

In [16]:
df = pd.get_dummies(df, prefix=['education_'], drop_first = True, columns= ["education"])

In [17]:
df = pd.get_dummies(df, prefix=['marital-status_'], drop_first = True, columns= ["marital-status"])

In [18]:
df = pd.get_dummies(df, prefix=['occupation_'], drop_first = True, columns= ["occupation"])

In [19]:
df = pd.get_dummies(df, prefix=['relationship_'], drop_first = True, columns= ["relationship"])

In [20]:
df = pd.get_dummies(df, prefix=['race_'], drop_first = True, columns= ["race"])

In [21]:
df = pd.get_dummies(df, prefix=['sex_selfID_'], drop_first = True, columns= ["sex_selfID"])

In [22]:
df.columns

Index(['age', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week', 'income_binary', 'workclass__Local-gov',
       'workclass__Private', 'workclass__Self-emp-inc',
       'workclass__Self-emp-not-inc', 'workclass__State-gov',
       'workclass__Without-pay', 'education__11th', 'education__12th',
       'education__1st-4th', 'education__5th-6th', 'education__7th-8th',
       'education__9th', 'education__Assoc-acdm', 'education__Assoc-voc',
       'education__Bachelors', 'education__Doctorate', 'education__HS-grad',
       'education__Masters', 'education__Preschool', 'education__Prof-school',
       'education__Some-college', 'marital-status__Married-AF-spouse',
       'marital-status__Married-civ-spouse',
       'marital-status__Married-spouse-absent',
       'marital-status__Never-married', 'marital-status__Separated',
       'marital-status__Widowed', 'occupation__Armed-Forces',
       'occupation__Craft-repair', 'occupation__Exec-managerial',
       'occupation

In [24]:
df['income_binary'].unique()

array(['<=50K', '>50K'], dtype=object)

In [25]:
#check for null values, all 0s so we can continue
df.isnull().sum()

age                                      0
education-num                            0
capital-gain                             0
capital-loss                             0
hours-per-week                           0
income_binary                            0
workclass__Local-gov                     0
workclass__Private                       0
workclass__Self-emp-inc                  0
workclass__Self-emp-not-inc              0
workclass__State-gov                     0
workclass__Without-pay                   0
education__11th                          0
education__12th                          0
education__1st-4th                       0
education__5th-6th                       0
education__7th-8th                       0
education__9th                           0
education__Assoc-acdm                    0
education__Assoc-voc                     0
education__Bachelors                     0
education__Doctorate                     0
education__HS-grad                       0
education__

In [26]:
#define X and y as features and label
X = df.drop(columns = "income_binary", axis = 0)
y = df["income_binary"]

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [28]:
print(X_train.shape)
print(X_test.shape)

(23038, 55)
(7680, 55)


In [29]:
#make random forest model 
rf_model = RandomForestClassifier()

In [30]:
param_grid = {"max_depth" : [50,100,150, 200], "n_estimators" : [20,40,60,80]}

In [31]:
#use GridSearch of 5 folds of cross validation to find the best value for max_depth and n_estimators
rf_grid = GridSearchCV(rf_model, param_grid = param_grid, scoring = "accuracy", cv=5)

In [32]:
rf_grid_search = rf_grid.fit(X_train, y_train)

In [33]:
best_params = rf_grid_search.best_params_
best_params

{'max_depth': 50, 'n_estimators': 40}

In [34]:
rf_grid_search.best_score_

0.8384409933253745

In [35]:
#seems like from GridSearch max_depth should be 100 and n_estimators should be 60
#make a random forest model with these params
best_rf_model = RandomForestClassifier(criterion = "entropy", max_depth = best_params["max_depth"], n_estimators = best_params["n_estimators"])

In [36]:
best_rf_model.fit(X_train, y_train)

In [37]:
#make predictions 
predict = best_rf_model.predict(X_test)

acc_score = accuracy_score(predict, y_test)
acc_score

Accuracy score falls in the high 80s, model seems to perform pretty well on test data.


In [39]:
# Make predictions on the test data using the predict_proba() method
prob_predictions = list(best_rf_model.predict_proba(X_test)[:,1])

In [42]:
#find roc-auc score
roc_auc = roc_auc_score(y_test, prob_predictions)
roc_auc

0.8894506347147311

Since the ROC_AUC score fall in the high 80s, the model seems to perform pretty decently. A ROC_AUC score close to 1 means good performance since a score of 1 is perfect performance on test data.