# Lab 8: Define and Solve an ML Problem of Your Choosing

In [631]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [632]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(WHRDataSet_filename)
print(df.shape)
print(list(df.columns))

#df.head(20)

#class bias is for example if there is a class of 0 and 1 and there is a data set that has patients having a diagnosis or not and the cancer is rare, then is there
#is a result that shows overwhelming bias to one or 0 would need to be (maybe need to be augmented) Something needs to change. 

(1562, 19)
['country', 'year', 'Life Ladder', 'Log GDP per capita', 'Social support', 'Healthy life expectancy at birth', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Positive affect', 'Negative affect', 'Confidence in national government', 'Democratic Quality', 'Delivery Quality', 'Standard deviation of ladder by country-year', 'Standard deviation/Mean of ladder by country-year', 'GINI index (World Bank estimate)', 'GINI index (World Bank estimate), average 2000-15', 'gini of household income reported in Gallup, by wp5-year']


In [633]:
#find columns that have missing values
nan_count = np.sum(df.isnull(), axis = 0)
nan_count



country                                                       0
year                                                          0
Life Ladder                                                   0
Log GDP per capita                                           27
Social support                                               13
Healthy life expectancy at birth                              9
Freedom to make life choices                                 29
Generosity                                                   80
Perceptions of corruption                                    90
Positive affect                                              18
Negative affect                                              12
Confidence in national government                           161
Democratic Quality                                          171
Delivery Quality                                            171
Standard deviation of ladder by country-year                  0
Standard deviation/Mean of ladder by cou

## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

# My ML Problem.>
1. World Happiness Report (WHR) data set: WHR2018Chapter2OnlineData.csv
2. I will be predicting confidence level in national government. The label is "positive affect is above average"
3. This is a supervised learning problem. This is a binary classification problem"
5. A company, like the general public, may want to know what features contribute to a higher than average positive affect in a population, regardless of the year. The country with data available is not as relevant because we are not trying to compare countries. We are just comparing features. 




## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [634]:
#print(df.describe())
#The mean of Positive affect is .708969
print(df.shape)

#Drop columns list
drop_cols = [col for col in df.columns if 'GINI' in col or 'Quality' in col] + [
    'Generosity', 'Standard deviation of ladder by country-year',
    'Standard deviation/Mean of ladder by country-year',
    'gini of household income reported in Gallup, by wp5-year',
    'Confidence in national government', 'country', 'year'
]

(1562, 19)


In [635]:
df_clean = df.drop(columns=drop_cols)

In [636]:
#Do the impute method--> Find columns that have missing values
#Look at lab 2 for reference
cols_to_impute =['Log GDP per capita','Social support','Healthy life expectancy at birth','Freedom to make life choices','Perceptions of corruption','Positive affect','Negative affect']
cols_to_impute

['Log GDP per capita',
 'Social support',
 'Healthy life expectancy at birth',
 'Freedom to make life choices',
 'Perceptions of corruption',
 'Positive affect',
 'Negative affect']

In [637]:
for col in cols_to_impute:
    df_clean[f"{col}_missing"] = df_clean[col].isnull().astype(int)

In [638]:
#Replace missing values with the mean of the cols
for colname in cols_to_impute:
    mean_val = np.mean(df_clean[colname])
    df_clean[colname].fillna(mean_val, inplace=True)

In [639]:
#df_cleaned = df.dropna(axis='columns')

          country  year  Life Ladder  \
0     Afghanistan  2008     3.723590   
1     Afghanistan  2009     4.401778   
2     Afghanistan  2010     4.758381   
3     Afghanistan  2011     3.831719   
4     Afghanistan  2012     3.782938   
...           ...   ...          ...   
1557     Zimbabwe  2013     4.690188   
1558     Zimbabwe  2014     4.184451   
1559     Zimbabwe  2015     3.703191   
1560     Zimbabwe  2016     3.735400   
1561     Zimbabwe  2017     3.638300   

      Standard deviation of ladder by country-year  \
0                                         1.774662   
1                                         1.722688   
2                                         1.878622   
3                                         1.785360   
4                                         1.798283   
...                                            ...   
1557                                      1.964805   
1558                                      2.079248   
1559                             

In [640]:
for colname in cols_to_impute:
    df_clean[colname].fillna(np.mean(df_clean[colname]), inplace=True)

In [655]:
#Check 
#for colname in cols_to_impute:
   # print("{} missing values count: {}".format(colname, np.sum(df[colname].isnull(), axis = 0)))

In [642]:
#Create Binary label
#The label will communicate if the positive affect is above the mean or not. 1 = above average. 0= not above average
mean_pa = df_clean['Positive affect'].mean()
df_clean['positive_affect_label'] = (df_clean['Positive affect'] > mean_pa).astype(int)


In [643]:
# Define features X and label y
X = df_clean.drop(columns=['Positive affect', 'positive_affect_label'])
y = df_clean['positive_affect_label']


## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

1. Feature list is : 'Log GDP per capita',
 'Social support',
 'Healthy life expectancy at birth',
 'Freedom to make life choices',
 'Perceptions of corruption',
 'Negative affect 
2. I chose to remove "country", "year". Also, columns with "GINI" or "Quality" becasue those have negative values or missing values
3. I created a  binary label y = (Positive affect > mean_positive_affect)
4. I am going to use: Missingness indicators, mean imputation, scaling

5. My model is: Logistic Regression with L2 regularization
   There will be two model variants: model_default has C=1.0 and model_best has C optimized from GridSearchCV 

6. I will define my label, split data, train default and tuned models. 

7. I will evaluate using ROC curves and AUC values, confusion matrices, precision–recall curves

8. I will improve by tuning C, visualizing curves, and comparing AUCs to ensure the final model generalizes well

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [644]:
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score, auc, confusion_matrix, precision_recall_curve


<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [645]:
#Splt inot train/test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1234)
print(X_train.shape)
print(X_test.shape)
X_train.head()

(1046, 14)
(516, 14)


Unnamed: 0,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Perceptions of corruption,Negative affect,Log GDP per capita_missing,Social support_missing,Healthy life expectancy at birth_missing,Freedom to make life choices_missing,Perceptions of corruption_missing,Positive affect_missing,Negative affect_missing
1326,7.776209,10.935776,0.946864,72.734001,0.945428,0.323241,0.176007,0,0,0,0,0,0,0
1069,6.89414,9.542232,0.937078,66.400909,0.640219,0.915287,0.149341,0,0,0,0,0,0,0
520,5.148242,10.117517,0.7529,71.780342,0.4383,0.872239,0.332831,0,0,0,0,0,0,0
643,7.060155,11.066487,0.943482,71.709785,0.905341,0.337085,0.212784,0,0,0,0,0,0,0
446,7.670627,10.659014,0.95134,69.745049,0.934179,0.216568,0.143539,0,0,0,0,0,0,0


In [646]:
model_default = LogisticRegression(max_iter=1000) #default C = 1.0


#model_default.fit --> fit model to X (array-like of shape n-samples and n_features) and Y (array-like of shape n_samples)
model_default.fit(X_train,y_train)


In [647]:
#make predictions on the test data using predict_proba()
proba_predictions = model_default.predict_proba(X_test)
proba_predictions_default = []
for i in proba_predictions:
    proba_predictions_default.append(i[1])

In [648]:
#make predictions on the test data using the predict method
class_label_predictions_default = model_default.predict(X_test)
print(class_label_predictions_default)

[1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 1 0 0 0 1 0 1 1 1 1 1 0 1 0 0 1 0 1 0 0
 0 0 1 0 1 0 0 1 1 0 0 1 1 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1 1 1 1 0 0 1 0
 0 1 1 1 1 1 0 1 1 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 1 1 0 1 0 1 1 0 0 0
 1 0 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 0 0 1 1
 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 0 0 0 1 0 1 0 1 0 0 1 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 0 1 0 0 1 1 0 1 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0
 0 1 0 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 0
 1 0 1 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1
 0 1 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 1 1
 1 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 0 0 0 0 1
 1 1 0 1 0 0 1 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1
 1 1 0 0 0 0 0 1 1 1 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 1 0 0 0 1 0 0 0 1 1
 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 0 1 1
 0 0 1 0 1 0 0 1 0 1 0 0 

In [649]:
c_m = confusion_matrix(y_test, class_label_predictions_default, labels = [True, False])
c_m

array([[211,  58],
       [ 57, 190]])

In [650]:
cs = [ 10 ** i for i in range(-5,5) ]
cs

param_grid = dict(C = list(cs))
param_grid

{'C': [1e-05, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]}

In [651]:
print('Running Grid Search...')

# 1. Create a LogisticRegression model object with the argument max_iter=1000. 
#    Save the model object to the variable 'model'

model = LogisticRegression(max_iter = 1000)


# 2. Run a grid search with 5-fold cross-validation and assign the output to the 
# object 'grid'.

grid = GridSearchCV(model, param_grid, cv=5 )


# 3. Fit the model on the training data and assign the fitted model to the 
#    variable 'grid_search'

grid_search = grid.fit(X_train, y_train)

print('Done')

Running Grid Search...
Done


In [652]:
best_C = grid_search.best_params_['C']
best_C

1

In [653]:
model_best = LogisticRegression(C = 100, max_iter =1000)
model_best.fit(X_train, y_train)

In [654]:
# Make predictions on the test data using the predict_proba() method

proba_predictions_array = model_best.predict_proba(X_test)
proba_predictions_best = []
for i in proba_predictions_array:
    proba_predictions_best.append(i[1])
    

#print(proba_predictions_best)
Make predictions on the test data using the predict() method

class_label_predictions_best = model_best.predict(X_test)
print(class_label_predictions_best)


SyntaxError: invalid syntax (2487097850.py, line 10)

In [None]:
 confusion_matrix(y_test, class_label_predictions_best, labels = [True, False])

In [None]:
precision_default, recall_default, thresholds_default = precision_recall_curve(y_test, proba_predictions_default)
precision_best, recall_best, thresholds_best = precision_recall_curve(y_test, proba_predictions_best)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

sns.lineplot( x = recall_default, y = precision_default,color = 'g')
plt.xlabel( "Recall")
plt.ylabel( "Precision")
plt.title("Precision-recall curve")
sns.lineplot( x = recall_best, y = precision_best,  color = 'r')
plt.show()


In [None]:
#Record the true positive and false positive rates for both models
fpr_default, tpr_default, thresholds_default = roc_curve(y_test, proba_predictions_default)
fpr_best, tpr_best, thresholds_best = roc_curve(y_test, proba_predictions_best)



In [None]:
#plot ROC Curve for Default Hyperparameter
fig = plt.figure()
ax = fig.add_subplot(111)

sns.lineplot(x=fpr_default, y=tpr_default, color = 'g')
plt.title("Receiver operating characteristic (ROC) curve")
plt.xlabel("False positive rate (fpr)")
plt.ylabel("True positive rate (tpr)")
plt.legend(['default hyperparameter'])
plt.show()

In [None]:
#Plot ROC Curve for Best Hyperparameter
fig = plt.figure()
ax = fig.add_subplot(111)

sns.lineplot(x=fpr_best, y=tpr_best, color = 'r')
plt.title("Receiver operating characteristic (ROC) curve")
plt.xlabel("False positive rate (fpr)")
plt.ylabel("True positive rate (tpr)")
plt.legend(['best hyperparameter'])
plt.show()

In [None]:
#Compute the area under the ROC curve for both models
auc_default =auc(fpr_default, tpr_default)
auc_best = auc(fpr_best, tpr_best)

print(auc_default)
print(auc_best)

In [None]:
#Extract the best 2 features
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

# Note that k=2 is specifying that we want the top 2 features
selector = SelectKBest(f_classif, k=2)
selector.fit(X, y)
filter = selector.get_support()
top_2_features = X.columns[filter]

print("Best 2 features:")
print(top_2_features)

# Create new training and test data for features
new_X_train = X_train[top_5_features]
new_X_test = X_test[top_5_features]


# Initialize a LogisticRegression model object with the best value of hyperparameter C 
# Note: Supply max_iter=1000 as an argument when creating the model object

model = LogisticRegression(C= best_C, max_iter = 1000)
# Fit the model to the new training data

model.fit(new_X_train,y_train)


# Use the predict_proba() method to use your model to make predictions on the new test data 
# Save the values of the second column to a list called 'proba_predictions'

new_proba = model.predict_proba(new_X_test)
proba_predictions = []
for i in new_proba:
    proba_predictions.append(i[1])

# Compute the auc-roc
fpr, tpr, thresholds = roc_curve(y_test, proba_predictions)
auc_result = auc(fpr, tpr)
print(auc_result)



Evaluation: 

The AUC = 0.80 so, it has strong classification performance. The binary classification : Above mean positive affect =1 or above mean of positive affect = 0. 
The best two features for predicting the label were 'Freedom to make life choices' and 'Life Ladder' . I chose to supply the top two feautures becasue there were only 6 features in total. 

 The two models produce nearly identical Precision recall curves, so across all decision thresholds,
 both models are making similar trade-offs between precision and recall. 
 They're behaving very similarlly on how many true positives they find vs. how many mistakes (false positives)
 they make. Maybe tuning C doesn't make much of a difference. 