# Machine Learning: Predict Job Quality Checkpoint Answers

**Tian Lou** \
Ohio Education Research Center \
The Ohio State University

**Xiangyu Ren** \
New York University

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10257464.svg)](https://doi.org/10.5281/zenodo.10257464)

**This notebook is developed for the [Data Literacy and Evidence Building Executive Class](https://www.socialdatascience.umd.edu/data-literacy).**

**The "Syntucky" data, which is synthetic in nature, is exclusively designed for training exercises. It is not intended to derive meaningful insights or make determinations about real-world populations.**

In [1]:
#Data Analysis Libraries
import pandas as pd
import numpy as np

#Visualization Library
import matplotlib.pyplot as plt

#Machine Learning Libraries

#Logistic Regression Model
import statsmodels.api as sm 

#Decision Tree Model
from sklearn.tree import DecisionTreeClassifier 
from sklearn.tree import plot_tree #Package used to plot decision tree

#Random Forest Model 
from sklearn.ensemble import RandomForestClassifier 

#Model Evaluation Metrics
from sklearn.metrics import (confusion_matrix,accuracy_score,precision_score,recall_score,mean_squared_error) 

Before running the code below, please change <font color='red'> **YOUR USERNAME**</font> to your username or your own file path.

In [2]:
#Run the notebook in which we store functions
%run C:/Users/YOUR USERNAME/Documents/Functions.ipynb

Before running the code below, please change <font color='red'> **YOUR USERNAME**</font> to your username or your own file path. Also make sure that you have <font color='red'> **run all the code in "05 Machine Learning Data Preparation.ipynb"** </font> and have saved the cleaned data in your own folder.

In [3]:
#Load the data we cleaned in "05 Machine Learning Data Preparation.ipynb"
#Make sure you have run all the code and saved the cleaned data in your own folder
df_comb = pd.read_csv(r"C:\Users\YOUR USERNAME\Documents\ML_dataset.csv")

#We define the label only for students who had positive earnings in year 7 and whose Pell grant status is not null
#So we only need to drop students whose "label_high_earnings" is null
df_comb = df_comb[df_comb['label_high_earnings'].isnull() == False]

In [7]:
#Get the training data: 2013 and 2014 cohorts
df_training = df_comb[(df_comb['cohort_acadyr'] == 2013) | (df_comb['cohort_acadyr'] == 2014) ]

#Get the testing data: 2015 cohort
df_testing = df_comb[df_comb['cohort_acadyr'] == 2015]

#Save training data and testing data label and features in separate DataFrames
X_train = df_training.drop(columns = ['id', 'cohort_acadyr', 'label_high_earnings', 'label_no_missing_earnings'], axis = 1)
Y_train = df_training[['label_high_earnings']].values

X_test = df_testing.drop(columns = ['id', 'cohort_acadyr', 'label_high_earnings', 'label_no_missing_earnings'], axis = 1)
Y_test = df_testing[['label_high_earnings']].values 

#### **Checkpoint 1:Evaluate the Logistic Regression Model with a K of 0.7**

Assume there is a funding increase and now we can have interventions on 30% of the students. Calculate the accuracy, precision, and recall at 70%. If you need the logistic regression model to predict students with high earnings more accurately, which metric should you use to select the model?

In [8]:
#Define the reference category
cols_to_drop = ['gender_Female', 'race_group_White', 'first_enroll_other', 'cohort_degree_pursuit_type_Bachelor']

#Drop the reference categories from training and testing features
X_train.drop(columns = cols_to_drop, inplace = True)
X_test.drop(columns = cols_to_drop, inplace = True)

In [9]:
#Fit the logistic regression model
log_reg = sm.Logit(Y_train, sm.add_constant(X_train)).fit()

#See weights (coefficients) and significance
print(log_reg.summary())

Optimization terminated successfully.
         Current function value: 0.676958
         Iterations 4
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                57822
Model:                          Logit   Df Residuals:                    57809
Method:                           MLE   Df Model:                           12
Date:                Wed, 29 Nov 2023   Pseudo R-squ.:                 0.01997
Time:                        21:39:08   Log-Likelihood:                -39143.
converged:                       True   LL-Null:                       -39941.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                                           coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
const                                    0.1926      0.0

In [10]:
#Predict outcomes of students in the testing data
df_testing.loc[:, 'lr_y_scores'] = log_reg.predict(sm.add_constant(X_test))

#Accuracy
lr_accuracy_70 = accuracy_at_k(df_testing['label_high_earnings'], df_testing['lr_y_scores'], 0.7)
#Precision
lr_precision_70 = precision_at_k(df_testing['label_high_earnings'], df_testing['lr_y_scores'], 0.7)
#Recall
lr_recall_70 = recall_at_k(df_testing['label_high_earnings'], df_testing['lr_y_scores'], 0.7)
#Write results to a DataFrame
lr_metrics_70 = pd.DataFrame([['Accuracy', lr_accuracy_70],
                          ['Precision', lr_precision_70],
                          ['Recall', lr_recall_70]],
                          columns = ['Metric', 'K = 70%'])

#See results
lr_metrics_70

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_testing.loc[:, 'lr_y_scores'] = log_reg.predict(sm.add_constant(X_test))


Unnamed: 0,Metric,K = 70%
0,Accuracy,0.572989
1,Precision,0.584168
2,Recall,0.750643


 <font color='red'> **To predict students with high earnings more accurately, we want to look at precision.**</font>

#### **Checkpoint 2: Use the Evaluation Metrics to Select A Model**

Assume the funding for the policy intervention can only cover 10% of the students. Use the evaluation metrics results in the table below to select a model that allows to you aovid misallocation of resources as much as possible.

In [11]:
#Define the list of models we will train
models = { 'Baseline_random': np.random.uniform(0,1,len(Y_test)),
           'Baseline_allones': np.ones(len(Y_test)),
           'RandomForest': RandomForestClassifier(n_estimators = 500, max_depth = 3,n_jobs = 2),
           'LogisticRegression': sm.Logit(Y_train,sm.add_constant(X_train)),
           'DecisionTree': DecisionTreeClassifier(max_depth = 7, min_samples_split = 10)
         }

#Define a list of model names for looping
model_list = ['Baseline_random','Baseline_allones','RandomForest','LogisticRegression','DecisionTree']

In [12]:
#Define an empty DataFrame for storing results
df_results = pd.DataFrame()

#Loop through all the models
for m in model_list:

    #Fit models
    if (m == 'Baseline_random' or m =='Baseline_allones'):
        np.random.seed(20)
        y_scores = models[m]
    elif (m == 'RandomForest' or m == 'DecisionTree'):
        clf = models[m].fit(X_train,Y_train.ravel())
        y_scores = clf.predict_proba(X_test)[:,1]
    elif m == 'LogisticRegression':
        clf = models[m].fit()
        y_scores = clf.predict(sm.add_constant(X_test))

    #Calculate metrics at different threshold
    #If you need to add additional metrics and/or threshold, you can add to this list
    a_at_80 = accuracy_at_k(Y_test, y_scores, 0.8) 
    a_at_90 = accuracy_at_k(Y_test, y_scores, 0.9)
    p_at_80 = precision_at_k(Y_test, y_scores, 0.8)
    p_at_90 = precision_at_k(Y_test, y_scores, 0.9)
    r_at_80 = recall_at_k(Y_test, y_scores, 0.8)
    r_at_90 = recall_at_k(Y_test, y_scores, 0.9)

    #Add the results to the df_results DataFrame
    #If you added additional metrics and/or threshold above, you should add them to this list as well
    df_results = df_results._append([{
        'Model':m,
        'accuracy_at_80':a_at_80,
        'precision_at_80':p_at_80,
        'recall_at_80':r_at_80,
        'accuracy_at_90':a_at_90,
        'precision_at_90':p_at_90,
        'recall_at_90':r_at_90,
    }])

#Show results
df_results

Optimization terminated successfully.
         Current function value: 0.676958
         Iterations 4


Unnamed: 0,Model,accuracy_at_80,precision_at_80,recall_at_80,accuracy_at_90,precision_at_90,recall_at_90
0,Baseline_random,0.527269,0.545086,0.800352,0.536273,0.545078,0.900379
0,Baseline_allones,0.544871,0.544871,1.0,0.544871,0.544871,1.0
0,RandomForest,0.565756,0.568535,0.842137,0.555092,0.553191,0.954016
0,LogisticRegression,0.566642,0.56951,0.838413,0.559373,0.557529,0.927062
0,DecisionTree,0.565203,0.56878,0.835297,0.553469,0.552016,0.957673


 <font color='red'> **We may want to select Decision Tree model as it has the highest recall among all the models at k=90**</font>