<p>
<center>
<font size="4">
Predicting Depression - Preliminary Models (Imbalanced Data)
</font>
</center>
</p>



In [1]:
path = ('/Users/carolinesklaver/Desktop/Capstone/NHANES/data/csv_data/')

import os
os.chdir(path)

In [2]:
import warnings

# Ignore warnings
warnings.filterwarnings('ignore')

In [3]:
import pandas as pd
import numpy as np

# Data Preprocessing

In [4]:
# Importing the data

df_raw = pd.read_csv('df_raw_v2.csv')

# bring year and target col to the beginning of df
year = df_raw.pop('year')
df_raw.insert(1, 'year', year)

dep = df_raw.pop('depressed')
df_raw.insert(2, 'depressed', dep)



# drop marijuana use
df_raw.drop(['used_marijuana'],axis=1, inplace=True)
# help!
df_raw.drop(['year'],axis=1, inplace=True)

df_raw.drop(['SEQN'],axis=1, inplace=True)

In [5]:
#continuous features
cont = ['#_ppl_household', 'age', 'triglyceride','caffeine', 'lifetime_partners',
       'glycohemoglobin', 'CRP', 'tot_cholesterol','systolic_BP','diastolic_BP', 'BMI', 'waist_C', '#meals_fast_food',
       'min_sedetary', 'bone_mineral_density']

# categorical features
cat = ['race_ethnicity', 'edu_level', 'gender', 'marital_status', 'annual_HI',
       'doc_diabetes', 'how_healthy_diet', 'used_CMH',
       'health_insurance', 'doc_asthma', 'doc_overweight', 'doc_arthritis',
       'doc_CHF', 'doc_CHD', 'doc_heart_attack', 'doc_stroke',
       'doc_chronic_bronchitis', 'doc_liver_condition', 'doc_thyroid_problem',
       'doc_cancer', 'difficult_seeing', 'doc_kidney', 'broken_hip',
       'doc_osteoporosis', 'vigorous_activity', 'moderate_activity',
       'doc_sleeping_disorder', 'smoker', 'sexual_orientation',
       'alcoholic','herpes_2', 'HIV', 'doc_HPV','difficult_hearing', 'doc_COPD']

# target binary feature
target = 'depressed'

# multi-class features
cat_encode = ['race_ethnicity', 'edu_level', 'gender', 'marital_status', 'annual_HI','how_healthy_diet',
              'sexual_orientation']


In [6]:
def nan_helper(df):
    """
    The NaN helper

    Parameters
    ----------
    df : dataframe
    
    Returns
    ----------
    The dataframe of variables with NaN (index), 
    raw number missing, and their proportion
    
    """
    
    
    # get the raw number of missing values & sort
    missing = df.isnull().sum().sort_values(ascending=True)
    
    # get the proportion of missing values (%)
    proportion = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=True)
    
    # create table of missing data
    nan_data = pd.concat([missing, proportion], axis=1, keys=['missing', 'proportion'])
    
    return nan_data


def missing_values(df, threshold_col, threshold_row, impute_type):
    """
    Handle Missing Values

    Parameters
    ----------
    df : dataframe
    threshold_col: the proportion of missing values at which  to drop whole column
    threshold_row: the proportion of missing values at which to drop rows
    impute_type: mean or median imputation for continuous variables
    
    Returns
    ----------
    The dataframe without missing values
    
    """
    
    # Dropping Cols and Rows
    # call NaN helper function
    df_nan = nan_helper(df)
        
    # drop columns with higher proportion missing than threshold col
    df = df.drop((df_nan[df_nan['proportion'] > threshold_col]).index,1)
    
    # drop rows with higher proportion missing than threshold row
    df_nan_2 = df_nan[df_nan['proportion']>threshold_row]
    df = df.dropna(subset=np.intersect1d(df_nan_2.index, df.columns),
                           inplace=False)
    

    
    # Imputing values
    # Impute continuous variables with mean 
    if impute_type == 'mean':
        for col in cont:
            if col in df.columns:
                df[col].fillna(df[col].mean(), inplace=True)
    # Impute continuous variables with median
    elif impute_type == 'median':
        for col in cont:
            if col in df.columns:
                df[col].fillna(df[col].median(), inplace=True)
    
    
    # Impute categorical variables with most frequent/mode
    for col in cat:
        if col in df.columns:
            df[col].fillna(df[col].value_counts().index[0], inplace=True)
    

    return df


df_mean = missing_values(df_raw, 0.65, 0.65, "mean")
df_median = missing_values(df_raw, 0.65, 0.65, "median")



In [7]:
nan_data = nan_helper(df_raw)
nan_data.head()

Unnamed: 0,missing,proportion
depressed,0,0.0
race_ethnicity,0,0.0
#_ppl_household,0,0.0
age,0,0.0
gender,0,0.0


# Running Preliminary Models Compare Imputation Methods

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
#from sklearn.ensemble import HistGradientBoostingClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
# now you can import normally from ensemble
#from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier

## Function to compare different types of imputation and their results

In [9]:
# read in the knn and mlp imputed data so we do not have to run the function every time
# code for knn and mlp imputation can be found at https://github.com/csklaver/Capstone-Group6/tree/master/Code
knn_df = pd.read_csv('df_progressive_knn.csv')
knn_df.drop(['SEQN'],axis=1,inplace=True)
knn_df.drop(['year'],axis=1,inplace=True)

mlp_df = pd.read_csv('df_progressive_mlp.csv')
mlp_df.drop(['SEQN'],axis=1,inplace=True)
mlp_df.drop(['year'],axis=1,inplace=True)
mlp_df.head()

Unnamed: 0,depressed,race_ethnicity,edu_level,#_ppl_household,age,gender,marital_status,annual_HI,caffeine,doc_diabetes,...,systolic_BP,diastolic_BP,BMI,waist_C,#meals_fast_food,min_sedetary,doc_HPV,bone_mineral_density,difficult_hearing,doc_COPD
0,0.0,4.0,4.0,4.0,44.0,2.0,1.0,11.0,13.0,0.0,...,144.0,74.0,30.9,96.0,2.093681,398.557696,0.0,0.845891,0.0,0.0
1,0.0,3.0,5.0,2.0,70.0,1.0,1.0,11.0,260.0,1.0,...,138.0,60.0,24.74,96.5,2.093681,384.781692,0.0,0.845891,0.0,0.0
2,0.0,3.0,3.0,2.0,73.0,1.0,1.0,6.0,142.0,0.0,...,130.0,68.0,30.63,117.1,2.093681,382.287784,0.0,0.845891,0.0,0.0
3,0.0,2.0,4.0,3.0,18.0,2.0,5.0,11.0,5.397605e-79,0.0,...,110.0,64.0,29.45,84.0,2.093681,387.8057,0.0,0.845891,0.0,0.0
4,0.0,3.0,4.0,3.0,19.0,1.0,5.0,11.0,5.397605e-79,0.0,...,108.0,62.0,22.57,84.2,2.093681,409.963013,0.0,0.845891,0.0,0.0


In [10]:

def impute_data(df_cleaned, impute_strategy=None, cols_to_standardize=None):
    """
    Impute Data

    Parameters
    ----------
    df_cleaned : dataframe without identifiers
    impute_strategy: mean, median, or progressive_knn/mlp imputation
    cols_to_standardize: continous variables
    
    Returns
    ----------
    The dataframe without missing values from chosen imputation method
    
    """
    
    
    df = df_cleaned.copy()
    if impute_strategy == 'mean':
        df = missing_values(df, 0.75, 0.75, 'mean')
    elif impute_strategy == 'median':
        df = missing_values(df, 0.75, 0.75, 'mean')
    elif impute_strategy == 'progressive_knn':
        df = knn_df
    elif impute_strategy == 'progressive_mlp':
        df = mlp_df
    else:
        arr = SimpleImputer(missing_values=np.nan,strategy=impute_strategy).fit(
          df.values).transform(df.values)
        df = pd.DataFrame(data=arr, index=df.index.values, columns=df.columns.values)
    
    if cols_to_standardize != None:
        cols_to_standardize = list(set(cols_to_standardize) & set(df.columns.values))
        df[cols_to_standardize] = df[cols_to_standardize].astype('float')
        df[cols_to_standardize] = pd.DataFrame(data=MinMaxScaler().fit(
        df[cols_to_standardize]).transform(df[cols_to_standardize]), 
                                             index=df[cols_to_standardize].index.values,
                                             columns=df[cols_to_standardize].columns.values)
    
    return df


In [44]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from timeit import default_timer as timer
from sklearn.preprocessing import MinMaxScaler

# function for handling missing values 
# and fitting logistic regression on clean data
def log_reg(data, impute_strategy=None,
                        cols_to_standardize=None,
                        test_size=0.33,
                        random_state=42):
    """
    Logistic Regression

    Parameters
    ----------
    data: dataframe
    impute_strategy: call impute_data() function for mean, median, or progressive_knn imputation
    cols_to_standardize: continous variables
    test_size: train-test split proportion
    
    Returns
    ----------
    prints confusion matrix
    train_score, test_score: f1-score on training and testing set
    reports time elapsed
    
    """
    

    df_imputed = impute_data(data, impute_strategy, cols_to_standardize)
    train_data, test_data = train_test_split(df_imputed, test_size=test_size,
                                             random_state=random_state, shuffle=True)
    
    # prepare tensors
    X_train = train_data.drop(columns=['depressed'])
    y_train = train_data['depressed']
    X_test = test_data.drop(columns=['depressed'])
    y_test = test_data['depressed']
    
    # model training
    lg = LogisticRegression().fit(X_train, y_train)
    
    # model evaluation
    y_pred = lg.predict(X_test)
    train_score = f1_score(y_train, lg.predict(X_train))
    
    test_score = f1_score(y_test, y_pred)
    print(confusion_matrix(y_test, y_pred))
    
    return {
        'imputation strategy': impute_strategy,
        'model': lg,
        'train score': train_score,
        'test score': test_score,
    }
  
# list to store models' performance  
lg_results = []

# prepare data
df = df_raw
cols_to_standardize = cont

# fit logistic regression for each imputation strategy
# with and without standardizing features
for impute_strategy in ['mean', 'median', 'progressive_knn','progressive_mlp']: 
    result = log_reg(df, impute_strategy=impute_strategy, cols_to_standardize=cont)
    lg_results.append(result)

# display logistic regression performance
lg_results_df = pd.DataFrame(lg_results)
lg_results_df.drop(['model'], axis=1).drop_duplicates()

[[9510   71]
 [ 704   63]]
[[9510   71]
 [ 704   63]]
[[9517   64]
 [ 703   64]]
[[9506   75]
 [ 695   72]]


Unnamed: 0,imputation strategy,train score,test score
0,mean,0.151005,0.139845
1,median,0.151005,0.139845
2,progressive_knn,0.154264,0.143017
3,progressive_mlp,0.178133,0.157549


In [23]:
# function for handling missing values 
# and fitting random forest on clean data
def decision_tree(data, impute_strategy=None,
                        cols_to_standardize=None,
                        test_size=0.33,
                        random_state=42): 
    """
    Decision Tree

    Parameters
    ----------
    data: dataframe
    impute_strategy: call impute_data() function for mean, median, or progressive_knn imputation
    cols_to_standardize: continous variables
    test_size: train-test split proportion
    
    Returns
    ----------
    prints confusion matrix
    train_score, test_score: F1-score on training and testing set
    
    """

    df_imputed = impute_data(data, impute_strategy, cols_to_standardize)
    train_data, test_data = train_test_split(df_imputed, test_size=test_size,
                                             random_state=random_state, shuffle=True)
    
    # feature matrix
    X_train = train_data.drop(columns=['depressed'])
    y_train = train_data['depressed']
    X_test = test_data.drop(columns=['depressed'])
    y_test = test_data['depressed']
    
    # model training
    dt = DecisionTreeClassifier(class_weight='balanced', random_state=42).fit(
        X_train, y_train)
    
    # model evaluation
    y_pred = rf.predict(X_test)
    train_score = f1_score(y_train, dt.predict(X_train))
    test_score = f1_score(y_test, dt.predict(X_test))
    print(confusion_matrix(y_test, y_pred))
    print(f1_score(y_test, y_pred))
    print(np.unique(y_pred))
    
    return {
        'imputation strategy': impute_strategy,
        'model': dt,
        'train score': train_score,
        'test score': test_score,
    }
  
# list to store models' performance  
dt_results = []

# prepare data
df = df_raw.copy()
cols_to_standardize = cont

# fit logistic regression for each imputation strategy
# with and without standardizing features
for impute_strategy in ['mean', 'median', 'progressive_knn', 'progressive_mlp']:  
    result = random_forest(df, impute_strategy=impute_strategy, cols_to_standardize=cont)
    dt_results.append(result)

# display random forest regression performance
dt_results_df = pd.DataFrame(dt_results)
dt_results_df.drop(['model'], axis=1).drop_duplicates()

[[9576    5]
 [ 761    6]]
0.015424164524421595
[0. 1.]
[[9576    5]
 [ 761    6]]
0.015424164524421595
[0. 1.]
[[9575    6]
 [ 764    3]]
0.007731958762886597
[0. 1.]
[[9577    4]
 [ 762    5]]
0.012886597938144331
[0. 1.]


Unnamed: 0,imputation strategy,train score,test score
0,mean,0.999368,0.015424
1,median,0.999368,0.015424
2,progressive_knn,0.999684,0.007732
3,progressive_mlp,0.999684,0.012887


In [45]:
# function for handling missing values 
# and fitting random forest on clean data
def random_forest(data, impute_strategy=None,
                        cols_to_standardize=None,
                        test_size=0.33,
                        random_state=42): 
    """
    Random Forest

    Parameters
    ----------
    data: dataframe
    impute_strategy: call impute_data() function for mean, median, or progressive_knn imputation
    cols_to_standardize: continous variables
    test_size: train-test split proportion
    
    Returns
    ----------
    prints confusion matrix
    train_score, test_score: F1-score on training and testing set
    
    """

    df_imputed = impute_data(data, impute_strategy, cols_to_standardize)
    train_data, test_data = train_test_split(df_imputed, test_size=test_size,
                                             random_state=random_state, shuffle=True)
    
    # feature matrix
    X_train = train_data.drop(columns=['depressed'])
    y_train = train_data['depressed']
    X_test = test_data.drop(columns=['depressed'])
    y_test = test_data['depressed']
    
    # model training
    rf = RandomForestClassifier(class_weight='balanced', random_state=42).fit(
        X_train, y_train)
    
    # model evaluation
    y_pred = rf.predict(X_test)
    train_score = f1_score(y_train, rf.predict(X_train))
    test_score = f1_score(y_test, rf.predict(X_test))
    print(confusion_matrix(y_test, y_pred))
    print(f1_score(y_test, y_pred))
    print(np.unique(y_pred))
    
    return {
        'imputation strategy': impute_strategy,
        'model': rf,
        'train score': train_score,
        'test score': test_score,
    }
  
# list to store models' performance  
rf_results = []

# prepare data
df = df_raw.copy()
cols_to_standardize = cont

# fit logistic regression for each imputation strategy
# with and without standardizing features
for impute_strategy in ['mean', 'median', 'progressive_knn', 'progressive_mlp']:  
    result = random_forest(df, impute_strategy=impute_strategy, cols_to_standardize=cont)
    rf_results.append(result)

# display random forest regression performance
rf_results_df = pd.DataFrame(rf_results)
rf_results_df.drop(['model'], axis=1).drop_duplicates()

[[9576    5]
 [ 761    6]]
0.015424164524421595
[0. 1.]
[[9576    5]
 [ 761    6]]
0.015424164524421595
[0. 1.]
[[9575    6]
 [ 764    3]]
0.007731958762886597
[0. 1.]
[[9577    4]
 [ 762    5]]
0.012886597938144331
[0. 1.]


Unnamed: 0,imputation strategy,train score,test score
0,mean,0.999368,0.015424
1,median,0.999368,0.015424
2,progressive_knn,0.999684,0.007732
3,progressive_mlp,0.999684,0.012887


In [25]:
# function for handling missing values 
# and fitting random forest on clean data
def xgboost(data, impute_strategy=None,
                        cols_to_standardize=None,
                        test_size=0.33,
                        random_state=42): 
    """
    XGBoost

    Parameters
    ----------
    data: dataframe
    impute_strategy: call impute_data() function for mean, median, or progressive_knn imputation
    cols_to_standardize: continous variables
    test_size: train-test split proportion
    
    Returns
    ----------
    prints confusion matrix
    train_score, test_score: F1-score on training and testing set
    
    """

    df_imputed = impute_data(data, impute_strategy, cols_to_standardize)
    train_data, test_data = train_test_split(df_imputed, test_size=test_size,
                                             random_state=random_state, shuffle=True)
    
    # feature matrix
    X_train = train_data.drop(columns=['depressed'])
    y_train = train_data['depressed']
    X_test = test_data.drop(columns=['depressed'])
    y_test = test_data['depressed']
    
    # model training
    xgbc = XGBClassifier(random_state=42).fit(
        X_train, y_train)
    
    # model evaluation
    y_pred = xgbc.predict(X_test)
    train_score = f1_score(y_train, xgbc.predict(X_train))
    test_score = f1_score(y_test, y_pred)
    print(confusion_matrix(y_test, y_pred))
    print(f1_score(y_test, y_pred))
    print(np.unique(y_pred))
    
    return {
        'imputation strategy': impute_strategy,
        'model': xgbc,
        'train score': train_score,
        'test score': test_score,
    }
  
# list to store models' performance  
xgbc_results = []

# prepare data
df = df_raw.copy()
cols_to_standardize = cont

# fit logistic regression for each imputation strategy
# with and without standardizing features
for impute_strategy in ['mean', 'median', 'progressive_knn', 'progressive_mlp']:  
    result = xgboost(df, impute_strategy=impute_strategy, cols_to_standardize=cont)
    xgbc_results.append(result)

# display random forest regression performance
xgbc_results_df = pd.DataFrame(xgbc_results)
xgbc_results_df.drop(['model'], axis=1).drop_duplicates()

[[9466  115]
 [ 678   89]]
0.18331616889804328
[0. 1.]
[[9466  115]
 [ 678   89]]
0.18331616889804328
[0. 1.]
[[9480  101]
 [ 690   77]]
0.16296296296296295
[0. 1.]
[[9469  112]
 [ 678   89]]
0.18388429752066118
[0. 1.]


Unnamed: 0,imputation strategy,train score,test score
0,mean,0.799547,0.183316
1,median,0.799547,0.183316
2,progressive_knn,0.83755,0.162963
3,progressive_mlp,0.842836,0.183884


In [21]:
# function for handling missing values 
# and fitting knn on clean data
def knn_model(data, impute_strategy=None,
                        cols_to_standardize=None,
                        test_size=0.33,
                        random_state=42):
    """
    K-Nearest Neighbors

    Parameters
    ----------
    data: dataframe
    impute_strategy: call impute_data() function for mean, median, or progressive_knn imputation
    cols_to_standardize: continous variables
    test_size: train-test split proportion
    
    Returns
    ----------
    prints confusion matrix
    train_score, test_score: F1-score on training and testing set
    reports time elapsed
    
    """
    
    df_imputed = impute_data(data, impute_strategy, cols_to_standardize)
    train_data, test_data = train_test_split(df_imputed, test_size=test_size,
                                             random_state=random_state, shuffle=True)
    
    
    # prepare tensors
    X_train = train_data.drop(columns=['depressed'])
    y_train = train_data['depressed']
    X_test = test_data.drop(columns=['depressed'])
    y_test = test_data['depressed']
    
    # model training
    knn = KNeighborsClassifier(n_neighbors=3, p=2, metric='minkowski').fit(
        X_train, y_train)
    
    # model evaluation
    y_pred = knn.predict(X_test)
    train_score = f1_score(y_train, knn.predict(X_train))
    test_score = f1_score(y_test, y_pred)
    print(confusion_matrix(y_test, y_pred))

    return {
        'imputation strategy': impute_strategy,
        'model': knn,
        'train score': train_score,
        'test score': test_score,
    }
  
# list to store models' performance  
knn_results = []

# prepare data
df = df_raw
cols_to_standardize = cont

# fit logistic regression for each imputation strategy
# with and without standardizing features
for impute_strategy in ['mean', 'median', 'progressive_knn', 'progressive_mlp']: 
    result = knn_model(df, impute_strategy=impute_strategy, cols_to_standardize=cont)
    knn_results.append(result)

# display knn performance
knn_results_df = pd.DataFrame(knn_results)
knn_results_df.drop(['model'], axis=1).drop_duplicates()

[[9489   92]
 [ 724   43]]
[[9489   92]
 [ 724   43]]
[[9459  122]
 [ 723   44]]
[[9481  100]
 [ 724   43]]


Unnamed: 0,imputation strategy,train score,test score
0,mean,0.342289,0.095344
1,median,0.342289,0.095344
2,progressive_knn,0.35119,0.094319
3,progressive_mlp,0.359961,0.094505


In [22]:

def NB_model(data, impute_strategy=None,
                        cols_to_standardize=None,
                        test_size=0.33,
                        random_state=42):
    """
    Naive Bayes

    Parameters
    ----------
    data: dataframe
    impute_strategy: call impute_data() function for mean, median, or progressive_knn imputation
    cols_to_standardize: continous variables
    test_size: train-test split proportion
    
    Returns
    ----------
    prints confusion matrix
    train_score, test_score: F1-score on training and testing set
    reports time elapsed
    
    """

    df_imputed = impute_data(data, impute_strategy, cols_to_standardize)
    train_data, test_data = train_test_split(df_imputed, test_size=test_size,
                                             random_state=random_state, shuffle=True)
    
    # feature matrix
    X_train = train_data.drop(columns=['depressed'])
    y_train = train_data['depressed']
    X_test = test_data.drop(columns=['depressed'])
    y_test = test_data['depressed']
    
    # model training
    nbc = GaussianNB().fit(
        X_train, y_train)
    
    # model evaluation
    y_pred = nbc.predict(X_test)
    train_score = f1_score(y_train, nbc.predict(X_train))
    test_score = f1_score(y_test, y_pred)
    print(confusion_matrix(y_test, y_pred))
    
    return {
        'imputation strategy': impute_strategy,
        'model': nbc,
        'train score': train_score,
        'test score': test_score,
    }
  
# list to store models' performance  
nbc_results = []

# prepare data
df = df_raw
cols_to_standardize = cont

# fit nb for each imputation strategy
for impute_strategy in ['mean', 'median', 'progressive_knn', 'progressive_mlp']:  
    result = NB_model(df, impute_strategy=impute_strategy, cols_to_standardize=cont)
    nbc_results.append(result)

# display nb performance
nbc_results_df = pd.DataFrame(nbc_results)
nbc_results_df.drop(['model'], axis=1).drop_duplicates()

[[7945 1636]
 [ 428  339]]
[[7945 1636]
 [ 428  339]]
[[7944 1637]
 [ 430  337]]
[[7958 1623]
 [ 431  336]]


Unnamed: 0,imputation strategy,train score,test score
0,mean,0.268444,0.247265
1,median,0.268444,0.247265
2,progressive_knn,0.265976,0.245896
3,progressive_mlp,0.267815,0.246515


In [26]:
from sklearn.linear_model import Perceptron

def ppn_model(data, impute_strategy=None,
                        cols_to_standardize=None,
                        test_size=0.33,
                        random_state=42):
    """
    Simple Perceptron Model

    Parameters
    ----------
    data: dataframe
    impute_strategy: call impute_data() function for mean, median, or progressive_knn imputation
    cols_to_standardize: continous variables
    test_size: train-test split proportion
    
    Returns
    ----------
    prints confusion matrix
    train_score, test_score: Accuracy on training and testing set
    reports time elapsed
    
    """

    df_imputed = impute_data(data, impute_strategy, cols_to_standardize)
    train_data, test_data = train_test_split(df_imputed, test_size=test_size,
                                             random_state=random_state, shuffle=True)

    
    # feature matrix
    X_train = train_data.drop(columns=['depressed'])
    y_train = train_data['depressed']
    X_test = test_data.drop(columns=['depressed'])
    y_test = test_data['depressed']
    
    # model training
    ppn = Perceptron(max_iter=40, eta0=0.1, random_state=0).fit(X_train, y_train)
    
    # model evaluation
    train_score = f1_score(y_train, ppn.predict(X_train))
    test_score = f1_score(y_test, ppn.predict(X_test))
    y_pred = ppn.predict(X_test)
    print(confusion_matrix(y_test, y_pred))
    
    return {
        'imputation strategy': impute_strategy,
        'model': ppn,
        'train score': train_score,
        'test score': test_score,
    }
  
# list to store models' performance  
ppn_results = []

# prepare data
df = df_raw
cols_to_standardize = cont

# fit mlp for each imputation strategy
for impute_strategy in ['mean', 'median', 'progressive_knn', 'progressive_mlp']:   
    result = ppn_model(df, impute_strategy=impute_strategy, cols_to_standardize=cont)
    ppn_results.append(result)

# display mlp performance
ppn_results_df = pd.DataFrame(ppn_results)
ppn_results_df.drop(['model'], axis=1).drop_duplicates()

[[9512   69]
 [ 715   52]]
[[9512   69]
 [ 715   52]]
[[9569   12]
 [ 755   12]]
[[8662  919]
 [ 490  277]]


Unnamed: 0,imputation strategy,train score,test score
0,mean,0.134287,0.117117
1,median,0.134287,0.117117
2,progressive_knn,0.030864,0.030341
3,progressive_mlp,0.328172,0.282221


In [27]:
from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop

Using TensorFlow backend.


In [35]:
from sklearn.linear_model import Perceptron

def keras_model(data, impute_strategy=None,
                        cols_to_standardize=None,
                        test_size=0.2,
                        random_state=42):
    """
    Keras MLP

    Parameters
    ----------
    data: dataframe
    impute_strategy: call impute_data() function for mean, median, or progressive_knn imputation
    cols_to_standardize: continous variables
    test_size: train-test split proportion
    
    Returns
    ----------
    prints confusion matrix
    train_score, test_score: Accuracy on training and testing set
    reports time elapsed
    
    """

    batch_size = 128
    epochs = 10

    df_imputed = impute_data(data, impute_strategy, cols_to_standardize)
    train_data, test_data = train_test_split(df_imputed, test_size=test_size,
                                             random_state=random_state, shuffle=True)
    
    # note which predictor columns were dropped or kept
    kept_columns = df_imputed.columns.difference(['depressed'])
    
    # prepare tensors
    X_train = train_data.drop(columns=['depressed'])
    y_train = train_data['depressed']
    X_test = test_data.drop(columns=['depressed'])
    y_test = test_data['depressed']
    
    # convert class vectors to binary class matrices
    y_train = keras.utils.to_categorical(y_train, 2)
    y_test = keras.utils.to_categorical(y_test, 2)

    model = Sequential()
    model.add(Dense(512, activation='relu', input_shape=(len(kept_columns),)))
    model.add(Dropout(0.2))
    model.add(Dense(512, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(2, activation='softmax'))
    
    model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy', keras.metrics.Precision(name='precision'),
                                        keras.metrics.Recall(name='recall')])

    history = model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_split=0.2)
    
    score = model.evaluate(X_test, y_test, verbose=0)
    
    y_pred = model.predict(X_test)
    # model evaulation
    print('Test loss:', score[0])
    print('Test accuracy:', score[1])
    print('Test precision:',score[2])
    print('Test recall:',score[3])
    f1 = f1_score(y_test.argmax(axis=1), y_pred.argmax(axis=1))
    print('F1-score', f1)

    
    return {
        'imputation strategy': impute_strategy,
        'standardized': cols_to_standardize!=None,
        'model': keras_model,
        'Test loss' :  score[0],
        'Test accuracy' : score[1],
        'Test F1-score': f1
#         'Test precision': score[2],
#         'Test recall': score[3]
        
    }
  
# list to store models' performance  
keras_results = []

# prepare data
df = df_raw
cols_to_standardize = cont

# fit keras mlp for each imputation strategy
for impute_strategy in ['mean', 'median', 'progressive_knn', 'progressive_mlp']: 
    result = keras_model(df, impute_strategy=impute_strategy, cols_to_standardize=cont)
    keras_results.append(result)

# display keras mlp performance
keras_results_df = pd.DataFrame(keras_results)
keras_results_df.drop(['model'], axis=1).drop_duplicates()

Train on 20068 samples, validate on 5017 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.23966926142421305
Test accuracy: 0.9188456535339355
Test precision: 0.9188456535339355
Test recall: 0.9188456535339355
F1-score 0.23458646616541348
Train on 20068 samples, validate on 5017 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.23124232487182836
Test accuracy: 0.9226722121238708
Test precision: 0.9226722121238708
Test recall: 0.9226722121238708
F1-score 0.08317580340264649
Train on 20068 samples, validate on 5017 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.23598425474245938
Test accuracy: 0.9239476919174194
Test precision: 0.9239476919174194
Test recall: 0.9239476919174194
F1-score 0.1675392670157068
Train on 20068 samples, validate on 50

Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.23311173714393255
Test accuracy: 0.9234693646430969
Test precision: 0.9234693646430969
Test recall: 0.9234693646430969
F1-score 0.14893617021276598


Unnamed: 0,imputation strategy,standardized,Test loss,Test accuracy,Test F1-score
0,mean,True,0.239669,0.918846,0.234586
1,median,True,0.231242,0.922672,0.083176
2,progressive_knn,True,0.235984,0.923948,0.167539
3,progressive_mlp,True,0.233112,0.923469,0.148936


# Conclusions

With an imbalanced target variable of depression (7.28% positive), these models are exposing the issue with imbalanced data. Accuracy for training and testing is high, but F1-score is extremely low. The simple keras neural network is not robust enough to detect the positive depression instances. Next, build a deeper model as well as explore resampling techniques and k-fold cross-validation to get an accurate model.