# Predicting if Someone has Tried Cocaine
## Model Building

In [1]:
import pandas as pd
import numpy as np
import pickle

# Data Preprocessing

Before we can use our data to train models, we need to do a few things:

1. Select our desired features

2. Apply standard scaling to our numerical variables

3. Dummify/One Hot Encode our categorical variables

We begin with removing coutyp4, year, irwrkstat, and mjever columns. These columns were only used for EDA or do not seem to provide insight in target prediction.

In [2]:
df = pd.read_pickle("./pickle/NSDUH_cleaned_dropna_2016-2019.pkl")
df = df.drop(['cocever', 'crkever', 'year'], axis=1)
df

Unnamed: 0,cig30use,iralcfy,irmjfy,irherfy,irmethamyfq,health,irsex,ireduhighst2,catag3,newrace2,wrkdhrswk2,irhhsiz2,irki17_2,irpinc3,coccrkever
0,0.0,5,0,0,0,2.0,1,7,1,1,0.0,1,2,1,0.0
1,0.0,52,364,0,0,1.0,1,8,4,7,40.0,4,3,2,1.0
2,0.0,48,0,0,0,2.0,0,11,3,1,0.0,1,1,1,0.0
4,22.0,6,0,0,0,3.0,0,9,2,1,0.0,4,3,1,0.0
5,0.0,120,0,0,0,3.0,1,9,2,5,0.0,2,1,1,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
282760,0.0,0,0,0,0,1.0,1,8,4,1,0.0,3,1,1,0.0
282762,0.0,0,0,0,0,3.0,1,1,4,7,0.0,3,1,3,0.0
282763,0.0,104,2,0,0,2.0,0,9,2,7,40.0,2,1,3,0.0
282764,0.0,10,0,0,0,2.0,0,11,3,5,26.0,2,1,2,0.0


Now, we must separate our numerical and categorical variables. Numerical variables will be adjusted per column by StandardScaler(), which converts the data such that the mean and standard deviation is 0 and 1 for that column, respectively. This standardization across numerical variables increases our model's accuracy.

As for categorical variables, each unique value in a categorical column must be given its own separate, binary column indicating if that observation fits that unique value or not. We do this because keeping them in one column implies some kind of order. Something like race (newrace2 column) makes no sense to order, and is therefore a candidate to be separated into different columns (dummified). 

In [3]:
# Continuous variables
num_cols = [
    "iralcfy",
    "catag3",
    "health",
    "ireduhighst2",
    "irpinc3",
    "irki17_2",
    "irmjfy",
    "wrkdhrswk2",
    'irhhsiz2',
    'cig30use',
    'irherfy',
    'irmethamyfq'
]

# Categorical variables, excluding irsex, which is already properly formatted
cat_cols = [
    "newrace2",
    "irsex"
]

# Data Preprocessing with Pipeline

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

In [5]:
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(drop='first'), cat_cols)
])

# Model Building

Now that our data is properly processed, we can test several models across different hyperparameters using GridSearchCV. We will be testing the following models:

1. RandomForestClassifier()

2. LogisticRegression()

3. svm.LinearSVC()

*We will only test the default hyperparamters for our LinearSVC due to the length of time required to train

In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.model_selection import train_test_split

In [30]:
# Define feature and target columns
features = num_cols+cat_cols
target = "coccrkever"

# Standard naming conventions for feature/test datasets
X = df[features]
y = df[target]

In [8]:
from sklearn.model_selection import StratifiedKFold, GridSearchCV

In [9]:
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(drop='first'), cat_cols)
])

# Parameter grid for GridSearchCV
model_grid = {
    'random_forest': {
        'model':RandomForestClassifier(random_state=15, n_jobs=3, n_estimators=500),
        'params': {
            'estimator__max_depth': [11, 12, 13, 14],
            'estimator__criterion':['gini', 'entropy']
        }
    },
    'logistic_regression': {
        'model': LogisticRegression(random_state=15, n_jobs=3),
        'params': {
            'estimator__C': [0.085, 0.09, 0.092],
            'estimator__solver':['lbfgs', 'liblinear'],
        }
    },
    'svm': {
        'model': svm.LinearSVC(random_state=15, max_iter=100000),
        'params': {
            'estimator__C':[0.52, 0.55, 0.6, 0.65]
        }
    }
}

In [10]:
# List to hold model scores
scores = []

for model_name, model_params in model_grid.items():
    pipe = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('estimator', model_params['model'])
    ])

    model = GridSearchCV(estimator=pipe, param_grid=model_params['params'], cv=4, return_train_score=False, refit=True)
    model.fit(X, y)
    scores.append({
        'model': model_name,
        'best_score:': model.best_score_,
        'best_params': model.best_params_
    })



In [11]:
# Show scores as df
scores_df = pd.DataFrame(scores)
scores

[{'model': 'random_forest',
  'best_score:': 0.8858342496707732,
  'best_params': {'estimator__criterion': 'entropy',
   'estimator__max_depth': 14}},
 {'model': 'logistic_regression',
  'best_score:': 0.8781065916890805,
  'best_params': {'estimator__C': 0.085, 'estimator__solver': 'liblinear'}},
 {'model': 'svm',
  'best_score:': 0.877463356611295,
  'best_params': {'estimator__C': 0.52}}]

# Choosing a Model

Although our random forest model has the highest accuracy, the accuracies are very similar. Let's train a model using the best hyperparameters of each, then look at the classification report of each model to gain better insight into the performance of each model.

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=.25, random_state=12)

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(drop='first'), cat_cols)
])

pipe = Pipeline(steps=[
        ('preprocessor', preprocessor),
    ])

X_train = pipe.fit_transform(X_train)

In [33]:
# Train a model for each model type using our best hyperparameters. This is so we can analyze each
# in a classification report

rf = RandomForestClassifier(random_state=15, n_jobs=3, n_estimators=500, max_depth=14, criterion='entropy')
rf.fit(X_train, y_train)

lg = LogisticRegression(random_state=15, solver='liblinear',C=0.085)
lg.fit(X_train, y_train)

lsvc = svm.LinearSVC(random_state=15, max_iter=100000, C=0.52)
lsvc.fit(X_train, y_train)

LinearSVC(C=0.52, max_iter=100000, random_state=15)

In [34]:
X_test = pipe.transform(X_test)

rf_predict = rf.predict(X_test)
lg_predict = lg.predict(X_test)
lsvc_predict = lsvc.predict(X_test)

In [35]:
output_cols = [
    "iralcfy",
    "catag3",
    "health",
    "ireduhighst2",
    "irpinc3",
    "irki17_2",
    "irmjfy",
    "wrkdhrswk2",
    'irhhsiz2',
    'cig30use',
    'irherfy',
    'irmethamyfq',
    'newrace2_2',
    'newrace2_3',
    'newrace2_4',
    'newrace2_5',
    'newrace2_6',
    'newrace2_7',
    "irsex"
]

X_test = pd.DataFrame(X_test, columns=output_cols)
X_test

Unnamed: 0,iralcfy,catag3,health,ireduhighst2,irpinc3,irki17_2,irmjfy,wrkdhrswk2,irhhsiz2,cig30use,irherfy,irmethamyfq,newrace2_2,newrace2_3,newrace2_4,newrace2_5,newrace2_6,newrace2_7,irsex
0,-0.661241,-1.516697,-1.287575,-1.658407,-0.913660,1.001951,-0.333690,-1.034050,0.491784,-0.428314,-0.046153,-0.060157,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,-0.042358,0.725999,-0.262482,1.209393,1.500373,1.001951,-0.321278,1.176180,0.491784,-0.428314,-0.046153,-0.060157,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.576525,1.473564,2.812798,1.209393,-0.430854,-0.874318,-0.333690,-1.034050,-0.908570,-0.428314,-0.046153,-0.060157,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.042358,0.725999,1.787705,0.253460,-0.430854,1.940085,3.948377,-1.034050,0.491784,-0.428314,-0.046153,-0.060157,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.232784,-0.769131,0.762612,-0.224507,-0.913660,-0.874318,-0.333690,-0.641120,-0.208393,-0.428314,-0.046153,-0.060157,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56740,-0.089965,-0.021566,1.787705,1.209393,1.500373,-0.874318,-0.333690,0.930599,-1.608748,-0.428314,-0.046153,-0.060157,0.0,0.0,0.0,1.0,0.0,0.0,0.0
56741,3.052056,-0.769131,-1.287575,0.253460,-0.913660,-0.874318,-0.184748,-0.739353,-0.208393,-0.428314,-0.046153,-0.060157,0.0,0.0,0.0,0.0,0.0,0.0,0.0
56742,-0.661241,1.473564,-1.287575,-0.224507,-0.913660,-0.874318,-0.333690,-1.034050,-0.908570,-0.428314,-0.046153,-0.060157,0.0,0.0,0.0,0.0,0.0,0.0,0.0
56743,-0.232784,-0.769131,0.762612,1.209393,0.534760,-0.874318,-0.333690,0.930599,-1.608748,-0.125412,-0.046153,-0.060157,1.0,0.0,0.0,0.0,0.0,0.0,1.0


In [17]:
from sklearn.metrics import classification_report, accuracy_score

In [36]:
print("Random Forest Score: %f\nLogistic Regression Score: %f\nLinear SVC Score: %f\n" %(accuracy_score(y_test, rf_predict), accuracy_score(y_test, lg_predict), accuracy_score(y_test, lsvc_predict)))
print("Random Forest:\n", classification_report(y_test, rf_predict))
print("Logistic Regression:\n", classification_report(y_test, lg_predict))
print("Linear SVC:\n", classification_report(y_test, lsvc_predict))

Random Forest Score: 0.885276
Logistic Regression Score: 0.877187
Linear SVC Score: 0.876764

Random Forest:
               precision    recall  f1-score   support

         0.0       0.90      0.98      0.94     49154
         1.0       0.68      0.27      0.39      7591

    accuracy                           0.89     56745
   macro avg       0.79      0.63      0.66     56745
weighted avg       0.87      0.89      0.86     56745

Logistic Regression:
               precision    recall  f1-score   support

         0.0       0.89      0.98      0.93     49154
         1.0       0.63      0.20      0.31      7591

    accuracy                           0.88     56745
   macro avg       0.76      0.59      0.62     56745
weighted avg       0.85      0.88      0.85     56745

Linear SVC:
               precision    recall  f1-score   support

         0.0       0.88      0.99      0.93     49154
         1.0       0.66      0.17      0.26      7591

    accuracy                         

## Accuracy vs Precision vs Recall

Although the random forest performed the best in terms of total accuracy, our linear SVC model has the highest precision of each model. Recall teh differences between accuracy, precision, and recall:

1. **Accuracy**: Proportion of correct predictions from total observations

2. **Precision**: For a given class, the proportion of correct predictions from total predictions

3. **Recall**: For a given class, proportion of correct predictions from the total number of true observations for that class

Our models have low recall. That means we miss a large number of people who have actually used cocaine. However, we also have extremely high precision. This means that for the people we *do* predict have used cocaine, we are actually correct! This is important to consider. If your goal is to either help people using cocaine or prevent people from becoming addicted cocaine, it would be very bad to wrongly approach someone believing they've tried cocaine when they actually have not. **To prevent false positives, we will choose our Linear SVC model because of its extremely high precision.**

# Saving our Data
Although we've decided on using the linear SVC model, we will save all the models regardless, just in case we want them in the future.

In [37]:
# Pickle models for later
for model, name in zip([lg, lsvc], ["logreg_model", "lsvc_model"]):
    with open("model/" + name + ".pickle", 'wb') as f:
        pickle.dump(model, f)

In [18]:
import gzip, pickletools

# The output of a regular pickle.dump for our random forest is quite large,
# we can compress it using gzip
with gzip.open("model/randforest_model.pickle", "wb") as f:
    pickled = pickle.dumps(rf)
    optimized_pickle = pickletools.optimize(pickled)
    f.write(optimized_pickle)

"""Code for loading from a gzipped pickle file"""
# with gzip.open("model/randforest_model_optimized.pickle", 'rb') as f:
#     p = pickle.Unpickler(f)
#     rf = p.load()

'Code for loading from a gzipped pickle file'

Next, we need to save our fitted pipeline to transform future data.

In [39]:
# Pickle models for later
with open("model/pipeline.pickle", 'wb') as f:
        pickle.dump(pipe, f)

Finally, let's save our columns as a JSON file for future reference.

In [40]:
import json
column_info = {
    'data_columns' : [col for col in num_cols+cat_cols]
}
col_desc = pd.read_csv('model/col_desc.txt', header=0, sep='\t')
for row in range(col_desc.shape[0]):
    column_info[col_desc.iloc[row, 0]] = col_desc.iloc[row, 1]

with open("model/data_columns.json", "w") as f:
    f.write(json.dumps(column_info))

Now, we can move on to creating a server where we can make our model easily interactable.