## CSC 8515 - Machine Learning Project  
**Topic: Predicting success in rehabilitation  
Author: James Fung  **

### Checkpoint 2 Progress:

For this checkpoint, I was able to begin handling missing values by using imputation methods. The method most commonly used was the mode as of now, and I applied the mode to features that had less than 10% overall missing. For features greater than 10%, I have not yet decided on how to handle their missing values.

Manual feature selection was also performed by using the codebook provided by the CDC. Many features had subcategories that were nearly uniform (99% one, 1% other) so these were manually dropped as they provided no useful information. I was able to reduce the size of the dataset to 45 features from the 65 originally.

For testing so far, I implemented a baseline model (random forest, as most of my features were categorical) after converting the dataset into numbers. For other models, I will probably need to use one hot encoding, but as of right now the accuracy is about 66%, which isn't great so more work will need to be done. 

In terms of checkpoint progress, I am once again a little behind from my original goal. Most of my time has been dedicated to studying for our department's comprehensive examinations for graduation this Saturday, so I hope to make more progress over the next 25 or so days now that I'll have more time.

In [1]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt

#One hot encoder.
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

## Import Data, Feature Exploration

In [2]:
#Import the rehab file as a pandas file.

#Read the data.
rehab = pd.read_csv('Rehab.csv', header=0)

In [3]:
#Quick look at the data.
rehab.head()

Unnamed: 0.1,Unnamed: 0,CASEID,DISYR,AGE,GENDER,RACE,ETHNIC,MARSTAT,EDUC,EMPLOY,...,BARBFLG,SEDHPFLG,INHFLG,OTCFLG,OTHERFLG,ALCDRUG,DSMCRIT,PSYPROB,HLTHINS,PRIMPAY
0,1,20140000000.0,2014,18-20,MALE,WHITE,NOT OF HISPANIC ORIGIN,NEVER MARRIED,9-11,UNEMPLOYED,...,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,OTHER DRUGS ONLY,CANNABIS DEPENDENCE,YES,NONE,OTHER GOVERNMENT PAYMENTS
1,2,20140000000.0,2014,50-54,MALE,WHITE,NOT OF HISPANIC ORIGIN,SEPARATED,12,UNEMPLOYED,...,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,ALCOHOL ONLY,ALCOHOL DEPENDENCE,NO,MISSING/UNKNOWN/NOT COLLECTED/INVALID,MISSING/UNKNOWN/NOT COLLECTED/INVALID
2,3,20140000000.0,2014,21-24,FEMALE,WHITE,NOT OF HISPANIC ORIGIN,NEVER MARRIED,12,UNEMPLOYED,...,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,OTHER DRUGS ONLY,OTHER SUBSTANCE DEPENDENCE,NO,NONE,"NO CHARGE (FREE, CHARITY, SPECIAL RESEARCH, TE..."
3,4,20140000000.0,2014,50-54,MALE,WHITE,NOT OF HISPANIC ORIGIN,"DIVORCED, WIDOWED",12,UNEMPLOYED,...,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,OTHER DRUGS ONLY,MISSING/UNKNOWN/NOT COLLECTED/INVALID,NO,MISSING/UNKNOWN/NOT COLLECTED/INVALID,MEDICAID
4,5,20140000000.0,2014,25-29,MALE,WHITE,NOT OF HISPANIC ORIGIN,NEVER MARRIED,12,NOT IN LABOR FORCE,...,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,SUBSTANCE NOT REPORTED,OTHER DRUGS ONLY,OTHER SUBSTANCE DEPENDENCE,NO,NONE,OTHER


In [4]:
#Drop the first three columns as they provide no information. Also drop some "FLG" columns, as they are nearly uniform.
rehabclean = rehab.drop(['Unnamed: 0','CASEID','DISYR','METHFLG','PCPFLG','HALLFLG','AMPHFLG','STIMFLG','TRNQFLG','BARBFLG','SEDHPFLG','INHFLG','OTCFLG'],1)

In [5]:
#What are the column names?
print(rehabclean.columns)

Index(['AGE', 'GENDER', 'RACE', 'ETHNIC', 'MARSTAT', 'EDUC', 'EMPLOY',
       'DETNLF', 'PREG', 'VET', 'LIVARAG', 'PRIMINC', 'ARRESTS', 'STFIPS',
       'CBSA', 'REGION', 'DIVISION', 'SERVSETD', 'METHUSE', 'DAYWAIT',
       'REASON', 'LOS', 'PSOURCE', 'DETCRIM', 'NOPRIOR', 'SUB1', 'ROUTE1',
       'FREQ1', 'FRSTUSE1', 'SUB2', 'ROUTE2', 'FREQ2', 'FRSTUSE2', 'SUB3',
       'ROUTE3', 'FREQ3', 'FRSTUSE3', 'NUMSUBS', 'IDU', 'ALCFLG', 'COKEFLG',
       'MARFLG', 'HERFLG', 'OPSYNFLG', 'MTHAMFLG', 'BENZFLG', 'OTHERFLG',
       'ALCDRUG', 'DSMCRIT', 'PSYPROB', 'HLTHINS', 'PRIMPAY'],
      dtype='object')


In [6]:
#How does the frequencies of the label features look?
rehabclean.groupby('REASON').size()

#What about the other features?
for i in rehabclean.columns:
    print('Information for '+i+':')
    print('')
    print(rehabclean.groupby(i).size())
    print('----------------------------------')

Information for AGE:

AGE
12-14           13494
15-17           64541
18-20           71755
21-24          179731
25-29          253879
30-34          221241
35-39          158836
40-44          139523
45-49          140760
50-54          124465
55 AND OVER    111588
dtype: int64
----------------------------------
Information for GENDER:

GENDER
FEMALE                                   499079
MALE                                     980447
MISSING/UNKNOWN/NOT COLLECTED/INVALID       287
dtype: int64
----------------------------------
Information for RACE:

RACE
ALASKA NATIVE (ALEUT, ESKIMO, INDIAN)           3916
AMERICAN INDIAN (OTHER THAN ALASKA NATIVE)     35063
ASIAN                                          10350
ASIAN OR PACIFIC ISLANDER                       2014
BLACK OR AFRICAN AMERICAN                     255706
MISSING/UNKNOWN/NOT COLLECTED/INVALID          28457
NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER       7623
OTHER SINGLE RACE                             128005
TWO OR M

DIVISION
EAST NORTH CENTRAL           144592
EAST SOUTH CENTRAL            58394
MID-ATLANTIC                 366688
MOUNTAIN                     120945
NEW ENGLAND                  168903
PACIFIC                      272170
SOUTH ATLANTIC               128049
US JURISDICTION/TERRITORY      3637
WEST NORTH CENTRAL           148307
WEST SOUTH CENTRAL            68128
dtype: int64
----------------------------------
Information for SERVSETD:

SERVSETD
AMBULATORY, DETOXIFICATION                   16701
AMBULATORY, INTENSIVE OUTPATIENT            198706
AMBULATORY, NON-INTENSIVE OUTPATIENT        693242
DETOX, 24 HR, FREE-STANDING RESIDENTIAL     253084
DETOX, 24 HR, HOSPITAL INPATIENT             42309
MISSING/UNKNOWN/NOT COLLECTED/INVALID          666
REHAB/RES, HOSPITAL (NON-DETOX)               2721
REHAB/RES, LONG TERM (MORE THAN 30 DAYS)    114257
REHAB/RES, SHORT TERM (30 DAYS OR FEWER)    158127
dtype: int64
----------------------------------
Information for METHUSE:

METHUSE
MISSIN

ROUTE2
INHALATION                                85318
INJECTION (IV OR INTRAMUSCULAR)           76767
MISSING/UNKNOWN/NOT COLLECTED/INVALID    647689
ORAL                                     297676
OTHER                                      6725
SMOKING                                  365638
dtype: int64
----------------------------------
Information for FREQ2:

FREQ2
1-2 TIMES IN THE PAST WEEK                84430
1-3 TIMES IN THE PAST MONTH              125133
3-6 TIMES IN THE PAST WEEK                85503
DAILY                                    234485
MISSING/UNKNOWN/NOT COLLECTED/INVALID    651091
NO USE IN THE PAST MONTH                 299171
dtype: int64
----------------------------------
Information for FRSTUSE2:

FRSTUSE2
11 AND UNDER                              56744
12-14                                    195740
15-17                                    238646
18-20                                    141881
21-24                                     81453
25-29          

Nearly all the columns look pretty good in terms of distribution of classes. Many contain missing data, however.

In [7]:
#We're interested in if they completed a treatment, or if they did not due to some personal reason.
#Filter rows to only those that interesting outcomes.

rehabclean = rehabclean.query('REASON in ["TREATMENT COMPLETED","TERMINATED BY FACILITY","LEFT AGAINST PROFESSIONAL ADVICE","INCARCERATED","DEATH"]')

## Missing Value Imputation

In [8]:
#Using the codebook provided by the CDC, missing values exist as "MISSING/UNKNOWN/NOT COLLECTED/INVALID"
#Convert these values into "NA"
rehabclean = rehabclean.replace("MISSING/UNKNOWN/NOT COLLECTED/INVALID",np.NaN)

In [9]:
#How many missing values are in each column? What is the proportion?
missing = ((rehabclean.isnull().sum()/len(rehabclean))*100).to_dict()
missingsort = sorted(missing.items(),key=lambda kv: kv[1])

In [10]:
#Print only the columns with missing values.
#for i in missingsort:
#    if int(i[1]) > 0:
#        print(i)

missingsort

[('AGE', 0.0),
 ('STFIPS', 0.0),
 ('CBSA', 0.0),
 ('REGION', 0.0),
 ('DIVISION', 0.0),
 ('DAYWAIT', 0.0),
 ('REASON', 0.0),
 ('NUMSUBS', 0.0),
 ('ALCFLG', 0.0),
 ('COKEFLG', 0.0),
 ('MARFLG', 0.0),
 ('HERFLG', 0.0),
 ('OPSYNFLG', 0.0),
 ('MTHAMFLG', 0.0),
 ('BENZFLG', 0.0),
 ('OTHERFLG', 0.0),
 ('ALCDRUG', 0.0),
 ('LOS', 8.6012404709007127e-05),
 ('GENDER', 0.02038493991603469),
 ('SERVSETD', 0.043178227163921584),
 ('SUB1', 0.58798079859077279),
 ('ETHNIC', 1.0777354310038594),
 ('PSOURCE', 1.5204412780411192),
 ('ROUTE1', 1.7676409291748056),
 ('FREQ1', 1.8053143624373509),
 ('RACE', 1.9262478034582147),
 ('METHUSE', 1.9465467309695406),
 ('EMPLOY', 1.9933374791312404),
 ('LIVARAG', 2.276920377456837),
 ('FRSTUSE1', 2.8314423506158057),
 ('EDUC', 4.4443469637191075),
 ('SUB2', 5.2305003427594334),
 ('VET', 7.4122049882033982),
 ('ARRESTS', 8.9351406259810791),
 ('NOPRIOR', 9.0708682006118924),
 ('SUB3', 18.518384721444527),
 ('PSYPROB', 21.506455661035435),
 ('MARSTAT', 21.9962102934

In [11]:
#Length of stay variable is sparse, must impute manually.
#LOS < 30 is sparse, and is categorical, recombine into <30 days.
LOSrecode = list(range(1,31))
LOSrecode = list(map(str,LOSrecode))
rehabclean['LOS'] = rehabclean['LOS'].replace(LOSrecode,'LESS THAN 30')

#Replace missing values in LOS with most common.
LOSmode = rehabclean['LOS'].value_counts().idxmax()
rehabclean['LOS'].fillna(LOSmode, inplace=True)

In [12]:
#For features with less than 10% of missing values, append the to a list.
autoimputelist = []

for i in missingsort:
    if i[1] > 0 and i[1] < 10:
        autoimputelist.append(i[0])

#For those feature columns, replace with the mode as it may be most reliable.
for column in autoimputelist:
    mode = rehabclean[str(column)].value_counts().idxmax()
    rehabclean[str(column)].fillna(mode,inplace=True)

### Preliminary Model Testing

There are about 17 features that have a high amount of missing values. Let's try running a model without
these features to test how well it performs.

In [13]:
#There are about 17 features that have a high amount of missing values. Let's try running a model without
#these features to test how well it performs.

cleanmissing = ((rehabclean.isnull().sum()/len(rehabclean))*100).to_dict()
cleanmissingsort = sorted(cleanmissing.items(),key=lambda kv: kv[1])

dropfeatures = []

for i in cleanmissingsort:
    if i[1]>0:
        dropfeatures.append(i[0])

#Drop the columns.
rehabtest = rehabclean.drop(dropfeatures,1)

#Drop the census data, except for region.
rehabtest = rehabtest.drop(['STFIPS','CBSA','REGION'],1)

In [14]:
rehabtest.columns

Index(['AGE', 'GENDER', 'RACE', 'ETHNIC', 'EDUC', 'EMPLOY', 'VET', 'LIVARAG',
       'ARRESTS', 'DIVISION', 'SERVSETD', 'METHUSE', 'DAYWAIT', 'REASON',
       'LOS', 'PSOURCE', 'NOPRIOR', 'SUB1', 'ROUTE1', 'FREQ1', 'FRSTUSE1',
       'SUB2', 'NUMSUBS', 'ALCFLG', 'COKEFLG', 'MARFLG', 'HERFLG', 'OPSYNFLG',
       'MTHAMFLG', 'BENZFLG', 'OTHERFLG', 'ALCDRUG'],
      dtype='object')

#### Encoding section:

In [15]:
#Manual encoding of the columns.

#Ordinal/binary variables.
def recat(colnames):
    for col in colnames:
        rehabtest[col] = rehabtest[col].astype('category')
        rehabtest[col] = rehabtest[col].cat.codes

recat(['AGE','EDUC','ARRESTS','NOPRIOR','FREQ1','FRSTUSE1','SUB2','GENDER','ALCFLG','COKEFLG','MARFLG','HERFLG',
       'OPSYNFLG','MTHAMFLG','BENZFLG','OTHERFLG','ALCDRUG'])

In [16]:
#One hot encode multicategory.
rehabtest = pd.get_dummies(rehabtest,columns=['RACE','ETHNIC','EMPLOY','VET',
                                                     'LIVARAG','DIVISION','SERVSETD','METHUSE',
                                                     'LOS','PSOURCE','SUB1','ROUTE1'])

In [17]:
rehabtest.dtypes

AGE                                                      int8
GENDER                                                   int8
EDUC                                                     int8
ARRESTS                                                  int8
DAYWAIT                                                 int64
REASON                                                 object
NOPRIOR                                                  int8
FREQ1                                                    int8
FRSTUSE1                                                 int8
SUB2                                                     int8
NUMSUBS                                                 int64
ALCFLG                                                   int8
COKEFLG                                                  int8
MARFLG                                                   int8
HERFLG                                                   int8
OPSYNFLG                                                 int8
MTHAMFLG

In [18]:
#Split the data into features and labels, and split into training and testing data.
from sklearn.model_selection import train_test_split

X=rehabtest.iloc[:,rehabtest.columns != 'REASON']
Y=rehabtest.iloc[:,rehabtest.columns == 'REASON']

X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=.3)

In [19]:
#Train on a random forest.
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

from sklearn import metrics
print('Accuracy:',metrics.accuracy_score(y_test,y_pred))

  after removing the cwd from sys.path.


Accuracy: 0.671447043611


In [20]:
#Train on a decision tree.
from sklearn import tree
tree = tree.DecisionTreeClassifier()
tree = tree.fit(X_train,y_train)
y_pred = tree.predict(X_test)

from sklearn import metrics
print('Accuracy:',metrics.accuracy_score(y_test,y_pred))

Accuracy: 0.559834512181


In [21]:
#Train on a neural network.
from sklearn.neural_network import MLPClassifier
neural = MLPClassifier(solver='lbfgs',alpha=1e-5,hidden_layer_sizes=(15,),random_state=1)
neural = neural.fit(X_train,y_train)
y_pred = neural.predict(X_test)

print('Accuracy:',metrics.accuracy_score(y_test,y_pred))

  y = column_or_1d(y, warn=True)


Accuracy: 0.646821699203


In [23]:
#Train on kNN.
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh = neigh.fit(X_train,y_train)
y_pred = neigh.predict(X_test)

print('Accuracy:',metrics.accuracy_score(y_test,y_pred))

  after removing the cwd from sys.path.


Accuracy: 0.562532433835
