#### CANB8347 Machine Learning Project
Given annotated dataset, develop a supervised machine learning method to predict live births from multi-dimensional data  
##### 0) Preprocessing and Characterization

In [1]:
import numpy as np
import pandas as pd
import scipy as sc

# sklearn tools
from sklearn.preprocessing import normalize
from sklearn.impute import SimpleImputer

---
#### 0. read in dataset and look at feature labels

In [2]:
# read in raw training data
vlbw = pd.read_csv('data/vlbw_train.csv')

In [3]:
# see how many observations and features we are working with
vlbw.shape

(537, 27)

In [4]:
# get rid of index axis
vlbw.drop('Unnamed: 0', axis=1, inplace=True)
# drop the rows that have NaN in the column twn, these seem to have a lot of other missing info
vlbw.dropna(subset=['twn'], inplace=True)
# look at number of missing observations in each feature
vlbw.isnull().sum()

birth         1
exit          9
hospstay      9
lowph        32
pltct        39
race          2
bwt           0
gest          1
inout         1
twn           0
lol         288
magsulf     182
meth         70
toc          70
delivery      3
apg1         11
vent          6
pneumo        3
pda           5
cld          38
pvh         109
ivh         108
ipe         108
year          1
sex           1
dead          0
dtype: int64

In [5]:
vlbw.shape

(519, 26)


Features with a lot of missing observations are concerning:
* lol
* magsulf
* meth
* toc
* pvh
* ivh
* ipe

---
#### Investigate missing values and decide how to impute/what to keep

In [6]:
# for all the numeric, continuous data, see if the features correlate with live births
print('lol: {}'.format(vlbw.lol.corr(vlbw.dead)))
print('magsulf: {}'.format(vlbw.magsulf.corr(vlbw.dead)))
print('meth: {}'.format(vlbw.meth.corr(vlbw.dead)))
print('toc: {}'.format(vlbw.toc.corr(vlbw.dead)))

lol: 0.16361321302009335
magsulf: -0.06183780159526858
meth: -0.1651503386980673
toc: 0.010423079240993036


In [7]:
# add labor length of 0 for any abdominal births without any value already assigned
vlbw.loc[(vlbw.delivery=='abdominal') & (vlbw.lol.isnull()), 'lol'] = 0

In [8]:
vlbw.magsulf.value_counts()

0.0    292
1.0     45
Name: magsulf, dtype: int64

In [9]:
# this one's probably okay to impute zero for the missing values
vlbw.loc[vlbw.magsulf.isnull(), 'magsulf'] = 0

In [10]:
vlbw.meth.value_counts()

0.0    254
1.0    195
Name: meth, dtype: int64

In [11]:
# again, for 70 observations with a low correlation to death, 
# this one's probably okay to impute zero for the missing values
vlbw.loc[vlbw.meth.isnull(), 'meth'] = 0

In [12]:
vlbw.toc.value_counts()

0.0    347
1.0    102
Name: toc, dtype: int64

In [13]:
# again, for 70 observations with a low correlation to death, 
# this one's probably okay to impute zero for the missing values
vlbw.loc[vlbw.toc.isnull(), 'toc'] = 0

---
Now look at `pvh`, `ivh`, and `ipe`, which are missing a bunch of values and have more than two levels.

In [14]:
# see potential 'pvh' values and their counts
vlbw.pvh.value_counts()

absent      277
definite    102
possible     31
Name: pvh, dtype: int64

In [15]:
# replace categories with numeric levels based on confidence of pvh diagnosis
vlbw.loc[vlbw.pvh=='absent', 'pvh'] = 0
vlbw.loc[vlbw.pvh=='possible', 'pvh'] = 1
vlbw.loc[vlbw.pvh=='definite', 'pvh'] = 2

In [16]:
# now levels should be 0, 1, 2
vlbw.pvh.value_counts()

0    277
2    102
1     31
Name: pvh, dtype: int64

In [17]:
# correlate death to new numeric pvh values
vlbw.dead.corr(vlbw.pvh.astype('float'))

0.1532637193896322

Not very significant correlation here.  Do the same for IVH and IPE.

In [18]:
# see potential 'ivh' values and counts
vlbw.ivh.value_counts()

absent      345
definite     58
possible      8
Name: ivh, dtype: int64

In [19]:
# replace categories with numeric levels based on confidence of pvh diagnosis
vlbw.loc[vlbw.ivh=='absent', 'ivh'] = 0
vlbw.loc[vlbw.ivh=='possible', 'ivh'] = 1
vlbw.loc[vlbw.ivh=='definite', 'ivh'] = 2

In [20]:
# now levels should be 0, 1, 2
vlbw.ivh.value_counts()

0    345
2     58
1      8
Name: ivh, dtype: int64

In [21]:
# correlate death to new numeric pvh values
vlbw.dead.corr(vlbw.ivh.astype('float'))

0.3966245771655225

That's a large correlation.

In [22]:
# see how many of each category
vlbw.ipe.value_counts()

absent      368
definite     29
possible     14
Name: ipe, dtype: int64

In [23]:
# replace categories with numeric levels based on confidence of pvh diagnosis
vlbw.loc[vlbw.ipe=='absent', 'ipe'] = 0
vlbw.loc[vlbw.ipe=='possible', 'ipe'] = 1
vlbw.loc[vlbw.ipe=='definite', 'ipe'] = 2

In [24]:
# now levels should be 0, 1, 2
vlbw.ipe.value_counts()

0    368
2     29
1     14
Name: ipe, dtype: int64

In [25]:
# correlate death to new numeric pvh values
vlbw.dead.corr(vlbw.ipe.astype('float'))

0.13357262144554805

Smaller for IPE, but still there

Now the problem with these values (`pvh`, `ivh`, `ipe`) will be imputation.  How do we impute 120 values in a dataset of 500?  Random sampling introduces noise, but assuming absence could also yield false negatives. See [`imputation.ipynb`](imputation.ipynb) for next steps.

In [26]:
# make sure the columns are numeric datatype before moving on
vlbw.loc[:,'pvh'] = vlbw.pvh.astype('float64', inplace=True)
vlbw.loc[:,'ivh'] = vlbw.ivh.astype('float64', inplace=True)
vlbw.loc[:,'ipe'] = vlbw.ipe.astype('float64', inplace=True)

---
#### Check out other categorical variables and see how to convert to numeric

In [27]:
# features that are still categorical
vlbw.dtypes[vlbw.dtypes=='object']

race        object
inout       object
delivery    object
sex         object
dtype: object

In [28]:
def numerize(df, col, drop=True):
    '''
    make categorical data numeric from 0 - n categories
        df = dataframe
        col = column to numerize into n_categories columns
        drop = drop original column or retain in df?
    '''
    temp = df.copy(deep=True) # copy df so you don't affect it
    
    for cat in temp[col].unique():
        # for each categorical value, create a new column with binary values for T/F
        temp[col+'_'+str(cat)] = (temp[col]==cat)*1
        
    if drop:
        return temp.drop(col, axis=1)
    
    else:
        return temp

In [29]:
# perform numerization on whole dataset
for feature, datatype in zip(vlbw.dtypes.index, vlbw.dtypes):
    if datatype == 'object':
        vlbw = numerize(vlbw, feature)

In [30]:
# look at resulting features
# should be more than we started with, as data is now in long-form
vlbw.dtypes

birth                   float64
exit                    float64
hospstay                float64
lowph                   float64
pltct                   float64
bwt                     float64
gest                    float64
twn                     float64
lol                     float64
magsulf                 float64
meth                    float64
toc                     float64
apg1                    float64
vent                    float64
pneumo                  float64
pda                     float64
cld                     float64
pvh                     float64
ivh                     float64
ipe                     float64
year                    float64
dead                      int64
race_white                int64
race_black                int64
race_native American      int64
race_oriental             int64
race_nan                  int64
inout_born at Duke        int64
inout_transported         int64
inout_nan                 int64
delivery_abdominal        int64
delivery

In [31]:
vlbw.loc[(vlbw['inout_born at Duke']==0)&(vlbw.inout_transported==0)]

Unnamed: 0,birth,exit,hospstay,lowph,pltct,bwt,gest,twn,lol,magsulf,...,race_nan,inout_born at Duke,inout_transported,inout_nan,delivery_abdominal,delivery_vaginal,delivery_nan,sex_female,sex_male,sex_nan
369,85.624,85.768997,53.0,,,1145.0,28.0,0.0,,0.0,...,0,0,0,0,0,0,0,0,1,0


There's only one observation where we don't know whether or not they were born at Duke. Let's say they are a Dukie.

In [32]:
vlbw.loc[(vlbw['inout_born at Duke']==0)&(vlbw.inout_transported==0), 'inout_born at Duke'] = 1
vlbw.drop('inout_nan', axis=1, inplace=True)

In [33]:
# make unknown sex 1 in 'sex_nan' column
vlbw.loc[(vlbw.sex_female==0)&(vlbw.sex_male==0), 'sex_nan'] = 1
# make unknown delivery 1 in 'delivery_nan' column
vlbw.loc[(vlbw.delivery_abdominal==0)&(vlbw.delivery_vaginal==0), 'delivery_nan'] = 1
# make unknown race 1 in 'race_nan' column
vlbw.loc[(vlbw.race_black==0)&(vlbw.race_white==0)&(vlbw.race_oriental==0)&(vlbw['race_native American']==0), 'race_nan'] = 1

In [34]:
vlbw.sex_nan.value_counts()

0    518
1      1
Name: sex_nan, dtype: int64

In [35]:
vlbw.delivery_nan.value_counts()

0    516
1      3
Name: delivery_nan, dtype: int64

In [36]:
vlbw.race_nan.value_counts()

0    517
1      2
Name: race_nan, dtype: int64

In [37]:
# drop unknown race, sex, and delivery bc there's not that many missing values
vlbw = vlbw[(vlbw.race_nan!=1) & (vlbw.sex_nan!=1) & (vlbw.delivery_nan!=1)]
vlbw.drop(['race_nan','delivery_nan','sex_nan'], axis=1, inplace=True)

In [38]:
# final obs x features matrix for training after preprocessing
vlbw.shape

(513, 32)

In [39]:
# save the 'numerified' data as .csv file
vlbw.to_csv('data/vlbw_train_numeric.csv', index=False)

---
Next, we need to figure out how to impute the rest of the missing values.  
See [`imputation.ipynb`](imputation.ipynb) for next steps.