#Merging, Cleaning Data Sets and Adding Variables#

This notebook will be used to explore merging the data sets, cleaning the data sets (changing various values to np.NaN, converting fields to the appropriate value type), and adding calculated variables. This code will be repeated in the exploratory and modeling notebooks.

In [2]:
import pandas as pd
import numpy as np

cross = pd.read_csv('000_Cross-Sectional.csv', low_memory=False)
base = pd.read_csv('00_Baseline.csv', low_memory=False)
visit1 = pd.read_csv('01_Visit1.csv', low_memory=False)
visit2 = pd.read_csv('02_Visit2.csv', low_memory=False)
visit3 = pd.read_csv('03_Visit3.csv', low_memory=False)
visit4 = pd.read_csv('04_Visit4.csv', low_memory=False)
visit5 = pd.read_csv('05_Visit5.csv', low_memory=False)
visit6 = pd.read_csv('06_Visit6.csv', low_memory=False)
visit7 = pd.read_csv('07_Visit7.csv', low_memory=False)
visit8 = pd.read_csv('08_Visit8.csv', low_memory=False)
visit9 = pd.read_csv('09_Visit9.csv', low_memory=False)
visit10 = pd.read_csv('10_Visit10.csv', low_memory=False)

pd.set_option('display.max_rows', 120)
cross.rename(columns={'ID':'SWANID'}, inplace=True)

##Merging Data Sets##

In [3]:
data = pd.merge(cross, base)
data = pd.merge(data, visit1, on='SWANID', how='outer')
data = pd.merge(data, visit2, on='SWANID', how='outer')
data = pd.merge(data, visit3, on='SWANID', how='outer')
data = pd.merge(data, visit4, on='SWANID', how='outer')
data = pd.merge(data, visit5, on='SWANID', how='outer')
data = pd.merge(data, visit6, on='SWANID', how='outer')
data = pd.merge(data, visit7, on='SWANID', how='outer')
data = pd.merge(data, visit8, on='SWANID', how='outer')
data = pd.merge(data, visit9, on='SWANID', how='outer')
data = pd.merge(data, visit10, on='SWANID', how='outer')

In [4]:
data.shape

(3302, 9214)

That's a lot of fields.

##Cleaning Data Sets##

**Changing coded values for missing/null data to np.NaN**

In [5]:
data.replace(' ', np.nan, inplace=True)
data.replace('-9', np.nan, inplace=True)
data.replace('-1', np.nan, inplace=True)
data.replace('-7', np.nan, inplace=True)
data.replace('-8', np.nan, inplace=True)

**Correcting data types**

Discrimination scores

In [22]:
data[['COURTES0', 'RESPECT0', 'POORSER0', 'NOTSMAR0', 'AFRAIDO0', 'DISHONS0', 'BETTER0', 'INSULTE0', 'HARASSE0', 'IGNORED0']] = data[['COURTES0', 'RESPECT0', 'POORSER0', 'NOTSMAR0', 'AFRAIDO0', 'DISHONS0', 'BETTER0', 'INSULTE0', 'HARASSE0', 'IGNORED0']].astype(float)
data[['COURTES1', 'RESPECT1', 'POORSER1', 'NOTSMAR1', 'AFRAIDO1', 'DISHONS1', 'BETTER1', 'INSULTE1', 'HARASSE1', 'IGNORED1']] = data[['COURTES1', 'RESPECT1', 'POORSER1', 'NOTSMAR1', 'AFRAIDO1', 'DISHONS1', 'BETTER1', 'INSULTE1', 'HARASSE1', 'IGNORED1']].astype(float)
data[['COURTES2', 'RESPECT2', 'POORSER2', 'NOTSMAR2', 'AFRAIDO2', 'DISHONS2', 'BETTER2', 'INSULTE2', 'HARASSE2', 'IGNORED2']] = data[['COURTES2', 'RESPECT2', 'POORSER2', 'NOTSMAR2', 'AFRAIDO2', 'DISHONS2', 'BETTER2', 'INSULTE2', 'HARASSE2', 'IGNORED2']].astype(float)
data[['COURTES3', 'RESPECT3', 'POORSER3', 'NOTSMAR3', 'AFRAIDO3', 'DISHONS3', 'BETTER3', 'INSULTE3', 'HARASSE3', 'IGNORED3']] = data[['COURTES3', 'RESPECT3', 'POORSER3', 'NOTSMAR3', 'AFRAIDO3', 'DISHONS3', 'BETTER3', 'INSULTE3', 'HARASSE3', 'IGNORED3']].astype(float)
data[['COURTES7', 'RESPECT7', 'POORSER7', 'NOTSMAR7', 'AFRAIDO7', 'DISHONS7', 'BETTER7', 'INSULTE7', 'HARASSE7', 'IGNORED7']] = data[['COURTES7', 'RESPECT7', 'POORSER7', 'NOTSMAR7', 'AFRAIDO7', 'DISHONS7', 'BETTER7', 'INSULTE7', 'HARASSE7', 'IGNORED7']].astype(float)
data[['COURTES10', 'RESPECT10', 'POORSER10', 'NOTSMAR10', 'AFRAIDO10', 'DISHONS10', 'BETTER10', 'INSULTE10', 'HARASSE10', 'IGNORED10']] = data[['COURTES10', 'RESPECT10', 'POORSER10', 'NOTSMAR10', 'AFRAIDO10', 'DISHONS10', 'BETTER10', 'INSULTE10', 'HARASSE10', 'IGNORED10']].astype(float)

Ages

In [28]:
data[['AGE0', 'AGE1', 'AGE2', 'AGE3', 'AGE4', 'AGE5', 'AGE6', 'AGE7', 'AGE8', 'AGE9', 'AGE10']] = data[['AGE0', 'AGE1', 'AGE2', 'AGE3', 'AGE4', 'AGE5', 'AGE6', 'AGE7', 'AGE8', 'AGE9', 'AGE10']].astype(float)

**Adding Calculated Variables**

*Average Discrimination Score*

In [30]:
data.loc[:, 'DISC_SCORE0'] = 5 - data[['COURTES0', 'RESPECT0', 'POORSER0', 'NOTSMAR0', 'AFRAIDO0', 'DISHONS0', 
                                   'BETTER0', 'INSULTE0', 'HARASSE0', 'IGNORED0']].mean(axis=1)
data.loc[:, 'DISC_SCORE1'] = 5 - data[['COURTES1', 'RESPECT1', 'POORSER1', 'NOTSMAR1', 'AFRAIDO1', 'DISHONS1', 
                                   'BETTER1', 'INSULTE1', 'HARASSE1', 'IGNORED1']].mean(axis=1)
data.loc[:, 'DISC_SCORE2'] = 5 - data[['COURTES2', 'RESPECT2', 'POORSER2', 'NOTSMAR2', 'AFRAIDO2', 'DISHONS2', 
                                    'BETTER2', 'INSULTE2', 'HARASSE2', 'IGNORED2']].mean(axis=1)
data.loc[:, 'DISC_SCORE3'] = 5 - data[['COURTES3', 'RESPECT3', 'POORSER3', 'NOTSMAR3', 'AFRAIDO3', 'DISHONS3', 
                                    'BETTER3', 'INSULTE3', 'HARASSE3', 'IGNORED3']].mean(axis=1)
data.loc[:, 'DISC_SCORE7'] = 5 - data[['COURTES7', 'RESPECT7', 'POORSER7', 'NOTSMAR7', 'AFRAIDO7', 'DISHONS7', 
                                    'BETTER7', 'INSULTE7', 'HARASSE7', 'IGNORED7']].mean(axis=1)
data.loc[:, 'DISC_SCORE10'] = 5 - data[['COURTES10', 'RESPECT10', 'POORSER10', 'NOTSMAR10', 'AFRAIDO10', 'DISHONS10', 
                                    'BETTER10', 'INSULTE10', 'HARASSE10', 'IGNORED10']].mean(axis=1)

*Reason for Discrimination*

In [44]:
def convert_binary(cols):
    data[cols] = data[cols].map({'1':0, '2':1})
    data[cols].replace(np.nan, 0, inplace=True)
    print data[cols].value_counts(dropna=False)

def convert_binary_float(cols):
    data[cols] = data[cols].map({1:0, 2:1})
    data[cols].replace(np.nan, 0, inplace=True) # this may need to come out for certain variables
    print data[cols].value_counts(dropna=False)
    
"""
for x in "array_of_column_names":
    if(data[x].dtype == np.float64):
        convert_bin_fl(x)
    else:
        convert_bin(x)
"""

'\nfor x in conversion:\n    if(data[x].dtype == np.float64):\n        convert_bin_fl(x)\n    else:\n        convert_bin(x)\n'

In [46]:
data['RACE_REASON0'] = data.MAINREA0.map({'1':1, '2':1, '3':0, '4':0, '5':0, '6':0, '7':0, '8':0, '9':0})