#Merging, Cleaning Data Sets and Adding Variables#

This notebook will be used to explore merging the data sets, cleaning the data sets (changing various values to np.NaN, converting fields to the appropriate value type), and adding calculated variables. The cleaned data st is exported as csv to be used in other data analysis.

In [21]:
import pandas as pd
import numpy as np

cross = pd.read_csv('000_Cross-Sectional.csv', low_memory=False)
base = pd.read_csv('00_Baseline.csv', low_memory=False)

pd.set_option('display.max_rows', 120)
cross.rename(columns={'ID':'SWANID'}, inplace=True)

##Merging Data Sets##

In [22]:
data = pd.merge(cross, base)

In [23]:
data.shape

(3302, 852)

That's a lot of fields.

##Cleaning Data Sets##

###Changing coded values for missing/null data to np.NaN###

In [24]:
data.replace(' ', np.nan, inplace=True)
data.replace('-9', np.nan, inplace=True)
data.replace('-1', np.nan, inplace=True)
data.replace('-7', np.nan, inplace=True)
data.replace('-8', np.nan, inplace=True)

###Correcting data types###

Discrimination scores

In [25]:
data[['COURTES0', 'RESPECT0', 'POORSER0', 'NOTSMAR0', 'AFRAIDO0', 'DISHONS0', 'BETTER0', 'INSULTE0', 
      'HARASSE0', 'IGNORED0']] = data[['COURTES0', 'RESPECT0', 'POORSER0', 'NOTSMAR0', 'AFRAIDO0', 'DISHONS0', 
                                       'BETTER0', 'INSULTE0', 'HARASSE0', 'IGNORED0']].astype(float)

Ages

In [26]:
data[['AGE0']] = data[['AGE0']].astype(float)

BMI

In [27]:
data[['BMI0']] = data[['BMI0']].astype(float)

###Adding Calculated Variables###

*Average Discrimination Score*

In [28]:
data.loc[:, 'DISC_SCORE0'] = 5 - data[['COURTES0', 'RESPECT0', 'POORSER0', 'NOTSMAR0', 'AFRAIDO0', 'DISHONS0', 
                                   'BETTER0', 'INSULTE0', 'HARASSE0', 'IGNORED0']].mean(axis=1)

*Reason for Discrimination*

In [29]:
def convert_binary(cols):
    data[cols] = data[cols].map({'1':0, '2':1})
    # data[cols].replace(np.nan, 0, inplace=True)
    print data[cols].value_counts(dropna=False)

def convert_binary_float(cols):
    data[cols] = data[cols].map({1:0, 2:1})
    # data[cols].replace(np.nan, 0, inplace=True) # this may need to come out for certain variables
    print data[cols].value_counts(dropna=False)

In [30]:
data['RACE_REASON0'] = data.MAINREA0.map({'1':1, '2':1, '3':0, '4':0, '5':0, '6':0, '7':0, '8':0, '9':0})

*Outcome measures*

In [31]:
# Difference between Lipid Vascular Age and actual age

data['LV_AGE_DIFF0'] = data.LV_AGE0 - data.AGE0

# Difference between BMI Vascular Age and actual age

data['BV_AGE_DIFF0'] = data.BV_AGE0 - data.AGE0


# % change between Lipid Vascular Age and actual age

data['LV_AGE_PCT0'] = (data.LV_AGE0 - data.AGE0) / data.AGE0

# % change between BMI Vascular Age and actual age

data['BV_AGE_PCT0'] = (data.BV_AGE0 - data.AGE0) / data.AGE0


###Replacing category numbers with names###

Race

In [32]:
data['RACE'] = data.ETHNIC.map({1:'Black', 8:'Asian', 9:'Asian', 10:'Caucasian', 13:'Hispanic'})

BMI (putting into categories)

In [33]:
data['BMI_CAT'] = data.BMI0

data.loc[data.BMI_CAT < 18, ['BMI_CAT']] = 'Underweight'
data.loc[data.BMI_CAT < 25, ['BMI_CAT']] = 'HealthyWeight'
data.loc[data.BMI_CAT < 30, ['BMI_CAT']] = 'Overweight'
data.loc[data.BMI_CAT < 100, ['BMI_CAT']] = 'Obese'

###Writing to a new CSV###

In [34]:
data.to_csv('991_CleanedData.csv', index = False)

In [35]:
data.shape

(3302, 860)