#**Data Exploration**

## Importing Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv('99053_AllData.csv')

pd.set_option('display.max_rows', 120)

## Some data cleaning

Change all missing or NA values to numpy nan

In [2]:
data.replace(' ', np.nan, inplace=True)
data.replace('-9', np.nan, inplace=True)
data.replace('-1', np.nan, inplace=True)

Converting object type to float

In [3]:
data.dtypes

    # data types that need to change from object to float
    # AGE0, INTDAY7, AGE7,INTDAY10, AGE10, P_STRESS, BMI0, BMI7, BMI10

data[['AGE0', 'INTDAY7', 'AGE7','INTDAY10', 'AGE10', 'P_STRESS', 'BMI0', 'BMI7', 'BMI10']] = data[['AGE0', 'INTDAY7', 'AGE7','INTDAY10', 'AGE10', 'P_STRESS', 'BMI0', 'BMI7', 'BMI10']].astype(float)

Adding a race column to combine Japanese and Chinese to Asian, rewriting Ethnic categories to show Ethnicity name instead of data code

In [4]:
data['RACE'] = data.ETHNIC.map({1:'Black', 8:'Asian', 9:'Asian', 10:'Caucasian', 13:'Hispanic'})
data['ETHNIC_CAT'] = data.ETHNIC.map({1:'Black', 8:'Japanese', 9:'Chinese', 10:'Caucasian', 13:'Hispanic'})

##Handling Null Values

dot dot dot

In [5]:
data.isnull().sum()

SWANID                0
AGE0                  5
COURTES0             12
RESPECT0             13
POORSER0              8
NOTSMAR0             10
AFRAIDO0              9
DISHONS0             10
BETTER0               8
INSULTE0              9
HARASSE0              8
IGNORED0              8
DISC_AVGSCORE0        6
DISC_SUMSCORE0        0
MAINREA0           1791
RACE_REASON0       1788
INTDAY7             982
AGE7                982
COURTES7           1146
RESPECT7           1147
POORSER7           1146
NOTSMAR7           1146
AFRAIDO7           1146
DISHONS7           1146
BETTER7            1146
INSULTE7           1146
HARASSE7           1146
IGNORED7           1146
DISC_AVGSCORE7     1147
DISC_SUMSCORE7     1147
BCRACE7            2330
BCETHN7            2330
BCGENDR7           2329
BCAGE7             2329
BCINCML7           2329
BCLANG7            2330
BCWGHT7            2330
BCPHAPP7           2329
BCORIEN7           2329
OTHEREX7           2329
RACE_REASON7       2297
INTDAY10        

In [None]:
# Need to figure out how to handle null values
## data.dropna()            drops a row if any values are null
## data.dropna(how='all')   drops a row only if all values are null
## data.dropna(subset = ['col1', 'col2', 'col4'])       # drops a row if col 1, 2, and 4 are null

# Descriptive Analysis

##Perceived Stress

Examining frequencies of Perceived Stress values. P_STRESS is measured from 4-20, with 20 indicating high perceived stress. There are 108 null values.

In [None]:
data.P_STRESS.value_counts(sort=False, dropna=False)

The average perceived stress score is 8.6. Perceived stress scores are skewed to the left, as seen visually in the below bar and box plots. 

In [None]:
data.P_STRESS.describe()

In [None]:
data.P_STRESS.plot(kind = 'hist', bins = 16, title='Histogram of Perceived Stress')
plt.xlabel('Perceived Stress Score')
plt.ylabel('Count')

In [None]:
data.P_STRESS.value_counts(sort=False, dropna=False).plot(kind='bar', title='Bar Plot of Perceived Stress (includes null)')
plt.xlabel('Perceived Stress Score')
plt.ylabel('Count')

In [None]:
data.P_STRESS.plot(kind='box', title='Boxplot of Perceived Stress')

##Ethnicity
*note: the next couple cells will be repeated for race*

In [None]:
data.ETHNIC_CAT.value_counts(ascending=True, dropna=False)

In [None]:
data.ETHNIC_CAT.value_counts(ascending=True, sort=True, dropna=False).plot(kind='bar', title='Bar Plot of Ethnicity')
plt.xlabel('Ethnicity')
plt.ylabel('Count')

**Ethnicity and Perceived Stress**

In [None]:
 data.groupby('ETHNIC_CAT').P_STRESS.describe()

In [None]:
data.boxplot(column='P_STRESS', by = 'ETHNIC_CAT')
plt.xlabel('Ethnicity')
plt.ylabel('Perceived Stress')

**Ethnicity and Age**

In [None]:
data.ETHNIC_CAT.isnull().sum()     # no null values hurrah!

In [None]:
data.groupby('ETHNIC_CAT').AGE0.describe()

In [None]:
data.boxplot(column='AGE0', by = 'ETHNIC_CAT')
plt.xlabel('Ethnicity')
plt.ylabel('Perceived Stress')

**Ethnicity and Education**

In [None]:
data.groupby('ETHNIC_CAT').DEGREE.value_counts(ascending=True, sort=False, dropna=False)

In [None]:
data.groupby('DEGREE').ETHNIC_CAT.value_counts(ascending=True, sort=True, dropna=False)

##Race

Of the 3,302 participants, nearly half are Caucasian. The next largest racial group is Black (28.3%), then Asians (16.1%), and Hispanics (8.6%).

In [None]:
data.RACE.value_counts(ascending=True, dropna=False)

In [None]:
data.RACE.value_counts(ascending=True, dropna=False)/len(covariate.RACE)*100

In [None]:
data.RACE.value_counts(ascending=True, sort=True, dropna=False).plot(kind='bar', title='Bar Plot of Race')
plt.xlabel('Race')
plt.ylabel('Count')

**Race and Perceived Stress**

Overall, Hispanics have the highest average perceived stress score (10.1). Blacks and Caucasians have similar average perceived stress scores, but Black perceived stress scores skew to the right. 

In [None]:
data.groupby('RACE').P_STRESS.describe()

In [None]:
data.boxplot(column='P_STRESS', by = 'RACE')
plt.xlabel('Race')
plt.ylabel('Perceived Stress')

**Race and Age**

In [None]:
data.groupby('RACE').AGE0.describe()

In [None]:
data.boxplot(column='AGE0', by = 'RACE')
plt.xlabel('Race')
plt.ylabel('Age at Baseline')

**Race and Education**

In [None]:
data.groupby('RACE').DEGREE.value_counts(ascending=True, sort=False, dropna=False)

In [None]:
data.groupby('DEGREE').RACE.value_counts(ascending=True, sort=True, dropna=False)