# 5. Data Analysis

We worked with summary statistics in the previous section. We will be using Numpy to compute these summary statistics for the NHANES dataset.

## Load dataset

By now, you should be comfortable with what this code is doing.

In [2]:
import pandas as pd
df = pd.read_csv('Data/NHANES.csv')

In [4]:
df.head()

Unnamed: 0,ID,SurveyYr,Gender,Age,AgeDecade,AgeMonths,Race1,Race3,Education,MaritalStatus,...,RegularMarij,AgeRegMarij,HardDrugs,SexEver,SexAge,SexNumPartnLife,SexNumPartYear,SameSex,SexOrientation,PregntNow
0,51624,2009_10,male,34,30-39,409.0,White,,High School,Married,...,No,,Yes,Yes,16.0,8.0,1.0,No,Heterosexual,
1,51624,2009_10,male,34,30-39,409.0,White,,High School,Married,...,No,,Yes,Yes,16.0,8.0,1.0,No,Heterosexual,
2,51624,2009_10,male,34,30-39,409.0,White,,High School,Married,...,No,,Yes,Yes,16.0,8.0,1.0,No,Heterosexual,
3,51625,2009_10,male,4,0-9,49.0,Other,,,,...,,,,,,,,,,
4,51630,2009_10,female,49,40-49,596.0,White,,Some College,LivePartner,...,No,,Yes,Yes,12.0,10.0,1.0,Yes,Heterosexual,


Delete NaNs

In [3]:
df = df.dropna(subset=['Height'])

## One Note: Why am I using NHANES dataset frequently?

Many social science researches make conclusions from observational data, because doing an experiment with human subjects requires a lot of time and money. NHANES dataset is the largest dataset to understand the health situations of Americans, and the U.S. government is spending more than $120M/year to gather this data from Americans.

Many data analysis exercises and examples in upper level statistics textbooks are actually coming from social sciences, and I believe it's very important for people in social sciences to learn and apply the statistical concepts in the real world dataset of your field.

I hope this NHANES dataset serves as an introduction to data analysis in the social sciences.

## Get some information about people's height

People's height is stored in `Height` column.

Remember the following from previous:

- Mean: np.mean(a)
- Variance: np.var(a)
- Standard Deviation: np.std(a)
- Maximum: np.max(a)
- Minimum: np.min(a)
- Median: np.median(a)
- Qth percentile: np.percentile(x, q, interpolation='midpoint')

## Do not forget to run this code:

In [13]:
import numpy as np

## 1. Get mean, variance and standard deviation

These are especially important in inference problems that we will cover in the future.

In [13]:
height = df['Height']

mean = np.mean(height)
variance = np.var(height)
std = np.std(height)

print("The average height is:", mean)
print("The variance of height is:", variance)
print("The standard deivation of height is:", std)

The average height is: 161.877837669743
The variance of height is: 407.45522978063894
The standard deivation of height is: 20.18552029997342


## 2. Get max, min, median and IQR (Interquartile range)

These are important to understand the distribution of the data

In [14]:
maximum = np.max(height)
minimum = np.min(height)
median = np.median(height)

q1 = np.percentile(height, 25, interpolation='midpoint')
q3 = np.percentile(height, 75, interpolation='midpoint')

print("The maximum is:", maximum)
print("The minimum is:", minimum)
print("The median is:", median)
print("The interquartile randge is:", q3-q1)

The maximum is: 200.4
The minimum is: 83.6
The median is: 166.0
The interquartile randge is: 17.69999999999999


## Reminder:

You will get an error with `np.percentile()` if you don't delete the NaNs

## Conclusions:

The mean was 162cm, while the median was 166cm. This suggests that the height data is negatively skewed, because the median is greater than the mean.

<img src="Image/skew.png" width=600>

However, I think it is incorrect to conclude that the height data is negatively skewed. Here, we are including height data from all people, including toddlers. It's not a good idea to compare the height data from children and adults.

Also, it is a known fact that the average height of males and females are different. We will get a separate dataset for adult males and females.

## Getting data with conditions

Step1:

Create a variable with the condition

Example:
```Python
condition1 = df['Age'] >= 20 # Get data of people whose age is above 20 years old
```

Step2:

Put that variable in square brackets after the dataframe variable

Example:
```Python
adult = df[condition1]
```

If you have multiple conditions, you can create a new dataframe by:
```Python
df_new = df[condition1 & condition2]
```

## Get adult male and female data

We are only interested in gender, height and age column, so we will only select them from the original dataframe.

In [4]:
df = pd.read_csv('Data/NHANES.csv')

In [5]:
df = df[['Gender', 'Age', 'Height']]

In [9]:
df = df.dropna()

In [10]:
df

Unnamed: 0,Gender,Age,Height
0,male,34,164.7
1,male,34,164.7
2,male,34,164.7
3,male,4,105.4
4,female,49,168.4
...,...,...,...
9994,male,28,177.3
9995,male,28,177.3
9997,male,27,175.8
9998,male,60,168.8


Looks great.

## Get adult male and female data

In [12]:
adult = df['Age'] >= 20

# Remember that == means a condition statement to find if two elements are the same
male = df['Gender'] == 'male'
female = df['Gender'] == 'female'

# We will create two separate dataframes for male and female

male_df = df[adult & male]
female_df = df[adult & female]

In [19]:
male_df.head()

Unnamed: 0,Gender,Age,Height
0,male,34,164.7
1,male,34,164.7
2,male,34,164.7
10,male,66,169.5
11,male,58,181.9


In [20]:
female_df.head()

Unnamed: 0,Gender,Age,Height
4,female,49,168.4
7,female,45,166.7
8,female,45,166.7
9,female,45,166.7
14,female,58,148.1


## Let's see if the conditions are satisfied

In [14]:
# Minimum age of male dataset
print("Min age is:", np.min(male_df['Age']))

Min age is: 20


In [15]:
# Gender
male_df['Gender'].value_counts()

male    3524
Name: Gender, dtype: int64

In [16]:
# Minimum age of female dataset
print("Min age is:", np.min(female_df['Age']))

Min age is: 20


In [17]:
# Gender
female_df['Gender'].value_counts()

female    3658
Name: Gender, dtype: int64

## This looks perfectly fine

Next, I try to find the summary statistics of males and females.

Here, I can copy and paste the code that I wrote before to find the summary statistics of overall data. I DO NOT NEED to rewrite the codes again.

That is why I made `df['Height']` in the first code to be a variable.

In [21]:
# Only change df['Height'] to be male_df['Height']

height = male_df['Height']

mean = np.mean(height)
variance = np.var(height)
std = np.std(height)

print("Male adult average height:", mean)
print("Variance of male adult height:", variance)
print("Standard deviation of male adult height:", std)

Male adult average height: 175.78927355278043
Variance of male adult height: 55.92104271858525
Standard deviation of male adult height: 7.478037357394335


In [22]:
maximum = np.max(height)
minimum = np.min(height)
median = np.median(height)

q1 = np.percentile(height, 25, interpolation='midpoint')
q3 = np.percentile(height, 75, interpolation='midpoint')

print("Max male adult height:", maximum)
print("Min male adult height:", minimum)
print("The median is:", median)
print("The interquartile randge is:", q3-q1)

Max male adult height: 200.4
Min male adult height: 149.4
The median is: 175.6
The interquartile randge is: 9.599999999999994


## Conclusions:

There are few noticeable results here:

1. Mean and median of male adult height are extremely close (175.79 vs 175.6). This suggests that male height follows a symmetric distribution

2. Interquartile range became smaller

3. The variance of male adult height is 55.92, while the variance of the original height data is 407.46. This is an important thing to notice that will be relevant in ANOVA test which we will be doing in the future.

## Practice problems:

- Get the summary statitics for female dataset
- Get the summary statistics for children (Age <= 15). Compare the summary statistics of males and females.