## Univariate Data Analysis - NHANES Case Study
Here we will work on to Nhanes Data to brush our Data Skills

First We will import required python packages to our project

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Next We will import our data to our project

In [2]:
da = pd.read_csv('./data/nhanes_2015_2016.csv')
# To check data is imported
da.shape

(5735, 28)

Lets See Begining of the Data and Columns of the Data

In [3]:
da.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


In [5]:
# So there 28 Columns
# Listing All columns
da.columns

Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR',
       'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR',
       'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2',
       'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC',
       'BMXWAIST', 'HIQ210'],
      dtype='object')

The Method value_counts is useful to count the frequency of data in a column.


In [13]:
da["DMDEDUC2"].value_counts().values.sum()

5474

We can see there are total 5735 rows but the columns gives 5474 rows results, so rest 261 values are nyll values.
We can calculate the null values more easily by **isnull** method

In [17]:
pd.isnull(da["DMDEDUC2"]).sum()

261

In some cases it is useful to replace integer codes with a text label that reflects the code's meaning. Below we create a new variable called 'DMDEDUC2x' that is recoded with text labels, then we generate its frequency distribution.

In [19]:
da["DMDEDUC2x"] = da["DMDEDUC2"].replace({1: "<9", 2: "9-11", 3: "HS/GED", 4: "Some college/AA", 5: "College", 
                                       7: "Refused", 9: "Don't know"})
da["DMDEDUC2x"].value_counts()

Some college/AA    1621
College            1366
HS/GED             1186
<9                  655
9-11                643
Don't know            3
Name: DMDEDUC2x, dtype: int64

We will also want to have a relabeled version of the gender variable, so we will construct that now as well. We will follow a convention here of appending an 'x' to the end of a categorical variable's name when it has been recoded from numeric to string (text) values.

In [20]:
da["RIAGENDRx"] = da.RIAGENDR.replace({1 : "Male", 2: "Female"})
da["RIAGENDRx"].value_counts()

Female    2976
Male      2759
Name: RIAGENDRx, dtype: int64

In some cases we will want to treat the missing response category as another category of observed response, rather than ignoring it when creating summaries. Below we create a new category called "Missing", and assign all missing values to it usig fillna. Then we recalculate the frequency distribution. We see that 4.6% of the responses are missing

In [22]:
da["DMDEDUC2x"] = da["DMDEDUC2x"].fillna("Missing")
da["DMDEDUC2x"].value_counts()

Some college/AA    1621
College            1366
HS/GED             1186
<9                  655
9-11                643
Missing             261
Don't know            3
Name: DMDEDUC2x, dtype: int64

For many purposes it is more relevant to consider the proportion of the sample with each of the possible category values, rather than the number of people in each category.  We can do this as follows:

In [23]:
x = da["DMDEDUC2x"].value_counts()
x /= x.sum()
x

Some college/AA    0.282650
College            0.238187
HS/GED             0.206800
<9                 0.114211
9-11               0.112119
Missing            0.045510
Don't know         0.000523
Name: DMDEDUC2x, dtype: float64

To generate Numeriacal Summary of any column we can use **describe()** method to get all the numeric data to that  column.
This method by default counts all missing values. to get rid of that we can use ***dropna()*** to drop the missing values and get the details 

In [24]:
# Here we want to get get Details for the Body Height Index
da["BMXHT"].dropna().describe()

count    5673.000000
mean      166.142834
std        10.079264
min       129.700000
25%       158.700000
50%       166.000000
75%       173.500000
max       202.700000
Name: BMXHT, dtype: float64

We can calculate the above values individually using Numpy & Pandas Library

In [26]:
x = da["BMXHT"].dropna()
print(x.mean())
print(x.median())
print(np.percentile(x,50))
print(np.percentile(x,75))
print(x.quantile(0.75))

166.14283447911131
166.0
166.0
173.5
173.5
