# Can the health and nutritional status of adults and children be used to classify age group?

### Data set: National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset

# 1. Summary

link: https://archive.ics.uci.edu/dataset/887/national+health+and+nutrition+health+survey+2013-2014+(nhanes)+age+prediction+subset

# 2. Introduction

# 3. Methods & Results

### 3.1 Describe in written english the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.

In [1]:
import pandas as pd
import numpy as np
import altair as alt
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)

### 3.2 Loading the Data

In [2]:
nhanes = pd.read_csv("data/NHANES_age_prediction.csv")
nhanes.head()

Unnamed: 0,SEQN,age_group,RIDAGEYR,RIAGENDR,PAQ605,BMXBMI,LBXGLU,DIQ010,LBXGLT,LBXIN
0,73564.0,Adult,61.0,2.0,2.0,35.7,110.0,2.0,150.0,14.91
1,73568.0,Adult,26.0,2.0,2.0,20.3,89.0,2.0,80.0,3.85
2,73576.0,Adult,16.0,1.0,2.0,23.2,89.0,2.0,68.0,6.14
3,73577.0,Adult,32.0,1.0,2.0,28.9,104.0,2.0,84.0,16.15
4,73580.0,Adult,38.0,2.0,1.0,35.9,103.0,2.0,81.0,10.92


### 3.3 Cleaning the data

#### Renaming columns
We first renamed the columns of the data set to be more meaningful and easy to understand. Below is a short description of each column in the data set.

- SEQN: Respondent Sequence Number
- age_group: Respondent's Age Group (senior/non-senior)
- RIDAGEYR: Respondent's Age
- RIAGENDR: Respondent's Gender (1 is Male / 2 is Female)
- PAQ605: Does the respondent engage in weekly moderate or vigorous-intensity physical activity (1 is yes / 2 is no)
- BMXBMI: Respondent's Body Mass Index
- LBXGLU: Respondent's Blood Glucose after fasting
- DIQ010: If the Respondent is diabetic (1 is yes / 2 is no)
- LBXGLT: Respondent's Oral
- LBXIN: Respondent's Blood Insulin Levels

In [3]:
#re-naming the columns
nhanes.columns = ["sequence_number", 
                  "age_group", 
                  "age", 
                  "gender", 
                  "physical_activity", 
                  "bmi", 
                  "blood_glucose", 
                  "diabetic", 
                  "oral", 
                  "blood_insulin"]

#### Checking for strange values
We are aware that "gender", "physical_activity", "diabetic" are binary features. However, "physical_activity", "diabetic" contain three unique values instead of two.

#### physical_activity
According to the dataset's documentation, 'physical_activity' should only have 1 or 2 as values. Rows containing 7 should be imputed as NaN.

#### diabetic
According to the dataset's documentation, 'diabetic' should only have 1 or 2 as values. Rows containing 3 should be imputed as NaN.

In [4]:
nhanes.nunique()

sequence_number      2278
age_group               2
age                    69
gender                  2
physical_activity       3
bmi                   340
blood_glucose         101
diabetic                3
oral                  232
blood_insulin        1424
dtype: int64

In [5]:
display(nhanes['physical_activity'].unique())
display(nhanes['diabetic'].unique())

array([2., 1., 7.])

array([2., 1., 3.])

In [6]:
nhanes['physical_activity'] = nhanes['physical_activity'].replace(7, np.nan)
nhanes['diabetic'] = nhanes['diabetic'].replace(3, np.nan)

#### Checking for missing values
Using the following code, we identify the number of missing values in the data set. We will drop all rows containing NaN.

In [7]:
missing_values = nhanes.isnull().sum()
missing_values

sequence_number       0
age_group             0
age                   0
gender                0
physical_activity     1
bmi                   0
blood_glucose         0
diabetic             58
oral                  0
blood_insulin         0
dtype: int64

In [8]:
nhanes_cleaned = nhanes.dropna()
nhanes_cleaned.head()

Unnamed: 0,sequence_number,age_group,age,gender,physical_activity,bmi,blood_glucose,diabetic,oral,blood_insulin
0,73564.0,Adult,61.0,2.0,2.0,35.7,110.0,2.0,150.0,14.91
1,73568.0,Adult,26.0,2.0,2.0,20.3,89.0,2.0,80.0,3.85
2,73576.0,Adult,16.0,1.0,2.0,23.2,89.0,2.0,68.0,6.14
3,73577.0,Adult,32.0,1.0,2.0,28.9,104.0,2.0,84.0,16.15
4,73580.0,Adult,38.0,2.0,1.0,35.9,103.0,2.0,81.0,10.92


### Splitting the data set

We split the data set before conducting EDA to avoid breaking the golden rule. We should avoid looking at the test data to prevent data leakage that may influence the training of our classification model.

In [9]:
train_df, test_df = train_test_split(nhanes_cleaned, test_size=0.3, random_state=123)
X_train, y_train = train_df.drop(columns = ["age_group"]), train_df["age_group"]
X_test, y_test = test_df.drop(columns = ["age_group"]), test_df["age_group"]

In [10]:
X_train.head()

Unnamed: 0,sequence_number,age,gender,physical_activity,bmi,blood_glucose,diabetic,oral,blood_insulin
2161,83153.0,20.0,1.0,2.0,18.6,94.0,2.0,70.0,6.37
1063,78404.0,44.0,1.0,2.0,34.5,98.0,2.0,84.0,13.42
2118,82978.0,41.0,2.0,2.0,21.7,97.0,2.0,105.0,6.75
158,74293.0,27.0,2.0,2.0,47.8,105.0,2.0,120.0,16.61
860,77442.0,17.0,2.0,2.0,21.9,83.0,2.0,112.0,21.85


### 3.4 Conducting EDA on the Training Set

In [11]:
nhanes_summary = X_train.describe()
nhanes_summary

Unnamed: 0,sequence_number,age,gender,physical_activity,bmi,blood_glucose,diabetic,oral,blood_insulin
count,1553.0,1553.0,1553.0,1553.0,1553.0,1553.0,1553.0,1553.0,1553.0
mean,78716.967804,41.441082,1.5132,1.820348,27.75235,99.094656,1.990341,114.1217,11.544482
std,2955.479062,20.220289,0.499987,0.384021,7.147187,17.079202,0.097835,46.059072,9.52038
min,73568.0,12.0,1.0,1.0,14.5,63.0,1.0,40.0,1.02
25%,76094.0,23.0,1.0,2.0,22.6,91.0,2.0,87.0,5.69
50%,78822.0,40.0,2.0,2.0,26.7,97.0,2.0,104.0,8.83
75%,81349.0,58.0,2.0,2.0,31.1,103.0,2.0,129.0,14.1
max,83727.0,80.0,2.0,2.0,70.1,405.0,2.0,604.0,102.29


### 3.5 Visualization for EDA

In [23]:
features = X_train.columns.tolist()

alt.Chart(train_df).mark_bar(opacity = 1).encode(
            x=alt.X(alt.repeat()).type('quantitative').bin(maxbins=40).stack(False),
            y='count()',
            color = 'age_group'
        ).repeat(
            features,
            columns = 3
        ).properties(
            title="Fig 1: Feature Distributions by Age Group (EDA)"
        )

### 3.6 Classification Analysis

#### Identifying different feature types and transformations

| Feature | Transformation | Explanation
| --- | ----------- | ----- |
| sequence_number | drop |  A numeric feature with no missing values. It is an ID with no predictive meaning that should be dropped |
| age | drop | A numeric feature with no missing values. Using age as a feature defeats the purpose of the question we are trying to answer as we are trying to use health indicators to classify age group. We should drop this feature. |
| gender | one-hot encoding with "binary=True" | A binary feature with no missing values. 1 is Male, 2 is Female.|
| physical_activity | one-hot encoding with "binary=True" | A binary feature with no missing values. 1 is Yes, 2 is No. |
| bmi | scaling with `StandardScaler` | A numeric feature with no missing values.  |
| blood_glucose | scaling with `StandardScaler`  | A numeric feature with no missing values. |
| diabetic | one-hot encoding with "binary=True"  | A binary feature with no missing values. 1 is Yes, 2 is No. |
| oral | scaling with `StandardScaler`  | A numeric feature with no missing values. |
| blood_insulin | scaling with `StandardScaler`  | A numeric feature with no missing values. |

#### Identify feature types

In [24]:
numeric_features = ["bmi", "blood_glucose", "oral", "blood_insulin"]
binary_features = ["gender", "physical_activity", "diabetic"]
drop_features = ["sequence_number", "age"]
target = "age_group"