# Can the health and nutritional status of adults and children be used to classify age group?

### Data set: National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset

# 1. Summary

link: https://archive.ics.uci.edu/dataset/887/national+health+and+nutrition+health+survey+2013-2014+(nhanes)+age+prediction+subset

# 2. Introduction

# 3. Methods & Results

### 3.1 Describe in written english the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.

In [25]:
import pandas as pd
import numpy as np
import altair as alt
from ucimlrepo import fetch_ucirepo 
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split)
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler


### 3.2 Loading the Data

In [2]:
nhanes = fetch_ucirepo(id=887) 

In [3]:
print(nhanes.variables)

        name     role         type demographic  \
0       SEQN       ID   Continuous        None   
1  age_group   Target  Categorical         Age   
2   RIDAGEYR    Other   Continuous         Age   
3   RIAGENDR  Feature   Continuous      Gender   
4     PAQ605  Feature   Continuous        None   
5     BMXBMI  Feature   Continuous        None   
6     LBXGLU  Feature   Continuous        None   
7     DIQ010  Feature   Continuous        None   
8     LBXGLT  Feature   Continuous        None   
9      LBXIN  Feature   Continuous        None   

                                         description units missing_values  
0                         Respondent Sequence Number  None             no  
1         Respondent's Age Group (senior/non-senior)  None             no  
2                                   Respondent's Age  None             no  
3                                Respondent's Gender  None             no  
4  If the respondent engages in moderate or vigor...  None           

### 3.3 Cleaning the data

#### Renaming columns
We first renamed the columns of the data set to be more meaningful and easy to understand. Below is a short description of each column in the data set.

- RIDAGEYR: Respondent's Age
- RIAGENDR: Respondent's Gender (1 is Male / 2 is Female)
- PAQ605: Does the respondent engage in weekly moderate or vigorous-intensity physical activity (1 is yes / 2 is no)
- BMXBMI: Respondent's Body Mass Index
- LBXGLU: Respondent's Blood Glucose after fasting
- DIQ010: If the Respondent is diabetic (1 is yes / 2 is no)
- LBXGLT: Respondent's Oral
- LBXIN: Respondent's Blood Insulin Levels

In [4]:
X = nhanes.data.features
X

Unnamed: 0,RIAGENDR,PAQ605,BMXBMI,LBXGLU,DIQ010,LBXGLT,LBXIN
0,2.0,2.0,35.7,110.0,2.0,150.0,14.91
1,2.0,2.0,20.3,89.0,2.0,80.0,3.85
2,1.0,2.0,23.2,89.0,2.0,68.0,6.14
3,1.0,2.0,28.9,104.0,2.0,84.0,16.15
4,2.0,1.0,35.9,103.0,2.0,81.0,10.92
...,...,...,...,...,...,...,...
2273,2.0,2.0,33.5,100.0,2.0,73.0,6.53
2274,1.0,2.0,30.0,93.0,2.0,208.0,13.02
2275,1.0,2.0,23.7,103.0,2.0,124.0,21.41
2276,2.0,2.0,27.4,90.0,2.0,108.0,4.99


In [5]:
#re-naming the columns
X.columns = ["gender", 
             "physical_activity", 
             "bmi", 
             "blood_glucose", 
             "diabetic", 
             "oral", 
             "blood_insulin"]
X.head()

Unnamed: 0,gender,physical_activity,bmi,blood_glucose,diabetic,oral,blood_insulin
0,2.0,2.0,35.7,110.0,2.0,150.0,14.91
1,2.0,2.0,20.3,89.0,2.0,80.0,3.85
2,1.0,2.0,23.2,89.0,2.0,68.0,6.14
3,1.0,2.0,28.9,104.0,2.0,84.0,16.15
4,2.0,1.0,35.9,103.0,2.0,81.0,10.92


In [6]:
y = nhanes.data.targets
y

Unnamed: 0,age_group
0,Adult
1,Adult
2,Adult
3,Adult
4,Adult
...,...
2273,Adult
2274,Adult
2275,Adult
2276,Adult


#### Checking for strange values
We are aware that "gender", "physical_activity", "diabetic" are binary features. However, "physical_activity", "diabetic" contain three unique values instead of two.

#### physical_activity
According to the dataset's documentation, 'physical_activity' should only have 1 or 2 as values. Rows containing 7 should be imputed as NaN.

#### diabetic
According to the dataset's documentation, 'diabetic' should only have 1 or 2 as values. Rows containing 3 should be imputed as NaN.

In [7]:
X.nunique()

gender                  2
physical_activity       3
bmi                   340
blood_glucose         101
diabetic                3
oral                  232
blood_insulin        1424
dtype: int64

In [8]:
display(X['physical_activity'].unique())
display(X['diabetic'].unique())

array([2., 1., 7.])

array([2., 1., 3.])

In [9]:
X['physical_activity'] = X['physical_activity'].replace(7, np.nan)
X['diabetic'] = X['diabetic'].replace(3, np.nan)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['physical_activity'] = X['physical_activity'].replace(7, np.nan)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['diabetic'] = X['diabetic'].replace(3, np.nan)


#### Checking for missing values
Using the following code, we identify the number of missing values in the data set. We will drop all rows containing NaN.

In [10]:
missing_values = X.isnull().sum()
missing_values

gender                0
physical_activity     1
bmi                   0
blood_glucose         0
diabetic             58
oral                  0
blood_insulin         0
dtype: int64

### Splitting the data set

We split the data set before conducting EDA to avoid breaking the golden rule. We should avoid looking at the test data to prevent data leakage that may influence the training of our classification model.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [12]:
X_train.head()

Unnamed: 0,gender,physical_activity,bmi,blood_glucose,diabetic,oral,blood_insulin
1236,1.0,2.0,26.7,116.0,,103.0,4.19
756,1.0,2.0,19.3,96.0,2.0,80.0,2.95
1331,1.0,2.0,22.5,85.0,2.0,75.0,4.84
1756,2.0,2.0,32.6,97.0,2.0,95.0,18.98
1773,1.0,2.0,27.7,100.0,2.0,126.0,6.24


### 3.4 Conducting EDA on the Training Set

In [14]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1822 entries, 1236 to 1346
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             1822 non-null   float64
 1   physical_activity  1821 non-null   float64
 2   bmi                1822 non-null   float64
 3   blood_glucose      1822 non-null   float64
 4   diabetic           1782 non-null   float64
 5   oral               1822 non-null   float64
 6   blood_insulin      1822 non-null   float64
dtypes: float64(7)
memory usage: 113.9 KB


In [13]:
nhanes_summary = X_train.describe()
nhanes_summary

Unnamed: 0,gender,physical_activity,bmi,blood_glucose,diabetic,oral,blood_insulin
count,1822.0,1821.0,1822.0,1822.0,1782.0,1822.0,1822.0
mean,1.51921,1.814937,27.836992,99.30955,1.989899,114.500549,11.834402
std,0.499768,0.388455,7.272448,16.046737,0.100023,45.837951,9.903283
min,1.0,1.0,14.6,63.0,1.0,41.0,0.14
25%,1.0,2.0,22.625,91.0,2.0,86.0,5.875
50%,2.0,2.0,26.7,97.0,2.0,105.0,8.925
75%,2.0,2.0,31.2,104.0,2.0,130.0,14.2475
max,2.0,2.0,70.1,333.0,2.0,510.0,102.29


### 3.5 Visualization for EDA

In [32]:
features = X_train.columns.tolist()

alt.Chart(pd.concat([X_train, y_train], axis = 1)).mark_bar(opacity = 1).encode(
            x=alt.X(alt.repeat()).type('quantitative').bin(maxbins=40).stack(False),
            y='count()',
            color = 'age_group'
        ).repeat(
            features,
            columns = 2
        ).properties(
            title="Fig 1: Feature Distributions by Age Group (EDA)"
        )

### 3.6 Classification Analysis

#### Identifying different feature types and transformations

| Feature | Transformation | Explanation
| --- | ----------- | ----- |
| gender | one-hot encoding with "binary=True" | A binary feature with no missing values. 1 is Male, 2 is Female.|
| physical_activity | one-hot encoding with "binary=True" | A binary feature with no missing values. 1 is Yes, 2 is No. |
| bmi | scaling with `StandardScaler` | A numeric feature with no missing values.  |
| blood_glucose | scaling with `StandardScaler`  | A numeric feature with no missing values. |
| diabetic | one-hot encoding with "binary=True"  | A binary feature with no missing values. 1 is Yes, 2 is No. |
| oral | scaling with `StandardScaler`  | A numeric feature with no missing values. |
| blood_insulin | scaling with `StandardScaler`  | A numeric feature with no missing values. |

#### Identify feature types

In [18]:
numeric_features = ["bmi", "blood_glucose", "oral", "blood_insulin"]
binary_features = ["gender", "physical_activity", "diabetic"]
target = "age_group"

#### Preprocessing

In [19]:
preprocessor = make_column_transformer(
    (OneHotEncoder(sparse_output = False,
                   drop='if_binary',dtype = int), binary_features),
    (StandardScaler(), numeric_features)
)


In [21]:
transformed_df = preprocessor.fit_transform(X_train)
n_new_cols = transformed_df.shape[1] - X_train.shape[1]
n_new_cols

4

#### Dummy Model

In [28]:
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(random_state = 123)
pipe = make_pipeline(preprocessor, dummy)
results_df= cross_validate(
    pipe, X_train, y_train, cv=5, return_train_score=True
)
results_df

Traceback (most recent call last):
  File "/Users/akosuanewday/miniforge3/envs/DSCI522-39-FMJ/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 971, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/akosuanewday/miniforge3/envs/DSCI522-39-FMJ/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 455, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/akosuanewday/miniforge3/envs/DSCI522-39-FMJ/lib/python3.12/site-packages/sklearn/pipeline.py", line 1000, in score
    Xt = transform.transform(Xt)
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/akosuanewday/miniforge3/envs/DSCI522-39-FMJ/lib/python3.12/site-packages/sklearn/utils/_set_output.py", line 316, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/akosuanewday/minif

{'fit_time': array([0.00849414, 0.0034492 , 0.00339103, 0.00271201, 0.00314808]),
 'score_time': array([0.004915  , 0.00330806, 0.00232077, 0.00130796, 0.00240684]),
 'test_score': array([0.83561644, 0.83561644, 0.83791209,        nan, 0.83516484]),
 'train_score': array([0.83665065, 0.83665065, 0.83607682, 0.83607682, 0.83676269])}

#### Logistic regresion

In [29]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state = 123)
pipe = make_pipeline(preprocessor, lr)
lr_results_df= cross_validate(
    pipe, X_train, y_train, cv=5, return_train_score=True
)
lr_results_df

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
Traceback (most recent call last):
  File "/Users/akosuanewday/miniforge3/envs/DSCI522-39-FMJ/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 971, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/akosuanewday/miniforge3/envs/DSCI522-39-FMJ/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 455, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/akosuanewday/miniforge3/envs/DSCI522-39-FMJ/lib/python3.12/site-packages/sklearn/pipeline.py", line 1000, in score
    Xt = transform.transform(Xt)
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/akosuanewday/miniforge3/envs/DSCI522-39-FMJ/lib/python3.12/site-packages/sklearn/utils/_set_output.py", line 316, in wra

{'fit_time': array([0.00810027, 0.00565672, 0.00456691, 0.00412607, 0.00440025]),
 'score_time': array([0.00251269, 0.00180817, 0.00161004, 0.00096583, 0.00170588]),
 'test_score': array([0.83835616, 0.82465753, 0.84615385,        nan, 0.83241758]),
 'train_score': array([0.83527797, 0.83733699, 0.82853224, 0.83744856, 0.83333333])}

In [30]:
from sklearn.svm import SVC

svc = SVC(random_state = 123)
pipe = make_pipeline(preprocessor, svc)
svc_results_df= cross_validate(
    pipe, X_train, y_train, cv=5, return_train_score=True
)
svc_results_df

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
Traceback (most recent call last):
  File "/Users/akosuanewday/miniforge3/envs/DSCI522-39-FMJ/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 971, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/akosuanewday/miniforge3/envs/DSCI522-39-FMJ/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 455, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/akosuanewday/miniforge3/envs/DSCI522-39-FMJ/lib/python3.12/site-packages/sklearn/pipeline.py", line 1000, in score
    Xt = transform.transform(Xt)
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/akosuanewday/miniforge3/envs/DSCI522-39-FMJ/lib/python3.12/site-packages/sklearn/utils/_set_output.py", line 316, in wra

{'fit_time': array([0.03722692, 0.02426291, 0.02049804, 0.02013993, 0.01931477]),
 'score_time': array([0.00937819, 0.007092  , 0.00670791, 0.00093913, 0.00653434]),
 'test_score': array([0.83287671, 0.83561644, 0.84065934,        nan, 0.83516484]),
 'train_score': array([0.84694578, 0.84420041, 0.84430727, 0.84636488, 0.84430727])}