# Modeling and Analysis of Total Fitness Factor Score for CSUF College Students

## Group Members
- Paul Anthony Bagabaldo
- Joksan Hernandez
- Huyen Nguyen

## Acknowledgements
Dr. Bill Beam (Department of Kinesiology), Dr. Archana McEligot (Department of Public Health), and Dr. Sinjini Mitra (Department of Information Systems and Decision Sciences) at California State University, Fullerton, for providing the data used in this project.

## Table of Contents 
<a id="TOC"></a>
1. [Introduction](#Introduction)
2. [Data Description](#Data-Description)
3. [Data Preprocessing](#Data-Preprocessing)
4. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
5. [Modeling and Analysis](#Modeling-and-Analysis)
6. [Results](#Results)
7. [Conclusion](#Conclusion)
8. [References](#References)

## Introduction
<a id="Introduction"></a>

Physical Fitness plays a large role in the overall health and well being of college students. These factors influencing fitness can help in designing targeted interventions to promote healthier lifestyles. Researchers at California State University, Fullerton (CSUF), in collaboration with various departments, have collected data on health and fitness-related variables for a long-term study. 

The goal of this project is to approximate the **Total Fitness Factor Score (FFTotal)** using other variables in the dataset. In this project, we will be using **explatory data analysis**, **linear regression modeling**, and **performance evaluation** to better understand the data. 

## Data Description

The dataset consists of various health and fitness-related variables collected as part of fitness testing among students at California State University, Fullerton. Below is a summary of the key variables used in the analysis:

### Demographic Information
- **Idnum**: A random identification number assigned at the conclusion of the semester.
- **Date**: The date or semester during which the test was conducted.
- **Phone**: Last four digits of the self-reported phone number for matching data.
- **Sex**: Gender of the subject (Female = F, Male = M).
- **Age**: Self-reported age in years.
- **Ethnicity**: Self-reported ethnicity (categories include Caucasian, Hispanic/Latino, African American, Native American, Asian, Pacific Islander, or Other).

### Anthropometric Measurements
- **Height (Ht)**: Height measured using a stadiometer to the closest 0.25 inches.
- **Weight (Wt)**: Weight measured using an electronic scale to the closest 0.1 lb.
- **BIA % Fat**: Body fat percentage measured using bioelectrical impedance analysis (BIA).
- **Waist Girth**: Measured at the "minimal" natural waist in centimeters.
- **Skinfold Measurements (SF 1, SF 2, SF 3)**: Skinfold thickness (in millimeters) at various sites:
  - SF1: Chest (male) or triceps (female).
  - SF2: Abdomen (male) or suprailium (female).
  - SF3: Thigh (both genders).

### Fitness and Physical Performance
- **Forward Flexion (FF)**: Sit-and-reach test result, best of three trials, measured to the closest 0.5 inch.
- **Right Grip Max (RGM)** and **Left Grip Max (LGM)**: Maximal grip strength for each hand (in kilograms).
- **Vital Capacity (VC)**: Lung capacity measured using a Vitalometer (in liters).
- **Stages**: Number of stages completed on a cycle ergometer test (range: 2–4 stages).
- **Power and Heart Rate per Stage (PL 1-4, HR 1-4)**: Power (in watts) and heart rate (in bpm) for each stage of the cycle ergometer test.
- **Rate of Perceived Exertion (RPE 1-4)**: Self-reported exertion at the end of each stage.

### Cardiovascular and Environmental Data
- **Resting HR**: Resting heart rate (in bpm), self-reported or measured.
- **Systolic BP (SBP)** and **Diastolic BP (DBP)**: Brachial systolic and diastolic blood pressure (in mmHg), measured in a seated position.
- **Ambient Temperature (TA)**: Temperature of the testing environment in degrees Celsius.
- **Barometric Pressure (PB)**: Measured barometric pressure (in mmHg).

### Risk Factors
- **RF 2**: Gender-based risk factor (Female = 1, Male = 5).
- **RF 3**: Stress score based on 11 questions (range: 11–55).
- **RF 4**: Family history of cardiovascular disease (CVD) categorized into severity levels.
- **RF 5**: Smoking history, self-reported (categories based on number of cigarettes/day).

### Outcome Variable
- **Total Fitness Factor Score (FFTotal)**: The primary outcome variable, representing an overall fitness metric calculated using a proprietary formula.


[Back to Top](#TOC)

## Data Preprocessing

In [75]:
import pandas as pd
from lets_plot import *
LetsPlot.setup_html()

df = pd.read_csv('totalFitnessFactor.csv')

df = df.dropna(axis=1, how='all')

The initial dataset was loaded from 'totalFitnessFactor.csv'. Empty columns were identified and removed. This step removed columns containing only NA values, reducing noise and improving data quality for analysis.

In [None]:
# Calculate the missing statistics
missing_stats = (
   df.isnull()
       .sum()
       .sort_values(ascending=False)
       .to_frame('Missing Count')
       .join(
           (df.isnull().mean() * 100)
           .round(2)
           .to_frame('Missing %')
       )
       [lambda x: x['Missing Count'] > 0]
)

# Display the missing statistics
(
missing_stats
    .style
    .format({
        'Missing Count': '{:,.0f}',
        'Missing %': '{:.2f}%'
    })
    .background_gradient(cmap='Blues')
)

Unnamed: 0,Missing Count,Missing %
Waist,5528,88.08%
BIA_percent_Fat,4680,74.57%
SF 2,1596,25.43%
SF 3,1596,25.43%
SF 1,1596,25.43%
RPE 3,393,6.26%
HR 3,380,6.05%
PL 3,380,6.05%


After reviewing the dataset using Python, we found that the columns Waist and BIA % Fat had significant amounts of missing data, with 88.08% and 74.57% of their values missing, respectively. Given the high percentage of missing values, we decided to exclude these columns from the analysis.

In contrast, the columns SF 1, SF 2, and SF 3 had about 25.43% of their values missing. While this is not ideal, these columns still contain enough data to potentially provide valuable insights, so they were retained for further exploration.

Lastly, there were several other columns, such as PL 3 (6.05%), HR 3 (6.05%), and RPE 3 (6.26%), that had only a small proportion of missing data. Given that these gaps were minimal, we concluded that keeping these columns would not significantly impact the results.

In [84]:
# Calculate the missing percentage for each column
missing_percentage = df.isnull().mean() * 100

# Clean the data by removing columns with more than 30% missing values
df_cleaned = df.loc[:, missing_percentage <= 30]
df_cleaned.columns

#Display the retained columns
retained_cols = pd.DataFrame({
   'Column Name': df_cleaned.columns.tolist(),
   'Data Type': df_cleaned.dtypes.values,
   'Non-null Count': df_cleaned.count().values
}).set_index('Column Name')

(
retained_cols
   .style
   .set_table_styles([
       {'selector': 'thead',
        'props': [('background-color', '#2c3e50'), 
                 ('color', 'white'),
                 ('font-weight', 'bold')]},
   ])
   .format({'Non-null Count': '{:,d}'})
)

Unnamed: 0_level_0,Data Type,Non-null Count
Column Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Idnum,object,6276
Date,object,6276
Sex,object,6276
Age,int64,6276
Ht,float64,6276
Wt,float64,6276
RF 2,int64,6276
RF 3,int64,6276
RF 4,int64,6276
RF 5,int64,6276
