## Term Project - Biswajit Sharma

### Prediction of Obesity risk based on eating habits and physical activity

#### Introduction

Obesity is a medical condition that is related to the excessive accumulation of body fat. It is not only a cosmetic concern but rather a medical problem that increases the risk of other health problems and diseases like heart diseases, diabetes, high cholesterol, high blood pressure, liver disease, musculoskeletal disorders, and certain cancers. Since 1997, WHO has considered obesity a global epidemic and a significant health problem. To prevent obesity, various organizations, including government and non-government, are promoting campaigns regarding two main risk factors: eating habits and physical activity (Gozukara et al., 2023).

Although it is known that excessive intake of calories can cause obesity, nutritional factors like low-quality diet, unbalanced diet, processed foods, and alcohol consumption can also increase the risk of obesity. Physical activity has also been very influential in controlling or preventing obesity.  The frequency, duration, and intensity of physical activity and exercises play an essential role in the effective prevention and reduction of obesity (Gozukara et al., 2023). Eating healthy and nutrituous food helps in managing proper body weight and preventing obesity. Therefore, it becomes important to invest resources in investigating the relationship of obesity with eating habits and physical activity. 

Healthcare plans can perform data mining and build a model to predict the early detection of obesity risk for their members, which can yield enormous benefits for both health plans and their members. Early detection of obesity risk helps individuals to be aware of the risk and take preventive measures so that they do not develop obesity related health conditions and diseases in the later stages of their life. It also helps health plans to intervene and incentivize members to motivate them in practicing obesity management such as healthy eating habits, exercises, and increased physical activity. This reduces the number of related health claims caused by obesity related diseases and essentially allows health plans to reduce the cost of care, enhance competitiveness and generate more revenue.

This study will generate and evaluate a model to _predict obesity risk based on eating habits and physical activity_

#### Dataset

_[UC Irvine Machine Learning Repository - Obesity levels, Eating Having and Physical activity dataset.][1]_

This dataset include data about eating habits, physical activity, weight, height and obesity levels of individuals from the countries of mexico, Peru and Columbia. The data includes the eating habits and physical activity levels of 498 participants aged between 14 and 61 years (UCI. 2019).

The originally collected data was preprocessed, such as the removal of missing values, and normalization was performed. It was also balanced to reduce the skewness of the obesity levels. 23% of the source data is actual responses collected over a 30-day survey, while the remaining 77% was synthetically generated using SMOTE (Palechor & de la Hoz Manotas, 2019).

There are 17 variables in the dataset.

 - Gender - male or female
 - Age - age in years
 - Height - height in meters
 - Weight - weight in kgs
 - Family History of overweight - yes or no
 - Frequently consume high caloric food (FAVC) - yes or no
 - Consumption of vegetables in meals (FCVC) - yes or no
 - Number of main meals (NCP) - 1 = between 1 and 2, 2 = three, 3 = more than three, 4 = no answer
 - Consumption of food between meals (CAEC) - no, sometimes, frequently, or always
 - Daily consumption of water (CH20) - 1 = less than a liter, 2 = between 1 and 2 L, 3 = more than 2 L
 - Consumption of alcohol (CALC) - no, sometimes, frequently or always
 - Calorie consumption monitoring done (SCC) - yes or no
 - Frequency of Physical activity (FAF) - 1 = never, 2 = once or twice a week, 3 = two or three times a week, 4 = four or five times a week
 - Use of electronic devices (TUE) - 0 = none, 1 = less than an hour, 2 = between one and three hours, 3 = more than three hours
 - Mode of transportation used (MTRANS) - automobile, motorbike, bike, public transportation, walking
 - Obesity Level - the obesity level of the individual

Obesity level is labelled in the source data with classes as given below. 

 - Insufficient Weight 
 - Normal Weight
 - Overweight I
 - Overweight II
 - Obesity I
 - Obesity II
 - Obesity III

[1]: https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition

#### Exploratory Data Analysis

_Exploratory Data Analysis_ was performed on the dataset to analyze and investigate the dataset with te aim of identifying relationship between the obesity and eating habits and physical activity.

![image.png](attachment:1b47691c-1389-449a-b127-ce77024744ab.png)

Fig 1 shows that the average frequency of physical activity among obese and overweight people is lower than those who are normal weight or underweight. This may indicate the presence of a relationship between obesity and the frequency of physical activity.

![image.png](attachment:6176ddc2-6b5f-41fb-b02d-b399a4884348.png)

Fig 2 shows that most of the obese and overweight people have a family history of overweight. This indicates that family history may play a vital role in increasing the risk of obesity, which may be due to genetic factors.

![image.png](attachment:5bdff0ce-60ff-4956-9b83-4af538d00f43.png)

Fig 3 shows that the majority of the obese people frequenly consume frequently high-calorie food. This indicates a potential presence of a relationship between obesity and frequent consumption of high-calorie foods.

![image.png](attachment:55a9d94e-c6ba-49f5-a70f-317865d92d37.png)

Fig 4 shows that age may also be important in increasing the obesity risk. The boxplots show that the median age of obese and overweight people is higher than that of those who are underweight or normal weight. It appears that as people become older, their risk of obesity also increases.

![image.png](attachment:09d6103f-4a25-421f-b56b-e7101cea6e64.png)

We already know that weight is one of the most important factors that contribute to obesity. Fig 5 also aligns with this understanding and shows that people with obesity have much higher weight than those are with normal weight. We see that the median weight for type II and type III obesity levels are much higher than that of normal weight and overweight.


From the above plots, we observe that there is a relationship between eating habits, physical activity and obesity. We noticed that the majority of the obese people frequently consume high-calorifie foods and perform lower physical activity. Family history may also play a vital role in increasing the risk of obesity because we see that most obese and overweight people have a family history of overweight. We also observed that obese and overweight people are generally older, and therefore, obesity risk may also increase with age. Therefore, eating habits, physical activity, family history, age and body weight appear to be significant factors in predicting obesity levels for an individual.

### Term Project Milestone 2

#### Data Preparation

Check for Imbalance in target classes

![image.png](attachment:21375598-be96-4aae-99b6-984fdaa94796.png)

From above plot we do not see any significant imbalance among the target classes.

Check for missing values

In [15]:
df.isna().sum()

gender                            0
age                               0
height                            0
weight                            0
family_history_with_overweight    0
favc                              0
fcvc                              0
ncp                               0
caec                              0
smoke                             0
ch2o                              0
scc                               0
faf                               0
tue                               0
calc                              0
mtrans                            0
nobeyesdad                        0
dtype: int64

From above, we do not notice any missing values in the dataset.

###### It is better to transform categories to _oridinal_ values if there is an order in case of Categorical feaures.

###### Categories in `consumption of food between meals` have some natural order. Based on the nature of this study, _no consumption of food between meals_ has a lower order than _frequent consumption of food between meals_, because _frequent consumption of food between meals_ can cause intake of more calories than our body needs and increase the obesity risk.

###### Categories in `consumption of alcohol` have some natural order. Based on the nature of this study, _no consumption of alcohol_ has a lower order than _frequent or always consumption of alcohol_, because _consumption of alcohol_ has been medically associated with _obesity_ related problems.

###### Categories in `mode of transportation` have some natural order. Based on the nature of this study, _using automobile for transportation_ has a lower order than _walking or biking_. This is because _walking or biking_ will burn more calories and we know that burning more calories helps in preventing or managing _obesity_.

###### Encode Categorical features with _dummy_ variables because we need to convert categorical classes to numerical values for modeling. Sklearn models expects features to be numerical values. Also,  we are dropping the first value of dummy variable to prevent multicollinearity.

<div class="alert alert-block alert-info">
<b>Note:</b> As this point it is better to separate the training and test set because we are going to apply feature selection methods. We must fit the feature selection methods on training set and not test set, inorder to prevent data leakage.
</div>

<div class="alert alert-block alert-info">
<b>Note:</b> As this study involves categorical target, we are using $chi-square$ statistic to select the categorical features. Chi-sqaure statistic can be used to determine association or dependence between categorical variables. If there is no significant association or dependence with target then such features will not be informative or important in predicting the target.
</div>

<div class="alert alert-block alert-success">
<b>Note:</b> we notice that out of 8 categorical features, above chi selector selected 7 features and dropped `favc`.
</div>

<div class="alert alert-block alert-info">
<b>Note:</b> As this study involves categorical target, we are using ANOVA $f-value$ statistic to select the numerical features that have relationship with the target. ANOVA F-test can be used to determine if there is significant difference in variances of the numerical varaible between the groups of the target classes. If there is no significant difference, then such features will not be informative and important in predicting the target.
</div>

In [36]:
# selected feature names
selected_numerical_features

['age', 'weight', 'height', 'fcvc', 'ncp', 'ch2o', 'faf']

<div class="alert alert-block alert-success">
<b>Note:</b> we notice that out of 8 numerical features, above f-value selector selected 7 features and dropped `tue`.
</div>

So, now $7$ categorical and $7$ numerical features remain after _feature selection_.

In [41]:
# generate scaled features using scaler
scaled_features_train = scaler.transform(selected_features_train)
scaled_features_test = scaler.transform(selected_features_test)


<div class="alert alert-block alert-info">
<b></b>At this point we have the train and test set features transformed and scaled to be used in modeling.
</div>

### Term Project Milestone 3

As our target variable is **categorical** in nature, we will use below models and then perform _model evaluation_ to check the performance of each model. 
- Logistic Regression
- KNN
- Decision Trees
- Random Forest


We are using GridSearch model selection technique to identify the best hyperparameters for each of the models. The _model evaluation_ will be performed along with _model hyperparameter selection_ using _Nested Cross-Validation_. Nested cross-validation allows us to perform both model selection and model evaluation at the same time. Nested cross-validation wraps cross-validation for the hyperparameter search (selection) within another cross-validation for model evaluation. Therefore, the _inner_ cross validation searches for the best model hyperparameters, while the _outer_ cross-validation evaluates the performance of the model selected from the _inner_ cross validation.

We are using **accuracy**, **precision**, **recall**, **F1-score** to evaluate model performance. Our data does not suffer from significant _class imbalance_ in the target variable, so we can use _accuracy_ metric. However, accuracy does not capture the model's predictive power to predict the target classes correctly. Hence, we are also evaluating _precision_, _recall_, and _F1-score_. Based on the nature of the problem, it is important to correctly identify the target class in an optimistic manner, but also, at the same time have good correctness in the target class, so that the obesity management efforts and resources are directed to the right people. _Precision_ is the proportion of every observation that is predicted positive is actually positive. Models with high precision are pessimistic because they predict a class only when it is very certain. On the other hand, models with high recall are optimistic and try to predict as many true positives as possible out of all the actual positive observations. Furthermore, we are also checking the _F1-score_, which is a good measure of the balance between precision and recall. 

Additionally, we are using **Confusion Matrix** to evaluate the overall performance of the selected model. A confusion matrix is a table that displays a grid of the number of observations in the actual class against the predicted class. It shows number of the correct and incorrect predictions.

We are creating a `DummyClassifer` to create a baseline model against which we can compare the trained model. It will help us to check whether the trained model is better than random guessing.

![image.png](attachment:c4414649-c1ae-452f-bb54-f9e132c4ee11.png)

![image.png](attachment:456a1f87-2b7a-4599-953c-bfaf6c4938ca.png)

![image.png](attachment:641925fb-836c-48b8-9c3d-63790db11f21.png)

![image.png](attachment:6d8de2a9-7c98-4a5f-b07e-293a4510a6e2.png)

### Summary

The target variable in this study has multiple categories, therefore it is a _multiple classification_ problem. We are evaluating the performance of four different models, Logistic Regression, K Nearest Neighbor (KNN), Decision Trees and Random Forest, to predict the target. A nested cross-validation technique was used to evaluate each model's performance and grid search was used to select hyperparameters for each of the models. 
Model performance for each model was measured using _accuracy, precision, recall, and F1-score_.  

We created a _dummy classifier_ that randomly selects a target class for prediction, and from Fig 6, we observe that all of the trained models perform better than just random selection. Fig 6 also shows that _Random Forest_ is the _best-performing_ model across all the metrics. However, the _Decision Tree_ model's performance is also very close to that of the Random Forest model. Confusion matrix (Fig 8) for the Random Forest shows that overall, the model performs very well with a few incorrect predictions, such as Overweight_Level_I and Normal Weight.  A validation curve (Fig 9 ) was used to check if addding more data in training would further help to increase the performance of the Random Forest model. It shows that as the number of test instances increases the test score also increases, and then starts to plateau at the end, which means that adding more data observations will not provide a huge benefit.

#### References

Estimation of obesity levels based on eating habits and physical condition. (2019). UCI Machine Learning Repository. https://doi.org/10.24432/C5H31Z.

Gozukara Bag, H.G., Yagin, F.H., Gormez, Y., González, P.P., Colak, C., Gülü, M., Badicu, G., Ardigò, L.P. 2023. Estimation of Obesity Levels through the Proposed Predictive Approach Based on Physical Activity and Nutritional Habits. Diagnostics. 13(18), 2949. https://doi.org/10.3390/diagnostics13182949

Palechor, F.M., de la Hoz Manotas, A. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data Brief. 25, 104344. https://doi.org/10.1016/j.dib.2019.104344