# Introduction: Data set and background information

**Our data set: Estimation of obesity levels based on eating habits and physical condition**

We have chosen to explore the ObesityDataSet, which is a dataset that gathered information of individuals. Below is the list of question asks in the survey:

1. What is your gender? (Male/Female)
2. What is your age?
3. What is your height (in meters)?
4. What is your weight (in kilograms)?
5. Has a family member suffered or suffers from overweight? (Yes/No)
6. Do you eat high caloric food frequently? (Yes/No)
7. Do you usually eat vegetables in your meals? (Never/Sometimes/Always)
8. How many main meals do you have daily? (1-2, 3, 3+)
9. Do you eat any food between meals? (No/Sometimes/Frequently/Always)
10. Do you smoke? (Yes/No)
11. How much water do you drink daily (in liters)? (<1L, 1-2L, >2L)
12. Do you monitor the calories you eat daily? (Yes/No)
13. How often do you have physical activity (days)? (0, 1-2, 2-4, 4-5)
14. How much time do you use technological devices such as cell phone, videogames, television, computer, and others (hours)? (0-2, 3-5, >5)
15. How often do you drink alcohol? (Do not drink/Sometimes/Frequently/Always)
16. Which transportation do you usually use? (Automobile/Motorbike/Bike/Public Transportation/Walking)

*From Palechor & Manotas (2019)*

The data set consists of 17 columns. We have lists the column name and its corresponding question from above:

In [50]:
library(kableExtra)
questions <- c(seq(from = 1, to = 16))
variables <- c("Gender", "Age", "Height", "Weight", "family_history_with_overweight", 
               "FAVC", "FCVC", "NCP", "CAEC", "SMOKE", "CH2O", "SCC", "FAF", "TUE",
               "CALC", "MTRANS")
values <- c("Female or Male", "double: ages 14-61", "double: in m", "double: in kg", "yes or no", 
            "1 = never, 2 = sometimes, 3 = always", "1-4 meals", "1 = no, 2 = sometimes, 3 = frequently, 4 = always",
            "yes or no", "1 = less than 1 liter, 2 = 1-2 liters, 3 = more than 2 liters", "yes or no", 
            "0 = none, 1 = 1-2 days, 2 = 2-4 days, 3 = 4-5 days", "0 = 0-2 hours, 1 = 3-5 hours, 2 = more than 5 hours", 
            "1 = never, 2 = sometimes, 3 = frequently, 4 = always", "automobile, motorbike, bike, public transportation, or walking",
            "Insufficient_Weight, Normal_Weight, Overweight_Level_I, Overweight_Level_II, Obesity_Type_I, Obesity_Type_II, Obesity_Type_III")
table_obesity_values <- tibble("Questions Number" = questions, "Variable Name" = variables, "Value information" = values)
table_obesity_values

Questions Number,Variable Name,Value information
<int>,<chr>,<chr>
1,Gender,Female or Male
2,Age,double: ages 14-61
3,Height,double: in m
4,Weight,double: in kg
5,family_history_with_overweight,yes or no
6,FAVC,"1 = never, 2 = sometimes, 3 = always"
7,FCVC,1-4 meals
8,NCP,"1 = no, 2 = sometimes, 3 = frequently, 4 = always"
9,CAEC,yes or no
10,SMOKE,"1 = less than 1 liter, 2 = 1-2 liters, 3 = more than 2 liters"





#### **Some important notes about the data:**

As noted by the authors, the data set is unbalanced (Palechor & Manotas, 2019). They performed data balancing by generating synthetic data. The results are demonstrated in the Figure 1 and Figure 2.

### Figure 1: Barplot of number of records for each obesity level category of the unabalanced data set (Palechor & Manotas, 2019)

### Figure 2: Barplot of number of records for each obesity level category of the balanced data set (Palechor & Manotas, 2019)

![balanced.jpg](attachment:2868ca51-eaf2-4afa-8bce-ba15fc410068.jpg)

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

In [24]:
obesity <- read_csv("data/obesity_balanced.csv")

[1mRows: [22m[34m2111[39m [1mColumns: [22m[34m17[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (9): Gender, family_history_with_overweight, FAVC, CAEC, SMOKE, SCC, CAL...
[32mdbl[39m (8): Age, Height, Weight, FCVC, NCP, CH2O, FAF, TUE

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [25]:
obesity <- obesity |>
    mutate(CAEC = as_factor(CAEC))

obesity <- obesity |>
    mutate(Gender = as_factor(Gender))

obesity_names <- obesity |>
    rename("obesity_level" = NObeyesdad, 
           "gender" = Gender, 
           "age" = Age, 
           "height" = Height, 
           "weight" = Weight,
           "high_caloric_freq" = FAVC, 
           "eat_veg_w_meal" = FCVC,
           "main_meals_daily" = NCP, 
           "food_btw_meals" = CAEC, 
           "smoker" = SMOKE, 
           "water" = CH2O, 
           "monitor_calories" = SCC, 
           "physical_freq" = FAF, 
           "screen_time" = TUE,
           "alcohol" = CALC, 
           "transportation_mode" = MTRANS)

Since we are only curious about predicting normal, overweight, and obesity, we are going to transform the data in the following ways:
1. Combine all obesity types under a single factor: obesity
2. filter out underweight as it is outside of our scope

In [51]:
obesity_filtered <- obesity_names |>
    filter(obesity_level != "Insufficient_Weight")
obese <- c("Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III")
overweight <- c("Overweight_Level_I", "Overweight_Level_II")
obesity_mutated <- obesity_filtered |>
    mutate(obesity_level = ifelse(obesity_level %in% obese, "Obese", obesity_level)) |>
    mutate(obesity_level = ifelse(obesity_level %in% overweight, "Overweight", obesity_level))

glimpse(obesity_mutated)

Rows: 1,839
Columns: 17
$ gender                         [3m[90m<fct>[39m[23m Female, Female, Male, Male, Male, Male,…
$ age                            [3m[90m<dbl>[39m[23m 21, 21, 23, 27, 22, 29, 23, 22, 24, 22,…
$ height                         [3m[90m<dbl>[39m[23m 1.62, 1.52, 1.80, 1.80, 1.78, 1.62, 1.5…
$ weight                         [3m[90m<dbl>[39m[23m 64.0, 56.0, 77.0, 87.0, 89.8, 53.0, 55.…
$ family_history_with_overweight [3m[90m<chr>[39m[23m "yes", "yes", "yes", "no", "no", "no", …
$ high_caloric_freq              [3m[90m<chr>[39m[23m "no", "no", "no", "no", "no", "yes", "y…
$ eat_veg_w_meal                 [3m[90m<dbl>[39m[23m 2, 3, 2, 3, 2, 2, 3, 2, 3, 2, 3, 2, 3, …
$ main_meals_daily               [3m[90m<dbl>[39m[23m 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
$ food_btw_meals                 [3m[90m<fct>[39m[23m Sometimes, Sometimes, Sometimes, Someti…
$ smoker                         [3m[90m<chr>[39m[23m "no", "yes", "no", "no", "n

# References

Palechor, F. M., &amp; Manotas, A. de. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data in Brief, 25, 104344. https://doi.org/10.1016/j.dib.2019.104344 