# LAB | BMI Survey 


## Part 1: Descriptive Analysis

## Problem Description

Overweight and obesity, measured by Body Mass Index (BMI), are increasing health concerns in Denmark and globally. This project analyzes BMI data from a Danish survey to provide an overview of BMI distributions and investigate potential factors influencing BMI, such as gender, age, and fast food consumption. The analysis aims to summarize and visualize the data using descriptive statistics and graphical methods.

---

## Instructions

- Complete each section below using Python and appropriate libraries (e.g., pandas, numpy, matplotlib, seaborn).
- Provide code, tables, and figures as needed.
- Write brief explanations for your findings after each analysis step.
- Do **not** include code in your final report; code should be submitted separately as an appendix.
- Ensure all figures and tables are clearly labeled and referenced in your explanations.

---

## Q1. Data Overview

**a) Short Description of the Data**

- List all variables in the dataset.
- Classify each variable as quantitative or categorical.
- State the number of observations.
- Check for missing values.


**Instructions:**  
- Summarize the dataset variables and their types.
- Report the number of observations and any missing values.

---

### Answer:
- The variables in the dataset are: Height, Weight, Gender, Urbanity and Fastfood consumption.
- Height, Weight, Urbanity and Fastfood consumption are quantitative variables.
- Gender is a categorical variable.
- The dataset contains 145 observations.
- There are no missing values in the dataset.


## 2. Calculating BMI

- Compute BMI for each respondent using the formula:

  $$\text{BMI} = \frac{\text{weight (kg)}}{\left[\text{height (m)}\right]^2}$$

- Add BMI as a new column to the dataset.

---



## 3. Empirical Distribution of BMI

**b) Density Histogram of BMI Scores**

- Plot a density histogram of BMI.
- Describe the distribution: symmetry, skewness, possible negative values, and variation.

**Instructions:**  
- Comment on the shape and spread of the BMI distribution.

---


Following is the plot of density histogram of BMI

![image.png](attachment:image.png)

### Answer:
Distribution description:
- Symmetry/Skewness: The distribution is right-skewed (positively skewed) (skewness = 0.67).
- Negative values present: No
- Variation (standard deviation): 3.83
- Range: 17.58 to 39.52



## 4. Gender Subsets

**c) Separate Density Histograms for Women and Men**

- Create subsets for women and men.
- Plot density histograms for each group.
- Compare and describe the distributions.


**Instructions:**  
- Discuss any gender differences in the BMI distributions.

---


Following is the Density Histogram of BMI Values for Women and Men

![image.png](attachment:image.png)

### Answer:
The box plot shows that the median BMI for men is higher than for women,
while the interquartile range (IQR) is slightly larger for women than for men,
indicating more variability in women's BMI values. Both distributions have some spread, but women display a wider range between Q1 and Q3.
Potential outliers, especially among women, may be present (e.g., higher maximum BMI values).
Overall, men tend to have higher BMI on average, but women's BMI values are more spread out.


## 5. Boxplot by Gender

**d) Box Plot of BMI by Gender**

- Create a box plot of BMI scores grouped by gender.
- Describe the distribution, symmetry/skewness, differences, and outliers.

**Instructions:**  
- Interpret the box plot and compare distributions.

---


Following is the Box Plot of BMI by Gender

![image.png](attachment:image.png)

### Answer:
Women BMI: mean = 24.22, std = 4.05, skew = 1.03
Men BMI: mean = 26.27, std = 3.33, skew = 0.70

Women: median=23.69, IQR=5.03
Men: median=25.73, IQR=4.48
Men have a higher median BMI than women.
BMI values for women are more spread out (higher IQR) than for men.

## 6. Summary Statistics

**e) Key Summary Statistics for BMI**

- Calculate and report the following for everyone, women, and men:
  - Number of observations (n)
  - Sample mean (\(\bar{x}\))
  - Sample variance (\(s^2\))
  - Sample standard deviation (\(s\))
  - Lower quartile (Q1)
  - Median (Q2)
  - Upper quartile (Q3)

**Instructions:**  
- Present the summary statistics in a table.
- Discuss what additional insights are provided by the table compared to the box plot.

---


### Answer:
Compared to the box plot, the summary table provides precise numerical values for central tendency (mean, median),
spread (variance, standard deviation, IQR), and sample size. While the box plot visually highlights the median and quartiles,
the table quantifies these and adds the mean and variance, allowing for more detailed comparisons between groups.

## Part 2: Missing Values, Outliers, and Bivariate EDA

## 1. Missing Values

- List the number of missing values for each variable in the dataset.
- Choose and apply appropriate strategies for handling missing values (e.g., removal, imputation).
- Justify your chosen approach.
- Show the number of observations before and after handling missing values.

### Answer:
Number of observations before handling missing values: 145.<br>
Number of observations after handling missing values: 145.<br>

## 2. Outlier Detection and Handling

### a) Identifying Outliers

- Use visual (boxplots, scatterplots) and statistical methods (e.g., IQR rule, z-scores) to detect outliers in BMI, height, and weight.
- List any extreme values found and discuss whether they are plausible or likely errors.



### Answer:
- Outliers detected by IQR and z-score methods are listed above.
- Check if extreme values are plausible (e.g., height < 140cm or > 200cm, weight < 40kg or > 150kg, BMI < 15 or > 40).
- Implausible values may indicate data entry errors and should be reviewed.


### b) Handling Outliers

- Decide how to handle detected outliers (e.g., keep, remove, or correct).
- Justify your approach and show the effect on the dataset.



### Answer:
Decide how to remove BMI and weight outliers for further analysis.
Justification: Outliers can disproportionately affect statistical analyses (mean, std, correlations).
Removing them provides a more robust summary of the typical population.

## 3. Bivariate Exploratory Data Analysis (EDA)

### a) BMI and Fast Food Consumption

- Create a scatter plot of BMI vs. fast food consumption.
- Calculate and interpret the correlation coefficient.
- Comment on any patterns or associations observed.

Following is the Scatter Plot of BMI vs Fast Food Consumption

![image.png](attachment:image.png)

### Answer:
Correlation coefficient between BMI and fast food consumption: 0.15
Interpretation: There is little to no linear association between BMI and fast food consumption.

### b) BMI by Gender

- Use boxplots or violin plots to compare BMI distributions between genders.
- Test for significant differences (e.g., t-test or Mann-Whitney U test).

Following is a box plot and violin plot to compare BMI distributions between genders

![image.png](attachment:image.png)

### Answer:
There is a significant difference in BMI between genders (t-test).
There is a significant difference in BMI between genders (Mann-Whitney U test).

### c) BMI by Urbanity

- Visualize BMI across different urbanity categories using boxplots or bar plots.
- Discuss any differences or trends.

Following is a box plot to visualize BMI across different urbanity categories

![image.png](attachment:image.png)

### Answer:
- The boxplot shows the spread and central tendency of BMI for each urbanity category.
- Check for differences in median, spread, or presence of outliers across categories.
- The summary table provides mean and median BMI for each group, which can be used to discuss trends.

### d) Additional Bivariate Relationships

- Explore other pairs of variables as relevant (e.g., weight vs. height, fast food vs. gender).
- Use appropriate plots and statistics.

### Answer:
There is a significant difference in fast food consumption between genders (t-test).
There is a significant difference in fast food consumption between genders (Mann-Whitney U test).

## 4. Summary

- Summarize the key findings from your missing value analysis, outlier handling, and bivariate EDA.
- Discuss how these steps improve the quality and reliability of your subsequent analyses.

### Answer:
The dataset underwent thorough cleaning and exploratory data analysis (EDA). Key steps included confirming the absence of missing values, identifying and removing outliers in BMI and weight using both the IQR and z-score methods, and ensuring all variables were in appropriate formats. After removing two outlier rows, the dataset contained 143 observations, improving the reliability of subsequent analyses.

Bivariate EDA revealed several insights: there is little to no linear association between BMI and fast food consumption, while men have a significantly higher mean BMI and fast food consumption than women, as confirmed by both t-tests and Mann-Whitney U tests. BMI distributions also showed moderate variation across urbanity categories, and a moderate positive correlation was found between weight and height.

Overall, these data cleaning and EDA steps enhanced the dataset's quality, ensuring robust and interpretable results for further statistical analysis and modeling. Removing outliers and confirming data integrity were crucial for drawing valid conclusions from the data.

Happy Coding .... ;) 