### Jupyter Homework 7: Exploratory Data Analysis and Linear Regression

### Childhood Obesity Data

According to the *[World Health Organization](https://www.who.int/health-topics/obesity#tab=tab_1)*, "Overweight and obesity are defined as abnormal or excessive fat accumulation that presents a risk to health. A body mass index (BMI) over 25 is considered overweight, and over 30 is obese. The issue has grown to epidemic proportions, with over 4 million people dying each year as a result of being overweight or obese in 2017 according to the global burden of disease." This is particularly concerning in the case of children, where the prevalence of obesity has increased worldwide over the past 50 years. (Abarca-Gomez et al.)

In order to understand this phenomena we will look at some representative data compiled by the CDC and NHANES (**National Health and Nutrition Estimation Survey**). We will be applying linear regression to see the correlations between `Height`, `Weight`, and `Waist Size` in our dataset.

### Setup

Import some python libraries needed for this notebook.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### NHANES Data

The data we will use is a cleaned up version of a data set downloaded from the **[National Health and Nutrition Examination Survey](https://www.cdc.gov/nchs/nhanes/index.htm)**

The particular data set chosen was the [2017- March 2020 Pre-Pandemic Examination Data](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_BMX.htm).

In [None]:
# Read in the data set and take a quick look at it.
df = pd.read_csv('data/NHANESChildObesity.csv', index_col = 0)
df = df.reset_index(drop = True)
df.head()

Let's *rename the columns* to make the data more readable.

In [None]:
cols = {"BMDBMIC":"Category",
        "BMXBMI":"BMI (kg/m**2)",
        "BMXHT":"Height (cm)",
        "BMXWAIST":"Waist Size (cm)",
        "BMXWT":"Weight (kg)"}
df = df.rename(columns = cols)

In [None]:
df.head()

From the website description, the `Category` column has codes $1,2,3$ or $4$. These represent `Underweight`,`Normal Weight`, `Overweight`, and `Obese` respectively. Let's change the category codes to their names.

In [None]:
category_codes = {1:'Underweight',
                  2:'Normal Weight',
                  3:'Overweight',
                  4:'Obese'}

In [None]:
df['Category'] = df['Category'].apply(lambda x: category_codes[x])

In [None]:
df.head()

Now let's look at those in the category `Normal Weight` only.

In [None]:
df = df[df['Category'] == 'Normal Weight']
plt.figure(figsize=(5,5))
g = sns.pairplot(df, kind="scatter")
g.fig.set_size_inches(7,7)

### Questions

#### Question 1:

What do you notice about the plot? Which two variables seem to be the most correlated?

In [None]:
# Put your answer to Question 1 here



#### Question 2a:

For two columns of *your choice*, find the *slope* and $y$*-intercept* of the *line of best fit* using `scipy.stats.linregress`. You can look at the [documentation here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html), or ask chat gpt for help.

In [None]:
# Put your answer to Question 2a here

# Example choices of X and Y:
X = df['Height (cm)']
Y = df['Weight (kg)']



slope = ...
y_intercept = ...

#### Question 2b:

In class, we saw the following formulas for the line of best fit $y = \alpha + \beta x$. Given a set of data points $(x_i, y_i)$,
$$\beta = \frac{n \sum x_i y_i - \left(\sum x_i\right)\left(\sum y_i\right)}{n \sum x_i^2 - \left(\sum x_i\right)^2}$$
and
$$\alpha = \overline{y}_n - \beta \overline{x}_n.$$


Verify your answers to **Question 2a** by creating two functions that calculate alpha and beta.

In [None]:
# Put your answer to Question 2b here

def beta(X,Y):
    # calculate beta 
    ...


def alpha(X,Y):
    b = beta(X,Y)
    # calculate alpha
    ...


X = ...
Y = ...
print(alpha(X,Y),beta(X,Y))

#### Question 3:

Using the slope and intercept you found in **Question 2**, plot the scatter points and the line on the same figure. Again, feel free to ask chat gpt for help, or google `plt.scatter` (or `sns.scatterplot`) for more information on how to plot using matplotlib (or seaborn).

In [None]:
# Put your answer to Question 3 here

plot = ...

#### Question 4:

Use your line of best fit to *guess* the $y$-value of someone in the 25th,50th, and 75th percentiles of your chosen $x$ variable. 

In [None]:
# Put your answer to Question 4 here



x25 = ...
y25 = ...
print(x25,y25)


x50 = ...
y50 = ...
print(x50,y50)


x75 = ...
y65 = ...
print(x75,y75)

### References
1. Worldwide trends in body-mass index, underweight, overweight, and obesity from 1975 to 2016: a pooled analysis of 2416 population-based measurement studies in 128·9 million children, adolescents, and adults. Lancet. 2017 Dec 16;390(10113):2627-2642. doi: 10.1016/S0140-6736(17)32129-3. Epub 2017 Oct 10. PMID: 29029897; PMCID: PMC5735219.
2. Computing standard deviations: accuracy, Chan and Lewis, Communications of the ACM, Vol. 22, No. 9, September 1979.

## -----------------------------------------------------------------------

**Submission** : Go to `File -> Save and Export Notebook As... -> PDF` and upload it onto CatCourses.

## -----------------------------------------------------------------------

*Math 032 - Probability and Statistics*

*UC Merced: Adrien Peltzer, July 20, 2024*