# Lab 6: Correlation Analysis

We will create a couple of scatter plots with various variables.

For this section, we will use the NHANES demographic dataset on a variety of body measurements.

***

### Load packages and dataset

In [None]:
# import packages

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.patches as patches

In [None]:
# load dataset

bmx_data = pd.read_csv("datasets/BMX_data.csv")
bmx_data.head()

## Generating Scatter Plots

### Task 1

**Create a scatter plot** with an individual's *height on the x-axis* and an individual's *weight on the y-axis*.

All you need are two lists of values to pass to the `scatter()` function. Height is stored in the "BMXHT" column and weight is stored in the "BMXWT" column. Label the plot and the axes.

In [None]:
# pull out the height and weight values from the dataframe
# these values are stored in the columns "BMXHT" and "BMXWT"

height_values = 
weight_values = 

# complete the code below to create the scatter plot

fig, axs = plt.subplots(figsize=(10,7))
axs.scatter(x_values, y_values, c="SkyBlue")
plt.title("Weight vs Height", fontsize=20)
axs.set_xlabel("Height (cm)", fontsize=15)
axs.set_ylabel("Weight (kg)", fontsize=15)
axs.tick_params(labelsize=10)
plt.show()

<p>
<details><summary>Click to show correct plot</summary><br>

![](img/scatter-weight-height.jpeg)

</details>
</p>


<p>
<details><summary>Click to show solution</summary><br>

```python
# pull out the height and weight values from the dataframe
# these values are stored in the columns "BMXHT" and "BMXWT"

height_values = bmx_data["BMXHT"]
weight_values = bmx_data["BMXWT"]

# complete the code below to create the scatter plot

fig, axs = plt.subplots(figsize=(10,7))
axs.scatter(height_values, weight_values, c="SkyBlue")
plt.title("Weight vs Height", fontsize=20)
axs.set_xlabel("Height (cm)", fontsize=15)
axs.set_ylabel("Weight (kg)", fontsize=15)
axs.tick_params(labelsize=10)
plt.show()
```

</details>
</p>

It appears that there is a relatively strong, positive linear correlation between an individual's height and weight: heavier weights are associated with taller heights. *Do you agree?*

## Analyzing Correlations: Strong, Moderate, Weak, Positive, Negative

### Task 2: Strong, weak and moderate *positive* correlation

Let us now load a new dataset to find some variables that might not necessarily have such a strong, positive correlation. For this section we will use a U.S. Census Bureau 2015 American Community Survey dataset, which collects a variety of demographic and socioeconomic information from households around the US.

In [None]:
# load dataset

comm_data = pd.read_csv("datasets/bureau_survey.csv")
display(comm_data.head())

Here are some of the column names of interesting data points recorded in this dataset:
- **NP**: number of people in household
- **BDSP**: number of bedrooms in household
- **RMSP**: number of rooms in household
- **HINCP**: household income (in past 12 months)
- **RNTP**: monthly rent

Here are some of the data points recorded in the `bmx_data` dataset:
- **BMXHT**: height
- **BMXWT**: weight
- **BMXARMC**: arm circumference
- **BMXBMI**: body mass index
- **BMXWAIST**: waist circumference

Now **pick three pairs of variables from either dataset and create a scatter plot for each pair**. Here are the requirements:
> 1. Pair 1 should contain variables that have a *strong positive correlation*.
> 2. Pair 2 should contain variables that have a *weak positive correlation*.
> 3. Pair 3 should contain variables that have a *moderate positive correlation* (something in between strong and weak).
>
> You must pick two data points within the same dataset to compare for each pair because these data points correspond to a single person (in `bmx_data`) or to a single household (in `comm_data`).

<p>
<details><summary>Click to show scatter plot template code</summary><br>
    
```python
fig, axs = plt.subplots(figsize=(10,7))
axs.scatter(x_values, y_values, c="SkyBlue")
plt.title("Plot Title", fontsize=20)
axs.set_xlabel("X-axis label", fontsize=15)
axs.set_ylabel("Y-axis label", fontsize=15)
axs.tick_params(labelsize=10)
plt.show()
```

</details>
</p>

In [None]:
## scatter plot of variable with a STRONG POSITIVE CORRELATION

# pull out the x- and y-values from dataframe of choice

x_values = 
y_values = 

# place scatter plot code below



<p>
<details><summary>Click to show example of correct plot</summary><br>
    
An individual's waist circumference and weight are strongly, positively correlated. If you chose different variables, make sure the resulting scatterplot still has the same shape.

![](img/scatter-strong-positive.jpeg)

</details>
</p>

In [None]:
## scatter plot of variable with a WEAK POSITIVE CORRELATION

# pull out the x- and y-values from dataframe of choice

x_values = 
y_values = 

# place scatter plot code below



<p>
<details><summary>Click to show example of correct plot</summary><br>
    
The number of people in a household is very weakly, positively correlated (practically uncorrelated) to the number of bedrooms in that household. If you chose different variables, make sure the resulting scatterplot still has the same shape.

![](img/scatter-weak-positive.jpeg)

</details>
</p>

In [None]:
## scatter plot of variable with a MODERATE POSITIVE CORRELATION

# pull out the x- and y-values from dataframe of choice

x_values = 
y_values = 

# place scatter plot code below



<p>
<details><summary>Click to show example of correct plot</summary><br>
    
A household's income is only moderately, positively correlated to that household's monthly rent. If you chose different variables, make sure the resulting scatterplot still has the same shape.

![](img/scatter-moderate-positive.jpeg)

</details>
</p>

### Task 3: *Negative* correlation

For this section, we will use two NHANES datasets that record data on body measurements and cardiovascular fitness for the *same* subjects (which we know because some of the subject identifiers, the "SEQN" attribute, in each dataset are the same).

We will create a scatter plot to display **the relationship between a subject's body mass index and maximal endurance time**.

First, we will load each of these datasets and combine them into one dataset which captures the body mass index and maximal endurance time values for subjects who had both values recorded.

***How will we do this?***
> Each dataset has a "SEQN" column containing subject identifiers. The "SEQN" number 71918, for example, in both datasets will correspond to the same person.
> ![](img/seqn.png)
> So we will need to combine the datasets *using the "SEQN" column*.
> To do this we use the code:
>```python
> dataset1.merge(dataset2, on="column_to_merge_on")
>```
> where "dataset1" and "dataset2" are place-holders for the datasets we want to merge, and "column_to_merge_on" is a place-holder for the column we want to use to combine the datasets.

**Complete the cell below** in which we load the datasets and then combine them using the "SEQN" column.

In [None]:
# create combined dataframe on BMI and maximal endurance time

# load datasets

cardio_data = pd.read_csv("datasets/cardio_end_data.csv")
fit_bmx_data = pd.read_csv("datasets/fit_bmx_data.csv")

#print the number of columns in each dataset
print("cardio_data columns:", len(cardio_data.columns))
print("fit_bmx_data columns:", len(fit_bmx_data.columns))

# combine the datasets
### TO-DO: replace the place-holders in the line of code below
#   with the correct variable names and column name ###

dataset1.merge(dataset2, on="column_to_merge_on")

#print the number of columns in the combined dataset
print("combined_data columns:", len(combined_data.columns))

display(combined_data.head())

<p>
<details><summary>Click to show "merge" solution</summary><br>

```python
combined_cardio_fit_data = cardio_data.merge(fit_bmx_data, on="SEQN")
```

</details>
</p>

Now that we have all of the necessary information from both datasets for each of our subjects combined into one dataset, we can pull out the datapoints we need in order to create a scatter plot.
> Body mass index is stored in the **"BMXBMI"** column and maximal endurance time is stored in the **"CEDEXLEN"** column.

In the cell below, **create a scatter plot** with *maximal endurance time on the y-axis* and *BMI on the x-axis* to visualize the relationship between these variables.

In [None]:
## scatter plot of variable with a NEGATIVE CORRELATION

# pull out values from the "combined_data" dataframe
x_values = 
y_values = 

# fill in scatter plot code below

fig, axs = plt.subplots(figsize=(10,7))
axs.scatter(x_values, y_values, c="SkyBlue")
plt.title("Max Endurance Time vs BMI", fontsize=20)
axs.set_xlabel("BMI", fontsize=15)
axs.set_ylabel("Max Endurance Time", fontsize=15)
axs.tick_params(labelsize=10)
plt.show()

<p>
<details><summary>Click to show correct plot</summary><br>

![](img/scatter-negative.jpeg)

</details>
</p>

There is a moderately, negative linear correlation between an individual's maximal endurance time and BMI: a higher max endurance time is associated with a lower BMI. *Is this what we would expect?*


### Displaying a subset of a larger dataframe
Our `combined_data` dataset is very large. If we wanted to show someone only the datapoints we had used in our scatter plot, we could create a new simple dataframe containing only the relevant columns.

**Read the code in the cell below which creates this simple dataframe to display. Run the cell.**

In [None]:
# select only the 3 relevant columns from "combined_data"

simple_combined = combined_data[["SEQN", "BMXBMI", "CEDEXLEN"]]

display(simple_combined.head())

## Effect of Outliers on Correlation Analysis

Let's observe the effect of outliers on correlation. One easy way to see this is by looking at the correlation coefficient associated with our data.

We will use the `corrcoef()` function in the `numpy` package:
> The function takes 2 arguments: the *x-values* and the *y-values* between which we are analyzing the correlation.
>
> It will return a [*correlation matrix*](https://www.displayr.com/what-is-a-correlation-matrix/) which we will store in the `corr_matrix` variable.
>
> Since we are comparing only two variables, there is only one number in the correlation matrix that we care about, which we can access with the following code: `corr_matrix[0][1]` to extract the value in the 1st row and 2nd column.

**Requirements for the `coerrcoef()` function**:
```python
corr_matrix = np.coerrcoef(x_values, y_values)
```
> 1. The x-values and y-values between which we are computing the correlation, *cannot contain any NaN values* (unknown/missing values).
> 2. The x-values and y-values passed to the function must be *explicit lists*

### Task 4: Calculate a correlation coefficient

We will again be looking at the correlation between maximal endurance time and BMI, which we can see in the above scatter plot.

**Take a look at the code below in which we compute the correlation coefficient for the data shown in our scatter plot above.**
>**Explanation**:<br>
> (1) We loop through all of our entries in the dataset.
>
> (2) For each entry, we access the BMI and max endurance time associated with that entry. If either of these values is NaN (meaning "not a number") we assume it is missing/unknown and we drop the associated entry.<br>
>
> To check if a value is NaN (meaning "not a number") we use the code `np.isnan(value)`. This function returns `True` if our value is NaN and returns `False` otherwise.
>
> **We have now handled all missing values! So you will not have to worry about these anymore.**
>
> (3) We calculate the correlation matrix, ensuring that we make our lists of values explicit lists by wrapping them in the function `list()`.
>
> (4) We extract the correlation coefficient from the correlation matrix.

In [None]:
## (1) ensure that our value lists do not contain NaN values

# run through all entries in "simple_combined"
for i in simple_combined.index:
    
    # check if either the BMXBMI value or the CEDEXLEN value is NaN
    if np.isnan(simple_combined.loc[i, "BMXBMI"]) or np.isnan(simple_combined.loc[i, "CEDEXLEN"]):
        
        # if either are NaN, drop/remove this entry in the dataset
        simple_combined = simple_combined.drop(i)
      
        
## (2) calculate correlation matrix

corr_matrix = np.corrcoef(list(simple_combined["BMXBMI"]), list(simple_combined["CEDEXLEN"]))

# extract correlation coefficient from correlation matrix

corr_coeff = corr_matrix[0][1]

print('Correlation coefficient:', corr_coeff)

### Task 5: Add additional *non-outlier* datapoints to the dataset

Let's see what happens to the correlation coefficient when we add more data points that are close to the majority of the other data points (i.e. not outliers).

We modified the original `simple_combined` dataset so that it contains these additional non-outlier datapoints. We then plot this modified dataset in a scatter plot, with the added data points shown in red circles:
![](img/data-2.png)

<p>
<details><summary>Expand this text if you would like to see how we created this modified dataset and made the plot.</summary><br>

```python
# add more data points that are NOT OUTLIERS

last_i = len(simple_combined.index)
bmx_cardio_data_2 = pd.DataFrame(simple_combined)
bmx_cardio_data_2.loc[last_i, "BMXBMI"] = 17
bmx_cardio_data_2.loc[last_i, "CEDEXLEN"] = 700
bmx_cardio_data_2.loc[last_i+1, "BMXBMI"] = 18
bmx_cardio_data_2.loc[last_i+1, "CEDEXLEN"] = 710
bmx_cardio_data_2.loc[last_i+2, "BMXBMI"] = 16
bmx_cardio_data_2.loc[last_i+2, "CEDEXLEN"] = 720
bmx_cardio_data_2.loc[last_i+3, "BMXBMI"] = 15
bmx_cardio_data_2.loc[last_i+3, "CEDEXLEN"] = 690
bmx_cardio_data_2.loc[last_i+4, "BMXBMI"] = 16
bmx_cardio_data_2.loc[last_i+4, "CEDEXLEN"] = 680
bmx_cardio_data_2.loc[last_i+5, "BMXBMI"] = 18
bmx_cardio_data_2.loc[last_i+5, "CEDEXLEN"] = 700

bmx_cardio_data_2.to_csv("datasets/bmx_cardio_data_2.csv", index=False)

# plot new data in scatter plot

fig, axs = plt.subplots(figsize=(10,7))
axs.scatter(bmx_cardio_data_2["BMXBMI"], bmx_cardio_data_2["CEDEXLEN"], c="SkyBlue")
plt.title("Max Endurance Time vs BMI", fontsize=20)
axs.set_xlabel("BMI", fontsize=15)
axs.set_ylabel("Max Endurance Time", fontsize=15)
axs.tick_params(labelsize=10)

# create circles
ellipse1 = patches.Ellipse((17, 700), 0.5, 25, linewidth=1, linestyle='--', edgecolor='r', facecolor='none')
axs.add_artist(ellipse1)
ellipse2 = patches.Ellipse((18, 710), 0.5, 25, linewidth=1, linestyle='--', edgecolor='r', facecolor='none')
axs.add_artist(ellipse2)
ellipse3 = patches.Ellipse((16, 720), 0.5, 25, linewidth=1, linestyle='--', edgecolor='r', facecolor='none')
axs.add_artist(ellipse3)
ellipse4 = patches.Ellipse((15, 690), 0.5, 25, linewidth=1, linestyle='--', edgecolor='r', facecolor='none')
axs.add_artist(ellipse4)
ellipse5 = patches.Ellipse((16, 680), 0.5, 25, linewidth=1, linestyle='--', edgecolor='r', facecolor='none')
axs.add_artist(ellipse5)
ellipse6 = patches.Ellipse((18, 700), 0.5, 25, linewidth=1, linestyle='--', edgecolor='r', facecolor='none')
axs.add_artist(ellipse6)

plt.show()
```

</details>
</p>

Now, **compute the correlation coefficient for this newly modified dataset which is stored in the variable `bmx_cardio_data_2`**. Use the same method we used above, just change the name of your dataset variable in this code.

In [None]:
# load "bmx_cardio_data_2"

bmx_cardio_data_2 = pd.read_csv("datasets/bmx_cardio_data_2.csv")

# calculate correlation coefficient for modified dataset

corr_matrix = 
corr_coeff = 

print('Correlation coefficient:', corr_coeff)

<p>
<details><summary>Click to show answer</summary><br>

`Correlation coefficient: -0.4562893277490532`

</details>
</p>

<p>
    
<details><summary>Click to show solution</summary><br>

```python
corr_matrix = np.corrcoef(list(bmx_cardio_data_2["BMXBMI"]), list(bmx_cardio_data_2["CEDEXLEN"]))
corr_coeff = corr_matrix[0][1]
```

</details>
</p>

Nothing much happens to the correlation coefficient (the previous correlation coefficient was -0.4558, so it did not change by much). What if we instead add data points to the dataset that *are* outliers and lie far away from the rest of the data?

### Task 6:  Add additional *outlier* datapoints to the dataset

Now, in the cell below we load a dataset which contains additional outliers datapoints into the variable `bmx_cardio_data_3`. We then plot this modified dataset in a scatter plot, with the added data points shown in red circles:
![](img/data-3.png)

<p>
<details><summary>Expand this text if you would like to see how we created this modified dataset and made the plot.</summary><br>

```python
# add more data points that ARE OUTLIERS

last_i = len(simple_combined.index)
bmx_cardio_data_3 = pd.DataFrame(simple_combined)
bmx_cardio_data_3.loc[last_i, "BMXBMI"] = 33
bmx_cardio_data_3.loc[last_i, "CEDEXLEN"] = 850
bmx_cardio_data_3.loc[last_i+1, "BMXBMI"] = 35
bmx_cardio_data_3.loc[last_i+1, "CEDEXLEN"] = 1000
bmx_cardio_data_3.loc[last_i+2, "BMXBMI"] = 30
bmx_cardio_data_3.loc[last_i+2, "CEDEXLEN"] = 1000
bmx_cardio_data_3.loc[last_i+3, "BMXBMI"] = 21
bmx_cardio_data_3.loc[last_i+3, "CEDEXLEN"] = 150
bmx_cardio_data_3.loc[last_i+4, "BMXBMI"] = 19
bmx_cardio_data_3.loc[last_i+4, "CEDEXLEN"] = 200
bmx_cardio_data_3.loc[last_i+5, "BMXBMI"] = 22
bmx_cardio_data_3.loc[last_i+5, "CEDEXLEN"] = 50

bmx_cardio_data_3.to_csv("datasets/bmx_cardio_data_3.csv", index=False)

# plot new data in scatter plot

fig, axs = plt.subplots(figsize=(8,5))
axs.scatter(bmx_cardio_data_3["BMXBMI"], bmx_cardio_data_3["CEDEXLEN"], c="SkyBlue")
plt.title("Max Endurance Time vs BMI", fontsize=20)
axs.set_xlabel("BMI", fontsize=15)
axs.set_ylabel("Max Endurance Time", fontsize=15)
axs.tick_params(labelsize=10)

# create circles
ellipse1 = patches.Ellipse((33, 850), 0.5, 25, linewidth=1, linestyle='--', edgecolor='r', facecolor='none')
axs.add_artist(ellipse1)
ellipse2 = patches.Ellipse((35, 1000), 0.5, 25, linewidth=1, linestyle='--', edgecolor='r', facecolor='none')
axs.add_artist(ellipse2)
ellipse3 = patches.Ellipse((30, 1000), 0.5, 25, linewidth=1, linestyle='--', edgecolor='r', facecolor='none')
axs.add_artist(ellipse3)
ellipse4 = patches.Ellipse((21, 150), 0.5, 25, linewidth=1, linestyle='--', edgecolor='r', facecolor='none')
axs.add_artist(ellipse4)
ellipse5 = patches.Ellipse((19, 200), 0.5, 25, linewidth=1, linestyle='--', edgecolor='r', facecolor='none')
axs.add_artist(ellipse5)
ellipse6 = patches.Ellipse((22, 50), 0.5, 25, linewidth=1, linestyle='--', edgecolor='r', facecolor='none')
axs.add_artist(ellipse6)

plt.show()
```

</details>
</p>

Now, **compute the correlation coefficient for this newly modified dataset which is stored in the variable `bmx_cardio_data_3`**.

In [None]:
# load "bmx_cardio_data_3"

bmx_cardio_data_3 = pd.read_csv("datasets/bmx_cardio_data_3.csv")

# calculate correlation coefficient for modified dataset

corr_matrix = 
corr_coeff = 

print('Correlation coefficient:', corr_coeff)

<p>
<details><summary>Click to show answer</summary><br>

`Correlation coefficient: -0.4077529252711053`

</details>
</p>

<p>
<details><summary>Click to show solution</summary><br>

```python
corr_matrix = np.corrcoef(list(bmx_cardio_data_3["BMXBMI"]), list(bmx_cardio_data_3["CEDEXLEN"]))
corr_coeff = corr_matrix[0][1]
```

</details>
</p>

With the addition of just a few outliers, our correlation coefficient is less negative (by 0.05), indicating that the negative correlation between max endurance time and BMX now appears to be *weaker* than before.