In [None]:
# Run cells by clicking on them and hitting CTRL + ENTER on your keyboard
from IPython.display import YouTubeVideo
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

# Module 6.1 Part 1: Correlation

Correlation measures the strength of a linear association between two numeric variables. In this notebook, you'll learn how
to calculate and interpret it.

3 videos make up this notebook, for a total run time of 46:27.

1. [Visualizing Correlation](#section1) *1 videos, total runtime 15:12*
2. [Calculating Correlation](#section2) *1 video, total runtime 19:54*
3. [Interpreting Correlation](#section3) *1 video, total runtime 11:21*
4. [Check for Understanding](#section4)

Textbook readings:
- [Chapter 15: Prediction](https://www.inferentialthinking.com/chapters/15/Prediction.html)
- [Chapter 15.1: Correlation](https://www.inferentialthinking.com/chapters/15/1/Correlation.html)

<a id='section1'></a>
## 1. Visualizing Correlation

In the following video you'll be introduced to correlation, a measure of linear association between two numeric variables.
You'll also see how scatterplots can be used to identify linear trends in data, and how these trends relate to correlation.

In [None]:
YouTubeVideo('k9-rzXYH11Q')

Run the cell below to load `pokemon`, a table that contains attributes for each Pokemon. 

In [None]:
# follow along here
pokemon = Table.read_table("pokemon.csv")
pokemon.show(5)

In the cell below, generate a scatterplot that visualizes the relationship between `Attack` and `Defense` values.

In [None]:
...

<details>
    <summary>Solution</summary>
    
    pokemon.scatter("Attack", "Defense")    
</details>
<br>

Create a table `standardized_attack_defense` that contains two columns:
- `standardized_attack`: the values from the `Attack` column of `pokemon` in standard units
- `standardized_defense`: the values from the `Defense` column of `pokemon` in standard units

In [None]:
def convert_to_standard_units(arr):
    ...

standardized_attack_defense = ...
standardized_attack_defense.show(5)

<details>
    <summary>Solution</summary>
    
    def convert_to_standard_units(arr):
        return (arr - np.average(arr)) / np.std(arr)

    standardized_attack_defense = Table().with_columns("standardized_attack", convert_to_standard_units(pokemon.column("Attack")), "standardized_defense", convert_to_standard_units(pokemon.column("Defense")))
</details>
<br>

Now generate a scatterplot that visualizes the relationship between `Attack` and `Defense` values, in standard units

In [None]:
...

<details>
    <summary>Solution</summary>
    
    standardized_attack_defense.scatter("standardized_attack", "standardized_defense") 
</details>
<br>

Is there a positive or negative association between the two variables?

<details>
    <summary>Solution</summary>
    Positive association
</details>
<br>

<a id='section2'></a>

## 2. Calculating Correlation

In the next video, you'll learn how to quantify the linear relationship between two numeric variables.

In [None]:
YouTubeVideo('uBN0NyAb8GU')

Using the table `standardized_attack_defense` to calculate the correlation coefficient between the attack and defense values in `pokemon`.

In [None]:
...

<details>
    <summary>Solution</summary>
    
    np.mean(standardized_attack_defense.column("standardized_attack") * standardized_attack_defense.column("standardized_defense"))
</details>
<br>

<a id='section3'></a>

## 3. Interpreting Correlation

In the upcoming video, you'll see how non-linear relationships and outliers affect correlation. You'll also learn
which precautions you should take when calculating the correlation coefficient of aggregated data.

In [None]:
YouTubeVideo('-n8LgiYXoXU')

Define `pokemon_grouped` to be a table that contains the average attributes for each combination of `Type 1` and `Type 2` pokemon in the `pokemon` table. Add the columns `standardized_attack` and `standardized_defense` to the table to represent the attack and defense values in standard units.

In [None]:
pokemon_means = pokemon.group(["Type 1", "Type 2"], np.mean)
pokemon_grouped = ...
pokemon_grouped.show(5)

<details>
    <summary>Solution</summary>
    
    pokemon_grouped = pokemon.group(["Type 1", "Type 2"], np.mean)\
                   .with_columns("standardized_attack", convert_to_standard_units(pokemon_grouped.column("Attack mean")),
                                 "standardized_defense", convert_to_standard_units(pokemon_grouped.column("Defense mean")))

</details>
<br>

Generate a scatterplot to visualize the relationship between the attack and defense values in standard units.

In [None]:
...

<details>
    <summary>Solution</summary>
    
    pokemon_grouped.scatter("standardized_attack", "standardized_defense")

</details>
<br>

How is this similar or different from the scatterplot you created using data from individual pokemon?

<details>
    <summary>Solution</summary>
    <br>Both scatterplots show a slightly positive linear association between attack and defense. We see similar trends in both scatterplots, even though one represents individual pokemon and the other represents groups of many pokemon.
    <br><br>
    The scatterplot using aggregated data is less "fuzzy" since each dot now represents a unique type of pokemon, not a unique pokemon.
</details>

Using `pokemon_grouped`, calculate the correlation coefficient between attack and defense for each type of pokemon.

In [None]:
...

<details>
    <summary>Solution</summary>
    
    np.mean(pokemon_grouped.column("standardized_attack") * pokemon_grouped.column("standardized_defense"))
<br>

<a id='section4'></a>
## 4. Check for Understanding

**A. True or False? Suppose the variables x and y have a correlation coefficient of 0.89. This means that an increase in x will cause an increase in y**
 
<details>
    <summary>Solution</summary>
    <b>False.</b> A high correlation indicates that the two variables have a strong, positive linear association. For example, on average,
    we expect observations with relatively large values of x to also possess relatively large values of y. However, it does not mean that
    either variable causes the other! Correlation does not imply causation.
</details>
<br>

**B. In the data visualized below, how would the correlation coefficient of x and y change if the circled data point was removed?**

<img src="correlation_question.png" width=300 height=300 />

<details>
    <summary>Solution</summary>
    The correlation would increase. The point we're removing is an outlier so removing it would result in a stronger linear association between the remaining data. 
</details>
<br>