1\. Correlation
---------------

00:00 - 00:07

Welcome to the final chapter of the course, where we'll talk about correlation and experimental design.

2\. Relationships between two variables
---------------------------------------

00:07 - 00:35

Before we dive in, let's talk about relationships between numeric variables. We can visualize these kinds of relationships with scatter plots - in this scatterplot, we can see the relationship between the total amount of sleep mammals get and the amount of REM sleep they get. The variable on the x-axis is called the explanatory or independent variable, and the variable on the y-axis is called the response or dependent variable.

3\. Correlation coefficient
---------------------------

00:35 - 00:56

We can also examine relationships between two numeric variables using a number called the correlation coefficient. This is a number between -1 and 1, where the magnitude corresponds to the strength of the relationship between the variables, and the sign, positive or negative, corresponds to the direction of the relationship.

4\. Magnitude = strength of relationship
----------------------------------------

00:56 - 01:18

Here's a scatterplot of 2 variables, x and y, that have a correlation coefficient of 0-point-99. Since the data points are closely clustered around a line, we can describe this as a near-perfect or very strong relationship. If we know what x is, we'll have a pretty good idea of what the value of y could be.

5\. Magnitude = strength of relationship
----------------------------------------

01:18 - 01:26

Here, x and y have a correlation coefficient of 0-point-75, and the data points are a bit more spread out.

6\. Magnitude = strength of relationship
----------------------------------------

01:26 - 01:34

In this plot, x and y have a correlation of 0-point-56 and are therefore moderately correlated.

7\. Magnitude = strength of relationship
----------------------------------------

01:34 - 01:40

A correlation coefficient around 0-point-2 would be considered a weak relationship.

8\. Magnitude = strength of relationship
----------------------------------------

01:40 - 01:55

When the correlation coefficient is close to 0, x and y have no relationship and the scatterplot looks completely random. This means that knowing the value of x doesn't tell us anything about the value of y.

9\. Sign = direction
--------------------

01:55 - 02:14

The sign of the correlation coefficient corresponds to the direction of the relationship. A positive correlation coefficient indicates that as x increases, y also increases. A negative correlation coefficient indicates that as x increases, y decreases.

10\. Visualizing relationships
------------------------------

02:14 - 02:48

To visualize relationships between two variables, we can use a scatterplot. We'll use seaborn, which is a plotting package built on top of matplotlib. We import seaborn as sns, which is the alias commonly used for seaborn. We create a scatterplot using sns-dot-scatterplot, passing it the name of the variable for the x-axis, the name of the variable for the y-axis, as well as the msleep DataFrame to the data argument. Finally, we call plt-dot-show.

11\. Adding a trendline
-----------------------

02:48 - 03:11

We can add a linear trendline to the scatterplot using seaborn's lmplot() function. It takes the same arguments as sns-dot-scatterplot, but we'll set ci to None so that there aren't any confidence interval margins around the line. Trendlines like this can be helpful to more easily see a relationship between two variables.

12\. Computing correlation
--------------------------

03:11 - 03:44

To calculate the correlation coefficient between two Series, we can use the dot-corr method. If we want the correlation between the sleep_total and sleep_rem columns of msleep, we can take the sleep_total column and call dot-corr on it, passing in the other Series we're interested in. Note that it doesn't matter which Series the method is invoked on and which is passed in since the correlation between x and y is the same thing as the correlation between y and x.

13\. Many ways to calculate correlation
---------------------------------------

03:44 - 04:25

There's more than one way to calculate correlation, but the method we've been using in this video is called the Pearson product-moment correlation, which is also written as r. This is the most commonly used measure of correlation. Mathematically, it's calculated using this formula where x and y bar are the means of x and y, and sigma x and sigma y are the standard deviations of x and y. The formula itself isn't important to memorize, but know that there are variations of this formula that measure correlation a bit differently, such as Kendall's tau and Spearman's rho, but those are beyond the scope of this course.

14\. Let's practice!
--------------------

04:25 - 04:30

Okay, time to practice calculating correlations.

#### Guess the correlation

On the right, use the scatterplot to estimate what the correlation is between the variables `x` and `y`. Once you've guessed it correctly, use the **New Plot** button to try out a few more scatterplots. When you're ready, answer the question below to continue to the next exercise.

Which of the following statements is NOT true about correlation?

##### Instructions

-   If the correlation between `x` and `y` has a high magnitude, the data points will be clustered closely around a line.

-   Correlation can be written as *r*.

-   If `x` and `y` are negatively correlated, values of `y` decrease as values of `x` increase.

[x] -   Correlation cannot be 0.

Relationships between variables
===============================

In this chapter, you'll be working with a dataset `world_happiness` containing results from the [2019 World Happiness Report](https://worldhappiness.report/ed/2019/). The report scores various countries based on how happy people in that country are. It also ranks each country on various societal aspects such as social support, freedom, corruption, and others. The dataset also includes the GDP per capita and life expectancy for each country.

In this exercise, you'll examine the relationship between a country's life expectancy (`life_exp`) and happiness score (`happiness_score`) both visually and quantitatively. `seaborn` as `sns`, `matplotlib.pyplot` as `plt`, and `pandas` as `pd` are loaded and `world_happiness` is available.

Instructions 1/4
----------------

-   Create a scatterplot of `happiness_score` vs. `life_exp` (without a trendline) using `seaborn`.
-   Show the plot.

In [None]:
# Create a scatterplot of happiness_score vs. life_exp and show
sns.scatterplot(x='life_exp', y='happiness_score', data=world_happiness)

# Show plot
plt.show()

# Create scatterplot of happiness_score vs life_exp with trendline
sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None)

# Show plot
plt.show()

# Create scatterplot of happiness_score vs life_exp with trendline
sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None)

# Show plot
plt.show()

# Correlation between life_exp and happiness_score
cor = world_happiness['life_exp'].corr(world_happiness['happiness_score'])

print(cor)

Instructions 2/4
----------------

-   Create a scatterplot of `happiness_score` vs. `life_exp` **with a linear trendline** using `seaborn`, setting `ci` to `None`. 
-   Show the plot.

In [None]:
# Create a scatterplot of happiness_score vs. life_exp and show
sns.scatterplot(x='life_exp', y='happiness_score', data=world_happiness)

# Show plot
plt.show()

# Create scatterplot of happiness_score vs life_exp with trendline
sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None)

# Show plot
plt.show()

# Create scatterplot of happiness_score vs life_exp with trendline
sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None)

# Show plot
plt.show()

# Correlation between life_exp and happiness_score
cor = world_happiness['life_exp'].corr(world_happiness['happiness_score'])

print(cor)

Instructions 3/4
----------------

### Question
--------

Based on the scatterplot, which is most likely the correlation between `life_exp` and `happiness_score`?

### Possible answers

0.3

-0.3

[x] 0.8

-0.8

Instructions 4/4
----------------

-   Calculate the correlation between `life_exp` and `happiness_score`. Save this as `cor`.

In [None]:
# Create a scatterplot of happiness_score vs. life_exp and show
sns.scatterplot(x='life_exp', y='happiness_score', data=world_happiness)

# Show plot
plt.show()

# Create scatterplot of happiness_score vs life_exp with trendline
sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None)

# Show plot
plt.show()

# Create scatterplot of happiness_score vs life_exp with trendline
sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None)

# Show plot
plt.show()

# Correlation between life_exp and happiness_score
cor = world_happiness['life_exp'].corr(world_happiness['happiness_score'])

print(cor)

1\. Correlation caveats
-----------------------

00:00 - 00:06

While correlation is a useful way to quantify relationships, there are some caveats.

2\. Non-linear relationships
----------------------------

00:06 - 00:16

Consider this data. There is clearly a relationship between x and y, but when we calculate the correlation, we get 0-point-18.

3\. Non-linear relationships
----------------------------

00:16 - 00:30

This is because the relationship between the two variables is a quadratic relationship, not a linear relationship. The correlation coefficient measures the strength of linear relationships, and linear relationships only.

4\. Correlation only accounts for linear relationships
------------------------------------------------------

00:30 - 00:38

Just like any summary statistic, correlation shouldn't be used blindly, and you should always visualize your data when possible.

5\. Mammal sleep data
---------------------

00:38 - 00:41

Let's return to the mammal sleep data.

6\. Body weight vs. awake time
------------------------------

00:41 - 01:00

Here's a scatterplot of each mammal's body weight versus the time they spend awake each day. The relationship between these variables is definitely not a linear one. The correlation between body weight and awake time is only about 0-point-3, which is a weak linear relationship.

7\. Distribution of body weight
-------------------------------

01:00 - 01:12

If we take a closer look at the distribution for bodywt, it's highly skewed. There are lots of lower weights and a few weights that are much higher than the rest.

8\. Log transformation
----------------------

01:12 - 01:44

When data is highly skewed like this, we can apply a log transformation. We'll create a new column called log_bodywt which holds the log of each body weight. We can do this using np-dot-log. If we plot the log of bodyweight versus awake time, the relationship looks much more linear than the one between regular bodyweight and awake time. The correlation between the log of bodyweight and awake time is about 0-point-57, which is much higher than the 0-point-3 we had before.

9\. Other transformations
-------------------------

01:44 - 02:13

In addition to the log transformation, there are lots of other transformations that can be used to make a relationship more linear, like taking the square root or reciprocal of a variable. The choice of transformation will depend on the data and how skewed it is. These can be applied in different combinations to x and y, for example, you could apply a log transformation to both x and y, or a square root transformation to x and a reciprocal transformation to y.

10\. Why use a transformation?
------------------------------

02:13 - 02:34

So why use a transformation? Certain statistical methods rely on variables having a linear relationship, like calculating a correlation coefficient. Linear regression is another statistical technique that requires variables to be related in a linear manner, which you can learn all about in this course.

11\. Correlation does not imply causation
-----------------------------------------

02:34 - 03:14

Let's talk about one more important caveat of correlation that you may have heard about before: correlation does not imply causation. This means that if x and y are correlated, x doesn't necessarily cause y. For example, here's a scatterplot of the per capita margarine consumption in the US each year and the divorce rate in the state of Maine. The correlation between these two variables is 0-point-99, which is nearly perfect. However, this doesn't mean that consuming more margarine will cause more divorces. This kind of correlation is often called a spurious correlation.

12\. Confounding
----------------

03:14 - 03:34

A phenomenon called confounding can lead to spurious correlations. Let's say we want to know if drinking coffee causes lung cancer. Looking at the data, we find that coffee drinking and lung cancer are correlated, which may lead us to think that drinking more coffee will give you lung cancer.

13\. Confounding
----------------

03:34 - 03:40

However, there is a third, hidden variable at play, which is smoking.

14\. Confounding
----------------

03:40 - 03:44

Smoking is known to be associated with coffee consumption.

15\. Confounding
----------------

03:44 - 03:48

It is also known that smoking causes lung cancer.

16\. Confounding
----------------

03:48 - 04:36

In reality, it turns out that coffee does not cause lung cancer and is only associated with it, but it appeared causal due to the third variable, smoking. This third variable is called a confounder, or lurking variable. This means that the relationship of interest between coffee and lung cancer is a spurious correlation. Another example of this is the relationship between holidays and retail sales. While it might be that people buy more around holidays as a way of celebrating, it's hard to tell how much of the increased sales is due to holidays, and how much is due to the special deals and promotions that often run around holidays. Here, special deals confound the relationship between holidays and sales.

17\. Let's practice!
--------------------

04:36 - 04:43

Now that you've learned how to use correlation responsibly, time to practice.

What can't correlation measure?
===============================

While the correlation coefficient is a convenient way to quantify the strength of a relationship between two variables, it's far from perfect. In this exercise, you'll explore one of the caveats of the correlation coefficient by examining the relationship between a country's GDP per capita (`gdp_per_cap`) and happiness score.

`pandas` as `pd`, `matplotlib.pyplot` as `plt`, and `seaborn` as `sns` are imported, and `world_happiness` is loaded.

Instructions 1/3
----------------

-   Create a `seaborn` scatterplot (without a trendline) showing the relationship between `gdp_per_cap` (on the x-axis) and `life_exp` (on the y-axis).
-   Show the plot

In [None]:
# Scatterplot of gdp_per_cap and life_exp
sns.scatterplot(x='gdp_per_cap', y='life_exp', data=world_happiness)

# Show plot
plt.show()

# Scatterplot of gdp_per_cap and life_exp
sns.scatterplot(x='gdp_per_cap', y='life_exp', data=world_happiness)

# Show plot
plt.show()
  
# Correlation between gdp_per_cap and life_exp
cor = world_happiness['gdp_per_cap'].corr(world_happiness['life_exp'])

print(cor)

Instructions 2/3
----------------

-   Calculate the correlation between `gdp_per_cap` and `life_exp` and store as `cor`.

In [None]:
# Scatterplot of gdp_per_cap and life_exp
sns.scatterplot(x='gdp_per_cap', y='life_exp', data=world_happiness)

# Show plot
plt.show()

# Scatterplot of gdp_per_cap and life_exp
sns.scatterplot(x='gdp_per_cap', y='life_exp', data=world_happiness)

# Show plot
plt.show()
  
# Correlation between gdp_per_cap and life_exp
cor = world_happiness['gdp_per_cap'].corr(world_happiness['life_exp'])

print(cor)

Instructions 3/3
----------------

Question
--------

The correlation between GDP per capita and life expectancy is 0.7. Why is correlation ***not*** the best way to measure the relationship between these two variables?

### Possible answers

Correlation measures how one variable affects another.

[x] Correlation only measures linear relationships.

Correlation cannot properly measure relationships between numeric variables.

Transforming variables
======================

When variables have skewed distributions, they often require a transformation in order to form a linear relationship with another variable so that correlation can be computed. In this exercise, you'll perform a transformation yourself.

`pandas` as `pd`, `numpy` as `np`, `matplotlib.pyplot` as `plt`, and `seaborn` as `sns` are imported, and `world_happiness` is loaded.

Instructions 1/2
----------------

-   Create a scatterplot of `happiness_score` versus `gdp_per_cap` and calculate the correlation between them.

In [None]:
# Scatterplot of happiness_score vs. gdp_per_cap
sns.scatterplot(x='gdp_per_cap', y='happiness_score', data=world_happiness)
plt.show()

# Calculate correlation
cor = world_happiness['gdp_per_cap'].corr(world_happiness['happiness_score'])
print(cor)

Instructions 2/2
----------------

-   Add a new column to `world_happiness` called `log_gdp_per_cap` that contains the log of `gdp_per_cap`.
-   Create a `seaborn` scatterplot of `happiness_score` versus `log_gdp_per_cap`.
-   Calculate the correlation between `log_gdp_per_cap` and `happiness_score`.

In [None]:
# Create log_gdp_per_cap column
world_happiness['log_gdp_per_cap'] = np.log(world_happiness['gdp_per_cap'])

# Scatterplot of happiness_score vs. log_gdp_per_cap
sns.scatterplot(x='log_gdp_per_cap', y='happiness_score', data=world_happiness)
plt.show()

# Calculate correlation
cor = world_happiness['log_gdp_per_cap'].corr(world_happiness['happiness_score'])
print(cor)

Does sugar improve happiness?
=============================

A new column has been added to `world_happiness` called `grams_sugar_per_day`, which contains the average amount of sugar eaten per person per day in each country. In this exercise, you'll examine the effect of a country's average sugar consumption on its happiness score.

`pandas` as `pd`, `matplotlib.pyplot` as `plt`, and `seaborn` as `sns` are imported, and `world_happiness` is loaded.

Instructions 1/2
----------------

-   Create a `seaborn` scatterplot showing the relationship between `grams_sugar_per_day` (on the x-axis) and `happiness_score` (on the y-axis).
-   Calculate the correlation between `grams_sugar_per_day` and `happiness_score`.

In [None]:
# Scatterplot of grams_sugar_per_day and happiness_score
sns.scatterplot(x='grams_sugar_per_day', y='happiness_score', data=world_happiness)
plt.show()

# Correlation between grams_sugar_per_day and happiness_score
cor = world_happiness['grams_sugar_per_day'].corr(world_happiness['happiness_score'])
print(cor)


Instructions 2/2
----------------

Question
--------

Based on this data, which statement about sugar consumption and happiness scores is true?

### Possible answers

Increased sugar consumption leads to a higher happiness score.

Lower sugar consumption results in a lower happiness score

[x] Increased sugar consumption is associated with a higher happiness score.

Sugar consumption is not related to happiness.