In [None]:
# Run this cell by hitting CTRL + ENTER
# Support embedding YouTube Videos in Notebooks
from IPython.display import YouTubeVideo

# Module 1.1 Part 2: Cause and Effect

In this notebook, you'll learn about the different ways that data can be generated and how that affects the types of inferences that can be made. 

This notebook includes 4 videos with a total runtime of 26:02.

- [Questions](#section1) *1 video, total runtime 4:10*
- [Association](#section2) *1 video, total runtime 9:29*
- [Causation](#section3) *1 video, total runtime 7:19*
- [Confounding](#section4) *1 video, total runtime 5:04*
- [Check for Understanding](#section5)

Textbook Readings:
- [Chapter 2: Causality and Experiments](https://www.inferentialthinking.com/chapters/02/causality-and-experiments.html)

<a id='section1'></a>
## 1. Questions

Oftentimes, we're interested in determining a relationship between two variables. Showing an association between the variables is fairly straightforward, but we have to be careful in declaring cause and effect relationship. 

This video gives a brief overview on the difference between association and causality, and how we can formulate scientific questions to explore both types of relationships.

In [None]:
YouTubeVideo("mKT6tJTwwL0")

**Key Definitions**

   - **Study participants**: the population that is being explicitly represented in the study. This population can be defined by characteristics like age, gender, geopgraphy, or occupation. 
   
   - **Treatment/Exposure**: the factor that we suspect has an effect on the outcome.
   
   - **Outcome**: the measurement of interest that we suspect to be affected by the treatment.
   
   - **Association**: a relationship or link between the treatment and outcome
   
   - **Causality**: a specific relationship in which a change in the treatment will directly result in a change to the outcome 

<a id='section2'></a>
## 2. Association

In the early 1800's, thousands of people were dying in London but no one knew why.

This video discusses the work of John Snow, who used early analysis tools to test for an association between a suspected exposure and death from the unknown disease.

In [None]:
YouTubeVideo("esDCoUrT0t8")

Snow used a map to visualize the relationship between contaminated drinking water and cholera deaths. Next week, you'll learn how to create charts to conduct similar explorations of data.

<a id='section3'></a>
## 3. Causation

After Snow established an association between contaminated drinking water and cholera deaths, he wanted to determine if contaiminated water was what actually caused the disease.

This video explains how John Snow set up an experiment that allowed him to make a conclusion about the causal relationship between exposure and outcome.

In [None]:
YouTubeVideo("Vu23eyOBrnE")

Snow found two groups that were similar except for their water source. One group had access to clean drinking water and experienced very few cholera deaths. The other group received contaiminated drinking water and had substantially more cholera deaths. 

Since the only dissimilarity between the groups was their water source, Snow was able to conclude that the differences in cholera death rates was caused by the different levels of contaimination in each group's drinking water.

Generally speaking, the key to establishing causality is as follows: if the treatment and control groups are similar apart from the treatment, then differences between the outcomes in the two groups can be ascribed to the treatment.

<a id='section4'></a>
## 4. Confounding

While John Snow was able to find two very similar groups for his natural experiment, he wouldn't have been able to make the same causal conclusions if the two groups had differed in ways other than their water source. For example, if the treatment group had been older, old age might have been what actually caused the cholera deaths. In this case, age would be considered a *confounding* factor on the relationship between contaminated water and cholera deaths.

This video explains how confounding factors might mislead data scientists into making incorrect conclusions. It also discusses how experiments can be designed to eliminate the effect of confounding factors. 

In [None]:
YouTubeVideo("zQKuNDEkKTM")

Data scientists must always be careful of confounding factors when doing any kind of data analysis. The best way to minimize the effect of confounding factors is to conduct an experiment where the treatment is randomly assigned to study participants.

<a id='section5'></a>
### Check for Understanding
**A. What is the key to establishing causality?**
<details>
    <summary>Solution</summary>
    If the treatment and control groups are similar apart from the treatment, then differences between the outcomes in the two groups can be ascribed to the treatment.
</details>
<br>

**B. Which type of study is more likely to be affected by confounding factors?**
<details>
    <summary>Solution</summary>
    Observational studies
</details>
<br>

**C. How can a study be designed to reduce the effect of confounding factors?**
<details>
    <summary>Solution</summary>
    Randomize treatment assignment among study participants
</details>
<br>

**D. True or False? In data science, randomness implies that a study was conducted haphazardly.**
<details>
    <summary>Solution</summary>
    False
</details>
<br>