

# <center><font color = #2E5266>Correlation and Causation</font></center>

"Correlation doesn't imply causation." It's a cautionary phrase passed down to nearly every new scientist as they notice patterns in the world: just because two phenomena are related does not mean that one causes the other. 

In the age of "Big Data", advanced in storage technology and machine learning have allowed us to see more patterns than ever. What does this mean for establishing causality?

In this notebook, we will explore how modern thinkers approach questions of causality using the lens of data science. Then, we will critique a case study where scientists use large-scale data to answer a very topical question- is too much screen time corrupting America's youth?

***

First we will load some Python libraries to help us run the code we'll be using for this notebook! No need to edit or alter anything, just hit `shift` + `enter` to run it!

In [43]:
# This cell block just loads in helpful libraries that will help us run the code we need to!
from datascience import * # The data science library provides helpful tools for loading tables!
import matplotlib.pyplot as plt # This library will help us create plots!
%matplotlib inline
import ipywidgets as widgets # Allows for us to create interactive user interfaces where you won't have to code!
from ipywidgets import interact, fixed, Layout
import numpy as np # Mathematical functions
import string # Text functions
import random # Random number generators!
import sklearn.metrics as sk # Machine learning models!
import os
import pandas as pd
from scipy.stats import pearsonr

***
### <font color = #2E5266>Correlation and Causation: Review</font>


![XKCD correlation causation](https://imgs.xkcd.com/comics/correlation.png)
<center><i><a href='https://xkcd.com/552/'>"Correlation" by Randall Munroe</a></i></center>
    
<br>

When we talk about **correlation**, we're referring to an association or relationship between two variables. In statistics, correlation is also often used to refer to the _Pearson correlation coefficient_, or _r_, which measures the linear relationship between two variables.

**Causation**, on the other hand, is when one variable has an effect on another. In technical terms:
> X causes Y if and only if X and Y are correlated under interventions on X

That is, one variable causes another when a change in the first results in a change in the second when all other factors are the same 

The traditional method for establishing causality is the **randomized controlled trial (RCT)**, which does just that- keeps all variables constant save for one, the _treatment_, to see whether changing it has a measurable effect on the _outcome_. However, the randomized controlled trial isn't the answer to all our problems.

<div class="alert alert-info"><b>DISCUSSION (2 minutes)</b> Two reasons when an RCT may not be used to establish causality may be when:
    <ul>
        <li> an RCT is impractical</li>
        <li> an RCT is unethical </li>
    </ul>
    
With a partner, see if you can think of at least one real-life scenario for each of these cases. 
    </div>
    

***
### <font color = #2E5266>Causality in the Age of Big Data</font>

Let's now turn to the "big data" approach to establishing causality. Data science has two potentially game-changing tools at its disposal here.

First, data. In the last few decades, data storage has become very cheap and very accessible in large amounts, allowing individuals and corporations to collect and store vast amounts of data.

Second, machine learning methods enable data scientists to find patterns in these vast quantities of data that would be otherwise nearly impossible for a human to discover.

Consequences: 
- correlations can be found relatively quickly due to the sheer amount of data available

side note:
- diff btwn how computers see patterns and how humans see patterns
- compys can only see patterns we tell them to look for
- we can see many sorts of patterns
- we tend to ascribe causality
- compys cannot ascribe causal agents or ID likely causes (unless we tell them specifically what to look for, which is just another pattern to find, i.e. more correlations)

In [75]:
# set figure size
plt.rcParams["figure.figsize"] = [6,6]

In [74]:
# create the widget
def analyze_data(dataset_num):
    data = pd.read_csv("https://raw.githubusercontent.com/lockedata/datasauRus/master/inst/extdata/DatasaurusDozen.tsv", 
            index_col="dataset", sep='\t')
    dataset_list = data.index[::-1].unique()
    data = data.loc[dataset_list[dataset_num -1], :]
    sns.scatterplot(data=data, x="x", y="y");
    plt.figure(figsize=(8,8));
    plt.show()
    print("x mean = ", data.x.mean())
    print("y mean = ", data.y.mean())
    print("standard deviation x = ", np.std(data.x))
    print("standard deviation y = ", np.std(data.y))
    print("correlation =  ", pearsonr(data.x, data.y)[0])
    print("p-value = ", pearsonr(data.x, data.y)[1])

# create the sliders for the widget
dataset_selection_widget = widgets.ToggleButtons(options=range(1, 14))

# create the widget to view plots for different parameter values
interact(analyze_data, dataset_num=dataset_selection_widget, show_graph=graph_widget);

interactive(children=(ToggleButtons(description='dataset_num', options=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,…

Example of wicked high correlation found via brute force


In [102]:
food = pd.read_excel("https://www2.census.gov/library/publications/2011/compendia/statab/131ed/tables/12s0217.xls",
                    header=3)
moz=food.iloc[32,6:]

In [99]:
nsf = pd.read_excel("https://wayback.archive-it.org/5902/20181003231608/https://www.nsf.gov/statistics/infbrief/nsf11305/tab1.xls",
                   header=1)
ce = nsf[nsf.Field == "Civil engineering "].iloc[:, 2:]
ce

Unnamed: 0,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009
24,480,501,540,552,547,622,655,701,712,708


In [113]:
pearsonr(moz.tolist(), ce.loc[24, :].tolist())

(0.9509548098376948, 2.38545595655013e-05)

<font color = #2E5266>What if the data is reeeeeally big?</font>

Can we do without RCT if data set it big enough?

https://www.textbook.ds100.org/ch/02/design_srs_vs_big_data.html

https://statistics.fas.harvard.edu/files/statistics-2/files/statistical_paradises_and_paradoxes.pdf

### <font color = #2E5266>The Implications of Error</font>


### <font color = #2E5266>Measuring Accuracy</font>


## <font color = #2E5266>Case Study: Adolescent Health vs. Screen Time</font>

background + intro

hotly debated, lots of strong feelings

lots of assumptions. what are yours?



### <font color = #2E5266>Getting Started with the Data</font>

source

method (read about it)

what steps were taken to try to make it easier to establish causality? What factors weren't they able to control for?

In [115]:
# The command below takes in a .csv file and converts it into a table!
raw_data = Table.read_table("data/SADCQ.csv")
# This command allows for us to see only a sneak-peek of the entries.
# Instead of seeing hundreds of rows, we will see only five.
raw_data.show(5)

year,age,sex,race7,bmi,sexid,q8,q9,q10,q11,q12,q13,q14,q15,q16,q17,q18,q19,q20,q21,q22,q23,q24,q25,q26,q27,q28,q29,q30,q31,q32,q33,q34,q35,q36,q37,q38,q39,q40,q41,q42,q43,q44,q45,q46,q47,q48,q49,q50,q51,q52,q53,q54,q55,q56,q57,q58,q59,q60,q61,q62,q63,q64,q65,q68,q69,q70,q71,q72,q73,q74,q75,q76,q77,q78,q79,q80,q81,q82,q83,q84,q85,q86,q87,q88,q89,qbikehelmet,qdrivemarijuana,qcelldriving,qpropertydamage,qbullyweight,qbullygender,qbullygay,qchokeself,qcigschool,qchewtobschool,qalcoholschool,qtypealcohol,qhowmarijuana,qmarijuanaschool,qcurrentcocaine,qcurrentheroin,qcurrentmeth,qhallucdrug,qprescription30d,qgenderexp,qtaughtHIV,qtaughtsexed,qtaughtstd,qtaughtcondom,qtaughtbc,qdietpop,qcoffeetea,qsportsdrink,qenergydrink,qsugardrink,qwater,qfastfood,qfoodallergy,qwenthungry,qmusclestrength,qsunscreenuse,qindoortanning,qsunburn,qconcentrating,qcurrentasthma,qwheresleep,qspeakenglish,qtransgender
2017,5,2,6,18.0484,1.0,5,1,2.0,2.0,1,1,1,1,1,1,1,2,1,1,1,2,2,2,2,2,1.0,1.0,2.0,1.0,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,1,1,1,1,1,1,2.0,2,2.0,3.0,2.0,1.0,1.0,2.0,3.0,3,8,6,2,4,4,4,1,2,1,2,4,2,,2,,,,,,,,,,,,,,,,1,,,,,,,,,,2,,,6.0,,2,,4,,1,6,1,,,1,
2017,5,2,3,30.4836,1.0,5,2,2.0,2.0,1,1,1,2,1,2,1,2,3,3,2,2,1,1,2,2,,,2.0,1.0,1,1,2,1,1,1,1,2,2,6,1,1,1,1,2,6,1,1,1,1,1,1,1,1,1,1,2,1,4,3,2,3,2,4,4.0,1,2.0,7.0,2.0,7.0,3.0,3.0,2.0,7,5,1,4,7,4,1,3,2,2,2,3,3,,2,,,,,,,,,,,,,,,,1,,,,,,,,,,2,,,7.0,,1,,4,,1,1,2,,,1,
2017,5,2,2,14.6645,,5,1,1.0,1.0,1,1,1,1,1,1,1,2,1,1,1,1,2,1,2,2,1.0,1.0,,,1,1,2,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,1,1,1,1,1,1,,2,,,,,,,,2,6,4,5,7,6,1,1,2,5,2,5,7,,1,,,,,,,,,,,,,,,,1,,,,,,,,,,3,,,,,3,,4,,1,1,1,,,2,
2017,5,2,7,20.8936,1.0,4,2,1.0,1.0,3,1,1,1,2,3,2,2,1,1,1,2,2,2,2,2,1.0,1.0,1.0,2.0,1,1,2,1,1,1,1,1,3,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,1,1,1,1,1,1,3.0,2,2.0,2.0,1.0,3.0,1.0,2.0,2.0,2,4,6,1,7,1,1,1,2,5,3,4,2,,1,,,,,,,,,,,,,,,,1,,,,,,,,,,2,,,7.0,,2,,4,,1,2,2,,,2,
2017,5,2,4,,1.0,5,1,,,1,1,1,1,1,1,1,2,1,2,2,2,2,2,2,2,1.0,1.0,2.0,1.0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,6,5,1,1,1,1,1,1,1,1,1,1,2,1,5,7,2,3,2,4,3.0,4,3.0,6.0,1.0,2.0,3.0,3.0,5.0,4,1,8,5,4,6,1,1,2,1,2,4,3,,1,,,,,,,,,,,,,,,,1,,,,,,,,,,6,,,7.0,,2,,8,,1,1,2,,,2,


This is some fairly big (for us) data, with 89,848 rows in total.


explain what a row, each column represents
explain scaling issues

***
assumptions

To determine what we can or can't say about causality, it's important t

* For simplicity, our calculations will NOT take the sample weights into account (results won't match papers)
* assume screen watching is treatment (could it be the reverse?)

### Vis and metrics

- bar chart widget
- scatter widget with fit line
- correlation metric

VBox(children=(Button(description='Generate message!', style=ButtonStyle()), Output()))

### Conclusion
- reflect: how does DS

## Citations

- https://www.autodeskresearch.com/publications/samestats
- data https://www.cdc.gov/healthyyouth/data/yrbs/data.htm
- study 1 https://journals.sagepub.com/doi/10.1177/2167702617723376
- https://www.tylervigen.com/spurious-correlations
- study 2 https://www-nature-com.libproxy.berkeley.edu/articles/s41562-018-0506-1

Notebook author: Keeley Takimoto