

Correlation vs Causation
1. Learning outcomes
   1. State some of the basic problems in establishing causation.      
   2. Recognize some of the basic problems in establishing causation and use them to identify situations in which claims of causation are and are not warranted.     
   4. Identify the flaw in an argument in which correlation is inappropriately being substituted for causation.  
   5. Address the argument, “Science can only establish correlations; it can’t determine causality.”    
   6. Use the definition of causation to identify situations in which claims of causation are and are not warranted.    
   8. Identify flaws in experimental designs aimed at testing causality and explain how the flaws could be addressed.  
2. Notebook outline
   1. Overview: correlation, causality, and big data
      1. Big Data tools can be used to show correlations, but proving causality is much harder
      2. There are cases where a big data approach may be better than an RCT (e.g. examining if smoking causes cancer- don’t want to randomly make half of your subjects smoke)
      3. Obstacles to proving causality include non-representative sampling, confounding factors
      4. Tools to lessen the impacts of the above obstacles include different sampling techniques, weighting samples
      5. Potentially relevant: can big data make up for not doing an RCT?
         1. Answer: depends on how skewed the sample is
         2. Example from DS100 textbook 
         3. Prof Xiao-Li Meng’s paper on the topic (more technical) and lecture
   2. Case study: adolescent health vs screen usage
      1. Overview: Orben & Przybylski (2019) have a well-documented article on using multiple, large data sources to examine this topic. In this notebook, we will do a simplified version of their analysis.
         1. Original article
         2. ⅔ of original data sets: UMich data and Youth risk behavior survey (note: these are LARGE and MESSY. Probably best to use the risk behavior survey as detailed in “our data” section below)
      2. Problem: what behaviors cause poor health outcomes in adolescents?
         1. Discussion: how can we study this? What makes this a difficult topic to study? Can this be studied using an RCT?
      3. Our data: Youth risk behavior survey[a]
         1. Introduce data, embed or link to relevant data dictionaries/source websites/etc
         2. Discussion:
            1. How was the survey designed?
            2. Who was included in the survey?
            3. What aspects of how the survey was designed and implemented could be problematic for testing causality? How could these aspects be addressed?
               1. Important points to touch on: 
                  1. Survey methodology: question phrasing, belief in confidentiality of results, non-serious responders (trolls)
                  2. Survey subjects: are they a  representative sample of the population of interest?
                  3. Re:question phrasing, some survey questions touch on topics that have been linked to poor health outcomes in RTCs, other questions do not


      4. Experiment + Results
         1. Calculate correlation between screen usage and health outcome variable/s. Show the correlation coefficient, scatter plot with best-fit regression line, and p-value.
      5. Discussion
         1. What did our analysis show?
         2. What can and can’t we say about causality
            1. How are your conclusions affected by your prior knowledge and biases about the topic?
         3. What would your recommendation be for policymakers legislating on this topic?
         
         
         Data that might invite causation

a widget to plot different variables and compare correlation between the questions.
If you could redo the study, what would you change. 
change x axis y axis to 

Use the course material as much as possible. 

# LS22 Correlation vs Causation

In this module, you will:

1. Learn some of the basic problems in establishing causation.
2. Recognize some of the basic problems in establishing causation and use them to identify situations in which claims of causation are and are not warranted.     
3. Discover how a randomized controlled trial can help rule out spurious correlations.  
4. Identify the flaw in an argument in which correlation is inappropriately being substituted for causation.  
5. Address the argument, “Science can only establish correlations; it can’t determine causality.”    
6. Use the definition of causation to identify situations in which claims of causation are and are not warranted.    
7. Design RCTs for sample problems.    
8. Identify flaws in experimental designs aimed at testing causality and explain how the flaws could be addressed.  


**Correlation** - Variables are related if the value of one goes up or down, the other does as well (though possibly in the opposite direction). Visually, this appears as points tightly clustered on a straight line. The the tightness of the cluster is how correlated the variables are. This is also called **linear association**.

**Causation** - If the change in one variable is the result change in the other, this indicates causation. 

**Correlation** only measures association. Correlation does not imply **causation**. 

In [5]:
# Run this cell
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

From Nature article:

"The idea that digital devices and the Internet have an enduring influence on how humans develop, socialize and thrive is a compelling one1. As the time spent by young people online has doubled in the past decade2, the debate about whether this shift negatively impacts children and adolescents is becoming increasingly heated3. A number of professional and governmental organizations have therefore called for more research into digital screen-time4,5, which has led to household panel surveys6,7 and large-scale social datasets adding measures of digital technology use to those already assessing psychological well-being8. Unfortunately, findings derived from the cross-sectional analysis of these datasets are conflicting; in some cases negative associations between digital technology use and well-being are found9,10, often receiving much attention even when correlations are small. Yet other results are mixed11 or contest previously discovered negative effects when re-analysing identical data12. One high-quality, pre-registered analysis of UK adolescents found that moderate digital engagement does not correlate with well-being, but very high levels of usage possibly have small negative associations."

All datasets contained a wide range of different questions that concern adolescents’ psychological well-being and functioning. We reversed selected measures so that these are all in the same direction, with higher scores indicating higher well-being.

In [6]:
df = pd.read_csv("XXHqn.csv")
df.head(5)

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,QN8,QN9,QN10,...,BMIPCT,RACEETH,Q6ORIG,Q7ORIG,QNDAYEVP,QNFREVP,QNDAYSKL,QNFRSKL,QNDAYCGR,QNFRCGR
0,5.0,2.0,3.0,2.0,E,1.9,108.86,2.0,2.0,2.0,...,97.44,5.0,603,240,2.0,2.0,2.0,2.0,2.0,2.0
1,7.0,2.0,3.0,1.0,,1.6,58.97,2.0,2.0,2.0,...,60.88,6.0,503,130,2.0,2.0,2.0,2.0,2.0,2.0
2,5.0,1.0,3.0,1.0,A,1.65,64.41,2.0,2.0,2.0,...,78.5,7.0,505,142,2.0,2.0,2.0,2.0,2.0,2.0
3,6.0,1.0,3.0,1.0,,1.6,64.86,2.0,2.0,2.0,...,84.61,6.0,503,143,2.0,2.0,2.0,2.0,2.0,2.0
4,7.0,2.0,3.0,2.0,E,1.75,65.77,2.0,2.0,2.0,...,40.08,5.0,509,145,2.0,2.0,2.0,2.0,2.0,2.0
