# Correlation vs Causation 
Correlation says that there is a relationship between some quantity $X$ and some quantity $Y$, while causation says that $X$ causes $Y$. THIS IS NOT THE SAME THING! Check out [this webpage](http://www.tylervigen.com/spurious-correlations) for funny examples of things that are strongly correlated but that make no sense for one of them to cause the other, like number of people who drown in pools vs number of movies Nicolas Cage appears in!

Inference problems postulate that there are a set of features (let's call them $W_i$ with $i=1,\dots,n$) that cause $Y$. However, most of the times the $W_i$ features that cause $Y$ are very abstract and cannot be directly observed. For example, assume $Y$ is quality of life in a country. The $W_i$ features in this case might be things like income inequality, well-being, safety, community and social relationships, etc. which are very hard to measure because they are not well-defined quantities. 

What can we do then if we cannot measure the $W_i$? The answer is to measure some other set of features $X_i$ that we believe are good approximations for the $W_i$. For example, a value for income inequality can be approximated by measuring the average difference between the highest and lowest wages across corporations, and a value for safety can be approximated by measuring the number of reported robberies in an area. 

Exercise: can you describe why these quantities are just approximations? do wages contain all the information you need to know about income inequality? how about vacation days? how about employer provided health insurance? 

Once we have some $X_i$ features that we can measure, we can then collect data and build a model, such as a linear regression model to try and understand the effect of the $X_i$ on $Y$, and since we know the $X_i$ are good approximations for the $W_i$, the results can be used to understand the relationship between the $W_i$ and $Y$.

Note that picking a model gives rise to a similar problem. We do not know what the "true model" (let's call it $F$) is that describes the relationship between $Y$ and $W_i$ (i.e. $Y = F(W_i)$), so we try and estimate a model (let's call it $f$) of the form $Y = f(X_i)$ that approximates $F(W_i)$, such that the error $F(W_i)-f(X_i)$ is small.

On the other hand, the prediction problem does not care about what causes $Y$. All it cares about is whether we can predict what $Y$ will be based on the available data $X_i$. In the example above, the prediction problem does not care about income inequality per se, but it cares about whether we can use the average difference in wages in corporations to predict quality of life. 

### Model Results

Summarize and Report your findings:
- What was the question you were trying to answer?
- What model did you use to answer the questions?
- What features did you use?
- What procedure did you follow to build your model? (training/testing percentage split, number of k-folds, etc.)
- What was the performance of your model? (ROC curve, AUC, $R^2$ value)
- What are the model parameter values and what do they mean? (positive/negative correlations between $X_i$ and $Y$)
- What is the interpretation of the results? 
    - Do your results make sense?
    - Do you think the correlations you found imply causation? Why or why not?
        - Do you think that the features you used are directly related to the outcome?
        - Did you want to include other features but they were not in the data? How would these ghost features change your results?
        - Where the features you used collected in a thorough and unbiased manner?
        - Did you have enough data to build an accurate model?
        - Do your model parameters change by much when you train the model with different data?
- What could have improved your methodology and results?
    - Better features?
    - More data?
    - Better models?

### Policy Implications

- Can you use your results to propose a policy that would help prevent or increase the probability of certain outcomes? 
    - Be very careful about correlations vs. causation when thinking about this. 
        - Do you think the correlations you found imply causation? Why or why not?
        - Is more data needed before drawing conclusions?