# Week 6 - Prediction & Causal Inference

Last week, we covered the classification of text, such as classifying Reddit posts by thread topic. Classification often uses a representative sample of the text that we want to make inferences about, such as if we can human-code only a random sample of texts and then use ML to classify the rest.

This week, we talk about two different types of inferences to out-of-sample populations. _Text as Data_ defines _prediction_ by the question: "What value of the outcome do we expect for a unit or units out of a distinct population of units?" Often this is prediction for the future. We don't expect the weather today to be identical to the weather tomorrow, but it should contain some useful information. They define _causal inference_ by the question: "How do outcomes differ if we intervene in the world?" Causality is a deeply contested notion in science and philosophy, but it usually involves an "if," a different between two counterfactual worlds, one where an event occurs and one where it doesn't.

For this notebook we will be using the following packages

In [1]:
#Special module written for this class
#This provides access to data and to helper functions from previous weeks
#Make sure you update it before starting this notebook
import lucem_illud #pip install -U git+git://github.com/UChicago-Computational-Content-Analysis/lucem_illud.git

#All these packages need to be installed from pip
import numpy as np #For arrays
import scipy as sp #For some stats
import pandas #Gives us DataFrames
import matplotlib.pyplot as plt #For graphics


#This 'magic' command makes the plots work better
#in the notebook, don't use it outside of a notebook.
#Also you can ignore the warning
%matplotlib inline

import os #For looking through files
import os.path #For managing file paths

# Prediction
We can make predictions about a range of different populations of texts. We can use texts in English to predict their translated version in French. We can use newspaper articles from 2012 to 2022 to predict 2023 newspaper articles (e.g., a [time series](https://en.wikipedia.org/wiki/Time_series)). Instead of forecasting, we can "nowcast" by using real-time social information such as Tweets to predict when an important event is happening, such as a riot.

If we don't have any information about how the new population will vary from the population we learned from, then prediction is implemented in the same way as in-sample inference. For example, if you have a categorization of 2022 emails as spam or not, you could predict whether 2023 emails are spam the same way you predicted 2022 emails. On the other hand, if you have new information, such as a trend beginning in December 2022 for spam emails to have "Urgent:" in the subject line, your 2023 prediction may differ by putting more weight on that indicator relative to others.

We currently don't have code for this, because it's so similar to the classification last week, but we encourage you to think more about this if you're interested in predicting the future of your corpus!

# Text as treatment

In causal inference, we are interested in the effect of a _treatment_ on an _outcome_. There are five sorts of variables that could be directly involved in our causal model, and any of them could be a text variable. This figure from [Keith et al. 2020](https://aclanthology.org/2020.acl-main.474.pdf) concisely shows the five variables: treatment, mediator, outcome, confounder, and collider.

<img src="https://raw.githubusercontent.com/UChicago-Computational-Content-Analysis/Homework-Notebooks/main/week-6/img/causal_diagram.png" alt="https://raw.githubusercontent.com/UChicago-Computational-Content-Analysis/Homework-Notebooks/main/week-6/img/causal_diagram.png" style="width:500px">

"Text as treatment" means the effect of text on other variables. For example, how does the news coverage of a politician affect their election chance? How does the sentiment of a Reddit post affect its upvotes?

In [None]:
# internet arguments data

# Text as mediator

What if text is instead the _mediator_, meaning it is effected by the teatment and effects the outcome?

# Text as outcome
Text can also be the _outcome_ that is being effected by other variables.

In [None]:
# Jacy add some very brief text on matching, DiD, RD, IV, etc.

# Text as confounder
The causal effect we're interested in estimating might not be our causal relationship of interest. Instead, it could be another variable that affects both our treatment and outcome, known as a _confounder_. Why do we need to control for this? If we didn't, we might correctly find that the treatment and outcome are correlated, but rather than one causing the other, they could both be caused by a third variable. For example, if we are studying the effect of the journal a paper is published in on the citations of the paper, we may be worried that the text of the article affects both whether it is published by the journal and whether people cite it.

In [None]:
# complaints data

# Customization of text to fit causal inference methods

2020 Causal Embedding Veitch and Blei
https://github.com/blei-lab/causal-text-embeddings

In [None]:
# start with data from the paper