# Week 6 - Prediction & Causal Inference

Last week, we covered the classification of text, such as classifying Reddit posts by thread topic. Classification often uses a representative sample of the text that we want to make inferences about, such as if we can human-code only a random sample of texts and then use ML to classify the rest.

This week, we talk about two different types of inferences to out-of-sample populations. _Text as Data_ defines _prediction_ by the question: "What value of the outcome do we expect for a unit or units out of a distinct population of units?" Often this is prediction for the future. We don't expect the weather today to be identical to the weather tomorrow, but it should contain some useful information. They define _causal inference_ by the question: "How do outcomes differ if we intervene in the world?" Causality is a deeply contested notion in science and philosophy, but it usually involves an "if," a different between two counterfactual worlds, one where an event occurs and one where it doesn't.

For this notebook we will be using the following packages

In [1]:
#Special module written for this class
#This provides access to data and to helper functions from previous weeks
#Make sure you update it before starting this notebook
import lucem_illud #pip install -U git+git://github.com/UChicago-Computational-Content-Analysis/lucem_illud.git

#All these packages need to be installed from pip
import requests #For getting files
import zipfile #For managing zips
import numpy as np #For arrays
import scipy as sp #For some stats
import pandas as pd #Gives us DataFrames
import numpy as np #Math and matrices
import matplotlib.pyplot as plt #For graphics

from transformers import pipeline

#This 'magic' command makes the plots work better
#in the notebook, don't use it outside of a notebook.
#Also you can ignore the warning
%matplotlib inline

import os #For looking through files
import os.path #For managing file paths

pd.set_option('display.max_columns', None)

# Prediction
We can make predictions about a range of different populations of texts. We can use texts in English to predict their translated version in French. We can use newspaper articles from 2012 to 2022 to predict 2023 newspaper articles (e.g., a [time series](https://en.wikipedia.org/wiki/Time_series)). Instead of forecasting, we can "nowcast" by using real-time social information such as Tweets to predict when an important event is happening, such as a riot.

If we don't have any information about how the new population will vary from the population we learned from, then prediction is implemented in the same way as in-sample inference. For example, if you have a categorization of 2022 emails as spam or not, you could predict whether 2023 emails are spam the same way you predicted 2022 emails. On the other hand, if you have new information, such as a trend beginning in December 2022 for spam emails to have "Urgent:" in the subject line, your 2023 prediction may differ by putting more weight on that indicator relative to others.

We currently don't have code for this, because it's so similar to the classification last week, but we encourage you to think more about this if you're interested in predicting the future of your corpus!

# Text in causal inference

In causal inference, we are interested in the effect of a _treatment_ on an _outcome_. There are five sorts of variables that could be directly involved in our causal model, and any of them could be a text variable. This figure from [Keith et al. 2020](https://aclanthology.org/2020.acl-main.474.pdf) concisely shows the five positions for variables in acyclic (i.e., no arrows flow back into themselves) causal inference: treatment, mediator, outcome, confounder, and collider.

<img src="https://raw.githubusercontent.com/UChicago-Computational-Content-Analysis/Homework-Notebooks/main/week-6/img/causal_diagram.png" alt="https://raw.githubusercontent.com/UChicago-Computational-Content-Analysis/Homework-Notebooks/main/week-6/img/causal_diagram.png" style="width:500px">

"Text as treatment" means the effect of text on other variables. For example, how does the news coverage of a politician affect their election chance? How does the sentiment of a Reddit post affect its upvotes?

Whether we're interested in text as treatment, mediator, outcome, or confounder, we have at our disposal the same causal inference strategies used with other forms of data, such as matching, difference in difference, regression discontinuity, and instrumental variable. Each of these methods usually gives you a more precise identification of the causal effect than a plain regression. For example, one of the readings for this week, [Saha 2019](https://doi.org/10.1145/3292522.3326032), uses propensity score matching, which is a straightforward method that works on most datasets but has relatively weak identification. Some scholars such as Gary King advocate for an improved method, [coarsened exact matching](https://www.youtube.com/watch?v=tvMyjDi4dyg). However, for this assignment, we do not go into these methods. We only use simple regressions. There are several courses at UChicago that introduce these methods, as well as online textbooks like Scott Cunningham's [Causal Inference: The Mixtape](https://mixtape.scunning.com/), which is geared towards economists.

You can do causal inference on any sort of text data as long as you have a plausible _identification_ strategy, meaning an argument that you can correctly identify a causal effect if one exists using your data and analysis. For example, if you have a data from a randomized controlled trial (RCT) where you intervene randomly with some treatment, you can identify a causal effect with relative ease.

# Text as treatment and outcome

To show text as treatment and outcome, we will analyze a dataset of internet arguments. We have 8,8895 pairs of comments, where one person makes a statement and the other responds. Our research question is thus: _How does the text of the first commenter affect the text of the respondent?_

The data comes from the [Internet Argument Corpus](https://nlds.soe.ucsc.edu/iac). Let's load the data and take a look.

In [2]:
url = 'http://nldslab.soe.ucsc.edu/iac/iac_v1.1.zip'

req = requests.get(url)

filename = url.split('/')[-1]
with open(filename,'wb') as output_file:
    output_file.write(req.content)
print('Downloaded file: ' + url)

Downloaded file: http://nldslab.soe.ucsc.edu/iac/iac_v1.1.zip


In [3]:
with zipfile.ZipFile('iac_v1.1.zip') as z:
   with z.open('iac_v1.1/data/fourforums/annotations/mechanical_turk/qr_averages.csv') as f:
      qr = pd.read_csv(f)

   with z.open('iac_v1.1/data/fourforums/annotations/mechanical_turk/qr_meta.csv') as f:
      md = pd.read_csv(f)

# columns = ['key', 'nicenasty', 'questioning-asserting', 'negotiate-attack', 'fact-feeling']
# qr_sub = qr[columns]
# qr_sub = qr

pairs = qr.merge(md, how='inner', on='key')
pairs = pairs[~pairs.quote_post_id.isnull() & ~pairs.response_post_id.isnull()]
pairs

NameError: name 'qr_sub' is not defined

That's a lot of variables! Variables like "agree-disagree" are the averages of annotations of the data made by workers on Mechanical Turk. The workers were asked questions like:

* __agree-disagree__ (Boolean): Does the respondent agree or disagree with the prior post?
* __fact-feeling__  (-5 to 5): Is the respondent attempting to make a fact based argument or appealing to feelings and emotions?
* __attack__ (-5 to 5): Is the respondent being supportive/respectful or are they attacking/insulting in their writing?
* __sarcasm__ (-5 to 5): Is the respondent using sarcasm?

Unfortunately the dataset only has the "response" annotated, not the original "quote." However, some "responses" in this dataset are also "quotes," meaning we can form triples of quote-response-response. Let's self-merge this dataframe to get these "r1" and "r2" pairs where both texts have annotations.

In [None]:
# Self-merge where the 'response' matches another 'quote' in the DataFrame
triples = pairs.merge(pairs,left_on='response',right_on='quote',how='inner',suffixes=('_r1','_r2'))

# Rename and reorder columns
triples = triples.rename(columns={'quote_r1':'quote', 'quote_r2':'response1', 'response_r2':'response2'})
triples = triples.drop(columns=['response_r1'])
front_columns = [
                 'quote','response1','response2','attack_r1','fact-feeling_r1','nicenasty_r1','sarcasm_r1',
                 'agreement_r2'
                ]
triples = triples.dropna(subset=front_columns)
triples = triples[front_columns].join(triples.drop(columns=front_columns))

# Display triples
triples

Now we have 1,346 triples of quote-response1-response2, several text variables of response1 (e.g., "Is the respondent using sarcasm?") that may predict the agreement of response2. In other words: _Does a sarcastic comment lead to more agreement?_ Of course, as with almost all observational data, there are a number of confounders that make our identification difficult, but for now, let's see how to run a simple regression in Python of agreement_r2 (dependent variable, commonly known as Y) on sarcasm_r1. Fortunately, we do have a strong case for identifying the direction of causality: Because response1 comes before response2, we can rule out the possibility that response2 affects response1.

In [None]:
# statsmodels is a popular Python statistics package
import statsmodels.api as sm

# We build an Ordinary Least Squares (OLS) model of agreement_r2 on sarcasm_r1.
# The function sm.add_constant() adds an intercept term to the regression (e.g., b in y = ax + b)
y = triples['agreement_r2']
X_cols = ['sarcasm_r1']
X = sm.add_constant(triples[X_cols])

lm1 = sm.OLS(y,X).fit()
lm1.summary()

The p-value for sarcasm_r1 is 0.855, which means we fail to reject the null hypothesis that there is no effect of sarcasm on agreement. However, we have other variables that may be confounding the effect of pure "attack" or pure "sarcasm." Let's try adding 3 other annotations to the regression model.

In [None]:
y = triples['agreement_r2']
X_cols = ['attack_r1','fact-feeling_r1','nicenasty_r1','sarcasm_r1']
X = sm.add_constant(triples[X_cols])

lm2 = sm.OLS(y,X).fit()
lm2.summary()

With this new regression model, we see a significant effect from both attack_r1 and sarcasm_r1, indicating both of these affect whether the response2 agrees with response1. Note that the coefficients are both positive: For attack_r1, this means that a more "supportive/respectful" comment led to more agreement, and for sarcasm_r1, this means that a more sarcasistic comment led to more agreement! Is that surprising?

For good measure, we can add other variables ourselves, such as sentiment and the character length of the comment. The length may be particularly important because of how it affects the annotations of Mechanical Turk workers. For example, as I was skimming through the data, it seemed like shorter comments were being rated as more supportive/respectful, even if this was not the case on a per-word basis. For sentiment, let's use the convenient BERT pipeline we used last week.

In [None]:
triples['length_r1'] = triples['response1'].apply(lambda x: len(x))
triples['length_r2'] = triples['response2'].apply(lambda x: len(x))

In [None]:
sentiment = pipeline("sentiment-analysis")
result = sentiment("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

This version of BERT is built only for texts of up to 512 tokens, so for comments longer than that, we truncate.

In [None]:
%%time

triples['sentiment_r1'] = triples['response1'].apply(lambda x: sentiment(x[:512])[0]['score'])
triples['sentiment_r2'] = triples['response2'].apply(lambda x: sentiment(x[:512])[0]['score'])

In [None]:
y = triples['agreement_r2']
X_cols = ['attack_r1','fact-feeling_r1','nicenasty_r1','sarcasm_r1','length_r1','sentiment_r1']
X = sm.add_constant(triples[X_cols])

lm2 = sm.OLS(y,X).fit()
lm2.summary()

Our finding of significant effects for attack_r1 and sarcasm_r1 persists even with these new controls! This sort of robustness or sensitivity analysis is important for making sure your finding is compelling to yourself and to your audience.

In [None]:
np.corrcoef(triples[X_cols])

## <font color="red">*Exercise 1*</font>

<font color="red">Propose a simple causal model in your data, or a different causal model in the annotated Internet Arguments Corpus (e.g., a different treatment, a different outcome) and test it using a linear regression. If you are using social media data for your final project, we encourage you to classify or annotate that data (either compuationally or with human annotators) and look at the effect of texts on replies to that text (e.g., Reddit posts on Reddit comments, Tweets on Twitter replies, YouTube video transcripts on YouTube comments).

## Splitting training and test text
Above, we used a number of external measures of text, meaning that the measures were developed without any influence from this dataset. For the annotations, it was Mechanical Turk workers measuring the text. For length, that is just a mathematical count of characters. For sentiment, that BERT model was not based on this Internet Arguments Corpus.

However, this is not always the case. Consider if we want to make a measure of the text based on topic modeling. We build an LDA topic model of these comments, then we measure what number of words from Topic 1 each comment uses. Can we put that measure in the regression? Unfortunately it would lead to a biased estimate of the true effect size because our measure is no longer external. The measure and the model are double-dipping the textual information. This is important to keep in mind for your final projects, and for a more thorough explanation and justification, you can read [Egami et al. 2018](https://arxiv.org/pdf/1802.02163.pdf).

# Text as mediator

What if text is instead the _mediator_, meaning it is effected by the teatment and effects the outcome?

# Text as confounder
The causal effect we're interested in estimating might not be our causal relationship of interest. Instead, it could be another variable that affects both our treatment and outcome, known as a _confounder_. Why do we need to control for this? If we didn't, we might correctly find that the treatment and outcome are correlated, but rather than one causing the other, they could both be caused by a third variable. For example, if we are studying the effect of the journal a paper is published in on the citations of the paper, we may be worried that the text of the article affects both whether it is published by the journal and whether people cite it.

In [None]:
# complaints data

# Customization of text to fit causal inference methods

2020 Causal Embedding Veitch and Blei
https://github.com/blei-lab/causal-text-embeddings

In [None]:
# start with data from the paper