# Week 7 - Prediction & Causal Inference

Last week, we explored (supervised) text classification, where we train a model to learn associations between text and some classification or value connected with it (e.g., what distinguishes a winning argument before the Supreme Court; can we extend our judgment regarding what documents are relevant to my thesis project to all of Google News; etc.) Classification often uses a representative sample of text about which we want to make inferences and then we use machine learning to learn "true" assignments and classify the rest.

This week, we explore two different types of inferences to out-of-sample populations. _Prediction_ involves our reasoned expectation regarding an unobserved state of the world, given the world in which we live and on which we have trained our prediction algorithm. Often this prediction is about the future world. We don't expect the U.S. Congress to talk about the identical things today and tomorrow, but today should contain some useful information. by contrast _causal inference_ poses the related by distinct challenge of our reasoned expectations regarding an unobserved state of the world IF we intervene in some way. In other words, what does the intervention cause, and how can we predict it to change the world. Causality has a deeply contested history in social science and philosophy, but it usually involves an "if," a difference between two counterfactual worlds, one where an event occurs and one where it doesn't.

Causal questions in text analysis may place the text in one or more of many positions we explore below: as cause, effect, confounder, mediator (or moderator), or collider. For example, assuming that everything spoken can be transcribed into text, saying something mean might hurt someone's feelings (text as cause). Doing something mean might cause someone to say something angry (text as effect). Apologizing might change the influence of doing something mean (text as mediator/moderator). A compliment might obscure the effect of doing something mean (text as confounder). And yelling something audaciously mean might yield a loud, emotional response, which both influence the likelihood that the interaction was recorded and subjected to analysis (text as collider). As you can see, in a single conversation, text can play all of these roles. Why do we care about cause and effect with text? Because while words appear to exert power in the world, which words spoken under what circumstances by whom? Causal analysis attempts to get at the question, if _X_ was written or spoken, _Y_ would happen.

## Set Up

### Imports

In [None]:
# installs if necessary
%pip install -U git+https://github.com/UChicago-Computational-Content-Analysis/lucem_illud.git
%pip install statsmodels
%pip install transformers

In [None]:
import lucem_illud

import os
import requests
import zipfile
import numpy as np
import scipy as sp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# statsmodels is a popular Python statistics package
import statsmodels.api as sm
import statsmodels.graphics.api as smg
from statsmodels.stats.mediation import Mediation

# Pipelines to add text-based quantiative variables for regressions
from transformers import pipeline
sentiment = pipeline("sentiment-analysis")

# We have a lot of features, so let's set Pandas to show all of them.
pd.set_option('display.max_columns', None)

### Helper Functions

### Data Loading

In [None]:
# Scrape data
url = "http://nldslab.soe.ucsc.edu/iac/iac_v1.1.zip"
req = requests.get(url)

# save data
data_directory = "/Users/shaymilner/Library/Mobile Documents/com~apple~CloudDocs/Harris/Winter24/Content Analysis/assignments/soci40133-homeworks/data"
filepath = os.path.join(data_directory, url.split("/")[-1])
os.makedirs(os.path.dirname(filepath), exist_ok=True)

with open(filepath, "wb") as output_file:
    output_file.write(req.content)
print("Downloaded file: " + url)

In [12]:
# Unzip data
with zipfile.ZipFile(filepath) as z:
    with z.open(
        "iac_v1.1/data/fourforums/annotations/mechanical_turk/qr_averages.csv"
    ) as f:
        qr = pd.read_csv(f)

    with z.open(
        "iac_v1.1/data/fourforums/annotations/mechanical_turk/qr_meta.csv"
    ) as f:
        md = pd.read_csv(f)

In [16]:
# get pairs
pairs = qr.merge(md, how='inner', on='key')
pairs = pairs[~pairs.quote_post_id.isnull() & ~pairs.response_post_id.isnull()]
pairs

Unnamed: 0,key,discussion_id_x,agree-disagree,agreement,agreement_unsure,attack,attack_unsure,defeater-undercutter,defeater-undercutter_unsure,fact-feeling,fact-feeling_unsure,negotiate-attack,negotiate-attack_unsure,nicenasty,nicenasty_unsure,personal-audience,personal-audience_unsure,questioning-asserting,questioning-asserting_unsure,sarcasm,sarcasm_unsure,discussion_id_y,response_post_id,quote_post_id,term,task1 num annot,task2 num annot,task2 num disagree,quote,response
0,"(731, 1)",6032,0.333333,-1.333333,0.333333,0.333333,0.000000,0.500000,0.000000,0.333333,0.333333,3.000000,0.250000,0.666667,0.166667,-2.250000,0.250000,-4.250000,0.000000,0.200000,0.166667,6032,149609,149552.0,,6,6,4,I remember looking at the classic evolutionary...,Why do you find it necessary to fit observatio...
1,"(660, 3)",10217,0.600000,0.285714,0.000000,0.714286,0.000000,-2.500000,0.000000,1.000000,0.000000,-2.000000,0.000000,1.142857,0.000000,-1.500000,0.000000,0.500000,0.000000,0.142857,0.000000,10217,277697,277459.0,yes,7,5,2,So they (pro-life peeps) say abortion is murde...,"Yes, you are missing something. How come age d..."
2,"(114, 5)",3462,0.600000,-1.500000,0.000000,1.333333,0.000000,1.000000,0.000000,1.500000,0.000000,-1.500000,0.000000,2.166667,0.000000,-4.000000,0.000000,-1.500000,0.000000,0.000000,0.000000,3462,76012,75976.0,No terms in first 10,6,5,2,'If the solar system was brought about by an a...,"C.S.Lewis believes things on faith, yet we are..."
3,"(43, 3)",9930,0.166667,-0.833333,0.333333,1.500000,0.000000,0.400000,0.000000,1.500000,0.166667,-2.000000,0.000000,1.666667,0.000000,-2.800000,0.000000,0.000000,0.000000,0.000000,0.000000,9930,264824,264697.0,well,6,6,5,...to ToE because it means genetic evolution i...,"Well, it might help if you could propose a mec..."
4,"(1314, 0)",5352,0.142857,-1.666667,0.166667,0.000000,0.166667,-1.166667,0.166667,-0.833333,0.333333,0.833333,0.333333,0.166667,0.166667,-3.333333,0.166667,-0.166667,0.166667,0.600000,0.166667,5352,128326,128325.0,you,6,7,6,Sir Issac Newton was an idiot and you are a ge...,"You really think so? Im flattered, but I think..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,"(580, 4)",821,0.800000,-2.000000,0.166667,-0.500000,0.166667,1.000000,0.000000,1.333333,0.166667,-1.000000,0.000000,0.500000,0.166667,0.000000,0.000000,0.000000,0.000000,0.200000,0.166667,821,67788,67785.0,oh,6,5,1,Why do some of you guys insist on being rabid ...,oh because for the past decade or so they have...
9997,"(694, 4)",9258,0.000000,-3.200000,0.200000,0.200000,0.200000,-1.800000,0.000000,1.600000,0.400000,-1.200000,0.200000,0.600000,0.200000,-1.600000,0.000000,1.800000,0.000000,0.000000,0.200000,9258,241951,241848.0,but,5,5,5,But I see two people involved here. Whether th...,But the embryo is a mere clump of flesh inside...
9998,"(916, 6)",10301,0.000000,-3.000000,0.000000,-2.400000,0.000000,-2.400000,0.000000,-2.400000,0.000000,2.800000,0.000000,-2.400000,0.000000,-4.200000,0.000000,1.800000,0.000000,0.000000,0.200000,10301,281530,281509.0,,5,5,5,I disagree with you because the logic you have...,**\n Sez u. Your problem being that when you a...
9999,"(1348, 1)",6032,0.857143,0.800000,0.600000,0.600000,0.200000,3.000000,0.000000,-0.600000,0.400000,-3.000000,0.000000,0.600000,0.200000,3.000000,0.000000,3.000000,0.000000,0.000000,0.200000,6032,149609,149552.0,No terms in first 10,5,7,1,What I don't understand is why YEC's want to L...,That's what faith does. It limits your options.


In [17]:
# get triples
# Self-merge where the 'response' matches another 'quote' in the DataFrame
triples = pairs.merge(pairs,left_on='response',right_on='quote',how='inner',suffixes=('_r1','_r2'))

# Rename and reorder columns
triples = triples.rename(columns={'quote_r1':'quote', 'quote_r2':'response1', 'response_r2':'response2'})
triples = triples.drop(columns=['response_r1'])
front_columns = [
                 'quote','response1','response2','attack_r1','fact-feeling_r1','nicenasty_r1','sarcasm_r1',
                 'agreement_r2'
                ]
triples = triples.dropna(subset=front_columns)
triples = triples[front_columns].join(triples.drop(columns=front_columns))

# add length variable
triples['length_r1'] = triples['response1'].apply(lambda x: len(x))
triples['length_r2'] = triples['response2'].apply(lambda x: len(x))
triples['length_q'] = triples['quote'].apply(lambda x: len(x))

# add sentiment
triples['sentiment_r1'] = triples['response1'].apply(lambda x: sentiment(x[:512])[0]['score'])
triples['sentiment_r2'] = triples['response2'].apply(lambda x: sentiment(x[:512])[0]['score'])
triples['sentiment_q'] = triples['quote'].apply(lambda x: sentiment(x[:512])[0]['score'])

# Display triples
triples

Unnamed: 0,quote,response1,response2,attack_r1,fact-feeling_r1,nicenasty_r1,sarcasm_r1,agreement_r2,key_r1,discussion_id_x_r1,agree-disagree_r1,agreement_r1,agreement_unsure_r1,attack_unsure_r1,defeater-undercutter_r1,defeater-undercutter_unsure_r1,fact-feeling_unsure_r1,negotiate-attack_r1,negotiate-attack_unsure_r1,nicenasty_unsure_r1,personal-audience_r1,personal-audience_unsure_r1,questioning-asserting_r1,questioning-asserting_unsure_r1,sarcasm_unsure_r1,discussion_id_y_r1,response_post_id_r1,quote_post_id_r1,term_r1,task1 num annot_r1,task2 num annot_r1,task2 num disagree_r1,key_r2,discussion_id_x_r2,agree-disagree_r2,agreement_unsure_r2,attack_r2,attack_unsure_r2,defeater-undercutter_r2,defeater-undercutter_unsure_r2,fact-feeling_r2,fact-feeling_unsure_r2,negotiate-attack_r2,negotiate-attack_unsure_r2,nicenasty_r2,nicenasty_unsure_r2,personal-audience_r2,personal-audience_unsure_r2,questioning-asserting_r2,questioning-asserting_unsure_r2,sarcasm_r2,sarcasm_unsure_r2,discussion_id_y_r2,response_post_id_r2,quote_post_id_r2,term_r2,task1 num annot_r2,task2 num annot_r2,task2 num disagree_r2,length_r1,length_r2,length_q,sentiment_r1,sentiment_r2,sentiment_q
0,I remember looking at the classic evolutionary...,Why do you find it necessary to fit observatio...,"Evolution has no goals, it is merely a beautif...",0.333333,0.333333,0.666667,0.200000,-2.833333,"(731, 1)",6032,0.333333,-1.333333,0.333333,0.000000,0.500000,0.0,0.333333,3.000000,0.25,0.166667,-2.250000,0.25,-4.250000,0.0,0.166667,6032,149609,149552.0,,6,6,4,"(610, 2)",6032,0.600000,0.166667,0.333333,0.166667,-3.50,0.0,1.333333,0.166667,3.500000,0.0,0.500000,0.333333,-4.000000,0.0,1.500000,0.0,0.0,0.166667,6032,149673,149609.0,,6,5,2,263,117,265,0.997491,0.972950,0.998637
1,What is the fun in that?,"Seriously? Well, I come here hoping for someth...","nah, I was just poking fun because I can! Pers...",-0.600000,-2.200000,0.000000,0.000000,-2.166667,"(697, 2)",5205,0.833333,-2.400000,0.000000,0.000000,-5.000000,0.0,0.000000,2.000000,0.00,0.000000,0.000000,0.00,-2.000000,0.0,0.000000,5205,122800,122780.0,,5,6,1,"(1267, 0)",5205,0.600000,0.333333,0.833333,0.166667,-1.50,0.0,-1.333333,0.500000,2.000000,0.0,0.500000,0.166667,-3.000000,0.0,-1.500000,0.0,0.2,0.166667,5205,123129,122800.0,,6,5,2,356,152,24,0.990721,0.994051,0.999512
2,"First off, the scientific method goes:\n \n 1)...",You guys know me. Always happy to correct anyo...,"Ah, thanks for the correction, although there ...",2.400000,2.800000,2.200000,0.000000,-0.400000,"(9, 0)",9449,0.400000,0.600000,0.200000,0.200000,-2.666667,0.0,0.200000,-3.666667,0.00,0.200000,0.333333,0.00,3.666667,0.0,0.200000,9449,247240,247225.0,you,5,5,3,"(1393, 1)",9449,1.000000,0.200000,0.800000,0.200000,,,1.200000,0.200000,,,1.000000,0.400000,,,,,0.0,0.400000,9449,247243,247240.0,No terms in first 10,5,7,0,1544,198,169,0.998007,0.843735,0.996069
3,You can ignore the obvious question. This is w...,Actually what they are really doing is ignorin...,"Really, then show me how I'm wrong - without d...",-3.500000,-3.166667,-3.166667,0.166667,-1.833333,"(1077, 1)",3467,0.400000,-4.333333,0.000000,0.000000,-4.666667,0.0,0.000000,-0.666667,0.00,0.000000,-2.000000,0.00,3.000000,0.0,0.000000,3467,73741,73738.0,actually,6,5,3,"(622, 1)",3467,0.600000,0.166667,-1.333333,0.166667,2.50,0.0,0.333333,0.166667,1.500000,0.0,0.000000,0.166667,0.000000,0.0,-1.500000,0.0,0.2,0.166667,3467,73783,73741.0,really,6,5,2,131,853,257,0.998448,0.987749,0.993888
4,Its really sad what these gay predator priests...,Homosexuals are attracted to adults of the sam...,Homosexuals are attracted to people of the sam...,0.166667,2.166667,-0.166667,0.400000,-2.666667,"(611, 0)",4337,0.333333,-2.000000,0.166667,0.166667,-0.750000,0.0,0.166667,-2.000000,0.00,0.166667,0.000000,0.00,3.750000,0.0,0.166667,4337,112008,111931.0,No terms in first 10,6,6,4,"(1350, 0)",4337,0.428571,0.000000,1.166667,0.000000,-2.50,0.0,0.666667,0.000000,-1.750000,0.0,1.166667,0.000000,-1.250000,0.0,3.250000,0.0,0.0,0.000000,4337,112012,112008.0,No terms in first 10,6,7,4,817,183,131,0.992456,0.910152,0.999480
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1341,1. Did she do anything wrong?,"Yes, she walked down the wrong alley, alone. W...",So it was wrong for her to walk down a public ...,0.166667,0.166667,0.500000,0.333333,-3.333333,"(27, 4)",13306,0.800000,0.666667,0.166667,0.166667,0.000000,0.0,0.166667,-1.000000,0.00,0.166667,0.000000,0.00,0.000000,0.0,0.500000,13306,370461,370385.0,yes,6,5,1,"(83, 5)",13306,0.200000,0.166667,-3.166667,0.166667,-2.75,0.0,-3.500000,0.166667,1.500000,0.0,-2.666667,0.166667,-2.250000,0.0,-0.750000,0.0,0.2,0.166667,13306,370646,370461.0,so,6,5,4,92,186,29,0.998961,0.999195,0.981565
1342,this is also the reason why atheism is more li...,It is curious how lately christians have been ...,"Well, the type of atheism you are talking abou...",1.200000,0.800000,0.600000,0.250000,-2.250000,"(1382, 6)",3982,0.000000,-1.000000,0.200000,0.200000,-1.285714,0.0,0.200000,-0.428571,0.00,0.200000,0.857143,0.00,1.714286,0.0,0.200000,3982,83610,83594.0,No terms in first 10,5,7,7,"(974, 6)",3982,0.600000,0.250000,0.750000,0.250000,2.50,0.0,1.000000,0.250000,-3.000000,0.0,1.750000,0.250000,-3.000000,0.0,2.500000,0.0,0.0,0.000000,3982,83636,83610.0,well,4,5,2,613,366,309,0.891050,0.991559,0.989538
1343,"Assault weapons? first of all, what are assaul...",Thanks Patriot for the well researched respons...,The ban on assault weapons stands on such polt...,2.000000,2.000000,2.500000,0.000000,0.200000,"(897, 4)",50,0.833333,0.500000,0.000000,0.000000,1.000000,0.0,0.000000,1.000000,0.00,0.000000,4.000000,0.00,1.000000,0.0,0.000000,50,33286,173.0,,4,6,1,"(398, 4)",50,0.000000,0.600000,2.200000,0.400000,-2.60,0.0,1.800000,0.400000,-1.200000,0.0,1.200000,0.400000,-0.800000,0.0,2.000000,0.0,0.0,0.200000,50,33347,33286.0,No terms in first 10,5,5,5,2412,4323,1614,0.984845,0.922969,0.997080
1344,"And without a gun you're defenseless, why can'...","Hardly any criminals go armed in England, and ...",All of the British papers keep going on about ...,2.400000,2.000000,2.600000,0.000000,-2.800000,"(1231, 1)",11999,0.285714,-1.600000,0.000000,0.000000,1.200000,0.0,0.000000,-1.000000,0.00,0.000000,0.000000,0.00,-0.200000,0.0,0.000000,11999,333578,333574.0,,5,7,5,"(1393, 2)",11999,0.571429,0.200000,0.200000,0.200000,-3.00,0.0,0.400000,0.200000,-0.333333,0.0,0.200000,0.200000,-2.666667,0.0,-0.333333,0.0,0.0,0.400000,11999,333583,333578.0,No terms in first 10,5,7,3,340,376,323,0.999583,0.999548,0.996570


## *Exercise 1*

Describe 2 separate predictions relevant to your project and associated texts, which involve predicting text that has not been observed based on patterns that have. Then, in a single, short paragraph, describe a research design through which you could use textual features and the tools of classification and regression to evaluate these predictions.

Two potential predictions:
1. Predict the effect of SCOTUS opinions on subsequent congressional legislation.
2. Predict the effect of SCOTUS opinions on media coverage of abortion.

SCOTUS opinions often include novel arguments for or against a given social issue. In the 1973 *Roe v. Wade* case, SCOTUS argued for the constitutional right to abortion based on the concept of privacy, an argument that had not been made before. Conversely, the 2022 *Dobbs v Jackson* opinion made the case against a constitutional right to abortion based on the lack of constitutional precedent for the practice. It is reasonable to assume that these novel arguments will influence the way congressional legislation following the opinions frames pro- or anti-abortion policies. Additionally, the way that the media covers abortion access across the US will begin to adopt these arguments, and even more, the outcome of arguments could cause a rise in counter-argumentative media articles (e.g., after *Roe*, an increased presence of anti-abortion media coverage). To assess the first prediction, we could use SCOTUS opinions as bookmarks in time, and assses the language used in the subsequent congressional legislation up to the next major SCOTUS opinion. With this, we can extract and weight key terms (nouns, adverbs, etc.) in the scotus case, and use these key terms to predict the type of congressional legislation to follow based on their own key terms and phrases. For the second article, we can adopt a similar model using media articles instead of congressional legislation.

## *Exercise 2*</font>

Propose a simple causal model in your data, or a different causal model in the annotated Internet Arguments Corpus (e.g., a different treatment, a different outcome), and test it using a linear or logistic regression model. If you are using social media data for your final project, we encourage you to classify or annotate a sample of that data (either compuationally or with human annotators) and examine the effect of texts on replies to that text (e.g., Reddit posts on Reddit comments, Tweets on Twitter replies, YouTube video transcripts on YouTube comments or ratings). You do not need to make a graph of the causal model, but please make it clear (e.g., "X affects Y, and C affects both X and Y.").
    
Also consider using the [ConvoKit datasets](https://convokit.cornell.edu/documentation/datasets.html)! Anytime there is conversation, there is an opportunity to explore the effects of early parts of the conversation on later parts. We will explore this further in Week 8 on Text Generation and Conversation.
    
***Stretch*** (not required): Propose a more robust identification strategy using either matching, difference in difference, regression discontinuity, or an instrumental variable. Each of these methods usually gives you a more precise identification of the causal effect than a unconditional regression. Scott Cunningham's [Causal Inference: The Mixtape](https://mixtape.scunning.com/) is a free textbook on these topics, and all have good YouTube video explanations.

### Fact v. Feeling
Test whether a person's lean toward using fact or feeling based arguments in their text affected the lean of their respondent.

### Run OLS

In [18]:
# Run OLS regression
ex_y = triples["agreement_r2"]
ex_x_cols = ["fact-feeling_unsure_r1"]
ex_x = sm.add_constant(triples[ex_x_cols])

ex_lm = sm.OLS(ex_y, ex_x).fit()
print(ex_lm.summary())

                            OLS Regression Results                            
Dep. Variable:           agreement_r2   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.012
Method:                 Least Squares   F-statistic:                     17.95
Date:                Wed, 21 Feb 2024   Prob (F-statistic):           2.43e-05
Time:                        11:38:11   Log-Likelihood:                -2572.2
No. Observations:                1340   AIC:                             5148.
Df Residuals:                    1338   BIC:                             5159.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                     -1

### Reflection
In the above OLS linear regression, we tested if the initial responder's lean toward a factual or feeling-based response had a causal effect on the second response's agreement measurement. Based on the data above, it has a significant effect (p <.001) on whether the second response agrees. Though it does have a significant effect, it does not account for much of the variation in agreement_r2, based on the Adjusted R-squared value (0.012), with the value of fact-feeling_unsure_r1 only accounting for 1.2% of the variation in agreement_r2.

## *Exercise 3*

Propose a measure you could generate to fill in or improve upon the simple causal model you proposed above and how you would split the data (e.g., a % of your main data, a separate-but-informative dataset). You do not have to produce the measure.
    
***Stretch*** (not required): Produce the measure and integrate it into your statistical analysis. This could be a great approach for your final project!

We can use the example to improve upon our causal model. To improve the model, we could incorporate length as a variable in our model, using `len(r1.split())`. We can add this to the first response to validate the assumption that feeling statements are often shorter than factual statements. We think the length of the first response would improve how much the variation in the data was explained by our model ($R^2$). Given that there multiple pairs that do not belong to the triples collection, we could use that as our split. We could then test our proposed model on the pair split, and validate it on the triples.


## Exercise 4
Propose a mediation model related to the simple causal model you proposed above (ideally on the dataset you're using for your final project). If you have measures for each variable in the model, run the analysis: You can just copy the "Mediation analysis" cell above and replace with your variables. If you do not have measures, do not run the analysis, but be clear as to the effect(s) you would like to estimate and the research design you would use to test them.

### Moderating Fact/Feeling
We can't use the fact-feeling lean since the original quotes weren't coded. However, we can estimate the effect of text sentiment on agreement. Specifically, is there a causal chain of sentiment through a conversation as a result of sentiment and comment length?

In [25]:
triples.head(3)

Unnamed: 0,quote,response1,response2,attack_r1,fact-feeling_r1,nicenasty_r1,sarcasm_r1,agreement_r2,key_r1,discussion_id_x_r1,agree-disagree_r1,agreement_r1,agreement_unsure_r1,attack_unsure_r1,defeater-undercutter_r1,defeater-undercutter_unsure_r1,fact-feeling_unsure_r1,negotiate-attack_r1,negotiate-attack_unsure_r1,nicenasty_unsure_r1,personal-audience_r1,personal-audience_unsure_r1,questioning-asserting_r1,questioning-asserting_unsure_r1,sarcasm_unsure_r1,discussion_id_y_r1,response_post_id_r1,quote_post_id_r1,term_r1,task1 num annot_r1,task2 num annot_r1,task2 num disagree_r1,key_r2,discussion_id_x_r2,agree-disagree_r2,agreement_unsure_r2,attack_r2,attack_unsure_r2,defeater-undercutter_r2,defeater-undercutter_unsure_r2,fact-feeling_r2,fact-feeling_unsure_r2,negotiate-attack_r2,negotiate-attack_unsure_r2,nicenasty_r2,nicenasty_unsure_r2,personal-audience_r2,personal-audience_unsure_r2,questioning-asserting_r2,questioning-asserting_unsure_r2,sarcasm_r2,sarcasm_unsure_r2,discussion_id_y_r2,response_post_id_r2,quote_post_id_r2,term_r2,task1 num annot_r2,task2 num annot_r2,task2 num disagree_r2,length_r1,length_r2,length_q,sentiment_r1,sentiment_r2,sentiment_q
0,I remember looking at the classic evolutionary...,Why do you find it necessary to fit observatio...,"Evolution has no goals, it is merely a beautif...",0.333333,0.333333,0.666667,0.2,-2.833333,"(731, 1)",6032,0.333333,-1.333333,0.333333,0.0,0.5,0.0,0.333333,3.0,0.25,0.166667,-2.25,0.25,-4.25,0.0,0.166667,6032,149609,149552.0,,6,6,4,"(610, 2)",6032,0.6,0.166667,0.333333,0.166667,-3.5,0.0,1.333333,0.166667,3.5,0.0,0.5,0.333333,-4.0,0.0,1.5,0.0,0.0,0.166667,6032,149673,149609.0,,6,5,2,263,117,265,0.997491,0.97295,0.998637
1,What is the fun in that?,"Seriously? Well, I come here hoping for someth...","nah, I was just poking fun because I can! Pers...",-0.6,-2.2,0.0,0.0,-2.166667,"(697, 2)",5205,0.833333,-2.4,0.0,0.0,-5.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,-2.0,0.0,0.0,5205,122800,122780.0,,5,6,1,"(1267, 0)",5205,0.6,0.333333,0.833333,0.166667,-1.5,0.0,-1.333333,0.5,2.0,0.0,0.5,0.166667,-3.0,0.0,-1.5,0.0,0.2,0.166667,5205,123129,122800.0,,6,5,2,356,152,24,0.990721,0.994051,0.999512
2,"First off, the scientific method goes:\n \n 1)...",You guys know me. Always happy to correct anyo...,"Ah, thanks for the correction, although there ...",2.4,2.8,2.2,0.0,-0.4,"(9, 0)",9449,0.4,0.6,0.2,0.2,-2.666667,0.0,0.2,-3.666667,0.0,0.2,0.333333,0.0,3.666667,0.0,0.2,9449,247240,247225.0,you,5,5,3,"(1393, 1)",9449,1.0,0.2,0.8,0.2,,,1.2,0.2,,,1.0,0.4,,,,,0.0,0.4,9449,247243,247240.0,No terms in first 10,5,7,0,1544,198,169,0.998007,0.843735,0.996069


In [24]:
# Mediation analysis
y = triples['agreement_r1']
X_cols = ['sentiment_q','length_q']
X = sm.add_constant(triples[X_cols])
mediator_model = sm.OLS(y,X)

# For the second step of the mediation model, we can add in other predictors.
y = triples['agreement_r1']
X_cols = ['sentiment_q','length_q', 'agreement_r1']
X = sm.add_constant(triples[X_cols])
outcome_model = sm.OLS(y,X)

med = Mediation(outcome_model=outcome_model, mediator_model=mediator_model,
                exposure='length_q', mediator='agreement_r1').fit()
med.summary()

Unnamed: 0,Estimate,Lower CI bound,Upper CI bound,P-value
ACME (control),0.001299882,-0.1263222,0.123353,0.974
ACME (treated),0.001299882,-0.1263222,0.123353,0.974
ADE (control),-3.6110430000000005e-17,-1.451433e-16,-6.447078e-18,0.0
ADE (treated),-3.5927200000000004e-17,-1.428095e-16,-6.734708e-18,0.0
Total effect,0.001299882,-0.1263222,0.123353,0.974
Prop. mediated (control),1.0,1.0,1.0,0.0
Prop. mediated (treated),1.0,1.0,1.0,0.0
ACME (average),0.001299882,-0.1263222,0.123353,0.974
ADE (average),-3.601881e-17,-1.435026e-16,-6.7303250000000004e-18,0.0
Prop. mediated (average),1.0,1.0,1.0,0.0


### Reflection
Similar to the example, the Average Causal Mediated Effect isn't significantly distinct from randomness, so there doesn't seem to be a causal link through responses. However, the ADE is significant, indicating a significant effect of quote sentiment on response agreement.

The lack of a causal link could be because respondents don't necessarily respond to previous comments; instead, each respondent could be independently responding to the original quote itself. We would need a data structure of responses to responses to quotes (like on Twitter/X, where you can respond to other responses) in order to see if such a mediated effect exists.

Current data structure:<br>
```plain text
original quote
|__ response 1
|__ response 2
```

Ideal data structure for mediation analysis:<br>
```plain text
original quote
|__ response 1
|  |__ response 1.1
|  |__ response 1.2
|__ response 2
```


## Exercise 5
Pick one other paper on causal inference with text from the ["Papers about Causal Inference and Language
" GitHub repository](https://github.com/causaltext/causal-text-papers). Write at least three sentences summarizing the paper and its logic of design in your own words.
    
***Stretch*** (not required): Skim a few more papers. The causal world is your textual oyster!

### Article
Veitch, Victor, Dhanya Sridhar, and David M. Blei. 2020. "Adapting Text Embeddings for Causal Inference." *Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI)*, vol. 124. [https://arxiv.org/pdf/1905.12741.pdf](https://arxiv.org/pdf/1905.12741.pdf). 

### Summary
This research article introduces a method to understand the causal impact of certain features in text documents, such as the inclusion of a theorem in a paper or mentioning an author's gender in a post, on outcomes like paper acceptance or post popularity. It tackles the problem of texts being too complex and high-dimensional for traditional causal inference methods by developing "causally sufficient embeddings". These are low-dimensional representations of documents that maintain crucial information for identifying causal effects while disregarding irrelevant details. The method combines language modeling techniques with supervised dimensionality reduction, focusing only on text aspects predictive of both the intervention (like adding a theorem) and the outcome (such as paper acceptance). The approach is validated through semi-synthetic datasets, showing improvements in causal estimation, and the article discusses potential future improvements and the challenges in assessing the assumptions behind these black box models.