In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
data = pd.read_csv('daily_tw_totals_2017.csv')
parties = ["Conservative and Unionist Party","Labour Party","Liberal Democrats","Green Party",
           "UK Independence Party (UKIP)","Scottish National Party (SNP)"]
data = data[data['Party'].isin(parties)]#subset to only the major parties
data['Gender'] = data['Gender'].str.lower()#do a bit of data cleaning

# Lab 2

In this lab we are going to execute a difference-in-differences study to test the same question we saw in Lab 1 - does being mentioned make you more likely to tweet?

We'll make use of June 2nd as a treatment date (though feel free to change that if you like). So this is our t2. t1 will be June 1st, t3 will be June 3rd. Our treated group will be all those who were mentioned at least once on t2. Our control is (obviously) those who weren't mentioned. Our outcome variable is the amount of tweets sent. 

In [None]:
#this long chunk of code transforms our daily tweet data into the data format we need. 
#I will talk through it after the lab 

data_jun2 = data[data['day']=='2017-06-02']
data_jun2['mentioned'] = data_jun2['mentions'] > 0
data_jun2 = data_jun2[['mentioned', 'tw_screenname']]

before = data[data['day']=='2017-06-01']
before['time'] = '1.before'
before = pd.merge(before, data_jun2, how='right', on='tw_screenname')
before = before.fillna(value=0)
before.loc[before['time']==0,'time'] = '1.before'
before = before[['tw_screenname', 'candidate_tweets', 'time', 'mentioned']]

after = data[data['day']=='2017-06-03']
after['time'] = '2.after'
after = pd.merge(after, data_jun2, how='right', on='tw_screenname')
after = after.fillna(value=0)
after.loc[after['time']==0,'time'] = '2.after'
after = after[['tw_screenname', 'candidate_tweets', 'time', 'mentioned']]

total = pd.concat([before, after])

Now that we have prepared the data in the right format, we can run our regression. You should know how to execute an OLS regression in Python now! Recall that the equation we want OLS to find values for is:

$outcome = intercept + time + treatment + time*treatment$

Run the regression and interpret the findings. 

To double check the regression worked, calculate the differences and the difference-in-differences manually. One approach to this would be to use the `groupby` and `diff` commands from Lab 1. Note that you can specify to `diff` how many rows back to go with your differencing. So to go two rows back we would use `.diff(2)`

Produce a graphic (either a line plot or a bar plot) which visualises the before and after values in the treatment and control groups. Hint: pandas is happier plotting 'wide' data, i.e. where each column contains a distinct data series. To get this format from the data produced by groupby you can make use of the `.unstack()` method.

Do a little investigation of the data. How many people were mentioned on June 1st? Break them down by gender and political party. Check week 1 lab 2 if you need some of the commands. Discuss potential causes of the observed differences with a neighbour. 

Try the same DID regression on another date. Do you get the same results?

(Optional) Write a loop which performs the regression on all possible dates, and adds up the number of positive, negative and statistically insignficant coefficients. When we look at all possible regressions, what conclusion do we come to?

NB: a more advanced version of this would be to then combine all the studies in a meta analysis. This isn't covered on this course but would be something to think about if you are using this type of approach for your thesis. 