# Week 7 - Lab 1 - First Difference Models

In this lab we are going to look at how to implement a first difference model. 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

The data are aggregate daily totals of Twitter activity of candidates during the buildup to the UK's 2017 general election. The data cover a few weeks before the election in the months of May and June. 

In [2]:
data = pd.read_csv('daily_tw_totals_2017.csv')
parties = ["Conservative and Unionist Party","Labour Party","Liberal Democrats","Green Party",
           "UK Independence Party (UKIP)","Scottish National Party (SNP)"]
data = data[data['Party'].isin(parties)]#subset to only the major parties
data['Gender'] = data['Gender'].str.lower()#do a bit of data cleaning

The first thing you should do is produce appropriate descriptive statistics for the dataset, as we did in Week 1. 

In [3]:
data.describe()

Unnamed: 0,twid,followers,friends,candidate_tweets,candidate_retweets,mentions,mentions_retweets,replies
count,34761.0,28048.0,28048.0,34761.0,34761.0,34761.0,34761.0,34761.0
mean,6.40884e+16,8876.13,1506.926982,8.388395,61.159978,63.018383,73.621472,1.292454
std,2.206928e+17,40215.83,3187.889384,16.51343,1257.659052,967.025047,1448.223154,4.866654
min,1622761.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,82467930.0,541.0,385.0,1.0,0.0,1.0,0.0,0.0
50%,344737200.0,1497.0,849.0,3.0,2.0,4.0,1.0,0.0
75%,2245759000.0,6901.25,1812.0,9.0,10.0,16.0,9.0,1.0
max,8.634661e+17,1135920.0,83648.0,737.0,102262.0,44541.0,84987.0,501.0


Now we are going to tackle the following research question:

**Does being mentioned on Twitter make you more likely to tweet?**

Our theory is that people are more likely to engage with platforms when other people on the platform engage with them! (that seems fairly obvious). Remember your project should have a question (or set of questions) which are something like this. 

We're going to start by running a simple OLS regression on the full dataset, treating it as a cross section (i.e. imaginging that all we did was measure the amount of tweets sent during the campaign period and the amount of mentions received). 

To do that we need to aggregate the dataframe at the person level, and sum up all the tweets produced and mentions received during that period. We can do that with the following code.

In [4]:
total_mp_data = data.groupby('tw_screenname').agg({
    'candidate_tweets':'sum', 
    'mentions':'sum'
})

In [5]:
display(total_mp_data.head())

Unnamed: 0_level_0,candidate_tweets,mentions
tw_screenname,Unnamed: 1_level_1,Unnamed: 2_level_1
1tomcorbin,51,138
ABridgen,32,96
ACPayton,105,415
ACunninghamMP,188,390
AGarcarz,28,8


Now, run a regression with mentions as the independent variable and the number of tweets sent as the dependent variable. Check the exercises from week 3 if you can't remember how to run a regression in python. Interpret the results - what can we say about the model and our research question on this basis? 

In [10]:
result = smf.ols(formula="candidate_tweets ~ mentions", data=total_mp_data).fit()

print(result.params)
print(result.summary())

Intercept    171.115179
mentions       0.000473
dtype: float64
                            OLS Regression Results                            
Dep. Variable:       candidate_tweets   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     1.489
Date:                Mon, 19 Nov 2018   Prob (F-statistic):              0.223
Time:                        13:39:37   Log-Likelihood:                -11277.
No. Observations:                1698   AIC:                         2.256e+04
Df Residuals:                    1696   BIC:                         2.257e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------

There are strong reasons to be suspicious of this simple OLS approach for the purposes of addressing this research question. What do you think some of the problems are?

Now let's look at how to do a first difference regression. This would allow us to make a stronger causal claim because we can resolve some (but not all) of the problems with the simple OLS design. 

The first thing we need to do is produce a new aggregation of our data preserving time periods per person. Let's work at the month level to start off with, so we will produce two observations per person in the dataset, one for each month. 

First we need to split up our date variable to get it at the month level.

In [11]:
data['y'], data['m'], data['d'] = data['day'].str.split('-').str
#the above command is a bit unintuitive. see a good explanation here: 
#https://stackoverflow.com/questions/44866225/pandas-dataframe-splitting-series-strings-into-multiple-columns
data['m'] = data['m'].astype(int)

Explore the month variable a little bit. How many observations are there per month? Is Twitter activity generally higher or lower in month 1 or month 2?

In [23]:
display(data.groupby('m').candidate_tweets.sum())

m
5    200583
6     91006
Name: candidate_tweets, dtype: int64

Now let's aggregate activity by person month. Note that Gender and Party are constant in each group. Any function which selects one of the values (e.g. max, min) would work in this context because they are all the same. 

In [24]:
mp_data = data.groupby(['tw_screenname', 'm']).agg({
    'candidate_tweets':'sum', 
    'mentions':'sum', 
    'Gender' : 'max',
    'Party' : 'max'
})
mp_data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,candidate_tweets,mentions,Gender,Party
tw_screenname,m,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1tomcorbin,5,32,129,male,Labour Party
1tomcorbin,6,19,9,male,Labour Party
ABridgen,5,29,66,male,Conservative and Unionist Party
ABridgen,6,3,30,male,Conservative and Unionist Party
ACPayton,5,60,278,male,Liberal Democrats


This dataset contains a count of tweets and mentions per MP for each month, i.e. May and June. Check the size of the dataset and compare it to the number of candidates. What do you notice?

In [38]:
print(mp_data.candidate_tweets.sum())
print(mp_data.mentions.sum())
print(len(mp_data.index.value_counts()))

291589
2190582
3375


The amount of observations should be twice the number of candidates (i.e. one observation per month). It isn't, because some MPs had no observed data for that month. 

We need to fill in these blank values. We can do this with the following (this is a common problem with time series data; make sure you document all of these kind of data transformation details for your assignment):

In [39]:
(mps, months) = mp_data.index.levels
new_index = pd.MultiIndex.from_product([mps, months])
mp_data = mp_data.reindex(new_index)
mp_data[['candidate_tweets', 'mentions']] = mp_data[['candidate_tweets', 'mentions']].fillna(0).astype(int)

Now we want to take the differences of tweets and mentions. We can do this with the following:

In [40]:
mp_data[['candidate_tweets_diffed', 'mentions_diffed']] = mp_data.groupby(level=0).diff()
mp_data.head()

Unnamed: 0,Unnamed: 1,candidate_tweets,mentions,Gender,Party,candidate_tweets_diffed,mentions_diffed
1tomcorbin,5,32,129,male,Labour Party,,
1tomcorbin,6,19,9,male,Labour Party,-13.0,-120.0
ABridgen,5,29,66,male,Conservative and Unionist Party,,
ABridgen,6,3,30,male,Conservative and Unionist Party,-26.0,-36.0
ACPayton,5,60,278,male,Liberal Democrats,,


Now you can run an OLS regression on the differenced data. Run it and interpret the output. What do you find?

In [48]:
result = smf.ols(formula="candidate_tweets_diffed ~ mentions_diffed+Gender+Party", data=mp_data).fit()

print(result.params)
print(result.summary())

Intercept                                 -70.843345
Gender[T.female, transgender]            -102.161062
Gender[T.male]                             -5.015959
Gender[T.non-binary]                     -285.197630
Party[T.Green Party]                       18.041450
Party[T.Labour Party]                       1.006780
Party[T.Liberal Democrats]                 12.606696
Party[T.Scottish National Party (SNP)]     -1.417113
Party[T.UK Independence Party (UKIP)]      71.350258
mentions_diffed                             0.000475
dtype: float64
                               OLS Regression Results                              
Dep. Variable:     candidate_tweets_diffed   R-squared:                       0.025
Model:                                 OLS   Adj. R-squared:                  0.020
Method:                      Least Squares   F-statistic:                     4.742
Date:                     Mon, 19 Nov 2018   Prob (F-statistic):           2.99e-06
Time:                             1

As a next step, add our two time invariant variables (Gender and Party) to the regression and rerun the model. How can we interpret these coefficients in a first difference context?

If you have more than two time periods in the data, we can also use a first difference design. In this case, the difference between successive time periods is taken (i.e. the t3 - t2 and t2 - t1 differences will both be in the dataset). In this case we also need to account for a dependence structure in the data as t2 is present twice. We can do this using our simple OLS tool but we may as well switch to a package which is specifically designed for this type of data, which is called `linearmodels` (see reference here: https://bashtage.github.io/linearmodels/doc/panel/index.html; if you prefer to operate in R check the `plm` package). 

We can bring in a new package using the following: `!pip install linearmodels`


Let's first fit our existing model with the linear models package

In [50]:
from linearmodels import FirstDifferenceOLS
mod = FirstDifferenceOLS.from_formula('candidate_tweets ~ mentions', mp_data)
res = mod.fit()
print(res)

                     FirstDifferenceOLS Estimation Summary                      
Dep. Variable:       candidate_tweets   R-squared:                        0.0031
Estimator:         FirstDifferenceOLS   R-squared (Between):              0.0066
No. Observations:                1698   R-squared (Within):               0.0031
Date:                Mon, Nov 19 2018   R-squared (Overall):              0.0058
Time:                        14:08:19   Log-likelihood                -1.083e+04
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      5.3313
Entities:                        1698   P-value                           0.0211
Avg Obs:                       2.0000   Distribution:                  F(1,1697)
Min Obs:                       2.0000                                           
Max Obs:                       2.0000   F-statistic (robust):             5.3313
                            

The model is a little different to our previous model. What is the cause of the difference?

Now let's look at the multiple period case. Let's work with differences at the day level. Use the below code to get the data ready. 

In [44]:
from datetime import datetime
data['day_of_year'] = data['day'].apply(datetime.strptime, args=("%Y-%m-%d",)) 
data['day_of_year'] = data['day_of_year'].apply(datetime.timetuple)
data['day_of_year'] = data['day_of_year'].apply(lambda x: x.tm_yday)

mp_data_days = data.groupby(['tw_screenname', 'day_of_year']).agg({
    'candidate_tweets':'sum', 
    'mentions':'sum', 
})
(mps, days) = mp_data_days.index.levels
new_index = pd.MultiIndex.from_product([mps, days])
mp_data_days = mp_data_days.reindex(new_index)
mp_data_days = mp_data_days.fillna(0).astype(int)
mp_data_days.head()

Unnamed: 0,Unnamed: 1,candidate_tweets,mentions
1tomcorbin,136,4,13
1tomcorbin,137,3,11
1tomcorbin,138,0,8
1tomcorbin,139,7,19
1tomcorbin,140,0,12


The regression is then given by the following:

In [45]:
from linearmodels import FirstDifferenceOLS
mod = FirstDifferenceOLS.from_formula('candidate_tweets ~ mentions', mp_data_days)
res = mod.fit(cov_type='clustered', cluster_entity=True)
print(res)

                     FirstDifferenceOLS Estimation Summary                      
Dep. Variable:       candidate_tweets   R-squared:                        0.0020
Estimator:         FirstDifferenceOLS   R-squared (Between):              0.0065
No. Observations:               39054   R-squared (Within):               0.0010
Date:                Mon, Nov 19 2018   R-squared (Overall):              0.0031
Time:                        13:57:45   Log-likelihood                 -1.65e+05
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      79.158
Entities:                        1698   P-value                           0.0000
Avg Obs:                       24.000   Distribution:                 F(1,39053)
Min Obs:                       24.000                                           
Max Obs:                       24.000   F-statistic (robust):             8.8722
                            

NB: When we have more than two time periods many people would also recommend using a multilevel design (either fixed or random effects) where the group is the individual themselves (and indeed there are a couple of different ways of implementing this). This design follows a very similar logic to first difference in that what you are assessing is how individuals differ from themselves at different time points, rather than how they differ from each other. The design is often recommended over first difference because it preserves more data (when you are differencing data you inevitably lose one time period from the analysis). The two designs will produce identical results when there are only two time periods. 

In [46]:
from linearmodels import PanelOLS
mod = PanelOLS.from_formula('candidate_tweets ~ mentions + EntityEffects', mp_data)
res = mod.fit()
print(res)

                          PanelOLS Estimation Summary                           
Dep. Variable:       candidate_tweets   R-squared:                        0.0031
Estimator:                   PanelOLS   R-squared (Between):              0.0066
No. Observations:                3396   R-squared (Within):               0.0031
Date:                Mon, Nov 19 2018   R-squared (Overall):              0.0058
Time:                        13:57:50   Log-likelihood                 -1.93e+04
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      5.3313
Entities:                        1698   P-value                           0.0211
Avg Obs:                       2.0000   Distribution:                  F(1,1697)
Min Obs:                       2.0000                                           
Max Obs:                       2.0000   F-statistic (robust):             5.3313
                            

Probably you noticed the R2 for all these models is pretty poor! Investigate the distribution of the input variables and see if you can find an appropriate transformation to make them approach normality. Rerun the first difference regression with this transformation and interpret the results.

The point of working with the first difference design is to eliminate unmeasured time invariant effects. What do you think the time invariant effects might be in this particular case? Are there any hidden time variant effects we should be concerned about? Discuss with your neighbour. 

There are several other variables in the dataset (e.g. followers, retweets). Run another first difference regression using some of these other variables. What do you find?