# Difference-in-differences Regression

## Setup
First import the necessary python modules and libraries.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy import stats

## 1.1 Simple Regression
*Load the data and regress (log) ratings of each show onto the (log) number of tweets per
episode. Do you think this regression gives you the causal effect of tweets on show viewership? If not, do
you think your estimate will be biased upwards or downwards?*

First, we load the relevant data into our dataframe.

In [12]:
weibo_data = pd.read_csv('weibo_data.csv')
weibo_data

Unnamed: 0,location,show_id,episode_num,censor_dummy,log_rating,log_tweet,av_tweets,day_id,mainland_dummy
0,Mainland China,1,1,1,0.475764,0.000000,3.692308,33,1
1,Mainland China,1,2,0,0.468479,0.000000,3.692308,34,1
2,Mainland China,1,3,0,0.581327,1.386294,3.692308,35,1
3,Mainland China,1,4,0,0.547851,0.000000,3.692308,36,1
4,Mainland China,1,5,0,0.483728,1.386294,3.692308,37,1
...,...,...,...,...,...,...,...,...,...
11422,hongkong,342,33,0,0.476110,,0.000000,51,0
11423,hongkong,342,34,0,0.432756,,0.000000,54,0
11424,hongkong,342,35,0,0.303211,,0.000000,55,0
11425,hongkong,342,36,0,0.436511,,0.000000,56,0


Next, let's regress the (log) ratings of each show onto the (log) number of tweets per episode.

In [3]:
mod_simple = smf.ols(formula = 'log_rating ~ log_tweet',data = weibo_data)
res_simple = mod_simple.fit()
print(res_simple.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.111
Model:                            OLS   Adj. R-squared:                  0.111
Method:                 Least Squares   F-statistic:                     987.9
Date:                Fri, 11 Feb 2022   Prob (F-statistic):          2.00e-204
Time:                        14:05:10   Log-Likelihood:                -87.734
No. Observations:                7899   AIC:                             179.5
Df Residuals:                    7897   BIC:                             193.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.2664      0.003     81.566      0.0

This regression does not give the causal effect of tweets on show viewership because our regression does not account for potential biases or trends that could be explaining how tweets are related to the rating of a show. Causal interpretation requires that tweets are also uncorrelated with all other determinants of rating for a show. <br> 

The estimate can be biased upwards or downwards depending on which other variables we would want to include in our regression to help measure the causal effect of tweets. For example, if we add in variables that positively affect our `log_rating`, then the coefficient on `log_tweet` will go down. On the other hand, if we add in variables that negatively affect our `log_rating`, then the coefficient on `log_tweet` will go up.

## 1.2 Geographic Diff-in-diff

### Part a)

First, let's obtain data from mainland China only.

In [9]:
weibo_data_main = weibo_data[weibo_data['mainland_dummy'] == 1]
weibo_data_main

Unnamed: 0,location,show_id,episode_num,censor_dummy,log_rating,log_tweet,av_tweets,day_id,mainland_dummy
0,Mainland China,1,1,1,0.475764,0.000000,3.692308,33,1
1,Mainland China,1,2,0,0.468479,0.000000,3.692308,34,1
2,Mainland China,1,3,0,0.581327,1.386294,3.692308,35,1
3,Mainland China,1,4,0,0.547851,0.000000,3.692308,36,1
4,Mainland China,1,5,0,0.483728,1.386294,3.692308,37,1
...,...,...,...,...,...,...,...,...,...
7894,Mainland China,193,57,0,0.232501,1.791759,11.517241,57,1
7895,Mainland China,193,58,0,0.231280,2.564949,11.517241,58,1
7896,Mainland China,193,59,0,0.262297,2.079442,11.517241,59,1
7897,Mainland China,193,60,0,0.202024,2.772589,11.517241,60,1


Next, let's run a regression of episode-level (log) ratings on show fixed effects and the censorship dummy.

In [22]:
mod_main = smf.ols(formula = 'log_rating ~ censor_dummy + C(show_id) + C(episode_num)',data = weibo_data)
res_main = mod_main.fit()
print(res_main.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.965
Model:                            OLS   Adj. R-squared:                  0.963
Method:                 Least Squares   F-statistic:                     576.4
Date:                Fri, 11 Feb 2022   Prob (F-statistic):               0.00
Time:                        16:00:39   Log-Likelihood:                 9190.0
No. Observations:               11427   AIC:                        -1.733e+04
Df Residuals:                   10904   BIC:                        -1.349e+04
Df Model:                         522                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 0.56

The coefficient on our censorship dummy is -0.0059 which indicates that censorship has a negative effect on rating based on this regression. This result makes sense initially because this censorship would have affected the number of tweets overall during those censorship days, and would have resulted in lower ratings.

### Part b)

It is necessary to control for show fixed effects in our regression here because it can measure omitted variables that vary across time and units, and we should be including them in our regression as controls.

In [18]:
mod_main_1 = smf.ols(formula = 'log_rating ~ censor_dummy',data = weibo_data)
res_main_1 = mod_main_1.fit()
print(res_main_1.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                   0.02322
Date:                Fri, 11 Feb 2022   Prob (F-statistic):              0.879
Time:                        15:38:16   Log-Likelihood:                -9968.2
No. Observations:               11427   AIC:                         1.994e+04
Df Residuals:                   11425   BIC:                         1.996e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0.4500      0.006     80.365   

As seen above, we ran the regression with show fixed effects and the coefficient on our censorship dummy increased to -0.0034. Therefore, without controlling for the show fixed effects, we would have underestimated the effect of censorship on show ratings.

### Part c)

Below, we obtain data only from Hong Kong.

In [16]:
weibo_data_hk = weibo_data[weibo_data['mainland_dummy'] == 0]
weibo_data_hk

Unnamed: 0,location,show_id,episode_num,censor_dummy,log_rating,log_tweet,av_tweets,day_id,mainland_dummy
7899,hongkong,194,1,0,0.087186,,0.0,1,0
7900,hongkong,194,2,0,0.254720,,0.0,2,0
7901,hongkong,194,3,0,0.156063,,0.0,3,0
7902,hongkong,194,4,0,0.179484,,0.0,5,0
7903,hongkong,194,5,0,0.184153,,0.0,6,0
...,...,...,...,...,...,...,...,...,...
11422,hongkong,342,33,0,0.476110,,0.0,51,0
11423,hongkong,342,34,0,0.432756,,0.0,54,0
11424,hongkong,342,35,0,0.303211,,0.0,55,0
11425,hongkong,342,36,0,0.436511,,0.0,56,0


Next, let's run a regression of episode-level (log) ratings on show fixed effects and the censorship dummy.

In [23]:
mod_hk = smf.ols(formula = 'log_rating ~ censor_dummy + C(show_id) + C(episode_num)',data = weibo_data_hk)
res_hk = mod_hk.fit()
print(res_hk.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.977
Model:                            OLS   Adj. R-squared:                  0.975
Method:                 Least Squares   F-statistic:                     510.3
Date:                Fri, 11 Feb 2022   Prob (F-statistic):               0.00
Time:                        16:01:12   Log-Likelihood:                 2000.8
No. Observations:                3528   AIC:                            -3460.
Df Residuals:                    3257   BIC:                            -1788.
Df Model:                         270                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 0.19

The coefficient on our censorship dummy is 0.0148 which indicates that censorship has a positive effect on rating of episodes in Hong Kong based on this regression. In addition, the p-value for our censorship coefficient is relatively high which makes it more likely that censorship doesn't have any effect on rating. This result makes sense initially because censorship only affects the Sina Weibo platform which doesn't have much presence in Hong Kong. Therefore, tweets wouldn't be affected in Hong Kong and similarly ratings wouldn't be negatively affected by censorship either.

### Part d)

In [24]:
DinD_reg_geo = smf.ols(formula = 'log_rating ~ mainland_dummy + censor_dummy + C(show_id) + C(episode_num) + mainland_dummy:censor_dummy', data = weibo_data).fit()
print(DinD_reg_geo.summary())
print("mainland_dummy:censor_dummy:", DinD_reg_geo.params["mainland_dummy:censor_dummy"])

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.965
Model:                            OLS   Adj. R-squared:                  0.963
Method:                 Least Squares   F-statistic:                     575.4
Date:                Fri, 11 Feb 2022   Prob (F-statistic):               0.00
Time:                        16:01:47   Log-Likelihood:                 9192.2
No. Observations:               11427   AIC:                        -1.734e+04
Df Residuals:                   10903   BIC:                        -1.349e+04
Df Model:                         523                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

If we take a look at our interaction term of `mainland_dummy:censor_dummy`, we can isolate the effect of the censorship from our control group (Hong Kong) and treatment group (mainland China). This indicates that censorship specifically in mainland China has a negative impact on rating when looking at the difference-in-difference effect.

## 1.3 Across-show Diff-in-diff

Below, we have observations from shows in mainland China only.

In [28]:
weibo_data_main

Unnamed: 0,location,show_id,episode_num,censor_dummy,log_rating,log_tweet,av_tweets,day_id,mainland_dummy
0,Mainland China,1,1,1,0.475764,0.000000,3.692308,33,1
1,Mainland China,1,2,0,0.468479,0.000000,3.692308,34,1
2,Mainland China,1,3,0,0.581327,1.386294,3.692308,35,1
3,Mainland China,1,4,0,0.547851,0.000000,3.692308,36,1
4,Mainland China,1,5,0,0.483728,1.386294,3.692308,37,1
...,...,...,...,...,...,...,...,...,...
7894,Mainland China,193,57,0,0.232501,1.791759,11.517241,57,1
7895,Mainland China,193,58,0,0.231280,2.564949,11.517241,58,1
7896,Mainland China,193,59,0,0.262297,2.079442,11.517241,59,1
7897,Mainland China,193,60,0,0.202024,2.772589,11.517241,60,1


In addition, we generate dummy variables for the number of tweets episode below.

In [50]:
weibo_data_main["fewer_than_5"] = (weibo_data_main.av_tweets < 5) * 1
weibo_data_main["at_least_5"] = ((weibo_data_main.av_tweets >=5) & (weibo_data_main.av_tweets < 100)) * 1
weibo_data_main["at_least_100"] = (weibo_data_main.av_tweets >= 100) * 1

weibo_data_main

Unnamed: 0,location,show_id,episode_num,censor_dummy,log_rating,log_tweet,av_tweets,day_id,mainland_dummy,fewer_than_5,at_least_5,at_least_100
0,Mainland China,1,1,1,0.475764,0.000000,3.692308,33,1,1,0,0
1,Mainland China,1,2,0,0.468479,0.000000,3.692308,34,1,1,0,0
2,Mainland China,1,3,0,0.581327,1.386294,3.692308,35,1,1,0,0
3,Mainland China,1,4,0,0.547851,0.000000,3.692308,36,1,1,0,0
4,Mainland China,1,5,0,0.483728,1.386294,3.692308,37,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
7894,Mainland China,193,57,0,0.232501,1.791759,11.517241,57,1,0,1,0
7895,Mainland China,193,58,0,0.231280,2.564949,11.517241,58,1,0,1,0
7896,Mainland China,193,59,0,0.262297,2.079442,11.517241,59,1,0,1,0
7897,Mainland China,193,60,0,0.202024,2.772589,11.517241,60,1,0,1,0


### Part a)

Regression for shows with less than 5 tweets per episode:

In [55]:
weibo_data_main_1 = weibo_data_main[weibo_data_main['fewer_than_5'] == 1]

mod_av_1 = smf.ols(formula = 'log_rating ~ censor_dummy',data = weibo_data_main_1)
res_av_1 = mod_av_1.fit()
print(res_av_1.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     6.461
Date:                Fri, 11 Feb 2022   Prob (F-statistic):             0.0111
Time:                        18:15:27   Log-Likelihood:                 978.52
No. Observations:                3405   AIC:                            -1953.
Df Residuals:                    3403   BIC:                            -1941.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0.2473      0.003     76.823   

Regression for shows with 5 to 100 tweets per episode:

In [56]:
weibo_data_main_2 = weibo_data_main[weibo_data_main['at_least_5'] == 1]

mod_av_2 = smf.ols(formula = 'log_rating ~ censor_dummy',data = weibo_data_main_2)
res_av_2 = mod_av_2.fit()
print(res_av_2.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     1.386
Date:                Fri, 11 Feb 2022   Prob (F-statistic):              0.239
Time:                        18:16:36   Log-Likelihood:                -302.06
No. Observations:                2945   AIC:                             608.1
Df Residuals:                    2943   BIC:                             620.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0.3172      0.005     62.080   

Regression for shows with at least 100 tweets:

In [57]:
weibo_data_main_3 = weibo_data_main[weibo_data_main['at_least_100'] == 1]

mod_av_3 = smf.ols(formula = 'log_rating ~ censor_dummy',data = weibo_data_main_3)
res_av_3 = mod_av_3.fit()
print(res_av_3.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                   0.06752
Date:                Fri, 11 Feb 2022   Prob (F-statistic):              0.795
Time:                        18:18:04   Log-Likelihood:                -380.47
No. Observations:                1549   AIC:                             764.9
Df Residuals:                    1547   BIC:                             775.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0.4865      0.008     59.247   

Across the three regressions, we can see that as the number of tweets per episode increases, the effect of the censor dummy variable decreases. Meaning, for shows with higher number of tweets per episode, the censorship will decrease ratings more dramatically while censorship seems to have the opposite effect for shows with lower number of tweets per episode.

### Part b)

Now, we run the diff-in-diff regression below for the three sets of shows with three different activity levels.

In [70]:
DinD_reg_three = smf.ols(formula = 'log_rating ~ censor_dummy + fewer_than_5:censor_dummy + at_least_5:censor_dummy + at_least_100:censor_dummy', data = weibo_data_main).fit()
print(DinD_reg_three.summary())
print("fewer_than_5:censor_dummy:", DinD_reg_three.params["fewer_than_5:censor_dummy"])
print("at_least_5:censor_dummy:", DinD_reg_three.params["at_least_5:censor_dummy"])
print("at_least_100:censor_dummy:", DinD_reg_three.params["at_least_100:censor_dummy"])

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.007
Model:                            OLS   Adj. R-squared:                  0.007
Method:                 Least Squares   F-statistic:                     18.29
Date:                Fri, 11 Feb 2022   Prob (F-statistic):           8.01e-12
Time:                        18:56:15   Log-Likelihood:                -525.91
No. Observations:                7899   AIC:                             1060.
Df Residuals:                    7895   BIC:                             1088.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
Intercept             

Our coefficients for the interactions terms between `censor_dummy` and `fewer_than_5`, `at_least_5`, and `at_least_100` are -0.0755, -0.0138, and 0.1244 respectively. We can see that the results here seem to be the opposite of what we saw in part A where we ran three separate regressions. The reason for this difference is that we can now use diff-in-diff to assess causal inference more accurately. In addition, the intercept values for each of the previous regressions is different which affected the value of the coefficient on our dummy variables.

### Part c)

I believe the regression related to geography was more informative regarding impact of the censorship on ratings because it allowed for us to test with a control group against a treatment group. This allows us to capture the effect of our treatment (censorship in this case) and see how things differ between pre-treatment and post-treatment periods.