In [None]:
import glob
import pandas as pd
import numpy as np
import re

# Project idea and description...

The original paper explores the hypothesis of chilling effects being present in people's behaviour on Wikipedia, that's to say whether people will tend to be less active in sensitive domains knowing that they are being surveilled by the government. And the conclusion was that such an effect does indeed exist.

We aim in our extension to see whether those effects extend to Twitter.

The logic behind our choice is that Twitter is a platform of discussion more than it is a platform of learning. So the behaviour of people should be different than on Wikipedia.
We also know that people tend to get really vocal on Twitter, so we expect people to actually be more active in talking about those hot-topics as time progresses, especially given that the years 2013-2014 have seen a lot of terrorist-related activity throughout the world.

# Data aquisition

At first we need to obtain tweets containing the same keywords chosen in the original paper, to do so we use the Python module `Twint` (not to be confused with the Swiss payment method) to retrieve those tweets for our selected dates.
We only focus on Tweets in English to match the scope of the original paper.

We need to get all the following keywords from twitter :

* abu sayyaf
* afghanistan
* agro
* al-qaeda
* al-qaeda in the arabian peninsula
* al-qaeda in the islamic maghreb
* al-shabaab
* ammonium nitrate
* attack
* biological weapon
* car bomb
* chemical weapon
* conventional weapon
* dirty bomb
* eco-terrorism
* environmental terrorism
* euskadi ta askatasuna
* extremism
* farc
* fundamentalism,
* hamas
* hezbollah
* improvised explosive device
* iran
* iraq
* irish republican army
* islamist
* jihad
* nationalism
* nigeria
* nuclear
* nuclear enrichment
* pakistan
* palestine liberation front
* pirates
* plo
* political radicalism
* recruitment
* somalia
* suicide attack
* suicide bomber
* taliban
* tamil tigers
* tehrik-i-taliban pakistan
* terror
* terrorism
* weapons-grade
* yemen

## Scraping script

Bellow we demonstrate the script that we used to retreive the data using the `twint` module. As the scripts were run in a command line fashion we used the `argparse` module to give command line style arguments to the script like start/end dates and keywords. This script could then be executed on multiple machines to get the full 47 keyword datasets.

Example of the command : `python3 scraper.py -q "abu sayyaf"` or with an end date `python3 scraper.py -q "abu sayyaf" -e 2012-05-06`

```python
import twint
import argparse

# get arguments from command line
# only the keyword is required, the other have default values 
parser = argparse.ArgumentParser()
parser.add_argument('-q', '--query', type=str, required=True)
parser.add_argument('-s', '--start', type=str, default='2012-01-01')
parser.add_argument('-e', '--end', type=str, default='2014-09-01')
args = parser.parse_args()

config = twint.Config()
config.Limit = 5
config.Hide_output = False
config.Lang = "en"
config.Since = args.start
config.Until = args.end
config.Store_csv = True
config.Search = args.query
config.Output = "_".join([args.query, args.start, args.end]) + ".csv"
# make search
print(f'Running search for "{args.query}" between {args.start} and {args.end}.')
twint.run.Search(config)
```

We had to restart the scripts often as sometimes the network went down or it was hitting an error. The convinience was that the scraped results are automatically saved to the specified output file so that you don't lose 2 days of computing.

## Monitoring the scraping process

Scraping took a lot of time and the dataset we collected became quickly huge. Here are some insights and explanations.

## Analysis of the tweets

### Interrupted time series with regression
We will now focus on getting an understanding of the tweet distribution over time, and how the massive revelations of online surveillance in June 2013 might have caused a chilling effect. We will follow the paper's original way of doing the interrupted time series, with regression analysis.

First, let's see what those big files yield by reading one. We need to first find all the archives.

In [None]:
archive_pathnames = glob.glob('./data/*.gz')
print(f"Found {len(archive_pathnames)} archives")

Now, what is in the first archive ?

In [None]:
df = pd.read_csv(archive_pathnames[0])
print(df.shape)
df.head()

We see a lot of information there. But what we are interested in is the user interactions around a topic that they might the government to track, for example. Therefore, we can count the number of tweets themselves, not any of their content or information, but also the number of likes and rerplies ! Each of these actions can make the user fear such surveillance. Retweets also, but because we also collect the retweets themselves, they are already there ! Therefore, we will first count the number of tweets per month, with the number of likes and retweets added as well.

In [None]:
# Keep only interesting columns and sum them all together by month. We set to parse dates so we can group by month
df = pd.read_csv(archive_pathnames[0], usecols=["date", "likes_count", "replies_count"], parse_dates=["date"], lineterminator='\n')
df["tweet_count"] = 1
grouped_df = df.set_index('date').groupby(pd.Grouper(freq='M')).sum()
grouped_df["user_interactions"] = grouped_df["likes_count"] +  grouped_df["tweet_count"] +  grouped_df["replies_count"]

grouped_df.head()

We will want to merge all twitter keyworkds into the same dataframe, therefore we will give the name of the keywork to the column where all values are summed, instead of just user_interactions. Let's extract the name from the archive path.

In [None]:
print(archive_pathnames[0])

In [None]:
# Knowing what the file path looks like, we can extract the name
name = re.search(r"(?<=data/).*?(?=_full)", archive_pathnames[0]).group(0)
name

In [None]:
# Just set this name to the column and go to the next zip !
grouped_df.rename(columns={"user_interactions": name}, inplace=True)
grouped_df.head()

Let's now implement a for loop to aggregate all the data.

In [None]:
# print(pd.read_csv(archive_pathnames[0], usecols=["date", "likes_count"]))
monthly_counts = pd.DataFrame([])

for archive_pathname in archive_pathnames:
    print(f"Reading {archive_pathname} file")
    df = pd.read_csv(archive_pathname, usecols=["date", "likes_count", "replies_count"], parse_dates=["date"], lineterminator='\n')
    print(f"Shape is : {df.shape} \n")

    df["tweet_count"] = 1
    df = df.set_index('date').groupby(pd.Grouper(freq='M')).sum()
    df["user_interactions"] = df["likes_count"] +  df["tweet_count"] +  df["replies_count"]
    name = re.search(r"(?<=data/).*?(?=_full)", archive_pathname).group(0)

    df.rename(columns={"user_interactions": name}, inplace=True)
    
    monthly_counts = pd.concat([monthly_counts, df[name]], axis=1)
monthly_counts.index = pd.to_datetime(monthly_counts.index) # Make sure it is datetime

In [None]:
print(f"We have {np.round(monthly_counts.sum().sum()/1000000, 2)} million actions !")

In [None]:
# Temporary to get working faster !
# monthly_counts.to_csv("./monthly_actions.csv")
# monthly_counts = pd.read_csv("./monthly_actions.csv")

We will now have a quick glance at all the values we have.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# define figure
fig = plt.figure(figsize=(20, 16))

# big frame for main labels
fig.add_subplot(111, frameon=False)
plt.tick_params(labelcolor='none', top=False, bottom=False, left=False, right=False, pad=60)
plt.grid(False)
plt.xlabel("Months", fontsize=22)
plt.ylabel("Views", fontsize=22)

# define number of columns and rows of plot
col = 4
row = len(monthly_counts.columns)//col

# plot all topics
ax = fig.subplots(row, col, sharey=False, sharex=True)

for i, article_name in enumerate(monthly_counts):
    axis = ax[i//col, i%col]
    sns.lineplot(data=monthly_counts[f"{article_name}"], ax=axis)
    axis.set_title(article_name)
    plt.setp(axis.get_xticklabels(), rotation=30, horizontalalignment='right')
    axis.set_ylabel("")
fig.tight_layout()

With this graph we can look for missing values, or things that look anormal. TODO : discuss the weird things once we have complete data

In the original article, they study the period of 32 months from january 2012 to end of august 2014. We will therefore restrict our period to the be the same.

In [None]:
studied_article_actions = monthly_counts["2012-01-01":"2014-08-31"]

And now we will create the interrupted time series plot, without regression first :

In [None]:
import numpy as np
all_actions = pd.DataFrame(studied_article_actions.sum(axis=1), columns=["actions"])
all_actions["month_nb"] = range(1, 33)

after_revelations_month = 17 # 16 first months including June 2013, but index starts at 0 so we add 1 

# define figure
fig = plt.figure(figsize=(10, 5))

# big frame for main labels
plt.title("Monthly actions on all topics")
plt.xlabel("Months")
plt.ylabel("Total Actions")
plt.xticks(np.arange(0, 33, 2.0))
plt.scatter(x=all_actions["month_nb"], y=all_actions.actions)
plt.axvline(after_revelations_month+0.5, color='orange', label='Mid June 2013') # Plot a vertical line mid June 
plt.legend()
plt.show()

Now we will do the regression, to compare the trend before and after the studied interruption.

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Do linear regression with ols
mod_before = smf.ols(formula='actions ~ month_nb',
              data=all_actions[:after_revelations_month])
res_before = mod_before.fit()

mod_after = smf.ols(formula='actions ~ month_nb',
              data=all_actions[after_revelations_month:])
res_after = mod_after.fit()

before_intercept = res_before.params[0]
before_slope = res_before.params[1]

after_intercept = res_after.params[0]
after_slope = res_after.params[1]

print(f"Before period has intercept={before_intercept} and slope={before_slope}")

print(f"After period has intercept={after_intercept} and slope={after_slope}")

And now, we will plot both regressions, on each side of the interruption :

In [None]:
# define figure
fig = plt.figure(figsize=(10, 7))

# big frame for main labels
plt.title("Interrupted regression of twitter actions across keywords")
plt.xlabel("Months")
plt.ylabel("Total actions")
plt.xticks(np.arange(0, 33, 2.0))
plt.scatter(x=range(1, 33), y=all_actions.actions)
plt.axvline(after_revelations_month+0.5, color='orange', label='Mid June 2013') # Plot a vertical line mid June 

# Now we'll add the before period regression line
plt.plot(all_actions[:after_revelations_month].month_nb, all_actions[:after_revelations_month].month_nb*before_slope+before_intercept, label="Trend Pre-June 2013")

# And the after period
plt.plot(all_actions[after_revelations_month:].month_nb, all_actions[after_revelations_month:].month_nb*after_slope+after_intercept, label="Trend Post-June 2013")

plt.legend()
plt.show()