# HK PROTESTS: Visualising Chinese State Troll Tweets, Part 2 (Using Scattertext)

Visualising unstructured text is hard. While frequency token charts and tree maps are useful in providing some quick insights into a messy body of text, they aren't particularly exciting from a visual standpoint nor useful when you want to inspect how certain key words were used in the original text.

Enter [Scattertext](https://github.com/JasonKessler/scattertext), which describes itself as a "sexy, interactive" tool for "distinguishing terms in small-to-medium-sized corpora". It has some quirks, such as the hard requirement for a binary category for the text you want to analyse. But otherwise it works great out of the box, including for Chinese text (in Part 4), and the detailed tutorials will enable you to try out a wide range of visualisations.

The interactive features are particularly useful in this case, allowing me to search for a key word in the chart and see which are the tweets and retweets that used that particular word or phrase. I've uploaded the interactive files to [Dropbox](https://www.dropbox.com/sh/jmb1oy0kak18cwy/AABfHXYoA_P8d6Tw-scNpDVia?dl=0) so you don't have to run this notebook in order to try out the Scattertext charts. 

In [1]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import scattertext as st
import spacy
import string

from IPython.display import IFrame
from IPython.core.display import display, HTML
from scipy.stats import rankdata, hmean, norm
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

mpl.rcParams["figure.dpi"] = 300
%matplotlib inline
%config InlineBackend.figure_format ='retina'

In [2]:
pd.set_option('display.max_columns', 40)

# 1. PRE-WORK
The CSV files are still pretty huge after the initial rounds of pre-processing in Part 1, and will likely bust the file size limits on Github. To keep things simple, I opted to repeat the pre-processing steps so that this notebook can be run as a standalone for those who wish to skip Part 1. As long as you've downloaded the two original CSV files from Twitter, the data processing steps below should work fine.

## 1.1 DATA PROCESSING
Download the original CSV files from [Twitter](https://blog.twitter.com/en_us/topics/company/2019/information_operations_directed_at_Hong_Kong.html). 

In [3]:
# Reminder that the raw1/2 CSV files are NOT in this repo. Download directly from Twitter; link above
raw1 = pd.read_csv('../data/china_082019_1_tweets_csv_hashed.csv')
raw2 = pd.read_csv('../data/china_082019_2_tweets_csv_hashed.csv')
raw = pd.concat([raw1, raw2])

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
# Dropping unnecessary columns
raw = raw.drop(
    columns=[
        "user_profile_url",
        "tweet_client_name",
        "in_reply_to_tweetid",
        "in_reply_to_userid",
        "quoted_tweet_tweetid",
        "is_retweet",
        "retweet_userid",
        "retweet_tweetid",
        "latitude",
        "longitude",
        "quote_count",
        "reply_count",
        "like_count",
        "retweet_count",
        "urls",
        "user_mentions",
        "poll_choices",
        "hashtags",
    ]
)

In [5]:
# Converting timings to HK time, and extracting year-month-day-hour cols
raw['tweet_time'] = pd.to_datetime(raw['tweet_time'])
raw['tweet_time'] = raw['tweet_time'].dt.tz_localize('GMT').dt.tz_convert('Hongkong')
raw['tweet_year'] = raw['tweet_time'].dt.year
raw['tweet_month'] = raw['tweet_time'].dt.month
raw['tweet_day'] = raw['tweet_time'].dt.day
raw['tweet_hour'] = raw['tweet_time'].dt.hour

raw['account_creation_date'] = pd.to_datetime(raw['account_creation_date'], yearfirst=True)
raw['year_of_account_creation'] = raw['account_creation_date'].dt.year
raw['month_of_account_creation'] = raw['account_creation_date'].dt.month
raw['day_of_account_creation'] = raw['account_creation_date'].dt.day

In [6]:
# I'll focus only on tweets sent from 2017
raw = raw[(raw["tweet_year"] >= 2017)].copy()

In [7]:
# In this notebook, I'll focus only on English tweets. Chinese tweets will be dealt with separately
# In earlier drafts, I found that troll accounts with Chinese language settings were sending out English tweets too,
# so provisions were made here to include those 
# Note the sub-categories for Twitter language settings for English and Chinese
raw_eng = raw[
    (raw["tweet_language"] == "en")
    & (
        (raw["account_language"] == "en")
        | (raw["account_language"] == "en-gb")
        | (raw["account_language"] == "zh-cn")
        | (raw["account_language"] == "zh-CN")
        | (raw["account_language"] == "zh-tw")
    )
].copy()

In [8]:
# Simple function to clean the tweet_text col
def clean_tweet(text):
    text = str(text).lower()
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"W", " ", text)
    text = text.strip(" ")
    text = text.strip(r"\n")
    text = re.sub(r"[^\w\s]", "", text)
    return text

In [9]:
raw_eng['clean_tweet_text'] = raw_eng['tweet_text'].map(lambda tweet: clean_tweet(tweet))

## 1.2 DATA FILTERING
As we've seen in Part 1, the dataset is incredibly noisy. I experimented with lightly filtered versions of the dataset and the results were highly unsatisfactory, with significant key words targetting the HK protests buried by the noise from irrelevant tweets.

So I focused on a set of key words highlighted in Part 1, and further filtered for irrelevant terms which I caught in my earlier drafts.

In [10]:
# Filtering and concating a new subset of tweets with keywords of interest

hk_eng1 = raw_eng[raw_eng['clean_tweet_text'].str.contains("hong kong")].copy()
hk_eng2 = raw_eng[raw_eng['clean_tweet_text'].str.contains("hk")].copy()
hk_eng3 = raw_eng[raw_eng['clean_tweet_text'].str.contains("police")].copy()
hk_eng4 = raw_eng[raw_eng['clean_tweet_text'].str.contains("protest")].copy()
hk_eng5 = raw_eng[raw_eng['clean_tweet_text'].str.contains("china")].copy()
hk_eng6 = raw_eng[raw_eng['clean_tweet_text'].str.contains("interference")].copy()
hk_eng7 = raw_eng[raw_eng['clean_tweet_text'].str.contains("meddling")].copy()
hk_eng8 = raw_eng[raw_eng['clean_tweet_text'].str.contains("ulterior motives")].copy()

hk_eng = pd.concat([hk_eng1, hk_eng2, hk_eng3, hk_eng4, hk_eng5, hk_eng6, hk_eng7, hk_eng8])

In [11]:
# These words popped up with high frequency in many of my initial Scattertext drafts
# You can add or shorten this list depending on your preference

hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("guo wengui")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("wengui")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("guo")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("poll")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("disneyland")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("world cup")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("worldcup")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("fcbayernen")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("rogerfederer")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("wimbledon")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("pew")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("mailonline")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("football")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("following")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("onthisday")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("bhivechat")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("roundtrip")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("bucket")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("jessica")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("ormsby")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("glass bridge")].copy()
hk_eng = hk_eng[~hk_eng["clean_tweet_text"].str.contains("european tour")].copy()

# 2. SCATTERTEXT PLOTS

# 2.1 Visualising Filtered English Tweets 
The highly filtered set consists of 1,108 "original" tweets and 1,188 retweets - a prety balanced set - and about 34,000 words. Let's see how they appear on a Scattertext chart.

In [12]:
hk_eng.shape

(2296, 21)

In [13]:
# Creating a new column to classify a tweet as a retweet or actual tweet
# This step is mandatory for the Scattertext plots
hk_eng['tweet_status'] = np.where(hk_eng["tweet_text"].str.startswith("RT @"), "retweet", "tweet")

In [14]:
# There are several options for spacy's pre-trained English models
# Go here if you wish to use a different one: https://spacy.io/models/en

nlp = spacy.load("en_core_web_lg") # to install, run: python -m spacy download en_core_web_lg

corpus = (st.CorpusFromPandas(
    hk_eng, 
    category_col="tweet_status", 
    text_col="clean_tweet_text", 
    nlp=nlp
).build().remove_terms(ENGLISH_STOP_WORDS, ignore_absences=True))

In [15]:
# This gives us the terms most associated with tweets and retweets in this subset:

term_freq_df = corpus.get_term_freq_df()
term_freq_df["Tweet Score"] = corpus.get_scaled_f_scores("tweet")
print(
    "Terms most associated with tweets:",
    list(term_freq_df.sort_values(by="Tweet Score", ascending=False).index[:10]),
)

term_freq_df["Retweet Score"] = corpus.get_scaled_f_scores("retweet")
print(
    "Terms most associated with retweets:",
    list(term_freq_df.sort_values(by="Retweet Score", ascending=False).index[:10]),
)


Terms most associated with tweets: ['claim', 'hate crime', 'sterling police', 'sterling', 'hate', 'osullivan', 'police investigate', 'joshua', 'basketball', 'inside the']
Terms most associated with retweets: ['xhnews', 'cgtnofficial', 'echinanews', 'pdchina china', 'sw', 'xinhuatravel', 'xhnews china', 'party', 'pdchina chinas', 'cctv']


In [16]:
# Generating a Scattertext plot. Check out the author's repo for other options
# Change the minimum_term_frequency if you wish to filter more aggressively

html = st.produce_scattertext_explorer(
    corpus,
    category="tweet",
    category_name="Tweets",
    not_category_name="Retweets",
    width_in_pixels=1000,
    metadata=hk_eng["user_screen_name"],
    minimum_term_frequency=5,
    show_characteristic=False,
)

interactive = "../output/scatter_eng.html"
open(interactive, "wb").write(html.encode("utf-8"))

IFrame(src=interactive, width=1200, height=700)

Download the chart [here](https://www.dropbox.com/sh/jmb1oy0kak18cwy/AABfHXYoA_P8d6Tw-scNpDVia?dl=0)

## HOW TO INTERPRET A SCATTERTEXT CHART:

### Colour
 - The words in the chart are colored by their association. Those in blue are more associated with original tweets, while those in red are more associated with retweets. Each dot corresponds to a word or phrase mentioned.

### Positioning: 
- Words nearer the top of the plot represent the most frequently used words in the "original" tweets.

- The further right a dot, the more that word or phrase was used in retweets (eg: legislative council).

- Words that appear frequently in both tweets and rewteets, like "police" and "china", appear in the upper-right-hand corner.

- Words that aren't often used in either tweets or retweets appear in the bottom-left-hand corner.


### Key Areas:
- Upper-left corner: These words appear frequently in the tweets but not the retweets. We still see a substantial amount of noise here, as evidenced by the appearance of words like "sterling" and "osullivan" near the very top.

- Lower-right corner: Likewise, words which appear frequently in retweets but not the tweets appear in the lower right corner. Here we see terms which indicate which are the popular accounts from which the trolls retweet - those owned by Chinese state media outlets.

## SEARCH FOR KEY TERMS:

One of Scattertext's best features is the search box at the bottom of each chart. Just enter a key word, such as "police", and see how the word was used in the tweets and retweets by the various users in the subset. This provides great context in the usage of particular key words, beyond mere frequency of appearance.

# 2.2 SCATTERTEXT PLOT FOR TOP TROLLS
While visually appealing, the first Scattertext plot above is still very "noisy". Let's see if things improve with a plot of just the tweets and retweets by the two accounts most active in English. 

In [17]:
# Same 2 top troll accounts analysed in Part 1

trolls_english = ['ctcc507', 'HKpoliticalnew']
top_trolls = hk_eng[hk_eng['user_screen_name'].isin(trolls_english)]

In [18]:
top_trolls['tweet_status'] = np.where(top_trolls["tweet_text"].str.startswith("RT @"), "retweet", "tweet")

In [19]:
nlp = spacy.load("en_core_web_lg")

corpus_trolls = (st.CorpusFromPandas(
    top_trolls, 
    category_col="tweet_status", 
    text_col="clean_tweet_text", 
    nlp=nlp
).build().remove_terms(ENGLISH_STOP_WORDS, ignore_absences=True))

In [20]:
troll_term_freq_df = corpus_trolls.get_term_freq_df()

troll_term_freq_df['Tweet Score'] = corpus_trolls.get_scaled_f_scores('tweet')
print("Terms most associated with top trolls' tweets:", list(troll_term_freq_df.sort_values(by='Tweet Score', ascending=False).index[:10]))

Terms most associated with top trolls' tweets: ['affairs', 'in kong', 'kong affairs', 'interference', 'stop', 'british', 'the police', 'china', 'southwest', 'to stop']


In [21]:
troll_term_freq_df['Retweet Score'] = corpus_trolls.get_scaled_f_scores('retweet')
print("Terms most associated with top trolls' retweets:", list(troll_term_freq_df.sort_values(by='Retweet Score', ascending=False).index[:10]))

Terms most associated with top trolls' retweets: ['_', '_ _', 'protesters', 'public event', 'event', 'a public', 'hkpoliceforce', 'extraditionbill', 'chief', 'hongkong']


In [22]:
troll_html = st.produce_scattertext_explorer(
    corpus_trolls,
    category="tweet",
    category_name="Tweets",
    not_category_name="Retweets",
    width_in_pixels=1000,
    metadata=top_trolls["user_screen_name"],
    show_characteristic=False,
    minimum_term_frequency=5
)

troll_interactive = "../output/trolls_eng.html"
open(troll_interactive, "wb").write(troll_html.encode("utf-8"))

IFrame(src=troll_interactive, width=1200, height=700)

Download the chart [here](https://www.dropbox.com/sh/jmb1oy0kak18cwy/AABfHXYoA_P8d6Tw-scNpDVia?dl=0)

## NOTE:
The chart is very sparse, unsurprising given the smaller number of tweets and retweets. But we do get a good sense of what goes into the troll tweets, based on the terms that appear in the top-left corner: "interference", "foreign", "meddling", and "internal".

Overall, the analysis of the English-language troll tweets has been a frustrating experience, given the very high noise-to-signal ratio. Let's see if the analysis of the Chinese-language tweets in Parts 3 and 4 are any better.