# HK PROTESTS: Visualising Chinese State Troll Tweets, Part 4 (Using Scattertext for Chinese text)

Scattertext works well for Chinese text, and presents an attrative alternative to the frequency distribution charts. But the resulting interactive html files can become too big, even for a modest body of Chinese text.

In [2]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import string
import scattertext as st
import spacy
import jieba

from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromParsedDocuments
from scattertext import chinese_nlp
from scattertext import produce_scattertext_explorer

mpl.rcParams["figure.dpi"] = 300
%matplotlib inline
%config InlineBackend.figure_format ='retina'

In [3]:
pd.set_option('display.max_columns', 40)

# 1. PRE-WORK

## 1. DATA PROCESSING
Download the original CSV files from [Twitter](https://blog.twitter.com/en_us/topics/company/2019/information_operations_directed_at_Hong_Kong.html). 

In [4]:
# Reminder that the raw1/2 CSV files are NOT in this repo. Download directly from Twitter; link above
raw1 = pd.read_csv('../data/china_082019_1_tweets_csv_hashed.csv')
raw2 = pd.read_csv('../data/china_082019_2_tweets_csv_hashed.csv')
raw = pd.concat([raw1, raw2])

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
raw = raw.drop(
    columns=[
        "user_profile_url",
        "tweet_client_name",
        "in_reply_to_tweetid",
        "in_reply_to_userid",
        "quoted_tweet_tweetid",
        "is_retweet",
        "retweet_userid",
        "retweet_tweetid",
        "latitude",
        "longitude",
        "quote_count",
        "reply_count",
        "like_count",
        "retweet_count",
        "urls",
        "user_mentions",
        "poll_choices",
        "hashtags",
    ]
)

In [6]:
# Converting timings to HK time
raw['tweet_time'] = pd.to_datetime(raw['tweet_time'])
raw['tweet_time'] = raw['tweet_time'].dt.tz_localize('GMT').dt.tz_convert('Hongkong')
raw['tweet_year'] = raw['tweet_time'].dt.year
raw['tweet_month'] = raw['tweet_time'].dt.month
raw['tweet_day'] = raw['tweet_time'].dt.day

In [7]:
raw['account_creation_date'] = pd.to_datetime(raw['account_creation_date'], yearfirst=True)
raw['year_of_account_creation'] = raw['account_creation_date'].dt.year
raw['month_of_account_creation'] = raw['account_creation_date'].dt.month
raw['day_of_account_creation'] = raw['account_creation_date'].dt.day

In [8]:
# I'll focus only on tweets sent from 2017
raw = raw[(raw["tweet_year"] >= 2017)].copy()

In [9]:
# Filtering out tweets which mention fugitive Chinese billionaire Guo Wengui, 
# and other irrelevant characters like US-based dissidents Yang Jianli, Guo Baosheng etc
# Also filtering out additional stop words
raw = raw[~raw["tweet_text"].str.contains("郭文贵")].copy()
raw = raw[~raw["tweet_text"].str.contains("文贵")].copy()
raw = raw[~raw["tweet_text"].str.contains("郭文")].copy()
raw = raw[~raw["tweet_text"].str.contains("杨建利")].copy()
raw = raw[~raw["tweet_text"].str.contains("郭宝胜")].copy()
raw = raw[~raw["tweet_text"].str.contains("宝胜")].copy()
raw = raw[~raw["tweet_text"].str.contains("老郭")].copy()
raw = raw[~raw["tweet_text"].str.contains("郭狗")].copy()
raw = raw[~raw["tweet_text"].str.contains("郭骗子")].copy()
raw = raw[~raw["tweet_text"].str.contains("余文生")].copy()
raw = raw[~raw["tweet_text"].str.contains("吴小晖")].copy()
raw = raw[~raw["tweet_text"].str.contains("整点")].copy()
raw = raw[~raw["tweet_text"].str.contains("日电")].copy()
raw = raw[~raw["tweet_text"].str.contains("时间")].copy()
raw = raw[~raw["tweet_text"].str.contains("桂海")].copy()

In [10]:
# In this notebook, I'll focus only on Chinese tweets. 
# In earlier drafts, I found that troll accounts with English language settings were sending out Chinese tweets too,
# so provisions were made here to include those 
# Note the sub-categories for Twitter language settings for English and Chinese

raw_ch = raw[
    (raw["tweet_language"] == "zh")
    & (
        (raw["account_language"] == "en")
        | (raw["account_language"] == "en-gb")
        | (raw["account_language"] == "zh-cn")
        | (raw["account_language"] == "zh-CN")
        | (raw["account_language"] == "zh-tw")
    )
].copy()

## 1.2 DATA FILTERING

Like in the case for the English tweets subset, it would be too unwieldly and inefficient to plot a Scattertext chart for the entire Chinese tweets subset. The resulting interactive chart would be too large to display on most browsers as well.

I opted to plot smaller individual charts for 3 key terms that we've seen come up repeatedly: "香港"(Hong Kong), "外國勢力"(foreign forces) and "警察"(police). You can experiment with different combinations of key words.  

In [11]:
# I'll plot the Scattertext charts for these key terms individually, 
# When concated, the resultant body of text results in a huge html file which is hard to load

hk_ch = raw_ch[raw_ch['tweet_text'].str.contains("香港")].copy()

In [12]:
# I chose these key terms as the trolls pushed hard for the conspiracy theory of a foreign plot in HK
hk_ch2a = raw_ch[raw_ch['tweet_text'].str.contains("顏色革命")].copy() # translation: color revolution
hk_ch2b = raw_ch[raw_ch['tweet_text'].str.contains("外國勢力")].copy() # translation: foreign forces
hk_ch2c = raw_ch[raw_ch['tweet_text'].str.contains("美國")].copy() # translation: America

hk_ch2 = pd.concat([hk_ch2a, hk_ch2b, hk_ch2c])

In [13]:
# Simple function to clean the tweet_text col

def clean_tweet_ch(text):
    text = text.strip(" ")
    text = text.strip(r"\n")
    text = re.sub(r"[^\w\s]", "", text)
    text = re.sub(r"http\S+", "", text)
    filtered = re.compile(u'[^\u4E00-\u9FA5]') # non-Chinese unicode range
    text = filtered.sub(r'', text) # remove all non-Chinese characters
    return text

In [14]:
hk_ch['clean_tweet_text'] = hk_ch['tweet_text'].map(lambda tweet: clean_tweet_ch(tweet))
hk_ch2['clean_tweet_text'] = hk_ch2['tweet_text'].map(lambda tweet: clean_tweet_ch(tweet))

In [15]:
# Creating a new column to classify a tweet as a retweet or actual tweet
# This step is mandatory for the Scattertext plots

hk_ch['tweet_status'] = np.where(hk_ch["tweet_text"].str.startswith("RT @"), "retweet", "tweet")
hk_ch2['tweet_status'] = np.where(hk_ch2["tweet_text"].str.startswith("RT @"), "retweet", "tweet")

In [16]:
hk_ch.shape, hk_ch2.shape

((2121, 21), (293, 21))

# 2. SCATTERTEXT PLOTS FOR CHINESE TWEETS

# 2.1 Visualising Filtered Tweets For "香港"(Hong Kong)
This subset consists of 1,551 "original" tweets and 570 retweets, and about 83,200 Chinese characters. Let's see how they appear on a Scattertext chart.

In [17]:
hk_ch['clean_tweet_text'] = hk_ch['clean_tweet_text'].apply(chinese_nlp)

Building prefix dict from /anaconda3/lib/python3.6/site-packages/jieba/dict.txt ...
Loading model from cache /var/folders/6z/wrz4dxdx65585cc04rbtr1xh0000gn/T/jieba.cache
Loading model cost 0.9550659656524658 seconds.
Prefix dict has been built succesfully.


In [18]:
corpus = CorpusFromParsedDocuments(
    hk_ch, category_col="tweet_status", parsed_col="clean_tweet_text"
).build()

In [19]:
term_freq_df = corpus.get_term_freq_df()

term_freq_df["Tweet Score"] = corpus.get_scaled_f_scores("tweet")
print(
    "Terms most associated with trolls' tweets (Chinese):",
    list(term_freq_df.sort_values(by="Tweet Score", ascending=False).index[:10]),
)

term_freq_df["Retweet Score"] = corpus.get_scaled_f_scores("retweet")
print(
    "Terms most associated with trolls' retweets (Chinese):",
    list(term_freq_df.sort_values(by="Retweet Score", ascending=False).index[:10]),
)


Terms most associated with trolls' tweets (Chinese): ['陳', '出席', '佔', '今年', '舉行', '發展', '港獨', '香港 特區', '公布', '約']
Terms most associated with trolls' retweets (Chinese): ['輿論', '遠離', '暴力 小心', '遠離 暴力', '處置', '德國員警', '德國員警 是', '錯誤', '小心', '本身 的']


In [20]:
html = produce_scattertext_explorer(
    corpus,
    category="tweet",
    category_name="Tweets",
    not_category_name="Retweets",
    width_in_pixels=1000,
    metadata=hk_ch["user_screen_name"],
    asian_mode=True,
    show_characteristic=False,
)

interactive = "../output/ch1.html"
open(interactive, "w").write(html)

IFrame(src=interactive, width=1200, height=700)

## INTERPRETING THE FIRST SCATTERTEXT CHART:

### Colour
 - The words in the chart are colored by their association. Those in blue are more associated with original tweets, while those in red are more associated with retweets. Each dot corresponds to a word or phrase mentioned.

### Positioning: 
- Words nearer the top of the plot represent the most frequently used words in the "original" tweets. In this chart, we can see that terms used by the trolls to criticise the protest movement ("港獨", or HK independence), or to push conspiracy theories ("美國", or the US, which Beijing accused of formenting the protests) feature well near the top

- The further right a dot, the more that word or phrase was used in retweets. Here, we see the retweets pushing the trolls' message for strict law enforcement and , with terms such as "嚴正執法" and "社會治安".

- Words that appear frequently in both tweets and rewteets, like "香港"(Hong Kong) and "暴力"(violence), appear in the upper-right-hand corner.

- Words that aren't often used in either tweets or retweets appear in the bottom-left-hand corner.


### Key Areas:
- Upper-left corner: These words appear frequently in the tweets but not the retweets. We still see a fair bit of noise here, as evidenced by the appearance of words like "桂民海" (detained HK bookseller Gui Minghai) near the very top.

- Lower-right corner: Likewise, words which appear frequently in retweets but not the tweets appear in the lower right corner. Here we see terms urging people to be careful of potential violence at the protests.

Download the chart [here](https://www.dropbox.com/sh/jmb1oy0kak18cwy/AABfHXYoA_P8d6Tw-scNpDVia?dl=0) for a fuller look. Use the search box at the bottom of the chart to see how the word was used in the tweets and retweets by the various users in the subset. 

# 2.2 Visualising Filtered Tweets For "顏色革命"(Color Revolution), "外國勢力"(Foreign Forces) And "美國"(United States)

As we've seen in Part 3, the trolls pushed hard at the conspiracy theory about the US and UK formenting a color revolution in the HK. Let's see how they appear on a Scattertext chart.

This subset consists of 207 "original" tweets and 86 retweets, and about 12,700 Chinese characters. 

In [21]:
hk_ch2['clean_tweet_text'] = hk_ch2['clean_tweet_text'].apply(chinese_nlp)

corpus2 = CorpusFromParsedDocuments(
    hk_ch2, category_col="tweet_status", parsed_col="clean_tweet_text"
).build()


term_freq_df2 = corpus2.get_term_freq_df()

term_freq_df2["Tweet Score"] = corpus2.get_scaled_f_scores("tweet")
print(
    "Terms most associated with trolls' tweets (Chinese):",
    list(term_freq_df2.sort_values(by="Tweet Score", ascending=False).index[:10]),
)

term_freq_df2["Retweet Score"] = corpus2.get_scaled_f_scores("retweet")
print(
    "Terms most associated with trolls' retweets (Chinese):",
    list(term_freq_df2.sort_values(by="Retweet Score", ascending=False).index[:10]),
)

Terms most associated with trolls' tweets (Chinese): ['佔', '民族', '佔 中', '浩天', '陳 浩天', '香港 民族', '去', '世界', '年', '勾']
Terms most associated with trolls' retweets (Chinese): ['本質', '這場', '的 本質', '鼓動', '有 計劃', '支持 下', '一場 有', '有 預謀有', '預謀有 組織', '組織 有']


In [22]:
html2 = produce_scattertext_explorer(
    corpus2,
    category="tweet",
    category_name="Tweets",
    not_category_name="Retweets",
    width_in_pixels=1000,
    metadata=hk_ch2["user_screen_name"],
    asian_mode=True,
    show_characteristic=False,
)

interactive2 = "../output/ch2.html"
open(interactive2, "w").write(html2)

IFrame(src=interactive2, width=1200, height=700)

## INTERPRETING THE SECOND SCATTERTEXT CHART:

### Colour
 - The words in the chart are colored by their association. Those in blue are more associated with original tweets, while those in red are more associated with retweets. Each dot corresponds to a word or phrase mentioned.

### Positioning: 
- Words nearer the top of the plot represent the most frequently used words in the "original" tweets. In this chart, the terms used by the trolls to push the conspiracy theories - "顏色革命" and "外國勢力" - naturally feature well near the top. Use the search box to see which accounts are actively pushing these tweets and retweets - @HKpoliticalnew, @mari1lcaire and a number of hashed accounts. 

- The further right a dot, the more that word or phrase was used in retweets. 

- Words that appear frequently in both tweets and rewteets, like "香港"(Hong Kong), "中國 "(China), and "逃犯條例"(extradition Bill) appear in the upper-right-hand corner.

- Words that aren't often used in either tweets or retweets appear in the bottom-left-hand corner.


### Key Areas:
- Upper-left corner: These words appear frequently in the tweets but not the retweets. We see some mention of entrepreneur [Jimmy Lai("黎智英)](https://en.wikipedia.org/wiki/Jimmy_Lai) and political activist [Andy Chan("陳浩天")](https://en.wikipedia.org/wiki/Chan_Ho-tin), two people heavily targetted by Beijing's criticisms.

- Lower-right corner: Likewise, words which appear frequently in retweets but not the tweets appear in the lower right corner.

Download the chart [here](https://www.dropbox.com/sh/jmb1oy0kak18cwy/AABfHXYoA_P8d6Tw-scNpDVia?dl=0) for a fuller look. Use the search box at the bottom of the chart to see how the word was used in the tweets and retweets by the various users in the subset. 