# Data Exploration (1):
## Word Frequency Analyses
* This notebook focuses on the first step in my analysis: figuring out which words are most commonly used in regards to China and its government and setting up a list of words of interest to delve further into.


---

### Setup
* includes additional modules, functions, and chars_to_strip

In [1]:
import json
from collections import Counter
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
%run functions.ipynb

In [3]:
chars_to_remove = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

## 1. Load data

In [4]:
LN2017 = json.load(open('../data/LN_data/cleaned/LN_2017.json'))
LN2018 = json.load(open('../data/LN_data/cleaned/LN_2018.json'))
LN2019 = json.load(open('../data/LN_data/cleaned/LN_2019.json'))
LN2020 = json.load(open('../data/LN_data/cleaned/LN_2020.json'))
LN2021 = json.load(open('../data/LN_data/cleaned/LN_2021.json'))

## 2. Frequency Lists for Each Year

### 2017

In [5]:
len(LN2017)

974

In [6]:
# set up a Counter for words
LN2017_word_dist = Counter()

# process each document in list
for doc in LN2017:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    LN2017_word_dist.update(tokens)

In [7]:
LN2017_word_dist.most_common()[100:150]

[('report', 1023),
 ('national', 1022),
 ('some', 1019),
 ('foreign', 988),
 ('growth', 981),
 ('when', 975),
 ('rights', 967),
 ('after', 943),
 ('news', 942),
 ('do', 936),
 ('two', 936),
 ('could', 934),
 ('financial', 932),
 ('first', 928),
 ('research', 923),
 ('markets', 921),
 ('like', 890),
 ('media', 875),
 ('future', 863),
 ('just', 860),
 ('think', 853),
 ('no', 834),
 ('percent', 831),
 ('going', 830),
 ('bartiromo', 818),
 ('time', 816),
 ('because', 801),
 ('most', 795),
 ('billion', 791),
 ('many', 787),
 ('products', 785),
 ('how', 784),
 ('both', 781),
 ('those', 781),
 ('while', 776),
 ('them', 769),
 ('last', 766),
 ('right', 765),
 ('economic', 758),
 ('public', 756),
 ('beijing', 748),
 ('system', 747),
 ('2015', 743),
 ('according', 737),
 ('online', 723),
 ('group', 722),
 ('state', 711),
 ('statements', 710),
 ('use', 705),
 ('under', 704)]

### 2018

In [8]:
len(LN2018)

1915

In [9]:
LN2018_word_dist = Counter()

for doc in LN2018:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    LN2018_word_dist.update(tokens)

In [10]:
LN2018_word_dist.most_common()[150:200]

[('mcdowell', 1518),
 ('most', 1515),
 ('rights', 1513),
 ('many', 1495),
 ('much', 1485),
 ('huawei', 1469),
 ('research', 1468),
 ('country', 1458),
 ('state', 1451),
 ('only', 1447),
 ('take', 1434),
 ('against', 1418),
 ('steel', 1415),
 ('make', 1406),
 ('want', 1398),
 ('according', 1381),
 ('economy', 1375),
 ('then', 1373),
 ('while', 1367),
 ('made', 1352),
 ('way', 1349),
 ('look', 1343),
 ('under', 1336),
 ('investment', 1331),
 ('even', 1323),
 ('say', 1321),
 ('financial', 1315),
 ('media', 1308),
 ('should', 1274),
 ('use', 1273),
 ('group', 1262),
 ('video', 1236),
 ('future', 1235),
 ('house', 1225),
 ('internet', 1209),
 ('really', 1209),
 ('being', 1207),
 ('network', 1167),
 ('big', 1158),
 ('both', 1155),
 ('today', 1144),
 ('human', 1137),
 ('end', 1127),
 ('intelligence', 1125),
 ('part', 1125),
 ('public', 1121),
 ('work', 1114),
 ('includes', 1112),
 ('go', 1111),
 ('administration', 1110)]

**Observations for 2017 + 2018**: 
* 'bartiromo' is quite consistently high for both 2017 and 2018
    * doing a quick google search, it seems to be the last name of a Fox News journalist, but I will conduct a KWIC search in a different notebook to double check its context.
* Interestingly, Beijing itself is mentioned quite frequently in 2017 (748 mentions in one of the smaller datasets)
    * like China, Beijing (China's capital) is sometimes used in media coverage to refer to the entire country, so this may be a notable word to look into.
* Further down the list for 2018, Huawei begins to appear as well. 
    * this is appropriate as I narrowed my LN search to focus on the tech industry since America is currently being threatened by China's growth as an emerging global leader in science and technology. 

### 2019

In [11]:
len(LN2019)

3106

In [12]:
LN2019_word_dist = Counter()

for doc in LN2019:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    LN2019_word_dist.update(tokens)

In [13]:
LN2019_word_dist.most_common()[150:200]

[('better', 7127),
 ('revenue', 6982),
 ('huawei', 6956),
 ('day', 6953),
 ('2020', 6950),
 ('rate', 6944),
 ('price', 6877),
 ('yearyear', 6860),
 ('basis', 6848),
 ('chinas', 6847),
 ('industrial', 6828),
 ('11', 6781),
 ('net', 6707),
 ('cash', 6666),
 ('no', 6656),
 ('15', 6646),
 ('including', 6638),
 ('rose', 6619),
 ('what', 6567),
 ('global', 6543),
 ('phase', 6527),
 ('if', 6449),
 ('items', 6447),
 ('dow', 6411),
 ('who', 6410),
 ('agreement', 6381),
 ('percent', 6375),
 ('consumer', 6339),
 ('had', 6285),
 ('due', 6250),
 ('offering', 6185),
 ('second', 6155),
 ('bartiromo', 6129),
 ('yield', 6129),
 ('companys', 6094),
 ('production', 6041),
 ('3', 6029),
 ('reported', 5985),
 ('2018', 5933),
 ('stocks', 5920),
 ('nonrecurring', 5913),
 ('hong', 5864),
 ('sector', 5822),
 ('just', 5813),
 ('years', 5802),
 ('common', 5743),
 ('approximately', 5735),
 ('financial', 5708),
 ('any', 5700),
 ('october', 5692)]

**Observations**:
* Like 2018, 2019 shows a focus on Huawei as well. 
    * Huawei appears more frequently in this year as the [Huawei Ban](https://www.androidauthority.com/huawei-google-android-ban-988382/#:~:text=The%20Huawei%20ban%20begins%20on,deemed%20a%20national%20security%20risk.) was implemented in May 2019 as Trump deemed it a "security risk."

### 2020

In [14]:
len(LN2020)

3057

In [15]:
LN2020_word_dist = Counter()

for doc in LN2020:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    LN2020_word_dist.update(tokens)

In [16]:
LN2020_word_dist.most_common()[100:150]

[('what', 6472),
 ('security', 6431),
 ('coronavirus', 6429),
 ('fy20', 6365),
 ('while', 6261),
 ('2019', 6252),
 ('into', 6130),
 ('may', 6098),
 ('two', 6050),
 ('week', 6035),
 ('beats', 6033),
 ('global', 6003),
 ('can', 5926),
 ('technology', 5861),
 ('march', 5843),
 ('trump', 5775),
 ('q4', 5767),
 ('his', 5766),
 ('including', 5740),
 ('out', 5708),
 ('inline', 5636),
 ('had', 5601),
 ('if', 5589),
 ('report', 5580),
 ('some', 5570),
 ('growth', 5521),
 ('revenue', 5474),
 ('trade', 5389),
 ('national', 5358),
 ('no', 5213),
 ('pandemic', 5154),
 ('prior', 5139),
 ('yearyear', 5112),
 ('through', 5098),
 ('time', 5060),
 ('when', 5055),
 ('economic', 5046),
 ('april', 5030),
 ('1', 5030),
 ('do', 5025),
 ('better', 5003),
 ('patients', 4998),
 ('markets', 4918),
 ('issues', 4894),
 ('information', 4882),
 ('results', 4808),
 ('increased', 4779),
 ('cash', 4758),
 ('could', 4728),
 ('during', 4716)]

### 2021

In [17]:
len(LN2021)

771

In [18]:
LN2021_word_dist = Counter()

for doc in LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    LN2021_word_dist.update(tokens)

In [19]:
LN2021_word_dist.most_common()[150:200]

[('country', 763),
 ('products', 750),
 ('since', 747),
 ('1', 743),
 ('share', 740),
 ('higher', 735),
 ('research', 734),
 ('quarter', 734),
 ('value', 728),
 ('where', 726),
 ('administration', 725),
 ('500', 722),
 ('according', 719),
 ('any', 718),
 ('may', 709),
 ('many', 704),
 ('increase', 698),
 ('briefingcom', 695),
 ('key', 692),
 ('bartiromo', 689),
 ('economic', 687),
 ('see', 687),
 ('markets', 684),
 ('being', 681),
 ('end', 680),
 ('cash', 680),
 ('use', 676),
 ('foreign', 675),
 ('•', 664),
 ('agreement', 663),
 ('total', 660),
 ('10', 658),
 ('how', 658),
 ('pandemic', 655),
 ('sector', 651),
 ('digital', 651),
 ('hong', 649),
 ('guidance', 648),
 ('should', 647),
 ('countries', 643),
 ('companys', 641),
 ('vs', 641),
 ('social', 639),
 ('trump', 637),
 ('under', 636),
 ('following', 635),
 ('q4', 633),
 ('15', 632),
 ('only', 632),
 ('increased', 630)]

**Observations for 2020 + 2021**:
* For 2020 and 2021, COVID took precedent over anything, so the related terms such as 'coronavirus', ('global')'pandemic', 'patients', etc appeared in both lists
* Additionally, words such as 'beats' and 'vs'—which incites competition—begin to show up often as well
* 'trump' himself is mentioned frequently in 2020
    * appropriate as Trump handled the COVID outbreak particularly poorly, especially the first few months and contributed in the spread of misinformation, doubt, and defiance among the American people. 
        * unlike other world leaders who were clear in allowing their medical experts provide information for the citizens, [Trump voiced conflicting information](https://www.latimes.com/politics/story/2020-03-30/trumps-mixed-messages-confuse-coronavirus-response) and caused more harm, slowing the country's recovery as America's COVID death rate exceeded that of any country's as a result.

---

In [20]:
all_word_dist = Counter()

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    all_word_dist.update(tokens)

In [21]:
all_word_dist.most_common()[100:150]

[('two', 18829),
 ('announced', 18785),
 ('who', 18656),
 ('out', 18612),
 ('value', 18229),
 ('01', 18216),
 ('futures', 17732),
 ('markets', 17700),
 ('global', 17601),
 ('eps', 17582),
 ('united', 17536),
 ('if', 17363),
 ('states', 17304),
 ('may', 17127),
 ('his', 17070),
 ('security', 17057),
 ('can', 17015),
 ('increased', 16706),
 ('02', 16629),
 ('economic', 16608),
 ('revenues', 16607),
 ('including', 16487),
 ('had', 16461),
 ('chinas', 16429),
 ('people', 16382),
 ('billion', 16213),
 ('yryr', 16083),
 ('results', 16044),
 ('week', 16010),
 ('news', 15972),
 ('03', 15963),
 ('trump', 15809),
 ('index', 15800),
 ('no', 15665),
 ('bartiromo', 15510),
 ('information', 15478),
 ('patients', 15337),
 ('nasdaq', 15330),
 ('sees', 15293),
 ('1', 15250),
 ('500', 15220),
 ('do', 15110),
 ('going', 14950),
 ('years', 14830),
 ('prior', 14632),
 ('shares', 14581),
 ('percent', 14487),
 ('today', 14408),
 ('could', 14406),
 ('through', 14394)]

**Observation**:
* Going through the first 250 or so words, 'huawei' and 'bartiromo' were words that reflect the consistency shown above.
* 'beats' appears consistently high throughout the year as well
* noticed 'his' (male pronoun) is used
    * leaders, CEOs are all male figures

---

## 3. Words of Interest for KWIC 
* words gathered from frequency lists above or from my own prior knowledge that I would like to explore further
    * negative words, words that breed competition, brand names, words that draw on American fear
    
    
* NOTE: Of course there are a lot more words of interest that could be explored, but this was a set of words I wanted to look at for this project in particular: a varied list of obviously negative words and ones that may appear more neutral (but contextually contains negative undertones). Even then, I couldn't go into every single one, but I thought it was important to list them here still. There are a huge huge list of words out there that American media uses to purposefully portray situations and events as China's fault individually when America itself (and other Western countries) have participated in the same or even worst, so this project could very easily expand into something bigger. 

In [22]:
words_to_lookup = ['crisis', 'danger', 'dangerous', 'damage', 'deny', 'control', 'alarming', 'corrupt',
                  'communism', 'communist', 'dominance', 'war', 'warning', 'restrict', 'regime', 'beats', 'enemy',
                   'but', 'now', 'propaganda', 'national', 'grip', 'tighten', 'fall', 'hard', 'authoritarian', 
                   'totalitarian','orwellian', 'influence', 'secret', 'secrets', 'appetite', 'surge', 'concerns', 
                   'concern', 'reality', 'beijing', 'huawei', 'tiktok', 'google', 'apple', 'vs', 'freedom', 'weapon']

for word in words_to_lookup:
    print(word, all_word_dist[word])

crisis 2008
danger 305
dangerous 633
damage 743
deny 286
control 4960
alarming 125
corrupt 202
communism 88
communist 4427
dominance 429
war 3355
restrict 413
regime 1044
beats 13929
enemy 156
but 38656
now 20024
propaganda 801
national 13433
grip 124
tighten 109
fall 1218
hard 1669
authoritarian 550
totalitarian 78
orwellian 78
influence 2252
secret 656
secrets 1430
appetite 88
surge 551
concerns 5432
concern 2048
reality 777
beijing 8071
huawei 12284
tiktok 7703
google 4442
apple 5227
vs 42654
freedom 2545
weapon 228


---