# Data Exploration (4):
## Keyness Analysis
* This notebook focuses on analyzing the Lexis Nexis data using keyness analysis to observe any additional patterns for the years 2017, 2018, 2020, 2021.
    * 2019 acts as a sort of middle ground

---

### Setup
* includes additional modules, functions, and chars_to_strip

In [1]:
import json
from collections import Counter
import random
import matplotlib.pyplot as plt
import os
import re
import pandas as pd
import math

from nltk.sentiment import SentimentIntensityAnalyzer

In [2]:
%run functions.ipynb

In [3]:
chars_to_remove = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~•'

## 1. Load data
* all word_dist's taken from the frequency analyses conducted in `LN_data_exploration1`.

In [4]:
LN2017 = json.load(open('../data/LN_data/cleaned/LN_2017.json'))

In [5]:
LN2018 = json.load(open('../data/LN_data/cleaned/LN_2018.json'))
LN2019 = json.load(open('../data/LN_data/cleaned/LN_2019.json'))
LN2020 = json.load(open('../data/LN_data/cleaned/LN_2020.json'))
LN2021 = json.load(open('../data/LN_data/cleaned/LN_2021.json'))

In [6]:
# set up a Counter for words
LN2017_word_dist = Counter()

# process each document in list
for doc in LN2017:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    LN2017_word_dist.update(tokens)

In [7]:
LN2018_word_dist = Counter()

for doc in LN2018:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    LN2018_word_dist.update(tokens)

In [8]:
LN2019_word_dist = Counter()

for doc in LN2019:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    LN2019_word_dist.update(tokens)

In [9]:
LN2020_word_dist = Counter()

for doc in LN2020:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    LN2020_word_dist.update(tokens)

In [10]:
LN2021_word_dist = Counter()

for doc in LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    LN2021_word_dist.update(tokens)

---

## 2. Keyness
* exploring 2017 + 2018 with 2020 + 2021
* exploring 2017 with 2020
    * first full year vs. last full year of data collected

In [11]:
word_dist_1718 = Counter()

for doc in LN2017 + LN2018:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    word_dist_1718.update(tokens)

In [12]:
word_dist_1718.most_common(50)

[('the', 164204),
 ('and', 89485),
 ('to', 86064),
 ('of', 82304),
 ('in', 63553),
 ('a', 53698),
 ('that', 34983),
 ('for', 31136),
 ('is', 29651),
 ('on', 26437),
 ('with', 20593),
 ('china', 17953),
 ('as', 17894),
 ('it', 16499),
 ('this', 16482),
 ('by', 15536),
 ('are', 15536),
 ('chinese', 14665),
 ('its', 14055),
 ('us', 13232),
 ('from', 13046),
 ('has', 12579),
 ('have', 12527),
 ('be', 12221),
 ('at', 12083),
 ('we', 11914),
 ('you', 11911),
 ('i', 10420),
 ('was', 9943),
 ('will', 9932),
 ('said', 9685),
 ('an', 9412),
 ('not', 9267),
 ('or', 8846),
 ('but', 8502),
 ('they', 8204),
 ('about', 8199),
 ('government', 8006),
 ('market', 7842),
 ('more', 7401),
 ('their', 7203),
 ('he', 6726),
 ('which', 6515),
 ('company', 6405),
 ('new', 6380),
 ('our', 6379),
 ('so', 6165),
 ('other', 6045),
 ('also', 5950),
 ('were', 5931)]

In [13]:
word_dist_20021 = Counter()

for doc in LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    word_dist_20021.update(tokens)

In [14]:
word_dist_20021.most_common(50)

[('the', 382941),
 ('to', 207677),
 ('of', 206041),
 ('and', 179830),
 ('in', 141501),
 ('a', 108719),
 ('for', 76729),
 ('that', 69891),
 ('is', 63216),
 ('on', 57703),
 ('with', 47751),
 ('as', 41088),
 ('by', 40630),
 ('its', 38576),
 ('from', 38234),
 ('at', 35252),
 ('this', 31362),
 ('it', 30498),
 ('are', 30014),
 ('has', 29719),
 ('will', 27785),
 ('us', 27357),
 ('be', 27348),
 ('was', 23419),
 ('have', 23141),
 ('we', 22679),
 ('an', 22152),
 ('mln', 21632),
 ('china', 21493),
 ('sp', 21044),
 ('chinese', 20337),
 ('consensus', 20275),
 ('capital', 19818),
 ('not', 18213),
 ('or', 18108),
 ('company', 17641),
 ('2020', 16308),
 ('iq', 16212),
 ('you', 16195),
 ('vs', 15933),
 ('new', 15356),
 ('which', 15350),
 ('our', 15250),
 ('said', 15138),
 ('last', 14921),
 ('but', 14776),
 ('i', 14403),
 ('market', 14369),
 ('up', 14131),
 ('than', 13791)]

### Keyness for 2017, 2018 vs. 2020, 2021

In [15]:
LN2017_token_cnt = sum(LN2017_word_dist.values())
LN2020_token_cnt = sum(LN2020_word_dist.values())

In [16]:
print(f'{LN2017_token_cnt} tokens in 2017')
print(f'{LN2020_token_cnt} tokens in 2020')

990587 tokens in 2017
6240764 tokens in 2020


* From previous notebooks, it's known that LN2020 is noticeably larger than LN2017 (2020 contains more articles), so it makes sense that 2020 would have more tokens in comparison. 

In [17]:
calculate_keyness(word_dist_1718, word_dist_20021, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
2017                     3942      1062      5241.705
china                    17953     21493     4616.499
2016                     2412      739       3001.737
corresponding            1354      41        2982.760
steel                    1774      334       2744.390
chinese                  14665     20337     2507.982
tariffs                  1953      610       2405.015
you                      11911     16195     2166.377
and                      89485     179830    1957.028
trade                    5712      6207      1825.555
i                        10420     14403     1800.583
2015                     1396      431       1730.089
blockchain               1079      233       1583.424
industry                 3633      3345      1577.596
2018                     2898      2317      1557.185
zte                      1124      313       1470.339
google                   1795      1108      1315.122
statements           

In [18]:
calculate_keyness(word_dist_20021, word_dist_1718, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
consensus                20275     117       12905.762
sp                       21044     353       11861.430
vs                       15933     58        10423.305
2020                     16308     328       8888.729
capital                  19818     906       8523.569
guidance                 10664     197       5908.250
tiktok                   7132      10        4818.737
q1                       6890      36        4415.672
q4                       6400      15        4263.111
eps                      6717      44        4235.077
beats                    6109      5         4167.399
expected                 12903     1018      4161.859
sees                     6905      102       3973.249
pandemic                 5809      36        3678.319
2019                     7042      264       3259.788
co                       9094      743       2863.266
dec                      4878      107       2611.555
per               

**Observations**:
* Comparing the two sets, there isn't a strong sentiment difference detected between the words; however, it still shows a bit of insight into the shifts from 2018 + 2018 and 2020 + 2021.
* To get the more obvious difference out of the way, 2020 and 2021 has a higher relative frequency of health related terms than in 2017 and 2018. Words like 'pandemic', 'patients', 'fda, 'clinical', and 'outbreak' appear at higher rates than in the earlier years as a result of COVID. The words that appear in greater frequency in 2017 and 2018 are thus more tech related: 'blockchain', 'google', 'zte', 'internet', 'bitcoin', and the word 'technology' itself.

* 'cooperation', 'forwardlooking', and 'independently' seem to be more positive words that appear earlier (17, 18), whereas 'exlcuding', and 'misses' in 2020 and 2021 hold a more negative undertone. 
    * there is a mix of potentially positive and negative words in the 2020 + 2021 list, but there is still a visible difference compared to 2017 + 2018 which is more positive/neutral

### Keyness for 2017 vs. 2021

In [19]:
## the frequency (keyness) of words that appear more often in 2017 articles than 2020 articles

calculate_keyness(LN2017_word_dist, LN2020_word_dist)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
2017                     1622      862       3495.423
2016                     1311      607       2996.710
china                    6026      17828     2246.590
corresponding            526       36        1834.363
2015                     743       369       1649.434
and                      30839     151577    1494.150
internet                 1034      1393      1210.148
blockchain               425       138       1103.287
statements               710       687       1088.942
industry                 1354      2564      1087.025
chinese                  4702      16667     1083.862
cooperation              554       410       1008.583
bitcoin                  272       42        846.686
weibo                    319       114       802.646
forwardlooking           403       263       786.104
investment               1040      2197      717.536
steel                    359       229       708.614
analyzed                 2

In [20]:
## the frequency (keyness) of words that appear more often in 2020 articles than 2017 articles

calculate_keyness(LN2020_word_dist, LN2017_word_dist)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
sp                       19989     79        5171.172
consensus                19371     59        5140.297
vs                       15292     23        4252.162
2020                     14596     132       3317.967
capital                  19005     430       3181.423
guidance                 10016     58        2467.865
q1                       6761      9         1890.685
eps                      6613      24        1726.129
sees                     6643      27        1713.277
expected                 11297     318       1677.374
2019                     6252      23        1629.651
q4                       5767      11        1583.175
pandemic                 5154      16        1365.373
per                      9159      300       1230.451
reports                  10235     388       1228.270
patients                 4998      35        1194.250
dec                      4611      34        1091.672
co                   

**Observations**:
* Isolating just the two years (2017, 2020) reflect similar results as the analysis prior.
    * Coverage on the pandemic invades a lot of the 2020 articles since it seems impossible to speak on COVID without mentioning China or its government.

---