# Data Exploration (2.1):
## KWIC Analyses
* This notebook focuses on observing the KWIC analysis got a select words gathered from the frequency list or from my own prior knowledge surrounding the coverage of China in the media.
    * split between two notebooks


---

### Setup
* includes additional modules, functions, and chars_to_strip

In [1]:
import json
from collections import Counter
import random
import os
import re
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import graphviz

In [2]:
%run functions.ipynb

In [3]:
chars_to_remove = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~•'

## 1. Load data

In [4]:
LN2017 = json.load(open('../data/LN_data/cleaned/LN_2017.json'))
LN2018 = json.load(open('../data/LN_data/cleaned/LN_2018.json'))
LN2019 = json.load(open('../data/LN_data/cleaned/LN_2019.json'))
LN2020 = json.load(open('../data/LN_data/cleaned/LN_2020.json'))
LN2021 = json.load(open('../data/LN_data/cleaned/LN_2021.json'))

In [5]:
all_word_dist = Counter()

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    all_word_dist.update(tokens)

## 1. KWIC 
* using the list of words from `LN_data_exploration1`, I performed several KWIC analyses to gather more insight on the words. 

### KWIC on 'tiktok'

In [6]:
all_kwic_tiktok = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('tiktok', tokens)
    all_kwic_tiktok.extend(kwic)

In [7]:
len(all_kwic_tiktok)

7703

In [8]:
all_kwic_tiktok_s1 = random.sample(all_kwic_tiktok, 50)
all_kwic_tiktok_s1 = sort_kwic(all_kwic_tiktok_s1, 'L1')
print_kwic(all_kwic_tiktok_s1)

                        dress up weird a  tiktok  video of richer catching
                 to raise concerns about  tiktok  the chinese socialmedia service
   providing evidence washington accuses  tiktok  whose parent company is
             continuing talks to acquire  tiktok  while apple remained hot
            company in december alleging  tiktok  surreptitiously vacuumed up and
          mobile applications wechat and  tiktok  to safeguard the national
               wellknown short video app  tiktok  according to trump chinese
              situation and officials at  tiktok  and oracle did not
                     week that would ban  tiktok  in 45 days if
                29 india suddenly banned  tiktok  the app which is
                   users data to beijing  tiktok  filed this week for
                  so microsoft might buy  tiktok  really a really other
             american company should buy  tiktok  so thateveryone can continue
                     of data gathered 

**Observations**:
* Similar to Huawei, the context for TikTok is mainly related to coverage on the potential security threats of users downloading Tiktok
    * ridiculous especially since most users of any social media site are already aware of the fact that companies are constantly stealing our information. What difference does it make whether TikTok has it or Facebook has it? Apparently a huge difference because new outlets would not stop talking about the apparent ~ dangers of TikTok ~
        * most of these voices tend to be those of the older generation as they have a stronger aversion to technology, but younger audiences who have grown up with technology all their life are less paranoid about the dangers of tech
        * Facebook and Google are consistently exposed for having full on detailed profiles of its users and while some Americans are vocal about it, most don't seem to mind
            * but of course it's different if cHiNa does it cause china bad, america good :)

### KWIC of 'beijing' 

In [9]:
all_kwic_beijing = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('beijing', tokens)
    all_kwic_beijing.extend(kwic)

In [10]:
len(all_kwic_beijing)

8071

In [11]:
all_kwic_beijing_s1 = random.sample(all_kwic_beijing, 50)
all_kwic_beijing_s1 = sort_kwic(all_kwic_beijing_s1, 'L1')
print_kwic(all_kwic_beijing_s1)

                                          beijing  confirmed friday that it
                      program and axis a  beijing  group specializing in leisure
             other fields washington and  beijing  are fighting over us
                        the holy see and  beijing  according to the americans
                         one deal us and  beijing  officials lawmakers and trade
                   pointed the finger at  beijing  over an extensive hacking
                    los angeles tel aviv  beijing  bangalore berlin london singapore
                     kong was accused by  beijing  of supporting antichina people
         draconian regulations issued by  beijing  in a bid to
                trade and currency fears  beijing  and washington still in
               government can enter from  beijing  to spy on the
             a intelligent response from  beijing  targets 20 chinese public
                    claims the app gives  beijing  access to the personal
              

**Observations**:
* 'beijing' used as a synonym to China.
    * "its used by the  beijing  government for espionage efforts"
* Interestingly, even though it's used as a synonym, the surrounding context for Beijing seems more negative in comparison to that of just 'china'.
    * this is likely due to the fact that the number of instances for 'beijing' appears much less frequently than that of 'china'

### KWIC of 'propaganda'

In [12]:
all_kwic_propaganda = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('propaganda', tokens)
    all_kwic_propaganda.extend(kwic)

In [13]:
len(all_kwic_propaganda)

801

In [14]:
all_kwic_propaganda_s1 = random.sample(all_kwic_propaganda, 50)
all_kwic_propaganda_s1 = sort_kwic(all_kwic_propaganda_s1, 'L1')
print_kwic(all_kwic_propaganda_s1)

                spread positive energy a  propaganda  catch phrase used by
                  china quietly pulled a  propaganda  film celebrating its tech
                the chinese government a  propaganda  consolation prize blocking a
                         on to china and  propaganda  that may come the
       chinese governments messaging and  propaganda  have been at fanning
            center on disinformation and  propaganda  related to coronavirus ms
                 york for commercial and  propaganda  purposes since 2011 after
                   a vast censorship and  propaganda  apparatus at home the
                 debate the line between  propaganda  and public diplomacy in
             exposed to the brainwashing  propaganda  and kept away from
                  as saying we broadcast  propaganda  radio programs to china
                accused of promoting ccp  propaganda  and falsely rumored that
                 year we detailed chinas  propaganda  efforts through t

### KWIC of 'orwellian'

In [15]:
all_kwic_orwellian = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('orwellian', tokens)
    all_kwic_orwellian.extend(kwic)

In [16]:
len(all_kwic_orwellian)

78

In [17]:
all_kwic_orwellian_s1 = random.sample(all_kwic_orwellian, 50)
all_kwic_orwellian_s1 = sort_kwic(all_kwic_orwellian_s1, 'L1')
print_kwic(all_kwic_orwellian_s1)

                     to create an almost  orwellian  surveillance state you know
                 local authorities in an  orwellian  step now established a
                   ccp has instituted an  orwellian  system of surveillance intimidation
                sinister creep toward an  orwellian  world   
                     not only created an  orwellian  state apparatus that oppresses
                 local authorities in an  orwellian  step now established a
                 not only constructed an  orwellian  hightech surveillance state at
                   ccp has instituted an  orwellian  system of surveillance intimidation
                     not only created an  orwellian  state apparatus that oppresses
                   ccp has instituted an  orwellian  system of surveillance intimidation
                     not only created an  orwellian  state apparatus that has
                   the campaign takes an  orwellian  theme with the line
                    china will beco

**Observations**:
* For both 'propaganda' and 'orwellian', I find the contexts displayed to be amusing. The dissonance and lack of cognizance is astounding. The idea behind the usage of both words is simply to evoke fear; "Big Brother is Watching You" and that's always been that message when American media discusses Chinese technology.
    * but America does the same thing... surveillance occurs everywhere... it doesn't make sense to point fingers and criticize China when you do the same thing and it's not like America is doing much to keep big tech in America from getting any bigger
* pot calling the kettle black type of situation

### KWIC of 'reality'

In [18]:
all_kwic_reality = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('reality', tokens)
    all_kwic_reality.extend(kwic)

In [19]:
all_kwic_reality_s1 = random.sample(all_kwic_reality, 50)
all_kwic_reality_s1 = sort_kwic(all_kwic_reality_s1, 'L1')
print_kwic(all_kwic_reality_s1)

                    vision for america a  reality     
                         rollout of 5g a  reality  companies need to invest
                      begins to become a  reality  and operators make plans
          selfdriving cars and augmented  reality  trump said it will
                 envision has now become  reality  china has made remarkable
                 to engineer and control  reality  hrw said in the
                 lead to social disgrace  reality  or fiction many people
                   is quite a disturbing  reality  which cannot fail to
                     become the de facto  reality  for their china operations
                    us operations but in  reality  bytedance is no other
                       the league but in  reality  it is what will
                budget challenges but in  reality  it could actually push
                         a friend but in  reality  its an enemy its
                  and communist party in  reality  it is probably both
  

**Observations**:
* Drawing on fears of "new" and "unknown" realities --> naturally makes people hesitate and scared
    * virtual, augmented, hologram realities
    * "decoupling is becoming a  reality  the tiktok and wechat" - TikTok and WeChat both being Chinese media platforms
    * "harderhitting  reality"

### KWIC of regime

In [20]:
all_kwic_regime = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('regime', tokens)
    all_kwic_regime.extend(kwic)

In [21]:
all_kwic_regime_s1 = random.sample(all_kwic_regime, 50)
all_kwic_regime_s1 = sort_kwic(all_kwic_regime_s1, 'L1')
print_kwic(all_kwic_regime_s1)

               uighur minority related a  regime  that fears religion new
                 reminding people that a  regime  whether then the soviet
                     just as the beijing  regime  and its dictatorforlife 8220president8221
             day the totalitarian castro  regime  has continued to surveil
             day the totalitarian castro  regime  has continued to surveil
                      party of china ccp  regime  in my research on
           to chinas rigorous censorship  regime  the senators wrote chinese
         ideology chinas vast censorship  regime  is without parallel freedom
       chinas strict internet censorship  regime  known as the great
              comply with its censorship  regime  as a prerequisite to
           to chinas rigorous censorship  regime  after a cyberattack that
                  demands of the chinese  regime  and closed the accounts
               bloomberg read on chinese  regime  announced retaliation against us
               

**Observations**:
* Very obvious that 'regime' has negative undertones as it's paried with:
    * totalitarian, communist, and authoritarian, dishonest, encryption
* "world of russias old  regime  and hint at the" - additionally, draws upon Cold War sentiments (negative history for Americans) with this example that mentions "russia's old regime"
* "clear to the maduro  regime  that it is our" - [Russia and China support Venezuelian president Maduro, US supports interim president Guaido](https://www.bbc.com/news/world-latin-america-47053701)
    * further implies how America associates the word 'regime' with
    * additional name: castro, kim jung un (all communist and/or dictatorship governments)

---