# Data Exploration (2):
## KWIC Analyses
* This notebook focuses on observing the KWIC analysis got a select words gathered from the frequency list or from my own prior knowledge surrounding the coverage of China in the media.
    * split between two notebooks


---

### Setup
* includes additional modules, functions, and chars_to_strip

In [1]:
import json
from collections import Counter
import random
import os
import re
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import graphviz

In [2]:
%run functions.ipynb

In [3]:
chars_to_remove = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~•'

## 1. Load data

In [4]:
LN2017 = json.load(open('../data/LN_data/cleaned/LN_2017.json'))
LN2018 = json.load(open('../data/LN_data/cleaned/LN_2018.json'))
LN2019 = json.load(open('../data/LN_data/cleaned/LN_2019.json'))
LN2020 = json.load(open('../data/LN_data/cleaned/LN_2020.json'))
LN2021 = json.load(open('../data/LN_data/cleaned/LN_2021.json'))

In [5]:
all_word_dist = Counter()

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    all_word_dist.update(tokens)

## 1. KWIC 
* using the list of words from `LN_data_exploration1`, I performed several KWIC analyses to gather more insight on the words. 

### KWIC on 'bartiromo'
* doublechecking context for this word (last name) as I'm unfamiliar with it since it appears so often

In [6]:
all_kwic_bartiromo = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('bartiromo', tokens)
    all_kwic_bartiromo.extend(kwic)

In [7]:
len(all_kwic_bartiromo)

15510

In [8]:
all_kwic_bartiromo_s1 = random.sample(all_kwic_bartiromo, 10)
all_kwic_bartiromo_s1 = sort_kwic(all_kwic_bartiromo_s1, 'L1')
print_kwic(all_kwic_bartiromo_s1)

                                          bartiromo  fbn host good morning
                 with no air conditioner  bartiromo  not good mcdowell theyre
                    a lot more expensive  bartiromo  yes youre right freeman
                  thank you good morning  bartiromo  so much to unpack
                  was still a republican  bartiromo  thats right dagen mcdowell
             property varney thats right  bartiromo  but i just im
                       by his legal team  bartiromo  yes mcdowell and he
                   suspect also that the  bartiromo  i know ray attorney
                and easier deangelis yes  bartiromo  jackie thank you so
                you listen simonetti yes  bartiromo  because this is the


**Observations**:
(As this isn't a main part of my analysis, I limited it to 10 results to take up less space)
* My assumption was correct; it's referring to Maria Batiromo, a journalist and Fox News anchor who speaks about her "takes" on China on Fox News and in her latest book "The Cost."
    * Bartriomo: on Fox News, sees China as an enemy --> not a surprising conclusion

### KWIC of 'china'

In [9]:
all_kwic_china = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('china', tokens)
    all_kwic_china.extend(kwic)

In [10]:
len(all_kwic_china)

61399

In [11]:
all_kwic_china_s1 = random.sample(all_kwic_china, 50)
all_kwic_china_s1 = sort_kwic(all_kwic_china_s1, 'L1')
print_kwic(all_kwic_china_s1)

                                          china  refuted the explosive claims
                    fcc priority in 2011  china  mobile usa which is
                     30 stake in allianz  china  general insurance becoming the
               human capital has allowed  china  to better realize the
                      between the us and  china  saying the motivation was
                   the united states and  china  are still building also
         asiapacific excluding japan and  china  middle east latin america
                    countries the us and  china  are not the only
                   the united states and  china  the infrastructure programme is
                stock market or angering  china  by calling too much
                 estate progress such as  china  overseas 267 was more
                   downgraded to hold at  china  renaissance posts alltime lows
                while throwing stones at  china  shirk said that as
             security law implemented by  ch

**Observations**: 
* 'china' is always used in the same definition (as the country and not 'fine china'/'porcelain')
* Looking at a random sample for 'china', there isn't a strong sense of contempt or pity as they seem to be related to a wide array of emotions and topics. However, this is not unexpected since this one of my most common and broader terms within the dataset (there are 61,399 instances of 'china'). While there appears to negativity within the list, it seems more nitpicky in this cases instead of apparent examples of articles contributing to sinophobia. Of course, KWIC lists only show a small portion of the article so it's hard to tell if the entire article is harmful or not, but just by this alone, it's difficult to say. A collocation analysis may provide more insights.

### KWIC of 'goverment' 

In [12]:
all_kwic_gov = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('government', tokens)
    all_kwic_gov.extend(kwic)

In [13]:
len(all_kwic_gov)

30463

In [14]:
all_kwic_gov_s1 = random.sample(all_kwic_gov, 50)
all_kwic_gov_s1 = sort_kwic(all_kwic_gov_s1, 'L1')
print_kwic(all_kwic_gov_s1)

                        to 2025 table 10  government  enduser market share breakdown
                   private sector with a  government  that plans and executes
               including us military and  government  employees with highlevel security
            chinese culture business and  government  williams said the trip
     cooperation and partnership between  government  law enforcement and private
                forced to choose between  government  funding and highquality costeffective
               guidance from the central  government  to go abroad seeking
                     ties to the chinese  government  an accusation the company
                   work with the chinese  government  is a false statement
                     us says the chinese  government  blurs the lines between
                   behalf of the chinese  government  and they should not
            throughout china the chinese  government  should immediately shut down
                      in sum the 

**Observations**:
* 'government' shows a much more interesting sample. By looking at the first few examples and running the cell over again to receive different random samples, words and phrases such as "contempt", "crack down", "detention facilities", "monitoring", and "vulnerabilities" appear as additional context for 'government.' 
    * Additional samples also display clear mentions of China's government as "weak", "stringent", and "communist" also appear. 
* Based on my first analysis, it was noted that there didn't seem to be any outward feelings of negativity towards Chinese people, but it's much easier for news media to criticize the Chinese government instead. For starters, this seems to be a better alternative and not harmful. Yet, the jump from criticizing the government to the people isn't very far; even if it isn't a conscious connection, negative assumptions about the government will lead to negative assumptions about the people. In history, there are several examples of this broad categorization with the most relevant one in ASAM history being the reason America established internment camps for Japanese Americans.

### KWIC of 'control'

In [15]:
all_kwic_control = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('control', tokens)
    all_kwic_control.extend(kwic)

In [16]:
len(all_kwic_control)

4960

In [17]:
all_kwic_control_s1 = random.sample(all_kwic_control, 50)
all_kwic_control_s1 = sort_kwic(all_kwic_control_s1, 'L1')
print_kwic(all_kwic_control_s1)

              submarines and command and  control  he said we expect
            maritime risk management and  control  while esper according to
              and application payers can  control  their home appliances payers
                 are beyond the companys  control  which may cause the
                 are beyond the companys  control  which may cause the
             braking and adaptive cruise  control  highly autonomous vehiclesin other
 addressing chinese government direction  control  and subsidization of chinese
                   great wall of disease  control  has been put in
                 the centers for disease  control  and prevention says an
        agencies the manufacturer doesnt  control  the cyber components it
                    its efforts to exert  control  at home and expand
divisions counterintelligence and export  control  section the criminal divisions
          panopto which allows finegrain  control  limiting access to those
                       

### KWIC of 'freedom'

In [18]:
all_kwic_freedom = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('freedom', tokens)
    all_kwic_freedom.extend(kwic)

In [19]:
all_kwic_freedom_s1 = random.sample(all_kwic_freedom, 50)
all_kwic_freedom_s1 = sort_kwic(all_kwic_freedom_s1, 'L1')
print_kwic(all_kwic_freedom_s1)

                                          freedom  house issued the following
                                          freedom  support hong kong those
            their commitment to academic  freedom  to uphold the principle
                the precepts of academic  freedom  and threatens global scholarly
                  of speech and academic  freedom  wednesdays gao report studied
              write respect for academic  freedom  and other human rights
                  of speech and academic  freedom  wednesdays gao report studied
                  said that the academic  freedom  in the us is
      cyber operation codenamed airborne  freedom  and launched by the
          our behaviors connectivity and  freedom  the extent to which
                     freedom no pain and  freedom  from the most bothersome
              freedom of association and  freedom  of thought isnt that
                     freedom no pain and  freedom  from the most bothersome
               freedom o

**Observations**:
* Unlike the rest of the words in this analysis, the context for 'control' is more obviously negative. Whether it's criticizing government practices ("a wider government move to tighten control over freedom of expression said human", "pathology of the chinese government to control the narratives of the country both", "a wider government move to tighten control over freedom of expression said human") or internet policies ("theres no doubt about it their control over the internet is going to", "over the domestic internet and closely control what information may be accessed by", "heralding a new era of state control over the mobile web starting on"), the random sample seems to serve the main purpose of inciting fear and perpetuating the image of a tight/restrictive communist regime. 
* While seemingly innocent, stories of "controlling" governments and instances of a "lack of freedom" especially resonate with American readers who are very sensitive to and passionate about protecting their rights. The choice to use "freedom" is deliberate and intentional. This directly fuels the belief that the West is "superior" to the "inferior, less progressive" East who needs the help of the West to liberate them with the knowledge of the "real news" instead of the communist propaganda the Chinese government has been feeding them. But every country provides propaganda. It just happens to be that this narrative of China as a "threat" is more widespread (even outside of the States) and constantly in our faces that people accept it as fact. After all, everyone is saying the same thing, so shouldn't the information be true?

### KWIC of 'communist'

In [20]:
all_kwic_comm = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('communist', tokens)
    all_kwic_comm.extend(kwic)

In [21]:
len(all_kwic_comm)

4427

In [22]:
all_kwic_comm_s1 = random.sample(all_kwic_comm, 50)
all_kwic_comm_s1 = sort_kwic(all_kwic_comm_s1, 'L1')
print_kwic(all_kwic_comm_s1)

                                          communist  party is stepping up
                  the podium before 2300  communist  party delegates to deliver
          location and internet activity  communist  china uses millions of
               line even more alarmingly  communist  party cells have reportedly
              the chinese government and  communist  partys efforts to undermine
               on chinese government and  communist  party officials it believes
               industry strategy seen by  communist  leaders as a path
                referring to the chinese  communist  party by its initials
                     the ccp the chinese  communist  party forbes im shocked
               companies and the chinese  communist  party dagen mcdowell fbn
                     used by the chinese  communist  party this is not
        modernday xi jinpinglead chinese  communist  party pose to the
                    reach of the chinese  communist  partys censors and law
        

**Observations**:
* The inclusion of 'communist' especially alongside 'chinese' to form 'chinese communist party' is also a deliberate choice to draw on the Western fears of decades of propaganda. Dating back to [McCarthyism and the Red Scare](https://millercenter.org/the-presidency/educational-resources/age-of-eisenhower/mcarthyism-red-scare) during the 1950s in America, the word itself as always been negatively associated.  

### KWIC of 'but'
* 'but' is a coordinating conjunction specifically used to connect two ideas that contrast
    * I thought it was an interesting word to look at due to the fact that it appeared very often among the LN articles and could indicate this type of sentence structure:
        * positive first half *but* ends the sentence negatively
        * disguising something negative as something positive

In [23]:
all_kwic_but = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('but', tokens)
    all_kwic_but.extend(kwic)

In [24]:
len(all_kwic_but)

38656

In [25]:
all_kwic_but_s1 = random.sample(all_kwic_but, 50)
all_kwic_but_s1 = sort_kwic(all_kwic_but_s1, 'L1')
print_kwic(all_kwic_but_s1)

               lower in overnight action  but  the market has since
                   one person to another  but  they dont know contagious
              floyd was resisting arrest  but  there is a new
             outbreak originated in bats  but  the way the virus
                    that they can change  but  they need to change
            which certainly rivals china  but  were in a moment
               blocked in mainland china  but  users can access it
          jurisdiction of this committee  but  they could also benefit
              from the rutgers community  but  from people around the
             by chinese parent companies  but  they have different histories
           patterns which are continuing  but  the current situation has
                   as a separate country  but  with the shift in
                     so its pretty crazy  but  i mean even what
                        for a trade deal  but  some investors are leaning
      down that produces dissatisfaction

### KWIC of 'influence'

In [26]:
all_kwic_influence = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('influence', tokens)
    all_kwic_influence.extend(kwic)

In [27]:
all_kwic_influence_s1 = random.sample(all_kwic_influence, 50)
all_kwic_influence_s1 = sort_kwic(all_kwic_influence_s1, 'L1')
print_kwic(all_kwic_influence_s1)

           beijings growing presence and  influence  in un agencies should
                 is there emerging asian  influence  in american music but
              was presumably the biggest  influence  on the transports but
                 united states by buying  influence  and conducting espionage at
                a chinese company chinas  influence  over tiktoks content critics
                  the increase in chinas  influence  will be felt immensely
      significant concerns about chinese  influence  is hikvision the hangzhoubased
                 pushing back on chinese  influence  let me ask you
                      hr 6010 on chinese  influence  operations which he intends
            american assets from chinese  influence  and possession and serve
           efforts to combat coordinated  influence  operations we disabled 210
               of political and cultural  influence  says mr onyebuchi african
                   to do semicoup either  influence  the campaign 

**Observations**:
* 'influence' is mainly used in the context to explain the reach of China's influence in a negative manner
    * the word itself is quite loaded. Influence could refer to positive influences in one's life/decision; however, in many contexts, influence could also imply that one's final decision wasn't made independently, but actually as a result of an external force affecting your choice. 
    * "lawmakers wary of chinese  influence  on american lives" - inclusion of 'wary' in the same sentence
    * "are susceptible to chinese  influence  on the political military" - inclusion of 'susceptible' to portray the weakness of one's own mind falling for "Chinese influence"
    * "abroad to counter chinese  influence  has 60 billion in" - inclusion of a large number that quantifies China's influence

### KWIC of 'concerns'

In [28]:
all_kwic_concerns = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('concerns', tokens)
    all_kwic_concerns.extend(kwic)

In [29]:
all_kwic_concerns_s1 = random.sample(all_kwic_concerns, 50)
all_kwic_concerns_s1 = sort_kwic(all_kwic_concerns_s1, 'L1')
print_kwic(all_kwic_concerns_s1)

                   assuage our fears and  concerns  about that or confirm
                    the company have any  concerns  over how police could
                    eager to address any  concerns  senator schumer has and
                    new 5g technology as  concerns  grow about chinese telecommunications
         value amid apparent coronavirus  concerns  and a 75 drop
           wechat do raise cybersecurity  concerns  they are not significantly
                   unity in voicing deep  concerns  about the technology which
          transaction went ahead despite  concerns  raised to the court
               et with google dismissing  concerns  about ip theft in
             quit dragonfly over ethical  concerns  about the project and
              general us economic growth  concerns  1439 comdx energy settlement
                      board the ncsc has  concerns  around huaweis engineering and
                     urge google to heed  concerns  from its own employees
       

**Observations**:
* Adds to idea of the amount of uncertainty that would arise should China be acknowledged for its advancements and growth.
    * there seems to be so much that could go wrong when something's related to China
    * many of the instances of concern appear to be related to large scale ones, such as: national security, growth, trade, "serious" concerns to heighten the severity of the issue at hand.

### KWIC of 'huawei'

In [30]:
all_kwic_huawei = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('huawei', tokens)
    all_kwic_huawei.extend(kwic)

In [31]:
all_kwic_huawei_s1 = random.sample(all_kwic_huawei, 50)
all_kwic_huawei_s1 = sort_kwic(all_kwic_huawei_s1, 'L1')
print_kwic(all_kwic_huawei_s1)

              in 2019 washington accused  huawei  the chinese technology group
           rather unusual stance against  huawei  and the chinese government
                   seems to tilt against  huawei  even the uk now
       sees washingtons campaign against  huawei  as a political ploy
           the latters sanctions against  huawei  and chinaindia relations continue
         for patent infringement against  huawei  technologies co solaredge is
                     choice but to allow  huawei  in its a false
          of his administration allowing  huawei  technologies co entry to
                         to do right and  huawei  is one of those
             expert for cybersecurity at  huawei  told reporters here that
           equipment could this backfire  huawei  is already the biggest
               trojan horses for beijing  huawei  marine hopes to win
          tighter restrictions on chinas  huawei  technologies would be banned
       government suggested that despite  

**Observations**:
* All related negative news regarding Huawei and the ban as America continues to push the potential Chinese competitor out of the market. 
    * nothing surprised based on what I already know about how Huawei is perceived in the states

### KWIC of 'vs'

In [32]:
all_kwic_vs = []

for doc in LN2017 + LN2018 + LN2019 + LN2020 + LN2021:
    text = doc['body']
    tokens = tokenize(text, lowercase=True, strip_chars=chars_to_remove)
    kwic = make_kwic('vs', tokens)
    all_kwic_vs.extend(kwic)

In [33]:
all_kwic_vs_s1 = random.sample(all_kwic_vs, 10)
all_kwic_vs_s1 = sort_kwic(all_kwic_vs_s1, 'L1')
print_kwic(all_kwic_vs_s1)

                   1031 september ppi 03  vs  briefingcom consensus of 01
                 interest margin was 361  vs  319 in the prior
           closed briefingcom sp futures  vs  fair value 1010 nasdaq
                   09771 0726 sp futures  vs  fair value 090 nasdaq
         then representing 101114 growth  vs  a projected 2019 midpoint
     263267 excluding nonrecurring items  vs  262 sp capital iq
     283293 excluding nonrecurring items  vs  291 sp capital iq
        050 excluding nonrecurring items  vs  053 sp capital iq
     359369 excluding nonrecurring items  vs  363 sp capital iq
             with 2720 contracts trading  vs  open int of 120


**Observations**:
* Not the results I was expecting. All of them seem to be comparing two things, but I'm not quite sure what the topic is about and the surrounding words only make the context more difficult for me to decipher. Overall, it doesn't seem to be useful for the purposes of my project.

---