# Classifying the Digital Deciders

Fahmida Y Rashid (fr48)

### Introduction to Natural Language Processing

_The Choice Between an Open Internet and a Sovereign One_

On the international stage, there are two visions of the Internet: one that sees the Internet as open and free for all ideas, and the other that sees the Internet that should be restricted to within their boundaries, to restrict ideas only to "approved" ones. The United States and many other Western countries support the idea of an open Internet. Countries that prefer to restrict speech or monitor citizens, such as Russia and China, support what's called the "sovereign" Internet.

*Digital Deciders* is a term used by New America (https://www.newamerica.org/cybersecurity-initiative/reports/digital-deciders/analyzing-the-clusters) to refer to countries that have not yet picked a side. We apply natural language processing tools on General Debate speeches made in the United Nations General Assembly from 1970 to 2018 to determine which worldview the Digital Deciders are more likely to lean towards.

[Project on GitHub](https://github.com/fr48/unga)

[Story on GitHub](https://fr48.github.io/prtfl)

## Part One: Creating the Corpus

Before I can do any comparisons, I have to create the corpus of relevant speeches. At first, I began with cybersecurity-related terms, and then realized that doesn't make sense. Most countries who have made their position clear are the ones talking about security (to convince others). And the idea of voting with your friends means you are looking for other commonalities, such as economics and politics. So I created three different corpuses (corpi?) to try the analysis to see what is the most relevant.

### Looking for Relevant Speeches

The first thing I did was to use regexp and grep to look for cyber-related terms in files. The related terms for cybersecurity were: cyber, privacy, security, internet, and hack `[Cc]yber\w+|[Pp]rivacy|[Ss]ecurity|[Ii]nternet|[Hh]ack`

The second thing I did was to use regexp and grep to look for economics-related terms. This resulted in a much larger set of files. The related terms for economics were: `[Ee]conom|[Dd]evelop|[Ii]nvest|[Ii]nfrastructure|[Ss]oci|[Ss]ecurity`

The third thing was to look for politics-related terms, which returned an even larger set of files. The related terms for politics were: `[Ss]overeign|[Dd]emocrat|[Ss]tability|[Cc]ooperation|[Ss]ecurity`. Values for
"peace and security" and "international security" are expected to be high because everyone says them, so I would have to take that into consideration when comparing.

I then read in the files that contained the desired words and ran them through the TF/IDF.

In [1]:
#processing data 
import pandas as pd
import re

#nltk 
from nltk import sent_tokenize
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation

import nltk
nltk.download('stopwords')

# feature/text extractions
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to /home/fyr/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 1. Cybersecurity-related speeches

* I decided to use regexp on the command line.
* running `grep -E '[Cc]yber\w+|[Pp]rivacy|[Ss]ecurity|[Ii]nternet|[Hh]ack' */*` returned all the files that seemed to be talking about cybersecurity.
* The resulting filenames are stored in `cybersecurity_speeches.csv`
* I run TF/IDF to check the files

In [2]:
# all the files for cybersecurity-related speeches
file_list = []
file_name = "data/cybersecurity_speeches.csv"
with open(file_name) as f_input:
    file_list = f_input.read().split('\n')

In [3]:
# for testing, just read in one at a time.
# will need a for loop eventually, but this is supposed to be a manual process as I am verifying each file
file_name = "data/"+file_list[55]
speeches = []

with open(file_name) as f_input:
    speeches.append(f_input.read())

speeches

['I \nwish to convey my greetings to the Secretary-General \nand all the Presidents, Heads of State, delegations and \ninternational organizations from across the world. I \nconvey also my special greetings to those attending this \nannual general debate of the General Assembly.\n\nWe are here once again, as usual, to share \nexperiences relating to leadership and to work for the \nsake of life, humanity, equality and social justice. But \nwe are also here to express our profound differences as \nconcerns life, peace and democracy. Over the past few \ndays I have been listening to the statements made by \ncertain Powers, which leave a lot to be desired in terms \nof liberty, equality, dignity and sovereignty.\n\nThanks to the awareness of the Bolivian people, I \nhave now held the presidency for almost eight years. In \nthat time — despite the economic and financial crises in \nsome so-called developed, industrialized and, I would \neven say, even exaggeratedly industrialized countries

In [4]:
# create a tf/idf of these files

#defining the ngram range to one word, two words, and three words
vectorizer = TfidfVectorizer(ngram_range=(1, 3))

# creating the document
vectors = vectorizer.fit_transform(speeches)

#these are all the (1,3) ngrams 
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()

# creating a dataframe with the  feature names as columns
df = pd.DataFrame(denselist, columns=feature_names)
df

Unnamed: 0,000,000 per,000 per municipality,10,10 million,10 million inhabitants,10 per,10 per cent,100,100 per,...,york,york is,york is secure,york see,york see 65,york which,york which is,état,état what,état what peace
0,0.002128,0.002128,0.002128,0.004257,0.002128,0.002128,0.002128,0.002128,0.004257,0.004257,...,0.006385,0.002128,0.002128,0.002128,0.002128,0.002128,0.002128,0.002128,0.002128,0.002128


In [5]:
# write to file to make it easier to view. I also transpose the df for easier reading.
df = df.T
df[0]=df[0]*100
df.to_csv(file_name[13:-4]+".csv", header=False)
df

Unnamed: 0,0
000,0.212848
000 per,0.212848
000 per municipality,0.212848
10,0.425696
10 million,0.212848
...,...
york which,0.212848
york which is,0.212848
état,0.212848
état what,0.212848


Manually go through the files and make sure the files are all about cybersecurity. Use the TF/IDF scores to look at whether they were actually part of the conversation. For example, "security" has a high score, but most of the score is because of "Security Council."

I now have all the cybersecurity-related files. I am going to split them up into two groups so that I can compare to find out where the similarity is. I am grouping the countries using the [New America](https://www.newamerica.org/cybersecurity-initiative/reports/digital-deciders/analyzing-the-clusters) definitions.

One thing I realized is that there are countries New America doesn't account for, so I am leaving them out right now.

In [6]:
# start again
file_list = []
file_name = "data/cybersecurity_speeches.csv"
with open(file_name) as f_input:
    file_list = f_input.read().split('\n')

In [7]:
# New America definitions
digital = ['ALB','ARG','ARM','BOL','BIH','BWA','BRA','COL','COG','CRI',
           'CIV','DOM','ECU','SLV','GEO','GHA','GTM','HND','IND','IDN',
           'IRQ','JAM','JOR','KEN','KWT','KGZ','LBN','MKD','MYS','MEX',
           'MNG','MAR','NAM','NIC','NGA','PAK','PAN','PNG','PRY','PER',
           'PHL','MDA','SRB','SGP','ZAF','LKA','THA','TUN','UKR','URY']

global_open = ['GBR','CAN','AUS','DEU','JPN','SWE','NLD','USA','NOR',
            'FIN','CHE','EST','ESP','POL','NZL','KOR','AUT','IRL','CZE',
            'PRT','DNK','ITA','LVA','LTU','LUX','BEL','SVN','GRC','CHL',
            'CYP','SVK','ISR','HRV','BGR','HUN','ROU','FRA']
               
sovereign_controlled = ['SAU','ZWE','VEN','SWZ','CUB','IRN','DZE','LBY',
    'QAT','TUR','ARE','BLR','RUS','CHN','KAZ','OMN','BHR','AZE','VNM',
    'CMR','TKM','TJK','SYR','PRK','UZB','AGO','EGY']

In [8]:
sovereigns = []
currents = []
deciders = []
unknowns = []

for one_file in file_list:
    if (one_file[8:11]) in global_open:
        currents.append(one_file)
    elif (one_file[8:11]) in sovereign_controlled:
        sovereigns.append(one_file)
    elif (one_file[8:11]) in digital:
        deciders.append(one_file)
    else:
        unknowns.append(one_file)

### Prepare each corpus

* Read in the files into lists using the groupings just created.
* While it is possible to run the vectorizer without using stopwords, it is worth doing some cleanup

In [12]:
#added curly apostrophe to the stop words list
stop_words = stopwords.words('english') + list(punctuation) + ['’']

In [13]:
open_internet = []

for current in currents:
    current = "data/"+current
    with open(current) as f_input:
        open_internet.append(f_input.read())

text = ""
for oneeach in open_internet:
    text = oneeach+text
    
open_internet=[text]

In [14]:
# Use regexp to remove numbers
clean_quotes = []
for x in open_internet:
    my_text =[]
    x = word_tokenize(x)
    x = [w.lower() for w in x]
    for z in x:
        if (z not in stop_words):
            #if re.match(r'[0-9,.]+', z) is None:
            if len(re.findall(r'[0-9]+', z)) == 0:
                my_text.append(z)
    clean_quotes.append(" ".join(my_text))

#write the words back
open_internet = clean_quotes

In [15]:
sovereign_internet = []

for sovereign in sovereigns:
    sovereign = "data/"+sovereign
    with open(sovereign) as f_input:
        sovereign_internet.append(f_input.read())

text = ""
for oneeach in sovereign_internet:
    text = oneeach+text
    
sovereign_internet=[text]

In [16]:
clean_quotes = []
for x in sovereign_internet:
    my_text =[]
    x = word_tokenize(x)
    x = [w.lower() for w in x]
    for z in x:
        if (z not in stop_words):
            #if re.match(r'[0-9,.]+', z) is None:
            if len(re.findall(r'[0-9]+', z)) == 0:
                my_text.append(z)
    clean_quotes.append(" ".join(my_text))

#write the words back
sovereign_internet = clean_quotes

In [51]:
# writing these to file for safe-keeping
pd.DataFrame(sovereign_internet).to_csv("output/cs_sov_corpus.csv", header=False, index=False)
pd.DataFrame(open_internet).to_csv("output/cs_op_corpus.csv", header=False, index=False)

# I need to save the deciders file for future processing
pd.DataFrame(deciders).to_csv("output/cs_deciders.csv", header=False, index=False)

### 2. Economics-related speeches

* Like the previous set, I used regexp on the command line to get all economics-related speeches. 
* Unlike the previous set, where I didn't restrict the years, economics is a fluctuating situation, so I am looking at only the last ten years of data.
* `grep -E '[Ee]conom|[Dd]evelop|[Ii]nvest|[Ii]nfrastructure|[Ss]oci|'[Ff]ood' */*` 
* The resulting filenames are stored in `economics_speeches.csv`

In [28]:
file_list = []
file_name = "data/economics_speeches.csv"
with open(file_name) as f_input:
    file_list = f_input.read().split('\n')

In [29]:
sovereign_economies = []
current_economies = []
decider_economies = []
unknown_economies = []

for one_file in file_list:
    if (one_file[8:11]) in global_open:
        current_economies.append(one_file)
    elif (one_file[8:11]) in sovereign_controlled:
        sovereign_economies.append(one_file)
    elif (one_file[8:11]) in digital:
        decider_economies.append(one_file)
    else:
        unknown_economies.append(one_file)

In [30]:
open_economy = []

for current in current_economies:
    current = "data/"+current
    with open(current) as f_input:
        open_economy.append(f_input.read())
        
text = ""
for oneeach in open_economy:
    text = oneeach+text
    
open_economy=[text]

In [31]:
clean_quotes = []
for x in open_economy:
    my_text =[]
    x = word_tokenize(x)
    x = [w.lower() for w in x]
    for z in x:
        if (z not in stop_words):
            #if re.match(r'[0-9,.]+', z) is None:
            if len(re.findall(r'[0-9]+', z)) == 0:
                my_text.append(z)
    clean_quotes.append(" ".join(my_text))

#write the words back
open_economy = clean_quotes

In [32]:
sovereign_economy = []

for sovereign in sovereign_economies:
    sovereign = "data/"+sovereign
    with open(sovereign) as f_input:
        sovereign_economy.append(f_input.read())

text = ""
for oneeach in sovereign_economy:
    text = oneeach+text
    
sovereign_economy=[text]

In [33]:
clean_quotes = []
for x in sovereign_economy:
    my_text =[]
    x = word_tokenize(x)
    x = [w.lower() for w in x]
    for z in x:
        if (z not in stop_words):
            #if re.match(r'[0-9,.]+', z) is None:
            if len(re.findall(r'[0-9]+', z)) == 0:
                my_text.append(z)
    clean_quotes.append(" ".join(my_text))

#write the words back
sovereign_economy = clean_quotes

In [34]:
# writing these to file for safe-keeping
pd.DataFrame(sovereign_economy).to_csv("output/ec_sov_corpus.csv", header=False, index=False)
pd.DataFrame(open_economy).to_csv("output/ec_op_corpus.csv", header=False, index=False)

# I need to save the deciders file for future processing
pd.DataFrame(decider_economies).to_csv("output/ec_deciders.csv", header=False, index=False)

### 3. Politics-related speeches

* Like the previous sets, I used regexp on the command line to get all politices-related speeches. 
* While politics isn't as fluctuating as economics, I still restricted the data to only the last ten years of data because otherwise the list was getting too long.
* `grep -E '[Ss]overeign|[Dd]emocratic|[Ss]table|[Ss]tability|[Cc]ooperat|[Ss]ecurity' */*` 
* The resulting filenames are stored in `politics_speeches.csv`

In [22]:
file_list = []
file_name = "data/politics_speeches.csv"
with open(file_name) as f_input:
    file_list = f_input.read().split('\n')

In [23]:
sovereign_politics = []
current_politics = []
decider_politics = []
unknown_politics = []

for one_file in file_list:
    if (one_file[8:11]) in global_open:
        current_politics.append(one_file)
    elif (one_file[8:11]) in sovereign_controlled:
        sovereign_politics.append(one_file)
    elif (one_file[8:11]) in digital:
        decider_politics.append(one_file)
    else:
        unknown_politics.append(one_file)

In [63]:
open_politic = []

for current in current_politics:
    current = "data/"+current
    with open(current) as f_input:
        open_politic.append(f_input.read())
        
text = ""
for oneeach in open_politic:
    text = oneeach+text
    
open_politic=[text]

In [64]:
clean_quotes = []
for x in open_politic:
    my_text =[]
    x = word_tokenize(x)
    x = [w.lower() for w in x]
    for z in x:
        if (z not in stop_words):
            #if re.match(r'[0-9,.]+', z) is None:
            if len(re.findall(r'[0-9]+', z)) == 0:
                my_text.append(z)
    clean_quotes.append(" ".join(my_text))

#write the words back
open_politic = clean_quotes

In [65]:
sovereign_politic = []

for sovereign in sovereign_politics:
    sovereign = "data/"+sovereign
    with open(sovereign) as f_input:
        sovereign_politic.append(f_input.read())

text = ""
for oneeach in sovereign_politic:
    text = oneeach+text
    
sovereign_politic=[text]

In [66]:
clean_quotes = []
for x in sovereign_politic:
    my_text =[]
    x = word_tokenize(x)
    x = [w.lower() for w in x]
    for z in x:
        if (z not in stop_words):
            #if re.match(r'[0-9,.]+', z) is None:
            if len(re.findall(r'[0-9]+', z)) == 0:
                my_text.append(z)
    clean_quotes.append(" ".join(my_text))

#write the words back
sovereign_politic = clean_quotes

In [67]:
# writing these to file for safe-keeping
pd.DataFrame(sovereign_politic).to_csv("output/po_sov_corpus.csv", header=False, index=False)
pd.DataFrame(open_politic).to_csv("output/po_op_corpus.csv", header=False, index=False)

# I need to save the deciders file for future processing
pd.DataFrame(decider_politics).to_csv("output/po_deciders.csv", header=False, index=False)