# Classifying the Digital Deciders

Fahmida Y Rashid (fr48)

### Introduction to Natural Language Processing

_The Choice Between an Open Internet and a Sovereign One_

[Project on GitHub](https://fr48.github.io/prtfl)

On the international stage, there are two visions of the Internet: one that sees the Internet as open and free for all ideas, and the other that sees the Internet that should be restricted to within their boundaries, to restrict ideas only to "approved" ones. The United States and many other Western countries support the idea of an open Internet. Countries that prefer to restrict speech or monitor citizens, such as Russia and China, support what's called the "sovereign" Internet.

*Digital Deciders* is a term used by New America (https://www.newamerica.org/cybersecurity-initiative/reports/digital-deciders/analyzing-the-clusters) to refer to countries that have not yet picked a side. We apply natural language processing tools on General Debate speeches made in the United Nations General Assembly from 1970 to 2018 to determine which worldview the Digital Deciders are more likely to lean towards.

## Part One: Creating the Corpus

Before I can do any comparisons, I have to create the corpus of relevant speeches. At first, I began with cybersecurity-related terms, and then realized that doesn't make sense. Most countries who have made their position clear are the ones talking about security (to convince others). And the idea of voting with your friends means you are looking for other commonalities, such as economics and politics. So I created three different corpuses (corpi?) to try the analysis to see what is the most relevant.

### Looking for Relevant Speeches

The first thing I did was to use regexp and grep to look for cyber-related terms in files. The related terms for cybersecurity were: cyber, privacy, security, internet, and hack `[Cc]yber\w+|[Pp]rivacy|[Ss]ecurity|[Ii]nternet|[Hh]ack`

I then read in the files that contained the desired words and ran them through the TF/IDF.

The second thing I did was to use regexp and grep to look for economics-related terms. This resulted in a much larger set of files. The related terms for economics were: economic, economy, market, debt, financial, invest, infrastructure, social, prosperity

The third thing was to look for politics-related terms, which returned an even larger set of files. The related terms for politics were: sovereign, security, elections, democrat, government, allies, ally, co-operation, cooperation, stability
"peace and security" and "international security"

In [1]:
#processing data 
import pandas as pd
from nltk import sent_tokenize

import re

# feature/text extractions
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

### Cybersecurity-related speeches

* I decided to use regexp on the command line.
* running `grep -E '[Cc]yber\w+|[Pp]rivacy|[Ss]ecurity|[Ii]nternet|[Hh]ack' */*` returned all the files that seemed to be talking about cybersecurity.
* The resulting filenames are stored in `cybersecurity_speeches.csv`

In [2]:
# all the files for cybersecurity-related speeches
file_list = []
file_name = "data/cybersecurity_speeches.csv"
with open(file_name) as f_input:
    file_list = f_input.read().split('\n')

file_list = file_list[0:-1]

In [3]:
# for testing, just read in one at a time.
# will need a for loop eventually, but this is supposed to be a manual process as I am verifying each file
file_name = "data/"+file_list[108]
speeches = []

with open(file_name) as f_input:
    speeches.append(f_input.read())

speeches

['We are in the midst of a great revolution — a revolution in Israel’s standing among the nations. It is happening because so many countries around the world have finally woken up to what Israel can do for them. Those countries now recognize what brilliant investors like Warren Buffett and great companies like Google and Intel have recognized and known for years, that is, that Israel is the innovation nation, the place for cutting-edge technology in agriculture, in water, in cybersecurity, in medicine, in autonomous vehicles. You name it, we have got it.\nThose countries now also recognize Israel’s exceptional capabilities in fighting terrorism. In recent years, Israel has provided intelligence that has prevented dozens of major terrorist attacks around the world. We have saved countless lives. Members may not know it, but their Governments do. And they are working closely together with Israel to keep their countries and its citizens safe.\nI stood here last year, at this rostrum, and 

In [4]:
# create a tf/idf of these files

#defining the ngram range to one word, two words, and three words
vectorizer = TfidfVectorizer(ngram_range=(1, 3))

# creating the document
vectors = vectorizer.fit_transform(speeches)

#these are all the (1,3) ngrams 
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()

# creating a dataframe with the  feature names as columns
df = pd.DataFrame(denselist, columns=feature_names)
df

Unnamed: 0,000,000 years,000 years when,14,14 just,14 just look,181,181 ii,181 ii supporting,2016,...,you,you cannot,you cannot be,you light,you light unto,you name,you name it,zionist,zionist congress,zionist congress to
0,0.004085,0.004085,0.004085,0.004085,0.004085,0.004085,0.004085,0.004085,0.004085,0.004085,...,0.012256,0.004085,0.004085,0.004085,0.004085,0.004085,0.004085,0.004085,0.004085,0.004085


In [5]:
# write to file to make it easier to view. I also transpose the df for easier reading.
df = df.T
df[0]=df[0]*100
df.to_csv(file_name[13:-4]+".csv", header=False)
df

Unnamed: 0,0
000,0.408545
000 years,0.408545
000 years when,0.408545
14,0.408545
14 just,0.408545
...,...
you name,0.408545
you name it,0.408545
zionist,0.408545
zionist congress,0.408545


### Verify Files

Manually go through the files and make sure the files are all about cybersecurity. Use the TF/IDF scores to look at whether they were actually part of the conversation. For example, "security" has a high score, but most of the score is because of "Security Council."

I now have all the cybersecurity-related files. I am going to split them up into two groups so that I can compare to find out where the similarity is. I am doing the groupings using the lists from New America.

One thing I realized is that there are countries New America doesn't account for, so I am leaving them out right now.

In [6]:
file_list = []
file_name = "data/cybersecurity_speeches.csv"
with open(file_name) as f_input:
    file_list = f_input.read().split('\n')

file_list = file_list[0:-1]

In [7]:
sovereigns = []
currents = []
deciders = []
unknowns = []

digital = ['ALB','ARG','ARM','BOL','BIH','BWA','BRA','COL','COG','CRI',
           'CIV','DOM','ECU','SLV','GEO','GHA','GTM','HND','IND','IDN',
           'IRQ','JAM','JOR','KEN','KWT','KGZ','LBN','MKD','MYS','MEX',
           'MNG','MAR','NAM','NIC','NGA','PAK','PAN','PNG','PRY','PER',
           'PHL','MDA','SRB','SGP','ZAF','LKA','THA','TUN','UKR','URY']

global_open = ['GBR','CAN','AUS','DEU','JPN','SWE','NLD','USA','NOR',
            'FIN','CHE','EST','ESP','POL','NZL','KOR','AUT','IRL','CZE',
            'PRT','DNK','ITA','LVA','LTU','LUX','BEL','SVN','GRC','CHL',
            'CYP','SVK','ISR','HRV','BGR','HUN','ROU','FRA']
               
sovereign_controlled = ['SAU','ZWE','VEN','SWZ','CUB','IRN','DZE','LBY',
    'QAT','TUR','ARE','BLR','RUS','CHN','KAZ','OMN','BHR','AZE','VNM',
    'CMR','TKM','TJK','SYR','PRK','UZB','AGO','EGY']

for one_file in file_list:
    if (one_file[8:11]) in global_open:
        currents.append(one_file)
    elif (one_file[8:11]) in sovereign_controlled:
        sovereigns.append(one_file)
    elif (one_file[8:11]) in digital:
        deciders.append(one_file)
    else:
        unknowns.append(one_file)

### Prepare each corpus

I now read in all the files into two lists, using the groupings I just created. I then run TF/IDF to create the vectors

In [8]:
open_internet = []

for current in currents:
    current = "data/"+current
    with open(current) as f_input:
        open_internet.append(f_input.read())

#open_internet

In [9]:
sovereign_internet = []

for sovereign in sovereigns:
    sovereign = "data/"+sovereign
    with open(sovereign) as f_input:
        sovereign_internet.append(f_input.read())

#sovereign_internet

In [10]:
s_vectorizer = TfidfVectorizer(ngram_range=(1, 3))
s_vectors = s_vectorizer.fit_transform(sovereign_internet)
s_feature_names = s_vectorizer.get_feature_names()
s_dense = s_vectors.todense()
s_denselist = s_dense.tolist()
sov_df = pd.DataFrame(s_denselist, columns=s_feature_names)
sov_df

Unnamed: 0,000,000 000,000 000 square,000 aid,000 aid workers,000 children,000 children die,000 children perish,000 cuban,000 cuban health,...,être,être of,être of global,être of the,œif,œif you,œif you would,ﬂagrant,ﬂagrant violation,ﬂagrant violation of
0,0.009106,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.014793,0.00716,0.00716,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.008666,0.008666,0.008666,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.006202,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.010626,0.010626,0.012007,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
o_vectorizer = TfidfVectorizer(ngram_range=(1, 3))
o_vectors = o_vectorizer.fit_transform(open_internet)
o_feature_names = o_vectorizer.get_feature_names()
o_dense = o_vectors.todense()
o_denselist = o_dense.tolist()
op_df = pd.DataFrame(o_denselist, columns=o_feature_names)
op_df

Unnamed: 0,000,000 americans,000 americans were,000 and,000 and that,000 and the,000 asylum,000 asylum seekers,000 bulgarian,000 bulgarian jews,...,zones and,zones and high,zones as,zones as well,zones in,zones in our,zones of,zones of influence,zones those,zones those times
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.010579,0.010579,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.01527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.013639,0.013639,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
# writing all these to file for safe-keeping
sov_df.to_csv("data/cs_sov_df.csv")
op_df.to_csv("data/cs_op_df.csv")
pd.DataFrame(sovereign_internet).to_csv("data/cs_sov_corpus.csv", header=False, index=False)
pd.DataFrame(open_internet).to_csv("data/cs_op_corpus.csv", header=False, index=False)