# Classifying the Digital Deciders

Fahmida Y Rashid (fr48)

### Introduction to Natural Language Processing

_The Choice Between an Open Internet and a Sovereign One_

[Project on GitHub](https://fr48.github.io/prtfl)

On the international stage, there are two visions of the Internet: one that sees the Internet as open and free for all ideas, and the other that sees the Internet that should be restricted to within their boundaries, to restrict ideas only to "approved" ones. The United States and many other Western countries support the idea of an open Internet. Countries that prefer to restrict speech or monitor citizens, such as Russia and China, support what's called the "sovereign" Internet.

*Digital Deciders* is a term used by New America (https://www.newamerica.org/cybersecurity-initiative/reports/digital-deciders/analyzing-the-clusters) to refer to countries that have not yet picked a side. We apply natural language processing tools on General Debate speeches made in the United Nations General Assembly from 1970 to 2018 to determine which worldview the Digital Deciders are more likely to lean towards.

## Part Two: Compare the corpus

Now that I have the corpus of relevant speeches, I started the comparisons. The first step is to run the similarity analysis. Just in case, I exported the previous dataframes into csv files.

In [38]:
# feature/text extractions
import pandas as pd
from nltk import sent_tokenize

from sklearn.metrics.pairwise import cosine_similarity

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

### Cybersecurity-related speeches


In [344]:
file_list = []
file_name = "data/cybersecurity_speeches.csv"
with open(file_name) as f_input:
    file_list = f_input.read().split('\n')

file_list = file_list[0:-1]

In [345]:
sovereigns = []
currents = []
deciders = []
unknowns = []

digital = ['ALB','ARG','ARM','BOL','BIH','BWA','BRA','COL','COG','CRI',
           'CIV','DOM','ECU','SLV','GEO','GHA','GTM','HND','IND','IDN',
           'IRQ','JAM','JOR','KEN','KWT','KGZ','LBN','MKD','MYS','MEX',
           'MNG','MAR','NAM','NIC','NGA','PAK','PAN','PNG','PRY','PER',
           'PHL','MDA','SRB','SGP','ZAF','LKA','THA','TUN','UKR','URY']

global_open = ['GBR','CAN','AUS','DEU','JPN','SWE','NLD','USA','NOR',
            'FIN','CHE','EST','ESP','POL','NZL','KOR','AUT','IRL','CZE',
            'PRT','DNK','ITA','LVA','LTU','LUX','BEL','SVN','GRC','CHL',
            'CYP','SVK','ISR','HRV','BGR','HUN','ROU','FRA']
               
sovereign_controlled = ['SAU','ZWE','VEN','SWZ','CUB','IRN','DZE','LBY',
    'QAT','TUR','ARE','BLR','RUS','CHN','KAZ','OMN','BHR','AZE','VNM',
    'CMR','TKM','TJK','SYR','PRK','UZB','AGO','EGY']

for one_file in file_list:
    if (one_file[8:11]) in global_open:
        currents.append(one_file)
    elif (one_file[8:11]) in sovereign_controlled:
        sovereigns.append(one_file)
    elif (one_file[8:11]) in digital:
        deciders.append(one_file)
    else:
        unknowns.append(one_file)

### Prepare each corpus

I now read in all the files into two lists, using the groupings I just created. I then run TF/IDF to create the vectors

In [346]:
open_internet = []

for current in currents:
    current = "data/"+current
    with open(current) as f_input:
        open_internet.append(f_input.read())

open_internet

["Allow me first to\ncongratulate you, Sir, on your election as President of the\nGeneral Assembly. Canadians are proud to have\naccompanied you and your people on their journey to join\nthe community of nations.\nOn behalf of Canada, allow me also to welcome the\nRepublic of Kiribati, the Republic of Nauru and the\nKingdom of Tonga as new Members of the United Nations.\n\nMr. President, your election is a tribute to your\nwisdom and your dedication to the goals of the United\nNations. I am convinced that you will guide us well in\ncarrying out the work that we are gathered here to do on\nbehalf of all of the world's people.\n\nIndeed, it is we the people for whom the United\nNations was founded and its purposes forged. We the\npeople, not we the nation States, or the ministers, or the\nambassadors, or the Secretariat. Let us recall these lines\nfrom the United Nations Charter:\n“We the peoples of the United Nations\ndetermined to save succeeding generations from the\nscourge of war, .

In [349]:
sovereign_internet = []

for sovereign in sovereigns:
    sovereign = "data/"+sovereign
    with open(sovereign) as f_input:
        sovereign_internet.append(f_input.read())

sovereign_internet

 'Mankind is increasingly feeling the winds\nof the twenty-first century. What they bring depends on\nall of us, on whether or not we succeed in responding\ncollectively to new challenges and in establishing a\nreliable system of international security and stability once\nwe have overcome the vices, antagonisms and stereotypes\naccumulated during the century about to end.\nThis is not only possible, it is the imperative of our\ntimes!\nA well-known Russian proverb says, â€œIf you would\nlive in the world, live in peaceâ€\x9d. It contains a highly\nphilosophical message of everlasting value. Mankind will\nlive in peace and harmony once it has learned to resolve\nemerging problems through peaceful, political means.\nStates will live in peace and harmony once they have\nrecognized their interrelationship and interdependence and\nstarted to seek collective responses to the challenges of\ntheir times.\nExperience confirms the truth of this popular\nwisdom. The most recent example is the sha

In [352]:
s_vectorizer = TfidfVectorizer(ngram_range=(1, 3))
s_vectors = s_vectorizer.fit_transform(sovereign_internet)
s_feature_names = s_vectorizer.get_feature_names()
s_dense = s_vectors.todense()
s_denselist = s_dense.tolist()
sov_df = pd.DataFrame(s_denselist, columns=s_feature_names)
sov_df

Unnamed: 0,000,000 000,000 000 square,000 aid,000 aid workers,000 children,000 children die,000 children perish,000 cuban,000 cuban health,...,être,être of,être of global,être of the,œif,œif you,œif you would,ﬂagrant,ﬂagrant violation,ﬂagrant violation of
0,0.009106,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.014793,0.00716,0.00716,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.008666,0.008666,0.008666,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.006202,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.010626,0.010626,0.012007,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [353]:
o_vectorizer = TfidfVectorizer(ngram_range=(1, 3))
o_vectors = o_vectorizer.fit_transform(open_internet)
o_feature_names = o_vectorizer.get_feature_names()
o_dense = o_vectors.todense()
o_denselist = o_dense.tolist()
op_df = pd.DataFrame(o_denselist, columns=o_feature_names)
op_df

Unnamed: 0,000,000 americans,000 americans were,000 and,000 and that,000 and the,000 asylum,000 asylum seekers,000 bulgarian,000 bulgarian jews,...,zones and,zones and high,zones as,zones as well,zones in,zones in our,zones of,zones of influence,zones those,zones those times
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.010579,0.010579,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.01527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.013639,0.013639,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



['\ufeff212.\tThe opportunity we have at this time to address this illustrious Assembly of the United Nations—the symbol for all the peoples represented here of hope for a future marked by harmony, respect, equality and co-operation among nations—enables us to express cur satisfaction at noting that some progress has been made towards such a future despite the fact that these objectives are still far from being embodied in international practice.\n213.\tWe wish to share in the responsibility, incumbent upon us all, of debating in this forum the topics that we\nbelieve to be fundamental for the progress of mankind. Collective security, peace and development are topics which concern all peoples in this world in which interdependence is becoming an increasingly clear reality. It is with pleasure that we note the consolidation of the principles for which we have always fought and the increasing acceptance of the\n\naspirations and legitimate demands of the countries of the third world for 

['\ufeff212.\tThe opportunity we have at this time to address this illustrious Assembly of the United Nations—the symbol for all the peoples represented here of hope for a future marked by harmony, respect, equality and co-operation among nations—enables us to express cur satisfaction at noting that some progress has been made towards such a future despite the fact that these objectives are still far from being embodied in international practice.\n213.\tWe wish to share in the responsibility, incumbent upon us all, of debating in this forum the topics that we\nbelieve to be fundamental for the progress of mankind. Collective security, peace and development are topics which concern all peoples in this world in which interdependence is becoming an increasingly clear reality. It is with pleasure that we note the consolidation of the principles for which we have always fought and the increasing acceptance of the\n\naspirations and legitimate demands of the countries of the third world for 

In [15]:

# replace x and y with your chosen ngram range.
vectorizer = TfidfVectorizer(ngram_range=(1, 3))
# list your documents as a list where instructed. Use the speaker names as variables, not as literals.
vectors = vectorizer.fit_transform(subset_files)

# these are all the (1,3) ngrams in the corpus 
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()

# creating a dataframe with the  feature names as columns and your previously created speakers list as rows.
# make sure to update the index list if you removed some speakers.
df = pd.DataFrame(denselist, columns=feature_names)
df

Unnamed: 0,12,12 years,12 years ago,13,13 new,13 new members,1962,1962 we,1962 we recognized,1980,...,yet so,yet so much,you,you sir,you sir on,your,your election,your election to,your vision,your vision and
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.002332,0.002332,0.002332,0.002332,0.002332,0.002332,0.002332,0.002332,0.002332,0.002332,...,0.002332,0.002332,0.002332,0.002332,0.002332,0.004664,0.002332,0.002332,0.002332,0.002332


['this is a string']



In [97]:
# replace x and y with your chosen ngram range.
vectorizer = TfidfVectorizer(ngram_range=(1, 3))
# list your documents as a list where instructed. Use the speaker names as variables, not as literals.
vectors = vectorizer.fit_transform(corpus)

# these are all the (1,3) ngrams in the corpus 
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()

# creating a dataframe with the  feature names as columns and your previously created speakers list as rows.
# make sure to update the index list if you removed some speakers.
df = pd.DataFrame(denselist, columns=feature_names)
df

Unnamed: 0,1244,1244 1999,1244 1999 and,14,14 april,14 april missile,1999,1999 and,1999 and will,1999 iraq,...,yemen and libya,yet,yet consigned,yet consigned to,yugoslavia,yugoslavia in,yugoslavia in 1999,zone,zone free,zone free of
0,0.003123,0.003123,0.003123,0.003123,0.003123,0.003123,0.006245,0.003123,0.003123,0.003123,...,0.003123,0.003123,0.003123,0.003123,0.003123,0.003123,0.003123,0.003123,0.003123,0.003123


['security',
 'Security',
 'Security',
 'Security',
 'Security',
 'Security',
 'Security',
 'security',
 'Security',
 'security',
 'security',
 'Security',
 'Security',
 'Security',
 'Security',
 'cyberspace',
 'security',
 'cyberspace',
 'cybercrime',
 'security',
 'Security']

['security challenges have stalled. Diplomacy and a culture of negotiation and compromise are increasingly replaced by dictates and unilateral extraterritorial restrictions effected without the consent of the Security Council. Such measures, which have already been applied to dozens of countries, are ineffective as well as illegal, as demonstrated by the more  than  half-century  of the United States blockade of Cuba, which has been condemned by the entire international community.',
 'Security Council resolution 2254 (2015), and the basis for the intra-Syrian constitutional committee now being established in Geneva. Its agenda includes the restoration of ruined infrastructure in order to facilitate the return of millions of refugees to their homes. Assistance in addressing those issues in the interest of all Syrians, without double standards, must become a priority for international efforts and the activities of United Nations agencies.',
 'Security Council in its resolution 2202 (2015

Unnamed: 0,security,cyberspace,cybercrime
0,0.056208,0.006245,0.003123


Unnamed: 0,0
1244,0.003123
1244 1999,0.003123
1244 1999 and,0.003123
14,0.003123
14 april,0.003123
...,...
yugoslavia in,0.003123
yugoslavia in 1999,0.003123
zone,0.003123
zone free,0.003123


In [52]:
# create a corpus of sentences
sent_corpus = []
for speech in corpus:
    sentences = sent_tokenize(speech)
    for sentence in sentences:
        sent_corpus.append(sentence)
sent_corpus

['It is an honour to be present at the General Assembly today.',
 'It is wonderful to be here in the great city of New York.',
 'Once again this week, New Yorkers showed us how to be resilient and resolute in the face of violent extremism.',
 'On behalf of everyone in this Hall, let me directly say to the people of New York that they are a model to the rest of the world, and we thank them.',
 'Exactly one year ago, Canada was in the middle of a long — 78 days on the road, and I can assure the Assembly that, in Canada, there are 78 days’ worth of roads — and closely fought election campaign.',
 'It is the responsibility of a leader to spend time with the people they are elected to serve.',
 'To get the real stories, it is important to go where people live: coffee shops and church basements, mosques and synagogues, farmers’ markets and public parks.',
 'It was in those places that I got the best sense of what Canadians were thinking and how they were doing and, through the politeness — b

In [64]:
#use the same vectorizer i used
svectors = vectorizer.transform(sent_corpus)
# these are all the ngrams in the corpus you created. take a peak!
sfeature_names = vectorizer.get_feature_names()
sdense = svectors.todense()
sdenselist = sdense.tolist()

In [65]:
# creating a dataframe with the  feature names as columns and your previously created speakers list as rows.
sdf = pd.DataFrame(sdenselist, columns=sfeature_names)
sdf

Unnamed: 0,000,000 additional,000 refugees,000 victims,10,10 billion,10 per,10 trillion,10 years,100,...,zero we,zika,zika that,zionist,zionist pressure,zionist regime,zone,zone free,zor,zor and
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
850,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
851,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
852,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
853,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [81]:
# if you want to transpose, you can, but unnecessary
#sdft = sdf.T
#sdft

In [87]:
# run the cosine simiarity (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)
from sklearn.metrics.pairwise import cosine_similarity
counts = cosine_similarity(sdf,sdf)
counts

array([[1.        , 0.11022803, 0.04485285, ..., 0.04506496, 0.07357459,
        0.03750946],
       [0.11022803, 1.        , 0.07569154, ..., 0.03687498, 0.05817123,
        0.03836575],
       [0.04485285, 0.07569154, 1.        , ..., 0.0407911 , 0.04826176,
        0.04526954],
       ...,
       [0.04506496, 0.03687498, 0.0407911 , ..., 1.        , 0.05387776,
        0.09096723],
       [0.07357459, 0.05817123, 0.04826176, ..., 0.05387776, 1.        ,
        0.16619749],
       [0.03750946, 0.03836575, 0.04526954, ..., 0.09096723, 0.16619749,
        1.        ]])

In [88]:
pd.DataFrame(counts)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,845,846,847,848,849,850,851,852,853,854
0,1.000000,0.110228,0.044853,0.049437,0.029040,0.098658,0.056851,0.022823,0.045677,0.000000,...,0.000000,0.027924,0.053913,0.022208,0.037284,0.009980,0.057149,0.045065,0.073575,0.037509
1,0.110228,1.000000,0.075692,0.189817,0.047525,0.091864,0.046519,0.031125,0.037375,0.000000,...,0.000000,0.022849,0.029410,0.027259,0.042712,0.016332,0.058454,0.036875,0.058171,0.038366
2,0.044853,0.075692,1.000000,0.073085,0.043810,0.041062,0.032937,0.039118,0.022051,0.000000,...,0.000000,0.025276,0.027111,0.026803,0.035998,0.024089,0.051729,0.040791,0.048262,0.045270
3,0.049437,0.189817,0.073085,1.000000,0.084986,0.151357,0.054451,0.075152,0.033127,0.045286,...,0.014296,0.066861,0.061271,0.064993,0.063483,0.042480,0.083623,0.071788,0.089838,0.094679
4,0.029040,0.047525,0.043810,0.084986,1.000000,0.047854,0.031988,0.072880,0.000000,0.002828,...,0.005249,0.032730,0.035107,0.052062,0.037874,0.035092,0.039075,0.026411,0.055551,0.051293
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
850,0.009980,0.016332,0.024089,0.042480,0.035092,0.021927,0.034199,0.034997,0.000000,0.000000,...,0.007215,0.022495,0.124558,0.023854,0.166688,1.000000,0.030692,0.157685,0.118411,0.140062
851,0.057149,0.058454,0.051729,0.083623,0.039075,0.052318,0.041966,0.040945,0.028096,0.000000,...,0.010329,0.042940,0.034544,0.042689,0.034400,0.030692,1.000000,0.071459,0.054660,0.075857
852,0.045065,0.036875,0.040791,0.071788,0.026411,0.057758,0.088247,0.036899,0.033232,0.000000,...,0.032580,0.067721,0.059927,0.033662,0.054253,0.157685,0.071459,1.000000,0.053878,0.090967
853,0.073575,0.058171,0.048262,0.089838,0.055551,0.071590,0.043504,0.036381,0.017475,0.000000,...,0.006424,0.040062,0.132399,0.065048,0.187654,0.118411,0.054660,0.053878,1.000000,0.166197


In [89]:
#look at the results
pd.DataFrame(counts.dot(counts.T))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,845,846,847,848,849,850,851,852,853,854
0,2.680763,1.775889,1.258590,2.534314,1.463423,2.226008,1.396458,1.280597,0.708358,0.377809,...,0.387193,1.219230,1.365385,1.427801,1.388442,0.920059,1.559336,1.652648,2.044220,1.842012
1,1.775889,2.763290,1.478522,3.071154,1.717719,2.342550,1.443634,1.447483,0.705180,0.394042,...,0.414008,1.304545,1.409721,1.597151,1.509055,1.033241,1.712436,1.716410,2.198199,2.007429
2,1.258590,1.478522,2.207103,2.407331,1.455041,1.782324,1.204409,1.248044,0.550779,0.326058,...,0.353505,1.127565,1.195241,1.360602,1.275363,0.929895,1.455259,1.523861,1.834790,1.755408
3,2.534314,3.071154,2.407331,5.968643,3.023172,3.948789,2.380807,2.628822,1.095186,0.824326,...,0.800779,2.391011,2.468132,2.888956,2.570978,1.885162,2.986075,2.988924,3.815096,3.608988
4,1.463423,1.717719,1.455041,3.023172,2.873616,2.195060,1.420888,1.657758,0.537030,0.407202,...,0.450738,1.387386,1.493955,1.801698,1.522712,1.196523,1.744744,1.702713,2.292834,2.158849
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
850,0.920059,1.033241,0.929895,1.885162,1.196523,1.377939,1.047213,1.037608,0.365153,0.235480,...,0.316164,0.938936,1.321852,1.118977,1.415273,1.986605,1.177109,1.630482,1.734047,1.794034
851,1.559336,1.712436,1.455259,2.986075,1.744744,2.220489,1.495435,1.545220,0.683082,0.399403,...,0.486762,1.463007,1.499581,1.728823,1.546826,1.177109,2.775529,1.978735,2.279801,2.236190
852,1.652648,1.716410,1.523861,2.988924,1.702713,2.340782,1.928264,1.647986,0.874319,0.491506,...,0.668739,1.686560,1.763895,1.765605,1.844556,1.630482,1.978735,3.684409,2.378100,2.566872
853,2.044220,2.198199,1.834790,3.815096,2.292834,2.912657,1.872295,1.920933,0.806990,0.484471,...,0.575598,1.820622,2.204324,2.280418,2.411767,1.734047,2.279801,2.378100,4.060822,3.117392


## Trying a different approach

In [97]:
rus_list = ['IRN_71_2016.txt','CHN_71_2016.txt','RUS_71_2016.txt','VNM_71_2016.txt']

rus = []

for file_path in rus_list:
    with open(file_path) as f_input:
        rus.append(f_input.read())

rus

['I congratulate Mr. Peter Thomson on his election as President of the General Assembly at its seventy-first session, and I hope that the decisions and initiatives taken by the Assembly will play an effective role in resolving the problems that our world is currently facing.\nFifteen years have passed since the painful terrorist attack in this city, a disaster whose human dimensions moved the entire world. On that day, no one imagined that this occurrence would lead to larger disasters or result in a devastating war in the Middle East and the spread of insecurity across the globe. This war has sown the seeds of borderless terrorism everywhere on Earth. Today, the most pressing question as to why we are facing such a situation should be on the agenda of every international forum. We need to find out which approaches, policies and erroneous actions have paved the way for the spread of insecurity throughout the world and what the world will look like 15 years from now.\nSecurity has becom

In [105]:
#break up speeches into a list of sentences
new_rus = []
for speech in rus:
    sentences = sent_tokenize(speech)
    for sentence in sentences:
        new_rus.append(sentence)    

#tag each sentence as 'RUS'
new_rus = pd.DataFrame(new_rus)
new_rus['group']='RUS'

# go back to list


Unnamed: 0,0,group
0,I congratulate Mr. Peter Thomson on his electi...,RUS
1,Fifteen years have passed since the painful te...,RUS
2,"On that day, no one imagined that this occurre...",RUS
3,This war has sown the seeds of borderless terr...,RUS
4,"Today, the most pressing question as to why we...",RUS
...,...,...
322,"As a peace-loving and friendly nation, Viet Na...",RUS
323,"We strive to be a friend, a reliable partner a...",RUS
324,Our commitment to multilateralism and internat...,RUS
325,Viet Nam has decided to present its candidacy ...,RUS


In [95]:
# analyze the first file, which is afghanistan
with open("AFG_73_2018.txt") as file:
    text = file.read()

#this file is going to change size len(text) will tell you how big it is
text

'Allow me to start by extending my congratulations to Ms. María Fernanda Espinosa Garcés and wishing much success to the presidency of the General Assembly at its seventy- third session. Let me also assure her that, by working with Member States and the United Nations family, we look forward to advancing the seven priorities set out in the agenda of the General Assembly at its seventy- third session.\nFrom this rostrum, I would like to provide the General Assembly with the latest regarding the situation in Afghanistan and the gains, opportunities and challenges that my nation faces at this critical juncture, in addition to our views on other key global challenges.\nThe record of accomplishments by this institution over the past 73  years  demonstrates  that,  wherever it might be and whoever it might impact, we cannot escape the ripple effect of, or de-link ourselves from, the global, national, communal and human connections that bind  us, whether  in relation  to  the environment,\n \

In [91]:
usa_list = ['CAN_71_2016.txt','FRA_71_2016.txt','USA_71_2016.txt','GBR_71_2016.txt']

usa = []

for file_path in usa_list:
    with open(file_path) as f_input:
        usa.append(f_input.read())

usa

['It is an honour to be present at the General Assembly today. It is wonderful to be here in the great city of New York. Once again this week, New Yorkers showed us how to be resilient and resolute in the face of violent extremism. On behalf of everyone in this Hall, let me directly say to the people of New York that they are a model to the rest of the world, and we thank them.\nExactly one year ago, Canada was in the middle of a long — 78 days on the road, and I can assure the Assembly that, in Canada, there are 78 days’ worth of roads — and closely fought election campaign. It is the responsibility of a leader to spend time with the people they are elected to serve. To get the real stories, it is important to go where people live: coffee shops and church basements, mosques and synagogues, farmers’ markets and public parks. It was in those places that I got the best sense of what Canadians were thinking and how they were doing and, through the politeness — because we Canadians are alw

In [92]:
usa = pd.DataFrame(usa)
usa

Unnamed: 0,0
0,It is an honour to be present at the General A...
1,It is always an honour for me to address the G...
2,As I address the General Assembly in this Hall...
3,It is a great honour for me to address the Gen...


In [None]:
my_classifier = NaiveBayesClassifier(train)
my_classifier.classify("thanks riri for this amazing foundation")

prob_dist = my_classifier.prob_classify('Fenty foundation is now the only foundation I use')
print(prob_dist.max())
print(round(prob_dist.prob('13-17'), 2))
print(round(prob_dist.prob('25-34'),2))
print(round(prob_dist.prob('over54'),2))