## UN Security Council Speeches `9 points`

Source: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KGVSYH

Description from [Data Is Plural](https://www.data-is-plural.com/archive/2019-07-17-edition/):

> Two decades of UN Security Council debates. A group of researchers have collected, parsed, and added metadata to all UN Security Council debates from 1995 through 2017. The dataset includes more than 65,000 speeches (with information about each speaker), extracted from nearly 5,000 meeting transcripts.

**Topics:**

* Reading in many files
* Extracting content from strings (regex, maybe)
* K-means clustering

## Opening the dataset `2 points`

You're interested in the `.tar` file. It should extract just like a `.zip` file and create a folder with many many files in it.

In [1]:
import glob
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from sklearn.cluster import KMeans


In [2]:
# import tarfile
# my_tar = tarfile.open('files/speeches.tar')
# my_tar.extractall('files/speeches/') 
# my_tar.close()

### How many speeches does this dataset have?

In [3]:
files = glob.glob('files/speeches/*')
len(files)

82165

### Put the speeches into a dataframe

In [4]:
# contents = [open(filename).read() for filename in filenames]

data = []
for f in files:
  with open (f, "r") as myfile:
    data.append(myfile.read())

df_dict = {'Filename':files,'Text':data}
df = pd.DataFrame(df_dict)

In [5]:
df.head()

Unnamed: 0,Filename,Text
0,files/speeches/UNSC_2000_SPV.4111Resumption1_s...,Mr. J erandi (Tunisia) (spoke in French): I sh...
1,files/speeches/UNSC_2016_SPV.7658_spch021.txt,Mr. Suarez Borges (Bolivarian Republic of Vene...
2,files/speeches/UNSC_2000_SPV.4118Resumption1_s...,Mr. Enkhsaikhan (Mongolia): It is a great hono...
3,files/speeches/UNSC_2018_SPV.8431_spch012.txt,"Mr. Allen (United Kingdom): Five years ago,\nf..."
4,files/speeches/UNSC_2020_SPV.2020_1129_spch015...,I would like to start by thanking Acting Speci...


In [6]:
df['Filename'] = df.Filename.str.replace('files/speeches/','')

df.head()

Unnamed: 0,Filename,Text
0,UNSC_2000_SPV.4111Resumption1_spch002.txt,Mr. J erandi (Tunisia) (spoke in French): I sh...
1,UNSC_2016_SPV.7658_spch021.txt,Mr. Suarez Borges (Bolivarian Republic of Vene...
2,UNSC_2000_SPV.4118Resumption1_spch004.txt,Mr. Enkhsaikhan (Mongolia): It is a great hono...
3,UNSC_2018_SPV.8431_spch012.txt,"Mr. Allen (United Kingdom): Five years ago,\nf..."
4,UNSC_2020_SPV.2020_1129_spch015.txt,I would like to start by thanking Acting Speci...


### How many speeches are from each year? `1 point`

You'll want to create a new column.

In [7]:
df['year'] = df.Filename.apply(lambda st: st[st.find("UNSC_")+5:st.find("_SPV")])

df.year.value_counts()

2019    6168
2018    6160
2017    5411
2016    5005
2015    4790
2014    4769
2020    4308
2011    3198
2013    3104
2009    3088
2012    3057
2010    3036
2002    3026
2003    3023
2008    2997
2004    2819
2001    2655
2000    2571
2006    2549
2007    2071
2005    1879
1999    1613
1996    1394
1995    1374
1998    1187
1997     913
Name: year, dtype: int64

## Speech topics `2 points`

### Join with `meta.tsv` to see the topic of each speech

You'll need to massage the filename a lot.

In [8]:
meta = pd.read_csv('files/meta.tsv',sep='\t')

meta
# meta.info()

Unnamed: 0,basename,date,num_speeches,topic,pressrelease,outcome,year,month,day
0,UNSC_1995_SPV.3486,6 January 1995,1,Bosnia and Herzegovina,,http://www.un.org/en/ga/search/view_doc.asp?sy...,1995,1,6
1,UNSC_1995_SPV.3487,12 January 1995,40,Federal Republic of Yugoslavia (Serbia and Mon...,,http://www.un.org/en/ga/search/view_doc.asp?sy...,1995,1,12
2,UNSC_1995_SPV.3488,12 January 1995,12,Georgia,,http://www.un.org/en/ga/search/view_doc.asp?sy...,1995,1,12
3,UNSC_1995_SPV.3489,13 January 1995,16,Liberia,,http://www.un.org/en/ga/search/view_doc.asp?sy...,1995,1,13
4,UNSC_1995_SPV.3490,13 January 1995,1,Western Sahara,,http://www.un.org/en/ga/search/view_doc.asp?sy...,1995,1,13
...,...,...,...,...,...,...,...,...,...
5743,UNSC_2020_SPV.8774,12 November 2020,4,Reports of the Secretary-General on the Sudan ...,https://www.un.org/press/en/2020/sc14354.doc.htm,http://undocs.org/en/S/RES/2550(2020),2020,11,12
5744,UNSC_2020_SPV.8775,12 November 2020,8,The situation in Somalia,https://www.un.org/press/en/2020/sc14355.doc.htm,http://undocs.org/en/S/RES/2551(2020),2020,11,12
5745,UNSC_2020_SPV.8776,12 November 2020,4,The situation in the Central African Republic,https://www.un.org/press/en/2020/sc14356.doc.htm,http://undocs.org/en/S/RES/2552(2020),2020,11,12
5746,UNSC_2020_SPV.8777,17 November 2020,3,The situation in Mali,http://www.un.org/press/en/2020/sc14359.doc.htm,,2020,11,17


In [9]:
df.sample(10)
# df.info()

Unnamed: 0,Filename,Text,year
47216,UNSC_2007_SPV.5655_spch013.txt,"Ms. Tincopa (Peru) (spoke in Spanish): We, too...",2007
68818,UNSC_1997_SPV.3808_spch005.txt,Mr. Park (Republic of Korea): We are gravely\n...,1997
26713,UNSC_2020_SPV.8699Resumption1_spch034.txt,"Mr. Ibragimov (Uzbekistan): First of all, allo...",2020
72982,UNSC_2018_SPV.8414_spch020.txt,Mr. Cohen (United States ofAmerica): I thank y...,2018
65550,UNSC_2019_SPV.8511_spch015.txt,Mr. Ipo (Cote d'Ivoire) (spoke in French): My\...,2019
13584,UNSC_1996_SPV.3628_spch033.txt,The President: I thank the representative of\n...,1996
78631,UNSC_2015_SPV.7435_spch005.txt,Mr. Oyarzun Marchesi (Spain) (Spoke in Spanish...,2015
32676,UNSC_1996_SPV.3702_spch048.txt,"Mr. Rubadiri (Malawi): About three months ago,...",1996
33641,UNSC_2016_SPV.7674_spch021.txt,Mr. Ciss (Senegal) (spoke in French): The\ndel...,2016
29555,UNSC_2011_SPV.6510Resumption1_spch018.txt,Mr. Waxman (Israel): Allow me to congratulate\...,2011


In [10]:
df['basename'] = df.Filename.str.extract(r'(UNSC_\d\d\d\d_SPV.\d\d\d\d)')

topics = pd.merge(df,meta,left_on='basename',right_on='basename')

topics

Unnamed: 0,Filename,Text,year_x,basename,date,num_speeches,topic,pressrelease,outcome,year_y,month,day
0,UNSC_2000_SPV.4111Resumption1_spch002.txt,Mr. J erandi (Tunisia) (spoke in French): I sh...,2000,UNSC_2000_SPV.4111,13 March 2000,13,Sierra Leone,http://www.un.org/press/en/2000/20000313.sc682...,,2000,3,13
1,UNSC_2000_SPV.4111_spch002.txt,Mr. Yel'chenko (Ukraine): On behalf of my\nGov...,2000,UNSC_2000_SPV.4111,13 March 2000,13,Sierra Leone,http://www.un.org/press/en/2000/20000313.sc682...,,2000,3,13
2,UNSC_2000_SPV.4111Resumption1_spch001.txt,Mrs. Ashipala-Musavyi (Namibia): At the outset...,2000,UNSC_2000_SPV.4111,13 March 2000,13,Sierra Leone,http://www.un.org/press/en/2000/20000313.sc682...,,2000,3,13
3,UNSC_2000_SPV.4111_spch012.txt,"Miss Durrant (Jamaica): I, too, wish to thank ...",2000,UNSC_2000_SPV.4111,13 March 2000,13,Sierra Leone,http://www.un.org/press/en/2000/20000313.sc682...,,2000,3,13
4,UNSC_2000_SPV.4111_spch011.txt,Mr. Chen Xu (China) (spoke in Chinese): I shou...,2000,UNSC_2000_SPV.4111,13 March 2000,13,Sierra Leone,http://www.un.org/press/en/2000/20000313.sc682...,,2000,3,13
...,...,...,...,...,...,...,...,...,...,...,...,...
79477,UNSC_2007_SPV.5772_spch001.txt,The President: I should like to inform the\nCo...,2007,UNSC_2007_SPV.5772,29 October 2007,1,Côte d'Ivoire,http://www.un.org/press/en/2007/sc9158.doc.htm,http://www.un.org/en/ga/search/view_doc.asp?sy...,2007,10,29
79478,UNSC_1996_SPV.3677_spch001.txt,The President (interpretation from French): As...,1996,UNSC_1996_SPV.3677,3 July 1996,1,Croatia,http://www.un.org/press/en/1996/19960703.sc623...,http://www.un.org/en/ga/search/view_doc.asp?sy...,1996,7,3
79479,UNSC_2016_SPV.7848_spch001.txt,The President (spoke in Spanish): The Security...,2016,UNSC_2016_SPV.7848,21 December 2016,1,Peace consolidation in West Africa,http://www.un.org/press/en/2016/sc12650.doc.htm,http://www.un.org/en/ga/search/view_doc.asp?sy...,2016,12,21
79480,UNSC_2002_SPV.4548_spch001.txt,The President (spoke in Arabic): As this is th...,2002,UNSC_2002_SPV.4548,5 June 2002,1,Democratic Republic of the Congo,http://www.un.org/press/en/2002/sc7421.doc.htm,http://www.un.org/en/ga/search/view_doc.asp?sy...,2002,6,5


### What are the most common speech topics?

In [11]:
topics.topic.value_counts().head(10)

Maintenance of international peace and security                         4907
Women and peace and security                                            3854
Middle East situation, including the Palestinian question               3778
The situation in the Middle East                                        3632
The situation in the Middle East, including the Palestinian question    3440
Children and armed conflict                                             2415
Protection of civilians in armed conflict                               2061
Afghanistan                                                             1734
Reports of the Secretary-General on the Sudan and South Sudan           1468
Iraq-Kuwait                                                             1378
Name: topic, dtype: int64

### Do you find these classifications useful? Why or why not?

In [12]:
# It depends on our goals. If we are looking for a specific topic, it might be useful. However, clearly, it lacks standardization for general analysis.

# Automatic organization `4 points`

Using k-means clustering, try to organize the speeches into 5 to 10 groups. Play with hyperparameters like `max_df` and stopwords to try and improve on the existing speech `topics` column.

In [13]:
stopwords_list = nltk.corpus.stopwords.words('english')
extra_stopwords = ['nations','international','also','support','representative','thank','table','addressed',
'government','give','spoke','item','meeting','invite','kind','words','agenda','procedure','states','would','make','take','speaker','country','next']

stopwords_list.extend(extra_stopwords)

In [14]:
vectorizer = TfidfVectorizer(stop_words=stopwords_list,token_pattern=r'(?u)\b[A-Za-z]+\b', max_df=0.6,max_features=6000)
matrix = vectorizer.fit_transform(df.Text)

In [15]:
number_of_clusters = 5
km = KMeans(n_clusters=number_of_clusters)
km.fit(matrix)

KMeans(n_clusters=5)

In [16]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :8]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: peace must efforts resolution general african political people
Cluster 1: seat list inscribed statement rose consideration accordance provisional
Cluster 2: floor spanish french briefing chinese arabic republic russian
Cluster 3: palestinian israel israeli peace east gaza palestine middle
Cluster 4: women children conflict armed protection sexual civilians violence
