# FDS Assignment 1: UN Debates, World Happiness and International Trade

We can start by loading the data:

In [6]:
import os
import numpy as np
import pandas as pd

sessions = np.arange(25, 76)
data=[]

for session in sessions:
    directory = "./TXT/Session " + str(session) + " - " + str(1945 + session)
    for filename in os.listdir(directory):
        f = open(os.path.join(directory, filename), encoding='utf8')
        if filename[0] == ".":
            continue
        splt = filename.split("_")
        data.append([session, 1945 + session, splt[0], f.read()])

df_speech = pd.DataFrame(data, columns=['Session','Year','ISO-alpha3 Code','Speech'])
df_speech.head(50)

Unnamed: 0,Session,Year,ISO-alpha3 Code,Speech
0,25,1970,ALB,33: May I first convey to our President the co...
1,25,1970,ARG,177.\t : It is a fortunate coincidence that pr...
2,25,1970,AUS,100.\t It is a pleasure for me to extend to y...
3,25,1970,AUT,155.\t May I begin by expressing to Ambassado...
4,25,1970,BEL,"176. No doubt each of us, before coming up to ..."
5,25,1970,BLR,71.\t. We are today mourning the untimely deat...
6,25,1970,BOL,135.\t I wish to congratulate the President o...
7,25,1970,BRA,"1.\tMr. President, I should like, first of all..."
8,25,1970,CAN,The General Assembly is fortunate indeed to ha...
9,25,1970,CMR,: A year ago I came here as the Acting Preside...


In [8]:
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\agniv\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\agniv\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\agniv\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [92]:
from nltk.corpus import stopwords
import re
import string

def preprocess(speech):
    sentences = speech.split("\n")
    new_speech = []
    for sentence in sentences:
        sentence = re.sub('(\d+)[:.]\s', '', sentence)
        sentence = sentence.translate(str.maketrans('', '', string.punctuation))
        sentence = sentence.lower()
        # tokens_without_sw = [word for word in sentence if not word in stopwords.words("english")]
        new_speech.append(sentence)
    print(new_speech)
    return ' '.join(new_speech)

In [93]:
# from nltk import word_tokenize

df_1970 = df_speech.set_index(["Year", "ISO-alpha3 Code"]).loc[1970]
speeches_1970 = df_1970['Speech']

## ['', '', ... ]
# speeches = [word_tokenize(speech) for speech in text.values]
processed_speeches = []
for speech in speeches_1970.values:
    processed_speeches.append(preprocess(speech))

print(processed_speeches)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



## TF-IDF

*Tf* means term-frequency while *tf-idf* means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval that has also found good use in document classification. The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document (as in the previous example) is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus. If needed, more info can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(speeches_1970.values)
feature_names = vectorizer.get_feature_names()
dense = X.todense()
denselist = dense.tolist()

print(pd.DataFrame(denselist, columns=feature_names))

         000  054        10       100       101  101st       102       103  \
0   0.000000  0.0  0.000000  0.000000  0.000000    0.0  0.000000  0.000000   
1   0.000000  0.0  0.002787  0.000000  0.000000    0.0  0.000000  0.000000   
2   0.003554  0.0  0.000000  0.003323  0.003731    0.0  0.003731  0.003731   
3   0.000000  0.0  0.006018  0.000000  0.000000    0.0  0.000000  0.000000   
4   0.000000  0.0  0.003395  0.004191  0.000000    0.0  0.000000  0.000000   
..       ...  ...       ...       ...       ...    ...       ...       ...   
65  0.000000  0.0  0.002637  0.000000  0.000000    0.0  0.000000  0.000000   
66  0.000000  0.0  0.000000  0.005021  0.005636    0.0  0.005636  0.005636   
67  0.000000  0.0  0.000000  0.005643  0.006335    0.0  0.006335  0.006335   
68  0.015939  0.0  0.004024  0.000000  0.000000    0.0  0.000000  0.000000   
69  0.000000  0.0  0.003347  0.004131  0.004637    0.0  0.004637  0.004637   

         104       105  ...  zeal  zealand  zealous  zimbabwe  