# getting the data

Today, we will work with the UN General Debate dataset. The corpus consists of 7,507 speeches held at the annual sessions of the United Nations General Assembly from 1970 to 2016. It was created in 2017 by Mikhaylov, Baturo, and Dasandi at Harvard “for understanding and measuring state preferences in world politics.” Each of the almost 200 countries in the United Nations has the opportunity to present its views on global topics such international conflicts, terrorism, or climate change at the annual General Debate.
Work on this data is proposed in the book 

- https://github.com/blueprints-for-text-analytics-python/blueprints-text
- from here, but rather it's easier to use the version on my server. 
  - https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/data/un-general-debates/un-general-debates-blueprint.csv.gz



## downloading some toy data

only once!

In [1]:
# start it only if you don't have your data yet!
# you can also simply get the zip, unzip and put it manuaylly next to your notbook
# https://gerdes.fr/saclay/informationRetrieval/un-general-debates-blueprint.csv.gz

# !wget https://gerdes.fr/saclay/informationRetrieval/un-general-debates-blueprint.csv.gz
# import gzip, shutil
# with open('un-general-debates-blueprint.csv.gz', 'rb') as f_in:
#     with gzip.open('un-general-debates-blueprint.csv', 'wb') as f_out:
#         shutil.copyfileobj(f_in, f_out)
        

In [2]:
# this turns on the autotimer, so that every cell has a timing information below
try:
    %load_ext autotime
except:
    !pip install ipython-autotime
    %load_ext autotime
# in order to stop using the autotimer:
# %unload_ext autotime

[33mDEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support[0m
Collecting ipython-autotime
  Downloading https://files.pythonhosted.org/packages/b4/c9/b413a24f759641bc27ef98c144b590023c8038dfb8a3f09e713e9dff12c1/ipython_autotime-0.3.1-py2.py3-none-any.whl
Collecting monotonic; python_version < "3.3" (from ipython-autotime)
  Downloading https://files.pythonhosted.org/packages/9a/67/7e8406a29b6c45be7af7740456f7f37025f0506ae2e05fb9009a53946860/monotonic-1.6-py2.py3-none-any.whl
Collecting ipython (from ipython-autotime)
[?25l  Downloading https://files.pythonhosted.org/packages/ce/2c/2849a2b37024a01a847c87d81825c0489eb22ffc6416cac009bf281ea838/ipython-5.10.0-py2-none-any.whl (760kB)
[K    

ModuleNotFoundError: No module named 'autotime'

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from collections import Counter
from nltk.tokenize import word_tokenize
from tqdm.notebook import tqdm
from wordcloud import WordCloud
import re

In [None]:
df = pd.read_csv("un-general-debates-blueprint.csv")
df.sample(22) #, random_state=53)

## Let's get to know the data (and Pandas):

In [None]:
df.columns, df.dtypes

In [None]:
df.info(memory_usage='deep')

## Adding the "length" column, describing the dataframe

In [None]:
df['length'] = df['text'].str.len()
df.describe().T

#### 🚧 todo: how long did the longest speech last?

length in characters: how much is one page (11pt)?  English ~ 600 words. 

That's approximately how many characters (including spaces)?

xxxx

What's your guess for 
German? French? Russian? Thai? Japanese?

In English, how many words per minute? ~ 150



In [None]:
# how many words for the longest speech?
print(72000/6)
# 🚧 todo:
# how many pages for the longest speech?
print(xxx)

# how long to read one page?
print(xxx)

# how long to read the longest speech?
print(xxx,'minutes')


## mean < average -> ?

terms you probably know: mode ? mean ? average ?

In [None]:
df[['country', 'speaker']].describe().T

## NaN ≠ NA
NaN means 0/0. NaN stands for Not a Number

NA is generally interpreted as a missing value and has various forms - NA_integer_, NA_real_, etc.

https://stats.stackexchange.com/questions/5686/what-is-the-difference-between-nan-and-na

In [None]:
df.isna().sum()

In [None]:
df[df['position'].isna()]

In [None]:
df['speaker'].fillna('unkown', inplace=True)
df['position'].fillna('unkown', inplace=True)
df[df['position'].isna()]

# categorical values vs numerical values

In [None]:
df[df['speaker'].str.contains('Bush')]['speaker'].value_counts()

In [None]:
df['length'].plot(kind='box', vert=False)


In [None]:
df['length'].plot(kind='hist', bins=30) # , figsize=(8,2)

### Kernel density estimation

https://en.wikipedia.org/wiki/Kernel_density_estimation

if error: "FutureWarning: `distplot` is a deprecated function"

update scipy: `pip3 install --upgrade scipy `

if it persists
    

In [None]:
# only if you got warnings!!!
import warnings
warnings.filterwarnings("ignore")

In [None]:
#plt.figure(figsize=(8, 2))
sns.distplot(df['length'], bins=30, kde=True);

# Seaborn docs?
https://seaborn.pydata.org/index.html  
https://seaborn.pydata.org/generated/seaborn.distplot.html

## from where?

catplot shows the relationship between a numerical and one or more categorical variables.
https://seaborn.pydata.org/generated/seaborn.catplot.html

In [None]:
sns.catplot(data=df, x="country", y="length")

In [None]:
# how to build a selection:
df['country'].isin(['USA', 'FRA', 'GBR', 'CHN', 'RUS'])

In [None]:
# using the selection
where = df['country'].isin(['USA', 'FRA', 'GBR', 'CHN', 'RUS'])
sns.catplot(data=df[where], x="country", y="length", kind='box')
sns.catplot(data=df[where], x="country", y="length", kind='violin')

## significant differences?

Student test? Anova ?

if the boxes (marking the quartiles) don't overlap each other and the sample size is at least 10, then the two groups being compared should have different medians at the 5% level: https://stats.stackexchange.com/questions/262495/reading-box-and-whisker-plots-possible-to-glean-significant-differences-between

In [None]:
sns.catplot(data=df[where], x="country", y="length", kind='box', notch= True)

## time?

size() returns the number of rows per group  
Why number of countries?

In [None]:
df.groupby('year').size().plot(title="Number of Countries")

when more people want to speak, ...?

In [None]:
df.groupby('year').agg({'length': 'mean'}).plot(title="Avg. Speech Length", ylim=(0,30000))

# Tokenization

### 🚧 todo:
Describe in one sentence the difference between the tokenizations. Which one is your favorite and why?


In [None]:
# 1.
text = "Let's all together defeat last year's problem, SARS-CoV-2, in 2022!"
'|'.join(text.split()),len(text.split())

In [None]:
# 2.
nochar = re.compile('\W+')
'|'.join(nochar.split(text)),len(nochar.split(text))

In [None]:
# 3.
nochar = re.compile('(\W+)')
'|'.join(nochar.split(text)),len(nochar.split(text))

In [None]:
#4.
charorhyphen = re.compile(r'[\w-]+')
'|'.join(charorhyphen.findall(text)),len(charorhyphen.findall(text))

### using a specialized class: nltk

In [None]:
#5.
'|'.join(word_tokenize(text)),len(word_tokenize(text))

Note that these are idiosyncratic rules for English. Think of *viens-tu*, *où va-t-il*, *Kaffeetasse*, *cantolo*, *我爱你*, ...

and it's slow!

In [None]:
for t in tqdm(df['text'][:100]):
    toks = word_tokenize(t)

### so be patient for this line:

In [None]:
df['tokens'] = df['text'].map(word_tokenize)
df['num_tokens'] = df['tokens'].map(len)

In [None]:
display(df)

In [None]:
df['num_tokens'] = df['tokens'].map(len)

In [None]:
where = df['country'].isin(['USA', 'FRA', 'GBR', 'CHN', 'RUS', 'FRG', 'DEU'])
sns.catplot(data=df[where], x="country", y="num_tokens", kind='box')

## 🚧 todo: When speaking English, do Germans use longer words?

- Compare to English natives and French speakers using notched box plots.
- Is the result significant?
- How do you explain this?

In [None]:
# 🚧 todo:
df['avg_wordsize'] = xxx
display(df)

In [None]:
# 🚧 todo:
where = df['country'].isin(xxx
sns.catplot(data=df[where],xxx

#### 🚧 todo:
answer: 
xxx

# Let's Zipf it!
## skim through this section if you have followed Hands-on NLP!
but execute the code so that we have the freq_df and start again at word clouds
### Let's first flatten the list

In [None]:
alltoks = [item for sublist in df['tokens'] for item in sublist] 
len(alltoks)

In [None]:
text = "Let's all together defeat last year's problem, SARS-CoV-2, in 2021!"
tokens = word_tokenize(text)
counter = Counter(tokens)
counter

### What are the most common words of English?

In [None]:
counter = Counter(alltoks)
counter.most_common(22)

for even bigger databases, it might be advisable to do the computation iteratively:

In [None]:
counter = Counter()
df['tokens'].map(counter.update)
counter.most_common(22)

In [None]:
freq_df = pd.DataFrame.from_dict(counter, orient='index', columns=['freq'])
freq_df.sort_values('freq',  inplace=True, ascending=False)
freq_df

In [None]:
freq_df.head(22).plot(kind='bar')


In [None]:
freq_df.head(2222).plot()

In [None]:
freq_df.head(2222).plot(loglog=True)

futher reading:  
https://en.wikipedia.org/wiki/Zipf's_law  
https://stats.stackexchange.com/questions/6780/how-to-calculate-zipfs-law-coefficient-from-a-set-of-top-frequencies

# Word cloud

http://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html#wordcloud.WordCloud

In [None]:
text = df.query("year==2015 and country=='USA'")['text'].values[0]
wc = WordCloud(max_words=100)
wc.generate(text)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

In [None]:
plt.subplots(1, 2, figsize=(20, 4))

text = df.query("country=='USA'")['text'].values[0]
wc = WordCloud(max_words=100)
wc.generate(text)
plt.subplot(1, 2, 1)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

text = df.query("country=='RUS'")['text'].values[0]
wc = WordCloud(max_words=100)
wc.generate(text)

plt.subplot(1, 2, 2)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

plt.tight_layout()

In [None]:
wc = WordCloud(max_words=100, stopwords=freq_df.head(50).index)
wc.generate(text)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

the `generate_from_frequencies` function allows to generate without stopwords directly from a Counter:

In [None]:
wc.generate_from_frequencies(counter)
plt.title('from counter')
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

# Index

We want to build an inverted index:
- make a df such that for every type, we have a 1 if the document contains the type, 0 if not.
- for every type, give a list of document ids

# 🚧 todo:
- how many types do we have?
- how many documents do we have?

In [None]:
print(xxx,'types')
print(xxx,'documents')


- we are checking with a small sub-sample first

In [None]:
list(freq_df.index[66:77])

In [None]:
df[33:36]

In [None]:
A = np.zeros((11, 3))
A.nbytes

we will first try the naïve way, to find out that this easily gets too slow:

In [None]:
for i,t in enumerate(freq_df.index[66:77]):
    for j,d in enumerate(df[33:36].tokens):
        if t in d: A[i,j] =1
A

In [None]:
A.nbytes

In [None]:
A = np.zeros((100, 7507))
for i,t in tqdm(enumerate(freq_df.index[:100])):
    for j,d in enumerate(df.tokens):
        if t in d: A[i,j] =1
# optional (skip at first): can you do that loop more efficiently?
A

In [None]:
A.nbytes

### 🚧 todo:

What would be the size of the complete table?


In [None]:
# 🚧 todo:
xxx
xxx
# x gb

### 🚧 todo:

How long will it take to fill the complete table?


In [None]:
# 🚧 todo:
# my computer takes xxx
xxx,'seconds', xxx,'minutes', xxx,'hours'


### redoing the same thing with CountVectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

df[33:36].text

In [None]:
vectorizer = CountVectorizer(vocabulary=freq_df.index[66:77], binary=True, min_df=0, lowercase=False)
# understand the options: 
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
X = vectorizer.fit_transform(df[33:36].text)
print(vectorizer.get_feature_names())
print(X.toarray())


In [None]:
# make it pretty:
d = {c:X.toarray()[i] for i,c in enumerate(df[33:36].index)}
df_cv = pd.DataFrame.from_dict(d,  orient='index',columns=freq_df.index[66:77])
df_cv

## trying the complete set of documents with the complete vocabulary

In [None]:
vectorizer = CountVectorizer(vocabulary=freq_df.index, binary=True, min_df=0, lowercase=False)
X = vectorizer.fit_transform(df.text)
print(len(vectorizer.get_feature_names()))
print(vectorizer.get_feature_names()[:11])
print(X.toarray())

- wow! comparably fast!

- can you get the vector of "the"? is there a speech that doesn't use it?


In [None]:
print(X[:,1])

In [None]:
print(np.all(X[:,1].toarray() == 1))

# a big vocabulary:
grab a pageview file here https://dumps.wikimedia.org/other/pageviews/2022/2022-01/

we produce a list of potential terms from it:

In [None]:
terms = []
for li in open('pageviews-20220101-000000').read().strip().split('\n'):
    t=li.split()[1]
    if li[:2]=='en' and t[:5]!='File:':
        if t[:9]=='Category:':
            t=t[9:] # can be improved Page:, Template:, ...
        terms+=[t.replace('_',' ')]
terms = sorted(set(terms))
open('en.pages.txt','w').write('\n'.join(terms))

In [None]:
Counter([len(t.split()) for t in terms]).most_common()

In [None]:
[t for t in terms if len(t.split())>33]

- trying to index these terms

In [None]:
vectorizer = CountVectorizer(vocabulary=terms, binary=True, min_df=0, lowercase=False, ngram_range=(1,4))
X = vectorizer.fit_transform(df.text)

In [None]:
print(vectorizer.get_feature_names()[:11])


# Homework

complete the # 🚧 todo:

and
## find the most frequently encountered Wikipedia entity
- in number of speeches
- in number of occurrences

- which speech talks most about the "Union of African States"?




### Before submitting, check:
- I have not imported any other modules
- I have put explanations between the lines of code (either inline or in separate cells)
- My notebook runs all the way through when I hit
  1. the ↻ button and then
  2. the ⏩︎ button (remove or comment out cells that are too slow and not needed, such as installing or downloading sections).
  