<a href="https://colab.research.google.com/github/adong-hood/cs200/blob/main/Copy_of_ch_8_3_8_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 8 UN Speech

This notebook contains selected sections of 8.3, 8.4, 8.5 and 8.7 from chapter 8. It requires the 'un-general-debates.csv' and 'country_codes.csv' datasets. Please refer to both the Runestone book and this notebook. No Runstone book activities. Just share this note book.

In [None]:
import string
import pandas as pd

In [None]:
undf = pd.read_csv('http://pluto.hood.edu/~dong/datasets/un-general-debates.csv')
undf.head(3)

# 8.3 Merging and Tiding Data

Let's read in the country code data with encoding 'latin-1' or 'iso-8859-1'

In [None]:
c_codes = pd.read_csv('http://pluto.hood.edu/~dong/datasets/country_codes.csv', encoding='iso-8859-1')
c_codes.head(2)


Rename column names to so the country code column matches the two dataframes

In [None]:
undf.columns = ['session', 'year', 'code_3', 'text']

With inner join where items that don't have matches from two dataframes are excluded. To preserve the unmatched data, we now perform an __outer join__.

In [None]:
undfe = undf.merge(c_codes[['code_3', 'country', 'continent', 'sub_region']], how = 'outer')
undfe.shape

(7562, 7)

In [None]:
undfe[undfe.code_3 == 'EU']

The country, continent and subregion is missing because EU wasn't in the c_codes file, but the rest of the data is preserved.

So what other countries don't have names? To find that, we look for the code_3 values where county is NA.

### Exercise 1:  

How many countries has the `nan` value for its country name? What are the values of `code_3` of these countries? Hint: use isna() to check for NA values.

In [None]:
#


101

In [None]:
undfe[undfe.country.isna()].code_3.unique()

array(['CSK', 'DDR', 'EU', 'YDYE', 'YUG'], dtype=object)

So the above codes don't match to country names.
* YDYE was the code for South Yemen, a country that existed from 1967 to 1990.
* CSK was the code for Czechoslovakia, which existed from 1918 to 1993 when it dissolved into the Czech Republic and Slovakia.
* YUG was the code for Yugoslavia, which existed from 1918 to 1992.
* DDR was East Germany, before the reunification of Germany after the fall of the Berlin Wall.
* We know that EU is the European Union.

We can fill these in "by hand."

In [None]:
undfe.loc[undfe.code_3 == 'EU', 'country'] = 'European Union'
undfe[undfe.code_3 == 'EU']

Unnamed: 0,session,year,code_3,text,country,continent,sub_region
2268,68.0,2013.0,EU,A year ago \nwhen we met in the General Assemb...,European Union,,
2269,67.0,2012.0,EU,﻿The advance of democracy has taken place in\n...,European Union,,
2270,69.0,2014.0,EU,The world today is much more dangerous than \n...,European Union,,
2271,66.0,2011.0,EU,Europe presents to you a message of \ncooperat...,European Union,,
2272,70.0,2015.0,EU,I am here today to reassure the General Assemb...,European Union,,


In [None]:
# Add country names for unmatched country codes
missing_code = list(undfe[undfe.country.isna()].code_3.unique())
missing_country = ['South Yemen', 'Czechoslovakia', 'Yugoslavia', 'East Germany']

for i in range(len(missing_country)):
    undfe.loc[undfe.code_3 == missing_code[i], 'country'] = missing_country[i]

Re-check if all conntries have filled.

In [None]:
undfe[undfe['code_3'].isna()]


Unnamed: 0,session,year,code_3,text,country,continent,sub_region


In [None]:
undfe[undfe['country'].isnull()]

Unnamed: 0,session,year,code_3,text,country,continent,sub_region


# 8.4. Most and Least Common Words

Now we want to clean up the text so that we can analyze the most and least common words in the speeches. To do that, we need to complete the following four steps:
1. Convert all text to lower case.
2. Remove all punctuation.
3. Break the string into a list of words - tokenize
4. Remove [stop words](https://en.wikipedia.org/wiki/Stop_words) from the list.

We'll start by doing this just for the speeches from 1970 so we can work on smaller text.

Here let's first make a copy (deep copy) of the text so the initial text won't be changes.

In [None]:
speeches_1970 = undfe[undfe.year == 1970].copy()
len(speeches_1970)

70

`apply()` method works on panda series and data frames with a simple Lambda function.

In [None]:
test_pd = pd.DataFrame({'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
                        'year': [2000, 2001, 2002, 2001, 2002],
                        'population': [1.5, 1.7, 3.6, 2.4, 2.9]})
test_pd

In [None]:
#Apply a lambda on a column
test_pd['population'] = test_pd.population.apply(lambda x: x*1000)
test_pd.head()

### Exercise 2:
Apply a lambda function on the `text` column to change the text to the lower case.

In [None]:
#


Next, we remove all punctuation.

In [None]:
# This line applies a function to the text column
speeches_1970['text'] = speeches_1970.text.apply(
    # it's a lambda function that replaces punctuation with whitespace,
    # but that may take some detective work reading the docs!
    lambda x: x.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))))

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

This adds a new column, called word_list, to our dataframe, by applying the function `nltk.word_tokenize` to the text column.

In [None]:
speeches_1970['word_list'] = speeches_1970.text.apply(nltk.word_tokenize)

In [None]:
speeches_1970.head(3)

So this turned our text into a list of words, stripped of punctuation. Now that we have these lists, we can count up how often each word occurs.

In [None]:
from collections import Counter
c = Counter(speeches_1970.word_list.sum())
c.most_common(10)

[('the', 25077),
 ('of', 16265),
 ('and', 9224),
 ('to', 9134),
 ('in', 6668),
 ('a', 4530),
 ('that', 3919),
 ('is', 3322),
 ('for', 2563),
 ('which', 2471)]

This is maybe not as interesting as we'd hoped. While these are common words, they don't tell us at all what the common themes were of the UN speeches. These words would be the most common for almost any text in english. Removing <strong> stopwords</strong>, that is common words like 'the', will let us get a more interesting list of the most common words.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
sw = set(stopwords.words('english'))
len(sw)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


198

In [None]:
# add your own stop words
sw.add('It')
len(sw)

199

So now we'll remove those words from the word lists.

In [None]:
# We'll replace the word list with a new one after applying a function
speeches_1970['word_list'] = speeches_1970.word_list.apply(
    # This is a lambda fuction that takes a list of words as input
    # it gives back a list - see the [] - where each word is exactly the same
    # as long as it's not a stop word (i.e. if word not in sw)
    lambda list_of_words: [word for word in list_of_words if word not in sw])

c = Counter(speeches_1970.word_list.sum())
c.most_common(10)

[('nations', 1997),
 ('united', 1996),
 ('international', 1251),
 ('world', 1101),
 ('peace', 1019),
 ('countries', 908),
 ('states', 897),
 ('organization', 763),
 ('would', 677),
 ('people', 649)]

### Exercise 3

Redo the analysis for 2015. Make sure the most common words do not contain any non-text charactors and they are interesting words, e.g. not including words such as 'must' or 'also'.

In [None]:
#


# 8.5. Working with Text

In [None]:
# re-read for this section.
undf = pd.read_csv('http://pluto.hood.edu/~dong/datasets/un-general-debates.csv')

### Exercise 4 (Q1 from 8.5):

**How many rows** from the United Nations dataset have a country code that starts with ‘M’?

Hint: use `str.startswith()` and count the total rows with `sum()`

In [None]:
#


663

### Exercise 5 (Q2 from 8.5):  

**How many country codes** from the United Nations dataset have a country code that starts with ‘M’?

Notice how this is different from the last question. As each row of our dataset is a speech, the answer from last question was the number of speeches delivered by M countries, not the number of M countries.

Hint: use `unique()` to find the unique country codes

In [None]:
#

18

# 8.7. Sentiment Analysis of UN Speeches¶

The Natural Language ToolKit (NLTK) provides us with many tools for sentiment analysis, e.g., [VADER (Valence Aware Dictionary and sEntiment Reasoner, not Darth Vader)](https://www.nltk.org/_modules/nltk/sentiment/vader.html). VADER performs better on normal text and does not require us to manually train a model.



In [None]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [None]:
from nltk import tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [None]:
analyzer = SentimentIntensityAnalyzer()
score = analyzer.polarity_scores("I love sci-fi movies!")
score

{'neg': 0.0, 'neu': 0.308, 'pos': 0.692, 'compound': 0.6696}

In [None]:
score = analyzer.polarity_scores("I wouldn't suggest this solution to anyone.")
score

{'neg': 0.0, 'neu': 0.656, 'pos': 0.344, 'compound': 0.3869}

In [None]:
def score_text(text):
    sentence_list = tokenize.sent_tokenize(text)
    cscore = 0.0
    for sent in sentence_list:
        ss = analyzer.polarity_scores(sent)['compound']
        cscore += ss
    return cscore / len(sentence_list)

In [None]:
undfe['text'] = undfe['text'].astype(str)

In [None]:
# This will take a while.
undfe['sentiment'] = undf.text.map(lambda t : score_text(t))

In [None]:
undfe.head()

Unnamed: 0,session,year,code_3,text,country,continent,sub_region,sentiment
0,,,ABW,,Aruba,Americas,Caribbean,0.216956
1,44.0,1989.0,AFG,﻿\nIt gives me pleasure at the very outset to ...,Afghanistan,Asia,Southern Asia,0.232908
2,68.0,2013.0,AFG,I bring to all warm \ngreetings and the good w...,Afghanistan,Asia,Southern Asia,0.27461
3,40.0,1985.0,AFG,I wish at the outset to congratulate the Presi...,Afghanistan,Asia,Southern Asia,0.235133
4,63.0,2008.0,AFG,Since the last time we \ngathered here in this...,Afghanistan,Asia,Southern Asia,0.207283


### Exercise 6.
Which countries are the most positive in their speeches throughout the years?

In [None]:
#

### Exercise 7.
Which subregion is the most positive in their speeches throughout the years?


In [None]:
#