# Nobel Peace Prize Speech Analysis

## This notebook walks through how I updated my initial analysis and the improvements I made.

In [13]:
import os
from pathlib import Path
import nltk
import re
import collections as col
from nltk.corpus import stopwords

In [14]:
#define the path where the speech files are found
path = Path.cwd()/ "nobelspeeches"
nobel_speeches = os.listdir(path)

#join the path with the files to find the appropriate file when looking for it
files = sorted([os.path.join(path, file) for file in nobel_speeches if file.endswith('.txt')])

#### The stopwords provided by the NLTK corpus are a list of the most common words in each language.
#### For English - "the", "it", "a", "and", "when", etc.
#### These words have almost no contributing value to the analysis, so we use stopwords to weed those out.
#### I added a few pieces of punctuation to the list of stopwords to weed those out as well.

In [15]:
stops = set(stopwords.words('english'))

stops.update(["–","…","*", " ", ""])

#### Instead of tokenizing the words, I used a for loop to iterate through them and pull them out of each file into a list.

#### I also used regex to tackle some of the more challenging issues within the data so that the final result was as clean as possible.

#### The collections module is useful for quickly forming dictionaries out of iterable datatypes and then sorting the dictionaries by their frequency

In [16]:
vocab_list = []

for file in files:

    file1 = open(file, encoding = "utf8")
    file2 = file1.read()
    file3 = re.sub("\b\W?(\w*)\W?\b", r"/1", file2)
    file4 = re.sub(r"[\b?\(\n*)](\w*)[\(\n*)\b?]", r" \1", file3)
    file4 = re.sub(r"\\n*", " ", file3)
    file5 = re.sub(r"(,|;|:|\.|)?(\w*)(,|;|:|\.|)?", r"\2", file4)
    file5.replace("\n", " ")
    lst = file5.lower().split(" ")

    for word in lst:
        if word.startswith("\'"):
            pass
        elif word not in stops:
            vocab_list.append(word)

most_frequent_words = col.Counter(vocab_list).most_common()

#### Instead of a list with a word and the occurence value, I used a dictionary and took advantage of the key:value strengths.

#### This data is far cleaner and was organized in a much shorter time with far fewer lines of code.

#### Uncomment the last line to see the most frequently used words of the last 29 Nobel Laureates.

In [17]:
#most_frequent_words