# DATA620: Assignment 8 - High Frequency Words

## Homework Team 3: David Simbandumwe, Eric Lehmphul and Lidiia Tronina

1. Choose a corpus of interest.
2. How many total unique words are in the corpus? (Please feel free to define unique words in any interesting, defensible way).
3. Taking the most common words, how many unique words represent half of the total words in the corpus?
4. Identify the 200 highest frequency words in this corpus.
5. Create a graph that shows the relative frequency of these 200 words.
6. Does the observed relative frequency of these words follow Zipf’s law? Explain.
7. In what ways do you think the frequency of the words in this corpus differ from “all words in all corpora.

### Load Required Packages

In [62]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import nltk
from nltk.corpus import inaugural
warnings.filterwarnings("ignore")
nltk.download('inagural')
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lidiiatronina/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### 1. Choose a corpus of interest.

Presidents’ words matter. For better or worse, presidential rhetoric tells the American people who they are. That's why we decided to look at the inaugural speeches from the freely available library that can be downloaded from the NLTK package. The corpus is a collection of 55 texts, one for each presidential address.

In [42]:
inaugural.fileids()

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt',
 '1809-Madison.txt',
 '1813-Madison.txt',
 '1817-Monroe.txt',
 '1821-Monroe.txt',
 '1825-Adams.txt',
 '1829-Jackson.txt',
 '1833-Jackson.txt',
 '1837-VanBuren.txt',
 '1841-Harrison.txt',
 '1845-Polk.txt',
 '1849-Taylor.txt',
 '1853-Pierce.txt',
 '1857-Buchanan.txt',
 '1861-Lincoln.txt',
 '1865-Lincoln.txt',
 '1869-Grant.txt',
 '1873-Grant.txt',
 '1877-Hayes.txt',
 '1881-Garfield.txt',
 '1885-Cleveland.txt',
 '1889-Harrison.txt',
 '1893-Cleveland.txt',
 '1897-McKinley.txt',
 '1901-McKinley.txt',
 '1905-Roosevelt.txt',
 '1909-Taft.txt',
 '1913-Wilson.txt',
 '1917-Wilson.txt',
 '1921-Harding.txt',
 '1925-Coolidge.txt',
 '1929-Hoover.txt',
 '1933-Roosevelt.txt',
 '1937-Roosevelt.txt',
 '1941-Roosevelt.txt',
 '1945-Roosevelt.txt',
 '1949-Truman.txt',
 '1953-Eisenhower.txt',
 '1957-Eisenhower.txt',
 '1961-Kennedy.txt',
 '1965-Johnson.txt',
 '1969-Nixon.txt',
 '1973-Nixon.txt',
 '1

An interesting property of this collection is its time dimension. To get the year out of the filename, we extracted the first four characters.

In [43]:
[fileid[:4] for fileid in inaugural.fileids()]

['1789',
 '1793',
 '1797',
 '1801',
 '1805',
 '1809',
 '1813',
 '1817',
 '1821',
 '1825',
 '1829',
 '1833',
 '1837',
 '1841',
 '1845',
 '1849',
 '1853',
 '1857',
 '1861',
 '1865',
 '1869',
 '1873',
 '1877',
 '1881',
 '1885',
 '1889',
 '1893',
 '1897',
 '1901',
 '1905',
 '1909',
 '1913',
 '1917',
 '1921',
 '1925',
 '1929',
 '1933',
 '1937',
 '1941',
 '1945',
 '1949',
 '1953',
 '1957',
 '1961',
 '1965',
 '1969',
 '1973',
 '1977',
 '1981',
 '1985',
 '1989',
 '1993',
 '1997',
 '2001',
 '2005',
 '2009',
 '2013',
 '2017',
 '2021']

### 2. How many total unique words are in the corpus? (Please feel free to define unique words in any interesting, defensible way).

There are a total of 152901 words in the corpus.

In [48]:
# Count ALL words
all_words = inaugural.words()
len(all_words)

152901

In [45]:
#Washington's speech
inaugural.words('1789-Washington.txt')

['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', ...]

In [52]:
nltk.FreqDist(all_words).most_common(20)

[('the', 9555),
 (',', 7275),
 ('of', 7169),
 ('and', 5226),
 ('.', 5011),
 ('to', 4477),
 ('in', 2604),
 ('a', 2229),
 ('our', 2062),
 ('that', 1769),
 ('be', 1505),
 ('is', 1477),
 ('we', 1281),
 ('for', 1141),
 ('by', 1063),
 ('it', 1036),
 ('have', 1029),
 ('which', 1007),
 ('not', 972),
 ('will', 935)]

In [50]:
len(set(all_words))

10025

There are 10025 unique words in this data. Looking at the sample of our words above, we can see that it includes punctuation as well as words such as 'the' and 'of'. We also know that python will see capital letters as distinct from lowercase letters, so we need to convert all words to lowercase and remove punctuation, some common words, and numbers to get only the unique words.

In [120]:
text_data = pd.DataFrame(columns = ['filename','year','length','unique'])
for file in inaugural.fileids():
    word_list = inaugural.words(file)
    word_list = [w.lower() for w in word_list]  # handle the case sensitivity
    this_file = pd.DataFrame(data = {"filename":[file], \
                                     "year" : [int(file[:4])], \
                                     "length" : [len(word_list)], \
                                     "unique" : [len(set(word_list))]})
    text_data = text_data.append(this_file, ignore_index=True)

In [121]:
#remove stopwords 
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [137]:
#remove punctuation
custom_stopwords = set((',', '.', ';', '?', '-', '!', '(', ')','--','"',"'", ':', '¡¦', '¡','', '9', '/', '11','ii', '400','1863'))

In [138]:
for text in text_data['filename']:
    print (text)
    word_list = inaugural.words(text)
    word_list = [w.lower() for w in word_list]  # handle the case sensitivity
    filtered_words = [word for word in word_list if word not in stopwords.words('english') and \
                      word not in custom_stopwords]
    print(nltk.FreqDist(filtered_words).most_common(15))


1789-Washington.txt
[('every', 9), ('government', 8), ('public', 6), ('may', 6), ('citizens', 5), ('present', 5), ('country', 5), ('one', 4), ('ought', 4), ('duty', 4), ('people', 4), ('united', 4), ('since', 4), ('fellow', 3), ('could', 3)]
1793-Washington.txt
[('shall', 3), ('oath', 2), ('fellow', 1), ('citizens', 1), ('called', 1), ('upon', 1), ('voice', 1), ('country', 1), ('execute', 1), ('functions', 1), ('chief', 1), ('magistrate', 1), ('occasion', 1), ('proper', 1), ('arrive', 1)]
1797-Adams.txt
[('people', 20), ('government', 16), ('may', 13), ('nations', 11), ('country', 10), ('nation', 9), ('states', 9), ('foreign', 8), ('constitution', 8), ('honor', 7), ('justice', 6), ('ever', 6), ('congress', 6), ('public', 6), ('good', 6)]
1801-Jefferson.txt
[('government', 12), ('us', 10), ('may', 8), ('fellow', 7), ('citizens', 7), ('let', 7), ('shall', 6), ('principle', 6), ('would', 6), ('one', 6), ('man', 6), ('safety', 5), ('good', 5), ('others', 5), ('peace', 5)]
1805-Jefferson.tx

[('country', 17), ('must', 17), ('great', 16), ('people', 15), ('government', 14), ('world', 13), ('peace', 13), ('much', 12), ('upon', 12), ('one', 10), ('law', 10), ('party', 10), ('ought', 9), ('nations', 9), ('old', 9)]
1929-Hoover.txt
[('government', 30), ('upon', 17), ('progress', 16), ('people', 15), ('world', 15), ('must', 15), ('peace', 15), ('justice', 14), ('nation', 12), ('system', 11), ('law', 11), ('laws', 10), ('enforcement', 10), ('public', 10), ('federal', 9)]
1933-Roosevelt.txt
[('national', 9), ('must', 9), ('people', 8), ('may', 8), ('leadership', 7), ('helped', 7), ('shall', 7), ('nation', 6), ('us', 6), ('action', 6), ('world', 6), ('time', 5), ('money', 5), ('great', 4), ('first', 4)]
1937-Roosevelt.txt
[('government', 16), ('people', 11), ('nation', 9), ('men', 8), ('democracy', 8), ('good', 8), ('see', 8), ('power', 7), ('progress', 7), ('purpose', 6), ('new', 6), ('upon', 6), ('millions', 6), ('united', 5), ('us', 5)]
1941-Roosevelt.txt
[('nation', 12), ('know

In [139]:
len(set(filtered_words))

663

There is a total of 663 unique words in the corpus.

### 3. Taking the most common words, how many unique words represent half of the total words in the corpus?

The following 332 unique words represent half of the total words in the corpus.

In [140]:
fdist = nltk.FreqDist(filtered_words)     
print(*[w for w,n in fdist.most_common()[:332]], sep = ", ")    

us, america, one, nation, democracy, americans, today, people, much, know, story, another, history, american, must, world, unity, president, days, war, let, stand, fellow, day, together, look, great, time, work, children, like, moment, justice, come, power, ever, say, truth, may, cause, god, centuries, strength, good, peace, forward, virus, country, lost, soul, things, .", face, better, way, promise, love, defend, understand, vice, hope, resolve, tested, challenge, prevailed, ago, violence, thank, heart, constitution, night, taken, sacred, oath, first, still, many, year, jobs, cry, future, new, whole, uniting, ask, join, right, fear, never, enough, faith, show, dignity, respect, meet, hear, believe, yet, gave, common, objects, honor, lies, get, need, going, harris, leader, ages, hour, friends, ground, capitol, carry, ahead, set, spoke, last, service, ,", union, far, winter, possibilities, repair, restore, difficult, lives, thousands, racial, years, dream, comes, rise, extremism, defeat

### 4. Identify the 200 highest frequency words in this corpus.

### 5. Create a graph that shows the relative frequency of these 200 words.

### 6. Does the observed relative frequency of these words follow Zipf’s law? Explain.

### 7. In what ways do you think the frequency of the words in this corpus differ from “all words in all corpora.