# DATA620: Assignment 10 - Document Classification

## Homework Team 3: David Simbandumwe, Eric Lehmphul and Lidiia Tronina

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data: [UCI Machine Learning Repository: Spambase Data Set](http://archive.ics.uci.edu/ml/datasets/Spambase)
For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).
For more adventurous students, you are welcome to come up a different set of documents that have already been classified , then analyze these documents to predict how new documents should be classified.

### Load Required Packages

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import nltk
from nltk.corpus import inaugural
from operator import itemgetter
warnings.filterwarnings("ignore")
nltk.download('inagural')
nltk.download('stopwords')

[nltk_data] Error loading inagural: Package 'inagural' not found in
[nltk_data]     index
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lidiiatronina/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Choose a corpus of interest for classification.

Presidents’ words matter. For better or worse, presidential rhetoric tells the American people who they are. That's why we decided to look at the inaugural speeches from the freely available library that can be downloaded from the NLTK package. The corpus is a collection of 55 texts, one for each presidential address.
Let's look at available texts in the corpus.

In [3]:
inaugural.fileids()

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt',
 '1809-Madison.txt',
 '1813-Madison.txt',
 '1817-Monroe.txt',
 '1821-Monroe.txt',
 '1825-Adams.txt',
 '1829-Jackson.txt',
 '1833-Jackson.txt',
 '1837-VanBuren.txt',
 '1841-Harrison.txt',
 '1845-Polk.txt',
 '1849-Taylor.txt',
 '1853-Pierce.txt',
 '1857-Buchanan.txt',
 '1861-Lincoln.txt',
 '1865-Lincoln.txt',
 '1869-Grant.txt',
 '1873-Grant.txt',
 '1877-Hayes.txt',
 '1881-Garfield.txt',
 '1885-Cleveland.txt',
 '1889-Harrison.txt',
 '1893-Cleveland.txt',
 '1897-McKinley.txt',
 '1901-McKinley.txt',
 '1905-Roosevelt.txt',
 '1909-Taft.txt',
 '1913-Wilson.txt',
 '1917-Wilson.txt',
 '1921-Harding.txt',
 '1925-Coolidge.txt',
 '1929-Hoover.txt',
 '1933-Roosevelt.txt',
 '1937-Roosevelt.txt',
 '1941-Roosevelt.txt',
 '1945-Roosevelt.txt',
 '1949-Truman.txt',
 '1953-Eisenhower.txt',
 '1957-Eisenhower.txt',
 '1961-Kennedy.txt',
 '1965-Johnson.txt',
 '1969-Nixon.txt',
 '1973-Nixon.txt',
 '1

We have 59 Presidential inaugural speeches. We want to classify presidential speeches as Republican/Democrat. Most presidents have belonged to one of these two parties (except some who were Whigs and National Union), and we can start as early as Andrew Jackson's 1829 speech. It will be interesting to see if the two parties have different word use in their addresses and if we can identify the president's party by word choices. 

In [11]:

party = pd.read_csv('https://raw.githubusercontent.com/dsimband/DATA620_Group3/main/Week10_Assignment/presidents.csv',index_col=0)

party.head(20)

Unnamed: 0_level_0,president,party
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1789,Washington,
1793,Washington,
1797,Adams,Federalist
1801,Jefferson,Democrat-Republican
1805,Jefferson,Democrat-Republican
1809,Madison,Democrat-Republican
1813,Madison,Democrat-Republican
1817,Monroe,Democrat-Republican
1821,Monroe,Democrat-Republican
1825,Adams,Democrat-Republican


## Data Exploration

There are a total of 152901 words in the corpus.

In [5]:
# Count ALL words
all_words = inaugural.words()
len(all_words)

152901

In [6]:
#Washington's speech
inaugural.words('1789-Washington.txt')

['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', ...]

In [8]:
nltk.FreqDist(all_words).most_common(10)

[('the', 9555),
 (',', 7275),
 ('of', 7169),
 ('and', 5226),
 ('.', 5011),
 ('to', 4477),
 ('in', 2604),
 ('a', 2229),
 ('our', 2062),
 ('that', 1769)]

There are 10025 unique words in this data. Looking at the sample of our words above, we can see that it includes punctuation as well as stopwords, such as 'the' and 'of'. These words are meaningless for research. We also know that python will see capital letters as distinct from lowercase letters, so we need to convert all words to lowercase and remove punctuation, some common words, and numbers to get only the unique words.

In [9]:
len(set(all_words))

10025

Removing stopwords can be done by calling the stopwords corpus in NLTK, which contain the common
high-frequency words with no practical meaning.

In [47]:
#remove stopwords 
from nltk.corpus import stopwords
print(stopwords.words('english'))
stop_words = set(stopwords.words('english')) 

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [11]:
filtered_words = [word.lower() for word in all_words if word.lower() not in stop_words] 

In [12]:
len(filtered_words)

80124

In [13]:
len(set(filtered_words))

9173

In [14]:
#remove punctuation
custom_stopwords = set((',',' ','.', ';', '?', '-', '!', '(', ')','--','"',"'", ':', '¡¦', '¡','', '9', '/', '11','ii', '400','1863','','us',',"'))

In [15]:
filtered_words2 = [ word.lower() for word in filtered_words if word.lower() not in custom_stopwords ] 

In [16]:
len(filtered_words2)

65255

In [18]:
len(set(filtered_words2))

9150

After we filtered out all unnecessary words, we got 9150 unique words in the corpus.

#### 200 highest frequency words:

In [22]:
fdist = nltk.FreqDist(filtered_words2)
fdist.most_common(200) 

[('government', 600),
 ('people', 594),
 ('must', 374),
 ('upon', 371),
 ('great', 346),
 ('world', 346),
 ('may', 343),
 ('states', 335),
 ('nation', 330),
 ('country', 322),
 ('shall', 316),
 ('every', 301),
 ('one', 272),
 ('peace', 259),
 ('new', 255),
 ('citizens', 248),
 ('power', 241),
 ('america', 240),
 ('public', 227),
 ('time', 223),
 ('would', 213),
 ('constitution', 209),
 ('united', 204),
 ('nations', 199),
 ('union', 191),
 ('freedom', 189),
 ('war', 185),
 ('free', 184),
 ('american', 171),
 ('let', 160),
 ('fellow', 158),
 ('national', 158),
 ('made', 156),
 ('good', 150),
 ('men', 149),
 ('make', 147),
 ('years', 143),
 ('well', 142),
 ('justice', 142),
 ('life', 140),
 ('without', 140),
 ('spirit', 140),
 ('rights', 138),
 ('never', 137),
 ('law', 136),
 ('congress', 130),
 ('laws', 130),
 ('work', 124),
 ('liberty', 123),
 ('right', 122),
 ('best', 122),
 ('duty', 120),
 ('hope', 120),
 ('interests', 115),
 ('know', 112),
 ('god', 112),
 ('today', 112),
 ('much', 11