# DATA620: Assignment 10 - Document Classification

## Homework Team 3: David Simbandumwe, Eric Lehmphul and Lidiia Tronina

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data: [UCI Machine Learning Repository: Spambase Data Set](http://archive.ics.uci.edu/ml/datasets/Spambase)
For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).
For more adventurous students, you are welcome to come up a different set of documents that have already been classified , then analyze these documents to predict how new documents should be classified.

### Load Required Packages

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import nltk
from nltk.corpus import inaugural
from operator import itemgetter
warnings.filterwarnings("ignore")
nltk.download('inagural')
nltk.download('stopwords')

[nltk_data] Error loading inagural: Package 'inagural' not found in
[nltk_data]     index
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lidiiatronina/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Choose a corpus of interest for classification.

Presidents’ words matter. For better or worse, presidential rhetoric tells the American people who they are. That's why we decided to look at the inaugural speeches from the freely available library that can be downloaded from the NLTK package. The corpus is a collection of 55 texts, one for each presidential address.
Let's look at available texts in the corpus.

In [3]:
inaugural.fileids()

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt',
 '1809-Madison.txt',
 '1813-Madison.txt',
 '1817-Monroe.txt',
 '1821-Monroe.txt',
 '1825-Adams.txt',
 '1829-Jackson.txt',
 '1833-Jackson.txt',
 '1837-VanBuren.txt',
 '1841-Harrison.txt',
 '1845-Polk.txt',
 '1849-Taylor.txt',
 '1853-Pierce.txt',
 '1857-Buchanan.txt',
 '1861-Lincoln.txt',
 '1865-Lincoln.txt',
 '1869-Grant.txt',
 '1873-Grant.txt',
 '1877-Hayes.txt',
 '1881-Garfield.txt',
 '1885-Cleveland.txt',
 '1889-Harrison.txt',
 '1893-Cleveland.txt',
 '1897-McKinley.txt',
 '1901-McKinley.txt',
 '1905-Roosevelt.txt',
 '1909-Taft.txt',
 '1913-Wilson.txt',
 '1917-Wilson.txt',
 '1921-Harding.txt',
 '1925-Coolidge.txt',
 '1929-Hoover.txt',
 '1933-Roosevelt.txt',
 '1937-Roosevelt.txt',
 '1941-Roosevelt.txt',
 '1945-Roosevelt.txt',
 '1949-Truman.txt',
 '1953-Eisenhower.txt',
 '1957-Eisenhower.txt',
 '1961-Kennedy.txt',
 '1965-Johnson.txt',
 '1969-Nixon.txt',
 '1973-Nixon.txt',
 '1

We have 59 Presidential inaugural speeches. We want to classify presidential speeches as Republican/Democrat. Most presidents have belonged to one of these two parties (except some who were Whigs and National Union), and we can start as early as Andrew Jackson's 1829 speech. It will be interesting to see if the two parties have different word use in their addresses and if we can identify the president's party by word choices. 

In [11]:

party = pd.read_csv('https://raw.githubusercontent.com/dsimband/DATA620_Group3/main/Week10_Assignment/presidents.csv',index_col=0)

party.head(20)

Unnamed: 0_level_0,president,party
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1789,Washington,
1793,Washington,
1797,Adams,Federalist
1801,Jefferson,Democrat-Republican
1805,Jefferson,Democrat-Republican
1809,Madison,Democrat-Republican
1813,Madison,Democrat-Republican
1817,Monroe,Democrat-Republican
1821,Monroe,Democrat-Republican
1825,Adams,Democrat-Republican


## Data Exploration

There are a total of 152901 words in the corpus.

In [5]:
# Count ALL words
all_words = inaugural.words()
len(all_words)

152901

In [6]:
#Washington's speech
inaugural.words('1789-Washington.txt')

['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', ...]

In [14]:
recent_list = inaugural.fileids()[-10:] 
for text in recent_list:
    word_list = inaugural.words(text)
    # Below is our "list comprehension":
    word_list = [w.lower() for w in word_list]  # handle the case sensitivity
    unique_words = len(set(word_list))
    # In Python you can concatenate text with plus signs.  I turn the number of unique words
    # into a string before concatenating it to the rest.
    print ("For text " + text + ", the number of unique words is", str(unique_words))

For text 1985-Reagan.txt, the number of unique words is 876
For text 1989-Bush.txt, the number of unique words is 754
For text 1993-Clinton.txt, the number of unique words is 604
For text 1997-Clinton.txt, the number of unique words is 727
For text 2001-Bush.txt, the number of unique words is 593
For text 2005-Bush.txt, the number of unique words is 742
For text 2009-Obama.txt, the number of unique words is 900
For text 2013-Obama.txt, the number of unique words is 792
For text 2017-Trump.txt, the number of unique words is 547
For text 2021-Biden.txt, the number of unique words is 783


In [37]:
import pandas as pd
text_data = pd.DataFrame(columns = ['filename','year','length','unique'])
for file in inaugural.fileids():
    word_list = inaugural.words(file)
    word_list = [w.lower() for w in word_list]  # handle the case sensitivity
    this_file = pd.DataFrame(data = {"filename":[file], \
                                     "president" : [str(file[5:])], \
                                     "year" : [int(file[:4])], \
                                     "length" : [len(word_list)], \
                                     "unique" : [len(set(word_list))]})
    text_data = text_data.append(this_file, ignore_index=True)


In [38]:
text_data['president'] = text_data['president'].str.replace('.txt','')
president_party = pd.merge(text_data, party, on='president', how='outer')

president_party

Unnamed: 0,filename,length,president,unique,year,party
0,1789-Washington.txt,1538,Washington,604,1789,
1,1789-Washington.txt,1538,Washington,604,1789,
2,1793-Washington.txt,147,Washington,95,1793,
3,1793-Washington.txt,147,Washington,95,1793,
4,1797-Adams.txt,2585,Adams,803,1797,Federalist
5,1797-Adams.txt,2585,Adams,803,1797,Democrat-Republican
6,1825-Adams.txt,3150,Adams,972,1825,Federalist
7,1825-Adams.txt,3150,Adams,972,1825,Democrat-Republican
8,1801-Jefferson.txt,1935,Jefferson,687,1801,Democrat-Republican
9,1801-Jefferson.txt,1935,Jefferson,687,1801,Democrat-Republican


Removing stopwords can be done by calling the stopwords corpus in NLTK, which contain the common
high-frequency words with no practical meaning.

In [8]:
nltk.FreqDist(all_words).most_common(10)

[('the', 9555),
 (',', 7275),
 ('of', 7169),
 ('and', 5226),
 ('.', 5011),
 ('to', 4477),
 ('in', 2604),
 ('a', 2229),
 ('our', 2062),
 ('that', 1769)]

In [47]:
#remove stopwords 
from nltk.corpus import stopwords
print(stopwords.words('english'))
stop_words = set(stopwords.words('english')) 

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [40]:
filtered_words = [word.lower() for word in all_words if word.lower() not in stop_words] 

NameError: name 'all_words' is not defined

In [14]:
#remove punctuation
custom_stopwords = set((',',' ','.', ';', '?', '-', '!', '(', ')','--','"',"'", ':', '¡¦', '¡','', '9', '/', '11','ii', '400','1863','','us',',"'))

In [39]:
filtered_words2 = [ word.lower() for word in filtered_words if word.lower() not in custom_stopwords ] 

NameError: name 'filtered_words' is not defined

In [16]:
len(filtered_words2)

65255

In [18]:
len(set(filtered_words2))

9150

#### 200 highest frequency words:

In [12]:
fdist = nltk.FreqDist(filtered_words2)
fdist.most_common(10) 

NameError: name 'filtered_words2' is not defined