# DATA620: Assignment 10 - Document Classification

## Homework Team 3: David Simbandumwe, Eric Lehmphul and Lidiia Tronina

### Assignment:

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data: [UCI Machine Learning Repository: Spambase Data Set](http://archive.ics.uci.edu/ml/datasets/Spambase)
For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).
For more adventurous students, you are welcome to come up a different set of documents that have already been classified , then analyze these documents to predict how new documents should be classified.

### Load Required Packages

In [120]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import random
import nltk
from nltk.corpus import inaugural
from operator import itemgetter
warnings.filterwarnings("ignore")
nltk.download('inagural')
nltk.download('stopwords')

[nltk_data] Error loading inagural: Package 'inagural' not found in
[nltk_data]     index
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lidiiatronina/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Choose a corpus of interest for classification.

Presidents’ words matter. For better or worse, presidential rhetoric tells the American people who they are. That's why we decided to look at the inaugural speeches from the freely available library that can be downloaded from the NLTK package. The corpus is a collection of 55 texts, one for each presidential address.

We want to classify presidential speeches as Republican/Democrat. Most presidents have belonged to one of these two parties (except some who were Whigs and National Union), and we can start as early as Andrew Jackson's 1829 speech. It will be interesting to see if the two parties have different word use in their addresses and if we can identify the president's party by word choices.

Let's look at available texts in the corpus.

In [27]:
inaugural.fileids()

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt',
 '1809-Madison.txt',
 '1813-Madison.txt',
 '1817-Monroe.txt',
 '1821-Monroe.txt',
 '1825-Adams.txt',
 '1829-Jackson.txt',
 '1833-Jackson.txt',
 '1837-VanBuren.txt',
 '1841-Harrison.txt',
 '1845-Polk.txt',
 '1849-Taylor.txt',
 '1853-Pierce.txt',
 '1857-Buchanan.txt',
 '1861-Lincoln.txt',
 '1865-Lincoln.txt',
 '1869-Grant.txt',
 '1873-Grant.txt',
 '1877-Hayes.txt',
 '1881-Garfield.txt',
 '1885-Cleveland.txt',
 '1889-Harrison.txt',
 '1893-Cleveland.txt',
 '1897-McKinley.txt',
 '1901-McKinley.txt',
 '1905-Roosevelt.txt',
 '1909-Taft.txt',
 '1913-Wilson.txt',
 '1917-Wilson.txt',
 '1921-Harding.txt',
 '1925-Coolidge.txt',
 '1929-Hoover.txt',
 '1933-Roosevelt.txt',
 '1937-Roosevelt.txt',
 '1941-Roosevelt.txt',
 '1945-Roosevelt.txt',
 '1949-Truman.txt',
 '1953-Eisenhower.txt',
 '1957-Eisenhower.txt',
 '1961-Kennedy.txt',
 '1965-Johnson.txt',
 '1969-Nixon.txt',
 '1973-Nixon.txt',
 '1

In [28]:
party = pd.read_csv('https://raw.githubusercontent.com/dsimband/DATA620_Group3/main/Week10_Assignment/presidents.csv',index_col=0)

party.head(20)

Unnamed: 0_level_0,president,party
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1789,Washington,
1793,Washington,
1797,Adams,Federalist
1801,Jefferson,Democrat-Republican
1805,Jefferson,Democrat-Republican
1809,Madison,Democrat-Republican
1813,Madison,Democrat-Republican
1817,Monroe,Democrat-Republican
1821,Monroe,Democrat-Republican
1825,Adams,Democrat-Republican


## Data Exploration

There are a total of 152901 words in the corpus.

In [29]:
# Count ALL words
all_words = inaugural.words()
len(all_words)

152901

In [30]:
#Washington's speech
inaugural.words('1789-Washington.txt')

['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', ...]

In [31]:
recent_list = inaugural.fileids()[-10:] 
for text in recent_list:
    word_list = inaugural.words(text)
    word_list = [w.lower() for w in word_list]  # handle the case sensitivity
    unique_words = len(set(word_list))
    print ("For text " + text + ", the number of unique words is", str(unique_words))

For text 1985-Reagan.txt, the number of unique words is 876
For text 1989-Bush.txt, the number of unique words is 754
For text 1993-Clinton.txt, the number of unique words is 604
For text 1997-Clinton.txt, the number of unique words is 727
For text 2001-Bush.txt, the number of unique words is 593
For text 2005-Bush.txt, the number of unique words is 742
For text 2009-Obama.txt, the number of unique words is 900
For text 2013-Obama.txt, the number of unique words is 792
For text 2017-Trump.txt, the number of unique words is 547
For text 2021-Biden.txt, the number of unique words is 783


In [32]:
import pandas as pd
text_data = pd.DataFrame(columns = ['filename','year','length','unique'])
for file in inaugural.fileids():
    word_list = inaugural.words(file)
    word_list = [w.lower() for w in word_list] 
    this_file = pd.DataFrame(data = {"filename":[file], \
                                     "president" : [str(file[5:])], \
                                     "year" : [int(file[:4])], \
                                     "length" : [len(word_list)], \
                                     "unique" : [len(set(word_list))]})
    text_data = text_data.append(this_file, ignore_index=True)


In [33]:
text_data['president'] = text_data['president'].str.replace('.txt','')
president_party = pd.merge(text_data, party, on='president', how='outer')

president_party[-3:]

Unnamed: 0,filename,length,president,unique,year,party
116,2013-Obama.txt,2369,Obama,792,2013,Democrat
117,2017-Trump.txt,1693,Trump,547,2017,Republican
118,2021-Biden.txt,3104,Biden,783,2021,Democrat


## Preparing Documents for Classification

First we need to take all Democrat/Republican speeches and combine them to create one text. We will also remove punctuation and convert everything to lowercase to eliminate duplicate words. 
We also want to remove some of the common words for both parties such as - people, nation, world, country, etc.
Then we can take that and create a list of text segments. Each segment will be 500 words long.

In [34]:
democrat = inaugural.words('1833-Jackson.txt')+inaugural.words('1837-VanBuren.txt')+inaugural.words('1845-Polk.txt')+inaugural.words('1853-Pierce.txt')+inaugural.words('1857-Buchanan.txt')+inaugural.words('1885-Cleveland.txt')+inaugural.words('1893-Cleveland.txt')+inaugural.words('1905-Roosevelt.txt')+inaugural.words('1933-Roosevelt.txt')+inaugural.words('1937-Roosevelt.txt')+inaugural.words('1941-Roosevelt.txt')+inaugural.words('1945-Roosevelt.txt')+inaugural.words('1913-Wilson.txt')+inaugural.words('1917-Wilson.txt')+inaugural.words('1949-Truman.txt')+inaugural.words('1961-Kennedy.txt')+inaugural.words('1965-Johnson.txt')+inaugural.words('1969-Nixon.txt')+inaugural.words('1973-Nixon.txt')+inaugural.words('1977-Carter.txt')+inaugural.words('1993-Clinton.txt')+inaugural.words('1997-Clinton.txt')+inaugural.words('2009-Obama.txt')+inaugural.words('2013-Obama.txt')+inaugural.words('2021-Biden.txt')

In [35]:
republican = inaugural.words('1841-Harrison.txt')+inaugural.words('1889-Harrison.txt')+inaugural.words('1861-Lincoln.txt')+inaugural.words('1865-Lincoln.txt')+inaugural.words('1869-Grant.txt')+inaugural.words('1873-Grant.txt')+inaugural.words('1877-Hayes.txt')+inaugural.words('1881-Garfield.txt')+inaugural.words('1897-McKinley.txt')+inaugural.words('1901-McKinley.txt')+inaugural.words('1905-Roosevelt.txt')+inaugural.words('1933-Roosevelt.txt')+inaugural.words('1937-Roosevelt.txt')+inaugural.words('1941-Roosevelt.txt')+inaugural.words('1945-Roosevelt.txt')+inaugural.words('1909-Taft.txt')+inaugural.words('1921-Harding.txt')+inaugural.words('1925-Coolidge.txt')+inaugural.words('1929-Hoover.txt')+inaugural.words('1953-Eisenhower.txt')+inaugural.words('1957-Eisenhower.txt')+inaugural.words('1969-Nixon.txt')+inaugural.words('1973-Nixon.txt')+inaugural.words('1981-Reagan.txt')+inaugural.words('1985-Reagan.txt')+inaugural.words('1989-Bush.txt')+inaugural.words('2001-Bush.txt')+inaugural.words('2005-Bush.txt')+inaugural.words('2017-Trump.txt') 

In [36]:
#remove stopwords 
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) 
#remove punctuation and common words
custom_stopwords = set((',',' ','.', ';', '?', '-', '!', '(', ')','--','"',"'", ':', '¡¦', '¡','', '9', '/', '11','ii', '400','1863','','people','shall','may','country','government','world', 'nation', 'must', 'great','upon','america','new','one','states','peace','every','us',',"'))

In [37]:
dem_filtered = [word.lower() for word in democrat if word.lower() not in stop_words] 

In [38]:
dem_filtered2 = [ word.lower() for word in dem_filtered if word.lower() not in custom_stopwords ] 

In [39]:
len(dem_filtered2)

21901

In [40]:
democrat1=[]
for i in range(43):
    democrat1.append([dem_filtered2[i*500:(i+1)*500],'dem'])

In [41]:
len(democrat1)

43

In [42]:
rep_filtered = [word.lower() for word in republican if word.lower() not in stop_words] 

In [43]:
rep_filtered2 = [ word.lower() for word in rep_filtered if word.lower() not in custom_stopwords ] 

In [44]:
len(rep_filtered2)

32344

In [45]:
republican1=[]
for i in range(64):
    republican1.append([rep_filtered2[i*500:(i+1)*500],'rep'])

In [46]:
len(republican1)

64

We now have a list of 43 Democrat and 64 Republican 500-word segments. These are the most common words for each party:

In [47]:
fdist = nltk.FreqDist(dem_filtered2)
fdist.most_common(20) 

[('let', 109),
 ('time', 102),
 ('power', 90),
 ('nations', 78),
 ('life', 78),
 ('union', 74),
 ('american', 73),
 ('citizens', 72),
 ('would', 72),
 ('spirit', 72),
 ('today', 70),
 ('united', 69),
 ('men', 69),
 ('free', 68),
 ('fellow', 65),
 ('together', 65),
 ('know', 63),
 ('constitution', 62),
 ('work', 60),
 ('public', 59)]

In [48]:
fdist1 = nltk.FreqDist(rep_filtered2)
fdist1.most_common(20) 

[('freedom', 132),
 ('power', 129),
 ('citizens', 119),
 ('constitution', 116),
 ('time', 112),
 ('would', 111),
 ('law', 107),
 ('free', 106),
 ('make', 104),
 ('public', 103),
 ('american', 99),
 ('united', 96),
 ('congress', 93),
 ('made', 92),
 ('years', 92),
 ('nations', 92),
 ('war', 90),
 ('national', 88),
 ('men', 86),
 ('good', 85)]

## Feature Extraction

We use a method from Natural Language Processing with Python to define a feature extractor for documents, so the classifier will know which aspects of the data it should pay attention to. We can define a feature for each word, indicating whether the segment contains that word. To limit the number of features that the classifier needs to process, we begin by constructing a list of the 2000 most frequent words in the overall corpus. We can then define a feature extractor that simply checks whether each of these words is present in a given document. 

In [55]:
all=dem_filtered2+rep_filtered2
all_words = nltk.FreqDist(w.lower() for w in all)
word_features = list(all_words)[:2000] 

def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [56]:
features = document_features(dem_filtered2)
list(features.items())[:20]

[('contains(fellow)', True),
 ('contains(citizens)', True),
 ('contains(american)', True),
 ('contains(expressed)', True),
 ('contains(unsolicited)', True),
 ('contains(suffrages)', True),
 ('contains(calls)', True),
 ('contains(pass)', True),
 ('contains(solemnities)', True),
 ('contains(preparatory)', True),
 ('contains(taking)', True),
 ('contains(duties)', True),
 ('contains(president)', True),
 ('contains(united)', True),
 ('contains(another)', True),
 ('contains(term)', True),
 ('contains(approbation)', True),
 ('contains(public)', True),
 ('contains(conduct)', True),
 ('contains(period)', True)]

Now that we've defined our feature extractor, we can use it to train a classifier to label each speech segment.

In [53]:
documents=republican1+democrat1

## Create Test and Train Dataset

Now we create a list of all text segments from both Republican and Democrat and shuffle them to create the text corpus that we will use to train and test our classifier model. We will use 86 segments to train our model and leave 21 segments for testing.

In [121]:
#random.seed(123)
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

107

In [122]:
size = int(len(featuresets) * 0.2)
train_set, test_set = featuresets[size:], featuresets[:size]
len(test_set)

21

## Model 1

We use naive Bayes classifier to train our first model. To choose a label for an input value, the naive Bayes classifier begins by calculating the prior probability of each label, which is determined by checking frequency of each label in the training set. The contribution from each feature is then combined with this prior probability, to arrive at a likelihood estimate for each label. 
(Natural Language Processing with Python https://www.nltk.org/book/ch06.html)

In [118]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [119]:
print(nltk.classify.accuracy(classifier, test_set))

0.6666666666666666


This model has a 66,6% accuracy.

Let's see what features are most important in training our model.

In [66]:
classifier.show_most_informative_features()

Most Informative Features
    contains(concession) = True              dem : rep    =      8.7 : 1.0
        contains(actual) = True              dem : rep    =      7.8 : 1.0
    contains(protecting) = True              dem : rep    =      7.8 : 1.0
         contains(minds) = True              dem : rep    =      6.9 : 1.0
      contains(fruitful) = True              dem : rep    =      6.0 : 1.0
 contains(consciousness) = True              dem : rep    =      6.0 : 1.0
   contains(appropriate) = True              dem : rep    =      6.0 : 1.0
         contains(globe) = True              dem : rep    =      6.0 : 1.0
         contains(happy) = True              dem : rep    =      5.8 : 1.0
        contains(humbly) = True              dem : rep    =      5.2 : 1.0


Interestingly enough, the most informative of the party words for democrats are concession, protecting, fruitful, consciosness. 

## Conclusion