# Welcome to "What would presidents say?"

# Which presidents do you think like?
We have access to all the US Presidential Speeches from April 30th, 1789 to Sept. 25th, 2019. Presidential speeches are an opportunity for Presidents to set the tone of the country, put forth their agenda, influence policy and sway public opinion.

We're going to build a system in which given a random sentence as the input, the program outputs the top 3 similar sentences from different presidents.

This reminds me of those WWJD aka What Would Jesus Do wrist bands that were popular back when I was in High School/University.

Now we're going to create WWPS i.e. "What would presidents say?" - A pipeline that, when given a sentence, would spit out the top 3 similar sentences previoulsy uttered by different US presidents.

Source data: https://www.kaggle.com/littleotter/united-states-presidential-speeches

# Step i - Understanding the problem

The first thing we need to do is understand the type of problem we are solving. The main features of this problem are:
- Unsupervised
    - The data is not labelled and we don't have a test set to validate our results.
- Text Similarity
    - We need to compare whether one set of text with another and determine their similarity.
- Information Retrieval 
    - We need to fetch relevant sources of information from a corpus just like a quer or search on a search engine.
Now that we know that we understand which steps to take to develop a suitable algorithm.

There are 2 main ways to analyse Text Similarity; Lexical Similarity and Semantical Similarity. We will select Semantical Similarity as in addition to syntax, the algorithm will consider context.

The steps involved in Text Similarity:
1. Text Normalization
    - Tokenize the sentences.
    - Develop a method that can be used for both the search_ and the corpus.
2. Information Retrieval
    - We need to define search_ and find a way to get input from the user.
3. Feature Engineering
    - Our options here include Bag of Words, TF-IDF and Word Vectorization
    - The 2 options include Lexical 
4. Similarity Measure
    - We need to find an optimal measure.

# Step ii: Import Libraries and Load the Data

In [190]:
#import matplotlib.pyplot as plt
import pandas as pd
import nltk
import re

from io import StringIO

from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
from sklearn.metrics.pairwise import linear_kernel
#from nltk.corpus import stopwords

from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
#from sklearn.metrics import confusion_matrix
from sklearn.metrics.pairwise import linear_kernel
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline




In [6]:
def read_corpus():
    corpus_df = pd.read_csv('corpus.csv')
    return corpus_df

In [7]:
corpus_df = read_corpus()

In [99]:
corpus_df.iloc[:10]

Unnamed: 0,President,Party,transcripts
0,George Washington,Unaffiliated,Fellow Citizens of the Senate and the House of...
1,John Adams,Federalist,"When it was first perceived, in early times, t..."
2,Thomas Jefferson,Democratic-Republican,"FRIENDS AND FELLOW-CITIZENS, Called upon to un..."
3,James Madison,Democratic-Republican,Unwilling to depart from examples of the most ...
4,James Monroe,Democratic-Republican,I should be destitute of feeling if I was not ...
5,John Quincy Adams,Democratic-Republican,"AND NOW, FRIENDS AND COUNTRYMEN, if the wise a..."
6,Andrew Jackson,Democratic,Fellow Citizens: About to undertake the arduou...
7,Martin Van Buren,Democratic,Fellow Citizens: The practice of all my predec...
8,William Harrison,Whig,Called from a retirement which I had supposed ...
9,John Tyler,Unaffiliated,To the People of the United States Before my a...


In [26]:
corpus_df.rename( columns={"Unnamed: 0" :"President"}, inplace=True )
corpus_df.head()

Unnamed: 0,President,Party,transcripts
0,George Washington,Unaffiliated,Fellow Citizens of the Senate and the House of...
1,John Adams,Federalist,"When it was first perceived, in early times, t..."
2,Thomas Jefferson,Democratic-Republican,"FRIENDS AND FELLOW-CITIZENS, Called upon to un..."
3,James Madison,Democratic-Republican,Unwilling to depart from examples of the most ...
4,James Monroe,Democratic-Republican,I should be destitute of feeling if I was not ...


# Step 1: Text Normalization

Removing stop words may harm the results since we are looking for similarities in sentences. So will try it **with** and **without stop words**.

Reference: https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

In [83]:
stopword_list = nltk.corpus.stopwords.words('english')

In [73]:
#regex
REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

In [123]:
#Tokenise words while ignoring punctuation
test = corpus_df['transcripts'].iloc(0)[6][:100]
test

'Fellow Citizens: About to undertake the arduous duties that I have been appointed to perform by the '

### Normalize a line of text

In [128]:
text =0

In [129]:
def normalize(text):
    
    # Tokenise words while ignoring punctuation
    text = re.sub(REPLACE_NO_SPACE, " ", text)
    text = re.sub(REPLACE_WITH_SPACE, " ", text)
    # Use the tokenization from nltk (word_tokenize)
    
    #tokeniser = RegexpTokenizer(REPLACE_NO_SPACE)
    tokens = word_tokenize(text)

    # Lowercase and lemmatise 
    lemmatiser = WordNetLemmatizer()
    lemmas = [lemmatiser.lemmatize(token.lower(), pos='v') for token in tokens]
    
    return lemmas

In [130]:
def normalize_stop(text):
    text = re.sub(REPLACE_NO_SPACE, " ", text)
    text = re.sub(REPLACE_WITH_SPACE, " ", text)

    tokens = word_tokenize(text)
    
    # Alternative for lemma and stop workds
    lemmatiser = WordNetLemmatizer()
    lemmas = [lemmatiser.lemmatize(token.lower(), pos='v') for token in tokens if token not in stopword_list]

    return lemmas

In [131]:
len(normalize(test)),len(normalize_stop(test))

(17, 9)

In [133]:
normalize(test),normalize_stop(test)

(['fellow',
  'citizens',
  'about',
  'to',
  'undertake',
  'the',
  'arduous',
  'duties',
  'that',
  'i',
  'have',
  'be',
  'appoint',
  'to',
  'perform',
  'by',
  'the'],
 ['fellow',
  'citizens',
  'about',
  'undertake',
  'arduous',
  'duties',
  'i',
  'appoint',
  'perform'])

### Normalize the dataframe.column

In [134]:
text

0

In [115]:
corpus_df.transcripts[:2]

0    Fellow Citizens of the Senate and the House of...
1    When it was first perceived, in early times, t...
Name: transcripts, dtype: object

In [162]:
def get_tokens(col):
    new_column = [normalize(row) for row in col]  
    return new_column 

In [163]:
corpus_df['tokens'] = get_tokens(corpus_df.transcripts)


In [164]:
def get_tokens(col):
    new_column = [normalize_stop(row) for row in col]  
    return new_column 

In [165]:
corpus_df['tokens_stop'] = get_tokens(corpus_df.transcripts)

In [166]:
corpus_df[30:40]

Unnamed: 0,President,Party,transcripts,tokens,tokens_stop
30,Theodore Roosevelt,Republican,By the President of the United States of Ameri...,"[by, the, president, of, the, unite, state, of...","[by, president, unite, state, america, a, proc..."
31,Harry S. Truman,Democratic,"Mr. Speaker, Mr. President, Members of the Con...","[mr, speaker, mr, president, members, of, the,...","[mr, speaker, mr, president, members, congress..."
32,Dwight D. Eisenhower,Republican,"My friends, before I begin the expression of t...","[my, friends, before, i, begin, the, expressio...","[my, friends, i, begin, expression, thoughts, ..."
33,John F. Kennedy,Democratic,"Governor Stevenson, Senator Johnson, Mr. Butle...","[governor, stevenson, senator, johnson, mr, bu...","[governor, stevenson, senator, johnson, mr, bu..."
34,Lyndon B. Johnson,Democratic,"On this hallowed ground, heroic deeds were per...","[on, this, hallow, grind, heroic, deeds, be, p...","[on, hallow, grind, heroic, deeds, perform, el..."
35,Richard M. Nixon,Republican,My Fellow Americans: I come before you tonight...,"[my, fellow, americans, i, come, before, you, ...","[my, fellow, americans, i, come, tonight, cand..."
36,Gerald Ford,Republican,"Mr. Chief Justice, my dear friends, my fellow ...","[mr, chief, justice, my, dear, friends, my, fe...","[mr, chief, justice, dear, friends, fellow, am..."
37,Jimmy Carter,Democratic,"I am Edwin Newman, moderator of this first deb...","[i, be, edwin, newman, moderator, of, this, fi...","[i, edwin, newman, moderator, first, debate, 1..."
38,Ronald Reagan,Republican,Thank you. Thank you very much. Thank you and ...,"[thank, you, thank, you, very, much, thank, yo...","[thank, thank, much, thank, good, even, the, s..."
39,George H. W. Bush,Republican,I have many friends to thank tonight. I thank ...,"[i, have, many, friends, to, thank, tonight, i...","[i, many, friends, thank, tonight, i, thank, v..."


# Step 3:  Feature Extraction

In [170]:
def build_feature_matrix(documents, feature_type='frequency', ngram_range=(1, 1), min_df=0.0, max_df=1.0):
    
    feature_type = feature_type.lower().strip()
    if feature_type == 'binary':
        vectorizer = CountVectorizer(binary=True, min_df=min_df,max_df=max_df, ngram_range=ngram_range)                 
    elif feature_type == 'frequency':
        vectorizer = CountVectorizer(binary=False, min_df=min_df, max_df=max_df, ngram_range=ngram_range)
    elif feature_type == 'tfidf':
        vectorizer = TfidfVectorizer(min_df=min_df, max_df=max_df,ngram_range=ngram_range)
    else:
        raise Exception("Wrong feature type entered. Possible values: 'binary', 'frequency','tfidf'")
    feature_matrix = vectorizer.fit_transform(documents).astype(float)
    return vectorizer, feature_matrix

# Step 4: Text Similarity (Lexical or Semantic)

Improvement. We can train from a much larger data set. E.g. some book excerpts or twitter feed.

In [177]:
enter_sentence = ["It's been a long time since we've been to the park to play with the dog."]

In [191]:
stop_words = set(stopwords.words('english')) 

In [192]:
# Interface lemma tokenizer from nltk with sklearn
class LemmaTokenizer:
    ignore_tokens = [',', '.', ';', ':', '"', '``', "''", '`']
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if t not in self.ignore_tokens]

In [193]:
tokenizer=LemmaTokenizer()
token_stop = tokenizer(' '.join(stop_words))

search_terms = 'red tomato'
documents = ['cars drive on the road', 'tomatoes are actually fruit']

# Create TF-idf model
vectorizer = TfidfVectorizer(stop_words=token_stop, 
                              tokenizer=tokenizer)
doc_vectors = vectorizer.fit_transform([search_terms] + documents)

# Calculate similarity
cosine_similarities = linear_kernel(doc_vectors[0:1], doc_vectors).flatten()
document_scores = [item.item() for item in cosine_similarities[1:]]


In [194]:
document_scores

[0.0, 0.2867109723804671]

### Normalize and Extract Features for user data

In [178]:
norm_sentence = normalize(enter_sentence)
tfidf_vectorizer, tfidf_features = build_feature_matrix(norm_sentence,
                                                        feature_
                                                        type='binary',
                                                        ngram_range=(1, 1),
                                                        min_df=0.0, max_
                                                        df=1.0)

SyntaxError: invalid syntax (1023706616.py, line 4)

In [None]:
tfidf_vectorizer, tfidf_features = build_feature_matrix(norm_sentence,
                                                        feature_
                                                        type='binary',
                                                        ngram_range=(1, 1),
                                                        min_df=0.0, max_
                                                        df=1.0)

# Step 5: Term Similarity