<a name ="top"><a/>
# Welcome to WWPS (What would Presidents say?)!

We have access to all the US Presidential Speeches from April 30th, 1789 to Sept. 25th, 2019. Presidential speeches are an opportunity for Presidents to set the tone of the country, put forth their agenda, influence policy and sway public opinion. This dataset can give us some good insight into what a President would say about various topics. 

Now we're going to create WWPS i.e. "What would Presidents say?" 
    
We will build a pipeline that, when given a sentence, would spit out the top 3 similar sentences previoulsy uttered by different US presidents.

Source data: https://www.kaggle.com/littleotter/united-states-presidential-speeches

<a name =stepi>
    
# Step i - Understanding the problem
[Go to top](#top)

The first thing we need to do is understand the type of problem we are solving. The main features of this problem are:
- Unsupervised Learning
    - The data is not labelled and we don't have a test set to validate our results.
- Text Similarity
    - We need to compare whether one set of text with another and determine their similarity.
- Information Retrieval 
    - We need to fetch relevant sources of information from a corpus just like a quer or search on a search engine.
Now that we know that we understand which steps to take to develop a suitable algorithm.

There are 2 main ways to analyse Text Similarity; Lexical Similarity and Semantical Similarity. We will select Semantical Similarity as in addition to syntax, the algorithm will consider context.

The steps involved in Text Similarity: 

i. [Understanding the problem](#stepi)  
ii. [Import Libraries and Load the data](#stepii)  
iii.  [Work with subset of data](#stepiii)

1. [Information Retrieval](#step1)
    - We need to define search term(query) and find a way to get input from the user.
2. [Text Normalization](#step2)
    - Tokenize the sentences.
    - Develop a method that can be used for both the search_ and the corpus.
3. [Feature Engineering](#step3)
    - Our options here include Bag of Words, TF-IDF and Word Vectorization
    - We will use a Word Embebbding Model  
4. [Similarity Measure](#step4)
    - We need to find an optimal measure. 
    - Options include Hellinger-Bhattcharya Distance and Okapi BM25 ranking
    - We will use Soft Cosine Measure
5. [Apply to Entire Dataframe](#step5)
6. [Results](#results)
7. [Limitations](#limitations)
8. [References](#references)

<a name =stepii>
    
# Step ii: Import Libraries and Load the Data

[Go to top](#top)

In [4]:
#import matplotlib.pyplot as plt
import pandas as pd
import nltk
import re

from io import StringIO

from nltk import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import Pipeline

In [2]:
def read_corpus():
    corpus_df = pd.read_csv('corpus.csv')
    return corpus_df

In [5]:
corpus_df = read_corpus()

In [341]:
corpus_df.iloc[:10]

Unnamed: 0.1,Unnamed: 0,Party,transcripts
0,George Washington,Unaffiliated,Fellow Citizens of the Senate and the House of...
1,John Adams,Federalist,"When it was first perceived, in early times, t..."
2,Thomas Jefferson,Democratic-Republican,"FRIENDS AND FELLOW-CITIZENS, Called upon to un..."
3,James Madison,Democratic-Republican,Unwilling to depart from examples of the most ...
4,James Monroe,Democratic-Republican,I should be destitute of feeling if I was not ...
5,John Quincy Adams,Democratic-Republican,"AND NOW, FRIENDS AND COUNTRYMEN, if the wise a..."
6,Andrew Jackson,Democratic,Fellow Citizens: About to undertake the arduou...
7,Martin Van Buren,Democratic,Fellow Citizens: The practice of all my predec...
8,William Harrison,Whig,Called from a retirement which I had supposed ...
9,John Tyler,Unaffiliated,To the People of the United States Before my a...


In [6]:
corpus_df.rename( columns={"Unnamed: 0" :"President"}, inplace=True )
corpus_df.head()

Unnamed: 0,President,Party,transcripts
0,George Washington,Unaffiliated,Fellow Citizens of the Senate and the House of...
1,John Adams,Federalist,"When it was first perceived, in early times, t..."
2,Thomas Jefferson,Democratic-Republican,"FRIENDS AND FELLOW-CITIZENS, Called upon to un..."
3,James Madison,Democratic-Republican,Unwilling to depart from examples of the most ...
4,James Monroe,Democratic-Republican,I should be destitute of feeling if I was not ...


In [96]:
corpus_df.shape

(44, 3)

<a name =stepiii>
    
# Step iii:  Work with subset of data </a>
e.g. 1 sentence then 1 row/1 president

[Go to top](#top)

In [358]:
#Get a sample to work with
#corpus_df.transcripts[-3:]
#word_tokenize(df['text'].iloc(0)[20])
#document_1=(corpus_df.transcripts[3])
#presidents =corpus_df.President

<a name =step1>
    
# Step 1: Information Retrieval Parameter</a>

[Go to top](#top)

In [7]:
# Python program showing 
# a use of input()
  
query_string = input("Type in your sentence: ")
query_string

Type in your sentence: In tendering this homage to the Great Author of every public and private good


'In tendering this homage to the Great Author of every public and private good'

<a name =step2>
    
# Step 2: Text Normalization </a>

[Go to top](#top)

In [33]:
from re import sub
#gensim simple_preprocess - This lowercases, tokenizes, de-accents (optional). 
from gensim.utils import simple_preprocess

#query_string = 'fruit and vegetables'
#documents_1 = ('cars drive on the road. tomatoes are actually fruit')

speeches=(corpus_df.transcripts[0])
sentences=sent_tokenize(speeches)
president =corpus_df.President[0]
president, sentences

('George Washington',
 ['Fellow Citizens of the Senate and the House of Representatives: Among the vicissitudes incident to life, no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the fourteenth day of the present month.',
  'On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years: a retreat which was rendered every day more necessary as well as more dear to me, by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time.',
  'On the other hand, the magnitude and difficulty of the trust to which the voice of my Country called me, being sufficient to awaken in the wisest and most experienced of her citizens, a distrustful scrutiny

In [34]:
stopwords = ['the', 'and', 'are', 'a']

# From: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb
def preprocess(doc):
    # Tokenize, clean up input document string
    doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
    doc = sub(r'<[^<>]+(>|$)', " ", doc)
    #doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
    #doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
    return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

In [35]:
# Preprocess the documents, including the query string

corpus = [preprocess(sentence) for sentence in sentences] 
query = preprocess(query_string)


<a name =step3>
    
# Step 3: Feature Engineering </a>

[Go to top](#top)

Source: https://towardsdatascience.com/how-to-rank-text-content-by-semantic-similarity-4d2419a84c32

In [37]:
import gensim.downloader as api
glove = api.load("glove-wiki-gigaword-50")  

In [38]:
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import SoftCosineSimilarity

  
similarity_index = WordEmbeddingSimilarityIndex(glove)

# Build the term dictionary, TF-idf model
dictionary = Dictionary(corpus+[query])
tfidf = TfidfModel(dictionary=dictionary)


# Create the term similarity matrix.  
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)

<a name =step4>
    
# Step 4: Soft Cosine Similarity </a>

[Go to top](#top)

In [40]:
import numpy as np
# Compute Soft Cosine Measure between the query and the documents.
# From: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb
query_tf = tfidf[dictionary.doc2bow(query)]

index = SoftCosineSimilarity(
            tfidf[[dictionary.doc2bow(sentence) for sentence in corpus]],
            similarity_matrix)

doc_similarity_scores = index[query_tf]

# Output the sorted similarity scores and documents
sorted_indexes = np.argsort(doc_similarity_scores)[::-1] #[::-1] Start at the end, end when it ends, end when nothing's left and walk backwards by 1. 
idx = sorted_indexes[0]
print(f'{idx} \t {doc_similarity_scores[idx]:0.3f} \t {sentences[idx]}')

6 	 0.804 	 In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own; nor those of my fellow citizens at large, less than either.


<a name =step5>
    
# Step 5: Apply to entire Dataframe </a>

[Go to top](#top)

### Trial #1 - Used existing phrase in corpora

In [97]:
#terms not loop dependent
query = preprocess(query_string)
similarity_index = WordEmbeddingSimilarityIndex(glove)
top_matches = pd.DataFrame()
for i in range(0,corpus_df.shape[0]):
    speeches=(corpus_df.transcripts[i])
    sentences=sent_tokenize(speeches)
    president =corpus_df.President[i]
    # STEP 2 -TEXT NORMALIZATION
    # Preprocess the documents, including the query string
    corpus = [preprocess(sentence) for sentence in sentences] 
    # STEP 3 - FEATURE ENGINEERING
    # Build the term dictionary, TF-idf model
    dictionary = Dictionary(corpus+[query])
    tfidf = TfidfModel(dictionary=dictionary)
    # Create the term similarity matrix.  
    similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)
    # STEP 4 - SOFT COSINE SIMILARITY
    # Compute Soft Cosine Measure between the query and the documents.
    query_tf = tfidf[dictionary.doc2bow(query)]

    index = SoftCosineSimilarity(
                tfidf[[dictionary.doc2bow(sentence) for sentence in corpus]],
                similarity_matrix)
    doc_similarity_scores = index[query_tf]
    # Output the sorted similarity scores and documents
    sorted_indexes = np.argsort(doc_similarity_scores)
    ranked=sorted_indexes[::-1]
    # runtime warning with index[0] only
    idx = ranked[0]
    # The most relevant sentence from each president
    df = {'President': president, 'Matching Sentence': sentences[idx], 'Index': idx, 'Similarity Score': (doc_similarity_scores[idx])}
    top_matches = top_matches.append(df, ignore_index = True)
    top=top_matches.sort_values(by=['Similarity Score'], ascending=False)

top.head(3)

  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))
  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))


Unnamed: 0,President,Matching Sentence,Index,Similarity Score
0,George Washington,In tendering this homage to the Great Author o...,6.0,0.803656
1,John Adams,"Irritated by no literary altercation, animated...",11.0,0.541808
2,Thomas Jefferson,Long absent on duties which the history of a w...,929.0,0.582807
3,James Madison,To meet the extended and diversified warfare a...,526.0,0.559601
4,James Monroe,Under these impressions I invite your attentio...,1307.0,0.53225
5,John Quincy Adams,The next of these cardinal measures of policy ...,812.0,0.608466
6,Andrew Jackson,The beneficent Author of All Good has granted ...,477.0,0.519133
7,Martin Van Buren,The Constitution devised by our forefathers as...,691.0,0.559491
8,William Harrison,I wish for the support of no member of that bo...,203.0,0.47184
9,John Tyler,It will not escape the observation of Congress...,1218.0,0.506621


In [133]:
top=top_matches.sort_values(by=['Similarity Score'], ascending=False)
top.head(3)


Unnamed: 0,President,Matching Sentence,Index,Similarity Score
0,George Washington,In tendering this homage to the Great Author o...,6.0,0.803656
11,Zachary Taylor,With the strict observance of this rule and th...,216.0,0.649095
42,Barack Obama,"In this effort, we draw inspiration from our f...",5190.0,0.625251


### Trial #2 - Used short phrase

In [205]:
#terms not loop dependent
query = preprocess('I believe I can fly.')
similarity_index = WordEmbeddingSimilarityIndex(glove)
top_matches = pd.DataFrame()
for i in range(0,corpus_df.shape[0]):
    speeches=(corpus_df.transcripts[i])
    sentences=sent_tokenize(speeches)
    president =corpus_df.President[i]
    # STEP 2 -TEXT NORMALIZATION
    # Preprocess the documents, including the query string
    corpus = [preprocess(sentence) for sentence in sentences] 
    # STEP 3 - FEATURE ENGINEERING
    # Build the term dictionary, TF-idf model
    dictionary = Dictionary(corpus+[query])
    tfidf = TfidfModel(dictionary=dictionary)
    # Create the term similarity matrix.  
    similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)
    # STEP 4 - SOFT COSINE SIMILARITY
    # Compute Soft Cosine Measure between the query and the documents.
    query_tf = tfidf[dictionary.doc2bow(query)]

    index = SoftCosineSimilarity(
                tfidf[[dictionary.doc2bow(sentence) for sentence in corpus]],
                similarity_matrix)
    doc_similarity_scores = index[query_tf]
    # Output the sorted similarity scores and documents
    sorted_indexes = np.argsort(doc_similarity_scores)
    ranked=sorted_indexes[::-1]
    # runtime warning with index[0] only
    idx = ranked[0]
    # The most relevant sentence from each president
    df = {'President': president, 'Matching Sentence': sentences[idx], 'Index': idx, 'Similarity Score': (doc_similarity_scores[idx])}
    top_matches = top_matches.append(df, ignore_index = True)
    top=top_matches.sort_values(by=['Similarity Score'], ascending=False)

top.head(3)

  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))
  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))


Unnamed: 0,President,Matching Sentence,Index,Similarity Score
19,James A. Garfield,Fellow Citizens: We stand to-day upon an emine...,0.0,1.0
8,William Harrison,The outline of principles to govern and measur...,5.0,1.0
32,Dwight D. Eisenhower,I would be prepared to submit to the Congress ...,387.0,1.0


In [210]:
top.head(10)

Unnamed: 0,President,Matching Sentence,Index,Similarity Score
19,James A. Garfield,Fellow Citizens: We stand to-day upon an emine...,0.0,1.0
8,William Harrison,The outline of principles to govern and measur...,5.0,1.0
32,Dwight D. Eisenhower,I would be prepared to submit to the Congress ...,387.0,1.0
3,James Madison,Having forborne to declare war until to other ...,547.0,1.0
11,Zachary Taylor,We could in no event be indifferent to their p...,129.0,1.0
9,John Tyler,This union in sentiment and feeling can only b...,1077.0,1.0
1,John Adams,"I have therefore thought fit to recommend, and...",172.0,0.97385
0,George Washington,But you will perceive from the information con...,36.0,0.972275
6,Andrew Jackson,But if from the original delay in asking for a...,2727.0,0.951481
12,Millard Fillmore,These papers will acquaint you with the ground...,440.0,0.943419


### Trial 3 a - Added stop words and repeated short phrase query
### Trial 3 b - Added stop words and more descriptive query 

In [211]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

# From: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb
def preprocess(doc):
    # Tokenize, clean up input document string
    doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
    doc = sub(r'<[^<>]+(>|$)', " ", doc)
    #doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
    doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
    return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

In [218]:
#terms not loop dependent
query = preprocess('I believe that the young people in America are our future.')
similarity_index = WordEmbeddingSimilarityIndex(glove)
top_matches = pd.DataFrame()
for i in range(0,corpus_df.shape[0]): #
    speeches=(corpus_df.transcripts[i])
    sentences=sent_tokenize(speeches)
    president =corpus_df.President[i]
    # STEP 2 -TEXT NORMALIZATION
    # Preprocess the documents, including the query string
    corpus = [preprocess(sentence) for sentence in sentences] 
    # STEP 3 - FEATURE ENGINEERING
    # Build the term dictionary, TF-idf model
    dictionary = Dictionary(corpus+[query])
    tfidf = TfidfModel(dictionary=dictionary)
    # Create the term similarity matrix.  
    similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)
    # STEP 4 - SOFT COSINE SIMILARITY
    # Compute Soft Cosine Measure between the query and the documents.
    query_tf = tfidf[dictionary.doc2bow(query)]

    index = SoftCosineSimilarity(
                tfidf[[dictionary.doc2bow(sentence) for sentence in corpus]],
                similarity_matrix)
    doc_similarity_scores = index[query_tf]
    # Output the sorted similarity scores and documents
    sorted_indexes = np.argsort(doc_similarity_scores)
    ranked=sorted_indexes[::-1]
    # runtime warning with index[0] only
    idx = ranked[0]
    # The most relevant sentence from each president
    df = {'President': president, 'Matching Sentence': sentences[idx], 'Index': idx, 'Similarity Score': (doc_similarity_scores[idx])}
    top_matches = top_matches.append(df, ignore_index = True)
    top=top_matches.sort_values(by=['Similarity Score'], ascending=False)

top.head(3)

  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))
  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))


Unnamed: 0,President,Matching Sentence,Index,Similarity Score
1,John Adams,On this subject it might become me better to b...,35.0,1.0
42,Barack Obama,In order to save our children from a future of...,769.0,0.99988
34,Lyndon B. Johnson,It does say the only limit to a man's hope for...,2383.0,0.982136


In [227]:
#length of top 3
len(top.iloc[0][0]), len(top.iloc[0][1]), len(top.iloc[1][0]), len(top.iloc[1][1]), len(top.iloc[2][0]), len(top.iloc[2][1])

(10, 4345, 12, 127, 17, 119)

<a name =results>
    
# Results: What would a President say?</a>

[Go to top](#top)

### Result using existing phrase in corpora as query.
- The top matching sentence was the same sentence the excerpt came from. 
- This proves that the algorithm works lexically.

In [202]:
print('Instead of saying: "',query_string,'" ...')

for i in range(0,3):
    print(str('\n' + 'President ' + top.iloc[i][0] + ' would say ' + '"' + top.iloc[i][1] + '"' + '\t' +'Soft Cosine Similarity'), top.iloc[i][3])

Instead of saying: " In tendering this homage to the Great Author of every public and private good " ...

President George Washington would say "In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own; nor those of my fellow citizens at large, less than either."	Soft Cosine Similarity 0.8036555

President Zachary Taylor would say "With the strict observance of this rule and the other injunctions of the Constitution, with a sedulous inculcation of that respect and love for the Union of the States which our fathers cherished and enjoined upon their children, and with the aid of that overruling Providence which has so long and so kindly guarded our liberties and institutions, we may reasonably expect to transmit them, with their innumerable blessings, to the remotest posterity."	Soft Cosine Similarity 0.6490951

President Barack Obama would say "In this effort, we draw inspiration from our fellow

### Result using short phrase - "I believe I can fly"
- The algorithm returned at least 6 x 100% matching sentences that upon review did not even make sense. 
- Way forward: Add more stop words
- Note: The quality of the input from the user affects the ability to detect semantics.

In [207]:
print('Instead of saying: "I believe I can fly" ...')

for i in range(0,3):
    print(str('\n' + 'President ' + top.iloc[i][0] + ' would say ' + '"' + top.iloc[i][1] + '"' + '\t' +'Soft Cosine Similarity'), top.iloc[i][3])

Instead of saying: "I believe I can fly" ...

President James A. Garfield would say "Fellow Citizens: We stand to-day upon an eminence which overlooks a hundred years of national life- a century crowded with perils, but crowned with the triumphs of liberty and law."	Soft Cosine Similarity 1.0

President William Harrison would say "The outline of principles to govern and measures to be adopted by an Administration not yet begun will soon be exchanged for immutable history, and I shall stand either exonerated by my countrymen or classed with the mass of those who promised that they might deceive and flattered with the intention to betray."	Soft Cosine Similarity 1.0

President Dwight D. Eisenhower would say "I would be prepared to submit to the Congress of the United States, and with every expectation of approval, any such plan that would: First, encourage world wide investigation into the most effective peacetime uses of fissionable material, and with the certainty that they had all the

In [208]:
#top.head(10)

Unnamed: 0,President,Matching Sentence,Index,Similarity Score
19,James A. Garfield,Fellow Citizens: We stand to-day upon an emine...,0.0,1.0
8,William Harrison,The outline of principles to govern and measur...,5.0,1.0
32,Dwight D. Eisenhower,I would be prepared to submit to the Congress ...,387.0,1.0
3,James Madison,Having forborne to declare war until to other ...,547.0,1.0
11,Zachary Taylor,We could in no event be indifferent to their p...,129.0,1.0
9,John Tyler,This union in sentiment and feeling can only b...,1077.0,1.0
1,John Adams,"I have therefore thought fit to recommend, and...",172.0,0.97385
0,George Washington,But you will perceive from the information con...,36.0,0.972275
6,Andrew Jackson,But if from the original delay in asking for a...,2727.0,0.951481
12,Millard Fillmore,These papers will acquaint you with the ground...,440.0,0.943419


### Result using stop words and "I believe I can fly"
- The stop words *may* have improved the semantics *but* as an idiom was used it is difficult to tell if stop words have improved the results.
- Still very high level of matching in spite of semantics not being clear.
- I only performed the trial on a small subset of corpora from 4 presidents
- The quality of query impacts results so we tried to improve the query in Trial 3b.

![Screenshot%202022-02-09%20at%2013.45.54.png](attachment:Screenshot%202022-02-09%20at%2013.45.54.png)

### Result using "I believe that young people are our future"
- The number 1 match was by John Adams, the 2nd President of the United States.
- His sentence was matching sentence had 4345 words compared to ~120 words for the next top 2 presidents; 44th and 36th respectively. This may have been due to typos, an anomoly, perhaps he is known for being wordy or neing wordy was normal back in those day.
- For the more readable sentences at only 120words, by the 2nd and 3rd ranked it is easier to see why they matched well.
- Way forward: Inspect the data to determine if:
    - length of sentences trend with time.
    - length of sentences favour/don't favour being selected in the top 3. 
    - the performance of the algorithm is independent of era of corpus 
- Note: The quality of the input from the user affects the ability to detect semantics.
    - Consider criteria for query. 
    - for e.g at least 2 verbs, 2 nouns, 2 adjectives/adverbs.

In [226]:
print('Instead of saying: "I believe that young people are our future" ...')

for i in range(0,3):
    print(str('\n' + 'President ' + top.iloc[i][0] + ' would say ' + '"' + top.iloc[i][1] + '"' + '\t' +'Soft Cosine Similarity'), top.iloc[i][3])

Instead of saying: "I believe that young people are our future" ...

President John Adams would say "On this subject it might become me better to be silent or to speak with diffidence; but as something may be expected, the occasion, I hope, will be admitted as an apology if I venture to say that if a preference, upon principle, of a free republican government, formed upon long and serious reflection, after a diligent and impartial inquiry after truth; if an attachment to the Constitution of the United States, and a conscientious determination to support it until it shall be altered by the judgments and wishes of the people, expressed in the mode prescribed in it; if a respectful attention to the constitutions of the individual States and a constant caution and delicacy toward the State governments; if an equal and impartial regard to the rights, interest, honor, and happiness of all the States in the Union, without preference or regard to a northern or southern, an eastern or western, 

<a name =limitations>
    
# Limitations & Observations</a>

[Go to top](#top)

1. I used GloVe, an unsupervised learning algorithm, "glove-wiki-gigaword-50". 
    - There is a newer release and different versions of GloVe based on from different sources (corpus)
    - Perhaps there is a range of corpora better suited to presidential speeches, which are more formally worded.
2. The data dates back to 1789 as such, how language is used has changed. This may impact semantic-consistency across the corpora. 
    - And the results may be biased toward corpora from more modern time periods.
    - Or the results may be biased to longer corpora
3. Preprocessing should be reviewed so Regex and Tokenizaton can be optimized. 
    - Increasing stop words may have helped the performance (ability to detect meaning)
4. This type of semantic search is limited by the quality of the user input. 
    - The more meaning conveyed in the query, the better the model can perform.
5. Consider comparing the performance against Okapi BM25 Ranking or similar model that is tailored to Information Retrieval.
6. Optimize code:
    - I get a runtime warning with index[0] only for the sorted indices of similarity.
7. Improve how I get the data from the user. There are no checks or error messages.
8. I need to investigate how to prove the validity/performance of Text Similarity Models.

<a name =references>
    
# References</a>

[Go to top](#top)

1. Text Analytics with Python A practical real-world approach to gaining actional insights from your data - Sarkar, Dipanjan
2. https://towardsdatascience.com/how-to-rank-text-content-by-semantic-similarity-4d2419a84c32
3. GloVe https://nlp.stanford.edu/projects/glove/
