# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-HES-ALM/master/main/ds-masters/content/images/hes-logo.png"> CSCI E-82: Advanced ML, Data Mining and AI
</br>


**Harvard Extension School - Fall 2024**<br/>

**Homework 5**: </br>

**Professor**: Dr. Peter V. Henstock<br/>

**Team Name**: The Spice Analysts</br>
**Students**: Daniel More Torres and Michael Assmus</br>

**Due Date**: 25/Nov/2024, 8:00pm EST</br>

----

<hr style="height:2pt">

In [1]:
# RUN THIS CELL TO GET CSS Styles for CSC-S-82
import requests
from IPython.core.display import HTML
styles = requests.get(
    "https://raw.githubusercontent.com/Harvard-HES-ALM/master/main/master/content/styles/"
    "csci-e-82.css"
).text

**Load all libraries needed for our homework**

In [2]:
import string
import re
import pandas as pd # pip install pandas
import numpy as np
import matplotlib.pyplot as plt # pip install matplotlib
import seaborn as sns # pip install seaborn
import pickle
import time

from sklearn import manifold

import nltk # pip install nltk
from nltk.corpus import stopwords 
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('punkt')


from nltk.stem.porter import PorterStemmer

import spacy # pip install spacy

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # pip install scikit-learn
from sklearn.metrics.pairwise import cosine_similarity
import gensim # pip install gensim
from gensim.models import word2vec  #pip install word2vec

# from wordcloud import WordCloud  # pip install wordcloud
# from textblob import TextBlob  # Sentiment Analysis - pip install textblob
# from sentence_transformers import SentenceTransformer, util # pip install sentence_transformers

In [3]:
# Record start time
start_time = time.perf_counter()

In [26]:
# Load debate dataset
debate_df = pd.read_csv('data/debates.csv', encoding='latin1')
display(debate_df.head())
len(debate_df['event_id'].unique())

Unnamed: 0,event_id,speaker_names_raw,rawtext,Order,Year,Party,PartyWin,Round,AgeDem,AgeRep,AgeDiff,InflationRate,GDPgrowth
0,1960_KennedyNixon_Rd1,KENNEDY,"Mr. Smith, Mr. Nixon. In the election of 1860...",1,1960,democrat,democrat,1,43,47,4,1.46,
1,1960_KennedyNixon_Rd1,NIXON,"Mr. Smith, Senator Kennedy. The things that S...",1,1960,republican,democrat,1,43,47,4,1.46,
2,1960_KennedyNixon_Rd1,KENNEDY,"Well, the Vice President and I came to the Co...",1,1960,democrat,democrat,1,43,47,4,1.46,
3,1960_KennedyNixon_Rd1,NIXON,I have no comment.,1,1960,republican,democrat,1,43,47,4,1.46,
4,1960_KennedyNixon_Rd1,NIXON,It would be rather difficult to cover them in...,1,1960,republican,democrat,1,43,47,4,1.46,


31

In [21]:
# Check debate year range
debate_df['Year'].min(), debate_df['Year'].max()

(1960, 2020)

In [29]:
# Inspect example debate text
debate_df.loc[396:398,'rawtext']

396    And I havenÕt got time to answer with regard t...
397                                                  NaN
398     Mr. President, if I heard you correctly, you ...
Name: rawtext, dtype: object

In [31]:
from collections import Counter

# Concatenate all text into a single string
all_text = ' '.join(debate_df.loc[debate_df['rawtext'].notna(), 'rawtext'])

# Extract special characters (non-alphanumeric and non-whitespace)
special_chars = re.findall(r'[^\w\s]', all_text)

# Count frequencies
freq = Counter(special_chars)
freq

Counter({'.': 24886,
         ',': 21801,
         '-': 1807,
         '?': 924,
         '$': 690,
         ';': 458,
         ':': 227,
         '%': 109,
         '[': 62,
         ']': 62,
         '/': 54,
         '(': 37,
         ')': 37,
         "'": 24,
         '\x89': 19,
         '\\': 12,
         '&': 5,
         '´': 4,
         '!': 3,
         '`': 2,
         '\x9d': 2,
         '+': 2,
         '>': 1})

In [53]:
re.findall(r'...............................................Ð................................................', all_text)

['e have over nine billion dollars worth of food Ð some of it rotting Ð even though there is a hun',
 ' not satisfied when I see men like Jimmy Hoffa Ð in charge of the largest union in the United St',
 's at the present rate of hydropower production Ð and that is the hallmark of an industrialized s',
 'constitutional rights. If a Negro baby is born Ð and this is true also of Puerto Ricans and Mexi',
 'as much chance to own a house. He has about uh Ð four times as much chance that heÕll be out of ',
 'eedom be maintained under the most severe tack Ð attack it has ever known? I think it can be. An',
 ' records in the areas that Senator Kennedy has Ð has discussed tonight, I think we find that Ame',
 'ill see that our medical care for the aged are Ð is Ð are much Ð is much better handled than it ',
 'vocates. I could give better examples, but for Ð for whatever it is, whether itÕs in the field o',
 'time that he has, so that our experience in uh Ð government is comparable. Secondly, I th

Special handling for special characters:
- \+ is fine. It's only used in "AOC+3", which is a group name.
- \> is part of a "> Transfer interrupted!" message. We should probably drop that segment.
- \x89Û\x9d seems to be close quotes. It has \x89ÛÏ as the open quotes counterpart.
- \ \1/2\ \ and similar fractions are apparently actual fractions. That one is "1/2" and there are other fraction versions.
- Pretty sure Ñ is -.
- Õ is '
- É might be em-dash or ellipses
- Ð is probably ellipses?
- There are a number of instances of [laughter], [crosstalk], [applause], and possibly others.

In [6]:
# Load speeches dataset
speech_df = pd.read_json('speeches.json')
display(speech_df.head())
len(speech_df['doc_name'].unique())

Unnamed: 0,doc_name,date,transcript,president,title
0,january-22-1807-special-message-congress-burr-...,1807-01-22,TO THE SENATE AND HOUSE OF REPRESENTATIVES OF ...,Thomas Jefferson,"January 22, 1807: Special Message to Congress ..."
1,may-25-1813-message-special-congressional-sess...,1813-05-25,Fellow-Citizens of the Senate and of the House...,James Madison,"May 25, 1813: Message on the Special Congressi..."
2,april-2-1917-address-congress-requesting-decla...,1917-04-02,I have called the Congress into extraordinary ...,Woodrow Wilson,"April 2, 1917: Address to Congress Requesting ..."
3,april-10-1975-address-us-foreign-policy,1975-04-10,"Mr. Speaker, Mr. President, distinguished gues...",Gerald Ford,"April 10, 1975: Address on U.S. Foreign Policy"
4,july-6-1848-message-regarding-treaty-guadalupe...,1848-07-06,To the House of Representatives of the United ...,James K. Polk,"July 6, 1848: Message Regarding the Treaty of ..."


1057

In [7]:
# Check speeches date range
speech_df['date'].min(), speech_df['date'].max()

('1789-04-30', '2024-09-24T10:12:00-04:00')

In [8]:
# Make functions for text cleaning
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))

def lower_case(text):
    text = text.lower()
    return text

def remove_punctuation(text):
    text = re.sub(r'[^a-z\s]', '', text)  # Remove punctuation and numbers
    return text

def remove_stopwords(text):
    tokens = word_tokenize(text)  # Tokenize
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(filtered_tokens)

lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(lemmatized_tokens)

In [9]:
# Clean speech dataset
speech_df['cleaned_text'] = speech_df['transcript'].apply(lower_case)
speech_df['cleaned_text'] = speech_df['cleaned_text'].apply(remove_punctuation)
speech_df['cleaned_text'] = speech_df['cleaned_text'].apply(remove_stopwords)
speech_df['cleaned_text'] = speech_df['cleaned_text'].apply(lemmatize_text)

In [10]:
# Check cleaned dataset
speech_df.head()

Unnamed: 0,doc_name,date,transcript,president,title,cleaned_text
0,january-22-1807-special-message-congress-burr-...,1807-01-22,TO THE SENATE AND HOUSE OF REPRESENTATIVES OF ...,Thomas Jefferson,"January 22, 1807: Special Message to Congress ...",senate house representative united statesagree...
1,may-25-1813-message-special-congressional-sess...,1813-05-25,Fellow-Citizens of the Senate and of the House...,James Madison,"May 25, 1813: Message on the Special Congressi...",fellowcitizens senate house representative ear...
2,april-2-1917-address-congress-requesting-decla...,1917-04-02,I have called the Congress into extraordinary ...,Woodrow Wilson,"April 2, 1917: Address to Congress Requesting ...",called congress extraordinary session serious ...
3,april-10-1975-address-us-foreign-policy,1975-04-10,"Mr. Speaker, Mr. President, distinguished gues...",Gerald Ford,"April 10, 1975: Address on U.S. Foreign Policy",mr speaker mr president distinguished guest go...
4,july-6-1848-message-regarding-treaty-guadalupe...,1848-07-06,To the House of Representatives of the United ...,James K. Polk,"July 6, 1848: Message Regarding the Treaty of ...",house representative united state answer resol...


<a id='#Hours-Invested'></a>

#### Hours Invested

164 hours

<a id='#Time-for-notebook-to-run'></a>

#### Time for notebook to run


In [11]:
print(f"It took {(time.perf_counter() - start_time)/60:.2f} minutes for this notebook to run ")

It took 0.58 minutes for this notebook to run 
