## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import sqlite3
import string
import os
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')
import random
import numpy as np
from collections import Counter, defaultdict
from nltk.corpus import stopwords
from string import punctuation
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
convention_db = sqlite3.connect("/content/drive/MyDrive/ColabNotebooks/Assignment 4 datasets/2020_Conventions.db")
cur = convention_db.cursor()

In [5]:
# Execute a query to list all tables in the database
cur.execute("SELECT name FROM sqlite_master WHERE type='table';")

# Fetch all table names from results
tables = cur.fetchall()

# Print the table names
if tables:
    for table in tables:
        print(table[0])
else:
    print("No tables found in the database.")


conventions


In [6]:
# Execute a query to list all tables in the database
cur.execute("SELECT name FROM sqlite_master WHERE type='table';")

# Fetch all table names from the result set
tables = cur.fetchall()

# Print the table names
for table in tables:
    print(table[0])


conventions


In [7]:
# Execute a query to fetch table schema
cur.execute("PRAGMA table_info(conventions)")

# Fetch and print the schema information
schema_info = cur.fetchall()
for column in schema_info:
    print(column)


(0, 'party', 'TEXT', 0, None, 0)
(1, 'night', 'INTEGER', 0, None, 0)
(2, 'speaker', 'TEXT', 0, None, 0)
(3, 'speaker_count', 'INTEGER', 0, None, 0)
(4, 'time', 'TEXT', 0, None, 0)
(5, 'text', 'TEXT', 0, None, 0)
(6, 'text_len', 'TEXT', 0, None, 0)
(7, 'file', 'TEXT', 0, None, 0)


In [8]:
# Execute a query to retrieve a few rows from the table
cur.execute("SELECT * FROM conventions LIMIT 5")

# Fetch and print the rows
rows = cur.fetchall()
for row in rows:
    print(row)


('Democratic', 4, 'Unknown', 1, '00:00', 'Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 21, 2020 2020 Democratic National Convention (DNC) Night 4 Transcript Rev  ›  Blog  ›  Transcripts  › 2020 Election Transcripts  ›  2020 Democratic National Convention (DNC) Night 4 Transcript Night 4 of the 2020 Democratic National Convention (DNC) on August 20. Read the full transcript of the event here. Transcribe Your Own Content  Try Rev for free  and save time transcribing, captioning, and

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text
for each party and prepare it for use in Naive Bayes.  

In [10]:
stop_words = set(stopwords.words('english'))

def remove_punctuation(text):
    # Remove punctuation
    return "".join([ch for ch in text if ch not in string.punctuation])

# Function to preprocess
def preprocess_text(text):
    # Convert text to lowercase
    unwanted_patterns = [
        r'skip to content\.',
        r'The company.*?today\.',
    ]
    text = text.lower()
    text_without_urls = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)
    text = remove_punctuation(text)
    # Tokenize the text
    tokens = text.split()
    # Join tokens back into a string
    processed_text = ' '.join(tokens)
    return processed_text

# Execute a query to retrieve data from the convention table
cur.execute("SELECT text, party FROM conventions")
rows = cur.fetchall()

# Fill convention_data list with processed data
convention_data = []
for row in rows:
    text = preprocess_text(row[0])
    party = row[1]
    convention_data.append((text, party))  # Append party and processed text as a tuple



Let's look at some random entries and see if they look right.

In [11]:
random.choices(convention_data,k=10)

[('what a beautiful story okay now let’s talk about another important issue the climate crisis this is an issue that has been on joe biden’s mind for decades in 1986 while working as a senator joe biden introduced one of the first climate bills in congress and he’s continued to listen to and work with climate advocates and activists putting forward a plan to address climate change and create a clean energy economy',
  'Democratic'),
 ('thank you senator as the 46th mayor of a city that’s more than ready for our 46th president i’m proud to cast minnesota’s 31 votes for bernie sanders and 60 votes for our next us president joe biden',
  'Democratic'),
 ('joe biden is a decent man with a long history of public service to america',
  'Democratic'),
 ('california', 'Democratic'),
 ('they hoped to raise 15 billion one person who did not want to talk about it joe biden',
  'Republican'),
 ('my dad was a healthy 65 year old his only preexisting condition was trusting donald trump and for that 

In [45]:
tokens = [w for t, p in convention_data for w in t.split()]



If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it.

In [46]:
word_cutoff = 5
tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)

print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2514 as features in the model.


In [17]:
def conv_features(text, feature_words):
    processed_text = preprocess_text(text)
    result_dict = {}
    for word in feature_words:
        if word.lower() in processed_text.lower().split():
            result_dict[word] = True
            print(f"Word: {repr(word)} in feature_words is in processed_text.")
    return result_dict

In [18]:
assert(len(feature_words)>0)

In [19]:
conv_features("donald is the president", feature_words)
assert {'donald': True,'president': True}

Word: 'the' in feature_words is in processed_text.
Word: 'donald' in feature_words is in processed_text.
Word: 'is' in feature_words is in processed_text.
Word: 'president' in feature_words is in processed_text.


In [20]:
conv_features("people are american in america", feature_words)
assert {'america': True,'american': True,"people": True}

Word: 'in' in feature_words is in processed_text.
Word: 'america' in feature_words is in processed_text.
Word: 'american' in feature_words is in processed_text.
Word: 'people' in feature_words is in processed_text.
Word: 'are' in feature_words is in processed_text.


Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory.

In [47]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]
pass

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Word: 'millions' in feature_words is in processed_text.
Word: 'actually' in feature_words is in processed_text.
Word: 'making' in feature_words is in processed_text.
Word: 'that' in feature_words is in processed_text.
Word: 'conversation' in feature_words is in processed_text.
Word: 'people' in feature_words is in processed_text.
Word: 'next' in feature_words is in processed_text.
Word: 'when' in feature_words is in processed_text.
Word: 'to' in feature_words is in processed_text.
Word: 'and' in feature_words is in processed_text.
Word: 'biden' in feature_words is in processed_text.
Word: 'that’s' in feature_words is in processed_text.
Word: 'the' in feature_words is in processed_text.
Word: 'but' in feature_words is in processed_text.
Word: 'jobs' in feature_words is in processed_text.
Word: 'comes' in feature_words is in processed_text.
Word: 'million' in feature_words is in processed_text.
Word: 'harris' in feature_wor

In [22]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [23]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.444


In [24]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     25.8 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     14.9 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                    isis = True           Republ : Democr =     13.0 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

In my observation I would say that many of the words with negative connotations are used more frequently by the Republic party affiliation vs the Democratic party. Democratic party really only out numbers the Republican affiliation when referring to votes or climate change. What I also found very intersting was the the number one feature appears to be China which is also interesting, that would be the most frequent word used by Republican party.



## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and
is unindexed, so the query takes a minute or two to run on my machine.

In [34]:
cong_db = sqlite3.connect("/content/drive/MyDrive/ColabNotebooks/Assignment 4 datasets/congressional_data.db")
cong_cur = cong_db.cursor()

In [35]:
# Execute a query to list all tables in the database
cong_cur.execute("SELECT name FROM sqlite_master WHERE type='table';")

# Fetch all table names from the results
tables = cong_cur.fetchall()

# Print the table names
for table in tables:
    print(table[0])

websites
candidate_data
tweets


In [36]:
cong_cur.execute("PRAGMA table_info(candidate_data)")

# Fetch and print the table schema
table_info = cong_cur.fetchall()
print("Table Schema:")
for column in table_info:
    print(column)

Table Schema:
(0, 'index', 'INTEGER', 0, None, 0)
(1, 'student', 'TEXT', 0, None, 0)
(2, 'state', 'TEXT', 0, None, 0)
(3, 'district_num', 'TEXT', 0, None, 0)
(4, 'formatted_dist_num', 'INTEGER', 0, None, 0)
(5, 'abbrev', 'TEXT', 0, None, 0)
(6, 'district', 'TEXT', 0, None, 0)
(7, 'candidate', 'TEXT', 0, None, 0)
(8, 'party', 'TEXT', 0, None, 0)
(9, 'website', 'TEXT', 0, None, 0)
(10, 'twitter_handle', 'TEXT', 0, None, 0)
(11, 'incumbent', 'TEXT', 0, None, 0)
(12, 'age', 'REAL', 0, None, 0)
(13, 'gender', 'TEXT', 0, None, 0)
(14, 'marital_status', 'TEXT', 0, None, 0)
(15, 'white_non_hispanic', 'TEXT', 0, None, 0)
(16, 'hispanic', 'TEXT', 0, None, 0)
(17, 'black', 'TEXT', 0, None, 0)
(18, 'partisian_lean_pvi', 'TEXT', 0, None, 0)
(19, 'opposed', 'TEXT', 0, None, 0)
(20, 'pct_urban', 'TEXT', 0, None, 0)
(21, 'income', 'REAL', 0, None, 0)
(22, 'region', 'TEXT', 0, None, 0)


In [37]:

# Execute the SQL query
results = cong_cur.execute('''
    SELECT DISTINCT
        cd.candidate,
        cd.party,
        tw.tweet_text
    FROM candidate_data cd
    INNER JOIN tweets tw ON cd.twitter_handle = tw.handle
        AND cd.candidate == tw.candidate
        AND cd.district == tw.district
    WHERE cd.party IN ('Republican', 'Democratic')
        AND tw.tweet_text NOT LIKE '%RT%'
''')

# Store the results in a list
results = list(results)



In [38]:
for row in results[:10]: #See a limited amount of results, only 10 rows.
    print(row)

('Mo Brooks', 'Republican', b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq')
('Mo Brooks', 'Republican', b'"Brooks: I Do Not Support America Raising, Training, and Arming a \nRebel Army to Fight in Syria\xe2\x80\x99s Civil War" http://t.co/f2QFErMkD4')
('Mo Brooks', 'Republican', b'"Brooks: Senate Democrats Allowing President to Give Americans\xe2\x80\x99 Jobs to Illegals" #securetheborder https://t.co/mZtEaX8xS6')
('Mo Brooks', 'Republican', b'"NASA on the Square" event this Sat. 11AM \xe2\x80\x93 4PM. Stop by &amp; hear about the incredible work done in #AL05! @DowntownHSV http://t.co/R9zY8WMEpA')
('Mo Brooks', 'Republican', b'"Rep. Mo Brooks: NDAA Amnesty Amendment \xe2\x80\x98Betrays Americans\xe2\x80\x99" via @BreitbartNews http://t.co/aflHYdUkuF')
('Mo Brooks', 'Republican', b'"The trouble with Socialism is that eventually you run out of other people\'s money." - Margaret Thatcher https://t.co/X97g7wzQwJ')
('Mo Brooks', 'Republica

In [39]:
tweet_data = []

for twitter_name, party, tweet_text in results:
    decoded_tweet_text = tweet_text.decode('utf-8')  # Decode the byte string
    processed_text = preprocess_text(decoded_tweet_text)  # Preprocess the decoded tweet text
    tweet_data.append((twitter_name, party, processed_text))  # Append Twitter name, party, and processed tweet text as a tuple

# Print processed tweet data
for row in tweet_data [:10]:
    twitter_name, party, processed_text = row
    print(f"Twitter Name: {twitter_name}, Party: {party}, Processed Tweet: {processed_text}")


Twitter Name: Mo Brooks, Party: Republican, Processed Tweet: brooks joins alabama delegation in voting against flawed funding bill httptco3cwjiwysnq
Twitter Name: Mo Brooks, Party: Republican, Processed Tweet: brooks i do not support america raising training and arming a rebel army to fight in syria’s civil war httptcof2qfermkd4
Twitter Name: Mo Brooks, Party: Republican, Processed Tweet: brooks senate democrats allowing president to give americans’ jobs to illegals securetheborder httpstcomzteax8xs6
Twitter Name: Mo Brooks, Party: Republican, Processed Tweet: nasa on the square event this sat 11am – 4pm stop by amp hear about the incredible work done in al05 downtownhsv httptcor9zy8wmepa
Twitter Name: Mo Brooks, Party: Republican, Processed Tweet: rep mo brooks ndaa amnesty amendment ‘betrays americans’ via breitbartnews httptcoaflhydukuf
Twitter Name: Mo Brooks, Party: Republican, Processed Tweet: the trouble with socialism is that eventually you run out of other peoples money margar

There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [40]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [41]:
def predict_party(text):
    republican_keywords = ['republican', 'gop', 'conservative', 'right-wing']
    democratic_keywords = ['democrat', 'democratic', 'liberal', 'left-wing']

    # Convert tweet text to lowercase for case-insensitive matching
    lowercased_tweet = processed_text.lower()

    # Check for Republican keywords
    for keyword in republican_keywords:
        if keyword in lowercased_tweet:
            return 'Republican'

    # Check for Democratic keywords
    for keyword in democratic_keywords:
        if keyword in lowercased_tweet:
            return 'Democratic'

    # If no party keywords found, return 'Unknown'
    return 'Unknown'


In [42]:
for twitter_name, party, processed_text in tweet_data_sample:
    estimated_party = predict_party(processed_text)


    print(f"Twitter Name: {twitter_name}")
    print(f"Actual party is {party} and our classifier says {estimated_party}.")
    print(f"Here's our (cleaned) tweet: {processed_text}")


Twitter Name: Barbara Lee
Actual party is Democratic and our classifier says Unknown.
Here's our (cleaned) tweet: the mass shooting in las vegas was a horrific act of violence the victims and their families are in my thoughts and prayers
Twitter Name: Markwayne Mullin
Actual party is Republican and our classifier says Unknown.
Here's our (cleaned) tweet: early morning this traveltuesday leaving ok02 for dc httptcoigknci79e7
Twitter Name: Joseph Kaufman
Actual party is Republican and our classifier says Unknown.
Here's our (cleaned) tweet: the only moderates in iraq amp syria are civilians weve enemies on both sides of the conflict we should not assist either
Twitter Name: Abigail Spanberger
Actual party is Democratic and our classifier says Unknown.
Here's our (cleaned) tweet: rt natsecaction over 200 national security veterans are demanding answers on the release of the confidential national security questionna…
Twitter Name: Antonio Sabàto Jr
Actual party is Republican and our classi

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [43]:

parties = ['Republican', 'Democratic']
results = defaultdict(lambda: defaultdict(int))


for p in parties:
    for p1 in parties:
        results[p][p1] = 0

num_to_score = 10000
random.shuffle(tweet_data)

for idx, (twitter_name, party, processed_text) in enumerate(tweet_data):
    # Estimate the party affiliation using predict_party() function
    estimated_party_1 = predict_party(processed_text)

    results[party][estimated_party_1] += 1

    if idx > num_to_score:
        break




In [44]:
print(dict(results))

{'Republican': defaultdict(<class 'int'>, {'Republican': 353, 'Democratic': 53, 'Unknown': 3734}), 'Democratic': defaultdict(<class 'int'>, {'Republican': 332, 'Democratic': 183, 'Unknown': 5347})}


### Reflections

Based on what I have observed from the printing the results it appears the classifier is working as it was able to classify Republican, and Democratic. The only issue I noticed is that the unknown result is higher than both other party affiliations. This could be for a couple of reasons. 1. Not many users like to list their party affiliation in their text bios. Which is where the data was collected. 2. There could be other keywords that are not listed in the Keywords portion of the classifer that is just not detecting how most people identify their party affiliation through their twitter accounts.