<a href="https://colab.research.google.com/github/dsklnr/ADS_509_Module4/blob/main/Political_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [1]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

In [22]:
# Additional imports
import os
from nltk.corpus import stopwords
import re

In [19]:
nltk.download('stopwords')

# Define stopwords
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

In [3]:
# Create a google drive connection
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [39]:
convetions_db_path = "/content/drive/MyDrive/ADS 509/Module4/2020_Conventions.db"
cong_db_path = "/content/drive/MyDrive/ADS 509/Module4/congressional_data.db"
os.listdir("/content/drive/MyDrive/ADS 509/Module4")

['2020_Conventions.db', 'congressional_data.db']

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text
for each party and prepare it for use in Naive Bayes.  

In [23]:
# Function to clean and tokenize text
def clean_and_tokenize(text):
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation and non-alphabetic characters
    text = re.sub(r'[^a-z\s]', '', text)

    # Tokenize the text
    tokens = text.split()

    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]

    # Return the cleaned text as a single string
    return ' '.join(tokens)

In [24]:
# Create a conventions db connection
convention_db = sqlite3.connect(convetions_db_path)
convention_cur = convention_db.cursor()

# Create a list to hold cleaned text and party data
convention_data = []

# Query data
query_results = convention_cur.execute(
                            '''
                            SELECT text, party FROM conventions
                            ''')

# Add clean text to the convention_data list
for row in query_results :
    text, party = row
    cleaned_text = clean_and_tokenize(text)
    convention_data.append([cleaned_text, party])

# Close the connection
convention_db.close()

# Display the first 5 items
for data in convention_data[:5]:
    print(data)

['skip content company careers press freelancers blog services transcription captions foreign subtitles translation freelancers contact login return transcript library home transcript categories transcripts election transcripts classic speech transcripts congressional testimony hearing transcripts debate transcripts donald trump transcripts entertainment transcripts financial transcripts interview transcripts political transcripts press conference transcripts speech transcripts sports transcripts technology transcripts aug democratic national convention dnc night transcript rev blog transcripts election transcripts democratic national convention dnc night transcript night democratic national convention dnc august read full transcript event transcribe content try rev free save time transcribing captioning subtitling', 'Democratic']
['im calling full session th quadrennial national convention democratic party order welcome final session historic memorable convention weve called th quadre

Let's look at some random entries and see if they look right.

In [25]:
random.choices(convention_data,k=5)

[['yes', 'Republican'],
 ['official roll call business republican convention conducted today charlotte created short video symbolize excitement president trump across states territories thank watching god bless god bless united states america',
  'Republican'],
  'Republican'],
 ['think places make neighborhood feel like home barbershop catch latest neighborhood news get hair cut restaurant know order heart art studio creative guy great idea ton hustle turned abandoned building second home entire community economists tell small businesses like one critically important recovery theyre also plain important without places call home wouldnt thats thing joe biden understands donald trump never never talk economy stock market whether find work really means something instead feeling like youre supposed grateful get paycheck rep',
  'Democratic'],
 ['convention joe biden guy earned respect commands across world across aisle stakes election call kind leadership think one party think core democr

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it.

In [26]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)

print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2328 as features in the model.


In [32]:
def conv_features(text,fw) :
    # create a dictionary
    ret_dict = {}

    # Split the text into tokens
    tokens = text.split()

    # Iterate through the feature words and check if they are present in the tokens
    for word in fw:
        if word in tokens:
            ret_dict[word] = True  # Add the word to the dictionary with value True

    return ret_dict

In [33]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("some people in america are citizens",feature_words)==
                     {'people':True,'america':True,"citizens":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory.

In [34]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [35]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [36]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.494


In [37]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     25.8 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     14.9 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                  defund = True           Republ : Democr =     13.0 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

_Your observations to come._



## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and
is unindexed, so the query takes a minute or two to run on my machine.

In [41]:
# Create a congress db connection
cong_db = sqlite3.connect(cong_db_path)
cong_cur = cong_db.cursor()

# Query tweets
results = cong_cur.execute(
        '''
           SELECT DISTINCT
                  cd.candidate,
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle
               AND cd.candidate == tw.candidate
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic')
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

# Close the connection
cong_db.close()

In [48]:
# Display first 5 results
for data in results[:5]:
    print(data)

('Mo Brooks', 'Republican', b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq')
('Mo Brooks', 'Republican', b'"Brooks: I Do Not Support America Raising, Training, and Arming a \nRebel Army to Fight in Syria\xe2\x80\x99s Civil War" http://t.co/f2QFErMkD4')
('Mo Brooks', 'Republican', b'"Brooks: Senate Democrats Allowing President to Give Americans\xe2\x80\x99 Jobs to Illegals" #securetheborder https://t.co/mZtEaX8xS6')
('Mo Brooks', 'Republican', b'"NASA on the Square" event this Sat. 11AM \xe2\x80\x93 4PM. Stop by &amp; hear about the incredible work done in #AL05! @DowntownHSV http://t.co/R9zY8WMEpA')
('Mo Brooks', 'Republican', b'"Rep. Mo Brooks: NDAA Amnesty Amendment \xe2\x80\x98Betrays Americans\xe2\x80\x99" via @BreitbartNews http://t.co/aflHYdUkuF')


In [60]:
# Initialize the list to hold tweet data
tweet_data = []

# Define the set of stopwords
stop_words = set(stopwords.words('english'))

# Process each tweet in the results
for candidate, party, tweet_text in results:
    # Decode tweet_text if it's in bytes
    if isinstance(tweet_text, bytes):
        tweet_text = tweet_text.decode('utf-8')

    # Clean and tokenize the tweet text
    tokens = tweet_text.lower().split()
    cleaned_tokens = [word for word in tokens if word not in stop_words]

    # Join the cleaned tokens back into a single string
    cleaned_text = ' '.join(cleaned_tokens)

    # Append the cleaned text and party to the tweet_data list
    tweet_data.append([cleaned_text, party])

In [61]:
# Display first 5 results
for data in tweet_data[:5]:
    print(data)

['"brooks joins alabama delegation voting flawed funding bill" http://t.co/3cwjiwysnq', 'Republican']
['"brooks: support america raising, training, arming rebel army fight syria’s civil war" http://t.co/f2qfermkd4', 'Republican']
['"brooks: senate democrats allowing president give americans’ jobs illegals" #securetheborder https://t.co/mzteax8xs6', 'Republican']
['"nasa square" event sat. 11am – 4pm. stop &amp; hear incredible work done #al05! @downtownhsv http://t.co/r9zy8wmepa', 'Republican']
['"rep. mo brooks: ndaa amnesty amendment ‘betrays americans’" via @breitbartnews http://t.co/aflhydukuf', 'Republican']


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [62]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [None]:
# Create feature words based on token frequency
word_cutoff = 5
tokens = [word for cleaned_text, _ in tweet_data for word in cleaned_text.split()]
word_dist = nltk.FreqDist(tokens)

feature_words = {word for word, count in word_dist.items() if count > word_cutoff}

# Prepare the featuresets
featuresets = [(conv_features(cleaned_text, feature_words), party) for cleaned_text, party in tweet_data]

# Shuffle the featuresets
random.seed(20220507)
random.shuffle(featuresets)

# Split into train and test sets
test_size = 500
test_set, train_set = featuresets[:test_size], featuresets[test_size:]

# Train the classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Create a random sample to predict party type
random.seed(20201014)
tweet_data_sample = random.choices(tweet_data, k=10)

# Prepare for predictions
for cleaned_text, actual_party in tweet_data_sample:
    features = conv_features(cleaned_text.split(), feature_words)
    estimated_party = classifier.classify(features)

    print(f"Here's our (cleaned) tweet: {cleaned_text}")
    print(f"Actual party is {actual_party} and our classifier says {estimated_party}.")
    print("")

In [63]:
for tweet, party in tweet_data_sample :
    estimated_party = 'Gotta fill this in'
    # Fill in the right-hand side above with code that estimates the actual party

    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")


Here's our (cleaned) tweet: mass shooting las vegas horrific act violence. victims families thoughts prayers.
Actual party is Democratic and our classifer says Gotta fill this in.

Here's our (cleaned) tweet: early morning #traveltuesday. leaving ok-02 dc! http://t.co/igknci79e7
Actual party is Republican and our classifer says Gotta fill this in.

Here's our (cleaned) tweet: moderates #iraq &amp; #syria civilians. we've enemies sides conflict. assist either.
Actual party is Republican and our classifer says Gotta fill this in.

Here's our (cleaned) tweet: rt @natsecaction: 200 national security veterans demanding answers release confidential national security questionna…
Actual party is Democratic and our classifer says Gotta fill this in.

Here's our (cleaned) tweet: 💯%, #buildthatwall now!! https://t.co/hyb6jcw5ea
Actual party is Republican and our classifer says Gotta fill this in.

Here's our (cleaned) tweet: glad attend #g20 assure everyone could majority americans still stand tr

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [None]:
# dictionary of counts by actual party and estimated party.
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp
    # Now do the same thing as above, but we store the results rather
    # than printing them.

    # get the estimated party
    estimated_party = "Gotta fill this in"

    results[party][estimated_party] += 1

    if idx > num_to_score :
        break

In [None]:
results

### Reflections

_Write a little about what you see in the results_