## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [1]:
import sqlite3
import nltk
import random
import numpy as np
import pandas as pd
import string
import re

from collections import Counter, defaultdict
from nltk.corpus import stopwords
from string import punctuation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [2]:
convention_db = sqlite3.connect(r'/Users/Chris/Downloads/2020_Conventions.db')
convention_cur = convention_db.cursor()

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [3]:
# Create a set of stopwords
sw = stopwords.words("english")

# Create a set of punctuation marks
punctuation = set(string.punctuation)  # Speeds up comparison
tw_punct = punctuation - {"#"}

# Remove stopwords
def remove_stop(tokens):
    return [w for w in tokens if w.lower() not in sw]

# Remove punctuation
def remove_punctuation(text, punct_set=tw_punct):
    return "".join([ch for ch in text if ch not in punct_set])

# Tokenize
def tokenize(text):
    return text.split()

# Convert to lowercase
def lowercase(text):
    return text.lower()

# Joins tokens (concatenation)
def join_tokens(tokens):
    text = " ".join(tokens)
    return text

# Applies pipeline to text
def prepare(text, pipeline):
    tokens = str(text)
    for transform in pipeline:
        tokens = transform(tokens)
    return tokens

cleaning_pipeline = [lowercase, remove_punctuation, tokenize, remove_stop, join_tokens]

In [4]:
# Retrieve table names
convention_cur.execute("SELECT name FROM sqlite_master WHERE type = 'table';")

# Fetch table names
convention_table_names = [row[0] for row in convention_cur.fetchall()]
convention_table_names

['conventions']

In [5]:
convention_data = []

# fill this list up with items that are themselves lists.
# The first element in the sublist should be the cleaned and tokenized text in a single string. The second element should be the party. 

query_results = convention_cur.execute("SELECT party, text FROM conventions")

for row in query_results :
    party, text = row
    lowercased_party = lowercase(party)
    lowercased_text = lowercase(text)
    clean_text = prepare(lowercased_text, pipeline = cleaning_pipeline)
    convention_data.append([clean_text, lowercased_party])

Let's look at some random entries and see if they look right. 

In [6]:
random.choices(convention_data, k = 10)

[['different kind convention', 'democratic'],
 ['son scranton claymont wilmington become one consequential vice presidents american history accolade nonetheless rest firmly behind legacy husband father grandfather grateful nation thanks vice president joseph r biden jr lifetime service behalf united states america',
  'democratic'],
 ['joe believes stand allies stand adversaries right president turns tragedies political weapons joe president turns challenges purpose joe bring us together build economy doesn’t leave anyone behind good paying job floor ceiling',
  'democratic'],
 ['feeding people act love think use little extra love days took food trucks community deliver meals made right clinton presidential center neighbors need even leaders let us americans kept looking arkansas casts 9 votes bernie sanders 27 votes next president joe biden',
  'democratic'],
 ['well i’m nurses i’m doctors i’m everybody make china virus go away it’s happening please go ahead',
  'republican'],
 ['good

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [7]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2391 as features in the model.


In [8]:
def conv_features(text,fw) :
        
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
    """
        
    ret_dict = {word: True for word in text.split() if word in fw}
    
    return(ret_dict)

In [9]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("people are american in america",feature_words)==
                     {'america':True,'american':True,"people":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [10]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [11]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [12]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.5


In [13]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           republ : democr =     25.8 : 1.0
                   votes = True           democr : republ =     23.8 : 1.0
             enforcement = True           republ : democr =     21.5 : 1.0
                 destroy = True           republ : democr =     19.2 : 1.0
                freedoms = True           republ : democr =     18.2 : 1.0
                 climate = True           democr : republ =     17.8 : 1.0
                supports = True           republ : democr =     17.1 : 1.0
                   crime = True           republ : democr =     16.1 : 1.0
                   media = True           republ : democr =     14.9 : 1.0
                 beliefs = True           republ : democr =     13.0 : 1.0
               countries = True           republ : democr =     13.0 : 1.0
                 defense = True           republ : democr =     13.0 : 1.0
                    isis = True           republ : democr =     13.0 : 1.0

### My Observations

There are differences between Democrats and Republicans. There is an imbalance between the number of words indicating a Republican speaker compared to those suggesting a Demoncratic speaker. Republican speeches exhibit words such as `China`, `enforcement`, `defund`, and `amendment`, which all relate to traditional conservative philosophies. On the other hand, are only two words, `votes` and `climate`, that are associated with Democratic speeches. These words may relate to topics like voting rights and climate change.

These patterns in word usage provide insights into the learned patterns of the Naive Bayes classifier. It suggests that the classifier has a higher propensity to associate words with Republican speeches than with Democratic speeches.

## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [14]:
cong_db = sqlite3.connect("/Users/Chris/Downloads/congressional_data.db")

cong_cur = cong_db.cursor()

In [15]:
# Retrieve table names
cong_cur.execute("SELECT name FROM sqlite_master WHERE type = 'table';")

# Fetch table names
cong_table_names = [row[0] for row in cong_cur.fetchall()]
cong_table_names

['websites', 'candidate_data', 'tweets']

In [16]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [17]:
tweet_data = []

# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.

sw = stopwords.words("english")

# Iterate through the query results
for row in results:
    # Unpack the values from the row
    candidate, party, tweet_text = row
    
    # Clean the tweet text using the pipeline
    clean_tweet_text = prepare(tweet_text, pipeline = cleaning_pipeline)
    
    # Append the cleaned tweet text and party to tweet_data
    tweet_data.append([clean_tweet_text, party])

There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [18]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data, k = 10)

In [19]:
for tweet, party in tweet_data_sample :
    estimated_party = classifier.classify(conv_features(tweet, feature_words))

    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")

Here's our (cleaned) tweet: bearlier today spoke house floor abt protecting health care women praised ppmarmonte work central coast httpstcowqgtrzt7vv
Actual party is Democratic and our classifer says republican.

Here's our (cleaned) tweet: bgo tribe #rallytogether httpstco0nxutfl9l5
Actual party is Democratic and our classifer says democratic.

Here's our (cleaned) tweet: bapparently trump thinks easy students overwhelmed crushing burden debt pay student loans #trumpbudget httpstcockyqo5t0qh
Actual party is Democratic and our classifer says republican.

Here's our (cleaned) tweet: bwexe2x80x99re grateful first responders rescue personnel firefighters police volunteers working tirelessly keep people safe provide muchneeded help putting lives linennhttpstcoezpv0vmiz3
Actual party is Republican and our classifer says republican.

Here's our (cleaned) tweet: bletxe2x80x99s make even greater #kag xf0x9fx87xbaxf0x9fx87xb8 httpstcoy9qozd5l2z
Actual party is Republican and our classifer says

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [20]:
# Dictionary of counts by actual party and estimated party.
# The first key is actual, and the second is estimated.
parties = ['Republican', 'Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties:
    for p1 in parties:
        results[p][p1] = 0

num_to_score = 10000
random.shuffle(tweet_data)

# Iterate through the tweet data and score the results
for idx, tp in enumerate(tweet_data):
    tweet, party = tp
    
    # Get the estimated party using the trained classifier
    estimated_party = classifier.classify(conv_features(tweet, feature_words))
    
    # Update the results dictionary
    results[party][estimated_party] += 1

    if idx >= num_to_score:
        break

In [21]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 0,
                          'Democratic': 0,
                          'republican': 3580,
                          'democratic': 698}),
             'Democratic': defaultdict(int,
                         {'Republican': 0,
                          'Democratic': 0,
                          'republican': 4712,
                          'democratic': 1011})})

In [22]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           republ : democr =     25.8 : 1.0
                   votes = True           democr : republ =     23.8 : 1.0
             enforcement = True           republ : democr =     21.5 : 1.0
                 destroy = True           republ : democr =     19.2 : 1.0
                freedoms = True           republ : democr =     18.2 : 1.0
                 climate = True           democr : republ =     17.8 : 1.0
                supports = True           republ : democr =     17.1 : 1.0
                   crime = True           republ : democr =     16.1 : 1.0
                   media = True           republ : democr =     14.9 : 1.0
                 beliefs = True           republ : democr =     13.0 : 1.0
               countries = True           republ : democr =     13.0 : 1.0
                 defense = True           republ : democr =     13.0 : 1.0
                    isis = True           republ : democr =     13.0 : 1.0

### Reflections

This study demonstrated the classification outcomes obtained by employing a Naive Bayes model on the congressional tweets dataset, quantifying both the accurately and inaccurately classified tweets. The model exhibited a higher predictive accuracy for Republicans than Democrats. For the Republican party, the model correctly identified tweets as `Republican` 3,580 times and incorrectly identified tweets as `Democratic` 698 times. For the Democratic party, the model incorrectly identified tweets as `Republican` 4,712 times and correctly identified tweets as `Democratic` 1,011 times.