## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [1]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict
import pandas as pd
from string import punctuation
from nltk.corpus import stopwords
import regex as re



# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns
import os
print(os.getcwd())

/Users/travis/naive-bayes-text-classifier


In [2]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

In [3]:
# Some punctuation variations
punctuation = set(punctuation) # speeds up comparison
tw_punct = punctuation - {"#"}

# Stopwords
sw = stopwords.words("english")

# Two useful regex
whitespace_pattern = re.compile(r"\s+")
hashtag_pattern = re.compile(r"^#[0-9a-zA-Z]+")

# It's handy to have a full set of emojis
all_language_emojis = set()

def is_emoji(s):
    return(emoji.is_emoji(s))

def contains_emoji(s):    
    s = str(s)
    emojis = [ch for ch in s if is_emoji(ch)]
    return(len(emojis) > 0)


def remove_stop(tokens) :
    return[t for t in tokens if t.lower() not in sw]
 
def remove_punctuation(text, punct_set=tw_punct) : 
    return("".join([ch for ch in text if ch not in punct_set]))

def tokenize(text) : 
    """ Splitting on whitespace rather than the book's tokenize function. That 
        function will drop tokens like '#hashtag' or '2A', which we need for Twitter. """
    
    # modify this function to return tokens
    return re.findall(r'[\w-]*\p{L}[\w-]*', text)

def restring(text) :
    return (" ".join(text))

def prepare(text, pipeline) : 
    tokens = str(text)
    
    for transform in pipeline : 
        tokens = transform(tokens)
        
    return(tokens)
def remove_title(text):
    text = str(text)
    text = text.split('"',2)[2]
    return(text)
    


### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [4]:
convention_data = []

# fill this list up with items that are themselves lists. The 
# first element in the sublist should be the cleaned and tokenized
# text in a single string. The second element should be the party. 

query_results = convention_cur.execute(
    'select text, party from conventions'
)
for row in query_results:
    convention_data.append(row)
    


Let's look at some random entries and see if they look right. 

In [5]:
random.choices(convention_data,k=10)

[('Yep. 15, close to 15 years cancer free?', 'Democratic'),
 ('A threat to this nation, to our democracy is real. It’s clear and it’s present. We’ve watched the president now for three years. Look at what he’s doing, instilling fear. I mean, not joking. Instilling fear. Showing division, stroking racial division, undercutting every institution that was designed to check the abuse of power by the president or anyone else. All this for what reason? All this in order to solidify his base and expand his power.',
  'Democratic'),
 ('Good evening. My name is Jeff Van Drew. I speak to you as a member of the Republican Party, but it always wasn’t that way. How I became a Republican says a lot about today’s Democratic Party. I’m from South Jersey, where we work hard, look after our neighbors, and care about our communities. Years ago, I was a local dentist and was asked by the Democrats to run for a town council seat. I had my doubts and I explained that my views were middle of the road to cons

In [6]:
#Converting the list into a two column DataFrame for the Convention Data
df = pd.DataFrame(convention_data, columns = ['Text', 'Party'])

In [7]:
#This pipeline was recycled from the previous modules work and I added "restring"
#on the end to put tokens back together into a string format after removing stopwords
my_pipeline = [str.lower, remove_punctuation, tokenize, remove_stop, restring]

#Running the Tweet data through the pipeline
df["Text"] = df["Text"].apply(prepare,pipeline=my_pipeline)

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [8]:
word_cutoff = 5


tokens = []

# Cycling through the Text columns data to test the amount of features output
for i in df["Text"]:
    i = tokenize(i)
    tokens.extend(i)
word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2275 as features in the model.


In [9]:
def conv_features(text,fw) :

    ret_dict = dict()
    #tokenizingthe cleaned text data from the DataFrame
    text = tokenize(text)
    
    #Cycling through the sample data to check against the feature words, returning true if it is in the feature words set
    for sample in text:
        for tok in fw:
            if sample == tok:
                ret_dict.update({sample:True})
        
    return(ret_dict)

In [10]:

assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)=={'donald':True,'president':True})
assert(conv_features("people are american in america",feature_words)=={'america':True,'american':True,"people":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [11]:
#Creating the feature set for the convention data
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data];

In [12]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [13]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.49


In [14]:
classifier.show_most_informative_features(25)

Most Informative Features
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.3 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                 company = True           Republ : Democr =     16.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     15.8 : 1.0
              appreciate = True           Republ : Democr =     14.0 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                religion = True           Republ : Democr =     13.0 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

_Your observations to come._

What I notice about the classifier is that the bulk of the top 25 features are found in republican speeches. This leads me to believe that there are potentially more republican speeches than Democratic. This could possibly lead to greater inaccuracy due to lower variablility between Democrat and Republican Rhetoric.


## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [15]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [16]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [17]:
tweet_data = []

for row in results:
    tweet_data.append(list(row))
# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.
print(len(tweet_data))

664656


In [18]:
#Deleting the first column of the list since we are not using the candidate names
for row in tweet_data:
    del row[0]

There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [19]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)


In [20]:
#Like the convention dataset, I am converting the sample dataset to a DataFrame
df_conv = pd.DataFrame(tweet_data_sample, columns = ['Party', 'Text'])

#building the dataframe for the full dataset of tweet_data
df_full = pd.DataFrame(tweet_data, columns = ['Party', 'Text'])

In [21]:
#Decoding both datasets text columns
df_conv['Text'] = df_conv['Text'].str.decode(encoding='UTF-8')
df_full['Text'] = df_full['Text'].str.decode(encoding='UTF-8')

In [22]:
#Applying the pipeline to both datasets, saving the result to a new column 
#to retain the original string to use for comparison
df_conv["Token"] = df_conv["Text"].apply(prepare,pipeline=my_pipeline)
df_full["Token"] = df_full["Text"].apply(prepare,pipeline=my_pipeline)

In [24]:
for row in df_conv.index:
    estimated_party = classifier.classify(conv_features(df_conv['Token'][row],feature_words))
    print(f"Here's our (cleaned) tweet: {df_conv['Text'][row]}")
    print(f"Actual party is {df_conv['Party'][row]} and our classifer says {estimated_party}.")
    print("")

Here's our (cleaned) tweet: Earlier today, I spoke on the House Floor abt protecting health care for women and praised @PPmarmonte for their work on the Central Coast. https://t.co/WqgTRzT7VV
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: Go Tribe! #RallyTogether https://t.co/0NXutFL9L5
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: Apparently, Trump thinks it's just too easy for students overwhelmed by the crushing burden of debt to pay off student loans #TrumpBudget https://t.co/ckYQO5T0Qh
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: We’re grateful for our first responders, our rescue personnel, our firefighters, our police, and volunteers who have been working tirelessly to keep people safe, provide much-needed help, while putting their own lives on the line.

https://t.co/eZPv0vMIz3
Actual party is Republican and our classifer says Republican.

Here'

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [25]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
#random.shuffle(test_df)

# for idx, tp in enumerate(tweet_data) :
#     tweet, party = tp
for row in df_full.index:
    # Now do the same thing as above, but we store the results rather
    # than printing them. 
   
    # get the estimated party
    estimated_party = classifier.classify(conv_features(df_full['Token'][row],feature_words))
    
    results[df_full['Party'][row]][estimated_party] += 1
    
    if row > num_to_score : 
        break

In [26]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 3508, 'Democratic': 577}),
             'Democratic': defaultdict(int,
                         {'Republican': 4930, 'Democratic': 987})})

### Reflections

_Write a little about what you see in the results_ 

I noticed that from the results, the model is more has a bias towards classifying an observation as Republican. These results to me seem to be subpar as we have mislabeled nearly 5000 tweets that were not Republican. This could be due to the top 25 most informative features being Republican biased. So if we have a more equal featureset the accuracy could be potentially better.