## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [1]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict

# Additional libraries
import re 
from string import punctuation
import pprint
import spacy

punctuation = set(punctuation) # speeds up comparison
tw_punct = punctuation - {"#"}

nlp = spacy.load('en_core_web_sm')

# Feel free to include your text patterns functions

def remove_stop(tokens) :
    # modify this function to remove stopwords
    stop_words = spacy.lang.en.STOP_WORDS                        # load stop words

    potential_stop_words = [ '', 'im', 'like',                   
                            'dont', 'got', 'cause',              # added extra stop words
                            'wanna', 'youre']
    
    for wrd in potential_stop_words:
        stop_words.add(wrd)

    removed = [w for w in tokens if not w in stop_words]         # remove stop words
    return(removed)

def remove_url(text):
    return re.sub(r'http\S+', '', text)

def remove_punctuation_keep_hash(text, punct_set=tw_punct) : 
    return("".join([ch for ch in text if ch not in punct_set]))
 
def remove_punctuation(text, punct_set=punctuation) : 
    return("".join([ch for ch in text if ch not in punct_set]))

def tokenize(text) : 
    """ Splitting on whitespace rather than the book's tokenize function. That 
        function will drop tokens like '#hashtag' or '2A', which we need for Twitter. """
    
    # modify this function to return tokens
    collapse_whitespace = re.compile(r'\s+')
    return([item.lower() for item in collapse_whitespace.split(text)])  # using Module 2
    
def prepare(text, pipeline) : 
    tokens = str(text)
    
    for transform in pipeline : 
        tokens = transform(tokens)
        
    return(tokens)

def join_by_space(text_list):
    return ' '.join(text_list)
#from text_functions_solutions import clean_tokenize, get_patterns

In [2]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [3]:
# Test query results

for row in convention_cur.execute("SELECT party, text FROM conventions"):
    # print(row)
    pass

In [4]:
# Get column names from the convention_db

names = [description[0] for description in convention_cur.description]
print(names)

['party', 'text']


In [5]:
convention_data = []

# fill this list up with items that are themselves lists. The 
# first element in the sublist should be the cleaned and tokenized
# text in a single string. The second element should be the party. 

query_results = convention_cur.execute(
                            '''
                            SELECT party, text FROM conventions
                            ''')

my_pipeline = [str.lower, remove_punctuation, tokenize, remove_stop, join_by_space]

for row in query_results :
    party = row[0]
    raw_string = row[1]

    # clean and tokenize the raw text string
    clean_string = prepare(raw_string, my_pipeline)

    # store the results in convention_data
    convention_data.append([clean_string, party])


Let's look at some random entries and see if they look right. 

In [6]:
random.choices(convention_data,k=10)

[['need numbers overwhelming trump can’t sneak steal way victory text vote 30330 started years ago yesterday 19th amendment constitution ratified took seven decades suffragists marching picketing going jail push closer perfect union 55 years ago john lewis marched bled selma work unfinished tonight i’m thinking girls boys america’s future kamala harris black woman daughter jamaican indian immigrants nominee vice president country’s story breaking barriers expanding circle possibility',
  'Democratic'],
 ['we’ve 1864 trade tariffs china they’ve horrible language trade deal said china buy price advantage kind trade deal that’s deal covid hit plummeted getting that’s problem that’s day day battle',
  'Democratic'],
 ['systematic decision country’s powerful leaders sell america china',
  'Republican'],
 ['need experienced leader tony 3631 leader passion integrity strategic leadership skills chris f 3636 joe biden joseph w 3637 joe biden',
  'Democratic'],
 ['want strong america want joe bi

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [7]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2278 as features in the model.


In [8]:
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
    """
    
    ret_dict = dict()
    text_list = text.split(' ')                     # split text by whitespace

    for word in text_list:                          # iterate through text_list
        if (word in fw) and (word not in ret_dict): # check for match in fw and one occurence
            ret_dict[word] = True                   # add to ret_dict if True
    
    return(ret_dict)

In [9]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("people are american in america",feature_words)==
                     {'america':True,'american':True,"people":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [10]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [11]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [12]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.52


In [13]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     25.8 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     14.9 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                    isis = True           Republ : Democr =     13.0 : 1.0

*Write a little prose here about what you see in the classifier. Anything odd or interesting?*

### My Observations

The results from the classifier displayed above allow us to gauge a few interesting observations about the 25 most informative words of the `feature_words` set. Based on the `nltk` documentation on the `show_most_informative_features` function: 

``"the informativeness of a feature (fname,fval) is equal to the highest value of P(fname=fval|label), for any label, divided by the lowest value of P(fname=fval|label), for any label."``

This means that the number ratio displayed to the right of the output represents words that tend to be used more frequently by a particular party. The two words that tend to be used most by individuals associated with the Democratic party are `"votes"` and `"climate"`, while the remaining 23 words fall under the rhetoric of individuals associated with the Republican party. The top three words most likely to be aligned with someone from the Republican party include `"china"`, `"enforcement"`, and `"destroy"`. These words appear in Republican texts approximately 20 times more than Democratic texts.

## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [14]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [15]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [16]:
tweet_data = []

# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.

names = [description[0] for description in cong_cur.description]
print(names)

my_pipeline = [str.lower, remove_url,           # add remove url to pipeline
               remove_punctuation_keep_hash,    # keep hash in punctuation
               tokenize, remove_stop, join_by_space]

for row in results :
    party = row[1]
    raw_string = row[2].decode("utf-8")              # convert bytes to string
    clean_string = prepare(raw_string, my_pipeline)  # clean raw string
    tweet_data.append([clean_string, party])         # store tweet


['candidate', 'party', 'tweet_text']


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [17]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

for row in tweet_data_sample:
    print(row)

['earlier today spoke house floor abt protecting health care women praised ppmarmonte work central coast', 'Democratic']
['tribe #rallytogether', 'Democratic']
['apparently trump thinks easy students overwhelmed crushing burden debt pay student loans #trumpbudget', 'Democratic']
['we’re grateful responders rescue personnel firefighters police volunteers working tirelessly people safe provide muchneeded help putting lives line', 'Republican']
['let’s greater #kag 🇺🇸', 'Republican']
['1hr cavs tie series 22 #allin216 repbarbaralee scared #roadtovictory', 'Democratic']
['congrats belliottsd new gig sd city hall glad continue serve…', 'Democratic']
['close 3500 raised match right whoot that’s 7000 nonmath majors room 😂 help', 'Democratic']
['today comment period potus’s plan expand offshore drilling opened public 60 days march 9 share oppose proposed program directly trump administration comments email mail', 'Democratic']
['celebrated icseastla’s 22 years eastside commitment amp saluted c

In [18]:
# featuresets2 = [(conv_features(text,feature_words), party) for (text, party) in tweet_data_sample]

for tweet, party in tweet_data_sample :
    feature_set = conv_features(tweet, feature_words)   # generate feature set
    estimated_party = classifier.classify(feature_set)  # predict party
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")
    

Here's our (cleaned) tweet: earlier today spoke house floor abt protecting health care women praised ppmarmonte work central coast
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: tribe #rallytogether
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: apparently trump thinks easy students overwhelmed crushing burden debt pay student loans #trumpbudget
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: we’re grateful responders rescue personnel firefighters police volunteers working tirelessly people safe provide muchneeded help putting lives line
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: let’s greater #kag 🇺🇸
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: 1hr cavs tie series 22 #allin216 repbarbaralee scared #roadtovictory
Actual party is Democratic and our classifer says Re

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [25]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp    
    # Now do the same thing as above, but we store the results rather
    # than printing them. 
    feature_set = conv_features(tweet, feature_words)   # generate feature set
    estimated_party = classifier.classify(feature_set)  # predict party
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break

In [26]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 3725, 'Democratic': 672}),
             'Democratic': defaultdict(int,
                         {'Republican': 4557, 'Democratic': 1048})})

### Reflections

_Write a little about what you see in the results_ 

Based on the dictionary output above we see the breakdown of the classifier's estimates on the tweet data. Based on the commentary of the dictionary created, the first keys represent the actual party associated with the tweet while the sub-dictionary contains the classifier's predictions. For all of the Republican party tweets, we can observe that the classifier predicted 3,725 tweets to be Republican and 672 to be Democratic. For all of the Democratic party tweets, we can see that 4,557 of them were classified as Republican and 1,048 were correctly predicted to be Democratic party tweets. From these results, it appears that the Naive Bayes Classifier that was trained is heavily skewed toward making Republican party classifications. Democratic classes were assigned 18.04% (672/3,725) of the time when the tweet was associated with Republicans, and only 23.00% (1,048/4,557) of the time when the tweet was associated with Democrats. Given that 23 of the top 25 most informative words were connected to Republican convention texts, it makes sense why this imbalance in predictions persists. Given the larger amount of tweet data than convention data, it might make sense to retrain the classifier on a portion of tweets as well. This may help the model generate more accurate predictions and the words considered to be most informative are likely to change as well.