## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [40]:
import sqlite3
import nltk
import random
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from collections import Counter, defaultdict
from nltk.corpus import stopwords
from string import punctuation
import string
from nltk.corpus import stopwords
from itertools import chain
import re
import matplotlib.pyplot as plt
from nltk.probability import FreqDist
nltk.download('stopwords')
nltk.download('punkt')
stop_words = stopwords.words("english")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/datascience/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/datascience/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [41]:
# Some punctuation variations
punctuation = set(punctuation) # speeds up comparison
tw_punct = punctuation - {"#"}

# Stopwords
sw = stopwords.words("english")

# Two useful regex
whitespace_pattern = re.compile(r"\s+")
hashtag_pattern = re.compile(r"^#[0-9a-zA-Z]+")

# Functions
def clean_tokenize(words):
    # Remove punctuation characters
    cleaned = words.translate(str.maketrans('', '', string.punctuation))
    # Convert to lowercase
    cleaned = cleaned.lower()
    # Tokenize the text
    tokenized = cleaned.split()
    # Remove stopwords
    tokenized = [word for word in tokenized if word not in sw]
    return tokenized

def descriptive_stats(tokens, num_tokens = 5, verbose=True) :
    """
        Given a list of tokens, print number of tokens, number of unique tokens, 
        number of characters, lexical diversity (https://en.wikipedia.org/wiki/Lexical_diversity), 
        and num_tokens most common tokens. Return a list with the number of tokens, number
        of unique tokens, lexical diversity, and number of characters. 
    
    """

  
    num_tokens = len(tokens)
    num_unique_tokens = len(set(tokens))
    lexical_diversity = num_unique_tokens/num_tokens
    num_characters = sum(len(token) for token in tokens)
    
    if verbose :        
        print(f"There are {num_tokens} tokens in the data.")
        print(f"There are {num_unique_tokens} unique tokens in the data.")
        print(f"There are {num_characters} characters in the data.")
        print(f"The lexical diversity is {lexical_diversity:.3f} in the data.")
    
        # print the five most common tokens
        
    return([num_tokens, num_unique_tokens,
            lexical_diversity,
            num_characters])

def get_patterns(text, num_words):
    
    top_words = []
    text = token_normal(text)
 
    total_tokens = len(text)
    unique_tokens = len(set(text))
    avg_token_len = np.mean([len(w) for w in text])
    lex_diversity = unique_tokens/total_tokens

    top_words = FreqDist(text).most_common(num_words)
    results = {"tokens": total_tokens,
              "unique_tokens": unique_tokens,
              "avg_token_len":avg_token_len,
              "lex_diversity": lex_diversity,
              "top_words":top_words}
    return results

def remove_punctuation(text, punct_set=tw_punct) : 
    return("".join([ch for ch in text if ch not in punct_set]))

def tokenize(text) : 
    """ Splitting on whitespace rather than the book's tokenize function. That 
        function will drop tokens like '#hashtag' or '2A', which we need for Twitter. """
    text = re.findall(r'\b\w+\b|#\w+', text)
    return(text)

def prepare(text, pipeline) : 
    tokens = str(text)
    
    for transform in pipeline : 
        tokens = transform(tokens)
        
    return(tokens)

In [42]:
convention_db = sqlite3.connect("/Users/datascience/Desktop/Text Mining/Datasets/2020_Conventions.db") 
convention_cur = convention_db.cursor() 

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [43]:
convention_data = []

# fill this list up with items that are themselves lists. The 
# first element in the sublist should be the cleaned and tokenized
# text in a single string. The second element should be the party. 

query_results = convention_cur.execute(
                            ''' 
                            SELECT text, party 
                            FROM conventions
                            ''')

for row in query_results :
    text = row[0] # Return Text
    party = row[1] # Return Party
    
    ## Clean and tokenize the text 
    
    # Make Text lower case
    tokens = word_tokenize(text.lower())
    # Remove Stop Words
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    # Join tokenized words into a single string
    cleaned_text = ' '.join(tokens) 
    # Append Data together
    convention_data.append([cleaned_text, party]) # Append Data together

In [122]:
# Dataframe with tokenized text in a single string and party identification
convention_df = pd.DataFrame(convention_data, columns=["text", "party"])
convention_df

Unnamed: 0,text,party
0,skip content company careers press freelancers...,Democratic
1,calling full session quadrennial national conv...,Democratic
2,every four years come together reaffirm democr...,Democratic
3,fight perfect union fighting soul country live...,Democratic
4,must come together defeat donald trump elect j...,Democratic
...,...,...
2536,come together decry darkness light way forward...,Democratic
2537,way see big problems demand big solutions love...,Democratic
2538,thing single american fix country alone even p...,Democratic
2539,greetings archbishop elpidophoros greek orthod...,Democratic


Let's look at some random entries and see if they look right. 

In [108]:
random.choices(convention_data,k=10)

[['would say knew joe biden going woman vice president', 'Democratic'],
 ['american people forever indebted work front line', 'Democratic'],
 ['record breaking number women ran midterms', 'Democratic'],
 ['alarm went morning june real violent night louis four officers shot others hit rocks fireworks least five businesses damaged looted set fire officer wellness coordinator cit coordinator police department keenly aware rioting spent evening getting ready mobilize support officers impacted',
  'Republican'],
 ['land greatness', 'Republican'],
 ['maryland', 'Republican'],
 ['state sovereignty', 'Republican'],
 ['ladies gentlemen leaders fighters freedom liberty american dream best yet come',
  'Republican'],
 ['surprised happy', 'Democratic'],
 ['immediately upon taking office president trump changed things change threatened establishment establishment fought back democrat obstruction phony investigations dishonest media incredible stories negative president seen anything like despite ev

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [146]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2236 as features in the model.


In [110]:
def conv_features(text, fw): 
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    
    
    # Initialize an empty dictionary to store the feature words
    ret_dict = dict()
    # Initialize a set to keep track of the words that have already been seen
    seen_words = set()
    # Split the text into a list of words
    words = text.split()
    # Iterate over the words in the text
    for word in words:
        # Check if the word is in the feature words and if it hasn't been seen before
        if word in fw and word not in seen_words:
            # If the word is in the feature words and hasn't been seen before, add it to the return dictionary
            ret_dict[word] = True
            # Add the word to the set of seen words
            seen_words.add(word)
    # Return the dictionary of feature words
    return ret_dict


In [123]:
# Quick Test to see if the function works
conv_features("this is a test",feature_words)

{'test': True}

In [124]:
# Quick Test to see if the function works
conv_features("fast fast brown fox",feature_words)

{'fast': True, 'brown': True}

The conv_features seems to work as intended

In [125]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("people are american in america",feature_words)==
                     {'america':True,'american':True,"people":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [126]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [127]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [128]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.494


In [129]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     27.1 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     15.8 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                  defund = True           Republ : Democr =     13.0 : 1.0

### Quick Poise About the Classifier

The "Most Informative Features" section displays the most significant words or phrases the classifier takes into account while making its predictions. For each feature, the results show the ratio of the likelihood of the feature appearing in documents classified as "Republican" compared to "Democrat". For instance, if the feature "china = True" is present, it is 27.1 times more likely to appear in documents classified as "Republican" rather than in documents classified as "Democrat".

### My Observations
The results indicate that certain words, like "china," "votes," "enforcement," "destroy," and "freedoms," play a significant role in determining political party affiliation. These words are up to 27 times more likely to show up in Republican or Democratic text. Other terms like "climate," "crime," and "media" also have significant ratios, but not as strong. On the other hand, there are words such as "beliefs," "defense," "isis," and "trade," that have an equal probability of appearing in both Republican and Democratic text, with a ratio of 13:1.

More specifically, it can be observed that the topic of climate is more likely to appear in texts associated with the Democratic party, which aligns with the party's known stance on the issue. However, the results also indicate an imbalance in the dataset as the majority of the top informative features are associated with the Republican party.


## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [117]:
cong_db = sqlite3.connect("/Users/datascience/Desktop/Text Mining/Datasets/congressional_data.db")
cong_cur = cong_db.cursor()

In [118]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [121]:
tweet_data = []
# Now fill up tweet_data with sublists like we did on the convention speeches.
for row in results :
    text = row[0] # Return Text
    party = row[1] # Return Party
    
    ## Clean and tokenize the text 
    
    # Make Text lower case
    tokens = word_tokenize(text.lower())
    # Remove Stop Words
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    # Join tokenized words into a single string
    cleaned_text = ' '.join(tokens) 
    # Append Data together
    tweet_data.append([cleaned_text, party]) # Append Data together

In [130]:
# Dataframe with tokenized text in a single string and party identification
tweet_df = pd.DataFrame(convention_data, columns=["tweet", "party"])
tweet_df

Unnamed: 0,tweet,party
0,skip content company careers press freelancers...,Democratic
1,calling full session quadrennial national conv...,Democratic
2,every four years come together reaffirm democr...,Democratic
3,fight perfect union fighting soul country live...,Democratic
4,must come together defeat donald trump elect j...,Democratic
...,...,...
2536,come together decry darkness light way forward...,Democratic
2537,way see big problems demand big solutions love...,Democratic
2538,thing single american fix country alone even p...,Democratic
2539,greetings archbishop elpidophoros greek orthod...,Democratic


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [133]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)
tweet_data_sample

[['jimmy panetta', 'Democratic'],
 ['marcy kaptur', 'Democratic'],
 ['debbie wasserman schultz', 'Democratic'],
 ['dave brat', 'Republican'],
 ['antonio sabàto jr', 'Republican'],
 ['marcia fudge', 'Democratic'],
 ['scott peters', 'Democratic'],
 ['mariah phillips', 'Democratic'],
 ['jimmy panetta', 'Democratic'],
 ['lucille', 'Democratic']]

In [143]:
for tweet, party in tweet_data_sample :
    # Classifier to predict party
    estimated_party = classifier.classify(conv_features(tweet, feature_words))

    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    if estimated_party == party:
        print("The classifier got it right!")
    else:
        print("The classifier got it wrong.")
    print("")


Here's our (cleaned) tweet: jimmy panetta
Actual party is Democratic and our classifer says Democratic.
The classifier got it right!

Here's our (cleaned) tweet: marcy kaptur
Actual party is Democratic and our classifer says Democratic.
The classifier got it right!

Here's our (cleaned) tweet: debbie wasserman schultz
Actual party is Democratic and our classifer says Democratic.
The classifier got it right!

Here's our (cleaned) tweet: dave brat
Actual party is Republican and our classifer says Republican.
The classifier got it right!

Here's our (cleaned) tweet: antonio sabàto jr
Actual party is Republican and our classifer says Democratic.
The classifier got it wrong.

Here's our (cleaned) tweet: marcia fudge
Actual party is Democratic and our classifer says Democratic.
The classifier got it right!

Here's our (cleaned) tweet: scott peters
Actual party is Democratic and our classifer says Democratic.
The classifier got it right!

Here's our (cleaned) tweet: mariah phillips
Actual par

In [145]:
# Print overall accuracy of the classifier 
actual_parties = [party for tweet, party in tweet_data_sample]
estimated_parties = [classifier.classify(conv_features(tweet, feature_words)) 
                     for tweet, party in tweet_data_sample]

accuracy = sum(actual == estimated for actual, estimated 
               in zip(actual_parties, estimated_parties)) / len(tweet_data_sample)

print(f"The overall accuracy of the classifier is: {accuracy:.2f}")


The overall accuracy of the classifier is: 0.90


Now that we've looked at it some, let's score a bunch and see how we're doing.

In [153]:
parties = ['Republican', 'Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties:
    for p1 in parties:
        results[p][p1] = 0

num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data):
    tweet, party = tp    
    estimated_party = classifier.classify(conv_features(tweet, feature_words))
    results[party][estimated_party] += 1
    
    if idx > num_to_score:
        break


In [154]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 656, 'Democratic': 3726}),
             'Democratic': defaultdict(int,
                         {'Republican': 692, 'Democratic': 4928})})

In [161]:
# Store the predictions in a dataframe
predictions = []
conv_features_sets = [(conv_features(tweet, feature_words), party) 
                      for (tweet, party) in tweet_data]
for tweet, party in tweet_data_sample:
    estimated_party = classifier.classify(conv_features(tweet, feature_words))
    predictions.append((tweet, party, estimated_party))
classifier_df = pd.DataFrame(predictions, 
                             columns=["tweet", "actual_party", "predicted_party"])
# Print the accuracy of the classifier using nltk.classify.accuracy
print("Accuracy of the classifier: ", 
      nltk.classify.accuracy(classifier, conv_features_sets))


Accuracy of the classifier:  0.5640556919669724


In [160]:
classifier_df

Unnamed: 0,tweet,actual_party,predicted_party
0,jimmy panetta,Democratic,Democratic
1,marcy kaptur,Democratic,Democratic
2,debbie wasserman schultz,Democratic,Democratic
3,dave brat,Republican,Republican
4,antonio sabàto jr,Republican,Democratic
5,marcia fudge,Democratic,Democratic
6,scott peters,Democratic,Democratic
7,mariah phillips,Democratic,Democratic
8,jimmy panetta,Democratic,Democratic
9,lucille,Democratic,Democratic


### Reflections

The results of the classifier show that the overall accuracy of the classifier is 56.41%. This means that out of all the tweets the classifier tried to classify, 56.41% were correctly classified. From the results of the classifier, we see that the classifier had a higher accuracy in classifying tweets as 'Democratic' (4928 tweets) than 'Republican' (656 tweets). This could be because of the imbalance in the dataset, where there were more 'Democratic' tweets than 'Republican' tweets. The results also show that there were 3726 'Democratic' tweets classified as 'Republican' and 692 'Republican' tweets classified as 'Democratic.' It could be that the classifier has learned certain words that are common in both the 'Republican' and 'Democratic' tweets, leading to incorrect classifications.

Considering the imbalance in the dataset, it is important to test for other metrics such as specificity or recall to get a more comprehensive understanding of the performance of the classifier. Despite this, the accuracy of 56.41% slightly beats a random model (coin toss 50/50) and can be considered a good start for the classifier. However, more work can be done to improve its performance, especially when it comes to handling imbalanced datasets.