<a href="https://colab.research.google.com/github/dranzer120/tm-nb-conventions/blob/main/Assignment_4_Political_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [5]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict

# Feel free to include your text patterns functions
# from text_functions_solutions import clean_tokenize, get_patterns

In [6]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [12]:
# Preprocessing function
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
convention_db = sqlite3.connect("/content/drive/MyDrive/Colab Notebooks/Assignment_4/2020_Conventions.db")
convention_cur = convention_db.cursor()

In [9]:
#Look at what tables are in the dababase
convention_cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = convention_cur.fetchall()
print(tables)

[('conventions',)]


In [10]:
#Look at the columns in the database
convention_cur.execute("PRAGMA table_info(conventions);")
columns = convention_cur.fetchall()
print(columns)

[(0, 'party', 'TEXT', 0, None, 0), (1, 'night', 'INTEGER', 0, None, 0), (2, 'speaker', 'TEXT', 0, None, 0), (3, 'speaker_count', 'INTEGER', 0, None, 0), (4, 'time', 'TEXT', 0, None, 0), (5, 'text', 'TEXT', 0, None, 0), (6, 'text_len', 'TEXT', 0, None, 0), (7, 'file', 'TEXT', 0, None, 0)]


### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text
for each party and prepare it for use in Naive Bayes.  

In [21]:
convention_data = []

# fill this list up with items that are themselves lists. The
# first element in the sublist should be the cleaned and tokenized
# text in a single string. The second element should be the party.

query_results = convention_cur.execute(
                            '''
                            SELECT text, party
                            FROM conventions
                            ''')

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = ' '.join([word.lower() for word in word_tokenize(text) if word.isalpha()])
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

for row in query_results:
    speech_text, party = row
    cleaned_text = preprocess_text(speech_text)
    convention_data.append([cleaned_text, party])



Let's look at some random entries and see if they look right.

In [23]:
random.choices(convention_data,k=10)

[['joe biden said living season darkness president trump said joe biden sees american darkness see american greatness challenging times country needs president believes america believes boundless capacity american people meet challenge defeat foe defend freedoms hold dear america needs four years president donald trump white house go allow say word families communities path hurricane laura prayers tonight administration working closely authorities states impacted fema mobilized resources supplies harm way',
  'Republican'],
 ['hi mike pompeo speaking beautiful jerusalem looking old city big job susan husband nick dad susan nick safe freedoms secure president trump put america first vision action may made popular every foreign capital worked president trump understands great fellow kansan president eisenhower said cherish justly desire children securing peace first indeed primary constitutional function national government ensuring family mine safe enjoy freedom live work learn worship 

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it.

In [24]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)

print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2236 as features in the model.


In [25]:
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.

       Args:
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word
            in `text` must be in fw in order to be returned. This
            prevents us from considering very rarely occurring words.

       Returns:
            A dictionary with the words in `text` that appear in `fw`.
            Words are only counted once.
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of
            {'quick' : True,
             'fox' :    True}

    """

    # Your code here

    tokens = text.split()

    ret_dict = dict()

    # Iterate over the tokens and check if they are in feature_words
    for word in set(tokens):
        if word in fw:
            ret_dict[word] = True

    return(ret_dict)

In [26]:
assert(len(feature_words)>0)
assert(conv_features("obama was the president",feature_words)==
       {'obama':True,'president':True})
assert(conv_features("some people in america are citizens",feature_words)==
                     {'people':True,'america':True,"citizens":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory.

In [27]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [28]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [29]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.494


In [30]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     27.1 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     15.8 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                  defund = True           Republ : Democr =     13.0 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations
Based on the result above, it seems like the word "china" and "climate" are more associated with democrats. But the rest of the words are more associated with the Repbulicans.




## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and
is unindexed, so the query takes a minute or two to run on my machine.

In [31]:
cong_db = sqlite3.connect("/content/drive/MyDrive/Colab Notebooks/Assignment_4/congressional_data.db")
cong_cur = cong_db.cursor()

In [32]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT
                  cd.candidate,
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle
               AND cd.candidate == tw.candidate
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic')
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [34]:
tweet_data = []

# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.

# Preprocess function to clean and tokenize tweets
stop_words = set(stopwords.words('english'))

def preprocess_tweet_text(text):
    # Decode byte string to regular string
    text = text.decode('utf-8')
    text = ' '.join([word.lower() for word in word_tokenize(text) if word.isalpha()])
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# Process the results and store preprocessed data in tweet_data
for row in results:
    candidate, party, tweet_text = row
    cleaned_tokens = preprocess_tweet_text(tweet_text)
    tweet_data.append([cleaned_tokens, party])

print(f"Processed {len(tweet_data)} tweets.")


Processed 1221141 tweets.


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [35]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [36]:

for tweet, party in tweet_data_sample :
    estimated_party = classifier.classify(conv_features(tweet, feature_words))
    # Fill in the right-hand side above with code that estimates the actual party

    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")


Here's our (cleaned) tweet: mass shooting las vegas horrific act violence victims families thoughts prayers
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: early morning traveltuesday leaving dc http
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: moderates iraq amp syria civilians enemies sides conflict assist either
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: rt natsecaction national security veterans demanding answers release confidential national security
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: buildthatwall https
Actual party is Republican and our classifer says Democratic.

Here's our (cleaned) tweet: glad attend assure everyone could majority americans still stand traditional allies
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: cnn everyone wraps flag pa

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [37]:
# dictionary of counts by actual party and estimated party.
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp
    # Now do the same thing as above, but we store the results rather
    # than printing them.
    tweet_features = conv_features(tweet, feature_words)

    # get the estimated party
    estimated_party = classifier.classify(tweet_features)

    results[party][estimated_party] += 1

    if idx > num_to_score :
        break

In [38]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 3663, 'Democratic': 477}),
             'Democratic': defaultdict(int,
                         {'Republican': 5090, 'Democratic': 772})})

### Reflections

Based on the results, it seems like the classifier is more biased in predicting republican. And this actually aligns with our sample results as well. Where out of the 10 samples, our classifier predicted 8 to be republican.