# Democratic Candidates
This notebook analyzes the sentiment of the tweets posted for democratic candidates. There are a few ideas that i need to consider. For instance,

1. The number of positive and negative tweets posted for each candidate
2. The proportion of positivity and negativity
3. How these change over time and possibly after each debate or major event
4. Is there any relationship between the tweets sentiments and the pols?
5. The location of the sentiments, broken down to the states, possibly focusing on the swing states.
6. We can expand the analysis beyond the tweets and to the *users/voters*.
7. Make a word cloud for the tweets about each candidate?

## Tweeter Data
I start with pulling some data from the major candidates from twitter, Elizabeth Warren, Bernine Sanders, and Joe Biden. For this, I used *tweepy*.

In [3]:
# fetch api keys from api_keys.py

from api_keys import *

query = {'@BernieSanders': 'Bernie Sanders',     # Bernie
         '@ewarren':       'Elizabeth Warren',   # Elizabeth
         '@KamalaHarris':  'Kamala Harris',      # Kamala
         '@PeteButtigieg': 'Pete Buttigieg',     # Pete
         '@JoeBiden':      'Joe Biden',          # Joe Biden
         }

query = dict((k.lower(), v.lower()) for k,v in query.items())

# make everything lowercase for consitency
mentions = query.keys()
names = query.values()

# Data Cleaning

Things that are need to be taken care of in data cleaning.

* Remove links (https, etc.).
* Remove pnctuations except dots and commas. So later we break them down accordingly. Or maybe *tokenize* does that automatically? Need to check!
* Remove special characters
* Remove numbers?

In [4]:
# clean the tweet
import re
import string
from textblob import TextBlob
import preprocessor as pp
#nltk.download('stopwords')
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from contractions import contractions

# only remove URL, reserved words, emojies, and smilies. Preserve hashtags and mentions
pp.set_options(pp.OPT.URL, pp.OPT.RESERVED, pp.OPT.EMOJI, pp.OPT.SMILEY)
# save stop words to be removed
stop_words = set(stopwords.words('english'))
# nltk tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
# For the contractions
contractions_re = re.compile('(%s)' % '|'.join(contractions.keys()))

# remove mentions if not a candidate
def remove_mentions(tweet, mentions):
    words = tweet.split()
    clean_words = []
    for w in words:
        if w.startswith('@'):
            candidate_mentioned = False
            for m in mentions:
                if m in w:
                    candidate_mentioned = True
                    clean_words.append(w.replace(m, query[m].lower()))
                    break
            if not candidate_mentioned:
                clean_words.append(w[1:]) # remove @
        else:
            clean_words.append(w)
    return ' '.join(clean_words)

# expand hashtags
def fix_hashtags(tweet):
    # first replace underscore with space
    tweet = tweet.replace('_', ' ')
    words = tweet.split()
    clean_words = []
    for w in words:
        if w.startswith('#'):
            w = ' '.join([a for a in re.split('([A-Z][a-z]+)', w[1:]) if a])
        clean_words.append(w)
    return ' '.join(clean_words)

def expand_contractions(tweet):
    def replace(match):
        # expand the contraction with the most possible alternative : [0]
        return contractions[match.group(0)][0]
    return contractions_re.sub(replace, tweet)
    
def clean_tweet(tweet):
    # remove URL, Reserved words (RT, FAV, etc.), Emojies, Smilies, and Numbers.
    # preserve mentions and hastags for now
    tweet = pp.clean(tweet)
    # fix hashtags
    tweet = fix_hashtags(tweet)
    # make the tweet lowercase
    tweet = tweet.lower()
    # now remove mentions that are not the candidates
    tweet = remove_mentions(tweet, mentions)
    # conver U+2019 to U+0027 (apostrophe)
    tweet = tweet.replace(u"\u2019", u"\u0027")
    # expand the contractions
    tweet = expand_contractions(tweet)
    # remove 's
    tweet = tweet.replace("'s",'')
    #replace consecutive non-ASCII characters with a space
    tweet = re.sub(r'[^\x00-\x7F]+',' ', tweet)
    # break into sentences
    # tb = TextBlob(tweet)
    sentences = []
    for sent in tokenizer.tokenize(tweet):
#    for sent in tb.sentences: # for this punkt package of nltk has to be downloaded once
#                              # with the following code:
#                              # import nltk
#                              # nltk.download('punkt')
        sent = str(sent)
        # remove ponctuations
        sent = sent.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation)))
        # consolidate white spaces
        sent = ' '.join(sent.split())
        if len(sent) > 4: # if the sentence is larger than 4 chars
            sentences.append(sent)
    return sentences

# Sentiment Analysis
Each tweet might comprise multiple sentences. Therefore each tweet must be broken down to different sentences with *tokenize* functionality of *TextBlob*. The final sentiment can be a function of the sentiment of different sentences, perhaps the average (?).

In [104]:
from textblob import TextBlob

# This needs more work to be more accurate. Some ideas:
#    1. Don't remove emoticons and use vader
#    2. clean stopwrds and everything else, and use outofthebox texblob
#    3. Read papers on political sentiment analysis with twitter
def get_sentiment(text, mode = 'textblob'):
    if mode == 'textblob':
        testimonial = TextBlob(text)
        return {'Pol': testimonial.sentiment.polarity,
                'Subj': testimonial.sentiment.subjectivity}
    elif mode == 'nltk':
        from nltk.sentiment.vader import SentimentIntensityAnalyzer
        sid = SentimentIntensityAnalyzer()
        return sid.polarity_scores(text)
    elif mode == 'api':    
        import requests   
        # api-endpoint 
        URL = "http://text-processing.com/api/sentiment/"
        params = {'text':text}
        r = requests.post(url = URL, data = params)
        data = r.json()
        return(data['probability'])
    elif mode == 'vader':
        from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
        analyzer = SentimentIntensityAnalyzer()
        return analyzer.polarity_scores(text)

# Tweet and Sentence Objects
Let's makes some classes and methods for tweets and sentences that cleans and gets the sentiment of the tweets.

In [105]:
# sentence object
class Sentence:
    def __init__(self, text):
        self.text = text
        self.sentiment = []
        self.sentimentize()
    
    # calculate sentiment for the sentence
    def sentimentize(self):
        for mode in ['nltk', 'vader', 'textblob', 'api']:
            self.sentiment.append((mode, get_sentiment(self.text, mode)))
            
    def __str__(self):
        import json
        sentiment_str = ''
        for s in self.sentiment:
            sentiment_str += json.dumps(s) + '\n'
        return '%s >>>>> \n%s' % (self.text, sentiment_str)
    
    def __repr__(self):
        return '%s >>>>> Pol: %.1f (Sub: %.1f)' % (self.text, self.polarity, self.subjectivity)
        
# tweet object
class Tweet:
    def __init__(self, text):
        self.text = text
        self.sentencize()
        
    # clean tweet and break down sentences
    def sentencize(self):
        self.sentences = [Sentence(t) for t in clean_tweet(self.text)]
    
    def disp(self):
        print('********************')
        print(self.text)
        print('====================')
        for sentence in self.sentences:
            print(sentence)
            print('--------------------')
        print('********************')
    

# Tweet downloader
This downloads tweets to test clearning and sentiment analyis.

In [106]:
# REST API
import tweepy
import sys

def get_sample_tweets(query_phrase, tweet_count):
    # authorization
    auth = tweepy.AppAuthHandler(API_KEY, API_SECRET)
    api = tweepy.API(auth, wait_on_rate_limit=True,           # wait until the limit is replenished
                           wait_on_rate_limit_notify=True)    # reply with a message if the limit is reached

    # check if not authorized
    if (not api):
        print ("Can't Authenticate")
        return

    tweets = []
    for status in tweepy.Cursor(api.search, q = query_phrase,
                                        tweet_mode = 'extended',
                                        lang = 'en').items(tweet_count):
        try:
            full_text = status._json['retweeted_status']['full_text']
        except:
            full_text = status._json['full_text']
        
        tweets.append(Tweet(full_text))
    return tweets

# Test
Now let's download some tweets and test the *cleaning* and *sentiment analysis*

In [109]:
tweets = get_sample_tweets(query_phrase = 'bernie sanders', tweet_count = 1)

for tweet in tweets:
    tweet.disp()


********************
There are a number of similarities between Bernie Sanders and Hulk Hogan. I won’t elaborate on this for some time...
there are a number of similarities between bernie sanders and hulk hogan >>>>> 
["nltk", {"neg": 0.0, "neu": 0.885, "pos": 0.115, "compound": 0.0772}]
["vader", {"neg": 0.0, "neu": 0.885, "pos": 0.115, "compound": 0.0772}]
["textblob", {"Pol": 0.0, "Subj": 0.0}]
["api", {"neg": 0.37125424236150817, "neutral": 0.8253129065161494, "pos": 0.6287457576384918}]

--------------------
i will not elaborate on this for some time >>>>> 
["nltk", {"neg": 0.0, "neu": 1.0, "pos": 0.0, "compound": 0.0}]
["vader", {"neg": 0.0, "neu": 1.0, "pos": 0.0, "compound": 0.0}]
["textblob", {"Pol": -0.25, "Subj": 1.0}]
["api", {"neg": 0.5175747047924295, "neutral": 0.04378182270699219, "pos": 0.48242529520757044}]

--------------------
********************


# Streaming

Now let's stream the tweets in real time and process them.

In [118]:
## STREAMING

import tweepy
import json
from tweepy import OAuthHandler
from tweepy import API
from tweepy import Stream
auth = OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(ACS_TOKEN, ACS_SECRET)
api = API(auth, wait_on_rate_limit=True,
                wait_on_rate_limit_notify=True)

if (not api):
    print ("Can't Authenticate")
    sys.exit(-1)
# Continue with rest of code

#override tweepy.StreamListener to add logic to on_status
class MyStreamListener(tweepy.StreamListener):
    def __init__(self, max_count = 5):
        self.count = 0
        self.max_count = max_count

    def on_data(self, data):
        all_data = json.loads(data)
        print('*************')
        print(self.count)
        print('*************')
        #print('-----------')
        print(all_data['text'])
        print('||||||||')
        try:
            if 'retweeted_status' in all_data:
                try:
                    text = all_data['retweeted_status']['extended_tweet']['full_text']
                except:
                    text = all_data['retweeted_status']['text']
            else:
                try:
                    text = all_data['extended_tweet']['full_text']
                except:
                    text = all_data['text']
            tweet = Tweet(text)
            tweet.disp()
        except:
            print('New structure!')
        
        self.count += 1
        if self.count > self.max_count:
            return False
        return True

    def on_error(self, status):
        print('ERR!!')
        print(status)
        if status == 420:
            return False
        
    def on_status(self, status):
        print(status.text)

In [120]:
max_count = 1      
myStreamListener = MyStreamListener(max_count = max_count)
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener)

try:
    myStream.filter(track = mentions, languages=['en'])
except KeyboardInterrupt:
    print("Stopped.")
finally:
    print('Done.')
    myStream.disconnect()

*************
0
*************
RT @NancyJKoch: Ilhan Omar ended Joe Biden’s Presidential campaign with one confession - Well of course!! At this point in time ⁦@Ilhan⁩ no…
||||||||
********************
Ilhan Omar ended Joe Biden’s Presidential campaign with one confession - Well of course!! At this point in time ⁦@Ilhan⁩ not sure if this helps or hurts ⁦⁦@JoeBiden⁩ run. Pretty much another nail in the coffin💥 https://t.co/uMjtEx2yP4
ilhan omar ended joe biden presidential campaign with one confession well of course >>>>> 
["nltk", {"neg": 0.0, "neu": 0.851, "pos": 0.149, "compound": 0.2732}]
["vader", {"neg": 0.0, "neu": 0.851, "pos": 0.149, "compound": 0.2732}]
["textblob", {"Pol": 0.0, "Subj": 0.0}]
["api", {"neg": 0.3191407847009151, "neutral": 0.7887384300136802, "pos": 0.6808592152990849}]

--------------------
at this point in time ilhan not sure if this helps or hurts joebiden run >>>>> 
["nltk", {"neg": 0.257, "neu": 0.61, "pos": 0.132, "compound": -0.3532}]
["vader", {"neg": 0.