# Sentiment and Emotion Analysis of Twitter Feeds - Week 8 Assignment

I picked the Monero crypto-currency as a topic, and specifically tried to target the Hard Fork that occured on March 9th, 2019. This was a contencious period and I expected to see a lot of emotion in tweets.

Challenges:

1) The crypto-currency topic on twitter draws a lot of 'spam' tweets that made it difficult to focus on my chosen topic. If I was doing this again, I would spend more time identifying and removing accounts that create spam.  
2) The Tweepy API only allowed me to get tweets for the previous week. Ideally, I would have data back into February 2019 to capture more of the lead up to the March 9th date. In the future, I would plan ahead better, or pick a topic that wasn't time constrained.  
3) The Twitter API gave me access to the last 30 days of tweets, but it started with the most recent date and only allowed me to pull 18,000 tweets (since I wasted 6,000+ of my monthly quota testing and debugging). In the future, I would have a better tested starting point, or pay for higher quota limits.


In [23]:
import tweepy
from searchtweets import ResultStream, gen_rule_payload, load_credentials
import re
import time
import pickle
import os
import math
from datetime import date
from collections import defaultdict
import pandas as pd
import numpy as np
from nltk.corpus import stopwords

### Declare the Twitter and Tweepy APIs for getting Tweets

Each API class follows the same WORM (Write Once Read Many) philosophy and immediately writes the data from the API to a pickle

In [24]:
class TwitterAPI(object): 
    def __init__(self, query):
        
        try:
            self.raw_path = os.path.join(os.getcwd(), 'Searchtweets Raw')
            if not os.path.exists(self.raw_path):
                os.makedirs(self.raw_path)
        except Exception as e:
            print("Error setting the path for Raw Pickles\n" + str(e))
            
        # Get creds from yaml file in cwd (yaml file not provided in assignment submission)
        try:
            self.premium_search_args = load_credentials("twitter_keys.yaml",
                                       yaml_key="search_tweets_api",
                                       env_overwrite=False)
        except Exception as e:
            print("Error setting the search arguments\n" + str(e))
            
        # Create the rule to use in the call
        try:
            self.rule = gen_rule_payload(query, results_per_call=100) # sandbox accounts limited to 100
        except Exception as e:
            print("Error setting the query rule logic\n" + str(e))
            
        # Create the stream object that will pull the tweets
        try:
            self.rs = ResultStream(rule_payload=self.rule,
                      max_results=18000,
                      max_pages=1000,
                      **self.premium_search_args)
        except Exception as e:
            print("Error setting the result stream logic\n" + str(e))
                
    def get_tweets(self):
        
        file_label = date.today().isoformat().replace("-","_") + '_'
        
        tweets = []
        
        # Use the Twitter result stream object to bring tweets in
        for i, tweet in enumerate(self.rs.stream(), 1):
            
            tweets.append(tweet)
            
            # Every 1000 tweets, we need to save them off to a pickle and pause so that we don't exceed rate limits
            if i % 1000 == 0:
                
                # only 10 calls per second allowed - 10 calls * 100 tweets per call = 1000 tweets
                # only 30 calls per minute allowed - 30 calls * 100 tweets per call = 3000 tweets
                # to be safe ... we will use 1000 every 30 seconds
                time.sleep(30)

                file_name = file_label + str(i) + '.pickle'
                with open(os.path.join(self.raw_path, file_name), 'wb') as f:
                    pickle.dump(tweets, f, pickle.HIGHEST_PROTOCOL)
                    
                tweets.clear()
        
        # Capture any left over tweets that didn't make it to 1000
        if len(tweets) > 0:

            file_name = file_label + str(i) + '.pickle'
            with open(os.path.join(self.raw_path, file_name), 'wb') as f:
                pickle.dump(tweets, f, pickle.HIGHEST_PROTOCOL)

                
class TweepyAPI(object): 
    def __init__(self):
        # Provide the Twitter developer creds here
        consumer_key = 'xxxx'
        consumer_secret = 'xxxx'
        access_token = 'xxxx'
        access_token_secret = 'xxxx'
        
        try:
            self.raw_path = os.path.join(os.getcwd(), 'Tweepy Raw')
            if not os.path.exists(self.raw_path):
                os.makedirs(self.raw_path)
        except Exception as e:
            print("Error setting the path for Raw Pickles\n" + str(e))
            
        # Set up the Tweepy API connection, including the rate limit handling
        try:
            auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
            auth.set_access_token(access_token, access_token_secret)

            self.api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
        
        except tweepy.TweepError as e:
            print(f"Error: Twitter Authentication Failed - \n{str(e)}")
            
    # This is the 'fall back' rate limit handler that should not be needed anymore
    def _limit_handled(self, cursor):
        while True:
            try:
                yield cursor.next()
            except tweepy.RateLimitError:
                time.sleep(15 * 60)
                continue
            except tweepy.TweepError as e:  
                print(e.reason)
                time.sleep(15 * 60)
                continue
            except StopIteration:
                break
                
    def get_tweets(self, query, si, un):
        
        file_label = date.today().isoformat().replace("-","_") + '_' + si.replace("-","_") + "_" + un.replace("-","_") + "_"

        # Loop through the Tweepy cursor one 'page' at a time and save each page to a pickle for processing
        for i, page in enumerate(self._limit_handled(tweepy.Cursor(self.api.search, q=query, since=si, until=un, \
                                                                   lang='en', tweet_mode='extended').pages()), 1):

            file_name = file_label + str(i) + '.pickle'
            with open(os.path.join(self.raw_path, file_name), 'wb') as f:
                pickle.dump(page, f, pickle.HIGHEST_PROTOCOL)
                

### Query the API

These are the queries that were used throughout the 2 week period to pull tweets from the 2 API classes

In [25]:
XMR_QUERY = '(XMR OR monero OR #xmr OR #monero OR @monero) lang:en'

twit_api = TwitterAPI(XMR_QUERY)

# Only run this when needed because it will count against our count of requests (250 per month)
# twit_api.get_tweets()

Grabbing bearer token from OAUTH


In [26]:
XMR_QUERY = '$XMR OR #xmr OR #monero OR @monero'

twep_api = TweepyAPI()

# Only run this when needed
# twep_api.get_tweets(XMR_QUERY,'2019-03-17','2019-03-22')

### Pre Process the Tweets data to prepare for Training and Testing

In [145]:
class PreProcess(object):
    def __init__(self):
        # Paths to the pickles saved from the Twitter and Tweepy APIs
        self.tweepy_path = os.path.join(os.getcwd(), 'Tweepy Raw')
        self.searchtweets_path = os.path.join(os.getcwd(), 'Searchtweets Raw')
        
        # New path to save the Train and Test data
        self.pre_path = os.path.join(os.getcwd(), 'PreProcess Pickles')
    
    # Regex helper function
    def _remove_pattern(self, text, pattern_regex):
        r = re.findall(pattern_regex, text)
        for i in r:
            text = re.sub(i, '', text)

        return text
    
    # Read the Tweepy tweets
    def _read_tweepy_pickles(self):
        for filename in os.listdir(self.tweepy_path):
            if '.pickle' in filename:
                with open(os.path.join(self.tweepy_path, filename), 'rb') as f:
                    yield pickle.load(f)

    # Read the Twitter tweets
    def _read_searchtweets_pickles(self):
        for filename in os.listdir(self.searchtweets_path):
            if '.pickle' in filename:
                with open(os.path.join(self.searchtweets_path, filename), 'rb') as f:
                    yield pickle.load(f)
    
    def _getText(self, data):
        # From github issue # 878: https://github.com/tweepy/tweepy/issues/878
        # Try for extended text of original tweet, if RT'd (streamer)
        try: text = data['retweeted_status']['extended_tweet']['full_text']
        except: 
            # Try for extended text of an original tweet, if RT'd (REST API)
            try: text = data['retweeted_status']['full_text']
            except:
                # Try for extended text of an original tweet (streamer)
                try: text = data['extended_tweet']['full_text']
                except:
                    # Try for extended text of an original tweet (REST API)
                    try: text = data['full_text']
                    except:
                        # Try for basic text of original tweet if RT'd 
                        try: text = data['retweeted_status']['text']
                        except:
                            # Try for basic text of an original tweet
                            try: text = data['text']
                            except: 
                                # Nothing left to check for
                                text = ''
        return text
    
    # This function was my one attempt to remove spam Tweets
    def get_exclude_list(self):
        
        dd_users = defaultdict(int)
        
        for tweets in self._read_searchtweets_pickles():
            for tweet in tweets:
                dd_users[tweet['user']['screen_name']] += 1

        for page in self._read_tweepy_pickles():
            for status in page:
                dd_users[status.user.screen_name] += 1
            
        df_list = pd.Series(dd_users).to_frame('Count')
        
        df_sorted = df_list.sort_values('Count', axis=0, ascending=False)

        # Exclude top 7 tweeters because they are posting repetitive 'spam' posts
        return df_sorted.head(7).index.values.tolist()
            
    def _removeUnneededText(self, tweet):
        
        # Remove user Twitter handles and the 'RT' word
        tweet = self._remove_pattern(tweet, r'(@[\w]*\s|@[\w]*:\s|\bRT\s)')
        
        # Remove any http links
        tweet_words = [word for word in tweet.split() if 'http' not in word]
        tweet = ' '.join(tweet_words)
        
        return tweet
                
    def process_raw_pickles(self):
        tweet_list = []
        
        # Get a list of the top 7 users, because we will assume that they are spammers
        exclude = self.get_exclude_list()

        # Read Tweepy tweets and add them to list
        for page in self._read_tweepy_pickles():
            for status in page:
                if status.user.screen_name not in exclude:
                    text = self._getText(status._json)
                    text = self._removeUnneededText(status.full_text)

                    # Add to list if it's not a duplicate
                    if status.retweet_count > 0: 
                        if text not in tweet_list:
                            tweet_list.append(text)
                    else:
                        tweet_list.append(text)
    
        # Read Twitter tweets and add them to list
        for tweets in self._read_searchtweets_pickles():
            for tweet in tweets:
                if tweet['user']['screen_name'] not in exclude:
                    text = self._getText(tweet)
                    text = self._removeUnneededText(text)

                    # Add to list if it's not a duplicate
                    if tweet.retweet_count > 0: 
                        if text not in tweet_list:
                            tweet_list.append(text)
                    else:
                        tweet_list.append(text)
                    
        # Do one more step to eliminate duplicate tweets
        df_tweets = pd.DataFrame(tweet_list, columns=['tweets']).drop_duplicates()
        
        # Set this parameter manually to get over 1000 tweets to use for Training
        msk = np.random.rand(len(df_tweets)) < 0.111

        # Use the boolean mask array to split the data into Train and Test datasets
        train_df = df_tweets[msk]
        test_df = df_tweets[~msk]
        
        # Save TRAIN data to CSV for manual classication in Excel
        train_df.to_csv(os.path.join(self.pre_path, 'Train_data.csv'), encoding='utf-8', index=False)
        
        # Save TEST data to pickle for later use
        with open(os.path.join(self.pre_path, 'Test_data.pickle'), 'wb') as f:
            pickle.dump(test_df, f, pickle.HIGHEST_PROTOCOL)
            
        # Return stats for user to review
        return len(train_df), len(test_df)

In [147]:
# Call the pre process class and the function to split the data into Train and Test
PP = PreProcess()

# At this point, the data is split and we don't want to rerun this step again because it will write over our data
#    that we are using for Training
# train_test_lengths = PP.process_raw_pickles()

### Manual Classification of Train Data

The CSV output from the above PreProcess class was backed up in case it was erased or lost. Then I manually tagged each tweet for the five emotions, with respect to my target event:

    Anger, Enthusiasm, Passivity, Fear and Hope

Once that was complete, I saved the results back to a new CSV file that will be read by the next class.


### Train and Test

Based on instructions in the discussion group, I used a simple dictionary counting classifier. However, I tried to exended it a little bit by writting 4 tests instead of just relying on highest count. This offsets the fact that the 'passivity' emotion was highly represented in the tweet training data.

In future projects, I would use documented TF-IDF, Word2Vec or another method to vectorize my data first. And then use either ML classifiers with Sklearn pipelines to test best classifier, or use a Recurrent Neural Network such as LSTM or GRU.

For this project, it was a good exercise to see how the data was constructed


In [30]:
class Basic_Train_Test(object):
    def __init__(self):
        # Read from the PreProcess path
        self.pre_path = os.path.join(os.getcwd(), 'PreProcess Pickles')
        
        # Dicts for recording words used in Training tweets
        self.dict_an = defaultdict(int)
        self.dict_en = defaultdict(int)
        self.dict_pa = defaultdict(int)
        self.dict_fe = defaultdict(int)
        self.dict_ho = defaultdict(int)

        # Totals to record the total number of words used in each emotion category
        self.an_ttl = self.en_ttl = self.pa_ttl = self.fe_ttl = self.ho_ttl = 0
        
        # Dict of emotion categories to use when labelling Test data
        self.emotions = {0: "anger", 1: "enthusiasm", 2: "passivity", 3: "fear", 4: "hope"}
    
    def train_data(self):
        # Read the manually tagged CSV data
        df_train = pd.read_csv(os.path.join(self.pre_path, 'Train_data_Classified.csv'))
        
        train_ = df_train.values.tolist()
        
        for tweet in train_:
            # Clean the text and remove stop words
            temp = re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet[0]).lower()
            words = [word for word in temp.split() if word not in stopwords.words('english')]

            # Put words in dicts based on their category and count them
            for word in words:
                if tweet[1] == "anger":
                    self.an_ttl += 1
                    if word in self.dict_an:
                        self.dict_an[word] += 1
                    else:
                        self.dict_an[word] = 1

                elif tweet[1] == "enthusiasm":
                    self.en_ttl += 1
                    if word in self.dict_en:
                        self.dict_en[word] += 1
                    else:
                        self.dict_en[word] = 1

                elif tweet[1] == "passivity":
                    self.pa_ttl += 1
                    if word in self.dict_pa:
                        self.dict_pa[word] += 1
                    else:
                        self.dict_pa[word] = 1

                elif tweet[1] == "fear":
                    self.fe_ttl += 1
                    if word in self.dict_fe:
                        self.dict_fe[word] += 1
                    else:
                        self.dict_fe[word] = 1

                elif tweet[1] == "hope":
                    self.ho_ttl += 1
                    if word in self.dict_ho:
                        self.dict_ho[word] += 1
                    else:
                        self.dict_ho[word] = 1

                else:
                    raise ValueError
        
        # Return stats for the user to review
        return { "anger": (len(self.dict_an), self.an_ttl), "enthusiasm": (len(self.dict_en), self.en_ttl), \
                 "passivity": (len(self.dict_pa), self.pa_ttl), "fear": (len(self.dict_fe), self.fe_ttl), \
                 "hope": (len(self.dict_ho), self.ho_ttl) }
    
    def test_data(self):
        # Read back test data from pickle
        with open(os.path.join(self.pre_path, 'Test_data.pickle'), 'rb') as f:
            df_test = pickle.load(f)
        
        test_ = df_test.values.tolist()
        
        for tweet in test_:
            # Clean the text
            temp = re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet[0]).lower()
            
            an_cnt = en_cnt = pa_cnt = fe_cnt = ho_cnt = 0
            
            # Count each word based on the dictionary of emtions created while training
            for word in temp.split():
                # Remove any stop words and then add the counts of the word from the emotion categories
                # This creates a cumulative 'weight' of the words in this category from the training data
                # Note - This 'weight' needs to be 'offset' by size of each category, i.e. len(dict_*), or 
                #        by the number of words in each category, i.e. *_ttl. Hence, the matrix of tests.
                if word not in stopwords.words('english'):
                    
                    an_cnt += self.dict_an[word]
                    en_cnt += self.dict_en[word]
                    pa_cnt += self.dict_pa[word]
                    fe_cnt += self.dict_fe[word]
                    ho_cnt += self.dict_ho[word]

            # Create a matrix of values based on 4 (or more) different statistical tests
            mtrx = [
                    [an_cnt/len(self.dict_an),
                     en_cnt/len(self.dict_en),
                     pa_cnt/len(self.dict_pa),
                     fe_cnt/len(self.dict_fe),
                     ho_cnt/len(self.dict_ho)],
                    [math.log(an_cnt+1)/len(self.dict_an),
                     math.log(en_cnt+1)/len(self.dict_en),
                     math.log(pa_cnt+1)/len(self.dict_pa),
                     math.log(fe_cnt+1)/len(self.dict_fe),
                     math.log(ho_cnt+1)/len(self.dict_ho)],
                    [an_cnt/math.log(len(self.dict_an)),
                     en_cnt/math.log(len(self.dict_en)),
                     pa_cnt/math.log(len(self.dict_pa)),
                     fe_cnt/math.log(len(self.dict_fe)),
                     ho_cnt/math.log(len(self.dict_ho))],
                    [an_cnt/self.an_ttl,
                     en_cnt/self.en_ttl,
                     pa_cnt/self.pa_ttl,
                     fe_cnt/self.fe_ttl,
                     ho_cnt/self.ho_ttl]
                   ]

            mtrx_scr = []

            # Score the matrix from 0-5 based on highest to lowest score
            # Highest being the most likely category based on the best word matches
            for test in mtrx:
                rev = sorted(test, reverse=True)
                d = {k: v for v, k in enumerate(rev)}
                mtrx_scr.append([d[v] for v in test])

            d_scr = defaultdict(int)

            # Use a dictionary to sum up the test scores and then compute the average score across all tests
            for j in range(len(mtrx_scr[0])):
                for i in range(len(mtrx_scr)):
                    d_scr[j] += mtrx_scr[i][j]
                d_scr[j] /= len(mtrx_scr)

            # Yield back the tweet and the emotion category label based on the lowest score
            yield [tweet[0], self.emotions[min(d_scr, key=d_scr.get)]]
                    

In [31]:
# Call the train test class and then train on the traning CSV data
BTT = Basic_Train_Test()
train_results = BTT.train_data()
print("Each emotion label has a tuple of (number of distinct words in the dictionary, total number of words):\n")
print(train_results)

Each emotion label has a tuple of (number of distinct words in the dictionary, total number of words):

{'anger': (183, 214), 'enthusiasm': (1152, 2756), 'passivity': (3451, 14668), 'fear': (536, 959), 'hope': (587, 1123)}


In [35]:
# Define a generator that can be used to pull the tested data down
test_results = BTT.test_data()

df_tested = pd.DataFrame(list(test_results), columns=['tweets','emotion'])

### Results of Testing

The counts by category below seem in line with the counts that we saw from training above. However, based on reviewing manually the results, I don't think this classifier did a very good job. That is, I think the accuracy of this method of classifying is very low.

I think it will require one of the more advanced classifiers to correctly classify these emotions better.

In [48]:
df_tested.groupby('emotion').count()

Unnamed: 0_level_0,tweets
emotion,Unnamed: 1_level_1
anger,84
enthusiasm,2362
fear,21
hope,19
passivity,5442


#### Samples of tweets by category

In [54]:
df_tested.loc[df_tested['emotion'] == "passivity"].sample(n=5, random_state=2)

Unnamed: 0,tweets,emotion
3068,#XMR Buy at #Bittrex and sell at #Bitfinex. Ra...,passivity
5179,1/Xmr is the true cypher punk crypto vision.. ...,passivity
6467,"Yay! New 4 blocks: 8.24 MSR (133.77%@432420), ...",passivity
1284,$XMR has now been in in this consolidation ran...,passivity
1910,🔄 Prices update in $USD (1 hour): $EOS - 3.69 ...,passivity


In [58]:
df_tested.loc[df_tested['emotion'] == "enthusiasm"].sample(n=5, random_state=2)

Unnamed: 0,tweets,emotion
4518,Monero gold,enthusiasm
4729,I've just posted a new blog: Altcoin News: Mon...,enthusiasm
7023,360 Total Security team uncovered a new Monero...,enthusiasm
6835,Facebook and Monero?,enthusiasm
3573,(FEBRUARY) via /r/Monero hot 🔥 in #reddit #Mon...,enthusiasm


In [56]:
df_tested.loc[df_tested['emotion'] == "anger"].sample(n=5, random_state=3)

Unnamed: 0,tweets,emotion
7593,Unbelievable.,anger
7811,dictatorship,anger
2432,I will sue you too,anger
7469,Actuales.,anger
5130,because you're fucking retarded.,anger


In [51]:
df_tested.loc[df_tested['emotion'] == "fear"].sample(n=5, random_state=2)

Unnamed: 0,tweets,emotion
6396,I've almost failed to unlock my aunties ipad m...,fear
3196,"He's right about that metadata, at least $RYO ...",fear
269,"Ban Anonymous Cryptocurrencies, Says French Na...",fear
7234,Where are points put up? Super coach doesn’t h...,fear
1105,Should have used @monero,fear


In [52]:
df_tested.loc[df_tested['emotion'] == "hope"].sample(n=5, random_state=2)

Unnamed: 0,tweets,emotion
4901,"Young Buck saying ""next time, we only using Dr...",hope
1739,"Ok, yes the signatures are ECDSA, but bullet p...",hope
7671,im waiting,hope
685,Where is the trezor implementation ?????,hope
4968,Question about the RandomX algo,hope
