# Bryce Fenlon

## Final Assignment

This notebook will serve to outline the the project and data used within the project. All private Twitter developer keys will be loaded from a separate file, and will not be included in the final submission. 

#### Project Description
I have chosen to do this project on the ongoing social distancing and quarantine orders throughout the United States and around the world. Specifically, how these measures impact tweet sentiments. Several locations around the United States will be chosen to focus this study on, provided they fit certain criteria. Those criteria are:
1. The area is or has recently (within the past week) been under strict stay-at-home orders or other similarly strict regulations
2. Those stay-at-home orders or strict regulations are set to be eased (to any extent) before May 6, 2020.
3. The tweet volume in the area chosen is sufficient to conduct statistical analysis with

For this project, I will be going outside the scope of this course to the realm of machine learning. I had to rely heavily on other projects and techniques, but I have made an effort to fully understand what is going on. The comments should reflect this. Any code explicitly taken from other projects/repositories will be cited in the code comments or in text.

In [1]:
# We will start this notebook with the creation of a classification model.
# The dataset used to create the model is the Sentiment140 set on Kaggle, a dataset with 1.6
# tweets tagged for sentiment between 0 and 4 (0 = negative, 4 = positive).
# Start by loading pandas and numpy to work with the data in the csv.

import pandas as pd
import numpy as np

# >>>>>>>>>>>>> IMPORTANT <<<<<<<<<<<<< #
# Switch following lines to 'True' to start model creation from scratch
create_save_model = False

# >>>>>>>>>>> ARE YOU SURE? <<<<<<<<<<< #
create_save_model = False

if create_save_model == True:
    # Read the training data in, keep only the sentiment value and the text of the tweet
    cols = ['sentiment', 'text']
    training = pd.read_csv('training.1600000.processed.noemoticon.csv', usecols = [0, 5], header = None, names = cols, encoding = "ISO-8859-1")

    # Though the description of the data hints at a neutral value, there are none present
    # Because of this, we will simply reduce 4s to 1s in the sentiment value to make after double checking there are no 2s
    # training easier in later steps
    training = training[training.sentiment != 2]
    training.sentiment = training.sentiment / 4

In [2]:
import re
import string

# Create dictionary of negation contractions, not comprehensive
neg_dict = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not"}
# Create dictionary of 'positive' contractions, 
# not comprehensive and some are arbitrary choices, like "he would" instead of "he had"
pos_dict = {"could've":"could have", "he'd":"he would", "she'd":"she would", "he'll":"he will",
                "she'll":"she will", "how'd":"how would", "how'll":"how will", "how're":"how are",
                "how's":"how is", "i'd":"i would", "i'll":"i will", "i'm":"i am", "i've":"i have",
                "it's":"it is", "let's":"let us", "might've":"might have", "must've":"must have",
                "he's":"he is", "she's":"she is", "they're":"they are", "they'd":"they would",
                "should've":"should have", "that's":"that is", "that'd":"that would", "there'll":"there will",
                "there're":"there are", "they'll":"they will", "they've":"they have", "we'll":"we will",
                "we'd":"we would", "why's":"why is"}
def clean_tweet(some_tweet):
    # use regex to remove URLs, @s, and #s by matching on the character (or http) 
    # and replacing everything after it until a whitespace character ('\S+')with an empty string
    # let's not forget any old fashioned 'www' type links
    text = re.sub(r'http\S+', '', some_tweet).lower()
    text = re.sub(r'www.\S+', '', text)
    text = re.sub(r'@\S+', '', text)
    text = re.sub(r'#\S+', '', text)
    text = text.replace('\n', ' ') # remove new lines
    for i in text.split():
        if i.isdigit(): # remove numbers
            text = text.replace(i, "")
        # handle negative contractions with neg_dict
        for k, v in neg_dict.items():
            if i == k:
                text = text.replace(k, v)
        # handle positive contractions with pos_dict
        for k, v in pos_dict.items():
            if i == k:
                text = text.replace(k, v)
    for x in string.punctuation + "‘’”“": # eliminate punctuation and other special characters
        text = text.replace(x, " ")
    # Eliminate non-ASCII characters (i.e. emoticons/emojis), because training data does not have emoticons
    text = "".join(i for i in text if ord(i)<128)
    # Get rid of words of single letters or extraneous single letters,
    # we will rely on the rest of the text to produce our results,
    # and we don't want random letters from ad hoc (e.g. ':P') emojis screwing up anything
    text = " ".join([x for x in text.split() if len(x) > 1])
    # Return cleaned text of tweet
    return(text)

if create_save_model == True:
    # Clean text of each tweet in training data
    training.text = training.text.apply(clean_tweet)
    x = training.text
    y = training.sentiment

In [3]:
if create_save_model == True:
    # Clean text of each tweet in training data frame
    # We're purposefully leaving stop words in, 
    training.text = training.text.apply(clean_tweet)
    x = training.text
    y = training.sentiment

In [4]:
if create_save_model == True:
    import sklearn
    from sklearn.model_selection import train_test_split

    # Split data into training and validation with sklearn's 'train_test_split'
    x_train, x_validation, y_train, y_validation = train_test_split(x, y, test_size=.1)

### Model Creation

Now begins the fun part. Using Tfidf vectorizer from `sklearn`, we will use 'Term frequency-inverse document frequency' to extract features for our neural network. Features are basically the input for a neural network, the machine readable stuff that allows us to train and adjust weights within a machine learning model. In this case, we will be extracting 100,000 features from the x_train data created in the previous step. Model creation and `batch_generator` function is taken from https://github.com/tthustla/twitter_sentiment_analysis_part9/blob/master/Capstone_part4-Copy7.ipynb. Final validation accuracy for the Sentiment140 data was 82.28%. 

In [5]:
if create_save_model == True:
    from sklearn.feature_extraction.text import TfidfVectorizer

    # Now we need to transform our text to something machine readable
    # The result will be a sparse matrix (that will need to be converted to dense)

    # This line initializes the Tfidf vectorizer with our given parameters
    tvec = TfidfVectorizer(max_features=100000,ngram_range=(1, 3))

    # Here we iterate through the training data to count occurrences of those 100,000 features
    tvec.fit(x_train)

In [6]:
if create_save_model == True:
    # Transform text training and validation data, converting x_validation data to dense matrix with .toarray() method
    # x_train data is simply too large, so the batch_generator function is needed (in the next cell)
    x_train_tfidf = tvec.transform(x_train)
    x_validation_tfidf = tvec.transform(x_validation).toarray()

In [7]:
if create_save_model == True:
    def batch_generator(X_data, y_data, batch_size):
        samples_per_epoch = X_data.shape[0]
        number_of_batches = samples_per_epoch/batch_size
        counter=0
        index = np.arange(np.shape(y_data)[0])
        while 1:
            index_batch = index[batch_size*counter:batch_size*(counter+1)]
            X_batch = X_data[index_batch,:].toarray()
            y_batch = y_data[y_data.index[index_batch]]
            counter += 1
            yield X_batch,y_batch
            if (counter > number_of_batches):
                counter=0

In [8]:
if create_save_model == True:
    import keras
    from keras.models import Sequential
    from keras.layers import Dense, Dropout, Flatten
    from keras.layers.embeddings import Embedding
    from keras.preprocessing import sequence

    model = Sequential()
    model.add(Dense(128, activation='relu', input_dim=100000))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

    model.fit_generator(generator = batch_generator(x_train_tfidf, y_train, 64),
                epochs=2, 
                validation_data=(x_validation_tfidf, y_validation),
                steps_per_epoch=x_train_tfidf.shape[0]/64)

In [9]:
if create_save_model == True:
    model_json = model.to_json()
    with open('model.json', 'w') as json_file:
        json_file.write(model_json)

    model.save_weights('model.h5')

### Loading and sentiment tagging of tweets
All the Twitter data from our two time periods will now be loaded, the text collected and cleaned, and then analyzed for sentiment.

In [10]:
import tweepy

%run ~/twitter_credentials.py

#Use tweepy.OAuthHandler to create an authentication using the given key and secret
auth = tweepy.OAuthHandler(consumer_key=con_key, consumer_secret=con_secret)
auth.set_access_token(acc_token, acc_secret)

#Connect to the Twitter API using the authentication
api = tweepy.API(auth)

def get_tweets(geo_code, time_period, num_needed = 30000):
    tweet_list = [] # create container for tweets
    last_id = -1 # create index for last tweet seen
    if time_period == 'before':
        try: # see if before_tweets already exist
            before_tweets
        except: 
            pass # ignore else statement and use empty container and initial last_id index
        else: # if it tweets does exist, set tweet list to tweets and index at last id of the list
            tweet_list = before_tweets 
            last_id = before_tweets[-1].id
        while len(tweet_list) < num_needed:
            try:
                new_tweets = api.search(q = '-filter:retweets', lang = 'en', geocode = geo_code, count = 100, max_id = str(last_id - 1), until = '2020-05-04', tweet_mode = 'extended')
            except tweepy.TweepError as e:
                print("Error", e)
                break
            else:
                if not new_tweets:
                    print("Could not find any more tweets!")
                    break
                tweet_list.extend(new_tweets)
                last_id = new_tweets[-1].id
    elif time_period == 'after':
        try: # see if after_tweets already exists
            after_tweets
        except: 
            pass # ignore else statement and use empty container and initial last_id index
        else: # if after_tweets does exist, set tweet list to tweets and index at last id of the list
            tweet_list = after_tweets 
            last_id = after_tweets[-1].id
        while len(tweet_list) < num_needed:
            try:
                # Change code to collect from the last tweet of the before_tweets (at 11:59:58, May 3rd)
                new_tweets = api.search(q = '-filter:retweets', lang = 'en', geocode = geo_code, count = 100, max_id = str(last_id - 1), since_id = 1257097599406018573, tweet_mode = 'extended')
            except tweepy.TweepError as e:
                print("Error", e)
                break
            else:
                if not new_tweets:
                    print("Could not find any more tweets!")
                    break
                tweet_list.extend(new_tweets)
                last_id = new_tweets[-1].id
    return(tweet_list)

In [11]:
# >>>>>>>>>>>>> IMPORTANT <<<<<<<<<<<<< #
# Switch following lines to 'True' to gather tweets and overwrite previously saved tweet data
need_to_collect_save = False

# >>>>>>>>>>> ARE YOU SURE? <<<<<<<<<<< #
need_to_collect_save = False

if need_to_collect_save == True:
    tweets = get_tweets(geo_code = '37.6872,-97.3301,50km', time_period = 'before')
    before_tweets = tweets

    tweets = get_tweets(geo_code = '37.6872,-97.3301,50km', time_period = 'after')
    after_tweets = tweets

    to_save_before = [(tweet.id, tweet.user.screen_name, tweet.full_text, tweet.created_at) for tweet in before_tweets]
    to_save_before = pd.DataFrame(to_save_before, columns = ['id', 'screen_name', 'text', 'date/time'])
    to_save_before.to_csv('before_tweets.csv', index = False)

    to_save_after = [(tweet.id, tweet.user.screen_name, tweet.full_text, tweet.created_at) for tweet in after_tweets]
    to_save_after = pd.DataFrame(to_save_after, columns = ['id', 'screen_name', 'text', 'date/time'])
    to_save_after.to_csv('after_tweets.csv', index = False)

### Clean collected tweets and tag sentiment values with TextBlob and custom model

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Load saved tweets
before_tweets = pd.read_csv('before_tweets.csv')
after_tweets = pd.read_csv('after_tweets.csv')

# Clean tweet text
before_tweets.text = before_tweets.text.apply(clean_tweet)
after_tweets.text = after_tweets.text.apply(clean_tweet)

# Save lists of before and after texts to feed to our model
to_pred_before = before_tweets.text
to_pred_after = after_tweets.text

# Initialize 2 Tfidf vectorizers with given parameters
tvec_before = TfidfVectorizer(max_features=100000,ngram_range=(1, 3))
tvec_after = TfidfVectorizer(max_features=100000,ngram_range=(1, 3))

# Convert words in corpus to numbers, and store in vectorizers for use in the transform
tvec_before.fit(to_pred_before)
tvec_after.fit(to_pred_after)

# Transform corpuses (corpi?) by mapping numbers in tfidf.vocabulary_ to tf-idf scores for each tweet
# each object has a shape of (x, 100000), where x is the number of tweets
before_tfidf = tvec_before.transform(to_pred_before)
after_tfidf = tvec_after.transform(to_pred_after)

In [13]:
def get_tweet_sentiment(tweet): 
    # Create textblob object for tweet 
    analysis = TextBlob(tweet) 
    
    # return sentiment, don't bin them into positive, neutral, or negative
    return(analysis.sentiment.polarity)

In [14]:
from keras.models import model_from_json

# Load saved model structure
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()

# Load model and saved weights
model = model_from_json(loaded_model_json)
model.load_weights('model.h5')

Using TensorFlow backend.


In [15]:
from textblob import TextBlob

custom_before = model.predict(before_tfidf)
custom_after = model.predict(after_tfidf)

textblob_before = []
for i in to_pred_before:
    textblob_before.append(get_tweet_sentiment(i))
    
textblob_after = []
for j in to_pred_after:
    textblob_after.append(get_tweet_sentiment(j))

Add columns containing prediction values from custom model as well as TextBlob

In [16]:
before_tweets['custom prediction'] = custom_before
after_tweets['custom prediction'] = custom_after
    
before_tweets['textblob prediction'] = textblob_before
after_tweets['textblob prediction'] = textblob_after

In [17]:
before_tweets.to_csv('before_r.csv', index = False)
after_tweets.to_csv('after_r.csv', index = False)

# Save culled version of data for submission
before_tweets[0:500].to_csv('culled_before_r.csv', index = False)
after_tweets[0:500].to_csv('culled_after_r.csv', index = False)