# Machine Learning-based Sentiment Analysis of Tweets                
## Yahui Peng


### Introduction
You are a data scientist working for the government. You want to understand the public opinion regarding hurricane Maria which is responsible for killing at least 499 people in Puerto Rico. Total losses are estimated at $94.4 billion dollars which accrued to government agencies, businesses, and more importantly, familes [1]. With this background, whether you are a politician, bussiness person, or one effected by the hurricane, understanding the sentiment of the general populace is important. For this assigment, you will use a subset of the tweets retrieved from Twitter that mentioned #PuertoRico over the period of October 4 to November 7, 2017 [2] to measure the sentiment (i.e., the "good" or "bad" opinions people have about the hurricane and its impact) of this event.

For this task, we will write code for a lexicon-based analysis (i.e., lexicon-based classification). Lexicon-based classification is a way to categorize text based using manually generated lists of topical words. Essentially, we just need to check if the topical words appear in a piece of text (e.g., a tweet). In this exercise we will make use of manually curated sentiment words. However, the basic experimental process is the same for other tasks (e.g., identifying offensive language).

If you are interested, though it is not needed, you can learn more about lexicon-based classification in Chapter 21 (21.2 and 21.6) of the free online book at the following link: [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/21.pdf)

### References
[1]  Spalding, Rebecca (November 13, 2017). "Puerto Rico Seeks $94 Billion in Federal Aid for Hurricane Recovery". Bloomberg News. Retrieved December 15, 2017.

[2]  site: https://archive.org/details/puertorico-tweets

## Submission Instructions

After completing the exercises below, generate a pdf of the code **with** outputs. After that create a zip file containing both the completed exercise and the generated PDF/HTML. You are **required** to check the PDF/HTML to make sure all the code **and** outputs are clearly visible and easy to read. If your code goes off the page, you should reduce the line size. I generally recommend not going over 80 characters.

For this task, unzip and move the file "puerto-rico.jsonl" in to the same directory as this notebook, then complete the following exercises. However, when you turn the assigment in, do **NOT** include puerto-rico.jsonl in your zip file when you submit the homework, you will kill Blackboard.

Finally, name the zip file using a combination of your the assigment and your name, e.g., ps2_zanella.zip

## Module 1 

The files "positive_words.txt" and "negative_words.txt" contain manually curated positive (e.g., good, great, awesome) and negative words (e.g., bad, hate, terrible). The files contain one word on each line. Write a function that takes the open file and adds the words (i.e., each line) to a set then returns it.


def file_to_set(file):
    # Write code here
    wordstr = []
    for line in file:
        wordstr.append(line.strip()) 
    return set(wordstr) # You should return a set

positive_file = open('./positive-words.txt', encoding='utf8')
positive_words = file_to_set(positive_file)
positive_file.close()

negative_file = open('./negative-words.txt', encoding='iso-8859-1') # If you get a weird read error. Let me know. We can change the encoding.
negative_words = file_to_set(negative_file)
negative_file.close()

The lines below give example inputs and correct outputs using asserts, and can be run to test the code. Passing these tests is necessary, but **NOT** sufficient to guarantee your implementation is correct. You may add additional test cases, but do not remove any tests.

In [36]:
assert(type(positive_words) == type(set()))
assert(type(negative_words) == type(set()))
assert(len(positive_words) == 2006)
assert(len(negative_words) == 4783)
assert(('good' in positive_words)  == True)
assert(('bad' in negative_words)  == True)
assert(('bad' not in positive_words) == True)
print("Asserts finished successfully!")

Asserts finished successfully!


## Module 2

For this exercise, you need to write a function that counts the number of words in a sentence that also appear in a set. For example, given the set set(['good', 'great']) and the sentence "this is good good good", the function should return 3.

In [37]:
def count_sentiment_words(sentiment_set, tweet_text, lower):
    # Your code here
    import re
    counts = 0
    words = re.findall(r'[a-z]+',tweet_text)
    for letter in words:
        if letter in sentiment_set:
            counts += 1    
    return counts #You should return a number

The lines below give example inputs and correct outputs using asserts, and can be run to test the code. Passing these tests is necessary, but **NOT** sufficient to guarantee your implementation is correct. You may add additional test cases, but do not remove any tests.

In [38]:
assert(count_sentiment_words(positive_words, "this is a good good good class", True) == 3)
assert(count_sentiment_words(positive_words, "this is a good\tgood\tgood class", True) == 3)
assert(count_sentiment_words(positive_words, "this is a GOOD GOOD good class", False) == 1)
assert(count_sentiment_words(positive_words, "Python is the best programming language for data science", True) == 1)
assert(count_sentiment_words(negative_words, "R is bad compared to Python ;)", True) == 1)
print("Asserts finished successfully!")

Asserts finished successfully!


## Module 3

For this exercise, you will write a function that takes two numbers as input and returns a string. Intuitively, this is a basic classification function for lexicon-based sentiment classification. 

The function should take as input parameters the the number of positive (num_pos_words) and negative (num_neg_words) words in each tweet to predict sentiment. If the number of positive words is greater than to the number of negative tweets (num_pos_words > num_neg_words), then predict **"positive"**. If the number of negative words is greater than the number of positive words (num_neg_words > num_pos_words), then predict **"negative"**. If both num_pos_words and num_neg_words are equal (num_neg_words = num_pos_words), predict **"neutral"**. This is known as lexicon-based classification.


In [39]:
def predict(num_pos_words, num_neg_words):
    # Your code here
    if num_pos_words > num_neg_words:
        prediction = 'positive'
    elif num_pos_words < num_neg_words:
        prediction = 'negative'
    else:
        prediction = 'neutral'
    return prediction # You should return a string

The lines below give example inputs and correct outputs using asserts, and can be run to test the code. Passing these tests is necessary, but **NOT** sufficient to guarantee your implementation is correct. You may add additional test cases, but do not remove any tests.

In [40]:
assert(predict(2, 5) == 'negative')
assert(predict(5, 2) == 'positive')
assert(predict(3, 3) == 'neutral')
print("Assert finished successfully!")

Assert finished successfully!


## Module 4

This exercise is similar to Exercise 3. However, instead of making a prediction, we should write a function that returns a sentiment score. Specifically, assume num_pos_words is 3 and num_neg_words is 4, the function should return -1. The idea is that the more *positive* the number, the more positive the sentiment. Likewise, the more *negative* the number, the more negative the sentiment.

In [41]:
def predict_score(num_pos_words, num_neg_words):
    # Your code here
    score = num_pos_words - num_neg_words
    return score # You should return a number

The lines below give example inputs and correct outputs using asserts, and can be run to test the code. Passing these tests is necessary, but **NOT** sufficient to guarantee your implementation is correct. You may add additional test cases, but do not remove any tests.

In [42]:
assert(predict_score(3, 1) == 2)
assert(predict_score(2, 2) == 0)
assert(predict_score(2, 5) == -3)
print("Asserts finished successfully!")

Asserts finished successfully!


## Module 5

Write a function that takes a json string as input and returns a Python object. Hint: This can be one line. You can use the json library.

In [43]:
import json

def json_string_to_dictionary(json_string):
    # Your code here
    mystr = json.loads(json_string)
    return mystr

The lines below give example inputs and correct outputs using asserts, and can be run to test the code. Passing these tests is necessary, but **NOT** sufficient to guarantee your implementation is correct. You may add additional test cases, but do not remove any tests.

In [44]:
data = json_string_to_dictionary('{"a": 1}')
assert(data == {'a': 1})
data = json_string_to_dictionary('[1,2,3]')
assert(data == [1,2,3])
print("Assert finished successfully!")

Assert finished successfully!


## Module 6

For this task, we combine the functions written for the previous exercises to classify all of the tweets in a real Twitter dataset. You should write code that does the following:
1. Keeps track of the number of tweets
2. Keeps track of the number of positive and negative tweets
3. Keeps track of the user that tweets the most
4. Keeps track of the total number of unique users
5. Keeps track of the average number of tweets per user (how many tweets does each user make, on average)
6. Keeps track of the most positive and negative tweets.

Note: This task depends on Exercises 1 through 5. You will need to complete them first. Also, do **not** store all of the tweets in a list.  This will use too much memory because of the size of the dataset. It is okay to store all of the user's screen names.

Finally, the dataset is big! So, I recommend working on a subset of the dataset to make sure your code works, i.e., you could "break" after the first 100 lines.



In [45]:
total_number_of_tweets = 0
total_number_of_positive_tweets = 0
total_number_of_negative_tweets = 0
most_positive_tweet = ''
most_negative_tweet = ''
user_with_most_tweets = ''
average_number_tweets_per_user = 0
total_number_of_users = 0
users = []
max_score, min_score = 0, 0

# NOTE: You may need to define extra variables to help generate the required output.

twitter_dataset = open('tweets_ps2.jsonl', 'r')

for row in twitter_dataset:
    tweet_dict = json_string_to_dictionary(row)
    
    ###############################
    
    tweet_text = tweet_dict['full_text'] # MODIFY THIS LINE TO GET THE "full_text" from the tweet_dict
    screen_name = tweet_dict['user']['screen_name'] # MODIFY THIS LINE TO GET THE "screen_name" from the tweet_dict
    ###############################
    
    num_pos_words = count_sentiment_words(positive_words, tweet_text, True)
    num_neg_words = count_sentiment_words(negative_words, tweet_text, True)
    
    sentiment_prediction = predict(num_pos_words, num_neg_words)
    sentiment_score = predict_score(num_pos_words, num_neg_words)
    
    ################################
    # Your code should do the following:
    #   1. Keep track of the number of tweets
    #   2. Keep track of the number of positive and negative tweets
    #   3. Keep track of the user that tweet the most
    #   4. Keep track of the total number of unique users
    #   5. Keep track of the average number of tweets per user (how many tweets does each user make, on average)
    #   6. Keep track of the most positive and negative tweets.
    
    # YOUR CODE HERE
    total_number_of_tweets += 1
    if sentiment_prediction == 'positive':
        total_number_of_positive_tweets += 1
    elif sentiment_prediction == 'negative':
        total_number_of_negative_tweets += 1    
    all_users = users.append(screen_name)
    if sentiment_score > max_score:
        max_score = sentiment_score
        most_positive_tweet = tweet_text
    elif sentiment_score < min_score:
        min_score = sentiment_score
        most_negative_tweet = tweet_text
    ################################

# You may need to have code after the for loop, depending on your implementation
# CODE HERE (Maybe, depending on your implementation)
total_number_of_users = len(set(users))
average_number_tweets_per_user = total_number_of_tweets/total_number_of_users
user_with_most_tweets = max(set(users), key=users.count)

twitter_dataset.close()

In [46]:
print("Total Number of Tweets: {}".format(total_number_of_tweets))
print("Total Number of Positive Tweets: {}".format(total_number_of_positive_tweets))
print("Total Number of Negative Tweets: {}\n".format(total_number_of_negative_tweets))

print("Most Positive Tweet")
print(most_positive_tweet)
print()

print("Most Negative Tweet")
print(most_negative_tweet)
print()

print("Total Number of Users: {}".format(total_number_of_users))
print("Average Number of Tweets per User: {}".format(average_number_tweets_per_user))
print("User with the most tweets: {}".format(user_with_most_tweets))

Total Number of Tweets: 8100
Total Number of Positive Tweets: 1946
Total Number of Negative Tweets: 2203

Most Positive Tweet
RT @Dr_Woga: I am so delighted on these proud women.

#PuertoRico #resist #Science #atheist #Impeach45   #free #tech #MAGA #trump #win #rt…

Most Negative Tweet
RT @TheDailyEdge: Trump says he's working really hard on #PuertoRico. He's either useless or a liar. Or a useless liar. https://t.co/dr8wLa…

Total Number of Users: 7512
Average Number of Tweets per User: 1.0782747603833867
User with the most tweets: Noti_PuertoRico


The lines below give example inputs and correct outputs using asserts, and can be run to test the code. Passing these tests is necessary, but **NOT** sufficient to guarantee your implementation is correct. You may add additional test cases, but do not remove any tests.

In [47]:
assert(isinstance(total_number_of_tweets, int) or isinstance(total_number_of_tweets, float))
assert(isinstance(total_number_of_positive_tweets, int) or isinstance(total_number_of_positive_tweets, float))
assert(isinstance(total_number_of_negative_tweets, int) or isinstance(total_number_of_negative_tweets, float))
assert(isinstance(most_positive_tweet, str))
assert(isinstance(most_negative_tweet, str))
assert(isinstance(user_with_most_tweets, str))
assert(total_number_of_tweets == 8100)
print("Assert finished successfully!")

Assert finished successfully!


## Module 7

For this exercise, you will perform manual analysis of the predictions. Modify the code to load the tweet text, then answer the questions below.

In [48]:
import json
twitter_dataset = open('tweets_ps2.jsonl', 'r')

num_tweets_to_print = 20

num_tweets = 0

for row in twitter_dataset:
    num_tweets += 1
    tweet_dict = json_string_to_dictionary(row)
    
    ###############################
    # YOUR CODE HERE
    tweet_text = tweet_dict['full_text'] # MODIFY THIS LINE TO GET THE "full_text" from the tweet_dict    
    ###############################
    
    num_pos_words = count_sentiment_words(positive_words, tweet_text, True)
    num_neg_words = count_sentiment_words(negative_words, tweet_text, True)
    
    sentiment_prediction = predict(num_pos_words, num_neg_words)
    
    print("Tweet {}: {}".format(num_tweets, tweet_text))
    print("Tweet {} Prediction: {}".format(num_tweets, sentiment_prediction))
    print()
    
    if num_tweets == num_tweets_to_print:
        break
    
twitter_dataset.close()

Tweet 1: RT @daddy_yankee: I know the reconstruction of my home island will requiere long-term solutions. - go to the link and help me raise more mo…
Tweet 1 Prediction: neutral

Tweet 2: RT @daddy_yankee: I know the reconstruction of my home island will requiere long-term solutions. - go to the link and help me raise more mo…
Tweet 2 Prediction: neutral

Tweet 3: RT @USNavy: #USNSComfort arrives in #SanJuan, #PuertoRico to support Hurricane #Maria relief - https://t.co/wkKaHc1LD2 @potus @fema @ricard…
Tweet 3 Prediction: positive

Tweet 4: PACKED at @MoMAPS1 before star-studded #PuertoRico fundraiser w/ @JimmyVanBramer @popdemoc + more! https://t.co/fcBt5uqfym
Tweet 4 Prediction: neutral

Tweet 5: RT @ExDemLatina: .@CarmenYulinCruz is a Lying policial Corrupt hack! 
She has time to make another shirt for media rounds. #PuertoRico #San…
Tweet 5 Prediction: negative

Tweet 6: RT @MStrooo6: Just trying to do my part. Please help me in my efforts to raise money for Puerto Rico! All donati

Complete the following tasks:
 
- Manually annotate all of the tweets printed above:
   1. Tweet 1 Annotation Here  neutral
   1. Tweet 2 Annotation Here  neutral
   1. Tweet 3 Annotation Here  positive
   1. Tweet 4 Annotation Here  neutral
   1. Tweet 5 Annotation Here  negative
   1. Tweet 6 Annotation Here  neutral
   1. Tweet 7 Annotation Here  negative
   1. Tweet 8 Annotation Here  neutral
   1. Tweet 9 Annotation Here  positive
   1. Tweet 10 Annotation Here negative
   1. Tweet 11 Annotation Here neutral
   1. Tweet 12 Annotation Here negative
   1. Tweet 13 Annotation Here neutral
   1. Tweet 14 Annotation Here negative
   1. Tweet 15 Annotation Here positive
   1. Tweet 16 Annotation Here positive
   1. Tweet 17 Annotation Here positive
   1. Tweet 18 Annotation Here positive
   1. Tweet 19 Annotation Here negative
   1. Tweet 20 Annotation Here positive

- How many of the predictions are right or wrong compared to your annotations?
    - Answer here
   - Four of the predictions are wrong compared to my annotations, namely tweet 10, 16, 17, and 18.
    
- Do you see any major limitations of lexicon-based classificaiton (i.e., making sentiment predictions using individual words)? Use your intuition, I will accept most answers, as long as it makes some sense. Please describe and provide examples below:

- Answer Here
  - Yes, a major limitation of lexicon-based classification is incorrect sentiment scoring of opinion words by the existing lexicons. 
  - For instance,
   - Tweet 10 is predicted to be positive. However, the expression 'Morons are way smarter and far more compassionate.' is sarcastic which delivers negative sentiment. According to the lexicon-based classification, the positive-words includes 'smarter', 'compassionate' and 'better', while the negative-words includes 'moron', resulting in a sentiment score greater than zero.
   - Tweet 16 is predicted to be neutral becauses the words 'thank' and 'effort' are not in the positive-words. In fact, the user delivers grateful sentiment.

# Module 8 Comparing Lexicon and RandomForest Classifiers

For this exercise, you will use a different dataset already split in train and test. Both *allTrainingData.tsv* and *twitdata_TEST.tsv* include the golden label column. The first cell of code loads the datasets into 4 lists of strings:
 - X_txt_train  (this are the features for training the classifier)
 - X_txt_test   (you test the classifier against these features)
 - y_test       (gold labels for the test dataset)
 - y_train      (gold labels for the train dataset)

You will perform the following tasks:
1. Use a Lexicon Classifier (count of positive and negative words ) to classify the tweets in the **test dataset**
   - The classification should be *neutral, positive, or negative*
   - Calculate the F1 score for the Lexicon Classifier
3. Use a different classifier (e.g. RandomForestClassifier) as follows:
    - Train the selected classifier using the **train dataset**
    - Using the trained classifier, predict the labels for the **test dataset**
    - Calculate the F1 score
4. Compare the two F1 scores: which one is better?

In [49]:
# split the data
import csv

X_txt_train= []
y_train = []
X_txt_test= []
y_test= []

with open('twitdata_TEST.tsv',encoding = 'utf8') as twat:
    Twat = csv.reader(twat, delimiter='\t', quoting = csv.QUOTE_NONE)
    header = True
    for row in Twat:
        if header:
            header = False
        X_txt_test.append(row[3])
        y_test.append(row[2])
with open('allTrainingData.tsv',encoding = 'utf8') as BBCtrain:
    BBC = csv.reader(BBCtrain, delimiter='\t',quoting = csv.QUOTE_NONE)
    header = True
    for row in BBC:
        if header:
            header = False
        X_txt_train.append(row[3])
        y_train.append(row[2])
print(len(X_txt_train),len(X_txt_test))

8018 3199


In [50]:
# Write and document your code here
# Step 1: Lexicon Classifier
# write code here
class LexiconClassifier():
    def __init__(self):
        self.positive_words = set()
        with open('positive-words.txt', encoding = 'utf-8') as iFile:
            for row in iFile:
                self.positive_words.add(row.strip())
                
                

        self.negative_words = set()
        with open('negative-words.txt', encoding='iso-8859-1') as iFile:
            for row in iFile:
                self.negative_words.add(row.strip())

    def predict(self, sentence):
        num_pos_words = 0
        num_neg_words = 0
        for word in sentence.lower().split():
            if word in self.positive_words:
                num_pos_words += 1
            elif word in self.negative_words:
                num_neg_words += 1

        pred = 'neutral'
        if num_pos_words > num_neg_words:
            pred = 'positive'
        elif num_pos_words < num_neg_words:
            pred = 'negative'

        return pred

    def count_pos_words(self, sentence):
            num_pos_words = 0
            for word in sentence.lower().split():
                if word in self.positive_words:
                    num_pos_words += 1
            return num_pos_words
        
    def count_neg_words(self, sentence):
        num_neg_words = 0
        for word in sentence.lower().split():
            if word in self.negative_words:
                num_neg_words += 1
        return num_neg_words


In [51]:
# Step 1.2 Calculate F1 for the Lexicon Classifier
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score
myLC = LexiconClassifier()
lex_test_preds = []
for t in X_txt_test:
    lex_test_preds.append(myLC.predict(t))
precision = precision_score(y_test, lex_test_preds, average="micro")
recall = recall_score(y_test, lex_test_preds, average="micro")
f1 = f1_score(y_test, lex_test_preds, average="micro")
print("Precision: {:.4f}".format(precision))
print("Recall: {:.4f}".format(recall))
print("F1: {:.4f}".format(f1))

Precision: 0.5827
Recall: 0.5827
F1: 0.5827


In [52]:
# Step 2: Use a different classifier
# Step 2.1 Train the selected classifier using X_txt_train and y_train
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score
## Do not change the following 4 lines
import numpy as np
np.random.seed(42)
import random
random.seed(42)
# WRITE CODE HERE
vec = CountVectorizer(ngram_range=(1,1))
vec.fit(X_txt_train)
X_train = vec.transform(X_txt_train)
X_test = vec.transform(X_txt_test)

In [53]:
# Step 2.2  Predict the labels for the test dataset X_txt_test
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
svm_test_predictions = clf.predict(X_test)

In [54]:
# Step 2.3 Calculate the F1 score for the selected classifier

f1 = f1_score(y_test, svm_test_predictions, average="micro")
precision = precision_score(y_test, svm_test_predictions, average="micro")
recall = recall_score(y_test, svm_test_predictions, average="micro")
print("Precision: {:.4f}".format(precision))
print("Recall: {:.4f}".format(recall))
print("F1: {:.4f}".format(f1))

Precision: 0.6114
Recall: 0.6114
F1: 0.6114


In [56]:
# Step 3: Compare and discuss the two results. 
print('RandomForestClassifier has higher F1 score than LexiconClassifier.')

RandomForestClassifier has higher F1 score than LexiconClassifier.
