# Analysis of Tweet Sentiment with Natural Language Processing (NLP) 
# and Naive Bayes

Instructions from SentimentAnalysis.pdf

Machine learning models allow us to process vast amounts of data quickly and build models
which attempt to understand the data. For example, email systems use NLP as part of their
spam detection systems. Many social networking platforms such as Facebook and Twitter use
such systems to flag offensive posts for review or elimination from the site. Success of such
systems is paramount to elimination of undesirable content while allowing desired content to
flow unimpeded.

In this project you will attempt to analyze a collection of tweets to determine whether the
messages carry a positive or negative sentiment using a Naive Bayes classifier. Given a
collection of positive and negative l abeled tweets your model must process and analyze the text
and l earn how to classify a given tweet as having a positive or negative sentiment. Then, you
will have a trained system that you can use to analyze some tweets around the election to find
trends.



Hints and Tricks:

Tasks you may need to perform:

● Determine the features of the message which are used for sentiment analysis i.e. words
or frequency of words which matter.

● Process raw tweets messages and create a dictionary of words which appear in tweets.

● Make encoded tweet features i.e., fixed length vectors for all tweets.

● Train model from encoded features and sentiments.

● Predict outcome for new unseen tweets.


Example of feature selection:

● Clean up text (Remove retweet tags, punctuations, …)

● Remove all words that are likely to be irrelevant (a, an, the, ...)

● Stem words (Find common root for words i.e., generally bird and birds should be counted
as bird in the dictionary)

● Count the association of all words with positive and negative sentiment

● Encode the tweets. For example tweet features are <Positives, Negatives, Sentiment>
where

○ Positives: sum positive sentiment counts for each word in the tweet

○ Negatives sum positive sentiment counts for each word in the tweet

○ 1 for Positive and 0 for Negative

You don’t have to build the NLP system from scratch. Modules, like the one found at
https://github.com/necromuralist/Neurotic-Networking, can be utilized to help you in your project.

Data Files:
Test and Training data file format:
tab delimited
classification i s 0/1 = negative / positive
Line schema: classification <tab> tweet
Trial data format:
tab delimited
Line schema: Username <tab> user screen name <tab> time <tab> is a retweet <tab>
tweet text

What to turn in:
A zip file with:

(50pts) Documented .py file(s) to train and test your classifier, read in data files, and
produce plots

(10pts) README file with instructions how to run your code

(40pts) A pdf file with two sections:

Section 1- description of your code/design for Bayesian classification; accuracy
data demonstrating the performance on the testing/training data; analysis of what improved
accuracy

Section 2- analysis of the additional tweets around the election demonstrating
trends around the dates surrounding the election and/or geographic analysis of
sentiment.

WARNING: Some of these tweets may contain offensive language-- potentially important to sentiment analysis.  Please let an instructor know if this causes any discomfort, as we can accommodate.  But, we will need to know quickly as the assignment will be modified.  

4/21 - It was noted that carriage return character slipped through in the training and test files.  Depending on how you choose to parse the files, this may or may not be an issue.  If this affects you program, you can use cleaned version of the train and test files (noCR_train.txt and noCR_test.txt).    

4/26- There does seem to be some missing data from November 10th in the tweets from around the election.  From the original data files, it looks like a potential collection issue.

Also, there is a Election tweets file added.  Tweet_Election_2.txt contains location information on each tweet as well. The location is what the user entered, so it may be a city, country, or nonsense.

Line schema for the new file is : Username <tab> user screen name <tab> time <tab> is a retweet <tab> location <tab> tweet text

In [5]:
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer, PorterStemmer
from nltk.tokenize import TweetTokenizer
from pandas import read_csv
from sklearn import metrics
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from wordcloud import WordCloud, STOPWORDS 
import gzip
import itertools
import matplotlib.pyplot as plt 
import nltk
import numpy as np 
import os
import pandas as pd 
import plotly.express as px
import re
import scipy as sp
import seaborn as sns
import struct
import sys
import time
import warnings
nltk.download('stopwords')
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aasegura\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
# Read in our training dataset
# Here, we have tab separating columns (tsc) in the data so we need to account for that
# Note; Data is pulled from a privately owned Password protected GitHub repo that will not be shared 
# or distributed by the author

# Define link to training dataset
link_to_train = "https://github.com/aasegura/probable-invention/blob/main/Tweets_train.txt"

# Read in our csv 
df_train = pd.read_csv(link_to_train, delimiter = '\t', header=None, names = label)

# Here, we have two columns of interest, ['score','text']
label = ['score','text']

HTTPError: HTTP Error 404: Not Found

In [4]:
# Import the data


# Tabs will be remove it
label = ['score','text']


HTTPError: HTTP Error 404: Not Found

In [None]:
# Create function words_cloud for use below
def words_cloud(list):
  comment_words = ' '
  stopwords = set(STOPWORDS) 
  for val in list: 
      val = str(val) 
      tokens = val.split() 
      for i in range(len(tokens)): 
          tokens[i] = tokens[i].lower() 
      for words in tokens: 
          comment_words = comment_words + words + ' '
    
    
  wordcloud = WordCloud(width = 800, height = 800, 
                  background_color ='white', 
                  stopwords = stopwords, 
                  min_font_size = 10).generate(comment_words) 
    
  plt.figure(figsize = (8, 8), facecolor = None) 
  plt.imshow(wordcloud) 
  plt.axis("off") 
  plt.tight_layout(pad = 0) 
    
  plt.show()

In [None]:
# Data cleaning step
# Removes 'http' and 'co' for .com within data
def remove_http_co(list): 
    list = np.delete(negative_list, np.where(list == [0, 'http'] ))
    list = np.delete(negative_list, np.where(list == [0, 'co'] ))
    list = np.delete(postive_list, np.where(list == [1, 'http'] ))
    list = np.delete(postive_list, np.where(list == [1, 'co'] ))


In [None]:
# Creates a data cleaning function 'clean' to clean the training dataset
def clean(df_train):
    
    # Prep our data by converting into np dataframe
    df_train_num = pd.DataFrame(df_train).to_numpy()

    # From the 'text' column, we remove the numbering by converting to string
    df_train['text'] = df_train['text'].astype(str)
    
    # Use the apply function to join. Also, converts to lowercase
    df_train['text'] = df_train['text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
    
    # Replaces string to convert to appropriate formating
    df_train['text'] = df_train['text'].str.replace('[^\ws]',' ')

    # Removing any stop words 
    stop = stopwords.words('english')
    df_train['text'] = df_train['text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

    # Removing stem words/ finding the roots
    df_train_num = pd.DataFrame(df_train).to_numpy()
    
    # Calles TweetTokenizer built-in functino
    tknzr = TweetTokenizer()
    
    # 
    temp_df_train=[]
    for i in range(len(df_train)):
        temp_df_train.append([df_train_num[:][i][0],tknzr.tokenize(df_train_num[:][i][1].lower())])

  #remvoing symbols and signs
  temp_df_train2=[]
  for i in range(len(temp_df_train)):
    for j in range(len(temp_df_train[i][1])):
      temp_df_train2.append([temp_df_train[i][0],re.sub(r'[^A-Za-z]', '', temp_df_train[i][1][j])])
    
  #
  df_train = pd.DataFrame(temp_df_train2, columns = label)
  df_train.replace(r'\s+', np.nan, regex=True)
  df_train.replace(r'\s+', np.nan, regex=True)
  df_train.dropna(subset = ["text"], inplace=True)

  #cleaning. my mess
  df_train['blank']=""
  df_train['space']=""
  df_train['is_blank']= (df_train['text']==df_train['blank'])
  df_train['is_space']= (df_train['text']==df_train['space'])
  df_train=df_train.replace('', np.nan, regex=True) 
  df_train.dropna(subset = ["text"], inplace=True)
  df_train = df_train.drop(columns=['blank', 'space','is_blank','is_space'])
  df_train_num=pd.DataFrame(df_train).to_numpy()

  # creating 2 different list for seperate emotions
  df_postive = df_train[df_train.score==1]
  df_negative=df_train[df_train.score==0] 
  postive_list=pd.DataFrame(df_postive).to_numpy()
  negative_list=pd.DataFrame(df_negative).to_numpy()

  # making a list of words
  positive_word = [ ]
  for i in range(len(postive_list)):
    positive_word.append(postive_list[i:i+1,1:2][0][0])

  negative_word = [ ]
  for i in range(len(negative_list)):
    negative_word.append(negative_list[i:i+1,1:2][0][0])

  # making a df
  word_freq_pos = pd.value_counts(positive_word).to_frame().reset_index()
  word_freq_neg = pd.value_counts(negative_word).to_frame().reset_index()


  return temp_df_train, word_freq_pos,word_freq_neg, positive_word, negative_word

In [23]:
nltk.download(["names","stopwords", "state_union", "twitter_samples", "movie_reviews", "averaged_perceptron_tagger","vader_lexicon", "punkt"])

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\aasegura\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aasegura\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package state_union to
[nltk_data]     C:\Users\aasegura\AppData\Roaming\nltk_data...
[nltk_data]   Package state_union is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\aasegura\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\aasegura\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\aasegura\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_ta

True

In [24]:
class classifier:
    
    # Constructor for Good and Bad histograms
    def __init__(self):
        self.word.good = {}
        self.word.bad = {}
        
        # Calculate prior for each word frequency
    def calculate_prior(self, histogram):
        total = 0
        
        # Find the total number of Word accuracies
        for key in histogram:
            total += histogram[key]
                
        # Word occurance/total occurance in doc per word
        for key in histogram:
            histogram[key]/= total
    
    
    def dict_add(self, words, histogram_A, histogram_B):
        for word in words:
            if word in stopwords:
                continue
            if word in punc:
                continue
            if word in names:
                continue
            if "//" in punc:
                continue
            if "http" in punc:
                continue
            if word in histogram_A:
                histogram_A[word] += 1 
            else:
                histogram_B[word] = 2
            if not word in histogram_B:
                histogram_B[word] = 1
    
    def train(self, X, y):
            for i in range(0, len(y)):
                words = nltk.word_tokenize(X[i].lower())
                if (y[i] == 0):
                    self.dict_add(words, self.word_bad, self.word_good)
                elif(y[i] == 1):
                    self.dict_add(words, self.word_good, self.word_bad)
            # Calculate prior probabilities
            self.calculate_prior(self.word_good)
            self.calculate_prior(self.word_bad)
            
    # Based on Bayes predict what the labels should be given X
    def predict(self, X):
        labels = []
        for tweet in X:
            words = nltk.word_tokenize(tweet.lower())
            good = 1
            bad = 1
            for word in words:
                if word in stopwords:
                    continue
                if word in punc:
                    continue
                if word in names:
                    continue
                if "//" in punc:
                    continue
                if "http" in punc:
                    continue
                if word in self.word_good:
                    good *= self.word_good[word]
                    bad *= self.word_bad[word]
            if (good > bad):
                labels.append(1)
            else:
                labels.append(0)
        return labels
    
    def get_miss(self, X, y, y_pred):
        size = len(X)
        miss = []
        for i in range(0, size):
            if(y[i] != y_pred[i]):
                miss.append(X[i])
        print('Miss classified teweets: %.d' % len(miss))
        return miss
    
    def analysis(self, X, y, y_pred):
        miss = self.get_miss(X_test, y_test, y_test_pred)
        miss_dict = {}
        for tweet in miss:
            words = mltk.word_tokenize(tweet.lower())
            for word in words:
                if word in stopwords:
                    continue
                if word in punc:
                    continue
                if '//' in word:
                    continue
                if word in miss_dict:
                    miss_dict[word] += 1
                else:
                    miss_dict[word] = 1
        return miss_dict
    
    # Notes from OH meeting
    # Use bag of words approach from class
    # Feature extraction of Tweets
    # For each Tweet, good/bad and clean up Tweets, 
    # eg remove stop words, 'a', 'the', root of word such that we are simplfying words 
    # Use NLP library
    # Remove punct, #, 
    # leave phrase as it may be helpful
    # Remove words smaller than 3 maybe?
    # Emoticons, emojis may communicate information
    # Frequency of words, with prior prob of 'mean' vs 'nice' Tweet
    # Prob of word given classification
    # + or multiple
    # Review lecture for more details
    # Need prob to account for inconsistents, do not want a prob of 0
    # Laplacian modifier
    # if root words are not remove then we may generate error or performance suffers.
    

In [13]:
# names = nltk.corpus.punk.words()

In [25]:
# Setup words that wiil likely effect our accuracy
punc = '''!()-[]{};:'"\,<>./?@#%^$^&*_~'''
stopwords = nltk.corpus.stopwords.words("english")
names = nltk.corpus.names.words()

In [None]:
class classifier:
    
    # Constructor for Good and Bad histograms
    def __init__(self):
        self.word.good = {}
        self.word.bad = {}
        
        # Calculate prior for each word frequency
    def calculate_prior(self, histogram):
        total = 0
        
        # Find the total number of Word accuracies
        for key in histogram:
            total += histogram[key]
                
        # Word occurance/total occurance in doc per word
        for key in histogram:
            histogram[key]/= total
    
    
    def dict_add(self, words, histogram_A, histogram_B):
        for word in words:
            if word in stopwords:
                continue
            if word in punc:
                continue
            if word in names:
                continue
            if "//" in punc:
                continue
            if "http" in punc:
                continue
            if word in histogram_A:
                histogram_A[word] += 1 
            else:
                histogram_B[word] = 2
            if not word in histogram_B:
                histogram_B[word] = 1
    
    def train(self, X, y):
            for i in range(0, len(y)):
                words = nltk.word_tokenize(X[i].lower())
                if (y[i] == 0):
                    self.dict_add(words, self.word_bad, self.word_good)
                elif(y[i] == 1):
                    self.dict_add(words, self.word_good, self.word_bad)
            # Calculate prior probabilities
            self.calculate_prior(self.word_good)
            self.calculate_prior(self.word_bad)
            
    # Based on Bayes predict what the labels should be given X
    def predict(self, X):
        labels = []
        for tweet in X:
            words = nltk.word_tokenize(tweet.lower())
            good = 1
            bad = 1
            for word in words:
                if word in stopwords:
                    continue
                if word in punc:
                    continue
                if word in names:
                    continue
                if "//" in punc:
                    continue
                if "http" in punc:
                    continue
                if word in self.word_good:
                    good *= self.word_good[word]
                    bad *= self.word_bad[word]
            if (good > bad):
                labels.append(1)
            else:
                labels.append(0)
        return labels
    
    def get_miss(self, X, y, y_pred):
        size = len(X)
        miss = []
        for i in range(0, size):
            if(y[i] != y_pred[i]):
                miss.append(X[i])
        print('Miss classified teweets: %.d' % len(miss))
        return miss
    
    def analysis(self, X, y, y_pred):
        miss = self.get_miss(X_test, y_test, y_test_pred)
        miss_dict = {}
        for tweet in miss:
            words = mltk.word_tokenize(tweet.lower())
            for word in words:
                if word in stopwords:
                    continue
                if word in punc:
                    continue
                if '//' in word:
                    continue
                if word in miss_dict:
                    miss_dict[word] += 1
                else:
                    miss_dict[word] = 1
        return miss_dict
    
                
                    
                    
        
        

In [21]:
# Initialization of Bayes Classifier
bayes = classifier()

# Trains classifer
bayes.train(X_train, y_train)

# Predictions for train and test
y_train_pred = bayes.predict(X_train)
y_test_pred = bayes.predict(X_test)

# Training and Testing Accuracies
train_accuracy = (np.sum(y_train == y_train_pred)).astype(np.float)
test_accuracy = (np.sum(y_test == y_test_pred)).astype(np.float)

# Prints out our accuracies
print('Training Accuracy: %.3f%%' % (train_accuracy * 100))
print('Testing Accuracy: %.3f%%' % (test_accuracy * 100))


NameError: name 'classifier' is not defined

In [19]:
# Missclassification Analysis
miss_classified = bayes.analysis(X_train, y_train, y_train_pred)

print("Total number of words in our miss classified Tweets: ", len(miss_classified))

Counter = 0
for key in miss_classied:
    if not key in bayes.word_good:
        Counter = Counter + 1

print("Percent of words not appeared in training: %.3f%%" % (Counter/len(miss_classified)*100))


NameError: name 'bayes' is not defined

In [20]:
print(bayes.word_good.keys())

NameError: name 'bayes' is not defined