# Twitter Sentiment Analysis
#### Author: Harsh Tandon

### About the homework

In the part 1 of our homework, we will be analyzing the sentiment of tweets about **'Trump'** and in part 2 we will be analyzing the sentiment of tweets about Indian Prime Minister **'Modi'**.<br>
We begin with getting data from twitter using Twitter API. We will compare the compare the words in the tweets with a list of positive and negative words to evaluate the sentiment. We will also be accounting for stop words, and whichever words do not fall under these categories will be classified as 'others'.  

### Import libraries and set up jupyter interactivity

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expressions are widely used in UNIX world. The Python module re provides full support for Perl-like regular expressions in Python.<br> We will be using regular expression for cleaninig the tweets data. 

In [3]:
#for regular expressions 
import re 

In [4]:
#print all outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Part 1 - Sentiment Analysis for Trump

### Read the data files

We will be reading our data extracted from twitter API. The extracted file is a .txt file.<br>
Next, we create an empty list called **'tweets'** and append each tweet from the file seperated by a '\n' return command. 

In [5]:
myfile = open("Trump.txt") #open the file containing all tweets
tweets = [] #creating an empty list

#extracting each line from txt file and appending it in the list
for line in myfile: 
        for j in line.split(r"\n"):
            tweets.append(j)
myfile.close() #close the file

In [6]:
tweets #lets print and see how our created list looks like

['b\'RT @TeaPainUSA: Trump on Lev Parnas (just now): "I don\\\'t even know who this man is."',
 '',
 'This will bite him.',
 '\'b\'TIL, Trump fans don\\\'t know how to use "sit this one out," but they_really_want to',
 "'b'Pelosi Loses It On House Floor, Compares Trump To Murdering Mobster https://t.co/J3zMEsaoqs via @realmattcouch",
 "'b'Nancy Pelosi is eaten alive with jealousy, rage , hatred. While her constituents sink in piles of feces and needles\\xe2\\x80\\xa6 https://t.co/4U9KfqNjbC",
 '\'b\'RT @davecclarke: Quite the @nytimes review for "A Very Stable Genius" by @PhilipRucker and @CarolLeonnig "They\\xe2\\x80\\x99re meticulous journalists, a\\xe2\\x80\\xa6',
 '\'b"RT @BernieSanders: Not only will we reverse the damage Trump has done to workers\' rights, we will expand those rights and double union memb\\xe2\\x80\\xa6',
 '"b\'RT @SJW_ForAll: The members of this jury who are @GOP members holding their hands up taking an oath, swearing to be impartial is an absolut\\xe2\\x80\\xa6

### Data Cleaning

The data looks dirty right? Let's clean it with the help of regular expressions.<br>
Well run a loop for to access each tweet in our list of tweets and if the length of that tweet is greater than 1, we will use regular expression on it. <br>

In [7]:
clean_tweets = [] #create an empty list which will contain cleaned tweets
for tweet in tweets:
    if tweet is not None:
        tweet_parse = tweet.split(':')
        if len(tweet_parse) > 1:                               #if the length of tweet is greater than 1, proceed with cleaning
            tweet = tweet_parse[-1]
            tweet = re.sub(r'\\x[0-9a-f][0-9a-f]', "", tweet)  #to deal with emoticons. Hex code for emoticons start with x
            tweet = re.sub('@(\w)+', "", tweet)                #to deal with mentions and usernames
            tweet = re.sub('&amp', "", tweet)                  #to deal with extra white spaces
            tweet = re.sub('//t.co/(\w)+', "",tweet)           #to deal with hyperlinks
            tweet = re.sub('//t.c', "",tweet)                  #to deal with hyperlinks
            tweet = re.sub(r'\\',"", tweet)                    #to deal with double backward slashes
            tweet = re.sub('#',"", tweet)
            tweet = re.sub("\'", " " ,tweet)
            tweet = tweet.replace(",","")
            tweet = tweet.replace("*","")
            tweet = tweet.replace("//t","")
            tweet = tweet.replace('"',"")
            tweet = tweet.replace(".","")
            clean_tweets.append(tweet.lower())                #after cleaning each tweet is appended to clean_tweets list

# Create a list of words from cleansed tweets
clean_tweet_words = [] #create an empty list for cleaned tweet words
for i in clean_tweets:
    s = i.split()
    for j in s:
        clean_tweet_words.append(j) #append cleaned words to a list

In [8]:
print(clean_tweet_words) #lets have a look at the list of cleaned words



### What is positive, negative and stop words?

Before we dive into analyzing our list of clean words from the tweets, we need something to compare them to.<br>
We read three different text files, each containing a variety to stop words, positive words and negative words.<br>
We will create three empty lists and append each each category of word into their respective lists. 

In [9]:
stopWords = [] #an empty list which will hold stop words
positive = [] #an empty list which will hold positive words
negative = [] #an empty list which will hold negative words

#read all the stop words and append each of them in the stopWords list
with open('stopwords.txt', 'r') as f:
    for word in f:
        word = word.split('\n')
        stopWords.append(word[0])
        
#read all the postive words and append each of them in the positive list
with open('positive.txt', 'r') as f:
    for word in f:
        word = word.split('\n')
        positive.append(word[0])

#read all the negative words and append each of them in the negative list
with open('negative.txt', 'r') as f:
    for word in f:
        word = word.split('\n')
        negative.append(word[0])

In [10]:
print (stopWords) #lets have a look at the stopWords

['', '!!', '?!', '??', '!?', '`', '``', "''", '-lrb-', '-rrb-', '-lsb-', '-rsb-', ',', '.', ':', ';', '"', "'", '?', '<', '>', '{', '}', '[', ']', '+', '-', '(', ')', '&', '%', '$', '@', '!', '^', '#', '*', '..', '...', "'ll", "'s", "'m ", 'a', 'about', 'above', 'across', 'after', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'among', 'an', 'and', 'another', 'any', 'anybody', 'anyone', 'anything', 'anywhere', 'are', 'area', 'areas', 'around', 'as', 'ask', 'asked', 'asking', 'asks', 'at', 'away', 'b', 'back', 'backed', 'backing', 'backs', 'be', 'became', 'because', 'become', 'becomes', 'been', 'before', 'began', 'behind', 'being', 'beings', 'best', 'better', 'between', 'big', 'both', 'but', 'by', 'c', 'came', 'can', 'cannot', 'case', 'cases', 'certain', 'certainly', 'clear', 'clearly', 'come', 'could', 'd', 'did', 'differ', 'different', 'differently', 'do', 'does', 'done', 'down', 'downed', 'downing', 'downs', 'during', 'e', 'each', 'ear

In [11]:
print (positive) #lets have a look at the positive list

['a+', 'abound', 'abounds', 'abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed', 'acclamation', 'accolade', 'accolades', 'accommodative', 'accomodative', 'accomplish', 'accomplished', 'accomplishment', 'accomplishments', 'accurate', 'accurately', 'achievable', 'achievement', 'achievements', 'achievible', 'acumen', 'adaptable', 'adaptive', 'adequate', 'adjustable', 'admirable', 'admirably', 'admiration', 'admire', 'admirer', 'admiring', 'admiringly', 'adorable', 'adore', 'adored', 'adorer', 'adoring', 'adoringly', 'adroit', 'adroitly', 'adulate', 'adulation', 'adulatory', 'advanced', 'advantage', 'advantageous', 'advantageously', 'advantages', 'adventuresome', 'adventurous', 'advocate', 'advocated', 'advocates', 'affability', 'affable', 'affably', 'affectation', 'affection', 'affectionate', 'affinity', 'affirm', 'affirmation', 'affirmative', 'affluence', 'affluent', 'afford', 'affordable', 'affordably', 'afordable', 'agile', 'agilely', 'agility', 'agreeable', 'ag

In [12]:
print (negative) #lets have a look at the negative list



### Lets Analyze!

The code below defines 4 different count variables, each assigned to zero.<br>
There is another count variable named 'sentiment' which is incremented every time a positive word pops up in our tweets, and decremented every time a negative word pops up. In simpler terms, this variable gives the difference between the number of positive and negative words in our tweets. 

In [13]:
stopCount = 0 #count variable for stopWords
posCount = 0 #count variable for positive words
negCount = 0 #count variable for negative words
otherCount = 0 #count variable for 'others'
sentiment = 0 #count variable to calculate overall sentiment

#compare each word from the tweet to each of the lists
for word in clean_tweet_words:
    if word in stopWords:
        stopCount += 1
    elif word in positive:
        posCount += 1
        sentiment += 1
    elif word in negative:
        negCount += 1
        sentiment -= 1
    else:
        otherCount += 1

### Results

We obtain the percentage of total number of positive, negative, stop words and others by dividing the count with total number of tweeted words. <br>
The overall sentiment is calculated by checking the sentiment score. Remember sentiment score is the difference between the positive and negative words in the tweets. If the sentiment score is positive, it means the overall sentiment is positive. If the sentiment score is negative, it means the overall sentiment is negative. <br>
After this we check whether it is a strong sentiment or not. For this we divide sentiment by total positive and negative words (not all words). This gives us the percentage difference between positive and negative words alone. If the percentage difference between the sentiments is greater than 50%, the sentiment is strong, otherwise, it is a weak sentiment.

In [18]:
print("In all the tweets:\n\tTotal number of positive words are : " + str(posCount))
print("\tTotal number of negative words are : "  + str(negCount))
print("\tTotal number of stop words are : " + str(stopCount))
print("\tTotal number of 'other' words are : " + str(otherCount))
        
overallPos = (posCount/len(clean_tweet_words))*100
overallNeg = (negCount/len(clean_tweet_words))*100
overallStop = (stopCount/len(clean_tweet_words))*100
overallOther = (otherCount/len(clean_tweet_words))*100

print("\nThe ratios(percentages) are as follows:\n\tPositive: %.2f " % overallPos + "%")
print("\tNegative: %.2f" %overallNeg + "%")
print("\tStopWords: %.2f" %overallStop+ "%")
print("\tOthers: %.2f" %overallOther+ "%")

#checking whether positive or negative is greater 
print ("\nChecking whether overall sentiment is postive or negative:")

print("Sentiment score : ", sentiment)
print ("\nOverall the sentiment is positive") if sentiment >= 0 else print ("Overall sentiment is negative ")

#checking if the sentiment is strong or not
print("Percentage difference between positive and negative words %.2f" % ((100*sentiment)/(posCount+negCount))+"%")
print("The sentiment is weak as it is less than 50% of the tweets") if ((100*sentiment)/(posCount+negCount)) < 50 else print("The sentiment is strong as it is more than or equal to 50%")

In all the tweets:
	Total number of positive words are : 2564
	Total number of negative words are : 2112
	Total number of stop words are : 28302
	Total number of 'other' words are : 22036

The ratios(percentages) are as follows:
	Positive: 4.66 %
	Negative: 3.84%
	StopWords: 51.45%
	Others: 40.06%

Checking whether overall sentiment is postive or negative:
Sentiment score :  452

Overall the sentiment is positive
Percentage difference between positive and negative words 9.67%
The sentiment is weak as it is less than 50% of the tweets


## Part 2 - Sentiment Analysis for Modi

Repeat all the steps from part 1

In [19]:
myfile = open("Modi.txt") #open the file containing all tweets
tweets = [] #creating an empty list

#extracting each line from txt file and appending it in the list
for line in myfile: 
        for j in line.split(r"\n"):
            tweets.append(j)
myfile.close() #close the file

In [20]:
tweets #lets print and see how our created list looks like

["b'RT @SaketGokhale: Piyush Goyal is miffed at Jeff Bezos purely because he can\\xe2\\x80\\x99t call &amp; bully him for censoring news about Modi in the Washin\\xe2\\x80\\xa6",
 '\'b"RT @ashoswai: #MyPiece - Not for destroying India\'s secularism but his failure on the economic front which has made Modi vulnerable. So, th\\xe2\\x80\\xa6',
 '"b"RT @TheMamuns: @htTweets But, no CHAUKEEDAARI. ',
 "Because, under the Modi's guidance.",
 'Group of CHAUKEEDAAR criminal mandates.',
 'Today the India\\xe2\\x80\\xa6',
 '"b"RT @ajitdatta: BJP\'s opponents have often questioned PM Modi\'s legitimacy to rule on the ground that over 60% of the electorate did not vot\\xe2\\x80\\xa6',
 '"b\'RT @NekkantiR: Modi has shown us, the sleeping beauties Hindus, how "chilled out" we are living among burkaSecularists  aka ChuslamiTerrori\\xe2\\x80\\xa6',
 '\'b"RT @CTR_Nirmalkumar: Dear My PM Modi Ji,',
 '',
 '* Pakistan is against YOU',
 '* Media is against YOU',
 "* Congress Italian Mafia's against YOU",
 '

In [21]:
clean_tweets = [] #create an empty list which will contain cleaned tweets
for tweet in tweets:
    if tweet is not None:
        tweet_parse = tweet.split(':')
        if len(tweet_parse) > 1:                               #if the length of tweet is greater than 1, proceed with cleaning
            tweet = tweet_parse[-1]
            tweet = re.sub(r'\\x[0-9a-f][0-9a-f]', "", tweet)  #to deal with emoticons. Hex code for emoticons start with x
            tweet = re.sub('@(\w)+', "", tweet)                #to deal with mentions and usernames
            tweet = re.sub('&amp', "", tweet)                  #to deal with extra white spaces
            tweet = re.sub('//t.co/(\w)+', "",tweet)           #to deal with hyperlinks
            tweet = re.sub('//t.c', "",tweet)                  #to deal with hyperlinks
            tweet = re.sub(r'\\',"", tweet)                    #to deal with double backward slashes
            tweet = re.sub('#',"", tweet)
            tweet = re.sub("\'", " " ,tweet)
            tweet = tweet.replace(",","")
            tweet = tweet.replace("*","")
            tweet = tweet.replace("//t","")
            tweet = tweet.replace('"',"")
            tweet = tweet.replace(".","")
            clean_tweets.append(tweet.lower())                #after cleaning each tweet is appended to clean_tweets list

# Create a list of words from cleansed tweets
clean_tweet_words = [] #create an empty list for cleaned tweet words
for i in clean_tweets:
    s = i.split()
    for j in s:
        clean_tweet_words.append(j) #append cleaned words to a list

In [22]:
print(clean_tweet_words) #lets have a look at the list of cleaned words

['piyush', 'goyal', 'is', 'miffed', 'at', 'jeff', 'bezos', 'purely', 'because', 'he', 'cant', 'call', ';', 'bully', 'him', 'for', 'censoring', 'news', 'about', 'modi', 'in', 'the', 'washin', 'mypiece', '-', 'not', 'for', 'destroying', 'india', 's', 'secularism', 'but', 'his', 'failure', 'on', 'the', 'economic', 'front', 'which', 'has', 'made', 'modi', 'vulnerable', 'so', 'th', 'but', 'no', 'chaukeedaari', 'bjp', 's', 'opponents', 'have', 'often', 'questioned', 'pm', 'modi', 's', 'legitimacy', 'to', 'rule', 'on', 'the', 'ground', 'that', 'over', '60%', 'of', 'the', 'electorate', 'did', 'not', 'vot', 'modi', 'has', 'shown', 'us', 'the', 'sleeping', 'beauties', 'hindus', 'how', 'chilled', 'out', 'we', 'are', 'living', 'among', 'burkasecularists', 'aka', 'chuslamiterrori', 'dear', 'my', 'pm', 'modi', 'ji', 'is', 'this', 'america?', 'no', 'wait', 'france?', 'oh', 'no', 'must', 'be', 'china!', 'janasena', 'and', 'bjp', 'declare', 'alliance', 'in', 'the', 'state', 'janasena', 'chief', 'pawan'

We have already read and stored the positive, negative and stop word file. No need to repeat that step.

In [23]:
stopCount = 0 #count variable for stopWords
posCount = 0 #count variable for positive words
negCount = 0 #count variable for negative words
otherCount = 0 #count variable for 'others'
sentiment = 0 #count variable to calculate overall sentiment

#compare each word from the tweet to each of the lists
for word in clean_tweet_words:
    if word in stopWords:
        stopCount += 1
    elif word in positive:
        posCount += 1
        sentiment += 1
    elif word in negative:
        negCount += 1
        sentiment -= 1
    else:
        otherCount += 1

### Results

In [24]:
print("In all the tweets:\n\tTotal number of positive words are : " + str(posCount))
print("\tTotal number of negative words are : "  + str(negCount))
print("\tTotal number of stop words are : " + str(stopCount))
print("\tTotal number of 'other' words are : " + str(otherCount))
        
overallPos = (posCount/len(clean_tweet_words))*100
overallNeg = (negCount/len(clean_tweet_words))*100
overallStop = (stopCount/len(clean_tweet_words))*100
overallOther = (otherCount/len(clean_tweet_words))*100

print("\nThe ratios(percentages) are as follows:\n\tPositive: %.2f " % overallPos + "%")
print("\tNegative: %.2f" %overallNeg + "%")
print("\tStopWords: %.2f" %overallStop+ "%")
print("\tOthers: %.2f" %overallOther+ "%")

#checking whether positive or negative is greater 
print ("\nChecking whether overall sentiment is postive or negative:")

print("Sentiment score : ", sentiment)
print ("\nOverall the sentiment is positive") if sentiment >= 0 else print ("Overall sentiment is negative ")

#checking if the sentiment is strong or not
print("Percentage difference between positive and negative words %.2f" % ((100*sentiment)/(posCount+negCount))+"%")
print("The sentiment is weak as it is less than 50% of the tweets") if ((100*sentiment)/(posCount+negCount)) < 50 else print("The sentiment is strong as it is more than or equal to 50%")

In all the tweets:
	Total number of positive words are : 630
	Total number of negative words are : 1294
	Total number of stop words are : 21387
	Total number of 'other' words are : 19728

The ratios(percentages) are as follows:
	Positive: 1.46 %
	Negative: 3.01%
	StopWords: 49.69%
	Others: 45.84%

Checking whether overall sentiment is postive or negative:
Sentiment score :  -664
Overall sentiment is negative 
Percentage difference between positive and negative words -34.51%
The sentiment is weak as it is less than 50% of the tweets
