# Twitter Sentiment Analysis

This is a simple twitter sentiment analysis. This data science project includes the following steps - 

1. Generation of twitter data
2. Cleaning the tweets
3. Determining the sentiments of the tweets and interpreting the sentiment of the tweets on basis of positive and negative words
4. Stop words are omitted in the analysis
5. Calculation of overall sentiment on the topic based on the sentiments deduced from the tweets

### Generation of Twitter data

The data is generated using an easy to use twitter library called the Tweepy. This library allows users to gain access to twitter data. The first step in obtaining twitter data is creating application on https://apps.twitter.com/. 

This allows users to obtain consumer_key, consumer_secret, access_token, access_token_secret. This data is stored in the file config.py. It will enable a user to extract data from Twitter using the Tweepy library.

Tweepy currently allows upto a maximum of 200 tweets to be extracted in one go. However, using custom function, multiple tweets are extracted.

Users can change the number of tweets and the data to be extracted.

In [1]:
import tweepy
import codecs
import config

In [2]:
#Setting up twitter

def twitter_setup():
    auth = tweepy.OAuthHandler(config.consumer_key, config.consumer_secret)
    auth.set_access_token(config.access_token, config.access_token_secret)
    api = tweepy.API(auth)
    return api

In [3]:
#Creating an object for extraction of data
tweet_extractor = twitter_setup()

In [4]:
# Extraction of tweets
all_tweets = []

search_term=input("Enter the topic you want to find sentiment of: ")
max_tweets=int(input("Enter the number of Tweets you want to download: "))

#max limit to extract tweet is 200 per request
new_tweets = tweet_extractor.user_timeline(screen_name = search_term , count=200)

all_tweets.extend(new_tweets)

oldest = all_tweets[-1].id - 1
   
#keep extracting tweets till there are no tweets left

while len(new_tweets) < max_tweets:
    
    #max_id has been used to prevent duplicates
    new_tweets = tweet_extractor.user_timeline(screen_name = search_term,count=200,max_id=oldest)

    #most recent tweets are saved in the all_tweets list
    all_tweets.extend(new_tweets)
        
    #id of the oldest tweet is updated -1
    oldest = all_tweets[-1].id - 1

    print("%s tweets downloaded so far" % (len(all_tweets)))

Enter the topic you want to find sentiment of: trump
Enter the number of Tweets you want to download: 200


In [5]:
#Writing the extracted tweets in the Tweets.txt file
file = codecs.open("Tweets.txt", "w", "utf-8")
for tweet in all_tweets:
    file.write(tweet.text)
    file.write("\n")
    
file.close()


print("\nTweets have been written in the file Tweets.txt")



Tweets have been written in the file Tweets.txt


In [6]:
import re
import nltk
from nltk.corpus import stopwords
import string

#For running multiple outputs in Jupyter
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#### Cleaning the tweets using regular expressions

In [7]:
#Process of cleaning the tweet using regular expressions
#Following changes are done to clean the tweet - 

#Conversion to lower case
#Conversion of URLs in form of www.* or http://* or https://* to URL
#Conversion of @username to text AT_USER
#Removing all special characters
#Replacing all hashtag words to simple words
#Stripping additional space


def clean_tweet(tweet):
    tweets = tweet.lower()    
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+)|(http?://[^\s]+)|(https:/[^\s]+)|(http?[^\s]+))','URL',tweet)    
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    tweet = re.sub('[\s]+', ' ', tweet)    
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    tweet=re.sub(r'[^\w]', ' ', tweet)
    
    tweet = tweet.strip('\'"')
    tweet_list=tweet.split()
    
    return tweet_list

#### Create list of stop words for comparison

In [8]:
stop_words = list(stopwords.words('english'))
#Adding more parts like usernames, URL to the list of stop words

def stopword_list(stop_words):
    stop_words.append('AT_USER')
    stop_words.append('URL')
    stop_words.append('RT')
    stop_words.append('a')
    return stop_words

stop_words= stopword_list(stop_words)

In [9]:
#Reading the positive and negative word list files to identify these words

positive_tweets=open('positive-words.txt','r').read().split("\n")
negative_tweets=open('negative-words.txt','r').read().split("\n")

### Calculate Sentiment

#### Assign sentiment value to each tweet

In [10]:
def find_sentiment(final_tweet):
    count=0
    for words in final_tweet:
        if words in positive_tweets:
            count+=1
        if words in negative_tweets:
            count-=1
    return count

   #### Determine sentiment from each tweet

In [11]:
final_tweet=[]
processed_tweet=[]

#Read the file line by line 
fp = open('Tweets.txt', 'r',encoding="utf8")
line = fp.readline()
while line:
    line = fp.readline()
    
    #Add the cleaned tweets to list clean tweets
    clean_tweets=clean_tweet(line)
    
    #Stopwords are removed from the clean_tweets list
    
    for i in clean_tweets:
        if i in stop_words:
            continue
        else:
            processed_tweet.append(i) 
            
final_tweet.extend(processed_tweet) 

#Add sentiment to words for each tweet
count=find_sentiment(final_tweet)
if count > 0:
    print('The sentiment is positive')
elif count <0:
    print('The sentiment is negative')
else:
    print('The sentiment is neutral')

print("Sentiment score = %d" % count)

fp.close()


The sentiment is positive
Sentiment score = 133


#### Find Total Positive/Negative word count which was found in previous section

In [12]:
positive_count=0
negative_count=0
for word in final_tweet:
    if word not in stop_words:
        if word in positive_tweets:
            positive_count+=1
        elif word in negative_tweets:
            negative_count+=1
            
print("Positive word count = %d" % positive_count)
print("Negative word count = %d" % negative_count)

Positive word count = 143
Negative word count = 10


In [13]:
if positive_count>negative_count:
    print ("It looks like the sentiment of users on this topic is Positive.")
else:
    print ("It looks like the sentiment of users on this topic is Negative.")

It looks like the sentiment of users on this topic is Positive.


## References

For extracting tweets more than 200- https://gist.github.com/yanofsky/5436496 <br>

For positive and negative word list- http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html <br>

For cleaning the tweets- https://www.ravikiranj.net/posts/2012/code/how-build-twitter-sentiment-analyzer/#implementation-details