# Case Study 1: Twitter Sentiment Analysis

We have [Twitter Dataset](https://www.kaggle.com/c/twitter-sentiment-analysis2/data). We have to convert given tweets into features which can be used for sentiment classification(Positive and Negative Tweets). Every tweet can be classified as having either a positive or negative sentiment. Example of few tweets are:

**Few Positive Tweets: **
1.  @Msdebramaye I heard about that contest! Congrats girl!!
2. UNC!!! NCAA Champs!! Franklin St.: I WAS THERE!! WILD AND CRAZY!!!!!! Nothing like it...EVER http://tinyurl.com/49955t3

**Few Negative Tweets:**
1. no more taking Irish car bombs with strange Australian women who can drink like rockstars...my head hurts.
2. Just had some bloodwork done. My arm hurts

We have 100,000 tweets for training and  300,000 tweets for testing. The Ground truth is 1 for positive tweet and 0 for negative tweet. Let's try to make a sentiment Analyzer using this dataset.

In [0]:
## Load dataset
import pandas as pd
dataFrame = pd.read_csv("../Datasets/train.csv",encoding='latin1')
print(dataFrame)

       ItemID  Sentiment                                      SentimentText
0           1          0                       is so sad for my APL frie...
1           2          0                     I missed the New Moon trail...
2           3          1                            omg its already 7:30 :O
3           4          0            .. Omgaga. Im sooo  im gunna CRy. I'...
4           5          0           i think mi bf is cheating on me!!!   ...
5           6          0                  or i just worry too much?        
6           7          1                 Juuuuuuuuuuuuuuuuussssst Chillin!!
7           8          0         Sunny Again        Work Tomorrow  :-|  ...
8           9          1        handed in my uniform today . i miss you ...
9          10          1           hmmmm.... i wonder how she my number @-)
10         11          0                      I must think about positive..
11         12          1        thanks to all the haters up in my face a...
12         1

In [0]:
# Convert data into array
data = dataFrame.values
n = dataFrame.shape[0] ## n is number of tweets
print(n)

##Stored labels and tweets in separate arrays for train data
labels = data[:,1]
tweets = data[:,2]
print(labels.shape)
print(tweets.shape)

99989
(99989,)
(99989,)


## Question 1
Modify the tweets such that the irrelevant words and characters are removed. To this end apply the following preprocessing.
1. **Case** Convert the tweets to lower case.
2. **URLs** We don't intend to follow the (short) urls and determine the content of the site, so we can eliminate all of these URLs via regular expression matching or replace it with URL.
3. **Username** We can eliminate "$@$username" via regex matching or replace it with AT\_USER
4. **hashtag** hash tags can give us some useful information, so replace them with the exact same word without the hash. E.g. \#nike replaced with 'nike'.
5. **Whitespace** Replace multiple whitespaces with a single whitespace.
6. **Stop words** a, is, the, with etc. The full list of stop words can be found at Stop Word List. These words don't indicate any sentiment and can be removed.
7. **Repeated letters** If you look at the tweets, sometimes people repeat letters to stress the emotion. E.g. hunggrryyy, huuuuuuungry for 'hungry'. We can look for 2 or more repetitive letters in words and replace them by 2 of the same.
8. **Punctuation** Remove punctuation such as comma, single/double quote, question marks at the start and end of each word. E.g. beautiful!!!!!! replaced with beautiful
9. **Non-alpha Words**  Remove all those words which don't start with an alphabet. E.g. 15th, 5.34am

In [0]:
## Preprocess the tweets

## import regex
import re
import numpy as np

#start process_tweet
def processTweet(tweet):
    # process the tweets

    #Convert to lower case
    tweet = tweet.lower()
    #Convert www.* or https?://* to URL
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #trim
    tweet = tweet.strip('\'"')
    tweet = tweet.strip('.,')
    return tweet


for i in range(n):
    tweets[i] = processTweet(tweets[i])

print(tweets[0:100])

[' is so sad for my apl friend' ' i missed the new moon trailer'
 ' omg its already 7:30 :o'
 " .. omgaga. im sooo im gunna cry. i've been at this dentist since 11.. i was suposed 2 just get a crown put on (30mins)"
 ' i think mi bf is cheating on me!!! t_t' ' or i just worry too much? '
 ' juuuuuuuuuuuuuuuuussssst chillin!!'
 ' sunny again work tomorrow :-| tv tonight'
 ' handed in my uniform today . i miss you already'
 ' hmmmm.... i wonder how she my number AT_USER'
 ' i must think about positive'
 ' thanks to all the haters up in my face all day! 112-102'
 ' this weekend has sucked so far'
 ' jb isnt showing in australia any more!' ' ok thats it you win'
 ' &lt;-------- this is the way i feel right now'
 " awhhe man.... i'm completely useless rt now. funny, all i can do is twitter. URL"
 " feeling strangely fine. now i'm gonna go listen to some semisonic to celebrate"
 ' huge roll of thunder just now...so scary!!!!'
 " i just cut my beard off. it's only been growing for well over a

## Question 2
Do further preprocessing to calculate count for number of positve words and number of negative words corresponding to each tweet. You can use [Positive_words.txt](https://drive.google.com/drive/folders/1TnJCyn4LiS6InT35skvCbbBrp37AGYc) and [Negative words.txt](https://drive.google.com/drive/folders/1TnJCyn4LiS6InT35skvCbbBrp37AGYcT}{negative\_words.txt) which contain positive words and negative words respectively.

In [0]:
#start replaceTwoOrMore
def replaceTwoOrMore(s):
    #look for 2 or more repetitions of character and replace with the character itself
    pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
    return pattern.sub(r"\1\1", s)
#end

#start getStopWordList
def getStopWordList(stopWordListFileName):
    #read the stopwords file and build a list
    stopWords = []
    stopWords.append('AT_USER')
    stopWords.append('URL')

    for stopWord in open(stopWordListFileName, 'r'):
        stopWords.append(stopWord)
    return stopWords
#end

#start getfeatureVector
def getFeatureVector(tweet):
    featureVector = []
    words = tweet.split()
    PUNCTUATIONS = '\'"?!,.;:'    
    for w in words:
        # strip punctuation
        w = w.strip(PUNCTUATIONS)
        # check if the word starts with an alphabet
        val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", w)
 
        #ignore if it is a stop word
        
        if w in stopWords or val is None:
            continue
        else:
            featureVector.append(w.lower())
    return featureVector

def getwordcount(words, count):    
    positive_count = 0
    negative_count = 0
    neutral_count = 0
    
    total = []
    #print words
    
    for w in words:        
        if w in positive_words:
            positive_count += 1
        elif w in negative_words:
            negative_count += 1
        else:
            neutral_count += 1
            
    total.append(positive_count)
    total.append(negative_count)
    total.append(neutral_count)
    total.append(labels[count])
    return total
    
tweets_modified = []
count = 0

stopWords = getStopWordList('../Datasets/stopwords.txt')
positive_words = pd.read_csv('../Datasets/positive-words.txt').values
negative_words=  pd.read_csv('../Datasets/negative-words.txt').values


for i in range(n):
    print(i)
    print(tweets)
    featureVector = getFeatureVector(tweets[i])
    print(featureVector)
    tweets_modified.append(getwordcount(featureVector,count))
   
    count += 1
    #line = fp.readline()
   # line1=sentiments.readline()




In [0]:
import numpy as np
x = np.asarray(tweets_modified)
print (x.shape)
print (x[0])

## Question 3
Plot the graph use features as positive count and negative count of each tweet. Also plot the garph by scaling the features and normalizing the features respecively. You need to plot total 3 graphs.

In [0]:
## Using features as probabilities
import matplotlib.pyplot as plt
plt.figure(1, figsize=(20,10))

colors = ["red","yellow"]
plt.scatter(x[:,0]/np.sum(x, 1, np.float), x[:,1]/np.sum(x, 1, np.float), c = colors, s=40)
plt.show()

NameError: name 'x' is not defined

## Question 4
Load the file test.csv and preprocess as above. Calculate the accuracy on test data using Linear classifier, or KNN classifier?

In [0]:
train_X = x[:70000,:2]
train_Y = x[:70000,3]
test_X = x[70000:,:2]
test_Y = x[70000:,3]
print(train_Y.shape)

In [0]:
## Linear classifier
from sklearn import linear_model
clf = linear_model.SGDClassifier()
clf.fit(train_X,train_Y)
pred_label = (clf.predict(test_X))
print(pred_label)
correct = np.sum(abs(test_Y-pred_label))
print(correct)
accuracy = (correct/np.float(len(test_Y)))*100.0
print(accuracy)

In [0]:
## Your code here