<h1>Twitter Sentimental Analysis</h1>

<h3>Abstract</h3>
<br>
<p>
The goal of this project was to predict sentiment for the given Twitter post using Python. Sentiment Analysis is the process of ‘computationally’ determining whether a piece of writing is positive, negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of a speaker.In this project we only focus on two major fields positive and negative. We had a complete dataset of 2500000 tweets. One half of tweets are positive labels and the other half are negative labels Our task was to build a classifier to predict the test dataset of 10000 tweets.
</p>

<h3>Data Loading</h3>

In [1]:
import pandas as pd

#helper function to Generate data
def load_data(filename):
    data = []
    for line in open(filename, "r"):
        data.append(line)
    return data    

In [2]:
print(" == Loading Tranning Data ==")
pos_train_tweets = pd.DataFrame(load_data('data/tweets/train_pos_full.txt'), columns=['tweet'])
neg_train_tweets = pd.DataFrame(load_data('data/tweets/train_neg_full.txt'), columns=['tweet'])
train_tweets = pd.concat([pos_train_tweets, neg_train_tweets], axis=0)
print(train_tweets)
train_tweets.to_pickle('data/pickles/train_origin.pkl')

 == Loading Tranning Data ==
                                                     tweet
0        <user> i dunno justin read my mention or not ....
1        because your logic is so dumb , i won't even c...
2        " <user> just put casper in a box ! " looved t...
3        <user> <user> thanks sir > > don't trip lil ma...
4        visiting my brother tmr is the bestest birthda...
5        <user> yay ! ! #lifecompleted . tweet / facebo...
6        <user> #1dnextalbumtitle : feel for you / roll...
7        workin hard or hardly workin rt <user> at hard...
8             <user> i saw . i'll be replying in a bit .\n
9                                  this is were i belong\n
10                <user> anddd to cheer #nationals2013 ?\n
11       we send an invitation to shop on-line ! here y...
12                     just woke up , finna go to church\n
13                 <user> agreed ! 12 more days left tho\n
14                                   monet with katemelo\n
15       like dammm <user> 

In [3]:
print(" == Loading Testing Data ==")
test_tweets = pd.DataFrame(load_data('data/tweets/test_data.txt'), columns=['tweet'])
test_tweets['tweet'] = test_tweets['tweet'].apply(lambda tweet: tweet.split(',', 1)[-1])
print(test_tweets)
test_tweets.to_pickle('data/pickles/test_origin.pkl')

 == Loading Testing Data ==
                                                  tweet
0     sea doo pro sea scooter ( sports with the port...
1     <user> shucks well i work all week so now i ca...
2             i cant stay away from bug thats my baby\n
3     <user> no ma'am ! ! ! lol im perfectly fine an...
4     whenever i fall asleep watching the tv , i alw...
5     <user> he needs to get rid of that thing ! it ...
6               its whatever . in a terrible mood ( (\n
7     yesss ! rt <user> <user> thanks jordan , i lov...
8     my friend <user> text me to check up on me las...
9     <user> #followback please . when will ur #unit...
10    watch some of y'all dumb asses get lock up tod...
11    obsessed with #phasell <user> you killed it ! ...
12    <user> robert de niro is not gay .. but with a...
13    <user> canada have to do it in grade 12 . but ...
14    <user> please say hi to denmark ! that would b...
15                                finally am home now\n
16    3x3 custom pic

<h3>Segmenter</h3>
<p>Helper function for preprocessing step. It Counts the frequencies for total words present in corpus. And there respective frequencies. Total frequency is saved in <span style="color:#e65100">"data/dicitonary/en/total.tsv"</span> while respictive frequency are saved in <span style="color:#e65100">"data/dicitonary/en/frequencies.tsv"</span> </p>

In [4]:
import os

class Analyzer(object):
    def __init__(self,file_dir="data/dictionary/en" , case_folding=True, minimum_frequency=1.0e-08):
        self.frequencies = dict()
        self.total = 0.0
        self.minimum_frequency = minimum_frequency
        self.file_dir =file_dir
        self.total = float(open(file_dir + "/total.tsv").readlines()[0])
        self.case_folding = case_folding
        counts = dict()
        for line in open(file_dir + "/frequencies.tsv"):
            parts = line.strip().split("\t")
            if len(parts) == 2:
                word, count = parts
                if case_folding:
                    key = word.lower()
                else:
                    key = word
                counts[key] = counts.get(word, 0.0) + float(count)
            else:
                pass

        for key in counts:
            frequency = counts[key] / self.total
            if frequency >= self.minimum_frequency:
                self.frequencies[key] = frequency

    def frequency(self, word):
        return self.frequencies.get(word, 0.0)

    def segment(self, text):
        # best[i] : best log probability for text[0:i]
        # words[i] : best word ending at position i
        original_text = text
        if self.case_folding:
            text = text.lower()
        n = len(text)
        # strings can be split into unicode chars using list
        words = [''] + list(text)
        best = [1.0] + [0.0] * n
        # fill in vectors best, words via dynamic programming
        for i in range(n + 1):
            for j in range(i):
                w = text[j:i]
                lp = self.frequency(w) * best[i - len(w)]
                if lp >= best[i]:
                    best[i] = lp
                    words[i] = original_text[j:i]
        # now recover the sequence of best words
        seq = []
        i = len(words) - 1
        while i > 0:
            # prevent an infinite loop from occuring here
            if len(words[i]) > 0:
                seq.append(words[i])
                i -= len(words[i])
            else:
                i -= 1
        # reverse sequence
        return seq[::-1]


<h3>Pre Processing</h3>
<p>User-generated content on the web is seldom present in a form usable for learning. It becomes important to normalize the text by applying a series of pre-processing steps. We have applied an extensive set of pre-processing steps to decrease the size of the feature set to make it suitable for learning algorithms.</p>

In [5]:
import numpy as np
import regex as re
import multiprocessing
from multiprocessing import Pool
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

np.random.seed(0)

# Loading current instance resource
num_partitions = multiprocessing.cpu_count()
num_cores = multiprocessing.cpu_count()
print("Total Number of Partitions : ", num_partitions)
print("Total Number of cores : ", num_cores)

Total Number of Partitions :  4
Total Number of cores :  4


<h5># Abbreviation Replacement</h5>

In [6]:
def abbreviation_replacement(text):
    text = re.sub(r"i\'m", "i am", text)
    text = re.sub(r"\'re", "are", text)
    text = re.sub(r"he\'s", "he is", text)
    text = re.sub(r"it\'s", "it is", text)
    text = re.sub(r"that\'s", "that is", text)
    text = re.sub(r"who\'s", "who is", text)
    text = re.sub(r"what\'s", "what is", text)
    text = re.sub(r"n\'t", "not", text)
    text = re.sub(r"\'ve", "have", text)
    text = re.sub(r"\'d", "would", text)
    text = re.sub(r"\'ll", "will", text)
    text = re.sub(r",", " , ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\.", " \. ", text)
    text = re.sub(r"\(", " \( ", text)
    text = re.sub(r"\)", " \) ", text)
    text = re.sub(r"\?", " \? ", text)
    return text

In [8]:
abbreviation_replacement("i'm afraid you're coming back to a lot")

'i am afraid youare coming back to a lot'

<h5># Emoji Translation</h5>

In [9]:
def emoji_translation(text):
    loves = ["<3", "♥"]
    smilefaces = []
    sadfaces = []
    neutralfaces = []

    eyes = ["8",":","=",";"]
    nose = ["'","`","-",r"\\"]
    for e in eyes:
        for n in nose:
            for s in ["\)", "d", "]", "}","p"]:
                smilefaces.append(e+n+s)
                smilefaces.append(e+s)
            for s in ["\(", "\[", "{"]:
                sadfaces.append(e+n+s)
                sadfaces.append(e+s)
            for s in ["\|", "\/", r"\\"]:
                neutralfaces.append(e+n+s)
                neutralfaces.append(e+s)
            #reversed
            for s in ["\(", "\[", "{"]:
                smilefaces.append(s+n+e)
                smilefaces.append(s+e)
            for s in ["\)", "\]", "}"]:
                sadfaces.append(s+n+e)
                sadfaces.append(s+e)
            for s in ["\|", "\/", r"\\"]:
                neutralfaces.append(s+n+e)
                neutralfaces.append(s+e)

    smilefaces = list(set(smilefaces))
    sadfaces = list(set(sadfaces))
    neutralfaces = list(set(neutralfaces))

    t = []
    for w in text.split():
        if w in loves:
            t.append("<love>")
        elif w in smilefaces:
            t.append("<happy>")
        elif w in neutralfaces:
            t.append("<neutral>")
        elif w in sadfaces:
            t.append("<sad>")
        else:
            t.append(w)
    newText = " ".join(t)
    return newText

<h5>Emphaszie Punctuation</h5>

In [10]:
def emphaszie_punctuation(text):
    for punctuation in ['!', '?', '.']:
        regex = r'[' + punctuation + ']' + "+"
        text = re.sub(regex, punctuation + ' <repeat> ', text)
    return text

In [11]:
emphaszie_punctuation("yesss ! rt <user> <user> thanks jordan , i love you")

'yesss ! <repeat>  rt <user> <user> thanks jordan , i love you'

<h5># Emphasize positive and negative words</h5>

In [12]:
# Loading positive and negative words from dictionary
positive_word_library = list(set(open('data/dictionary/positive-words.txt', encoding = "ISO-8859-1").read().split()))
negative_word_library = list(set(open('data/dictionary/negative-words.txt', encoding = "ISO-8859-1").read().split()))
print(positive_word_library)
print(negative_word_library)

['astutely', 'masterfully', 'exalted', 'enliven', 'competitive', 'personages', 'angel', 'luck', 'evaluative', 'thriving', 'fecilitous', 'dirt-cheap', 'roomier', 'awe', 'vivid', 'felicitous', 'glow', 'convenience', 'appreciates', 'foremost', 'comforting', 'overtaken', 'courtly', 'best-performing', 'reasonable', 'flawless', 'warmly', 'luxuriate', 'durable', 'mighty', 'wonderfully', 'supporting', 'risk-free', 'righteous', 'striving', 'marvels', 'covenant', 'progress', 'proven', 'adventurous', 'jaw-dropping', 'preeminent', 'beutifully', 'reaffirmation', 'congratulation', 'serene', 'rock-star', 'record-setting', 'intelligent', 'dependable', 'innovative', 'respect', 'richer', 'thoughtful', 'goood', 'hotcakes', 'graciously', 'stellar', 'breeze', 'friendly', 'smoothes', 'industrious', 'levity', 'headway', 'thrills', 'cherish', 'trouble-free', 'harmony', 'reverently', 'self-satisfaction', 'supports', 'dead-on', 'verifiable', 'survivor', 'stability', 'advocates', 'effusively', 'excallent', 'exub

In [13]:
def emphasize_pos_and_neg_words(text):
    t = []
    for w in text.split():
        if w in positive_word_library:
            t.append('<positive> ' + w)
        elif w in negative_word_library:
            t.append('<negative> ' + w)
        else:
            t.append(w)
    newTweet = " ".join(t)
    return newTweet

In [14]:
emphasize_pos_and_neg_words(" i'm really sad about charlotte though")

"i'm really <negative> sad about charlotte though"

<h5>#Clean hashtag</h5>

In [15]:
e = Analyzer()

In [16]:
def extract_hashtag(text):
    hash_list = ([re.sub(r"(\W+)$", "", i) for i in text.split() if i.startswith("#")])
    return hash_list

In [17]:
def split_hashtag_to_words(tag):
    word_list = [w for w in e.segment(tag[1:]) if len(w) > 3]
    return word_list

In [18]:
def clean_hashtag(text):
    words = []
    tag_list = extract_hashtag(text)
    for tag in tag_list:
        words += split_hashtag_to_words(tag)
    if len(words):
        return (" ".join(words)).strip()
    else:
        return ""
    return word_list

In [20]:
clean_hashtag("yay ! ! #lifecompleted . tweet / facebook")

'life completed'

<h5># Remove Number</h5>

In [21]:
def remove_number(text):
    new_tweet = []
    for word in text.split():
        try:
            float(word)
            new_tweet.append("")
        except:
            new_tweet.append(word)
    return " ".join(new_tweet)

In [22]:
remove_number("hello he scored 993")

'hello he scored '

<h5># Spelling corretion</h5>

In [23]:
dict = {}

corpus1 = open('data/dictionary/tweet_typo_corpus.txt', 'rb')
for term in corpus1:
    term = term.decode('utf8').split()
    dict[term[0]] = term[1]

corpus2 = open('data/dictionary/text_correction.txt', 'rb')
for term in corpus2:
    term = term.decode('utf8').split()
    dict[term[1]] = term[3]


def spelling_correction(text):
    text = text.split()
    for idx in range(len(text)):
        if text[idx] in dict.keys():
            text[idx] = dict[text[idx]]
    text = ' '.join(text)
    return text

In [27]:
spelling_correction("hey! you are fab")

'hey! you are fabulous'

<h5># Removing Stopwords</h5>

In [28]:
stoplist = stopwords.words('english')

def remove_stopwords(text):
    tokens = text.split()
    for word in tokens:
        if word in stoplist:
            tokens.remove(word)
    return ' '.join(tokens)

In [29]:
remove_stopwords("hey! you are fab")

'hey! are fab'

<h5># Lemmatization</h5>

In [30]:
lemma = WordNetLemmatizer()
def lemmatize_word(w):
    try:
        x = lemma.lemmatize(w).lower()
        return x
    except Exception as e:
        return w


def lemmatize_sentence(text):
    x = [lemmatize_word(t) for t in text.split()]
    return " ".join(x)

In [33]:
lemmatize_word("laurie's gardening design protective decal skin")

"laurie's gardening design protective decal skin"

<h5># Stemming</h5>

In [34]:
stemmer = PorterStemmer()
def stemming_word(w):
    return stemmer.stem(w)


def stemming_sentence(text):
    x = [stemming_word(t) for t in text.split()]
    return " ".join(x)

In [35]:
stemming_sentence("laurie's gardening design protective decal skin")

"laurie' garden design protect decal skin"

<h3>Combining all preprocessing method</h3>

In [38]:
def preprocessing_all_method(tweet):
    tweet = abbreviation_replacement(tweet)
    tweet = emphaszie_punctuation(tweet)
    tweet = tweet + ' ' + clean_hashtag(tweet)
    tweet = emoji_translation(tweet)
    tweet = remove_number(tweet)
    tweet = emphasize_pos_and_neg_words(tweet)
    tweet = spelling_correction(tweet)
    tweet = remove_stopwords(tweet)
    tweet = stemming_sentence(tweet)
    tweet = lemmatize_sentence(tweet)
    return tweet.strip().lower()

In [39]:
preprocessing_all_method("laurie's gardening design protective decal skin")

"laurie' garden design <positive> protect deal skin"

In [40]:
# Paralelize Pandas datafames
def parallelize_dataframe(df, func):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

def multiply_columns(data):
    data['tweet'] = data['tweet'].apply(lambda x: preprocessing_all_method(x))
    return data

In [41]:
print("Test data preprocessing start!")
X_test = pd.read_pickle("data/pickles/test_origin.pkl")
X_test = parallelize_dataframe(X_test, multiply_columns)
X_test

Test data preprocessing start!


Unnamed: 0,tweet
0,sea pro sea scooter \( sport the <positive> po...
1,<user> shuck <positive> well <positive> work w...
2,cant stay away <negative> bug that' babi
3,<user> madam ! <repeat> ! <repeat> ! <repeat> ...
4,"whenev <negative> fall asleep watch televis , ..."
5,<user> need get rip that thing ! <repeat> scar...
6,whatev \. <repeat> a <negative> terribl mood \...
7,"ye ! <repeat> <user> <user> thank jordan , <po..."
8,friend <user> text to check on last night \. <...
9,<user> #followback pleas \. <repeat> will #uni...


In [42]:
X_test.to_pickle("data/pickles/test_clean.pkl")
print("Test data preprocessing finish!")

Test data preprocessing finish!


In [44]:
print("Train data preprocessing start!")
X_train = pd.read_pickle("data/pickles/train_origin.pkl")
X_train = parallelize_dataframe(X_train, multiply_columns)
X_train

Train data preprocessing start!


Unnamed: 0,tweet
0,<user> dunno justin read mention not \. <repea...
1,"your logic so <negative> dumb , wonot even cro..."
2,""" <user> put casper a box ! <repeat> "" love ba..."
3,<user> <user> thank sir > > dont trip littl ma...
4,visit brother tomorrow the bestest birthday gi...
5,<user> <positive> may ! <repeat> ! <repeat> #l...
6,<user> #1dnextalbumtitl : feel you / <negative...
7,work <negative> hard hardli work <user> hardee...
8,<user> saw \. <repeat> be repli a bit \. <repeat>
9,is i belong


In [None]:
X_train.to_pickle("data/pickles/train_clean.pkl")
print("Train data preprocessing finish!")