# Twitter Sentiment Analysis

#### In this project we analyze text tweets for any company, and sorting them into Positive, Negative or Neutral. It will help company to understand how people(customers) think about how good company is?
### STEPS:
#### 1- Scrape the tweets. (for e.g. Careem Pak)
#### 2- Data Preparing (Preprocessing and Cleaning)
#### 3- Train the model (Use ML Algorithm)
#### 4- Test and Predict
#####  ----------------------------------------------------- GROUP:  -----------  GHAZANFAR ALI (FA19-MSCS-0016)   &     MUHAMMAD ASHAR (FA19-MSCS-0007)

## NOTE:
Initially we try to get twitter data but api request rejected. As we discussed earlier.
Then we plan to scrap whatmobile review. We did it but problem with that review is more like everybody trying to sell 
their own item. It's not actually a review. So we drop to use this data.
Finally we search twitter data online that we found. 
So, we use this data which is available in csv format.

### IMPORTING NECESSARY MODULES

In [1]:
import numpy as np
import pandas as pd
import string

from sklearn import model_selection, preprocessing

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet         # To Lemmatize with POS Tag
from nltk import pos_tag
from nltk.corpus import stopwords


#### Following are necessary things to download once to work with nltk library if only not downloaded
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('wordnet')

## Read data from csv file and store into data frame

In [2]:
df = pd.read_csv('data.csv',delimiter=',')
print(df.info())
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 3 columns):
tweet_id     4999 non-null float64
sentiment    4999 non-null object
text         4999 non-null object
dtypes: float64(1), object(2)
memory usage: 117.2+ KB
None
       tweet_id sentiment                                               text
0  5.679000e+17  negative  @SouthwestAir I am scheduled for the morning, ...
1  5.699890e+17  positive  @SouthwestAir seeing your workers time in and ...
2  5.680890e+17  positive  @united Flew ORD to Miami and back and  had gr...
3  5.689280e+17  negative     @SouthwestAir @dultch97 that's horse radish 😤🐴
4  5.685940e+17  negative  @united so our flight into ORD was delayed bec...


## Getting tweet(text) and sentiment column

In [3]:
text = df['text']
sentiment = df['sentiment']
print(f"tweet: {text[4]} =======> sentiment: {sentiment[4]}")
print(type(sentiment))
print(type(text))

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


## Splitting the data into train and test

In [4]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(text, sentiment, test_size=0.2)

In [5]:
print(f"X_train:{len(X_train)} y_train:{len(y_train)} X_test:{len(X_test)} y_test:{len(y_test)}")

X_train:3999 y_train:3999 X_test:1000 y_test:1000


## Create separate csv file for each split(just to save it) and read it back

In [6]:
X_train.to_csv('xtrain.csv', header=['text'], index=False)
X_test.to_csv('xtest.csv', header=['text'], index=False)
y_train.to_csv('ytrain.csv', header=['sentiment'], index=False)
y_test.to_csv('ytest.csv', header=['sentiment'], index=False)

In [7]:
X_train = pd.read_csv('xtrain.csv')['text']
X_test = pd.read_csv('xtest.csv')['text']
y_train = pd.read_csv('ytrain.csv')['sentiment']
y_test = pd.read_csv('ytest.csv')['sentiment']

In [8]:
print(f"X_train:{len(X_train)} y_train:{len(y_train)} X_test:{len(X_test)} y_test:{len(y_test)}")

X_train:3999 y_train:3999 X_test:1000 y_test:1000


## Tokenizing word and create list of tuple(tweet,sentiment)

In [9]:
def tokenizing(tweets,sentiments):
    doc_list = []
    for i in range(len(tweets)):
        doc_list.append((word_tokenize(tweets[i].lower()), sentiments[i]))
    return doc_list
    
    
reviews = tokenizing(X_train, y_train)
test_reviews = tokenizing(X_test, y_test)
print(f"length of reviews: {len(reviews)} length of test_reviews: {len(test_reviews)}")

print(reviews[9])
print(test_reviews[9])

length of reviews: 3999 length of test_reviews: 1000
(['@', 'united', 'i', 'would', 'like', 'to', 'know', 'if', 'its', 'possiable', 'to', 'checkin', 'online', 'or', 'must', 'it', 'be', 'done', 'at', 'schiphol', '?'], 'neutral')
(['@', 'americanair', 'their', 'flights', 'into', 'buffalo', 'as', 'well', '--', 'you', 'were', 'the', 'only', 'flight', 'cancelled', 'flightled', '!', '!', '!'], 'negative')


## Shuffle the reviews

In [10]:
import random
random.shuffle(reviews)

## Lemmatization
Lemmatization is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care’, whereas, stemming would cutoff the ‘ing’ part and convert it to car.

‘Caring’ -> Lemmatization -> ‘Care’
‘Caring’ -> Stemming -> ‘Car’

Wordnet is an large, freely and publicly available lexical database for the English language aiming to establish structured semantic relationships between words. It offers lemmatization capabilities as well and is one of the earliest and most commonly used lemmatizers.

In [11]:
def get_stop_words(language):
    stop_words = set(stopwords.words(language)) # create set of english stop words
    punctuations = list(string.punctuation) # get punctuations (using string import earlier)
    stop_words.update(punctuations) # update stop words with punctuations
    return stop_words

In [12]:
stops = get_stop_words('english')
print(type(stops))
print(stops)

<class 'set'>
{"you've", 'been', 'the', 'o', 'its', "mustn't", "haven't", "wasn't", 'there', 'these', 'where', 'were', 'will', 'it', 'am', 'about', 'very', 'both', '"', 'me', ':', '{', 'just', 'himself', 'has', '~', 'from', 'weren', 'hers', 'over', "couldn't", 'some', 'here', '<', '>', 'this', 'in', 'shan', 're', 'itself', "don't", ',', 'her', 'below', 'for', 'did', 'aren', 'mustn', 'during', 'again', 'by', 'mightn', '\\', 'ours', 'as', 'why', 'haven', 'a', 'because', 'once', 'they', 'that', 'then', 'won', 'any', '#', 'above', 'how', 'before', 'his', 'who', 'between', 'their', 'your', 'wasn', '/', 'herself', '|', '+', '^', '!', '_', "needn't", 'against', 'themselves', "you're", ')', "doesn't", 'but', 'll', '*', "won't", 'ma', 'each', 'more', "hasn't", 'isn', 'don', 'our', 'my', 'no', 'through', 'yours', 'which', 'was', 'can', "wouldn't", 'an', 'hadn', 'she', "shan't", 'while', 'doing', 'yourself', "mightn't", '?', "didn't", 'all', "that'll", 'out', 'same', 'm', 'theirs', "you'll", 'him

### Getting part of speech tags against each word

In [13]:
# Get part of speech tags
def get_pos_tag(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [14]:
# Example
w = "study"
pos = pos_tag([w])  # Take list as an argument even a single word
print(pos)
get_pos_tag(pos[0][1])

[('study', 'NN')]


'n'

### Cleaning review function

In [15]:
# In order to lemmatize, you need to create an instance of the WordNetLemmatizer() and call the lemmatize() 
# function on a single word.
# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

def cleaning_reviews(sentence):
    output = []
    for word in sentence:
        if word.lower() not in stops:
            pos = pos_tag([word])
            clean_word = lemmatizer.lemmatize(word, get_pos_tag(pos[0][1]))
            output.append(clean_word.lower())
    return output

In [16]:
# Example to clean review
sentence = ['@', 'americanair', 'i', "'m", 'trying', 'to', 'choose', 'my', 'seats', 'but', 'every', 'time', 'i', 'go', 'to', 'the', 'next', 'flight', 'i', 'get', 'system', 'error']
clean = cleaning_reviews(sentence)
print(clean)

['americanair', "'m", 'try', 'choose', 'seat', 'every', 'time', 'go', 'next', 'flight', 'get', 'system', 'error']


### Clean reviews and test_reviews

In [17]:
reviews = [(cleaning_reviews(review), sentiment) for review, sentiment in reviews]
test_reviews = [(cleaning_reviews(review), sentiment) for review, sentiment in test_reviews]

In [18]:
print(reviews[19])
print(test_reviews[4])

(['jetblue', 'make', 'happy', 'hope', "n't", 'empty', 'promise'], 'positive')
(['united', '1/2', 'thanks', 'answer', 'question', 'want', 'make', 'reservation', 'phone', 'perth', 'hold', '24hrs', 'family'], 'negative')


In [19]:
# Just to verify number of tweets in training and testing
print(len(reviews))
print(len(test_reviews))

3999
1000


### Getting word density and create features of most common words

In [20]:
def get_features(reviews):
    all_words = []
    for review in reviews:
        all_words += review[0]
    common_words = nltk.FreqDist(all_words).most_common(2500) # get frequency of all words then top 2500
    features = [i[0] for i in common_words]
    return features

In [21]:
features = get_features(reviews)
print(features)

['flight', 'united', 'usairways', 'americanair', 'jetblue', 'southwestair', 'get', "n't", "'s", 'http', 'hour', 'thanks', 'service', 'cancel', 'time', '...', 'u', 'help', 'customer', 'call', 'hold', 'wait', 'go', 'plane', 'bag', '2', 'amp', 'fly', 'would', 'thank', 'need', 'make', 'still', 'try', "'m", 'one', 'back', 'say', 'day', 'gate', 'delayed', 'airline', 'take', 'bad', 'please', 'ca', 'like', 'virginamerica', 'late', 'book', 'guy', "'ve", 'delay', 'phone', 'agent', 'seat', 'change', 'today', '``', 'flightled', "''", 'ticket', 'know', 'work', 'well', 'check', 'never', 'could', 'airport', 'minute', 'miss', 'great', "'re", 'give', 'see', '3', 'use', 'hr', 'home', 'weather', 'problem', 'travel', 'tomorrow', 'really', 'min', 'love', 'dm', 'another', 'want', 'even', 'look', 'luggage', 'someone', 'good', 'people', 'lose', 'last', 'issue', 'way', 'much', "'ll", 'let', 'sit', 'new', 'right', '4', 'come', 'first', 'email', 'ever', 'told', 'staff', 'passenger', 'reservation', 'trip', 'next'

In [22]:
def get_features_dict(review):
    current_review = {}
    review_set = set(review)
    for w in features:
        current_review[w] = w in review_set
    return current_review

## Creating list of tuple, tuple has 2 value: 
#####  1- dictionary of features with true or false for current review 
#####  2- sentiment

In [23]:
train_in = [(get_features_dict(review), sentiment) for review, sentiment in reviews]
test_in = [(get_features_dict(review), sentiment) for review, sentiment in test_reviews]

In [24]:
print(train_in[0])

({'flight': True, 'united': False, 'usairways': True, 'americanair': False, 'jetblue': False, 'southwestair': False, 'get': True, "n't": False, "'s": False, 'http': False, 'hour': False, 'thanks': False, 'service': False, 'cancel': False, 'time': False, '...': False, 'u': False, 'help': False, 'customer': False, 'call': False, 'hold': False, 'wait': False, 'go': False, 'plane': False, 'bag': False, '2': False, 'amp': False, 'fly': False, 'would': False, 'thank': False, 'need': False, 'make': False, 'still': False, 'try': False, "'m": False, 'one': False, 'back': False, 'say': False, 'day': False, 'gate': False, 'delayed': True, 'airline': False, 'take': False, 'bad': False, 'please': False, 'ca': False, 'like': False, 'virginamerica': False, 'late': False, 'book': False, 'guy': False, "'ve": False, 'delay': False, 'phone': False, 'agent': False, 'seat': False, 'change': False, 'today': False, '``': False, 'flightled': False, "''": False, 'ticket': False, 'know': False, 'work': False, '

## Training and Testing
### Method 1: Using NaiveBayesClassifier

In [25]:
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_in)

In [26]:
classifier.classify(test_in[5][0])

'negative'

In [27]:
nltk.classify.accuracy(classifier, test_in)

0.758

In [28]:
classifier.show_most_informative_features(20)

Most Informative Features
                   great = True           positi : neutra =     23.0 : 1.0
                 awesome = True           positi : neutra =     21.1 : 1.0
                  excite = True           positi : negati =     21.0 : 1.0
                    hold = True           negati : positi =     20.1 : 1.0
              definitely = True           positi : negati =     18.6 : 1.0
                   thank = True           positi : negati =     17.8 : 1.0
                   amaze = True           positi : negati =     17.1 : 1.0
                    rock = True           positi : negati =     17.1 : 1.0
                 atlanta = True           neutra : negati =     16.3 : 1.0
                    haha = True           positi : negati =     16.1 : 1.0
                 welcome = True           positi : negati =     16.1 : 1.0
               companion = True           neutra : negati =     14.4 : 1.0
                      hr = True           negati : positi =     14.3 : 1.0

### Method2: Using RandomForestClassifier
Random forests is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy to use algorithm. A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.

In [29]:
from sklearn.ensemble import RandomForestClassifier
from nltk.classify.scikitlearn import SklearnClassifier

In [30]:
rfc = RandomForestClassifier()
classifier_sklearn = SklearnClassifier(rfc)

In [31]:
classifier_sklearn.train(train_in)



<SklearnClassifier(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))>

In [32]:
nltk.classify.accuracy(classifier_sklearn, test_in)

0.708