## Sentiment analysis <br> 

The objective of the second problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 1. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from google.colab import drive
drive.mount('/content/gdrive',force_remount=True)

Mounted at /content/gdrive


In [0]:
data = pd.read_csv('/content/gdrive/My Drive/tweets.csv',encoding='unicode-escape')

In [27]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [28]:
data.count()

tweet_text                                            9092
emotion_in_tweet_is_directed_at                       3291
is_there_an_emotion_directed_at_a_brand_or_product    9093
dtype: int64

In [0]:
data.dropna(inplace=True)

In [30]:
data.count()

tweet_text                                            3291
emotion_in_tweet_is_directed_at                       3291
is_there_an_emotion_directed_at_a_brand_or_product    3291
dtype: int64

### 2. Preprocess the text and add the preprocessed text in a column with name `text` in the dataframe.

In [0]:
def preprocess(text):
    try:
        return text.decode('ascii')
    except Exception as e:
        return ""

In [0]:
data['text'] = [preprocess(text) for text in data.tweet_text]

In [34]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,


### 3. Consider only rows having Positive emotion and Negative emotion and remove other rows from the dataframe.

In [35]:
data['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

Positive emotion                      2672
Negative emotion                       519
No emotion toward brand or product      91
I can't tell                             9
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [36]:
data[np.logical_or(data['is_there_an_emotion_directed_at_a_brand_or_product']=='No emotion toward brand or product',
data['is_there_an_emotion_directed_at_a_brand_or_product'] == "I can't tell")].count()

tweet_text                                            100
emotion_in_tweet_is_directed_at                       100
is_there_an_emotion_directed_at_a_brand_or_product    100
text                                                  100
dtype: int64

In [37]:
data=data[~np.logical_or(data['is_there_an_emotion_directed_at_a_brand_or_product']=='No emotion toward brand or product',
data['is_there_an_emotion_directed_at_a_brand_or_product'] == "I can't tell")]
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,


In [38]:
data.count()

tweet_text                                            3191
emotion_in_tweet_is_directed_at                       3191
is_there_an_emotion_directed_at_a_brand_or_product    3191
text                                                  3191
dtype: int64

### 4. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

#### Use `vect` as the variable name for initialising CountVectorizer.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range=(1,2))

In [0]:
dtm = vect.fit_transform(data['tweet_text']).toarray()

In [42]:
dtm

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [43]:
np.count_nonzero(vect.get_feature_names())

29892

### 5. Find number of different words in vocabulary

In [44]:
vect.vocabulary_

{'wesley83': 28452,
 'have': 10663,
 '3g': 271,
 'iphone': 12666,
 'after': 722,
 'hrs': 11419,
 'tweeting': 27085,
 'at': 2580,
 'rise_austin': 21214,
 'it': 13187,
 'was': 28115,
 'dead': 6219,
 'need': 17094,
 'to': 26229,
 'upgrade': 27444,
 'plugin': 19651,
 'stations': 23050,
 'sxsw': 23695,
 'wesley83 have': 28453,
 'have 3g': 10664,
 '3g iphone': 277,
 'iphone after': 12672,
 'after hrs': 733,
 'hrs tweeting': 11420,
 'tweeting at': 27087,
 'at rise_austin': 2666,
 'rise_austin it': 21215,
 'it was': 13361,
 'was dead': 28132,
 'dead need': 6222,
 'need to': 17118,
 'to upgrade': 26550,
 'upgrade plugin': 27446,
 'plugin stations': 19653,
 'stations at': 23051,
 'at sxsw': 2678,
 'jessedee': 13493,
 'know': 13884,
 'about': 465,
 'fludapp': 8448,
 'awesome': 2966,
 'ipad': 12238,
 'app': 1855,
 'that': 24899,
 'you': 29497,
 'll': 14731,
 'likely': 14420,
 'appreciate': 2212,
 'for': 8540,
 'its': 13380,
 'design': 6388,
 'also': 990,
 'they': 25737,
 're': 20612,
 'giving': 95

#### Tip: To see all available functions for an Object use dir

In [45]:
dir(vect)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_stop_words_consistency',
 '_check_vocabulary',
 '_count_vocab',
 '_get_param_names',
 '_get_tags',
 '_limit_features',
 '_more_tags',
 '_sort_features',
 '_stop_words_id',
 '_validate_custom_analyzer',
 '_validate_params',
 '_validate_vocabulary',
 '_white_spaces',
 '_word_ngrams',
 'analyzer',
 'binary',
 'build_analyzer',
 'build_preprocessor',
 'build_tokenizer',
 'decode',
 'decode_error',
 'dtype',
 'encoding',
 'fit',
 'fit_transform',
 'fixed_vocabulary_',
 'get_feature_names',
 'get_params',
 'get_stop_words',
 'input',
 'inverse_transf

### 6. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [46]:
data['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

Positive emotion    2672
Negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 7. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [0]:
data['label'] = data.is_there_an_emotion_directed_at_a_brand_or_product.map({'Positive emotion':1,'Negative emotion':0})

In [48]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text,label
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,,0
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,,1
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,,1
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,,0
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,,1


### 8 Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [0]:
X = data['tweet_text']
Y = data['label']

In [0]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3,random_state=7)

In [53]:
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(2233,)
(958,)
(2233,)
(958,)


## 9. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and their accuracy scores for predicting the sentiment of the given text

In [0]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

In [0]:
nb = MultinomialNB()
logr = LogisticRegression()

In [0]:
dtmTrain = vect.transform(X_train).toarray()
dtmTest = vect.transform(X_test).toarray()

In [61]:
nb.fit(dtmTrain,Y_train)
nb.score(dtmTest,Y_test)
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
Y_pred_nb = nb.predict(dtmTest)
print(accuracy_score(Y_test,Y_pred_nb))
print(classification_report(Y_test,Y_pred_nb))
print(confusion_matrix(Y_test,Y_pred_nb))

0.8549060542797495
              precision    recall  f1-score   support

           0       0.54      0.45      0.49       149
           1       0.90      0.93      0.92       809

    accuracy                           0.85       958
   macro avg       0.72      0.69      0.70       958
weighted avg       0.85      0.85      0.85       958

[[ 67  82]
 [ 57 752]]


In [62]:
logr.fit(dtmTrain,Y_train)
logr.score(dtmTest,Y_test)
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
Y_pred_logr = logr.predict(dtmTest)
print(accuracy_score(Y_test,Y_pred_logr))
print(classification_report(Y_test,Y_pred_logr))
print(confusion_matrix(Y_test,Y_pred_logr))



0.8768267223382046
              precision    recall  f1-score   support

           0       0.75      0.31      0.44       149
           1       0.89      0.98      0.93       809

    accuracy                           0.88       958
   macro avg       0.82      0.65      0.68       958
weighted avg       0.86      0.88      0.85       958

[[ 46 103]
 [ 15 794]]


## 10. Create a function called `tokenize_predict` which can take count vectorizer object as input and prints the accuracy for x (text) and y (labels)

In [0]:
def tokenize_test(vect):
    x_train_dtm = vect.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(x_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))  

In [0]:
def tokenize_predict(vect):
  vect.fit(X_train)
  dtmTr = vect.transform(X_train)
  dtmTe = vect.transform(X_test)
  nb.fit(dtmTr,Y_train)
  nb.score(dtmTe,Y_test)
  from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
  Y_pred_nb = nb.predict(dtmTe)
  print(accuracy_score(Y_test,Y_pred_nb))
  print(classification_report(Y_test,Y_pred_nb))
  print(confusion_matrix(Y_test,Y_pred_nb))

### 11 Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

In [74]:
cv_12 = CountVectorizer(ngram_range=(1,2))
tokenize_predict(cv_12)

0.8736951983298539
              precision    recall  f1-score   support

           0       0.94      0.20      0.33       149
           1       0.87      1.00      0.93       809

    accuracy                           0.87       958
   macro avg       0.90      0.60      0.63       958
weighted avg       0.88      0.87      0.84       958

[[ 30 119]
 [  2 807]]


### Q 12 Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [75]:
cv_sw = CountVectorizer(stop_words='english')
tokenize_predict(cv_sw)

0.8674321503131524
              precision    recall  f1-score   support

           0       0.76      0.21      0.34       149
           1       0.87      0.99      0.93       809

    accuracy                           0.87       958
   macro avg       0.82      0.60      0.63       958
weighted avg       0.86      0.87      0.83       958

[[ 32 117]
 [ 10 799]]


### Q 13 Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [76]:
cv_sw_300 = CountVectorizer(stop_words='english',max_features=300)
tokenize_predict(cv_sw_300)

0.8402922755741128
              precision    recall  f1-score   support

           0       0.48      0.36      0.41       149
           1       0.89      0.93      0.91       809

    accuracy                           0.84       958
   macro avg       0.68      0.65      0.66       958
weighted avg       0.82      0.84      0.83       958

[[ 54  95]
 [ 58 751]]


### Q 14 Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [77]:
cv_sw_15000 = CountVectorizer(stop_words='english',max_features=15000)
tokenize_predict(cv_sw_15000)

0.8674321503131524
              precision    recall  f1-score   support

           0       0.76      0.21      0.34       149
           1       0.87      0.99      0.93       809

    accuracy                           0.87       958
   macro avg       0.82      0.60      0.63       958
weighted avg       0.86      0.87      0.83       958

[[ 32 117]
 [ 10 799]]


### Q. 15 -Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [78]:
cv_min = CountVectorizer(ngram_range=(1,2),min_df=2)
tokenize_predict(cv_min)

0.8768267223382046
              precision    recall  f1-score   support

           0       0.69      0.38      0.49       149
           1       0.89      0.97      0.93       809

    accuracy                           0.88       958
   macro avg       0.79      0.67      0.71       958
weighted avg       0.86      0.88      0.86       958

[[ 56  93]
 [ 25 784]]
