# Sentiment Analysis of text files

Dataset consists of text distributed across: 
- Training set.
- Test set.
- Validation set.

***Columns***

| Column Name   | Description |
|:---          |    ---:|
| textID        | Unique ID for each piece of text       |
| text          | Unique        |
| selected_text | The text that supports the sentiment (Unique) |
|  sentiment    | The general sentiment of the text (Ex : Positive/negative) |
|  Mood         | General sentiment of the text (Ex  : Joy, Sad,Happy, Anger) |


In [1]:
# Importing ibraries

import pandas as pd
import numpy as np
import re
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import WordNetLemmatizer

## Loading train, test and validation sets

In [2]:
# Load dataset and make 'textID' column

df_train = pd.read_csv('train.txt', sep =';' , names= ['text','Mood'])
df_test = pd.read_csv('test.txt', sep =';' , names= ['text','Mood'])
df_val = pd.read_csv('val.txt', sep =';', names= ['text','Mood'])


df_train['textID'] = list(range(0,len(df_train)))
df_train.set_index('textID',drop=True,inplace=True)

df_test['textID'] = list(range(0,len(df_test)))
df_test.set_index('textID',drop=True,inplace=True)

df_val['textID'] = list(range(0,len(df_val)))
df_val.set_index('textID',drop=True,inplace=True)


In [3]:
# know the shape of each set
print('Shape of train data', df_train.shape)
print('Shape of test data',df_test.shape)
print('Shape of validation data',df_val.shape)

Shape of train data (16000, 2)
Shape of test data (2000, 2)
Shape of validation data (2000, 2)


In [4]:
# show the top 5 rows in train set
df_train.head()

Unnamed: 0_level_0,text,Mood
textID,Unnamed: 1_level_1,Unnamed: 2_level_1
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [5]:
# show the top 5 rows in test set
df_test.head()

Unnamed: 0_level_0,text,Mood
textID,Unnamed: 1_level_1,Unnamed: 2_level_1
0,im feeling rather rotten so im not very ambiti...,sadness
1,im updating my blog because i feel shitty,sadness
2,i never make her separate from me because i do...,sadness
3,i left with my bouquet of red and yellow tulip...,joy
4,i was feeling a little vain when i did this one,sadness


In [6]:
# show the top 5 rows in validation set
df_val.head()

Unnamed: 0_level_0,text,Mood
textID,Unnamed: 1_level_1,Unnamed: 2_level_1
0,im feeling quite sad and sorry for myself but ...,sadness
1,i feel like i am still looking at a blank canv...,sadness
2,i feel like a faithful servant,love
3,i am just feeling cranky and blue,anger
4,i can have for a treat or if i am feeling festive,joy


In [7]:
# show the counts of each mood in each set
print('Mood count in train :', '\n', df_train['Mood'].value_counts())
print('\nMood count in test :','\n', df_test['Mood'].value_counts())
print('\nMood count in validation :','\n', df_val['Mood'].value_counts())

Mood count in train : 
 joy         5362
sadness     4666
anger       2159
fear        1937
love        1304
surprise     572
Name: Mood, dtype: int64

Mood count in test : 
 joy         695
sadness     581
anger       275
fear        224
love        159
surprise     66
Name: Mood, dtype: int64

Mood count in validation : 
 joy         704
sadness     550
anger       275
fear        212
love        178
surprise     81
Name: Mood, dtype: int64


**There are 6 moods, So we can consider that:**
> - joy, love, surprise &rarr; **Positive Sentiment**
> - fear, anger, sadness &rarr; **Negative Sentiment** 

In [8]:
# sentiment col --> positive or negative
df_train['sentiment'] = df_train['Mood'].map({'sadness':'N','anger':'N','love':'P','surprise':'P','fear':'N','joy':'P'})

df_test['sentiment'] = df_test['Mood'].map({'sadness':'N','anger':'N','love':'P','surprise':'P','fear':'N','joy':'P'})

df_val['sentiment'] = df_val['Mood'].map({'sadness':'N','anger':'N','love':'P','surprise':'P', 'fear':'N','joy':'P'})

In [9]:
import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package vader_lexicon to C:\Users\Bassant
[nltk_data]     Magdy\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Bassant
[nltk_data]     Magdy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Bassant
[nltk_data]     Magdy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Bassant
[nltk_data]     Magdy\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Another shape for dataset: 

In [10]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

c_train = df_train.copy()
c_test = df_test.copy()
c_val = df_val.copy()

c_train['Sentiments'] = c_train['text'].apply(lambda x: sid.polarity_scores(x))
c_train['Positive Sentiment'] = c_train['Sentiments'].apply(lambda x: x['pos']) 
c_train['Neutral Sentiment'] = c_train['Sentiments'].apply(lambda x: x['neu'])
c_train['Negative Sentiment'] = c_train['Sentiments'].apply(lambda x: x['neg'])
c_train.drop(columns=['Sentiments'],inplace=True)


c_test['Sentiments'] = c_test['text'].apply(lambda x: sid.polarity_scores(x))
c_test['Positive Sentiment'] = c_test['Sentiments'].apply(lambda x: x['pos']) 
c_test['Neutral Sentiment'] = c_test['Sentiments'].apply(lambda x: x['neu'])
c_test['Negative Sentiment'] = c_test['Sentiments'].apply(lambda x: x['neg'])
c_test.drop(columns=['Sentiments'],inplace=True)


c_val['Sentiments'] = c_val['text'].apply(lambda x: sid.polarity_scores(x))
c_val['Positive Sentiment'] = c_val['Sentiments'].apply(lambda x: x['pos']) 
c_val['Neutral Sentiment'] = c_val['Sentiments'].apply(lambda x: x['neu'])
c_val['Negative Sentiment'] = c_val['Sentiments'].apply(lambda x: x['neg'])
c_val.drop(columns=['Sentiments'],inplace=True)

c_train.head()

Unnamed: 0_level_0,text,Mood,sentiment,Positive Sentiment,Neutral Sentiment,Negative Sentiment
textID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,i didnt feel humiliated,sadness,N,0.504,0.496,0.0
1,i can go from feeling so hopeless to so damned...,sadness,N,0.271,0.503,0.227
2,im grabbing a minute to post i feel greedy wrong,anger,N,0.0,0.526,0.474
3,i am ever feeling nostalgic about the fireplac...,love,P,0.091,0.909,0.0
4,i am feeling grouchy,anger,N,0.278,0.185,0.537


## Preparing the text to Analysis

In [11]:
# Tokenize text
tr_tokens = [word_tokenize(sent) for sent in df_train.text]
te_tokens = [word_tokenize(sent) for sent in df_test.text]
val_tokens = [word_tokenize(sent) for sent in df_val.text]

# Normalize text which will make all words in lowercase
norm_train = []
for sent in tr_tokens:
  norm_train.append([word.lower() for word in sent])

norm_test = []
for sent in te_tokens:
  norm_test.append([word.lower() for word in sent])

norm_val = []
for sent in val_tokens:
  norm_val.append([word.lower() for word in sent])

# Remove stopwords like-> 'i, me, your' and punctuation marks '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
stop_words = stopwords.words('english') + list(punctuation)

nostop_words_train = []
for sent in norm_train:
  nostop_words_train.append([word for word in sent if word not in stop_words])

nostop_words_test = []
for sent in norm_test:
  nostop_words_test.append([word for word in sent if word not in stop_words])

nostop_words_val = []
for sent in norm_val:
  nostop_words_val.append([word for word in sent if word not in stop_words])

# Remove everything except alphabets to make sure that the text is ready
sp_train = []
for sent in nostop_words_train:
  sp_train.append([(re.sub('[^a-zA-Z]',' ', word)) for word in sent])

sp_test = []
for sent in nostop_words_test:
  sp_test.append([(re.sub('[^a-zA-Z]',' ', word)) for word in sent])

sp_val = []
for sent in nostop_words_val:
  sp_val.append([(re.sub('[^a-zA-Z]',' ', word)) for word in sent])


# Lemmatize text
l = WordNetLemmatizer()

Lemmatized_train = []
for sent in sp_train:
    Lemmatized_train.append([l.lemmatize(word) for word in sent])

Lemmatized_test = []
for sent in sp_test:
    Lemmatized_test.append([l.lemmatize(word) for word in sent])

Lemmatized_val = []
for sent in sp_val:
    Lemmatized_val.append([l.lemmatize(word) for word in sent])
    
    
train_txt = [" ".join(token) for token in Lemmatized_train]

test_txt = [" ".join(token) for token in Lemmatized_test]

val_txt = [" ".join(token) for token in Lemmatized_val]

## The required train, test and val dataset:

In [12]:
# Assign the whole changes in 'text' column in 'selected_text' column
df_train['selected_text'] = pd.Series(train_txt)
df_test['selected_text'] = pd.Series(test_txt)
df_val['selected_text'] = pd.Series(val_txt)

# Rearrange the columns in each set according to the wanted dataset
df_train = df_train.reindex(columns=['text','selected_text','sentiment','Mood'])
df_test = df_test.reindex(columns=['text','selected_text','sentiment','Mood'])
df_val = df_val.reindex(columns=['text','selected_text','sentiment','Mood'])

In [13]:
df_train.head()

Unnamed: 0_level_0,text,selected_text,sentiment,Mood
textID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,i didnt feel humiliated,didnt feel humiliated,N,sadness
1,i can go from feeling so hopeless to so damned...,go feeling hopeless damned hopeful around some...,N,sadness
2,im grabbing a minute to post i feel greedy wrong,im grabbing minute post feel greedy wrong,N,anger
3,i am ever feeling nostalgic about the fireplac...,ever feeling nostalgic fireplace know still pr...,P,love
4,i am feeling grouchy,feeling grouchy,N,anger


In [14]:
df_test.head()

Unnamed: 0_level_0,text,selected_text,sentiment,Mood
textID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,im feeling rather rotten so im not very ambiti...,im feeling rather rotten im ambitious right,N,sadness
1,im updating my blog because i feel shitty,im updating blog feel shitty,N,sadness
2,i never make her separate from me because i do...,never make separate ever want feel like ashamed,N,sadness
3,i left with my bouquet of red and yellow tulip...,left bouquet red yellow tulip arm feeling slig...,P,joy
4,i was feeling a little vain when i did this one,feeling little vain one,N,sadness


In [15]:
df_val.head()

Unnamed: 0_level_0,text,selected_text,sentiment,Mood
textID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,im feeling quite sad and sorry for myself but ...,im feeling quite sad sorry ill snap soon,N,sadness
1,i feel like i am still looking at a blank canv...,feel like still looking blank canvas blank pie...,N,sadness
2,i feel like a faithful servant,feel like faithful servant,P,love
3,i am just feeling cranky and blue,feeling cranky blue,N,anger
4,i can have for a treat or if i am feeling festive,treat feeling festive,P,joy


## Save the required datasets as csv files

In [16]:
df_train.to_csv('train.csv')
df_test.to_csv('test.csv')
df_val.to_csv('val.csv')

## Concatenate the train and validation sets into one set

In [17]:
# Concatenate the train and validation sets into one set 
training = pd.concat([df_train, df_val], axis=0)

# Reset the index of the DataFrame, and use the default one instead
training.reset_index(inplace=True)

# Drop 'textID'
training.drop('textID', axis=1, inplace=True)

training.head(10)

Unnamed: 0,text,selected_text,sentiment,Mood
0,i didnt feel humiliated,didnt feel humiliated,N,sadness
1,i can go from feeling so hopeless to so damned...,go feeling hopeless damned hopeful around some...,N,sadness
2,im grabbing a minute to post i feel greedy wrong,im grabbing minute post feel greedy wrong,N,anger
3,i am ever feeling nostalgic about the fireplac...,ever feeling nostalgic fireplace know still pr...,P,love
4,i am feeling grouchy,feeling grouchy,N,anger
5,ive been feeling a little burdened lately wasn...,ive feeling little burdened lately wasnt sure,N,sadness
6,ive been taking or milligrams or times recomme...,ive taking milligram time recommended amount i...,P,surprise
7,i feel as confused about life as a teenager or...,feel confused life teenager jaded year old man,N,fear
8,i have been with petronas for years i feel tha...,petronas year feel petronas performed well mad...,P,joy
9,i feel romantic too,feel romantic,P,love


In [18]:
# Split data into independent and dependent variable x,y

Xtrain = training['selected_text']

# Convert P to 1 and N to 0
Y_train = training['sentiment'].map({'P':1,'N':0})

# Do the same to test data
Xtest = df_test['selected_text']
Y_test = df_test['sentiment'].map({'P':1,'N':0})

In [19]:
Xtrain.head()

0                                didnt feel humiliated
1    go feeling hopeless damned hopeful around some...
2            im grabbing minute post feel greedy wrong
3    ever feeling nostalgic fireplace know still pr...
4                                      feeling grouchy
Name: selected_text, dtype: object

In [20]:
Y_train.head()

0    0
1    0
2    0
3    1
4    0
Name: sentiment, dtype: int64

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert a collection of raw documents to a matrix of TF-IDF features
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(Xtrain)
X_test = vectorizer.transform(Xtest)

In [22]:
print(X_train.shape)
print(X_test.shape)

(18000, 14293)
(2000, 14293)


## Build Model 1:


A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

In [23]:
from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier()
RF.fit(X_train, Y_train)
Ypred_1 = RF.predict(X_test)

In [24]:
from sklearn.metrics import confusion_matrix

confusion_matrix(Y_test, Ypred_1)

array([[1032,   48],
       [  41,  879]], dtype=int64)

In [25]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

print("Accuracy Score is {}".format(accuracy_score(Y_test, Ypred_1)))
print("F1 Score is {}".format(f1_score(Y_test, Ypred_1)))

Accuracy Score is 0.9555
F1 Score is 0.9518137520303194


In [26]:
from sklearn.metrics import classification_report

print("Classification eport: \n\n {}".format(classification_report(Y_test, Ypred_1)))

Classification eport: 

               precision    recall  f1-score   support

           0       0.96      0.96      0.96      1080
           1       0.95      0.96      0.95       920

    accuracy                           0.96      2000
   macro avg       0.96      0.96      0.96      2000
weighted avg       0.96      0.96      0.96      2000



## Build Model 2: 

C-Support Vector Classification.

The implementation is based on libsvm. The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples.

In [27]:
from sklearn.svm import SVC

svc = SVC(kernel= 'linear')
svc.fit(X_train, Y_train)
Ypred_2 = svc.predict(X_test)

In [28]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

print("Accuracy Score is {}".format(accuracy_score(Y_test, Ypred_2)))
print("F1 Score is {}".format(f1_score(Y_test, Ypred_2)))

Accuracy Score is 0.9665
F1 Score is 0.9636067354698533


In [29]:
from sklearn.metrics import classification_report

print("Classification eport: \n\n {}".format(classification_report(Y_test, Ypred_2)))

Classification eport: 

               precision    recall  f1-score   support

           0       0.97      0.97      0.97      1080
           1       0.96      0.96      0.96       920

    accuracy                           0.97      2000
   macro avg       0.97      0.97      0.97      2000
weighted avg       0.97      0.97      0.97      2000



###  So, After we make comparsion between 2 models we will find that Model 2 which is the SVM is the best in Accuracy and F1 Score.

In [30]:
# Show the predicted classes
Ypred_2

array([0, 0, 0, ..., 1, 1, 0], dtype=int64)

In [31]:
# Prepare Submission Dataframe

Submission = pd.DataFrame({'Actual':pd.Series(Y_test),'Predicted':pd.Series(Ypred_2)})
Submission.head(10)

Unnamed: 0,Actual,Predicted
0,0,0
1,0,0
2,0,0
3,1,1
4,0,0
5,0,0
6,0,0
7,1,1
8,1,1
9,0,0


Which is:

- Positive --> 1
- Negative --> 0

In [32]:
# Save Submission data as csv file

Submission.to_csv('Submission.csv')