<a href="https://colab.research.google.com/github/abivilion/Sentiment-Analysis-Web-App/blob/master/Sentiment_Analysis_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Goal**: Let Machine Understand the sentiments of humans by reading text data. And perform a better guess of the sentiments by choosing  highest probable sentiment.

# **Libraries**

In [None]:
import string
import spacy
import joblib
import numpy as np
import pandas as pd
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix

## **Importing Datasets**

**Yelp.txt**

In [None]:



dty= pd.read_csv('yelp.txt',sep='\t',header=None)

dty.head()
# review and sentiment
# 0-> negative
# 1-> positive review


Unnamed: 0,0,1
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


Assign Columns Name 

In [None]:
col_nm=['Review','Sentiment']
dty.columns = col_nm
dty.head()
# dty.shape

Unnamed: 0,Review,Sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


**Amazon.txt**

In [None]:
dta= pd.read_csv('amazon.txt',sep='\t',header=None)
# review and sentiment
# 0->negative, 1-> positive for positive review
dta.head()

Unnamed: 0,0,1
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


Assign Columns Name 

In [None]:
col_nm = ['Review','Sentiment']
dta.columns = col_nm
dta.head()
dta.shape

(1000, 2)

**IMDB.txt**

In [None]:
dtim= pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning/Major Project/imdb.txt',sep='\t',header=None)
# review and sentiment
# 0->negative, 1-> positive for positive review
dtim.head()

Unnamed: 0,0,1
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


Assign Columns Name

In [None]:
col_nm = ['Review','Sentiment']
dtim.columns = col_nm
dtim.head()
dtim.shape

(748, 2)

### **Mega DataSet** 
Adding sets all in one set(yelp <- amazon <- imdb)

In [None]:
data = dty.append([dta,dtim],ignore_index=True)
data

Unnamed: 0,Review,Sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
2743,I just got bored watching Jessice Lange take h...,0
2744,"Unfortunately, any virtue in this film's produ...",0
2745,"In a word, it is embarrassing.",0
2746,Exceptionally bad!,0


In [None]:
data.shape

(2748, 2)

*Distribution Of Sentiments Data*

In [None]:
 data['Sentiment'].value_counts()

1    1386
0    1362
Name: Sentiment, dtype: int64

*Null Checking*


In [None]:
data.isnull().sum()

Review       0
Sentiment    0
dtype: int64

In [None]:
x = data['Review']
y = data['Sentiment']
print(x.shape)
print(y.shape)

(2748,)
(2748,)


## **Data Preprocessing/Cleaning**

Here, Stopwords, Punctuations -> **REMOVED**

Apply ***Lemmatization***


In [None]:

 punct = string.punctuation
 punct

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Stop Words

In [None]:
stopwords= list(STOP_WORDS) #list of stopwords
stopwords  #326 words

['how',
 'several',
 '’re',
 'him',
 'whenever',
 'really',
 'together',
 'two',
 'something',
 'after',
 'whole',
 'once',
 'sometime',
 'full',
 'whoever',
 'put',
 'hereby',
 'yourselves',
 'his',
 'hence',
 'cannot',
 'first',
 'beyond',
 'already',
 'over',
 'can',
 '‘ve',
 'else',
 'within',
 'does',
 'latter',
 'thereby',
 'noone',
 'seems',
 'indeed',
 'at',
 'part',
 'only',
 'without',
 'although',
 'everything',
 'such',
 'seem',
 'a',
 'hereupon',
 'none',
 'along',
 'regarding',
 'among',
 'besides',
 'neither',
 'ca',
 'back',
 'who',
 'becomes',
 'whose',
 'upon',
 'when',
 'another',
 'before',
 'say',
 'go',
 'well',
 '‘ll',
 'all',
 'if',
 'eleven',
 'itself',
 "'re",
 'hundred',
 'hers',
 'themselves',
 'sixty',
 'just',
 'had',
 'herein',
 'forty',
 'am',
 'seeming',
 'she',
 'namely',
 'has',
 'their',
 'fifty',
 'into',
 'alone',
 'yet',
 'somewhere',
 'latterly',
 'did',
 'here',
 'nor',
 'across',
 'myself',
 'whither',
 'move',
 'also',
 'somehow',
 'were',
 'e

***Data Cleaning Method***

In [None]:
nlp= spacy.load('en_core_web_sm')

In [None]:
def text_cleaning(vario): # accept only 1 review at a run
  doc = nlp(vario) # calling spacy model to work on a SENTENCE 
  
  tokens = [] # list of tokens

  # lowering case all tokens 
  
  for token in doc:

# if root form(token) of that word is not pronoun then it is going to convert that into lowercase
    if token.lemma_ !="-PRON-":
      temp = token.lemma_.lower().strip()
    else:
# If that word is proper noun,then it directly taking lower case, because there is no lemma for proper noun
      temp = token.lower_
    tokens.append(temp)



  cleaned_tokens= [] 
  # removing all punctuation and stopword tokens  
  for token in tokens:
    if token not in stopwords and token not in punct:
      cleaned_tokens.append(token)
  return cleaned_tokens          

In [None]:

# text_cleaning("usa having Harvard University")
# text_cleaning

## **Verctorization Feature Engineering(TF-IDF)**

In [None]:

tfidf = TfidfVectorizer(tokenizer=text_cleaning)
tfidf

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function text_cleaning at 0x7f4ec4e534d0>,
                use_idf=True, vocabulary=None)

*Creating a Support Vector Classifier*

In [None]:
classifier = SVC()

## **Training and Testing**

**Spliting Data**

Testing Data: 0.2 (20% of Whole)

Training Data: 0.8 (80% of Whole)

In [None]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0 )
# x_train.shape+x_test.shape

**Fitting the Values/Data**

*Pipeline* - The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.

In [None]:
clf =Pipeline([('tfidf',tfidf),('clf',classifier)])

In [None]:
clf.fit(x_train,y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function text_cleaning at 0x7f4ec4e534d0>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0

## **Testing and Scoring**

In [None]:

y_pred = clf.predict(x_test)


**Confusion Matrix**

In [None]:
confusion_matrix(y_test,y_pred)

array([[209,  70],
       [ 48, 223]])

**Classification Matrix**

In [None]:
print(classification_report(y_test,y_pred)) 

              precision    recall  f1-score   support

           0       0.81      0.75      0.78       279
           1       0.76      0.82      0.79       271

    accuracy                           0.79       550
   macro avg       0.79      0.79      0.79       550
weighted avg       0.79      0.79      0.79       550



**Accuracy Score**

In [None]:
print(f'Accuracy Score: {round((accuracy_score(y_test,y_pred)*100),2)}%')
# 78.55%

Accuracy Score: 78.55%


**Checking**

In [None]:
examine= "looping is best a way  safe "
print(clf.predict([examine]))

if __name__ == '__main__':
   text_cleaning
   clf

[1]


In [None]:
# clf

### **Saving The Trained Model file**





In [None]:
joblib.dump(clf,'Sentai')

['Sentai']