<a href="https://colab.research.google.com/github/ajayshgithub/Project-1/blob/main/Project_2_Comparing_different_Classifier_algos_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Dataset information

Number of instances = 25000

number of features= 2

1. Positive sentiment(1)

2. Negative sentiment(0)

Load data

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('/content/labeledTrainData.tsv.zip', header = 0, delimiter="\t", quoting=3)

In [None]:
#Checking the data
df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [None]:
#NUmber of examples and features
df.shape

(25000, 3)

#Clean up the test

In [None]:
import re, string

In [None]:
def clean_str(string):
  """
  String cleaning before we convert text to numbers
  Remove any html code, convert all to lowercase, remove numbers etc.
  YOu can try clean the data more if you want OR leave it as is"""
  try:
    string = re.sub(r'^https?:\/\/<>.*[\r\n]*', '' , string, flags = re.MULTILINE)
    string = re.sub(r"[^A-Za-z]", " ", string)
    words = string.strip().lower().split()
    words = [w for w in words if len(w)>=1]
    return " ".join(words)
  except:
    return " "

In [None]:
df['clean_review']  = df['review'].apply(clean_str)
df.head()

Unnamed: 0,id,sentiment,review,clean_review
0,"""5814_8""",1,"""With all this stuff going down at the moment ...",with all this stuff going down at the moment w...
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ...",the classic war of the worlds by timothy hines...
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell...",the film starts with a manager nicholas bell g...
3,"""3630_4""",0,"""It must be assumed that those who praised thi...",it must be assumed that those who praised this...
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ...",superbly trashy and wondrously unpretentious s...


#Prepare data for model training

In [None]:
#Let's check first review
df.loc[0, 'clean_review']

'with all this stuff going down at the moment with mj i ve started listening to his music watching the odd documentary here and there watched the wiz and watched moonwalker again maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent moonwalker is part biography part feature film which i remember going to see at the cinema when it was originally released some of it has subtle messages about mj s feeling towards the press and also the obvious message of drugs are bad m kay br br visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him br br the actual feature film bit when it finally starts is only on for 

##Split the data between traing and test dataset

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train , X_test, y_train, y_test = train_test_split(df['clean_review'], df['sentiment'],  #we will split both features and labels
                                                     test_size = 0.3,
                                                     random_state=42)

In [None]:
print('Number of training examples: ', X_train.shape[0])
print('Number of test examples: ', X_test.shape[0])

Number of training examples:  17500
Number of test examples:  7500


##Convert text to numbers using TFIDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
#Initialize TfIDF vectorizer, keep vocablury to top 10,000 words
#Top words mean most frequent words
tfidf = TfidfVectorizer(max_features=10000)


In [None]:
#Allow tfidf to build the information on training data
#Make sure you do not show test data to tfidf when building vocabulary and tfidf vector
tfidf.fit(X_train)


In [None]:
#Convert training and test data to numbers using tfidf
X_train_tfidf = tfidf.transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [None]:
#Check out shape of training and test input data
#Number of features for each example will be equal to the vocabulary size
print(X_train_tfidf.shape)
print(X_test_tfidf.shape)

(17500, 10000)
(7500, 10000)


##Building different Models to compare Accuracy

In [None]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

###Initializing each Model

SVC

In [None]:
#Initialize the Model (SVC)
model_svc = SVC(C=1, kernel='linear')

#Train the model
model_svc.fit(X_train_tfidf, y_train)

Logistic Regression

In [None]:
#Initialize the Model (LR)
model_lr = LogisticRegression()

#Train the model
model_lr.fit(X_train_tfidf, y_train)

Decision Tree Classifier

In [None]:
#Initialize the model(Dtree)
model_dtree = DecisionTreeClassifier(min_samples_leaf=5,
                               min_samples_split=15)

#Train the model
model_dtree.fit(X_train_tfidf, y_train)

Gradient Boosting Classifier

In [None]:
#Initialize the model(GBC)
model_gbc = GradientBoostingClassifier(n_estimators=20)

#Train the model
model_gbc.fit(X_train_tfidf, y_train)

Random Forest Classifier

In [None]:
#Initialize the model(RFC)
model_rfc = RandomForestClassifier(n_estimators=50)

#Train the model
model_rfc.fit(X_train_tfidf, y_train)

##Accuracy of each model

In [None]:
#Training accuracy
score1 = model_svc.score(X_train_tfidf, y_train)
score2 = model_lr.score(X_train_tfidf, y_train)
score3 = model_dtree.score(X_train_tfidf, y_train)
score4 = model_gbc.score(X_train_tfidf, y_train)
score5 = model_rfc.score(X_train_tfidf, y_train)

In [None]:
print(score1,score2,score3,score4,score5)

0.9485714285714286 0.9240571428571429 0.904 0.7458285714285714 1.0


**Order of Training Accuracy is:**

RandomForest > SVC > LogisticRegression > DecisionTree > GradientBoosting

In [None]:
#Test Accuracy
T_score1 = model_svc.score(X_test_tfidf, y_test)
T_score2 = model_lr.score(X_test_tfidf, y_test)
T_score3 = model_dtree.score(X_test_tfidf, y_test)
T_score4 = model_gbc.score(X_test_tfidf, y_test)
T_score5 = model_rfc.score(X_test_tfidf, y_test)

In [None]:
print(T_score1, T_score2, T_score3, T_score4 , T_score5)

0.8926666666666667 0.8892 0.7030666666666666 0.7328 0.828


**Order of Test Accuracy is:**

SVC > LogisticRegression > RandomForest > GradientBoosting > DecisionTree

##Prediction on Test Data

In [None]:
y_pred1 = model_svc.predict(X_test_tfidf)
y_pred2 = model_lr.predict(X_test_tfidf)
y_pred3 = model_dtree.predict(X_test_tfidf)
y_pred4 = model_gbc.predict(X_test_tfidf)
y_pred5 = model_rfc.predict(X_test_tfidf)

df1 = pd.DataFrame({'Actual':y_test,
                   'SVC':y_pred1,
                   'Lr':y_pred2,
                   'DTree':y_pred3,
                   'GBC':y_pred4,
                   'RFC':y_pred5})

In [None]:
print(df1)

       Actual  SVC  Lr  DTree  GBC  RFC
6868        0    0   0      0    0    0
24016       1    1   1      1    1    1
9668        0    0   0      0    0    0
13640       1    1   1      0    0    1
14018       0    0   0      0    0    0
...       ...  ...  ..    ...  ...  ...
21156       1    1   1      1    1    1
24654       0    0   1      0    1    1
14592       0    0   0      1    1    0
20160       0    0   0      0    1    0
4731        0    0   0      1    1    0

[7500 rows x 6 columns]


### On **Training Data** **,**  **RandomForestClassifier** Algorithm works very well and attain highest accuracy among all other algorithms.

### But on **Test Data,** **Support Vector Classifier** Algorithm attains highest accuracy.

### Overall, the performance of **SVC** algorithm is best.!!!

