# Description 
The objective of this project is to detect the racist and sexist comments from amazon reviews or does a review contain any hate speech in it,we will classify these reviews.
We will use a Training sample to build a model and use that model on a test dataset to predict the labels on that dataset and find the accuracy of our model.

# Introduction
In this whole process we will use the amazon review 'txt' file to build the system and in this file we have 10000 examples, We 
will follow the following components to get the result-
#### -Dataset Preparation
#### -Feature Extraction
#### - Training Model

In [1]:
##libraries
import pandas as pd
import numpy as np
from sklearn import model_selection,preprocessing,svm,metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

## Dataset Preparation

In [2]:
#We will convert the raw data into a meaning full and informative dataset
df=pd.read_csv('review.txt',sep=('\n'),header=None)
print(df)
#df.iloc[:,0]=df.iloc[:,0].str.lower()
df['label']=''
df['Text']=''
texts=[]
for i in df.index:
    body=str(df.iloc[i,0]).split()
    df['label'][i]=body[0]
    df['Text'][i]=" ".join(body[1:])
df

                                                      0
0     __label__2 Stuning even for the non-gamer: Thi...
1     __label__2 The best soundtrack ever to anythin...
2     __label__2 Amazing!: This soundtrack is my fav...
3     __label__2 Excellent Soundtrack: I truly like ...
4     __label__2 Remember, Pull Your Jaw Off The Flo...
...                                                 ...
9995  __label__2 A revelation of life in small town ...
9996  __label__2 Great biography of a very interesti...
9997  __label__1 Interesting Subject; Poor Presentat...
9998  __label__1 Don't buy: The box looked used and ...
9999  __label__2 Beautiful Pen and Fast Delivery.: T...

[10000 rows x 1 columns]


Unnamed: 0,0,label,Text
0,__label__2 Stuning even for the non-gamer: Thi...,__label__2,Stuning even for the non-gamer: This sound tra...
1,__label__2 The best soundtrack ever to anythin...,__label__2,The best soundtrack ever to anything.: I'm rea...
2,__label__2 Amazing!: This soundtrack is my fav...,__label__2,Amazing!: This soundtrack is my favorite music...
3,__label__2 Excellent Soundtrack: I truly like ...,__label__2,Excellent Soundtrack: I truly like this soundt...
4,"__label__2 Remember, Pull Your Jaw Off The Flo...",__label__2,"Remember, Pull Your Jaw Off The Floor After He..."
...,...,...,...
9995,__label__2 A revelation of life in small town ...,__label__2,A revelation of life in small town America in ...
9996,__label__2 Great biography of a very interesti...,__label__2,Great biography of a very interesting journali...
9997,__label__1 Interesting Subject; Poor Presentat...,__label__1,Interesting Subject; Poor Presentation: You'd ...
9998,__label__1 Don't buy: The box looked used and ...,__label__1,Don't buy: The box looked used and it is obvio...


## Feature Extraction using TFidf Vectorization

In [3]:
x_train,x_test,y_train,y_test= model_selection.train_test_split(df['Text'],df['label'])
#split the data into test and training datasets
encoder = preprocessing.LabelEncoder()
y_train= encoder.fit_transform(y_train)
y_test= encoder.fit_transform(y_test)
#we transform y_train and y_test into the binary class according to the labels.
tfidf=TfidfVectorizer(min_df=10,max_df=0.99,ngram_range=(1,2),stop_words='english').fit(x_train)
#we have use stopwords like he,are,is etc these words don't bring any feature to our dataset.
#min_df removes words which are only present in less than 10 documents in the whole x_train.
#max_df removes words which are present in 99% of the x_train documents.
x_train=tfidf.transform(x_train)
x_test=tfidf.transform(x_test)
#feature extraction is completed here.
#tfidf.get_feature_names() 

In [4]:
tfidf_df=pd.DataFrame(x_train.todense(),columns=tfidf.get_feature_names())

## Different ML Models and Model Selection
We have extracted features and labels dataset, and now we will apply different ML models on the training dataset and calculate the accuracy for each of them.
- Naive Bayes
- Logistic Regression
- Support Vector Machine(SVM)
- Random Forest Model

In [5]:
#Naive Bayes
gridval={'alpha':[0.001,.01,.1,1,10]}
model=MultinomialNB()
grid=GridSearchCV(model,param_grid=gridval,scoring='accuracy')
grid.fit(x_train,y_train)
pred= grid.predict(x_test)
accuracy_mnb=accuracy_score(y_test,pred)*100

In [6]:
#Logistic Regression
gridval={'C':[0.0001,0.001,.01,.1,1,10]}
model=LogisticRegression()
grid=GridSearchCV(model,param_grid=gridval,scoring='accuracy')
grid.fit(x_train,y_train)
pred= grid.predict(x_test)
accuracy_lrg=accuracy_score(y_test,pred)*100

In [7]:
#Support Vector Machine (SVM)
#gridval={'gamma':[0.0001,0.001,.01,.1,1,10],'C':[0.0001,0.001,.01,.1,1,10]}
model=SVC(kernel='rbf')
#grid=GridSearchCV(model,param_grid=gridval,scoring='accuracy')
#grid.fit(x_train,y_train)
#pred= grid.predict(x_test)
model.fit(x_train,y_train)
pred=model.predict(x_test)
accuracy_svm=accuracy_score(y_test,pred)*100


In [8]:
# Random Forest Model(RFM)
model=RandomForestClassifier()
#gridval={'n_estimators':[10,100,200,300,400,500]}
#grid=GridSearchCV(model,param_grid=gridval,scoring='accuracy')
#grid.fit(x_train,y_train)
model.fit(x_train,y_train)
#pred= grid.predict(x_test)
pred=model.predict(x_test)
accuracy_rfm=accuracy_score(y_test,pred)*100

In [9]:
#Result
#Now we have accuracy of different models-
print('Accuracies of Different Models\nNaive Bayer: {}\nLogistic Regression: {}\nSupport Vector Machine(SVM):{}\nRandom Forest Model: {}\n '.format(accuracy_mnb,accuracy_lrg,accuracy_svm,accuracy_rfm))

Accuracies of Different Models
Naive Bayer: 85.28
Logistic Regression: 85.92
Support Vector Machine(SVM):85.88
Random Forest Model: 82.96
 
