<a href="https://colab.research.google.com/github/berkecengiz/ML-Examples/blob/main/ReviewClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Review Classification task**

> Some explanation here...


In [1]:
import numpy as np  
import pandas as pd
import os
import re
import nltk
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_confusion_matrix
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import classification_report,accuracy_score,precision_score,recall_score,roc_auc_score, confusion_matrix
%matplotlib inline

#### ***Loading*** the Dataset

In [12]:
# Data
input_file = "/content/drive/MyDrive/archive/sentiment labelled sentences/sentiment labelled sentences/amazon_cells_labelled.txt"
amazon = pd.read_csv(input_file,delimiter='\t',header=None, names=['review', 'sentiment'])
amazon['source']='amazon'

input_file = "/content/drive/MyDrive/archive/sentiment labelled sentences/sentiment labelled sentences/yelp_labelled.txt"
yelp = pd.read_csv(input_file,delimiter='\t',header=None, names=['review', 'sentiment'])
yelp['source']='yelp'

input_file = "/content/drive/MyDrive/archive/sentiment labelled sentences/sentiment labelled sentences/imdb_labelled.txt"
imdb = pd.read_csv(input_file,delimiter='\t',header=None, names=['review', 'sentiment'])
imdb['source']='imdb'

data = pd.DataFrame()
data = pd.concat([amazon, yelp, imdb])
data['sentiment'] = data['sentiment'].astype(str)

In [13]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Variables** are defined here into two categories here (Review/Sentiment)


In [14]:
# Independent variables
review = data.iloc[:, 0].values

# Dependent variable
senti = data.iloc[:, 1].values

#### **Text** Pre-processing


In [15]:
processed_reviews = []

for sentence in range(0, len(review)):
    processed_rev = re.sub(r'[^\w\s]', ' ', str(review[sentence])) # removes special char
    processed_rev= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_rev) # removes single char
    processed_rev = re.sub(r'\s+', ' ', processed_rev, flags=re.I) # removes multiple spacing
    processed_rev = re.sub(r'\d+', ' ', processed_rev) # removes numbers
    processed_rev = processed_rev.lower()
    processed_reviews.append(processed_rev)

In [16]:
data_clean = pd.DataFrame(processed_reviews)
data_clean.columns = ['reviews']
data_clean['senti_score'] = senti
data_clean.head()

Unnamed: 0,reviews,senti_score
0,so there is no way for me to plug it in here i...,0
1,good case excellent value,1
2,great for the jawbone,1
3,tied to charger for conversations lasting more...,0
4,the mic is great,1


#### **Train-Test Split:**

Splitting data into train and test datasets




In [17]:
X_train, X_test, y_train, y_test = train_test_split(data_clean['reviews'], data_clean['senti_score'], 
                                                    test_size=0.2, random_state=0)

### **Extracting features from text data (using Tfidf Vectorizer) and Model Building**



Here, I trained different classification models with train data set and evaluated them based on their training **Accuracy** and **ROC** values. To avoid overfitting, And compared their average scores in order the test their efficency. 

In [18]:
DTC = Pipeline([
        ("tfidf_vectorizer", TfidfVectorizer(stop_words="english")),
        ("dtc", DecisionTreeClassifier(random_state=0))
    ])

GBC = Pipeline([
        ("tfidf_vectorizer", TfidfVectorizer(stop_words="english")),
        ("dtc", GradientBoostingClassifier(random_state=0))
    ])

RFC = Pipeline([
        ("tfidf_vectorizer", TfidfVectorizer(stop_words="english")),
        ("rfc", RandomForestClassifier(random_state=0))
    ])

all_models = [
    ("DecissionTree", DTC),
    ("GradientBoosting", GBC),
    ("RandomForest", RFC),
    ]
 
unsorted_scores = [(name, cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc').mean()) for name, model in all_models]
scores = pd.DataFrame(unsorted_scores, columns=['ML Model', 'roc_auc Score'])

unsorted_scores = [(name, cross_val_score(model, X_train, y_train, cv=3, scoring='recall_macro').mean()) for name, model in all_models]
scores_recall = pd.DataFrame(unsorted_scores, columns=['ML Model', 'recall Score'])
scores['Recall Score'] = scores_recall['recall Score']

unsorted_scores = [(name, cross_val_score(model, X_train, y_train, cv=3, scoring='precision_macro').mean()) for name, model in all_models]
scores_pre = pd.DataFrame(unsorted_scores, columns=['ML Model', 'Pre Score'])
scores['Precision Score'] = scores_pre['Pre Score']

unsorted_scores = [(name, cross_val_score(model, X_train, y_train, cv=3).mean()) for name, model in all_models]
scores_acc = pd.DataFrame(unsorted_scores, columns=['ML Model', 'Acc Score'])
scores['Accuracy Score'] = scores_acc['Acc Score']

scores.head()

Unnamed: 0,ML Model,roc_auc Score,Recall Score,Precision Score,Accuracy Score
0,DecissionTree,0.745,0.715558,0.720009,0.715654
1,GradientBoosting,0.80902,0.742165,0.770195,0.74113
2,RandomForest,0.829389,0.74627,0.750941,0.746134


Results shows that Random Forest classifier worked better with accuracy of around 75% 