**Hello folks! If you like my notebook, please don't forget to UPVOTE!**

**About Dataset**<br>

This is a comprehensive dataset that captures the aroma and ambiance of cafes through genuine customer reviews. This dataset contains essential information including an index, cafe name, overall rating, cuisine, average cost for two, city, and insightful reviews.

**Columns:**<br>

* Index: A unique identifier for each review entry.
* Name: The name of the cafe being reviewed.
* Overall Rating: The overall rating provided by the reviewer.
* Cuisine: The type of cuisine offered by the cafe.
* Rate for Two: The average cost for two people dining at the cafe.
* City: The city where the cafe is located.
* Review: A detailed review written by the customer, capturing their experience.


**Importing Necessary Libraries**

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px

## Data Cleaning

In [None]:
# Importing the df
df = pd.read_csv('/kaggle/input/zomato-cafe-reviews/reviews.csv')
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
# Total unique ratings
df['Overall_Rating'].nunique()

In [None]:
df['Overall_Rating'].value_counts()

In [None]:
# Removing '-'
df = df[df['Overall_Rating']!='-']

In [None]:
# Removing 'New'
df = df[df['Overall_Rating']!='New']

In [None]:
df.shape

## Exploratory Data Analysis

Plotting overall ratings:

In [None]:
fig = px.histogram(df, x="Overall_Rating")
fig.update_layout(title_text='Overall_Rating')
fig.show()

In [None]:
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS

# Create stopword list:
stopwords = set(STOPWORDS)
stopwords.update(["br", "href"])
textt = " ".join(review for review in df.Review)
wordcloud = WordCloud(stopwords=stopwords).generate(textt)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud.png')
plt.show()

In [None]:
# Changing Overall_Rating dtype to numeric
df['Overall_Rating'] = pd.to_numeric(df['Overall_Rating'], errors='coerce')

**Positive is 3.5 and above <br>
Negative is rest**

In [None]:
df['sentiment'] = df['Overall_Rating'].apply(lambda x: 'positive' if x >= 3.5 else 'negative')

In [None]:
df.head()

In [None]:
# WordCloud for Negative Sentiments

textt = " ".join(review for review in df[df['sentiment']=='negative'].Review)
wordcloud = WordCloud(stopwords=stopwords).generate(textt)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud.png')
plt.show()

In [None]:
# WordCloud for positive sentiments

textt = " ".join(review for review in df[df['sentiment']=='positive'].Review)
wordcloud = WordCloud(stopwords=stopwords).generate(textt)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud.png')
plt.show()

**Basic Cleaning of Reviews**

In [None]:
def remove_punctuation(text):
    final = "".join(u for u in text if u not in ("?", ".", ";", ":",  "!",'"','food','Food'))
    return final

df['Review'] = df['Review'].apply(remove_punctuation)
df['Cuisine'] = df['Cuisine'].apply(remove_punctuation)

In [None]:
newData = df[['Review','sentiment']]
newData.head()

## Splitting the Dataset

In [None]:
# train-test split
index = newData.index
newData['random_number'] = np.random.randn(len(index))
train = newData[newData['random_number'] <= 0.70]
test = newData[newData['random_number'] > 0.30]

In [None]:
# count vectorizer:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')

train_matrix = vectorizer.fit_transform(train['Review'])
test_matrix = vectorizer.transform(test['Review'])

In [None]:
x_train = train_matrix
x_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']

## <u> Classification Models

### Logistic Regression:

In [None]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
#from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

reg = LogisticRegression()

#fitting the model
reg.fit(x_train,y_train)

#predicting
reg_pred = reg.predict(x_test)

print("accuracy is: ", accuracy_score(y_test,reg_pred).round(2))
pd.crosstab(y_test,reg_pred)

In [None]:
from sklearn.metrics import classification_report
print('\nClassification Report\n')
print(classification_report(y_test, reg_pred, target_names=['negative', 'positive']))

### Naive Bayes Classifier:

In [None]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(x_train, y_train)
mnb_pred = mnb.predict(x_test)
print('\nClassification Report\n')
print(classification_report(y_test, mnb_pred, target_names=['negative', 'positive']))

### Random Forest Classifier:

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100)
rf.fit(x_train, y_train)

rf_pred = rf.predict(x_test)
pd.crosstab(y_test,rf_pred)

In [None]:
print('\nClassification Report\n')
print(classification_report(y_test, rf_pred, target_names=['negative', 'positive']))

<u>**Summary Table**:

| Model          | Accuracy |
| -------------  | ---------|
| Random Forest  | 0.85     |
| Naive Bayes    | 0.83     |
| Logistic Reg.  | 0.84     |

**Conclusion: Random Forest gives us the best accuracy**