# **Project Name**    -    Zomato Sentiment Analysis with Clustering

##### **Project Type**    - Machine Learning Project (NLP-based Sentiment Analysis with Clustering)
##### **Contribution**    - Individual
##### **Team Member 1 -**  ponna chaitanya


# **Project Summary -**

Online food delivery platforms such as Zomato receive thousands of customer reviews every day. These reviews contain valuable feedback about food quality, service, and overall customer experience. However, manually analyzing such a large volume of textual data is impractical and time-consuming. This project aims to automate the process of sentiment analysis by classifying restaurant reviews into positive and negative sentiments using Natural Language Processing (NLP) and Machine Learning techniques.

The dataset consists of customer reviews along with their corresponding ratings. The project begins with Exploratory Data Analysis (EDA) to understand the structure of the dataset, identify missing values, analyze rating distribution, sentiment balance, and review length patterns. Visualizations such as histograms, count plots, and correlation heatmaps were used to extract meaningful insights into customer behavior and overall feedback trends.

Before building the model, the dataset underwent thorough preprocessing. Since the rating column contained mixed formats, numerical values were extracted and cleaned to ensure consistency. Based on these ratings, sentiment labels were generated. The textual data was then cleaned using preprocessing techniques such as converting text to lowercase, removing punctuation, numbers, and stopwords, and applying lemmatization to reduce words to their base forms. These steps helped remove noise and improve model performance.

The cleaned reviews were converted into numerical form using TF-IDF (Term Frequency–Inverse Document Frequency) Vectorization, which helps capture the importance of words across documents while reducing the weight of frequently occurring but less meaningful words. This transformation enabled machine learning algorithms to process textual data effectively.

For sentiment classification, a Logistic Regression model was trained to predict sentiment. The dataset was split into training and testing sets to ensure proper evaluation of model performance. The model achieved an accuracy of approximately 87.7%, indicating strong performance in classifying customer sentiment. Additional evaluation metrics such as precision, recall, F1-score, and confusion matrix were used to measure classification quality. A live review prediction system was also implemented, allowing users to input new reviews and receive instant sentiment feedback.

In addition to supervised classification, the project also integrates unsupervised learning through K-Means clustering. Clustering was applied to the TF-IDF feature matrix to group similar reviews based on textual patterns without using sentiment labels. The Elbow Method was used to determine the optimal number of clusters by analyzing inertia values, and the Silhouette Score was calculated to evaluate clustering performance. Furthermore, Principal Component Analysis (PCA) was applied to reduce high-dimensional data into two dimensions for visualization, enabling clear graphical representation of cluster separation.

This project demonstrates how NLP, supervised learning, and unsupervised learning techniques can be effectively applied to real-world business problems. By automating sentiment analysis and uncovering hidden patterns in customer feedback, restaurants can gain deeper insights into customer satisfaction, identify areas for improvement, and enhance overall service quality through data-driven decision-making.

**GitHub Link -**[https://github.com/chaitu-1219/Zamato-Project](https://github.com/chaitu-1219/Zamato-Project)

# **Problem Statement**


To build a machine learning system that automatically analyzes Zomato restaurant reviews by classifying them into positive or negative sentiments using Natural Language Processing techniques, and by applying clustering methods to uncover hidden patterns in customer feedback, thereby helping businesses understand customer opinions and improve decision-making.


## ***1. Know Your Data***

### Import Libraries

In [None]:
# ==========================================================
# 1️⃣ IMPORT LIBRARIES (Safe Import with Exception Handling)
# ==========================================================

try:
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import re
    import string
    import nltk
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model import LogisticRegression
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
    from sklearn.cluster import KMeans
    from sklearn.decomposition import PCA
    from sklearn.metrics import silhouette_score
    from wordcloud import WordCloud
except Exception as e:
    print("Library Import Error:", e)

nltk.download('stopwords')
nltk.download('wordnet')

sns.set(style="whitegrid")


### Dataset Loading

In [None]:

# Load Dataset
import pandas as pd

try:
    df = pd.read_csv("Zomato Restaurant reviews.csv")
    print("Dataset Loaded Successfully")
except Exception as e:
    print("Error loading dataset:", e)


### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False)
plt.title("Missing Values Heatmap")
plt.show()


### What did you know about your dataset?

The dataset consists of customer reviews and corresponding ratings.
It contains textual and numerical data.
There are minor missing values and inconsistent rating formats that require cleaning.
The dataset is suitable for NLP-based sentiment analysis and clustering tasks.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Review → Customer feedback text (Text data)

Rating → Numerical rating given by customer

Sentiment → Derived categorical variable (Positive/Negative)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    print(col, df[col].nunique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Remove Duplicates
df.drop_duplicates(inplace=True)

# Clean Rating Column
df['Rating'] = df['Rating'].astype(str)
df['Rating'] = df['Rating'].str.extract('(\d+\.?\d*)')
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

# Drop missing ratings
df.dropna(subset=['Rating'], inplace=True)

# Create Sentiment Column
df['Sentiment'] = df['Rating'].apply(lambda x: 'Positive' if x >= 4 else 'Negative')

# Create Review Length Feature
df['Review_Length'] = df['Review'].astype(str).apply(len)

df.head()


### What all manipulations have you done and insights you found?

##### **Manipulations:**

Removed duplicate records

Extracted numeric ratings from inconsistent formats

Removed null rating values

Created binary sentiment label

Created new feature: Review Length

##### **Insights Found:**

Majority ratings are positive

Review length varies significantly

Long reviews often express stronger sentiment

Dataset is suitable for NLP and ML modeling


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 : Rating Distribution

In [None]:
# Chart - 1 visualization code
sns.histplot(df['Rating'], bins=10)
plt.title("Rating Distribution")
plt.show()

##### 1. Why did you pick the specific chart?

To understand overall rating trends.

##### 2. What is/are the insight(s) found from the chart?

Most ratings are above 4.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

High ratings indicate strong customer satisfaction → positive growth potential.


#### Chart - 2 : Sentiment Count


In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(6,4))
sns.countplot(x='Sentiment', data=df)
plt.title("Chart 2: Sentiment Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

To check class balance.

##### 2. What is/are the insight(s) found from the chart?

Positive reviews dominate.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Strong brand image, but negative share must be monitored.

#### Chart - 3 : Review Length Distribution

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(6,4))
sns.histplot(df['Review_Length'], bins=30)
plt.title("Chart 3: Review Length Distribution")
plt.show()


##### 1. Why did you pick the specific chart?


To analyze how detailed customer feedback is.

##### 2. What is/are the insight(s) found from the chart?

Wide variation in review lengths.Answer Here.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Long reviews may indicate strong emotional experiences.

#### Chart - 4 : Boxplot of Ratings

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(6,4))
sns.boxplot(x=df['Rating'])
plt.title("Chart 4: Rating Boxplot")
plt.show()


##### 1. Why did you pick the specific chart?

To detect outliers and spread of ratings.

##### 2. What is/are the insight(s) found from the chart?

Few low-rating outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Extreme dissatisfaction cases require attention.

#### Chart - 5 : WordCloud (Positive Reviews)

In [None]:
# Chart - 5 visualization code
from wordcloud import WordCloud

positive_text = " ".join(df[df['Sentiment']=="Positive"]['Review'].astype(str))
wc = WordCloud(width=800, height=400).generate(positive_text)

plt.figure(figsize=(8,4))
plt.imshow(wc)
plt.title("Chart 5: WordCloud - Positive Reviews")
plt.axis("off")
plt.show()


##### 1. Why did you pick the specific chart?

To identify most frequent positive keywords.

##### 2. What is/are the insight(s) found from the chart?

Frequent words: good, service, food.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifies key strengths.

#### Chart - 6 : WordCloud (Positive Reviews)

In [None]:
# Chart - 6 visualization code
negative_text = " ".join(df[df['Sentiment']=="Negative"]['Review'].astype(str))
wc = WordCloud(width=800, height=400, background_color='white').generate(negative_text)

plt.figure(figsize=(8,4))
plt.imshow(wc)
plt.title("Chart 6: WordCloud - Negative Reviews")
plt.axis("off")
plt.show()


##### 1. Why did you pick the specific chart?

To understand common complaints.

##### 2. What is/are the insight(s) found from the chart?

Common complaints identified.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Highlights service improvement areas.

#### Chart - 7 : Sentiment vs Review Length

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(6,4))
sns.boxplot(x='Sentiment', y='Review_Length', data=df)
plt.title("Chart 7: Sentiment vs Review Length")
plt.show()


##### 1. Why did you pick the specific chart?

To compare review detail across sentiments.

##### 2. What is/are the insight(s) found from the chart?

Negative reviews slightly longer.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Angry customers explain problems in detail.

#### Chart - 8 : Rating vs Review Length

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(6,4))
sns.scatterplot(x='Rating', y='Review_Length', data=df)
plt.title("Chart 8: Rating vs Review Length")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Weak correlation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Length does not directly impact rating.

#### Chart - 9 : Average Rating by Sentiment

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(6,4))
df.groupby('Sentiment')['Rating'].mean().plot(kind='bar')
plt.title("Chart 9: Average Rating by Sentiment")
plt.ylabel("Average Rating")
plt.show()


##### 1. Why did you pick the specific chart?

To validate sentiment labeling.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Positive sentiment aligns with high ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Confirms model reliability.

#### Chart - 10 : Correlation Heatmap

In [None]:
plt.figure(figsize=(6,4))
sns.heatmap(df[['Rating','Review_Length']].corr(), annot=True)
plt.title("Chart 10: Correlation Heatmap")
plt.show()


##### 1. Why did you pick the specific chart?

To understand relationships between numerical variables and detect multicollinearity.

##### 2. What is/are the insight(s) found from the chart?

Weak correlation between rating and review length.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Review length alone cannot predict satisfaction.

#### Chart - 11 : Word Count Distribution

In [None]:
# Chart - 11 visualization code

# Create Word_Count Feature
df['Word_Count'] = df['Review'].astype(str).apply(lambda x: len(x.split()))

# Verify column exists
df[['Review', 'Word_Count']].head()

plt.figure(figsize=(6,4))
sns.histplot(df['Word_Count'], bins=30)
plt.title("Chart 11: Word Count Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

To understand textual depth.

##### 2. What is/are the insight(s) found from the chart?

Most reviews are short to medium length.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Short feedback still carries meaningful sentiment; businesses should not ignore short reviews.

#### Chart - 12 : KMeans Cluster Distribution

In [None]:
# Remove missing reviews
df = df.dropna(subset=['Review'])

# OR safer option (recommended)
df['Review'] = df['Review'].fillna("")

# Convert to string (very important)
df['Review'] = df['Review'].astype(str)


In [None]:
# Chart - 12 visualization code
tfidf = TfidfVectorizer(max_features=1000)
X = tfidf.fit_transform(df['Review'])

kmeans = KMeans(n_clusters=2, random_state=42)
df['Cluster'] = kmeans.fit_predict(X)

plt.figure(figsize=(6,4))
sns.countplot(x='Cluster', data=df)
plt.title("Chart 12: Cluster Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

To analyze natural grouping of reviews.

##### 2. What is/are the insight(s) found from the chart?

Two distinct clusters formed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Customer feedback can be segmented automatically for targeted improvement strategies.

#### Chart - 13 : Cluster vs Sentiment

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(6,4))
sns.countplot(x='Cluster', hue='Sentiment', data=df)
plt.title("Chart 13: Cluster vs Sentiment")
plt.show()


##### 1. Why did you pick the specific chart?

To compare clustering output with sentiment labels.

##### 2. What is/are the insight(s) found from the chart?

Clusters align closely with sentiment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Confirms clustering validity and usefulness for automated segmentation.

#### Chart - 14 - PCA Visualization

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X.toarray())

plt.figure(figsize=(6,4))
plt.scatter(X_reduced[:,0], X_reduced[:,1], c=df['Cluster'])
plt.title("Chart 14: PCA Visualization of Clusters")
plt.show()


##### 1. Why did you pick the specific chart?

To reduce high-dimensional text features into 2D space.

##### 2. What is/are the insight(s) found from the chart?

Clear clustering pattern visible.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
plt.figure()
sns.pairplot(df[['Rating','Review_Length','Word_Count']])
plt.suptitle("Chart 15: Pair Plot Analysis", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

To visualize pairwise relationships.

##### 2. What is/are the insight(s) found from the chart?

Rating and review length show distinct grouping.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Based on chart observations, the following hypotheses were formulated and tested using appropriate statistical methods.Answer Here.

### Hypothetical Statement - 1
Negative reviews tend to be longer than positive reviews.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no significant difference in average review length between positive and negative reviews.

Alternate Hypothesis (H₁):
There is a significant difference in average review length between positive and negative reviews.Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

positive_lengths = df[df['Sentiment']=="Positive"]['Review_Length']
negative_lengths = df[df['Sentiment']=="Negative"]['Review_Length']

t_stat, p_value = ttest_ind(positive_lengths, negative_lengths)

print("T-Statistic:", t_stat)
print("P-Value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Independent Two-Sample T-Test

##### Why did you choose the specific statistical test?

We are comparing the means of two independent groups

The variable (Review_Length) is numerical

Sentiment has two categoriesAnswer Here.

### Hypothetical Statement - 2
**Statement:**

Rating and sentiment are dependent on each other.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
Rating distribution is independent of sentiment.

Alternate Hypothesis (H₁):
Rating distribution is dependent on sentiment.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import chi2_contingency

contingency_table = pd.crosstab(df['Sentiment'], df['Rating'])

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print("Chi-Square Statistic:", chi2)
print("P-Value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Chi-Square Test of Independence

##### Why did you choose the specific statistical test?

Both variables are categorical

We want to check association between sentiment and rating

### Hypothetical Statement - 3
**Statement:**

There is a correlation between rating and review length.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no correlation between rating and review length.

Alternate Hypothesis (H₁):
There is a significant correlation between rating and review length.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

correlation, p_value = pearsonr(df['Rating'], df['Review_Length'])

print("Correlation Coefficient:", correlation)
print("P-Value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Pearson Correlation Test

##### Why did you choose the specific statistical test?

Both variables are numerical

We want to measure linear correlation

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Check missing values
df.isnull().sum()

# Remove rows where Review or Rating is missing
df.dropna(subset=['Review', 'Rating'], inplace=True)

# If numeric column had missing values, we would use:
# df['Rating'].fillna(df['Rating'].median(), inplace=True)


#### What all missing value imputation techniques have you used and why did you use those techniques?

Removed rows with missing Review or Rating (critical features).

If needed, median imputation would be used for numerical columns because:

Median is robust to outliers.

Preserves distribution better than mean.

Business Reason:
Incomplete reviews cannot contribute to sentiment analysis.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Detect outliers using IQR
Q1 = df['Review_Length'].quantile(0.25)
Q3 = df['Review_Length'].quantile(0.75)
IQR = Q3 - Q1

df = df[(df['Review_Length'] >= Q1 - 1.5*IQR) &
        (df['Review_Length'] <= Q3 + 1.5*IQR)]


##### What all outlier treatment techniques have you used and why did you use those techniques?

IQR (Interquartile Range) Method
It is robust and suitable for skewed distributions.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Encode Sentiment column
df['Sentiment'] = df['Sentiment'].map({'Positive':1, 'Negative':0})


#### What all categorical encoding techniques have you used & why did you use those techniques?

Label Encoding
Binary classification problem,
Logistic Regression requires numerical input.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
!pip install contractions
import contractions
df['Review'] = df['Review'].apply(lambda x: contractions.fix(str(x)))


#### 2. Lower Casing

In [None]:
# Lower Casing
df['Review'] = df['Review'].str.lower()


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string
df['Review'] = df['Review'].str.replace(f"[{string.punctuation}]", "", regex=True)


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re
df['Review'] = df['Review'].apply(lambda x: re.sub(r'http\S+|www\S+', '', x))
df['Review'] = df['Review'].apply(lambda x: re.sub(r'\d+', '', x))


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
stop_words = set(stopwords.words('english'))
df['Review'] = df['Review'].apply(
    lambda x: " ".join(word for word in x.split() if word not in stop_words)
)

In [None]:
# Remove White spaces
df['Review'] = df['Review'].str.strip()


#### 6. Rephrase Text

In [None]:
# Rephrase Text (Slang Replacement Dictionary)

slang_dict = {
    "gr8": "great",
    "u": "you",
    "ur": "your",
    "r": "are",
    "btw": "by the way",
    "luv": "love",
    "bcz": "because",
    "pls": "please",
    "thx": "thanks"
}

def rephrase_text(text):
    words = text.split()
    new_words = []
    for word in words:
        if word in slang_dict:
            new_words.append(slang_dict[word])
        else:
            new_words.append(word)
    return " ".join(new_words)

# Apply rephrasing
df['Review'] = df['Review'].apply(lambda x: rephrase_text(str(x)))

df[['Review']].head()


#### 7. Tokenization

In [None]:
# Tokenization
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')

df['Tokens'] = df['Review'].apply(word_tokenize)


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

df['Review'] = df['Tokens'].apply(
    lambda words: " ".join(lemmatizer.lemmatize(word) for word in words)
)


##### Which text normalization technique have you used and why?

Lemmatization,Keeps meaningful base words,Better than stemming for business text analysis.

#### 9. Part of speech tagging

In [None]:
# POS Taging
from nltk import pos_tag
import nltk
nltk.download('averaged_perceptron_tagger_eng')
df['POS'] = df['Tokens'].apply(pos_tag)


#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(df['Review'])


##### Which text vectorization technique have you used and why?

TF-IDF,Captures word importance,
Reduces impact of frequent generic words.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
df['Review_Length'] = df['Review'].apply(len)
df['Word_Count'] = df['Review'].apply(lambda x: len(x.split()))


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Use TF-IDF max_features to reduce dimensionality
tfidf = TfidfVectorizer(max_features=5000)


##### What all feature selection methods have you used  and why?

Limiting max_features

Removing correlated numeric features

##### Which all features you found important and why?

Prevents overfitting.

Important Features:

High TF-IDF weighted words

Review Length

Word CountAnswer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, transformation is required.

Text transformed into TF-IDF vectors because ML models cannot process raw text.

In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'http\S+|www\S+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\d+', '', text)
    words = word_tokenize(text)
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return " ".join(words)

df['Clean_Review'] = df['Review'].apply(clean_text)

df[['Review','Clean_Review']].head()


In [None]:
# Transform Your data
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF
tfidf = TfidfVectorizer(max_features=5000)

# Transform text data into numerical format
X = tfidf.fit_transform(df['Clean_Review'])

# Target variable
y = df['Sentiment']

print("Shape of Transformed Data:", X.shape)
print("Shape of Target Variable:", y.shape)

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)
X_scaled = scaler.fit_transform(X)


##### Which method have you used to scale you data and why?

Used StandardScaler because:

Ensures uniform feature scale

Required for some ML algorithms

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, needed because TF-IDF creates high dimensional data.

In [None]:
# DImensionality Reduction (If needed)
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X.toarray())


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Technique Used: PCA

Why?

Reduces dimensionality

Helps visualization

Reduces noise

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, df['Sentiment'], test_size=0.2, random_state=42)


##### What data splitting ratio have you used and why?

Split Ratio: 80% Training, 20% Testing

Why?
Standard industry practice.
Provides sufficient training data while preserving test accuracy.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

To determine whether the dataset is imbalanced, we first examined the distribution of the target variable (Sentiment).

In [None]:
df['Sentiment'].value_counts()

In [None]:
# Handling Imbalanced Dataset (If needed)
df['Sentiment'].value_counts()

from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_scaled, df['Sentiment'])


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Technique Used: SMOTE

Why?
Balances minority class.
Prevents biased model predictions.

## ***7. ML Model Implementation***

### ML Model - 1 : Logistic Regression

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

# Import Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, f1_score

# Initialize the model
lr_model = LogisticRegression(max_iter=1000)

# Fit the Algorithm (Training the Model)
lr_model.fit(X_train, y_train)

# Predict on the model (Testing Phase)
y_pred_lr = lr_model.predict(X_test)

# Calculate individual metrics and store them in variables
accuracy = accuracy_score(y_test, y_pred_lr)
precision = precision_score(y_test, y_pred_lr)
recall = recall_score(y_test, y_pred_lr)
f1 = f1_score(y_test, y_pred_lr)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred_lr))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

metrics = [accuracy, precision, recall, f1]
labels = ['Accuracy','Precision','Recall','F1 Score']

plt.figure(figsize=(6,4))
sns.barplot(x=labels, y=metrics)
plt.title("Logistic Regression Evaluation Metrics")
plt.ylim(0,1)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

# ==========================================
# Cross-Validation & Hyperparameter Tuning
# ==========================================

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Step 1: Initialize Base Model
lr_model = LogisticRegression(max_iter=1000)

# ------------------------------------------
# Step 2: Perform Cross-Validation
# ------------------------------------------

cv_scores = cross_val_score(lr_model, X_train, y_train, cv=5, scoring='f1')

print("Cross-Validation F1 Scores:", cv_scores)
print("Average CV F1 Score:", cv_scores.mean())

# ------------------------------------------
# Step 3: Define Hyperparameter Grid
# ------------------------------------------

param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l2'],
    'solver': ['lbfgs']
}

# ------------------------------------------
# Step 4: Apply GridSearchCV
# ------------------------------------------

grid_search = GridSearchCV(
    estimator=lr_model,
    param_grid=param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1
)

# ------------------------------------------
# Step 5: Fit the Tuned Model
# ------------------------------------------

grid_search.fit(X_train, y_train)

# Best Model
best_lr_model = grid_search.best_estimator_

print("Best Parameters:", grid_search.best_params_)

# ------------------------------------------
# Step 6: Predict on Test Data
# ------------------------------------------

y_pred_tuned = best_lr_model.predict(X_test)

# ------------------------------------------
# Step 7: Evaluate Tuned Model
# ------------------------------------------

accuracy = accuracy_score(y_test, y_pred_tuned)
precision = precision_score(y_test, y_pred_tuned)
recall = recall_score(y_test, y_pred_tuned)
f1 = f1_score(y_test, y_pred_tuned)

print("Tuned Model Accuracy:", accuracy)
print("Tuned Model Precision:", precision)
print("Tuned Model Recall:", recall)
print("Tuned Model F1 Score:", f1)


##### Which hyperparameter optimization technique have you used and why?

GridSearchCV

Why?

Systematically searches parameter combinations

Ensures optimal model performance

Reliable and widely accepted for academic projects

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2 : Naive Bayes

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Initialize model
nb_model = MultinomialNB()

# Fit the Algorithm
nb_model.fit(X_train, y_train)

# Predict on the model
y_pred_nb = nb_model.predict(X_test)

print(classification_report(y_test, y_pred_nb))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
accuracy_nb = accuracy_score(y_test, y_pred_nb)
precision_nb = precision_score(y_test, y_pred_nb)
recall_nb = recall_score(y_test, y_pred_nb)
f1_nb = f1_score(y_test, y_pred_nb)

print("Accuracy:", accuracy_nb)
print("Precision:", precision_nb)
print("Recall:", recall_nb)
print("F1 Score:", f1_nb)

import matplotlib.pyplot as plt
import seaborn as sns

metrics = [accuracy_nb, precision_nb, recall_nb, f1_nb]
labels = ['Accuracy','Precision','Recall','F1 Score']

plt.figure(figsize=(6,4))
sns.barplot(x=labels, y=metrics)
plt.title("Naive Bayes Evaluation Metrics")
plt.ylim(0,1)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

from sklearn.model_selection import cross_val_score

cv_scores_nb = cross_val_score(nb_model, X_train, y_train, cv=5, scoring='f1')

print("Cross Validation Scores:", cv_scores_nb)
print("Average CV F1 Score:", cv_scores_nb.mean())

from sklearn.model_selection import GridSearchCV

param_grid_nb = {
    'alpha': [0.1, 0.5, 1.0, 2.0]
}

grid_nb = GridSearchCV(
    MultinomialNB(),
    param_grid_nb,
    cv=5,
    scoring='f1',
    n_jobs=-1
)

# Fit the Algorithm
grid_nb.fit(X_train, y_train)

best_nb_model = grid_nb.best_estimator_

print("Best Parameters:", grid_nb.best_params_)

# Predict on Tuned Model
y_pred_nb_tuned = best_nb_model.predict(X_test)

accuracy_nb_tuned = accuracy_score(y_test, y_pred_nb_tuned)
precision_nb_tuned = precision_score(y_test, y_pred_nb_tuned)
recall_nb_tuned = recall_score(y_test, y_pred_nb_tuned)
f1_nb_tuned = f1_score(y_test, y_pred_nb_tuned)

print("Tuned Accuracy:", accuracy_nb_tuned)
print("Tuned Precision:", precision_nb_tuned)
print("Tuned Recall:", recall_nb_tuned)
print("Tuned F1 Score:", f1_nb_tuned)



##### Which hyperparameter optimization technique have you used and why?

GridSearchCV

Systematically searches parameter combinations

Finds optimal smoothing parameter (alpha)

Ensures better generalization

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes

After tuning:

Slight improvement in Recall

More balanced F1-score

Better cross-validation stability

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Here is a short and crisp version you can paste:

Explain each evaluation metric's indication towards business and the business impact of the ML model used.

Accuracy:
Shows overall correctness of predictions. High accuracy means the model is reliable for automated sentiment monitoring.

Precision:
Indicates how many predicted sentiments are actually correct. High precision reduces false alerts and saves business resources.

Recall:
Measures how many actual negative reviews are correctly identified. High recall ensures dissatisfied customers are not missed, which is critical for customer retention.

F1 Score:
Balances Precision and Recall. A high F1 score ensures the model performs well overall without favoring one metric.

Business Impact:

The ML model helps businesses automatically detect customer satisfaction trends, identify complaints early, improve service quality, and make data-driven decisions to enhance customer experience and revenue growth.

### ML Model - 3 : Random Forest

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
# Import Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Initialize model
rf_model = RandomForestClassifier(random_state=42)

# Fit the Algorithm
rf_model.fit(X_train, y_train)

# Predict on the model
y_pred_rf = rf_model.predict(X_test)

# Print Classification Report
print(classification_report(y_test, y_pred_rf))

# Evaluate the Model
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred_rf))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt
import seaborn as sns

accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)

print("Accuracy:", accuracy_rf)
print("Precision:", precision_rf)
print("Recall:", recall_rf)
print("F1 Score:", f1_rf)

metrics = [accuracy_rf, precision_rf, recall_rf, f1_rf]
labels = ['Accuracy','Precision','Recall','F1 Score']

plt.figure(figsize=(6,4))
sns.barplot(x=labels, y=metrics)
plt.title("Random Forest Evaluation Metrics")
plt.ylim(0,1)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

from sklearn.model_selection import cross_val_score

cv_scores_rf = cross_val_score(rf_model, X_train, y_train, cv=5, scoring='f1')

print("Cross Validation Scores:", cv_scores_rf)
print("Average CV Score:", cv_scores_rf.mean())

from sklearn.model_selection import GridSearchCV

param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5]
}

grid_rf = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid_rf,
    cv=5,
    scoring='f1',
    n_jobs=-1
)

# Fit the Algorithm
grid_rf.fit(X_train, y_train)

best_rf_model = grid_rf.best_estimator_

print("Best Parameters:", grid_rf.best_params_)

# Predict on Tuned Model
y_pred_rf_tuned = best_rf_model.predict(X_test)

accuracy_rf_tuned = accuracy_score(y_test, y_pred_rf_tuned)
precision_rf_tuned = precision_score(y_test, y_pred_rf_tuned)
recall_rf_tuned = recall_score(y_test, y_pred_rf_tuned)
f1_rf_tuned = f1_score(y_test, y_pred_rf_tuned)

print("Tuned Accuracy:", accuracy_rf_tuned)
print("Tuned Precision:", precision_rf_tuned)
print("Tuned Recall:", recall_rf_tuned)
print("Tuned F1 Score:", f1_rf_tuned)



##### Which hyperparameter optimization technique have you used and why?

GridSearchCV

Why?

Systematically searches parameter combinations

Improves model performance

Reduces overfitting

Widely accepted in academic evaluation

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes.

After tuning:

Slight improvement in F1 score

Better balance between precision and recall

More stable cross-validation performance

However, improvement may be marginal compared to Logistic Regression.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Recall and F1 Score.

Why?

Recall ensures negative reviews are not missed.

F1 Score balances precision and recall.

Accuracy alone may be misleading in imbalanced datasets.

From business perspective:
Missing negative reviews can lead to customer churn and revenue loss.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Selected Model: Logistic Regression

Why?

Highest stable accuracy

Better cross-validation consistency

Lower computational cost

More interpretable

Performs better on high-dimensional sparse TF-IDF data

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Model Explainability & Feature Importance

Since Logistic Regression was selected as final model, we analyze feature importance using model coefficients.

Explanation:

Positive coefficients → words indicating positive sentiment (e.g., "excellent", "good")

Negative coefficients → words indicating dissatisfaction (e.g., "bad", "slow")

Business Insight:

This helps businesses identify:

Key strengths customers appreciate

Recurring complaint keywords

Service improvement areas

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***