# Sentiment Analysis using Naive Bayes Classifier

### Introduction

- This project builds a **sentiment analysis model** that classifies restaurant reviews as **positive or negative** using a **Naive Bayes (NB) classifier**. Despite its simplicity, NB performs well for sentiment analysis, especially on **smaller datasets**, and is widely used as a **baseline model**.

- **Dataset**:
  - We use a **public Kaggle dataset** of **1,000 restaurant reviews**.
  - The data is perfectly balanced. Each review is a short text paired with a sentiment label, making the dataset ideal for training a binary classifier.

## 1. Import Libraries and Load Dataset

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/Restaurant_Reviews.tsv', sep='\t')

print("Number of reviews:", df.shape[0])
print(df.head(5))


Number of reviews: 1000
                                              Review  Liked
0                           Wow... Loved this place.      1
1                                 Crust is not good.      0
2          Not tasty and the texture was just nasty.      0
3  Stopped by during the late May bank holiday of...      1
4  The selection on the menu was great and so wer...      1


In [2]:
df['Liked'].value_counts()

Liked
1    500
0    500
Name: count, dtype: int64

## 2. Text Preprocessing

In [3]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download NLTK resources (run once)
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text: str) -> str:
    # Remove all non-letter characters
    text = re.sub(r'[^A-Za-z]', ' ', text)
    # Convert to lowercase and tokenize
    tokens = text.lower().split()
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize each word to its root form
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # Join tokens back into a single string
    return ' '.join(tokens)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [4]:
# Preprocess all reviews in the dataset
corpus = [preprocess_text(review) for review in df['Review']]

print("Original review:", df['Review'][0])
print("After preprocessing:", corpus[0])

Original review: Wow... Loved this place.
After preprocessing: wow loved place


## 3. Feature Extraction (Bag-of-Words Model)

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer to convert text to feature vectors
vectorizer = CountVectorizer()
# learn vocabulary and transform corpus
X_features = vectorizer.fit_transform(corpus)

print("Number of features (vocabulary size):", len(vectorizer.get_feature_names_out()))

Number of features (vocabulary size): 1766


## 4. Train/Test Split

In [6]:
from sklearn.model_selection import train_test_split
# feature matrix
X = X_features
# sentiment labels array
y = df['Liked'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print("Training examples:", X_train.shape[0])
print("Testing examples:", X_test.shape[0])

Training examples: 800
Testing examples: 200


## 5. Training the Naive Bayes Classifier

In [7]:
from sklearn.naive_bayes import MultinomialNB
nb_classifier = MultinomialNB()

# Train (fit) the model on the training data
nb_classifier.fit(X_train, y_train)

## 6. Model Evaluation

In [8]:
from sklearn.metrics import accuracy_score, classification_report

# Predict sentiments for the test set
y_pred = nb_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy*100:.2f}%")

# Display detailed classification report (precision, recall, f1-score)
print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))


Test Accuracy: 74.50%
              precision    recall  f1-score   support

    Negative       0.73      0.74      0.74        96
    Positive       0.76      0.75      0.75       104

    accuracy                           0.74       200
   macro avg       0.74      0.74      0.74       200
weighted avg       0.75      0.74      0.75       200



## 7. Testing the Model on New Reviews

In [9]:
def predict_sentiment(review: str):
    cleaned = preprocess_text(review)
    features = vectorizer.transform([cleaned])
    pred = nb_classifier.predict(features)
    return "Positive" if pred[0] == 1 else "Negative"

print(predict_sentiment("The food was absolutely wonderful, fresh and very tasty!"))
print(predict_sentiment("I will never come to this restaurant again. The service was terrible."))


Positive
Negative
