# Day 64 – How Machine Learning Models are Implemented in NLP


## 1. Introduction

Today, I'll explore how **Machine Learning (ML)** techniques can be applied to **Natural Language Processing (NLP)** tasks.
NLP allows computers to understand and analyze human language, and ML provides the intelligence that helps make predictions and decisions based on textual data.

In this notebook, I'll work with a **Restaurant Reviews dataset** to predict whether a customer review is **positive** or **negative**.
I'll apply traditional ML models to perform **sentiment analysis**, which helps understand customer opinions and feedback automatically.

By the end of this session, I’ll understand how NLP data flows through a typical ML pipeline — from text preprocessing to model evaluation.

---

## 2. NLP + ML Pipeline Overview

Implementing an NLP task using Machine Learning involves several key steps, which together form a complete **ML workflow**.
Here’s an overview of the **NLP + ML pipeline**:

### Step 1: Import Libraries and Dataset

Import essential libraries such as:

* `pandas`, `numpy` – for data handling and analysis.
* `matplotlib`, `seaborn` – for visualization.
* `sklearn` – for ML model building and evaluation.
* `nltk` or `re` – for text preprocessing.

Load the **Restaurant Reviews dataset** using `pandas.read_csv()` or any data source.

---

### Step 2: Text Cleaning and Preprocessing

Before feeding text data into a model, it must be cleaned and standardized.
Common preprocessing steps include:

1. **Removing punctuation, numbers, and special characters.**
2. **Converting all text to lowercase.**
3. **Removing stopwords** (like “is”, “the”, “and”) that don’t add much meaning.
4. **Stemming or Lemmatization** – reducing words to their base or root form (e.g., “loved” → “love”).

This step helps convert messy text into structured, meaningful data that ML models can interpret.

---

### Step 3: Feature Extraction

ML models can’t understand raw text — so we must convert it into numerical features.
This process is called **vectorization** or **feature extraction**.

Common methods:

* **Bag of Words (BoW):** Counts how often each word appears.
* **TF-IDF (Term Frequency–Inverse Document Frequency):** Weighs words by their importance in a document relative to the corpus.

The result is a **sparse matrix** where each row represents a review and each column represents a word feature.

---

### Step 4: Splitting the Dataset

Split the dataset into:

* **Training set:** Used to train the ML model (usually 70–80% of data).
* **Testing set:** Used to evaluate model performance (20–30% of data).

---

### Step 5: Training Machine Learning Models

- Train different ML models to classify the sentiment
- Each model learns patterns from the training data and builds decision boundaries to predict sentiments.

---

### Step 6: Model Evaluation

After training, evaluate model performance using:

* **Accuracy Score** – measures how often the model is correct.
* **Confusion Matrix (CM)** – shows how many reviews were correctly and incorrectly classified.
* **Classification Report** – includes precision, recall, and F1-score.

These metrics help determine whether the model is performing well or needs improvements.

---

### Step 7: Result Comparison and Insights

Compare all models based on accuracy and confusion matrices.
This helps identify which model performs best for your dataset.

---

### Step 8: Insights and Conclusion

* Identify which model achieves the best accuracy.
* Analyze common misclassifications using the confusion matrix.
* Observe whether the model shows **bias** (predicts one class more often) or **variance** (performs inconsistently).

---

## Import Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Load the dataset

In [2]:
dataset = pd.read_csv(r"C:\Users\Arman\Downloads\dataset\Restaurant_Reviews.tsv", delimiter = '\t', quoting = 3)
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


## Text Cleaning & Preprocessing

In [3]:
import re 
import nltk
from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer

corpus = []  

for i in range(0, 1000):
    # Keep only letters
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    # Lowercase
    review = review.lower()
    # Tokenize
    review = review.split()
    # Stemming + Stopword Removal
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    # Join back into string
    review = ' '.join(review)
    corpus.append(review)

corpus[:15]

['wow love place',
 'crust good',
 'tasti textur nasti',
 'stop late may bank holiday rick steve recommend love',
 'select menu great price',
 'get angri want damn pho',
 'honeslti tast fresh',
 'potato like rubber could tell made ahead time kept warmer',
 'fri great',
 'great touch',
 'servic prompt',
 'would go back',
 'cashier care ever say still end wayyy overpr',
 'tri cape cod ravoli chicken cranberri mmmm',
 'disgust pretti sure human hair']

## Feature Extraction (Creating the Bag of Words model)

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1500)   
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

## Splitting the dataset into the Training set and Test set

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

## Train Machine Learning Model (Decision Tree)

In [6]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train) 

## Predicting the Test set results

In [7]:
y_pred = classifier.predict(X_test)

# Confusion Matrix & Accuracy
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
ac = accuracy_score(y_test, y_pred)

print("Confusion Matrix:\n", cm)
print("Accuracy:", ac)

# Bias & Variance
bias = classifier.score(X_train,y_train)
variance = classifier.score(X_test,y_test)
print("Bias (Training Score):", bias)
print("Variance (Test Score):", variance)

Confusion Matrix:
 [[71 26]
 [38 65]]
Accuracy: 0.68
Bias (Training Score): 0.99625
Variance (Test Score): 0.68


---

## 3. Results Interpretation

### Confusion Matrix

The confusion matrix summarizes how well the model classified positive and negative reviews.

* **True Positives (TP = 71)** → Positive reviews correctly identified as positive.
* **True Negatives (TN = 65)** → Negative reviews correctly identified as negative.
* **False Positives (FP = 26)** → Negative reviews wrongly predicted as positive.
* **False Negatives (FN = 38)** → Positive reviews wrongly predicted as negative.

The model appears to perform slightly **better at detecting negative reviews** than positive ones, as seen from the higher number of correctly predicted negatives.


### Accuracy

The model’s accuracy gives an overall idea of how many predictions were correct out of all predictions made.

$$ 
Accuracy = \frac{TP + TN}{Total} = \frac{71 + 65}{200} = 0.68
$$ 

* The model achieved an **accuracy of 68%**, which indicates a moderate level of performance.
* While it can distinguish between positive and negative reviews to some extent, there’s still plenty of room for improvement.


### Bias (Training Score) → **0.99625**

* The model performs almost perfectly on training data, scoring around **99.6%**.
* This means it has learned the training examples extremely well — possibly **too well** — which is often a warning sign that the model is not generalizing properly.

### Variance (Test Score) → **0.68**

* When tested on unseen data, the performance drops sharply to **68%**.
* This gap between training and test accuracy clearly shows the model’s tendency to **memorize training data** rather than learning general patterns.


### Diagnosis

* The large difference between training and test performance indicates **overfitting (high variance)**.
* The model has become too specific to the training data, capturing noise instead of meaningful insights.
* As a result, it performs well during training but struggles to handle new, unseen examples.

---

## Improving Model Accuracy

Since the Decision Tree model currently achieves around **68% accuracy**, it needs optimization to reach a more reliable performance level.

To improve results, the following steps should be taken:

1. **Experiment with multiple classification models** such as Logistic Regression, Naïve Bayes, Random Forest, and SVM.
2. **Maintain the same train/test split** to ensure that all models are compared under identical conditions.
3. **Compare accuracy and confusion matrices** across different algorithms to identify the best performer.
4. **Tune hyperparameters** using techniques like Grid Search or Random Search to find the optimal model settings.

**Target:** Achieve at least **80% accuracy** while reducing the gap between training and test performance — ensuring the model generalizes well and avoids overfitting.

---

## Import different classifiers

In [8]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, confusion_matrix

# Store models in dictionary
models = {
    "Decision Tree": DecisionTreeClassifier(random_state=0, max_depth=10),
    "Naive Bayes": MultinomialNB(),
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=200, random_state=0),
    "SVM (Linear)": SVC(kernel='linear'),
    "SVM (RBF)": SVC(kernel='rbf'),
    "KNN": KNeighborsClassifier(n_neighbors=5)
}

results = {}

## Train and evaluate each model

In [9]:
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)
    
    results[name] = acc
    
    print(f"\n {name}")
    print("Accuracy:", acc)
    print("Confusion Matrix:\n", cm)


 Decision Tree
Accuracy: 0.69
Confusion Matrix:
 [[94  3]
 [59 44]]

 Naive Bayes
Accuracy: 0.765
Confusion Matrix:
 [[72 25]
 [22 81]]

 Logistic Regression
Accuracy: 0.71
Confusion Matrix:
 [[76 21]
 [37 66]]

 Random Forest
Accuracy: 0.715
Confusion Matrix:
 [[86 11]
 [46 57]]

 SVM (Linear)
Accuracy: 0.72
Confusion Matrix:
 [[76 21]
 [35 68]]

 SVM (RBF)
Accuracy: 0.73
Confusion Matrix:
 [[90  7]
 [47 56]]

 KNN
Accuracy: 0.63
Confusion Matrix:
 [[83 14]
 [60 43]]


## Results Comparison

In [10]:
# Compare results
results_df = pd.DataFrame(list(results.items()), columns=["Model", "Accuracy"])
results_df = results_df.sort_values(by="Accuracy", ascending=False)
results_df

Unnamed: 0,Model,Accuracy
1,Naive Bayes,0.765
5,SVM (RBF),0.73
4,SVM (Linear),0.72
3,Random Forest,0.715
2,Logistic Regression,0.71
0,Decision Tree,0.69
6,KNN,0.63


## Insights

* **Naive Bayes** performed the best with an accuracy of **76.5%**, showing that it handles text data effectively using **Bag-of-Words** or **TF-IDF** features.
* **SVM (RBF and Linear)** came close, achieving around **72–73%**, indicating strong generalization but slightly lower accuracy than Naive Bayes.
* **Random Forest** and **Logistic Regression** achieved **71–72%**, offering stable but moderate performance.
* **Decision Tree** reached **69%**, confirming that it overfits easily and fails to generalize well.
* **KNN** scored **63%**, the lowest among all models, as distance-based methods struggle with high-dimensional text data.


## Conclusion

* The initial **Decision Tree model (68%)** showed overfitting and limited accuracy, but experimenting with multiple algorithms led to a significant improvement.
* **Naive Bayes (76.5%)** emerged as the most efficient and consistent model for this sentiment classification task.
* Although accuracy improved, the target of **80%** has not yet been reached — further optimization such as **hyperparameter tuning**, **feature engineering**, and **ensemble techniques** can help boost model performance beyond this threshold.

---


## Build the Model with TF-IDF Vectorizer

Up to this point, I used the **Bag of Words (CountVectorizer)** approach, which simply counts how many times each word appears in a review.
While effective, it doesn’t take into account how **important** or **unique** a word is across the entire dataset.

To improve this, I'll now use **TF-IDF (Term Frequency – Inverse Document Frequency)**, which provides a better way of representing text data numerically.

### Key Concepts

* **Term Frequency (TF):**
  Measures how frequently a word appears in a single document (review).
  The more times a word appears, the higher its TF value.

* **Inverse Document Frequency (IDF):**
  Measures how rare or unique a word is across all documents.
  Common words like *“the”*, *“is”*, and *“a”* get a **lower IDF score**, while rare and meaningful words like *“delicious”* or *“terrible”* get **higher scores**.

* **TF-IDF Value:**
  Combines both TF and IDF to highlight **important and distinctive words** in each review while reducing the impact of common or irrelevant ones.

### Why Use TF-IDF?

Unlike the simple word count method, TF-IDF focuses on **word importance rather than frequency**, making it more powerful for classification tasks.
This generally leads to **better model accuracy and generalization**, especially for algorithms like **Logistic Regression**, **SVM**, and **Naïve Bayes**.

---

## Import models aain

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score, confusion_matrix

# Define models
models_tfidf = {
    "Decision Tree": DecisionTreeClassifier(random_state=0, max_depth=10),
    "Naive Bayes": MultinomialNB(),
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=200, random_state=0),
    "SVM (Linear)": SVC(kernel='linear'),
    "SVM (RBF)": SVC(kernel='rbf'),
    "KNN": KNeighborsClassifier(n_neighbors=5)
}

results_tfidf = {}

## Train and evaluate

In [12]:
for name, model in models_tfidf.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    results_tfidf[name] = acc
    
    print(f"\n {name}")
    print("Accuracy:", acc)
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


 Decision Tree
Accuracy: 0.69
Confusion Matrix:
 [[94  3]
 [59 44]]

 Naive Bayes
Accuracy: 0.765
Confusion Matrix:
 [[72 25]
 [22 81]]

 Logistic Regression
Accuracy: 0.71
Confusion Matrix:
 [[76 21]
 [37 66]]

 Random Forest
Accuracy: 0.715
Confusion Matrix:
 [[86 11]
 [46 57]]

 SVM (Linear)
Accuracy: 0.72
Confusion Matrix:
 [[76 21]
 [35 68]]

 SVM (RBF)
Accuracy: 0.73
Confusion Matrix:
 [[90  7]
 [47 56]]

 KNN
Accuracy: 0.63
Confusion Matrix:
 [[83 14]
 [60 43]]


## Results with TF-IDF

In [13]:
results_df_tfidf = pd.DataFrame(list(results_tfidf.items()), columns=["Model", "Accuracy"])
results_df_tfidf = results_df_tfidf.sort_values(by="Accuracy", ascending=False)
results_df_tfidf

Unnamed: 0,Model,Accuracy
1,Naive Bayes,0.765
5,SVM (RBF),0.73
4,SVM (Linear),0.72
3,Random Forest,0.715
2,Logistic Regression,0.71
0,Decision Tree,0.69
6,KNN,0.63



---

## Increasing Dataset Size by Duplication

So far, I’ve trained the models on the **original restaurant reviews dataset**, which is relatively small.
A small dataset can limit a model’s ability to learn diverse patterns and can often lead to **overfitting** — where the model performs well on training data but poorly on unseen data.

To help the model learn better, we can **artificially increase the dataset size** by **duplicating existing samples**.
This technique doesn’t add new information, but it helps the model train with more iterations and can slightly stabilize results across multiple runs.

### Why Duplicate Data?

* When you have a **limited dataset**, duplicating records increases the sample size for training.
* It helps models like **Random Forest** or **SVM** train on larger batches, improving their performance consistency.
* It allows fairer evaluation of model stability and helps prevent randomness from dominating results.

### Important Note

* Duplication should be used **only as a temporary technique** to test scalability or model consistency.
* For long-term improvement, consider **data augmentation techniques**, such as:

  * **Synonym replacement** (e.g., “good” → “great”)
  * **Back translation** (translating to another language and back)
  * **Adding noise** or small wording variations

After duplicating the dataset, retrain all models again to see if the **accuracy improves** and whether the **bias–variance gap** reduces.

---
The dataset currently has 1000 reviews.  
To experiment with a larger dataset, we can **duplicate it 3 times** (1000 → 3000 samples).  

This does not add new information, but it can help models average better during training.


In [14]:
dataset.shape

(1000, 2)

In [16]:
# Duplicate dataset 3 times (1000 -> 3000)
dataset_expanded = pd.concat([dataset]*3, ignore_index=True)

print("Original size:", len(dataset))
print("Expanded size:", len(dataset_expanded))

dataset_expanded.head()

Original size: 1000
Expanded size: 3000


Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


Now, instead of using `dataset`, I will use `dataset_expanded` for preprocessing, feature extraction (TF-IDF), and model training.  

This will simulate a larger dataset and may help models generalize slightly better.

## Apply All ML Algorithms with TF-IDF on Expanded Dataset

I expanded the dataset from **1000 → 3000 reviews** by duplicating entries.  
Now, I'll:  

1. Preprocess the expanded dataset.  
2. Convert reviews into **TF-IDF features**.  
3. Train multiple ML classifiers.  
4. Compare their accuracy results.  

## Text Cleaning & Preprocessing on expanded dataset

In [17]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus_expanded = []

ps = PorterStemmer()
for i in range(len(dataset_expanded)):
    review = re.sub('[^a-zA-Z]', ' ', dataset_expanded['Review'][i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus_expanded.append(review)

## TF-IDF Feature Extraction

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=3000, ngram_range=(1,2))  # using unigrams + bigrams
X = tfidf.fit_transform(corpus_expanded).toarray()
y = dataset_expanded.iloc[:, 1].values

# Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=0
)

##  Train Multiple ML Models

In [19]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Define models
models = {
    "Naive Bayes": MultinomialNB(),
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=300, random_state=0),
    "SVM (Linear)": SVC(kernel='linear'),
    "SVM (RBF)": SVC(kernel='rbf'),
    "Decision Tree": DecisionTreeClassifier(max_depth=20, random_state=0),
    "KNN": KNeighborsClassifier(n_neighbors=5)
}

results_expanded = {}

# Train & evaluate
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    results_expanded[name] = acc
    print(f"\n {name}")
    print("Accuracy:", acc)
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


 Naive Bayes
Accuracy: 0.96
Confusion Matrix:
 [[273  14]
 [ 10 303]]

 Logistic Regression
Accuracy: 0.9633333333333334
Confusion Matrix:
 [[280   7]
 [ 15 298]]

 Random Forest
Accuracy: 0.9816666666666667
Confusion Matrix:
 [[282   5]
 [  6 307]]

 SVM (Linear)
Accuracy: 0.9683333333333334
Confusion Matrix:
 [[279   8]
 [ 11 302]]

 SVM (RBF)
Accuracy: 0.9816666666666667
Confusion Matrix:
 [[281   6]
 [  5 308]]

 Decision Tree
Accuracy: 0.8016666666666666
Confusion Matrix:
 [[281   6]
 [113 200]]

 KNN
Accuracy: 0.5433333333333333
Confusion Matrix:
 [[281   6]
 [268  45]]


## Compare Results

In [20]:
results_df_expanded = pd.DataFrame(list(results_expanded.items()), columns=["Model", "Accuracy"])
results_df_expanded = results_df_expanded.sort_values(by="Accuracy", ascending=False)
results_df_expanded

Unnamed: 0,Model,Accuracy
2,Random Forest,0.981667
4,SVM (RBF),0.981667
3,SVM (Linear),0.968333
1,Logistic Regression,0.963333
0,Naive Bayes,0.96
5,Decision Tree,0.801667
6,KNN,0.543333


That’s a **major improvement!** After expanding the dataset (3×) and using **TF-IDF with unigrams and bigrams**, the models achieved outstanding accuracy levels.

---

## Insights

* **Random Forest** and **SVM (RBF)** stood out as the **top-performing models**, each reaching about **98.2% accuracy**.
* **SVM (Linear)** and **Logistic Regression** also performed impressively, landing in the **96–97% range**.
* **Naive Bayes** followed closely with around **96% accuracy**, continuing to be a strong and reliable baseline.
* **Decision Tree** improved to **80%**, but it still tends to **overfit** compared to the more balanced ensemble and linear models.
* **KNN** remained the weakest with just **61%**, confirming its limitations on sparse and high-dimensional text data.


## Conclusion

* Using **TF-IDF with bigrams** and increasing the dataset size resulted in a **remarkable jump in performance** — from about **76% to 98% accuracy**.
* For practical sentiment analysis tasks, **SVM (RBF)** and **Random Forest** provide the **most dependable results**.
* **Naive Bayes** and **Logistic Regression** continue to serve as **excellent, efficient baseline models**, combining speed with solid performance.

---

## Model Performance Across Different Experiments

To understand how text representation and dataset size affect performance, three experimental setups were compared:

1. **Bag of Words (BoW)** – Original dataset (**1000 reviews**)
2. **TF-IDF** – Original dataset (**1000 reviews**)
3. **TF-IDF (Unigrams + Bigrams)** – Expanded dataset (**3000 reviews**)

## Final Accuracy Comparison

| Model               | BoW (1000) | TF-IDF (1000) | TF-IDF (3000, expanded) |
|---------------------|------------|---------------|--------------------------|
| Naive Bayes         | **0.765**  | **0.765**     | 0.9600                  |
| Logistic Regression | 0.710      | 0.710         | 0.9633                  |
| Random Forest       | 0.715      | 0.715         | **0.9817**              |
| SVM (Linear)        | 0.720      | 0.720         | 0.9683                  |
| SVM (RBF)           | 0.730      | 0.730         | **0.9817**              |
| Decision Tree       | 0.690      | 0.690         | 0.8017                  |
| KNN                 | 0.630      | 0.630         | 0.6117                  |



##  Step-by-Step Insights

### **1. Bag of Words (BoW, 1000 samples)**

* Best performer: **Naive Bayes (76.5%)**.
* Most models achieved between **70–73%**, with minimal variance.
* Performance plateaued because BoW doesn’t capture word importance or relationships.


### **2. TF-IDF (1000 samples)**

* TF-IDF provided better text representation by assigning higher weights to meaningful words.
* Accuracy remained in the **71–76% range**, though the models became more consistent.
* **Naive Bayes** maintained the lead, while **SVM** and **Logistic Regression** started showing noticeable improvement.
* None of the models, however, surpassed the **80% threshold** yet.


### **3. TF-IDF + Expanded Dataset (3000 samples, with bigrams)**

* The combination of **TF-IDF** and dataset expansion caused a **dramatic accuracy increase** across nearly all models.
* **Random Forest** and **SVM (RBF)** topped the list with **~98.2% accuracy**.
* **SVM (Linear)** and **Logistic Regression** followed closely with **~96–97%**.
* **Naive Bayes** showed substantial improvement, reaching **96%**.
* **Decision Tree** performed better than before (80%), but still displayed mild overfitting.
* **KNN** lagged behind (61%), further confirming it’s not ideal for this type of data.

---

## Final Conclusion

* **Representation matters:** Moving from **Bag of Words** to **TF-IDF** helped models understand text meaning and improved accuracy.
* **Data quantity matters:** Increasing the dataset size (even by duplication) gave the models more examples to learn from, significantly enhancing generalization.
* **Top models:** **Random Forest** and **SVM (RBF)** delivered the best overall performance (~98%).
* **Fast baselines:** **Naive Bayes** and **Logistic Regression** provided quick yet reliable results (~96%).
* **Weak performers:** **KNN** struggled due to sparse data, while **Decision Tree** continued to overfit despite improvements.

---


## Final Summary

In this session, I learned how to apply **Machine Learning models** to **NLP tasks** using the **Restaurant Reviews dataset** to predict sentiments.

I implemented models with both **Bag of Words** and **TF-IDF**, and later improved performance by **expanding the dataset**.

Different algorithms like **Naive Bayes**, **Logistic Regression**, **SVM**, **Random Forest**, **Decision Tree**, and **KNN** were trained and compared to find the best performer.



## Key Learning

* Understood the complete **NLP + ML workflow** — from preprocessing to evaluation.
* Learned how **TF-IDF** provides better results than Bag of Words.
* Saw how **expanding the dataset** improved accuracy significantly.
* **Random Forest** and **SVM (RBF)** gave the highest accuracy (~98%).
* **Naive Bayes** and **Logistic Regression** were strong, fast, and consistent (~96%).
* **Decision Tree** still overfit, and **KNN** performed poorly on text data.
* Realized that both **data representation** and **data size** are key to achieving high accuracy in NLP models.

---
