## 🧾 **Example**

Suppose the original review is:

> `"I do **not** like this movie. It was boring and dull!"`

### Step-by-Step:

| Step                            | Result                                                                              |
| ------------------------------- | ----------------------------------------------------------------------------------- |
| Remove punctuation              | `"I do not like this movie It was boring and dull"`                                 |
| Lowercase                       | `"i do not like this movie it was boring and dull"`                                 |
| Tokenize                        | `['i', 'do', 'not', 'like', 'this', 'movie', 'it', 'was', 'boring', 'and', 'dull']` |
| Remove stopwords (except 'not') | `['not', 'like', 'movie', 'boring', 'dull']`                                        |
| Stemming                        | `['not', 'like', 'movi', 'bore', 'dull']`                                           |
| Final cleaned string            | `'not like movi bore dull'`                                                         |

This processed review is now in the `corpus`.

---

## 🧮 **Bag of Words Representation**

Imagine you have 3 processed reviews:

1. `'love movi'`
2. `'hate movi'`
3. `'not like movi bore dull'`

Vocabulary (unique words):
`['love', 'hate', 'not', 'like', 'movi', 'bore', 'dull']`

Now we represent each review as a vector indicating **word presence** (or frequency):

| Word | R1 | R2 | R3 |
| ---- | -- | -- | -- |
| love | 1  | 0  | 0  |
| hate | 0  | 1  | 0  |
| not  | 0  | 0  | 1  |
| like | 0  | 0  | 1  |
| movi | 1  | 1  | 1  |
| bore | 0  | 0  | 1  |
| dull | 0  | 0  | 1  |

---

## 📊 Diagram (Conceptual Visualization)

```
           ┌─────────────────────────────────────┐
           │         Corpus of Documents         │
           └─────────────────────────────────────┘
                        ↓
              Cleaned using NLP steps:
             - Lowercasing
             - Removing punctuation
             - Removing stopwords
             - Stemming
                        ↓
               Cleaned Text in Corpus
                        ↓
              Bag of Words Matrix Created
                        ↓
┌────────────┬────┬────┬────┬────┬────┐
│ Vocabulary │ R1 │ R2 │ R3 │ .. │ RN │
├────────────┼────┼────┼────┼────┼────┤
│ love       │  1 │  0 │  0 │    │    │
│ hate       │  0 │  1 │  0 │    │    │
│ movi       │  1 │  1 │  1 │    │    │
│ ...        │    │    │    │    │    │
└────────────┴────┴────┴────┴────┴────┘
```



### Importing the libraries

In [105]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Importing the dataset

In [106]:
dataset = pd.read_csv('Restaurant_Reviews.tsv' , delimiter='\t', quoting= 3)
# quoting= 3 --> this will remove the " in the file to avoid the problems

### Cleaning the texts

In [107]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []  # Will hold the cleaned and processed reviews

for i in range(0, 1000):
  review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])  # Remove numbers and punctuation
  review = review.lower()  # Convert to lowercase
  review = review.split()  # Tokenize the review into words

  ps = PorterStemmer()
  all_stopwords = stopwords.words('english')  # Common words like "the", "is", etc.
  all_stopwords.remove('not')  # Keep 'not' for sentiment analysis

  # Stemming and removing stopwords
  review = [ps.stem(word) for word in review if word not in set(all_stopwords)]

  review = ' '.join(review)  # Rejoin cleaned words into one string
  corpus.append(review)  # Add to final list


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\abhis\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Creating the Bag of Words model

In [108]:
from sklearn.feature_extraction.text import CountVectorizer

# Convert the text corpus into a matrix of token counts (Bag of Words model)
# Keep only the 1500 most frequent words to reduce dimensionality
cv = CountVectorizer(max_features=1500 , ngram_range=(1, 2))

# Fit the model on the corpus and transform the text data into numerical feature vectors
X = cv.fit_transform(corpus).toarray()

# Extract the labels (target values) from the last column of the dataset
y = dataset.iloc[:, -1].values

### Splitting the dataset into the Training set and Test set

In [109]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

### Training the Naive Bayes model on the Training set

In [110]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)


'''
This has the 72%

from sklearn.tree import DecisionTreeClassifier
# Build the CART model (Decision Tree)
classifier = DecisionTreeClassifier(criterion='entropy', random_state=0)
# Train the model
classifier.fit(X_train, y_train)
'''

"""
Kernal SVM have 78% accuracy in the NLP

from sklearn.svm import  SVC
classifier = SVC(kernel= 'rbf' , random_state=0) # build the model
classifier.fit(X_train , y_train)
"""

"\nKernal SVM have 78% accuracy in the NLP\n\nfrom sklearn.svm import  SVC\nclassifier = SVC(kernel= 'rbf' , random_state=0) # build the model\nclassifier.fit(X_train , y_train)\n"

### Predicting the Test set results

In [111]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[1 0]
 [1 0]
 [1 0]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 1]
 [1 1]
 [1 0]
 [1 0]
 [0 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 0]
 [0 0]
 [1 0]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 0]
 [1 0]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 0]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 0]
 [1 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 0]
 [1 1]
 [0 1]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [1 0]
 [1 1]
 [1 0]
 [0 1]
 [1 1]
 [1 0]
 [0 1]
 [1 1]
 [1 1]
 [1 0]
 [0 1]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [0 1]
 [0 1]
 [1 1]
 [0 0]
 [1 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 0]
 [1 1]
 [1 0]
 [0 0]
 [0 1]
 [1 0]
 [0 1]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [0 1]
 [1 1]
 [1 1]

### Making the Confusion Matrix

In [112]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[58 39]
 [15 88]]


0.73

## Predicting if a single review is positive or negative

## Positive
Use our model to predict if the following review:

"I love this restaurant so much"

is positive or negative.

In [113]:
new_review = 'I love this restaurant so much'
new_review = re.sub('[^a-zA-Z]', ' ', new_review)
new_review = new_review.lower()
new_review = new_review.split()
ps = PorterStemmer()
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
new_review = [ps.stem(word) for word in new_review if not word in set(all_stopwords)]
new_review = ' '.join(new_review)
new_corpus = [new_review]
new_X_test = cv.transform(new_corpus).toarray()
new_y_pred = classifier.predict(new_X_test)
print(new_y_pred)

[1]


## Negative

In [114]:
new_review = 'I hate this restaurant so much'
new_review = re.sub('[^a-zA-Z]', ' ', new_review)
new_review = new_review.lower()
new_review = new_review.split()
ps = PorterStemmer()
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
new_review = [ps.stem(word) for word in new_review if not word in set(all_stopwords)]
new_review = ' '.join(new_review)
new_corpus = [new_review]
new_X_test = cv.transform(new_corpus).toarray()
new_y_pred = classifier.predict(new_X_test)
print(new_y_pred)

[0]
