# 🧭 Plan of Attack
**🔹 What it means:**  
This is your overall game plan or roadmap.

🧠 _"What steps will I follow to build the text classification project?"_

### 📋 It usually includes:
- Collecting data  
- Cleaning (preprocessing)  
- Feature extraction (BoW, TF-IDF, Word2Vec)  
- Training models  
- Evaluating results  

---

# 🧠 What is Text Classification
**🔹 What it means:**  
Text classification is a way to automatically assign categories (like spam, positive/negative, topic labels) to pieces of text using machine learning.

---

# 🧩 Types of Text Classification
**🔹 Key types:**

- **Binary Classification** — Two labels (e.g., spam vs not spam)  
- **Multi-class** — One label out of many (e.g., news category: politics, sports, tech)  
- **Multi-label** — Text may belong to multiple categories (e.g., ["COVID", "Health", "News"])

---

# 📱 Applications
**🔹 Where it's used:**

- Spam detection  
- Sentiment analysis  
- Chatbot intent classification  
- Product review analysis  
- News categorization  

---

# 🧵 The Pipeline
**🔹 What it means:**  
A step-by-step process from raw text to predictions.

### 📊 Pipeline Steps:
- Collect & clean data  
- Text preprocessing (lowercase, remove stopwords, etc.)  
- Feature extraction (BoW, TF-IDF, Word2Vec)  
- Model training (Naive Bayes, SVM, etc.)  
- Evaluation & optimization  

---

# 🧠 Different Approaches
**🔹 What it means:**  
Multiple ways to build a classifier:

- Rule-based (heuristics)  
- Traditional ML (SVM, Logistic Regression)  
- Deep Learning (RNN, BERT)  
- Pre-trained APIs  

---

# 🛠️ Heuristic Approach
**🔹 What it means:**  
You write custom rules to decide categories instead of using a model.

**🧠 Example:**
```python
if "free money" in text:
    return "spam" 
```
✅ Good for small datasets or prototyping.

---

## ☁️ Using API  
**🔹 What it means:**  
Use pre-built models via APIs like:

- HuggingFace  
- OpenAI  
- Google Cloud NLP  
- AWS Comprehend  

📦 You send your text and get predictions back without training your own model.

---

## 🧺 Using BoW and N-grams  
**🔹 What it means:**  
Transform text into numbers using:

- **BoW (Bag of Words):** Count each word  
- **N-grams:** Capture short phrases (e.g., “not good” = bigram)

✅ Great for simple ML models.

---

## 📊 Using TF-IDF  
**🔹 What it means:**  
A smarter version of BoW.  
It highlights important words in a document that are not common in all others.

✅ Helps models focus on unique and meaningful terms.

---

## 🔡 Using Word2Vec  
**🔹 What it means:**  
Transforms words into vectors that carry semantic meaning.

**🧠 Example:**  
_"King" - "Man" + "Woman" = "Queen"_  

✅ It captures context and relationships between words.

---

## 🧾 Interpreting Models  
**🔹 What it means:**  
Understand why the model gave a prediction.

**🧠 Tools:**

- SHAP  
- LIME  
- Confusion Matrix  

✅ Helps you trust, debug, and improve your model.

---

## 💡 Practical Advice  
**🔹 What it means:**  
Real-world tips like:

- Always clean your data well  
- Use a validation set  
- Try simple models before complex ones  
- Use pre-trained models to save time


In [1]:
import pandas as pd 
import numpy as np


In [2]:
import pandas as pd
df = pd.read_csv('IMDB Review.csv')

df.head()


Unnamed: 0,2401,Borderlands,Positive,"im getting on borderlands and i will murder you all ,"
0,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
1,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
2,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
3,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
4,2401,Borderlands,Positive,im getting into borderlands and i can murder y...


In [3]:
df.columns

Index(['2401', 'Borderlands', 'Positive',
       'im getting on borderlands and i will murder you all ,'],
      dtype='object')

In [4]:
import pandas as pd

# Assigning column names while reading the CSV
column_names = ['Review_ID', 'Game_Title', 'Sentiment_Label', 'Review_Text']
df = pd.read_csv('IMDB Review.csv', names=column_names, header=None)

# Preview the updated dataframe
df.head()


Unnamed: 0,Review_ID,Game_Title,Sentiment_Label,Review_Text
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


In [5]:
df.columns

Index(['Review_ID', 'Game_Title', 'Sentiment_Label', 'Review_Text'], dtype='object')

In [6]:
df['Sentiment_Label'].value_counts()

Sentiment_Label
Negative      22542
Positive      20832
Neutral       18318
Irrelevant    12990
Name: count, dtype: int64

In [7]:
df.isnull().sum()

Review_ID            0
Game_Title           0
Sentiment_Label      0
Review_Text        686
dtype: int64

In [8]:
df = df.dropna()

In [9]:
df.isnull().sum()

Review_ID          0
Game_Title         0
Sentiment_Label    0
Review_Text        0
dtype: int64

In [10]:
df.duplicated().sum()

np.int64(2340)

In [11]:
df.drop_duplicates(inplace= True)

In [12]:
df.duplicated().sum()

np.int64(0)

In [13]:
import re 
def remove_tags(raw_text):
    cleaned_text = re.sub(re.compile('<.*?>'),'',raw_text)
    return cleaned_text

In [14]:
df['Review_Text'] = df['Review_Text'].apply(remove_tags)

In [15]:
df.Review_Text

0        im getting on borderlands and i will murder yo...
1        I am coming to the borders and I will kill you...
2        im getting on borderlands and i will kill you ...
3        im coming on borderlands and i will murder you...
4        im getting on borderlands 2 and i will murder ...
                               ...                        
74677    Just realized that the Windows partition of my...
74678    Just realized that my Mac window partition is ...
74679    Just realized the windows partition of my Mac ...
74680    Just realized between the windows partition of...
74681    Just like the windows partition of my Mac is l...
Name: Review_Text, Length: 71656, dtype: object

In [16]:
# Drop the columns
df = df.drop(['Review_ID', 'Game_Title'], axis=1)

# View the updated DataFrame
df.head()


Unnamed: 0,Sentiment_Label,Review_Text
0,Positive,im getting on borderlands and i will murder yo...
1,Positive,I am coming to the borders and I will kill you...
2,Positive,im getting on borderlands and i will kill you ...
3,Positive,im coming on borderlands and i will murder you...
4,Positive,im getting on borderlands 2 and i will murder ...


In [17]:
# Convert all reviews to lowercase
df['Review_Text'] = df['Review_Text'].str.lower()


In [18]:
x = df.iloc[:,1:2]
y = df['Sentiment_Label']

In [19]:
x

Unnamed: 0,Review_Text
0,im getting on borderlands and i will murder yo...
1,i am coming to the borders and i will kill you...
2,im getting on borderlands and i will kill you ...
3,im coming on borderlands and i will murder you...
4,im getting on borderlands 2 and i will murder ...
...,...
74677,just realized that the windows partition of my...
74678,just realized that my mac window partition is ...
74679,just realized the windows partition of my mac ...
74680,just realized between the windows partition of...


In [20]:
y

0        Positive
1        Positive
2        Positive
3        Positive
4        Positive
           ...   
74677    Positive
74678    Positive
74679    Positive
74680    Positive
74681    Positive
Name: Sentiment_Label, Length: 71656, dtype: object

In [21]:
df.columns

Index(['Sentiment_Label', 'Review_Text'], dtype='object')

In [22]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

y = encoder.fit_transform(y)

In [23]:
y

array([3, 3, 3, ..., 3, 3, 3], shape=(71656,))

In [24]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)

In [25]:
x_train.shape

(57324, 1)

In [26]:
#Applying BOW
from sklearn.feature_extraction.text import CountVectorizer

In [27]:
cv = CountVectorizer()

In [28]:
x_train_bow = cv.fit_transform(x_train['Review_Text']).toarray()
x_test_bow = cv.transform(x_test['Review_Text']).toarray()

In [29]:
x_train_bow.shape

(57324, 29698)

In [30]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(x_train_bow,y_train)

In [31]:
y_pred =  gnb.predict(x_test_bow)

from sklearn.metrics import  accuracy_score, confusion_matrix
accuracy_score(y_test,y_pred )


0.6823192855149316

In [32]:
confusion_matrix(y_test,y_pred)

array([[2418,   36,   24,   62],
       [1274, 2480,   36,  528],
       [ 709,  180, 2428,  308],
       [1326,   42,   28, 2453]])

# Train From RandomForest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier()  # <-- parentheses added

rf.fit(x_train_bow, y_train)
y_pred = rf.predict(x_test_bow)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


In [None]:
cv = CountVectorizer(max_features= 3000)

x_train_bow = cv.fit_transform(x_train['Review_Text']).toarray()
x_test_bow = cv.transform(x_test['Review_Text']).toarray()

rf = RandomForestClassifier()

rf.fit(x_train_bow,y_train)
y_pred = rf.predict(x_test_bow)
accuracy_score(y_test,y_pred)

KeyboardInterrupt: 