<a href="https://colab.research.google.com/github/dhahbimohamed/ml-zero-to-expert/blob/main/07_naive_bayes_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **🔹 What is Naive Bayes?**
It’s a simple but powerful classification algorithm based on Bayes’ Theorem
It assumes that all features are independent (naive assumption) and calculates the probability of each class given the input.

P(Class∣Data)=
P(Data)
P(Data∣Class)⋅P(Class)


# **🔹 Why is it “Naive”?**
Because it assumes that the input features (words) are conditionally independent, which is rarely true in language — but the algorithm still works well.

# **🔹 Where is it used?**
Common in:

Spam detection

Sentiment analysis

Document classification



In [20]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='train',categories=['sci.space','rec.sport.hockey'],shuffle= True)
X,y = data.data , data.target

* CountVectorizer() transforms text into bag of words:
It counts how many times each word appears in the sentence.
* stop_words='english' removes common boring words like "the", "is", "and"
* fit_transform() learns all words in the corpus and builds the word frequency matrix

  => So X_vectorized is now a huge matrix of word counts.



* MultinomialNB() → this is Naive Bayes for text (discrete features)

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
X_vectorized = vectorizer.fit_transform(X)
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X_train,X_test,y_train,y_test = train_test_split(X_vectorized,y,test_size=0.33,random_state=42)

model = MultinomialNB()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred, target_names=data.target_names))

                  precision    recall  f1-score   support

rec.sport.hockey       1.00      0.98      0.99       208
       sci.space       0.98      0.99      0.99       186

        accuracy                           0.99       394
       macro avg       0.99      0.99      0.99       394
    weighted avg       0.99      0.99      0.99       394



In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words = 'english')
X_vectorized = vectorizer.fit_transform(X)
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)

model = MultinomialNB()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=data.target_names))

                  precision    recall  f1-score   support

rec.sport.hockey       1.00      0.98      0.99       125
       sci.space       0.97      1.00      0.99       114

        accuracy                           0.99       239
       macro avg       0.99      0.99      0.99       239
    weighted avg       0.99      0.99      0.99       239



# **🔹 CountVectorizer vs TfidfVectorizer**
| Feature  | CountVectorizer       | TfidfVectorizer           |
| -------- | --------------------- | ------------------------- |
| Method   | Count words           | Weight important words    |
| Problem  | Common words dominate | Down-weights common words |
| Use case | Quick models          | Smarter features for ML   |


# **🔹 My Performance Summary:**
| Vectorizer      | Precision (hockey) | Precision (sci.space) | F1-score               | Support |
| --------------- | ------------------ | --------------------- | ---------------------- | ------- |
| CountVectorizer | High (\~0.99)      | High                  | Excellent              | ✔️      |
| TfidfVectorizer | Slightly different | Still high            | No drop in performance | ✔️      |


# **🔹 Confidence Rating:**
6.5 / 10 — I understand Naive Bayes, the classification report, and text preprocessing better now, but I still want to improve my grasp on:

What actually happens inside TF-IDF

How to interpret precision/recall from real-world logic