# Text Classification with Naive Bayes using Count Vectorizer and TF-IDF

1. Load Textual Data
2. Text Preprocessing (TF-IDF, word count)
3. Train Classifier
4. Evaluate Results
5. Test Model

#### Your task:
- run the code, understand and observe the differences between two basic vectorization techniques
- suggest how to improve on the code
- add other metrics of measurement of accuracy
- add visualization to better convey the results

In [None]:
!pip install scikit-learn matplotlib pandas

## Importing Libraries

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score, f1_score
from sklearn.datasets import fetch_20newsgroups
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.pipeline import make_pipeline
import pandas as pd

## 1. Load Textual Data

News articles in 20 different categories, for this tutorial we choose the following:
 - alt.atheism
 - comp.graphics
 - sci.med
 - soc.religion.christian

In [None]:
news = fetch_20newsgroups()

In [None]:
news.target_names

In [None]:
target_categories = ['alt.atheism','comp.graphics','sci.med','soc.religion.christian']

train = fetch_20newsgroups(subset='train', categories=target_categories)
test = fetch_20newsgroups(subset='test', categories=target_categories)

In [None]:
len(test.data), len(train.data)

### Sample

In [None]:
print(f'CATEGORY: {target_categories[train.target[0]]}')
print('-' * 80)
print(train.data[0])
print('-' * 80)

## 2. Text preprocessing

Text must be represented as numbers (vectors). There are several useful techniques to transform text into vectors:
1. TF-IDF (Term Frequency - Inverse Document Frequency)
2. Word Count

In [None]:
sample_sentences = [
    'My name is George, this is my name', 
    'I like apples', 
    'apple is my favorite fruit'
    ]

### 2. 1 TF-IDF

In [None]:
tfidf = TfidfVectorizer()

In [None]:
vectorizer = tfidf.fit_transform(sample_sentences)

In [None]:
pd.DataFrame(vectorizer.toarray(), columns=tfidf.get_feature_names_out())

### 2.2 Words Counting

In [None]:
count_vector = CountVectorizer()

In [None]:
vectorizer = count_vector.fit_transform(sample_sentences)

In [None]:
pd.DataFrame(vectorizer.toarray(), columns=count_vector.get_feature_names_out())

## 3. Model

Build two models, but use different vectorization techniques: TF-IDF and Word Count

In [None]:
model_tfidf = make_pipeline(TfidfVectorizer(), MultinomialNB())

In [None]:
model_count = make_pipeline(CountVectorizer(), MultinomialNB())

### 3.1 Training

In [None]:
model_tfidf.fit(train.data, train.target)

In [None]:
model_count.fit(train.data, train.target)

### 3.2 Predicting

In [None]:
y_pred_tfidf = model_tfidf.predict(test.data)

In [None]:
y_pred_count = model_count.predict(test.data)

### 3.3 Evaluation

In [None]:
f1 = f1_score(test.target, y_pred_tfidf, average='weighted')
accuracy = accuracy_score(test.target, y_pred_tfidf)
print('Multinomial Naive Bayes with TF-IDF:')
print('-' * 40)
print(f'f1: {f1:.4f}')
print(f'accuracy: {accuracy:.4f}')

In [None]:
f1 = f1_score(test.target, y_pred_count, average='weighted')
accuracy = accuracy_score(test.target, y_pred_count)
print('Multinomial Naive Bayes with Word Count:')
print('-' * 40)
print(f'f1: {f1:.4f}')
print(f'accuracy: {accuracy:.4f}')

## 4. Testing the Model

In [None]:
text = [
    'I believe in jesus', 
    'Nvidia released new video card', 
    'one apple a day takes a doctor away',
    'God does not exist',
    'My monitor supports HDR',
    'Vitamins are essential for your health and development'
]

### 4.1 TF-IDF

In [None]:
y_pred = model_tfidf.predict(text)

In [None]:
for i in range(len(y_pred)):
    print(f'"{target_categories[y_pred[i]]:<22}" ==> "{text[i]}"')

### 4.2 Word Count

In [None]:
y_pred = model_count.predict(text)

In [None]:
for i in range(len(y_pred)):
    print(f'"{target_categories[y_pred[i]]:<22}" ==> "{text[i]}"')