<a href="https://colab.research.google.com/github/hussain0048/Projects-/blob/master/Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1-Multinomial naive Bayes for text classification**

##**1.1- Introduction**

One place in Data Science where multinomial naive Bayes is often used is in text classification, where the features are related to word counts or frequencies within the documents to be classified.

In this data science project we will use the sparse word count features from the 20 Newsgroups corpus to show how we might classify these short documents into categories[1]

##**1.2- Dataset** 
Let’s download the data and take a look at the target names:[1]

In [None]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
print(data.target_names)

## **1.3- Training and Testing**
For simplicity, we will select just a few of these categories, and download the training and testing set[1]:



In [2]:
categories = ['talk.religion.misc', 'soc.religion.christian', 'sci.space', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

Here is a representative entry from the data:[1]

In [None]:
print(train.data[5])

## **1.4- Build model**

n order to use this data for machine learning, we need to be able to convert the content of each string into a vector of numbers. For this we will use the TF–IDF vectorizer, and create a pipeline that attaches it to a multinomial naive Bayes classifier:[1]

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

In [5]:
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

With this pipeline, we can apply the model to the training data, and predict labels for the test data:[1]



In [6]:
model.fit(train.data, train.target)
labels = model.predict(test.data)

## **1.5- Model evaluation**
Now that we have predicted the labels for the test data, we can evaluate them to learn about the performance of the estimator. For example, here is the confusion matrix between the true and predicted labels for the test data[1].


In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test.target, labels)
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(mat.T,square=True,annot=True,fmt='d',cbar=False,
            xticklabels=train.target_names,yticklabels=train.target_names)
plt.xlabel('True Label')
plt.ylabel("Predicted Label")
plt.show()

Evidently, even this very simple classifier can successfully separate space talk from computer talk, but it gets confused between talk about religion and talk about Christianity. This is perhaps an expected area of confusion.

The very cool thing here is that we now have the tools to determine the category for any string, using the predict() method of this pipeline. Here’s a quick utility function that will return the prediction for a single string:[1]

In [10]:
def predict_category(s, train=train,model=model):
    pred = model.predict([s])
    print(train.target_names[pred[0]])

In [11]:
predict_category("sending a payload to the ISS")

sci.space


In [None]:
predict_category("discussing islam vs atheism")

In [None]:
predict_category("determining the screen resolution")

Remember that this is nothing more sophisticated than a simple probability model for the (weighted) frequency of each word in the string; nevertheless, the result is striking. Even a very naive algorithm, when used carefully and trained on a large set of high-dimensional data, can be surprisingly effective.

# **References** 

[[1]-Text Classification with Data Science](https://thecleverprogrammer.com/2020/05/14/text-classification-with-data-science/)

[[2]-Text Classification Using Convolutional Neural Networks](https://www.youtube.com/watch?v=8YsZXTpFRO0)