# What Do They Talk About
The objective is to train a model, which can be used for automatic detection of topic in a message.<br>

## Step 1: Dataset ### 
We will be using a dataset <b>fetch_20newsgroups</b> from <b>sklearn.datasets</b>.

In [None]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()

In [None]:
# see the data
data

In [None]:
# see the topics/categories available from the train data
data.target_names

## Step 2: Data Preparation

In [None]:
# select few, for example
categories = ['sci.crypt', 'talk.politics.misc','sci.space', 'comp.graphics', 'talk.politics.guns', 'sci.med' ]

In [None]:
# split train and test sets for these categories
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

## Step 3: Train Model

In [None]:
# import TF-IDF vectorizer, which converts the content of each string into a vector of numbers
from sklearn.feature_extraction.text import TfidfVectorizer

# import method to create a pipeline, which will attache the vectorizer to a multinomial naive Bayes classifier
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB

# build a model 
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

In [None]:
# train the model with the train data and train labels
model.fit(train.data, train.target)

## Step 4: Test Model

In [None]:
# use the trained model to predict labels for the test data, without test labels
labels = model.predict(test.data)

In [None]:
# see the predicted labels
labels

## Step 5: Validate Model

In [None]:
# Calculate confusion matrix to evaluate the the accuracy of the predicted labels
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test.target, labels)
plt.xlabel('true label')
plt.ylabel('predicted label');

In [None]:
# Visualize it by use of seaborn library
import seaborn as sns; sns.set()
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)
plt.show()

## Implement for Prediction

In [None]:
# define your function, which can take any string and predict its topic/category
def predict_category(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]

In [None]:
# implement your function with a random string
predict_category('determining the screen resolution')

In [None]:
predict_category('election')

In [None]:
predict_category('hiv')