# Text Classification

Task of classifying text / document under a predefined category using its content. May either be topic-based or genre-based classification.

### Data Attributes Description
* document: document ID
* content: document content
* category: document category i.e. auto, sports, computer

### Goal
Determine the category of the documents below using a Naive Bayes classifier:

| document | content           | category |
|----------|-------------------|----------|
| D6       | Home Runs Game    | ?        |
| D7       | Car Engine Noises | ?        |

## TASK 1: Loading the dataset to pandas dataframe

In [1]:
dat_path = "data/doc_classification.csv"

In [2]:
# Import pandas and load the dataset
import pandas as pd

df = pd.read_csv(dat_path, index_col=False)
df

Unnamed: 0,document,content,category
0,D1,Saturn Dealer’s Car,Auto
1,D2,Toyota Car Tercel,Auto
2,D3,Baseball Game Play,Sports
3,D4,Pulled Muscle Game,Sports
4,D5,Colored GIFs Root,Computer


## TASK 2: Integer-based labels/categories
> Machine learning algorithms generally handle numerical values better than categorical values

In [3]:
df["category_num"] = df.category.map({"Auto":0, "Sports":1, "Computer":2})

## TASK 3: Splits
* Split features to X (predictor/s) and y (label)
* Split observations into train and test sets 

In [4]:
# Splitting features to X (predictor/s) and y(label)
X = df.content
y = df.category_num
print(X.shape)
print(y.shape)

(5,)
(5,)


In [5]:
# Splitting observations into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## TASK 4: Building and evaluating the model
Use of [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)
> The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [6]:
# Importing packages necessary for building the model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline


text_clf = Pipeline([('vect_tfid', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
                     ])

In [7]:
# Fitting the model to the train set
text_clf = text_clf.fit(X_train, y_train)

In [8]:
y_predicted = text_clf.predict(X_test)

In [9]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_predicted)

1.0

In [10]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_predicted)

array([[1, 0],
       [0, 1]])

In [11]:
# Print the classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predicted))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00         1
          1       1.00      1.00      1.00         1

avg / total       1.00      1.00      1.00         2



## !! Caution !! Our model showed perfect scores - but this doesn't necessarily mean that the model is 'perfect' already. Caveat: Small dataset is usually prone to bias and/or overfitting
