In [6]:
import pandas as pd
import numpy as np
from numpy import random

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import classification_report

# pass in previously stored df
%store -r df_bbc

## Train and Test set splitting
Are now splitting our dataset into the train and test sets.

In [7]:
# set input and output matrices
X = df_bbc['article_text_clean']
y = df_bbc['category']

# set train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 40)

> **Note:** As we saw our categories are relatively well-balanced, we would expect the balanced categories to also apply when we randomly split our dataset into the train and test set. In this way, we expect the train and test set categories to be similarly distributed.

We may revisit this assumption if we have overfitting.

***
## Model: Naive Bayes
Having loaded and preprocessed our labelled data, we can start including some feature engineering.

We will convert our text documents to a matrix of token counts using `CountVectorizer`, then transform this count matrix to a normalised *tf-idf* representation (via the tf-idf transformer). Then will train several classifiers from the scikit-learn library.

#### TF-IDF
*Term frequency-inverse document frequency* is a statistical measure that evaulates how relevant a word is to a document in a collection of documents. This is computed by multiplying two metrics:
- (TF) Term frequency, which is the number of times a word appears in a document
- (IDF) Inverse document frequency of the word across a set of documents

Words that are common in every document, such as *'this'*, *'the'*, *'what'* are ranked low, even thought they appear many times. This is because they don't mean much to that document in particular. Whereas if words such as *'football'* appear many times in a document, whilst not appearing many times in others, it probabily means it's very relevant, hence the word *'football'* will probably be tied to the category/topic, *'Sport'* since most documents/articles containing the word *'football'* will be about this category/topic.

*[Reference: [MonkeyLearn blog, 2019](https://monkeylearn.com/blog/what-is-tf-idf/)]*

In [8]:
model_naivebayes = Pipeline(steps = [('vect', CountVectorizer()),
                                     ('tfidf', TfidfTransformer()),
                                     ('clf', MultinomialNB())])
model_naivebayes.fit(X_train, y_train)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', MultinomialNB())])

In [9]:
%%time
y_pred = model_naivebayes.predict(X_test)

print('Accuracy: {:.3f}'.format(accuracy_score(y_pred, y_test)))

Accuracy: 0.970
CPU times: user 141 ms, sys: 4.58 ms, total: 146 ms
Wall time: 146 ms


In [10]:
list_categories = df_bbc['category'].unique()
print(classification_report(y_test, y_pred, target_names = list_categories))

               precision    recall  f1-score   support

entertainment       0.95      0.99      0.97       144
     business       0.99      0.91      0.95       110
        sport       0.95      0.98      0.96       129
     politics       0.98      1.00      0.99       164
         tech       0.98      0.96      0.97       121

     accuracy                           0.97       668
    macro avg       0.97      0.97      0.97       668
 weighted avg       0.97      0.97      0.97       668



From looking at the accuracy, it's actually really good, 97%! Wow, can we do better than this?

In [11]:
object_keep = {'df_bbc': df_bbc,
               'X': X,
               'y': y,
               'X_train': X_train,
               'X_test': X_test,
               'y_train': y_train,
               'y_test': y_test}
%store object_keep

Stored 'object_keep' (dict)
