# Text Classification

Using Naive Bayes

In [15]:
import pickle
from pathlib import Path
import os

import pandas as pd
import numpy as np
from sklearn import metrics

from nlp.nb import NBClassifier

### Loading articles

In [16]:
STORAGE_PATH = './storage/nb.model'

good_articles = pd.read_table('./articles/good.articles', sep='\n', names=['body'])
good_articles['label'] = 0 # good
bad_articles  = pd.read_table('./articles/bad.articles', sep='\n', names=['body'])
bad_articles['label'] = 1 # bad

# joining
articles = pd.concat([good_articles, bad_articles], ignore_index=True)

# persistently shuffling
index_path = f"{STORAGE_PATH}.shuffled_index"
if Path(index_path).is_file():
  file_size = os.path.getsize(index_path)
  print(f"Loading index from stored file ({file_size} bytes)...")
  with open(index_path, 'rb') as fp:
    shuffled_index = pickle.load(fp)
else:
  print('Shuffling index for the first time ...')
  shuffled_index = np.random.permutation(articles.index)
  print('Saving index on disk for further access...')
  with open(index_path, 'wb') as fp:
    pickle.dump(shuffled_index, fp)
  file_size = os.path.getsize(index_path)
  print(f"Done. It took {file_size} bytes on the disk.")

articles = articles.reindex(shuffled_index)

print(f"Counts:\n{articles['label'].value_counts()}")
articles.head()

Loading index from stored file (113239 bytes)...
Counts:
1    7518
0    6617
Name: label, dtype: int64


Unnamed: 0,body,label
10624,Во Франции цементный концерн подозревают в спо...,1
10193,Полиция Ирландии арестовала двух человек после...,1
2894,"Российские медицинские туристы, которые ездили...",0
9087,Следователь московского полицейского главка вы...,1
7888,МЧС предупредило москвичей об ухудшении погоды...,1


### Splitting between train and test

In [29]:
test_first_index = int(articles.shape[0] * 0.35)

train_data = articles[:test_first_index]
test_data  = articles[test_first_index:]

print(train_data.shape)
print(test_data.shape)

(4947, 2)
(9188, 2)


### Training

In [None]:
nbcs = []
for strategy in ['tf', 'tfidf']:
  nbc = NBClassifier(strategy = strategy)
  nbc.train(train_data)
  nbcs.append(nbc)

### Predicting

In [31]:
predictions_by_strategy = [[nbc.predict(article).label for article in test_data['body']] for nbc in nbcs]

In [32]:
for predictions in predictions_by_strategy:
  report = metrics.classification_report(test_data['label'], predictions)
  print(report)

             precision    recall  f1-score   support

          0       0.97      0.98      0.98      4322
          1       0.98      0.97      0.98      4866

avg / total       0.98      0.98      0.98      9188

             precision    recall  f1-score   support

          0       0.97      0.98      0.98      4322
          1       0.98      0.98      0.98      4866

avg / total       0.98      0.98      0.98      9188



### Results

Naive Bayes makes accurate predictions trained just on a few thousand of documents. It performs well comparing to neural networks, in particular to deep ones, which are a way more greedy for data and require more time to train.

We used two ways of representing probability of a word given the class:
  - using **Term Frequency**, where `P(w|C) = TF(w|C) / sum_for_words_in_C( TF(w) )`
  - and **TFIDF**, where `P(w|C) = TF(w|C) * IDF(w) / sum_for_words_in_C( TF(w) * IDF(w) )`
  
We also added Laplace smoothing normalized by the average document length of a class (_not shown in the formula above_).

Both gave us approximately the same f1-scores with **TF** slightly outperforming **TFIDF** on our dataset.

However, for both scenarios a considerable improvement was reached by utilizing an advanced definition of **Term Frequency**: raw count of words in a document adjusted by document length. It can be attributed to greatly-varied document lengths in the dataset.