# Text Classification

Using Naive Bayes

In [1]:
import pickle
from pathlib import Path
import os

import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.model_selection import train_test_split

from nlp.nb import NBClassifier

### Loading articles

In [2]:
STORAGE_PATH = './storage/nb.model'

good_articles = pd.read_table('./articles/good.articles', sep='\n', names=['body'])
good_articles['label'] = 0 # good
bad_articles  = pd.read_table('./articles/bad.articles', sep='\n', names=['body'])
bad_articles['label'] = 1 # bad

# joining
articles = pd.concat([good_articles, bad_articles], ignore_index=True)

# persistently shuffling
index_path = f"{STORAGE_PATH}.shuffled_index"
if Path(index_path).is_file():
  file_size = os.path.getsize(index_path)
  print(f"Loading index from stored file ({file_size} bytes)...")
  with open(index_path, 'rb') as fp:
    shuffled_index = pickle.load(fp)
else:
  print('Shuffling index for the first time ...')
  shuffled_index = np.random.permutation(articles.index)
  print('Saving index on disk for further access...')
  with open(index_path, 'wb') as fp:
    pickle.dump(shuffled_index, fp)
  file_size = os.path.getsize(index_path)
  print(f"Done. It took {file_size} bytes on the disk.")

articles = articles.reindex(shuffled_index)

print(f"Counts:\n{articles['label'].value_counts()}")
articles.head()

Loading index from stored file (113239 bytes)...
Counts:
1    7518
0    6617
Name: label, dtype: int64


Unnamed: 0,body,label
10624,Во Франции цементный концерн подозревают в спо...,1
10193,Полиция Ирландии арестовала двух человек после...,1
2894,"Российские медицинские туристы, которые ездили...",0
9087,Следователь московского полицейского главка вы...,1
7888,МЧС предупредило москвичей об ухудшении погоды...,1


### Splitting between train and test

Initial dataset was made up roughly by two categories: accidents and else otherwise. This is our best approximation to what we wanna see as bad and good news (cold start). Later on initial dataset is to be replaced by human-decided data from any categories.

That is why we use just as little as 2k (15%) articles to initially train the model.

In [3]:
train_data, test_data = train_test_split(articles, train_size = 0.15)

print('\nTrain dataset:')
print(train_data.groupby('label').size())
print(train_data.shape)

print('\nTest dataset:')
print(test_data.groupby('label').size())
print(test_data.shape)
train_data.head()


Train dataset:
label
0     991
1    1129
dtype: int64
(2120, 2)

Test dataset:
label
0    5626
1    6389
dtype: int64
(12015, 2)


Unnamed: 0,body,label
8298,Шесть детей госпитализировали из петербургской...,1
5947,Французский модный дом Balenciaga представил р...,0
4395,"В Ненецком автономном округе частные детсады, ...",0
10142,Число жертв взрыва в дипломатическом квартале ...,1
11498,На юго-востоке Москвы нашли еще один снаряд. П...,1


### Training

In [4]:
nbcs = []
for strategy in ['tf', 'tfidf']:
  nbc = NBClassifier(strategy = strategy)
  nbc.train(train_data)
  nbcs.append(nbc)

Removed highly correlated words: ,, нача, не, к, num, британск, с, юбк, труд, средств, руководител, нам, привод, глав, изначальн, работ, чечн, все, рассказа, церкв, чувств, том, продолж, установк, измерен, тысяч, крайн, осуществлен, устройств, проход, немн, кипр, отел, мин, через, нег, волк, занят, демонстрирова, отказа, зат, нью, мог, кров, разбира, проведен, заработн, владивосток, судьб, граждан, сбыва, остальн, помоч, забрасыва, образова, добра, забира, пас, пенсионер, девушк, шок, реконструкц, впоследств, замет, исполнен, попрос, огранич, сиден, воева, первичн, наруша, пугачев, проб, бумажн, ирин, паспорт, барс, пороча, нац, приостанов, заперт, стремительн, ура, самостоятельн, телефон, действова, великобритан, матер, утвержд, фрг, выкат, казанск, высажива, ялт, юстиц, генера, кат, настаива, тунисск, штраф, концерт, буквальн, снежн, размест, горн, принадлежат, денежн, эфиопск, нормальн, предел, ушл, монсон, пояс, жан, мот, захотел, заказа, принадлежа, электрон, крут, противоположн, 

### Predicting

In [5]:
predictions_by_strategy = [[nbc.predict(article).label for article in test_data['body']] for nbc in nbcs]

In [6]:
for predictions in predictions_by_strategy:
  report = metrics.classification_report(test_data['label'], predictions)
  print(report)

             precision    recall  f1-score   support

          0       0.97      0.98      0.98      5626
          1       0.98      0.97      0.98      6389

avg / total       0.98      0.98      0.98     12015

             precision    recall  f1-score   support

          0       0.98      0.97      0.98      5626
          1       0.98      0.98      0.98      6389

avg / total       0.98      0.98      0.98     12015



### Results

Naive Bayes makes accurate predictions trained just on a few thousand of documents. It performs well comparing to neural networks, in particular to deep ones, which are a way more greedy for data and require more time to train.

We used two ways of representing probability of a word given the class:
  - using **Term Frequency**, where `P(w|C) = TF(w|C) / sum_for_words_in_C( TF(w) )`
  - and **TFIDF**, where `P(w|C) = TF(w|C) * IDF(w) / sum_for_words_in_C( TF(w) * IDF(w) )`
  
We also added Laplace smoothing normalized by the average document length of a class (_not shown in the formula above_).

Both gave us approximately the same precision, recall and f1-scores.

However, for both scenarios a considerable improvement was reached by utilizing an advanced definition of **Term Frequency**: raw count of words in a document adjusted by document length. It can be attributed to greatly-varied document lengths in the dataset.