## Comments classifier

This is the example of 8-class classification algorithm to classify text explanation of excess stocks in a mining company warehouse. Comments are written in Russian. 

Training set is a manually labeled sentences for each categories. Feature extraction from text was performed with Tf-idf vectorizer from NLTK package. 

Learning algoritm is a multiclass stochastic gradient descent with elastic-net loss function regularization.

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter('ignore')

train = pd.read_excel('~/train.xlsx')
train.fillna('-', inplace=True)
train = train[(train['Комментарий']!='-')&(train['Класс']!='-')]

### Data preparation and feature extraction

In [2]:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
y = lb.fit_transform(train['Класс'])

text = train['Комментарий'].values

In [3]:
import re
import string
import nltk
from nltk.stem import SnowballStemmer
from nltk.tokenize import TweetTokenizer
exclude = set(string.punctuation)
exclude.remove('.')

def stripper(s):
    s = ''.join(ch for ch in s if ch not in exclude)
    s = re.sub("[\t\n№\d«»–]'", ' ', s)
    return ' '.join(s.split())

def strippers(s):
    s = ''.join(ch for ch in s if ch not in exclude)
    s = re.sub('[\t№\n\d«»–]', ' ', s)
    return ' '.join(s.split())

x = []
for line in text:
    x.append(stripper(line))

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
    
vectorizer = TfidfVectorizer(decode_error='replace', encoding='utf-8')
X = vectorizer.fit_transform(x)
X = X.todense()
X.shape

(142, 489)

### Training

In [6]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier


classif = OneVsRestClassifier(SGDClassifier(penalty='elasticnet', verbose=False))
classif.fit(X, y)
print('Accuracy:', classif.score(X, y))

Accuracy: 0.929577464789


### Prediction

In [None]:
test = pd.read_excel('~/test.xlsx')
# test.fillna('-', inplace=True)
test = test[(test['Комментарий']!='-')]

In [None]:
test_text = test['Комментарий'].values

x_test = []
for line in test_text:
    x_test.append(stripper(line))
    
X_test = vectorizer.transform(x_test)
X_test = X_test.todense()
X_test.shape

In [None]:
pred = lb.inverse_transform(classif.predict(X_test))
pred = pd.DataFrame(pred)
result = pd.concat([test, pred], axis=1)
result.to_excel('~/prediction.xlsx')

### Feature importances for each class

Linear model allows to understand what features are important for each class. This loop returns 5 most important words for each class ordered by weights in model.

In [None]:
import operator
classes = lb.classes_
for i in range(len(classif.coef_)):
    print(classes[i])
    vocab = vectorizer.vocabulary_.keys()
    coef = abs(classif.coef_[i])
    d = dict([(t, coef[vectorizer.vocabulary_[t]]) for t in vocab])
    sorted_x = sorted(d.items(), key=operator.itemgetter(1),reverse=True)
    print(dict(sorted_x[0:5]).keys())