## Document Analysis
This is a decision support system which classifies important sentences in technical documents in Russian and filters unimportant. Manually labeled sentences from several documents were used for the training set.

Training was performed with Naive Bayes classification algorithm with stratified 10-fold cross-validation.
Feature extraction from text was performed with Tf-Idf vectorizer from NLTK package. Demencionality of vocabulary was reduced with PCA to decrease computational complexity.

In [1]:
import re
import string
import nltk
from nltk.stem import SnowballStemmer
from nltk.tokenize import TweetTokenizer
exclude = set(string.punctuation)
exclude.remove('.')

In [2]:
with open('train.txt') as f:
    text = f.read().splitlines()
    
def stripper(s):
    s = ''.join(ch for ch in s if ch not in exclude)
    s = re.sub("[\t\n№\d«»–]'", ' ', s)
    return ' '.join(s.split())

def strippers(s):
    s = ''.join(ch for ch in s if ch not in exclude)
    s = re.sub('[\t№\n\d«»–]', ' ', s)
    return ' '.join(s.split())

train = []
for line in text:
    train.append(stripper(line))

In [3]:
with open('target.txt') as f:
    target = f.read().splitlines()

In [4]:
assert len(train) == len(target)

In [5]:
Y = []
for label in target:
    Y.append(int(label))

### Feature extraction and training
Every sentence is a bag-of-words representation with stemming and reduced stop-words

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
    
vectorizer = TfidfVectorizer(decode_error='replace', encoding='utf-8')
X = vectorizer.fit_transform(train)
X = X.todense()

In [7]:
X.shape

(2364, 1436)

In [8]:
### Naive Bayes ###

from sklearn.model_selection import StratifiedKFold
from sklearn import naive_bayes
from sklearn.model_selection import cross_val_score
from sklearn import decomposition


pca = decomposition.PCA(n_components=180)
X_pca = pca.fit_transform(X)

print(sum(pca.explained_variance_ratio_))

clf = naive_bayes.GaussianNB()
kfold = StratifiedKFold(n_splits=10, random_state=7)
results = cross_val_score(clf, X_pca, Y, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
clf.fit(X_pca, Y)

0.77979769671
Accuracy: 83.76% (6.40%)


GaussianNB(priors=None)

Logistic Regression was used as a baseline for accuracy

In [10]:
# ### Logistic Regression ###

# from sklearn.linear_model import LogisticRegression
# from sklearn.model_selection import GridSearchCV, StratifiedKFold

# skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)

# logit = LogisticRegression()
# logit_params = {'C': [0.001, 0.01, 0.1, 1, 10]}

# grid = GridSearchCV(estimator=logit, param_grid=logit_params, n_jobs=-1, cv=skf, verbose=1, scoring='accuracy')

# grid_result = grid.fit(X_pca, Y)
# # summarize results
# print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
# means = grid_result.cv_results_['mean_test_score']
# stds = grid_result.cv_results_['std_test_score']
# params = grid_result.cv_results_['params']
# for mean, stdev, param in zip(means, stds, params):
#     print("%f (%f) with: %r" % (mean, stdev, param))


### Prediction

Prediction is based on the probability of sentence being helpful or not. The function defined below requires a probability threshold. Low theshold decrease the algorithm accuracy but reduce the number of false negative cases. 

In [None]:
import pickle
import pprint

with open('~/home/docs.pickle', 'rb') as f:
    data = pickle.load(f)

list_of_documents = list(data.keys())

from stop_words import get_stop_words
stoplist = get_stop_words('russian')
stemmer = SnowballStemmer('russian')


In [None]:
#document_name = list_of_documents[15]

def predict_importance(document_name, threshold = 0.3):
    
    print('Document Name: ', document_name)
    print('====================================')
    doc = data[str(document_name)]['content']
    sentences = strippers(doc).split('. ')

    for sentence in sentences:
        initial_sentence = sentence
        sentence = [word for word in sentence.lower().split() if word not in stoplist]
        for i, word in enumerate(sentence):
                sentence[i] = stemmer.stem(word)
        if sentence != []:
            sentence = ' '.join(sentence)

            ############ Validation ###########

            vector = vectorizer.transform([sentence])
            is_important = clf.predict(pca.transform(vector.todense()))
            probability = clf.predict_proba(pca.transform(vector.todense()))
            if probability[0][1] > threshold:
                print(initial_sentence, probability[0][1])
                print('-----------------------------------------')
    print('========== Document End ===========')

predict_importance(document_name)   
    