## Training notebook

Here some last stages of preprocessing and

training of LDA and auxillary models such as CountVectorizer.

In [6]:
%matplotlib inline
#adding this to avoid memory errors
%env JOBLIB_TEMP_FOLDER=/tmp
import numpy as np

from tqdm import tqdm_notebook as tqdm

from sklearn.externals import joblib
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.model_selection import train_test_split
import pickle

import nltk
from nltk.corpus import stopwords
from pymystem3 import Mystem
import re

from os.path import join
from glob import glob

from time import time

env: JOBLIB_TEMP_FOLDER=/tmp


In [8]:
#reading stopwords and train set names
with open('../Data/stopwords/stopwords.pkl', 'rb') as f:
    stopwords = pickle.load(f)
with open('../Data/train_names.pkl', 'rb') as f:
    train_names = pickle.load(f)
print('Got %d stopwords.\nGot %d texts in training corpus.'%(len(stopwords), len(train_names)))

Got 796 stopwords.
Got 3738 texts in training corpus.


## Idf filtering
We got no less about 700k of unique words even after deleting numbers and punctuations.

So, All the term-doc matrices will have immensly big size like Nx700k, where N is also big.

Most of those words are very uninformative and ought to be filtered. So they would be filtered by IDF(Inverse Document Frequency) thresholding.
So we can sort out very common/rare words, threshold was set manually and then tuned during few rounds of training.

In [12]:
#'training' (tf-)idf vectorizer.
tf_idf = TfidfVectorizer(input='filename',
                             stop_words=stopwords,
                             smooth_idf=False
                         )
tf_idf.fit(train_names)
#getting idfs
idfs = tf_idf.idf_
#sorting out too rare and too common words
#original 1.3 and 7
# 2 6
lower_thresh = 3.
upper_thresh = 6.
not_often = idfs > lower_thresh
not_rare = idfs < upper_thresh

mask = not_often * not_rare

good_words = np.array(tf_idf.get_feature_names())[mask]
#deleting punctuation as well.
cleaned = []
for word in good_words:
    word = re.sub("^(\d+\w*$|_+)", "", word)
    
    if len(word) == 0:
        continue
    cleaned.append(word)
print("Len of original vocabulary: %d\nAfter filtering: %d"%(idfs.shape[0], len(cleaned)))

Len of original vocabulary: 718729
After filtering: 36450


### Filtering results

Well, we reduced size of vocabulary from 718729 to 36450, that is A LOT.

### Stemming

We will further reduce size of our vocabulary by stemming russian words.

It can be easily done with https://github.com/nlpub/pymystem3
which stems russian words without destroying them and ignoring english.

In [13]:
#Stemming
m = Mystem()
stemmed = set()
voc_len = len(cleaned)
for i in tqdm(range(voc_len)):
    word = cleaned.pop()
    stemmed_word = m.lemmatize(word)[0]
    stemmed.add(stemmed_word)
    
stemmed = list(stemmed)
print('After stemming: %d'%(len(stemmed)))

HBox(children=(IntProgress(value=0, max=36450), HTML(value='')))


After stemming: 13611


## Term-doc matrix

To construct term-doc matrix we will use CountVectorizer from sklearn.

In [14]:
#training count vectorizer
voc = {word : i for i,word in enumerate(stemmed)}

count_vect = CountVectorizer(input='filename',
                             stop_words=stopwords,
                             vocabulary=voc)

dataset = count_vect.fit_transform(train_names)

### LDA
Finally, training LDA.

All the hyperparams were set intuitively and tuned for a few rounds.

In [16]:
#training LDA
lda = LDA(n_components = 60, max_iter=30, n_jobs=6, learning_method='batch', verbose=1)
lda.fit(dataset)

iteration: 1 of max_iter: 30
iteration: 2 of max_iter: 30
iteration: 3 of max_iter: 30
iteration: 4 of max_iter: 30
iteration: 5 of max_iter: 30
iteration: 6 of max_iter: 30
iteration: 7 of max_iter: 30
iteration: 8 of max_iter: 30
iteration: 9 of max_iter: 30
iteration: 10 of max_iter: 30
iteration: 11 of max_iter: 30
iteration: 12 of max_iter: 30
iteration: 13 of max_iter: 30
iteration: 14 of max_iter: 30
iteration: 15 of max_iter: 30
iteration: 16 of max_iter: 30
iteration: 17 of max_iter: 30
iteration: 18 of max_iter: 30
iteration: 19 of max_iter: 30
iteration: 20 of max_iter: 30
iteration: 21 of max_iter: 30
iteration: 22 of max_iter: 30
iteration: 23 of max_iter: 30
iteration: 24 of max_iter: 30
iteration: 25 of max_iter: 30
iteration: 26 of max_iter: 30
iteration: 27 of max_iter: 30
iteration: 28 of max_iter: 30
iteration: 29 of max_iter: 30
iteration: 30 of max_iter: 30


LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=30, mean_change_tol=0.001,
             n_components=60, n_jobs=6, n_topics=None, perp_tol=0.1,
             random_state=None, topic_word_prior=None,
             total_samples=1000000.0, verbose=1)

In [18]:
joblib.dump(lda, '../Data/models/lda.pkl')
joblib.dump(count_vect, '../Data/models/countVect.pkl')
joblib.dump(tf_idf,'../Data/models/tf_idf.pkl')

['../Data/models/tf_idf.pkl']