In [1]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [3]:
%%time
import pandas as pd
import pickle

pd.set_option('display.max_colwidth', -1)

PATH = 'Задача 1/input/'

df = pd.read_csv(PATH + 'train.csv', index_col='doc_id')
test_df = pd.read_csv(PATH + 'test.csv', index_col='doc_id')

CPU times: user 11.4 s, sys: 2.02 s, total: 13.5 s
Wall time: 13.7 s


## Sklearn Latent Dirichlet Allocation

### Training

First, I'll convert data to the bag of words format

In [10]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stopWords = stopwords.words('russian')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [11]:
%%time
from sklearn.feature_extraction.text import CountVectorizer

features = 100000

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=100000,
                                stop_words=stopWords)

tf = tf_vectorizer.fit_transform(df['text'])

tf_feature_names = tf_vectorizer.get_feature_names()

CPU times: user 1min 24s, sys: 1.59 s, total: 1min 26s
Wall time: 1min 26s


I choose 50 topics for lda

In [0]:
from sklearn.decomposition import LatentDirichletAllocation
n_components = 50
lda = LatentDirichletAllocation(n_components=n_components, max_iter=1,
                                learning_method='online', random_state=17,
                                verbose=True, n_jobs=-1)

In [0]:
lda.fit(tf)

In [0]:
pkl_filename = "/ITMO/pickle_model.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(lda, file)

In [0]:
def top_words(model, feature_names, n_top_words):
  l = []
  for topic in lda.components_:
    topic_tokens = " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
    l.append(topic_tokens)
  return pd.Series(data=l, name='topic_tokens')

In [0]:
topics_to_tokens = top_words(lda, tf_feature_names, 5)

In [112]:
topics_to_tokens[:10]

0    серия порядке сегодня сезон сериал     
1    любовь любить сердце люди знаешь       
2    людей люди хочется жизни умеешь        
3    жизнь жизни человека люди людей        
4    минут масло воды затем масла           
5    окружающим людей уважение людям следует
6    бизнес страница теги como сайт         
7    лишь свет руки любви детей             
8    женщина говорит мужчина муж сказала    
9    репост выбираем авы группу рандомно    
Name: topic_tokens, dtype: object

### Testing

In [49]:
%%time
test_tf = tf_vectorizer.transform(test_df['text'])
ans = lda.transform(test_tf)

CPU times: user 10.3 s, sys: 117 ms, total: 10.4 s
Wall time: 10.4 s


In [0]:
test_df['label'] = np.argmax(ans, axis=1)

Save topic number for every document and popular words for every topic

In [0]:
test_df.to_csv('ITMO/labels.csv', columns=['label'], index=True)
topics_to_tokens.to_csv('ITMO/topics_tokens.csv', 
                        index_label='topic_id', header=True)

### Evaluation

#### Quantitative metric

For evaluating LDA we can use [perplexity](http://qpleple.com/perplexity-to-evaluate-topic-models/). Perplexity - The most common way to evaluate a probabilistic model is to measure the log-likelihood of a held-out test set.

In [31]:
lda.perplexity(test_tf)

31362.778725074197

Perplexity is quite large. Due that we can try different number of topics and then look on the perplexity. So we can find optimal numbers of topics

### Возможные улучшения

1) Choose amount of topics relying on perplexity - [1]

2) Change *max_df* of *CountVectorizer* to 0.6 as advised here - [1]

3) Lemmatizing/Stemming

4) *TF-IDF* instead of *CountVectorizer*

[[1]](https://www.quora.com/What-are-good-ways-of-evaluating-the-topics-generated-by-running-LDA-on-a-corpus)