input: preprocessed ndjson from ttx with giveaways removed

## Feature selection 

a) __Filter out non-meaningful Parts of Speech from all texts__.   
Only NOUN, PROP-NOUN, ADJ, VERB and ADVERB will be kept


b) __Neural detection of phrases__.  
If two tokens appear together often, they will be concatenated into a single token.

In [None]:
import pandas as pd
import ndjson

from src.utility import phraser
from src.utility.general import load_data

In [None]:
# import preprocessed data
texts = load_data('fb_prep.ndjson')

# phraser has both a) & b) functionality
texts_phrased = phraser.train(texts)

# display a sample


## Seed selection

a) __Train a CBOW model__  
To be used for finding related words to query.  
Intentions behind the parameters:
- words that appear together in the whole FB post (window=20)
- frequent words, so that the seeds are generalizable (min_count=100)

_comment: potentially this could be taken care of by PmiSvdEmbeddings._

b) __Enhance phrase list__  
Add synonyms and related words to a given phrase list. This will be used as guide the topic model.

In [None]:
from gensim.models import Word2Vec, KeyedVectors

# from src.embeddings.pmisvd import PmiSvdEmbeddings
from src.embeddings.query_ops import import_query, get_related

In [None]:
# import phrase list
query_list = import_query('path.csv')

# train a cbow model
cbow_texts = Word2Vec(text_phrased,
                      size=300, window=20, min_count=100,
                      sg=0, hs=0,
                      iter=500, workers=4)

# get a list of words similar to those in the phrase list
query_related = get_related(cbow_texts.wv, query_list, topn=10, cutoff=100)

# save + display a sample
(query_related
 .to_csv('path.csv')
 .head())

Now the seeds have to be manually redacted.

## Topic modeling

In [None]:
from src.lda.asymmetric import lda_grid_search_ASM
from src.lda.seeded import lda_grid_search_SED