## First tries with LDA
Try using spark lda but with gensim corpus processing.

### Spark and gensim mixed

In [None]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim import corpora
from gensim.matutils import corpus2dense
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors

In [None]:
#create the dictionary from gensim, each lemma will be assigned a number
dictionary = corpora.Dictionary(line for line in wlp_bytext.rdd.map(lambda r: r[1]).collect())

In [None]:
#the dictionary object also have some useful informations stored in it
print('Number of documents in corpus: \t', dictionary.num_docs)
print('Number of words in corpus: \t', dictionary.num_pos)
print('Number of tokens in dictionary: ', len(dictionary.token2id))

In [None]:
#class that makes the gensim corpus object, 
#for now this is the only way I found to go from sparse to dense vector form (using the gensim corpus2dense fct)
class MyCorpus(object):
     def __iter__(self):
            for line in wlp_bytext.rdd.map(lambda r: r[1]).collect():
                yield dictionary.doc2bow(line)

In [None]:
#create the corpus and turn it into a format that spark will like
corpus = MyCorpus()
#changing from sparse to dense representation
data = sc.parallelize(corpus2dense(corpus,num_terms=len(dictionary.token2id),num_docs=dictionary.num_docs).T)
#not sure this is entirely necessary but the data is transformed into spark dense vectors (maybe faster)
parsedData = data.map(lambda line: Vectors.dense(line))
#index documents with unique IDs
corpus_rdd = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()

In [None]:
#train model, here it crashes, it should work though, I think it is juste because of a lack of resources
ldas = LDA.train(corpus_rdd, k=10)

#output topics, sadly there aren't any strings here, we need to map that to dictionary, would be even harder to do without gensim
'''print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize())
      + " words):")
topics = ldaModel.topicsMatrix()
for topic in range(10):
    print("Topic " + str(topic) + ":")
    for word in range(0, ldaModel.vocabSize()):
        print(" " + str(topics[word][topic]))'''

### Full gensim
Same thing but entirely done with gensim. Very practical and concise. The word selection could even be done here, see the dictionary attributes `.filter_extremes` and `filter_n_most_frequent` [here](https://radimrehurek.com/gensim/corpora/dictionary.html).

In [None]:
from gensim.models.ldamodel import LdaModel

In [None]:
corpus = MyCorpus()
ldag = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1, chunksize=100, passes=5)

In [None]:
ldag.print_topics()