# Distributed LDA with gensim

This is a work in progress (currently not working) attempt to distribute gensim's LDA. Gensim supports this using the [Python Remote Objects](https://pyro4.readthedocs.io/en/stable/) library, which is likely the thing we need to figure out to get it working.

Install dependencies.

In [1]:
!pip install gensim[distributed]

You should consider upgrading via the '/usr/local/bin/python3.8 -m pip install --upgrade pip' command.[0m


In [2]:
import os
import time
from collections import defaultdict

import cdsw
from gensim import corpora, models



Launch a Pyro name server and fetch the IP address.

In [3]:
name_server = cdsw.launch_workers(
  n=1,
  cpu=1,
  memory=2,
  kernel="python3",
  code=f"!export PYRO_SERIALIZERS_ACCEPTED=pickle; export PYRO_SERIALIZER=pickle; python -m Pyro4.naming -n 0.0.0.0; while true; do sleep 10; done"
)

Pause so IP address is established.

In [4]:
time.sleep(10)

In [5]:
name_server_ip = [
    worker["ip_address"] for worker in cdsw.list_workers()
    if worker["id"] == name_server[0]["id"]
][0]

Should not be 'unknown' (if it is, re-run above cell).

In [6]:
name_server_ip

'100.100.75.90'

Launch some workers, with one gensim worker on each node.

In [7]:
workers = cdsw.launch_workers(
  n=3,
  cpu=1,
  memory=2,
  kernel="python3",
  code=f"!export PYRO_SERIALIZERS_ACCEPTED=pickle; export PYRO_SERIALIZER=pickle; python -m gensim.models.lda_worker --host {name_server_ip} --verbose"
)

Launch a gensim dispatcher.

In [8]:
dispatcher = cdsw.launch_workers(
  n=1,
  cpu=1,
  memory=2,
  kernel="python3",
  code=f"!export PYRO_SERIALIZERS_ACCEPTED=pickle; export PYRO_SERIALIZER=pickle; python -m gensim.models.lda_dispatcher --host {name_server_ip}; while true; do sleep 10; done"
)

In theory, this is all we need.

In [9]:
len(cdsw.list_workers())

8

Hacked together from the gensim tutorials, this should give lda something (very simple) to do.

In [10]:
# from the gensim tutorials:

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]

Setting the same environment variables on this host because I've tried making it an `lda_worker` in the terminal, which changes nothing. Making this node the `Pyro4.naming` name server gives a serialization error.

In [11]:
os.environ["PYRO_HOST"] = name_server_ip
os.environ["PYRO_SERIALIZERS_ACCEPTED"] = "pickle"
os.environ["PYRO_SERIALIZER"] = "pickle"

We'll get something like: ```CommunicationError: cannot connect to ('localhost', 9090): [Errno 111] Connection refused```

In [12]:
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, distributed=True)

RuntimeError: failed to initialize distributed LDA (Pyro name server not found)

Clean up after ourselves.

In [13]:
cdsw.stop_workers(*[worker["id"] for worker in name_server + dispatcher + workers])

[<Response [204]>,
 <Response [204]>,
 <Response [204]>,
 <Response [204]>,
 <Response [204]>]