<a href="https://colab.research.google.com/github/colorprint/idhcc/blob/master/lda2vec_test5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [21]:
!pip install pylda2vec jellyfish pyLDAvis
!spacy download en_core_web_md
!pip install pyLDAvis

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 3.4MB/s 
Collecting funcy
[?25l  Downloading https://files.pythonhosted.org/packages/ce/4b/6ffa76544e46614123de31574ad95758c421aae391a1764921b8a81e1eae/funcy-1.14.tar.gz (548kB)
[K     |████████████████████████████████| 552kB 54.6MB/s 
Building wheels for collected packages: pyLDAvis, funcy
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97712 sha256=5f4c1dfb44b82862e05291478a481611c4b25f82eb1dac0321466c39ef6ff2dc
  Stored in directory: /root/.cache/pip/wheels/98/71/24/513a99e58bb6b8465bae4d2d5e9dba8f0bef8179e3051ac414
  Building wheel for funcy (setup.py) ... [?25l[?25hdone
  Created wheel for funcy: filename=funcy-1.14-py2.py3-none-any.whl size=32042 sha256=3ba1632b

In [5]:
!wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz 
!gunzip GoogleNews-vectors-negative300.bin.gz

--2020-09-26 11:11:22--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.224.187
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.224.187|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2020-09-26 11:11:44 (72.0 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [2]:
#DATASET
import logging
import pickle
#from sklearn.datasets import fetch_20newsgroups
import numpy as np
import pandas as pd
from lda2vec import preprocess, Corpus
logging.basicConfig()

###Alternated for our dataset:
df=pd.read_csv("https://raw.githubusercontent.com/colorprint/idhcc/master/idhcc-courses.csv", sep=';')
texts = []
df_docs = df[df['course_desc'].notnull()]
for row_val in df_docs['course_desc']:
    texts.append(str(row_val))

bad = set(["ax>", '`@("', '---', '===', '^^^'])


def clean(line):
    return ' '.join(w for w in line.split() if not any(t in w for t in bad))


# Preprocess data
max_length = 10000   # Limit of 10k words per document
# Convert to unicode (spaCy only works with unicode)
texts = [str(clean(d)) for d in texts if len(str(clean(d))) > 0]

In [3]:
texts[0]

'This course will introduce students to some of the major concepts, practices, and implications involved in the use of digital technologies in the humanities – the group of academic disciplines interested in examining what it means to be human from cultural, historical, and philosophical perspectives. From the vantage point of these new ‘digital humanities’, we will examine the contemporary shift away from a predominantly print culture to one that is increasingly digital and online, while at the same time analysing and critiquing the emerging cultural practices that accompany this development. In so doing, we will seek to better understand the historical influence of new technologies on how we think of ourselves and our cultural heritage, both individually and collectively; how we interact socially and politically; how we determine public and private spaces in an increasingly connected world; and how we can use digital technologies to produce, preserve, and study cultural materials.'

In [4]:
tokens, vocab = preprocess.tokenize(texts, max_length, merge=False, n_threads=4)
np.save("vocab", vocab)
np.save("tokens", tokens)
#tokens = np.load("tokens.npy")
#vocab = np.load("vocab.npy")

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [5]:
#vocab = vocab.tolist()
#vocab = list(vocab.values())
corpus = Corpus()
# Make a ranked list of rare vs frequent words
corpus.update_word_count(tokens)
corpus.finalize()
# The tokenization uses spaCy indices, and so may have gaps
# between indices for words that aren't present in our dataset.
# This builds a new compact index
compact = corpus.to_compact(tokens)
# Remove extremely rare words
pruned = corpus.filter_count(compact, min_count=30)
# Convert the compactified arrays into bag of words arrays
bow = corpus.compact_to_bow(pruned)
# Words tend to have power law frequency, so selectively
# downsample the most prevalent words
clean = corpus.subsample_frequent(pruned)
# Now flatten a 2D array of document per row and word position
# per column to a 1D array of words. This will also remove skips
# and OoV words
doc_ids = np.arange(pruned.shape[0])
flattened, (doc_ids,) = corpus.compact_to_flat(pruned, doc_ids)

In [6]:
d=dict()
for i in range(len(vocab)-2):
  d[i]=vocab[i]
vocab=d

In [7]:
assert flattened.min() >= 0
# Fill in the pretrained word vectors
n_dim = 300
fn_wordvc = 'GoogleNews-vectors-negative300.bin'
vectors, s, f = corpus.compact_word_vectors(vocab, filename=fn_wordvc)

In [9]:
# Save all of the preprocessed files
pickle.dump(vocab, open('vocab.pkl', 'wb'))
pickle.dump(corpus, open('corpus.pkl', 'wb'))
np.save("flattened", flattened)
np.save("doc_ids", doc_ids)
np.save("pruned", pruned)
np.save("bow", bow)
np.save("vectors", vectors)

In [10]:
### MODEL
import os
import os.path
import pickle
import time
import shelve

import chainer
from chainer import cuda
from chainer import serializers
import chainer.optimizers as O
import numpy as np

from lda2vec import utils
from lda2vec import prepare_topics, print_top_words_per_topic, topic_coherence
from lda2vec import LDA2Vec

gpu_id = int(os.getenv('CUDA_GPU', 0))
cuda.get_device(gpu_id).use()
print("Using GPU:" + str(gpu_id))

Using GPU:0


In [11]:
#data_dir = os.getenv('data_dir', '../data/')
fn_vocab = 'vocab.pkl'
fn_corpus = 'corpus.pkl'
fn_flatnd = 'flattened.npy'
fn_docids = 'doc_ids.npy'
fn_vectors = 'vectors.npy'
vocab = pickle.load(open(fn_vocab, 'rb'))
corpus = pickle.load(open(fn_corpus, 'rb'))
flattened = np.load(fn_flatnd)
doc_ids = np.load(fn_docids)
vectors = np.load(fn_vectors)

In [12]:
# Model Parameters
# Number of documents
n_docs = doc_ids.max() + 1
# Number of unique words in the vocabulary
n_vocab = flattened.max() + 1
# 'Strength' of the dircihlet prior; 200.0 seems to work well
clambda = 200.0
# Number of topics to fit
n_topics = int(os.getenv('n_topics', 20))
batchsize = 4096
# Power for neg sampling
power = float(os.getenv('power', 0.75))
# Intialize with pretrained word vectors
pretrained = bool(int(os.getenv('pretrained', True)))
# Sampling temperature
temperature = float(os.getenv('temperature', 1.0))
# Number of dimensions in a single word vector
n_units = int(os.getenv('n_units', 300))
# Get the string representation for every compact key
words = corpus.word_list(vocab)[:n_vocab]
# How many tokens are in each document
doc_idx, lengths = np.unique(doc_ids, return_counts=True)
doc_lengths = np.zeros(doc_ids.max() + 1, dtype='int32')
doc_lengths[doc_idx] = lengths
# Count all token frequencies
tok_idx, freq = np.unique(flattened, return_counts=True)
term_frequency = np.zeros(n_vocab, dtype='int32')
term_frequency[tok_idx] = freq

In [13]:
for key in sorted(locals().keys()):
    val = locals()[key]
    if len(str(val)) < 100 and '<' not in str(val):
        print(key, val)

__ 
___ 
__doc__ Automatically created module for IPython interactive environment
__loader__ None
__name__ __main__
__package__ None
__spec__ None
_dh ['/content']
_i1 len(d)
_i3 texts[0]
_i6 d=dict()
for i in range(len(vocab)-2):
  d[i]=vocab[i]
vocab=d
bad {'^^^', '---', '`@("', 'ax>', '==='}
batchsize 4096
clambda 200.0
doc_ids [  0   0   0 ... 380 380 380]
f 200
flattened [ 14  17  11 ...  19 113   7]
fn_corpus corpus.pkl
fn_docids doc_ids.npy
fn_flatnd flattened.npy
fn_vectors vectors.npy
fn_vocab vocab.pkl
fn_wordvc GoogleNews-vectors-negative300.bin
gpu_id 0
i 4414
max_length 10000
n_dim 300
n_docs 381
n_topics 20
n_units 300
n_vocab 211
power 0.75
pretrained True
s 4211
temperature 1.0


In [14]:
### TRAINING MODEL
model = LDA2Vec(n_documents=n_docs, n_document_topics=n_topics,
                n_units=n_units, n_vocab=n_vocab, counts=term_frequency,
                n_samples=15, power=power, temperature=temperature)

In [15]:
if os.path.exists('lda2vec.hdf5'):
    print("Reloading from saved")
    serializers.load_hdf5("lda2vec.hdf5", model)
    
if pretrained:
    model.sampler.W.data[:, :] = vectors[:n_vocab, :]

In [16]:
model.to_gpu()
optimizer = O.Adam()
optimizer.setup(model)
clip = chainer.optimizer.GradientClipping(5.0)
optimizer.add_hook(clip)

In [17]:
j = 0
epoch = 0
fraction = batchsize * 1.0 / flattened.shape[0]
progress = shelve.open('progress.shelve')

In [18]:
for epoch in range(1):
    data = prepare_topics(cuda.to_cpu(model.mixture.weights.W.data).copy(),
                          cuda.to_cpu(model.mixture.factors.W.data).copy(),
                          cuda.to_cpu(model.sampler.W.data).copy(),
                          words)
    top_words = print_top_words_per_topic(data)
    if j % 100 == 0 and j > 100:
        coherence = topic_coherence(top_words)
        for j in range(n_topics):
            print(j, coherence[(j, 'cv')])
        kw = dict(top_words=top_words, coherence=coherence, epoch=epoch)
        progress[str(epoch)] = pickle.dumps(kw)
    data['doc_lengths'] = doc_lengths
    data['term_frequency'] = term_frequency
    np.savez('topics.pyldavis', **data)
    print(epoch)
    for d, f in utils.chunks(batchsize, doc_ids, flattened):
        t0 = time.time()
        model.cleargrads()
        #optimizer.use_cleargrads(use=False)
        l = model.fit_partial(d.copy(), f.copy())
        print("after partial fitting:", l)
        prior = model.prior()
        loss = prior * fraction
        loss.backward()
        optimizer.update()
        msg = ("J:{j:05d} E:{epoch:05d} L:{loss:1.3e} "
               "P:{prior:1.3e} R:{rate:1.3e}")
        prior.to_cpu()
        loss.to_cpu()
        t1 = time.time()
        dt = t1 - t0
        rate = batchsize / dt
        logs = dict(loss=float(l), epoch=epoch, j=j,
                    prior=float(prior.data), rate=rate)
        print(msg.format(**logs))
        j += 1
    serializers.save_hdf5("lda2vec.hdf5", model)

Top words in topic 0 examines explores fundamental cultural skip programming problems practices contemporary forms
Top words in topic 1 principles technical world you these skills we methods techniques production
Top words in topic 2 introduction access use user various out_of_vocabulary related cultural include methods
Top words in topic 3 projects techniques into technologies between project explore public explores debates
Top words in topic 4 methods create creative content ways tools cultural techniques relevant skills
Top words in topic 5 making processes world systems major introduce computational theoretical approaches at
Top words in topic 6 module ( out_of_vocabulary unit ; writing digital knowledge communication tools
Top words in topic 7 topics how tools processes concepts not issues ways principles understand
Top words in topic 8 students computational student languages based programming critically humanities science social
Top words in topic 9 practice topics technologies 

In [19]:
### VIZ
%matplotlib inline

In [22]:
from sklearn.datasets import fetch_20newsgroups
from lda2vec import preprocess, Corpus
import matplotlib.pyplot as plt
import numpy as np
import seaborn
import warnings
import pyLDAvis

In [25]:
pyLDAvis.enable_notebook()
warnings.filterwarnings('ignore')

In [26]:
npz = np.load(open('topics.pyldavis.npz', 'rb'))
dat = {k: v for (k, v) in npz.iteritems()}
dat['vocab'] = dat['vocab'].tolist()

In [28]:
top_n = 10
topic_to_topwords = {}
for j, topic_to_word in enumerate(dat['topic_term_dists']):
    top = np.argsort(topic_to_word)[::-1][:top_n]
    msg = 'Topic %i '  % j
    top_words = [dat['vocab'][i].strip()[:35] for i in top]
    msg += ' '.join(top_words)
    print(msg)
    topic_to_topwords[j] = top_words

Topic 0 examines explores fundamental cultural skip programming problems practices contemporary forms
Topic 1 principles technical world you these skills we methods techniques production
Topic 2 introduction access use user various out_of_vocabulary related cultural include methods
Topic 3 projects techniques into technologies between project explore public explores debates
Topic 4 methods create creative content ways tools cultural techniques relevant skills
Topic 5 making processes world systems major introduce computational theoretical approaches at
Topic 6 module ( out_of_vocabulary unit ; writing digital knowledge communication tools
Topic 7 topics how tools processes concepts not issues ways principles understand
Topic 8 students computational student languages based programming critically humanities science social
Topic 9 practice topics technologies explores related culture examines principles a cultural
Topic 10 project opportunity development develop projects writing creative

In [29]:
prepared_data = pyLDAvis.prepare(dat['topic_term_dists'], dat['doc_topic_dists'], 
                                 dat['doc_lengths'] * 1.0, dat['vocab'], dat['term_frequency'] * 1.0, mds='tsne')

In [30]:
pyLDAvis.display(prepared_data)