# Topic modeling

Can we build interesting topics from the transcripts? WHO KNOWS. Let's try.

Following code here: https://de.dariah.eu/tatom/topic_model_python.html

In [62]:
import os
import numpy as np
import sklearn.feature_extraction.text as text
from sklearn import decomposition

### Get all transcript files:

In [18]:
CORPUS_PATH = os.path.join("transcripts_no_timestamp")
filenames = sorted([os.path.join(CORPUS_PATH, fn) for fn in os.listdir(CORPUS_PATH)])

### Vectorize

In [55]:
vectorizer = text.TfidfVectorizer(input='filename', stop_words='english', min_df=1, use_idf=True)
dtm = vectorizer.fit_transform(filenames).toarray()
vocab = np.array(vectorizer.get_feature_names())

### Build topic models

In [63]:
num_topics = 8
num_top_words = 10
clf = decomposition.NMF(n_components=num_topics, random_state=1)
doctopic = clf.fit_transform(dtm)

## Output

In [90]:
topic_words = []
idx = 0
for topic in clf.components_:
    word_idx = np.argsort(topic)[::-1][0:num_top_words]
    topic_words.append(str(idx) + ": " + ", ".join([vocab[i] for i in word_idx]))
    idx += 1
topic_words

['0: like, image, data, gradient, pixel, satellite, color, want, canvas, going',
 '1: distribution, uncertainty, data, people, difference, heights, statistics, like, plots, observed',
 '2: charts, participants, pie, chart, bar, studies, data, line, chartbuilder, people',
 '3: like, vr, attention, just, thing, people, really, things, right, think',
 '4: data, seasonality, customers, illumination, week, really, workflows, time, product, day',
 '5: views, like, model, page, player, tennis, career, tournaments, wikipedia, data',
 '6: vega, brush, data, events, interaction, signals, event, know, signal, values',
 '7: net, neural, like, robot, words, things, data, images, going, network']

# Testing document fit

In [92]:
doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)
novel_names = []

for fn in filenames:
    basename = os.path.basename(fn)
    name, ext = os.path.splitext(basename)
    name = name.lstrip('ovc2016_0123456789')
    novel_names.append(name)

novel_names = np.asarray(novel_names)
doctopic_orig = doctopic.copy()
num_groups = len(set(novel_names))
doctopic_grouped = np.zeros((num_groups, num_topics))
for i, name in enumerate(sorted(set(novel_names))):
    doctopic_grouped[i, :] = np.mean(doctopic[novel_names == name, :], axis=0)

doctopic = doctopic_grouped
novels = sorted(set(novel_names))
print("Top NMF topics in...")
for i in range(len(doctopic)):
    top_topics = np.argsort(doctopic[i,:])[::-1][0:3]
    top_topics_str = ' '.join(str(t) for t in top_topics)
    print("{}: {}".format(novels[i], top_topics_str))

Top NMF topics in...
albrecht: 0 4 1
armstrong: 5 3 4
ase: 2 3 5
becker: 0 7 6
binx: 6 7 5
bremer: 0 2 6
elliott: 4 3 1
halabi: 7 6 5
hu: 7 4 6
hullman: 4 3 7
ivo: 4 7 6
kosaka: 3 7 6
llins: 0 3 7
mcdonald: 1 0 7
mcnamara: 1 2 7
pearce: 3 6 0
satyanarayan: 3 7 6
waigl: 7 1 3
wattenberg_viegas: 5 7 6
wu: 2 7 6
yanofsky: 3 7 5


# Final Thoughts

These look fairly uninteresting.
Not sure why the word "like" appears in so many of these.
It did do some interesting groupings:

* Satalite & canvas talks
* neural nets talks
* charting talks
* stats talks

Not sure that there's anything we should do with this, but recording it here for future reference.

Also clearly this algorithm isn't particularly solid. Jim suggested I try out http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html but I'm calling it bust.