<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/umbcdata602/fall2020/blob/master/lab_topic_modeling.ipynb">
<img src="http://introtodeeplearning.com/images/colab/colab.png?v2.0"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# Lab: topic modeling

Reference: Raschka's [ch08.ipynb](https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/ch08/ch08.ipynb) -- github


In [1]:
# Read movie reviews from CSV in Raschka's github repo
# This cell replaces cells 2, 3 & 4
import os
import sys
import time
import pandas as pd
import urllib.request

def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    progress_size = int(count * block_size)
    speed = progress_size / (1024.**2 * duration)
    percent = count * block_size * 100. / total_size

    sys.stdout.write("\r%d%% | %d MB | %.2f MB/s | %d sec elapsed" %
                    (percent, progress_size / (1024.**2), speed, duration))
    sys.stdout.flush()

target = "movie_data.csv.gz"
source = "https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/" + target
if not os.path.isfile(target):
    urllib.request.urlretrieve(source, target, reporthook)

df = pd.read_csv(target, compression='gzip')

assert df.shape == (50000, 2)
df.head(3)

100% | 25 MB | 6.18 MB/s | 4 sec elapsed

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [2]:
# Cell 46
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english',
                        max_df=.1,
                        max_features=5000)
X = count.fit_transform(df['review'].values)

### Expectation-maximization algorithm

* Goal: Estimate parameters of a statistical model so you can use it to make predictions.
    * With a generative model, you can estimate the probability of the data given a set of parameters.
* Expectation -- make predictions (e.g., classify data) based on statistical model (and its presumed parameters)
* Maximization -- update the unknown parameters by optimizing some "fitness" function (e.g., prediction errors)
* VanderPlas demonstrates E-M with K-means [05.11-k-means](https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html)
    * Initialize: randomly choose K cluster centers
         * i.e., initialize parameters of statistical model
    * Step 1: Expectation -- make predictions based on statistical model
        * i.e., classify data by assigning samples to clusters based on distance from the K centers
    * Step 2: Maximization -- update the parameters based on the data and some "fitness" criterion
        * i.e., recompute the cluster centers from the data and the predicted labels from Step 1
    * Repeat steps 1 & 2 until done
        * i.e., go back to Step 1 and repeat until parameters stop changing
* Visualization using Old Faithful geyser data -- wait-time (delay) between eruptions vs eruption duration
    * Ref: [Expectation-maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) -- wikipedia

<img src="https://upload.wikimedia.org/wikipedia/commons/6/69/EM_Clustering_of_Old_Faithful_data.gif" width="400"/>



In [3]:
# Cell 47 (takes ~7 minutes in Colab)
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch')
X_topics = lda.fit_transform(X)

In [4]:
# Cell 48
lda.components_.shape

(10, 5000)

In [5]:
# Cell 49
n_top_words = 5
feature_names = count.get_feature_names()

for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (topic_idx + 1))
    print(" ".join([feature_names[i]
                    for i in topic.argsort()\
                        [:-n_top_words - 1:-1]]))

Topic 1:
worst minutes awful script stupid
Topic 2:
family mother father children girl
Topic 3:
american war dvd music tv
Topic 4:
human audience cinema art sense
Topic 5:
police guy car dead murder
Topic 6:
horror house sex girl woman
Topic 7:
role performance comedy actor performances
Topic 8:
series episode war episodes tv
Topic 9:
book version original read novel
Topic 10:
action fight guy guys cool


In [7]:
# Cell 50
horror = X_topics[:, 5].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\nHorror movie #%d:' % (iter_idx + 1))
    print(df['review'][movie_idx][:300], '...')


Horror movie #1:
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...

Horror movie #2:
Okay, what the hell kind of TRASH have I been watching now? "The Witches' Mountain" has got to be one of the most incoherent and insane Spanish exploitation flicks ever and yet, at the same time, it's also strangely compelling. There's absolutely nothing that makes sense here and I even doubt there  ...

Horror movie #3:
<br /><br />Horror movie time, Japanese style. Uzumaki/Spiral was a total freakfest from start to finish. A fun freakfest at that, but at times it was a tad too reliant on kitsch rather than the horror. The story is difficult to summarize succinctly: a carefree, normal teenage girl starts coming fac ...
