## **LDA Topic Identification**

The objective in this notebook is to extract topics from the corpus of uk.TrustPilot.uk reviews of train service in the UK. Example tasks include,
* Extract the main topics being discussed. (Covered in **Topic Identification** below.)
* Analyse how the topics changed over time. Is there a difference in the topics or their frequency over time?
* Visualise the results. (See **Visualizations**.)

This objective was met using Latent Dirchlet Alloction (LDA). LDA can be thought of as "soft clustering". If the number of topics in a given corpra is the number of clusters, and the probability of any one corpus belonging to a topic is roughly the proportion of cluster membership, then LDA can be interpreted as a way of clustering the corpra into distinct clusters. Contrasting this with k-means, where each entity can belong to only one cluster, LDA allows for "fuzzy" memberships. Thereby providing a more nuanced way of indentifying similar items in the input data.

A good reference paper for LDA is http://jmlr.org/papers/volume3/blei03a/blei03a.pdf.

**Libraries** <br/>
Library imports are consolidated here.

In [1]:
# Standard Python Libraries
import random
import re

from collections import OrderedDict

# Data Science Stack Libraries
import numpy as np

import pandas as pd

from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

# NLP Libraries
from sklearn.feature_extraction.text import TfidfVectorizer

import nltk

import gensim
from gensim import corpora

# Visualization
from bokeh import plotting as bplot
from bokeh import models as bmodels
from bokeh import palettes

In [2]:
# Supress all notebook output warnings
# Import made to specifically to supress a depricated numpy warning found in the gensim LDA lib
import warnings
warnings.simplefilter("ignore")

In [3]:
# Additional options
pd.set_option('max.columns', 999)  # sets the max number of columns to 999 when displaying dataframes

bplot.output_notebook()  # output bokeh figures to notebook

Read the raw `train_reviews.json` dataset into a dataframe.

In [4]:
raw = pd.read_json('./train_reviews.json')
raw.sample(10)

Unnamed: 0,date,stars,text,title,url
1988,2018-02-13 14:37:10,star-rating star-rating-1 star-rating--medium,It's difficult to even explain the depths of f...,"Late, cramped, delayed and skipped your station",https://uk.trustpilot.com/review/www.southwest...
1914,2018-01-06 09:01:51,star-rating star-rating-2 star-rating--medium,I booked this ticket online but knew nothing a...,Not a pleasant experience,https://uk.trustpilot.com/review/www.grandcent...
323,2018-06-06 05:19:27,star-rating star-rating-1 star-rating--medium,08/June / 18,No trust ....,https://uk.trustpilot.com/review/www.nationalr...
1659,2016-02-09 14:09:33,star-rating star-rating-1 star-rating--medium,ANOTHER HARD DAY AT WORK FOLLOWED BY ANOTHER N...,HOW ARE THEY STILL IN BUSINESS???????????,https://uk.trustpilot.com/review/www.southernr...
1205,2018-03-11 11:41:16,star-rating star-rating-1 star-rating--medium,I can't remember the last time I got an arriva...,A useless company,https://uk.trustpilot.com/review/www.arrivatra...
1285,2017-10-24 16:19:05,star-rating star-rating-1 star-rating--medium,We travelled on the Eurostar from London to Br...,Terrible customer service and dishonest staff,https://uk.trustpilot.com/review/www.eurostar....
575,2017-12-01 16:16:29,star-rating star-rating-1 star-rating--medium,Back in 2004 Virgin Trains were fabulous and I...,Horrendous service even before travelling,https://uk.trustpilot.com/review/www.virgintra...
100,2018-06-21 15:28:41,star-rating star-rating-1 star-rating--medium,They break their own terms of use by not respo...,They break their own terms of use,https://uk.trustpilot.com/review/www.nationalr...
1743,2018-07-08 23:02:53,star-rating star-rating-1 star-rating--medium,Today on the 16.01 from Exeter to London mysel...,LIFE THREATENING OVERCROWDING,https://uk.trustpilot.com/review/www.gwr.com?p...
1295,2017-09-29 16:14:47,star-rating star-rating-1 star-rating--medium,Ridiculous and irritating customer service. I ...,Error ordering,https://uk.trustpilot.com/review/www.eurostar....


In [5]:
raw.shape

(2021, 5)

In [6]:
# Generate a simple summary of the dataset
raw.describe()

Unnamed: 0,date,stars,text,title,url
count,2021,2021,2021,2021,2021
unique,1441,5,1437,1396,86
top,2018-04-03 13:10:39,star-rating star-rating-1 star-rating--medium,I hunted for a review site just so I can share...,Deplorable service,https://uk.trustpilot.com/review/www.nationalr...
freq,21,1601,21,21,400
first,2011-05-28 15:00:36,,,,
last,2018-07-17 13:59:52,,,,


In [7]:
# Test for missing data?
pd.isnull(raw).any(1).nonzero()  # returns an array of the integer-value indicies of any rows that contain at least one null-element

(array([], dtype=int64),)

In [8]:
# Test for duplicate data alone on the column `text` as this is what we will base our clustering on
raw[raw['text'].duplicated()==True].shape

(584, 5)

In [9]:
# We drop these duplicate rows, knowing that other columns may provide discriminating information, but for the sake of simplicity of this exercise
raw_final = raw[raw['text'].duplicated(keep='first')==False]

**Data Preparation & Exploration** <br/>
In this section we clean several columns of the data set in which there are clearly redundant elements, such as in the `stars` and `url` columns.

In [10]:
# Copy the raw data into a new dataframe to be manipulated
df = raw_final.copy(deep=True)
df.shape

(1437, 5)

The elements of the column `stars` contains many repetitve elements.

In [11]:
# Split on white space to extract the last character of the second element, which is the "star rating"; force to int
df['stars'] = df['stars'].apply(lambda r: int(r.split()[1][-1:]))

In [12]:
# Take a look at the result
df.sample(3)

Unnamed: 0,date,stars,text,title,url
917,2015-07-02 14:40:04,2,I was one of many people disrupted yesterday (...,Poor Communication Again,https://uk.trustpilot.com/review/www.virgintra...
1356,2017-05-30 15:13:57,1,Travelled recently from London to Paris. Long ...,Terrible untrustworthy company who do not hono...,https://uk.trustpilot.com/review/www.eurostar....
1783,2017-12-19 20:17:45,1,I took a short trip from Hook to Reading invol...,No Christamas spirit at GWR!,https://uk.trustpilot.com/review/www.gwr.com?p...


In [13]:
df['stars'].value_counts()

1    1125
5     121
2     107
4      45
3      39
Name: stars, dtype: int64

People tend only to leave reviews if they have something to complain about ;)

What we're actually after though are the identification of "topics". Perhaps there is something we kind find in the urls. The first part of each url is dropped as it's the same, a reference to uk.trustpilot.com.

In [14]:
df['url'] = df['url'].apply(lambda r: r.split('https://uk.trustpilot.com/review/')[1])

In [15]:
df.sample(10)

Unnamed: 0,date,stars,text,title,url
934,2015-11-21 12:09:46,1,Virgin should have their franchise removed the...,The worst company ever! Con and fraud spring t...,www.virgintrains.co.uk?page=23
1540,2016-12-18 14:28:58,1,never use again,wow so bad,www.southernrailway.com?page=8
1515,2017-04-03 08:56:15,5,I had a great service on Friday 31st May - I w...,Perfect service - May 31st 2017,www.southernrailway.com?page=6
1997,2018-06-01 21:41:53,1,"Unfortunately, after this new company took ove...",Poor facilities,www.londonnorthwesternrailway.co.uk
479,2018-05-25 13:33:34,1,"Yet again virgin train is late , a later virgi...",Yet again virgin train is late,www.virgintrains.co.uk?page=2
1066,2017-06-08 11:42:43,1,Booked a trip from London - Edinburgh for my 7...,Outrageous,www.virgintrainseastcoast.com?page=5
1215,2018-05-30 16:58:28,1,Aniko : Worst customer service ever by an agen...,Worst customer service ever by an agent…,www.eurostar.com?page=2
1619,2016-09-04 09:04:25,1,words fail me,unbelievably bad,www.southernrailway.com?page=9
774,2018-01-25 21:14:19,1,My weekly commute from Brighton to Wilmslow ha...,100% fare increase,www.virgintrains.co.uk?page=5
1016,2018-05-31 08:52:00,5,Me and my sister really enjoyed our trip on th...,Me and my sister really enjoyed our…,www.virgintrainseastcoast.com?page=1


It seems that the urls do not contain any useful information, but we can use, as a feature, the base of each unique url. So we apply some cleaning to those as well.

In [16]:
# Isolate the base of each url, ignoring anything that may come after any '?' character
df['url'] = df['url'].apply(lambda r: r.split('?')[0])

In [17]:
df['url'].value_counts()

www.virgintrains.co.uk                     479
www.southernrailway.com                    260
www.gwr.com                                180
www.eurostar.com                           179
www.virgintrainseastcoast.com              118
www.nationalrail.co.uk                      65
www.southeasternrailway.co.uk               20
www.buytickets.crosscountrytrains.co.uk     20
www.eastmidlandstrains.co.uk                20
www.tpexpress.co.uk                         20
www.arrivatrainswales.co.uk                 20
www.grandcentralrail.com                    19
www.southwesternrailway.com                 13
www.londonnorthwesternrailway.co.uk          8
chilternrailways.co.uk                       8
sleeper.scot                                 6
www.hulltrains.co.uk                         2
Name: url, dtype: int64

Looks good. This gives us a concise set of consistant labels we can use later.

**Topic Indentification**

The strategy is as follows:
1. Preprocess the text in the `text` and `title` columns to homogenize their representations.
    * lowercase
    * replace numeric symbols
    * standardize abbreviations
    * remove stopwords and then lemmatize texts
2. Use LDA to identify topics.
    * Use a fixed number, based on manual refinement

We construct some regular expressions to be able to clean the text, and define a preprocessor method that lowercases the input text, "cleans" abbreviations, removes special characters, and strips dashes and underscores.

In [18]:
re_specialchar_removal = re.compile(r'(!|@|#|&|\(|\)|\+|=|\{|\}|\[|\]|:|;|\"|\'|,|\.|\?)', re.UNICODE)
re_specialchar_numsymbremoval = re.compile(r'(\$|%)', re.UNICODE)
re_numword = re.compile(r'(\s\d*\s)|\s\d*\.\d*\s', re.UNICODE)
re_dash_removal = re.compile(r'-|_', re.UNICODE)


def remove_stopwords(text, stopwords):
    """
    Given an input 'text' stop words are removed, reducing the complexity of the input text. A collection of stopwords must
    be supplied, otherwise this method returns the input text directly again.
        :param: text: Input string.
        :return: Reduced complexity string.
    """
    return ' '.join([word for word in text.split() if word not in stopwords])


def preprocessor(text, stopwords=''):
    """
    Applies the following preprocessing steps to any input text:
        - lowercases all text
        - maps abbreviations to same format (e.g., A.D., A. D., A D to AD)
        - removes general special characters (e.g., an '!' or an '&' symbol)
        - splits words that contains dashes or underscores
        - strips the any newline characters
    """
    text = text.lower()
    text = re_specialchar_removal.sub('', text)
    text = re_specialchar_numsymbremoval.sub('', text)  # depending on intent, this should be optional
    text = re_numword.sub(' numword ', text)            # and this one
    text = remove_stopwords(text, stopwords=stopwords)
    return text.strip()

In [20]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/alexanderdesouza/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [25]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/alexanderdesouza/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [27]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/alexanderdesouza/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [21]:
# We utilize the nltk corpus of English stop words; in addition...
#   we omit a few additional common articles such as `the`, which are surprisingly not already included as stopwords
#   we omit our replacement for numerics, as this is too often contained in the doc corpus
#   we omit `train` as this too dominates the feature space
stopwords = nltk.corpus.stopwords.words('English') + \
            ['and', 'are', 'the', 'this'] + \
            ['numword'] +  \
            ['train']

In [22]:
# Example of usage...
test_string = "Hi! I have the 1 name: Alexander L.M. De Souza."
preprocessed_test_string = preprocessor(test_string, stopwords)
preprocessed_test_string

'hi name alexander lm de souza'

In order to match different, but analogous, sentences the body text need be simplified. One way to accomplish this is to remove stopwords, stem, and lemmatize all the sentences to be matched. Here, a tokenizer and lemmatizer are applied to the text. Text objects can then be compared by calculating and evaluating the cosine distance between the resultant document vectors.

In [23]:
# This initial attempt utilized Porter Stemming instead of Lemmatization; Porter stemming, a rule-based heuristic, produced incogruent results
def tokenize_and_stem(text):
    """
    """
    lemmatizer = nltk.stem.WordNetLemmatizer()
    tokens = nltk.word_tokenize(text)
    return [i for i in [lemmatizer.lemmatize(t) for t in tokens] if len(i) > 2]

In [28]:
# Examine the output from the TokenLemmatizer() on the `test_string`
tokenize_and_stem(preprocessed_test_string)

['name', 'alexander', 'souza']

It works, so the `preprocessor` and `tokenize_and_stem` methods are applied to all document texts...

In [29]:
df['preprocessed_text'] = df['text'].apply(preprocessor, args=([stopwords]))
df['wordvecs'] = df['preprocessed_text'].apply(tokenize_and_stem)

In [30]:
# And take a look at a sample of the resultant dataframe
df.sample(10)

Unnamed: 0,date,stars,text,title,url,preprocessed_text,wordvecs
408,2018-02-21 15:41:05,1,I use the trains a lot and this must be the mo...,I use the trains a lot and this must be…,www.eastmidlandstrains.co.uk,use trains lot must appaling comp country boar...,"[use, train, lot, must, appaling, comp, countr..."
2004,2018-04-06 16:52:55,1,Super rude staff. Train are delayed almost eve...,Slow and unrelaible.,www.southeasternrailway.co.uk,super rude staff delayed almost everyday,"[super, rude, staff, delayed, almost, everyday]"
1906,2018-05-13 19:21:19,2,First class traveller sometimes using Virgin f...,First class traveller sometimes using…,www.grandcentralrail.com,first class traveller sometimes using virgin w...,"[first, class, traveller, sometimes, using, vi..."
12,2018-02-09 16:16:41,1,b@stards charge £330 to get from London to Man...,b@stards charge £330 to get from London …,www.nationalrail.co.uk,bstards charge £330 get london manchester paid...,"[bstards, charge, £330, get, london, mancheste..."
560,2017-12-21 22:35:52,1,My mother is poorly. Partner books train. Trai...,Still can't believe this experience.,www.virgintrains.co.uk,mother poorly partner books strike try get ref...,"[mother, poorly, partner, book, strike, try, g..."
936,2015-11-10 08:22:51,1,"9th November 2015 saw a problem 'on the line',...",Utterly the worst customer service I've ever e...,www.virgintrains.co.uk,9th november saw problem line delay amounted 5...,"[9th, november, saw, problem, line, delay, amo..."
1657,2016-06-08 17:58:21,2,Southern Rail are cancelling services on the b...,Ridiculas to the subline,www.southernrailway.com,southern rail cancelling services back staff s...,"[southern, rail, cancelling, service, back, st..."
1280,2017-11-23 17:47:44,1,Spent even more time explaining to their custo...,Sharp Practice,www.eurostar.com,spent even time explaining customer services h...,"[spent, even, time, explaining, customer, serv..."
928,2015-03-27 12:25:39,1,I just booked tickets online and due to an iss...,Awful Customer Service,www.virgintrains.co.uk,booked tickets online due issue website chose ...,"[booked, ticket, online, due, issue, website, ..."
1258,2018-01-23 10:00:10,1,Please be aware that Eurostar website provides...,False pricing advertised on the website,www.eurostar.com,please aware eurostar website provides false p...,"[please, aware, eurostar, website, provides, f..."


We rely on the implementation of LDA supplied by the `gensim` library. This works by way of constructing bag-of-word representation and then finding similarities between the resulting word vectors.

In [31]:
texts = list(df['wordvecs'].values)  # a list of lists of each word vector constructed from each individual review
dictionary = corpora.Dictionary(texts)  # a gensim object of the unique words in our corpus
corpus = [dictionary.doc2bow(text) for text in texts]  # docs to bag-of-words, represented a tuple joining the doc to the relevant bow representation

Next we construct the LDA model imposing both the `NUM_TOPICS` to be categorized and the `NUM_WORDS_PER_TOPIC` that should be used when forming the topic vectors.

In [32]:
print("Mean number of words per review: {}\n".format(df['wordvecs'].apply(lambda r: len(r)).mean()) + \
      "Std of the number of words per review: {}".format(df['wordvecs'].apply(lambda r: len(r)).std())
)

Mean number of words per review: 32.67292971468337
Std of the number of words per review: 30.412693858779463


On this basis, due to the high variance observed in the number of words in each word vector, per review, we elect to divide the data into 8 distinct topics, which we construct using 5 words per topic. Note that there is _significant_ overlap in the corpuses of any two individual reviews.

In [33]:
NUM_TOPICS = 8
NUM_WORDS_PER_TOPIC = 5

Training was refined with logging inplace, using a host of specified learning parameters for optimization. The commented-out lines below give an example of how this was done.

In [34]:
# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [35]:
# `passes` is equivalent to number of epochs of training for the model
# `eval_every` reports the preplexity of the model after each training epoch / pass
# lda = gensim.models.ldamodel.LdaModel(corpus, num_topics=NUM_TOPICS, id2word=dictionary, alpha='auto', eta='auto', passes=33, eval_every=1)

# Note: When this was executed last, a large number of iterations were used per pass; perplexity was evaluated after each pass.

In [36]:
lda = gensim.models.ldamodel.LdaModel(corpus, num_topics=NUM_TOPICS, id2word=dictionary, alpha=0.1, eta='auto', passes=10)
topics = lda.print_topics(num_words=NUM_WORDS_PER_TOPIC)

In [37]:
# Print each topic in terms of the vector of the relevant words and the associated weights for reference
for topic in topics:
    print("Topic {}: {}".format(topic[0], topic[1]))

Topic 0: 0.021*"service" + 0.011*"train" + 0.009*"staff" + 0.008*"journey" + 0.007*"time"
Topic 1: 0.026*"ticket" + 0.016*"virgin" + 0.014*"service" + 0.012*"train" + 0.010*"time"
Topic 2: 0.011*"ticket" + 0.010*"time" + 0.010*"service" + 0.009*"london" + 0.009*"virgin"
Topic 3: 0.024*"service" + 0.013*"get" + 0.011*"customer" + 0.011*"time" + 0.011*"ticket"
Topic 4: 0.014*"london" + 0.010*"virgin" + 0.010*"service" + 0.009*"hour" + 0.009*"time"
Topic 5: 0.013*"service" + 0.013*"eurostar" + 0.011*"customer" + 0.010*"ticket" + 0.008*"time"
Topic 6: 0.014*"train" + 0.011*"get" + 0.011*"ticket" + 0.011*"service" + 0.011*"seat"
Topic 7: 0.016*"ticket" + 0.011*"get" + 0.010*"train" + 0.009*"time" + 0.007*"one"


We need to construct human-intelligble topic labels based on these representations. The prevelance of `train` and `service` in each word vector implies these words could potentially be added to the list of stopwords for this particular corpus (the former has been added retro-actively). Baring that, we look to the remaining parameters to construct the unique topic names. Notice that for some of these topics, we will group some of the topics as well.

In [38]:
human_intelligible_topic_labels = {0: '0 Poor Customer Service',
                                   1: '1 Virgin Specific Train Issues',
                                   2: '2 General Train Issues',
                                   3: '3 Other Issues',
                                   4: '4 First Class Service',
                                   5: '5 Ticketing Issues',
                                   6: '6 Train Timing',
                                   7: '7 London Regional Issues'}

**Visualizations** <br/>
With these topics constructed, we next want to validate them via visualization. For this we employ t-SNE; but to do so requires first projecting the data from 5-dimensions (based on `NUM_WORDS_PER_TOPIC`) to 2.

In [39]:
vectorizer = TfidfVectorizer(input='content',
                             analyzer='word',
                             min_df=0.01,
                             max_df=0.25,
                             norm='l2',
                             use_idf=True,
                             smooth_idf=True,
                             sublinear_tf=True)

df_tfidf = vectorizer.fit_transform(df['preprocessed_text']).toarray()

In [40]:
def get_doc_topic_distance(model, corpus, key_words=False):
    
    """
    LDA transformation, for each doc return only those topics with non-zero weight.
    This function composes a matrix of the docs in the topic space.
    """
    topic_distance =[]
    keys = []

    for d in corpus:
        tmp = {i: 0 for i in range(NUM_TOPICS)}
        tmp.update(dict(model[d]))

        values = list(OrderedDict(tmp).values())
        
        topic_distance += [np.array(values)]
        
        if key_words:
            keys += [np.array(values).argmax()]

    return np.array(topic_distance), keys

In [41]:
# The above method is used to calculate the topic distances as these cannot be extracted from the gensim model
topic_distance, lda_keys = get_doc_topic_distance(lda, corpus, True)
features = vectorizer.get_feature_names()

In [42]:
# And we also make a list of each of the words that constitute each topic
topic_words = []
for n in range(len(df_tfidf)):
    indexes = np.argsort(df_tfidf[n])[::-1][:4]
    tmp = [features[i] for i in indexes]
    topic_words += [' '.join(tmp)]

In [43]:
# And append to the dataframe a representation for each review, and assign to it an associated topic number
df['representation'] = pd.DataFrame(topic_words)
df['cluster_number'] = pd.DataFrame(lda_keys)
df['cluster_number'].fillna(9, inplace=True)  # all reviews which belong to no identifiable cluster we group separately

In [44]:
# Build a color schema based on the number of clusters + 1 (accounting for the `na` category)
cluster_color_schema = {}
palette = palettes.viridis(len(df['cluster_number'].unique())+1)
for n, color in enumerate(palette):
    cluster_color_schema[n] = color

cluster_color_schema

{0: '#440154',
 1: '#472777',
 2: '#3E4989',
 3: '#30678D',
 4: '#25828E',
 5: '#1E9C89',
 6: '#35B778',
 7: '#6BCD59',
 8: '#B2DD2C',
 9: '#FDE724'}

In [45]:
# Apply...
df['color'] = df['cluster_number'].apply(lambda l: cluster_color_schema[l])

Next we can build a t-SNE projection of these topics in 2-dimension.

In [46]:
tsne = TSNE(n_components=2)
tsne_coordinates = tsne.fit_transform(topic_distance)

In [47]:
# Validate that we do have one t-SNE projection coordinate per dataframe entry...
print(len(df) == len(tsne_coordinates))

True


In [48]:
# Then add these coordinates to the dataframe
df['tsne_xcoord'] = tsne_coordinates[:, 0]
df['tsne_ycoord'] = tsne_coordinates[:, 1]

In [49]:
# And plot only those elements for which we have a valid `cluster_number` entry, i.e., not the "invalid" null cluster (if it exists)
df2plot = df[df['cluster_number']!=len(df['cluster_number'].unique())]

In [50]:
source = bmodels.ColumnDataSource(dict(
    x=df2plot['tsne_xcoord'],
    y=df2plot['tsne_ycoord'],
    
    color=df2plot['color'],
    
    label=df2plot['cluster_number'].apply(lambda l: human_intelligible_topic_labels[l]),
    
    topic_key= df2plot['cluster_number'],
    
    content = df2plot['representation']
))

In [54]:
title = 't-SNE Topic Visualization'

plot_lda = bplot.figure(plot_width=1000,
                        plot_height=600,
                        tools="pan, wheel_zoom, box_zoom, reset, hover, previewsave",
                        x_axis_type=None,
                        y_axis_type=None,
                        min_border=1)

plot_lda.scatter(x='x',
                 y='y',
                 legend='label',
                 source=source,
                 color='color',
                 alpha=0.33,
                 size=10)

# hover tools
hover = plot_lda.select(dict(type=bmodels.HoverTool))
hover.tooltips = {"content": "Topic: @topic_key - Text: @content"}
plot_lda.legend.location = "top_left"

bplot.show(plot_lda)

Well, this is unfortunate, but was expected. What happened?

The word vectors that comprise each topic are largely dominated by similar elements, namely "service", "ticket", and often "virign". This makes it difficult to build a distinct configuration of word vectors (when using five words per toic) that can be projected distinctly in 2D. One potential solution is to plot this in a higher dimension, or simply to remove some of these more commonly occuring words that do not add to the distinctiveness of the projections.

Due to time constraints I did not explore in any depth the time evolution of the topic streams in this notebook.

However, below I provide a rough sketch of the "stellar" evolution of each topic, based on it's mean rolling `stars` rating.

In [55]:
# Compute the rolling mean star rating for each topic
df['mean_stars'] = df.sort_values(by='date').groupby('cluster_number')['stars'].rolling(100).mean().reset_index(0, drop=True)

In [56]:
# Construct a grouped dataframe for plotting the rolling average for each topic
dfs2plot = df.sort_values(by='date').groupby('cluster_number')

In [57]:
# Make the plot...
p = bplot.figure(plot_width=1000, plot_height=600, x_axis_type="datetime", x_range=(min(dfs2plot.get_group(0)['date']), max(df['date'])))

n = 0
for name, grp in dfs2plot:
    p.line('date', 'mean_stars', source=grp, line_color=cluster_color_schema[n], legend=human_intelligible_topic_labels[n])
    n += 1
    if n == 8:
        break  # do not plot the null-group

p.legend.location = "top_left"

bplot.show(p)

No doubt improvements could be made smoothing this data, or even by fitting a trend-line to the rolling mean.

In [58]:
kmeans = KMeans(n_clusters=5).fit(df_tfidf)

In [59]:
labels = kmeans.predict(df_tfidf)

In [60]:
title = 'k-Means Topic Visualization'

plot_kmeans = bplot.figure(plot_width=1000,
                           plot_height=600,
                           tools="pan, wheel_zoom, box_zoom, reset, hover, previewsave",
                           x_axis_type=None,
                           y_axis_type=None,
                           min_border=1)

plot_kmeans.scatter(x='x',
                    y='y',
                    legend='label',
                    source=source,
                    color='color',
                    alpha=0.33,
                    size=10)

# hover tools
hover = plot_kmeans.select(dict(type=bmodels.HoverTool))
hover.tooltips = {"content": "Text: @content - Topic: @topic_key "}
plot_kmeans.legend.location = "top_left"

bplot.show(plot_lda)