# Experiments

We experiment with several commonly used topic model evaluation methods as well as a few novel methods.

- Topic coherence
  - UMass
  - UCI
- Perplexity on held-out data


The above evaluation metrics are tested against varying corpora. We take real-world corpora and make modifications to the following attributes:

- Number of documents
- Average document length
- Presence of stopwords
- Signal injection

Related work has shown the results of changing the number of documents and average document length. There has not been (to our knowledge) investigation of the effects of injecting/removing words into the corpus. In particular, our novel experiments test the injection/removal of interesting words and compares its effect against the injection/removal of uninteresting stopwords.

[skip here to load plots from pickles](#Load-Data-from-Pickles)

In [1]:
%load_ext autoreload
%autoreload 2

## Number of Documents

In this experiment we change the number of documents in a corpus and observe changes in the resulting topic metrics.

TODO: Make each document fixed length.

In [4]:
from os import listdir
from os.path import isfile, join
from nltk.corpus import abc
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from corpus import PropertiesCorpusReader

fileids = [f for f in listdir('corpus/abc_science') if isfile(join('corpus/abc_science', f))]
abc_science = PlaintextCorpusReader('corpus/abc_science',fileids)
abc_science_pcr = PropertiesCorpusReader(abc_science)

Calculating properties...
	Readability calculated.
	Distance from uniform calculated.
	Lexical diversity calculated.
	Stopword presence calculated.


Next step is to select a range of number of documents we want to test. There are 764 documents in this corpus, the average document length is ~551. We'll test with up to 750 documents in increments of 50 documents.

In [5]:
import numpy as np
d_range = np.arange(50,abc_science_pcr.num_docs,50)

Now, for each of these d_ranges, we create a subset of the corpus by selecting that number of documents at random from the documents in the corpus. We'll randomly select from fileids and build a corpus that way.

In [6]:
pcrs = {}
for d in d_range:
    d_fileids = np.random.choice(fileids,d)
    d_cr = PlaintextCorpusReader('corpus/abc_science',d_fileids)
    d_pcr = PropertiesCorpusReader(d_cr,verbose=False)
    pcrs[d] = d_pcr

Next, load our fixed topic model and separately fit it to each pcr to get model components we can calculate metrics for. 

In [7]:
import fixed_model as fm

model_components = {}
for d in d_range:
    model_components[d] = fm.get_model(pcrs[d].raw_docs())

Calculate the necessary plotting data (all the X and Y ranges). Doing this ahead of time makes the plots load/interact slightly faster.

In [8]:
import plotting
plotting_data = plotting.calculate_plotting_data(d_range,model_components,pcrs)

50
	Calculating corpus properties...
	Calculating topic metrics...
100
	Calculating corpus properties...
	Calculating topic metrics...
150
	Calculating corpus properties...
	Calculating topic metrics...
200
	Calculating corpus properties...
	Calculating topic metrics...
250
	Calculating corpus properties...
	Calculating topic metrics...
300
	Calculating corpus properties...
	Calculating topic metrics...
350
	Calculating corpus properties...
	Calculating topic metrics...
400
	Calculating corpus properties...
	Calculating topic metrics...
450
	Calculating corpus properties...
	Calculating topic metrics...
500
	Calculating corpus properties...
	Calculating topic metrics...
550
	Calculating corpus properties...
	Calculating topic metrics...
600
	Calculating corpus properties...
	Calculating topic metrics...
650
	Calculating corpus properties...
	Calculating topic metrics...
700
	Calculating corpus properties...
	Calculating topic metrics...
750
	Calculating corpus properties...
	Calculatin

In [23]:
import util
util.dump_pickle("plotting_data_abc_science_num_docs.pickle",plotting_data)

# Load Data from Pickles

In [35]:
import util
util.load_pickle("plotting_data_abc_science_num_docs.pickle",plotting_data)

In [30]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models.widgets import Panel, Tabs
from bokeh.models import HoverTool, Legend, ColumnDataSource
output_notebook()

In [31]:
def plot_metric(metric_name, plotting_data):
  properties = ['num_docs','avg_doc_len','vocab_size',
              'readability','lexical_diversity',
              'stopword_presence']
  hover = HoverTool(tooltips=[('x','@x'),('y','@y'),('top words','@top_three')])
  
  property_tabs= []
  figs = {}
  for p in properties:
    fig = figure(x_axis_label=p,
                 y_axis_label=metric_name,
                 height=600,
                 width=800,
                 toolbar_location='above')
    
    # Make the labels legible when plots are downloaded
    fig.xaxis.axis_label_text_font_size = "30pt"
    fig.yaxis.axis_label_text_font_size = "30pt"
    fig.xaxis.major_label_text_font_size = "15pt"
    fig.yaxis.major_label_text_font_size = "15pt"
    fig.add_tools(hover)
    
    # One scatter plot for each corpus
    for n in d_range:
      circle = fig.circle(x='x',
                          y='y',
                          source=ColumnDataSource({'x':plotting_data['{}_{}'.format(n,p)],
                                                   'y':plotting_data['{}_{}'.format(n,metric_name)],
                                                   'top_three':[' '.join(w[:3]) for w in plotting_data['{}_top_words'.format(n)]]}),
                          size=20, alpha=0.2)
    figs[p] = fig
    property_tabs.append(Panel(child=fig, title=p))

  tabs = Tabs(tabs=property_tabs)

  show(tabs,notebook_handle=True)

In [37]:
plot_metric('exclusivity',plotting_data)

In [38]:
plot_metric('distance_from_uniform',plotting_data)

In [39]:
plot_metric('distance_from_corpus',plotting_data)

Interestingly, distance from corpus and distance from uniform distribution over the corpus space are effectively the same (for the purposes of topic quality distinction).

In [40]:
plot_metric('rank1',plotting_data)

In [41]:
plot_metric('effective_size',plotting_data)

In [42]:
plot_metric('average_word_length',plotting_data)

## Average document length

We'll repeat the above experiment this time randomly selecting a specific number of words from each document so that they are of a specified length.

TODO: Fix corpus size.

In [43]:
import corpus

d_range = np.arange(10,abc_science_pcr.avg_doc_len,50)

doc_len_pcrs = {}
for d in d_range:
    d_strings = []
    for fn in abc_science.fileids():
        words = abc_science.words(fn)
        d_strings.append(' '.join(np.random.choice(words,d)))
    cr = corpus.from_strings('abc_science_{}'.format(d),d_strings)
    doc_len_pcrs[d] = cr

  # Remove the CWD from sys.path while we load stuff.


Calculating properties...
	Readability calculated.
	Distance from uniform calculated.
	Lexical diversity calculated.
	Stopword presence calculated.
Calculating properties...
	Readability calculated.
	Distance from uniform calculated.
	Lexical diversity calculated.
	Stopword presence calculated.
Calculating properties...
	Readability calculated.
	Distance from uniform calculated.
	Lexical diversity calculated.
	Stopword presence calculated.
Calculating properties...
	Readability calculated.
	Distance from uniform calculated.
	Lexical diversity calculated.
	Stopword presence calculated.
Calculating properties...
	Readability calculated.
	Distance from uniform calculated.
	Lexical diversity calculated.
	Stopword presence calculated.
Calculating properties...
	Readability calculated.
	Distance from uniform calculated.
	Lexical diversity calculated.
	Stopword presence calculated.
Calculating properties...
	Readability calculated.
	Distance from uniform calculated.
	Lexical diversity calcula

Load the fixed topic model and calculate components.

In [44]:
import fixed_model as fm

doc_len_model_components = {}
for d in d_range:
    doc_len_model_components[d] = fm.get_model(doc_len_pcrs[d].raw_docs())

Calculate the necessary plotting data.

In [45]:
import plotting
doc_len_plotting_data = plotting.calculate_plotting_data(d_range,doc_len_model_components,doc_len_pcrs)

10.0
	Calculating corpus properties...
	Calculating topic metrics...
60.0
	Calculating corpus properties...
	Calculating topic metrics...
110.0
	Calculating corpus properties...
	Calculating topic metrics...
160.0
	Calculating corpus properties...
	Calculating topic metrics...
210.0
	Calculating corpus properties...
	Calculating topic metrics...
260.0
	Calculating corpus properties...
	Calculating topic metrics...
310.0
	Calculating corpus properties...
	Calculating topic metrics...
360.0
	Calculating corpus properties...
	Calculating topic metrics...
410.0
	Calculating corpus properties...
	Calculating topic metrics...
460.0
	Calculating corpus properties...
	Calculating topic metrics...
510.0
	Calculating corpus properties...
	Calculating topic metrics...


In [None]:
import util
util.dump_pickle(doc_len_plotting_data,"plotting_data_abc_science_doc_len.pickle")

In [None]:
import util
plotting_data = util.load_pickle("plotting_data_abc_science_doc_len.pickle")

In [54]:
plot_metric('exclusivity',doc_len_plotting_data)

In [55]:
plot_metric('distance_from_uniform',doc_len_plotting_data)

In [56]:
plot_metric('distance_from_corpus',doc_len_plotting_data)

In [57]:
plot_metric('average_word_length',doc_len_plotting_data)

In [58]:
plot_metric('rank1',doc_len_plotting_data)

In [59]:
plot_metric('effective_size',doc_len_plotting_data)

## Stopword Injection/Removal

Future work: stopwords specific to the corpus.

## Signal Injection/Removal