# Experiments

We experiment with several commonly used topic model evaluation methods as well as a few novel methods. The evaluation metrics are tested against varying corpora. We take real-world corpora and make modifications to the following attributes:

- [Number of documents](#Number-of-Documents)
- [Average document length](#Average-Document-Length)
- [Presence of stopwords](#Presence-of-Stopwords)

Related work has shown the results of changing the number of documents and average document length. There has not been (to our knowledge) investigation of the effects of injecting/removing words into the corpus. In particular, our novel experiments test the injection/removal of interesting words and compares its effect against the injection/removal of uninteresting stopwords.

In each section we'll review the expected outcome beforehand and discuss results afterwards. Each section also includes a description of the algorithm used to generate test corpora.

__NOTE:__ If you want to change the corpus being tested, there may be more than one spot where you need to make changes. For example, you'll need to add an import statement if that's where the corpus is coming from. Or you'll need to change the path directory variables. I'm still working on making this better.

In [2]:
%load_ext autoreload
%autoreload 2

In [2]:
# UTILITIES
import util,os,corpus
import numpy as np
from collections import Counter
from time import time
# NLP
from nltk.corpus import stopwords
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from metrics_model import MetricsModel
# PLOTTING
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models.widgets import Panel, Tabs
from bokeh.models import HoverTool, ColumnDataSource

output_notebook()

## Topic Model

First, a topic model that will be fit to each test corpus.

In [3]:
mm = MetricsModel()

And some utility functions.

In [4]:
def get_metrics(pcrs,x_range,n_topics,verbose=False):
    metrics_dict = {"exclusivity":{'x':[],'y':[],'top_three':[]},
                    "avg_cos":{'x':[],'y':[],'top_three':[]},
                    "avg_kld":{'x':[],'y':[],'top_three':[]},
                    "avg_jsd":{'x':[],'y':[],'top_three':[]},
                    "rank1":{'x':[],'y':[],'top_three':[]},
                    "distance_from_uniform":{'x':[],'y':[],'top_three':[]},
                    "effective_size":{'x':[],'y':[],'top_three':[]}}
    for x in x_range:
        if verbose:
            print(x)
        # Fit each set of documents to the model
        mm.fit_from_samples(pcrs[x].raw_docs(),n_topics)
        # Calculate model metrics
        for m in metrics_dict.keys():
            m_func = getattr(mm,m)
            for k in range(n_topics):
                metrics_dict[m]['x'].append(x)
                metrics_dict[m]['y'].append(m_func(k))
                metrics_dict[m]['top_three'].append(' '.join(mm.top_words(k)[:3]))
    return metrics_dict

def get_tabs(metrics_dict,x_label):
    # Each metric is a different tab
    tabs = []
    hover = HoverTool(tooltips=[('x','@x'),('y','@y'),('top words','@top_three')])
    for m in metrics_dict.keys():
        fig = figure(title=CORPUS_NAME,
                     x_axis_label=x_label,
                     y_axis_label=m,
                     height=600,
                     width=800,
                     toolbar_location='above')
        fig.xaxis.axis_label_text_font_size = "30pt"
        fig.yaxis.axis_label_text_font_size = "30pt"
        fig.xaxis.major_label_text_font_size = "15pt"
        fig.yaxis.major_label_text_font_size = "15pt"
        fig.title.text_font_size = '15pt'
        fig.add_tools(hover)
        source = ColumnDataSource(metrics_dict[m])
        fig.circle(x='x',y='y',source=source,size=20,alpha=0.5,color='black')
        tabs.append(Panel(child=fig,title=m))
    return tabs

Start off by choosing a corpus to test on. Just run the cell for the corpus you want below. Or add your own.

In [67]:
# ABC SCIENCE CORPUS
CORPUS_NAME = 'abc_science'
CORPUS_DIR = os.path.join('corpus',CORPUS_NAME)
fileids = []
for f in os.listdir(CORPUS_DIR):
    if os.path.isfile(os.path.join(CORPUS_DIR,f)):
        fileids.append(f)
cr = PlaintextCorpusReader(CORPUS_DIR,fileids)
pcr = corpus.PropertiesCorpusReader(cr)

Calculating properties...
	Readability calculated.
	Distance from uniform calculated.
	Lexical diversity calculated.
	Stopword presence calculated.


In [68]:
pcr.print_properties()

Number of documents: 764
Average document length: 551.3495
Vocab size: 24306
Readability: 13.3439
Distance from uniform: 0.9375
Lexical diversity: 0.0577
Stopword presence: 0.3493


In [53]:
# ABC RURAL CORPUS
CORPUS_NAME = 'abc_rural'
CORPUS_DIR = os.path.join('corpus',CORPUS_NAME)
fileids = []
for f in os.listdir(CORPUS_DIR):
    if os.path.isfile(os.path.join(CORPUS_DIR,f)):
        fileids.append(f)
cr = PlaintextCorpusReader(CORPUS_DIR,fileids)
pcr = corpus.PropertiesCorpusReader(cr)

Calculating properties...
	Readability calculated.
	Distance from uniform calculated.
	Lexical diversity calculated.
	Stopword presence calculated.


In [54]:
pcr.print_properties()

Number of documents: 2424
Average document length: 142.5660
Vocab size: 17222
Readability: 11.9944
Distance from uniform: 0.9205
Lexical diversity: 0.0498
Stopword presence: 0.3507


In [16]:
pcr.num_docs,pcr.avg_doc_len,pcr.stopword_presence

(2424, 142.56600660066007, 0.35065397303084667)

In [56]:
# WINE REVIEWS
CORPUS_NAME = 'wine'
CORPUS_DIR = os.path.join('corpus',CORPUS_NAME)
fileids = []
for f in os.listdir(CORPUS_DIR):
    if os.path.isfile(os.path.join(CORPUS_DIR,f)):
        fileids.append(f)
cr = PlaintextCorpusReader(CORPUS_DIR,fileids)
pcr = corpus.PropertiesCorpusReader(cr)

Calculating properties...
	Readability calculated.
	Distance from uniform calculated.
	Lexical diversity calculated.
	Stopword presence calculated.


In [57]:
pcr.print_properties()

Number of documents: 1230
Average document length: 25.4878
Vocab size: 3414
Readability: 276.1483
Distance from uniform: 0.8687
Lexical diversity: 0.1089
Stopword presence: 0.2614


In [19]:
# WEBTEXT CORPUS
from nltk.corpus import webtext
CORPUS_NAME = 'webtext'
CORPUS_DIR = webtext.root
cr = webtext
pcr = corpus.PropertiesCorpusReader(cr)

Calculating properties...
	Readability calculated.
	Distance from uniform calculated.
	Lexical diversity calculated.
	Stopword presence calculated.


In [20]:
pcr.num_docs,pcr.avg_doc_len,pcr.stopword_presence

(6, 66122.166666666672, 0.27507920944312675)

In [65]:
# BROWN CORPUS
from nltk.corpus import brown
CORPUS_NAME = 'brown'
CORPUS_DIR = "corpus/brown"
fileids = []
for f in os.listdir(CORPUS_DIR):
    if os.path.isfile(os.path.join(CORPUS_DIR,f)):
        fileids.append(f)
cr = PlaintextCorpusReader(CORPUS_DIR,fileids)
pcr = corpus.PropertiesCorpusReader(cr)

Calculating properties...
	Readability calculated.
	Distance from uniform calculated.
	Lexical diversity calculated.
	Stopword presence calculated.


In [66]:
pcr.print_properties()

Number of documents: 500
Average document length: 2418.0880
Vocab size: 48675
Readability: 1997.3570
Distance from uniform: 0.9581
Lexical diversity: 0.0403
Stopword presence: 0.3726


## Number of Documents

In this experiment we change the number of documents in a corpus and observe changes in the resulting topic metrics. Our expectation is that as the number of documents increases, topic quality improves. Results from Tang et al. showed that while it may be theoretically impossible to guarantee identiy of topics with too few documents (regardless of their length), performance peaks once there are sufficiently many documents. This section will break up the experimental design into greater detail than the following sections since design is since they are roughly similar.

### ALGORITHM

        1. For each corpus:
        2.   For d in the range of corpus sizes to test:
        3.       Select a random subset of size d from the list of corpus files

Next step is to select a range of number of documents we want to test. There are 764 documents in this corpus, the average document length is ~551. We'll test with up to 750 documents in increments of 50 documents.

In [47]:
d_range = np.arange(50,pcr.num_docs,50)

Now, for each of these d_ranges, we create a subset of the corpus by selecting that number of documents at random from the documents in the corpus. We'll randomly select from fileids and build a corpus that way.

In [49]:
num_docs_pcrs = {}
for d in d_range:
    d_fileids = np.random.choice(pcr.fileids(),d)
    d_cr = PlaintextCorpusReader("corpus/brown",d_fileids)
    d_pcr = corpus.PropertiesCorpusReader(d_cr,verbose=False)
    num_docs_pcrs[d] = d_pcr

Next, load our fixed topic model and separately fit it to each pcr to get model components we can calculate metrics for. 

In [21]:
num_docs_metrics = get_metrics(num_docs_pcrs,d_range,15)

Now we build the plots from the metrics dictionary.

In [22]:
num_docs_tabs = Tabs(tabs=get_tabs(num_docs_metrics,"Number of Documents"))
show(num_docs_tabs,notebook_handle=True)

### Discussion

As the number of documents is increased, the average effective size of topics also increases. For several topics, the rank1 and distance from uniform metrics increase  as well. These topics appear to be the topics containing corpus specific stopwords. All other topics score relatively the same rank1 despite changes in the number of documents. The average Jensen-Shannon divergence between each topic and the other topics in the model shows a steady linear increase with the number of documents. Exclusivity shows a slight linear increase in both average and range among all topics in the model as the number of documents increases.

In [23]:
# RUN THIS CELL TO SAVE YOUR WORK
util.dump_pickle(num_docs_metrics,CORPUS_NAME+"_num_docs_metrics.pickle")

True

After we've run the experiment at least once, we have the option to skip all the long cells and go straight to the plots by loading the necessary data from stored pickle objects. This cell will load the object (if it's there already). If it is not successful, it will return `None`

In [24]:
num_docs_metrics = util.load_pickle(CORPUS_NAME+"_num_docs_metrics.pickle")

## Average Document Length

We'll repeat the above experiment this time randomly selecting a specific number of words from each document so that they are of a specified length.

### ALGORITHM

        1. For each corpus:
        2.   For l in the range of document sizes to test:
        3.     Repeat 10 times:
        4.       Randomly select n words from each document in the corpus

In [28]:
pcr.avg_doc_len

1.0

In [27]:
# The range of document sizes to test
l_range = np.arange(10,pcr.avg_doc_len,50)

# Build the test corpora following the algorithm above
doc_len_pcrs = {}
for l in l_range:
    l_strings = []
    for fn in pcr.fileids():
        words = pcr.words(fn)
        l_strings.append(' '.join(np.random.choice(words,int(l))))
    new_cr = corpus.from_strings(os.path.join(CORPUS_DIR+'_x','doc_len_{}'.format(l)),l_strings)
    doc_len_pcrs[l] = new_cr

# Run metrics on the test corpora
# This step involves fitting a 15-topic LDA model to each corpus
doc_len_metrics = get_metrics(doc_len_pcrs,l_range,15)

In [26]:
# Generate the plots
tabs=Tabs(tabs=get_tabs(doc_len_metrics,"Document Length"))
show(tabs,notebook_handle=True)

### Discussion

The behaviour of effective size is different when we change the corpus via modification of the document length.

Exclusivity, Rank1, and Distance from Uniform were rather boring.

In [125]:
# RUN THIS CELL TO SAVE YOUR WORK
util.dump_pickle(doc_len_metrics,CORPUS_NAME+"_doc_len_metrics.pickle")

True

In [139]:
# RUN THIS CELL TO LOAD PLOT DATA FROM PICKLEJAR
doc_len_metrics = util.load_pickle(CORPUS_NAME+"doc_len_metrics.pickle")

## Presence of Stopwords

This portion experiments with injecting or removing stopwords. The hypothesis is that topics will converge when stopwords are removed and lose quality as stopwords are injected.

### ALGORITHM

        1. For each corpus:
        2.   For s in the range of stopword presences to test:
        3.     Repeat 10 times:
        4.       For each file in the corpus:
        4.         Calculate the percentage of stopwords in that file
        5.         If it is smaller than s:
        6.           Calculate the number ot stopwords that need to be added
        7.           Randomly select this many words from the list of stopwords that are in the corpus
        8.         Otherwise, if it is larger than s:     

In [135]:
stoplist = stopwords.words('english')

def stopwords_count(tokens):
    stopword_count = 0
    counter = Counter(tokens)
    u = set(tokens)
    for w in u.intersection(stoplist):
        stopword_count += counter[w]
    return stopword_count

def stopwords_presence(tokens):
    return stopwords_count(tokens)/len(tokens)

s_range = np.arange(0,1,0.1)

sw_pres_pcrs = {}
for s in s_range:
    s_strings = []
    for fn in pcr.fileids():
        tokens = [w.lower() for w in list(pcr.words(fn))]
        n_s = len(words)*s # Proportion of stopwords we want
        s_p = stopwords_presence(tokens)
        if s_p < s: # Add stopwords
            to_add = int((stopwords_count(tokens)-len(tokens)*s)/(s-1))
            tokens.extend(np.random.choice(stoplist,to_add))
        elif s_p > s: # Remove stopwords
            while stopwords_presence(tokens) > s_p:
                sw = np.random.choice(set(tokens).intersection(stoplist))
                tokens.remove(sw)
        s_strings.append(' '.join(tokens))
    cr = corpus.from_strings(os.path.join(CORPUS_DIR+'_x','sw_pres_{}'.format(s)),s_strings)
    sw_pres_pcrs[s] = cr

In [136]:
sw_pres_metrics = get_metrics(sw_pres_pcrs,s_range,15)

In [154]:
pcr.stopword_presence

0.3493475076620667

In [12]:
tabs = Tabs(tabs=get_tabs(sw_pres_metrics,"Stopword Presence (%)"))
show(tabs,notebook_handle=True)

### Discussion

Average cosine distance between topics decreases.

Average KL-divergence between topics also decreases.

Only effective size of poor topics is affected.

Exclusivity is largely unaffected.

Only rank1 of poor topics is affected.

In [155]:
# RUN THIS CELL TO SAVE YOUR WORK
util.dump_pickle(sw_pres_metrics,CORPUS_NAME+"_sw_pres_metrics.pickle")

True

In [11]:
# RUN THIS CELL TO LOAD PLOT DATA FROM PICKLEJAR
sw_pres_metrics = util.load_pickle(CORPUS_NAME+"_sw_pres_metrics.pickle")

## Corpus Specific Stopwords

The phenomenon being observed in the results of the above stopword experiments is that of "corpus-specific stopwords." Though not often referenced in related work, they play a critical role in determining the quality of topics generated by a model. It may be easy enough to disregard the two topics acting strange. However, their top words often appear as top words in other topics which in turn also affects 

We wish to explore the effect on topic quality of removing corpus specific stopwords. Although there are many popular sources for stoplists available (Gensim, NLTK, SKlearn, etc.), these are often not enough. There is a fine line between removing too many and not enough stopwords.

- Which algorithms are affected by stopword removal? (Which are not?)
- What stoplists are available?

TODO: Investigate the research question, what is the effect of fitting once, and then treating the highest scoring rank1 topic as a 

# Future Work / Questions

- Should each document be fixed length?
- More control of other variables?
- Stopwords specific to the corpus.
- Get timing stats