# Text Mining: A Brief Introduction to Topic Modeling
## June 2018

# Lineup

<li>Brief review of machine learning definitions</li>

<li>Case study: JSM Conference Abstracts</li>

<li>Q/A</li>

# Goals

<li>Understand type of machine learning to apply when facing a specific problem</li>

<li>Familiarize yourself with some algorithms for the types of machine learning described</li>

<li>Observe Python code and a preferred Data Science development environment</li>

<li>Inspire creativity in applications for client challenges</li>

# Recap: Machine Learning


<li>Supervised Machine Learning: data has labels with target output you want to predict</li>

<li>Unsupervised Machine Learning: data has no labels; algorithm looks for patterns + similarities</li>

<li>Semi-supervised Machine Learning: some data labels, majority unlabeled</li>

<li>Reinforcement Learning: algorithm to maximize reward or minimize risk</li>

# What is the difference between Machine Learning (ML) & Traditional Statistical Models?


Not much. They use many of the same models, but with different goals in mind.

ML practitioners' goal is to accurately predict an observation, not necessarily focus on model assumptions & representativeness.

The most comprehensive & insightful work will come from a team that includes both machine learning/data science & traditional statistics.


<em>*Additional reading: https://www.kdnuggets.com/2017/08/machine-learning-vs-statistics.html<em>

# Machine Learning "Cheat Sheet":

In [1]:
from IPython.display import Image
image_ID = '1ovhTbem5Mcg09pqILBqZXIYRR-_j9RP7'
Image(url="https://drive.google.com/uc?export=view&id={}".format(image_ID))

# Overview - Case Study

### 1. Scrape 10 years of JSM abstracts (2008-2017)


### 2. Pre-process text data for analysis


### 3. Conduct topic modeling using 2 approaches


### 4. Review results & limitations

# JSM website (2017)


In [2]:
image_ID = '15vvrVMn5Z8GgghO89msZKjVi1PlN1u3m'
Image(url="https://drive.google.com/uc?export=view&id={}".format(image_ID), width=800)

#  Setup

### Import initial libraries

In [3]:
import numpy as np
import pandas as pd
import csv, spacy
import en_core_web_sm
import nltk, re, os, codecs, mpld3
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, ENGLISH_STOP_WORDS
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD

Suppress certain warnings

In [4]:
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim') # suppress chunksize warning from gensim
warnings.filterwarnings(action= 'ignore', category=DeprecationWarning, module= 'scipy') # suppress another scipy warning
warnings.filterwarnings(action= 'ignore', category=DeprecationWarning, module= 'pyLDAvis') # suppress another pyLDAvis warning warning

In [5]:
import gensim

### Import and merge annual JSM data (2008-2017)

In [6]:
dataDir = r'..\data'
files = (file for file in os.listdir(dataDir) if re.search(r'.xlsx', file))

In [7]:
frame = pd.DataFrame()
list_ = []
for file_ in files:
    df = pd.read_excel(r'{}\{}'.format(dataDir, file_))
    df['Year'] = re.search(r'(\d{4})_.*\.xlsx', file_).group(1)
    list_.append(df)
frame = pd.concat(list_)


# Pre-processing Data

### 1. Data Cleaning


<li>remove all observations w/no content in abstract text</li>
        

<li>include only observations sponsored by "survey research methods"</li>


### 2. Tokenize text

<li>break sentences into individual words (aka tokens)</li>

### 3. Lemmatize text

<li>analyzing, analyze, analyzes → analyz*</li>


### 4. Remove stopwords

<li>the, of, are, is, thereby, often, nevertheless etc</li>


<li>additional stopwords for this data: data, roundtable</li>


## Data Cleaning
Filter all JSM data to include only `Survey Research Methods`

In [8]:
df = frame.copy()
df = df[~df['Abstract_Text'].isnull()]
df['Abstract_Text'] = df['Abstract_Text'].apply(lambda x: re.sub('\xa0', '', x))
df['Abstract_Text'] = df['Abstract_Text'].apply(lambda x: re.sub(r'\xa0', '', x))
df['Abstract_Text'] = df['Abstract_Text'].apply(lambda x: re.sub('^ $', '', x))
df = df.loc[df['Abstract_Text'].apply(lambda x: len(x)>0)]
df = df[df['Sponsors'].str.contains('Survey Research Methods', na=False)]
df = df.drop_duplicates(subset = 'Abstract_Text', keep = 'first')
df.reset_index(drop=True, inplace = True)

In [9]:
frame.shape

(38484, 19)

In [10]:
df.shape

(3044, 19)

# II. Helper Functions

### Declare what processes you want to apply on text

In [11]:
def abstractText(data, process_all= True, lemmatize = False):
    nlp = en_core_web_sm.load(disable=['parser', 'ner'])
    
    if process_all and not lemmatize:
        return data
    
    elif process_all and lemmatize:    
        return lemmatization(list(sent_to_words(data)), allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
    
    elif not process_all and not lemmatize:
        return data.apply(lambda x: firstLastSentence(x))
    
    else:
        data = data.apply(lambda x: firstLastSentence(x))
        return lemmatization(list(sent_to_words(data)), allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
    

Keeps only the first and last sentences of text - assumes those have the most relevant information to abstract

In [12]:
def firstLastSentence(x):
    sentences = x.split('. ')
    sentences = [sentence for sentence in sentences if len(sentence)>0]
    return '. '.join([sentences[0], sentences[-1]])

Functions to prepare for lemmatization

## Tokenize Text

In [13]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuation


## Lemmatize Text

In [14]:
nlp = en_core_web_sm.load(disable=['parser', 'ner'])

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    global nlp
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out

### Displaying Output

In [15]:
def display_topics_and_docs(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("Topic {}:".format(topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print('\n{}\n'.format(df['Abstract_Title'].iloc[doc_index])) # show titles associated with topics
        

Create a document-topic matrix and highlight positive/larger values in green

In [16]:
def display_document_topic_matrix(_model, no_docs, _W):
    topicnames = ["Topic" + str(i) for i in range(_model.n_components)]
    docnames = ["Doc" + str(i) for i in range(no_docs)]

    df_document_topic = pd.DataFrame(np.round(_W, 2), columns=topicnames, index=docnames)
    dominant_topic = np.argmax(df_document_topic.values, axis=1)
    df_document_topic['dominant_topic'] = dominant_topic

    def color_green(val):
        color = 'green' if val > .01 else 'black'
        return 'color: {col}'.format(col=color)

    def make_bold(val):
        weight = 700 if val > .02 else 400
        return 'font-weight: {weight}'.format(weight=weight)

    # Apply Style
    df_document_topics_styler = df_document_topic.head(15).style.applymap(color_green).applymap(make_bold)
    
    return df_document_topics_styler

Show top n keywords for each topic

In [17]:
def show_topics(vectorizer, model, n_words):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
        
        
    df_topic_keywords = pd.DataFrame(topic_keywords)
    df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
    df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]

    return df_topic_keywords



### Predicting Topics from Text

In [18]:
def predict_topic(text, nlp, lemma_flag, model):

    global sent_to_words
    global lemmatization

    # Step 1: Clean with simple_preprocess
    mytext_2 = sent_to_words(text)
    
    # Step 2: Remove stopwords
    mytext_3 = [[word for word in item if word not in my_stopwords] for item in sent_to_words(mytext_2)]

    # Step 3: Lemmatize or not
    if lemma_flag:
        # Step 2: Lemmatize
        mytext_4 = lemmatization(mytext_3, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
    else:
        mytext_4 = [' '.join(item) for item in mytext_3]

    # Step 3: Vectorize transform
    mytext_5 = _vectorizer.transform(mytext_4)

    # Step 4: LDA Transform
    topic_probability_scores = model.transform(mytext_5)
    topic = df_topic_keywords.iloc[np.argmax(topic_probability_scores), :].values.tolist()
    return topic, topic_probability_scores



### Create Topic Labels

In [19]:
def convertLabelsToDict(text, no_topics):
    return dict(zip((i for i in range(no_topics)), re.findall(r'Topic \d+: (.*)', text)))

In [20]:
text_20_LDA = """
Topic 0: Sampling Design
Topic 1: Response Issues & Consumer Expenditure Survey
Topic 2: Bayesian & Uncertainty Analyses
Topic 3: Nonresponse Errors & Adjustments
Topic 4: Health Studies
Topic 5: Phone Surveys
Topic 6: Small Area Estimation
Topic 7: Employment Estimates
Topic 8: Misc Estimation
Topic 9: Disclosure Limitations & Risks
Topic 10: UNDETERMINED
Topic 11: Linkages
Topic 12: Imputation
Topic 13: UNDETERMINED
Topic 14: Address-Based Sampling
Topic 15: UNDETERMINED
Topic 16: Data Collection Process
Topic 17: Healthcare (Administrative Data)
Topic 18: Census & Similar Prgms
Topic 19: Misc Analyses
"""

In [21]:
text_20_NMF = """
Topic 0: Sampling Design
Topic 1: Health Surveys
Topic 2: Imputation
Topic 3: Small Area Estimation
Topic 4: Response Rates
Topic 5: Census
Topic 6: Phone Surveys
Topic 7: Calibration & Weighting
Topic 8: American Community Survey (ACS)
Topic 9: Variance Estimation
Topic 10: Misc Statistics
Topic 11: Nonresponse Bias
Topic 12: Measurement Error
Topic 13: Employment Statistics
Topic 14: Address-Based Sampling
Topic 15: Interviewer Effects & Behavior
Topic 16: Linkages
Topic 17: Mode Effects
Topic 18: UNDETERMINED
Topic 19: Parametric Approaches
"""

# Topic Modeling


Topic Modeling (no labeled data) → Unsupervised learning algorithms 


# Constructing Model Frames


<li>Non-negative Matrix Factorization (NMF)</li>

<li>Latent Dirichlet Allocation (LDA)</li>

Both models requires that you assign "n" topics in advance

# NMF

<li>Unsupervised learning algorithm that extracts useful features</li>

<li>Represented as a weighted sum of some components</li>

<li>Product of document-term matrix and topic-term matrix to approximate document-topic matrix</li>

In [22]:
image_ID = '1xXNnNXG5CuSeRsDLjah3dEBoro9_v_je'
Image(url="https://drive.google.com/uc?export=view&id={}".format(image_ID))

<em>Source: https://en.wikipedia.org/wiki/Non-negative_matrix_factorization<em>

In [23]:
def createNMF(min_df, max_df, max_features, stop_words, ngram_range, n_topics, data):
    _vectorizer = TfidfVectorizer(min_df = min_df, max_df = max_df,
                                       max_features = max_features, stop_words= stop_words, ngram_range = ngram_range)
    
    _vectorized = _vectorizer.fit_transform(data)
    _feature_names = _vectorizer.get_feature_names()
    
    _model = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(_vectorized)
    _W = _model.transform(_vectorized)
    _H = _model.components_
    
    return _vectorizer, _vectorized, _feature_names, _model, _W, _H
    

# LDA


<li>Assumes that groups of words that frequently appear together indicate a “topic”</li>


<li>Assumes that each document contains a mixture of “topics”</li>


<li>Probabilistic model using P(word | topic) and P(topics | document) – probabilities are calculated iteratively until convergence</li>

# Visual of LDA


In [24]:
image_ID = '14yNhw5sa740jkkjXbcvGjyoKHvIHkPlE'
Image(url="https://drive.google.com/uc?export=view&id={}".format(image_ID), width=800, height=500)

<em>Source: Blei, 2012 https://cacm.acm.org/magazines/2012/4/147361-probabilistic-topic-models/fulltext<em>

In [25]:
def createLDA(min_df, max_df, max_features, stop_words, ngram_range, n_topics, data):
    _vectorizer = CountVectorizer(min_df = min_df, max_df = max_df,
                                    max_features = max_features,
                                    stop_words= stop_words, 
                                    ngram_range = ngram_range,
                                    token_pattern='[a-zA-Z0-9]{3,}')

    _vectorized = _vectorizer.fit_transform(data)
    _feature_names = _vectorizer.get_feature_names()
    
    
    _model = LatentDirichletAllocation(n_components=n_topics,          # Number of topics
                                       max_iter=25,               # Max learning iterations
                                       learning_method='batch',  # batch or online - latter is faster on large sets 
                                       random_state=0,          # Random state
#                                       batch_size=124,            # n docs in each learning iter
#                                       evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,               # Use all available CPUs
#                                       learning_decay = 0.5       # set learning rate
                                     )
    _W = _model.fit_transform(_vectorized)
    _H = _model.components_
    
    return _vectorizer, _vectorized, _feature_names, _model, _W, _H

# Process Data

### 1. Process all text? First/Last sentences only?


### 2. Do you want to lemmatize the data?


### 3. Update stopwords?

In [26]:
lemma_flag = True
process_all_flag = True # False if you want to process first/last sentence only

data = abstractText(df['Abstract_Text'], process_all=process_all_flag, lemmatize= lemma_flag) 

In [27]:
my_stopwords = ENGLISH_STOP_WORDS.union(['data', 'roundtable', 'datum'])

# IV. Build Model

### Definitions of Parameters to show

# Set parameters

In [28]:
no_features = 50000
no_topics = 20

ngram_range = (1,3)
max_df = 0.5
min_df = 0.02


# for display: "n" keywords to show & "n" abstract titles
no_top_words = 5
no_top_documents = 10

# Instantiate Model

Uncomment whichever model you want to use (NMF or LDA) and run

In [29]:
model_type = 'NMF'
_vectorizer, _vectorized, _feature_names, _model, _W, _H = createNMF(
    min_df= min_df, 
    max_df= max_df, 
    max_features= no_features,
    stop_words= my_stopwords,
    ngram_range= ngram_range,
    n_topics= no_topics,
    data= data)

In [None]:
model_type = 'LDA'
_vectorizer, _vectorized, _feature_names, _model, _W, _H = createLDA(
    min_df= min_df, 
    max_df= max_df, 
    max_features= no_features,
    stop_words= my_stopwords,
    ngram_range= ngram_range,
    n_topics= no_topics,
    data= data)

# V. Display Model Output

In [30]:
saveModel = r'..\models'
saveTextOutput = r'..\output\text'
saveVizOutput = r'..\output\visualizations'

In [31]:
_topics = convertLabelsToDict(text_20_NMF, no_topics) # change depending on model

# Results - NMF

### Displaying topics, top words associated with topic, top titles

<em>open text file<em>

In [32]:
display_topics_and_docs(_H, _W, _feature_names, data, no_top_words, no_top_documents)


Topic 0:
sample design sampling size probability

Investigating the Performance of Inverse Sampling for Model Estimation


Designing Minimum-Cost Multi-Stage Sample Designs


Expanding the Number of Primary Sampling Units for the National Health Interview Survey


Occupational Requirements Survey Sample Design


Adaptive Sampling Using Neyman Allocation


Sample Design Research in the 2010 Sample Redesign


State and Local Government Sample Design for the National Compensation Survey


Studying Millions of Rescued Documents: Sample Plan at the Guatemalan National Police Archive


Using Response Rates to Adjust a Dual Sample Design


Reducing the Public Employment Survey Sample Size

Topic 1:
health care national interview medical

Using the National Health Interview Survey to Monitor Health Insurance and Access to Care


Using the National Health Interview Survey to Monitor the Early Effects of the Affordable Care Act


Evaluation of Design Effects for Selected Estimates in the Medical



A Class of Dual Frame Survey Sampling Estimators in the Presence of a Covariate: How Amy Predicts Her President


Propensity Score Adjustments Using Covariates in Observational Studies


Analysis on Generalized Variance Function Estimators from Complex Sample Surveys


A New Optimal Estimator of Population Proportion in Randomized Response Sampling


Using Successive Difference Replication for Estimating Variances


Small Area Estimation Under Fay-Herriot Models with Nonparametric Estimation of Error Variances

Topic 10:
statistical analysis research risk time

Confidentiality Approaches for Real-Time Systems Generating Aggregated Results


nan


Revising Statistical Standards to Keep Pace with the Web


Human Rights and Statistics: A Reciprocal Relationship


Managing Disclosure Risks in the Curation and Dissemination of Research Data


Comparative Study of Differentially Private Data Synthesis Methods


Rethinking the Risk-Utility Tradeoff Approach to Statistical Disclosure Limitat



Secondary Analysis in GWAS


Cancer Marker Identification via Penalized Integrative Analysis


Confidence Intervals for the Ratio of Two Poisson Rates


Propensity Score Adjustment Method for Nonignorable Nonresponse


Modeling of Longitudinal Biomarker Data with Dropout and Death Using a Weighted Pseudo--Maximum Likelihood Method


Haplotype-Based Association Studies Under Complex Sampling


Bayesian Finite Population Inference for Skewed Survey Data Using Skew-Normal Penalized-Spline Regression

Topic 19:
model effect regression linear variable

A semi-parametric approach to fractional imputation for nonignorable missing data


A Semiparametric Approach to Inference with Nonignorable Missing Data Using Surrogate Information


Nonparametric and Semiparametric M-Quantile Inference for Longitudinal Data


Application of GEE and MRM in Evaluation of the Efficacy of an HIV Prevention Intervention


A Picture Is Worth a Billion Words: Visualizing Mega-Parameter Models from Giga-Scale Tex

Write the above display of topics and relevant titles to a txt file

In [33]:
with open(r'{}\{}_Topics_and_top{}_abstract_titles.txt'.format(saveTextOutput, model_type, no_top_documents), 'w') as f:
    
    for topic_idx, topic in enumerate(_H):
        f.write("Topic {}:\n\n".format(topic_idx))
        f.write(" ".join([_feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        f.write('\n')

        top_doc_indices = np.argsort(_W[:,topic_idx] )[::-1][0:no_top_documents]
        for num, doc_index in enumerate(top_doc_indices):
            f.write('\n{}. {}\n'.format(num, df['Abstract_Title'].iloc[doc_index])) # show titles associated with topics
            f.write('\n')

### View document-topic matrix

In [34]:
doc_topic_matrix = display_document_topic_matrix(_model=_model, no_docs=len(data), _W = _W)
# doc_topic_matrix

### View Top 5 keywords for each topic

In [35]:
df_topic_keywords = show_topics(vectorizer=_vectorizer, model=_model, n_words=5)        
df_topic_keywords.head()

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4
Topic 0,sample,design,sampling,size,probability
Topic 1,health,care,national,interview,medical
Topic 2,imputation,miss,multiple imputation,multiple,value
Topic 3,area,small,small area,area estimation,small area estimation
Topic 4,response,rate,response rate,respondent,non


# Predict Topic(s) Based on Text

2018 Abstracts in Survey Methods:

<li>https://ww2.amstat.org/meetings/jsm/2018/onlineprogram/AbstractDetails.cfm?abstractid=329351</li>

<li>https://ww2.amstat.org/meetings/jsm/2018/onlineprogram/AbstractDetails.cfm?abstractid=330627</li>


In [41]:
mytext = [input('Paste text here: ')]

Paste text here: Current Population Survey (CPS) is the oldest survey in the United States since 1942 and the source of numerous high-profile economic statistics, including the national unemployment rate. Balanced repeated replication (BRR) is the main methodology used for variance estimation in CPS as well as many other surveys conducted by the U.S. Census Bureau. In this talk, we study the properties of collapsed-stratum BRR implemented at the Census Bureau in estimating variance of CPS household response rate in non-self-representing (NSR) strata. In addition, we will present some bias study in BRR variance estimate using simulations based on CPS data as a frame.


In [42]:
topic, prob_scores = predict_topic(text =mytext, nlp=nlp, lemma_flag= lemma_flag, model = _model)

# run this if you don't have labels for your topics
# print(r'Most dominant topic(s): {}'.format(', '.join([df_topic_keywords.index[i] for i in np.argsort(-prob_scores)[0,:3] 
#                                                       if prob_scores[0, i]> 0.01])))

# # run this if you do have labels for your topics
print('Most dominant topic(s): \n{}'.format(',\n\n'.join(['\n'.join(('Topic: ' + str(i), _topics[i])) 
                                                        for i in np.argsort(-prob_scores)[0,:5] if prob_scores[0, i]> 0.01])))

Most dominant topic(s): 
Topic: 5
Census,

Topic: 4
Response Rates,

Topic: 8
American Community Survey (ACS),

Topic: 9
Variance Estimation,

Topic: 13
Employment Statistics


# VII. Visualizing Topics

### Import viz libraries

In [None]:
import bokeh.plotting as bp
from bokeh.plotting import save
from bokeh.models import HoverTool
from sklearn.manifold import TSNE

### Initialize t-SNE

In [None]:
# angle value close to 1 means sacrificing accuracy for speed
# pca initialization usually leads to better results 
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, angle=.99, init='pca')


tsne_= tsne_model.fit_transform(_W)

In [None]:
cust_color_palette = np.array(["#e6194b", "#3cb44b", "#ffe119", "#0082c8", "#f58231",
                               "#911eb4", "#46f0f0", "#f032e6", "#d2f53c", "#fabebe",
                               "#008080", "#e6beff", "#aa6e28", "#fffac8", "#800000",
                               "#aaffc3", "#808000", "#ffd8b1", "#000080", "#808080",
                              "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78","#2ca02c"])
                           


In [None]:
_keys = []
for i in range(_W.shape[0]):
    _keys +=  _W[i].argmax(),

In [None]:
df_tsne = pd.DataFrame(tsne_)
df_tsne.rename(columns = {0: 'X_tsne', 1: 'Y_tsne'}, inplace = True)
df_topics = pd.DataFrame(_W)
df_topics['ind'] = _keys
df_topics.head()

In [None]:
df2 = pd.merge(df_topics, df_tsne, how = 'inner', left_index=True, right_index = True)

In [None]:
from bokeh.plotting import figure, show, output_notebook, save, output_file
from bokeh.models import HoverTool, value, LabelSet, Legend, ColumnDataSource
output_notebook()

In [None]:
source = bp.ColumnDataSource(dict(
    x=df2['X_tsne'],
    y=df2['Y_tsne'],
    color=cust_color_palette[_keys],
#     topic_key= df2['ind'],
    topic_key= df2['ind'].apply(lambda l: _topics[l]), # dictionary here needs to be updated as you change models
    title= df['Abstract_Title'],
    content = df['Abstract_Text']
))

# Visualizing results - scatterplot

In [None]:
title = 'Visualization of {} topics ({})'.format(no_topics, model_type)

_model = figure(plot_width=1100, plot_height=600,
                     title=title, tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
                     x_axis_type=None, y_axis_type=None, min_border=1)

_model.scatter(x='x', y='y', legend='topic_key', source=source,
                 color='color', alpha=0.8, size=10)#'msize', )

# hover tools
hover = _model.select(dict(type=HoverTool))
hover.tooltips = {"Title": "@title", "Topic": "@topic_key"} #, KeyWords: @content - Topic: @topic_key "}
_model.legend.location = "top_left"

# move legend to outside of chart area
new_legend = _model.legend[0]
_model.legend[0].plot = None
_model.add_layout(new_legend, 'right')

                

In [None]:
show(_model)

In [None]:
filename = r'{}_scatterplot_{}_topics.html'.format(model_type, no_topics)


output_file(r'{}\{}'.format(saveVizOutput, filename))
save(_model)

# bp.reset_output()

# VIII. Visualizing LDA (optional)

In [None]:
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline

Have to re-run LDA script here because the variables stored after running function above do not retain arributes that are needed to plug into the viz method

In [None]:
_vectorizer = CountVectorizer(min_df = min_df, max_df = max_df,
                                    max_features = no_features,
                                    stop_words= my_stopwords, 
                                    ngram_range = ngram_range,
                                    token_pattern='[a-zA-Z0-9]{3,}')

_vectorized = _vectorizer.fit_transform(data)
_feature_names = _vectorizer.get_feature_names()


_model = LatentDirichletAllocation(n_components=no_topics,          # Number of topics
                                   max_iter=25,               # Max learning iterations
                                   learning_method='batch',  # batch or online - latter is faster on large sets 
                                   random_state=0,          # Random state
#                                       batch_size=124,            # n docs in each learning iter
#                                       evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                  n_jobs = -1,               # Use all available CPUs
#                                       learning_decay = 0.5       # set learning rate
                                 )
_W = _model.fit_transform(_vectorized)
_H = _model.components_

# Visualizing results - intertopic distance (LDA)

In [None]:
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model= _model, dtm=_vectorized, vectorizer=_vectorizer, mds='pca')


In [None]:
panel

In [None]:
pyLDAvis.save_html(panel, '{}\intertopic_distance_LDA_{}topics.html'.format(saveVizOutput, no_topics))

# Results & Limitations

<li>Results from any unsupervised machine learning model should be taken with a grain of salt</li>

<li>“Topics” as uncovered by machine learning models may not be the same as how humans understand a coherent topic; for machine learning models, “topics” are comprised of components which may or may not have semantic meaning</li>

><li>Two author writing about different subjects could be seen as having similar “topics” because of their writing style and preferred word choices</li>

<li>Pre-processing data and model parameters may dramatically change the separation of topics</li>


# Questions?
## email: Alison_Thaung@abtassoc.com
## twitter: @AlisonThaung