### Project Gutenberg:  Topic Modelling Using Unsupervised NLP

https://www.gutenberg.org/

Assignment 2: Devika Pace

[Import Libaries](#0)

  [Select and load book(s) from Project Gutenberg](#1)
  - Alice In Wonderland
  - Grimm's Fairy Tales
  - Walden 

[Pre-process Text](#2)
- Clean
- Tokenize
- Stop Word Removal
- Stemming and Lemmatization

[Vectorization](#3)
- Count Vectorizer
- Tf-idf Vectorizer
- Sparsity

[Topic Model with Unsupervised Clustering Algorithm](#4)
- LDA
- NNMF
- LSI

[Metrics & Results](#5)
- Tuning
- Visualization
- Performance


<a name="0"></a>
### **Import Libraries**

In [3]:
import os
from urllib.request import urlopen
from warnings import simplefilter
simplefilter('ignore')

import numpy as np
import pandas as pd
import re, nltk, spacy, gensim

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer  
               
from sklearn.decomposition import LatentDirichletAllocation, NMF, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import uniform, truncnorm, randint
from pprint import pprint

import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline

<a name="1"></a>
### **Project Gutenberg**

We will download a few different texts for comparison's sake.

In [4]:
url_alice = "https://gutenberg.org/files/11/11-0.txt"      # Alice in Wonderland
url_grimm = "https://gutenberg.org/files/2591/2591-0.txt"  # Grimm's Fairy Tales
url_walden = "https://gutenberg.org/files/205/205-0.txt"   # Walden

def get_gutenberg_text(url):

  with urlopen(url) as response:
    html_content = response.read()
  encoding = response.headers.get_content_charset('utf-8')
  text = html_content.decode(encoding)
  return text

**Alice in Wonderland**

In [5]:
%%capture
text = get_gutenberg_text(url_alice)
print(f'length of text response: {len(text)}, length of split: {len(text.split("***"))}\n')
text.split('***')

In [6]:
alice = text.split('***')[2]
print(f'LENGTH OF ALIC TEXT: {len(alice)}')
print(f'\n************************START OF ALICE TEXT*****************************:{alice[0:3000]}')
print(f'\n**************************END OF ALICE TEXT*****************************:\n\n{alice[-1000:]}')

LENGTH OF ALIC TEXT: 147994

************************START OF ALICE TEXT*****************************:

[Illustration]




Alice’s Adventures in Wonderland

by Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0

Contents

 CHAPTER I.     Down the Rabbit-Hole
 CHAPTER II.    The Pool of Tears
 CHAPTER III.   A Caucus-Race and a Long Tale
 CHAPTER IV.    The Rabbit Sends in a Little Bill
 CHAPTER V.     Advice from a Caterpillar
 CHAPTER VI.    Pig and Pepper
 CHAPTER VII.   A Mad Tea-Party
 CHAPTER VIII.  The Queen’s Croquet-Ground
 CHAPTER IX.    The Mock Turtle’s Story
 CHAPTER X.     The Lobster Quadrille
 CHAPTER XI.    Who Stole the Tarts?
 CHAPTER XII.   Alice’s Evidence




CHAPTER I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into
the book her sister was reading, but it had no pictures or
conversations in it, “and what is the use 

**Grimm's Fairy Tales**

In [7]:
%%capture
text = get_gutenberg_text(url_grimm)
print(f'length of text response: {len(text)}, length of split: {len(text.split("***"))}\n')
text.split('***')

In [8]:
grimm = text.split('***')[2]
grimm_about = text.split('***')[3]
print(f'LENGTH OF GRIMM TEXT: {len(grimm)}')
print(f'\n************************START OF GRIMM TEXT*****************************:{grimm[0:3000]}')
print(f'\n*******************END OF GRIMM TEXT (SNOW WHITE)***********************:\n\n{grimm[-1000:]}')
print(f'******************************ABOUT*************************************:\n{grimm_about}')

LENGTH OF GRIMM TEXT: 528873

************************START OF GRIMM TEXT*****************************:




Grimms’ Fairy Tales

By Jacob Grimm and Wilhelm Grimm



PREPARER’S NOTE

     The text is based on translations from
     the Grimms’ Kinder und Hausmärchen by
     Edgar Taylor and Marian Edwardes.




CONTENTS:

     THE GOLDEN BIRD
     HANS IN LUCK
     JORINDA AND JORINDEL
     THE TRAVELLING MUSICIANS
     OLD SULTAN
     THE STRAW, THE COAL, AND THE BEAN
     BRIAR ROSE
     THE DOG AND THE SPARROW
     THE TWELVE DANCING PRINCESSES
     THE FISHERMAN AND HIS WIFE
     THE WILLOW-WREN AND THE BEAR
     THE FROG-PRINCE
     CAT AND MOUSE IN PARTNERSHIP
     THE GOOSE-GIRL
     THE ADVENTURES OF CHANTICLEER AND PARTLET
       1. HOW THEY WENT TO THE MOUNTAINS TO EAT NUTS
       2. HOW CHANTICLEER AND PARTLET WENT TO VISIT MR KORBES
     RAPUNZEL
     FUNDEVOGEL
     THE VALIANT LITTLE TAILOR
     HANSEL AND GRETEL
     THE MOUSE, T

**Walden**


In [9]:
%%capture
text = get_gutenberg_text(url_walden)
print(f'length of text response: {len(text)}, length of split: {len(text.split("***"))}\n')
text.split('***')

In [10]:
walden = text.split('***')[2]
print(f'LENGTH OF ALIC TEXT: {len(alice)}')
print(f'\n************************START OF THOREAU TEXT*****************************:{walden[0:2500]}')
print(f'\n**************************END OF THOREAU TEXT*****************************:\n\n{walden[-1000:]}')

LENGTH OF ALIC TEXT: 147994

************************START OF THOREAU TEXT*****************************:




WALDEN




and



ON THE DUTY OF CIVIL DISOBEDIENCE



by Henry David Thoreau


cover


Contents


 WALDEN

 Economy
 Where I Lived, and What I Lived For
 Reading
 Sounds
 Solitude
 Visitors
 The Bean-Field
 The Village
 The Ponds
 Baker Farm
 Higher Laws
 Brute Neighbors
 House-Warming
 Former Inhabitants and Winter Visitors
 Winter Animals
 The Pond in Winter
 Spring
 Conclusion

 ON THE DUTY OF CIVIL DISOBEDIENCE



WALDEN

Economy

When I wrote the following pages, or rather the bulk of them, I lived
alone, in the woods, a mile from any neighbor, in a house which I had
built myself, on the shore of Walden Pond, in Concord, Massachusetts,
and earned my living by the labor of my hands only. I lived there two
years and two months. At present I am a sojourner in civilized life
again.

I should not obtrude my affairs 

<a name="2"></a>
### **Pre-process Text**

In [11]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

def preprocess_text(text):

  n = 25
  # clean text (to lowercase, remove punctuation)
  text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

  # tokenize text
  words = word_tokenize(text)
  print('tokenized:\n', words[:n])

  # remove stopwords
  words = [w for w in words if w not in stopwords.words('english')]
  #print('stop words:\n', stopwords.words('english'), '\n')
  print('no stop words:\n', words[:n])

  # Lemmatize verbs by specifying pos
  lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]
  print('lemmed with pos:\n', lemmed[:n])

  # stemming - remove prefix, suff
  stemmed = [PorterStemmer().stem(w) for w in lemmed]
  print('stemmed:\n', stemmed[:n])

  return stemmed

print('stop words:\n', stopwords.words('english'), '\n')

stop words:
 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 's

In [12]:
post_alice = preprocess_text(alice)

tokenized:
 ['illustration', 'alice', 's', 'adventures', 'in', 'wonderland', 'by', 'lewis', 'carroll', 'the', 'millennium', 'fulcrum', 'edition', '3', '0', 'contents', 'chapter', 'i', 'down', 'the', 'rabbit', 'hole', 'chapter', 'ii', 'the']
no stop words:
 ['illustration', 'alice', 'adventures', 'wonderland', 'lewis', 'carroll', 'millennium', 'fulcrum', 'edition', '3', '0', 'contents', 'chapter', 'rabbit', 'hole', 'chapter', 'ii', 'pool', 'tears', 'chapter', 'iii', 'caucus', 'race', 'long', 'tale']
lemmed with pos:
 ['illustration', 'alice', 'adventure', 'wonderland', 'lewis', 'carroll', 'millennium', 'fulcrum', 'edition', '3', '0', 'content', 'chapter', 'rabbit', 'hole', 'chapter', 'ii', 'pool', 'tear', 'chapter', 'iii', 'caucus', 'race', 'long', 'tale']
stemmed:
 ['illustr', 'alic', 'adventur', 'wonderland', 'lewi', 'carrol', 'millennium', 'fulcrum', 'edit', '3', '0', 'content', 'chapter', 'rabbit', 'hole', 'chapter', 'ii', 'pool', 'tear', 'chapter', 'iii', 'caucu', 'race', 'long', '

In [13]:
post_grimm = preprocess_text(grimm)

tokenized:
 ['grimms', 'fairy', 'tales', 'by', 'jacob', 'grimm', 'and', 'wilhelm', 'grimm', 'preparer', 's', 'note', 'the', 'text', 'is', 'based', 'on', 'translations', 'from', 'the', 'grimms', 'kinder', 'und', 'hausm', 'rchen']
no stop words:
 ['grimms', 'fairy', 'tales', 'jacob', 'grimm', 'wilhelm', 'grimm', 'preparer', 'note', 'text', 'based', 'translations', 'grimms', 'kinder', 'und', 'hausm', 'rchen', 'edgar', 'taylor', 'marian', 'edwardes', 'contents', 'golden', 'bird', 'hans']
lemmed with pos:
 ['grimms', 'fairy', 'tales', 'jacob', 'grimm', 'wilhelm', 'grimm', 'preparer', 'note', 'text', 'base', 'translations', 'grimms', 'kinder', 'und', 'hausm', 'rchen', 'edgar', 'taylor', 'marian', 'edwardes', 'content', 'golden', 'bird', 'hans']
stemmed:
 ['grimm', 'fairi', 'tale', 'jacob', 'grimm', 'wilhelm', 'grimm', 'prepar', 'note', 'text', 'base', 'translat', 'grimm', 'kinder', 'und', 'hausm', 'rchen', 'edgar', 'taylor', 'marian', 'edward', 'content', 'golden', 'bird', 'han']


In [14]:
post_walden = preprocess_text(walden)

tokenized:
 ['walden', 'and', 'on', 'the', 'duty', 'of', 'civil', 'disobedience', 'by', 'henry', 'david', 'thoreau', 'cover', 'contents', 'walden', 'economy', 'where', 'i', 'lived', 'and', 'what', 'i', 'lived', 'for', 'reading']
no stop words:
 ['walden', 'duty', 'civil', 'disobedience', 'henry', 'david', 'thoreau', 'cover', 'contents', 'walden', 'economy', 'lived', 'lived', 'reading', 'sounds', 'solitude', 'visitors', 'bean', 'field', 'village', 'ponds', 'baker', 'farm', 'higher', 'laws']
lemmed with pos:
 ['walden', 'duty', 'civil', 'disobedience', 'henry', 'david', 'thoreau', 'cover', 'content', 'walden', 'economy', 'live', 'live', 'read', 'sound', 'solitude', 'visitors', 'bean', 'field', 'village', 'ponds', 'baker', 'farm', 'higher', 'laws']
stemmed:
 ['walden', 'duti', 'civil', 'disobedi', 'henri', 'david', 'thoreau', 'cover', 'content', 'walden', 'economi', 'live', 'live', 'read', 'sound', 'solitud', 'visitor', 'bean', 'field', 'villag', 'pond', 'baker', 'farm', 'higher', 'law']


<a name="3"></a>
### **Vectorize Text**

In [15]:
len(post_alice), len(post_grimm), len(post_walden)

(12306, 44860, 54684)

In [16]:
vectorizer = CountVectorizer(analyzer='word', min_df=10,
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')

tfidf_vectorizer = TfidfVectorizer(analyzer='word', min_df=10,  
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')

vec_alice = vectorizer.fit_transform(post_alice)
vec_tfidf_alice = tfidf_vectorizer.fit_transform(post_alice)

vec_grimm = vectorizer.fit_transform(post_grimm)
vec_tfidf_grimm = tfidf_vectorizer.fit_transform(post_grimm)

vec_walden = vectorizer.fit_transform(post_walden)
vec_tfidf_walden = tfidf_vectorizer.fit_transform(post_walden)

In [17]:
vecs = {'vec_alice': vec_alice, 'vec_tfidf_alice':vec_tfidf_alice, 'vec_grimm':vec_grimm, 
        'vec_tfidf_grimm': vec_tfidf_grimm, 'vec_walden': vec_walden, 'vec_tfidf_walden':vec_tfidf_walden}

#### Check Sparsity

In [18]:
def check_sparsity(key,vec):
    #materialize the sparse data
    data_dense = vec.todense()
    # compute sparsity as % of non-zero cells
    print(key, "sparsicity: ", ((data_dense > 0).sum()/data_dense.size)*100, "%")

In [19]:
for k,v in vecs.items():
    check_sparsity(k,v)

vec_alice sparsicity:  0.22439548093464512 %
vec_tfidf_alice sparsicity:  0.22439548093464512 %
vec_grimm sparsicity:  0.0934931314461793 %
vec_tfidf_grimm sparsicity:  0.0934931314461793 %
vec_walden sparsicity:  0.05316807626200962 %
vec_tfidf_walden sparsicity:  0.05316807626200962 %


<!-- Choose least sparse ('Alice in Wonderland') vs most sparse ('Walden') to compare results.  We will start with Alice as it is ~20-25% size. -->

We will choose 'Alice in Wonderland' as it is the least sparse and also ~20-25% the size of most sparse ('Walden').

In [20]:
data_vectorized = vec_tfidf_alice.copy()

<a name="4"></a>
### Build Models

In [21]:
# Non-Negative Matrix Factorization Model
nmf_model = NMF(n_components=10)
nmf_Z = nmf_model.fit_transform(data_vectorized)
print(nmf_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)
 
# Latent Semantic Indexing Model
lsi_model = TruncatedSVD(n_components=10)
lsi_Z = lsi_model.fit_transform(data_vectorized)
print(lsi_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)

# Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_components=10, max_iter=10, learning_method='online') # no of components = no. of topics
lda_Z = lda_model.fit_transform(data_vectorized)
print(lda_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)

(12306, 10)
(12306, 10)
(12306, 10)


In [22]:
# inspect the inferred topics
def print_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])

In [23]:
# check log-likelihood and perplexity of LDA model
def print_performance(model, data_vectorized):   
    # log likelihood: Higher the better
    print("Log Likelihood: ", model.score(data_vectorized))
    # perplexity: lower the better, perplexity = exp(-1. * log-likelihood per word)
    print("Perplexity: ", model.perplexity(data_vectorized))
    # model parameters
    pprint(model.get_params())

In [24]:
print("NMF Model:")
print_topics(nmf_model, tfidf_vectorizer)
print("=" * 20)

NMF Model:
Topic 0:
[('commenc', 4.802616494093879), ('deliber', 0.0), ('bear', 0.0), ('berri', 0.0), ('beneath', 0.0), ('bend', 0.0), ('belong', 0.0), ('bell', 0.0), ('believ', 0.0), ('behold', 0.0)]
Topic 1:
[('accord', 4.469338245807474), ('deliber', 0.0), ('beast', 0.0), ('besid', 0.0), ('berri', 0.0), ('beneath', 0.0), ('bend', 0.0), ('belong', 0.0), ('bell', 0.0), ('believ', 0.0)]
Topic 2:
[('cow', 3.427438806199466), ('civil', 2.452125007609153e-20), ('creatur', 1.5476682240074683e-21), ('custom', 4.622127836842346e-26), ('board', 8.200703599143148e-27), ('blue', 1.8387333664617303e-28), ('deal', 1.67093304010306e-28), ('caus', 1.8546914468005856e-29), ('clean', 1.3501012257806272e-29), ('clear', 2.1652719464115938e-30)]
Topic 3:
[('burst', 3.37011611454764), ('care', 4.669361777040618e-19), ('aliv', 1.2795183384945585e-29), ('boat', 4.0154177571039855e-31), ('cellar', 1.142838605403968e-33), ('cri', 6.017174148850361e-35), ('attend', 4.225440133843332e-35), ('color', 3.10374325

In [25]:
print("LSI Model:")
print_topics(lsi_model, tfidf_vectorizer)
print("=" * 20)

LSI Model:
Topic 0:
[('commenc', 0.9999999999974932), ('caus', 9.132190885816087e-07), ('clear', 7.164715201356431e-07), ('deal', 5.04097293831358e-07), ('danger', 3.579064907926029e-07), ('cover', 2.3872127501697113e-07), ('art', 2.3658546926946302e-07), ('berri', 7.645407317979535e-08), ('boat', 7.273440008946646e-08), ('cri', 5.464417711411476e-08)]
Topic 1:
[('accord', 0.9999999998529792), ('custom', 9.084788210291837e-06), ('civil', 3.9928359902025675e-06), ('aliv', 3.815188586139706e-06), ('care', 3.413616478705748e-06), ('blue', 2.9238207062881833e-06), ('cove', 2.497422065220584e-06), ('clean', 1.9822924869765313e-06), ('betray', 1.6874471930060731e-06), ('advantag', 1.668044902985133e-06)]
Topic 2:
[('cow', 0.9999575608228036), ('clear', 0.004078647375475867), ('caus', 0.0031027258932351826), ('bright', 0.0021081829550984458), ('cover', 0.0019305435413368827), ('danger', 0.0011837908918862245), ('attend', 0.0008210500158325748), ('cultiv', 0.0008037781665256781), ('cri', 0.000

In [26]:
print("LDA Model:")
print_topics(lda_model, tfidf_vectorizer)
print("=" * 20)

LDA Model:
Topic 0:
[('art', 44.34589969382979), ('bank', 36.913361983039934), ('birch', 36.52542653047315), ('concord', 36.097005890264576), ('daili', 32.86179963332501), ('cost', 29.000238415954094), ('circul', 28.272648232992143), ('cloud', 28.020214691756266), ('absolut', 27.01873324062051), ('deliber', 25.45378671480786)]
Topic 1:
[('cow', 128.51005850768775), ('care', 75.14476845719682), ('attend', 43.30094891291943), ('cove', 41.73404787555231), ('cellar', 33.71431693353539), ('best', 30.78737440084893), ('apart', 24.48102505873801), ('buri', 20.196921172430844), ('appetit', 18.241898118538224), ('calm', 15.369495897586923)]
Topic 2:
[('bring', 106.469822389357), ('cover', 75.08034011411172), ('blue', 63.16739346974223), ('deal', 51.11214945100638), ('certainli', 34.0171678405048), ('add', 33.37152110324514), ('companion', 22.71013520799123), ('actual', 21.126184351183095), ('constitut', 21.112249854691004), ('america', 19.777046254896348)]
Topic 3:
[('accord', 392.8740333473071

<a name="5"></a>
### LDA Performance Metrics

In [27]:
import warnings
with warnings.catch_warnings():
    warnings.filterwarnings("ignore",category=DeprecationWarning)
    import imp

print_performance(lda_model, data_vectorized)

Log Likelihood:  -34952.01706683081
Perplexity:  190.9465505315102
{'batch_size': 128,
 'doc_topic_prior': None,
 'evaluate_every': -1,
 'learning_decay': 0.7,
 'learning_method': 'online',
 'learning_offset': 10.0,
 'max_doc_update_iter': 100,
 'max_iter': 10,
 'mean_change_tol': 0.001,
 'n_components': 10,
 'n_jobs': None,
 'perp_tol': 0.1,
 'random_state': None,
 'topic_word_prior': None,
 'total_samples': 1000000.0,
 'verbose': 0}


### Visualization

In [32]:
from IPython.core.display import HTML
data_vectorized = tfidf_vectorizer.fit_transform(post_alice)

pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, data_vectorized, tfidf_vectorizer, mds='tsne')
pyLDAvis.save_html(panel, 'lda_result_10.html')
display(HTML('lda_result_10.html'))


[Interactive LDA Visualization - Alice in Wonderland (10 topics)](lda_result_10.html?raw=True)

![](pyLDAvis_alice_10_stat.png?raw=True)

### Tune Hyperparameters

In [34]:
import warnings
with warnings.catch_warnings():
    warnings.filterwarnings("ignore",category=DeprecationWarning)
    import imp
    
search_params = {'n_components': [5, 15, 30], 'learning_decay': [.5, .9]}
model = GridSearchCV( LatentDirichletAllocation(), param_grid=search_params)
model.fit(data_vectorized)

GridSearchCV(estimator=LatentDirichletAllocation(),
             param_grid={'learning_decay': [0.5, 0.9],
                         'n_components': [5, 15, 30]})

### Best Model

In [35]:
best_lda_model = model.best_estimator_
print("Best Model's Params: ", model.best_params_)
print("Best Log Likelihood Score: ", model.best_score_)
print("Model Perplexity: ", best_lda_model.perplexity(data_vectorized))

Best Model's Params:  {'learning_decay': 0.5, 'n_components': 5}
Best Log Likelihood Score:  -8127.429234058503
Model Perplexity:  184.7837496392483


In [37]:
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(best_lda_model, data_vectorized, tfidf_vectorizer, mds='tsne')
pyLDAvis.save_html(panel, 'lda_result_5.html')
display(HTML('lda_result_5.html'))

[Interactive LDA Visualization - Alice in Wonderland (5 topics)](lda_result_5.html?raw=True)

![](pyLDAvis_alice_5_stat.png?raw=True)

We were able to improve the performance of our topic modeling by tuning the number of components and learning rate, increasing our log-likelihood from -34952 to -8127 and we were able to reduce our perplexity from 191 to 185.  However, to the reader, topics threads remain somewhat perplexing so to speak. As stated in reference below 'optimizing for perplexity may not yield human interpretable topics'.

We downloaded three texts and did an analysis on 'Alice in Wonderland' based on it having less sparsity and content with respect to the others.  It would be interesting to look at how these factors effect the models as well comparison of various pre-processing and vectorization methods.

References:

LDA Performance and Hyperparameter Tuning<br>
https://www.machinelearningplus.com/nlp/topic-modeling-python-sklearn-examples/


Evaluate Topic Models: Latent Dirichlet Allocation (LDA)<br>
https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0