## LDA Text Mining

In natural language processing (NLP), Latent Dirichlet Allocation (LDA) is a popular tool for topic modeling which allows for discovery of patterns in similar topics and semantic structures in a corpus. 

This notebook contains the cleaning and EDA process.

#### Cleaning

In the previous notebook, we did some preliminary text cleaning. We will now generate an LDA model for each of the top 20 most prolific writers in the [University of Michigan Gutenberg Dataset](https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html) and their works.

In [1]:
import pandas as pd
import os
import re
import gensim
import gensim.corpora as corpora

result_df = pd.read_csv('./data/result_df.csv')

`stop_words.py` contains one function: `custom_stopwords()` that returns a combined list of stopwords from gensim, nltk, spacy, and other sources. **NOTE:** we can continue to add to this stop words list.

In [2]:
# Import custom stopwords list
from stop_words import custom_stopwords

stop_words = custom_stopwords()

In [3]:
## Tokenize Text
def regex_tokenizer(text):
    text = re.sub('[^\w\s]', '', text) # replaces all non letter digit and white space characters
    words = re.compile(r'[a-zA-Z0-9]+').findall(text) # tokenizes words
    
    # remove stop words:
    words = [w for w in words if w not in stop_words]
    return words

### Top 20 Authors & their LDA topics

I referenced the following [source](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#11createthedictionaryandcorpusneededfortopicmodeling) to generate the id2word and corpus for topic modeling.

In [10]:
result_df['tokenized'] = result_df['text'].map(lambda x: regex_tokenizer(x))

In [11]:
# If the folder that stores the author's ldas doesn't exist, create it
if os.path.isdir('./data/LDA_htmls') == False:
    print("LDA_htmls folder used to not exist in the data folder. Now it does. Congrats!!!")
    os.mkdir('./data/LDA_htmls')
else:
    print("Folder LDA_htmls already exists")

LDA_htmls folder used to not exist in the data folder. Now it does. Congrats!!!


### Interpret the pyLDAvis visualization:

The Intertopic Distance Map projects topic clusters as circles. 

We are using a default PCA model to extract first 2 components (PC1 and PC2) on a topic-term distribution distance matrix. The distance between circles indicates how similar a topic is to one another. The area of the circles is proportional to the proportions of the topics across the total number of tokens in the corpus.

The Top 30 Most Salient Terms represent how relevant and salient terms are for a selected topic.
1. The red bars represent the most relevant terms for a selected topic. Relevance rank terms within topics for topic interpretation. The way relevance is defined is by the following [source](https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf): 

$$r(w, t | \lambda) = \lambda \cdot P(w|t) + (1 - \lambda) \cdot \frac{P(w|t)}{P(w)}$$

Terms are ranked based on how relevant they are to a given topic and the proportion of their frequency in the topic compared to the rest of the corpus

2. The blue bars represent the most salient terms for a selected topic. Saliency is defined as how distinctive a term is for a selected topic. For example, we observe the likelihood that an observed word was generated by a topic and the likelihood that any  [source](http://vis.stanford.edu/files/2012-Termite-AVI.pdf)

$$saliency(w) = frequency(w) \cdot \sum_t P(t|w) \cdot \log \frac{P(t|w)}{P(t)}$$




In [15]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
    
import pyLDAvis.gensim

# Top 20 Authors
authors = result_df['author'].unique()

for author in authors:
    
    # Convert author name to lower and replace spaces
    file_name = author.lower().replace(' ','_')
    
    # Grab author's tokenized text data
    text_data = result_df[result_df['author'] == author]['tokenized'].tolist()

    # Create Dictionary:
    id2word = corpora.Dictionary(text_data)

    # Print total words for author:
    print(f"{author}'s works have a total of {len(id2word)} words")

    # Convert document into bag of words format
    corpus = [id2word.doc2bow(text) for text in text_data]

    # LDA model
    LDA = gensim.models.ldamodel.LdaModel(corpus = corpus,
                                          id2word = id2word,
                                          num_topics = 10, # Choosing 10 topics
                                          random_state=42,
                                          update_every = 2, # online iterative learning instead of batch
                                          chunksize = 100,
                                          passes = 5,
                                          alpha = 'auto',
                                          per_word_topics = True)

    # Generate top 10 topics
    lda_topics = LDA.print_topics()
    
    # Save pyLDAvis visualization as an html file:
    viz = pyLDAvis.gensim.prepare(LDA, corpus, dictionary = LDA.id2word)
    pyLDAvis.save_html(viz, f"./data/LDA_htmls/{file_name}_LDA.html")

William Dean Howells's works have a total of 46756 words
George Alfred Henty's works have a total of 45601 words
Edward Stratemeyer's works have a total of 25351 words
William Wymark Jacobs's works have a total of 17700 words
Henry Rider Haggard's works have a total of 37008 words
Sir Arthur Conan Doyle's works have a total of 43490 words
Henry James's works have a total of 41678 words
Bret Harte's works have a total of 39673 words
Nathaniel Hawthorne's works have a total of 33071 words
Jacob Abbott's works have a total of 25070 words
Edward Phillips Oppenheim's works have a total of 32134 words
Anthony Trollope's works have a total of 45842 words
Andrew Lang's works have a total of 61704 words
Robert Louis Stevenson's works have a total of 50476 words
Charles Dickens's works have a total of 46140 words
Charlotte Mary Yonge's works have a total of 53964 words
Jack London's works have a total of 43013 words
Herbert George Wells's works have a total of 50380 words
R M Ballantyne's works 