## LDA Text Mining

In natural language processing (NLP), Latent Dirichlet Allocation (LDA) is a popular tool for topic modeling which allows for discovery of patterns in similar topics and semantic structures in a corpus. 

We will now generate an LDA model for each of the 1312 books by the top 20 most prolific authors.

In [1]:
import gensim, spacy, nltk
import pyLDAvis.gensim
import os
import pandas as pd
import re
import gensim.corpora as corpora

  from collections import Mapping


#### Pre-processing & Preliminary Cleaning

The first step is to clean and process around 1300 English books written by the top 20 most prolific authors from the [University of Michigan Gutenberg Dataset](https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html) and store it in a dataframe that contains the author's name, book title, and text.

In [2]:
%%time

# Processing tools from teammate Bo Cheng:

def clean_text(text):
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "can not ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\'scuse", " excuse ", text)
    text = re.sub(r"\_", " ", text)
    text = re.sub(r"\[|\]", "", text)
    text = re.sub('''\'|\"''', "", text) # remove single and double quotation marks
    text = re.sub(r"\-", " ", text)
    text = re.sub(r'\s+', ' ', text)
    text = text.strip(' ')
    return text

data_path = r'./Gutenberg/txt'

records_dict_list = []

for txt_filename in os.listdir(data_path):
    if txt_filename.endswith(".txt"):
        try:
            author, title = txt_filename.split('.')[0].split('___')
            with open(os.path.join(data_path, txt_filename), "r", encoding="utf-8") as f:
                content = f.read()
            data = {'author': author, 'title': title, 'text': clean_text(content)}
            records_dict_list.append(data)
        except Exception:
            print('not found' + data_path + "/" + txt_filename)
        

df = pd.DataFrame.from_dict(records_dict_list)

CPU times: user 1min 52s, sys: 10.5 s, total: 2min 3s
Wall time: 2min 23s


In [3]:
df.head()

Unnamed: 0,author,title,text
0,Sir William Schwenck Gilbert,Bab Ballads and Savoy Songs,bab ballads and savoy songs by w. h. gilbert p...
1,William Dean Howells,The Editor's Relations With The Young Contributor,literature and life the young contributor by w...
2,George Alfred Henty,Bonnie Prince Charlie,bonnie prince charlie a tale of fontenoy and c...
3,George Bernard Shaw,Arms and the Man,arms and the man by george bernard shaw introd...
4,Hamlin Garland,The Spirit of Sweetwater,ladies home journal library of fiction the spi...


### Filter for Top 20 Most Prolific Authors based on Book Count

In [4]:
top_20_authors = df['author'].value_counts().nlargest(20).reset_index()
top_20_authors.columns = ['author', 'title_count']
display(top_20_authors.head())

df = df[df['author'].isin(list(top_20_authors['author'].unique()))].reset_index(drop = True)
df.head()

Unnamed: 0,author,title_count
0,William Wymark Jacobs,97
1,George Alfred Henty,89
2,R M Ballantyne,88
3,Nathaniel Hawthorne,86
4,William Dean Howells,84


Unnamed: 0,author,title,text
0,William Dean Howells,The Editor's Relations With The Young Contributor,literature and life the young contributor by w...
1,George Alfred Henty,Bonnie Prince Charlie,bonnie prince charlie a tale of fontenoy and c...
2,Edward Stratemeyer,Marching on Niagara,marching on niagara or the soldier boys of the...
3,William Wymark Jacobs,"Keeping Watch, Night Watches, Part 2",night watches by w.w. jacobs keeping watch hum...
4,George Alfred Henty,The Tiger of Mysore,the tiger of mysore: a story of the war with t...


`stop_words.py` contains one function: `custom_stopwords()` that returns a combined list of stopwords from gensim, nltk, spacy, and other sources. **NOTE:** we can continue to add to this stop words list.

In [5]:
# Import custom stopwords list
from stop_words import custom_stopwords

stop_words = custom_stopwords()

### IMPORTANT:

Download spaCy's model by running the following:

`python -m spacy download en_core_web_sm`

Alternative method via pip:

`pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz`

The following function `clean_and_tokenize` prepares the text for LDA modeling.

In [11]:
########################
#### IMPORTANT!!!! #####
########################
# Download small spacy model!
# !python -m spacy download en_core_web_sm

def clean_and_tokenize(text):
    '''
    Cleans and tokenizes text into corpus of documents that can be fed into an LDA model.
    
    Steps: 
    
    1. Book text is split into sentences based on whether a sentence ends in either an 
    exclamation mark, question mark, or period
    2. Gensim's simple_preprocess lowercases, tokenizes, and de-accents sentences
    3. Stop words are removed
    4. Keep tokens and lemmatize them if their part of speech is either a noun, adj, verb, or adverb
    
    Input: Any book or block of text. For example, df.loc[0,'text'] is the first book in the dataframe
    Returns: Corpus - A list of lists - Each sublist is a tokenized sentence
    '''
    
    raw_book_text = re.split(f"[{re.escape('!.?')}]", text)
    
    nlp = spacy.load('en_core_web_sm', disable = ['parser', 'ner'])

    # Part of speech tagging allowed:
    pos_tagging = ['NOUN', 'ADJ', 'VERB', 'ADV']

    book_text = []
    for sentence in raw_book_text:

        sentence = sentence.strip()
        # Lowercases, tokenizes, de-accents text:
        sentence = gensim.utils.simple_preprocess(sentence, min_len = 2, deacc=True) 

        # Remove stop words by using sets:
        sentence = list(set(sentence).difference(set(stop_words)))

        # Recreate sentence as a string for spaCy processing:
        sentence = nlp(" ".join(sentence))

        # If a token is a noun, adjective, verb, or adverb, keep the token and lemmatize it!
        sentence = [word.lemma_ for word in sentence if word.pos_ in pos_tagging]

        if len(sentence) > 1:
            book_text.append(sentence)
            
    return book_text

### Interpret the pyLDAvis visualization:

The Intertopic Distance Map projects topic clusters as circles. 

We are using a default PCA model to extract first 2 components (PC1 and PC2) on a topic-term distribution distance matrix. The distance between circles indicates how similar a topic is to one another. The area of the circles is proportional to the proportions of the topics across the total number of tokens in the corpus.

The Top 30 Most Salient Terms represent how relevant and salient terms are for a selected topic.
1. The red bars represent the most relevant terms for a selected topic. Relevance rank terms within topics for topic interpretation. The way relevance is defined is by the following [source](https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf): 

$$r(w, t | \lambda) = \lambda \cdot P(w|t) + (1 - \lambda) \cdot \frac{P(w|t)}{P(w)}$$

Terms are ranked based on how relevant they are to a given topic and the proportion of their frequency in the topic compared to the rest of the corpus

2. The blue bars represent the most salient terms for a selected topic. Saliency is defined as how distinctive a term is for a selected topic. For example, we observe the likelihood that an observed word was generated by a topic and the likelihood that any  [source](http://vis.stanford.edu/files/2012-Termite-AVI.pdf)

$$saliency(w) = frequency(w) \cdot \sum_t P(t|w) \cdot \log \frac{P(t|w)}{P(t)}$$




Generating 1300+ LDA models:

I referenced the following [source](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#11createthedictionaryandcorpusneededfortopicmodeling) to generate the id2word and corpus for topic modeling.

In [12]:
# If the folder that stores the author's ldas doesn't exist, create it
if os.path.isdir('./data/LDA_htmls/ALL_LDA_htmls') == False:
    print("The folder ALL_LDA_htmls has been created: ./data/LDA_htmls/ALL_LDA_htmls")
    os.mkdir('./data/LDA_htmls/ALL_LDA_htmls')
else:
    print("Folder ALL_LDA_htmls already exists")

The folder ALL_LDA_htmls has been created: ./data/LDA_htmls/ALL_LDA_htmls


In [13]:
%%time
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

# Top 20 Most Prolific Authors
authors = df['author'].unique()

# Loop through top 20 authors
for author in authors:
    
    # Create an author mask to filter dataframe
    auth_mask = (df['author'] == author)
    
    # Grab books per selected author
    books_per_auth = df[auth_mask]['title']
    
    # Convert author name to lower and replace spaces
    auth_file_name = author.lower().replace(' ','_')
    
    # Create the author's folder in ALL_LDA_htmls:
    os.mkdir(f"./data/LDA_htmls/ALL_LDA_htmls/{auth_file_name}")
    
    print(f"Currently processing {author}'s works:")
    print("*********************************************")
    
    # Loop through each author's book and create an LDA html
    for book in books_per_auth:
        
        book_mask = (df['title'] == book)
        
        book_file_name = book.lower().replace(" ","_")
        book_file_name = re.sub('[^a-z0-9\\_]', '', book_file_name)
        
        # Grab author's specific book's tokenized text data
        raw_data = df[auth_mask & book_mask]['text'].values[0]
        
        # Process text data:
        text_data = clean_and_tokenize(raw_data)
        
        # Create Dictionary: (bag of words)
        id2word = corpora.Dictionary(text_data)
        
        # Print total words for author's book:
        print(f"{book} has a total of {len(id2word)} words")
        
        # Convert document into bag of words format
        corpus = [id2word.doc2bow(text) for text in text_data]
        
        # LDA model
        LDA = gensim.models.ldamodel.LdaModel(corpus = corpus,
                                              id2word = id2word,
                                              num_topics = 5, # Choosing 5 topics
                                              random_state=42,
                                              update_every = 2, # online iterative learning instead of batch
                                              chunksize = 100,
                                              passes = 5,
                                              alpha = 'auto',
                                              per_word_topics = True)
        
        # Generate top 10 topics
        lda_topics = LDA.print_topics()
        
        # Save pyLDAvis visualization as an html file:
        viz = pyLDAvis.gensim.prepare(LDA, corpus, dictionary = LDA.id2word)
        pyLDAvis.save_html(viz, f"./data/LDA_htmls/ALL_LDA_htmls/{auth_file_name}/{book_file_name}.html")
        
    print(' ')

Currently processing William Dean Howells's works:
*********************************************
The Editor's Relations With The Young Contributor has a total of 618 words
A Chance Acquaintance has a total of 4170 words
The March Family Trilogy has a total of 10871 words
The Quality of Mercy has a total of 4855 words
Christmas Every Day and Other Stories has a total of 1153 words
A Little Swiss Sojourn has a total of 2172 words
Dr has a total of 3411 words
Between The Dark And The Daylight has a total of 3359 words
Poems has a total of 2756 words
The Albany Depot has a total of 643 words
The Landlord at Lion's Head has a total of 4579 words
Last Days in a Dutch Hotel has a total of 789 words
A Pair of Patient Lovers has a total of 4187 words
Suburban Sketches has a total of 5372 words
Their Wedding Journey has a total of 5414 words
My Mark Twain has a total of 2774 words
Fennel and Rue has a total of 2493 words
A Foregone Conclusion has a total of 4014 words
The Leatherwood God has a t

In [32]:
# Rearrange works.csv, placeofbirth_df.csv, result_df.csv, and works.csv in a folder called author-data

# If the folder that stores the author's data doesn't exist, create it
if os.path.isdir('./data/author-data') == False:
    print("The folder author-data has been created: ./data/author-data/")
    os.mkdir('./data/author-data')
else:
    print("Folder author-data already exists")

The folder author-data has been created: ./data/author-data/


In [33]:
import shutil

In [34]:
# If file ends with .csv , move it to author-data
for csv in os.listdir('./data'):
    if csv == 'placeofbirth_df.csv':
        shutil.copy2(f'./data/{csv}', f'./data/author-data/{csv}')
    elif csv == 'works.csv':
        pass
    elif csv.endswith('.csv'):
        os.rename(f'./data/{csv}', f'./data/author-data/{csv}')
        print(f"{csv} moved to author-data")

author_list.csv moved to author-data
result_df.csv moved to author-data
