<a href="https://colab.research.google.com/github/cyrus723/my-first-binder/blob/main/nlp_lda_model_gensim1_spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>Implementing LDA in Python</center>
Source:
https://github.com/wjbmattingly/topic_modeling_textbook/blob/main/03_03_lda_model_demo.ipynb

<center>Dr. W.J.B. Mattingly</center>

<center>Smithsonian Data Science Lab and United States Holocaust Memorial Museum</center>

<center>February 2021</center>

## Key Concepts in this Notebook

## Introduction

## Importing the Required Libraries

In [1]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
pip install pyLDAvis



In [3]:
#https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#1introduction
import numpy as np
import json
import glob

#Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

#spacy
import spacy
from nltk.corpus import stopwords

#vis
import pyLDAvis
import pyLDAvis.gensim

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Preparing the Data

In [4]:
#from google.colab import drive
#drive.mount('/content/drive')

#import os
#os.chdir('/content/drive/My Drive/Colab Notebooks/data/')
#os.getcwd()
#os.listdir()

In [5]:
#def load_data(file):
#    with open (file, "r", encoding="utf-8") as f:
#        data = json.load(f)
#    return (data)

#def write_data(file, data):
#    with open (file, "w", encoding="utf-8") as f:
#        json.dump(data, f, indent=4)

In [6]:
import requests

def load_data(url):
    response = requests.get(url)
    if response.status_code == 200:
        data = json.loads(response.text)
        return data
    else:
        raise Exception(f"Failed to load data from {url}")

def write_data(file, data):
    with open (file, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4)

In [7]:
stopwords = stopwords.words("english")

In [8]:
print (stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [9]:
data = load_data("https://raw.githubusercontent.com/wjbmattingly/topic_modeling_textbook/main/data/ushmm_dn.json")["texts"]

print (data[0][0:90])

 My name David Kochalski. I was born in a small town called , and I was born May 5, 1928. 


In [10]:
def lemmatization(texts, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]):
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
    texts_out = []
    for text in texts:
        doc = nlp(text)
        new_text = []
        for token in doc:
            if token.pos_ in allowed_postags:
                new_text.append(token.lemma_)
        final = " ".join(new_text)
        texts_out.append(final)
    return (texts_out)


lemmatized_texts = lemmatization(data)
print (lemmatized_texts[0][0:90])

name bear small town call bear very hard work child father mother small mill flour buckwhe


In [11]:
def gen_words(texts):
    final = []
    for text in texts:
        new = gensim.utils.simple_preprocess(text, deacc=True)
        final.append(new)
    return (final)

data_words = gen_words(lemmatized_texts)

print (data_words[0][0:20])

['name', 'bear', 'small', 'town', 'call', 'bear', 'very', 'hard', 'work', 'child', 'father', 'mother', 'small', 'mill', 'flour', 'buckwheat', 'prosperous', 'comfortable', 'go', 'school']


In [12]:
id2word = corpora.Dictionary(data_words)

corpus = []
for text in data_words:
    new = id2word.doc2bow(text)
    corpus.append(new)

print (corpus[0][0:20])

word = id2word[[0][:1][0]]
print (word)

[(0, 2), (1, 11), (2, 1), (3, 2), (4, 1), (5, 2), (6, 1), (7, 2), (8, 3), (9, 1), (10, 12), (11, 1), (12, 8), (13, 1), (14, 2), (15, 1), (16, 3), (17, 2), (18, 1), (19, 2)]
able


In [13]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=30,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha="auto")


The code you've provided is for creating an LDA (Latent Dirichlet Allocation) model using the `gensim` library in Python. LDA is a popular algorithm used for topic modeling, which allows you to discover abstract topics within a collection of documents. Here's a breakdown of the code and the parameters being used to create the `LdaModel`:

1. **lda_model**:
   - This is the variable to which the model is assigned after it's created.

2. **gensim.models.ldamodel.LdaModel**:
   - This specifies that we are using the `LdaModel` class from the `ldamodel` module in the `gensim.models` package.

3. **Parameters**:
   - **corpus**: This is the collection of text documents represented in a format that the model can process (usually a list of lists where each sublist contains tuples of token IDs and their corresponding frequencies in a document).
   - **id2word**: A dictionary that maps the token IDs to the actual tokens (words). It helps the model to know which word each numeric token ID corresponds to.
   - **num_topics**: The number of distinct topics to be extracted from the corpus. In this case, it's set to 30, meaning the model should try to find 30 different topics.
   - **random_state**: This acts as a seed to ensure reproducibility of the results. Using the same seed and model parameters with the same corpus will yield the same topics. Here, it's set to 100.
   - **update_every**: Determines how often the model parameters should be updated. If set to 1, the model is updated every time an entire chunk of documents is processed. This can be useful for online training.
   - **chunksize**: The number of documents to be used in each training chunk. Here, it's set to 100. This means that the model will update its parameters after every 100 documents.
   - **passes**: The total number of passes the model makes over the entire corpus. More passes might lead to a better model at the cost of increased computational time. Here, it's set to 10.
   - **alpha**: A parameter that influences the document-topic density. With 'auto', the model automatically learns an asymmetric prior directly from the data, which can often lead to better model performance.

Overall, this code initializes an LDA model with specific parameters aimed at uncovering 30 different topics from the provided corpus, adjusting its parameters iteratively and automatically to improve the quality of the topics discovered. This model can then be used for analyzing the underlying topics within the corpus, useful in natural language processing tasks such as document classification, summarization, or understanding content themes.

## Vizualizing the Data

In [14]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, mds="mmds", R=30)
vis

