# Latent Dirichlet Allocation

The purpose of applying the LDA method is to obtain the distribution of words that make up a topic and documents with a particular topic. The first stage in LDA modeling is to initialize the parameters. These parameters can be the number of documents, the number of words in the document, the number of topics, the number of iterations, and the LDA coefficient. The next stage is to mark a word with a predetermined topic by applying a semi-random distribution based on the Dirichlet distribution method. Next is the iteration stage. In this stage, there are parameters that can determine the distribution of the number of topics and the distribution of words from topics in a document.

## Dependencies

In [1]:
# %pip install google-cloud-bigquery
# %pip install PySastrawi
# %pip install nltk
# %pip install pyLDAvis

In [2]:
from string import punctuation

from google.oauth2 import service_account
from google.cloud import bigquery

import numpy as np
import pandas as pd

from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel

import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

from util.text_preprocessing import StopWordRemoverFactory, \
    Stemming, Formalization, TextTokenizer, \
    LDATextPreprocess



## Load Data

In [3]:
key_path = '../airflow/credentials/future-data-track-1-sapporo.json'
credentials = service_account.Credentials.from_service_account_file(
    key_path,
    scopes=["https://www.googleapis.com/auth/cloud-platform"]
)

bigquery_client = bigquery.Client(
    project='future-data-track-1',
    credentials=credentials
)

In [4]:
query = """
SELECT
  *
FROM
  `future-data-track-1.sapporo_mart.topic_modelling`;
"""

query_job = bigquery_client.query(query)
df = query_job.to_dataframe()

### Preprocess

This process include casefolding, remove special character, multiple whitespace, stopword, and stemming

In [5]:
_sw_remover = StopWordRemoverFactory().create_stop_word_remover()
_stemmer = Stemming()
_tokenizer = TextTokenizer()
_formalizer = Formalization()

sw_remover = _sw_remover.remove
stemmer = _stemmer.stem
tokenizer = _tokenizer.tokenize
formalizer = _formalizer.convert_all

In [6]:
preprocess = LDATextPreprocess(sw_remover, stemmer, tokenizer, formalizer)

In [7]:
df['review'].replace('', float("NaN"))
df.dropna(subset=["review"], inplace=True)

In [8]:
texts = preprocess.preprocess(df['review'])

## Modelling

In [9]:
dictionary = corpora.Dictionary(texts)

In [10]:
corpus = [dictionary.doc2bow(doc) for doc in texts]

In [11]:
vocab_size = len(dictionary.keys())

In [12]:
model = LdaModel(corpus=corpus, num_topics=4, id2word=dictionary, passes=20, iterations=100, alpha=[0.01]*4, eta=[0.01]*vocab_size)

  perwordbound = self.bound(chunk, subsample_ratio=subsample_ratio) / (subsample_ratio * corpus_words)


In [20]:
model.log_perplexity(corpus)

-32.76274236927371

In [21]:
coherence_model = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model.get_coherence()

print("Model evaluation with {k} topics".format(k=4))
print(coherence_lda)

Model evaluation with 4 topics
0.34760066821366586


In [22]:
model.save('Topic Modelling/lda_4_dion-ricky')