# Latent Dirichelt Allocation Project
Chulwoo Kim 

[Background]
Topic models automatically infer the topics discussed in a collection of documents. These topics can be used to summarize and organize documents, or used for featurization and dimensionality reduction in later stages of the data analysis.

LDA (Latent Dirichlet Allocation) is one of the most successful topic model libraries. Use LDA in this exercise to derive ‘topics’ from the dataset provided, the code should be written in Python.

The dataset could be obtained from Yelp’s webportal by accessing this url:
https://www.yelp.com/dataset_challenge

Your code should include the following steps:
- Prepping the data
  * Tokenizing
  * Stopping
  * Stemming
- Construct a Document-term Matrix
- Apply the LDA Model
- Examine the results

Describe how accurate your model is and what it needs to be done to make it better.

## 1. Data Preparation

The size of this data downloaded from Yelp.com is about 3.4GB. Therefore, I decide to use the database system because it is impossible to upload it in memory. The total number of lines for the data loaded into the mongoDB is 4169501 lines. 

In [None]:
import json
from pymongo import MongoClient

dataset_file = 'dataset/yelp_academic_dataset_review.json'
reviews_collection = MongoClient("mongodb://localhost:27017/")["Yelp_Reviews"]["Reviews"]

with open(dataset_file, encoding='utf8') as dataset:
    next(dataset)
    for line in dataset:
        try:
            data = json.loads(line)
        except ValueError:
            print('Value Error')
        if data["type"] == "review":
            reviews_collection.insert({
                "reviewId": data["review_id"],
                "business": data["business_id"],
                "text": data["text"]
            })

## 2. Data Preprocessing

In this step, I performed to remove the punctuation as well as tokenizing, stopping, and stemming to clear dataset. Also, I use only 2000 reviews because it takes a very long time to process all the data.

In [None]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from pymongo import MongoClient

NUMBER_OF_TEST_REVIEWS = 2000

reviews_collection = MongoClient("mongodb://localhost:27017/")["Yelp_Reviews"]["Reviews"]
reviews_cursor = reviews_collection.find()
reviewsCount = reviews_cursor.count()
reviews_cursor.batch_size(1000)

stopword = set(stopwords.words('english'))
punctuation = set(string.punctuation)
ps = PorterStemmer()

words = []
for i, review in enumerate(reviews_cursor):    
    sentences = nltk.sent_tokenize(review["text"].lower())
    if i == NUMBER_OF_TEST_REVIEWS:
        break;
    for sentence in sentences:
        # Remove punctuation
        sentence = ''.join(ch for ch in sentence if ch not in punctuation)
        # Remove number
        sentence = ''.join(ch for ch in sentence if ch not in "0123456789")
        tokens = nltk.word_tokenize(sentence)    
        # Stemming & Removing stop words
        text = [ps.stem(word) for word in tokens if word not in stopword]
    words.append(text)

## 3. Construction of a document-term matrix

If the data preprocessing is finished, it only left the words as a string type. We have to number it to apply this to the LDA model. I use two function, Dictionary and doc2bow. The dictionary function transfers each words to a unique number. After dictionary function perform, I execute the doc2bow function which converts dictionary into a bag-of-words.

In [None]:
import gensim
from gensim import corpora
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

dictionary = corpora.Dictionary(doc for doc in words)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in words]

## 4. Applying the LDA model

Latent Dirichlet allocation (LDA) is a topic model that generates topics based on word frequency from a set of documents. LDA is particularly useful for finding reasonably accurate mixtures of topics within a given document set.

In [None]:
ldamodel = gensim.models.ldamodel.LdaModel(doc_term_matrix, num_topics=10, id2word = dictionary, passes=50)

## 5. Examining the results

Using the print_topics function, we can get a specific output. This result shows us topics and their probability. You can guess the sentence;I would definitely come back here, by looking at the topic of the first review, but it is hard to see it as a meaningful topic. Next topics such as best drink make sense.

In [3]:
print(ldamodel.print_topics(num_topics=10, num_words=5))

[(0, '0.076*"back" + 0.031*"go" + 0.030*"come" + 0.020*"definit" + 0.015*"would"'), (1, '0.015*"place" + 0.013*"one" + 0.010*"best" + 0.009*"drink" + 0.009*"go"'), (2, '0.021*"good" + 0.015*"servic" + 0.012*"time" + 0.010*"im" + 0.009*"go"'), (3, '0.015*"time" + 0.013*"love" + 0.012*"like" + 0.010*"friendli" + 0.009*"staff"'), (4, '0.021*"place" + 0.015*"good" + 0.013*"get" + 0.010*"lot" + 0.009*"know"'), (5, '0.013*"food" + 0.011*"die" + 0.011*"money" + 0.010*"und" + 0.008*"happi"'), (6, '0.016*"even" + 0.014*"place" + 0.013*"price" + 0.011*"get" + 0.009*"would"'), (7, '0.060*"recommend" + 0.023*"highli" + 0.021*"would" + 0.012*"go" + 0.012*"realli"'), (8, '0.023*"food" + 0.016*"good" + 0.012*"great" + 0.012*"enjoy" + 0.011*"place"'), (9, '0.021*"go" + 0.020*"great" + 0.019*"definit" + 0.014*"would" + 0.014*"place"')]


## 6. Conclusion
If I set up the properties like 10 topics and 5 words, the topics are mostly significant but it is not perfect. I would say this model fairly make result, even if I cannot calculate specific accuracy value because this is not a classification problem.

Also, we can think about the way to make better result.
Here are some way to improve performence:
1. It has to learn the LDA model having more data. I just use 2,000 reviews but the more data, the better results. Instead, the learning time will increase tremendously.
2. Well-tuned properties of the lda model, such as passes and num_topics, can improve performance.
