# Machine Learning with Python
---
---
The Process:
1. [Import dependencies and configure settings](#Step-1:-Import-dependencies-and-configure-settings)
2. [Load pre-processed data](#Step-2:-Load-pre-processed-data)
3. [Create instance of LDA model](#Step-3:-Create-instance-of-LDA-model)
4. [Visualize model results](#Step-4:-Visualize-model-results)
5. [Store the model instance](#Step-5:-Store-the-model-instance)

![topic-modeling](../images/topic_model_intro.png "Topic Modeling Intro")

---

## Step 1: Import dependencies and configure settings
### Import packages, modules, objects, or functions into the current notebook

***TIP: Avoid 'from PACKAGE import \*' syntax as it clutters your namespace with objects you may not be aware of***

In [None]:
# BEST PRACTICE: import built-in packages first
# from PACKAGE import OBJECT lets us bring only what we need into our namespace
from pprint import pprint
from warnings import filterwarnings

# BEST PRACTICE: import third-party packages second
# from PACKAGE import OBJECT as ALIAS, renames the object in our namespace
from gensim.models.ldamodel import LdaModel as gensim_LdaModel
from joblib import load as joblib_load
from pyLDAvis.gensim_models import prepare as pyLDAvis_prepare_gensim
from seaborn import lineplot as seaborn_lineplot
# aliases can also be used when importing an individual module from the package
import matplotlib.pyplot as plt
# aliases can be added when fully importing packages
import pandas as pd

# BEST PRACTICE: import custom packages last
# for example: import custom_module as cm

### Configure settings by using functions or methods (always check the documentation)
Third-party packages used in this notebook:
- [Gensim](https://radimrehurek.com/gensim/)
- [Matplotlib](https://matplotlib.org/)
- [Pandas](https://pandas.pydata.org/pandas-docs/version/1.2.4/user_guide/index.html)
- [pyLDAvis](https://github.com/bmabey/pyLDAvis)
- [Seaborn](https://seaborn.pydata.org/)

In [None]:
# set python shell filter out warnings and avoid cluttering outputs
filterwarnings(
    'ignore'
)

In [None]:
# set custom pandas package options by iterating through key-value pairs in a dictionary
for option, value in { # dictionaries are denoted by curly brackets {} or the dict() function
    'display.max_columns': 50,
    'display.max_colwidth': None,
    'display.max_info_columns': 50,
    'display.max_rows': 20,
    'display.precision': 4
}.items(): # the .items() function of a dictionary lets us iterate through key, value pairs
    # we can call a function on the variable we set for the objects we're iterating over
    # in this case those variables are 'option', and 'value' and they represent
    # the key, value pairs from the dictionary above
    pd.set_option(
        option, # this will be 'display.max_columns' etc..
        value   # this will be 50, None, etc..
    )

In [None]:
# set pyLDAvis to render plots within IPython notebook cells
pyLDAvis.enable_notebook()

[Top](#Machine-Learning-with-Python)

---

## Step 2: Load pre-processed data
### Use [pickling](https://realpython.com/python-pickle-module/) to store python objects and resue them across notebooks 

In [None]:
# import corpus and id2words objects by 'unpickling' them
corpus = joblib_load(
    '../data/cocoon_reviews_corpus.pkl'
)
id2word = joblib_load(
    '../data/cocoon_reviews_id2word.pkl'
)

In [None]:
optimal_model = gensim_LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=4, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)
model_topics = optimal_model.show_topics(formatted=False)

In [None]:
pprint(optimal_model.print_topics())

In [None]:
top_n_words = 10
topics = optimal_model.show_topics(
    num_topics =4, num_words = top_n_words, formatted = False)

for _, infos in topics:
    probs = [prob for _, prob in infos]
    seaborn_lineplot(range(top_n_words), probs, marker = '*')

plt.xlabel('Word rank')
plt.ylabel('Weights')
plt.title('Weights of Top {} Words in each Topic'.format(top_n_words))
plt.show()

In [None]:
LDAvis_prepared = pyLDAvis_prepare_gensim(optimal_model, corpus, id2word)
LDAvis_prepared