# Machine Learning with Python
---
---
The Process:
1. [Import dependencies and configure settings](#Step-1:-Import-dependencies-and-configure-settings)
2. [Load pre-processed data](#Step-2:-Load-pre-processed-data)
3. [Create instance of LDA model](#Step-3:-Create-instance-of-LDA-model)
4. [Visualize model results](#Step-4:-Visualize-model-results)

![topic-modeling](../images/topic_model_intro.png "Topic Modeling Intro")

---

## Step 1: Import dependencies and configure settings
### Import packages, modules, objects, or functions into the current notebook

***TIP: Avoid 'from PACKAGE import \*' syntax as it clutters your namespace with objects you may not be aware of***

In [None]:
# BEST PRACTICE: import built-in packages first
# from PACKAGE import OBJECT lets us bring only what we need into our namespace
from pprint import pprint
from warnings import filterwarnings

# BEST PRACTICE: import third-party packages second
# from PACKAGE import OBJECT as ALIAS, renames the object in our namespace
from gensim.models.ldamodel import LdaModel as gensim_LdaModel
from joblib import load as joblib_load
from pyLDAvis.gensim_models import prepare as pyLDAvis_prepare_gensim
from seaborn import lineplot as seaborn_lineplot
# aliases can also be used when importing an individual module from the package
import matplotlib.pyplot as plt
# aliases can be added when fully importing packages
import pandas as pd

# BEST PRACTICE: import custom packages last
# for example: import custom_module as cm

### Configure settings by using functions or methods (always check the documentation)
Third-party packages used in this notebook:
- [Gensim](https://radimrehurek.com/gensim/)
- [Matplotlib](https://matplotlib.org/)
- [Pandas](https://pandas.pydata.org/pandas-docs/version/1.2.4/user_guide/index.html)
- [pyLDAvis](https://github.com/bmabey/pyLDAvis)
- [Seaborn](https://seaborn.pydata.org/)

In [None]:
# set python shell filter out warnings and avoid cluttering outputs
filterwarnings(
    'ignore'
)

In [None]:
# set custom pandas package options by iterating through key-value pairs in a dictionary
for option, value in { # dictionaries are denoted by curly brackets {} or the dict() function
    'display.max_columns': 50,
    'display.max_colwidth': None,
    'display.max_info_columns': 50,
    'display.max_rows': 20,
    'display.precision': 4
}.items(): # the .items() function of a dictionary lets us iterate through key, value pairs
    # we can call a function on the variable we set for the objects we're iterating over
    # in this case those variables are 'option', and 'value' and they represent
    # the key, value pairs from the dictionary above
    pd.set_option(
        option, # this will be 'display.max_columns' etc..
        value   # this will be 50, None, etc..
    )

In [None]:
# set pyLDAvis to render plots within IPython notebook cells
pyLDAvis.enable_notebook()

[Top](#Machine-Learning-with-Python)

---

## Step 2: Load pre-processed data
### Use [pickling](https://realpython.com/python-pickle-module/) to store python objects and reuse them across notebooks 
NOTE: Pickled files can only be 'unpickled' in an environment that matches the same dependencies as the one it was 'pickled' in, i.e. a colleague wouldn't be able to open pickled files if they don't have the same python environment on their machine

### Topic Model Requirements

**id2word**
: A python dictionary mapping the numerical identifiers created during pre-processing to the actual words they represent

**corpus**
: A 'bag of words' representation of each review in the dataset, converting text to a python list of tuples *--immutable objects--* in the form of (Word Identifier, Word Frequency)

NOTE: We use the Gensim package's functions to transform the text for us in the [Data Wrangling Notebook](./data_wrangling.ipynb)

In [None]:
# import corpus and id2words objects by 'unpickling' them
corpus = joblib_load(
    '../data/cocoon_reviews_corpus.pkl'
)
id2word = joblib_load(
    '../data/cocoon_reviews_id2word.pkl'
)

### Text Pre-processing Procedure

**1. Tokenization**: Split raw text into small, indivisible units for processing.
These can be:
- words
- n-grams
- sentences
- paragraphs

>> n-gram
: A sequence of *n* number of words, i.e. a 2-gram, called a bigram, is a sequence of two words (“please turn”, “turn your”, or ”your homework”), and a 3-gram, called a trigram, is a three-word sequence of words (“please turn your”, or “turn your homework”)

**2. Normalization**: Convert all text to lower case, expand contractions, remove numerals and accent marks

>> Contraction
: A word made by shortening and combining two words, i.e. can't (can + not), don't (do + not), and I've (I + have)

**3. Stopwords Removal**: Remove words below a certain length threshold or which contribute little overall meaning

>> Stopwords
: Sets of commonly used words in given language, whose removal will increase overall meaning garnered from the text as only important or relevant words will remain

**4. Stemming or Lemmatization**: Tag each word by the part of speech it represent, and then eliminate affixes from a word by removing common word endings ('-ing', '-ed'), or capturing canonical forms based on a word's lemma (chosen representative)

![stemming_vs_lemmatization](../images/stemming_vs_lemmatization.png)

>> Stemming
: Cut off the end or the beginning of the word based on an algorithm, typically taking into account a list of common prefixes and suffixes that can be found in an inflected word in a given language

>> Lemmatization
: Grouping words based on a morphological analysis of it's structure and definition. To do so it is necessary to have detailed dictionaries of words and their representatives, which the algorithms can look through to link the form back to a lemma.

>> Parth-of-Speech Tagging
: A popular Natural Language Processing process which refers to categorizing words in a text (corpus) in correspondence with a particular part of speech, depending on the definition of the word and its context

### Why is part of speech tagging important to do before stemming or lemmatization?
![pos_example1](../images/pos_example1.png)
![pos_example2](../images/pos_example2.png)

[Top](#Machine-Learning-with-Python)

---

## Step 3: Create instance of LDA model

**Latent Dirichlet Allocation (LDA)**  

A ‘generative probabilistic model’, seeking to allow sets of observations to be described by unobserved groups that explain why some parts of the data are similar. 

For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. In this case the documents are the reviews and the parts are the words and/or phrases (n-grams), with LDA serving as way of soft clustering the documents and parts. 

The fuzzy memberships drawn provide a more nuanced way of inferring topics, as each review can be made up of a combination of topics, versus simply attributing each review to one topic. This feature of LDA is key in this instance, as we are seeking to draw out terms indicative of constructive comments, with the assumption that customers might write about more than one topic in their reviews.

### Optional
Tune the [hyperparameters](https://towardsdatascience.com/parameters-and-hyperparameters-aa609601a9ac), which are paramaters whode values control the learning process and determine the values of model parameters that a learning algorithm ends up learning  

For this Latent Dirichlet Allocation Model those would be:
- num_topics
- random_state
- update_every
- chunksize
- passes
- alpha
- per_word_topics

In [None]:
optimal_model = gensim_LdaModel(
    corpus = corpus,
    id2word = id2word,
    num_topics = 4,
    random_state = 100,
    update_every = 1,
    chunksize = 100,
    passes = 10,
    alpha = 'auto',
    per_word_topics = True
)
model_topics = optimal_model.show_topics(
    formatted = False
)

[Top](#Machine-Learning-with-Python)

---

## Step 4: Visualize the model results

In [None]:
pprint(optimal_model.print_topics())

In [None]:
top_n_words = 10
topics = optimal_model.show_topics(
    num_topics =4, num_words = top_n_words, formatted = False)

for _, infos in topics:
    probs = [prob for _, prob in infos]
    seaborn_lineplot(range(top_n_words), probs, marker = '*')

plt.xlabel('Word rank')
plt.ylabel('Weights')
plt.title('Weights of Top {} Words in each Topic'.format(top_n_words))
plt.show()

In [None]:
LDAvis_prepared = pyLDAvis_prepare_gensim(optimal_model, corpus, id2word)
LDAvis_prepared

[Top](#Machine-Learning-with-Python)

---