## ⚡️ Make a Copy

Save a copy of this notebook in your Google Drive before continuing. Be sure to edit your own copy, not the original notebook.

In this lab, you will simply excute the code to create your classifications and your interactive LDAvis. Success in this lab depends primarily on having properly completed Homework 2.

Before continuing with this lab, be sure you have completed Homework 2 and saved your corpus and topic model to your Google Drive. They should be files in the root of your Google Drive called MSDS_HW2_corpus.p and MSDS_HW2_model.p. In this lab, you will load the model and use it for classification of documents in the corpus.

Mount your Google Drive via the left sidebar controls before continuing.

✏️  **Note:** You may need to restart the runtime after running the following imports/installs.

In [2]:
import numpy as np
import pickle

try:
  import pyLDAvis
except:
  !pip install pyLDAvis==3.4.1
  import pyLDAvis

try:
  import tmtoolkit
except:
  !pip install tmtoolkit
  import tmtoolkit

try:
  from lda import LDA
except:
  !pip install lda
  from lda import LDA

from tmtoolkit.bow.bow_stats import doc_lengths
from tmtoolkit.topicmod.model_stats import generate_topic_labels_from_top_words
from tmtoolkit.topicmod.model_io import ldamodel_top_doc_topics
from tmtoolkit.topicmod.model_io import load_ldamodel_from_pickle
from tmtoolkit.topicmod.visualize import parameters_for_ldavis

  and should_run_async(code)


In [3]:
# prompt: Ensure matplotlib version 3.1.3 is installed. Restart the kernel if a different version is installed after installing.
# !pip install matplotlib==3.1.3

  and should_run_async(code)


## Load the saved objects

Be sure that you have mounted your Google Drive and that the files are there.

In [4]:
from google.colab import drive
drive.mount('/content/drive')
folder_path = "drive/MyDrive/MSDS_marketing_text_analytics"

  and should_run_async(code)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
with open(f"{folder_path}/MSDS_HW2_corpus.p", "rb") as corpusfile:
    corpus = pickle.load(corpusfile)

  and should_run_async(code)


In [6]:
with open(f"{folder_path}/MSDS_HW2_model.p", "rb") as modelfile:
    model_info = load_ldamodel_from_pickle(modelfile)

model_info.keys()

  and should_run_async(code)


dict_keys(['model', 'vocab', 'doc_labels', 'dtm'])

## Classification

As seen from the keys output above, the saved model is actually a dictionary containing the model itself, along with some other resources we will need.

Here, you will use those resources to create topic labels using [generate_topic_labels_from_top_words](https://tmtoolkit.readthedocs.io/en/latest/api.html#tmtoolkit.topicmod.model_stats.generate_topic_labels_from_top_words). Note that the vocabulary is provided as a list, but a numpy array is needed. Also note that the doc lengths are determined by the doc_lengths function using the saved dtm.

In [7]:
model = model_info["model"]
vocab = model_info["vocab"]
dtm = model_info["dtm"]
doc_labels = model_info["doc_labels"]

topic_labels = generate_topic_labels_from_top_words(
    model.topic_word_,
    model.doc_topic_,
    doc_lengths(dtm),
    np.array(vocab),
)

  and should_run_async(code)


In [8]:
topic_labels

  and should_run_async(code)


array(['1_size_fit', '2_wear_foot', '3_run_sock', '4_shoe_great',
       '5_shoe_nike', '6_shoe_size', '7_bag_love', '8_love_comfortable',
       '9_shoe_buy', '10_shoe_like'], dtype='<U18')

Document classifications can be retrieved via [ldamodel_top_doc_topics](https://tmtoolkit.readthedocs.io/en/latest/api.html?highlight=ldamodel_top_doc_topics#tmtoolkit.topicmod.model_io.ldamodel_top_doc_topics)

In [9]:
doc_topic = model.doc_topic_
documentclassifications = ldamodel_top_doc_topics(doc_topic, doc_labels, top_n=2, topic_labels=topic_labels)

  and should_run_async(code)


In [10]:
documentclassifications.head()

  and should_run_async(code)


Unnamed: 0_level_0,rank_1,rank_2
document,Unnamed: 1_level_1,Unnamed: 2_level_1
0,10_shoe_like (0.5215),4_shoe_great (0.4355)
1,4_shoe_great (0.2989),1_size_fit (0.2566)
2,6_shoe_size (0.7458),8_love_comfortable (0.108)
3,4_shoe_great (0.6647),6_shoe_size (0.1941)
4,6_shoe_size (0.5513),8_love_comfortable (0.4145)


### Attach the original corpus texts to the dataframe

Finally, include the corpus texts in the frame in order to see the text alongside the topics.

In [11]:
documentclassifications["text"] = corpus
documentclassifications.head()

  and should_run_async(code)


Unnamed: 0_level_0,rank_1,rank_2,text
document,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,10_shoe_like (0.5215),4_shoe_great (0.4355),bought these for supportive shoes after our da...
1,4_shoe_great (0.2989),1_size_fit (0.2566),I was a little hesitant about buying sneakers ...
2,6_shoe_size (0.7458),8_love_comfortable (0.108),"I have a lot of pairs of running shoes, and th..."
3,4_shoe_great (0.6647),6_shoe_size (0.1941),My husband said they are very comfortable and ...
4,6_shoe_size (0.5513),8_love_comfortable (0.4145),Very nice shoes...my son loved the color and c...


## Visualization

[pyLDAvis](https://pyldavis.readthedocs.io/en/latest/readme.html) is a Python port of the LDAvis package in R, and is used as a tool for interpreting the topics in a topic model that has been fit to a corpus of text data.

Execute the following code to create an interactive visualization of your topic model.

In [12]:
ldavis_params = parameters_for_ldavis(
    model.topic_word_,
    model.doc_topic_,
    dtm,
    vocab
)
print(ldavis_params.keys())

dict_keys(['topic_term_dists', 'doc_topic_dists', 'vocab', 'doc_lengths', 'term_frequency', 'sort_topics'])


  and should_run_async(code)


In [13]:
%matplotlib inline
# pyLDAvis.drop('saliency', axis=1)
vis = pyLDAvis.prepare(**ldavis_params)
pyLDAvis.enable_notebook(local=True)
pyLDAvis.display(vis)

  and should_run_async(code)


⚡️ **For a better screenshot:**

 * Select the code cell above which contains the display code
 * Select the 3 dots at the far right for "More cell actions"
 * Select "View output fullscreen"



## Submit a screenshot for peer evaluation

Once you have completed execution of the above code, and have generated the visualization of your topic model, submit a screenshot of your interactive LDAvis above as proof of completion of the lab. This should be submitted as an image file in the peer review assessment after Lab 2 in Coursera.

💡 Tip: The LDAvis tends to be a bit more readable in Colab's light theme than it is in the dark theme (Go to Tools > Settings > Sight > Theme > light). Also, it may be easier to get a good screenshot by viewing the chart at full screen (see the ⚡️ tip above).