# MSDS Marketing Text Analytics, Unit 2, Assignment 2: Build a topic model

## ⚡️ Make a Copy

Save a copy of this notebook in your Google Drive before continuing. Be sure to edit your own copy, not the original notebook.

In this assignment, you will implement a topic model preprocessor which can then be applied to the task of topic-modeling Amazon text reviews. Please review the course lectures and documentation up to this point before continuing. Be sure also to be familiar with the [documentation for TMToolkit](https://tmtoolkit.readthedocs.io/en/latest/topic_modeling.html)

Be sure to make a copy into your own Drive account before editing this notebook.

You will implement a preprocessing function to prepare your corpus for topic modeling. It is recommended that you use a small test corpus (an example is provided below) for development, rather than starting with the full review set.

## Imports

In [1]:
try:
    from tmtoolkit.corpus import Corpus
    from tmtoolkit.preprocess import TMPreproc
    from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words
    from tmtoolkit.topicmod.tm_lda import compute_models_parallel
except ModuleNotFoundError:
    !pip install lda
    !pip install tmtoolkit
    from tmtoolkit.corpus import Corpus
    from tmtoolkit.preprocess import TMPreproc
    from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words
    from tmtoolkit.topicmod.tm_lda import compute_models_parallel

Collecting lda
  Downloading lda-2.0.0.tar.gz (320 kB)
[K     |████████████████████████████████| 320 kB 2.9 MB/s eta 0:00:01
[?25hCollecting pbr<4,>=0.6
  Downloading pbr-3.1.1-py2.py3-none-any.whl (99 kB)
[K     |████████████████████████████████| 99 kB 18.2 MB/s eta 0:00:01
Building wheels for collected packages: lda
  Building wheel for lda (setup.py) ... [?25ldone
[?25h  Created wheel for lda: filename=lda-2.0.0-cp39-cp39-macosx_10_9_x86_64.whl size=338490 sha256=d4745c0457340447c70aafb654a67cfcef5b98934a08ac494e017cedf74fd14b
  Stored in directory: /Users/donavin/Library/Caches/pip/wheels/3d/b5/8f/1c1c6a2986ad87a19ddd0a0fbe676d6e4764b5c906702fdd95
Successfully built lda
Installing collected packages: pbr, lda
Successfully installed lda-2.0.0 pbr-3.1.1
Collecting tmtoolkit
  Downloading tmtoolkit-0.11.2-py3-none-any.whl (7.5 MB)
[K     |████████████████████████████████| 7.5 MB 1.7 MB/s eta 0:00:01
Collecting numpy>=1.22.0
  Downloading numpy-1.23.0-cp39-cp39-macosx_10_9_x86_64

RuntimeError: the required package "spacy" for text processing is not installed; did you install tmtoolkit with "recommended" or "textproc" option? see https://tmtoolkit.readthedocs.io/en/latest/install.html for further information

In [3]:
!pip install -U tmtoolkit[textproc_extra]

zsh:1: no matches found: tmtoolkit[textproc_extra]


**NOTE:** Loading a corpus as a list of strings is not the only way to use tmtoolkit. Given, for example, a large corpus that might not fit in memory, the current approach would not work well. See the tmtoolkit docs on [working with text corpora](https://tmtoolkit.readthedocs.io/en/latest/text_corpora.html) for more info.

## Implement a pre-processor

Here you will implement a function called `preprocess` which returns the TMPreproc object to be used for topic modeling.

The preprocess function will take a list of texts and return a pre-processed corpus object, i.e. a TMPreproc object. Preprocessing should include the following actions on the corpus using the appropriate methods in the TMPreproc class:

 - lemmatize the texts
 - convert tokens to lowercase
 - remove special characters
 - clean tokens to remove numbers and any tokens shorter than 3 characters

The first part of the function to create the corpus and preprocess object are done for you. Your job is to call the specific preprocess functions and to return the resulting preprocess object.


In [None]:
def preprocess(texts, lang="en"):
    """Preprocessor which returns a TMPreproc object processed on corpus as language
    specified by lang (defaults to "en"):

    Should perform all of the following pre-processing functions:
     - lemmatize
     - tokens_to_lowercase
     - remove_special_chars_in_tokens
     - clean_tokens (remove numbers, and remove tokens shorter than 2)
    """
    # Here, we just use the index of the text as the label for the corpus item
    corpus = Corpus({ i:r for i, r in enumerate(texts) })
    preproc = TMPreproc(corpus, language=lang)

    # TODO: Complete the implementation of this function and submit the
    # .py download of this notebook as your assignment submission.
    # lemmatize not working correctly

    return preproc.lemmatize().tokens_to_lowercase().remove_special_chars_in_tokens().clean_tokens(remove_shorter_than=2, remove_numbers=True)

In [None]:
# %pip install -Iv spaCy

In [None]:
help(TMPreproc.lemmatize)

In [None]:
#~~ /autograde # do not delete this cell

---
### ⚠️  **Caution:** No arbitrary code above this line

The only code written above should be the implementation of your graded function. For experimentation and testing, only add code below.
___

## Function development

Use this section of code to verify your function implementation. You may change the test_corpus as needed to verify your implementation. The grader will be checking that your function returns a TMPreproc object that meets all of the following critera:

 - tokens are lemmatized
 - tokens are converted to lowercase
 - special characters are removed from tokens
 - tokens shorter than 3 characters and numerics are removed

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

In [7]:
test_corpus = [ # Feel free to edit this corpus for further testing
                # to be sure that your functions meet specifications.
    "The 3 cats sat on the mats!",
    "1 fish 2 fish Red fish Blue fish"
    "She sells $ea$shells"
]
preproc = preprocess(test_corpus)
pp.pprint(preproc.get_tokens())



KeyError: ignored

In [None]:
dtms = {
    "test_corpus": preproc.dtm
}
lda_params = {
    'n_topics': 2,
    'eta': .01,
    'n_iter': 10,
    'random_state': 1234,  # to make results reproducible
    'alpha': 1/16
}

models = compute_models_parallel(dtms, constant_parameters=lda_params)

In [None]:
model = models["test_corpus"][0][1]
print_ldamodel_topic_words(model.topic_word_, preproc.vocabulary, top_n=5)

### Assignment submission

After completing the preprocess implementation, download your notebook as a .py file (File > Download > Download .py) and submit the downloaded file for grading.

## Topic modeling Amazon Reviews

Once you have completed the assignment above, you will be well prepared to start your final project for this unit. The project will include loading Amazon reviews into a corpus for topic modeling. The code below demonstrates topic modeling the reviews for a given brand. Note that the final project will require additional segmentation of the data, which is not done for you in the example here.

In [None]:
import gzip
import itertools
import json

asins = []

# To run this code, you will need to download the metadata file from the course
# assets and upload it to your Google Drive. See the notes about that file
# regarding how it was processed from the original file into json-l format.

with gzip.open("drive/MyDrive/meta_Clothing_Shoes_and_Jewelry.jsonl.gz") as products:
    for product in products:
        data = json.loads(product)
        categories = [c.lower() for c in
                      list(itertools.chain(*data.get("categories", [])))]
        if "nike" in categories:
            asins.append(data["asin"])

Inspect the first fews ASINs

In [None]:
asins[:3]

Check the length, i.e. the number of resulting ASINs

In [None]:
len(asins)

Build a corpus of review texts

In [None]:
review_corpus = []
with gzip.open("drive/MyDrive/reviews_Clothing_Shoes_and_Jewelry_5.json.gz") as reviews:
    for review in reviews:
        data = json.loads(review)
        if data["asin"] in asins:
            text = data["reviewText"]
            review_corpus.append(text)

Inspect a few of the reviews

In [None]:
for i, review in enumerate(review_corpus[:5]):
    print(i, review[:80])

Build a TMPreproc object from the review corpus

In [None]:
pre = preprocess(review_corpus)

In [None]:
dtms = {
    "reviews_corpus": pre.dtm
}
lda_params = {
    'n_topics': 10,
    'eta': .01,
    'n_iter': 10,
    'random_state': 1234,  # to make results reproducible
    'alpha': 1/16
}

models = compute_models_parallel(dtms, constant_parameters=lda_params)

Print the topics

In [None]:
model = models["reviews_corpus"][0][1]
print_ldamodel_topic_words(model.topic_word_, pre.vocabulary, top_n=5)

## Save your topic model and corpus for use in Lab 2

Once you have completed the above assignment, run the following code to save your topic model and your corpus to your Google Drive. You will load this model and use it for document classification in Lab 2.

In [None]:
import pickle
from tmtoolkit.topicmod.model_io import save_ldamodel_to_pickle

with open("drive/MyDrive/MSDS_HW2_model.p", "wb") as modelfile:
    save_ldamodel_to_pickle(modelfile, model, pre.vocabulary, pre.doc_labels, dtm=pre.dtm)

In [None]:
with open("drive/MyDrive/MSDS_HW2_corpus.p", "wb") as corpusfile:
    pickle.dump(review_corpus, corpusfile)