# Simple topic identification
  
This chapter will introduce you to topic identification, which you can apply to any text you encounter in the wild. Using basic NLP models, you will identify topics from texts based on term frequencies. You'll experiment and compare two simple methods: bag-of-words and Tf-idf using NLTK, and a new library `Gensim`.

**Helpful links**  
  
[Regex Testing](https://regex101.com)  
[NLTK Documentation](https://www.nltk.org)  
[Gensim Documentation](https://radimrehurek.com/gensim/auto_examples/index.html)  
[Python Documentation for Text Processing Services (re module and strings)](https://docs.python.org/3/library/text.html)


In [22]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations
import re                           # Regular Expressions:      Text manipulation
from pprint import pprint           # Pretty Print:             Advanced printing operations

**Word counts with bag-of-words**
  
Welcome to chapter two! We'll begin with using word counts with a bag of words approach.
  
**Bag-of-words**
  
Bag of words is a very simple and basic method to finding topics in a text. For bag of words, you need to first create tokens using tokenization, and then count up all the tokens you have. The theory is that the more frequent a word or token is, the more central or important it might be to the text. Bag of words can be a great way to determine the significant words in a text based on the number of times they are used.
  
**Bag-of-words**  
- Basic method for finding topics in a text
- Need to first create tokens using tokenization
- ... and then count up all the tokens
- The more frequent a word, the more important it might be
- Can be a great way to determine the significant words in a text
  
**Bag-of-words example**
  
Here we see an example series of sentences, mainly about a cat and a box. If we just us a simple bag of words model with tokenization like we learned in chapter one and remove the punctuation, we can see the example result. Box, cat, The and the are some of the most important words because they are the most frequent. Notice that the word THE appears twice in the bag of words, once with uppercase and once lowercase. If we added a preprocessing step to handle this issue, we could lowercase all of the words in the text so each word is counted only once.
  
<img src='../_images/nlp-bag-of-words-example.png' alt='img' width='500'>
  
**Bag-of-words in Python**
  
We can use the NLP fundamentals we already know, such as tokenization with NLTK to create a list of tokens. We will use a new class called `Counter` which we import from the standard library module `collections`. The list of tokens generated using `word_tokenize` can be passed as the initialization argument for the `Counter` class. The result is a counter object which has similar structure to a dictionary and allows us to see each token and the frequency of the token. `Counter` objects also have a method called `.most_common()`, which takes an integer argument, such as 2 and would then return the top 2 tokens in terms of frequency. The return object is a series of tuples inside a list. For each tuple, the first element holds the token and the second element represents the frequency. Note: other than ordering by token frequency, the `.most_common()` method does not sort the tokens it returns or tell us there are more tokens with that same frequency.
  
<img src='../_images/nlp-bag-of-words-example1.png' alt='img' width='500'>
  
**Let's practice!**
  
Now you know a bit about bag of words and can get started building your own using Python.

### Bag-of-words picker
  
It's time for a quick check on your understanding of bag-of-words. Which of the below options, with basic nltk tokenization, map the bag-of-words for the following text?

"The cat is in the box. The cat box."
  
**Possible answers**
  
- [ ] ('the', 3), ('box.', 2), ('cat', 2), ('is', 1)
- [ ] ('The', 3), ('box', 2), ('cat', 2), ('is', 1), ('in', 1), ('.', 1)
- [ ] ('the', 3), ('cat box', 1), ('cat', 1), ('box', 1), ('is', 1), ('in', 1)
- [x] ('The', 2), ('box', 2), ('.', 2), ('cat', 2), ('is', 1), ('in', 1), ('the', 1)
  
**Solution**
  
```python
In [1]: from nltk.tokenize import word_tokenize

In [2]: word_tokenize("The cat is in the box. The cat box.", language='english')
Out[2]: ['The', 'cat', 'is', 'in', 'the', 'box', '.', 'The', 'cat', 'box', '.']
```
**Alternative Solution**  
  
```python
In [3]:
from nltk.tokenize import word_tokenize
from collections import Counter

Counter(word_tokenize(my_string)).most_common(len(word_tokenize(my_string)))

Out [3]: 
[('The', 2),
 ('cat', 2),
 ('box', 2),
 ('.', 2),
 ('is', 1),
 ('in', 1),
 ('the', 1)]
```
  
Well done!

### Building a Counter with bag-of-words
  
In this exercise, you'll build your first (in this course) bag-of-words counter using a Wikipedia article, which has been pre-loaded as article. Try doing the bag-of-words without looking at the full article text, and guessing what the topic is! If you'd like to peek at the title at the end, we've included it as `article_title`. Note that this article text has had very little preprocessing from the raw Wikipedia database entry.
  
`word_tokenize` has been imported for you.
  
1. Import `Counter` from `collections`.
2. Use `word_tokenize()` to split the article into tokens.
3. Use a list comprehension with `t` as the iterator variable to convert all the tokens into lowercase. The `.lower()` method converts text into lowercase.
4. Create a bag-of-words counter called `bow_simple` by using `Counter()` with `lower_tokens` as the argument.
5. Use the `.most_common()` method of `bow_simple` to print the 10 most common tokens.

In [23]:
from nltk.tokenize import word_tokenize
from collections import Counter


with open('../_datasets/wikipedia_articles/wiki_text_debugging.txt', 'r') as file:
    article = file.read()
    article_title = word_tokenize(article)[2]

In [24]:
# Tokenize the aricle: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple, Bag-of-Words
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
pprint(bow_simple.most_common(10))

[(',', 151),
 ('the', 150),
 ('.', 89),
 ('of', 81),
 ("''", 69),
 ('to', 63),
 ('a', 60),
 ('``', 47),
 ('in', 44),
 ('and', 41)]


Great work!

## Simple text preprocessing
  
In this video, we will cover some simple text preprocessing.
  
**Why preprocess?**
  
Text processing helps make for better input data when performing machine learning or other statistical methods. For example, in the last few exercises you have applied small bits of preprocessing (like tokenization) to create a bag of words. You also noticed that applying simple techniques like lowercasing all of the tokens, can lead to slightly better results for a bag-of-words model. Preprocessing steps like tokenization or lowercasing words are commonly used in NLP. Other common techniques are things like *lemmatization* or *stemming*, where you shorten the words to their root stems, or techniques like removing stop words, which are common words in a language that don't carry a lot of meaning -- such as and or the, or removing punctuation or unwanted tokens. Of course, each model and process will have different results -- so it's good to try a few different approaches to preprocessing and see which works best for your task and goal.
  
**Preprocessing example**
  
We have here some example input and output text we might expect from preprocessing. First we have a simple two sentence string about pets. Then we have some example output tokens we want. You can see that the text has been tokenized and that everything is lowercase. We also notice that stopwords have been removed and the plural nouns have been made singular.
  
<img src='../_images/nlp-preprocessing-examples.png' alt='img' width='530'>
  
**Text preprocessing with Python**
  
We can perform text preprocessing using many of the tools we already know and have learned. In this code, we are using the same text as from our previous video, a few sentences about a cat with a box. We can use list comprehensions to tokenize the sentences which we first make lowercase using the string `.lower()` method. The string `.is_alpha()` method will return `True` if the string has only alphabetical characters. We use the `.is_alpha()` method along with an if statement iterating over our tokenized result to only return only alphabetic strings (this will effectively strip tokens with numbers or punctuation). To read out the process in both code and English we say we take each token from the `word_tokenize` output of the lowercase text if it contains only alphabetical characters. In the next line, we use another list comprehension to remove words that are in the stopwords list. This stopwords list for english comes built in with the NLTK library. Finally, we can create a counter and check the two most common words, which are now cat and box (unlike the and box which were the two tokens returned in our first result). Preprocessing has already improved our bag of words and made it more useful by removing the stopwords and non-alphabetic words.
  
<img src='../_images/nlp-preprocessing-examples1.png' alt='img' width='530'>
  
**Let's practice!**
  
You can now get started by preprocessing your own text!

### Text preprocessing steps
  
Which of the following are useful text preprocessing steps?
  
Possible Answers
  
- [ ] Stems, spelling corrections, lowercase.
- [x] Lemmatization, lowercasing, removing unwanted tokens.
- [ ] Removing stop words, leaving in capital words.
- [ ] Strip stop words, word endings and digits.
  
Well done!

### Text preprocessing practice
  
Now, it's your turn to apply the techniques you've learned to help clean up text for better NLP results. You'll need to remove stop words and non-alphabetic characters, lemmatize, and perform a new bag-of-words on your cleaned text.
  
You start with the same tokens you created in the last exercise: `lower_tokens`. You also have the `Counter` class imported.
  
1. Import the `WordNetLemmatizer` class from `nltk.stem`.
2. Create a list `alpha_only` that contains only alphabetical characters. You can use the `.isalpha()` method to check for this.
3. Create another list called `no_stops` consisting of words from `alpha_only` that are not contained in `english_stops`.
4. Initialize a `WordNetLemmatizer` object called `wordnet_lemmatizer` and use its `.lemmatize()` method on the tokens in `no_stops` to create a new list called lemmatized.
5. Create a new `Counter` called bow with the lemmatized words.
6. Lastly, print the 10 most common tokens.

In [25]:
# import nltk
# import ssl
# nltk.download('wordnet')
# Out [i]: [nltk_data] Error loading wordnet: <urlopen error [SSL:
# Out [i]: [nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
# Out [i]: [nltk_data]     unable to get local issuer certificate (_ssl.c:992)>
# Out [i]: False

import nltk
import ssl


# Disable SSL certificate verification
ssl._create_default_https_context = ssl._create_unverified_context

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/alexandergursky/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

> Please note that disabling SSL certificate verification introduces potential security risks. It is recommended to enable certificate verification in a production environment. 

In [26]:
# Loading english stopwords
with open('../_datasets/english_stopwords.txt', 'r') as file:
    english_stops = file.read()

In [27]:
from nltk.stem import WordNetLemmatizer


# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()  # Lemmatize using WordNet's built-in morphy function. Returns the input word unchanged if it cannot be found in WordNet.

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
pprint(bow.most_common(10))

[('debugging', 39),
 ('system', 25),
 ('bug', 17),
 ('software', 16),
 ('problem', 15),
 ('tool', 15),
 ('computer', 14),
 ('process', 13),
 ('term', 13),
 ('debugger', 13)]


Great work!

## Introduction to `gensim`
  
In this video, we will get started using a new tool called `Gensim`.
  
**What is `gensim`?**
  
**`Gensim`** is a popular open-source natural language processing library. It uses top academic models to perform complex tasks like building document or word vectors, corpora and performing topic identification and document comparisons.
  
**gensim**  
  
- Popular open-source NLP library
- Uses top academic models to perform complex tasks
- Building document or word vectors
- Performing topic identification and document comparison
  
**What is a word vector?**
  
You might be wondering what a word or document vector is? Here are some examples <span style="color:red;">**(Image Error: IMAGE NOT SHOWN IN VIDEO)**</span> here in visual form. A word embedding or vector is trained from a larger corpus and is a multi-dimensional representation of a word or document. You can think of it as a multi-dimensional array normally with sparse features (lots of zeros and some ones). With these vectors, we can then see relationships among the words or documents based on how near or far they are and also what similar comparisons we find. For example, in this graphic we can see that the vector operation king minus queen is approximately equal to man minus woman. Or that Spain is to Madrid as Italy is to Rome. The deep learning algorithm used to create word vectors has been able to distill this meaning based on how those words are used throughout the text.
  
**`Gensim` example**
  
The graphic we have here is an example of LDA visualization. LDA stands for *latent dirichlet allocation*, and it is a statistical model we can apply to text using `Gensim` for topic analysis and modelling. This graph is just a portion of a blog post written in 2015 using `Gensim` to analyze US presidential addresses. The article is really neat and you can find the link here.
  
<img src='../_images/gensim-example-usa-presidential-addresses.png' alt='img' width='530'>
  
**Creating a `gensim` dictionary**
  
`Gensim` allows you to build corpora and dictionaries using simple classes and functions. A corpus (or if plural, corpora) is a set of texts used to help perform natural language processing tasks. Here, our documents are a list of strings that look like movie reviews about space or sci-fi films. First we need to do some basic preprocessing. For brevity, we will only tokenize and lowercase. For better results, we would want to apply more of the preprocessing we have learned in this chapter, such as removing punctuation and stop words. Then we can pass the tokenized documents to the `Gensim` Dictionary class. This will create a mapping with an id for each token. This is the beginning of our corpus. We now can represent whole documents using just a list of their token ids and how often those tokens appear in each document. We can take a look at the tokens and their ids by looking at the `.token2id` attribute, which is a dictionary of all of our tokens and their respective ids in our new dictionary.
  
<img src='../_images/gensim-example-usa-presidential-addresses1.png' alt='img' width='530'>
  
**Creating a `gensim` corpus**
  
Using the dictionary we built in the last slide, we can then create a `Gensim` corpus. This is a bit different than a normal corpus -- which is just a collection of documents. `Gensim` uses a simple bag-of-words model which transforms each document into a bag of words using the token ids and the frequency of each token in the document. Here, we can see that the `Gensim` corpus is a list of lists, each list item representing one document. Each document a series of tuples, the first item representing the `tokenid` from the dictionary and the second item representing the token frequency in the document. In only a few lines, we have a new bag-of-words model and corpus thanks to `Gensim`. And unlike our previous Counter-based bag of words, this `Gensim` model can be easily saved, updated and reused thanks to the extra tools we have available in `Gensim`. Our dictionary can also be updated with new texts and extract only words that meet particular thresholds. We are building a more advanced and feature-rich bag-of-words model which can then be used for future exercises.
  
<img src='../_images/gensim-example-usa-presidential-addresses2.png' alt='img' width='530'>
  
**Let's practice!**
  
Now you can get started building your own dictionary with `Gensim`!

### What are word vectors?
  
What are word vectors and how do they help with NLP?
  
Possible Answers
  
- [ ] They are similar to bags of words, just with numbers. You use them to count how many tokens there are.
- [ ] Word vectors are sparse arrays representing bigrams in the corpora. You can use them to compare two sets of words to one another.
- [x] Word vectors are multi-dimensional mathematical representations of words created using deep learning methods. They give us insight into relationships between words in a corpus.
- [ ] Word vectors don't actually help NLP and are just hype.
  
Well done! Keep working to use some word vectors yourself!

### Creating and querying a corpus with `gensim`
  
It's time to apply the methods you learned in the previous video to create your first `gensim` dictionary and corpus!
  
You'll use these data structures to investigate word trends and potential interesting topics in your document set. To get started, we have imported a few additional messy articles from Wikipedia, which were preprocessed by lowercasing all words, tokenizing them, and removing stop words and punctuation. These were then stored in a list of document tokens called `articles`. You'll need to do some light preprocessing and then generate the `gensim` dictionary and corpus.
  
1. Import `Dictionary` from `gensim.corpora.dictionary`.
2. Initialize a `gensim` `Dictionary` with the tokens in `articles`.
3. Obtain the id for "`computer`" from `dictionary`. To do this, use its `.token2id` method which returns ids from text, and then chain `.get()` which returns tokens from ids. Pass in "`computer`" as an argument to `.get()`.
4. Use a list comprehension in which you iterate over articles to create a `gensim` `MmCorpus` from `dictionary`.
5. In the output expression, use the `.doc2bow()` method on dictionary with article as the argument.
6. Print the first 10 word ids with their frequency counts from the fifth document. This has been done for you, so hit 'Submit Answer' to see the results!

In [28]:
import glob


# Extracting all txt files in directory, and preprocessing in-order to do exercise, originally not given
path_list = glob.glob('../_datasets/wikipedia_articles/*.txt')
articles = []                                   # Storing articles, global iterable variable to append to
for article_path in path_list:
    article = []                                # 'Holding-cell' for all extracted files, local iterable
    with open(article_path, 'r') as file:
        a = file.read()                         # Cycled variable that cycles the articles
    tokens = word_tokenize(a)                   # Tokenization of words in article[i]
    lower_tokens = [t.lower() for t in tokens]  # Convert all tokenized-words to lowercase
    
    # Retain alphabetic words: alpha_only
    alpha_only = [t for t in lower_tokens if t.isalpha()]

    # Remove all stop words: no_stops
    no_stops = [t for t in alpha_only if t not in english_stops]
    articles.append(no_stops)

In [29]:
from gensim.corpora.dictionary import Dictionary  # pip3 install gensim


# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles)  # Dictionary encapsulates the mapping between normalized words and their integer ids.

# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")  # computer_id = int(223)

# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))

# Create a MmCorpus: corpus
corpus = [dictionary.doc2bow(article) for article in articles]  # Corpus serialized using the sparse coordinate Matrix Market format

# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[4][:10])

computer
[(4, 1), (6, 6), (7, 2), (9, 5), (18, 1), (19, 1), (20, 1), (22, 1), (24, 2), (28, 3)]


Great work!

### Gensim bag-of-words
  
Now, you'll use your new `gensim` corpus and dictionary to see the most common terms per document and across all documents. You can use your dictionary to look up the terms. Take a guess at what the topics are and feel free to explore more documents in the IPython Shell!
  
You have access to the `dictionary` and `corpus` objects you created in the previous exercise, as well as the Python `defaultdict` and `itertools` to help with the creation of intermediate data structures for analysis.
  
- `defaultdict` allows us to initialize a dictionary that will assign a default value to non-existent keys. By supplying the argument `int`, we are able to ensure that any non-existent keys are automatically assigned a default value of `0`. This makes it ideal for storing the counts of words in this exercise.
  
- `itertools.chain.from_iterable()` allows us to iterate through a set of sequences as if they were one continuous sequence. Using this function, we can easily iterate through our corpus object (which is a list of lists).
  
The fifth document from `corpus` is stored in the variable `doc`, which has been sorted in descending order.
  
<br></br>

1. Using the first for loop, print the top five words of `bow_doc` using each `word_id` with the `dictionary` alongside `word_count`.
- The `word_id` can be accessed using the `.get()` method of `dictionary`.
2. Create a `defaultdict` called `total_word_count` in which the keys are all the token ids (word_id) and the values are the sum of their occurrence across all documents (`word_count`).
3. Remember to specify int when creating the `defaultdict`, and inside the second for loop, increment each `word_id` of `total_word_count` by `word_count`.
4. Create a sorted list from the `defaultdict`, using words across the entire corpus. To achieve this, use the `.items()` method on `total_word_count` inside `sorted()`.
5. Similar to how you printed the top five words of `bow_doc` earlier, print the top five words of `sorted_word_count` as well as the number of occurrences of each word across all the documents.


In [30]:
from collections import defaultdict  # The default factory is called without arguments to produce a new value when a key is not present
import itertools


# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)
    

# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count
    

# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True)

# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)



computer 252
computers 100
first 61
cite 59
computing 59
computer 597
software 451
cite 322
ref 259
code 235


Good work!

## Tf-idf with gensim
  
In this video, we will learn how to use a TFIDF model with Gensim.
  
**What is tf-idf?**
  
Tf-idf stands for term-frequncy - inverse document frequency. It is a commonly used natural language processing model that helps you determine the most important words in each document in the corpus. The idea behind tf-idf is that each corpus might have more shared words than just stopwords. These common words are like stopwords and should be removed or at least down-weighted in importance. For example, if I am an astronomer, sky might be used often but is not important, so I want to downweight that word. TF-Idf does precisely that. It will take texts that share common language and ensure the most common words across the entire corpus don't show up as keywords. Tf-idf helps keep the document-specific frequent words weighted high and the common words across the entire corpus weighted low.
  
<img src='../_images/what-is-tf-idf.png' alt='img' width='530'>
  
**Tf-idf formula**
  
The equation to calculate the weights can be outlined like so: The weight of token $i$ in document $j$ is calculated by taking the term frequency (or how many times the token appears in the document) multiplied by the log of the total number of documents divided by the number of documents that contain the same term. Let's unpack this a bit. First, the weight will be low if the term doesnt appear often in the document because the $tf$ variable will then be low. However, the weight will also be a low if the logarithm ($\log()$) is close to zero, meaning the internal equation is low. Here we can see if the total number of documents divded by the number of documents that have the term is close to one, then our logarithm will be close to zero. So words that occur across many or all documents will have a very low tf-idf weight. On the contrary, if the word only occurs in a few documents, that logarithm will return a higher number.
  
$formula:$
  
$\Large w_{i, j} = \text{tf}_{i, j} â€¢ \log (\frac{N}{\text{df}_i})$  
  
$where:$
  
$w_{i,j}$ = tf-idf for token $i$ in document $j$  
$tf_{i,j}$ = Number of occurences for token $i$ in document $j$  
$df_{i}$ = Number of documents that contain token $i$  
$N$ = Total number of documents  
  
**Tf-idf with gensim**
  
You can build a Tfidf model using Gensim and the corpus you developed previously. Taking a look at the corpus we used in the last video, around movie reviews, we can use the Bag of Words corpus to translate it into a TF-idf model by simply passing it in initialization. We can then reference each document by using it like a dictionary key with our new tfidf model. For the second document in our corpora, we see the token weights along with the token ids. Notice there are some large differences! Token id 10 has a weight of 0.77 whereas tokens 0 and 1 have weights below 0.18. These weights can help you determine good topics and keywords for a corpus with shared vocabulary.
  
<img src='../_images/what-is-tf-idf1.png' alt='img' width='530'>
  
**Let's practice!**
  
Now you can build a tfidf model using Gensim to explore topics in the Wikipedia article list.

## What is tf-idf?
  
You want to calculate the tf-idf weight for the word "computer", which appears five times in a document containing 100 words. Given a corpus containing 200 documents, with 20 documents mentioning the word "computer", tf-idf can be calculated by multiplying term frequency with inverse document frequency.
  
Term frequency = percentage share of the word compared to all tokens in the document Inverse document frequency = logarithm of the total number of documents in a corpora divided by the number of documents containing the term
  
Which of the below options is correct?
  
Possible answers  
  
- [x] (5 / 100) * log(200 / 20)
- [ ] (5 * 100) / log(200 * 20)
- [ ] (20 / 5) * log(200 / 20)
- [ ] (200 * 5) * log(400 / 5)
  
Correct!

Tf-idf with Wikipedia
Now it's your turn to determine new significant terms for your corpus by applying gensim's tf-idf. You will again have access to the same corpus and dictionary objects you created in the previous exercises - `dictionary`, `corpus`, and `doc`. Will tf-idf make for more interesting results on the document level?
  
`TfidfModel` has been imported for you from `gensim.models.tfidfmodel`.
  
1. Initialize a new `TfidfModel` called `tfidf` using `corpus`.
2. Use `doc` to calculate the weights. You can do this by passing `[doc]` to `tfidf`.
3. Print the first five term ids with weights.
4. Sort the term ids and weights in a new list from highest to lowest weight. *This has been done for you.*
5. Using your pre-existing `dictionary`, print the top five weighted words (`term_id`) from `sorted_tfidf_weights`, along with their weighted score (`weight`).

In [31]:
from gensim.models.tfidfmodel import TfidfModel


# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]

# Print the first five weights
print(tfidf_weights[:5])

[(4, 0.005149712197382678), (6, 0.005127761019345027), (7, 0.008207466968432171), (9, 0.02574856098691339), (18, 0.0032491066476585816)]


In [32]:
# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

mechanical 0.1847740145909077
circuit 0.15142303140509794
manchester 0.1427799203657014
alu 0.1397751059123981
thomson 0.12812718041969826


Great work!