**Introduction to regular expressions**
___
- What is Natural Language Processing?
    - field of study focused on making sense of language
        - using statistics and computers
    - you will learn the basics of NLP
        - topic identification
        - text classification
    - other NLP applications
        - chatbots
        - translation
        - sentiment analysis
- What exactly are regular expressions?
    - strings with a special syntax
    - allow us to match patterns in other strings
    - applications of regular expressions:
        - find all web links in a document
        - parse email addresses
        - remove/replace unwanted characters
- in Python
    - import re
    - re.match('pattern', 'string')
![_images/18.1.PNG](_images/18.1.PNG)
![_images/18.2.PNG](_images/18.2.PNG)
___

In [2]:
#Practicing regular expressions: re.split() and re.findall()

#Now you'll get a chance to write some regular expressions to match
#digits, strings and non-alphanumeric characters. Take a look at
#my_string first by printing it in the IPython Shell, to determine
#how you might best match the different steps.

#Note: It's important to prefix your regex patterns with r to ensure
#that your patterns are interpreted in the way you want them to. Else,
#you may encounter problems to do with escape sequences in strings. For
#example, "\n" in Python is used to indicate a new line, but if you use
#the r prefix, it will be interpreted as the raw string "\n" - that is,
#the character "\" followed by the character "n" - and not as a new line.

#The regular expression module re has already been imported for you.

#Remember from the video that the syntax for the regex library is to
#always to pass the pattern first, and then the string second.

import re
my_string = "Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"

# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

#################################################
#Practice is the key to mastering RegEx.

["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']
['Let', 'RegEx', 'Won', 'Can', 'Or']
["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']
['4', '19']


**Introduction to tokenization**
___
- What is tokenization?
    - turning a string or document into **tokens** (smaller chunks)
    - one step in preparing a text for NLP
    - many different theories and rules
    - you can create your own rules using regular expressions
    - some examples:
        - breaking our words or sentences
        - separating punctuation
        - separating all hashtags in a tweet
    - Why tokenize?
        - easier to map part of the speech
        - matching common words
        - removing unwanted tokens
    - nltk library
___

In [1]:
#Word tokenization with NLTK

#Here, you'll be using the first scene of Monty Python's Holy Grail,
#which has been pre-loaded as scene_one. Feel free to check it out in
#the IPython Shell!

#Your job in this exercise is to utilize word_tokenize and sent_tokenize
#from nltk.tokenize to tokenize both words and sentences from Python
#strings - in this case, the first scene of Monty Python's Holy Grail.

scene_one="SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!  [clop clop clop] \nSOLDIER #1: Halt!  Who goes there?\nARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!\nSOLDIER #1: Pull the other one!\nARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.\nSOLDIER #1: What?  Ridden on a horse?\nARTHUR: Yes!\nSOLDIER #1: You're using coconuts!\nARTHUR: What?\nSOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.\nARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?\nARTHUR: We found them.\nSOLDIER #1: Found them?  In Mercea?  The coconut's tropical!\nARTHUR: What do you mean?\nSOLDIER #1: Well, this is a temperate zone.\nARTHUR: The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land?\nSOLDIER #1: Are you suggesting coconuts migrate?\nARTHUR: Not at all.  They could be carried.\nSOLDIER #1: What?  A swallow carrying a coconut?\nARTHUR: It could grip it by the husk!\nSOLDIER #1: It's not a question of where he grips it!  It's a simple question of weight ratios!  A five ounce bird could not carry a one pound coconut.\nARTHUR: Well, it doesn't matter.  Will you go and tell your master that Arthur from the Court of Camelot is here.\nSOLDIER #1: Listen.  In order to maintain air-speed velocity, a swallow needs to beat its wings forty-three times every second, right?\nARTHUR: Please!\nSOLDIER #1: Am I right?\nARTHUR: I'm not interested!\nSOLDIER #2: It could be carried by an African swallow!\nSOLDIER #1: Oh, yeah, an African swallow maybe, but not a European swallow.  That's my point.\nSOLDIER #2: Oh, yeah, I agree with that.\nARTHUR: Will you ask your master if he wants to join my court at Camelot?!\nSOLDIER #1: But then of course a-- African swallows are non-migratory.\nSOLDIER #2: Oh, yeah...\nSOLDIER #1: So they couldn't bring a coconut back anyway...  [clop clop clop] \nSOLDIER #2: Wait a minute!  Supposing two swallows carried it together?\nSOLDIER #1: No, they'd have to have it on a line.\nSOLDIER #2: Well, simple!  They'd just use a strand of creeper!\nSOLDIER #1: What, held under the dorsal guiding feathers?\nSOLDIER #2: Well, why not?\n"

# Import necessary modules
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)

{'beat', 'under', 'winter', 'matter', 'kingdom', 'Oh', 'It', 'suggesting', 'minute', 'King', 'SCENE', 'halves', 'get', 'they', 'use', 'house', 'Patsy', 'coconut', "'em", 'They', 'together', "'s", 'course', 'sovereign', 'since', 'covered', 'I', "n't", 'five', 'speak', 'zone', 'and', 'point', 'strand', 'why', 'horse', 'Whoa', 'in', '.', 'In', 'Well', 'or', 'England', '1', 'Halt', 'here', "'re", 'Where', 'What', 'to', 'times', 'one', 'weight', 'bird', 'he', 'carrying', 'creeper', 'master', 'who', 'bring', 'ounce', 'Found', 'this', 'lord', 'at', 'coconuts', 'No', 'tropical', 'Please', 'goes', 'martin', 'other', 'African', 'through', 'seek', 'maintain', 'Mercea', 'Not', 'join', 'bangin', 'be', 'ask', 'non-migratory', 'anyway', 'grips', 'by', 'will', 'am', 'does', 'migrate', 'empty', 'KING', '[', 'plover', 'an', 'you', 'carry', 'found', 'are', 'We', 'could', 'fly', 'but', 'Yes', 'it', 'Will', 'mean', 'You', 'just', "'d", 'Ridden', 'swallow', 'simple', 'tell', '!', 'these', 'my', '--', '#', '

In [4]:
#More regex with re.search()

#In this exercise, you'll utilize re.search() and re.match() to find
#specific tokens. Both search and match expect regex patterns, similar
#to those you defined in an earlier exercise. You'll apply these regex
#library methods to the same Monty Python text from the nltk corpora.

#You have both scene_one and sentences available from the last exercise;
#now you can use them with re.search() and re.match() to extract and
#match more text.

# Import necessary modules
from nltk.tokenize import sent_tokenize
import re

scene_one="SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!  [clop clop clop] \nSOLDIER #1: Halt!  Who goes there?\nARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!\nSOLDIER #1: Pull the other one!\nARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.\nSOLDIER #1: What?  Ridden on a horse?\nARTHUR: Yes!\nSOLDIER #1: You're using coconuts!\nARTHUR: What?\nSOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.\nARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?\nARTHUR: We found them.\nSOLDIER #1: Found them?  In Mercea?  The coconut's tropical!\nARTHUR: What do you mean?\nSOLDIER #1: Well, this is a temperate zone.\nARTHUR: The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land?\nSOLDIER #1: Are you suggesting coconuts migrate?\nARTHUR: Not at all.  They could be carried.\nSOLDIER #1: What?  A swallow carrying a coconut?\nARTHUR: It could grip it by the husk!\nSOLDIER #1: It's not a question of where he grips it!  It's a simple question of weight ratios!  A five ounce bird could not carry a one pound coconut.\nARTHUR: Well, it doesn't matter.  Will you go and tell your master that Arthur from the Court of Camelot is here.\nSOLDIER #1: Listen.  In order to maintain air-speed velocity, a swallow needs to beat its wings forty-three times every second, right?\nARTHUR: Please!\nSOLDIER #1: Am I right?\nARTHUR: I'm not interested!\nSOLDIER #2: It could be carried by an African swallow!\nSOLDIER #1: Oh, yeah, an African swallow maybe, but not a European swallow.  That's my point.\nSOLDIER #2: Oh, yeah, I agree with that.\nARTHUR: Will you ask your master if he wants to join my court at Camelot?!\nSOLDIER #1: But then of course a-- African swallows are non-migratory.\nSOLDIER #2: Oh, yeah...\nSOLDIER #1: So they couldn't bring a coconut back anyway...  [clop clop clop] \nSOLDIER #2: Wait a minute!  Supposing two swallows carried it together?\nSOLDIER #1: No, they'd have to have it on a line.\nSOLDIER #2: Well, simple!  They'd just use a strand of creeper!\nSOLDIER #1: What, held under the dorsal guiding feathers?\nSOLDIER #2: Well, why not?\n"

# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*\]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w\s]+:"
print(re.match(pattern2, sentences[3]))

#################################################
#Now that you're familiar with the basics of tokenization and
#regular expressions, it's time to learn about more advanced tokenization.

580 588
<re.Match object; span=(9, 32), match='[wind] [clop clop clop]'>
<re.Match object; span=(0, 7), match='ARTHUR:'>


**Advanced tokenization with NLTK and regex**
___
- Regex groups using the "|"
    - OR is represented using |
    - You can define a group using ()
    - You can define explicit character ranges using []
- Regex ranges and groups
![_images/18.3.PNG](_images/18.3.PNG)
___

In [7]:
#Regex with NLTK tokenization

#Twitter is a frequently used source for NLP text and tasks. In this
#exercise, you'll build a more complex tokenizer for tweets with
#hashtags and mentions using nltk and regex. The nltk.tokenize.TweetTokenizer
#class gives you some extra methods and attributes for parsing tweets.

#Here, you're given some example tweets to parse using both
#TweetTokenizer and regexp_tokenize from the nltk.tokenize module.
#These example tweets have been pre-loaded into the variable tweets.
#Feel free to explore it in the IPython Shell!

#Unlike the syntax for the regex library, with nltk_tokenize() you
#pass the pattern as the second argument.

# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer

tweets = ['This is the best #nlp exercise ive found online! #python',
 '#NLP is super fun! <3 #learning',
 'Thanks @datacamp :) #nlp #python']

# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"
# Use the pattern on the first tweet in the tweets list
hashtags = regexp_tokenize(tweets[0], pattern1)
print(hashtags)

# Write a pattern that matches both mentions (@) and hashtags
pattern2 = r"([@#]\w+)"
# Use the pattern on the last tweet in the tweets list
mentions_hashtags = regexp_tokenize(tweets[2], pattern2)
print(mentions_hashtags)

# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)

['#nlp', '#python']
['@datacamp', '#nlp', '#python']
[['This', 'is', 'the', 'best', '#nlp', 'exercise', 'ive', 'found', 'online', '!', '#python'], ['#NLP', 'is', 'super', 'fun', '!', '<3', '#learning'], ['Thanks', '@datacamp', ':)', '#nlp', '#python']]


In [8]:
#Non-ascii tokenization

#In this exercise, you'll practice advanced tokenization by tokenizing
#some non-ascii based text. You'll be using German with emoji!

#Here, you have access to a string called german_text, which has
#been printed for you in the Shell. Notice the emoji and the German
#characters!

#The following modules have been pre-imported from nltk.tokenize:
#regexp_tokenize and word_tokenize.

#Unicode ranges for emoji are:

#('\U0001F300'-'\U0001F5FF'), ('\U0001F600-\U0001F64F'),
#('\U0001F680-\U0001F6FF'), and ('\u2600'-\u26FF-\u2700-\u27BF').

from nltk.tokenize import regexp_tokenize
from nltk.tokenize import word_tokenize

german_text = 'Wann gehen wir Pizza essen? 🍕 Und fährst du mit Über? 🚕'

# Tokenize and print all words in german_text
all_words = word_tokenize(german_text)
print(all_words)

# Tokenize and print only capital words
capital_words = r"[A-ZÜ]\w+"
print(regexp_tokenize(german_text, capital_words))

# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))

['Wann', 'gehen', 'wir', 'Pizza', 'essen', '?', '🍕', 'Und', 'fährst', 'du', 'mit', 'Über', '?', '🚕']
['Wann', 'Pizza', 'Und', 'Über']
['🍕', '🚕']


**Charting word length with NLTK**
___

In [None]:
#Charting practice

#Try using your new skills to find and chart the number of words per
#line in the script using matplotlib. The Holy Grail script is loaded
#for you, and you need to use regex to find the words per line.

#Using list comprehensions here will speed up your computations. For
#example: my_lines = [tokenize(l) for l in lines] will call a function
#tokenize on each line in the list lines. The new transformed list
#will be saved in the my_lines variable.

#You have access to the entire script in the variable holy_grail.
#Go for it!

# Split the script into lines: lines
#lines = holy_grail.split('\n')

# Replace all script lines for speaker
#pattern = "[A-Z]{2,}(\s)?(#\d)?([A-Z]{2,})?:"
#lines = [re.sub(pattern, '', l) for l in lines]

# Tokenize each line: tokenized_lines
#tokenized_lines = [regexp_tokenize(s, "\w+") for s in lines]

# Make a frequency list of lengths: line_num_words
#line_num_words = [len(t_line) for t_line in tokenized_lines]

# Plot a histogram of the line lengths
#plt.hist(line_num_words)

# Show the plot
#plt.show()

![_images/18.1.svg](_images/18.1.svg)
See you in Chapter 2, where you'll begin learning about topic identification!

**Word counts with bag-of-words**
___
- first create tokens using tokenization
- count all of the tokens
- the more frequent a word, the more important it might be
___

In [None]:
#Building a Counter with bag-of-words

#In this exercise, you'll build your first (in this course)
#bag-of-words counter using a Wikipedia article, which has been
#pre-loaded as article. Try doing the bag-of-words without looking
#at the full article text, and guessing what the topic is! If you'd
#like to peek at the title at the end, we've included it as
#article_title. Note that this article text has had very little
#preprocessing from the raw Wikipedia database entry.

#word_tokenize has been imported for you.

# Import Counter
#from collections import Counter

# Tokenize the article: tokens
#tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
#lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
#bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
#print(bow_simple.most_common(10))

#################################################
#<script.py> output:
#    [(',', 151), ('the', 150), ('.', 89), ('of', 81), ("''", 68), ('to', 63), ('a', 60), ('in', 44), ('and', 41), ('debugging', 40)]
#################################################

**Simple text preprocessing**
___
- Why preprocess?
    - helps make for better input data
        - when performing machine learning or other statistical methods
    - examples:
        - tokenization to create bag of words
        - lowercasing words
    - lemmatization/stemming
        - shorten words to their root stems
    - removing stop words, punctuation, or unwanted tokens
    - good to experiment with different approaches
___

In [None]:
#Text preprocessing practice

#Now, it's your turn to apply the techniques you've learned to help
#clean up text for better NLP results. You'll need to remove stop
#words and non-alphabetic characters, lemmatize, and perform a new
#bag-of-words on your cleaned text.

#You start with the same tokens you created in the last exercise:
#lower_tokens. You also have the Counter class imported.

# Import WordNetLemmatizer
#from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
#alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
#no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
#wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
#lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
#bow = Counter(lemmatized)

# Print the 10 most common tokens
#print(bow.most_common(10))

#################################################
#<script.py> output:
#    [('debugging', 40), ('system', 25), ('software', 16), ('bug', 16), ('problem', 15), ('tool', 15), ('computer', 14), ('process', 13), ('term', 13), ('used', 12)]
#################################################

**Introduction to gensim**
___
- What is gensim?
    - popular open-source NLP library
- Uses top academic models to perform complex tasks
    - building a document or words vectors
    - performing topic identification and document comparison
![_images/18.4.PNG](_images/18.4.PNG)
![_images/18.5.PNG](_images/18.5.PNG)
___

In [None]:
#Creating and querying a corpus with gensim

#It's time to apply the methods you learned in the previous video to
#create your first gensim dictionary and corpus!

#You'll use these data structures to investigate word trends and
#potential interesting topics in your document set. To get started,
#we have imported a few additional messy articles from Wikipedia,
#which were preprocessed by lowercasing all words, tokenizing them,
#and removing stop words and punctuation. These were then stored in
#a list of document tokens called articles. You'll need to do some
#light preprocessing and then generate the gensim dictionary and corpus.

# Import Dictionary
#from gensim.corpora.dictionary import Dictionary

# Create a Dictionary from the articles: dictionary
#dictionary = Dictionary(articles)

# Select the id for "computer": computer_id
#computer_id = dictionary.token2id.get("computer")

# Use computer_id with the dictionary to print the word
#print(dictionary.get(computer_id))

# Create a MmCorpus: corpus
#corpus = [dictionary.doc2bow(article) for article in articles]

# Print the first 10 word ids with their frequency counts from the fifth document
#print(corpus[4][:10])

#################################################
#<script.py> output:
#    computer
#    [(0, 88), (23, 11), (24, 2), (39, 1), (41, 2), (55, 22), (56, 1), (57, 1), (58, 1), (59, 3)]
#################################################

In [None]:
#Gensim bag-of-words

#Now, you'll use your new gensim corpus and dictionary to see the
#most common terms per document and across all documents. You can
#use your dictionary to look up the terms. Take a guess at what the
#topics are and feel free to explore more documents in the IPython
#Shell!

#You have access to the dictionary and corpus objects you created in
#the previous exercise, as well as the Python defaultdict and itertools
#to help with the creation of intermediate data structures for analysis.

#defaultdict allows us to initialize a dictionary that will assign
#a default value to non-existent keys. By supplying the argument
#int, we are able to ensure that any non-existent keys are automatically
#assigned a default value of 0. This makes it ideal for storing the
#counts of words in this exercise.

#itertools.chain.from_iterable() allows us to iterate through a set
#of sequences as if they were one continuous sequence. Using this
#function, we can easily iterate through our corpus object (which is
#a list of lists).

#The fifth document from corpus is stored in the variable doc, which
#has been sorted in descending order.

# Save the fifth document: doc
#doc = corpus[4]

# Sort the doc for frequency: bow_doc
#bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
#for word_id, word_count in bow_doc[:5]:
#    print(dictionary.get(word_id), word_count)

# Create the defaultdict: total_word_count
#total_word_count = defaultdict(int)
#for word_id, word_count in itertools.chain.from_iterable(corpus):
#    total_word_count[word_id] += word_count

#################################################
#<script.py> output:
#    engineering 91
#    '' 88
#    reverse 71
#    software 51
#    cite 26
#################################################

# Create the defaultdict: total_word_count
#total_word_count = defaultdict(int)
#for word_id, word_count in itertools.chain.from_iterable(corpus):
#    total_word_count[word_id] += word_count

# Create a sorted list from the defaultdict: sorted_word_count
#sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True)

# Print the top 5 words across all documents alongside the count
#for word_id, word_count in sorted_word_count[:5]:
#    print(dictionary.get(word_id), word_count)

#################################################
#<script.py> output:
#    engineering 91
#    '' 88
#    reverse 71
#    software 51
#    cite 26
#    '' 1042
#    computer 594
#    software 450
#    `` 345
#    cite 322
#################################################

**Tf-idf with gensim**
___
- What is tf-idf?
    - term frequency-inverse document frequency
    - allows you to determine the most important words in each document
    - each corpus nay have shared words beyond just stopwords
    - these words should be down-weighted in importance
    - ensures most common words do not show up as key words
    - keeps document specific frequent words weighted high
![_images/18.6.PNG](_images/18.6.PNG)
___

In [None]:
#Tf-idf with Wikipedia

#Now it's your turn to determine new significant terms for your
#corpus by applying gensim's tf-idf. You will again have access to
#the same corpus and dictionary objects you created in the previous
#exercises - dictionary, corpus, and doc. Will tf-idf make for more
#interesting results on the document level?

#TfidfModel has been imported for you from gensim.models.tfidfmodel.

# Create a new TfidfModel using the corpus: tfidf
#tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
#tfidf_weights = tfidf[doc]

# Print the first five weights
#print(tfidf_weights[:5])

#################################################
#<script.py> output:
#    [(24, 0.0022836332291091273), (39, 0.0043409401554717324), (41, 0.008681880310943465), (55, 0.011988285029371418), (56, 0.005482756770026296)]
#################################################

# Sort the weights from highest to lowest: sorted_tfidf_weights
#sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
#for term_id, weight in sorted_tfidf_weights[:5]:
#    print(dictionary.get(term_id), weight)

#################################################
#<script.py> output:
#    reverse 0.4884961428651127
#    infringement 0.18674529210288995
#    engineering 0.16395041814479536
#    interoperability 0.12449686140192663
#    reverse-engineered 0.12449686140192663
#################################################