# Introduction to Natural Language Processing in Python
In this course, you'll learn natural language processing (NLP) basics, such as how to identify and separate words, how to extract topics in a text, and how to build your own fake news classifier. You'll also learn how to use basic libraries such as NLTK, alongside libraries which utilize deep learning to solve common NLP problems. This course will give you the foundation to process and parse text as you move forward in your Python learning.

## $\star$ Chapter 1: Regular expressions & word tokenization
This chapter will introduce some basic NLP concepts, such as word tokenization and regular expressions to help parse text. You'll also learn how to handle non-English text and more difficult tokenization you might find.

#### What is Natural Language Processing?
* Massive field of study focused on making sense of language using statistics and computers
* Some of the basics of NLP:
    * Topic identification
    * Text classification
* NLP applications include:
    * topic identification
    * chatbots
    * text classification
    * translation
    * sentiment analysis
    * ... many, many more!
    
#### Regular Expressions
* **Regular expressions** are strings you can use that have a special syntax which allows you to match patterns and find other strings
* A **pattern** is a series of letters or symbols which can map to an actual text or words or punctuation.
* Applications of regular expression:
    * Find links in a a webpage or document
    * Parse email addresses
    * Remove unwanted strings or characters
* Regular expressions are often referred to as **regex** and can be used easily with Python via the `re` library
* match a substring using the `re.match()` method:

In [40]:
import re
import nltk
from nltk.tokenize import word_tokenize, regexp_tokenize, TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
from collections import Counter
from gensim.corpora.dictionary import Dictionary
from gensim.models.tfidfmodel import TfidfModel
import spacy

In [6]:
#nltk.download('punkt')

In [2]:
re.match('abc', 'abcdef')

<re.Match object; span=(0, 3), match='abc'>

* `re.match()` takes the pattern as the first argument, the string as the second argument, and returns a **match object**
* We can also use "special" patterns that regex understands, like the `\w+`, which will match a word:

In [3]:
word_regex = '\w+'
re.match(word_regex, 'hi there!')

<re.Match object; span=(0, 2), match='hi'>

* There are hundreds of characters and patterns you can learn and memorize with regular expressions, but here we get started with a few common patterns:

<img src='data/common_regex.png' width="300" height="150" align="center"/>

#### Python's re modules
* **`re`** module
* **`split`**: split a string on a regex
* **`findall`**: find all patterns in a string
* **`search`**: search for a pattern
* **`match`**: match an entire string or substring based on a pattern
* Syntax for regex library is always to pass the pattern first and the string second
* Depending on the method, it may return an iterator, a new string, or a match object

In [4]:
re.split('\s+', 'Split on spaces.')

['Split', 'on', 'spaces.']

* This can be used for tokenization, so you can process text using regex while doing NLP

#### Exercises: Practicing regular expressions: re.split() and re.findall()

```
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))
```

### Introduction to tokenization

* **Tokenization** is the process of transforming a string or document into smaller chunks, which we call tokens.
* One step in the process of preparing a text for NLP
* Many different theories and rules regarding tokenization
    * You can also create your own tokenization rules using regular expressions
* Some examples:
    * Breaking out words or sentences
    * Separating punctuation
    * Separating all hashtags in a tweet

#### nltk library
* One library that is commonly used for simple tokenization is `nltk`, the **natural language took kit** library
* `nltk`: natural langauge toolkit

In [None]:
#nltk.download('punkt')

In [5]:
word_tokenize("Hi there!")

['Hi', 'there', '!']

#### Why tokenize?
* Tokenizing can help us with some simple text processing tasks like:
    * Mapping parts of speech
    * Matching common words
    * Removing unwanted tokens
    
#### Other nltk tokenizers
* **`sent_tokenize`:** tokenize a document into sentences
* **`regexp_tokenize`:** tokenize a string or document based on a regular expression pattern
* **`TweetTokenizer`:** special class just for tweet tokenization, allowing you to separate hashtags, mentions, and lots of exclamation points

#### More regex practice
* Difference between `re.search()` and `re.match()`:
    * When we use `search` and `match` with the same pattern and string when the pattern is at the beginning of the string, we see we find identical matches.
    
<img src='data/search_vs_match.png' width="600" height="300" align="center"/>    

* **Note that `match` will try and match a string from the beginning until it cannot match any longer, while `search` will go through the ENTIRE string to look for match options.**
* So, if you need to find a patter that might not be at the beginning of the string, you should use `search`
* If you want to be specific about the composition of the entire string, or at least the initial pattern, then you should use `match`

#### Exercises: Word tokenization with NLTK

```
# Import necessary modules
from nltk.tokenize import sent_tokenize, word_tokenize 

# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)

# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w\s]+:"
print(re.match(pattern2, sentences[3]))
```


### Advanced tokenization with regex

#### Regex groups using or "|"
* OR is represented using **`|`**
* Define a group using **`()`**
    * Groups can be either a pattern or set of characters you want to match
* You can also define explicit character classes using **`[]`**
* Example: we want to find all digits and words using tokenization:

In [7]:
# import re
match_digits_and_words = ('(\d+|\w+)')

In [8]:
re.findall(match_digits_and_words, "He has 11 cats.")

['He', 'has', '11', 'cats']

* Pseudo script: "find all" digits and/or words.

#### Regex ranges and groups

<img src='data/regex_ranges.png' width="600" height="300" align="center"/>

* **Note** that:
    * **ranges** are marked with **`[]`**
    * **groups** are marked with **`()`**
* We can see in the chart above that we can use square brackets to defne a new character class
* Note in the third row of the chart, that because the hyphen and period are special characters in regex, we must tell regex we mean an ACTUAL period or hyphen
    * To do so we use what is called an **escape character**: in regex that means to place a backwards slash in front of our character so it knows then to look for a hyphen or period.
* On the other hand, with groups which are designated by the parentheses, we can only match what we explicitly define in the group
    * For example, see row four in the chart above; this regex only specifies 3 characters to match: `a`, `-`, `z` (and *not* "all the lowercase letters between a and z).
    * **Groups are useful when you want to define an explicit set of characters.**
* Final example: spaces or a comma.
* In the code example below, use `match` with a character range to match all lowercase ascii, any digits, and spaces:

In [10]:
# import re
my_str = 'match lowercase spaces nums like 12, but no commas'
re.match('[a-z0-9 ]+', my_str)

<re.Match object; span=(0, 35), match='match lowercase spaces nums like 12'>

* The above regex is **greedy**, marked by the **`+`** after the range definition, but once it hits the comma, it can't match any more.
* This short example demonstrates that thinking about what regex method you use (such as `search` versus `match`) and whether you define a *group* or a *range* can have a large impact on the usefulness and readability of your patterns.

#### Exercises: Regex with NLTK tokenization
Twitter is a frequently used source for NLP text and tasks. In this exercise, you'll build a more complex tokenizer for tweets with hashtags and mentions using nltk and regex. The nltk.tokenize.TweetTokenizer class gives you some extra methods and attributes for parsing tweets.

Here, you're given some example tweets to parse using both TweetTokenizer and regexp_tokenize from the nltk.tokenize module. These example tweets have been pre-loaded into the variable tweets. 

*Unlike the syntax for the regex library, with `nltk_tokenize()` you pass the pattern as the **second** argument.*

```
# Import the necessary modules
from nltk.tokenize import regexp_tokenize, TweetTokenizer

# PART 1
# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"

# Use the pattern on the first tweet in the tweets list
hashtags = regexp_tokenize(tweets[0], pattern1)
print(hashtags)


# PART 2
# Write a pattern that matches both mentions (@) and hashtags
pattern2 = r"([#|@]\w+)"

# Use the pattern on the last tweet in the tweets list
mentions_hashtags = regexp_tokenize(tweets[-1], pattern2)
print(mentions_hashtags)


# PART 3
# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)
```

#### Exercises: Non-ascii Tokenization
In this exercise, you'll practice advanced tokenization by tokenizing some non-ascii based text. You'll be using German with emoji!

Here, you have access to a string called `german_text`, which has been printed for you in the Shell. Notice the emoji and the German characters!

The following modules have been pre-imported from `nltk.tokenize`: `regexp_tokenize` and `word_tokenize`.

Unicode ranges for emoji are:

`('\U0001F300'-'\U0001F5FF')`, `('\U0001F600-\U0001F64F')`, `('\U0001F680-\U0001F6FF')`, and `('\u2600'-\u26FF-\u2700-\u27BF')`.

```
# Tokenize and print all words in german_text
all_words = word_tokenize(german_text)
print(all_words)

# Tokenize and print only capital words
capital_words = r"[A-ZÜ]\w+"
print(regexp_tokenize(german_text, capital_words))

# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))
```

### Charting word length with nltk
* Using `nltk` with `matplotlib`
* Tokenize text and chart word length for a simple sentence

#### Combining NLP data extraction with plotting

```
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize

# Tokenize the words and punctuation in a short sentence
words = word_tokenize("this is a pretty cool tool!")
word_lengths = [len(w) for w in words]
plt.hist(word_lengths)
```
* As a brief refresher on list comprehensions, they are a succinct way to write a for loop

#### Exercises: Charting practice
Try using your new skills to find and chart the number of words per line in the script using `matplotlib`. The Holy Grail script is loaded for you, and you need to use regex to find the words per line.

Using list comprehensions here will speed up your computations. For example: `my_lines = [tokenize(l) for l in lines]` will call a function `tokenize` on each line in the list `lines`. The new transformed list will be saved in the `my_lines` variable.

You have access to the entire script in the variable `holy_grail`. Go for it!

```
# Split the script into lines: lines
lines = holy_grail.split('\n')

# Replace all script lines for speaker
pattern = "[A-Z]{2,}(\s)?(#\d)?([A-Z]{2,})?:"
lines = [re.sub(pattern, '', l) for l in lines]

# Tokenize each line: tokenized_lines
tokenized_lines = [regexp_tokenize(s, "\w+") for s in lines]

# Make a frequency list of lengths: line_num_words
line_num_words = [len(t_line) for t_line in tokenized_lines]

# Plot a histogram of the line lengths
plt.hist(line_num_words)
plt.show()
```

# $\star$ Chapter 2: Simple topic identification
This chapter will introduce you to topic identification, which you can apply to any text you encounter in the wild. Using basic NLP models, you will identify topics from texts based on term frequencies. You'll experiment and compare two simple methods: bag-of-words and Tf-idf using NLTK, and a new library `Gensim`

### Word counts with bag-of-words
#### Bag-of-words
* Basic method for finding topics in a text
* Need to first create tokens using tokenization
* ... and then count up all the tokens
* Theory: **the more frequent a word or token is, the more central or important it might be to the text.**
* Bag-of-words can be a great way to determine the significant words in a text (based on the number of times they are used)

#### Bag-of-words in Python

```
from nltk.tokenize import word_tokenize
from collections import Counter

Counter(word_tokenize("""The cat is in the box. The cat likes the box. The box is over the cat."""))
```

In [14]:
Counter(word_tokenize("""The cat is in the box. The cat likes the box. The box is over the cat."""))

Counter({'The': 3,
         'cat': 3,
         'is': 2,
         'in': 1,
         'the': 3,
         'box': 3,
         '.': 3,
         'likes': 1,
         'over': 1})

* The result is a `counter` object, which has a similar structure to a dictionary and allows us to see each token and the frequency of the token.
* `counter` objects also have a method called **`most_common()`**, which takes an integer argument, such as `2`, and would then return the top 2 tokens in terms of frequency.

In [16]:
counter = Counter(word_tokenize("""The cat is in the box. The cat likes the box. The box is over the cat."""))
counter.most_common(2)

[('The', 3), ('cat', 3)]

* The returned object is a series of tuples inside a list
* For each tuple, the first element holds the token and the second element represents the frequency
* **Note:** Other than ordering by token frequency, the `most_common` method does not sort the tokens it returns or tell us there are more tokens with that same frequency. 
    * For example: In the above example, the `most_common()` method doesn't alert us that `'box'` also has a frequency of `3`.
    

#### Exercises: Building a Counter with bag-of-words

```
# Import Counter
from collections import Counter

# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))
```

### Simple text preprocessing
* Text preprocessing helps make for better input data when performing machine learning or other statisticals methods
* Examples:
    * Tokenization to create a bag of words
    * Lowercasing words/tokens
* **Lemmatization** or **Stemming**: shortening the words to their root stems 
* **Removing stop words**, punctuation, or unwanted tokens: stop words are common words in a language that may not convey much meaning regarding content or topics
* Good to experiment with different approaches

#### Preprocessing example
* Input: Cats, dogs, and birds are common pets. So are fish.
* Output: cat, dog, bird, common, pet, fish

#### Text preprocessing with Python
* Below we use list comprehensions to tokenize the sentences which we first make lowercase using the string `lower()` method.
* The **`isalpha()`** method will return `True` if the string has *only* alphabetical characters
    * Effectively strips tokens of numbers or punctuation when used as a conditional within list comprehension
* We use another list comprehension to remove words that are in the stopwords list
    * This **`stopwords` list comes built in with the `nltk` library.**
* Finally, we create a counter and check the two most common words, which are now cat and box ("the" is no longer included, nor is "The")

```
from ntlk.corpus import stopwords

text = """The cat is in the box. The cat likes the box. The box is over the cat."""
tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()] 
no_stops = [t for t in tokens if t not in stopwords.words('english')]
Counter(no_stops).most_common(2)
```

In [19]:
#nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/abigailmorgan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [20]:
text = """The cat is in the box. The cat likes the box. The box is over the cat."""
tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()] 
no_stops = [t for t in tokens if t not in stopwords.words('english')]
Counter(no_stops).most_common(2)

[('cat', 3), ('box', 3)]

* As demonstrated, preprocessing has already improved our bag of words and made.

#### Exercises

```
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))
```

### Introduction to gensim
* **Gensim** is a popular open-source NLP library
* It uses top academic models to perform complex tasks like:
    * Building document or word vectors
    * Building corpora
    * Performing topic identification and document comparisons
    
<img src='data/word_vectors.png' width="600" height="300" align="center"/>

#### Word vectors
* A **word embedding** or **vector** is trained from a larger corpus and is a multi-dimensional representation of a word or document
    * Think of it as a multi-dimensional array normally with sparse features 
    * With these vectors we can then see relationships among the words or documents based on how near or far they are and also what similar comparisons we find
* In the image above, we see that `king minus queen` is approximately equal to `man minus woman`.
* Or, that `Spain` is to `Madrid` as `Italy` is to `Rome`.
* The deep learning algorithm used to create word vectors has been able to distill this meaning based on how those words are used throughout the text.

<img src='data/LDA_viz.png' width="600" height="300" align="center"/>

* The image above is an example of LDA visualization
* **LDA** stands for **Latent Dirichlet Allocation** and is a statistical model that we can apply to text using Gensim for topic analysis and modeling
* Link to above article [HERE](http://tlfvincent.github.io/2015/10/23/presidential-speech-topics)

#### Gensim
* Gensim allows you to build corpora and dictionaries using simple classes and functions \
* A **corpus** (or, if plural, **corpora**) is a set of texts used to help perform NLP tasks

```
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize

my_documents = ["The movie was about a spaceship and aliens.",
                "I really liked the movie!",
                "Awesome action scenes, but boring characters.",
                "The movie was awful! I hate alien films.",
                "Space is cool!" I liked the movie.",
                "More space films, please!"]
```
* In the above example, our "documents" are a list of strings that look like movie reviews about space or sci-fi films

In [23]:
my_documents = ["The movie was about a spaceship and aliens.",
                "I really liked the movie!",
                "Awesome action scenes, but boring characters.",
                "The movie was awful! I hate alien films.",
                "Space is cool! I liked the movie.",
                "More space films, please!"]

In [24]:
# Lower-case-it and tokenize
tokenized_docs = [word_tokenize(doc.lower()) for doc in my_documents]

* Then, pass the tokenized documents to the Gensim Dictionary class:
    * This will create a mapping with an id for each token
    * This is the beginning of our corpus
    * We can now represent whole documents using just a list of their token ids and how often those tokens appear in each document
    * We can take a look at the tokens and their ids by looking at the `token2id` attribute, which is a dictionary of all our tokens and their respective ids in our new dictionary

In [25]:
dictionary = Dictionary(tokenized_docs)

In [29]:
dictionary

<gensim.corpora.dictionary.Dictionary at 0x7f9b62045850>

In [26]:
dictionary.token2id

{'.': 0,
 'a': 1,
 'about': 2,
 'aliens': 3,
 'and': 4,
 'movie': 5,
 'spaceship': 6,
 'the': 7,
 'was': 8,
 '!': 9,
 'i': 10,
 'liked': 11,
 'really': 12,
 ',': 13,
 'action': 14,
 'awesome': 15,
 'boring': 16,
 'but': 17,
 'characters': 18,
 'scenes': 19,
 'alien': 20,
 'awful': 21,
 'films': 22,
 'hate': 23,
 'cool': 24,
 'is': 25,
 'space': 26,
 'more': 27,
 'please': 28}

#### Creating a gensim corpus
* Using the dictionary created above, we can then create a Gensim corpus, which is a bit different from a normal corpus (which is just a collection of documents)
* Gensim uses a simple bag-of-words model which transforms each document into a bag-of-words using the token ids and the frequency of each token in the document
* Below, we can see that the Gensim corpus is a list of list, with each litem item representing one document.
* Each document is now a series of tuples, the first item representing the tokenid from the dictionary and the second item representing the token frequency in the document
* Unlike our previous Counter-based bow, this Gensim model can be easily save, updated, and reused thanks to the extra tools we have available in Gensim
* Our dictionary can also be updated with new texts and extract only words that meet particular thresholds
* We are building a more advanced and feature-rich bag-of-words model which can then be used for future exercises

In [27]:
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

In [28]:
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
 [(5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(0, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)],
 [(0, 1),
  (5, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1)],
 [(0, 1), (5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (24, 1), (25, 1), (26, 1)],
 [(9, 1), (13, 1), (22, 1), (26, 1), (27, 1), (28, 1)]]

#### Exercises: Creating and querying a corpus with gensim

```
# Import Dictionary
from gensim.corpora.dictionary import Dictionary

# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles)

# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")

# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))

# Create a MmCorpus: corpus
corpus = [dictionary.doc2bow(article) for article in articles]

# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[4][:10])
```

#### Exercises: Gensim bag-of-words
Now, you'll use your new `gensim` corpus and dictionary to see the most common terms per document and across all documents. You can use your dictionary to look up the terms. Take a guess at what the topics are and feel free to explore more documents in the IPython Shell!

You have access to the `dictionary` and `corpus` objects you created in the previous exercise, as well as the Python `defaultdict` and `itertools` to help with the creation of intermediate data structures for analysis.

* `defaultdict` allows us to initialize a dictionary that will assign a default value to non-existent keys. By supplying the argument `int`, we are able to ensure that any non-existent keys are automatically assigned a default value of `0`. This makes it ideal for storing the counts of words in this exercise.

* `itertools.chain.from_iterable()` allows us to iterate through a set of sequences as if they were one continuous sequence. Using this function, we can easily iterate through our `corpus` object (which is a list of lists).

The fifth document from `corpus` is stored in the variable `doc`, which has been sorted in descending order.

```
# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)
    
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count
    
# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)
```

### Tf-idf with gensim
* Here we will learn how to use a TFIDF model with Gensim

#### What is tf-idf?
* **T**erm **f**requency- **i**nverse **d**ocument **f**requency
* Allows you to determine the most important words in each documentin the corpus
* Underlying theory: Each corpus may have more shared words than just stopwords; these common words are like stopwords and should be removed or at least down-weighted in impotance
* Tf-idf ensures most common words don't show up as key words
* Keeps document-specific frequent words weighted high
    * (And the common words across the entire corpus weighted low)
    
<img src='data/tfidf_formula.png' width="600" height="300" align="center"/>

* The weight will be low if the term doesn't appear often in the document because the tf variable will then be low. However, the weight will also be low if the logarithm is close to zero, meaning the internal equation is low

### Tf-idf with gensim
* We reference each document by using it like a dictionary key with our new tfidf model 

```
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus)
tfidf[corpus[1]]
```

In [31]:
tfidf = TfidfModel(corpus)
tfidf[corpus[1]]

[(5, 0.1746298276735174),
 (7, 0.1746298276735174),
 (9, 0.1746298276735174),
 (10, 0.29853166221463673),
 (11, 0.47316148988815415),
 (12, 0.7716931521027908)]

* For the second document in our corpora, we see the token weights along with the token ids. Notice there are some large differences.
* These weights can help you determine good topics and keywords for a corpus with shared vocabulary

#### Exercises: Tf-idf with Wikipedia

```
# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]

# Print the first five weights
print(tfidf_weights[:5])

# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)
```

# $\star$ Chapter 3: Named-entity recognition
This chapter will introduce a slightly more advanced topic: named-entity recognition. You'll learn how to identify the who, what, and where of your texts using pre-trained models on English and non-English text. You'll also learn how to use some new libraries, polyglot and spaCy, to add to your NLP toolbox.

### Named Entity Recognition

#### What is Named Entity Recognition?
* Named Entity Recognition (or **NER**) is an NLP task used to identify important named entities in the text
    * People, places, and organizations
    * Dates, states, works of art
    * ...and other categories (depending on the libraries and notation you use)
* NER can be used alongside topic identification.. or on its own
* Who? What? When? Where?

<img src='data/NER_example.png' width="600" height="300" align="center"/>

* The text above has been highlighted for different types of named entities that were found using the **Stanford NER library**.
* Use NER to solve problems like fact extraction or which entities are related... by using computational language models.

### nltk and the Stanford CoreNLP Library
* NLTK allows you to interact with NER via its own model, but also the Stanford CoreNLP library
* **The Stanford CoreNLP library:**
    * Integrated into Python via `nltk`
    * Java based
    * You can also used the Stanford library on its own without integrating it with NLTK or operate it as an API server
    * Great support for **NER** as well as **coreference** and **dependency trees**
        * **Coreference:** linking pronouns and entities together
        * **Dependency trees:** help with parsing meaning and relationships amongst words or phrases in a sentence
* For our simple-use case, we will use the *built-in* named entity recognition with NLTK

#### Using nltk for Named Entity Recognition
* Take a normal sentence
* Preprocess it via tokenization
* Then, tag the sentence for parts of speech
    * This will add tags for proper nouns, pronouns, adjectives, verbs, and other parts of speech that NLTK uses based on English grammar.

```
import nltk
sentence = '''In New York, I like to rie the Metro to visit MOMA and some restuarants rated well by Ruth Reichl.'''
tokenized_sent = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokenized_sent)
tagged_sent[:3]
```

In [33]:
#nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/abigailmorgan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [34]:
sentence = '''In New York, I like to rie the Metro to visit MOMA and some restuarants rated well by Ruth Reichl.'''
tokenized_sent = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokenized_sent)
tagged_sent[:3]

[('In', 'IN'), ('New', 'NNP'), ('York', 'NNP')]

* When we take a look at the tags, we see `New` and `York` are tagged `NNP` which is the tag for a proper noun, singular
* Then we pass the tagged sentence (`tagged_sent`) into the **`nltk.ne_chunk()`** function

In [36]:
# nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/abigailmorgan/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


True

In [38]:
# nltk.download('words')

[nltk_data] Downloading package words to
[nltk_data]     /Users/abigailmorgan/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [39]:
print(nltk.ne_chunk(tagged_sent))

(S
  In/IN
  (GPE New/NNP York/NNP)
  ,/,
  I/PRP
  like/VBP
  to/TO
  rie/VB
  the/DT
  (ORGANIZATION Metro/NNP)
  to/TO
  visit/VB
  (ORGANIZATION MOMA/NNP)
  and/CC
  some/DT
  restuarants/NNS
  rated/VBN
  well/RB
  by/IN
  (PERSON Ruth/NNP Reichl/NNP)
  ./.)


* `ne_chunk` = "named entity chunk"
* `ne_chunk` will return the sentence as a tree
* Though NLTK trees may look a bit different than trees from other libraries, they do still have leaves and subtrees representing more complex grammar
* `GPE` - Geopolitical Entity
* NTLK classifies each of these words *without consulting a knowledge base* like Wikipedia; instead, it uses *trained statistical and grammatical parsers.*

#### Exercises: 

```
# Tokenize the article into sentences: sentences
sentences = sent_tokenize(article)

# Tokenize each sentence into words: token_sentences
token_sentences = [word_tokenize(sent) for sent in sentences]

# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences] 

# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences, binary=True)

# Test for stems of the tree with 'NE' tags
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label") and chunk.label() == "NE":
            print(chunk)
```            

#### Exercise: Charting practice

```
# Create the defaultdict: ner_categories
ner_categories = defaultdict(int)

# Create the nested for loop
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            ner_categories[chunk.label()] += 1
            
# Create a list from the dictionary keys for the chart labels: labels
labels = list(ner_categories.keys())

# Create a list of the values: values
values = [ner_categories.get(v) for v in labels]

# Create the pie chart
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)

# Display the chart
plt.show()
```

### Introduction to SpaCy
* **SpaCy** is another great library for NLP

#### What is SpaCy?
* SpaCy is an NLP library similar to gensim, but with different implementations
* Focus on creating NLP pipelines to generate models and corpora
* SpaCy is open-source, with extra libraries and tools, including:
    * **Displacy**: A visualization tool for viewing parse trees with uses Node-js to create interactive text
    * tools to build word and document vectors from text

In [42]:
nlp = spacy.load('en')

* The object `nlp` functions similarly to our gensim dictionary and corpus.
* It has several linked objects, including entity , which is an Entity Recognition object from the pipeline module; this is what's used to find entities in the text. 

In [43]:
nlp.entity

<spacy.pipeline.pipes.EntityRecognizer at 0x7f9b62612e20>

In [44]:
doc = nlp("""Berlin is the capital of Germany; and the residence of Chancellor Angela Merkel""")
doc.ents

(Berlin, Germany, Angela Merkel)

* When the document (above) is loaded, the named entities are stored as a document attribute called `ents`
* We see SpaCy has properly tagged and identified the three main entities in the sentence
* We can also investigate the labels of each entity by using the indexing to pick out the first entity and the `label_` attribute to see the label for that particular entity.

In [45]:
print(doc.ents[0], doc.ents[0].label_)

Berlin GPE


* SpaCy has several other language models available, including advanced German and Chinese implementations
* It's a great tool especially if you want to build your own extraction and natural language processing pipeline quickly and iteratively.

#### Why use SpaCy for NER?
* Ability to integrate with the other great SpaCy features
* Easy pipeline creation
* Different entity types compared to `nltk` (and often labels entities differently than `nltk`)
* SpaCy also comes with informal language corpora
    * Easily find entities in Tweets and chat messages
* Quickly growing! May have more by now!
* Some of the *extra* categories that `spacy` uses compared to `nltk` in its NER are:
    * NORP, CARDINAL, MONEY, WORKOFART, LANGUAGE, EVENT

#### Exercises: Comparing NLTK with spaCy NER

```
# Import spacy
import spacy

# Instantiate the English model: nlp
nlp = spacy.load('en', tagger=False, parser=False, matcher=False)

# Create a new document: doc
doc = nlp(article)

# Print all of the found entities and their labels
for ent in doc.ents:
    print(ent.label_, ent.text)
```

<img src='data/course_datasets.png' width="600" height="300" align="center"/>