1\. Natural Language Processing (NLP) basics
--------------------------------------------

00:00 - 00:11

Welcome to this course! I'm Azadeh, a principal data scientist. In this course, we'll explore Natural Language Processing (NLP) using spaCy.

2\. Natural Language Processing (NLP)
-------------------------------------

00:11 - 00:47

NLP is a subfield of artificial intelligence that combines computer science and linguistics to help computers understand, analyze, and generate human language. NLP helps extract insights from unstructured data. Unstructured data, such as textual data, is information that is not organized in a pre-defined manner. NLP incorporates statistics, machine learning, and deep learning models to understand human language, intent, and sentiment.

- A subfield of Artificial Intelligence (AI)
- Helps computers to understand human language  
- Helps extract insights from unstructured data
- Incorporates statistics, machine learning models and deep learning models

Artificial Intelligence contains:
- Machine Learning
- Natural Language Processing
- Deep Learning (intersection between Machine Learning and NLP)

[Note: The original shows a Venn diagram with Artificial Intelligence as the outer circle, Machine Learning and Natural Language Processing as overlapping circles within it, and Deep Learning in the intersection of ML and NLP]

3\. NLP use cases
-----------------

00:47 - 01:26

NLP has many applications. We will introduce three well-known use cases: sentiment analysis, named-entity recognition, and chatbots. Sentiment analysis is the use of computers to interpret the underlying subjective tone of a piece of text, and categorize it into positive, neutral, or negative classes. For example, a review about great service and affordable price is classified with a positive sentiment, while a review of a horrible experience is categorized with a negative sentiment.

# Sentiment analysis

- Use of computers to determine the underlying subjective tone of a piece of writing

| Sentiment | Example Text |
|-----------|--------------|
| Positive | "Great service and affordable price. I will buy it again." |
| Negative | "This was a horrible experience. Not worth the money" |

[Note: The original image shows emoticons/smileys - a happy face for Positive and sad face for Negative sentiment, in green and orange boxes respectively]

4\. NLP use cases
-----------------

01:26 - 02:06

The next NLP use case is named entity recognition (NER). NER is used in information extraction to locate and classify named entities in unstructured text into predefined categories. Entities are objects such as a person or location. For example, with the phrase "John McCarthy was born on September 4, 1927." NER would classify John McCarthy as the name, highlighted in blue here, and September 4, 1927 as the date, highlighted in red.

# Named entity recognition (NER)

- Locating and classifying named entities mentioned in unstructured text into pre-defined categories
- Named entities are real-world objects such as a person or location

Example:
`[John McCarthy][Name] was born on [September 4, 1927][Date]`

[Note: The original image shows text with colored boxes highlighting and labeling the Name and Date entities]

5\. NLP use cases
-----------------

02:06 - 02:21

Another NLP use case is text generation in chatbots. ChatGPT is an example, which is based on a transformer-based language model trained on a vast amount of unstructured text data.

- Generate human-like responses to text input, such as ChatGPT

[Note: The original image shows an illustration of a computer screen with a robot/AI assistant icon and chat bubble, displayed in blue and white colors]

6\. Introduction to spaCy
-------------------------

02:21 - 02:56

Now that we have learned about NLP, let's learn more about spaCy and how we can utilize it in our NLP projects. spaCy is a free and open-source library for NLP in Python, which is designed to simplify building systems for information extraction. spaCy provides production-ready code widely used for NLP use cases. It supports 64+ languages. It is robust, fast and has built-in visualizers for various NLP functionalities.

## spaCy is a free, open-source library for NLP in Python which:

- Is designed to build systems for information extraction
- Provides production-ready code for NLP use cases
- Supports 64+ languages
- Is robust and fast and has visualization libraries

[Note: The original image includes the spaCy logo in white against a blue patterned background]

7\. Install and import spaCy
----------------------------

02:56 - 03:40

As the first step, we install spaCy using pip, a Python package manager. We can then download any spaCy model using a specific Python command, -m spacy, with a given model name. Here we choose "en_core_web_sm", the smallest English model. After downloading the model, we import spacy and create a nlp object by passing the model name in quotation marks to the spacy-dot-load function. spaCy has multiple trained models for the English language that are available for download from spacy-dot-io website.

- As the first step, `spaCy` can be installed using the Python package manager pip
- `spaCy` trained models can be downloaded 
- Multiple trained models are available for English language at spacy.io

```bash
$ python3 pip install spacy

python3 -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
```

8\. Read and process text with spaCy
------------------------------------

03:40 - 04:14

Now that our NLP object is ready, we can move on to reading and processing text. The loaded spaCy model (nlp object) can process text and convert it into a Doc object, which is a container to store the processed text. The Doc object contains information like tokens, linguistic annotations, and relationships about the text. We'll learn about each of these later in the chapter.

- Loaded `spaCy` model `en_core_web_sm` = `nlp` object
- `nlp` object converts text into a `Doc` object (container) to store processed text

```
nlp(Text)     -->     Doc object     -->     Components:
object                                       - Tokens
                                            - Linguistic annotations
                                            - Relationships
```

[Note: The original image shows this as a flow diagram with boxes and arrows. The Doc object is represented as a blue cylinder/database icon.]

9\. spaCy in action
-------------------

04:14 - 05:04

Let's look at an example of processing text with spaCy. This example will use a preprocessing step known as tokenization. The first step is to read text, in this case the string "A spaCy pipeline object is created.". We convert this text into a Doc object by running a loaded spaCy model, nlp, on the text. Now, we can utilize list comprehension to print all tokens of the input text by using token-dot-text for token in doc. A token is the smallest meaningful part of a text. The process of dividing a text into a list of meaningful tokens is called tokenization.

- Processing a string using `spaCy`

```python
import spacy
nlp = spacy.load("en_core_web_sm")
text = "A spaCy pipeline object is created."
doc = nlp(text)
```

- Tokenization
  - A `Token` is defined as the smallest meaningful part of the text.
  - Tokenization: The process of dividing a text into a list of meaningful tokens

```python
print([token.text for token in doc])
```

```
['A', 'spaCy', 'pipeline', 'object', 'is', 'created', '.']
```

10\. Let's practice!
--------------------

05:04 - 05:09

Let's practice our learnings!

Doc container in spaCy
======================

The first step of a spaCy text processing pipeline is to convert a given text string into a `Doc` container, which stores the processed text. In this exercise, you'll practice loading a `spaCy`model, creating an `nlp()` object, creating a `Doc` container and processing a `text` string that is available for you.

`en_core_web_sm` model is already downloaded.

Instructions
------------

-   Load `en_core_web_sm` and create an `nlp`object.
-   Create a `doc` container of the `text` string.
-   Create a list containing the text of each tokens in the `doc` container.

In [None]:
# Load en_core_web_sm and create an nlp object
nlp = spacy.load("en_core_web_sm")

# Create a Doc container for the text object
doc = nlp(text)

# Create a list containing the text of each token in the Doc container
print([token.text for token in doc])

NER use case
============

NLP has many applications across different industries such as sentiment analysis, named entity recognition and chatbots. 

Is the following a correct definition for named entity recognition? 

*"Given a string of text, named entity recognition is identifying and categorizing entities in text."*

##### Answer the question

50XP

#### Possible Answers

Select one answer

[/] -   False

-   True

Tokenization with spaCy
=======================

In this exercise, you'll practice tokenizing text. You'll use the first review from the Amazon Fine Food Reviews dataset for this exercise. You can access this review by using the `text` object provided. 

The `en_core_web_sm` model is already loaded for you. You can access it by calling `nlp()`. You can use list comprehension to compile output lists.

Instructions
------------

-   Store Doc container for the pre-loaded review in a `document` object.
-   Store and review texts of all the tokens of the `document` in the variable `first_text_tokens`.

In [None]:
# Create a Doc container of the given text
document = nlp(text)

# Store and review the token text values of tokens for the Doc container
first_text_tokens = [token.text for token in document]
print("First text tokens:\n", first_text_tokens, "\n")

1\. spaCy basics
----------------

00:00 - 00:05

Let's learn more about spaCy and some of its core functionalities.

2\. spaCy NLP pipeline
----------------------

00:05 - 00:36

We previously learned that a spaCy NLP pipeline is created when we load a spaCy model. We started by importing spaCy, then call spacy-dot-load() to return a nlp object, a spaCy Language class. The Language class is the text processing pipeline and applies all necessary preprocessing steps to our input text behind the scenes. After that, we can apply nlp() on any given text to return a Doc container.

Here's the markdown conversion of the image content:

```python
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Here's my spaCy pipeline.")
```

- Import `spaCy`
- Use `spacy.load()` to return `nlp`, a Language class
  - The `Language` object is the text processing pipeline
- Apply `nlp()` on any text to get a `Doc` container

3\. spaCy NLP pipeline
----------------------

00:36 - 01:06

Let's learn more about the spaCy NLP pipeline. Every NLP application consists of several steps of text processing. spaCy applies a series of preprocessing steps to the text when we call nlp(), the spaCy Language class. Some of the processing steps are tokenization, tagging, parsing, Named Entity Recognition and many others which result in a Doc container.

spaCy applies some processing steps using its Language class:

Text -> [nlp] -> Doc
     
where [nlp] contains:
Tokenizer -> Tagger -> Parser -> NER -> ...

4\. Container objects in spaCy
------------------------------

01:06 - 01:42

Doc object is only one of the container classes that spaCy supports. spaCy uses multiple data structures to represent text data. Container classes such as Doc hold information about sentences, words and the text. Another container class is the Span object, which represents a slice from a Doc object; and spaCy also has a Token class, which represents an individual token, like a word, punctuation symbol, etc.

There are multiple data structures to represent text data in spaCy:

| Name | Description |
|------|-------------|
| Doc | A container for accessing linguistic annotations of text |
| Span | A slice from a `Doc` object |
| Token | An individual token, i.e. a word, punctuation, whitespace, etc. |

5\. Pipeline components
-----------------------

01:42 - 02:21

All the container classes are generated during the spaCy NLP processing steps. Each of the processing steps we saw in the spaCy pipeline has a well-defined task. In this course, we mostly focus on tokenizer, tagger, lemmatizer, and ner components. As shown, the tokenizer creates Doc object and segment text into tokens. Then the tagger and other components add more attributes such as part-of-speech tags, and label named entities.

The spaCy language processing pipeline always depends on the loaded model and its capabilities.

| Component | Name | Description |
|-----------|------|-------------|
| Tokenizer | Tokenizer | Segment text into tokens and create `Doc` object |
| Tagger | Tagger | Assign part-of-speech tags |
| Lemmatizer | Lemmatizer | Reduce the words to their root forms |
| EntityRecognizer | NER | Detect and label named entities |

6\. Pipeline components
-----------------------

02:21 - 02:49

There are many more text processing components available in spaCy and it is important to highlight some of the other important text processing components of an nlp instance and their duties, such as Language, DependencyParser, and Sentencizer. Each component has unique features to help us process our text better. We will see more examples of each component throughout the course.

Each component has unique features to process text
- Language
- DependencyParser 
- Sentencizer

7\. Tokenization
----------------

02:49 - 03:41

We introduced tokenization earlier, but let's explore it further. Tokenization is always the first processing step in a spaCy NLP pipeline as all other processing steps require tokens in a given text. Recall that tokenization splits a sentence into its tokens, or the smallest meaningful piece of text. Tokens can be words, numbers and punctuation. The code segment shows the tokenization process we've seen before using a small English spaCy model. Once we apply the nlp object to the input sentence and create a Doc object, we can access each Token by using list comprehension and print a token's text by using -dot-text attribute.

- Always the first operation
- All the other operations require tokens
- Tokens can be words, numbers and punctuation

```python
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("Tokenization splits a sentence into its tokens.")
print([token.text for token in doc])
```

```
['Tokenization', 'splits', 'a', 'sentence', 'into', 'its', 'tokens', '.']
```

8\. Sentence segmentation
-------------------------

03:41 - 04:23

Sentence segmentation or breaking a text into its given sentences, is a more complex task compared to tokenization due to difficulties of handling punctuation and abbreviations. Sentence segmentation happens as part of the DependencyParser pipeline component. We utilize a for loop to iterate over the sentences of "We are learning NLP. This course introduces spaCy." using the dot-sents property of a Doc container. Then, we can use the dot-text attribute to access the sentence text.

- More complex than tokenization
- Is a part of `DependencyParser` component

```python
import spacy
nlp = spacy.load("en_core_web_sm")

text = "We are learning NLP. This course introduces spaCy."
doc = nlp(text)
for sent in doc.sents:
    print(sent.text)
```

```
We are learning NLP.
This course introduces spaCy.
```

9\. Lemmatization
-----------------

04:23 - 04:58

Lemmatization, one of the spaCy processing steps, reduces the word forms to their lemmas. A lemma is the base form of a token in which the token appears in a dictionary. For instance, the lemma of the words "eats" and "ate" is "eat". Lemmatization improves the accuracy of many language modeling tasks. We iterate over tokens to get their text and lemmas using token-dot-text and token-dot-lemma_.

- A lemma is the base form of a token
- The lemma of eats and ate is eat
- Improves accuracy of language models

```python
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("We are seeing her after one year.")
print([(token.text, token.lemma_) for token in doc])
```

```
[('We', 'we'), ('are', 'be'), ('seeing', 'see'), ('her', 'she'), 
('after', 'after'), ('one', 'one'), ('year', 'year'), ('.', '.')]
```

10\. Let's practice!
--------------------

04:58 - 05:00

Let's exercise!

Running a spaCy pipeline
========================

You've already run a spaCy NLP pipeline on a single piece of text and also extracted tokens of a given list of Doc containers. In this exercise, you'll practice the initial steps of running a `spaCy` pipeline on `texts`, which is a list of text strings. 

You will use the `en_core_web_sm` model for this purpose. The `spaCy` package has already been imported for you.

Instructions
------------

-   Load the `en_core_web_sm` model as `nlp`.
-   Run an `nlp()` model on each item of `texts`, and append each corresponding `Doc`container to a `documents` list.
-   Print the token texts for each `Doc` container of the `documents` list.

In [None]:
# Load en_core_web_sm model as nlp
nlp = spacy.load("en_core_web_sm")

# Run an nlp model on each item of texts and append the Doc container to documents
documents = []
for text in texts:
  documents.append(nlp(text))

# Print the token texts for each Doc container
for doc in documents:
  print([token.text for token in doc])

Lemmatization with spaCy
========================

In this exercise, you will practice lemmatization. Lemmatization can be helpful to generate the root form of derived words. This means that given any sentence, we expect the number of lemmas to be less than or equal to the number of tokens.

The first Amazon food review is provided for you in a string called `text`. `en_core_web_sm` is loaded as `nlp`, and has been run on the `text`to compile `document`, a `Doc` container for the text string.

`tokens`, a list containing tokens for the `text`is also already loaded for your use.

Instructions
------------

-   Append the lemma for all tokens in the `document`, then print the list of `lemmas`.
-   Print `tokens` list and observe the differences between `tokens` and `lemmas`.

In [None]:
document = nlp(text)
tokens = [token.text for token in document]

# Append the lemma for all tokens in the document
lemmas = [token.lemma_ for token in document]
print("Lemmas:\n", lemmas, "\n")

# Print tokens and compare with lemmas list
print("Tokens:\n", tokens)

Sentence segmentation with spaCy
================================

In this exercise, you will practice sentence segmentation. In NLP, segmenting a document into its sentences is a useful basic operation. It is one of the first steps in many NLP tasks that are more elaborate, such as detecting named entities. Additionally, capturing the number of sentences may provide some insight into the amount of information provided by the text.

You can access ten food reviews in the list called `texts`. 

The `en_core_web_sm` model has already been loaded for you as `nlp` and .

Instructions
------------

-   Run the `spaCy` model on each item in the `texts` list to compile `documents`, a list of all `Doc` containers.
-   Extract sentences of each `doc` container by iterating through `documents` list and append them to a list called `sentences`.
-   Count the number of sentences in each `doc`container using the `sentences` list.

In [None]:
# Generating a documents list of all Doc containers
documents = [nlp(text) for text in texts]

# Iterate through documents and append sentences in each doc to the sentences list
sentences = []
for doc in documents:
  sentences.append([s for s in doc.sents])

# Find number of sentences per each doc container
print([len(s) for s in sentences])

1\. Linguistic features in spaCy
--------------------------------

00:00 - 00:11

Welcome back! Let's learn about some of spaCy's linguistic features such as the part-of-speech-tagger and named entity recognizer to extract information from text.

2\. POS tagging
---------------

00:11 - 00:36

POS stands for part-of-speech. A part of speech is a grammatical term that categorizes words based on their function and context within a sentence. For example, the English language has nine main POS categories, some of them are: verb, noun, adjective, adverb and conjunction.

3\. POS tagging with spaCy
--------------------------

00:36 - 01:02

One use case for POS tagging is to confirm the meaning of a word. For example, some words such as "watch" can be both noun and verb. spaCy captures POS tags in the pos_ feature of the nlp pipeline. spacy-dot-explain() can be used on a given tag to include explanations of the tags.

4\. POS tagging with spaCy
--------------------------

01:02 - 01:36

Let's look at an example for extracting part-of-speech tags for two sentences: "I watch TV" and "I left without my watch". We use list comprehension to identify the token and the POS tag using token-dot-pos-underscore, and the explanation using spacy-dot-explain and passing it token-dot-pos-underscore. The word "watch" is correctly tagged as a verb in the first sentence, and tagged as a noun in the second example.

5\. Named entity recognition
----------------------------

01:36 - 02:21

On to named entity recognition! A named entity is a word or phrase that refers to a specific entity with a name, such as a organization. Named-entity recognition (NER) is a NLP task that classifies named entities found in an unstructured text into pre-defined categories such as person names. spaCy supports a wide range of entity types such as: PERSON to represent a named person, ORG to represent a company, GPE for a geo-political entity like a country, LOC for other locations such as mountain ranges, DATE and TIME.

6\. NER and spaCy
-----------------

02:21 - 02:51

spaCy models can predict named entities and their corresponding labels as part of the NER component. Named entities are available via the doc-dot-ents property of a Doc container. spaCy will also tag each entity with its corresponding label, which represents an entity type. The label of an entity is available via the -dot-label_ property.

7\. NER and spaCy
-----------------

02:51 - 03:34

The code snippet illustrates how we extract named entities from "Albert Einstein was genius". We can iterate through entities by using doc-dot-ents attribute, and access entity text, the start and end characters of each entity, and entity labels by using -dot-text, -dot-start_char, -dot-end_char and -dot-label_ respectively. In this instance, Albert Einstein is detected as a PERSON which starts from the first and ends at the 15th character of the given text.

8\. NER and spaCy
-----------------

03:34 - 04:11

An alternative approach to extract entities and their types is to directly use Token class instead of accessing doc-dot-ents to only check extracted named entities. spaCy tags each token in a given Doc container with its entity type if it is categorized as an entity. We can access a Token's -dot-text and -dot-entity_type. If a token is not classified as an entity such as the words was and genius, we will see an empty string as the entity type.

9\. displaCy
------------

04:11 - 04:50

We can also visualize these entities using displaCy. displaCy has different visualization options, such as the entity visualizer, which highlights named entities and their labels in a text. For example we can use displacy-dot-serve function to visualize named entities of a previous example, "Albert Einstein was genius". The displacy-dot-serve function takes two arguments, a Doc container, and the type of displaCy visualization which is "ent" (entities) in this instance.

10\. Let's practice!
--------------------

04:50 - 04:54

Let's exercise our learnings!

POS tagging with spaCy
======================

In this exercise, you will practice POS tagging. POS tagging is a useful tool in NLP as it allows algorithms to understand the grammatical structure of a sentence and to confirm words that have multiple meanings such as `watch`and `play`.

For this exercise, `en_core_web_sm` has been loaded for you as `nlp`. Three comments from the Airline Travel Information System (ATIS) dataset have been provided for you in a list called `texts`.

Instructions
------------

-   Compile `documents`, a list of all `doc`containers for each text in `texts` list using list comprehension.
-   For each `doc` container, print each token's text and its corresponding POS tag by iterating through `documents` and tokens of each `doc` container using a nested for loop.

In [None]:
# Compile a list of all Doc containers of texts
documents = [nlp(text) for text in texts]

# Print token texts and POS tags for each Doc container
for doc in documents:
    for token in doc:
        print("Text: ", token.text, "| POS tag: ", token.pos_)
    print("\n")

NER with spaCy
==============

Named entity recognition (NER) helps you to easily identify key elements of a given document, like names of people and places. It helps sort unstructured data and detect important information, which is crucial if you are dealing with large datasets. In this exercise, you will practice Named Entity Recognition.

`en_core_web_sm` has been loaded for you as `nlp`. Three comments from the Airline Travel Information System (ATIS) dataset have been provided for you in a list called `texts`.

Instructions
------------

-   Compile `documents`, a list of all `Doc`containers for each text in the `texts` using list comprehension.
-   For each `doc` container, print each entity's text and corresponding label by iterating through `doc.ents`.
-   Print the sixth token's text, and the entity type of the second `Doc` container.

In [None]:
# Compile a list of all Doc containers of texts
documents = [nlp(text) for text in texts]

# Print the entity text and label for the entities in each document
for doc in documents:
    print([(ent.text, ent.label_) for ent in doc.ents])

# Print the 6th token's text and entity type of the second document
print("\nText:", documents[1][5].text, "| Entity type: ", documents[1][5].ent_type_)

Text processing with spaCy
==========================

Every NLP application consists of several text processing steps. You have already learned some of these steps, including tokenization, lemmatization, sentence segmentation and named entity recognition.

In this exercise, you'll continue to practice with text processing steps in spaCy, such as breaking the text into sentences and extracting named entities. You will use the first five reviews from the Amazon Fine Food Reviews dataset for this exercise. You can access these reviews by using the `texts` object. 

The `en_core_web_sm` model has already been loaded for you to use, and you can access it by using `nlp`. The list of `Doc` containers for each item in `texts` is also pre-loaded and accessible at `documents`.

Instructions 1/2
----------------

50 XP

-   Create `sentences`, a list of list of all sentences in each `doc` container in `documents` using list comprehension.
-   Print `num_sentences`, a list containing the number of sentences for each `doc` container by using the `len()` method.

In [None]:
# Create a list to store sentences of each Doc container in documents
sentences = [[sent for sent in doc.sents] for doc in documents]

# Print number of sentences in each Doc container in documents
num_sentences = [len(s) for s in sentences]
print("Number of sentences in documents:\n", num_sentences)

Instructions 2/2
----------------

-   Create a list of tuples of format (entity text, entity label) for the third `doc` container in `third_text_entities`.
-   Create a list of tuples of format (token text, POS tag) of first ten tokens of third `doc`container at `third_text_10_pos`.

In [None]:
# Create a list to store sentences of each Doc container in documents
sentences = [[sent for sent in doc.sents] for doc in documents]

# Create a list to track number of sentences per Doc container in documents
num_sentences = [len([sent for sent in doc.sents]) for doc in documents]
print("Number of sentences in documents:\n", num_sentences, "\n")

# Record entities text and corresponding label of the third Doc container
third_text_entities = [(ent.text, ent.label_) for ent in documents[2].ents]
print("Third text entities:\n", third_text_entities, "\n")

# Record first ten tokens and corresponding POS tag for the third Doc container
third_text_10_pos = [(token.text, token.pos_) for token in documents[2]][:10]
print("First ten tokens of third text:\n", third_text_10_pos)