1\. Natural Language Processing (NLP) basics
--------------------------------------------

00:00 - 00:11

Welcome to this course! I'm Azadeh, a principal data scientist. In this course, we'll explore Natural Language Processing (NLP) using spaCy.

2\. Natural Language Processing (NLP)
-------------------------------------

00:11 - 00:47

NLP is a subfield of artificial intelligence that combines computer science and linguistics to help computers understand, analyze, and generate human language. NLP helps extract insights from unstructured data. Unstructured data, such as textual data, is information that is not organized in a pre-defined manner. NLP incorporates statistics, machine learning, and deep learning models to understand human language, intent, and sentiment.

3\. NLP use cases
-----------------

00:47 - 01:26

NLP has many applications. We will introduce three well-known use cases: sentiment analysis, named-entity recognition, and chatbots. Sentiment analysis is the use of computers to interpret the underlying subjective tone of a piece of text, and categorize it into positive, neutral, or negative classes. For example, a review about great service and affordable price is classified with a positive sentiment, while a review of a horrible experience is categorized with a negative sentiment.

4\. NLP use cases
-----------------

01:26 - 02:06

The next NLP use case is named entity recognition (NER). NER is used in information extraction to locate and classify named entities in unstructured text into predefined categories. Entities are objects such as a person or location. For example, with the phrase "John McCarthy was born on September 4, 1927." NER would classify John McCarthy as the name, highlighted in blue here, and September 4, 1927 as the date, highlighted in red.

5\. NLP use cases
-----------------

02:06 - 02:21

Another NLP use case is text generation in chatbots. ChatGPT is an example, which is based on a transformer-based language model trained on a vast amount of unstructured text data.

6\. Introduction to spaCy
-------------------------

02:21 - 02:56

Now that we have learned about NLP, let's learn more about spaCy and how we can utilize it in our NLP projects. spaCy is a free and open-source library for NLP in Python, which is designed to simplify building systems for information extraction. spaCy provides production-ready code widely used for NLP use cases. It supports 64+ languages. It is robust, fast and has built-in visualizers for various NLP functionalities.

7\. Install and import spaCy
----------------------------

02:56 - 03:40

As the first step, we install spaCy using pip, a Python package manager. We can then download any spaCy model using a specific Python command, -m spacy, with a given model name. Here we choose "en_core_web_sm", the smallest English model. After downloading the model, we import spacy and create a nlp object by passing the model name in quotation marks to the spacy-dot-load function. spaCy has multiple trained models for the English language that are available for download from spacy-dot-io website.

8\. Read and process text with spaCy
------------------------------------

03:40 - 04:14

Now that our NLP object is ready, we can move on to reading and processing text. The loaded spaCy model (nlp object) can process text and convert it into a Doc object, which is a container to store the processed text. The Doc object contains information like tokens, linguistic annotations, and relationships about the text. We'll learn about each of these later in the chapter.

9\. spaCy in action
-------------------

04:14 - 05:04

Let's look at an example of processing text with spaCy. This example will use a preprocessing step known as tokenization. The first step is to read text, in this case the string "A spaCy pipeline object is created.". We convert this text into a Doc object by running a loaded spaCy model, nlp, on the text. Now, we can utilize list comprehension to print all tokens of the input text by using token-dot-text for token in doc. A token is the smallest meaningful part of a text. The process of dividing a text into a list of meaningful tokens is called tokenization.

10\. Let's practice!
--------------------

05:04 - 05:09

Let's practice our learnings!

Doc container in spaCy
======================

The first step of a spaCy text processing pipeline is to convert a given text string into a `Doc` container, which stores the processed text. In this exercise, you'll practice loading a `spaCy`model, creating an `nlp()` object, creating a `Doc` container and processing a `text` string that is available for you.

`en_core_web_sm` model is already downloaded.

Instructions
------------

-   Load `en_core_web_sm` and create an `nlp`object.
-   Create a `doc` container of the `text` string.
-   Create a list containing the text of each tokens in the `doc` container.

In [None]:
# Load en_core_web_sm and create an nlp object
nlp = spacy.load("en_core_web_sm")

# Create a Doc container for the text object
doc = nlp(text)

# Create a list containing the text of each token in the Doc container
print([token.text for token in doc])

NER use case
============

NLP has many applications across different industries such as sentiment analysis, named entity recognition and chatbots. 

Is the following a correct definition for named entity recognition? 

*"Given a string of text, named entity recognition is identifying and categorizing entities in text."*

##### Answer the question

50XP

#### Possible Answers

Select one answer

[/] -   False

-   True

Tokenization with spaCy
=======================

In this exercise, you'll practice tokenizing text. You'll use the first review from the Amazon Fine Food Reviews dataset for this exercise. You can access this review by using the `text` object provided. 

The `en_core_web_sm` model is already loaded for you. You can access it by calling `nlp()`. You can use list comprehension to compile output lists.

Instructions
------------

-   Store Doc container for the pre-loaded review in a `document` object.
-   Store and review texts of all the tokens of the `document` in the variable `first_text_tokens`.

In [None]:
# Create a Doc container of the given text
document = nlp(text)

# Store and review the token text values of tokens for the Doc container
first_text_tokens = [token.text for token in document]
print("First text tokens:\n", first_text_tokens, "\n")

1\. spaCy basics
----------------

00:00 - 00:05

Let's learn more about spaCy and some of its core functionalities.

2\. spaCy NLP pipeline
----------------------

00:05 - 00:36

We previously learned that a spaCy NLP pipeline is created when we load a spaCy model. We started by importing spaCy, then call spacy-dot-load() to return a nlp object, a spaCy Language class. The Language class is the text processing pipeline and applies all necessary preprocessing steps to our input text behind the scenes. After that, we can apply nlp() on any given text to return a Doc container.

3\. spaCy NLP pipeline
----------------------

00:36 - 01:06

Let's learn more about the spaCy NLP pipeline. Every NLP application consists of several steps of text processing. spaCy applies a series of preprocessing steps to the text when we call nlp(), the spaCy Language class. Some of the processing steps are tokenization, tagging, parsing, Named Entity Recognition and many others which result in a Doc container.

4\. Container objects in spaCy
------------------------------

01:06 - 01:42

Doc object is only one of the container classes that spaCy supports. spaCy uses multiple data structures to represent text data. Container classes such as Doc hold information about sentences, words and the text. Another container class is the Span object, which represents a slice from a Doc object; and spaCy also has a Token class, which represents an individual token, like a word, punctuation symbol, etc.

5\. Pipeline components
-----------------------

01:42 - 02:21

All the container classes are generated during the spaCy NLP processing steps. Each of the processing steps we saw in the spaCy pipeline has a well-defined task. In this course, we mostly focus on tokenizer, tagger, lemmatizer, and ner components. As shown, the tokenizer creates Doc object and segment text into tokens. Then the tagger and other components add more attributes such as part-of-speech tags, and label named entities.

6\. Pipeline components
-----------------------

02:21 - 02:49

There are many more text processing components available in spaCy and it is important to highlight some of the other important text processing components of an nlp instance and their duties, such as Language, DependencyParser, and Sentencizer. Each component has unique features to help us process our text better. We will see more examples of each component throughout the course.

7\. Tokenization
----------------

02:49 - 03:41

We introduced tokenization earlier, but let's explore it further. Tokenization is always the first processing step in a spaCy NLP pipeline as all other processing steps require tokens in a given text. Recall that tokenization splits a sentence into its tokens, or the smallest meaningful piece of text. Tokens can be words, numbers and punctuation. The code segment shows the tokenization process we've seen before using a small English spaCy model. Once we apply the nlp object to the input sentence and create a Doc object, we can access each Token by using list comprehension and print a token's text by using -dot-text attribute.

8\. Sentence segmentation
-------------------------

03:41 - 04:23

Sentence segmentation or breaking a text into its given sentences, is a more complex task compared to tokenization due to difficulties of handling punctuation and abbreviations. Sentence segmentation happens as part of the DependencyParser pipeline component. We utilize a for loop to iterate over the sentences of "We are learning NLP. This course introduces spaCy." using the dot-sents property of a Doc container. Then, we can use the dot-text attribute to access the sentence text.

9\. Lemmatization
-----------------

04:23 - 04:58

Lemmatization, one of the spaCy processing steps, reduces the word forms to their lemmas. A lemma is the base form of a token in which the token appears in a dictionary. For instance, the lemma of the words "eats" and "ate" is "eat". Lemmatization improves the accuracy of many language modeling tasks. We iterate over tokens to get their text and lemmas using token-dot-text and token-dot-lemma_.

10\. Let's practice!
--------------------

04:58 - 05:00

Let's exercise!

Running a spaCy pipeline
========================

You've already run a spaCy NLP pipeline on a single piece of text and also extracted tokens of a given list of Doc containers. In this exercise, you'll practice the initial steps of running a `spaCy` pipeline on `texts`, which is a list of text strings. 

You will use the `en_core_web_sm` model for this purpose. The `spaCy` package has already been imported for you.

Instructions
------------

-   Load the `en_core_web_sm` model as `nlp`.
-   Run an `nlp()` model on each item of `texts`, and append each corresponding `Doc`container to a `documents` list.
-   Print the token texts for each `Doc` container of the `documents` list.

In [None]:
# Load en_core_web_sm model as nlp
nlp = spacy.load("en_core_web_sm")

# Run an nlp model on each item of texts and append the Doc container to documents
documents = []
for text in texts:
  documents.append(nlp(text))

# Print the token texts for each Doc container
for doc in documents:
  print([token.text for token in doc])

Lemmatization with spaCy
========================

In this exercise, you will practice lemmatization. Lemmatization can be helpful to generate the root form of derived words. This means that given any sentence, we expect the number of lemmas to be less than or equal to the number of tokens.

The first Amazon food review is provided for you in a string called `text`. `en_core_web_sm` is loaded as `nlp`, and has been run on the `text`to compile `document`, a `Doc` container for the text string.

`tokens`, a list containing tokens for the `text`is also already loaded for your use.

Instructions
------------

-   Append the lemma for all tokens in the `document`, then print the list of `lemmas`.
-   Print `tokens` list and observe the differences between `tokens` and `lemmas`.

In [None]:
document = nlp(text)
tokens = [token.text for token in document]

# Append the lemma for all tokens in the document
lemmas = [token.lemma_ for token in document]
print("Lemmas:\n", lemmas, "\n")

# Print tokens and compare with lemmas list
print("Tokens:\n", tokens)

Sentence segmentation with spaCy
================================

In this exercise, you will practice sentence segmentation. In NLP, segmenting a document into its sentences is a useful basic operation. It is one of the first steps in many NLP tasks that are more elaborate, such as detecting named entities. Additionally, capturing the number of sentences may provide some insight into the amount of information provided by the text.

You can access ten food reviews in the list called `texts`. 

The `en_core_web_sm` model has already been loaded for you as `nlp` and .

Instructions
------------

-   Run the `spaCy` model on each item in the `texts` list to compile `documents`, a list of all `Doc` containers.
-   Extract sentences of each `doc` container by iterating through `documents` list and append them to a list called `sentences`.
-   Count the number of sentences in each `doc`container using the `sentences` list.

In [None]:
# Generating a documents list of all Doc containers
documents = [nlp(text) for text in texts]

# Iterate through documents and append sentences in each doc to the sentences list
sentences = []
for doc in documents:
  sentences.append([s for s in doc.sents])

# Find number of sentences per each doc container
print([len(s) for s in sentences])