1\. Natural Language Processing (NLP) basics
--------------------------------------------

00:00 - 00:11

Welcome to this course! I'm Azadeh, a principal data scientist. In this course, we'll explore Natural Language Processing (NLP) using spaCy.

2\. Natural Language Processing (NLP)
-------------------------------------

00:11 - 00:47

NLP is a subfield of artificial intelligence that combines computer science and linguistics to help computers understand, analyze, and generate human language. NLP helps extract insights from unstructured data. Unstructured data, such as textual data, is information that is not organized in a pre-defined manner. NLP incorporates statistics, machine learning, and deep learning models to understand human language, intent, and sentiment.

3\. NLP use cases
-----------------

00:47 - 01:26

NLP has many applications. We will introduce three well-known use cases: sentiment analysis, named-entity recognition, and chatbots. Sentiment analysis is the use of computers to interpret the underlying subjective tone of a piece of text, and categorize it into positive, neutral, or negative classes. For example, a review about great service and affordable price is classified with a positive sentiment, while a review of a horrible experience is categorized with a negative sentiment.

4\. NLP use cases
-----------------

01:26 - 02:06

The next NLP use case is named entity recognition (NER). NER is used in information extraction to locate and classify named entities in unstructured text into predefined categories. Entities are objects such as a person or location. For example, with the phrase "John McCarthy was born on September 4, 1927." NER would classify John McCarthy as the name, highlighted in blue here, and September 4, 1927 as the date, highlighted in red.

5\. NLP use cases
-----------------

02:06 - 02:21

Another NLP use case is text generation in chatbots. ChatGPT is an example, which is based on a transformer-based language model trained on a vast amount of unstructured text data.

6\. Introduction to spaCy
-------------------------

02:21 - 02:56

Now that we have learned about NLP, let's learn more about spaCy and how we can utilize it in our NLP projects. spaCy is a free and open-source library for NLP in Python, which is designed to simplify building systems for information extraction. spaCy provides production-ready code widely used for NLP use cases. It supports 64+ languages. It is robust, fast and has built-in visualizers for various NLP functionalities.

7\. Install and import spaCy
----------------------------

02:56 - 03:40

As the first step, we install spaCy using pip, a Python package manager. We can then download any spaCy model using a specific Python command, -m spacy, with a given model name. Here we choose "en_core_web_sm", the smallest English model. After downloading the model, we import spacy and create a nlp object by passing the model name in quotation marks to the spacy-dot-load function. spaCy has multiple trained models for the English language that are available for download from spacy-dot-io website.

8\. Read and process text with spaCy
------------------------------------

03:40 - 04:14

Now that our NLP object is ready, we can move on to reading and processing text. The loaded spaCy model (nlp object) can process text and convert it into a Doc object, which is a container to store the processed text. The Doc object contains information like tokens, linguistic annotations, and relationships about the text. We'll learn about each of these later in the chapter.

9\. spaCy in action
-------------------

04:14 - 05:04

Let's look at an example of processing text with spaCy. This example will use a preprocessing step known as tokenization. The first step is to read text, in this case the string "A spaCy pipeline object is created.". We convert this text into a Doc object by running a loaded spaCy model, nlp, on the text. Now, we can utilize list comprehension to print all tokens of the input text by using token-dot-text for token in doc. A token is the smallest meaningful part of a text. The process of dividing a text into a list of meaningful tokens is called tokenization.

10\. Let's practice!
--------------------

05:04 - 05:09

Let's practice our learnings!

Doc container in spaCy
======================

The first step of a spaCy text processing pipeline is to convert a given text string into a `Doc` container, which stores the processed text. In this exercise, you'll practice loading a `spaCy`model, creating an `nlp()` object, creating a `Doc` container and processing a `text` string that is available for you.

`en_core_web_sm` model is already downloaded.

Instructions
------------

-   Load `en_core_web_sm` and create an `nlp`object.
-   Create a `doc` container of the `text` string.
-   Create a list containing the text of each tokens in the `doc` container.

In [None]:
# Load en_core_web_sm and create an nlp object
nlp = spacy.load("en_core_web_sm")

# Create a Doc container for the text object
doc = nlp(text)

# Create a list containing the text of each token in the Doc container
print([token.text for token in doc])

NER use case
============

NLP has many applications across different industries such as sentiment analysis, named entity recognition and chatbots. 

Is the following a correct definition for named entity recognition? 

*"Given a string of text, named entity recognition is identifying and categorizing entities in text."*

##### Answer the question

50XP

#### Possible Answers

Select one answer

[/] -   False

-   True

Tokenization with spaCy
=======================

In this exercise, you'll practice tokenizing text. You'll use the first review from the Amazon Fine Food Reviews dataset for this exercise. You can access this review by using the `text` object provided. 

The `en_core_web_sm` model is already loaded for you. You can access it by calling `nlp()`. You can use list comprehension to compile output lists.

Instructions
------------

-   Store Doc container for the pre-loaded review in a `document` object.
-   Store and review texts of all the tokens of the `document` in the variable `first_text_tokens`.

In [None]:
# Create a Doc container of the given text
document = nlp(text)

# Store and review the token text values of tokens for the Doc container
first_text_tokens = [token.text for token in document]
print("First text tokens:\n", first_text_tokens, "\n")