1\. Customizing spaCy models
----------------------------

00:00 - 00:14

Welcome! We have learned to use spaCy model functionality such as POS taggers and NER. We'll now learn about situations where we might seek to customize spaCy models.

2\. Why train spaCy models?
---------------------------

00:14 - 01:42

spaCy models go a long way for general NLP use cases such as splitting a document into sentences, understanding sentence syntax, and extracting named entities. However, sometimes we seek to work on text data from specific domains that spaCy models haven't seen during their training. For example, Twitter data can contain hashtags or emotions, which may not have any specific meaning outside the Twitter platform. Additionally, Twitter sentences are usually just phrases and not full sentences. As a result, we might observe low quality sentence segmentation results from this data if we use of-the-shelf spaCy models. Similarly, text data from the medical domain typically contains several named entities, such as drugs and diseases. We don't expect these entities to be classified accurately using existing spaCy NER models, because the models don't generally contain disease or drug entity labels and they will perform poorly on such domain data. In such scenarios, it is worthwhile to train a spaCy model using our own domain-specific text data. The snapshot shows an example of a NER model results that is trained on medical domain data and hence performs well.

Go a long way for general NLP use cases

But may not have seen specific domains data during their training, e.g.
- Twitter data
- Medical data

```
PAST MEDICAL HISTORY: Significant for history of pulmonary fibrosis DISEASE and atrial fibrillation DISEASE. He is status post bilateral lung transplant back in 2004 because of the pulmonary fibrosis DISEASE.

ALLERGIES: There are no known allergies.

MEDICATIONS: Include multiple medications that are significant for his lung transplant including Prograf, CellCept CHEMICAL, prednisone CHEMICAL, omeprazole CHEMICAL, Bactrim CHEMICAL which he is on chronically, folic acid CHEMICAL, vitamin D CHEMICAL, Mag-Ox, Toprol-XL, calcium CHEMICAL 500 mg DOSAGE, vitamin B1, Centrum Silver, verapamil CHEMICAL, and digoxin CHEMICAL.
```

3\. Why train spaCy models?
---------------------------

01:42 - 02:06

We can usually make the model more accurate by showing it examples from our domain and we often also want to predict categories specific to our problem. Before starting to train, we need to ask the following questions. Do spaCy models perform well enough on our data? and does our domain include many labels that are absent in the spaCy models?

Better results on your specific domain

Essential for domain specific text classification

Before start training, ask the following questions:

- Do spaCy models perform well enough on our data?
- Does our domain include many labels that are absent in spaCy models?

4\. Models performance on our data
----------------------------------

02:06 - 03:20

To determine if training is needed, let's start with the question of whether existing spaCy models perform well enough on our data. If they do, we can use existing models in our NLP pipeline. However, there are multiple scenarios where the existing models do not perform as expected. For example, an en_core_web_sm spaCy model will not be able to correctly classify Oxford Street in "The car was navigating to the Oxford Street." as a location with a GPE label, instead, it identifies this location as an organization with an ORG label. This is because the model did not observe similar location examples during its training phase, but might have observed Oxford in the title of organizations, hence it confuses this GPE entity with one that has an ORG type. If such behavior is observed from a spaCy model, we should train this spaCy model further to improve model performance.

```python
import spacy
nlp = spacy.load("en_core_web_sm")

text = "The car was navigating to the Oxford Street."
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])
```

• Do spaCy models perform well enough on our data?
• Oxford Street is not correctly classified with a GPE label:

`[('the Oxford Street', 'ORG')]`

5\. Output labels in spaCy models
---------------------------------

03:20 - 03:56

Before rushing to train our own models, we also need to confirm if there are missing output labels in the existing spaCy models or not. The snapshot shows an example of NER entities on the common English domain on the top and an example of medical domain on the bottom. Common domain entities (LOC, ORG, DATE) that are used for training of existing spaCy models are considerably different from medical domain entities (DISEASE, DOSAGE, CHEMICAL).

• Does our domain include many labels that are absent in spaCy models?

Market Analysis Text:
```
In fact, the Chinese[NORP] market has the three[CARDINAL] most influential names of the retail and tech space - Alibaba[GPE], Baidu[ORG], and Tencent[PERSON] (collectively touted as BAT[ORG]), and is betting big in the global AI[GPE] in retail industry space. The three[CARDINAL] giants which are claimed to have a cut-throat competition with the U.S.[GPE] (in terms of resources and capital) are positioning themselves to become the future AI[PERSON] platforms. The trio is also expanding in other Asian[NORP] countries and investing heavily in the U.S.[GPE] based AI[GPE] startups to leverage the power of AI[GPE]. Backed by such powerful initiatives and presence of these conglomerates, the market in APAC AI is forecast to be the fastest-growing one[CARDINAL], with an anticipated CAGR[PERSON] of 45%[PERCENT] over 2018-2024[DATE].
```

Medical History Text:
```
PAST MEDICAL HISTORY: Significant for history of pulmonary fibrosis[DISEASE] and atrial fibrillation[DISEASE]. He is status post bilateral lung transplant back in 2004 because of the pulmonary fibrosis[DISEASE].

ALLERGIES: There are no known allergies.

MEDICATIONS: Include multiple medications that are significant for his lung transplant including Prograf, CellCept[CHEMICAL], prednisone[CHEMICAL], omeprazole[CHEMICAL], Bactrim[CHEMICAL] which he is on chronically, folic acid[CHEMICAL], vitamin D[CHEMICAL], Mag-Ox, Toprol-XL, calcium[CHEMICAL] 500 mg[DOSAGE], vitamin B1, Centrum Silver, verapamil[CHEMICAL], and digoxin[CHEMICAL].
```

6\. Output labels in spaCy models
---------------------------------

03:56 - 04:20

It is clear that the existing spaCy models do not have many of the output labels for an NER task on medical domain data and do not perform well on our data. In such case, we'll need to first collect our domain specific data, annotate our data and then update an existing model or train a model from scratch with our data.

If we need custom model training, we follow these steps:

• Collect our domain specific data
• Annotate our data 
• Determine to update an existing model or train a model from scratch

7\. Let's practice!
-------------------

04:20 - 04:28

Great! Let's practice and then begin our journey of training spaCy models.

Training spaCy models
=====================

spaCy models go a long way for general NLP use cases such as splitting a document into sentences, understanding sentence syntax, and extracting named entities. However, sometimes you seek to train a spaCy model. 

When do you need to train a spaCy model? Please select all options that apply.

##### Answer the question

#### Possible Answers

Select all correct answers

[/] Models do not include many labels for your specific domain.

[] Models perform well on your data out-of-the-box.

[/] The accuracy of the spaCy model on your data is unacceptably low.

Model performance on your data
==============================

In this exercise, you will practice evaluating an existing model on your data. In this case, the aim is to examine model performance on a specific entity label, `PRODUCT`. If a model can accurately classify a large percentage of `PRODUCT` entities (e.g. more than 75%), you do not need to train the model on examples of `PRODUCT` entities, otherwise, you should consider training the model to improve its performance on `PRODUCT` entity prediction.

You'll use two reviews from the Amazon Fine Food Reviews dataset for this exercise. You can access these reviews by using the `texts` list. 

The `en_core_web_sm` model is already loaded for you. You can access it by calling `nlp()`. The model is already ran on the `texts` list and `documents`, a list of `Doc` containers is available for your use.

Instructions
------------

-   Compile a `target_entities` list, of all the entities for each of the `documents`, and append a tuple of (entities text, entities label) only if `Jumbo` is in the entity text.
-   For any tuple in the `target_entities`, append `True` to a `correct_labels` list if the entity label (second attribute in the tuple) is `PRODUCT`, otherwise append `False`

In [None]:
# Append a tuple of (entities text, entities label) if Jumbo is in the entity
target_entities = []
for doc in documents:
  target_entities.extend([(ent.text, ent.label_) for ent in doc.ents if "Jumbo" in ent.text])
print(target_entities)

# Append True to the correct_labels list if the entity label is `PRODUCT`
correct_labels = []
for ent in target_entities:
  if ent[1] == "PRODUCT":
    correct_labels.append(True)
  else:
    correct_labels.append(False)
print(correct_labels)

1\. Training data preparation
-----------------------------

00:00 - 00:11

Welcome! Now that we have learned how to identify whether training a spaCy model is necessary, let's learn how to prepare training data.

2\. Training steps
------------------

00:11 - 00:57

spaCy allows us to update existing models using examples from our own annotated data. To do so, we initialize a spaCy model, either with weights from an existing model, or random values. Next, we predict a batch of examples with the current weights. The model then checks the predictions against the correct answers we provide and aims at optimizing the weights to achieve better results. Optimizer objects will be used in this stage, we will learn more about them later on. We then move on to next batch of examples, and spaCy continues calling the model to predict another batch of examples in the data and refine the weights.

1. Annotate and prepare input data
2. Initialize the model weight
3. Predict a few examples with the current weights 
4. Compare prediction with correct answers
5. Use optimizer to calculate weights that improve model performance
6. Update weights slightly
7. Go back to step 3.

3\. Annotating and preparing data
---------------------------------

00:57 - 01:50

It is clear now that the first step of training a model is always preparing training data. spaCy model training code works with dictionaries. After collecting data, we annotate data in the required format for a spaCy model. Annotation means labeling the intent, entities, POS tags, and so on. We can see an example of an annotated data record for a NER task in the medical domain. The annotated data has two key value pairs. The first attribute records the input text with a "sentence" key and the second attribute captures all the labeled entities of the input text with an "entities" key. In this instance, there is only one labeled entity with the entity type of "Medicine".

• First step is to prepare training data in required format
• After collecting data, we annotate it
• Annotation means labeling the intent, entities, etc.
• This is an example of annotated data:

```python
annotated_data = {
    "sentence": "An antiviral drugs used against influenza is neuraminidase inhibitors.",
    "entities": {
        "label": "Medicine",
        "value": "neuraminidase inhibitors",
    }
}
```

4\. Annotating and preparing data
---------------------------------

01:50 - 02:22

Let's check another example of an annotated data record for a NER task of the common English language. In this instance, the annotated data has two entities for the given text. In such scenarios, a list of dictionaries will be stored for the entities attribute. For example, the first element captures the Bill Gates entity with the type PERSON and the second element shows the SFO Airport entity with the type LOC (location).

• Here's another example of annotated data:

```python
annotated_data = {
    "sentence": "Bill Gates visited the SFO Airport.",
    "entities": [{"label": "PERSON", "value": "Bill Gates"},
                 {"label": "LOC", "value": "SFO Airport"}]
}
```

5\. spaCy training data format
------------------------------

02:22 - 03:05

The goal of data annotation is to prepare training data and point the spaCy model to what we want the model to learn. This annotated data has to be stored as a dictionary format and we also need to provide start and end characters of the text span with a given label. Let's see an example of a training dataset. This dataset consists of three example pairs for a named entity recognition task. Each example pair includes a sentence as the first element. The second element of the pair is a list of annotated entities and their corresponding start and end characters and labels.

• Data annotation prepares training data for what we want the model to learn
• Training dataset has to be stored as a dictionary:

```python
training_data = [
("I will visit you in Austin.", {"entities": [(20, 26, "GPE")]}),
("I'm going to Sam's house.", {"entities": [(13,18, "PERSON"), (19, 24, "GPE")]}),
("I will go.", {"entities": []})
]
```

Three example pairs:
• Each example pair includes a sentence as the first element
• Pair's second element is list of annotated entities and start and end characters

6\. Example object data for training
------------------------------------

03:05 - 04:11

We cannot feed the raw text and annotations directly to spaCy and need to create an Example object for each training example. Let's check an example for a NER model. Let's assume we have a training data point we want to feed to our NER component to ensure the model will correctly predict Austin as GPE (Geopolitical entity). First, we will convert the associated text to a Doc container, and then use the Example class from spaCy to convert the Doc container and the relevant annotation to an Example object which is compatible for training with spaCy. For this purpose, we use Example-dot-from_dict() method and pass two arguments: the Doc container and the annotations dictionary. We can view attributes that are processed and stored at the example object by using the example-dot-to_dict() method.

• We cannot feed the raw text directly to spaCy
• We need to create an Example object for each training example

```python
import spacy
from spacy.training import Example

nlp = spacy.load("en_core_web_sm")

doc = nlp("I will visit you in Austin.")
annotations = {"entities": [(20, 26, "GPE")]}

example_sentence = Example.from_dict(doc, annotations)
print(example_sentence.to_dict())
```

7\. Let's practice!
-------------------

04:11 - 04:24

Great! We learned about the training data format and the Example object that converts a training data to a compatible format for training a spaCy model. Let's practice our learnings.

#### Training steps

You may work on very specific domains that spaCy models didn't see during training such as the medical domain. spaCy allows to update existing models with more examples from your own data to improve the model performance on our own data. 

100XP

-   Organize and order the given steps for training a spaCy model for a single epoch.

#### Training steps

You may work on very specific domains that spaCy models didn't see during training such as the medical domain. spaCy allows to update existing models with more examples from your own data to improve the model performance on our own data. 

100XP

-   Organize and order the given steps for training a spaCy model for a single epoch.

1. Annotate and/or prepare input data
2. Initialize model weights randomly or from an existing model
3. Predict a few examples using the current weights
4. Compare predictions with correct answers
5. Use optimizer to calculate weights that increase the chance of correct predictions
6. Update weights slightly

Annotation and preparing training data
======================================

After collecting data, you can annotate data in the required format for a spaCy model. In this exercise, you will practice forming the correct annotated data record for an NER task in the medical domain.

A `sentence` and two entities of `entity_1`with a text of `chest pain` and a `SYMPTOM`type and `entity_2` with a text of `hyperthyroidism` and a `DISEASE` type are available for you to use.

Instructions
------------

-   Complete the `annotated_data` record in the correct format.
-   Extract start and end characters of each entity and store as the corresponding variables.
-   Store the same input sentence and its entities in the proper training format as `training_data`.

In [None]:
text = "A patient with chest pain had hyperthyroidism."
entity_1 = "chest pain"
entity_2 = "hyperthyroidism"

# Store annotated data information in the correct format
annotated_data = {"sentence": text, "entities": [{"label": "SYMPTOM", "value": entity_1}, {"label": "DISEASE", "value": entity_2}]}

# Extract start and end characters of each entity
entity_1_start_char = text.index(entity_1)
entity_1_end_char = entity_1_start_char + len(entity_1)
entity_2_start_char = text.index(entity_2)
entity_2_end_char = entity_2_start_char + len(entity_2)

# Store the same input information in the proper format for training
training_data = [(text, {"entities": [(entity_1_start_char, entity_1_end_char, "SYMPTOM"), 
                                      (entity_2_start_char, entity_2_end_char, "DISEASE")]})]
print(training_data)

Compatible training data
========================

Recall that you cannot feed the raw text directly to `spaCy`. Instead, you need to create an `Example` object for each training example. In this exercise, you will practice converting a `training_data` with a single annotated sentence into a list of `Example` objects.

`en_core_web_sm` model is already imported and ready for use as `nlp`. The `Example` class is also imported for your use.

Instructions
------------

-   Iterate through the text and annotations in the `training_data`, convert the text to a `Doc` container and store it at `doc`.
-   Create an `Example` object using the `doc`object and the annotations of each training data point, and store it at `example_sentence`.
-   Append `example_sentence` to a list of `all_examples`.

In [None]:
example_text = 'A patient with chest pain had hyperthyroidism.'
training_data = [(example_text, {'entities': [(15, 25, 'SYMPTOM'), (30, 45, 'DISEASE')]})]

all_examples = []
# Iterate through text and annotations and convert text to a Doc container
for text, annotations in training_data:
  doc = nlp(text)
  
  # Create an Example object from the doc container and annotations
  example_sentence = Example.from_dict(doc, annotations)
  print(example_sentence.to_dict(), "\n")
  
  # Append the Example object to the list of all examples
  all_examples.append(example_sentence)
  
print("Number of formatted training data: ", len(all_examples))

1\. Training with spaCy
-----------------------

00:00 - 00:06

Welcome! Let's learn how we train spaCy models for a NER task.

2\. Training steps
------------------

00:06 - 00:48

We previously learned that a spaCy model may not work well on a given data. One solution is to train the model on our data. In this video, we will learn how we can train a model on our data after annotating and preparing training data, and by disabling all other pipeline components. For example for training an NER component, we need to disable all other pipeline components such as POS tagger and dependency parser. We then feed our examples to the training procedure and evaluate the new NER model performance. Let's learn more about each of these steps.

- Annotate and prepare input data
- Disable other pipeline components
- Train a model for a few epochs4
- Evaluate model performance


3\. Disabling other pipeline components
---------------------------------------

00:48 - 01:28

It is necessary to disable other pipeline components of an nlp model in order to only train the intended component. For example, if we want to train an NER model we have to ensure any other components are disabled. For this purpose, we can use nlp-dot-disable_pipes() method given a list of other_pipes (all other pipeline components). Other_pipes is compiled by looping through each pipe name of nlp-dot-pipe_names and checking if the pipe name is not the same as ner.

• Disable all pipeline components except NER:

```python
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

nlp.disable_pipes(*other_pipes)
```

4\. Model training procedure
----------------------------

01:28 - 03:15

In the training procedure, we will go over the training data several times. An epoch is the number of times that the learning algorithm will work through the entire training dataset. In each epoch, the training code updates the weights of the model with a small number using an optimizer object on randomly shuffled training data. Optimizers are functions that update the model weights and aim to lower the risk of errors from these predictions, and improve the accuracy of the model. We can create an optimizer object, using create_optimizer() method. In each epoch, we first shuffle training_data, an Example object, by using random-dot-shuffle() method. Next, for each training data point, which is a tuple of a text and annotations, we extract the equivalent dictionary object from the Example object given the Doc container of a text and training data annotation using Example-dot-from_dict() method. The extracted Example dictionary will be used to update the nlp model weights by using the nlp-dot-update() method and passing the list of the example dictionary, the optimizer object and a losses dictionary to track model's loss during training. Loss is a number indicating how bad the model's prediction is on a single example. The procedure continues to process next training data points.

• Go over the training set several times; one iteration is called an epoch.
• In each epoch, update the weights of the model with a small number.
• Optimizers update the model weights.

```python
optimizer = nlp.create_optimizer()

losses = {}
for i in range(epochs):
    random.shuffle(training_data)
    for text, annotation in training_data:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotation)
        nlp.update([example], sgd = optimizer, losses=losses)
```

5\. Save and load a trained model
---------------------------------

03:15 - 04:20

After we have trained a model, the next step is to test the model. For this purpose, we need to save and later load it. We use the dot-get_pipe() method to get the trained pipeline component. In an example, we trained an NER model and hence we get the NER component and save to the disk using the ner-dot-to_disk() method, passing a model name. Later, we load a spaCy model and create a blank NER component by using the nlp-dot-create_pipe() method. Then, we load the trained NER model from the disk by using ner-dot-from_disk() method on the created NER component. Lastly, we add the loaded NER component to the pipeline by calling nlp-dot-add_pipe() method and passing a name for the NER model, such as "ner".

• Save a trained NER model:

```python
ner = nlp.get_pipe("ner")
ner.to_disk("<ner model name>")
```

• Load the saved model:

```python
ner = nlp.create_pipe("ner")
ner.from_disk("<ner model name>")
nlp.add_pipe(ner, "<ner model name>")
```

6\. Model for inference
-----------------------

04:20 - 04:37

Once a trained model is saved, it can be loaded as nlp. Then, we can use the model to find entities of a given text. We can see an example of how to apply the NER model and store entities' texts and labels for a text.

• Use a saved model at inference.

• Apply NER model and store tuples of (entity text, entity label):

```python
doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]
```

7\. Let's practice!
-------------------

04:37 - 04:41

Let's practice our learnings.

Training preparation steps
==========================

Before and during training of a `spaCy` model, you'll need to (1) disable other pipeline components in order to only train the intended component and (2) convert a `Doc` container of a training data point and its corresponding `annotations` into an `Example` class. 

In this exercise, you will practice these two steps by using a pre-loaded `en_core_web_sm`model, which is accessible as `nlp`. `Example`class is already imported and a `text` string and related `annotations` are also available for your use.

Instructions
------------

-   Disable all pipeline components of the `nlp`model except `ner`.
-   Convert a `text` string and its `annotations`to the correct format usable for training.

In [None]:
nlp = spacy.load("en_core_web_sm")

# Disable all pipeline components of nlp except `ner`
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
nlp.disable_pipes(*other_pipes)

# Convert a text and its annotations to the correct format usable for training
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotations)
print("Example object for training: \n", example.to_dict())

Train an existing NER model
===========================

A spaCy model may not work well on a given data. One solution is to train the model on our data. In this exercise, you will practice training a NER model in order to improve its prediction performance.

A spaCy  `en_core_web_sm` model that is accessible as `nlp`, which is not able to correctly predict `house` as an entity in a `test` string. 

Given a `training_data`, write the steps to update this model while iterating through the data two times. The other pipelines are already disabled and `optimizer` is also ready to be used. Number of epochs is already set to 2.

Instructions
------------

-   Use the `optimizer` object and for each epoch, shuffle the dataset using `random`package and create an `Example` object.
-   Update the `nlp` model using `.update`attribute and set the `sgd` arguments to use the optimizer.

In [None]:
nlp = spacy.load("en_core_web_sm")
print("Before training: ", [(ent.text, ent.label_) for ent in nlp(test).ents])
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
nlp.disable_pipes(*other_pipes)
optimizer = nlp.create_optimizer()

# Shuffle training data and the dataset using random package per epoch
for i in range(epochs):
  random.shuffle(training_data)
  for text, annotations in training_data:
    doc = nlp.make_doc(text)
    # Update nlp model after setting sgd argument to optimizer
    example = Example.from_dict(doc, annotations)
    nlp.update([example], sgd=optimizer)
print("After training: ", [(ent.text, ent.label_) for ent in nlp(test).ents])

Training a spaCy model from scratch
===================================

spaCy provides a very clean and efficient approach to train your own models. In this exercise, you will train a NER model from scratch on a real-world corpus (CORD-19 data).

Training data is available in the right format as `training_data`. In this exercise, you will use a given list of labels ("Pathogen", "MedicalCondition", "Medicine") stored in `labels` using a blank English model (`nlp`) with an NER component. Intended medical `labels` will be added the NER pipeline and then you can train the model for one epoch. You can use pre-imported `Example` class to convert the training data to the required format. To track model training you can add a `losses` list to the `.update()` method and review training loss.

Instructions
------------

-   Create a blank spaCy model and add an NER component to the model. 
-   Disable other pipeline components, use the created `optimizer` object and update the model weights using converted data to the `Example` format.

In [None]:
# Load a blank English model, add NER component, add given labels to the ner pipeline
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")
for ent in labels:
    ner.add_label(ent)

# Disable other pipeline components, complete training loop and run training loop
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
nlp.disable_pipes(*other_pipes)
losses = {}
optimizer = nlp.begin_training()
for text, annotation in training_data:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotation)
    nlp.update([example], sgd=optimizer, losses=losses)
    print(losses)

1\. Wrap-up
-----------

00:00 - 00:08

Well done! We have reached the final video of this course. Let's briefly review what we have learned.

2\. Chapter 1 - Introduction to NLP and spaCy
---------------------------------------------

00:08 - 00:26

In chapter one, we learned about NLP, some of its use cases, and how to use spaCy pipeline to perform various natural language processing tasks such as tokenization, sentence segmentation, and named entity recognition.

```python
import spacy

# Load the English language model
nlp = spacy.load('en_core_web_sm')

# Create text processing pipeline
text = "Input text here" 

# Process text through pipeline
doc = nlp(text)

# The pipeline includes:
# - Tokenizer: Splits text into tokens
# - Tagger: Assigns part-of-speech tags 
# - Parser: Assigns dependency labels
# - NER: Names Entity Recognition
# - Additional components (...)

# Returns processed Doc object
```

Pipeline Diagram:
```
Text -> [nlp] -> Doc
        |
        Tokenizer -> Tagger -> Parser -> NER -> ...
```

3\. Chapter 2 - spaCy linguistic annotations and word vectors
-------------------------------------------------------------

00:26 - 00:43

In chapter two, we learned about linguistic features, word vectors, analogies, and word vector operations. We also learned how to use spaCy to extract word vectors and find semantically similar terms.

```python
import spacy

# Load language model with word vectors
nlp = spacy.load('en_core_web_lg')

# Example words
words = ['man', 'woman', 'king', 'queen']
words2 = ['walking', 'walked', 'swimming', 'swam']

# Get tokens
tokens = [nlp(word)[0] for word in words]
tokens2 = [nlp(word)[0] for word in words2]

# Calculate similarities using word vectors
# The word vectors capture semantic relationships:
# woman - man ≈ queen - king  
# walking - walked ≈ swimming - swam

# Plot word vectors (simplified 2D visualization)
# Left plot: Gender-royalty relationship
# Right plot: Present-past tense relationship
```

Word Vector Relationships:
```
Left plot:
woman (275, -200) -> queen (200, -100)
man (135, -250) -> king (60, -150)

Right plot: 
swimming (275, 200) -> swam (75, 200)
walking (50, -200) -> walked (-150, 0)
```

Here are the word vector plots in ASCII/markdown format:

```
Gender-Royalty Plot:
                woman (275)
275 |              •
    |               \
200 |                \                queen
    |                 \              •
150 |                  \
    |    man           \
100 |    •              \
    |     \              \
50  |      \              \ king
    |       \              •
0   |        
    |-----------------------------------
    -250    -200    -150    -100    -50


Verb Tense Plot:
                swimming
300 |              •
    |               \
200 |                \                swam
    |                 \              •
100 |                  \
    | walking           \
0   | •                  \  walked
    |  \                 •
-100|   \              
    |    \             
    |-----------------------------------
    -200    -100     0      100     200
```

4\. Chapter 3 - Data analysis with spaCy
----------------------------------------

00:43 - 01:02

In the third chapter, we learned multiple approaches for rule-based information extraction using EntityRuler, Matcher, and PhraseMatcher classes in spaCy and RegEx Python package. Examples of using Matcher and PhraseMatcher are shown.

```python
# Using Matcher
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "good"}, {"LOWER": {"IN": ["morning", "evening"]}}]
matcher.add("morning_greeting", [pattern])

# Using PhraseMatcher 
matcher = PhraseMatcher(nlp.vocab, attr = "LOWER")
patterns = [nlp.make_doc(term) for term in terms]
matcher.add("InvestmentTerms", patterns)
```

5\. Chapter 4 - Customizing spaCy models
----------------------------------------

01:02 - 01:17

And finally, we learned when we may need to train a spaCy model, and how to prepare data, train spaCy models and use trained models at inference.

```
PAST MEDICAL HISTORY: Significant for history of [pulmonary fibrosis](DISEASE) and [atrial fibrillation](DISEASE). He is status post bilateral lung transplant back in 2004 because of the [pulmonary fibrosis](DISEASE).

ALLERGIES: There are no known allergies.  

MEDICATIONS: Include multiple medications that are significant for his lung transplant including Prograf, [CellCept](CHEMICAL), [prednisone](CHEMICAL), [omeprazole](CHEMICAL), [Bactrim](CHEMICAL) which he is on chronically, [folic acid](CHEMICAL), [vitamin D](CHEMICAL), Mag-Ox, Toprol-XL, [calcium](CHEMICAL) [500 mg](DOSAGE), vitamin B1, Centrum Silver, [verapamil](CHEMICAL), and [digoxin](CHEMICAL).
```

Annotations:
- Green highlight: DISEASE
- Orange highlight: CHEMICAL  
- Blue highlight: DOSAGE

6\. Recommended resources
-------------------------

01:17 - 01:28

That was a summary of what we have learned about spaCy. DataCamp has many useful resources for you to continue your learnings in AI and NLP. This is a list of a few recommended courses.

- [Introduction to Deep Learning in Python](https://app.datacamp.com/learn/courses/introduction-to-deep-learning-in-python)
- [Introduction to Deep Learning with PyTorch](https://app.datacamp.com/learn/courses/introduction-to-deep-learning-with-pytorch)
- [Introduction to ChatGPT](https://app.datacamp.com/learn/courses/introduction-to-chatgpt)


7\. Congratulations!
--------------------

01:28 - 01:36

It has been a pleasure to work with you, thank you for the time that you have dedicated to this course.