1\. Customizing spaCy models
----------------------------

00:00 - 00:14

Welcome! We have learned to use spaCy model functionality such as POS taggers and NER. We'll now learn about situations where we might seek to customize spaCy models.

2\. Why train spaCy models?
---------------------------

00:14 - 01:42

spaCy models go a long way for general NLP use cases such as splitting a document into sentences, understanding sentence syntax, and extracting named entities. However, sometimes we seek to work on text data from specific domains that spaCy models haven't seen during their training. For example, Twitter data can contain hashtags or emotions, which may not have any specific meaning outside the Twitter platform. Additionally, Twitter sentences are usually just phrases and not full sentences. As a result, we might observe low quality sentence segmentation results from this data if we use of-the-shelf spaCy models. Similarly, text data from the medical domain typically contains several named entities, such as drugs and diseases. We don't expect these entities to be classified accurately using existing spaCy NER models, because the models don't generally contain disease or drug entity labels and they will perform poorly on such domain data. In such scenarios, it is worthwhile to train a spaCy model using our own domain-specific text data. The snapshot shows an example of a NER model results that is trained on medical domain data and hence performs well.

Go a long way for general NLP use cases

But may not have seen specific domains data during their training, e.g.
- Twitter data
- Medical data

```
PAST MEDICAL HISTORY: Significant for history of pulmonary fibrosis DISEASE and atrial fibrillation DISEASE. He is status post bilateral lung transplant back in 2004 because of the pulmonary fibrosis DISEASE.

ALLERGIES: There are no known allergies.

MEDICATIONS: Include multiple medications that are significant for his lung transplant including Prograf, CellCept CHEMICAL, prednisone CHEMICAL, omeprazole CHEMICAL, Bactrim CHEMICAL which he is on chronically, folic acid CHEMICAL, vitamin D CHEMICAL, Mag-Ox, Toprol-XL, calcium CHEMICAL 500 mg DOSAGE, vitamin B1, Centrum Silver, verapamil CHEMICAL, and digoxin CHEMICAL.
```

3\. Why train spaCy models?
---------------------------

01:42 - 02:06

We can usually make the model more accurate by showing it examples from our domain and we often also want to predict categories specific to our problem. Before starting to train, we need to ask the following questions. Do spaCy models perform well enough on our data? and does our domain include many labels that are absent in the spaCy models?

Better results on your specific domain

Essential for domain specific text classification

Before start training, ask the following questions:

- Do spaCy models perform well enough on our data?
- Does our domain include many labels that are absent in spaCy models?

4\. Models performance on our data
----------------------------------

02:06 - 03:20

To determine if training is needed, let's start with the question of whether existing spaCy models perform well enough on our data. If they do, we can use existing models in our NLP pipeline. However, there are multiple scenarios where the existing models do not perform as expected. For example, an en_core_web_sm spaCy model will not be able to correctly classify Oxford Street in "The car was navigating to the Oxford Street." as a location with a GPE label, instead, it identifies this location as an organization with an ORG label. This is because the model did not observe similar location examples during its training phase, but might have observed Oxford in the title of organizations, hence it confuses this GPE entity with one that has an ORG type. If such behavior is observed from a spaCy model, we should train this spaCy model further to improve model performance.

```python
import spacy
nlp = spacy.load("en_core_web_sm")

text = "The car was navigating to the Oxford Street."
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])
```

• Do spaCy models perform well enough on our data?
• Oxford Street is not correctly classified with a GPE label:

`[('the Oxford Street', 'ORG')]`

5\. Output labels in spaCy models
---------------------------------

03:20 - 03:56

Before rushing to train our own models, we also need to confirm if there are missing output labels in the existing spaCy models or not. The snapshot shows an example of NER entities on the common English domain on the top and an example of medical domain on the bottom. Common domain entities (LOC, ORG, DATE) that are used for training of existing spaCy models are considerably different from medical domain entities (DISEASE, DOSAGE, CHEMICAL).

• Does our domain include many labels that are absent in spaCy models?

Market Analysis Text:
```
In fact, the Chinese[NORP] market has the three[CARDINAL] most influential names of the retail and tech space - Alibaba[GPE], Baidu[ORG], and Tencent[PERSON] (collectively touted as BAT[ORG]), and is betting big in the global AI[GPE] in retail industry space. The three[CARDINAL] giants which are claimed to have a cut-throat competition with the U.S.[GPE] (in terms of resources and capital) are positioning themselves to become the future AI[PERSON] platforms. The trio is also expanding in other Asian[NORP] countries and investing heavily in the U.S.[GPE] based AI[GPE] startups to leverage the power of AI[GPE]. Backed by such powerful initiatives and presence of these conglomerates, the market in APAC AI is forecast to be the fastest-growing one[CARDINAL], with an anticipated CAGR[PERSON] of 45%[PERCENT] over 2018-2024[DATE].
```

Medical History Text:
```
PAST MEDICAL HISTORY: Significant for history of pulmonary fibrosis[DISEASE] and atrial fibrillation[DISEASE]. He is status post bilateral lung transplant back in 2004 because of the pulmonary fibrosis[DISEASE].

ALLERGIES: There are no known allergies.

MEDICATIONS: Include multiple medications that are significant for his lung transplant including Prograf, CellCept[CHEMICAL], prednisone[CHEMICAL], omeprazole[CHEMICAL], Bactrim[CHEMICAL] which he is on chronically, folic acid[CHEMICAL], vitamin D[CHEMICAL], Mag-Ox, Toprol-XL, calcium[CHEMICAL] 500 mg[DOSAGE], vitamin B1, Centrum Silver, verapamil[CHEMICAL], and digoxin[CHEMICAL].
```

6\. Output labels in spaCy models
---------------------------------

03:56 - 04:20

It is clear that the existing spaCy models do not have many of the output labels for an NER task on medical domain data and do not perform well on our data. In such case, we'll need to first collect our domain specific data, annotate our data and then update an existing model or train a model from scratch with our data.

If we need custom model training, we follow these steps:

• Collect our domain specific data
• Annotate our data 
• Determine to update an existing model or train a model from scratch

7\. Let's practice!
-------------------

04:20 - 04:28

Great! Let's practice and then begin our journey of training spaCy models.

Training spaCy models
=====================

spaCy models go a long way for general NLP use cases such as splitting a document into sentences, understanding sentence syntax, and extracting named entities. However, sometimes you seek to train a spaCy model. 

When do you need to train a spaCy model? Please select all options that apply.

##### Answer the question

#### Possible Answers

Select all correct answers

[/] Models do not include many labels for your specific domain.

[] Models perform well on your data out-of-the-box.

[/] The accuracy of the spaCy model on your data is unacceptably low.

Model performance on your data
==============================

In this exercise, you will practice evaluating an existing model on your data. In this case, the aim is to examine model performance on a specific entity label, `PRODUCT`. If a model can accurately classify a large percentage of `PRODUCT` entities (e.g. more than 75%), you do not need to train the model on examples of `PRODUCT` entities, otherwise, you should consider training the model to improve its performance on `PRODUCT` entity prediction.

You'll use two reviews from the Amazon Fine Food Reviews dataset for this exercise. You can access these reviews by using the `texts` list. 

The `en_core_web_sm` model is already loaded for you. You can access it by calling `nlp()`. The model is already ran on the `texts` list and `documents`, a list of `Doc` containers is available for your use.

Instructions
------------

-   Compile a `target_entities` list, of all the entities for each of the `documents`, and append a tuple of (entities text, entities label) only if `Jumbo` is in the entity text.
-   For any tuple in the `target_entities`, append `True` to a `correct_labels` list if the entity label (second attribute in the tuple) is `PRODUCT`, otherwise append `False`

In [None]:
# Append a tuple of (entities text, entities label) if Jumbo is in the entity
target_entities = []
for doc in documents:
  target_entities.extend([(ent.text, ent.label_) for ent in doc.ents if "Jumbo" in ent.text])
print(target_entities)

# Append True to the correct_labels list if the entity label is `PRODUCT`
correct_labels = []
for ent in target_entities:
  if ent[1] == "PRODUCT":
    correct_labels.append(True)
  else:
    correct_labels.append(False)
print(correct_labels)