# Text preprocessing, POS tagging and NER

1\. Tokenization and Lemmatization

----------------------------------

00:00 - 00:06

In NLP, we usually have to deal with texts from a variety of sources. For instance,

2\. Text sources

----------------

00:06 - 00:22

it can be a news article where the text is grammatically correct and proofread. It could be tweets containing shorthands and hashtags. It could also be comments on YouTube where people have a tendency to abuse capital letters and punctuations.

- News articles
- Tweets
- Comments

3\. Making text machine friendly

--------------------------------

00:22 - 01:03

It is important that we standardize these texts into a machine friendly format. We want our models to treat similar words as the same. Consider the words Dogs and dog. Strictly speaking, they are different strings. However, they connotate the same thing. Similarly, reduction, reducing and reduce should also be standardized to the same string regardless of their form and case usage. Other examples include don't and do not, and won't and will not. In the next couple of lessons, we will learn techniques to achieve this.

- `Dogs,dog`
- `reduction`, `REDUCING`, `Reduce`
- `don't`, `do not`
- `won't`, `will not`

4\. Text preprocessing techniques

---------------------------------

01:03 - 01:31

The text processing techniques you use are dependent on the application you're working on. We'll be covering the common ones, including converting words into lowercase removing unnecessary whitespace, removing punctuation, removing commonly occurring words or stopwords, expanding contracted words like don't and removing special characters such as numbers and emojis.

- Converting words into lowercase
- Removing leading and trailing whitespaces
- Removing punctuation
- Removing stopwordsExpanding contractions
- Removing special characters (numbers, emojis, etc.)

5\. Tokenization

----------------

01:31 - 02:21

To do this, we must first understand tokenization. Tokenization is the process of splitting a string into its constituent tokens. These tokens may be sentences, words or punctuations and is specific to a particular language. In this course, we will primarily be focused with word and punctuation tokens. For instance, consider this sentence. Tokenizing it into its constituent words and punctuations will yield the following list of tokens. Tokenization also involves expanding contracted words. Therefore, a word like don't gets decomposed into two tokens: do and n't as can be seen in this example.

"I have a dog. His name is Hachi."
Tokens:
```python
["I", "have", "a", "dog", ".", "His", "name", "is", "Hachi", "."]

6\. Tokenization using spaCy

----------------------------

02:21 - 03:21

To perform tokenization in python, we will use the spacy library. We first import the spacy library. Next, we load a pre-trained English model 'en_core_web_sm' using spacy.load(). This will return a Language object that has the know-how to perform tokenization. This is stored in the variable nlp. Let's now define a string we want to tokenize. We pass this string into nlp to generate a spaCy Doc object. We store this in a variable named doc. This Doc object contains the required tokens (and many other things, as we will soon find out). We generate the list of tokens by using list comprehension as shown. This is essentially looping over doc and extracting the text of each token in each iteration. The result is as follows.

```python
import spacy
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')
# Initialize string
string = "Hello! I don't know what I'm doing here."
# Create a Doc object
doc = nlp(string)
# Generate list of tokens
tokens = [token.text for token in doc]
print(tokens)
```

Output:
```python
['Hello', '!', 'I', 'do', "n't", 'know', 'what', 'I', "'", 'm', 'doing', 'here', '.']
```

7\. Lemmatization

-----------------

03:21 - 04:07

Lemmatization is the process of converting a word into its lowercased base form or lemma. This is an extremely powerful process of standardization. For instance, the words reducing, reduces, reduced and reduction, when lemmatized, are all converted into the base form reduce. Similarly be verbs such as am, are and is are converted into be. Lemmatization also allows us to convert words with apostrophes into their full forms. Therefore, n't is converted to not and 've is converted to have.

- Convert word into its base form
  - reducing, reduces, reduced, reduction → reduce
  - am, are, is → be
  - n't → not
  - 've → have


8\. Lemmatization using spaCy

-----------------------------

04:07 - 04:42

When you pass the string into nlp, spaCy automatically performs lemmatization by default. Therefore, generating lemmas is identical to generating tokens except that we extract token.lemma_ in each iteration inside the list comprehension instead of token.text. Also, observe how spaCy converted the Is into -PRON-. This is standard behavior where every pronoun is converted into the string '-PRON-'.

```python
import spacy
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')
# Initialize string
string = "Hello! I don't know what I'm doing here."
# Create a Doc object
doc = nlp(string)
# Generate list of lemmas
lemmas = [token.lemma_ for token in doc]
print(lemmas)
```

Output:
```python
['hello', '!', '-PRON-', 'do', 'not', 'know', 'what', '-PRON-', 'be', 'do', 'here', '.']
```

9\. Let's practice!

-------------------

04:42 - 05:00

Once we understand how to perform tokenization and lemmatization, performing the text preprocessing techniques described earlier becomes easier. Before we move to that, let's first practice our understanding of the concepts introduced so far.

Identifying lemmas
==================

Identify the list of words from the choices which do not have the same lemma.

##### Answer the question

#### Possible Answers

Select one answer

- [x] He, She, I, They

-   Am, Are, Is, Was

-   Increase, Increases, Increasing, Increased

-   Car, Bike, Truck, Bus

Tokenizing the Gettysburg Address
=================================

In this exercise, you will be tokenizing one of the most famous speeches of all time: the Gettysburg Address delivered by American President Abraham Lincoln during the American Civil War.

The entire speech is available as a string named `gettysburg`.

Instructions
------------

-   Load the `en_core_web_sm` model using `spacy.load()`.
-   Create a Doc object `doc` for the `gettysburg` string.
-   Using list comprehension, loop over `doc` to generate the token texts.

In [None]:
import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate the tokens
tokens = [token.text for token in doc]
print(tokens)

Lemmatizing the Gettysburg address
==================================

In this exercise, we will perform lemmatization on the same `gettysburg` address from before. 

However, this time, we will also take a look at the speech, before and after lemmatization, and try to adjudge the kind of changes that take place to make the piece more machine friendly.

Instructions 1/3
----------------

Print the gettysburg address to the console.

In [None]:
# Print the gettysburg address
print(gettysburg)

Instructions 2/3


Loop over doc and extract the lemma for each token of gettysburg.

In [None]:
# Loop over doc and extract the lemma for each token of gettysburg.
import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate lemmas
lemmas = [token.lemma_ for token in doc]

Instructions 3/3

Convert lemmas into a string using join.

In [None]:
import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate lemmas
lemmas = [token.lemma_ for token in doc]

# Convert lemmas into a string
print(' '.join(lemmas))

# Text cleaning

1\. Text cleaning
-----------------

00:00 - 00:08

Now that we know how to convert a string into a list of lemmas, we are now in a good position to perform basic text cleaning.


2\. Text cleaning techniques
----------------------------

00:08 - 00:31

Some of the most common text cleaning steps include removing extra whitespaces, escape sequences, punctuations, special characters such as numbers and stopwords. In other words, it is very common to remove non-alphabetic tokens and words that occur so commonly that they are not very useful for analysis.

- Unnecessary whitespaces and escape sequences
- Punctuations
- Special characters (numbers, emojis, etc.)
- Stopwords


3\. isalpha()
-------------

00:31 - 01:10

Every python string has an isalpha() method that returns true if all the characters of the string are alphabets. Therefore, the "Dog".isalpha() will return true but "3dogs".isalpha() will return false as it has a non-alphabetic character 3. Similarly, numbers, punctuations and emojis will all return false too. This is an extremely convenient method to remove all (lemmatized) tokens that are or contain numbers, punctuation and emojis.

```python
"Dog".isalpha()
```
Output:
```python
True
```

```python
"3dogs".isalpha()
```
Output:
```python
False
```

```python
"12347".isalpha()
```
Output:
```python
False
```

```python
"!".isalpha()
```
Output:
```python
False
```

```python
"?".isalpha()
```
Output:
```python
False
```

4\. A word of caution
---------------------

01:10 - 01:56

If isalpha() as a silver bullet that cleans text meticulously seems too good to be true, it's because it is. Remember that isalpha() has a tendency of returning false on words we would not want to remove. Examples include abbreviations such as USA and UK which have periods in them, and proper nouns with numbers in them such as word2vec and xto10x. For such nuanced cases, isalpha() may not be sufficient. It may be advisable to write your own custom functions, typically using regular expressions, to ensure you're not inadvertently removing useful words.

```markdown
- Abbreviations: U.S.A, U.K, etc.
- Proper Nouns: word2vec and xto10x.
- Write your own custom function (using regex) for the more nuanced cases.
```

5\. Removing non-alphabetic characters
--------------------------------------

01:56 - 02:13

Consider the string here. This has a lot of punctuations, unnecessary extra whitespace, escape sequences, numbers and emojis. We will generate the lemmatized tokens like before.

```python
string = """
OMG!!!! This is like    the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""
import spacy

# Generate list of tokens
nlp = spacy.load('en_core_web_sm')
doc = nlp(string)
lemmas = [token.lemma_ for token in doc]
print(lemmas)
```

Output:
```python
['OMG', '!', 'This', 'be', 'like', 'the', 'best', 'thing', 'ever', '\n', '.', 'Wow', ',', 'such', 'an', 'amazing', 'song', '!', 'I', "'m", 'hook', '.', 'Top', '5', 'definitely', '.', '?']
```

6\. Removing non-alphabetic characters
--------------------------------------

02:13 - 02:35

Next, we loop through the tokens again and choose only those words that are either -PRON- or contain only alphabetic characters. Let's now print out the sanitized string. We see that all the non-alphabetic characters have been removed and each word is separated by a single space.

```python
# Remove tokens that are not alphabetic
a_lemmas = [lemma for lemma in lemmas
            if lemma.isalpha() or lemma == '-PRON-']

# Print string after text cleaning
print(' '.join(a_lemmas))
```

Output:
```python
'OMG this be like the good thing ever wow such an amazing song -PRON- be hooked top definitely'
```

7\. Stopwords
-------------

02:35 - 02:55

There are some words in the English language that occur so commonly that it is often a good idea to just ignore them. Examples include articles such as a and the, be verbs such as is and am and pronouns such as he and she.

- Words that occur extremely commonly
- Eg. articles, be verbs, pronouns, etc.


8\. Removing stopwords using spaCy
----------------------------------

02:55 - 03:03

spaCy has a built-in list of stopwords which we can access using spacy.lang.en.stop_words.STOP_WORDS..

# Get list of stopwords
stopwords = spacy.lang.en.stop_words.STOP_WORDS
string = """
OMG!!!! This is like    the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""


9\. Removing stopwords using spaCy
----------------------------------

03:03 - 03:49

We make a small tweak to a_lemmas generation step. Notice that we have removed the -PRON- condition as pronouns are stopwords anyway and should be removed. Additionally, we have introduced a new condition to check if the word belongs to spacy's list of stopwords. The output is as follows. Notice how the string consists only of base form words. Always exercise caution while using third party stopword lists. It is common that an application find certain words useful that may be considered a stopword by third party lists. It is often advisable to create your custom stopword lists.

```python
# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas
            if lemma.isalpha() and lemma not in stopwords]

# Print string after text cleaning
print(' '.join(a_lemmas))
```

Output:
```python
'OMG like good thing wow amazing song hooked definitely'
```

10\. Other text preprocessing techniques
----------------------------------------

03:49 - 04:07

There are other preprocessing techniques that are used but have been omitted for the sake of brevity. Some of them include removing HTML or XML tags, replacing accented characters and correcting spelling errors and shorthands

- Removing HTML/XML tags
- Replacing accented characters (such as é)
- Correcting spelling errors


11\. A word of caution
----------------------

04:07 - 04:42

We have covered a lot of text preprocessing techniques in the last couple of lessons. However, a word of caution is in place. The text preprocessing techniques you use is always dependent on the application. There are many applications which may find punctuations, numbers and emojis useful, so it may be wise to not remove them. In other cases, using all caps may be a good indicator of something. Remember to always use only those techniques that are relevant to your particular use case.

- Always use only those text preprocessing techniques that are relevant to your application.

12\. Let's practice!
--------------------

04:42 - 04:45

It's now time to practice!

Cleaning a blog post
====================

In this exercise, you have been given an excerpt from a blog post. Your task is to clean this text into a more machine friendly format. This will involve converting to lowercase, lemmatization and removing stopwords, punctuations and non-alphabetic characters.

The excerpt is available as a string `blog` and has been printed to the console. The list of stopwords are available as `stopwords`.

Instructions
------------

-   Using list comprehension, loop through `doc`to extract the `lemma_` of each token.
-   Remove stopwords and non-alphabetic tokens using `stopwords` and `isalpha()`

In [None]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(blog)

# Generate lemmatized tokens
lemmas = [token.lemma_ for token in doc]

# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]

# Print string after text cleaning
print(' '.join(a_lemmas))

Cleaning TED talks in a dataframe
=================================

In this exercise, we will revisit the TED Talks from the first chapter. You have been a given a dataframe `ted` consisting of 5 TED Talks. Your task is to clean these talks using techniques discussed earlier by writing a function `preprocess` and applying it to the `transcript` feature of the dataframe. 

The stopwords list is available as `stopwords`.

Instructions
------------

-   Generate the Doc object for `text`. Ignore the `disable` argument for now.
-   Generate lemmas using list comprehension using the `lemma_` attribute.
-   Remove non-alphabetic characters using `isalpha()` in the if condition.

In [None]:
# Function to preprocess text
def preprocess(text):
  	# Create Doc object
    doc = nlp(text, disable=['ner', 'parser'])
    # Generate lemmas
    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
    
    return ' '.join(a_lemmas)
  
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(preprocess)
print(ted['transcript'])

1\. Part-of-speech tagging
--------------------------

00:00 - 00:08

In this lesson, we will cover part-of-speech tagging, which is one of the most popularly used feature engineering techniques in NLP.

2\. Applications
----------------

00:08 - 01:14

Part-of speech tagging or POS tagging has an immense number of applications in NLP. It is used in word-sense disambiguation to identify the sense of a word in a sentence. For instance, consider the sentences "the bear is a majestic animal" and "please bear with me". Both sentences use the word 'bear' but they mean different things. POS tagging helps in identifying this distinction by identifying one bear as a noun and the other as a verb. Consequentially, POS tagging is also used in sentiment analysis, question answering systems and linguistic approaches to detect fake news and opinion spam. For example, one paper discovered that fake news headlines, on average, tend to use lesser common nouns and more proper nouns than mainstream headlines. Generating the POS tags for these words proved extremely useful in detecting false or hyperpartisan news.

```markdown
- Word-sense disambiguation
  - "The bear is a majestic animal"
  - "Please bear with me"
- Sentiment analysis
- Question answering
- Fake news and opinion spam detection
```

3\. POS tagging
---------------

01:14 - 01:45

So what is POS tagging? It is the process of assigning every word (or token) in a piece of text, its corresponding part-of-speech. For instance, consider the sentence "Jane is an amazing guitarist". A typical POS tagger will label Jane as a proper noun, is as a verb, an as a determiner (or an article), amazing as an adjective and finally, guitarist as a noun.

```markdown
- Assigning every word, its corresponding part of speech.
  - "Jane is an amazing guitarist."
- POS Tagging:
  - Jane → proper noun
  - is → verb
  - an → determiner
  - amazing → adjective
  - guitarist → noun
```


4\. POS tagging using spaCy
---------------------------

01:45 - 02:14

POS Tagging is extremely easy to do using spaCy's models and performing it is almost identical to generating tokens or lemmas. As usual, we import the spacy library and load the en_core_web_sm model as nlp. We will use the same sentence "Jane is an amazing guitarist" from before. We will then create a Doc object that will perform POS tagging, by default.

```python
import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Initialize string
string = "Jane is an amazing guitarist"

# Create a Doc object
doc = nlp(string)
```

5\. POS tagging using spaCy
---------------------------

02:14 - 03:08

Using list comprehension, we generate a list of tuples pos where the first element of the tuple is the token and is generated using token.text and the second element is its POS tag, which is generated using token.pos_. Printing pos will give us the following output. Note how the tagger correctly identified all the parts-of-speech as we had discussed earlier. That said, remember that POS tagging is not an exact science. spaCy infers the POS tags of these words based on the predictions given by its pre-trained models. In other words, the accuracy of the POS tagging is dependent on the data that the model has been trained on and the data that it is being used on.

```python
# Generate list of tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)

[('Jane', 'PROPN'),
 ('is', 'VERB'),
 ('an', 'DET'),
 ('amazing', 'ADJ'),
 ('guitarist', 'NOUN')]
```

6\. POS annotations in spaCy
----------------------------

03:08 - 03:39

spaCy is capable of identifying close to 20 parts-of-speech and as we saw in the previous slide, it uses specific annotations to denote a particular part of speech. For instance, PROPN referred to a proper noun and DET referred to a determinant. You can find the complete list of POS annotations used by spaCy in spaCy's documentation. Here is a snapshot of the web page.

```markdown
- PROPN → proper noun
- DET → determiner
- spaCy annotations at [https://spacy.io/api/annotation](https://spacy.io/api/annotation)

| POS   | DESCRIPTION          | EXAMPLES                              |
|-------|----------------------|---------------------------------------|
| ADJ   | adjective            | big, old, green, incomprehensible, first |
| ADP   | adposition           | in, to, during                       |
| ADV   | adverb               | very, tomorrow, down, where, there   |
| AUX   | auxiliary            | is, has (done), will (do), should (do) |
| CONJ  | conjunction          | and, or, but                         |
| CCONJ | coordinating conjunction | and, or, but                         |
| DET   | determiner           | a, an, the                           |
```

7\. Let's practice!
-------------------

03:39 - 03:47

Great! Let's now practice our understanding of POS tagging in the next few exercises.

POS tagging in Lord of the Flies
================================

In this exercise, you will perform part-of-speech tagging on a famous passage from one of the most well-known novels of all time, *Lord of the Flies*, authored by William Golding.

The passage is available as `lotf` and has already been printed to the console.

Instructions
------------

-   Load the `en_core_web_sm` model.
-   Create a doc object for `lotf` using `nlp()`.
-   Using the `text` and `pos_` attributes, generate tokens and their corresponding POS tags.

In [None]:
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(lotf)

# Generate tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)

Counting nouns in a piece of text
=================================

In this exercise, we will write two functions, `nouns()` and `proper_nouns()` that will count the number of other nouns and proper nouns in a piece of text respectively.

These functions will take in a piece of text and generate a list containing the POS tags for each word. It will then return the number of proper nouns/other nouns that the text contains. We will use these functions in the next exercise to generate interesting insights about fake news. 

The `en_core_web_sm` model has already been loaded as `nlp` in this exercise.

Instructions 1/2
----------------

-   Using the list `count` method, count the number of proper nouns (annotated as `PROPN`) in the `pos` list.

In [None]:
nlp = spacy.load('en_core_web_sm')

# Returns number of proper nouns
def proper_nouns(text, model=nlp):
  	# Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of proper nouns
    return pos.count('PROPN')

print(proper_nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))

Instructions 2/2
----------------

-   Using the list `count` method, count the number of other nouns (annotated as `NOUN`) in the `pos` list.

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

# Returns number of other nouns
def nouns(text, model=nlp):
    # Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of other nouns
    return pos.count("NOUN")  # Count the number of NOUN tags

print(nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))


Noun usage in fake news
=======================

In this exercise, you have been given a dataframe `headlines` that contains news headlines that are either fake or real. Your task is to generate two new features `num_propn`and `num_noun` that represent the number of proper nouns and other nouns contained in the `title` feature of `headlines`.

Next, we will compute the mean number of proper nouns and other nouns used in fake and real news headlines and compare the values. If there is a remarkable difference, then there is a good chance that using the `num_propn` and `num_noun` features in fake news detectors will improve its performance.

To accomplish this task, the functions `proper_nouns` and `nouns` that you had built in the previous exercise have already been made available to you.

Instructions 1/2
----------------

-   -   Create a new feature `num_propn` by applying `proper_nouns` to `headlines['title']`.
    -   Filter `headlines` to compute the mean number of proper nouns in fake news using the `mean` method.

In [None]:
headlines['num_propn'] = headlines['title'].apply(proper_nouns)

# Compute mean of proper nouns
real_propn = headlines[headlines['label'] == 'REAL']['num_propn'].mean()
fake_propn = headlines[headlines['label'] == 'FAKE']['num_propn'].mean()

# Print results
print("Mean no. of proper nouns in real and fake headlines are %.2f and %.2f respectively"%(real_propn, fake_propn))

Instructions 2/2
----------------

-   -   Repeat the process for other nous: create a feature `'num_noun'` using `nouns` and compute the mean of other nouns

In [None]:
headlines['num_noun'] = headlines['title'].apply(nouns)

# Compute mean of other nouns
real_noun = headlines[headlines['label'] == 'REAL']['num_noun'].mean()
fake_noun = headlines[headlines['label'] == 'FAKE']['num_noun'].mean()

# Print results
print("Mean no. of other nouns in real and fake headlines are %.2f and %.2f respectively"%(real_noun, fake_noun))


1\. Named entity recognition
----------------------------

00:00 - 00:06

The final technique we will learn as part of this chapter is named entity recognition.


2\. Applications
----------------

00:06 - 00:48

Named entity recognition or NER has a host of extremely useful applications. It is used to build efficient search algorithms and question answering systems. For instance, let us say you have a piece of text and you ask your system about the people that are being talked about in the text. NER would help the system in answering this question by identifying all the entities that refer to a person in the text. NER also found application with News Providers who use it to categorize their articles and Customer Service centers who use it to classify and record their complaints efficiently.

```markdown
- Word-sense disambiguation
  - "The bear is a majestic animal"
  - "Please bear with me"
- Sentiment analysis
- Question answering
- Fake news and opinion spam detection
```

3\. Named entity recognition
----------------------------

00:48 - 01:38

Let us now get down to the definitions. A named entity is anything that can be denoted with a proper name or a proper noun. Named entity recognition or NER, therefore, is the process of identifying such named entities in a piece of text and classifying them into predefined categories such as person, organization, country, etc. For example, consider the text "John Doe is a software engineer working at Google. He lives in France." Performing NER on this text will tell us that there are three named entities: John Doe, who is a person, Google, which is an organization and France, which is a country (or geopolitical entity)

```markdown
- Identifying and classifying named entities into predefined categories.
  - Categories include person, organization, country, etc.
- "John Doe is a software engineer working at Google. He lives in France."
- Named Entities:
  - John Doe → person
  - Google → organization
  - France → country (geopolitical entity)
```

4\. NER using spaCy
-------------------

01:38 - 02:40

Like POS tagging, performing NER is extremely easy using spaCy's pre-trained models. Let's try to find the named entities in the same sentence we used earlier. As usual, we import the spacy library, load the required model and create a Doc object for the string. When we do this, spaCy automatically computes all the named entities and makes it available as the ents attribute of doc. Therefore, to access the named entity and its category, we use list comprehension to loop over doc.ents and create a tuple containing the entity name, which is accessed using ent.text, and entity category, which is accessed using ent.label_. Printing this list out will give the following output. We see that spaCy has correctly identified and classified all the named entities in this string.

```python
import spacy
string = "John Doe is a software engineer working at Google. He lives in France."

# Load model and create Doc object
spacy.Load('en_core_web_sm')
nlp(string)

nlp

doc

# Generate named entities
ne = [(ent.text, ent.label_) for ent in doc.ents]
print(ne)

[('John Doe', 'PERSON'), ('Google', 'ORG'), ('France', 'GPE')]
```

5\. NER annotations in spaCy
----------------------------

02:40 - 03:00

Currently, spaCy's models are capable of identifying more than 15 different types of named entities. The complete list of categories and their annotations can be found in spaCy's documentatiion. Here is a snapshot of the page.

```markdown
- More than 15 categories of named entities
- NER annotations at [https://spacy.io/api/annotation#named-entities](https://spacy.io/api/annotation#named-entities)

| TYPE  | DESCRIPTION                                      |
|-------|--------------------------------------------------|
| PERSON | People, including fictional.                    |
| NORP   | Nationalities or religious or political groups. |
| FAC    | Buildings, airports, highways, bridges, etc.    |
| ORG    | Companies, agencies, institutions, etc.         |
| GPE    | Countries, cities, states.                      |
```

6\. A word of caution
---------------------

03:00 - 03:54

In this chapter, we have used spacy's models to accomplish several tasks. However, remember that spacy's models are not perfect and its performance depends on the data it was trained with and the data it is being used on. For instance, if we are trying extract named entities for texts from a heavily technical field, such as medicine, spacy's pretrained models may not perform such a great job. In such nuanced cases, it is better to train your models with your specialized data. Also, remember that spacy's models are language specific. This is understandable considering that each language has its own grammar and nuances. The en_core_web_sm model that we've been using is, as the name suggests, only suitable for English texts.

```markdown
- Not perfect
- Performance dependent on training and test data
- Train models with specialized data for nuanced cases
- Language specific
```

7\. Let's practice!
-------------------

03:54 - 04:05

This concludes our lesson on named entity recognition. Let us practice our understanding of this technique in the exercises.