1\. Introduction to NLP feature engineering
-------------------------------------------

00:00 - 00:18

Welcome to Feature Engineering for NLP in Python! I am Rounak and I will be your instructor for this course. In this course, you will learn to extract useful features out of text and convert them into formats that are suitable for machine learning algorithms.

2\. Numerical data
------------------

00:18 - 00:44

For any ML algorithm, data fed into it must be in tabular form and all the training features must be numerical. Consider the Iris dataset. Every training instance has exactly four numerical features. The ML algorithm uses these four features to train and predict if an instance belongs to class iris-virginica, iris-setosa or iris-versicolor.

#### Iris dataset

| sepal length | sepal width | petal length | petal width | class |
|--------------|-------------|--------------|-------------|-------|
| 6.3 | 2.9 | 5.6 | 1.8 | Iris-virginica |
| 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
| 5.6 | 2.9 | 3.6 | 1.3 | Iris-versicolor |
| 6.0 | 2.7 | 5.1 | 1.6 | Iris-versicolor |
| 7.2 | 3.6 | 6.1 | 2.5 | Iris-virginica |

3\. One-hot encoding
--------------------

00:44 - 01:01

ML algorithms can also work with categorical data provided they are converted into numerical form through one-hot encoding. Let's say you have a categorical feature 'sex' with two categories 'male' and 'female'.

| sex |
|-----|
| female |
| male |
| female |
| male |
| female |
| ... |

4\. One-hot encoding
--------------------

01:01 - 01:05

One-hot encoding will convert this feature into two features,

| sex | one-hot encoding |
|-----|-----------------|
| female | → |
| male | → |
| female | → |
| male | → |
| female | → |
| ... | ... |

5\. One-hot encoding
--------------------

01:05 - 01:17

'sex_male' and 'sex_female' such that each male instance has a 'sex_male' value of 1 and 'sex_female' value of 0. For females, it is the vice versa.

| sex | one-hot encoding | sex_female | sex_male |
| --- | ---------------- | ---------- | -------- |
| female | → | 1 | 0 |
| male | → | 0 | 1 |
| female | → | 1 | 0 |
| male | → | 0 | 1 |
| female | → | 1 | 0 |
| ... | ... | ... | ... |

6\. One-hot encoding with pandas
--------------------------------

01:17 - 01:54

To do this in code, we use pandas' get_dummies() function. Let's import pandas using the alias pd. We can then pass our dataframe df into the pd.get_dummies() function and pass a list of features to be encoded as the columns argument. Not mentioning columns will lead pandas to automatically encode all non-numerical features. Finally, we overwrite the original dataframe with the encoded version by assigning the dataframe returned by get_dummies() back to df.

```python
# Import the pandas library
import pandas as pd

# Perform one-hot encoding on the 'sex' feature of df
df = pd.get_dummies(df, columns=['sex'])
```

7\. Textual data
----------------

01:54 - 02:10

Consider a movie reviews dataset. This data cannot be utilized by any machine learning or ML algorithm. The training feature 'review' isn't numerical. Neither is it categorical to perform one-hot encoding on.

#### Movie Review Dataset

| review | class |
| --- | --- |
| This movie is for dog lovers. A very poignant... | positive |
| The movie is forgettable. The plot lacked... | negative |
| A truly amazing movie about dogs. A gripping... | positive |

8\. Text pre-processing
-----------------------

02:10 - 02:34

We need to perform two steps to make this dataset suitable for ML. The first is to standardize the text. This involves steps like converting words to lowercase and their base form. For instance, 'Reduction' gets lowercased and then converted to its base form, reduce. We will cover these concepts in more detail in subsequent lessons.

- Converting to lowercase
    - Example:`Reduction` to `reduction`
- Converting to base-form
    - Example:`reduction` to `reduce`


9\. Vectorization
-----------------

02:34 - 02:48

After preprocessing, the reviews are converted into a set of numerical training features through a process known as vectorization. After vectorization, our original review dataset gets converted

| review | class |
| --- | --- |
| This movie is for dog lovers. A very poignant... | positive |
| The movie is forgettable. The plot lacked... | negative |
| A truly amazing movie about dogs. A gripping... | positive |

10\. Vectorization
------------------

02:48 - 02:55

into something like this. We will learn techniques to achieve this in later lessons.

| 0 | 1 | 2 | ... | n | class |
| --- | --- | --- | --- | --- | --- |
| 0.03 | 0.71 | 0.00 | ... | 0.22 | positive |
| 0.45 | 0.00 | 0.03 | ... | 0.19 | negative |
| 0.14 | 0.18 | 0.00 | ... | 0.45 | positive |

11\. Basic features
-------------------

02:55 - 03:20

We can also extract certain basic features from text. It maybe useful to know the word count, character count and average word length of a particular text. While working with niche data such as tweets, it also maybe useful to know how many hashtags have been used in a tweet. This tweet by Silverado Records,for instance, uses two.

- Number of words
- Number of characters
- Average length of words
- Tweets

```markdown
testbook @books
What book are ypu guys reading?

#books #reading
```

12\. POS tagging
----------------

03:20 - 03:50

So far, we have seen how to extract features out of an entire body of text. Some NLP applications may require you to extract features for individual words. For instance, you may want to do parts-of-speech tagging to know the different parts-of-speech present in your text as shown. As an example, consider the sentence 'I have a dog'. POS tagging will label each word with its corresponding part-of-speech.

| Word | POS |
| --- | --- |
| I | Pronoun |
| have | Verb |
| a | Article |
| dog | Noun |

13\. Named Entity Recognition
-----------------------------

03:50 - 04:16

You may also want to know perform named entity recognition to find out if a particular noun is referring to a person, organization or country. For instance, consider the sentence "Brian works at DataCamp". Here, there are two nouns "Brian" and "DataCamp". Brian refers to a person whereas DataCamp refers to an organization.

#### Noun Reference

The image asks whether nouns refer to persons, organizations, or countries.

| Noun | NER |
| --- | --- |
| Brian | Person |
| DataCamp | Organization |

The image contains three main visual elements:
1. A photo of a person playing a guitar, which is likely a reference to the "Brian" noun.
2. A Swiss flag, which could represent a country.
3. The TED logo, which represents the organization "TED".

Based on the table in the image, the nouns "Brian" and "DataCamp" are classified as referring to a person and an organization, respectively. The image is prompting the viewer to consider whether nouns generally refer to persons, organizations, or countries.

14\. Concepts covered
---------------------

04:16 - 04:33

Therefore, broadly speaking, this course will teach you how to conduct text preprocessing, extract certain basic features, word features and convert documents into a set of numerical features (using a process known as vectorization).

- Text Preprocessing
- Basic Features
- Word Features
- Vectorization


15\. Let's practice!
--------------------

04:33 - 04:36

Great! Now, let's practice!

Data format for ML algorithms
=============================

In this exercise, you have been given four dataframes `df1`, `df2`, `df3` and `df4`. The final column of each dataframe is the predictor variable and the rest of the columns are training features. 

Using the console, determine which dataframe is in a suitable format to be trained by a classifier.

Instructions
------------

### Possible answers

`df1`

`df2`

[/] `df3`

`df4`

One-hot encoding
================

In the previous exercise, we encountered a dataframe `df1` which contained categorical features and therefore, was unsuitable for applying ML algorithms to.

In this exercise, your task is to convert `df1`into a format that is suitable for machine learning.

Instructions 1/3
----------------

-   Use the `columns` attribute to print the features of `df1`.

In [None]:
print(df1.columns)

Instructions 2/3
----------------

-   Use the `pd.get_dummies()` function to perform one-hot encoding on `feature 5` of `df1`.

In [None]:
# Print the features of df1
print(df1.columns)

# Perform one-hot encoding
df1 = pd.get_dummies(df1, columns=['feature 5'])

Instructions 3/3
----------------

-   Use the `columns` attribute again to print the new features of `df1`.
-   Print the first five rows of `df1` using `head()`.

In [None]:
# Print the features of df1
print(df1.columns)

# Perform one-hot encoding
df1 = pd.get_dummies(df1, columns=['feature 5'])

# Print the new features of df1
print(df1.columns)

# Print first five rows of df1
print(df1.head())

1\. Basic feature extraction
----------------------------

00:00 - 00:11

In this video, we will learn to extract certain basic features from text. While not very powerful, they can give us a good idea of the text we are dealing with.

2\. Number of characters
------------------------

00:11 - 00:58

The most basic feature we can extract from text is the number of characters, including whitespaces. For instance, the string "I don't know." has 13 characters. The number of characters is the length of the string. Python gives us a built-in len() function which returns the length of the string passed into it. The output will be 13 here too. If our dataframe df has a textual feature (say 'review'), we can compute the number of characters for each review and store it as a new feature 'num_chars' by using the pandas dataframe apply method. This is done by creating df['num_chars'] and assigning it to df['review'].apply(len).

```python
"I don't know." # 13 characters
# Compute the number of characters 
text = "I don't know."num_char = len(text)
# Print the number of charactersprint
(num_char)

13

# Create a 'num_chars' feature
df['num_chars'] = df['review'].apply(len)
```

3\. Number of words
-------------------

00:58 - 01:32

Another feature we can compute is the number of words. Assuming that every word is separated by a space, we can use a string's split() method to convert it into a list where every element is a word. In this example, the string Mary had a little lamb is split to create a list containing the words Mary, had, a, little and lamb. We can now compute the number of words by computing the number of elements in this list using len().

```python
# Split the string into words
text = "Mary had a little lamb."
words = text.split()
# Print the list containing words
print(words)

['Mary', 'had', 'a', 'little', 'lamb.']

# Print number of words
print(len(words))5

Output :
5
```

4\. Number of words
-------------------

01:32 - 01:58

To do this for a textual feature in a dataframe, we first define a function that takes in a string as an argument and returns the number of words in it. The steps followed inside the function are similar as before. We then pass this function word_count into apply. We create df['num_words'] and assign it to df['review'].apply(word_count).

```python
# Function that returns number of words in string
def word_count(string):
    # Split the string into words    
    words = string.split()
    # Return length of words list
    returnlen(words)
# Create num_words feature in df
df['num_words'] = df['review'].apply(word_count)
```

5\. Average word length
-----------------------

01:58 - 02:24

Let's now compute the average length of words in a string. Let's define a function avg_word_length() which takes in a string and returns the average word length. We first split the string into words and compute the length of each word. Next, we compute the average word length by dividing the sum of the lengths of all words by the number of words.

```python
#Function that returns average word length
def avg_word_length(x):
    # Split the string into words    
    words = x.split()
    # Compute length of each word and store in a separate list 
    word_lengths = [len(word) for word in words]
    # Compute average word length    
    avg_word_length = sum(word_lengths)/len(words)
    # Return average word length
    return(avg_word_length)
```

6\. Average word length
-----------------------

02:24 - 02:31

We can now pass this into apply() to generate a average word length feature like before.

```python
# Create a new feature avg_word_length
df['avg_word_length'] = df['review'].apply(doc_density)
```

7\. Special features
--------------------

02:31 - 02:52

When working with data such as tweets, it maybe useful to compute the number of hashtags or mentions used. This tweet by DataCamp, for instance, has one mention upendra_35 which begins with an @ and two hashtags, PySpark and Spark which begin with a #.

```markdown
Tweet: 
Datacamp @Datacamp

Big data Fundamentals via PySpark. #BigData #pySpark
```

8\. Hashtags and mentions
-------------------------

02:52 - 03:44

Let's write a function that computes the number of hashtags in a string. We split the string into words. We then use list comprehension to create a list containing only those words that are hashtags. We do this using the startswith method of strings to find out if a word begins with #. The final step is to return the number of elements in this list using len. The procedure to compute number of mentions is identical except that we check if a word starts with @. Let's see this function in action. When we pass a string "@janedoe This is my first tweet! #FirstTweet #Happy", the function returns 2 which is indeed the number of hashtags in the string.

```python
# Function that returns number of hashtags
def hashtag_count(string):
    # Split the string into words    
    words = string.split()
    # Create a list of hashtags    
    hashtags = [word for word in words if word.startswith('#')]
    # Return number of hashtags
    returnlen(hashtags)
    
hashtag_count("@janedoe This is my first tweet! #FirstTweet #Happy")

Output:
2
```


9\. Other features
------------------

03:44 - 04:04

There are other basic features we can compute such as number of sentences, number of paragraphs, number of words starting with an uppercase, all-capital words, numeric quantities etc. The procedure to do this is extremely similar to the ones we've already covered.

- Number of sentences
- Number of paragraphs
- Words starting with an uppercase
- All-capital words
- Numeric quantities


10\. Let's practice!
--------------------

04:04 - 04:09

That's enough theory for now. Let's practice!

Character count of Russian tweets
=================================

In this exercise, you have been given a dataframe `tweets` which contains some tweets associated with Russia's Internet Research Agency and compiled by FiveThirtyEight. 

Your task is to create a new feature 'char_count' in `tweets` which computes the number of characters for each tweet. Also, compute the average length of each tweet. The tweets are available in the `content`feature of `tweets`.

*Be aware that this is real data from Twitter and as such there is always a risk that it may contain profanity or other offensive content (in this exercise, and any following exercises that also use real Twitter data).*

Instructions
------------

-   Create a new feature `char_count` by applying `len` to the 'content' feature of `tweets`.
-   Print the average character count of the tweets by computing the mean of the 'char_count' feature.

In [None]:
# Create a feature char_count
tweets['char_count'] = tweets['content'].apply(len)

# Print the average character count
print(tweets['char_count'].mean())

Word count of TED talks
=======================

`ted` is a dataframe that contains the transcripts of 500 TED talks. Your job is to compute a new feature `word_count` which contains the approximate number of words for each talk. Consequently, you also need to compute the average word count of the talks. The transcripts are available as the `transcript` feature in `ted`.

In order to complete this task, you will need to define a function `count_words` that takes in a string as an argument and returns the number of words in the string. You will then need to apply this function to the `transcript` feature of `ted` to create the new feature `word_count`and compute its mean.

Instructions
------------

-   Split `string` into a list of words using the `split()` method.
-   Return the number of elements in `words`using `len()`.
-   Apply your function to the `transcript`column of `ted` to create the new feature `word_count`.
-   Compute the average word count of the talks using `mean()`.

In [None]:
# Function that returns number of words in a string
def count_words(string):
	# Split the string into words
    words = string.split()
    
    # Return the number of words
    return len(words)

# Create a new feature word_count
ted['word_count'] = ted['transcript'].apply(count_words)

# Print the average word count of the talks
print(ted['word_count'].mean())

Hashtags and mentions in Russian tweets
=======================================

Let's revisit the `tweets` dataframe containing the Russian tweets. In this exercise, you will compute the number of hashtags and mentions in each tweet by defining two functions `count_hashtags()` and `count_mentions()`respectively and applying them to the `content` feature of `tweets`. 

In case you don't recall, the tweets are contained in the `content` feature of `tweets`.

Instructions 1/2
----------------

-   In the list comprehension, use `startswith()` to check if a particular `word` starts with `'#'`.

In [None]:
# Function that returns numner of hashtags in a string
def count_hashtags(string):
	# Split the string into words
    words = string.split()
    
    # Create a list of words that are hashtags
    hashtags = [word for word in words if word.startswith('#')]
    
    # Return number of hashtags
    return(len(hashtags))

# Create a feature hashtag_count and display distribution
tweets['hashtag_count'] = tweets['content'].apply(count_hashtags)
tweets['hashtag_count'].hist()
plt.title('Hashtag count distribution')
plt.show()

Instructions 2/2
----------------

-   In the list comprehension, use `startswith()` to check if a particular `word` starts with '@'.

In [None]:
def count_mentions(string):
	# Split the string into words
    words = string.split()
    
    # Create a list of words that are mentions
    mentions = [word for word in words if word.startswith('@')]
    
    # Return number of mentions
    return(len(mentions))

# Create a feature mention_count and display distribution
tweets['mention_count'] = tweets['content'].apply(count_mentions)
tweets['mention_count'].hist()
plt.title('Mention count distribution')
plt.show()

1\. Readability tests
---------------------

00:00 - 00:06

In this lesson, we will look at a set of interesting features known as readability tests.

2\. Overview of readability tests
---------------------------------

00:06 - 00:56

These tests are used to determine the readability of a particular passage. In other words, it indicates at what educational level a person needs to be in, in order to comprehend a particular piece of text. The scale usually ranges from primary school up to college graduate level and is in context of the American education system. Readability tests are done using a mathematical formula that utilizes the word, syllable and sentence count of the passage. They are routinely used by organizations to determine how easy their publications are to understand. They have also found applications in domains such as fake news and opinion spam detection.

- Determine readability of an English passage
- Scale ranging from primary school up to college graduate level
- A mathematical formula utilizing word, syllable and sentence count
- Used in fake news and opinion spam detection


3\. Readability text examples
-----------------------------

00:56 - 01:25

There are a variety of readability tests in use. Some of the common ones include the Flesch reading ease, the Gunning fog index, the simple measure of gobbledygook or SMOG and the Dale-Chall score. Note that these tests are used for texts in English. Tests for other languages also exist that take into consideration, the nuances of that particular language. For the sake of brevity, we will cover only the

- Flesch reading ease
- Gunning fog index
- Simple Measure of Gobbledygook (SMOG)
- Dale-Chall score


4\. Readability test examples
-----------------------------

01:25 - 01:36

first two scores in detail. However, once you understand them, you will be in a good position to understand and use the other scores too.

- Flesch reading ease
- Gunning fog index
- Simple Measure of Gobbledygook (SMOG)
- Dale-Chall score


5\. Flesch reading ease
-----------------------

01:36 - 02:30

The Flesch Reading Ease is one of the oldest and most widely used readability tests. The score is based on two ideas: the first is that the greater the average sentence length, harder the text is to read. Consider these two sentences. The first is easier to follow than the second. The second is that the greater the average number of syllables in a word, the harder the text is to read. Therefore, I live in my home is considered easier to read than I reside in my domicile on account of its usage of lesser syllables per word. The higher the Flesch Reading Ease score, the greater is the readability. Therefore, a higher score indicates that the text is easier to understand.

- One of the oldest and most widely used tests
- **Greater the average sentence length, harder the text is to read**
    - "This is a short sentence."
    - "This is longer sentence with more words and it is harder to follow than the first sentence."
- **Greater the average number of syllables in a word, harder the text is to read**
    - "I live in my home."
    - "I reside in my domicile."
- Higher the score, greater the readability


6\. Flesch reading ease score interpretation
--------------------------------------------

02:30 - 02:49

This table shows how to interpret the Flesch Reading Ease scores. A score above 90 would imply that the text is comprehensible to a 5th grader whereas a score below 30 would imply the text can only be understood by college graduates.

| Reading ease score | Grade Level |
|-------------------|-------------|
| 90-100           | 5           |
| 80-90            | 6           |
| 70-80            | 7           |
| 60-70            | 8-9         |
| 50-60            | 10-12       |
| 30-50            | College     |
| 0-30             | College Graduate |

7\. Gunning fog index
---------------------

02:49 - 03:23

The Gunning fog index was developed in 1954. Like Flesch, this score is also dependent on the average sentence length. However, it uses percentage of complex words in place of average syllables per word to compute its score. Here, complex words refer to all words that have three or more syllables. Unlike Flesch, the formula for Gunning fog index is such that the higher the score, the more difficult the passage is to understand.

- Developed in 1954
- Also dependent on average sentence length
- Greater the percentage of complex words, harder the text is to read
- Higher the index, lesser the readability

8\. Gunning fog index interpretation
------------------------------------

03:23 - 03:39

The index can be interpreted using this table. A score of 6 would indicate 6th grade reading difficulty whereas a score of 17 would indicate college graduate level reading difficulty.

| Fog index | Grade level |
|-----------|-------------|
| 17 | College graduate |
| 16 | College senior |
| 15 | College junior |
| 14 | College sophomore |
| 13 | College freshman |
| 12 | High school senior |
| 11 | High school junior |

| Fog index | Grade level |
|-----------|-------------|
| 10 | High school sophomore |
| 9 | High school freshman |
| 8 | Eighth grade |
| 7 | Seventh grade |
| 6 | Sixth grade |

9\. The readability library
---------------------------

03:39 - 04:28

We can conduct these tests in Python using the readability metrics library. In order to use this package, we first need to download the punkt module from nltk. We then import the Readability class from readability. Next, we create a Readability object and pass in the passage or text we're evaluating. To compute a readability score, we call a method that computes the score of our interest, for instance, gunning fog. We store this variable in a variable named gf. Next, we access the score using gf.score. In this example, the text that was passed is between the reading level of a college senior and that of a college graduate.

```python
# Download nltk punkt module
import nltknltk.download('punkt_tab')
# Import the Readability class
from readability import Readability
# Create a Readability Object
readability_scores = Readability(text)
# Generate scoresgf = readability_scores.gunning_fog()
print(gf.score())

Output: 
16.26
```

10\. Let's practice!
--------------------

04:28 - 04:38

Let's now practice computing readability scores using the readability library in the exercises.

Readability of 'The Myth of Sisyphus'
=====================================

In this exercise, you will compute the Flesch reading ease score for Albert Camus' famous essay *The Myth of Sisyphus*. We will then interpret the value of this score as explained in the video and try to determine the reading level of the essay.

The entire essay is in the form of a string and is available as `sisyphus_essay`.

Instructions
------------

-   Import the `Readability` class from `readability`.
-   Compute the `readability_scores` object for `sisyphus_essay` using `Readability`.
-   Print the Flesch reading ease score using the `flesch` method.

In [None]:
# Import Textatistic
from textatistic import Textatistic

# Compute the readability scores 
readability_scores = Textatistic(sisyphus_essay).scores

# Print the flesch reading ease score
flesch = readability_scores['flesch_score']
print("The Flesch Reading Ease is %.2f" % (flesch))

Readability of various publications
===================================

In this exercise, you have been given excerpts of articles from four publications. Your task is to compute the readability of these excerpts using the Gunning fog score and consequently, determine the relative difficulty of reading these publications.

The excerpts are available as the following strings:

-   `forbes`- An excerpt from an article from *Forbes* magazine on the Chinese social credit score system.
-   `harvard_law`- An excerpt from a book review published in *Harvard Law Review*.
-   `r_digest`- An excerpt from a *Reader's Digest* article on flight turbulence.
-   `time_kids` - An excerpt from an article on the ill effects of salt consumption published in *TIME for Kids*.

Instructions
------------

-   Import the `Readability` class from `readability`.
-   Compute the `gf` object for each `excerpt`using the `gunning_fog()` method on `Readability`.
-   Compute the Gunning fog score using the the `score` attribute.
-   Print the list of Gunning fog scores.

In [None]:
# Import Textatistic
from textatistic import Textatistic

# List of excerpts
excerpts = [forbes, harvard_law, r_digest, time_kids]

# Loop through excerpts and compute gunning fog index
gunning_fog_scores = []
for excerpt in excerpts:
  readability_scores = Textatistic(excerpt).scores
  gunning_fog = readability_scores['gunningfog_score']
  gunning_fog_scores.append(gunning_fog)

# Print the gunning fog indices
print(gunning_fog_scores)

# Text preprocessing, POS tagging and NER

1\. Tokenization and Lemmatization

----------------------------------

00:00 - 00:06

In NLP, we usually have to deal with texts from a variety of sources. For instance,

2\. Text sources

----------------

00:06 - 00:22

it can be a news article where the text is grammatically correct and proofread. It could be tweets containing shorthands and hashtags. It could also be comments on YouTube where people have a tendency to abuse capital letters and punctuations.

- News articles
- Tweets
- Comments

3\. Making text machine friendly

--------------------------------

00:22 - 01:03

It is important that we standardize these texts into a machine friendly format. We want our models to treat similar words as the same. Consider the words Dogs and dog. Strictly speaking, they are different strings. However, they connotate the same thing. Similarly, reduction, reducing and reduce should also be standardized to the same string regardless of their form and case usage. Other examples include don't and do not, and won't and will not. In the next couple of lessons, we will learn techniques to achieve this.

- `Dogs,dog`
- `reduction`, `REDUCING`, `Reduce`
- `don't`, `do not`
- `won't`, `will not`

4\. Text preprocessing techniques

---------------------------------

01:03 - 01:31

The text processing techniques you use are dependent on the application you're working on. We'll be covering the common ones, including converting words into lowercase removing unnecessary whitespace, removing punctuation, removing commonly occurring words or stopwords, expanding contracted words like don't and removing special characters such as numbers and emojis.

- Converting words into lowercase
- Removing leading and trailing whitespaces
- Removing punctuation
- Removing stopwordsExpanding contractions
- Removing special characters (numbers, emojis, etc.)

5\. Tokenization

----------------

01:31 - 02:21

To do this, we must first understand tokenization. Tokenization is the process of splitting a string into its constituent tokens. These tokens may be sentences, words or punctuations and is specific to a particular language. In this course, we will primarily be focused with word and punctuation tokens. For instance, consider this sentence. Tokenizing it into its constituent words and punctuations will yield the following list of tokens. Tokenization also involves expanding contracted words. Therefore, a word like don't gets decomposed into two tokens: do and n't as can be seen in this example.

"I have a dog. His name is Hachi."
Tokens:
```python
["I", "have", "a", "dog", ".", "His", "name", "is", "Hachi", "."]

6\. Tokenization using spaCy

----------------------------

02:21 - 03:21

To perform tokenization in python, we will use the spacy library. We first import the spacy library. Next, we load a pre-trained English model 'en_core_web_sm' using spacy.load(). This will return a Language object that has the know-how to perform tokenization. This is stored in the variable nlp. Let's now define a string we want to tokenize. We pass this string into nlp to generate a spaCy Doc object. We store this in a variable named doc. This Doc object contains the required tokens (and many other things, as we will soon find out). We generate the list of tokens by using list comprehension as shown. This is essentially looping over doc and extracting the text of each token in each iteration. The result is as follows.

```python
import spacy
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')
# Initialize string
string = "Hello! I don't know what I'm doing here."
# Create a Doc object
doc = nlp(string)
# Generate list of tokens
tokens = [token.text for token in doc]
print(tokens)
```

Output:
```python
['Hello', '!', 'I', 'do', "n't", 'know', 'what', 'I', "'", 'm', 'doing', 'here', '.']
```

7\. Lemmatization

-----------------

03:21 - 04:07

Lemmatization is the process of converting a word into its lowercased base form or lemma. This is an extremely powerful process of standardization. For instance, the words reducing, reduces, reduced and reduction, when lemmatized, are all converted into the base form reduce. Similarly be verbs such as am, are and is are converted into be. Lemmatization also allows us to convert words with apostrophes into their full forms. Therefore, n't is converted to not and 've is converted to have.

- Convert word into its base form
  - reducing, reduces, reduced, reduction → reduce
  - am, are, is → be
  - n't → not
  - 've → have


8\. Lemmatization using spaCy

-----------------------------

04:07 - 04:42

When you pass the string into nlp, spaCy automatically performs lemmatization by default. Therefore, generating lemmas is identical to generating tokens except that we extract token.lemma_ in each iteration inside the list comprehension instead of token.text. Also, observe how spaCy converted the Is into -PRON-. This is standard behavior where every pronoun is converted into the string '-PRON-'.

```python
import spacy
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')
# Initialize string
string = "Hello! I don't know what I'm doing here."
# Create a Doc object
doc = nlp(string)
# Generate list of lemmas
lemmas = [token.lemma_ for token in doc]
print(lemmas)
```

Output:
```python
['hello', '!', '-PRON-', 'do', 'not', 'know', 'what', '-PRON-', 'be', 'do', 'here', '.']
```

9\. Let's practice!

-------------------

04:42 - 05:00

Once we understand how to perform tokenization and lemmatization, performing the text preprocessing techniques described earlier becomes easier. Before we move to that, let's first practice our understanding of the concepts introduced so far.

Identifying lemmas
==================

Identify the list of words from the choices which do not have the same lemma.

##### Answer the question

#### Possible Answers

Select one answer

- [x] He, She, I, They

-   Am, Are, Is, Was

-   Increase, Increases, Increasing, Increased

-   Car, Bike, Truck, Bus

Tokenizing the Gettysburg Address
=================================

In this exercise, you will be tokenizing one of the most famous speeches of all time: the Gettysburg Address delivered by American President Abraham Lincoln during the American Civil War.

The entire speech is available as a string named `gettysburg`.

Instructions
------------

-   Load the `en_core_web_sm` model using `spacy.load()`.
-   Create a Doc object `doc` for the `gettysburg` string.
-   Using list comprehension, loop over `doc` to generate the token texts.

In [None]:
import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate the tokens
tokens = [token.text for token in doc]
print(tokens)

Lemmatizing the Gettysburg address
==================================

In this exercise, we will perform lemmatization on the same `gettysburg` address from before. 

However, this time, we will also take a look at the speech, before and after lemmatization, and try to adjudge the kind of changes that take place to make the piece more machine friendly.

Instructions 1/3
----------------

Print the gettysburg address to the console.

In [None]:
# Print the gettysburg address
print(gettysburg)

Instructions 2/3


Loop over doc and extract the lemma for each token of gettysburg.

In [None]:
# Loop over doc and extract the lemma for each token of gettysburg.
import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate lemmas
lemmas = [token.lemma_ for token in doc]

Instructions 3/3

Convert lemmas into a string using join.

In [None]:
import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate lemmas
lemmas = [token.lemma_ for token in doc]

# Convert lemmas into a string
print(' '.join(lemmas))