1\. Introduction to NLP feature engineering
-------------------------------------------

00:00 - 00:18

Welcome to Feature Engineering for NLP in Python! I am Rounak and I will be your instructor for this course. In this course, you will learn to extract useful features out of text and convert them into formats that are suitable for machine learning algorithms.

2\. Numerical data
------------------

00:18 - 00:44

For any ML algorithm, data fed into it must be in tabular form and all the training features must be numerical. Consider the Iris dataset. Every training instance has exactly four numerical features. The ML algorithm uses these four features to train and predict if an instance belongs to class iris-virginica, iris-setosa or iris-versicolor.

```markdown
# Iris dataset

| sepal length | sepal width | petal length | petal width | class           |
|--------------|-------------|--------------|-------------|-----------------|
| 6.3          | 2.9         | 5.6          | 1.8         | Iris-virginica  |
| 4.9          | 3.0         | 1.4          | 0.2         | Iris-setosa     |
| 5.6          | 2.9         | 3.6          | 1.3         | Iris-versicolor |
| 6.0          | 2.7         | 5.1          | 1.6         | Iris-versicolor |
| 7.2          | 3.6         | 6.1          | 2.5         | Iris-virginica  |
```

3\. One-hot encoding
--------------------

00:44 - 01:01

ML algorithms can also work with categorical data provided they are converted into numerical form through one-hot encoding. Let's say you have a categorical feature 'sex' with two categories 'male' and 'female'.

```markdown
| sex    |
|--------|
| female |
| male   |
| female |
| male   |
| female |
| ...    |
```

4\. One-hot encoding
--------------------

01:01 - 01:05

One-hot encoding will convert this feature into two features,

```markdown
| sex    | one-hot encoding |
|--------|------------------|
| female | →                |
| male   | →                |
| female | →                |
| male   | →                |
| female | →                |
| ...    | ...              |
```

5\. One-hot encoding
--------------------

01:05 - 01:17

'sex_male' and 'sex_female' such that each male instance has a 'sex_male' value of 1 and 'sex_female' value of 0. For females, it is the vice versa.

| sex | one-hot encoding | sex_female | sex_male |
| --- | ---------------- | ---------- | -------- |
| female | → | 1 | 0 |
| male | → | 0 | 1 |
| female | → | 1 | 0 |
| male | → | 0 | 1 |
| female | → | 1 | 0 |
| ... | ... | ... | ... |

6\. One-hot encoding with pandas
--------------------------------

01:17 - 01:54

To do this in code, we use pandas' get_dummies() function. Let's import pandas using the alias pd. We can then pass our dataframe df into the pd.get_dummies() function and pass a list of features to be encoded as the columns argument. Not mentioning columns will lead pandas to automatically encode all non-numerical features. Finally, we overwrite the original dataframe with the encoded version by assigning the dataframe returned by get_dummies() back to df.

```python
# Import the pandas library
import pandas as pd

# Perform one-hot encoding on the 'sex' feature of df
df = pd.get_dummies(df, columns=['sex'])
```

7\. Textual data
----------------

01:54 - 02:10

Consider a movie reviews dataset. This data cannot be utilized by any machine learning or ML algorithm. The training feature 'review' isn't numerical. Neither is it categorical to perform one-hot encoding on.

#### Movie Review Dataset

| review | class |
| --- | --- |
| This movie is for dog lovers. A very poignant... | positive |
| The movie is forgettable. The plot lacked... | negative |
| A truly amazing movie about dogs. A gripping... | positive |

8\. Text pre-processing
-----------------------

02:10 - 02:34

We need to perform two steps to make this dataset suitable for ML. The first is to standardize the text. This involves steps like converting words to lowercase and their base form. For instance, 'Reduction' gets lowercased and then converted to its base form, reduce. We will cover these concepts in more detail in subsequent lessons.

- Converting to lowercase
    - Example:`Reduction` to `reduction`
- Converting to base-form
    - Example:`reduction` to `reduce`


9\. Vectorization
-----------------

02:34 - 02:48

After preprocessing, the reviews are converted into a set of numerical training features through a process known as vectorization. After vectorization, our original review dataset gets converted

| review | class |
| --- | --- |
| This movie is for dog lovers. A very poignant... | positive |
| The movie is forgettable. The plot lacked... | negative |
| A truly amazing movie about dogs. A gripping... | positive |

10\. Vectorization
------------------

02:48 - 02:55

into something like this. We will learn techniques to achieve this in later lessons.

| 0 | 1 | 2 | ... | n | class |
| --- | --- | --- | --- | --- | --- |
| 0.03 | 0.71 | 0.00 | ... | 0.22 | positive |
| 0.45 | 0.00 | 0.03 | ... | 0.19 | negative |
| 0.14 | 0.18 | 0.00 | ... | 0.45 | positive |

11\. Basic features
-------------------

02:55 - 03:20

We can also extract certain basic features from text. It maybe useful to know the word count, character count and average word length of a particular text. While working with niche data such as tweets, it also maybe useful to know how many hashtags have been used in a tweet. This tweet by Silverado Records,for instance, uses two.

- Number of words
- Number of characters
- Average length of words
- Tweets

```markdown
testbook @books
What book are ypu guys reading?

#books #reading
```

12\. POS tagging
----------------

03:20 - 03:50

So far, we have seen how to extract features out of an entire body of text. Some NLP applications may require you to extract features for individual words. For instance, you may want to do parts-of-speech tagging to know the different parts-of-speech present in your text as shown. As an example, consider the sentence 'I have a dog'. POS tagging will label each word with its corresponding part-of-speech.

| Word | POS |
| --- | --- |
| I | Pronoun |
| have | Verb |
| a | Article |
| dog | Noun |

13\. Named Entity Recognition
-----------------------------

03:50 - 04:16

You may also want to know perform named entity recognition to find out if a particular noun is referring to a person, organization or country. For instance, consider the sentence "Brian works at DataCamp". Here, there are two nouns "Brian" and "DataCamp". Brian refers to a person whereas DataCamp refers to an organization.

#### Noun Reference

The image asks whether nouns refer to persons, organizations, or countries.

| Noun | NER |
| --- | --- |
| Brian | Person |
| DataCamp | Organization |

The image contains three main visual elements:
1. A photo of a person playing a guitar, which is likely a reference to the "Brian" noun.
2. A Swiss flag, which could represent a country.
3. The TED logo, which represents the organization "TED".

Based on the table in the image, the nouns "Brian" and "DataCamp" are classified as referring to a person and an organization, respectively. The image is prompting the viewer to consider whether nouns generally refer to persons, organizations, or countries.

14\. Concepts covered
---------------------

04:16 - 04:33

Therefore, broadly speaking, this course will teach you how to conduct text preprocessing, extract certain basic features, word features and convert documents into a set of numerical features (using a process known as vectorization).

- Text Preprocessing
- Basic Features
- Word Features
- Vectorization


15\. Let's practice!
--------------------

04:33 - 04:36

Great! Now, let's practice!

Data format for ML algorithms
=============================

In this exercise, you have been given four dataframes `df1`, `df2`, `df3` and `df4`. The final column of each dataframe is the predictor variable and the rest of the columns are training features. 

Using the console, determine which dataframe is in a suitable format to be trained by a classifier.

Instructions
------------

### Possible answers

`df1`

`df2`

[/] `df3`

`df4`

One-hot encoding
================

In the previous exercise, we encountered a dataframe `df1` which contained categorical features and therefore, was unsuitable for applying ML algorithms to.

In this exercise, your task is to convert `df1`into a format that is suitable for machine learning.

Instructions 1/3
----------------

-   Use the `columns` attribute to print the features of `df1`.

In [None]:
print(df1.columns)

Instructions 2/3
----------------

-   Use the `pd.get_dummies()` function to perform one-hot encoding on `feature 5` of `df1`.

In [None]:
# Print the features of df1
print(df1.columns)

# Perform one-hot encoding
df1 = pd.get_dummies(df1, columns=['feature 5'])

Instructions 3/3
----------------

-   Use the `columns` attribute again to print the new features of `df1`.
-   Print the first five rows of `df1` using `head()`.

In [None]:
# Print the features of df1
print(df1.columns)

# Perform one-hot encoding
df1 = pd.get_dummies(df1, columns=['feature 5'])

# Print the new features of df1
print(df1.columns)

# Print first five rows of df1
print(df1.head())