1\. Introduction to NLP feature engineering
-------------------------------------------

00:00 - 00:18

Welcome to Feature Engineering for NLP in Python! I am Rounak and I will be your instructor for this course. In this course, you will learn to extract useful features out of text and convert them into formats that are suitable for machine learning algorithms.

2\. Numerical data
------------------

00:18 - 00:44

For any ML algorithm, data fed into it must be in tabular form and all the training features must be numerical. Consider the Iris dataset. Every training instance has exactly four numerical features. The ML algorithm uses these four features to train and predict if an instance belongs to class iris-virginica, iris-setosa or iris-versicolor.

```markdown
# Iris dataset

| sepal length | sepal width | petal length | petal width | class           |
|--------------|-------------|--------------|-------------|-----------------|
| 6.3          | 2.9         | 5.6          | 1.8         | Iris-virginica  |
| 4.9          | 3.0         | 1.4          | 0.2         | Iris-setosa     |
| 5.6          | 2.9         | 3.6          | 1.3         | Iris-versicolor |
| 6.0          | 2.7         | 5.1          | 1.6         | Iris-versicolor |
| 7.2          | 3.6         | 6.1          | 2.5         | Iris-virginica  |
```

3\. One-hot encoding
--------------------

00:44 - 01:01

ML algorithms can also work with categorical data provided they are converted into numerical form through one-hot encoding. Let's say you have a categorical feature 'sex' with two categories 'male' and 'female'.

```markdown
| sex    |
|--------|
| female |
| male   |
| female |
| male   |
| female |
| ...    |
```

4\. One-hot encoding
--------------------

01:01 - 01:05

One-hot encoding will convert this feature into two features,

```markdown
| sex    | one-hot encoding |
|--------|------------------|
| female | →                |
| male   | →                |
| female | →                |
| male   | →                |
| female | →                |
| ...    | ...              |
```

5\. One-hot encoding
--------------------

01:05 - 01:17

'sex_male' and 'sex_female' such that each male instance has a 'sex_male' value of 1 and 'sex_female' value of 0. For females, it is the vice versa.

| sex | one-hot encoding | sex_female | sex_male |
| --- | ---------------- | ---------- | -------- |
| female | → | 1 | 0 |
| male | → | 0 | 1 |
| female | → | 1 | 0 |
| male | → | 0 | 1 |
| female | → | 1 | 0 |
| ... | ... | ... | ... |

6\. One-hot encoding with pandas
--------------------------------

01:17 - 01:54

To do this in code, we use pandas' get_dummies() function. Let's import pandas using the alias pd. We can then pass our dataframe df into the pd.get_dummies() function and pass a list of features to be encoded as the columns argument. Not mentioning columns will lead pandas to automatically encode all non-numerical features. Finally, we overwrite the original dataframe with the encoded version by assigning the dataframe returned by get_dummies() back to df.

```python
# Import the pandas library
import pandas as pd

# Perform one-hot encoding on the 'sex' feature of df
df = pd.get_dummies(df, columns=['sex'])
```

7\. Textual data
----------------

01:54 - 02:10

Consider a movie reviews dataset. This data cannot be utilized by any machine learning or ML algorithm. The training feature 'review' isn't numerical. Neither is it categorical to perform one-hot encoding on.

#### Movie Review Dataset

| review | class |
| --- | --- |
| This movie is for dog lovers. A very poignant... | positive |
| The movie is forgettable. The plot lacked... | negative |
| A truly amazing movie about dogs. A gripping... | positive |

8\. Text pre-processing
-----------------------

02:10 - 02:34

We need to perform two steps to make this dataset suitable for ML. The first is to standardize the text. This involves steps like converting words to lowercase and their base form. For instance, 'Reduction' gets lowercased and then converted to its base form, reduce. We will cover these concepts in more detail in subsequent lessons.

- Converting to lowercase
    - Example:`Reduction` to `reduction`
- Converting to base-form
    - Example:`reduction` to `reduce`


9\. Vectorization
-----------------

02:34 - 02:48

After preprocessing, the reviews are converted into a set of numerical training features through a process known as vectorization. After vectorization, our original review dataset gets converted

| review | class |
| --- | --- |
| This movie is for dog lovers. A very poignant... | positive |
| The movie is forgettable. The plot lacked... | negative |
| A truly amazing movie about dogs. A gripping... | positive |

10\. Vectorization
------------------

02:48 - 02:55

into something like this. We will learn techniques to achieve this in later lessons.

| 0 | 1 | 2 | ... | n | class |
| --- | --- | --- | --- | --- | --- |
| 0.03 | 0.71 | 0.00 | ... | 0.22 | positive |
| 0.45 | 0.00 | 0.03 | ... | 0.19 | negative |
| 0.14 | 0.18 | 0.00 | ... | 0.45 | positive |

11\. Basic features
-------------------

02:55 - 03:20

We can also extract certain basic features from text. It maybe useful to know the word count, character count and average word length of a particular text. While working with niche data such as tweets, it also maybe useful to know how many hashtags have been used in a tweet. This tweet by Silverado Records,for instance, uses two.

- Number of words
- Number of characters
- Average length of words
- Tweets

```markdown
testbook @books
What book are ypu guys reading?

#books #reading
```

12\. POS tagging
----------------

03:20 - 03:50

So far, we have seen how to extract features out of an entire body of text. Some NLP applications may require you to extract features for individual words. For instance, you may want to do parts-of-speech tagging to know the different parts-of-speech present in your text as shown. As an example, consider the sentence 'I have a dog'. POS tagging will label each word with its corresponding part-of-speech.

| Word | POS |
| --- | --- |
| I | Pronoun |
| have | Verb |
| a | Article |
| dog | Noun |

13\. Named Entity Recognition
-----------------------------

03:50 - 04:16

You may also want to know perform named entity recognition to find out if a particular noun is referring to a person, organization or country. For instance, consider the sentence "Brian works at DataCamp". Here, there are two nouns "Brian" and "DataCamp". Brian refers to a person whereas DataCamp refers to an organization.

#### Noun Reference

The image asks whether nouns refer to persons, organizations, or countries.

| Noun | NER |
| --- | --- |
| Brian | Person |
| DataCamp | Organization |

The image contains three main visual elements:
1. A photo of a person playing a guitar, which is likely a reference to the "Brian" noun.
2. A Swiss flag, which could represent a country.
3. The TED logo, which represents the organization "TED".

Based on the table in the image, the nouns "Brian" and "DataCamp" are classified as referring to a person and an organization, respectively. The image is prompting the viewer to consider whether nouns generally refer to persons, organizations, or countries.

14\. Concepts covered
---------------------

04:16 - 04:33

Therefore, broadly speaking, this course will teach you how to conduct text preprocessing, extract certain basic features, word features and convert documents into a set of numerical features (using a process known as vectorization).

- Text Preprocessing
- Basic Features
- Word Features
- Vectorization


15\. Let's practice!
--------------------

04:33 - 04:36

Great! Now, let's practice!

Data format for ML algorithms
=============================

In this exercise, you have been given four dataframes `df1`, `df2`, `df3` and `df4`. The final column of each dataframe is the predictor variable and the rest of the columns are training features. 

Using the console, determine which dataframe is in a suitable format to be trained by a classifier.

Instructions
------------

### Possible answers

`df1`

`df2`

[/] `df3`

`df4`

One-hot encoding
================

In the previous exercise, we encountered a dataframe `df1` which contained categorical features and therefore, was unsuitable for applying ML algorithms to.

In this exercise, your task is to convert `df1`into a format that is suitable for machine learning.

Instructions 1/3
----------------

-   Use the `columns` attribute to print the features of `df1`.

In [None]:
print(df1.columns)

Instructions 2/3
----------------

-   Use the `pd.get_dummies()` function to perform one-hot encoding on `feature 5` of `df1`.

In [None]:
# Print the features of df1
print(df1.columns)

# Perform one-hot encoding
df1 = pd.get_dummies(df1, columns=['feature 5'])

Instructions 3/3
----------------

-   Use the `columns` attribute again to print the new features of `df1`.
-   Print the first five rows of `df1` using `head()`.

In [None]:
# Print the features of df1
print(df1.columns)

# Perform one-hot encoding
df1 = pd.get_dummies(df1, columns=['feature 5'])

# Print the new features of df1
print(df1.columns)

# Print first five rows of df1
print(df1.head())

1\. Basic feature extraction
----------------------------

00:00 - 00:11

In this video, we will learn to extract certain basic features from text. While not very powerful, they can give us a good idea of the text we are dealing with.

2\. Number of characters
------------------------

00:11 - 00:58

The most basic feature we can extract from text is the number of characters, including whitespaces. For instance, the string "I don't know." has 13 characters. The number of characters is the length of the string. Python gives us a built-in len() function which returns the length of the string passed into it. The output will be 13 here too. If our dataframe df has a textual feature (say 'review'), we can compute the number of characters for each review and store it as a new feature 'num_chars' by using the pandas dataframe apply method. This is done by creating df['num_chars'] and assigning it to df['review'].apply(len).

```python
"I don't know." # 13 characters
# Compute the number of characters 
text = "I don't know."num_char = len(text)
# Print the number of charactersprint
(num_char)

13

# Create a 'num_chars' feature
df['num_chars'] = df['review'].apply(len)
```

3\. Number of words
-------------------

00:58 - 01:32

Another feature we can compute is the number of words. Assuming that every word is separated by a space, we can use a string's split() method to convert it into a list where every element is a word. In this example, the string Mary had a little lamb is split to create a list containing the words Mary, had, a, little and lamb. We can now compute the number of words by computing the number of elements in this list using len().

```python
# Split the string into words
text = "Mary had a little lamb."
words = text.split()
# Print the list containing words
print(words)

['Mary', 'had', 'a', 'little', 'lamb.']

# Print number of words
print(len(words))5

Output :
5
```

4\. Number of words
-------------------

01:32 - 01:58

To do this for a textual feature in a dataframe, we first define a function that takes in a string as an argument and returns the number of words in it. The steps followed inside the function are similar as before. We then pass this function word_count into apply. We create df['num_words'] and assign it to df['review'].apply(word_count).

```python
# Function that returns number of words in string
def word_count(string):
    # Split the string into words    
    words = string.split()
    # Return length of words list
    returnlen(words)
# Create num_words feature in df
df['num_words'] = df['review'].apply(word_count)
```

5\. Average word length
-----------------------

01:58 - 02:24

Let's now compute the average length of words in a string. Let's define a function avg_word_length() which takes in a string and returns the average word length. We first split the string into words and compute the length of each word. Next, we compute the average word length by dividing the sum of the lengths of all words by the number of words.

```python
#Function that returns average word length
def avg_word_length(x):
    # Split the string into words    
    words = x.split()
    # Compute length of each word and store in a separate list 
    word_lengths = [len(word) for word in words]
    # Compute average word length    
    avg_word_length = sum(word_lengths)/len(words)
    # Return average word length
    return(avg_word_length)
```

6\. Average word length
-----------------------

02:24 - 02:31

We can now pass this into apply() to generate a average word length feature like before.

```python
# Create a new feature avg_word_length
df['avg_word_length'] = df['review'].apply(doc_density)
```

7\. Special features
--------------------

02:31 - 02:52

When working with data such as tweets, it maybe useful to compute the number of hashtags or mentions used. This tweet by DataCamp, for instance, has one mention upendra_35 which begins with an @ and two hashtags, PySpark and Spark which begin with a #.

```markdown
Tweet: 
Datacamp @Datacamp

Big data Fundamentals via PySpark. #BigData #pySpark
```

8\. Hashtags and mentions
-------------------------

02:52 - 03:44

Let's write a function that computes the number of hashtags in a string. We split the string into words. We then use list comprehension to create a list containing only those words that are hashtags. We do this using the startswith method of strings to find out if a word begins with #. The final step is to return the number of elements in this list using len. The procedure to compute number of mentions is identical except that we check if a word starts with @. Let's see this function in action. When we pass a string "@janedoe This is my first tweet! #FirstTweet #Happy", the function returns 2 which is indeed the number of hashtags in the string.

```python
# Function that returns number of hashtags
def hashtag_count(string):
    # Split the string into words    
    words = string.split()
    # Create a list of hashtags    
    hashtags = [word for word in words if word.startswith('#')]
    # Return number of hashtags
    returnlen(hashtags)
    
hashtag_count("@janedoe This is my first tweet! #FirstTweet #Happy")

Output:
2
```


9\. Other features
------------------

03:44 - 04:04

There are other basic features we can compute such as number of sentences, number of paragraphs, number of words starting with an uppercase, all-capital words, numeric quantities etc. The procedure to do this is extremely similar to the ones we've already covered.

- Number of sentences
- Number of paragraphs
- Words starting with an uppercase
- All-capital words
- Numeric quantities


10\. Let's practice!
--------------------

04:04 - 04:09

That's enough theory for now. Let's practice!

Character count of Russian tweets
=================================

In this exercise, you have been given a dataframe `tweets` which contains some tweets associated with Russia's Internet Research Agency and compiled by FiveThirtyEight. 

Your task is to create a new feature 'char_count' in `tweets` which computes the number of characters for each tweet. Also, compute the average length of each tweet. The tweets are available in the `content`feature of `tweets`.

*Be aware that this is real data from Twitter and as such there is always a risk that it may contain profanity or other offensive content (in this exercise, and any following exercises that also use real Twitter data).*

Instructions
------------

-   Create a new feature `char_count` by applying `len` to the 'content' feature of `tweets`.
-   Print the average character count of the tweets by computing the mean of the 'char_count' feature.

In [None]:
# Create a feature char_count
tweets['char_count'] = tweets['content'].apply(len)

# Print the average character count
print(tweets['char_count'].mean())

Word count of TED talks
=======================

`ted` is a dataframe that contains the transcripts of 500 TED talks. Your job is to compute a new feature `word_count` which contains the approximate number of words for each talk. Consequently, you also need to compute the average word count of the talks. The transcripts are available as the `transcript` feature in `ted`.

In order to complete this task, you will need to define a function `count_words` that takes in a string as an argument and returns the number of words in the string. You will then need to apply this function to the `transcript` feature of `ted` to create the new feature `word_count`and compute its mean.

Instructions
------------

-   Split `string` into a list of words using the `split()` method.
-   Return the number of elements in `words`using `len()`.
-   Apply your function to the `transcript`column of `ted` to create the new feature `word_count`.
-   Compute the average word count of the talks using `mean()`.

In [None]:
# Function that returns number of words in a string
def count_words(string):
	# Split the string into words
    words = string.split()
    
    # Return the number of words
    return len(words)

# Create a new feature word_count
ted['word_count'] = ted['transcript'].apply(count_words)

# Print the average word count of the talks
print(ted['word_count'].mean())

Hashtags and mentions in Russian tweets
=======================================

Let's revisit the `tweets` dataframe containing the Russian tweets. In this exercise, you will compute the number of hashtags and mentions in each tweet by defining two functions `count_hashtags()` and `count_mentions()`respectively and applying them to the `content` feature of `tweets`. 

In case you don't recall, the tweets are contained in the `content` feature of `tweets`.

Instructions 1/2
----------------

-   In the list comprehension, use `startswith()` to check if a particular `word` starts with `'#'`.

In [None]:
# Function that returns numner of hashtags in a string
def count_hashtags(string):
	# Split the string into words
    words = string.split()
    
    # Create a list of words that are hashtags
    hashtags = [word for word in words if word.startswith('#')]
    
    # Return number of hashtags
    return(len(hashtags))

# Create a feature hashtag_count and display distribution
tweets['hashtag_count'] = tweets['content'].apply(count_hashtags)
tweets['hashtag_count'].hist()
plt.title('Hashtag count distribution')
plt.show()

Instructions 2/2
----------------

-   In the list comprehension, use `startswith()` to check if a particular `word` starts with '@'.

In [None]:
def count_mentions(string):
	# Split the string into words
    words = string.split()
    
    # Create a list of words that are mentions
    mentions = [word for word in words if word.startswith('@')]
    
    # Return number of mentions
    return(len(mentions))

# Create a feature mention_count and display distribution
tweets['mention_count'] = tweets['content'].apply(count_mentions)
tweets['mention_count'].hist()
plt.title('Mention count distribution')
plt.show()