1\. Introduction to regular expressions
---------------------------------------

00:00 - 00:06

Welcome to the course! In this video, you'll be learning about regular expressions.

2\. What is Natural Language Processing?
----------------------------------------

00:06 - 00:47

Natural language processing is a massive field of study and actively used practice which aims to make sense of language using statistics and computers. In this course, you will learn some of the basics of NLP which will help you move from simple to more difficult and advanced topics. Even though this is the first course, you will still get some exposure to the challenges of the field such as topic identification and text classification. Some interesting NLP areas you might have heard about are: topic identification, chatbots, text classification, translation, sentiment analysis. There are also many more! You will learn the fundamentals of some of these topics as we move through the course.

* Field of study focused on making sense of language
  * Using statistics and computers

* You will learn the basics of NLP
  * Topic identification
  * Text classification

* NLP applications include:
  * Chatbots
  * Translation
  * Sentiment analysis
  * ... and many more!

3\. What exactly are regular expressions?
-----------------------------------------

00:47 - 01:54

Regular expressions are strings you can use that have a special syntax, which allows you to match patterns and find other strings. A pattern is a series of letters or symbols which can map to an actual text or words or punctuation. You can use regular expressions to do things like find links in a webpage, parse email addresses and remove unwanted strings or characters. Regular expressions are often referred to as regex and can be used easily with python via the `re` library. Here we have a simple import of the library. We can match a substring by using the re.match method which matches a pattern with a string. It takes the pattern as the first argument, the string as the second and returns a match object, here we see it matched exactly what we expected: abc. We can also use special patterns that regex understands, like the \w+ which will match a word. We can see here via the match object representation that it has matched the first word it found -- hi.

* Strings with a special syntax      → Find all web links in a document

* Allow us to match patterns in      → Parse email addresses
  other strings

* Applications of regular            → Remove/replace unwanted
  expressions:                         characters

```python
import re
re.match('abc', 'abcdef')
```

```python
word_regex = '\w+'
re.match(word_regex,
    'hi there!')
```

`<_sre.SRE_Match object; span=(0, 3), match='abc'>`

`<_sre.SRE_Match object; span=(0, 2), match='hi'>`

4\. Common regex patterns
-------------------------

01:54 - 02:10

There are hundreds of characters and patterns you can learn and memorize with regular expressions, but to get started, I want to share a few common patterns. The first pattern \w we already saw, it is used to match words. The \d pattern allows us to match digits, which can be useful when you need to find them and separate them in a string. The \s pattern matches spaces, the period is a wildcard character. The wildcard will match ANY letter or symbol. The + and * characters allow things to become greedy, grabbing repeats of single letters or whole patterns. For example to match a full word rather than one character, we need to add the + symbol after the \w. Using these character classes as capital letters negates them so the \S matches anything that is not a space. You can also create a group of characters you want by putting them inside square brackets, like our lowercase group.

| pattern | matches | example |
|----------|----------|----------|
| \w+      | word     | 'Magic'  |

5\. Common regex patterns (2)
-----------------------------

02:10 - 02:17

| pattern | matches | example |
|----------|----------|----------|
| \w+      | word     | 'Magic'  |
| \d       | digit    | 9        |

6\. Common regex patterns (3)
-----------------------------

02:17 - 02:21

| pattern | matches | example |
|----------|----------|----------|
| \w+      | word     | 'Magic'  |
| \d       | digit    | 9        |
| \s       | space    | ' '      |

7\. Common regex patterns (4)
-----------------------------

02:21 - 02:29

| pattern | matches | example |
|----------|----------|----------|
| \w+      | word     | 'Magic'  |
| \d       | digit    | 9        |
| \s       | space    | ' '      |
| .*       | wildcard | 'username74' |

8\. Common regex patterns (5)
-----------------------------

02:29 - 02:44

| pattern | matches | example |
|----------|----------|----------|
| \w+      | word     | 'Magic'  |
| \d       | digit    | 9        |
| \s       | space    | ' '      |
| .*       | wildcard | 'username74' |
| + or *   | greedy match | 'aaaaaa' |

9\. Common regex patterns (6)
-----------------------------

02:44 - 02:55

| pattern | matches | example |
|----------|----------|----------|
| \w+      | word     | 'Magic'  |
| \d       | digit    | 9        |
| \s       | space    | ' '      |
| .*       | wildcard | 'username74' |
| + or *   | greedy match | 'aaaaaa' |
| \S       | not space | 'no_spaces' |

10\. Common regex patterns (7)
------------------------------

02:55 - 03:02

| pattern | matches | example |
|----------|----------|----------|
| \w+      | word     | 'Magic'  |
| \d       | digit    | 9        |
| \s       | space    | ' '      |
| .*       | wildcard | 'username74' |
| + or *   | greedy match | 'aaaaaa' |
| \S       | not space | 'no_spaces' |
| [a-z]    | lowercase group | 'abcdefg' |

11\. Python's re module
-----------------------

03:02 - 03:56

In the following exercises, you'll use the `re` module to perform some simple activities, like splitting on a pattern or finding all patterns in a string. In addition to split and findall, search and match are also quite popular. You saw a simple match at the beginning of this video, and search is similar but doesn't require you to match the pattern from the beginning of the string. The syntax for the regex library is always to pass the pattern first, and the string second. Depending on the method, it may return an iterator, a new string or a match object. Here we see the re.split method will take a pattern for spaces and a string with some spaces and return a list object with the results of splitting on spaces. This can be used for tokenization, so you can preprocess text using regex while doing natural language processing.

* `re` module

* `split`: split a string on regex

* `findall`: find all patterns in a string

* `search`: search for a pattern

* `match`: match an entire string or substring based on a
  pattern

* Pattern first, and the string second

* May return an iterator, string, or match object

```python
re.split('\s+', 'Split on spaces.')
```

`['Split', 'on', 'spaces.']`

12\. Let's practice!
--------------------

03:56 - 04:02

Now it's your turn! Get started writing your first Regex and I'll see you back here soon!

Which pattern?
==============

Which of the following Regex patterns results in the following text? 

```
>>> my_string = "Let's write RegEx!"
>>> re.findall(PATTERN, my_string)
['Let', 's', 'write', 'RegEx']

```

In the IPython Shell, try replacing `PATTERN` with one of the below options and observe the resulting output. The `re`module has been pre-imported for you and `my_string` is available in your namespace.

### Possible answers

PATTERN = r"\s+"

[/] PATTERN = r"\w+"

PATTERN = r"[a-z]"

PATTERN = r"\w"

Practicing regular expressions: re.split() and re.findall()
===========================================================

Now you'll get a chance to write some regular expressions to match digits, strings and non-alphanumeric characters. Take a look at `my_string` first by printing it in the IPython Shell, to determine how you might best match the different steps.

Note: It's important to prefix your regex patterns with `r` to ensure that your patterns are interpreted in the way you want them to. Else, you may encounter problems to do with escape sequences in strings. For example, `"\n"` in Python is used to indicate a new line, but if you use the `r` prefix, it will be interpreted as the raw string `"\n"` - that is, the character `"\"` followed by the character `"n"` - and not as a new line.

The regular expression module `re` has already been imported for you.

*Remember from the video that the syntax for the regex library is to always to pass the **pattern first**, and then the **string second**.*

Instructions
------------

-   Split `my_string` on each sentence ending. To do this:
    -   Write a pattern called `sentence_endings`to match sentence endings (`.?!`).
    -   Use `re.split()` to split `my_string` on the pattern and print the result.
-   Find and print all capitalized words in `my_string` by writing a pattern called `capitalized_words` and using `re.findall()`. 
    -   Remember the `[a-z]` pattern shown in the video to match lowercase groups? Modify that pattern appropriately in order to match uppercase groups.
-   Write a pattern called `spaces` to match one or more spaces (`"\s+"`) and then use `re.split()` to split `my_string` on this pattern, keeping all punctuation intact. Print the result.
-   Find all digits in `my_string` by writing a pattern called `digits` (`"\d+"`) and using `re.findall()`. Print the result.

In [None]:
import re

# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))


1\. Introduction to tokenization
--------------------------------

00:00 - 00:04

In this video, we'll learn more about string tokenization!

2\. What is tokenization?
-------------------------

00:04 - 00:39

Tokenization is the process of transforming a string or document into smaller chunks, which we call tokens. This is usually one step in the process of preparing a text for natural language processing. There are many different theories and rules regarding tokenization, and you can create your own tokenization rules using regular expresssions, but normally tokenization will do things like break out words or sentences, often separate punctuation or you can even just tokenize parts of a string like separating all hashtags in a Tweet.

* Turning a string or document into tokens (smaller chunks)

* One step in preparing a text for NLP

* Many different theories and rules

* You can create your own rules using regular expressions

* Some examples:
  * Breaking out words or sentences
  * Separating punctuation
  * Separating all hashtags in a tweet

3\. nltk library
----------------

00:39 - 00:58

One library that is commonly used for simple tokenization is nltk, the natural language toolkit library. Here is a short example of using the word_tokenize method to break down a string into tokens. We can see from the result that words are separated and punctuation are individual tokens as well.

* `nltk`: natural language toolkit

```python
from nltk.tokenize import word_tokenize
word_tokenize("Hi there!")
```

`['Hi', 'there', '!']`

4\. Why tokenize?
-----------------

00:58 - 01:31

Why bother with tokenization? Because it can help us with some simple text processing tasks like mapping part of speech, matching common words and perhaps removing unwanted tokens like common words or repeated words. Here, we have a good example. The sentence is: I don't like Sam's shoes. When we tokenize it we can clearly see the negation in the not and we can see possession with the 's. These indicators can help us determine meaning from simple text.

* Easier to map part of speech

* Matching common words

* Removing unwanted tokens

* "I don't like Sam's shoes."

* "I", "do", "n't", "like", "Sam", "'s", "shoes", "."

This shows how tokenization breaks down a sentence into its individual components, including splitting contractions and punctuation marks into separate tokens. It's a key preprocessing step that makes further NLP analysis possible.


5\. Other nltk tokenizers
-------------------------

01:31 - 02:03

Beyond just tokenizing words, NLTK has plenty of other tokenizers you can use, including these ones you'll be working with in this chapter. The sent_tokenize function will split a document into individual sentences. The regexp_tokenize uses regular expressions to tokenize the string, giving you more granular control over the process. And the tweettokenizer does neat things like recognize hashtags, mentions and when you have too many punctuation symbols following a sentence. How convenient!!!

* `sent_tokenize`: tokenize a document into sentences

* `regexp_tokenize`: tokenize a string or document based on a 
  regular expression pattern

* `TweetTokenizer`: special class just for tweet tokenization, 
  allowing you to separate hashtags, mentions and lots of
  exclamation points!!!

6\. More regex practice
-----------------------

02:03 - 03:07

You'll be using more regex in this section as well, not only when you are tokenizing, but also figuring out how to parse tokens and text. Using the regex module's re.match and re.search are pretty essential tools for Python string processing. Learning when to use search versus match can be challenging, so let's take a look at how they are different. When we use search and match with the same pattern and string with the pattern is at the beginning of the string, we see we find identical matches. That is the case with matching and searching abcde with the pattern abc. When we use search for a pattern that appears later in the string we get a result, but we don't get the same result using match. This is because match will try and match a string from the beginning until it cannot match any longer. Search will go through the ENTIRE string to look for match options. If you need to find a pattern that might not be at the beginning of the string, you should use search. If you want to be specific about the composition of the entire string, or at least the initial pattern, then you should use match.

* Difference between `re.search()` and `re.match()`

```python
import re
re.match('abc', 'abcde')
```
`<_sre.SRE_Match object; span=(0, 3), match='abc'>`

```python
re.search('abc', 'abcde')
```
`<_sre.SRE_Match object; span=(0, 3), match='abc'>`

```python
re.match('cd', 'abcde')
re.search('cd', 'abcde')
```
`<_sre.SRE_Match object; span=(2, 4), match='cd'>`

This demonstrates that `re.match()` only finds matches at the beginning of a string, while `re.search()` will find matches anywhere in the string. That's why the 'cd' pattern is only found by `search()` but not by `match()`.

7\. Let's practice!
-------------------

03:07 - 03:11

Now it's your turn to try some tokenization!

Word tokenization with NLTK
===========================

Here, you'll be using the first scene of Monty Python's Holy Grail, which has been pre-loaded as `scene_one`. Feel free to check it out in the IPython Shell!

Your job in this exercise is to utilize `word_tokenize` and `sent_tokenize` from `nltk.tokenize` to tokenize both words and sentences from Python strings - in this case, the first scene of Monty Python's Holy Grail.

Instructions
------------

-   Import the `sent_tokenize` and `word_tokenize` functions from `nltk.tokenize`.
-   Tokenize all the sentences in `scene_one` using the `sent_tokenize()` function.
-   Tokenize the fourth sentence in `sentences`, which you can access as `sentences[3]`, using the `word_tokenize()` function. 
-   Find the unique tokens in the entire scene by using `word_tokenize()` on `scene_one` and then converting it into a set using `set()`.
-   Print the unique tokens found. This has been done for you, so hit 'Submit Answer' to see the results!





In [None]:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)

More regex with re.search()
===========================

In this exercise, you'll utilize `re.search()` and `re.match()` to find specific tokens. Both `search` and `match` expect regex patterns, similar to those you defined in an earlier exercise. You'll apply these regex library methods to the same Monty Python text from the `nltk` corpora.

You have both `scene_one` and `sentences` available from the last exercise; now you can use them with `re.search()` and `re.match()` to extract and match more text.

Instructions 1/3
----------------

-   Use `re.search()` to search for the first occurrence of the word `"coconuts"` in `scene_one`. Store the result in `match`.
-   Print the start and end indexes of `match`using its `.start()` and `.end()` methods, respectively.

In [None]:
# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

Instructions 2/3
----------------

-   Write a regular expression called `pattern1`to find anything in square brackets.
-   Use `re.search()` with the pattern to find the first text in `scene_one` in square brackets in the scene. Print the result.

In [None]:
# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*\]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

Instructions 3/3
----------------

-   Create a pattern to match the script notation (e.g. `Character:`), assigning the result to `pattern2`. *Remember that you will want to match any words or spaces that precede the `:` (such as the space within `SOLDIER #1:`).*
-   Use `re.match()` with your new pattern to find and print the script notation in the fourthline. The tokenized sentences are available in your namespace as `sentences`.

In [None]:
# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w\s]+:"
print(re.match(pattern2, sentences[3]))

1\. Advanced tokenization with regex
------------------------------------

00:00 - 00:08

In this video, we'll take a look at doing more advanced tokenization with regex.

2\. Regex groups using or "|"
-----------------------------

00:08 - 01:01

One new regex pattern you will find useful for advanced tokenization is the ability to use the or method. In regex, OR is represented by the pipe character. To use the or, you can define a group using parenthesis. Groups can be either a pattern or a set of characters you want to match. You can also define explicit character classes using square brackets. We'll go a bit more into depth on groups and ranges soon. Let's take an example that we want to tokenize using regular expressions and we want to find all digits and words. We define our pattern using a group with the OR symbol and make them greedy so they catch the full word or digits. Then, we can call findall using Python's re library and return our tokens. Notice that our pattern does not match punctuation but properly matches the words and digits.

* OR is represented using `|`

* You can define a group using `()`

* You can define explicit character ranges using `[]`

```python
import re
match_digits_and_words = ('(\d+|\w+)')
re.findall(match_digits_and_words, 'He has 11 cats.')
```

`['He', 'has', '11', 'cats']`

This example shows how to:
- Use `|` to match either digits OR words
- Group the pattern using parentheses
- Find all occurrences that match either digits or words in the string

3\. Regex ranges and groups
---------------------------

01:01 - 02:28

Let's take a look at another more advanced topic, defining groups and character ranges. Here we have another chart of patterns, and this time we are using ranges or character classes marked by the square brackets and groups marked by the parentheses. We can see in this chart that we can use square brackets to define a new character class. For example, we can match all upper and lowercase english letters using Uppercase A hyphen Uppercase Z which will match all uppercase and then lowercase a hyphen lowercase z which will match all lowercase letters. We can also make ranges to match all digits 0 hyphen 9, or perhaps a more complex range like uppercase and lowercase English with the hyphen and period. Because the hyphen and period are special characters in regex, we must tell regex we mean an ACTUAL period or hyphen. To do so, we use what is called an escape character and in regex that means to place a backwards slash in front of our character so it knows then to look for a hyphen or period. On the other hand, with groups which are designated by the parentheses, we can only match what we explicitly define in the group. So a-z matched only a, a hyphen and z. Groups are useful when you want to define an explicit group, such as the final example; where we are taking spaces or commas.

| pattern | matches | example |
|----------|----------|----------|
| [A-Za-z]+ | upper and lowercase English alphabet | 'ABCDEFghijk' |
| [0-9] | numbers from 0 to 9 | 9 |
| [A-Za-z\\-\\.]+ | upper and lowercase English alphabet, - and . | 'My-Website.com' |
| (a-z) | a, - and z | 'a-z' |
| (\\s+l,) | spaces or a comma | ', ' |

4\. Character range with `re.match()`
-------------------------------------

02:28 - 02:57

In this code example, we can use match with a character range to match all lowercase ascii, any digits and spaces. It is greedy marked by the + after the range definition, but once it hits the comma, it can't match anymore. This short example demonstrates that thinking about what regex method you use (such as search versus match) and whether you define a group or a range can have a large impact on the usefulness and readability of your patterns.

```python
import re
my_str = 'match lowercase spaces nums like 12, but no commas'
re.match('[a-z0-9 ]+', my_str)

<_sre.SRE_Match object;
        span=(0, 42), match='match lowercase spaces nums like 12'>
```

5\. Let's practice!
-------------------

02:57 - 03:03

Now it's your turn to practice advanced regex techniques to help with tokenization!

Choosing a tokenizer
====================

Given the following string, which of the below patterns is the best tokenizer? If possible, you want to retain sentence punctuation as separate tokens, but have `'#1'` remain a single token.

```
my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"

```

The string is available in your workspace as `my_string`, and the patterns have been pre-loaded as `pattern1`, `pattern2`, `pattern3`, and `pattern4`, respectively. 

Additionally, `regexp_tokenize` has been imported from `nltk.tokenize`. You can use `regexp_tokenize(string, pattern)` with `my_string` and one of the patterns as arguments to experiment for yourself and see which is the best tokenizer.

Instructions
------------

### Possible answers

r"(\w+|\?|!)"

[/] r"(\w+|#\d|\?|!)"

r"(#\d\w+\?!)"

r"\s+"





Regex with NLTK tokenization
============================

Twitter is a frequently used source for NLP text and tasks. In this exercise, you'll build a more complex tokenizer for tweets with hashtags and mentions using `nltk` and regex. The `nltk.tokenize.TweetTokenizer` class gives you some extra methods and attributes for parsing tweets. 

Here, you're given some example tweets to parse using both `TweetTokenizer` and `regexp_tokenize`from the `nltk.tokenize` module. These example tweets have been pre-loaded into the variable `tweets`. Feel free to explore it in the IPython Shell!

*Unlike the syntax for the regex library, with `nltk_tokenize()` you pass the pattern as the **second**argument.*

Instructions 1/4
----------------

-   From `nltk.tokenize`, import `regexp_tokenize` and `TweetTokenizer`.

In [None]:
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer

# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"

# Use the pattern on the first tweet in the tweets list
regexp_tokenize(tweets[0], pattern1)

# Write a pattern that matches both mentions and hashtags
pattern2 = r"([#|@]\w+)"

# Use the pattern on the last tweet in the tweets list
regexp_tokenize(tweets[-1], pattern2)

# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)

Instructions 2/4
----------------

-   A regex pattern to define hashtags called `pattern1` has been defined for you. Call `regexp_tokenize()` with this hashtag pattern on the **first** tweet in `tweets` and assign the result to `hashtags`.
-   Print `hashtags` (this has already been done for you).

In [None]:
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer

# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"

# Use the pattern on the first tweet in the tweets list
hashtags = regexp_tokenize(tweets[0], pattern1)
print(hashtags)

Instructions 3/4
----------------

-   Write a new pattern called `pattern2` to match mentions and hashtags. A mention is something like `@DataCamp`. 

-   Then, call `regexp_tokenize()` with your new hashtag pattern on the **last** tweet in `tweets` and assign the result to `mentions_hashtags`.

    -   You can access the last element of a list using `-1` as the index, for example, `tweets[-1]`.
-   Print `mentions_hashtags` (this has been done for you).

In [None]:
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer

# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"

# Use the pattern on the first tweet in the tweets list
hashtags = regexp_tokenize(tweets[0], pattern1)
print(hashtags)

# Write a pattern that matches both mentions (@) and hashtags
pattern2 = r"([@#]\w+)"

# Use the pattern on the last tweet in the tweets list
mentions_hashtags = regexp_tokenize(tweets[-1], pattern2)
print(mentions_hashtags)

Instructions 4/4
----------------

-   Create an instance of `TweetTokenizer` called `tknzr` and use it inside a list comprehension to tokenize each tweet into a new list called `all_tokens`. 
    -   To do this, use the `.tokenize()` method of `tknzr`, with `t` as your iterator variable.
-   Print `all_tokens`.

In [None]:
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer

# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"

# Use the pattern on the first tweet in the tweets list
hashtags = regexp_tokenize(tweets[0], pattern1)
print(hashtags)

# Write a pattern that matches both mentions (@) and hashtags
pattern2 = r"([@#]\w+)"

# Use the pattern on the last tweet in the tweets list
mentions_hashtags = regexp_tokenize(tweets[-1], pattern2)
print(mentions_hashtags)

# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)

Non-ascii tokenization
======================

In this exercise, you'll practice advanced tokenization by tokenizing some non-ascii based text. You'll be using German with emoji!

Here, you have access to a string called `german_text`, which has been printed for you in the Shell. Notice the emoji and the German characters!

The following modules have been pre-imported from `nltk.tokenize`: `regexp_tokenize` and `word_tokenize`. 

Unicode ranges for emoji are:

`('\U0001F300'-'\U0001F5FF')`, `('\U0001F600-\U0001F64F')`, `('\U0001F680-\U0001F6FF')`, and `('\u2600'-\u26FF-\u2700-\u27BF')`.

Instructions
------------

-   Tokenize all the words in `german_text` using `word_tokenize()`, and print the result.
-   Tokenize only the capital words in `german_text`. 
    -   First, write a pattern called `capital_words` to match only capital words. Make sure to check for the German `Ü`! To use this character in the exercise, copy and paste it from these instructions.
    -   Then, tokenize it using `regexp_tokenize()`. 
-   Tokenize only the emoji in `german_text`. The pattern using the unicode ranges for emoji given in the assignment text has been written for you. Your job is to use `regexp_tokenize()` to tokenize the emoji.

In [None]:
all_words = word_tokenize(german_text)
print(all_words)

# Tokenize and print only capital words
capital_words = r"[A-ZÜ]\w+"
print(regexp_tokenize(german_text, capital_words))

# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))

1\. Charting word length with nltk
----------------------------------

00:00 - 00:07

Hi everyone! In this video, we are going to learn about using charts with our NLP tools.

2\. Getting started with matplotlib
-----------------------------------

00:07 - 00:30

Matplotlib is a charting library used by many different open-source Python projects to create data visualizations, charts and graphs. It has fairly straightforward functionality with lots of options for graphs like histograms, bar charts, line charts and scatter plots. It even has advanced functionality like generating 3D graphs and animations.

* Charting library used by many open source Python projects
* Straightforward functionality with lots of options
  * Histograms
  * Bar charts
  * Line charts
  * Scatter plots
* ... and also advanced functionality like 3D graphs and animations!

3\. Plotting a histogram with matplotlib
----------------------------------------

00:30 - 01:00

Matplotlib is usually imported by simply aliasing the pyplot module as plt. If we want to plot a basic histogram, which is a type of plot used to show distribution of data, we can pass in a small array to the `hist` function. The array has 5 appearing twice and 7 appearing three times, so it's a good candidate to show distribution. Finally, we call the plt.show function and matplotlib will show us the generated chart in our system's standard graphics viewing tool.

```python
from matplotlib import pyplot as plt
plt.hist([1, 5, 5, 7, 7, 7, 9])

(array([ 1., 0., 0., 0., 0., 2., 0., 3., 0., 1.]),
array([ 1., 1.8, 2.6, 3.4, 4.2, 5., 5.8, 6.6, 7.4, 8.2, 9.]),
<a list of 10 Patch objects>)

plt.show()
```

4\. Generated histogram
-----------------------

01:00 - 01:29

This is the chart that we generated using the previous code. We notice that indeed it has determined proper bins for each entry and we can see that the 7 and 5 bins reflect the distribution we expected to see. It's not the prettiest chart by default, but making it look nicer is fairly easy with more arguments and several available helper libraries.

```
                                                      3.0 -|
                                                          |
                                                      2.5 -|
                                                          |                ■
                                                      2.0 -|
                                                          |
                                                      1.5 -|
                                                          |         ■
                                                      1.0 -|
                                                          |   ■                     ■
                                                      0.5 -|
                                                          |
                                                      0.0 -|___________________________________
                                                          |   |   |   |   |   |   |   |   |
                                                          1   2   3   4   5   6   7   8   9
```

The image shows a histogram created using matplotlib. The x-axis ranges from 1 to 9, and the y-axis shows frequencies from 0 to 3. The data distribution is represented by blue bars showing:
- A frequency of 1 for value 1
- A frequency of 2 around value 5
- A frequency of 3 around value 7 
- A frequency of 1 for value 9

This matches the data input from the previous code: [1, 5, 5, 7, 7, 7, 9], where:
- 1 appears once
- 5 appears twice
- 7 appears three times
- 9 appears once

5\. Combining NLP data extraction with plotting
-----------------------------------------------

01:29 - 02:27

We can then use skills we have learned throughout this first chapter to tokenize text and chart word length for a simple sentence. First, we perform the necessary imports to use NLTK for word tokenization and matplotlib charting. Then, we tokenize the words and punctuation in a short sentence. Finally, we can use Python list comprehension with our tokenized words array to transform it to a list of lengths. As a brief refresher on list comprehensions, it is a succint way to write a for loop. If we look at the syntax, we have opening and closing square brackets. Then we can iterate over any list and make a new list using this simple syntax. Here, we create a list that holds the lengths of each word in the words array simply by saying len(w) for w in words. This will iterate over each word, calculate the length and return it as a new list. We then pass this array of token lengths to the hist function and generate our chart using the plt.show method.

```python
from matplotlib import pyplot as plt
from nltk.tokenize import word_tokenize
words = word_tokenize("This is a pretty cool tool!")
word_lengths = [len(w) for w in words]
plt.hist(word_lengths)

(array([ 2., 0., 1., 0., 0., 0., 3., 0., 0., 1.]),
array([ 1., 1.5, 2., 2.5, 3., 3.5, 4., 4.5, 5., 5.5, 6.]),
<a list of 10 Patch objects>)

plt.show()
```

6\. Word length histogram
-------------------------

02:27 - 02:48

Here is the generated histogram from our previous code. We can see from the chart that we have a majority of four-letter words in our example sentence. Of course, with a simple sentence, this is easy enough to simply count by hand -- but for an entire play or book, this would be tedious and prone to error -- so writing it in code makes it a lot easier.

```
                                                      3.0 -|
                                                          |                    ■
                                                      2.5 -|
                                                          |
                                                      2.0 -|   ■
                                                          |
                                                      1.5 -|
                                                          |
                                                      1.0 -|         ■                    ■
                                                          |
                                                      0.5 -|
                                                          |
                                                      0.0 -|____________________________________
                                                          |   |   |   |   |   |   |   |   |
                                                          1   2   3   4   5   6   7   8   9
```

This histogram shows the distribution of word lengths from the text "This is a pretty cool tool!", where:

- Length 1: There are 2 one-letter words ("a")
- Length 2: There is 1 two-letter word ("is") 
- Length 4: There are 3 four-letter words ("this", "cool", "tool")
- Length 6: There is 1 six-letter word ("pretty")

The y-axis shows the frequency (count) of words of each length, and the x-axis shows the word lengths from 1 to 6 characters. The histogram uses blue bars to represent these frequencies, with the tallest bar at length 4 showing that four-letter words are the most common in this text sample.

7\. Let's practice!
-------------------

02:48 - 02:54

Now it's your turn to start plotting NLP charts with matplotlib!

Charting practice
=================

Try using your new skills to find and chart the number of words per line in the script using `matplotlib`. The Holy Grail script is loaded for you, and you need to use regex to find the words per line. 

Using list comprehensions here will speed up your computations. For example: `my_lines = [tokenize(l) for l in lines]` will call a function `tokenize` on each line in the list `lines`. The new transformed list will be saved in the `my_lines` variable.

You have access to the entire script in the variable `holy_grail`. Go for it!

Instructions
------------

-   Split the script `holy_grail` into lines using the newline (`'\n'`) character.
-   Use `re.sub()` inside a list comprehension to replace the prompts such as `ARTHUR:` and `SOLDIER #1`. The pattern has been written for you. 
-   Use a list comprehension to tokenize `lines` with `regexp_tokenize()`, keeping **only words**. Recall that the pattern for words is `"\w+"`.
-   Use a list comprehension to create a list of line lengths called `line_num_words`.
    -   Use `t_line` as your iterator variable to iterate over `tokenized_lines`, and then `len()`function to compute line lengths.
-   Plot a histogram of `line_num_words` using `plt.hist()`. Don't forgot to use `plt.show()` as well to display the plot.

In [None]:
# Split the script into lines: lines
lines = holy_grail.split('\n')

# Replace all script lines for speaker
pattern = "[A-Z]{2,}(\s)?(#\d)?([A-Z]{2,})?:"
lines = [re.sub(pattern, '', l) for l in lines]

# Tokenize each line: tokenized_lines
tokenized_lines = [regexp_tokenize(s, "\w+") for s in lines]

# Make a frequency list of lengths: line_num_words
line_num_words = [len(t_line) for t_line in tokenized_lines]

# Plot a histogram of the line lengths
plt.hist(line_num_words)

# Show the plot
plt.show()