What is Natural Language Processing?
------------------------------------

Natural language processing is a massive field of study and actively used practice which aims to make sense of language using statistics and computers. In this course, you will learn some of the basics of NLP which will help you move from simple to more difficult and advanced topics. Even though this is the first course, you will still get some exposure to the challenges of the field such as topic identification and text classification. Some interesting NLP areas you might have heard about are: topic identification, chatbots, text classification, translation, sentiment analysis. There are also many more! You will learn the fundamentals of some of these topics as we move through the course.

What exactly are regular expressions?
-----------------------------------------

Regular expressions are strings you can use that have a special syntax, which allows you to match patterns and find other strings. A pattern is a series of letters or symbols which can map to an actual text or words or punctuation. You can use regular expressions to do things like find links in a webpage, parse email addresses and remove unwanted strings or characters. Regular expressions are often referred to as regex and can be used easily with python via the `re` library. Here we have a simple import of the library. We can match a substring by using the re.match method which matches a pattern with a string. It takes the pattern as the first argument, the string as the second and returns a match object, here we see it matched exactly what we expected: abc. We can also use special patterns that regex understands, like the \w+ which will match a word. We can see here via the match object representation that it has matched the first word it found -- hi.

Which pattern?
==============

Which of the following Regex patterns results in the following text? 

```
>>> my_string = "Let's write RegEx!"
>>> re.findall(PATTERN, my_string)
['Let', 's', 'write', 'RegEx']

```

In the IPython Shell, try replacing `PATTERN` with one of the below options and observe the resulting output. The `re`module has been pre-imported for you and `my_string` is available in your namespace.

### Possible answers

PATTERN = r"\s+"

[/] PATTERN = r"\w+"

PATTERN = r"[a-z]"

PATTERN = r"\w"

Common regex patterns
-------------------------

There are hundreds of characters and patterns you can learn and memorize with regular expressions, but to get started, I want to share a few common patterns. The first pattern \w we already saw, it is used to match words. The \d pattern allows us to match digits, which can be useful when you need to find them and separate them in a string. The \s pattern matches spaces, the period is a wildcard character. The wildcard will match ANY letter or symbol. The + and * characters allow things to become greedy, grabbing repeats of single letters or whole patterns. For example to match a full word rather than one character, we need to add the + symbol after the \w. Using these character classes as capital letters negates them so the \S matches anything that is not a space. You can also create a group of characters you want by putting them inside square brackets, like our lowercase group.

Python's re module
-----------------------

In the following exercises, you'll use the `re` module to perform some simple activities, like splitting on a pattern or finding all patterns in a string. In addition to split and findall, search and match are also quite popular. You saw a simple match at the beginning of this video, and search is similar but doesn't require you to match the pattern from the beginning of the string. The syntax for the regex library is always to pass the pattern first, and the string second. Depending on the method, it may return an iterator, a new string or a match object. Here we see the re.split method will take a pattern for spaces and a string with some spaces and return a list object with the results of splitting on spaces. This can be used for tokenization, so you can preprocess text using regex while doing natural language processing.

Let's practice!
---------------

Now it's your turn! Get started writing your first Regex and I'll see you back here soon!

Practicing regular expressions: re.split() and re.findall()
===========================================================

Now you'll get a chance to write some regular expressions to match digits, strings and non-alphanumeric characters. Take a look at `my_string` first by printing it in the IPython Shell, to determine how you might best match the different steps.

Note: It's important to prefix your regex patterns with `r` to ensure that your patterns are interpreted in the way you want them to. Else, you may encounter problems to do with escape sequences in strings. For example, `"\n"` in Python is used to indicate a new line, but if you use the `r` prefix, it will be interpreted as the raw string `"\n"` - that is, the character `"\"` followed by the character `"n"` - and not as a new line.

The regular expression module `re` has already been imported for you.

*Remember from the video that the syntax for the regex library is to always to pass the **pattern first**, and then the **string second**.*

Instructions
------------

-   Split `my_string` on each sentence ending. To do this:
    -   Write a pattern called `sentence_endings`to match sentence endings (`.?!`).
    -   Use `re.split()` to split `my_string` on the pattern and print the result.
-   Find and print all capitalized words in `my_string` by writing a pattern called `capitalized_words` and using `re.findall()`. 
    -   Remember the `[a-z]` pattern shown in the video to match lowercase groups? Modify that pattern appropriately in order to match uppercase groups.
-   Write a pattern called `spaces` to match one or more spaces (`"\s+"`) and then use `re.split()` to split `my_string` on this pattern, keeping all punctuation intact. Print the result.
-   Find all digits in `my_string` by writing a pattern called `digits` (`"\d+"`) and using `re.findall()`. Print the result.

In [None]:
import re

# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))


What is tokenization?
-------------------------

Tokenization is the process of transforming a string or document into smaller chunks, which we call tokens. This is usually one step in the process of preparing a text for natural language processing. There are many different theories and rules regarding tokenization, and you can create your own tokenization rules using regular expresssions, but normally tokenization will do things like break out words or sentences, often separate punctuation or you can even just tokenize parts of a string like separating all hashtags in a Tweet.


nltk library
----------------

One library that is commonly used for simple tokenization is nltk, the natural language toolkit library. Here is a short example of using the word_tokenize method to break down a string into tokens. We can see from the result that words are separated and punctuation are individual tokens as well.


Why tokenize?
-----------------

Why bother with tokenization? Because it can help us with some simple text processing tasks like mapping part of speech, matching common words and perhaps removing unwanted tokens like common words or repeated words. Here, we have a good example. The sentence is: I don't like Sam's shoes. When we tokenize it we can clearly see the negation in the not and we can see possession with the 's. These indicators can help us determine meaning from simple text.

Other nltk tokenizers
-------------------------

Beyond just tokenizing words, NLTK has plenty of other tokenizers you can use, including these ones you'll be working with in this chapter. The sent_tokenize function will split a document into individual sentences. The regexp_tokenize uses regular expressions to tokenize the string, giving you more granular control over the process. And the tweettokenizer does neat things like recognize hashtags, mentions and when you have too many punctuation symbols following a sentence. How convenient!!!

More regex practice
-----------------------

You'll be using more regex in this section as well, not only when you are tokenizing, but also figuring out how to parse tokens and text. Using the regex module's re.match and re.search are pretty essential tools for Python string processing. Learning when to use search versus match can be challenging, so let's take a look at how they are different. When we use search and match with the same pattern and string with the pattern is at the beginning of the string, we see we find identical matches. That is the case with matching and searching abcde with the pattern abc. When we use search for a pattern that appears later in the string we get a result, but we don't get the same result using match. This is because match will try and match a string from the beginning until it cannot match any longer. Search will go through the ENTIRE string to look for match options. If you need to find a pattern that might not be at the beginning of the string, you should use search. If you want to be specific about the composition of the entire string, or at least the initial pattern, then you should use match.

Let's practice!
-------------------

Now it's your turn to try some tokenization!

Word tokenization with NLTK
===========================

Here, you'll be using the first scene of Monty Python's Holy Grail, which has been pre-loaded as `scene_one`. Feel free to check it out in the IPython Shell!

Your job in this exercise is to utilize `word_tokenize` and `sent_tokenize` from `nltk.tokenize` to tokenize both words and sentences from Python strings - in this case, the first scene of Monty Python's Holy Grail.

Instructions
------------

-   Import the `sent_tokenize` and `word_tokenize` functions from `nltk.tokenize`.
-   Tokenize all the sentences in `scene_one` using the `sent_tokenize()` function.
-   Tokenize the fourth sentence in `sentences`, which you can access as `sentences[3]`, using the `word_tokenize()` function. 
-   Find the unique tokens in the entire scene by using `word_tokenize()` on `scene_one` and then converting it into a set using `set()`.
-   Print the unique tokens found. This has been done for you, so hit 'Submit Answer' to see the results!





In [None]:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)

More regex with re.search()
===========================

In this exercise, you'll utilize `re.search()` and `re.match()` to find specific tokens. Both `search` and `match` expect regex patterns, similar to those you defined in an earlier exercise. You'll apply these regex library methods to the same Monty Python text from the `nltk` corpora.

You have both `scene_one` and `sentences` available from the last exercise; now you can use them with `re.search()` and `re.match()` to extract and match more text.

Instructions 3/3
----------------

-   Use `re.search()` to search for the first occurrence of the word `"coconuts"` in `scene_one`. Store the result in `match`.
-   Print the start and end indexes of `match` using its `.start()` and `.end()` methods, respectively.
  
-   Write a regular expression called `pattern1` to find anything in square brackets.
-   Use `re.search()` with the pattern to find the first text in `scene_one` in square brackets in the scene. Print the result.

-   Create a pattern to match the script notation (e.g. `Character:`), assigning the result to `pattern2`. *Remember that you will want to match any words or spaces that precede the `:` (such as the space within `SOLDIER #1:`).*
-   Use `re.match()` with your new pattern to find and print the script notation in the **fourth** line. The tokenized sentences are available in your namespace as `sentences`.

In [None]:
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*\]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w\s]+:"
print(re.match(pattern2, sentences[3]))
