# Introduction to Natural Language Processing in Python
In this course, you'll learn natural language processing (NLP) basics, such as how to identify and separate words, how to extract topics in a text, and how to build your own fake news classifier. You'll also learn how to use basic libraries such as NLTK, alongside libraries which utilize deep learning to solve common NLP problems. This course will give you the foundation to process and parse text as you move forward in your Python learning.

## $\star$ Chapter 1: Regular expressions & word tokenization
This chapter will introduce some basic NLP concepts, such as word tokenization and regular expressions to help parse text. You'll also learn how to handle non-English text and more difficult tokenization you might find.

#### What is Natural Language Processing?
* Massive field of study focused on making sense of language using statistics and computers
* Some of the basics of NLP:
    * Topic identification
    * Text classification
* NLP applications include:
    * topic identification
    * chatbots
    * text classification
    * translation
    * sentiment analysis
    * ... many, many more!
    
#### Regular Expressions
* **Regular expressions** are strings you can use that have a special syntax which allows you to match patterns and find other strings
* A **pattern** is a series of letters or symbols which can map to an actual text or words or punctuation.
* Applications of regular expression:
    * Find links in a a webpage or document
    * Parse email addresses
    * Remove unwanted strings or characters
* Regular expressions are often referred to as **regex** and can be used easily with Python via the `re` library
* match a substring using the `re.match()` method:

In [1]:
import re
import nltk
from nltk.tokenize import word_tokenize

In [6]:
#nltk.download('punkt')

In [2]:
re.match('abc', 'abcdef')

<re.Match object; span=(0, 3), match='abc'>

* `re.match()` takes the pattern as the first argument, the string as the second argument, and returns a **match object**
* We can also use "special" patterns that regex understands, like the `\w+`, which will match a word:

In [3]:
word_regex = '\w+'
re.match(word_regex, 'hi there!')

<re.Match object; span=(0, 2), match='hi'>

* There are hundreds of characters and patterns you can learn and memorize with regular expressions, but here we get started with a few common patterns:

<img src='data/common_regex.png' width="300" height="150" align="center"/>

#### Python's re modules
* **`re`** module
* **`split`**: split a string on a regex
* **`findall`**: find all patterns in a string
* **`search`**: search for a pattern
* **`match`**: match an entire string or substring based on a pattern
* Syntax for regex library is always to pass the pattern first and the string second
* Depending on the method, it may return an iterator, a new string, or a match object

In [4]:
re.split('\s+', 'Split on spaces.')

['Split', 'on', 'spaces.']

* This can be used for tokenization, so you can process text using regex while doing NLP

#### Exercises: Practicing regular expressions: re.split() and re.findall()

```
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))
```

### Introduction to tokenization

* **Tokenization** is the process of transforming a string or document into smaller chunks, which we call tokens.
* One step in the process of preparing a text for NLP
* Many different theories and rules regarding tokenization
    * You can also create your own tokenization rules using regular expressions
* Some examples:
    * Breaking out words or sentences
    * Separating punctuation
    * Separating all hashtags in a tweet

#### nltk library
* One library that is commonly used for simple tokenization is `nltk`, the **natural language took kit** library
* `nltk`: natural langauge toolkit

In [None]:
#nltk.download('punkt')

In [5]:
word_tokenize("Hi there!")

['Hi', 'there', '!']

#### Why tokenize?
* Tokenizing can help us with some simple text processing tasks like:
    * Mapping parts of speech
    * Matching common words
    * Removing unwanted tokens
    
#### Other nltk tokenizers
* **`sent_tokenize`:** tokenize a document into sentences
* **`regexp_tokenize`:** tokenize a string or document based on a regular expression pattern
* **`TweetTokenizer`:** special class just for tweet tokenization, allowing you to separate hashtags, mentions, and lots of exclamation points

#### More regex practice
* Difference between `re.search()` and `re.match()`:
    * When we use `search` and `match` with the same pattern and string when the pattern is at the beginning of the string, we see we find identical matches.
    
<img src='data/search_vs_match.png' width="600" height="300" align="center"/>    

* **Note that `match` will try and match a string from the beginning until it cannot match any longer, while `search` will go through the ENTIRE string to look for match options.**
* So, if you need to find a patter that might not be at the beginning of the string, you should use `search`
* If you want to be specific about the composition of the entire string, or at least the initial pattern, then you should use `match`

#### Exercises: Word tokenization with NLTK

```
# Import necessary modules
from nltk.tokenize import sent_tokenize, word_tokenize 

# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)

# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w\s]+:"
print(re.match(pattern2, sentences[3]))
```


### Advanced tokenization with regex

#### Regex groups using or "|"
* OR is represented using **`|`**
* Define a group using **`()`**
    * Groups can be either a pattern or set of characters you want to match
* You can also define explicit character classes using **`[]`**
* Example: we want to find all digits and words using tokenization:

In [7]:
# import re
match_digits_and_words = ('(\d+|\w+)')

In [8]:
re.findall(match_digits_and_words, "He has 11 cats.")

['He', 'has', '11', 'cats']

* Pseudo script: "find all" digits and/or words.

#### Regex ranges and groups

<img src='data/regex_ranges.png' width="600" height="300" align="center"/>

* **Note** that:
    * **ranges** are marked with **`[]`**
    * **groups** are marked with **`()`**
* We can see in the chart above that we can use square brackets to defne a new character class
* Note in the third row of the chart, that because the hyphen and period are special characters in regex, we must tell regex we mean an ACTUAL period or hyphen
    * To do so we use what is called an **escape character**: in regex that means to place a backwards slash in front of our character so it knows then to look for a hyphen or period.
* On the other hand, with groups which are designated by the parentheses, we can only match what we explicitly define in the group
    * For example, see row four in the chart above; this regex only specifies 3 characters to match: `a`, `-`, `z` (and *not* "all the lowercase letters between a and z).
    * **Groups are useful when you want to define an explicit set of characters.**
* Final example: spaces or a comma.
* In the code example below, use `match` with a character range to match all lowercase ascii, any digits, and spaces:

In [10]:
# import re
my_str = 'match lowercase spaces nums like 12, but no commas'
re.match('[a-z0-9 ]+', my_str)

<re.Match object; span=(0, 35), match='match lowercase spaces nums like 12'>

* The above regex is **greedy**, marked by the **`+`** after the range definition, but once it hits the comma, it can't match any more.
* This short example demonstrates that thinking about what regex method you use (such as `search` versus `match`) and whether you define a *group* or a *range* can have a large impact on the usefulness and readability of your patterns.

<img src='data/course_datasets.png' width="600" height="300" align="center"/>