# Introduction to Natural Language Processing in Python
In this course, you'll learn natural language processing (NLP) basics, such as how to identify and separate words, how to extract topics in a text, and how to build your own fake news classifier. You'll also learn how to use basic libraries such as NLTK, alongside libraries which utilize deep learning to solve common NLP problems. This course will give you the foundation to process and parse text as you move forward in your Python learning.

## $\star$ Chapter 1: Regular expressions & word tokenization
This chapter will introduce some basic NLP concepts, such as word tokenization and regular expressions to help parse text. You'll also learn how to handle non-English text and more difficult tokenization you might find.

#### What is Natural Language Processing?
* Massive field of study focused on making sense of language using statistics and computers
* Some of the basics of NLP:
    * Topic identification
    * Text classification
* NLP applications include:
    * topic identification
    * chatbots
    * text classification
    * translation
    * sentiment analysis
    * ... many, many more!
    
#### Regular Expressions
* **Regular expressions** are strings you can use that have a special syntax which allows you to match patterns and find other strings
* A **pattern** is a series of letters or symbols which can map to an actual text or words or punctuation.
* Applications of regular expression:
    * Find links in a a webpage or document
    * Parse email addresses
    * Remove unwanted strings or characters
* Regular expressions are often referred to as **regex** and can be used easily with Python via the `re` library
* match a substring using the `re.match()` method:

In [1]:
import re

In [2]:
re.match('abc', 'abcdef')

<re.Match object; span=(0, 3), match='abc'>

* `re.match()` takes the pattern as the first argument, the string as the second argument, and returns a **match object**
* We can also use "special" patterns that regex understands, like the `\w+`, which will match a word:

In [3]:
word_regex = '\w+'
re.match(word_regex, 'hi there!')

<re.Match object; span=(0, 2), match='hi'>

* There are hundreds of characters and patterns you can learn and memorize with regular expressions, but here we get started with a few common patterns:

<img src='data/common_regex.png' width="300" height="150" align="center"/>

#### Python's re modules
* **`re`** module
* **`split`**: split a string on a regex
* **`findall`**: find all patterns in a string
* **`search`**: search for a pattern
* **`match`**: match an entire string or substring based on a pattern
* Syntax for regex library is always to pass the pattern first and the string second
* Depending on the method, it may return an iterator, a new string, or a match object

In [4]:
re.split('\s+', 'Split on spaces.')

['Split', 'on', 'spaces.']

* This can be used for tokenization, so you can process text using regex while doing NLP

#### Exercises: Practicing regular expressions: re.split() and re.findall()

```
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))
```

### Introduction to tokenization

<img src='data/course_datasets.png' width="600" height="300" align="center"/>