This note book is following the structure of the chapter 2 of the book "*An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition*"

### Overview of this notebook

In this notebook, we will introduce the basic techniques to do preprocessing or **text normalization**. In general, text normalization is a set of procedures to convert text to a more convenient, standard form.Here we will include **tokenization**,**lemmatization** and **stemming** , also we will introduce the fundamental tool in language processing - **regular expression**. 

**Tokenization** refers to the task of separating the text word by word, most of the latin language can be separated by "white space", but the sometimes it is required to treat things differently, for example, we often treat "New York" as a single word.

**Lemmatization** and **stemming** are closely related, lemmatization will map all the different of a words to its root, for example, sang, sung, and sing will all be mapped to verb sing. Stemming is a simpler version of the lemmatization, it only strip the suffix from the end of the words.

We divide each topic into several sub-sections,and each section will have an example coming after the definition.

Let's dive in.

### Regular expression

Regular expression is a useful tool when doing text normalization， it is a language for specifying the strings to be searched. Many string processing functions in python support the use of regular expression, here we use *re.search* as an example to show how the regular expression works, in the end of this section, we will introduce more functions which could use regular expression.

In [3]:
import re #this the library for using rugular expression in python

#### 1: Basic Regular Expression Patterns
The simplest regular expression is to match a sequence of simple characters. For example, suppose we have the following text:

In [52]:
Text = "Though hundreds of thousands and Thousands had done their very best to disfigure the small piece of land on which they were crowded together"

Suppose we want to match the words "thousands" then we will simply use /thousand/ to match the words, *findall* function will check for a match anywhere in the text, then return all the matched string

In [53]:
matchObj = re.findall(r'thousands',Text)

In [54]:
matchObj

['thousands']

**Basic Regular 1.1**： **[ ]**，**^**， and **-** 

Regular expression may not seem useful at this point, but suppose now we want to match the both 'thousands' and 'Thousands', this is where "[]" comes to play a role - the regular expression will match both string of characters inside the braces. Now we have [Tt]housands to match both **thousands** and **Thousands**

In [60]:
matchObj = re.findall(r'[Tt]housands',Text)

In [61]:
matchObj

['thousands', 'Thousands']

Things become interesting when we have this *[ ]* tool, we can match all the single digit using [1234567890], and use [ABCDEFGHIJKLMNOPQRSTUVWXYZ] to match any capital letter, let's see an example

In [67]:
Text = "An apple falls from the tree, there are 2 Birds and 3 Monkeys on the tree"

In [63]:
matchObj = re.findall(r'[1234567890]',Text)

In [64]:
matchObj

['2', '3']

In [69]:
matchObj = re.findall(r'[ABCDEFGHIJKLMNOPQRSTUVWXYZ]',Text)

In [70]:
matchObj

['A', 'B', 'M']

Now let's continue to expore some other tools which come along with *[ ]*, using them can help express richer structure.The brackets can be used with the dash (-) to specify range any one character in a range，for example, the pattern *[1-9]* specify any one of the characters from 1 to 9. And the patttern *[A-Z]* is equivalent to the expression we used abouv - *[ABCDEFGHIJKLMNOPQRSTUVWXYZ]*. Another tool coming along with *[ ]* is caret *^*, it represent negating when it is put to the first symbol after the open square brace. For example, *[^a-z]* means to match any single character except a,b,c,d,e... Let's see some example: 

In [11]:
Text = "An apple falls from the tree, there are 2 Birds and 3 Monkeys on the tree"

In [15]:
matchObj = re.findall(r'[1-9]',Text)

In [14]:
matchObj

['2', '3']

In [16]:
matchObj = re.findall(r'[^a-z]',Text)

In [17]:
matchObj

['A',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 ' ',
 '2',
 ' ',
 'B',
 ' ',
 ' ',
 '3',
 ' ',
 'M',
 ' ',
 ' ',
 ' ']

**Basic Regular 1.2**: **?**, *****, **+**,and **.**

*?* is another common symbol we use, it represent the "the preceding character or nothing". For example, *falls?* will match both "fall" and "falls". See an example below

In [1]:
Text = "An apple falls from the tree while children fall from another tree"

In [4]:
matchObj = re.findall(r'falls?',Text)

In [5]:
matchObj

['falls', 'fall']

While the question mark *?* is matching "zeros or one instances of the previous character", we have asterisk \* to match the "zero or more occurrences of the immediately previous character". For example, "ab\*" will match all the following text: "ab", "abb", "abbbb". Interestingly, we can use [0-9][0-9]\* to match any integer. Another symbol *+* has almost the same function as \*, only with one difference: it only match "**one** or more occurrences of the immediately previous character".

In [11]:
Text = "Some dummy words； ab, abb, abbbb, abbbb, a"
matchObj = re.findall(r'ab*',Text)
matchObj

['ab', 'abb', 'abbbb', 'abbbb', 'a']

If we use "ab+" instead of "ab*", it will only match the "ab", "abb", "abbb...", but not "a"

In [13]:
matchObj = re.findall(r'ab+',Text)
matchObj

['ab', 'abb', 'abbbb', 'abbbb']

In [14]:
Text = "There are 57 aplles and 2 birds on the tree"
matchObj = re.findall(r'[0-9][0-9]*',Text)
matchObj

['57', '2']

Period **.** is another very important expression, which is used to match any single character

In [17]:
Text = "began is the past tense of begin"
matchObj = re.findall(r'beg.n',Text)
matchObj

['began', 'begin']

**Basic Regular 1.3:** Anchors

Anchors are special characters that anchor regular expressions to particular places in a string. The common used anchors are caret *^* and dollar sign *$*. Caret *^* is used to match the start of a line, so *^began* will only match the word "began" at the start of a line. Similarly, \$ is used to match the things at the end of a line.

In [29]:
Text = "end1 of the a period of life often indicates the beginning of another journey,so end2ing is not always the end3"
matchObj = re.findall(r'^end.',Text)
matchObj

['end1']

In [31]:
matchObj = re.findall(r'end.',Text)
matchObj

['end1', 'end2', 'end3']

In [32]:
matchObj = re.findall(r'end.$',Text)
matchObj

['end3']

#### 2: Disjunction, Grouping and Precedence

**2.1: ** Disjunction

Disjunction operator is also called the pipe symbol "|", is similar to the "or" in our natural language. Suppose we want to search for "bird or monkey" in the sentence, how can we do this? Someone may think of using *[ ]* to do this, but the problem is we can only apply the bracket on single character

In [57]:
Text = "Here are a bird and some monkeys on the tree"
matchObj = re.findall(r'bird|monkeys',Text)
matchObj

['bird', 'monkeys']

**2.2: **Precedence and Grouping

The parenthesis symbol - "()"is used for both Precedence and Grouping. Suppose you want to match both "study" and "studies", in this case, we may want to use disjunction operator, but "study|ies" will not match study and studies at the same time, instead, it will match "study" and "ies". This is because by default, the sequence like "study" and "ies" take precedence over the disjunction operator "|". To solve this probelm, we can use  "()" to specify the precedence as we want. So "(study)|(ies)" will do what we want.

In general, adding the parenthesis around the text is helping us to group things, but remember, the grouping is realized by changing the precedence.