This note book is following the structure of the chapter 2 of the book "*An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition*"

### Overview of this notebook

In this notebook, we will introduce the basic techniques to do preprocessing or **text normalization**. In general, text normalization is a set of procedures to convert text to a more convenient, standard form.Here we will include **tokenization**,**lemmatization** and **stemming** , also we will introduce the fundamental tool in language processing - **regular expression**. 

**Tokenization** refers to the task of separating the text word by word, most of the latin language can be separated by "white space", but the sometimes it is required to treat things differently, for example, we often treat "New York" as a single word.

**Lemmatization** and **stemming** are closely related, lemmatization will map all the different of a words to its root, for example, sang, sung, and sing will all be mapped to verb sing. Stemming is a simpler version of the lemmatization, it only strip the suffix from the end of the words.

We divide each topic into several sub-sections,and each section will have an example coming after the definition.

Let's dive in.

### Regular expression

Regular expression is a useful tool when doing text normalization， it is a language for specifying the strings to be searched. Many string processing functions in python support the use of regular expression, here we use *re.search* as an example to show how the regular expression works, in the end of this section, we will introduce more functions which could use regular expression.

In [1]:
import re #this the library for using rugular expression in python

#### 1: Basic Regular Expression Patterns
The simplest regular expression is to match a sequence of simple characters. For example, suppose we have the following text:

In [2]:
Text = "Though hundreds of thousands and Thousands had done their very best to disfigure the small piece of land on which they were crowded together"

Suppose we want to match the words "thousands" then we will simply use /thousand/ to match the words, *findall* function will check for a match anywhere in the text, then return all the matched string

In [3]:
matchObj = re.findall(r'thousands',Text)

In [4]:
matchObj

['thousands']

**Basic Regular 1.1**： **[ ]**，**^**， and **-** 

Regular expression may not seem useful at this point, but suppose now we want to match the both 'thousands' and 'Thousands', this is where "[]" comes to play a role - the regular expression will match both string of characters inside the braces. Now we have [Tt]housands to match both **thousands** and **Thousands**

In [5]:
matchObj = re.findall(r'[Tt]housands',Text)

In [6]:
matchObj

['thousands', 'Thousands']

Things become interesting when we have this *[ ]* tool, we can match all the single digit using [1234567890], and use [ABCDEFGHIJKLMNOPQRSTUVWXYZ] to match any capital letter, let's see an example

In [7]:
Text = "An apple falls from the tree, there are 2 Birds and 3 Monkeys on the tree"

In [8]:
matchObj = re.findall(r'[1234567890]',Text)

In [9]:
matchObj

['2', '3']

In [10]:
matchObj = re.findall(r'[ABCDEFGHIJKLMNOPQRSTUVWXYZ]',Text)

In [11]:
matchObj

['A', 'B', 'M']

Now let's continue to expore some other tools which come along with *[ ]*, using them can help express richer structure.The brackets can be used with the dash (-) to specify range any one character in a range，for example, the pattern *[1-9]* specify any one of the characters from 1 to 9. And the patttern *[A-Z]* is equivalent to the expression we used abouv - *[ABCDEFGHIJKLMNOPQRSTUVWXYZ]*. Another tool coming along with *[ ]* is caret *^*, it represent negating when it is put to the first symbol after the open square brace. For example, *[^a-z]* means to match any single character except a,b,c,d,e... Let's see some example: 

In [12]:
Text = "An apple falls from the tree, there are 2 Birds and 3 Monkeys on the tree"

In [13]:
matchObj = re.findall(r'[1-9]',Text)

In [14]:
matchObj

['2', '3']

In [15]:
matchObj = re.findall(r'[^a-z]',Text)

In [16]:
matchObj

['A',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 ' ',
 '2',
 ' ',
 'B',
 ' ',
 ' ',
 '3',
 ' ',
 'M',
 ' ',
 ' ',
 ' ']

**Basic Regular 1.2**: **?**, *****, **+**,and **.**

*?* is another common symbol we use, it represent the "the preceding character or nothing". For example, *falls?* will match both "fall" and "falls". See an example below

In [17]:
Text = "An apple falls from the tree while children fall from another tree"

In [18]:
matchObj = re.findall(r'falls?',Text)

In [19]:
matchObj

['falls', 'fall']

While the question mark *?* is matching "zeros or one instances of the previous character", we have asterisk \* to match the "zero or more occurrences of the immediately previous character". For example, "ab\*" will match all the following text: "ab", "abb", "abbbb". Interestingly, we can use [0-9][0-9]\* to match any integer. Another symbol *+* has almost the same function as \*, only with one difference: it only match "**one** or more occurrences of the immediately previous character".

In [20]:
Text = "Some dummy words； ab, abb, abbbb, abbbb, a"
matchObj = re.findall(r'ab*',Text)
matchObj

['ab', 'abb', 'abbbb', 'abbbb', 'a']

If we use "ab+" instead of "ab*", it will only match the "ab", "abb", "abbb...", but not "a"

In [21]:
matchObj = re.findall(r'ab+',Text)
matchObj

['ab', 'abb', 'abbbb', 'abbbb']

In [22]:
Text = "There are 57 aplles and 2 birds on the tree"
matchObj = re.findall(r'[0-9][0-9]*',Text)
matchObj

['57', '2']

Period **.** is another very important expression, which is used to match any single character

In [23]:
Text = "began is the past tense of begin"
matchObj = re.findall(r'beg.n',Text)
matchObj

['began', 'begin']

**Basic Regular 1.3:** Anchors

Anchors are special characters that anchor regular expressions to particular places in a string. The common used anchors are caret *^* and dollar sign *$*. Caret *^* is used to match the start of a line, so *^began* will only match the word "began" at the start of a line. Similarly, \$ is used to match the things at the end of a line.

In [24]:
Text = "end1 of the a period of life often indicates the beginning of another journey,so end2ing is not always the end3"
matchObj = re.findall(r'^end.',Text)
matchObj

['end1']

In [25]:
matchObj = re.findall(r'end.',Text)
matchObj

['end1', 'end2', 'end3']

In [26]:
matchObj = re.findall(r'end.$',Text)
matchObj

['end3']

#### 2: Disjunction, Grouping and Precedence

**2.1: ** Disjunction

Disjunction operator is also called the pipe symbol "|", is similar to the "or" in our natural language. Suppose we want to search for "bird or monkey" in the sentence, how can we do this? Someone may think of using *[ ]* to do this, but the problem is we can only apply the bracket on single character

In [27]:
Text = "Here are a bird and some monkeys on the tree"
matchObj = re.findall(r'bird|monkeys',Text)
matchObj

['bird', 'monkeys']

**2.2: **Precedence and Grouping

The parenthesis symbol - "()"is used for both Precedence and Grouping. Suppose you want to match both "study" and "studies", in this case, we may want to use disjunction operator, but "study|ies" will not match study and studies at the same time, instead, it will match "study" and "ies". This is because by default, the sequence like "study" and "ies" take precedence over the disjunction operator "|". To solve this probelm, we can use  "()" to specify the precedence as we want. So "stud(y)|(ies)" will do what we want.

In general, adding the parenthesis around the text is helping us to group things, but remember, the grouping is realized by changing the precedence.

In [29]:
Text = "Tom like to study history, while his brother studies maths in the college"
matchObj = re.findall(r'study|ies',Text)
matchObj

['study', 'ies']

*?:* is called non-capturing group,*()* have two meanning: one is specify the precedence, and the other is to actually store the things in the bracket in the computer and output it when necessary.

To differentiate these two cases: we have *()*, which will both grouping things and store the group in the computer,and *(?:)* will merely specify the preference.

Basically, if we merely put *()* here, findall fucntion will only return the content in the braket, in this case, only return "y,ies", sometimes we may need to use *()* because we only want to match some pattern but only return part of words within this pattern.

In [50]:
matchObj = re.findall(r'stud(?:y|ies)',Text)
matchObj

['study', 'studies']

In [51]:
matchObj = re.findall(r'stud(y|ies)',Text)
matchObj

['y', 'ies']

#### 3: An example to show the use of Regular Expresion

Here is an example showing how we can construct a regular expression step by step, this resulting expression will help us to match what we want to match. I extract an article about the history of England, suppose now we want to have a list of times (years) occurred in the article, and we need to write the regular expression to help us to finish this task. Here is how this passage look like:

In [106]:
with open('History of England(used for regex tutorial).txt', 'rb') as myfile:
    data=myfile.read()

In [107]:
data = data.decode("utf-8")
data[:500]

'History of England\r\nFrom Wikipedia, the free encyclopedia\r\n\r\nEngland became inhabited more than 800,000 years ago, as the discovery of stone tools and footprints at Happisburgh in Norfolk has revealed.[1] The earliest evidence for early modern humans in North West Europe, a jawbone discovered in Devon at Kents Cavern in 1927, was re-dated in 2011 to between 41,000 and 44,000 years old.[2] Continuous human habitation in England dates to around 13,000 years ago (see Creswellian), at the end of the'

By looking at the text, we can see that there are three "types" of ways to mention years: First one is the most common one, it just write: "In 2011", where 2011 can be replaced by any number; second way is "4000 BC/AD", where "4000" can be replaced by any other number; and the third way is some expression like this "1495–1497", it refers to a certain period. We will write the regular expresion to capture this three formats in the following:

It is straight forward to write down the expression using what we have learnt from previous sections.

In [108]:
matchObj = re.findall(r'[Ii]n [0-9]+',data)
matchObj[:10]

['in 1927',
 'in 2011',
 'In 1066',
 'in 1485',
 'in 1660',
 'in 1707',
 'In 55',
 'In 2003',
 'in 43',
 'in 60']

In [80]:
matchObj = re.findall(r'[0-9]+ (?:BC|AD)',data)
matchObj

['000 BC',
 '6500 BC',
 '9000 BC',
 '4000 BC',
 '3806 BC',
 '2500 BC',
 '800 BC',
 '400 BC',
 '400 BC',
 '325 BC',
 '150 BC',
 '300 BC',
 '100 BC',
 '54 BC',
 '43 AD',
 '54 AD',
 '60 AD',
 '138 AD',
 '700 AD']

But wait for a minute, if you look at the results carefully, you may find something weird, what does "000 BC" mean? It turned out some of the year is expressed using in this format: "9,000 BC", so there would be comma inserted in the numbers somewhere. And instead of using "[0-9]+", which match the normal integer, we will add "(,[0-9]{3})\*" to match the integers containing the comma

In [87]:
matchObj = re.findall(r'[0-9]+(?:,[0-9]{3})* (?:BC|AD)',data)
matchObj

['9,000 BC',
 '6500 BC',
 '9000 BC',
 '4000 BC',
 '3806 BC',
 '2500 BC',
 '800 BC',
 '400 BC',
 '400 BC',
 '325 BC',
 '150 BC',
 '300 BC',
 '100 BC',
 '54 BC',
 '43 AD',
 '54 AD',
 '60 AD',
 '138 AD',
 '700 AD']

In [104]:
matchObj = re.findall(r'\w+–\w+',data)
matchObj

['1135–1154',
 '1337–1453',
 '1649–1653',
 '1653–1659',
 '18th–19th',
 '3807–3806',
 '600–400',
 '150–100',
 '1139–1153',
 '1216–1272',
 '1272–1307',
 '1315–1317',
 '1327–1377',
 '1455–1485',
 '1470–1471',
 '1495–1497',
 '1516–1558',
 '1527–1598',
 '1558–1603',
 '1585–1604',
 '1585–1603',
 '1630–1660',
 '1642–1645',
 '18th–19th',
 '1832–1974',
 '1945–present',
 '1945–present',
 '1985–86']

Here "\w" simply means the alphanumeric/underscore, which is equivalent to "[a-zA-Z0-9_]"

#### 4: Summary of regular expression

Construaction of regular expression is a very time consuming and error-prone process, but it is extreme powerful because it is very fast to execute in the computer and can almost be used to match any pattern. Here we only have a very brief introduction to the regular expression, and a lot of thing hasn't been talked about. But this bit sould prepare you to understand the regular expression you encountered in other places, and when you want to use the regular expression yourself, it will be a good idea to search on the internet and see how other people construct the regular expression and have some inspiration to construct yours.

### Text normalization