This note book is following the structure of the chapter 2 of the book "*An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition*"

**Practical Info: The hashtag symbol "#" in the codefields means that the text directly after does not affect the code in any way. It is used to comment on the code. If there is a "#" in front of a line of code, then the code is disabled.**

## Basic Techniques for Preprocessing （text normalization)

This notebook introduces the basic techniques to do preprocessing or **text normalization**. In general, text normalization is a set of procedures to convert text to a more convenient, standard form. This process includes **tokenization**,**lemmatization** and **stemming**. 



**Regular Expression：**

This is a fundamental tool for language processing. In this notebook we will show you how to use it to search your text for specific expression, whether it's words, numbers or any combination of letters, special characters and numbers.

**Tokenization:** 

refers to the task of separating the text word by word. Most of the latin languages can be separated by "white space", but sometimes it is necessary to treat words differently, for example, we often treat "New York" as a single word.

**Lemmatization** and **stemming** are closely related

**Lemmatization:** will map all the different occurrences of a word to its root, for example, sang, sung, and sing will all be mapped to the verb sing. 

**Stemming:** is a simpler version of lemmatization, it only strips the suffix from the end of the words.


Each topic is divided into several sub-sections, and each section has an example after the definition.




## Content

### Regular Expression
1. Basic Regular: Expression Patterns
    - 1.1 Basic Regular: **[ ]**，**^**， and **-** 
    - 1.2 Basic Regular: ? * .
    - 1.3 Basic Regular: Anchors
    - 1.4 Practice
2. Disjunction, Grouping and Precedence
    - 2.1 Disjunction
    - 2.2 Grouping and Precedence
3. An Example of Regular Expression in Use
4. Summary of Regular Expression

### Tokenization and Normalization
1. Tokenization
2. Collapsing Words: LEmmatization and Stemming
3. Sentence Segmentation and Summary

### Regular expression

Regular expression is a useful tool when doing text normalization， it is a language for specifying the strings to be searched. Many string processing functions in python support the use of regular expression, here we use **re.findall** as an example to show how the regular expression works, in the end of this section, we will introduce more functions which could use regular expression.

In [1]:
#this the library for using rugular expression in python
import re 

#### 1: Basic Regular Expression Patterns
The simplest regular expression is to match a sequence of simple characters. For example, suppose we have the following text:

In [2]:
#"Text" will be used in the regular expressions, to tell the notebook which text to search through
Text = "Though hundreds of thousands and Thousands had done their very best to disfigure the small piece of land on which they were crowded together"

And we want to match the words "thousands". Then we will simply use /thousands/ to match the words. The **findall** function will check for a match anywhere in the text, then return all the matched strings.

**Note on the function:**
The “r” at the start of the pattern string (in the brackets) designates a python raw string. Therefore the word you want to match, needs to be specified as follows: **(r'word'**,Text**)**. And (r'word'**,Text**) tells the notebook which text to search (as we defined above).


In [3]:
#labels "matchObj" as containing the results of the function
matchObj = re.findall(r'thousands',Text)

In [4]:
#prints the results of the function
matchObj

['thousands']

**Basic Regular 1.1**： **[ ]**，**^**， and **-** 


So far the regular expression only matches 'thousands' exactly. That is, every occurrence of 'thousands' written with a lowercase 't' and plural. In order to include other occurrences of 'thousands', the regular expression needs to be expanded:

- **[ ]** = Match one of the characters in the brackets

Suppose now we want to match both **'thousands'** and **'Thousands'**, this is where **"[ ]"** comes to play a role (- the regular expression with square brackets will match both strings of characters inside the brackets). The new regular expression needs to look as follows: **(r'[Tt]housands)** 

**Example: Find "thousands" in lower and upper case**

In [5]:
#labels "matchObj" as containing the results of the function
matchObj = re.findall(r'[Tt]housands',Text)

In [6]:
#prints the results of the function
matchObj

['thousands', 'Thousands']

Using the square brackets, we can match all the single digits in a text: **(r'[1234567890]')** 

or match any capital letter in a text: **(r'[ABCDEFGHIJKLMNOPQRSTUVWXYZ]')**

Let's look at an example of each:

In [7]:
#"Text" will be used in the regular expressions, to tell the notebook which text to search through
Text = "An apple falls from the tree, there are 2 Birds and 3 Monkeys on the tree"

**Example: Find all the numbers**

In [8]:
#labels "matchObj" as containing the results of the function
matchObj = re.findall(r'[1234567890]',Text)

In [9]:
#prints the results of the function
matchObj

['2', '3']

**Example: Find all the capital letters**

In [10]:
#labels "matchObj" as containing the results of the function
matchObj = re.findall(r'[ABCDEFGHIJKLMNOPQRSTUVWXYZ]',Text)

In [11]:
#prints the results of the function
matchObj

['A', 'B', 'M']

Let's continue to explore some other tools which you can use with **[ ]**. Using these tools will give you more specific results:

- **-** = Range indicator

The brackets can be used with the dash **-** to specify any one character in a range. For example: the pattern **[1-9]** specifies any one of the characters from 1 to 9 (like you saw above: (r'[123456789]'). And the pattern **[A-Z]** is also equivalent to the expression we used above: *[ABCDEFGHIJKLMNOPQRSTUVWXYZ]*.

- **^** = When [^inside brackets], it means "not"

The caret tool **^** "not" when it is put in front of the first symbol after the open square bracket. For example, **[^a-z]** means to match any single character except the lowercase alphabet. That means it will still match digits and capital letters.

Let's see two examples: 


In [12]:
Text = "An apple falls from the tree, there are 2 Birds and 3 Monkeys on the tree"

**Example: Find all numbers in "a range from 1 to 9"**

In [13]:
matchObj = re.findall(r'[1-9]',Text)

In [14]:
matchObj

['2', '3']

**Example: Find all "characters and numbers", except "lower case alphabet"**


In [15]:
matchObj = re.findall(r'[^a-z]',Text)

In [16]:
matchObj

['A',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 ' ',
 '2',
 ' ',
 'B',
 ' ',
 ' ',
 '3',
 ' ',
 'M',
 ' ',
 ' ',
 ' ']

**Basic Regular 1.2**: **?**, *****, **+**,and **.**

Let's add another tool:

- *?* = Once or none

The question mark means "the preceding character or nothing". For example, **(r'falls?)** will match both "fall" and "falls". 

See the example below:

In [17]:
Text = "An apple falls from the tree while children fall from another tree. One apple is falling. Is it false?"

**Example: Find all instances of words starting with "fall" and "falls", output only "fall" and "falls"**

In [18]:
matchObj = re.findall(r'falls?',Text)

In [19]:
matchObj

['falls', 'fall', 'fall']


- ***** = Zero or more times

The asterisk ***** matches "zero or more occurrences of the immediately previous character". For example, **(r'ab*')** will match all the following text: "ab", "abb", "abbbb". 

Interestingly, we can use **(r'[0-9][0-9]*')** to match any integer. 

*Side note: An integer (from the Latin integer meaning "whole") is a number that can be written without a fractional component. For example, 21, 4, 0, and −2048 are integers, while 9.75, 5 1/2, and √2 are not. [Source](https://en.wikipedia.org/wiki/Integer)*

**Example: Find all instances of "any number", ouput the exact number**



In [20]:
Text = "There are 57 apples and 2 birds on the tree 300, 1234"
matchObj = re.findall(r'[0-9][0-9]*',Text)
matchObj

['57', '2', '300', '1234']

**Example: Find all instances of "a" and "a followed by any other character", output only "a" and "a" followed by b**

In [21]:
Text = "Some dummy words； ab, abb, abbbb, abbbb, a, afg, abcde"
matchObj = re.findall(r'ab*',Text)
matchObj

['ab', 'abb', 'abbbb', 'abbbb', 'a', 'a', 'ab']

- *+* = One or more

The **+** tool has almost the same function as *****. The difference is: it matches "**one** or more occurrences of the immediately previous character" (as opposed to "**zero** or more...")

If we use "ab+" instead of "ab*", it will only match the "ab", "abb", "abbb...", but not "a".

See in the example:

**Example: Find all instances of "a" immediately followed by "b", output only "ab" followed by b**

In [22]:
matchObj = re.findall(r'ab+',Text)
matchObj

['ab', 'abb', 'abbbb', 'abbbb', 'ab']

- **.** = Any character except line break

The period **.** is a very important expression. It is used to match any single character, also within a word. For example, if you want to match any occurrence of the words "sing, sang, sung and song", use: **(r's.ng')**

Take a look at the example below:

**Example: Find all instances of "beg_n"**


In [23]:
Text = "began is the past tense of begin not begging"
matchObj = re.findall(r'beg.n',Text)
matchObj

['began', 'begin']

**Basic Regular 1.3:** Anchors

Anchors are special characters that anchor regular expressions to particular places in a string. 

- **^** = Start of string or start of line depending on multiline mode. (But when [^inside brackets], it means "not")

Caret **^** is used to match the start of a line, so **(r'^began')** will only match the word "began" at the start of a line.

- **$** = 	End of string or end of line depending on multiline mode.

The dollar sign **$** is used to match the characters or digits at the end of a line. So **(r'began§')** will only match "began" if it appears at the end of a line.

See the three examples below. *(Note the period at the end of "end" is to include all instances of words that contain end immediately followed by more characters or digits.)*

**Example: Find all instances of "end" at the beginning of a line**



In [24]:
Text = "end1 of the a period of life often indicates the beginning of another journey,so end2ing is not always the end3"
matchObj = re.findall(r'^end.',Text)
matchObj

['end1']


**Example: Find all instances of "end" followed by "any character or number"**

In [25]:
matchObj = re.findall(r'end.',Text)
matchObj

['end1', 'end2', 'end3']

**Example: Find all instances of "end" at the end of a line**

In [26]:
matchObj = re.findall(r'end.$',Text)
matchObj

['end3']


### 1.4 Practice

You're welcome to use the code field below to practice the regular expressions you've learned above. We provided a text you can practice on, but you can also replace it with your own text if you want to (just copy and paste it inside of the quotation marks).

A little recap, these are the regular expression tools you've learned so far:

**(r'xxx',Text)**

**Tools**
- **[ ]** = Match one of the characters in the brackets
- **-** = Range indicator
- **^** = When [^inside brackets], it means "not" (also means: start of string or start of line depending on multiline mode. )
- *?* = Once or none
- ***** = Zero or more times
- *+* = One or more
- **.** = Any character except line break

**Anchors**

- **^** = Start of string or start of line depending on multiline mode. (But when [^inside brackets], it means "not")
- **$** = 	End of string or end of line depending on multiline mode.



In [27]:
#"Text" will be used in the regular expressions, to tell the notebook which text to search through
Text = "Top Three Reasons Why TV Remote Controls Are Frustrating to Use. The average U.S. household has five remote controls. In fact, many households have 10 or more. But despite being introduced over 50 years ago, TV remote controls still maintain a basic design that can be frustrating to use. Below are the top three reasons why TV remote controls are frustrating to use: #3 – Commonly used buttons are too small, making their use awkward. The most commonly used buttons are frequently the smallest buttons. They also are usually surrounded by lots of other small buttons. This is especially problematic in low light. #2 – Too many rarely used buttons getting in the way. Most remotes have from 40 to 60 buttons, yet only about 10 of them are commonly used. All of these extra buttons make it difficult to find the buttons that are actually used most often. This, too, is especially problematic in low light. #1 – Difficult to use in low light, impossible to use in the dark. Nearly 90 percent of remote controls do not have backlighting. Considering the popularity of watching television with the lights off, along with the fact that 75 percent of Americans have vision problems, it is surprising so few are backlit. Backlighting is not a perfect solution, though, because it only illuminates the buttons but not the text next to the buttons."
#labels "matchObj" as containing the results of the regular expression
matchObj = re.findall(r'ENTER REGULAR EXPRESSION HERE',Text)
#prints the results of the regular expression
matchObj

[]

## 2. Disjunction, Grouping and Precedence

### 2.1 Disjunction

- **|** = Alternation / OR operand

Disjunction operator **|** (also called the pipe symbol), is similar to "or" in our natural language. Suppose we want to search for "bird or monkey" in the sentence, how can we do this? You may think about using **[ ]** to do this, but the problem is that we can only apply the bracket to single characters.

This is how you search for x or y: **(r'x|y')**


In [28]:
Text = "Here is a bird and some monkeys on the tree. There is one more monkey and some birds in the air. Look abird."
matchObj = re.findall(r'bird|monkeys',Text)
matchObj

['bird', 'monkeys', 'bird', 'bird']

*Note: There are three results for "bird" because it includes bird as part of any word including "birds" (plural) and "abird" (typo). It does not include "monkey" because it only matches "monkey**s**" exactly.

### 2.2 Grouping and Precedence

- **()** = Capturing group

The parenthesis symbol **()** is used for both Grouping and Precedence. 

Suppose you want to match both "study" and "studies", in this case, we may want to use the disjunction operator, but "study|ies" will not match study and studies at the same time, instead, it will match "study" and "ies". This is because by default, the sequence like "study" and "ies" take precedence over the disjunction operator **|**. To solve this problem, we can use **()** to specify the precedence as we want. So **(r'stud(y)|(ies)')** will do what we want.

In general, adding the parenthesis around the text is helping us to group things, but remember, the grouping is realized by changing the precedence.

**Example: Find all instances of "study" and "ies" using |, output "study" and "ies"**



In [29]:
Text = "Tom likes to study history, while his brother studies maths in the college. It's not lies."
matchObj = re.findall(r'study|ies',Text)
matchObj

['study', 'ies', 'ies']

**Example: Find all instances of "study" and "ies", output "y" and "ies"

In [30]:
Text = "Tom likes to study history, while his brother studies maths in the college. It's not lies. There is a stud y."
matchObj = re.findall(r'stud(y)|(ies)',Text)
matchObj

[('y', ''), ('', 'ies'), ('', 'ies')]



As you can see in the second example above, the output is not "study and studies", it's "y and ies". If we want the output to be "study and studies" we need to add another tool:

- **?:** = Non-capturing group

**What is the problem?**

**()** has two meanings: 
- one is specify the precedence 
- the other is to actually store the things in the bracket in the computer and output it when necessary.

Basically, if we merely put **()** here, the findall fucntion will only return the content in the bracket. In this case, only return **"y,ies"**. Sometimes we may need to use **()** because we want to match some pattern but only return part of words within this pattern.

**Solution:**
Using **(?:)** will specify the preference and not only output the content of the brackets, but all connected characters and digits.

Take a look at the examples. 

1. Uses only brackets
2. Uses the **?:** tool

**Example: Find all instances of "study" and "studies", output "y" and "ies"**


In [31]:
matchObj = re.findall(r'stud(y|ies)',Text)
matchObj

['y', 'ies']


**Example: Find all instances of "study" and "studies", output "study" and "studies"**

In [32]:
matchObj = re.findall(r'stud(?:y|ies)',Text)
matchObj

['study', 'studies']

## 3. An example of Regular Expression in Use

Here is an example showing how we can construct a regular expression step by step. For this example we will use an article about the history of England. 

Suppose now we want to have a list of all references to time as digits (= years) that occur in the article, and we need to write the regular expression to help us find them. Here is how this passage could look like:

In [33]:
#This loads an external text file and labels the file as "data"
with open('History of England(used for regex tutorial).txt', 'rb') as myfile:
    data=myfile.read()

In [34]:
data = data.decode("utf-8")
data[:500]

'History of England\r\nFrom Wikipedia, the free encyclopedia\r\n\r\nEngland became inhabited more than 800,000 years ago, as the discovery of stone tools and footprints at Happisburgh in Norfolk has revealed.[1] The earliest evidence for early modern humans in North West Europe, a jawbone discovered in Devon at Kents Cavern in 1927, was re-dated in 2011 to between 41,000 and 44,000 years old.[2] Continuous human habitation in England dates to around 13,000 years ago (see Creswellian), at the end of the'


By looking at the text, we can see that there are three "types" of ways to mention years: 

1. **In 2011** where "2011" can be replaced by any number 
2. **4000 BC/AD** where "4000" can be replaced by any number 
3. **1495–1497** where "1495" and "1497" can be replaced by any number 

We will write the regular expression to capture these three formats in the following:
**(r'[Ii]n [0-9]+',data)**

- "[Ii]n" = matches any occurence of "In and in"
- "[0-9]" = matches any digit between "0 and 9"
- "+" = matches more than one digit at a time
- ",data" = the label we gave the text 

**Example: Find all instances of "in" followed by "numbers"**


In [35]:
matchObj = re.findall(r'[Ii]n [0-9]+',data)
#The [:10] means is will show only the first 10 results.
matchObj[:10]

['in 1927',
 'in 2011',
 'In 1066',
 'in 1485',
 'in 1660',
 'in 1707',
 'In 55',
 'In 2003',
 'in 43',
 'in 60']


In the previous example we got all instances of numbers appearing with "I/in" in front of it, but as mentioned before, we also have numbers like **"4000 BC or AD"**

Let's find those next with:
**(r'[0-9]+ (?:BC|AD)',data)**

- "[0-9]" = matches any digit between "0 and 9"
- "+" = matches more than one digit at a time
- "(?:BC|AD)" = matches any occurrence of "BC or AD"


**Example: Find all instances of "numbers" followed by "BC" or "AD"**

In [36]:
matchObj = re.findall(r'[0-9]+ (?:BC|AD)',data)
matchObj

['000 BC',
 '6500 BC',
 '9000 BC',
 '4000 BC',
 '3806 BC',
 '2500 BC',
 '800 BC',
 '400 BC',
 '400 BC',
 '325 BC',
 '150 BC',
 '300 BC',
 '100 BC',
 '54 BC',
 '43 AD',
 '54 AD',
 '60 AD',
 '138 AD',
 '700 AD']

### But wait a minute! 

If you look at the results carefully, you can see something weird. What does **"000 BC"** mean? 

It turns out, some of the years are expressed using in this format: **"9,000 BC"**, so there is a comma inserted in the numbers somewhere. 

**Solution:**

Instead of using "[0-9]+", which matches the normal integer, we need to add **"(,[0-9]{3})\*"** to match the integers containing the comma.

The new regular expression now looks like this: **(r'[0-9]+(?:,[0-9]{3})* (?:BC|AD)',data)**

- "[0-9]+" = matches any digit between "0 and 9", and more than one at a time
- "(?:,[0-9]{3})\*" = matches digits with commas inside
    - "?:," = includes the comma among numbers in the result
    - "[0-9]" = specifies that the comma needs to be followed by numbers
    - "{3}" = (def: Exactly three times), specifies that comma is followed by 3 numbers
    - "\*" = matches zero or more instances of the defined character, in this case any instance of "a comma followed by maximum 3 numbers"
- "(?:BC|AD)" = matches any occurrence of "BC or AD"
- ",data" = the label we gave the text

**Example: Find all instances of "numbers" and "numbers containing a comma" followed by "BC" or "AD"**



In [37]:
matchObj = re.findall(r'[0-9]+(?:,[0-9]{3})* (?:BC|AD)',data)
matchObj

['9,000 BC',
 '6500 BC',
 '9000 BC',
 '4000 BC',
 '3806 BC',
 '2500 BC',
 '800 BC',
 '400 BC',
 '400 BC',
 '325 BC',
 '150 BC',
 '300 BC',
 '100 BC',
 '54 BC',
 '43 AD',
 '54 AD',
 '60 AD',
 '138 AD',
 '700 AD']



Now we matched all occurrences of years followed by BC or AD, but what about occurrences of time periods?

Let's take a look:

**(r'\w+–\w+',data)**

- **\w** = matches unicode letters, ideograms, digits, or underscores
- **+** = matches more than one at a time
- **–** = matches characters connected by a dash

This regular expression searches for occurrences of a word or digits, followed by a dash, followed by a word or digits.

Here is the example:

**Example: Find all instances of "letters or digits" followed by "-" followed by "letters or digits" (without spaces)**

In [38]:
matchObj = re.findall(r'\w+–\w+',data)
matchObj

['1135–1154',
 '1337–1453',
 '1649–1653',
 '1653–1659',
 '18th–19th',
 '3807–3806',
 '600–400',
 '150–100',
 '1139–1153',
 '1216–1272',
 '1272–1307',
 '1315–1317',
 '1327–1377',
 '1455–1485',
 '1470–1471',
 '1495–1497',
 '1516–1558',
 '1527–1598',
 '1558–1603',
 '1585–1604',
 '1585–1603',
 '1630–1660',
 '1642–1645',
 '18th–19th',
 '1832–1974',
 '1945–present',
 '1945–present',
 '1985–86']



## 4. Summary of Regular Expression

Regular expressions are an extremly powerful tool, because they are fast to execute and can be used to match almost any pattern. 

One thing you need to remember is, that regular expressions only match exaclty what you tell it to match. It is always important to check the results to see if the regular expression matched what you wanted it to or if it included/excluded something.

Here we only have a very brief introduction to regular expressions, and a lot of things haven't been mentioned. Fortunately, there are many websites with cheatsheets that can give you a quick overview of the tools. Alternatively you can also google for regular expressions examples to see what other people did and be inspired to construct your own.


### Word Tokenization and Normalizaiton


### 1. Tokenization

Tokenization is used to segment the running text into words, but just like other preprocessing tasks, word tokenization is highly dependent on the task we want to perform. 

For example, in some tasks, we want to keep the punctuation in our sentence and use it as a seperate token, but most of the time, if we only care about the *semantic level* of the text (the meaning of the the text)，we can ignore the the punctuation and other non-alphanumeric characters in the text. 

**Note:** Non-alphanumeric characters means all characters that are not aphabetical (abc...) and not numeric (123...), like: **!"#€%&/()=?**

Here we provide two functions: 
1. remove the non-alphanumeric characters
2. tokenize the text into words.

But first we need to add text:

**Add text** 

This is an example text with randomly added non-alphanumerical characters:


In [39]:
#"data" will be used in the functions, to tell the notebook which text to search through
data = "emmm... There is arc**haeological evidence of hu$man occupation of the Rome area from approximately 14,000 years ago, but the dense layer of much younger debris obscures Palaeolithic and Neolithic sites.[6] Evidence of stone tools, pottery, and stone weapons attest to about 10,000 years of human presence. Several excavations support the view that Rome grew from pastoral settlements on the Palatine Hill built above the area of the future Roman Forum. Between the end of the bronze age and the beginning of the Iron age, each hill between the sea and the Capitol was topped by a village (on the Capitol Hill, a village is attested since the end of the 14th century BC).[22]"

**1. Remove non-alphanumeric characters**

Below we import the regular expressions library called "re". Then we write the code to remove the non-alphanumeric characters.

This function is actually using a very simple regular expression to match all the alphanumerical characters (represented by \w) and space (represented by \s)

In [40]:
#imports regular expressions library
import re
#removes the non-alphanumeric characters based on "re"
def re_nalpha(str):
    pattern = re.compile(r'[^\w\s]', re.U)
    return re.sub(r'\n','',re.sub(r'_', '', re.sub(pattern, '', str)))

**1.1 Output - Remove non-alphanumeric characters**

We defined *re_nalpha* above to remove all the non-alphanumeric characters and we tell it in the brackets to remove the characters from the text we labelled "data" (see the "add text" field, where it says **data = "..."**).

In [41]:
#labels "alpha" as containing the text without non-alphanumeric characters
alpha = re_nalpha(data)
#prints the text without non-alphanumeric characters
alpha

'emmm There is archaeological evidence of human occupation of the Rome area from approximately 14000 years ago but the dense layer of much younger debris obscures Palaeolithic and Neolithic sites6 Evidence of stone tools pottery and stone weapons attest to about 10000 years of human presence Several excavations support the view that Rome grew from pastoral settlements on the Palatine Hill built above the area of the future Roman Forum Between the end of the bronze age and the beginning of the Iron age each hill between the sea and the Capitol was topped by a village on the Capitol Hill a village is attested since the end of the 14th century BC22'


*Note: Removing non-alphanumeric characters might lead to some unexpected results like this:*
- Original: **sites.[6]**
- After removing alphanumeric characters: **sites6**

**2. Tokenize the text into words**

We already imported the regular expressions library above, so we don't need to do it again here. We only need to write the code to tokenize the words.

In [42]:
#tokenizes the text into words
def word_tokenize(text): return re.findall(r'\w+', text.lower())

**2.2 Output - Tokenize text into words**

Above we defined "word_tokenize" to tokenize the words and under "1.1 Output" we labelled the cleaned text "alpha".

We could also tokenize the original text by replacing "alpha" with "data" in the brackets, but then the tokenized words would still contain the non-alphanumeric characters - and that's not what we want.

- **[:10]** = show only the first 10 results



In [43]:
word_tokenize(alpha)[:10]

['emmm',
 'there',
 'is',
 'archaeological',
 'evidence',
 'of',
 'human',
 'occupation',
 'of',
 'the']

**Sidenote on Languages**

Tokenization in English and other latin-origin languages is relatively easy, but when it comes to other languages like Chinese, tokenizaion is actually a challenging task. 

**Sidenote on Special Cases**

- Also, there are some "special cases" we haven't talk about. For example, how do we deal with **"you're"** case? Should we seperate it into **"you are"** or just a single string **"you're"**?  
- And should we regard "New York" as a single string or two words? 

In these cases, we cannot merely rely on the space between the words to do tokenization. More advanced algorithm will be needed.


### 2. Collapsing Words: Lemmatization and Stemming

**Lemmatization** is the task of determining if two words have the same root. When we perform the lemmatization operation to text, all the words that have the same root/origin will be mapped to the origin. 

For example, these three words **"am"**,**"is"**,**"are"** will be mapped to **"be"**.

Lemmatization requires a very complex algorithm, so in many cases, we will do a simpler version of lemmatization: 

**Stemming** - just removes the suffix of the words. In this case, **"cats"** will become **"cat"**, **"heights"** will become **"height"**. 

Of course, not all the stemming is just removing the "s" character at the end of the word, depending on the algorithm we use, it will have different stemming methods. The most widely used algorithm is proposed by Porter(1980). It will, for example, map **"accurate"** to **"accur"**. 

Let's look at an example of stemming using the nltk library:



In [44]:
#imports the Porter Stemmer from the nltk stemming library
from nltk.stem import PorterStemmer

In [45]:
#shortens PorterStemmer to ps, so we don't have to write the whole expression every time
ps = PorterStemmer()

In [46]:
#labels wordList to contain the tokenized words from the "alpha" text (the one we tokenized above)
wordList = word_tokenize(alpha)[10:30]
#prints the list of tokenized words
wordList

['rome',
 'area',
 'from',
 'approximately',
 '14000',
 'years',
 'ago',
 'but',
 'the',
 'dense',
 'layer',
 'of',
 'much',
 'younger',
 'debris',
 'obscures',
 'palaeolithic',
 'and',
 'neolithic',
 'sites6']

In [47]:
#prints the list of stemmed words
for w in wordList:
    print(ps.stem(w))

rome
area
from
approxim
14000
year
ago
but
the
dens
layer
of
much
younger
debri
obscur
palaeolith
and
neolith
sites6


**More info about Porter Stemmer**

The PorterStemmer is based on a set of rules, for example, one of the rules is to convert all the "SSES" to "SS", following this rule, "grasses" will be converted to "grass". The list of rules and the code for implementing the algorithm can be found on [this website](https://tartarus.org/martin/PorterStemmer/index.html)


### 3. Sentence Segmentation and Summary

Sentence segmentation is another important step in text processing, similar to *word tokenization*, but this task is to split the text at the sentence level. 

While sentence segmentation seems easy to do, the state-of-the-art algorithm is to rely on machine learning. Some people may argue we can simply use the punctuation as the boundary of sentence, but punctuation is often ambiguous. For example, the period character **"."** can either be a sentence boundary marker or a marker of abbrevations like **"Mr."** or **"etc."**.

In this repository, we have some preprocessing tools built by ourselves. You can find the scripts for tokenization under *CLEAR/tools/scripts_py/preprocessing.py* , but for the stemming part, we rely on the **nltk** library, which provides many useful functions for preprocessing.
