# Natural Language Processing

Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages. In particular, how to teach computers to effectively process large amounts of natural language data.

Challenges in natural-language processing frequently involve speech recognition, natural-language understanding, and natural-language generation. NLP can be applied in many different applications from text classification, search engine, and sentiment analysis.


## Preparation

In this lab, we are using the text from *Alice in the Wonderland*. You may use the following command to load.

```python
text = open('alice.txt').read()
```

## Tokenization

Tokenization is the task of cutting a string into smaller units that contain only a single piece of language data. 
The simplest method for tokenizing text is to split on space. 

```python
tokens = text.split(' ')
```

Do a quick check on the tokens, you will find that we have successfully extracted the word tokens, but the result is not ideal. It contains many noisy characters such as punctuation and control characters. We are going to deal with that next.

In [4]:
text = open('alice.txt').read()
tokens = text.split(' ')

print(len(tokens))

25056


The more ideal or advanced way is to use regular expressions. Regular expression (or Regex in short) is a powerful technique for string manipulations.

For more details, please refer to: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285

```python
import re
tokens = re.findall(r'\w+', text)
print(tokens)
```

In this particular code, the `w+` allows us to capture the word as a whole, and giving us a much cleaner result compared to only using `split()`. The `r` beside `\w+` here represents a *raw string*, it is a special requirements for regex. It can also be used in other circumstances but you do not have to worry too much for now.

In [5]:
import re
tokens = re.findall(r'\w+', text)
print(tokens)




## First look

Since we have obtained our tokens, we can take a quick look at the top 20 high occurrence keywords, and see if it tells us anything about the content.

```python
from collections import Counter
print(Counter(tokens).most_common(20))
```

Some of the issues here are:
- Uppercase and lowercase characters are calculated separately
- The top 20 words do not carry too much meaning.

Most of the words in the top 20 list simply do not have any indicative meaning, and they are called *stopwords*. 
In the field of Information Retrieval (IR) and text mining, a stop word is a commonly used word (such as *is*, *am*, *the*) that a program has been programmed to ignore, both when indexing and when retrieving. When building the index, most engines are programmed to remove certain words from any index entry. The list of words that are not to be added is called a *stop list*. Stop words are deemed irrelevant because they occur frequently in the language. In order to save both space and time, these words are dropped and should be ignored. 



## Stopword Removal

We can remove stopwords by using a dictionary (stopwords.txt), which is a collection of common stopwords in English language. This file is a collection from multiple sources, including research outcome from Dr. Lau's PhD. Feel free to edit the file to suit your own use case.

```python
stopwords = open('stopwords.txt','r').read().splitlines()
```

If you open `stopwords.txt`, you may notice that all of the words are in lowercase. This is another step in NLP as we want to normalize the effect of cases. Let's combine the two using list comprehension, to obtain our token list without stopwords.


```python
tokens = [t.lower() for t in tokens if t.lower() not in stopwords]
```

In [7]:
from collections import Counter
print(Counter(tokens).most_common(20))

[('the', 1527), ('and', 802), ('to', 725), ('a', 615), ('I', 543), ('it', 527), ('she', 509), ('of', 500), ('said', 456), ('Alice', 396), ('in', 357), ('was', 352), ('you', 345), ('that', 275), ('as', 246), ('her', 243), ('t', 216), ('at', 202), ('s', 195), ('on', 189)]


In [10]:
stopwords = open('stopwords.txt','r').read().splitlines()
tokens = [t.lower() for t in tokens if t.lower() not in stopwords]

print(Counter(tokens).most_common(20))

[('alice', 398), ('queen', 75), ('time', 71), ('king', 63), ('don', 61), ('turtle', 59), ('ll', 57), ('hatter', 56), ('mock', 56), ('gryphon', 55), ('rabbit', 51), ('head', 50), ('voice', 48), ('looked', 45), ('ve', 44), ('mouse', 44), ('duchess', 42), ('round', 41), ('tone', 40), ('dormouse', 40)]


## Final look

We can now use the `most_common` function again, to look at the top `n` most frequently occurred words in our dataset.

```python
counter = Counter(tokens).most_common(30)
```

**p/s:** By now you should be quite familiar with Python, Jupyter Notebook and their common operations. So we wouldn't constantly remind you how to inspect or check the content of variables or results. But just a refresher here, you may use either of the following commands

```python
print(counter)
```

or simply the

```python
counter
```

To inspect and check variables.


Once you have done that, you will realize that the list of tokens you have obtained are tidier, and makes more sense even to somebody who didn't read the book *Alice in the Wonderland*. We see the names of main characters like *alice*, *rabbit*, and *queen* are all there.

This conclude your first exercise with NLP. The list of tokens form the basis of your future. What we have learnt here is how to convert unstructured data into numerical representations by using its occurrence. In next exercise we will look at handling multiple documents and how to retrieve information. 


In [11]:
print(tokens)

