# Basic Text Pre-Processing Technologies

The aim of this tutorial is to demonstrate the basic technologies used to pre-process text data in the text mining, Information Retrieval (IR) and Natural Language Processing (NLP) communities. Those technologies include
* Tokenizing text
* Removing stop words
* Stemming & Lemmatization
* Sentence segmentation

The ultimate goal of pre-processing text is to convert unstructured and free language text into structured data so that text analysis algorithms can directly take the structured data as input. For example, the UCI machine learning database provides free download of the bag-of-words datasets that contain ENRON emails, NIPS articles, the New York Times news articles, <a href="https://www.ncbi.nlm.nih.gov/pubmed">PubMed</a> articles. Those are the bench mark datasets used in text analysis. Lets have a look at one of the datasets, PubMed. The image below shows a screenshot of the first 15 lines in the data set
<img src="pubmed_example.png">

The the three lines are the total number of PubMed abstracts, the vocabulary size, and the total number of work tokens in the datasets. Each abstract is stored in a sparse format that is often used in text analysis, where each row contains **document ID**, **word index** and the corresponding **word count** in the document. For example, "1 6811 1" means word 6811 appears in document 1 just once. To find the word string for "6811", you then go to the vocabulary and find the 6811th word. Now, how can we pre-process text data and save the processed data in the spare format.

Assume that we are going to analyze some medical reports that are about fungal disease. The goal of the analysis is to **predict how likely a patient has fungal infection given some diagnostic report**. The prediction can be formulated as a **classification task** where we are going to **assign a binary label to a patient**: 1 means the patient has fungal infection, and 0 means the patient does not. 

The text in the following cell contains a short diagnostic report for a patient. In this tutorial, you are going to learn **the basic techniques** often used in preprocessing text. In next tutorial, you will learn how to put these techniques together to count vocabulary and generate the final structure data.

In [226]:
raw_text = """Previous right upper lobe nodule? Fungal question resolution change. 
Findings: Comparison is made to prior CT dated November 30, 2004. Significant resolution 
in the previously noted fluid overload status. Ectasia of the thoracic aorta measuring 4.2 cm.
Features of generalised centrilobular emphysema. Resolution of right upper lobe nodule. 
There is now presence of a nodule within the medial segment of the right lower lobe which 
measures 5.4 mm and is non-specific in nature. Given the interval development of this 
fungal/inflammatory aetiology is likely. There is a 13 mm right axillary node which is a new 
finding since the prior study. No significant mediastinal or hilar adenopathy. Conclusion: 
Nodule in the right lower lobe in keeping with fungal/inflammatory aetiology."""
raw_text = raw_text.lower()

## 1. Word Tokenization 单词化

Now, we need to think about how to break such a long sequence of characters into word tokens. The task of breaking a character sequence into pieces is known as tokenization. In the lecture, we have covered different tokenizers built in NLTK. For example, whitespace tokenizer (**WhitespaceTokenizer**仅根据空格划分), regular expression tokenizer (**RegexpTokenizer**根据正则表达划分) and etc. You can find more information on the NLTK website, e.g.,
* <a href="http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize">tokenize module</a> in nltk.
* <a href="http://www.nltk.org/howto/tokenize.html">tokenize</a>: shows you how to use Treebank tokenizer and Regexp tokenizer

You can also refer to the Jupyter Notebook we provided. After tokenizing the <font color="brown">raw_text</font>, you should derive the following list of tokens
```
['previous', 'right', 'upper', 'lobe', 'nodule', 'fungal', 'question', 'resolution', 'change', 'findings', 'comparison', 'is', 'made', 'to', 'prior', 'ct', 'dated', 'november', '30', '2004', 'significant', 'resolution', 'in', 'the', 'previously', 'noted', 'fluid', 'overload', 'status', 'ectasia', 'of', 'the', 'thoracic', 'aorta', 'measuring', '4.2', 'cm', 'features', 'of', 'generalised', 'centrilobular', 'emphysema', 'resolution', 'of', 'right', 'upper', 'lobe', 'nodule', 'there', 'is', 'now', 'presence', 'of', 'a', 'nodule', 'within', 'the', 'medial', 'segment', 'of', 'the', 'right', 'lower', 'lobe', 'which', 'measures', '5.4', 'mm', 'and', 'is', 'non-specific', 'in', 'nature', 'given', 'the', 'interval', 'development', 'of', 'this', 'fungal', 'inflammatory', 'aetiology', 'is', 'likely', 'there', 'is', 'a', '13', 'mm', 'right', 'axillary', 'node', 'which', 'is', 'a', 'new', 'finding', 'since', 'the', 'prior', 'study', 'no', 'significant', 'mediastinal', 'or', 'hilar', 'adenopathy', 'conclusion', 'nodule', 'in', 'the', 'right', 'lower', 'lobe', 'in', 'keeping', 'with', 'fungal', 'inflammatory', 'aetiology']
```

### RegexpTokenizer -- 正则表达式划分

In [227]:
from nltk.tokenize import RegexpTokenizer 
unigram_tokens = RegexpTokenizer(r"\w+(?:[-.]\w+)?").tokenize(raw_text)
print (unigram_tokens)
len(set(unigram_tokens))

['previous', 'right', 'upper', 'lobe', 'nodule', 'fungal', 'question', 'resolution', 'change', 'findings', 'comparison', 'is', 'made', 'to', 'prior', 'ct', 'dated', 'november', '30', '2004', 'significant', 'resolution', 'in', 'the', 'previously', 'noted', 'fluid', 'overload', 'status', 'ectasia', 'of', 'the', 'thoracic', 'aorta', 'measuring', '4.2', 'cm', 'features', 'of', 'generalised', 'centrilobular', 'emphysema', 'resolution', 'of', 'right', 'upper', 'lobe', 'nodule', 'there', 'is', 'now', 'presence', 'of', 'a', 'nodule', 'within', 'the', 'medial', 'segment', 'of', 'the', 'right', 'lower', 'lobe', 'which', 'measures', '5.4', 'mm', 'and', 'is', 'non-specific', 'in', 'nature', 'given', 'the', 'interval', 'development', 'of', 'this', 'fungal', 'inflammatory', 'aetiology', 'is', 'likely', 'there', 'is', 'a', '13', 'mm', 'right', 'axillary', 'node', 'which', 'is', 'a', 'new', 'finding', 'since', 'the', 'prior', 'study', 'no', 'significant', 'mediastinal', 'or', 'hilar', 'adenopathy', 'c

76

### WhitespaceTokenizer -- 根据空格划分

In [228]:
from nltk.tokenize import WhitespaceTokenizer
tokens_white = WhitespaceTokenizer().tokenize(raw_text)
print(tokens_white)
len(set(tokens_white))

['previous', 'right', 'upper', 'lobe', 'nodule?', 'fungal', 'question', 'resolution', 'change.', 'findings:', 'comparison', 'is', 'made', 'to', 'prior', 'ct', 'dated', 'november', '30,', '2004.', 'significant', 'resolution', 'in', 'the', 'previously', 'noted', 'fluid', 'overload', 'status.', 'ectasia', 'of', 'the', 'thoracic', 'aorta', 'measuring', '4.2', 'cm.', 'features', 'of', 'generalised', 'centrilobular', 'emphysema.', 'resolution', 'of', 'right', 'upper', 'lobe', 'nodule.', 'there', 'is', 'now', 'presence', 'of', 'a', 'nodule', 'within', 'the', 'medial', 'segment', 'of', 'the', 'right', 'lower', 'lobe', 'which', 'measures', '5.4', 'mm', 'and', 'is', 'non-specific', 'in', 'nature.', 'given', 'the', 'interval', 'development', 'of', 'this', 'fungal/inflammatory', 'aetiology', 'is', 'likely.', 'there', 'is', 'a', '13', 'mm', 'right', 'axillary', 'node', 'which', 'is', 'a', 'new', 'finding', 'since', 'the', 'prior', 'study.', 'no', 'significant', 'mediastinal', 'or', 'hilar', 'adenop

79

### Differences:

In [229]:
for i in list(set(unigram_tokens)):
    for j in list(set(tokens_white)):
        if i == j:
            tokens_white.remove(j)
print(tokens_white)

['nodule?', 'change.', 'findings:', '30,', '2004.', 'resolution', 'status.', 'the', 'cm.', 'of', 'emphysema.', 'resolution', 'of', 'right', 'upper', 'lobe', 'nodule.', 'is', 'of', 'the', 'of', 'the', 'right', 'lobe', 'is', 'in', 'nature.', 'the', 'of', 'fungal/inflammatory', 'is', 'likely.', 'there', 'is', 'a', 'mm', 'right', 'which', 'is', 'a', 'the', 'prior', 'study.', 'significant', 'adenopathy.', 'conclusion:', 'nodule', 'in', 'the', 'right', 'lower', 'lobe', 'in', 'fungal/inflammatory', 'aetiology.']


**Note**:  
Using <font color='blue'>**WhitespaceTokenizer**</font> just separate words between spaces. Non-alphanumeric characters (such as `?`, `,`, `.`, etc.) are included. 

###  Multi-word expression 组合词

The tokens are all unigrams. Except for `"non-specific"` that contains a hyphen and numbers, all the other tokens are single word tokens. As we know, **phrases are more meaningful than single word**, which makes us think that it would be good to tokenize a text so that phrases are kept as phrases. Then, the question is how can we **merge multi-word expressions into single tokens**. Assume that we are going to have the following multi-word expressions being treated as single tokens
* "<font color="red">generalised centrilobular emphysema</font>"
* "<font color="red">inflammatory aetiology</font>"
* "<font color="red">lobe nodule</font>"
* "<font color="red">axillary node</font>"
* "<font color="red">thoracic aorta measuring</font>"

In other words, **you cannot split those phrases into individual words**. It is lucky that NLTK provides us a <a href="http://www.nltk.org/_modules/nltk/tokenize/mwe.html#MWETokenizer">multi-word expression tokenizer</a>. The output should be
```
['previous', 'right', 'upper', 'lobe_nodule', 'fungal', 'question', 'resolution', 'change', 'findings', 'comparison', 'is', 'made', 'to', 'prior', 'ct', 'dated', 'november', '30', '2004', 'significant', 'resolution', 'in', 'the', 'previously', 'noted', 'fluid', 'overload', 'status', 'ectasia', 'of', 'the', 'thoracic_aorta_measuring', '4.2', 'cm', 'features', 'of', 'generalised_centrilobular_emphysema', 'resolution', 'of', 'right', 'upper', 'lobe_nodule', 'there', 'is', 'now', 'presence', 'of', 'a', 'nodule', 'within', 'the', 'medial', 'segment', 'of', 'the', 'right', 'lower', 'lobe', 'which', 'measures', '5.4', 'mm', 'and', 'is', 'non-specific', 'in', 'nature', 'given', 'the', 'interval', 'development', 'of', 'this', 'fungal', 'inflammatory_aetiology', 'is', 'likely', 'there', 'is', 'a', '13', 'mm', 'right', 'axillary_node', 'which', 'is', 'a', 'new', 'finding', 'since', 'the', 'prior', 'study', 'no', 'significant', 'mediastinal', 'or', 'hilar_adenopathy', 'conclusion', 'nodule', 'in', 'the', 'right', 'lower', 'lobe', 'in', 'keeping', 'with', 'fungal', 'inflammatory_aetiology']
```

Firstly, you should think about how to expand the list of unique words give by the unigram tokenizer above. In order to get a unique list of tokens (about 76 tokens in total), you can use <font color="blue">set</font> function, then convert the set to a list, and append the list with multi-word phrases.

In [259]:
uni_voc = list(set(unigram_tokens))
# include the Multi-word expression
uni_voc.append(('generalised', 'centrilobular', 'emphysema'))
uni_voc.append(('inflammatory', 'aetiology'))
uni_voc.append(('hilar', 'adenopathy'))
uni_voc.append(('lobe', 'nodule'))
uni_voc.append(('axillary', 'node'))
uni_voc.append(('thoracic', 'aorta', 'measuring'))
print(uni_voc)

['and', 'fluid', 'resolution', 'inflammatory', 'or', '2004', '13', 'lobe', 'previous', 'question', 'is', 'which', 'ct', 'noted', 'conclusion', 'medial', 'mediastinal', 'nature', 'centrilobular', 'study', 'presence', 'significant', 'likely', 'right', 'finding', 'axillary', 'upper', 'this', 'generalised', 'aorta', 'a', 'keeping', 'adenopathy', 'the', 'with', 'dated', 'new', 'overload', 'cm', '4.2', 'november', 'nodule', 'interval', 'node', 'no', 'now', 'status', 'emphysema', 'development', 'to', 'previously', 'given', 'aetiology', 'findings', 'made', 'thoracic', 'features', 'mm', 'hilar', 'non-specific', 'of', 'there', 'in', 'segment', 'ectasia', 'lower', 'within', 'measuring', 'fungal', '30', '5.4', 'measures', 'change', 'comparison', 'prior', 'since', ('generalised', 'centrilobular', 'emphysema'), ('inflammatory', 'aetiology'), ('hilar', 'adenopathy'), ('lobe', 'nodule'), ('axillary', 'node'), ('thoracic', 'aorta', 'measuring')]


Then, tokenize the <font color="brown">**raw_tex**</font> with multi-word expressions.

In [260]:
from nltk.tokenize import MWETokenizer #Multi-words Expression Tokenizer
mwe_tokenizer = MWETokenizer(uni_voc) #input the unique tokens
mwe_tokens = mwe_tokenizer.tokenize(unigram_tokens)
print(set(unigram_tokens))
print("\n-----After MWE------\n")
print(set(mwe_tokens))
print("\n-----Difference------\n")
print(set(mwe_tokens)-set(unigram_tokens))

{'and', 'fluid', 'resolution', 'inflammatory', 'or', '2004', '13', 'lobe', 'previous', 'question', 'is', 'which', 'ct', 'noted', 'conclusion', 'medial', 'mediastinal', 'nature', 'centrilobular', 'study', 'presence', 'significant', 'likely', 'right', 'finding', 'axillary', 'upper', 'this', 'generalised', 'aorta', 'a', 'keeping', 'adenopathy', 'the', 'with', 'dated', 'new', 'overload', 'cm', '4.2', 'november', 'nodule', 'interval', 'node', 'no', 'now', 'status', 'emphysema', 'development', 'to', 'previously', 'given', 'aetiology', 'findings', 'made', 'thoracic', 'features', 'mm', 'hilar', 'non-specific', 'of', 'there', 'in', 'segment', 'ectasia', 'lower', 'within', 'measuring', 'fungal', '30', '5.4', 'measures', 'change', 'comparison', 'prior', 'since'}

-----After MWE------

{'and', 'fluid', 'resolution', 'generalised_centrilobular_emphysema', 'or', '2004', '13', 'previous', 'lobe', 'question', 'is', 'which', 'ct', 'axillary_node', 'noted', 'conclusion', 'medial', 'mediastinal', 'nature

**Note**:  
1. `'lobe', 'nodule'` has become `'lobe_nodule'` 
2. `'thoracic', 'aorta', 'measuring'` has become `'thoracic_aorta_measuring'`
3. `'generalised', 'centrilobular', 'emphysema'` has become  `'generalised_centrilobular_emphysema'`
4. `'hilar', 'adenopathy'` has become `'hilar_adenopathy'`
5. `'inflammatory', 'aetiology'` has become `'inflammatory_aetiology'`

You can also have try the different tokenizer on line at http://text-processing.com/demo/tokenize/.

The <font color='brown'>**raw_text**</font> has been split into a list of tokens that contains **both unigrams and multi-word expressions**. However, the list contains a lot of functional words, such as "to", "in", "the", "is" and so on. These functional words usually do not contribute much to the semantics of the text, except for increase the dimensionality of the data in text analysis. Also, note that our goal is to build a classification model of predicting fungal disease. Thus, we are more interested in the meaning of the diagnostic report than the syntax. Therefore, we can choose to **remove those words**, which is your next task.

## 2. Stop Words Removal

As we have discussed in the lecture and in the Jupyter Notebook, **stop words carry little lexical content**. They are often functional words in English, for example, articles, pronouns, particles, and so on. In NLP and IR, we usually **exclude stop words from the vocabulary**. Otherwise, we will face the curse of dimensionality. There are some exceptions, such as syntactic analysis like parsing, we choose to keep those functional words. However, you are going to remove all the stop words in the above list by using the stop word list in NLTK, which is

In [232]:
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
print(stopwords_list)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Now, it is your turn to remove all the stop words from the output of your <b>MWEtokenizer</b>.

> **Note**: the difference between <font color="blue">**list**</font> and <font color="blue">**set**</font>.  
Sets are **significantly faster** when your task is to determine if an object **is present in the set**. 
But are **slower** than lists when you try to **iterate over the elements**. 

The list of tokens after stop words be removed should be :
```Python
['previous', 'right', 'upper', 'lobe_nodule', 'fungal', 'question', 'resolution', 'change', 'findings', 'comparison', 'made', 'prior', 'ct', 'dated', 'november', '30', '2004', 'significant', 'resolution', 'previously', 'noted', 'fluid', 'overload', 'status', 'ectasia', 'thoracic_aorta_measuring', '4.2', 'cm', 'features', 'generalised_centrilobular_emphysema', 'resolution', 'right', 'upper', 'lobe_nodule', 'presence', 'nodule', 'within', 'medial', 'segment', 'right', 'lower', 'lobe', 'measures', '5.4', 'mm', 'non-specific', 'nature', 'given', 'interval', 'development', 'fungal', 'inflammatory_aetiology', 'likely', '13', 'mm', 'right', 'axillary_node', 'new', 'finding', 'since', 'prior', 'study', 'significant', 'mediastinal', 'hilar_adenopathy', 'conclusion', 'nodule', 'right', 'lower', 'lobe', 'keeping', 'fungal', 'inflammatory_aetiology']
```

In [233]:
stopwords_set = set(stopwords_list)
stopped_tokens = [w for w in mwe_tokens if w not in stopwords_set] #use sets to check if word present
print(stopped_tokens)

['previous', 'right', 'upper', 'lobe_nodule', 'fungal', 'question', 'resolution', 'change', 'findings', 'comparison', 'made', 'prior', 'ct', 'dated', 'november', '30', '2004', 'significant', 'resolution', 'previously', 'noted', 'fluid', 'overload', 'status', 'ectasia', 'thoracic_aorta_measuring', '4.2', 'cm', 'features', 'generalised_centrilobular_emphysema', 'resolution', 'right', 'upper', 'lobe_nodule', 'presence', 'nodule', 'within', 'medial', 'segment', 'right', 'lower', 'lobe', 'measures', '5.4', 'mm', 'non-specific', 'nature', 'given', 'interval', 'development', 'fungal', 'inflammatory_aetiology', 'likely', '13', 'mm', 'right', 'axillary_node', 'new', 'finding', 'since', 'prior', 'study', 'significant', 'mediastinal', 'hilar_adenopathy', 'conclusion', 'nodule', 'right', 'lower', 'lobe', 'keeping', 'fungal', 'inflammatory_aetiology']


Of course, you can use a rich stopword list, as the one used in the lecture. You can also expand the stopword list by adding corpus specific stop words, for example those **more frequent words** (the words appear in every document but do not help us distinguish documents). For example, the following words do appear in each diagnostic report. (In next tutorial, we will demonstrate how to use basic statistics to identify them.)
* <font color="red">ct</font>
* <font color="red">mm</font>
* <font color="red">cm</font>
* <font color="red">fungal</font>
* <font color="red">conclusion</font>

You task is to expand the stopword list with the four words, and process the list of tokens again. The output should be
```Python
['previous', 'right', 'upper', 'lobe_nodule', 'question', 'resolution', 'change', 'findings', 'comparison', 'made', 'prior', 'dated', 'november', '30', '2004', 'significant', 'resolution', 'previously', 'noted', 'fluid', 'overload', 'status', 'ectasia', 'thoracic_aorta_measuring', '4.2', 'features', 'generalised_centrilobular_emphysema', 'resolution', 'right', 'upper', 'lobe_nodule', 'presence', 'nodule', 'within', 'medial', 'segment', 'right', 'lower', 'lobe', 'measures', '5.4', 'non-specific', 'nature', 'given', 'interval', 'development', 'inflammatory_aetiology', 'likely', '13', 'right', 'axillary_node', 'new', 'finding', 'since', 'prior', 'study', 'significant', 'mediastinal', 'hilar_adenopathy', 'nodule', 'right', 'lower', 'lobe', 'keeping', 'inflammatory_aetiology']
```

In [234]:
### write your code below
stopwords_set.add('ct')
stopwords_set.add('mm')
stopwords_set.add('cm')
stopwords_set.add('fungal')
stopwords_set.add('conclusion')
stopped_tokens = [w for w in mwe_tokens if w not in stopwords_set]
print(set(stopped_tokens))

{'fluid', 'resolution', 'generalised_centrilobular_emphysema', '2004', '13', 'previous', 'lobe', 'question', 'axillary_node', 'noted', 'medial', 'mediastinal', 'nature', 'study', 'presence', 'significant', 'likely', 'right', 'finding', 'upper', 'keeping', 'dated', 'new', 'overload', 'lobe_nodule', '4.2', 'november', 'nodule', 'interval', 'hilar_adenopathy', 'status', 'thoracic_aorta_measuring', 'inflammatory_aetiology', 'development', 'previously', 'given', 'findings', 'made', 'features', 'non-specific', 'segment', 'ectasia', 'lower', 'within', '5.4', '30', 'measures', 'change', 'comparison', 'prior', 'since'}


Again, we should inspect the output, which is a very good practice in data preprocessing. You will find that we have words like `"find"` and `"findings"`, `"previous"` and `"previously"`, `"noted"`, etc. Should we keep them as they are? or Should we reduce them to the base form?

## 3. Stemming, Lemmatization, setence segmentation and POS tagging

The task of stemming and lemmatization is to **reduce the same word in different lexical forms to its base form** in the lexicon without significantly loosing the meaning. In English, nouns are inflected in the plural, verbs are inflected in the various tenses, and adjectives are inflected in the comparative/superlative. In morphology, the derivation process creates a new word out of an existing one often by adding either a prefix or a suffix. In this exercise, you are going to apply the <font color='blue'>**WordNetLemmatizer**</font> provided in the NLTK's <a href="http://www.nltk.org/api/nltk.stem.html">stem</a> package. Note that the <font color='blue'>**WordNetLemmatizer**</font> can take **the POS tag** of each word as one argument, specifying which can give us more accurate base form of the word. 

Therefore, first you should carry out sentence segmentation. You code should produce 
```
previous right upper lobe nodule?
fungal question resolution change.
findings: comparison is made to prior ct dated november 30, 2004. significant resolution in the previously noted fluid overload status.
ectasia of the thoracic aorta measuring 4.2 cm.
features of generalised centrilobular emphysema.
resolution of right upper lobe nodule.
there is now presence of a nodule within the medial segment of the right lower lobe which measures 5.4 mm and is non-specific in nature.
given the interval development of this fungal/inflammatory aetiology is likely.
there is a 13 mm right axillary node which is a new finding since the prior study.
no significant mediastinal or hilar adenopathy.
conclusion: nodule in the right lower lobe in keeping with fungal/inflammatory aetiology.
```

In order to segment the given text into sentences, you can refer to the Jupyter Notebook (chapter 1) or search "<b>Punkt Sentence Tokenizer</a>" on the http://www.nltk.org/api/nltk.tokenize.html

### Punkt Sentence Tokenizer 分句

In [261]:
import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = sent_detector.tokenize(raw_text.strip())
print(sent_detector)
i = 1
for sent in sentences:
    print(i, sent)
    i+=1

<nltk.tokenize.punkt.PunktSentenceTokenizer object at 0x10a2133c8>
1 previous right upper lobe nodule?
2 fungal question resolution change.
3 findings: comparison is made to prior ct dated november 30, 2004. significant resolution 
in the previously noted fluid overload status.
4 ectasia of the thoracic aorta measuring 4.2 cm.
5 features of generalised centrilobular emphysema.
6 resolution of right upper lobe nodule.
7 there is now presence of a nodule within the medial segment of the right lower lobe which 
measures 5.4 mm and is non-specific in nature.
8 given the interval development of this 
fungal/inflammatory aetiology is likely.
9 there is a 13 mm right axillary node which is a new 
finding since the prior study.
10 no significant mediastinal or hilar adenopathy.
11 conclusion: 
nodule in the right lower lobe in keeping with fungal/inflammatory aetiology.


Then, we will use **the POS tagger** to assign POS tag to each word in each sentence. Please refer to Section 1 of http://www.nltk.org/book/ch05.html. The step you are going to use is: 

For each sentence
1. use the unigram tokenizer you developed in Exercise 1 to tokenize the sentence
2. use the **MWETokenizer** used in Exercise 1 to tokenize the sentence with **multi-word expressions (MWE)**  
3. use the information in Section 1 of http://www.nltk.org/book/ch05.html to help you finish the POS tagging.
4. remove the stop words in each sentence

Finally save the tagged sentences in a list. The output you will derive should be
```
[[('previous', 'JJ'), ('right', 'JJ'), ('upper', 'NN'), ('lobe_nodule', 'NN')], [('question', 'NN'), ('resolution', 'NN'), ('change', 'NN')], 
[('findings', 'NNS'), ('comparison', 'NN'), ('made', 'VBN'), ('prior', 'VB'), ('dated', 'JJ'), ('november', 'RB'), ('30', 'CD'), ('2004', 'CD'), ('significant', 'JJ'), ('resolution', 'NN'), ('previously', 'RB'), ('noted', 'VBN'), ('fluid', 'NN'), ('overload', 'NN'), ('status', 'NN')], 
[('ectasia', 'NN'), ('thoracic_aorta_measuring', 'VBG'), ('4.2', 'CD')], 
[('features', 'NNS'), ('generalised_centrilobular_emphysema', 'NN')], 
[('resolution', 'NN'), ('right', 'JJ'), ('upper', 'JJ'), ('lobe_nodule', 'NN')], [('presence', 'NN'), ('nodule', 'NN'), ('within', 'IN'), ('medial', 'JJ'), ('segment', 'NN'), ('right', 'JJ'), ('lower', 'JJR'), ('lobe', 'NN'), ('measures', 'VBZ'), ('5.4', 'CD'), ('non-specific', 'JJ'), ('nature', 'NN')], 
[('given', 'VBN'), ('interval', 'NN'), ('development', 'NN'), ('inflammatory_aetiology', 'NN'), ('likely', 'JJ')], 
[('13', 'CD'), ('right', 'NN'), ('axillary_node', 'NN'), ('new', 'JJ'), ('finding', 'NN'), ('since', 'IN'), ('prior', 'JJ'), ('study', 'NN')], 
[('significant', 'JJ'), ('mediastinal', 'NN'), ('hilar_adenopathy', 'NN')], 
[('nodule', 'NN'), ('right', 'NN'), ('lower', 'JJR'), ('lobe', 'NN'), ('keeping', 'VBG'), ('inflammatory_aetiology', 'NN')]]
```

In [236]:
text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

* `CC`: Coordinating conjunction 连接词
* `RB`: Adverbs 副词
* `IN`: Preposition 介词
* `JJ`: Adjective 形容词
* `NN`: Noun 名词

Now, your task is to fill the for loop below by following the steps above. If you would like to know the meaning of each tag, you can type for example
```
print nltk.help.upenn_tagset('NNP')
```
Replacing `"NNP"` with the tag you want, you should see the explanation.

In [244]:
tagged_sents = []
for sent in sentences:
    uni_sent = tokenizer.tokenize(sent)
    mwe_text = mwe_tokenizer.tokenize(uni_sent)
    tagged_sent = nltk.tag.pos_tag(mwe_text)
    stopped_tagged_sent = [x for x in tagged_sent if x[0] not in stopwords_set]  
    tagged_sents.append(stopped_tagged_sent)
    '''
    print('sentence: ', sent) # single sentence
    print('uni_sent: ', uni_sent) # tokens
    print('MWE: ', mwe_text) # check multi-words
    print('POS tagging: ', tagged_sent)
    print('Stopped: ', stopped_tagged_sent)
    print('----------')
    '''
print (tagged_sents)

[[('previous', 'JJ'), ('right', 'JJ'), ('upper', 'NN'), ('lobe_nodule', 'NN')], [('question', 'NN'), ('resolution', 'NN'), ('change', 'NN')], [('findings', 'NNS'), ('comparison', 'NN'), ('made', 'VBN'), ('prior', 'VB'), ('dated', 'JJ'), ('november', 'RB'), ('30', 'CD'), ('2004', 'CD'), ('significant', 'JJ'), ('resolution', 'NN'), ('previously', 'RB'), ('noted', 'VBN'), ('fluid', 'NN'), ('overload', 'NN'), ('status', 'NN')], [('ectasia', 'NN'), ('thoracic_aorta_measuring', 'VBG'), ('4.2', 'CD')], [('features', 'NNS'), ('generalised_centrilobular_emphysema', 'NN')], [('resolution', 'NN'), ('right', 'JJ'), ('upper', 'JJ'), ('lobe_nodule', 'NN')], [('presence', 'NN'), ('nodule', 'NN'), ('within', 'IN'), ('medial', 'JJ'), ('segment', 'NN'), ('right', 'JJ'), ('lower', 'JJR'), ('lobe', 'NN'), ('measures', 'VBZ'), ('5.4', 'CD'), ('non-specific', 'JJ'), ('nature', 'NN')], [('given', 'VBN'), ('interval', 'NN'), ('development', 'NN'), ('inflammatory_aetiology', 'NN'), ('likely', 'JJ')], [('13', '

# WordNet and Lamma

More about **WordNet**: [WordNet Tutorial](http://localhost:8888/notebooks/gitfiles/MIT_S03/5196_Exercise/NLTK_python/Wordnet_tutorial.ipynb) and [official doc](http://www.nltk.org/howto/wordnet.html)

In [238]:
from nltk.corpus import wordnet

The last step is to apply the <b>WordNetLemmatizer</b>.  

You code should **make use the POS tags of each word to decide the lexical base form**.  
The <font color="blue">**lemmatize**</font> function in <b>WordNetLemmatizer</b> can accept the following wordnet tags
* `wordnet.ADJ`
* `wordnet.VERB`
* `wordnet.NOUN` 
* `wordnet.ADV`

The function of **converting POS tags to wordnet tags** is given bellow. In you code, you should think about how to call the function.  
[code source](http://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python)

After tokenization, each word in a form of `('word', 'tag')`. To make use of the `tag`, we use `wordnet` to improve the accuracy:

In [239]:
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ # 'a'
    elif treebank_tag.startswith('V'):
        return wordnet.VERB # 'v'
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN # 'n'
    elif treebank_tag.startswith('R'):
        return wordnet.ADV # 'r'
    else:
        return wordnet.NOUN

Now writing your lemmatization code:

`WordNetLemmatizer().lemmatize(word, pos='n')` method:  
using WordNet’s built-in morphy function, returns the input word unchanged if it cannot be found in WordNet.

**Note**: 
1. Don't forget the parentheses `()`
2. the input must be in lowercase, otherwise it won't work

In [252]:
case = WordNetLemmatizer().lemmatize('Implies', wordnet.VERB)
case_lower = WordNetLemmatizer().lemmatize('implies', wordnet.VERB)
print('Implies ->', case, '\nimplies ->', case_lower)

Implies -> Implies 
implies -> imply


In [240]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
final_tokens = []
for tagged_set in tagged_sents:
    final_tokens.append([lemmatizer.lemmatize(w[0], get_wordnet_pos(w[1])) for w in tagged_set ])

You can compare the difference between the tokenization with and without lemmatization. For example, if the list of tokens generated in Exercise 2 is "stopped_tokens", then you can use the following code to see the difference
```python
set(final_tokens) - set(stopped_tokens)
```
```python
set(stopped_tokens) - set(final_tokens)
```

In [241]:
set([1,2,3,4])-set([1,3,4])

{2}

To use the stemmers discussed in the lecture, you can simply replace the following code in the above code cell
```python
    lemmatizer = WordNetLemmatizer()
```
with 
```python
    stemmer = PorterStemmer()
```
and replance
```python
    lemmatizer.lemmatize(w[0], get_wordnet_pos(w[1])) 
```
with 
```python
    stemmer.stem(w[0])
```
Don't forget import the corresponding modules.

Note that we have not yet done the preprocessing. Next tutorial, we will learn how to count the vocabulary by further removing the most and less frequent words, to generate numerical represenation of a document, etc.

In [242]:
print(final_tokens)

[['previous', 'right', 'upper', 'lobe_nodule'], ['question', 'resolution', 'change'], ['finding', 'comparison', 'make', 'prior', 'dated', 'november', '30', '2004', 'significant', 'resolution', 'previously', 'note', 'fluid', 'overload', 'status'], ['ectasia', 'thoracic_aorta_measuring', '4.2'], ['feature', 'generalised_centrilobular_emphysema'], ['resolution', 'right', 'upper', 'lobe_nodule'], ['presence', 'nodule', 'within', 'medial', 'segment', 'right', 'low', 'lobe', 'measure', '5.4', 'non-specific', 'nature'], ['give', 'interval', 'development', 'inflammatory_aetiology', 'likely'], ['13', 'right', 'axillary_node', 'new', 'finding', 'since', 'prior', 'study'], ['significant', 'mediastinal', 'hilar_adenopathy'], ['nodule', 'right', 'low', 'lobe', 'keep', 'inflammatory_aetiology']]


In [243]:
print(stopped_tokens)

['previous', 'right', 'upper', 'lobe_nodule', 'question', 'resolution', 'change', 'findings', 'comparison', 'made', 'prior', 'dated', 'november', '30', '2004', 'significant', 'resolution', 'previously', 'noted', 'fluid', 'overload', 'status', 'ectasia', 'thoracic_aorta_measuring', '4.2', 'features', 'generalised_centrilobular_emphysema', 'resolution', 'right', 'upper', 'lobe_nodule', 'presence', 'nodule', 'within', 'medial', 'segment', 'right', 'lower', 'lobe', 'measures', '5.4', 'non-specific', 'nature', 'given', 'interval', 'development', 'inflammatory_aetiology', 'likely', '13', 'right', 'axillary_node', 'new', 'finding', 'since', 'prior', 'study', 'significant', 'mediastinal', 'hilar_adenopathy', 'nodule', 'right', 'lower', 'lobe', 'keeping', 'inflammatory_aetiology']
