# Please go to https://ccv.jupyter.brown.edu

# What we learned so far...
- Functions for making code organized and reusable
- Comprehensions for constructing sequences from other sequences
- Error handling

# 4. Intro to Natural Language Processing (NLP)
### By the end of the day, you'll be able to 
- describe the main goal of NLP in machine learning
- become familiar with text normalization
- explain the differences between stemming and lemmatization
- design code to pre-process a text corpus for downstream text analysis

## 4.1 Machine Learning and NLP

Typical problems you can solve:
- *classification and regression* problems like
    - grading essays
    - predicting stock price based on news articles (kaggle competition) and/or the president's tweets (conference talk)
    - predicting the author of an article from a set of authors (e.g., anonymous NYT op-ed)
    - predicting the topic of articles from a set of topics

- *unsupervised* problems like    
    - group documents based on similarity
    - extract topics from a set of documents
- other
    - sentiment analysis
    - summarize a large corpus which is too long to read for a human

## 4.2 The steps in the NLP pipeline
1. Text normalization
2. Stem or lemmatize the words 
3. Collect the unique words in the corpus
4. Count how many times each unique word appears in the documents (count matrix)

## 4.3 Step 1: Text Normalization
### Can be tricky, spend some time considering the following:
- Lower case of all characters is usually a good idea
- Do we care about special characters? (social media corpus: yes!!!??? :) ;) :D :( )
- Do we care about numbers? (for scientific articles, probably)
- Are there non-English words in the corpus?
- Do we need to worry about misspelled words? (usually yes, unless you work with classics)

#### Robust and fast functions for text analysis have been implemented by Python developers in the `sklearn` and `nltk` packages. 
scikit learn:
- Simple and efficient tools for machine learning, data mining, and data analysis
- Built on NumPy, SciPy, and matplotlib

nltk (Natural Language ToolKit):
- Work with human language data
- Over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

### 4.3.1 Document Normalization

In [None]:
doc = "  The 5 biggest countries by population in 2017 are: China, India, United States, Indonesia, and Brazil. \
Aren't they neat?   "
doc

In [None]:
doc = doc.lower()
doc

In [None]:
doc = doc.strip()
doc

In [None]:
import contractions
contractions.fix(doc)

In [None]:
import re
doc = re.sub(r'[^\w\s]', '', doc)
doc

### 4.3.2 Tokenization and Word Normalization

In [None]:
# tokenize
import nltk
nltk.download('punkt')
words = nltk.word_tokenize(doc)
words

In [None]:
# replace integer occurrences in list of tokenized words with textual representation
import inflect
p = inflect.engine()

numbers = [p.number_to_words(word) for word in words if word.isdigit()] 
words = [word for word in words if word.isdigit() == False]
words = numbers + words
words

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))
words = [i for i in words if not i in stopwords]
words

# Exercise 1
## Write a function to normalize a document and apply it to this corpus:

```
corpus = ['The character said: It was the best of times, ', 
          'The character said: it was the worst of times, ', 
          'The character said: it was the Age of Wisdom, ', 
          'it was the Age of Foolishness. - Charles Dickens']
```

In [None]:
# Solution



## 4.4 Step 2: Stemming and Lemmatization
#### Goal: reduce derived words to a common base.

An example list of unique words:
- ['time', 'timed', 'timing', 'times']
- ['was', 'is' ,'am', 'were']
- ['operate', 'operating', 'operates', 'operation', 'operative', 'operatives', 'operational']

#### Are these different forms of the same word or different words?

### 4.4.1 Example

#### Try to return the dictionary form of the word via morphological analysis (not perfect!)
Example: 'drunk'
- Stemming would give us 'drunk'
- Lemmatization would return either 'drink' or 'drunk' depending on whether 'drunk' is likely a verb or noun in the sentence, respectively.

In [None]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

words = ['drunk','drunk']
pos_tags = ['n','v']

lemmatizer = WordNetLemmatizer()

for w,p in zip(words,pos_tags):
    print(w,p,'=>',lemmatizer.lemmatize(w,pos=p))

### 4.4.2 Stemming
Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language. (See Porter, 1980 (An algorithm for suffix stripping. Program 14 (3): 130-137)  for more details.)

#### Some general rules:
- SSES => SS (e.g., dresses => dress)
- IES => I (e.g., ponies => poni)
- SS => SS (e.g., class => class)
- S => _ (e.g., cats => cat)
- EMENT => _ if the root is longer than 1 (e.g., replacement => replac but cement does not change to c)

In [None]:
from nltk.stem import PorterStemmer

words = ['time', 'timed', 'timing', 'times',\
         'was','is','am','are',\
         'operate', 'operating', 'operates', 'operation', 'operative', 'operatives', 'operational']

ps = PorterStemmer()

for w in words:
    print(w + ' => ' + ps.stem(w))

# Exercise 2

## Write a function to stem a document and apply it to `norm_corpus` from Exercise 1

In [None]:
# Solution



### Problem: stem of 'operate', 'operating', 'operates', 'operation', 'operative', 'operatives', 'operational' is _'oper'_

### 4.4.3 Lemmatization
Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

In [None]:
words = ['time', 'timed', 'timing', 'times',\
         'was','is','am','are',\
         'operate', 'operating', 'operates', 'operations', 'operative', 'operatives', 'operational']
lemmatizer = WordNetLemmatizer()
for w in words:
    print(w + ' => ' + lemmatizer.lemmatize(w))

In [None]:
help(lemmatizer.lemmatize)

### Googled "python nltk lemmatization pos" and this came up:
https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python

In [None]:
words = ['time', 'timed', 'timing', 'times', 'was', 'is','am','are',\
         'operate', 'operating', 'operates', 'operations', 'operative', 'operatives', 'operational',\
         'drunk','drunk','saw','saw']
pos_tags = ['n','v','v','n', 'v','v','v','v',\
            'v','v','v','n','n','n','a',\
            'v','n','v','n']
lemmatizer = WordNetLemmatizer()
for w, p in zip(words, pos_tags):
    print(w + ' ' + p + ' => ' + lemmatizer.lemmatize(w, pos=p))

### `nltk` has a part-of-speech tagger that is useful for lemmatization

In [None]:
nltk.download('averaged_perceptron_tagger')
tagged_words = nltk.pos_tag(words)
for tw in tagged_words:
    print(tw)

# Exercise 3
## Write a function that prepares wordnet POS tags for input to the `lemmatize()` function

- Hint 1: There is a string method, `.startswith()`
- Hint 2: Use the `None` keyword for POS's other than nouns, verbs, adjectives

In [None]:
# Solution



# Exercise 4

##  Using the function from Exercise 3 to automatically tag POS, write a function that lemmatizes the documents in `norm_corpus` from Exercise 1.

In [None]:
# Solution



### 4.4.4 Pros and cons of stemming and lemmatization
- Stemming is faster but less accurate
- Lemmatization is slower but it can be more accurate (but usually not by much)

## 4.5 Step 3: Collect the unique words in the corpus

Can use `set()` or `np.unique()`
```
corpus = ['The character said: It was the best of times, ', 
          'The character said: it was the worst of times, ', 
          'The character said: it was the Age of Wisdom, ', 
          'it was the Age of Foolishness. - Charles Dickens']
```

In [None]:
corpus = ['The character said: It was the best of times, ', 
          'The character said: it was the worst of times, ', 
          'The character said: it was the Age of Wisdom, ', 
          'it was the Age of Foolishness. - Charles Dickens']
norm_corpus = [normalize(doc, stopwords) for doc in corpus]
stemmed_corpus = [stemmatize(doc) for doc in norm_corpus]
lemmed_corpus = [lemmatize(doc) for doc in norm_corpus]

In [None]:
unique_words = set([word for doc in lemmed_corpus for word in doc])
unique_words

## 4.6 Step 4: Count how many times each unique word appears in the documents (count matrix)

- If a word appears frequently in all documents, it doesn't carry useful information
- If a word only appears in one (or a few) documents, it is not good either.

### The count matrix:
|-|character|say|age|bad|best|charles|dickens|foolishness|time|wisdom|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Doc1|1| 1| 0| 0| 1| 0| 0| 0| 1| 0|
|Doc2|1| 1| 0| 1| 0| 0| 0| 0| 1| 0|
|Doc3|1| 1| 1| 0| 0| 0| 0| 0| 0| 1|
|Doc4|0| 0| 1| 0| 0| 1| 1| 1| 0| 0|

### Word order does not matter!
|-|age|bad|best|character|charles|dickens|foolishness|say|time|wisdom|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Doc1|0| 0| 1| 1| 0| 0| 0| 1| 1| 0|
|Doc2|0| 1| 0| 1| 0| 0| 0| 1| 1| 0|
|Doc3|1| 0| 0| 1| 0| 0| 0| 1| 0| 1|
|Doc4|1| 0| 0| 0| 1| 1| 1| 0| 0| 0|

### Document vectors:
- 'The character said: It was the best of times, ' => 'character said best time' => [0,0,1,1,0,0,0,1,1,0]
- 'The character said: it was the worst of times, ' => 'character said worst time' => [0,1,0,1,0,0,01,1,0]
- 'The character said: it was the Age of Wisdom, ' => 'character said age wisdom' => [1,0,0,1,0,0,0,1,0,1]
- 'it was the Age of Foolishness. - Charles Dickens' => 'age foolishness charles dickens' => [1,0,0,0,1,1,1,0,0,0]
___
Key: ['age', 'bad', 'best', 'character', 'charles', 'dickens', 'foolishness', 'say', 'time', 'wisdom']

### 4.6.1 Bag of Words

- A method to extract features from text documents
- Can be used for training machine learning algorithms
- Creates a vocabulary of unique words occurring in all documents
- Represents a document as word counts, disregarding the order in which they appear

### This problem has been solved already (like nearly all other programming problems you will come across)!

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
help(CountVectorizer)

In [None]:
corpus = ['The character said: It was the best of times, ', 
          'The character said: it was the worst of times, ', 
          'The character said: it was the Age of Wisdom, ', 
          'it was the Age of Foolishness. - Charles Dickens']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

### 4.6.2 CountVectorizer will construct a bag of words matrix from a corpus of documents and it supports some of the pre-processing functions we coded before (lowercase, split into words/tokens, remove stopwords).

#### If you don't need to do a lot of custom preprocessing to your corpus, you can directly input your corpus into CountVectorizer and let it do basic preprocessing for you...

In [None]:
corpus = ['The character said: It was the best of times, ', 
          'The character said: it was the worst of times, ', 
          'The character said: it was the Age of Wisdom, ', 
          'it was the Age of Foolishness. - Charles Dickens']

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

#### For more granular control over pre-processing, do it outside of CountVectorizer, join your tokens back together, then run CountVectorizer().

In [None]:
print(lemmed_corpus)
clean_corpus = [' '.join(doc) for doc in lemmed_corpus]
print(clean_corpus)

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(clean_corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

## Recap
- Main goal of NLP in ML is to convert variable length documents into fixed length numbers
- Text Normalization is the process of cleaning and processing raw text
- Stemming and lemmatization are two attempts to reduce derived words to their bases
- Bag of words counts how many times each unique word appears in a document