In [None]:
#! pip install nltk
import nltk
import pandas as pd
import numpy as np
#%load_ext pep8_magic
from IPython.display import HTML

<p style="font-size:200%; font-weight:bold"> Intro to Natural Language Processing (NLP)</p>

# What is NLP?
- Field at the intersection of **Computer Science**, **Artificial Intelligence**, and **Linguistics**
- **Goal**: Teach computers to **process**, **understand**, and **generate** human language
- NLP is "**AI Complete**": requires all different types of knowledge that humans possess

In [None]:
 HTML('''
<script type="text/javascript" charset="utf-8" src="go.js"></script>
<script type="text/javascript" charset="utf-8" src="nlp_levels.js"></script>
<div id="sample">
  <h3>NLP Layers</h3>
  <div id="myDiagramDiv" style="border: solid 1px black; width:100%; height:600px"></div>
</div>''')

## Natural Language Understanding (NLU)
- Getting computers to **derive meaning** from natural language
  - TODO: Examples
- Imagine a "**Concept/Semantic/Representation space**" 
  - In it, any idea/word/concept etc of interest has unique computer representation
  - This is usually done via a **vector space**!
  - ***NLU is about mapping language into this space***

## Natural Language Generation (NLG)
- Mapping from computer representation space to language space
- Essentially, opposite direction of NLU
- Usually, you need NLU to perform NLG!

## Text vs Speech
- "Natural Language" can refer to **Text** or **Speech**
- They're 2 different presentations of the same thing: the concept space

## History of NLP
- NLP has been through (at least) 3 major eras:
  - 1950s-1980s: **Linguistics** Methods and **Handwritten Rules**
  - 1980s-Now: **Corpus/Statistical** Methods
  - Now-???: **Deep Learning**
    - Lucky you!  You're right near the start of a paradigm shift!

### 1950s-1980s: Linguistics Methods and Rules-based Systems
- NLP Systems focus on:
  - **Linguistics**: Grammar rules, sentence structure parsing, etc
  - **Handwritten Rules**: Huge sets of logical (if/else) statements
  - **Ontologies**: Manually created (domain-specific!) knowledge bases to augment rules above
- **Problems**: 
  - Too complex to maintain!
  - Can't scale!
  - Can't generalize!

### 1980s-Now: Corpus and Statistical Methods
- NLP starts using ML methods
- Make use of statistical learning over huge datasets of unstructured text
- e.g. Supervised Learning: Machine Translation
- e.g. Unsupervised Learning: Deriving Word "Meanings" (vectors)

### Now-???: Deep Learning
- Deep Learning made its name with Images first
- Around 2013: Deep Learning has major NLP breakthroughs
- Deep Learning very useful for unified processing of Language + Images

# NLP Toolkits
- Many NLP computing frameworks big and small
- The following list is **not** exhaustive, but good

## Natural Language Toolkit (`nltk`)
- Has many features
- Preeminent Python library for simple NLP manipulations
- **Terrible** documentation
- **Blah** syntax

### Configuring NLTK
- Installation
- Download all its data and models!

In [None]:
# Installation
#!pip install nltk
# Download Data, Models, etc
nltk.download()

### `nltk` Corpora + Models
- `nltk` has a whole bunch of text **corpora**
  - For training or trying out models for instance
- `nltk` also has a bunch of pre-trained models

## TextBlob
- Basically wraps common `nltk` functionality
- Such nicer (Pythonic!) syntax

## `sklearn`
- Not to be left behind, has some NLP primitives of its own
- Most notably, creating **frequency vectors** from text

## `gensim`
- Amazing open source **Topic Modeling** library
- Scales to large datasets
- We'll play soon!

## Stanford NLP Group Tools
- Stanford has a whole suite of NLP tools [here]()
- They're really the best at this stuff (other than Google)
- Wondering best way to quickly try an NLP task?  Always check Stanford
- Written in Java tho, need to add libraries to Java classpath

## Deep Learning Frameworks
- Deep Learning has exploded for NLP in the past few years!
We'll get here :)

### TensorFlow
- Google open-sourced Deep Learning framework

### Theano
- Academia Deep Learning framework

### Keras
- Able to wrap both TensorFlow and Theano!

### Caffe
- Yet another one!

# Important NLP Terminology

## Corpora
- **Corpus**: A set of documents
- **Corpora**: Plural of corpus

## Word Vector
- The most important fundamental unit in NLP!
- **If we can turn words (or rather whatever language units) into vectors, we can do any ML we want!**

## n-grams
- All sequences of n tokens in a chunk of text
  - **Character-based**: Tokens are characters
    - e.g. 2-grams (bigrams): To, ok, ke, en, ns, ar, re, ch, ha, ar...
  - **Word-based**: Tokens are words
    - e.g. 3-grams (trigrams): (All sequences of), (sequences of n), (of n tokens)...

**Oh lord there are certainly many more...**  
We'll cover them as we see them.

# NLP Applications
- NLP applications are essentially uncountable
- Here is a pretty good list, but surely not complete

## Text Processing
- All text tasks

### Text Understanding
- Deriving meaning from text (class of problems)

#### ML with Text
- Standard machine learning algorithms
- Raw text is input

##### Text Classification and Regression
- Classify or Regress on documents
  - Documents $\rightarrow$ vectors
  - vectors + Labeled set $\rightarrow$ Supervised Learning

##### Text Clustering
- Cluster chunks of text
- Again, vectorize the text first!

#### Automated Essay Scoring (AES)
- Exactly what it sounds like!

#### Language Identification
- Identify the language a chunk of text is written in
- Sure we can use a dictionary...
- But it's slow!  Can do better.

#### Natural Language Programming
- Programming by giving natural language instructions
- The ultimate in "Declarative Programming"
- Checkout **Wolfram Natural Language**

#### Natural Language Search
- "Hey Siri, find me Lebron James's career NBA finals numbers."

#### Optical Character Recognition (OCR)
- Translating human-readable text into machine representation
  - e.g. Digits dataset!
  - e.g. Extracting text of scanned PDF
- We won't focus on this much
  - But it's a long-studied problem!

#### Sentiment Analysis
- Evaluate sentiment (positive/negative/neutral) of a chunk of text
  - This is a classification problem!
  - Can have other emotional states (classes) as well
  - **Polarity**: How positive/negative is the text?
  - **Objectivity**: How opinionated (subjective) is the text?
  - e.g.: Text with many positive and negative statements may have high subjectivity but neutral polarity (cancel out)
- `TextBlob` is easiest (**not best!**) for sentiment:

In [None]:
# Import
from textblob import TextBlob

# Create "blobs"
blob1 = TextBlob("I hate Mondays.")
blob2 = TextBlob("I hate Mondays, but I love you all!")

# Get sentiment 
print("Sentiment 1: {}".format(blob1.sentiment))
print("Sentiment 2: {}".format(blob2.sentiment))

#### Proofreading
- That thing you never did in middle school

#### Text Simplification
- What Paul needs for his slides and READMEs
- Removes verbosity, cuts to core content

#### Extracting Word and Document Meaning ("Semantic Analysis")
- Word vectors baby
- **Conceptual Comparisons** between:
  - Words-Words
  - Chunks of Text-Chunks of Text
  - Chunks-Words

#### Information Retrieval
- Getting those query results that you want

#### Relationship Extraction
- Determining the relationships between entities in a chunk of text
  - e.g. Paul = father(Paul) ;)

#### Topic Modeling
- The topic of the next 2 days
- Determining **underlying topics/concepts** in a document
- Example algorithms:
  - LDA
  - LSA
  - NMF
  - Word2Vec (ish)

### Text Generation
- That thing you do with your diary every night

#### Image Annotation/Captioning
- Such a cool use case, describing what's happening in an image
- Combines Image spaces with Language spaces
  - $\rightarrow$ Deep Learning

#### Form Letter Generation
- Frequent use case for this

### Text Understanding AND Generation
- Generally, Generation requires Understanding anyway
- Here are some tasks that need both!

#### Automatic Summarization
- Summarizing text like a human does
- **Extractive**: Combining existing chunks from the text.
- **Abstractive**: Creating novel summary chunks **paraphrasing** the core points
- Humans do the latter, it's hard :)

#### Machine Translation
- Oh so long studied
- Basically the great great grandmother of NLP
- Deep Learning state of the art now

#### Question Answering
- Answering any generic question, simple as that
- AI Complete
- Deep Learning!

## Speech Processing
- Remember, we have a unified "Language Space" between speech and text
- As long as we have good tools for Text-to-Speech and Speech-to-Text, can use the same kinds of algorithms

### Speech Understanding

#### Speech to Text (STT)
- "Siri, did you **hear** me??"

### Speech Generation

#### Text to Speech (TTS)
- "Yes, I heard what you said Paul."

# Common Component Tasks in NLP

## Chunking
- Extracting meaninful units, or **chunks**, of text from raw text
- Can range from very simple to extremely complex
- Examples:
  - Tokenization
  - Sentence Segmentation
  - Named Entity Recognition
  - Compound Term Extraction
- In `nltk`: `nltk.chunk.*`

### Tokenization
- Splitting raw text into small indivisible units
  - e.g.: **Word Tokenization**: Splitting into list of words
  - e.g.: **N-Gram Tokenization**: Splitting into all n-grams
- There's no reason you can't do these manually (regex, etc)!  But these are optimized  

#### Tokenization in NLTK
In `nltk`, use the `nltk.tokenize` module for tokenization.  
Here are some examples!

**Word and Whitespace Tokenization:**

In [None]:
# Imports
from nltk.tokenize import word_tokenize, wordpunct_tokenize, WhitespaceTokenizer

# Create some text to tokenize
text = """I'm relatively certain that you all are the best Metis cohort of all time!
    However, I'll wait until the rigorous (so rigorous) statistical analysis comes back before I quantify my degree of confidence."""

# Word Tokenize: Creates tokens from words and punctuation
word_tokens = word_tokenize(text)
print("Results of word_tokenize: {}\n".format(word_tokens))

# Word + Punctuation Tokenize: Tokenize to get words and punctuation
whitespace_tokens = wordpunct_tokenize(text)
print("Results of wordpunct_tokenize: {}\n".format(whitespace_tokens))

# Whitespace Tokenize with span: Creates start/end indices for tokens yielded by splitting on whitespace only
whitespace_tokenizer = WhitespaceTokenizer()
whitespace_token_slices = list(whitespace_tokenizer.span_tokenize(text))
print("Results of WhitespaceTokenizer: {}\n".format(whitespace_token_slices))

**N-gram Tokenization**:  

In [None]:
# Import
from nltk.util import ngrams

# Bigrams
print("Bigrams: {}".format(ngrams(word_tokens, 2)))

# Trigrams
print("Trigrams: {}".format(ngrams(word_tokens, 3s)))


**Regex Tokenization:**  
- Create tokens by:
  - Splitting on a defined regex delimiter (parameter `gaps=True`)
  - Matching tokens to defined regex
- In `nltk`:  
**Text**: *I'm relatively certain that you all are the best Metis cohort of all time! However, I'll wait until the rigorous (so rigorous) statistical analysis comes back before I quantify my degree of confidence.*

In [None]:
# Import
from nltk.tokenize import RegexpTokenizer

# RegexpTokenizer with whitespace delimiter
whitespace_regex_tokenizer = RegexpTokenizer("\s+", gaps=True)
print("Results of whitespace_regex_tokenizer: {}\n".format(whitespace_regex_tokenizer.tokenize(text)))

# RegexpTokenizer to match only capitalized words 
cap_tokenizer = RegexpTokenizer("[A-Z]['\w]+")
print("Results of whitespace_regex_tokenizer: {}".format(cap_tokenizer.tokenize(text)))

### Sentence Segmentation
- Splitting data into sentences.
- This can be harder than it seems!
  - Periods that don't end sentences, spacing issues, etc
- Here's an example in `nltk`!  
**Text**: *I'm relatively certain that you all are the best Metis cohort of all time! However, I'll wait until the rigorous (so rigorous) statistical analysis comes back before I quantify my degree of confidence.*

In [None]:
# Import 
from nltk.tokenize import sent_tokenize

# Sentence Tokenize: Creates tokens from sentences
sentence_tokens = sent_tokenize(text)
print("Results of sent_tokenize: {}\n".format(sentence_tokens))

### Named Entity Recognition (NER) aka Entity Extraction
- Identifying and tagging named entities in text
  - Persons, Places, Organizations, Phone #s, Emails, etc
- Can be **tremendously valuable** for further NLP tasks
  - e.g.: "George Bush" $\rightarrow$ "George_Bush"
    - George_Bush has a lot more meaning in it than George and Bush separately!

TODO: Implement Stanford NER

In [None]:
# Import 
from nltk.chunk import ne_chunk
from nltk.tag import pos_tag

# Text with some entities
ner_text = "Abraham Lincoln was the 16th President of the United States"

# Create Tokens
tokens = pos_tag(word_tokenize(ner_text))

# Extract entities from token list
entities = ne_chunk(tokens)
print(entities)

### Compound Term Extraction
- Extracting and tagging **compound words** or **phrases** in text
- **Super valuable** as well!
  - e.g.: "baseball bat" $\rightarrow$ "baseball_bat"
    - This totally changes the conceptual meaning!
- Here's one way to do it manually in `nltk`:  
**Text**: *I'm relatively certain that you all are the best Metis cohort of all time! However, I'll wait until the rigorous (so rigorous) statistical analysis comes back before I quantify my degree of confidence.*

In [None]:
# Import
from nltk.tokenize import MWETokenizer

# Multi-word Expression Tokenizer: Takes a list of tuples that will be additionally tagged when seen together
mwe_tokenizer = MWETokenizer([('you', 'all'), ('of', 'all', 'time'), ('statistical', 'analysis')])
# Add one more
mwe_tokenizer.add_mwe(('degree', 'of', 'confidence'))
# Tokenize (takes a list of tokens, not raw text, then retokenizes)
mwe_tokens = mwe_tokenizer.tokenize(word_tokenize(text))
print("Results of MWETokenizer: {}".format(mwe_tokens))

But in practice, we'd want it done automagically!

## Stemming
- Cutting all words down to their root word
- **Motivation**: 
  - Meaning of run, runs, running, ran all pretty much the same
  - Cuts down on complexity by reducing # unique words
- Here's one way we do it in `nltk`: `nltk.stem` module  
**Text**: *I'm relatively certain that you all are the best Metis cohort of all time! However, I'll wait until the rigorous (so rigorous) statistical analysis comes back before I quantify my degree of confidence.*

In [None]:
# Import
from nltk.stem.lancaster import LancasterStemmer

# This uses WordNet (huge lexical database of English words)
stemmer = LancasterStemmer()

# Try some stems
print("Dogs: {}".format(stemmer.stem('dogs')))
print('drive: {}'.format(stemmer.stem('drive')))
print('drives: {}'.format(stemmer.stem('drives')))
print('driver: {}'.format(stemmer.stem('driver')))
print('drivers: {}'.format(stemmer.stem('drivers')))
print('driven: {}'.format(stemmer.stem('driven')))

## Term-Document Matrix
- Given a corpus of documents:
  - Create matrix of all documents (rows) vs unique tokens (columns)
  - Fill values with either frequency counts (term in document) or binary occurrence (0s and 1s)
- Dependent on tokenizer!
- Can grow large fast!
- Probably really sparse (many 0 entries)!
- Let's use `sklearn` to do this like nothin':

In [13]:
# Import
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups

# Load data
ng_train = fetch_20newsgroups()
ng_train_data = ng_train.data

# Create a vectorizer object to generate term document counts
# Note all the parameters we can use, let's play!
cv = CountVectorizer()
# Get the vectors
ng_train_vecs = cv.fit_transform(ng_train_data)
# Store them in a Pandas DataFrame
ng_train_df = pd.DataFrame(ng_train_vecs.todense(), columns=[cv.get_feature_names()])
ng_train_df.head()

Unnamed: 0,00,000,0000,00000,000000,00000000,0000000004,0000000005,00000000b,00000001,...,çon,ère,ée,égligent,élangea,érale,ête,íålittin,ñaustin,ýé
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Stopwords
- Words that have very little **semantic value**
  - Because they appear everywhere!
  - e.g.: the, is, a, an, etc
- Typically there are language (or context) specific lists
  - `nltk` has one for English
- But let's stick with `sklearn` and only slightly augment our code above:

In [17]:
# Create a vectorizer object to generate term document counts
# Note all the parameters we can use, let's play!
cv = CountVectorizer(stop_words='english')
# Get the vectors
ng_train_vecs = cv.fit_transform(ng_train_data)
# Store them in a Pandas DataFrame
ng_train_df = pd.DataFrame(ng_train_vecs.todense(), columns=[cv.get_feature_names()])
ng_train_df.head()

Unnamed: 0,00,000,0000,00000,000000,00000000,0000000004,0000000005,00000000b,00000001,...,çon,ère,ée,égligent,élangea,érale,ête,íålittin,ñaustin,ýé
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Term Frequency Inverse Document Frequency (TFIDF) Weighting
- Don't stop at just counts!
- We want to **weight the counts**
- **TFIDF**:
  - **TF**: Weight **directly proportional** to count for term within the document (**local count**)
  - **IDF**: Weight **inversely proportional** to count for term across all documents (**global count**)
- In `sklearn`!:

In [16]:
# Import
from sklearn.feature_extraction.text import TfidfVectorizer
# Create a vectorizer object to generate term document counts
# Note all the parameters we can use, let's play!
tfidf = TfidfVectorizer(stop_words='english')
# Get the vectors
ng_train_vecs = cv.fit_transform(ng_train_data)
# Store them in a Pandas DataFrame
ng_train_df = pd.DataFrame(ng_train_vecs.todense(), columns=[cv.get_feature_names()])
ng_train_df.head()

Unnamed: 0,00,000,0000,00000,000000,00000000,0000000004,0000000005,00000000b,00000001,...,çon,ère,ée,égligent,élangea,érale,ête,íålittin,ñaustin,ýé
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Parts-of-Speech (POS) Tagging
- Tagging part of speech of each word
- In `nltk`:  
**Text**: *I'm relatively certain that you all are the best Metis cohort of all time! However, I'll wait until the rigorous (so rigorous) statistical analysis comes back before I quantify my degree of confidence.*  

TODO: Implement Stanford Tagger

In [None]:
# Import
from nltk.tag import pos_tag

# Tag away!
pos_tag(word_tokens)

## Parsing
- Generating **Parse trees** for sentences
<img  src="sentenceParse.png"/>
- In `nltk`: `nltk.parse`  
**Text**: *I'm relatively certain that you all are the best Metis cohort of all time!

TODO: Implement Stanford Parser

In [None]:
# Import 
from nltk.parse.stanford import StanfordParser

# Create parser 
stanford_parser = StanfordParser()

# Parse it out
stanford_parser.parse(word_tokens)

## Coreference Resolution
- Given a chunk of text, map which words refer to the same objects.
  - **Anaphora Resolution**: Special case, mapping pronouns to the nouns they refer to
    - e.g. <span class="burk">Paul</span> worked hard on <span class="burk">his</span> slides for this.

### Coreference Resolution in Stanford CoreNLP
Here's how we do this with the Stanford Tools!

TODO: Implement Coref Resolution with Stanford

In [None]:
# Import 

## Query Expansion

## Speech Segmentation
- Breaking speech into phonemes

## Word-sense Disambiguation
- Distinguishing different versions of the same character sequence
  - e.g.: George Bush vs bush
- This is very hard!

# Where We're Going
- Topic Modeling
  - LDA
  - LSA
  - NMF
  - Word2Vec
- Other NLP Models
  - Markov Models
  - Maybe more
- Deep Learning for NLP