<img src="https://images.efollett.com/htmlroot/images/templates/storeLogos/CA/864.gif" style="float: right;"> 




# ECON628-01 
### Lecture 1.1 - NLP
---

 


### Language?
---
![](https://snag.gy/BXQ6xM.jpg)

### What is NLP?
---
**NLP =** Natural Language Processing => Computational Linguistic, Text Analysis, etc.

- Using computers to process (analyze, understand, generate) natural human languages 
- Most knowledge created by humans is _unstructured text_, and we need a way to make sense of it
- We need to reach to a quantitative analysis of text
- Also referred to as machine learning with text.

### NLP Applications?
---

- **Information retrieval**: Find relevant results and similar results
    - [Google](https://www.google.com/)
- **Information extraction**: Structured information from unstructured documents
    - [Events from Gmail](https://support.google.com/calendar/answer/6084018?hl=en)
- **Machine translation**: One language to another
    - [Google Translate](https://translate.google.com/)
- **Text simplification**: Preserve the meaning of text, but simplify the grammar and vocabulary
    - [Rewordify](https://rewordify.com/)
    - [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page)
- **Predictive text input**: Faster or easier typing
    - [A friend's application](https://justmarkham.shinyapps.io/textprediction/)
    - [A much better application](https://farsite.shinyapps.io/swiftkey-cap/)
- **Sentiment analysis**: Attitude of speaker
    - [Hater News](http://haternews.herokuapp.com/)
- **Automatic summarization**: Extractive or abstractive summarization
    - [autotldr](https://www.reddit.com/r/technology/comments/35brc8/21_million_people_still_use_aol_dialup/cr2zzj0)
- **Natural Language Generation**: Generate text from data
    - [How a computer describes a sports match](http://www.bbc.com/news/technology-34204052)
    - [Publishers withdraw more than 120 gibberish papers](http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763)
- **Speech recognition and generation**: Speech-to-text, text-to-speech
    - [Google's Web Speech API demo](https://www.google.com/intl/en/chrome/demos/speech.html)
    - [Vocalware Text-to-Speech demo](https://www.vocalware.com/index/demo)
- **Question answering**: Determine the intent of the question, match query with knowledge base, evaluate hypotheses
    - [How did supercomputer Watson beat Jeopardy champion Ken Jennings?](http://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/)
    - [IBM's Watson Trivia Challenge](http://www.nytimes.com/interactive/2010/06/16/magazine/watson-trivia-game.html)
    - [The AI Behind Watson](http://www.aaai.org/Magazine/Watson/watson.php)
- **Spell and Grammar Checking**: Checking spelling and grammar & suggesting alternatives for the errors
    - [Grammarly](https://app.grammarly.com)
- **Word Prediction**: Search for something in Google
    - [Google](https://www.google.com/)
- **Text Categorization**: Assigning one (or more) pre-defined category to a text
    - [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/)  
- **Sentiment Analysis**: Reviews (amazon => customers reviews, imbdr => movies reviews, etc)
    - [Amazon](https://www.amazon.com/)    
- **Speech Recognition**: Siri, Alexa
    - [Siri](https://www.apple.com/ios/siri/) 

### NLP general applications -  level of difficulty
---
- **Easy (mostly solved)**:
    - Spell and grammar checking
    - Some text categorization tasks
    - Some named-entity recognition tasks

&nbsp;

- **Intermediate (good progress)**:
    - Information retrieval
    - Sentiment analysis
    - Machine translation
    - Information extraction

&nbsp;    

- **Difficult (still hard)**:
    - Question answering (Match query with knowledge base)
    - Summarization
    - Speech Recognition;
        - Speech to text
        - Trained/untrained user models
        - Voice-based interfaces
    - Closed domain vs open domain
    - Reasoning about intent of question

### NLP basic components
---
- **Tokenization**: breaking text into tokens (words, sentences, n-grams)
- **Stopword removal**: a/an/the
- **Stemming and lemmatization**: root word
- **TF-IDF**: word importance
- **Part-of-speech tagging**: noun/verb/adjective
- **Named entity recognition**: person/organization/location
- **Spelling correction**: "New Yrok City"
- **Word sense disambiguation**: "buy a mouse"
- **Segmentation**: "New York City subway"
- **Language detection**: "translate this page"
- **Machine learning**: (classification, clustering, recommendation)

### Why is NLP hard?
----

- **Ambiguity**:
    - Hospitals are Sued by 7 Foot Doctors
    - Juvenile Court to Try Shooting Defendant
    - Local High School Dropouts Cut in Half
- **Non-standard English**: text messages
- **Idioms**: "throw in the towel"
- **Newly coined words**: "retweet"
- **Tricky entity names**: "Where is A Bug's Life playing?"
- **World knowledge**: "Mary and Sue are sisters", "Mary and Sue are mothers"
- **Texts with the same words and phrases can having different meanings **: 
State farm commercial where two different people say "Is this my car? What? This is ridiculous! This can't be happening! Shut up! Ahhhh!!!"


NLP requires an understanding of the **language** and the **world**.

## Common NLP Related Models
---

- **LSI**: Latent semantic indexing
    - [Latent semantic indexing](https://nlp.stanford.edu/IR-book/html/htmledition/latent-semantic-indexing-1.html) 
- **LDA**: Latent Dirichlet Allocation
    - [Latent Dirichlet Allocation](https://algorithmia.com/algorithms/nlp/LDA)
    
- **HDP**: Hierarchical Dirichlet Processes  
    - [Hierarchical Dirichlet Processes](http://mlg.eng.cam.ac.uk/tutorials/07/ywt.pdf)
- **Word2Vec**: Global Vectors for Word Representation
    - [Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf)
- **LogisticRegression**
- **Naive Bayes**
- **SVM**
- **CountVectorizer**
    - [Count Vectorizer ](https://de.dariah.eu/tatom/working_with_text.html)
- **TfIdF**: Term frequency–Inverse document frequency
    - [Term frequency–Inverse document frequency](https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html)
- **DTM**: Dynamic Topic Models
    - [Dynamic Topic Models](https://radimrehurek.com/gensim/models/dtmmodel.html)

### NLP - simple example
---
Suppose we are building a spam/ham classifier. Input are emails, output is a binary classification.

We can classify an email like the one below as spam - no spam by:
 
> _Checking for the presence of the words Donate, WILL, sum, cancer, LinkedIn and similar._

> Defining a simple rule that parses the text which is one of the simplest feature extraction from text: _binary word counting_.

In [1]:
spam = """
Hello,\nI saw your contact information on LinkedIn. 
I have carefully read through your profile and you seem to have an outstanding personality. 
This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of 
Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going 
in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros
(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.
"""

ham = """
Hello,\nI am writing in regards to your application to the position of Data Scientist at Hooli X. 
We are pleased to inform you that you passed the first round of interviews and we would like to invite you 
for an on-site interview with our Senior Data Scientist Mr. John Smith. You will find attached to this message 
further information on date, time and location of the interview. Please let me know if I can be of any further 
assistance. Best Regards.
"""
print spam
print
print ham



Hello,
I saw your contact information on LinkedIn. 
I have carefully read through your profile and you seem to have an outstanding personality. 
This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of 
Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going 
in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros
(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.



Hello,
I am writing in regards to your application to the position of Data Scientist at Hooli X. 
We are pleased to inform you that you passed the first round of interviews and we would like to invite you 
for an on-site interview with our Senior Data Scientist Mr. John Smith. You will find attached to this message 
further information on date, time and location of the interview. Please let me know if I can be of any further 
assistance. Best Regards.



### NLP - Bag of Words (word counting model, CountVectorizer, HashingVectorizer)
---

- The bag-of-words is a simplifying representation used in NLP. 
- In this method, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, **disregarding grammar and even word order but keeping multiplicity.**

In [2]:
## Using Counter
from collections import Counter
print "This is the word counting for Spam:\n" ,Counter(spam.lower().split())
print "==========="
print "This is the word counting for Ham:\n", Counter(ham.lower().split())

This is the word counting for Spam:
Counter({'i': 7, 'of': 4, 'and': 3, 'is': 2, 'etc.': 2, 'am': 2, 'an': 2, 'have': 2, 'in': 2, 'your': 2, 'euros': 2, 'to': 2, 'years': 2, 'with': 2, 'this': 2, 'contact': 2, 'the': 2, 'major': 1, 'old': 1, 'cancer': 1, 'outstanding': 1, 'seven': 1, 'decided': 1, 'through': 1, 'carefully': 1, 'seem': 1, 'saw': 1, '(eight': 1, 'information': 1, 'for': 1, 'fifty': 1, '86': 1, 'sum': 1, '"lukoil".': 1, 'only': 1, 'pjsc': 1, 'mr.': 1, '2': 1, 'linkedin.': 1, 'will/donate': 1, 'you': 1, 'hundred': 1, 'was': 1, 'personality.': 1, 'chairman': 1, 'profile': 1, 'you.': 1, 'hello,': 1, 'ago.': 1, 'read': 1, 'going': 1, 'thousand': 1, 'million': 1, 'grayfer': 1, 'reason': 1, 'be': 1, 'one': 1, 'why': 1, 'on': 1, 'name': 1, 'week.': 1, '8,750,000.00': 1, 'later': 1, 'board': 1, 'operation': 1, 'will': 1, 'directors': 1, 'diagnosed': 1, 'valery': 1, 'my': 1})
This is the word counting for Ham:
Counter({'to': 5, 'you': 4, 'of': 4, 'the': 3, 'and': 2, 'we': 2, 'scie

In [3]:
##################
## CountVectorizer : it count the frequency of each word.
##################

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
## http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

'''CountVectorizer = Convert a collection of text documents to a matrix of token counts
   tokens = (words, sentences, n-grams)'''

cvec = CountVectorizer()
cvec

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [4]:
cvec.fit([spam])

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [25]:
# Idea is to take the full result of words we have in a mess, and reorganize them in a table, and shape.
df  = pd.DataFrame(cvec.transform([spam]).todense(),
                   columns=cvec.get_feature_names())

''' toarray returns an ndarray; 
    todense returns a matrix. 
    If you want a matrix, use todense; otherwise, use toarray
'''
print df.shape
print ''

'''getting the first 10 values'''
print df.transpose().shape
print ''
df.transpose().sort_values(0, ascending=False).head(10).transpose()

(1, 69)

(69, 1)



Unnamed: 0,of,and,your,contact,is,in,have,euros,the,this
0,4,3,2,2,2,2,2,2,2,2


### NLP - Hashing Vectorizer
---
- When using the `CountVectorizer` method we need to set up a dictionary to have a fixed size, only keeping words of certain frequencies.
- But, we still have to compute a dictionary and hold the dictionary in memory. This could be a **problem** when we have a large corpus or in streaming applications where we don't know which words we will encounter in the future.


- **Solution:** Use the `HashingVectorizer`, which converts a collection of text documents to a matrix of occurrences, calculated with the [hashing trick](https://en.wikipedia.org/wiki/Feature_hashing). Each word is mapped to a feature with the use of a [hash function](https://en.wikipedia.org/wiki/Hash_function) that converts it to a hash. If we encounter that word again in the text, it will be converted to the same hash, allowing us to count word occurence without retaining a dictionary in memory. This is very convenient!


- **The problem:** with this trick is that it's not possible to compute the inverse transform, and thus we lose information on what words the important features correspond to. 

In [17]:
####################
## HashingVectorizer
####################

from sklearn.feature_extraction.text import HashingVectorizer
hvec = HashingVectorizer()
hvec

HashingVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
         dtype=<type 'numpy.float64'>, encoding=u'utf-8', input=u'content',
         lowercase=True, n_features=1048576, ngram_range=(1, 1),
         non_negative=False, norm=u'l2', preprocessor=None,
         stop_words=None, strip_accents=None,
         token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None)

In [18]:
hvec.fit([spam])

HashingVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
         dtype=<type 'numpy.float64'>, encoding=u'utf-8', input=u'content',
         lowercase=True, n_features=1048576, ngram_range=(1, 1),
         non_negative=False, norm=u'l2', preprocessor=None,
         stop_words=None, strip_accents=None,
         token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None)

In [20]:
df  = pd.DataFrame(hvec.transform([spam]).todense())
df.transpose().sort_values(0, ascending=False).head(10).transpose()

Unnamed: 0,479532,144749,174171,832412,828689,994433,1005907,170062,675997,959146
0,0.338062,0.169031,0.169031,0.169031,0.169031,0.169031,0.169031,0.169031,0.169031,0.084515


### NLP - Bag of Words (drawbacks)
---

- Completely ignores the structure of a sentence. Bag of Words
- Bag of Words merely assess presence of specific words or word combinations.
- The same word can have multiple meanings in different contexts. Consider for example the following two sentences:

    - There's wood floating in the **sea**
    - Mike's in a **sea** of trouble with the move

In the first case the word "sea" indicates a large body of water, while in the second case it indicates "a lot of".

How do we teach a computer to disambiguate? Here are some additional techniques that may come to help.

### NLP - Segmentation (hard way and easy way with NLTK)
---

This is a technique to **identify sentences** within a body of text. 
As we know, language is not a continuous uninterrupted stream of words: punctuation serves as a guide to group together words that convey meaning when contiguous.

In [7]:
####################
## Segmentation
## The hard way!!
####################

easy_text = "I went to the zoo today. What do you think of that? I bet you hate it! Or maybe you don't"

def simple_sentencer(text):
    '''take a string called `text` and return
    a list of strings, each containing a sentence'''

    sentences = []
    substring = ''
    for c in text:
        if c in ('.', '!', '?'):
            sentences.append(substring + c)
            substring = ''
        else:
            substring += c
    return sentences

simple_sentencer(easy_text)

['I went to the zoo today.',
 ' What do you think of that?',
 ' I bet you hate it!']

In [None]:
##################################
''' Is NLTK install in my computer?
    If this opens, then you need to install 
    packages as you need''' 
################################
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [8]:
##########################
## Segmentation with NLTK
#########################
from nltk.tokenize import PunktSentenceTokenizer

detect_sentence = PunktSentenceTokenizer()

detect_sentence.sentences_from_text(easy_text)

['I went to the zoo today.',
 'What do you think of that?',
 'I bet you hate it!',
 "Or maybe you don't"]

### NLP - with NLTK (Normalization)
---

- **_Normalization_: Text normalization is the process of transforming text into a single canonical form that it might not have had before.**


- **_Normalization_** is when slightly different version of a word exist. For example: LinkedIn sees 6000+ variations of the title "Software Engineer" and 8000+ variations of the word "IBM".
- Here is a bigger list:
    - Person titles (Mr. MR. DR etc.)
    - Dates (10/03, March 10 etc.)
    - Numbers
    - Plurals
    - Verb conjugations
    - Slang
    - SMS abbreviations



### NLP - with NLTK (Stemming)
---
- **_Stemming_ is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form**


- **_Stemming_** It would be wrong to consider the words "MR." and "mr" to be different features, thus we need a technique to normalize words to a common root. This technique is called _Stemming_.

- Science, Scientist => Scien
- Swimming, Swimmer, Swim => Swim

In [38]:
####################
## Stemming
## The hard way and 
## really innefective
####################
def stem(tokens):
    '''rules-based stemming of a bunch of tokens'''

    new_bag = []
    for token in tokens:
        # define rules here
        if token.endswith('s'):
            new_bag.append(token[:-1])
        elif token.endswith('er'):
            new_bag.append(token[:-2])
        elif token.endswith('tion'):
            new_bag.append(token[:-4])
        elif token.endswith('tist'):
            new_bag.append(token[:-4])
        elif token.endswith('ce'):
            new_bag.append(token[:-2])
        elif token.endswith('ing'):
            new_bag.append(token[:-2])
        else:
            new_bag.append(token)

    return new_bag

In [42]:
stem(['Swimmed', 'Swimming'])

['Swimmed', 'Swimmi']

In [9]:
##########################
## Stemming with NLTK
#########################
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print stemmer.stem('Swimmed')
print stemmer.stem('Swimming')

Swim
Swim


### NLP - with NLTK (Stop Words)
---
- Some words are very common and provide **NO** information on the text content, and should be removed!.


In [None]:
##########################
## Stop Words with NLTK
#########################

from nltk.corpus import stopwords
stop = stopwords.words('english')
print "There are", len(stop), "in the english nltk program"
print ''
sentence = "this class is great is the best way to learn ml and python"
print sentence.split()
print ''
print [i for i in sentence.split() if i not in stop] #print words of each sentence that are not in stop words list

#instead of not in we can print the list of words that are in stop word list.

### NLP - with NLTK (Parts of Speech)
---
- We all know that each word has a specific role in a sentence (Verb, Noun etc.) 
- Parts-of-speech tagging (POS) is a **feature extraction technique** that attaches a _tag_ to each word in the sentence, to provide a more precise context for further analysis. 
- This is often a resource intensive process, but it can sometimes improve the accuracy of models.
- [What are all possible pos tags of NLTK?](http://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk)


In [48]:
############################
## Parts of Speech with NLTK
############################
from nltk.tag import pos_tag
from nltk.tokenize import WordPunctTokenizer

tok = WordPunctTokenizer()
pos_tag(tok.tokenize("this class is great is the best way to learn ml and python"))

[('this', 'DT'),
 ('class', 'NN'),
 ('is', 'VBZ'),
 ('great', 'JJ'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('best', 'JJS'),
 ('way', 'NN'),
 ('to', 'TO'),
 ('learn', 'VB'),
 ('ml', 'NN'),
 ('and', 'CC'),
 ('python', 'NN')]

### NLP - with NLTK (Term frequency - Inverse document Frequency)
---

- Tf-Idf will give you an score that shows which words are **most discriminating** between documents. Words that occur a lot in one document but doesn't occur in many documents will tell you something special about the document.

- This weight/score evaluates how important a word is to a document in a collection (corpus)
- The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
- [Here is a deeper explanation for tf–idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)


Term frequency tf is the frequency of a certain term in a document:

$$
\mathrm{tf}(t,d) = \frac{N_\text{term}}{N_\text{terms in Document}}
$$

Inverse document frequency is defined as the frequency of documents that contain that term over the whole corpus.

$$
\mathrm{idf}(t, D) = \log\frac{N_\text{Documents}}{N_\text{Documents that contain term}}
$$

Term frequency - Inverse Document Frequency is calculated as:

$$
\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)
$$


<Br><br>
![](https://snag.gy/rBNLtd.jpg)

This enhances terms that are highly specific of a particular document, while suppressing terms that are common to most documents.

In [None]:
########################################################
## Term frequency - Inverse document Frequency with NLTK
########################################################


In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer
## http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
tvec = TfidfVectorizer(stop_words='english')

tvec.fit([spam, ham])

TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [50]:
df  = pd.DataFrame(tvec.transform([spam, ham]).todense(),
                   columns=tvec.get_feature_names(),
                   index=['spam', 'ham'])
print len(df)
print ''
df.transpose().sort_values('spam', ascending=False).head(10).transpose()

2



Unnamed: 0,years,euros,contact,personality,linkedin,lukoil,major,million,old,operation
spam,0.290133,0.290133,0.290133,0.145067,0.145067,0.145067,0.145067,0.145067,0.145067,0.145067
ham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [51]:
df.transpose().sort_values('ham', ascending=False).head(10).transpose()

Unnamed: 0,regards,interview,data,scientist,location,position,let,know,senior,invite
spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ham,0.31039,0.31039,0.31039,0.31039,0.155195,0.155195,0.155195,0.155195,0.155195,0.155195
