# Introduction to Natural Language Processing (NLP)

<a id='common'></a>
## Common NLP Problems

---
The applications of using text data in statistical modeling are practically infinite. Some examples include:

**Sentiment Analysis** | Determining if what is written is positive or negative (e.g. movie reviews, restaurant reviews, tweets) or if a political writer is left-leaning or right-leaning. 

**Named Entity Recognition** | Classifying names of people, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

**Summarization** | Boiling down large bodies of text to paraphrased versions.

**Topic Modeling** | Pinpointing the topics a body of text belongs to (e.g., auto-tagging news articles). |

**Question Answering** | Determining the answer to a human-language question.

**Word Disambiguation** | Many words have more than one meaning; we have to select the meaning that makes the most sense in context. For this problem, we’re typically given a list of words and associated word senses (e.g., from a dictionary or from an online resource such as WordNet).

**Machine Dialog Systems** | Building response systems that react contextually to human input (i.e., Me: "Siri, cook me some bacon." Siri: "How do you like your bacon cooked?"). 

**Language Detection** | Determining what natural language a given text is written in, deciphering between similar languages and dialects (e.g. Serbian vs Croatian, Indonesian vs Malay, Québécois vs European French).

**Machine Translation** | Converting from one natural language to another while preserving the meaning and producing valid output.

**Pragmatic Analysis** | Going from what is _said_ to what is _meant_, which required context awareness.

<a name="intro"></a>
## Introduction to Text Feature Extraction

---

The models we’ve been using so far accept a two-dimensional matrix of real numbers as input `X` and a target vector of classes or numbers as `y`. 

What if our starting data is unstructured, qualitative data (e.g. documents of text) instead of structure and quantitative?

We need a way to fit unstructured data into our usual numeric `X` matrix in order to use the same models. This process is called _feature extraction_.





<a id='simple'></a>
## A Simple Example

In [1]:
spam = """
Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer, chairman of the board of directors of PJSC "LUKOIL." I am 86-years old and I was diagnosed with cancer two years ago. I will be going in for an operation later this week. I decided to will/donate the sum of 8,750,000.00 Euros (eight million seven hundred and fifty thousand euros only etc. etc.
"""

ham = """
Hello,\nI am writing in regards to your application to the position of data scientist at Hooli X. We are pleased to inform you that you passed the first round of interviews, and we would like to invite you for an onsite interview with our senior data scientist, Mr. John Smith. You will find attached to this message further information on date, time, and location of the interview. Please let me know if I can be of any further assistance. Best regards.
"""

print(spam)
print()
print(ham)


Hello,
I saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer, chairman of the board of directors of PJSC "LUKOIL." I am 86-years old and I was diagnosed with cancer two years ago. I will be going in for an operation later this week. I decided to will/donate the sum of 8,750,000.00 Euros (eight million seven hundred and fifty thousand euros only etc. etc.



Hello,
I am writing in regards to your application to the position of data scientist at Hooli X. We are pleased to inform you that you passed the first round of interviews, and we would like to invite you for an onsite interview with our senior data scientist, Mr. John Smith. You will find attached to this message further information on date, time, and location of the interview. Please let me know if I can be of any further assistance. Best regards.



## Some NLP Terminology

---

- a collection of text is a **document**
- a collection of documents is a **corpus** (plural corpora)

In [2]:
corpus = [spam, ham]  # two documents in our corpus

## Can You Think of a Simple Heuristic Rule to Catch Emails Like This?

> _We could check for the presence of words such as "donate", "will", "sum", "cancer", "LinkedIn"_

By defining a simple rule that parses the text for the presence of keywords, we’re performing one of the simplest text feature extraction methods: _word counting_.


<a id='bow'></a>
## Bag of Words/Word Counting
---
<img src="https://c1.staticflickr.com/5/4070/4654257115_aab01d4e37_b.jpg" width="300px"/>

The bag-of-words model is a simplified representation of the raw data. In this model, a document is represented as the bag of its words.

Bag-of-words representations discard grammar, order, and structure in the text and only tracks occurrences.

---
### Now we can have numeric X features again

<a name="countvectorizer"></a>
## Demo: Scikit-Learn `CountVectorizer`
---

Scikit-learn offers a `CountVectorizer` class:
> "Convert a collection of text documents to a matrix of token counts"


In [3]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

cvec = CountVectorizer(stop_words='english')

In [4]:
corpus = [spam, ham]  # two documents in our corpus

cvec.fit(corpus) # count vectorizer learns the vocabulary of the corpus
matrix_corpus = cvec.transform(corpus)
matrix_corpus

<2x68 sparse matrix of type '<class 'numpy.int64'>'
	with 71 stored elements in Compressed Sparse Row format>

In [5]:
matrix_corpus.todense()

matrix([[1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 2, 0, 0, 1, 1, 1, 1, 2,
         1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1,
         1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1,
         0, 1, 1, 0, 2],
        [0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0,
         0, 0, 1, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,
         0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 2, 1, 0, 2, 1, 0, 1, 0, 0,
         1, 0, 0, 1, 0]])

In [6]:
df  = pd.DataFrame(matrix_corpus.todense(),
                   columns=cvec.get_feature_names(),
                   index=['spam', 'ham'])

df.T.sort_values('spam', ascending=False).head(10).T

Unnamed: 0,years,contact,euros,000,linkedin,lukoil,major,million,mr,old
spam,2,2,2,1,1,1,1,1,1,1
ham,0,0,0,0,0,0,0,0,1,0


In [7]:
df.shape

(2, 68)

<a id='stopwords'></a>
## Stop Words

---

Some words are commonly used and provide no legitimate information about the content of the text.

In [8]:
from sklearn.feature_extraction import stop_words
 
print(stop_words.ENGLISH_STOP_WORDS)

frozenset({'among', 'up', 'get', 'the', 'because', 'thus', 'put', 'your', 'system', 'someone', 'other', 'five', 'in', 'becoming', 'thence', 'sincere', 'would', 'into', 'much', 'namely', 'move', 'via', 'whereby', 'whither', 'once', 'besides', 'hasnt', 'done', 'ie', 'myself', 'she', 'this', 'fifteen', 'may', 'over', 'against', 'without', 'before', 'serious', 'nine', 'found', 'fifty', 'un', 'no', 'least', 'as', 'mostly', 'until', 'each', 'whatever', 'from', 'name', 'thereupon', 'everywhere', 'onto', 'two', 'their', 'yet', 'is', 'top', 'empty', 'otherwise', 'everything', 'further', 'whenever', 'beyond', 'after', 'seeming', 'bottom', 'front', 'last', 'below', 'was', 'see', 'back', 'might', 'some', 'what', 'again', 'show', 'toward', 'both', 'ten', 'with', 'us', 'thru', 'moreover', 'ever', 'yours', 'be', 'will', 'latterly', 'latter', 'made', 'nowhere', 'could', 'amongst', 'indeed', 'while', 'anyway', 'whereas', 'keep', 'due', 'part', 'than', 'or', 'which', 'even', 'third', 'often', 'his', 'th

### Let's look at the [count vectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and see what some of its parameters do.

In particular, look at what the following do:

- `encoding`
- `analyzer`
- `lowercase`
- `stop_words`



Let's go back and try eliminating stop words from our CountVectorizer's feature set.

## n-grams

- The unit varies: character, word, etc (in `CountVectorizer` default is word, but you can change this with `analyzer` param)

If we're talking about words:

- uni-gram (1-gram) is 1 word
- bi-gram (2-gram) is 2 adjacent words
- tri-gram (3-gram) is 3 adjacent words

The sentence `I am counting words` contains:
- four uni-grams: 
    * `I`
    * `am`
    * `counting`
    * `words`  
- three bi-grams: 
    * `I am` 
    * `am counting`
    * `counting words`   
    
- two tri-grams: 
    * `I am counting`
    * `am counting words` 
    
    
If we're talking about characters instead of words:

- uni-gram (1-gram) is 1 character
- bi-gram (2-gram) is 2 adjacent characters
- tri-gram (3-gram) is 3 adjacent characters

e.g. `_I`, `I_`, `_a`, `am`, `m_`, `_c`, `co`, `ou`, `un`, `nt`, `ti`, `in`, `ng`, `g_` are the bi-grams contained in `I am counting`

<a name="hashingvectorizer"></a>
## Scikit-Learn's `HashingVectorizer`

---

Although we can set the `CountVectorizer` dictionary to a fixed size (using the `max_features` param), only keeping words with the highest frequencies. However, **we still have to compute a dictionary and hold it in memory.** This could be a problem when...

- we have a large corpus or 
- when we stream applications where we don't know which words we'll encounter in the future

Both problems can be solved using the `HashingVectorizer`, which works similarly to `CountVctorizer`, but uses the [hashing trick](https://en.wikipedia.org/wiki/Feature_hashing). Each word is mapped to a feature with the use of a [hash function](https://en.wikipedia.org/wiki/Hash_function), which converts it from a word to a "hash". If we encounter that word again in the text, it will be converted to the _same_ hash, allowing us to count word occurrence without retaining a dictionary in memory.

<img src="images/hashing.png" width="500px" />

$$
\text{word} \rightarrow \text{hash} \rightarrow \text{column index}
$$


<a id='hash'></a>
## What is a Hash Function?

---
<div style="overflow-y: hidden; height: 250px">
<img src="https://i.ytimg.com/vi/bs7Wq0Z1uYk/maxresdefault.jpg" />
</div>

### Hashing

A hash value is a number generated from a string of text. It's also referred to simply as "hash" or "message digest."

The hash is substantially smaller than the text itself and is generated by a formula in such a way that it's extremely unlikely some other text will produce the same hash value.

Think of the hash as a code that represents the original text in a more condensed format.

![](images/hash_function.png)

## The main drawback

The hashing trick is *not reversible* - we can't tell which original words correspond with the important features.

### Let's repeat the vectorization using a `HashingVectorizer`.



In [9]:
from sklearn.feature_extraction.text import HashingVectorizer
hvec = HashingVectorizer(stop_words='english', norm=None, alternate_sign=False)
hvec.fit(corpus)
matrix_corpus = hvec.transform(corpus)

df  = pd.DataFrame(matrix_corpus.todense(), index=['spam', 'ham'])  

df.T.sort_values('spam', ascending=False).head(10).T

Unnamed: 0,288820,449291,484920,816269,543187,826540,51251,608416,752544,946487
spam,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
ham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
df.shape

(2, 1048576)

### What new parameters does this vectorizer offer?

Check out [the `HashingVectorizer` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) and compare it to [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).



- n_features

<a name="downsides-bow"></a>
## Downsides to Bag of Words

---

Bag-of-word approaches ignore the structure of a sentence and merely assess the presence of specific words or word combinations.


When counting only unigrams, these two documents produce identical bags of words:


- It was bad. It was not good.
- It was good. It was not bad. <img src="images/bag_of_words.png" width="180px"/>


The same word can have multiple meanings in different contexts:

- There's wood floating in the **sea**.
- Mike's in a **sea** of trouble with the move.


How do we teach a computer to disambiguate?


<a name="tfidf"></a>
## Term Frequency-Inverse Document Frequency (TF-IDF)

---

A TF-IDF score tells us which words are most discriminating between documents. Words that occur often in one document but don't occur in many documents contain a great deal of discriminating power.

- This weight is a statistical measure used to evaluate how important a word is to a document in a collection (corpus).
- The importance increases in proportion to the number of times a word appears in a document but is offset by the frequency of the word in the corpus.

Variations of the TF-IDF weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

The inverse document frequency is a measure of how much information the word provides — that is, whether the term is common or rare across all documents. 

The idf of a rare term is high, whereas the idf of a frequent term is likely to be low.


**Let's see how it's calculated:**

Term frequency (`tf`) is the frequency of a certain term in a document:

$$
\mathrm{tf}(t,d) = \frac{N_\text{term}}{N_\text{terms in Document}}
$$

where

- $N_\text{term}$ is the number of times a term/word $t$ appears in document $d$
- $N_\text{terms in Document}$ is the number of terms/words in document $d$

Inverse document frequency (`idf`) of a term is calculated by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm:

$$
\mathrm{idf}(t, D) = \log\frac{N_\text{Documents}}{N_\text{Documents that contain term}}
$$

where

- $N_\text{Documents}$ is the number of documents in the corpus $D$
- $N_\text{Documents that contain term}$ is the number of documents in $D$ that contain term/word $t$

TF-IDF is then calculated as:

$$
\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)
$$


> **You might ask: But what is `log` used for?**<br>
> Good question!

The log function looks like this:
![](https://upload.wikimedia.org/wikipedia/commons/thumb/7/7f/Graph_of_common_logarithm.svg/250px-Graph_of_common_logarithm.svg.png)

Taking the log helps "dampen" the effect of very high values.  

For example, a document containing a term 2 million times is not twice as relevant as a document containing the same term 1 million times.  They are both very relevant documents!

<a id='tfidf-vec'></a>
## Practice Using the `TfidfVectorizer`

---

### Why Use TF-IDF?
- Common words are penalized.
- Rare words have more influence.

Scikit-learn provides a TF-IDF vectorizer that works similarly to the other vectorizers we've covered.



In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tvec = TfidfVectorizer(stop_words='english')
tvec.fit(corpus)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words='english', strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [12]:
corpus_matrix = tvec.transform(corpus)

df  = pd.DataFrame(corpus_matrix.todense(),
                   columns=tvec.get_feature_names(),
                   index=['spam', 'ham'])

df.transpose().sort_values('spam', ascending=False).head(10).transpose()

Unnamed: 0,years,euros,contact,personality,linkedin,lukoil,major,million,old,operation
spam,0.290133,0.290133,0.290133,0.145067,0.145067,0.145067,0.145067,0.145067,0.145067,0.145067
ham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
df.transpose().sort_values('ham', ascending=False).head(10).transpose()

Unnamed: 0,scientist,regards,data,interview,round,hooli,inform,interviews,invite,like
spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ham,0.31039,0.31039,0.31039,0.31039,0.155195,0.155195,0.155195,0.155195,0.155195,0.155195


### "Real" Example

---

Let's test this stuff out on some SMS text data.  Can you predict real vs. promotional texts just based on what is written?  Let's see...

In [14]:
df = pd.read_csv('data/sms.csv', index_col=0)
df.head()

Unnamed: 0,class,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [15]:
df.shape

(5574, 2)

In [16]:
df['class'].value_counts()/df.shape[0]

ham     0.865985
spam    0.134015
Name: class, dtype: float64

In [17]:
from sklearn.model_selection import train_test_split

x = df['text'].values
y = df['class'].map({'ham':0, 'spam':1})

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [18]:
# Vectorize
cvec = CountVectorizer(stop_words='english')
x_train_counts = cvec.fit_transform(x_train)
x_test_counts = cvec.transform(x_test)

In [19]:
# or:
# tfidf_vec = TfidfVectorizer()
# x_train_counts = tfidf_vec.fit_transform(x_train)
# x_test_counts = tfidf_vec.transform(x_test)

In [20]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(x_train_counts, y_train)
log_reg.score(x_test_counts, y_test)



0.9820652173913044

In [21]:
predictions = log_reg.predict(x_test_counts)

In [22]:
df = pd.DataFrame(list(zip(predictions, y_test, x_test)), columns=['predictions', 'label', 'text_msg'])

In [23]:
df[df.predictions != df.label]

Unnamed: 0,predictions,label,text_msg
26,0,1,Kit Strip - you have been billed 150p. Netcoll...
30,0,1,Call Germany for only 1 pence per minute! Call...
45,0,1,I'd like to tell you my deepest darkest fantas...
295,0,1,Urgent Ur £500 guaranteed award is still uncla...
300,0,1,2/2 146tf150p
320,0,1,Loans for any purpose even if you have Bad Cre...
366,0,1,CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
459,0,1,tddnewsletter@emc1.co.uk (More games from TheD...
524,0,1,Hello darling how are you today? I would love ...
533,0,1,Fantasy Football is back on your TV. Go to Sky...


<a id='resources'></a>
## Additional Resources

---

- Documentation: 
    - [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). 
    - [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html). 
    - [TF-IDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).
- A list of stop words is available [here](https://github.com/ga-students/DSI-DC-2/blob/master/curriculum/Week-05/5.04-nlp/stop-words.txt).
- Wikipedia's [feature hashing](https://github.com/generalassembly-studio/DSI-course-materials/tree/master/curriculum/04-lessons/week-06/4.1-lesson) and [hash functions](https://en.wikipedia.org/wiki/Hash_function) entries are a great place to turn for more information on hashing.
- [Feature hashing](https://en.wikipedia.org/wiki/Feature_hashing).
- [Term frequency-inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).
- Check out Charlie Greenbacker's [introduction to NLP](http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf), which he delivered at the [DC-NLP Meetup](http://www.meetup.com/DC-NLP/).
- Wikipedia also has a [walk through](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) of TF-IDF.
- Google's [ngram tool](https://books.google.com/ngrams/graph?content=data+science&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cdata%20science%3B%2Cc0).
