> **Jupyter slideshow:** This notebook can be displayed as slides. To view it as a slideshow in your browser type in the console:


> `> ipython nbconvert [this_notebook.ipynb] --to slides --post serve`


> To toggle off the slideshow cell formatting, click the `CellToolbar` button, then `View --> Cell Toolbar --> None`


<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Natural Language Processing

_Authors: Dave Yerrington (SF)_

---


![](https://snag.gy/uvESGH.jpg)

### Learning Objectives
*After this lesson, you will be able to:*
- Extract features from unstructured text using Scikit Learn
- Identify Parts of Speech using NLTK
- Remove stop words
- Describe how TFIDF works

### Student Pre-Work
*Before this lesson, you should already be able to:*
- Familiarize yourself with [nltk.download()](http://www.nltk.org/data.html) in case you need to download additional corpuses
- Describe what a transformer is in Scikit Learn and use it
- Recognize basic principles of English language syntax

### Lesson Guide
- [Introduction to text feature extraction](#intro)
- [An NLP project: rapstats.io](#rapstats)
- [Common NLP problems](#common)
- [Some common NLP models](#models)
- [A simple example](#simple)
- [Bag-of-words / word counting](#bow)
    - [Bag-of-words demo](#bow-demo)
    - [Unicode: a common pitfall](#unicode)
- [Demo: sklearn `CountVectorizer`](#countvectorizer)
- [What is a hash function?](#hash)
- [Sklearn's `HashingVectorizer`](#hasingvectorizer)
- [Downsides to bag-of-words](#downsides-bow)
- [Segmentation](#segmentation)
    - [NLTK sentencer](#nltk-sentencer)
- [Demo: stemming with NLTK](#stem-demo)
    - [Stemming](#stemming)
    - [Stemming group activity](#group)
- [Stop words](#stopwords)
- [Part of speech tagging](#pos)
- [Term frequency - inverse document frequency](#tfidf)
- [Practice using `TfidfVectorizer`](#tfidf-vec)
- [Conclusion](#conclusion)
- [Additional resources](#resources)

<a name="intro"></a>
## Introduction to text feature extraction

---

The models we have been using so far accept a 2D matrix of real numbers as input `X` and a target vector of classes or numbers `y`. What if our starting point data is not given in the form of a table of numbers, but rather is unstructured? This is the case when working with text documents.

We need a way to go from unstructured data to our numeric `X` matrix in order to use the same models. This is called _feature extraction_ and this lesson is dedicated to it.

The applications of using text data in statistical modeling are practically infinite. Some examples include:
- Sentiment analysis of Yelp reviews
- Identifying topics of new articles
- Classification of political authors

<br>
<center>
# Y ~ "YOLO 4life ^^ BBQ@@ OMG LOL!"
</center>
<br>

<img src="https://snag.gy/FoaBeK.jpg" style="width: 400px; float: left; margin-right: 20px;">

<img src="https://snag.gy/Qz0mav.jpg" style="width: 150px; float: left; margin-right: 50px;">

<img src="https://snag.gy/6Lu9aC.jpg">


<a id='rapstats'></a>
## An example NLP Project:  rapstats.io

---

<a href="http://rapstats.io"><img src="https://snag.gy/8GSVqf.jpg"></a>

<img src="https://snag.gy/8eJNFv.jpg" style="width: 300px; float: left;">
<img src="https://snag.gy/2Hz0o7.jpg" style="width: 300px;">

**See Also:**

- [Largest Vocabulary in Hip Hop](http://poly-graph.co/vocabulary.html)
- [Rap Genius: Rapstats](http://genius.com/rapstats)
- [Rap Lyric Generator, Hieu Nguyen, Brian Sa](http://nlp.stanford.edu/courses/cs224n/2009/fp/5.pdf)

<a id='common'></a>
## Common NLP problems

---

The table below details some of the most common problems and tasks in the vast field of natural language processing (NLP).

| | |
|-|-|
| **Sentiment Analysis** | Is what is written positive or negative? | 
| **Named Entity Recognition** | Classify names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. |
| **Summarization** | Boiling down large bodies of text to paraphrased versions |
| **Topic Modeling** | What topics does a body of text belong to? (ie: Auto tagging of news articles) |
| **Question answering** | Given a human-language question, determine its answer. |
| **Word disambiguation** | Many words have more than one meaning; we have to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or from an online resource such as WordNet. |
| **Machine dialog systems** | Building response systems that react contextually to human input (ie: me: Siri, cook me some bacon.  Siri:  How do you like your bacon? ) | 


See Also:

- [News Headline Anlaysis](http://nbviewer.jupyter.org/github/AYLIEN/headline_analysis/blob/06f1223012d285412a650c201a19a1c95859dca1/main-chunks.ipynb?utm_content=buffer5d40c&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer)
- [Sentiment + Robot Classification in Movies](http://nbviewer.jupyter.org/github/cojette/ClusteringRobotsinMovie/blob/master/Classification%20of%20Robots%20in%20Movies.ipynb)
- [Text Summarization /w Gensim](http://nbviewer.jupyter.org/github/piskvorky/gensim/blob/develop/docs/notebooks/summarization_tutorial.ipynb)
- [Sentiment Analysis Intro](http://nbviewer.jupyter.org/github/sgsinclair/alta/blob/master/ipynb/SentimentAnalysis.ipynb)

<a id='models'></a>
## Some common NLP models and terms

---

- LSI (Latent Semantic Indexing)
- LDA (Latent Diriclet Allocation)
- HDP (Hierarchical Dirichlet Process)
- Word2Vec
- LogisticRegression
- Naive Bayes
- SVM
- CountVectorizer
- TfIdF (term frequency inverse document frequency)
- DTM (document term matrix)

> **Note:** This is not an exhaustive list, nor will we be covering all of these models in class. NLP is a very deep and very broad area of data science that could warrant it's own immersive entirely!

<a id='simple'></a>
## A Simple Example
---

Suppose we are building a spam classifier. Inputs are emails and the output is a binary label.

Here's an example of an input email from each class:


```python
spam = """
Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.
"""

ham = """
Hello,\nI am writing in regards to your application to the position of Data Scientist at Hooli X. We are pleased to inform you that you passed the first round of interviews and we would like to invite you for an on-site interview with our Senior Data Scientist Mr. John Smith. You will find attached to this message further information on date, time and location of the interview. Please let me know if I can be of any further assistance. Best Regards.
"""
print spam
print
print ham
```


## Can you think of a simple heuristic rule to catch email like this?

> _We could check for the presence of the words Donate, WILL, sum, cancer, LinkedIn and similar._

By defining a simple rule that parses the text for presence of keywords we are performing one of the simplest text feature extraction methods: _binary word counting_.


<a id='bow'></a>
## Bag-of-words / word counting
---

The bag-of-words model is a simplified representation of the raw data. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words.

Bag-of-words representations discard grammar, order, and structure in the text, but track occurances.

<a id='bow-demo'></a>
### Load this up in a blank notebook and try this out!

```python
from collections import Counter
print Counter(spam.lower().split())
print
print Counter(ham.lower().split())
```

    Counter({'i': 7, 'of': 4, 'and': 3, 'is': 2, 'etc.': 2, 'am': 2, 'an': 2, 'have': 2, 'in': 2, 'your': 2, 'to': 2, 'years': 2, 'with': 2, 'this': 2, 'contact': 2, 'the': 2, 'major': 1, 'old': 1, 'cancer': 1, 'outstanding': 1, 'seven': 1, 'decided': 1, 'through': 1, 'carefully': 1, 'euros(eight': 1, 'seem': 1, 'saw': 1, 'information': 1, 'for': 1, 'euros': 1, 'fifty': 1, '86': 1, 'sum': 1, '"lukoil".': 1, 'only': 1, 'pjsc': 1, 'mr.': 1, '2': 1, 'linkedin.': 1, 'will/donate': 1, 'you': 1, 'hundred': 1, 'was': 1, 'personality.': 1, 'chairman': 1, 'profile': 1, 'you.': 1, 'hello,': 1, 'ago.': 1, 'read': 1, 'going': 1, 'thousand': 1, 'million': 1, 'grayfer': 1, 'reason': 1, 'be': 1, 'one': 1, 'why': 1, 'on': 1, 'name': 1, 'week.': 1, '8,750,000.00': 1, 'later': 1, 'board': 1, 'operation': 1, 'will': 1, 'directors': 1, 'diagnosed': 1, 'valery': 1, 'my': 1})
    
    Counter({'to': 5, 'you': 4, 'of': 4, 'the': 3, 'and': 2, 'we': 2, 'scientist': 2, 'data': 2, 'i': 2, 'further': 2, 'this': 1, 'regards.': 1, 'find': 1, 'information': 1, 'am': 1, 'an': 1, 'at': 1, 'in': 1, 'our': 1, 'message': 1, 'pleased': 1, 'best': 1, 'if': 1, 'will': 1, 'would': 1, 'with': 1, 'interviews': 1, 'please': 1, 'writing': 1, 'application': 1, 'mr.': 1, 'location': 1, 'passed': 1, 'interview': 1, 'for': 1, 'john': 1, 'date,': 1, 'be': 1, 'hello,': 1, 'x.': 1, 'invite': 1, 'that': 1, 'any': 1, 'interview.': 1, 'regards': 1, 'let': 1, 'know': 1, 'hooli': 1, 'on-site': 1, 'me': 1, 'on': 1, 'your': 1, 'like': 1, 'assistance.': 1, 'attached': 1, 'senior': 1, 'inform': 1, 'smith.': 1, 'can': 1, 'time': 1, 'position': 1, 'first': 1, 'round': 1, 'are': 1})


In the above example we counted the number of times each word appeared in the text. Note that since we included all the words in the text, we created a dictionary that contains many words with only one appearance.

<a id='unicode'></a>
### Unicode: a common pitfall

#### What happens when we get a character that is referenced outside of the character space, for instance a Japanese Katakana character?  (片仮名 / カタカナ)

**UTF8 vs Latin**
[Latin Ord Reference](https://media-mediatemple.netdna-ssl.com/wp-content/uploads/2012/04/iso-8859-5-php-orange.png)


- Python doesn't know how to handle these characters if it has to process it in any way
- Characters outside the latin character space will get converted to ordinal 0
- This problem can be very frustrating to deal with

Luckily, sklearn has robust classes for text feature extraction. Use sklearns builtin text preprocessing method when possible.  Always save/encode your text as UTF8 when there are options available to do so.

<a name="countvectorizer"></a>
## Demo: sklearn `CountVectorizer`
---

Sklearn offers a `CountVectorizer` class with many configurable options:


```python
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
```


```python
cvec = CountVectorizer()
cvec.fit([spam])
```




    CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
            dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
            lowercase=True, max_df=1.0, max_features=None, min_df=1,
            ngram_range=(1, 1), preprocessor=None, stop_words=None,
            strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
            tokenizer=None, vocabulary=None)




```python
df  = pd.DataFrame(cvec.transform([spam]).todense(),
             columns=cvec.get_feature_names())

df.transpose().sort_values(0, ascending=False).head(10).transpose()
```

<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>of</th>
      <th>and</th>
      <th>your</th>
      <th>contact</th>
      <th>is</th>
      <th>in</th>
      <th>have</th>
      <th>euros</th>
      <th>the</th>
      <th>this</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>4</td>
      <td>3</td>
      <td>2</td>
      <td>2</td>
      <td>2</td>
      <td>2</td>
      <td>2</td>
      <td>2</td>
      <td>2</td>
      <td>2</td>
    </tr>
  </tbody>
</table>
</div>

**Note that there are several parameters to tweak.**

### Spend a couple of minutes scanning the documentation to figure out what those parameters do. 

Share a few takeaways from the documentation in groups. What arguments and capabilities stand out to you? How do the arguments affect the parsing behavior?

[Count Vectorizer Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

<a id='hash'></a>
## What is a hash function?

---
![](https://i.ytimg.com/vi/bs7Wq0Z1uYk/maxresdefault.jpg)

### Hashing

A hash value (or simply hash), also called a message digest, is a number generated from a string of text. 

The hash is substantially smaller than the text itself, and is generated by a formula in such a way that it is extremely unlikely that some other text will produce the same hash value.

Think of the hash as a code that represents the original text in a more condensed format.

<a name="hashingvectorizer"></a>
## Sklearn's `HashingVectorizer`

---

As you have seen we can set the `CountVectorizer` dictionary to have a fixed size, only keeping words of certain frequencies, however, we still have to compute a dictionary and hold the dictionary in memory. This could be a problem when we have a large corpus or in streaming applications where we don't know which words we will encounter in the future.

Both problems can be solved using the `HashingVectorizer`, which converts a collection of text documents to a matrix of occurrences calculated with the [hashing trick](https://en.wikipedia.org/wiki/Feature_hashing). Each word is mapped to a feature with the use of a [hash function](https://en.wikipedia.org/wiki/Hash_function) that converts it to a hash. If we encounter that word again in the text, it will be converted to the same hash, allowing us to count word occurence without retaining a dictionary in memory. This is very convenient!

The main drawback of the this trick is that it's *not possible to compute the inverse transform*, and thus we lose information on what words the important features correspond to. The hash function employed is the signed 32-bit version of Murmurhash3.

## What characteristics should text feature extraction from text satisfy?

It should return a vector of fixed size, regardless of the length of a text.

## Using the code above as example, let's repeat the vectorization using a `HashingVectorizer`.

Lookup how to do this and then try it.


```python
from sklearn.feature_extraction.text import HashingVectorizer
hvec = HashingVectorizer()
hvec.fit([spam])
```



```python
#
# .todense() is just returning it as an array.
#
df  = pd.DataFrame(hvec.transform([spam]).todense())  

df.transpose().sort_values(0, ascending=False).head(10).transpose()
```




<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>479532</th>
      <th>144749</th>
      <th>174171</th>
      <th>832412</th>
      <th>828689</th>
      <th>994433</th>
      <th>1005907</th>
      <th>170062</th>
      <th>675997</th>
      <th>959146</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0.338062</td>
      <td>0.169031</td>
      <td>0.169031</td>
      <td>0.169031</td>
      <td>0.169031</td>
      <td>0.169031</td>
      <td>0.169031</td>
      <td>0.169031</td>
      <td>0.169031</td>
      <td>0.084515</td>
    </tr>
  </tbody>
</table>
</div>


### What new parameters does this vectorizer offer?

Go ahead go back to the documentation and compare to the `CountVectorizer`.


> Answer:
- n_features

<a name="downsides-bow"></a>
## Downsides to bag-of-words

---

Bag-of-word approaches like the one outlined above completely ignores the structure of a sentence. BOW's merely assess presence of specific words or word combinations.

The same word can have multiple meanings in different contexts. Consider for example the following two sentences:

- There's wood floating in the **sea**
- Mike's in a **sea** of trouble with the move

How do we teach a computer to disambiguate? Later will cover some other techniques that may help.


<a id='segmentation'></a>
## Segmentation

---

_Segmentation_ is a technique to **identify sentences** within a body of text. Language is not a continuous uninterrupted stream of words: punctuation serves as a guide to group together words that convey meaning when contiguous.




```python
easy_text = "I went to the zoo today. What do you think of that? I bet you hate it! Or maybe you don't"

easy_split_text = ["I went to the zoo today.",
                   "What do you think of that?",
                   "I bet you hate it!",
                   "Or maybe you don't"]
```


```python
def simple_sentencer(text):
    '''take a string called `text` and return
    a list of strings, each containing a sentence'''
    
    sentences = []
    substring = ''
    for c in text:
        if c in ('.', '!', '?'):
            sentences.append(substring + c)
            substring = ''
        else:
            substring += c
    return sentences

simple_sentencer(easy_text)
```

`# Result:`

    ['I went to the zoo today.',
     ' What do you think of that?',
     ' I bet you hate it!']

The function above doesn't work perfectly. On the other hand, the python NLTK library offers a more robust and easy to use sentencer.


<a id='nltk-sentencer'></a>
### There's an easier way to do the same thing!

```python
from nltk.tokenize import PunktSentenceTokenizer
sent_detector = PunktSentenceTokenizer()
sent_detector.sentences_from_text(easy_text)
```

    ['I went to the zoo today.',
     'What do you think of that?',
     'I bet you hate it!',
     "Or maybe you don't"]


## Does NLTK offer other Tokenizers? Use `nltk.download()` to explore the available packages.

Try it, we'll wait.

In [3]:
# import nltk
# nltk.download()

<a name="stem-nltk"></a>
## Demo: stemming with NLTK

---

**Text normalization** is the process of converting slightly different versions of words with essentially equivalent meaning into the same features.

For example: LinkedIn sees 6000+ variations of the title "Software Engineer" and 8000+ variations of the word "IBM".

### What are other common cases of text that could need normalization?


- Person titles (Mr. MR. DR etc.)
- Dates (10/03, March 10 etc.)
- Numbers
- Plurals
- Verb conjugations
- Slang
- SMS abbreviations

### Stemming

It would be wrong to consider the words "MR." and "mr" to be different features, thus we need a technique to normalize words to a common root. This technique is called _stemming_.

- Science, Scientist => Scien
- Swimming, Swimmer, Swim => Swim

As we did above we could define a Stemmer based on rules:


```python
def stem(tokens):
    '''rules-based stemming of a bunch of tokens'''
    
    new_bag = []
    for token in tokens:
        # define rules here
        if token.endswith('s'):
            new_bag.append(token[:-1])
        elif token.endswith('er'):
            new_bag.append(token[:-2])
        elif token.endswith('tion'):
            new_bag.append(token[:-4])
        elif token.endswith('tist'):
            new_bag.append(token[:-4])
        elif token.endswith('ce'):
            new_bag.append(token[:-2])
        elif token.endswith('ing'):
            new_bag.append(token[:-2])
        else:
            new_bag.append(token)

    return new_bag

stem(['Science', 'Scientist'])
```




    ['Scien', 'Scien']

Luckily for us, NLTK contains several robust stemmers.


```python
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print stemmer.stem('Swimmed')
print stemmer.stem('Swimming')
```

    Swim
    Swim


<a id='group'></a>
### Stemming group activity


There are other stemmers available in NLTK. Let's split the class in 2 teams and have a look at [this article](https://www.elastic.co/guide/en/elasticsearch/guide/current/choosing-a-stemmer.html). One team will focus on the pros of the Porter Stemmer, the other team will focus on the pros of the Snowball Stemmer. 

You have 5 minutes to read, then each side will have 2 minutes to convince the other side about their stemmer.

<center>

<h3>We're data people.  We must research.  We will advocate.</h3>
<br>

</center>

![](https://mlpforums.com/uploads/post_images/sig-4104702.0RpMdUv.jpg)


<a id='stopwords'></a>
## Stop words

---

Some words are very common and provide no legitimate information about the content of the text.

#### Can you give some examples?

We should remove these _stop words_. Note that each language has different stop words.


```python
from nltk.corpus import stopwords
stop = stopwords.words('english')
sentence = "this is a foo bar sentence"
print [i for i in sentence.split() if i not in stop]
```

    ['foo', 'bar', 'sentence']


<a id='pos'></a>
## Part of speech tagging

---

Each word has a specific role in a sentence (Verb, Noun etc.) Parts-of-speech tagging (POS) is a feature extraction technique that attaches a tag to each word in the sentence in order to provide a more precise context for further analysis. This is often a resource intensive process, but it can sometimes improve the accuracy or our models to have the grammatical features.


```python
from nltk.tag import pos_tag
from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()
pos_tag(tok.tokenize("today is a great day to learn nlp"))
```

    [('today', 'NN'),
     ('is', 'VBZ'),
     ('a', 'DT'),
     ('great', 'JJ'),
     ('day', 'NN'),
     ('to', 'TO'),
     ('learn', 'VB'),
     ('nlp', 'NN')]

<a name="tfidf"></a>
## Term frequency - inverse document frequency (tf-idf)

---

A tf-idf score tells us which words are most discriminating between documents. Words that occur a lot in one document but don't occur in many documents contain a great deal of discriminating power.

- This weight is a statistical measure used to evaluate how important a word is to a document in a collection (aka corpus).
- The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

**Let's see how it is calculated:**

Term frequency tf is the frequency of a certain term in a document:

$$
\mathrm{tf}(t,d) = \frac{N_\text{term}}{N_\text{terms in Document}}
$$

Inverse document frequency is defined as the frequency of documents that contain that term over the whole corpus.

$$
\mathrm{idf}(t, D) = \log\frac{N_\text{Documents}}{N_\text{Documents that contain term}}
$$

Term frequency - Inverse Document Frequency is calculated as:

$$
\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)
$$


### tf-idf visually

![](https://snag.gy/rBNLtd.jpg)

This enhances terms that are highly specific of a particular document, while suppressing terms that are common to most documents.

> **Someone will ask:  "But what is log used for!?"**<br>
> Good question!  This is a sublinear transformation that helps separate our extremes between rare and common values.

> "...any linear function ${\displaystyle g}$, for sufficiently large input ${\displaystyle f} $ grows slower than ${\displaystyle g}$" -Wikipedia

<a id='tfidf-vec'></a>
## Practice using `TfidfVectorizer`

---

### Why Use TFIDF?
- Common words are penalized
- Rare words have more influence

Sklearn provides a tf-idf vectorizer that works similarly to the other vectorizers we've covered. Notice that we can also eliminate stop words to improve our analysis.

As you did above, import and initialize the `TfidfVectorizer`, then fit the spam and ham data.

```python
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(stop_words='english')
tvec.fit([spam, ham])
```

```python
df  = pd.DataFrame(tvec.transform([spam, ham]).todense(),
                   columns=tvec.get_feature_names(),
                   index=['spam', 'ham'])

df.transpose().sort_values('spam', ascending=False).head(10).transpose()
```


<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>years</th>
      <th>euros</th>
      <th>contact</th>
      <th>personality</th>
      <th>linkedin</th>
      <th>lukoil</th>
      <th>major</th>
      <th>million</th>
      <th>old</th>
      <th>operation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>spam</th>
      <td>0.287128</td>
      <td>0.287128</td>
      <td>0.287128</td>
      <td>0.143564</td>
      <td>0.143564</td>
      <td>0.143564</td>
      <td>0.143564</td>
      <td>0.143564</td>
      <td>0.143564</td>
      <td>0.143564</td>
    </tr>
    <tr>
      <th>ham</th>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
    </tr>
  </tbody>
</table>
</div>

```python
df.transpose().sort_values('ham', ascending=False).head(10).transpose()
```

<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>interview</th>
      <th>regards</th>
      <th>scientist</th>
      <th>data</th>
      <th>let</th>
      <th>position</th>
      <th>john</th>
      <th>invite</th>
      <th>interviews</th>
      <th>inform</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>spam</th>
      <td>0.00000</td>
      <td>0.00000</td>
      <td>0.00000</td>
      <td>0.00000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>ham</th>
      <td>0.31039</td>
      <td>0.31039</td>
      <td>0.31039</td>
      <td>0.31039</td>
      <td>0.155195</td>
      <td>0.155195</td>
      <td>0.155195</td>
      <td>0.155195</td>
      <td>0.155195</td>
      <td>0.155195</td>
    </tr>
  </tbody>
</table>
</div>


<a name="conclusion"></a>
## Conclusion

---

In this lesson we learned an overview of Natural Language Processing (NLP) and about two very powerful toolkits:
- Scikit Learn Feature Extraction Text
- Natural Language Tool Kit

**Check:** Discussion: what are some real world applications of these techniques?

- Spam Detection for one
- Preprocessing for larger NLP problems
- Job market analysis
- Is someone date-able or not? "I" in relation to signifier
- Crude topic analyis
- Build a keyword extractation heuristic and pipe it into a marketing analysis 

<a id='resources'></a>
## Additional resources

---

- Check out this [Yelp blog post](http://engineeringblog.yelp.com/2015/09/automatically-categorizing-yelp-businesses.html) how they completed a classification task (with over 1000 response variables!) using restaurant review text
- Always check documentation: [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html), [TF-IDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
- A list of all stop-words is available [here](https://github.com/ga-students/DSI-DC-2/blob/master/curriculum/Week-05/5.04-nlp/stop-words.txt) h/t sleevillanueva
- Wikpedia's [feature hashing](https://github.com/generalassembly-studio/DSI-course-materials/tree/master/curriculum/04-lessons/week-06/4.1-lesson) and [hash functions](https://en.wikipedia.org/wiki/Hash_function) is a great place to turn for more info on hashing
- I made use of Charlie Greenbacher's [Intro to NLP](http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf), which he delivered at the [DC-NLP Meetup](http://www.meetup.com/DC-NLP/) (join!)
- Wikipedia includes a [walkthrough](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) of TF-IDF
- We played with Google's [ngram tool](https://books.google.com/ngrams/graph?content=data+science&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cdata%20science%3B%2Cc0)
- A hilarious data scientist gone rogue used NLP and Eigenfaces (Eigenvalues for face recognition) [for Tinder](http://dataconomy.com/hacking-tinder-with-facial-recognition-nlp/)
- I referenced KPCB's 2016 internet trends. If you're into startups, check out [this massive, insightful deck](http://www.kpcb.com/internet-trends)
- [Count Vectorizer Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- [Choosing a Stemmer](https://www.elastic.co/guide/en/elasticsearch/guide/current/choosing-a-stemmer.html)
- [Feature Hashing](https://en.wikipedia.org/wiki/Feature_hashing)
- [Term Frequency Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
- [TFIDF Vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)