# Basics of NLP
A short workshop of NLP techniques for those with little or no experience with NLP.

## What is NLP?
* NLP stands for Natural Language Processing, and the field is concerned with the ability to use computers to manipulate text data. 
* Researchers in computational linguistics are focused in three areas: NLP, NLG (natural language generation) and NLU (natural language understanding)
* NLP researchers tend to study and word on problems that include: 
  * parsing strings into individual paragraphs, sentences, words, morphemes, etc
  * finding grammatical relations and structures
  * identifying entities
  * comparing strings
  * feature extractions, such as: sentiments, topics, etc
  
Today we're going to cover some of the basics in NLP. Including why and how to preprocess text data

First we'll read in some data:

In [8]:
import pandas as pd

# import data 
wine_data = pd.read_csv("data/winemag-data_first150k.csv", encoding = 'utf8')
wine_data.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [9]:
# remove any rows where we have no description
wine_data = wine_data[pd.notnull(wine_data['description'])]

## A few quick initial thoughts about Computational Linguistics:
* computers know nothing about words
  * as humans we know that "US", "USA", "us", "usa", and "u.s.a." are all referencing the same place, but computers can only compare things that are identical
  * if we want to be able to compare words, then we have to make the words as uniform as possible.

In [10]:
# In case you don't believe me:
print ("USA and usa are the same word:" , "USA"=="usa")
print ("u.s.a. and usa are the same word:" , "u.s.a."=="usa")
print("Amanda and amanda are the same word:", "Amanda"=="amanda")

USA and usa are the same word: False
u.s.a. and usa are the same word: False
Amanda and amanda are the same word: False


## Preprocessing Text

When working with text data, the goal is to process (remove, filter, and combine) the text in such a way that informative text is preserve and munged into a form that models can better understand.  After looking at our raw text, we know that there are a number of textual attributes that we will need to address before we can ultimately represent our text as quantified features. 

A common first step is to handle [string encoding](http://kunststube.net/encoding/) and formatting issues.  Often it is easy to address the character encoding and mixed capitalization using Python's built-in functions. For our wine example, we will convert everything to UTF-8 encoding and convert all letters to lowercase.

In [13]:
# for simplicity we can look at what this does for a single row:

wine1 = wine_data.description.iloc[0]

print ("Original: ", wine1)
print("Lower-cased: ", wine1.lower())

Original:  This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.
Lower-cased:  this tremendous 100% varietal wine hails from oakville and was aged over three years in oak. juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. enjoy 2022–2030.


## Tokenizing
In order to process text, it must be deconstructed into its constituent elements through a process termed *tokenization*. Often, the *tokens* yielded from this process are simply individual words in a document.  In certain cases, it can be useful to tokenize stranger objects like emoji or parts of html (or other code).

A simplistic way to tokenize text relies on white space, such as in <code>nltk.tokenize.WhitespaceTokenizer</code>. Relying on white space, however, does not take *punctuation* into account, and depending on this some tokens will include punctuation  and will require further preprocessing (e.g. 'account,'). Depending on your data, the punctuation may provide meaningful information, so you will want to think about whether it should be preserved or if it can be removed.

Tokenization is particularly challenging in the biomedical field, where many phrases contain substantial punctuation (parentheses, hyphens, etc.) that can't necessarily be ignored. Additionally, negation detection can be critical in this context which can provide an additional preprocessing challenge.

NLTK contains many built-in modules for tokenization, such as <code>nltk.tokenize.WhitespaceTokenizer</code> and <code>nltk.tokenize.RegexpTokenizer</code>. It surprisingly also has a module specifically for deal with Twitter data, <code>nltk.tokenize.casual.TweetTokenizer</code> which just has a few features related to handling twitter handles.

SpaCy also has built in modules to deal with tokenization. Below we'll look at a few different kinds of tokenizers. 

See also:

[The Art of Tokenization](https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en)<br>
[Negation's Not Solved: Generalizability Versus Optimizability in Clinical Natural Language Processing](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4231086/)

### Whitespace Tokenizer
One possible method for tokenizing. However, this particular tool identifies words by using whitespace. It thus considers punctuation to be part of a word at times.

In [15]:
from nltk.tokenize import WhitespaceTokenizer
ws_tokenizer = WhitespaceTokenizer()

# tokenize example review
wine1_ws = nyt_ws_tokens = ws_tokenizer.tokenize(wine1.lower())
print(wine1_ws)

['this', 'tremendous', '100%', 'varietal', 'wine', 'hails', 'from', 'oakville', 'and', 'was', 'aged', 'over', 'three', 'years', 'in', 'oak.', 'juicy', 'red-cherry', 'fruit', 'and', 'a', 'compelling', 'hint', 'of', 'caramel', 'greet', 'the', 'palate,', 'framed', 'by', 'elegant,', 'fine', 'tannins', 'and', 'a', 'subtle', 'minty', 'tone', 'in', 'the', 'background.', 'balanced', 'and', 'rewarding', 'from', 'start', 'to', 'finish,', 'it', 'has', 'years', 'ahead', 'of', 'it', 'to', 'develop', 'further', 'nuance.', 'enjoy', '2022–2030.']


### Regular Expression Tokenization

By applying the regular expression tokenizer we can more highly tune our tokenizer to yield the types of tokens useful for our data.  Here we return a list of word tokens without punctuation

In [16]:
from nltk.tokenize import RegexpTokenizer
re_tokenizer = RegexpTokenizer(r'\w+')

# tokenize example review
wine1_re = re_tokenizer.tokenize(wine1.lower())
print(wine1_re)

['this', 'tremendous', '100', 'varietal', 'wine', 'hails', 'from', 'oakville', 'and', 'was', 'aged', 'over', 'three', 'years', 'in', 'oak', 'juicy', 'red', 'cherry', 'fruit', 'and', 'a', 'compelling', 'hint', 'of', 'caramel', 'greet', 'the', 'palate', 'framed', 'by', 'elegant', 'fine', 'tannins', 'and', 'a', 'subtle', 'minty', 'tone', 'in', 'the', 'background', 'balanced', 'and', 'rewarding', 'from', 'start', 'to', 'finish', 'it', 'has', 'years', 'ahead', 'of', 'it', 'to', 'develop', 'further', 'nuance', 'enjoy', '2022', '2030']


### SpaCy
SpaCy is a kind of cool package that can tokenize words, lemmatize, find named entities, etc. So, you can imagine it as an alternative to NLTK.

In [18]:
import spacy

nlp = spacy.load('en')
print(nlp(wine1.lower()))

this tremendous 100% varietal wine hails from oakville and was aged over three years in oak. juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. enjoy 2022–2030.


In [20]:
# you can use spacy to get individual tokens:
for token in nlp(wine1.lower()):
    print (token)

this
tremendous
100
%
varietal
wine
hails
from
oakville
and
was
aged
over
three
years
in
oak
.
juicy
red
-
cherry
fruit
and
a
compelling
hint
of
caramel
greet
the
palate
,
framed
by
elegant
,
fine
tannins
and
a
subtle
minty
tone
in
the
background
.
balanced
and
rewarding
from
start
to
finish
,
it
has
years
ahead
of
it
to
develop
further
nuance
.
enjoy
2022–2030
.


In [21]:
# you can also get sentences usinf SpaCy
for sent in nlp(wine1.lower()).sents:
    print(sent)

this tremendous 100% varietal wine hails from oakville and was aged over three years in oak.
juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background.
balanced and rewarding from start to finish, it has years ahead of it to develop further nuance.
enjoy 2022–2030.


### Final thoughts on tokenization:
Which tokenizer you use depends on what you are going to need for your model ultimately. You should think about what is the most appropriate choice. In some senses it it fine to do something like lowercase, and then get individual words without punctuation, but there are other circumstances when this might not be helpful. For example, imagine if you were trying to identify questions in your 