# Basics of NLP
A short workshop of NLP techniques for those with little or no experience with NLP.

## What is NLP?
* NLP stands for Natural Language Processing, and the field is concerned with the ability to use computers to manipulate text data. 
* Researchers in computational linguistics are focused in three areas: NLP, NLG (natural language generation) and NLU (natural language understanding)
* NLP researchers tend to study and word on problems that include: 
  * parsing strings into individual paragraphs, sentences, words, morphemes, etc
  * finding grammatical relations and structures
  * identifying entities
  * comparing strings
  * feature extractions, such as: sentiments, topics, etc
  
Today we're going to cover some of the basics in NLP. Including why and how to preprocess text data

First we'll read in some data:

In [2]:
import pandas as pd
import nltk

# import data 
wine_data = pd.read_csv("data/winemag-data_first150k.csv", encoding = 'utf8')
wine_data.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [3]:
# remove any rows where we have no description
wine_data = wine_data[pd.notnull(wine_data['description'])]

## A few quick initial thoughts about Computational Linguistics:
* computers know nothing about words
  * as humans we know that "US", "USA", "us", "usa", and "u.s.a." are all referencing the same place, but computers can only compare things that are identical
  * if we want to be able to compare words, then we have to make the words as uniform as possible.

In [9]:
# In case you don't believe me:
print ("USA and usa are the same word:" , "USA"=="usa")
print ("u.s.a. and usa are the same word:" , "u.s.a."=="usa")
print("Amanda and amanda are the same word:", "Amanda"=="amanda")

USA and usa are the same word: False
u.s.a. and usa are the same word: False
Amanda and amanda are the same word: False


## Preprocessing Text

When working with text data, the goal is to process (remove, filter, and combine) the text in such a way that informative text is preserve and munged into a form that models can better understand.  After looking at our raw text, we know that there are a number of textual attributes that we will need to address before we can ultimately represent our text as quantified features. 

A common first step is to handle [string encoding](http://kunststube.net/encoding/) and formatting issues.  Often it is easy to address the character encoding and mixed capitalization using Python's built-in functions. For our wine example, we will convert everything to UTF-8 encoding and convert all letters to lowercase.

In [10]:
# for simplicity we can look at what this does for a single row:

wine1 = wine_data.description.iloc[0]

print ("Original: ", wine1)
print("Lower-cased: ", wine1.lower())

Original:  This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.
Lower-cased:  this tremendous 100% varietal wine hails from oakville and was aged over three years in oak. juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. enjoy 2022–2030.


## Tokenizing
In order to process text, it must be deconstructed into its constituent elements through a process termed *tokenization*. Often, the *tokens* yielded from this process are simply individual words in a document.  In certain cases, it can be useful to tokenize stranger objects like emoji or parts of html (or other code).

A simplistic way to tokenize text relies on white space, such as in <code>nltk.tokenize.WhitespaceTokenizer</code>. Relying on white space, however, does not take *punctuation* into account, and depending on this some tokens will include punctuation  and will require further preprocessing (e.g. 'account,'). Depending on your data, the punctuation may provide meaningful information, so you will want to think about whether it should be preserved or if it can be removed.

Tokenization is particularly challenging in the biomedical field, where many phrases contain substantial punctuation (parentheses, hyphens, etc.) that can't necessarily be ignored. Additionally, negation detection can be critical in this context which can provide an additional preprocessing challenge.

NLTK contains many built-in modules for tokenization, such as <code>nltk.tokenize.WhitespaceTokenizer</code> and <code>nltk.tokenize.RegexpTokenizer</code>. It surprisingly also has a module specifically for deal with Twitter data, <code>nltk.tokenize.casual.TweetTokenizer</code> which just has a few features related to handling twitter handles.

SpaCy also has built in modules to deal with tokenization. Below we'll look at a few different kinds of tokenizers. 

See also:

[The Art of Tokenization](https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en)<br>
[Negation's Not Solved: Generalizability Versus Optimizability in Clinical Natural Language Processing](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4231086/)

### Whitespace Tokenizer
One possible method for tokenizing. However, this particular tool identifies words by using whitespace. It thus considers punctuation to be part of a word at times.

In [11]:
from nltk.tokenize import WhitespaceTokenizer
ws_tokenizer = WhitespaceTokenizer()

# tokenize example review
wine1_ws = nyt_ws_tokens = ws_tokenizer.tokenize(wine1.lower())
print(wine1_ws)

['this', 'tremendous', '100%', 'varietal', 'wine', 'hails', 'from', 'oakville', 'and', 'was', 'aged', 'over', 'three', 'years', 'in', 'oak.', 'juicy', 'red-cherry', 'fruit', 'and', 'a', 'compelling', 'hint', 'of', 'caramel', 'greet', 'the', 'palate,', 'framed', 'by', 'elegant,', 'fine', 'tannins', 'and', 'a', 'subtle', 'minty', 'tone', 'in', 'the', 'background.', 'balanced', 'and', 'rewarding', 'from', 'start', 'to', 'finish,', 'it', 'has', 'years', 'ahead', 'of', 'it', 'to', 'develop', 'further', 'nuance.', 'enjoy', '2022–2030.']


### Regular Expression Tokenization

By applying the regular expression tokenizer we can more highly tune our tokenizer to yield the types of tokens useful for our data.  Here we return a list of word tokens without punctuation

In [12]:
from nltk.tokenize import RegexpTokenizer
re_tokenizer = RegexpTokenizer(r'\w+')

# tokenize example review
wine1_re = re_tokenizer.tokenize(wine1.lower())
print(wine1_re)

['this', 'tremendous', '100', 'varietal', 'wine', 'hails', 'from', 'oakville', 'and', 'was', 'aged', 'over', 'three', 'years', 'in', 'oak', 'juicy', 'red', 'cherry', 'fruit', 'and', 'a', 'compelling', 'hint', 'of', 'caramel', 'greet', 'the', 'palate', 'framed', 'by', 'elegant', 'fine', 'tannins', 'and', 'a', 'subtle', 'minty', 'tone', 'in', 'the', 'background', 'balanced', 'and', 'rewarding', 'from', 'start', 'to', 'finish', 'it', 'has', 'years', 'ahead', 'of', 'it', 'to', 'develop', 'further', 'nuance', 'enjoy', '2022', '2030']


### SpaCy
SpaCy is a kind of cool package that can tokenize words, lemmatize, find named entities, etc. So, you can imagine it as an alternative to NLTK.

In [13]:
import spacy

nlp = spacy.load('en')
print(nlp(wine1.lower()))

this tremendous 100% varietal wine hails from oakville and was aged over three years in oak. juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. enjoy 2022–2030.


In [14]:
# you can use spacy to get individual tokens:
wine1_tokens = []
for token in nlp(wine1.lower()):
    wine1_tokens.append(token)
    
print(wine1_tokens)

[this, tremendous, 100, %, varietal, wine, hails, from, oakville, and, was, aged, over, three, years, in, oak, ., juicy, red, -, cherry, fruit, and, a, compelling, hint, of, caramel, greet, the, palate, ,, framed, by, elegant, ,, fine, tannins, and, a, subtle, minty, tone, in, the, background, ., balanced, and, rewarding, from, start, to, finish, ,, it, has, years, ahead, of, it, to, develop, further, nuance, ., enjoy, 2022–2030, .]


In [15]:
# you can also get sentences usinf SpaCy
for sent in nlp(wine1.lower()).sents:
    print(sent)

this tremendous 100% varietal wine hails from oakville and was aged over three years in oak.
juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background.
balanced and rewarding from start to finish, it has years ahead of it to develop further nuance.
enjoy 2022–2030.


### Final thoughts on tokenization:
Which tokenizer you use depends on what you are going to need for your model ultimately. You should think about what is the most appropriate choice. In some senses it it fine to do something like lowercase, and then get individual words without punctuation, but there are other circumstances when this might not be helpful. For example, imagine if you were trying to identify questions in your input - you may want to use something like SpaCy's tokenizer to get the sentences before further processing your data.

## Stop Words
Depending on the application, many words provide little value when building an NLP model. Moreover, they may provide a source of "distraction" for models since model capacity is used to understand words with low information content.  Accordingly, these are termed *stop words*. Examples of stop words include pronouns, articles, prepositions and conjunctions, but there are many other words, or non meaningful tokens, that you may wish to remove. 
<p>Stop words can be determined and handled in many different ways, including:
* Using a list of words determined *a priori* - either a standard list from the NLTK package or one modified from such a list based on domain knowledge of a particular subject
* Sorting the terms by *collection frequency*(the total number of times each term appears in the document collection), and then to taking the most frequent terms as a stop list based on semantic content.
* Using no defined stop list at all, and dealing with text data in a purely statistical manner. In general, search engines do not use stop lists.

As you work with your text, you may decide to iterate on this process. When in doubt, it is often a fruitful strategy to try the above bullets in order.  See also: [Stop Words](http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html)

### Stopword Corpus

For this example, we will use the english stopword corpus from NLTK. 

In [16]:
from nltk.corpus import stopwords

# here you can see the words included in the stop words corpus
print (stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Let's remove the stop words and compare to our original list of tokens from our regular expression tokenizer.

In [17]:
cleaned_tokens = []
stop_words = set(stopwords.words('english'))
for token in wine1_re:
    if token not in stop_words:
        cleaned_tokens.append(token)

In [18]:
print ('Number of tokens before removing stop words: %d' % len(wine1_re))
print ('Number of tokens after removing stop words: %d' % len(cleaned_tokens))

Number of tokens before removing stop words: 62
Number of tokens after removing stop words: 38


In [19]:
print(cleaned_tokens)

['tremendous', '100', 'varietal', 'wine', 'hails', 'oakville', 'aged', 'three', 'years', 'oak', 'juicy', 'red', 'cherry', 'fruit', 'compelling', 'hint', 'caramel', 'greet', 'palate', 'framed', 'elegant', 'fine', 'tannins', 'subtle', 'minty', 'tone', 'background', 'balanced', 'rewarding', 'start', 'finish', 'years', 'ahead', 'develop', 'nuance', 'enjoy', '2022', '2030']


You can see that by removing stop words, we now have less than half the number of tokens as our original list. Taking a peek at the cleaned tokens, we can see that a lot of the information that makes sentences human-readable has been lost, but the key nouns, verbs, adjectives, and adverbs remain. 

### With SpaCy

In [20]:
from spacy.lang.en.stop_words import STOP_WORDS

print(STOP_WORDS) # <- set of Spacy's default stop words

{'somewhere', 'either', 'over', 'us', 'there', 'am', 'three', 'enough', 'indeed', 'take', 'such', 'two', 'never', 'whenever', 'you', 'did', 'few', 'if', 'hereupon', 'others', 'so', 'neither', 'wherever', 'their', 'has', 'very', 'hereby', 'former', 'under', 'anything', 'see', 'while', 'any', 'beyond', 'none', 'name', 'towards', 'too', 'next', 'its', 'my', 'whither', 'whom', 'almost', 'nevertheless', 'otherwise', 'this', 'well', 'above', 'seem', 'but', 'together', 'about', 'because', 'most', 'still', 'even', 'bottom', 'below', 'again', 'just', 'are', 'by', 'beforehand', 're', 'everyone', 'against', 'put', 'thereafter', 'up', 'eleven', 'onto', 'serious', 'yourselves', 'call', 'via', 'when', 'throughout', 'only', 'been', 'himself', 'i', 'thereby', 'these', 'however', 'were', 'each', 'his', 'eight', 'me', 'show', 'rather', 'whether', 'though', 'twelve', 'various', 'all', 'those', 'afterwards', 'should', 'yourself', 'hundred', 'in', 'further', 'namely', 'thru', 'seeming', 'being', 'nine', 't

In [21]:
# if you want to add your own personal stop words
STOP_WORDS.add('.')
STOP_WORDS.add('%')
STOP_WORDS.add(',')
STOP_WORDS.add('-')

***Note:*** it seems like you have to lemmatize and remove stopwords in the same step...

### Final thoughts on tokenization
You may notice from looking at this sample, however, that a potentially meaningful word has been removed: 'not'. This stopword corpus includes the words 'no', 'nor', and 'not' and so by removing these words we have removed negation. You can set these stopword lists to avoid removing some words, and you can add words to the stop word lists (if for example you really thought that you would never need to care about a particular word, e.g., you didn't care about the use of the word *flamingo* you could add it to the stop word list and it would automatically be removed.

## Stemming and Lemmatization


The overarching goal of stemming and lemmatization is to reduce differential forms of a word to a common base form. By performing stemming and lemmitzation, the count occurrences of words are can be very informative when further processing the data (such as the vectorization, see below). 

In deciding how to reduce the differential forms of words, you will want to consider how much information you will need to retain for your application. For instance, in many cases markers of tense and plurality are not informative, and so removing these markers will allow you to reduce the number of features.  In other cases, retaining these variations results in better understanding of the underlying content. 

**Stemming** is the process of representing the word as its root word while removing inflection. For example, the stem of the word 'explained' is 'explain'. By passing this word through a stemming function you would remove the tense inflection. There are multiple approaches to stemming: [Porter stemming](http://snowball.tartarus.org/algorithms/porter/stemmer.html), [Porter2 (snowball) stemming](http://snowball.tartarus.org/algorithms/english/stemmer.html), and [Lancaster stemming](http://www.nltk.org/_modules/nltk/stem/lancaster.html). You can read more in depth about these approaches.


In [22]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer

porter = PorterStemmer()
snowball = SnowballStemmer('english')
lancaster = LancasterStemmer()

In [23]:
print ('Porter Stem of "explanation": %s' % porter.stem('explanation'))
print ('Porter2 (Snowball) Stem of "explanation": %s' %snowball.stem('explanation'))
print ('Lancaster Stem of "explanation": %s' %lancaster.stem('explanation'))

Porter Stem of "explanation": explan
Porter2 (Snowball) Stem of "explanation": explan
Lancaster Stem of "explanation": expl


While <b><em>stemming</em></b> is a heuristic process that selectively removes the end of words, <b><em>lemmatization</em></b> is a more sophisticated process that can account for variables such as part-of-speech, meaning, and context within a document or neighboring sentences.</p>

In [24]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print (lemmatizer.lemmatize('explanation'))

explanation


In this example, lemmatization retains a bit more information than stemming. Within stemming, the Lancaster method is more aggressive than Porter and Snowball. Remember that this step allows us to reduce words to a common base form so that we can reduce our feature space and perform counting of occurrences. It will depend on your data and your application as to how much information you need to retain.

As a good starting point, see also: [Stemming and lemmatization](http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

### Stemming vs Lemmatizing

In [25]:
stemmed_tokens = []
lemmatized_tokens = []

for token in cleaned_tokens:
    stemmed_tokens.append(snowball.stem(token))
    lemmatized_tokens.append(lemmatizer.lemmatize(token))

Let's compare: stemmed tokens

In [26]:
print(stemmed_tokens)

['tremend', '100', 'variet', 'wine', 'hail', 'oakvill', 'age', 'three', 'year', 'oak', 'juici', 'red', 'cherri', 'fruit', 'compel', 'hint', 'caramel', 'greet', 'palat', 'frame', 'eleg', 'fine', 'tannin', 'subtl', 'minti', 'tone', 'background', 'balanc', 'reward', 'start', 'finish', 'year', 'ahead', 'develop', 'nuanc', 'enjoy', '2022', '2030']


Lemmatized tokens

In [27]:
print(lemmatized_tokens)

['tremendous', '100', 'varietal', 'wine', 'hail', 'oakville', 'aged', 'three', 'year', 'oak', 'juicy', 'red', 'cherry', 'fruit', 'compelling', 'hint', 'caramel', 'greet', 'palate', 'framed', 'elegant', 'fine', 'tannin', 'subtle', 'minty', 'tone', 'background', 'balanced', 'rewarding', 'start', 'finish', 'year', 'ahead', 'develop', 'nuance', 'enjoy', '2022', '2030']


Looking at the above, it is clear different strategies for generating tokens might retain different information. Moreover, given the transformations stemming and lemmatization apply there will be a different amount of tokens retained in the overall vocabularity.

***Critical thoughts:*** It's best to apply intuition and domain knowledge to get a feel for which strategy(ies) to begin with.  In short, it's usually a good idea to optimize for smaller numbers of unique tokens and greater interpretibility as long as it doesn't disagree with common sense and (sometimes more importantly) overall performance.

### SpaCy
You can also lemmatize things using SpaCy (but there is no lemmatization):

In [28]:
from spacy.lemmatizer import Lemmatizer
lemmatizerS = Lemmatizer()
lemmatized_tokens_spacy = []

for token in wine1_tokens:
    lemma = token.lemma_
    if lemma not in STOP_WORDS:
        lemmatized_tokens_spacy.append(lemma)

In [29]:
print(lemmatized_tokens_spacy)

['tremendous', '100', 'varietal', 'wine', 'hail', 'oakville', 'age', 'year', 'oak', 'juicy', 'red', 'cherry', 'fruit', 'compelling', 'hint', 'caramel', 'greet', 'palate', 'frame', 'elegant', 'fine', 'tannin', 'subtle', 'minty', 'tone', 'background', 'balance', 'rewarding', 'start', 'finish', '-PRON-', 'year', 'ahead', '-PRON-', 'develop', 'nuance', 'enjoy', '2022–2030']


## POS tagging
Another thing that we might care about is the parts of speech (aka POS) of the various words in our reviews. You might, for example, want to do something like count the number of nouns in a review, or the number of adjectives, etc. By tagging words with their POS, you can build features like this.

In case you're wondering, this is the list of NLTK POSs:
* CC	coordinating conjunction
* CD	cardinal digit
* DT	determiner
* EX	existential there (like: "there is" ... think of it like "there exists")
* FW	foreign word
* IN	preposition/subordinating conjunction
* JJ	adjective	'big'
* JJR	adjective, comparative	'bigger'
* JJS	adjective, superlative	'biggest'
* LS	list marker	1)
* MD	modal	could, will
* NN	noun, singular 'desk'
* NNS	noun plural	'desks'
* NNP	proper noun, singular	'Harrison'
* NNPS	proper noun, plural	'Americans'
* PDT	predeterminer	'all the kids'
* POS	possessive ending	parent's
* PRP	personal pronoun	I, he, she
* PRP\$	possessive pronoun	my, his, hers
* RB	adverb	very, silently,
* RBR	adverb, comparative	better
* RBS	adverb, superlative	best
* RP	particle	give up
* TO	to	go 'to' the store.
* UH	interjection	errrrrrrrm
* VB	verb, base form	take
* VBD	verb, past tense	took
* VBG	verb, gerund/present participle	taking
* VBN	verb, past participle	taken
* VBP	verb, sing. present, non-3d	take
* VBZ	verb, 3rd person sing. present	takes
* WDT	wh-determiner	which
* WP	wh-pronoun	who, what
* WP\$	possessive wh-pronoun	whose
* WRB	wh-abverb	where, when

In [30]:
from nltk import pos_tag

# get the pos for our review
pos_tagged_review = pos_tag(stemmed_tokens)
    
print(pos_tagged_review)

[('tremend', 'VB'), ('100', 'CD'), ('variet', 'JJ'), ('wine', 'NN'), ('hail', 'NN'), ('oakvill', 'NN'), ('age', 'NN'), ('three', 'CD'), ('year', 'NN'), ('oak', 'NN'), ('juici', 'NN'), ('red', 'JJ'), ('cherri', 'NN'), ('fruit', 'NN'), ('compel', 'NN'), ('hint', 'NN'), ('caramel', 'NN'), ('greet', 'NN'), ('palat', 'JJ'), ('frame', 'NN'), ('eleg', 'JJ'), ('fine', 'JJ'), ('tannin', 'NN'), ('subtl', 'NN'), ('minti', 'FW'), ('tone', 'NN'), ('background', 'NN'), ('balanc', 'NN'), ('reward', 'JJ'), ('start', 'NN'), ('finish', 'JJ'), ('year', 'NN'), ('ahead', 'RB'), ('develop', 'VB'), ('nuanc', 'NN'), ('enjoy', 'NN'), ('2022', 'CD'), ('2030', 'CD')]


For example, if we wanted to count the number of nouns in this review, we could do this:

In [31]:
nouns = ['NN', 'NNS', 'NNP', 'NNPS']
count = 0
for word, tag in pos_tagged_review:
    if tag in nouns:
        count +=1
    
print("There are %d nouns in the first review" %count)

There are 23 nouns in the first review


### SpaCy?

In [32]:
pos_tagged_spacy = []

for token in nlp(wine1.lower()):
    pos_tagged_spacy.append((token, token.pos_))

print(pos_tagged_spacy)

[(this, 'DET'), (tremendous, 'ADJ'), (100, 'NUM'), (%, 'NOUN'), (varietal, 'ADJ'), (wine, 'NOUN'), (hails, 'VERB'), (from, 'ADP'), (oakville, 'NOUN'), (and, 'CCONJ'), (was, 'VERB'), (aged, 'VERB'), (over, 'ADP'), (three, 'NUM'), (years, 'NOUN'), (in, 'ADP'), (oak, 'NOUN'), (., 'PUNCT'), (juicy, 'ADJ'), (red, 'ADJ'), (-, 'PUNCT'), (cherry, 'NOUN'), (fruit, 'NOUN'), (and, 'CCONJ'), (a, 'DET'), (compelling, 'ADJ'), (hint, 'NOUN'), (of, 'ADP'), (caramel, 'NOUN'), (greet, 'VERB'), (the, 'DET'), (palate, 'NOUN'), (,, 'PUNCT'), (framed, 'VERB'), (by, 'ADP'), (elegant, 'ADJ'), (,, 'PUNCT'), (fine, 'ADJ'), (tannins, 'NOUN'), (and, 'CCONJ'), (a, 'DET'), (subtle, 'ADJ'), (minty, 'NOUN'), (tone, 'NOUN'), (in, 'ADP'), (the, 'DET'), (background, 'NOUN'), (., 'PUNCT'), (balanced, 'VERB'), (and, 'CCONJ'), (rewarding, 'ADJ'), (from, 'ADP'), (start, 'NOUN'), (to, 'ADP'), (finish, 'NOUN'), (,, 'PUNCT'), (it, 'PRON'), (has, 'VERB'), (years, 'NOUN'), (ahead, 'ADV'), (of, 'ADP'), (it, 'PRON'), (to, 'PART'),

In [33]:
count = 0
for word, tag in pos_tagged_spacy:
    if tag =='NOUN':
        count +=1
    
print("There are %d nouns in the first review" %count)

There are 18 nouns in the first review


***You'll note:*** SpaCy and NLTK differ in how many nouns they think exist in the text. This is something you should think about - about the reliability of your POS tagger. Also note, SpaCy sort of requries you to do all of your preprocessing at once, and I found it difficult to use the POS tagger on lemmatized tokens.

## Vectorization

Often in natural language processing we want to represent our text as a quantitative set of features for subsequent analysis. We can refer to this as vectorization. One way to generate features from text is to count the occurrences words. This apporoach is often referred to as a <b>bag of words approach</b>.

For the example of our article, we can represent the document as a vector of counts for each token. We can do the same for the other articles, and in the end we would have a set of vectors - with each vector representing an article. These vectors could then be used in the next phase of analysis (e.g. classification, document clustering, ...). 

When we apply a <b>count vectorizer</b> to our corpus of articles, the output will be a matrix with the number of rows corresponding to the number of articles and the number of columns corresponding to the number of unique tokens across (across articles). You can imagine that if we have many articles in a corpus of varied content, the number of unique tokens could get quite large. Some of our preprocessing steps address this issue. In particular, the stemming/lemmatization step reduces the number of unique versions of a word that appear in the corpus. Additionally it is possible to reduce the number of features by removing words that appear least frequently, or by removing words that are common to each article and therefore may not be informative for subsequent analysis.


#### Count Vectorization of Article

For this example we will use the stemmed tokens from our article. We will need to join the tokens together to represent one article.

Check out the documentation for [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) in scikit-learn. You will see that there are a number of parameters that you can specify - including the maximum number of features. Depending on your data, you may choose to restrict the number of features by removing words that appear with least frequency (and this number may be set by cross-validation).

**Example:**

In [48]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

# stem our example article
stemmed_review = ' '.join(wd for wd in stemmed_tokens)

# performe a count-based vectorization of the document
review_vect = vectorizer.fit_transform([stemmed_review])

As shown below, we can see that the five most frequently occuring words in this review, are __year, tremend, 100, variet,__ and __wine__: 

In [50]:
freqs = [(word, review_vect.getcol(idx).sum()) for word, idx in vectorizer.vocabulary_.items()]
print ("top 5 words for the first reviewed wine:", sorted (freqs, key = lambda x: -x[1])[0:5])

top 5 words for the first reviewed wine: [('year', 2), ('tremend', 1), ('100', 1), ('variet', 1), ('wine', 1)]


Now you can imagine that we could apply this count vectorizer to all of our articles. We could then use the word count vectors in a number of subsequent analyses (e.g. exploring the topics appearing across the corpus).  

#### Term Frequency - Inverse Document Frequency (tf-idf) Vectorization

We have mentioned that you may want to limit the number of features in your vector, and that one way to do this would be to only take the tokens that occur most frequently. Imagine again the above example of trying to differentiate between supporting and opposing documents in a political context. If the documents are all related to the same political initiative, then very likely there will be words related to the intitiative that appear in both documents and thus have high frequency counts. If we cap the number of features by frequency, these words would likely be included, but will they be the most informative when trying to differentiate documents?


For many such cases we may want to use a vectorization approach called **term frequency - inverse document frequency (tf-idf)**. [Tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) allows us to weight words by their importance by considering how often a word appears in a given document and throughout the corpus. That is, if a word occurs frequently in a (preprocessed) document it should be important, yet if it also occurs frequently accross many documents it is less informative and differentiating.

In our example, the name of the inititative would likely appear numerous times in each document for both opposing and supporting positions. Because the name occurs across all documents, this word would be down weighted in importance. For a more in depth read, these posts go into a bit more depth about text vectorization: [tf-idf part 1](http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/) and [tf-idf part 2](http://blog.christianperone.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/).

**Example:**

To utilize tf-idf, we will add in additional articles from our dataset. We will need to preprocess the text from these articles and then we can use [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) on our stemmed tokens.

To perform tf-idf tranformations, we first need occurence vectors for all our articles using (like the above) count vectorizer.  From here, we could use scikit-learn's [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) to transform our matrix into a tf-idf matrix.

For a more complete example, consider a preprocessing pipeline where we first tokenize using regexp, remove standard stop words, perform stemming, and finally convert to tf-idf vectors:

In [57]:
def preprocess_review_content(text_df):
    """
    Simple preprocessing pipeline which uses RegExp, sets basic token requirements, and removes stop words.
    """
    print ('preprocessing review text...')

    # tokenizer, stops, and stemmer
    tokenizer = RegexpTokenizer(r'\w+')
    stop_words = set(stopwords.words('english'))  # can add more stop words to this set
    stemmer = SnowballStemmer('english')

    # process reviews
    review_list = []
    for row, review in enumerate(text_df['description']):
        cleaned_tokens = []
        tokens = tokenizer.tokenize(review.lower())
        for token in tokens:
            if token not in stop_words:
                if len(token) > 0 and len(token) < 20: # removes non words
                    if not token[0].isdigit() and not token[-1].isdigit(): # removes numbers
                        stemmed_tokens = stemmer.stem(token)
                        cleaned_tokens.append(stemmed_tokens)
        # add process article
        review_list.append(' '.join(wd for wd in cleaned_tokens))

    # echo results and return
    print ('preprocessed content for %d articles' % len(review_list))
    return review_list

# process articles
processed_review_list = preprocess_review_content(wine_data)

# vectorize the articles and compute count matrix
from sklearn.feature_extraction.text import TfidfVectorizer
tf_vectorizer = TfidfVectorizer()
tfidf_article_matrix = tf_vectorizer.fit_transform(processed_review_list)

print (tfidf_article_matrix.shape)

preprocessing review text...
preprocessed content for 150930 articles
(150930, 21503)


You can see that after applying the tf-idf vectorizers to our sample of 150K reviews, we have a sparse matrix with 150K rows (each corresponding to a review) and 21,503 columns (each corresponding to a stemmed token. Depending on our application, we may choose to restrict the number of features (corresponding to the number of columns).

#### Unigrams v. Bigrams v. Ngrams

When we decided to tokenize our corpus above, we decided to treat each word as a token.  A collection of text represented by single words only is a **unigram model of the data**.  This representation can often be surprisingly powerful, because the presence of single words can be hugely informative. 

However, when dealing with natural language we often want to incorporate structure that is present - grammer, syntactic meaning, and tone.  The **downside of unigrams is that it ignores the ordering of words**, as the order of the token counts is not captured.  The simplest model that captures ordering and structure is one that treats neighboring word pairs as tokens, this is called a **bigram** model.  

As an example consider a document that has the words "good", "bad", and "project" in its corpus (with relatively similar count frequencies).  From unigrams alone, its not possible to tell whether the project is good or bad, because those adjectives could appear next to the subject "project" or in completely unrelated sentences.  With bigrams, we might then see the token "good project" appearing frequently and we would now know significantly more what the document is about.

Choosing pairs of words (bigrams) is just the simplest choice we can make. We can generalize this allow tokens of N numbers of words, these are called **Ngrams**.  When N=3 we refer to tokens as trigrams, but for higher values of N we do not typically assign a unique name. 

***Best practices:*** Generally speaking most NLP models want to have unigrams present.  Very commonly bigrams are important and are also used to build high quality models.  Typically higher order Ngrams are less common, as the number of features (and computational requirements) increase rapidly and yield diminishing returns.

### Word Embeddings
Another vectorization option is to use a word embedding model to generate vector representations of words. Word embedding models create non-linear representations of words, which account for the context and neighboring language surround a word.  A common model(which has many pretrained libraries) is [Word2Vec](http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html).

Word embedding models have gained lots of popularity, as they are able to capture syntactic meaning quite well.  However, good vector representations are only appropriate for the corpus they are trained on and often they will not generate good models for corpuses which are significantly different.  For instance, a Word2Vec model trained on literature may not be appropriate for Twitter or StackOverflow text data.  The alternative in these cases is to retrain the model on the correct data, but this is hard - it requires lots of data, choices, and computation to generate good representations.  As a first approach, it's probably best to start with Ngrams using counts or tf-idf weightings.

In [16]:
import gensim

# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300-SLIM.bin', binary=True)  

## From Words To Paragraphs: Vector Averaging
One challenge with the wine dataset is the variable-length reviews. We need to find a way to take individual word vectors and transform them into a feature set that is the same length for every review.

Since each word is a vector in 300-dimensional space, we can use vector operations to combine the words in each review. One method we tried was to simply average the word vectors in a given review (for this purpose, we removed stop words, which would just add noise).

The following code averages the feature vectors.

In [17]:
import numpy as np  # Make sure that numpy is imported
import re

def makeFeatureVec(words, model, num_features):
    # Function to average all of the word vectors in a given
    # paragraph
    #
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32")
    #
    nwords = 0.
    # 
    # Index2word is a list that contains the names of the words in 
    # the model's vocabulary. Convert it to a set, for speed 
    index2word_set = set(model.index2word)
    #
    # Loop over each word in the review and, if it is in the model's
    # vocaublary, add its feature vector to the total
    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])
    # 
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords)
    return featureVec


def getAvgFeatureVecs(reviews, model, num_features):
    # Given a set of reviews (each one a list of words), calculate 
    # the average feature vector for each one and return a 2D numpy array 
    # 
    # Initialize a counter
    counter = 0
    # 
    # Preallocate a 2D numpy array, for speed
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    # 
    # Loop through the reviews
    for review in reviews:
       # Call the function (defined above) that makes average feature vectors
       reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)
       #
       # Increment the counter
       counter = counter + 1
    return reviewFeatureVecs

Since we want to see if we can train a model to predict something (for simplicity's sake, let's say I want to predict the wine score) we need to split our data into a training and a test set.

In [18]:
from sklearn.model_selection import train_test_split

# make smaller df
wine_data = wine_data[pd.notnull(wine_data['points'])]
df = wine_data
df = df.iloc[:1000,]


# split the data 
train, test = train_test_split(df, test_size=0.2, random_state=42)

train.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
29,29,US,This standout Rocks District wine brings earth...,The Funk Estate,94,60.0,Washington,Walla Walla Valley (WA),Columbia Valley,Syrah,Saviah
535,535,France,"The wine is pure minerality, tightly woven int...",Mandelberg Grand Cru,92,,Alsace,Alsace,,Riesling,Jean Becker
695,695,US,"Cherry-rhubarb juice, shiitake mushrooms, sage...",,90,30.0,California,Sta. Rita Hills,Central Coast,Pinot Noir,Fess Parker
557,557,US,This youthful wine is shyer on the nose than i...,Montecillo Vineyard,90,42.0,California,Sonoma Valley,Sonoma,Zinfandel,Ordaz Family Wines
836,836,US,Aromas of gingerbread cookies with a black-che...,Bailey Ranch Riserva,90,40.0,California,Adelaida District,Central Coast,Zinfandel,Bella Luna


In [19]:
from nltk.corpus import stopwords
import re

stop_words = set( stopwords.words("english") )
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')


def review_to_wordlist( raw_review, remove_stopwords = False ):
    
    words = re.sub("[^a-zA-Z]"," ",raw_review).lower().split()
    
    if remove_stopwords:
        words = [word for word in words if word not in stop_words]
    
    return words

In [20]:
# ****************************************************************
# Calculate average feature vectors for training and testing sets,
# using the functions we defined above. Notice that we now use stop word
# removal.

num_features = 300 # this is the length of our word vectors
clean_train_reviews = []
for review in train["description"]:
    clean_train_reviews.append(review_to_wordlist(review, remove_stopwords=True ))

trainDataVecs = getAvgFeatureVecs( clean_train_reviews, model, num_features )

print ("Creating average feature vecs for test reviews")
clean_test_reviews = []
for review in test["description"]:
    clean_test_reviews.append(review_to_wordlist(review, remove_stopwords=True ))

testDataVecs = getAvgFeatureVecs(clean_test_reviews, model, num_features )

Creating average feature vecs for test reviews


In [22]:
# Fit a random forest to the training data, using 100 trees
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
import numpy as np
forest = RandomForestClassifier( n_estimators = 100 )

forest = forest.fit( trainDataVecs, train["points"] )

# Test & extract results 
result = forest.predict( testDataVecs )

print("F1 score: ", f1_score(test["points"], result, average="weighted"))

F1 score:  0.1903963278101209


  'precision', 'predicted', average, warn_for)


### Findings:
Turns out that the F1 score for our random forest isn't great. We could tune this better, or we could resolve that maybe other features might be better at identifying our wine scores (e.g., are there predictive keywords? does the sentiment of the review better reflect scores?)

## Sentiment Analysis
If you want to know how positive / neutral / negative a review is, you can try passing it through a sentiment analysis tool.

In [2]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyser = SentimentIntensityAnalyzer()

In [5]:
def print_sentiment_scores(sentence):
    snt = analyser.polarity_scores(sentence)
    print("{:-<40} {}".format(sentence, str(snt)))

In [6]:
print_sentiment_scores(wine1)

This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030. {'neg': 0.0, 'neu': 0.768, 'pos': 0.232, 'compound': 0.9287}


We could try to compute the various scores for all wines in our table and use those scores as features in our analysis.

## Named Entity Recognition
For some tasks you might be interested in finding the named entities in the text. This is similar to POS tagging, only instead of getting the parts of speech, you get what kind of thing a word is (e.g., 'PERSON', 'PLACE'). This is particularly useful if you want to provide some sort of summary, or if you want to know something about what was in general talked about.

I'm using SpaCy because it was easy, but not necessarily the best. For more on the types see: [SpaCy's website](https://spacy.io/usage/linguistic-features#section-named-entities)

In [59]:
import spacy
nlp = spacy.load('en') # install 'en' model (python3 -m spacy download en)
doc = nlp(wine1)
for ent in doc.ents:
    print(ent.text, ent.label_)


100% PERCENT
Oakville GPE
aged over three years DATE
2022–2030 CARDINAL


# Other things that I didn't get around to discussing:
Other ways to vectorize:
* Facebook's FastText: similar to Word2Vec except that instead of building vectors for words it builds vectors for sub-word embeddings
  * this means that it builds vectors for the word "apple" such as: "a", "pple", "ap", "ple", "app", "le", "appl", "e" and "apple"
  * the vectors for things such as "app" and "le" and "app" and "el" are probably more similar than "app" and "fe" making it more robust to spelling errors
  * it can also handle made up words and infrequent words a bit better than word2vec
* doc2vec, sent2vec, etc: all variations on word2vec
  * word2vec is just a means of matching words to vectors
  * sent2vec would mean matching setences to vectors, etc
* Topic Modelling:
  * What are the most important topics in the document?
  * LSA, LDA, etc (hopefully covered by someone at a later point?)
* Cosine difference:
  * a measure that can be used to find the most similar reviews to reviews we have vectors for - could choose to give the same wine score as the review with the most similar vector
