[View in Colaboratory](https://colab.research.google.com/github/alvations/attap/blob/master/Attap.ipynb)

NLP in 3-Steps: Transform, Transfer and Predict (TTP)
====


>[1.0 Introduction](#scrollTo=7lxVWeaqCotg)

>[2.0 Vector-Space Model](#scrollTo=ieIxgEASDABP)

>>[2.1 Term-Frequency / Inverse Document Frequency (TF-IDF)](#scrollTo=3cPrTgtYXEzv)

>>>[2.1.1 TF-IDF with numpy "natively"](#scrollTo=mfpCVI_zNJwI)

>>>[2.1.2 Use Gensim for TF-IDF](#scrollTo=YBHT9ur6it3o)

>>>[2.1.3 Using sklearn for TF-IDF](#scrollTo=ERXxBDkbKlC5)


>>[2.2 Where on earth is "mr brown"?](#scrollTo=EipNchQj4ON-)

>>>[2.2.1 Transform: Documents to TF-IDF vectors](#scrollTo=w7sa3utu5N2X)

>>>[2.2.2 Transfer: Learn a Classifier from the Vectors](#scrollTo=p9NqX3AbVcUH)

>>>[2.2.3 Predict: Make Predictions on Test/Unseen Data](#scrollTo=w_lLkkobZ1jQ)


>[잠시만요... (wait a minute)](#scrollTo=f4calKgsccOi)


>[3.0 Vectorization Is All You Need](#scrollTo=RyUOt25Qjki6)

>>[3.1. Universal Sentence Encoder](#scrollTo=RyUOt25Qjki6)

>>[3.1.1 Transform: Encode the Text Input into a vector](#scrollTo=BkbTYn8dpcam)

>>[3.1.2 Transfer: Learn the Model to Map Sentence Vectors to Labels](#scrollTo=hTF1Crylpq4M)

>>[3.1.3 Predict: Make Predictions on the Test Data](#scrollTo=OwgvNlvOyA95)

>[4.0 Mirrors, it must be mirrors!](#scrollTo=uPgDrfX2yj0V)

>[4.1. Classifying Toxic Comments](#scrollTo=wNe2mX1Iz3JK)

>>[4.1.1 Transform: Lets munge the data to get X and y](#scrollTo=BH8i5OAEp60u)

>>[4.1.2 Transfer: Train the model using to map X -> y](#scrollTo=pYqQpstG1N8y)

> [5.0 And now...](#scrollTo=lkYd0dzR6sH9)

# 1. Introduction

**Natural Language Processing (NLP)** is the task of making computers understand and produce human languages.

**Deep Learning (DL)** is... 

Some people say it's: 

 - Neural nets
 - Stacking multiple/deep layers of "representation learning"
 - Something that burns up as much GPUs as Bitcoin mining
 - A subset of methods in machine learning 
 
 
 <br>
 **Deep Learning in NLP** is lots of arrays of floats.
 

# 2. Vector-Space Model

To understand how deep learning approaches is attractive in NLP, first we should understand the notion of **vector space model**. 

Essentially, vector space models is a **numerical representation of text** (usually an array). 

Traditionally, it is computed using the number of times each word occurs, e.g.



In [0]:
from collections import Counter

sent0 = "The quick brown fox jumps over the lazy brown dog .".lower().split()
sent1 = "Mr brown jumps over the lazy fox .".lower().split()

In [32]:
print('Word counts:')
print(Counter(sent0))

Word counts:
Counter({'the': 2, 'brown': 2, 'quick': 1, 'fox': 1, 'jumps': 1, 'over': 1, 'lazy': 1, 'dog': 1, '.': 1})


In [33]:
print(Counter(sent1))

Counter({'mr': 1, 'brown': 1, 'jumps': 1, 'over': 1, 'the': 1, 'lazy': 1, 'fox': 1, '.': 1})


### Putting the words and counts into a nice table

|  |the | brown | jumps | fox | quick | dog | over | lazy | mr | . | 
|:--|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|Sent0 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 
|Sent1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 


When we fix the position of the vocabulary in the table, we get the **sentence vectors** <br>
(i.e. list of numbers to represent each sentence) **that are comparable across sentences**: 



In [34]:
import numpy as np
import pandas as pd

column_names = ['the', 'brown', 'jumps', 'fox', 'quick', 'dog', 'over', 'lazy', 'mr', '.']
sent0 = [2,2,1,1,1,1,1,1,0,1]
sent1 = [1,1,1,1,0,0,1,1,1,1]

matrix = np.array([sent0, sent1]) # We put all documents into a matrix.

documents = pd.DataFrame(matrix, columns=column_names)
documents.head()

Unnamed: 0,the,brown,jumps,fox,quick,dog,over,lazy,mr,.
0,2,2,1,1,1,1,1,1,0,1
1,1,1,1,1,0,0,1,1,1,1


### Alternatively, without hard-coding the counts.

In [35]:
sent0 = "The quick brown fox jumps over the lazy brown dog .".lower().split()
sent1 = "Mr brown jumps over the lazy fox .".lower().split()

documents = pd.DataFrame.from_dict([Counter(sent0), Counter(sent1)])
documents.fillna(0, inplace=True, downcast='infer')
documents.head()

Unnamed: 0,.,brown,dog,fox,jumps,lazy,mr,over,quick,the
0,1,2,1,1,1,1,0,1,1,2
1,1,1,0,1,1,1,1,1,0,1


In [36]:
sent0 = documents.iloc[0]
print(type(sent1))
print(sent1) # When we want to access the sentence as a pandas.Series object

<class 'list'>
['mr', 'brown', 'jumps', 'over', 'the', 'lazy', 'fox', '.']


In [37]:
sent0 = documents.iloc[0]
print(type(sent0.values))
print(sent0.values) # When we want to access the sentence as a numpy.array object

<class 'numpy.ndarray'>
[1 2 1 1 1 1 0 1 1 2]



Conversely, if we flip the table around we get a **vector representation of each word** too:

In [38]:
documents.head()

Unnamed: 0,.,brown,dog,fox,jumps,lazy,mr,over,quick,the
0,1,2,1,1,1,1,0,1,1,2
1,1,1,0,1,1,1,1,1,0,1


In [39]:
print(documents['the'].values)

[2 1]


In [40]:
print(documents['over'].values)

[1 1]


## 2.1 Term-Frequency / Inverse Document Frequency (TF-IDF)

Traditionally, if we look at the word count tables `documents` above from the column perspective, <br>
it shows the raw count of each word in each document. 

The simplest **term-frequency** defintion is the number of occurences of each word per sentence.

In [41]:
word = 'brown'
print(f'Term-Frequency (TF) of "{word}" in sent0 =', documents[word].values[0])
print(f'Term-Frequency (TF) of "{word}" in sent1 =', documents[word].values[1])

Term-Frequency (TF) of "brown" in sent0 = 2
Term-Frequency (TF) of "brown" in sent1 = 1


Using the raw counts to denote TF allows the values in the table to range from `[0, ∞]`, <br>
that might cause some numerical instability, so typically we <br>
**normalize the raw counts by the total count of the words across all sentences**; <br>
that way we limit the range of values to `[0, 1]`.

In [42]:
documents = documents.apply(lambda x: x/sum(x))

documents

Unnamed: 0,.,brown,dog,fox,jumps,lazy,mr,over,quick,the
0,0.5,0.666667,1.0,0.5,0.5,0.5,0.0,0.5,1.0,0.666667
1,0.5,0.333333,0.0,0.5,0.5,0.5,1.0,0.5,0.0,0.333333


Another numerical statistics that is useful to represent the word is the **document-frequency**. 

The document-frequency is simply the no. of documents that a word occurs in. <br>
So, for each word, there will be only 1 document-frequency value.






In [43]:
word = 'brown'
# Note: `documents['the'].nonzero()[0]`  returns an index of the sentence with non-zero value.
print(f'Document-Frequency (DF) of "{word}" =', len(documents[word].nonzero()[0]))
print(f'The "{word}" word appears in these sentences:', documents[word].nonzero()[0])

Document-Frequency (DF) of "brown" = 2
The "brown" word appears in these sentences: [0 1]


In [44]:
word = 'mr'
print(f'Document-Frequency (DF) of "{word}" =', len(documents[word].nonzero()[0]))
print(f'The "{word}" word appears in these sentences:', documents[word].nonzero()[0])

Document-Frequency (DF) of "mr" = 1
The "mr" word appears in these sentences: [1]


While term-frequency tells you how often a word occurs per sentence, <br>
*the information is localized to the word per sentence*. 

The document-frequency tells you how often a word will occur in the whole collection of sentences, <br>
*the information is global and not specific to any sentence*.






A useful statistics that combines both term/document-frequencies is the<br>
**term-frequency / inverse document-frequency** (TF-IDF) number.<br><br>

The **inverse document frequency** , like document-frequency is a global statistics; it's  the <br>
logarithm of the no. of documents divided by the no. of documents that each word

In [45]:
import math

word = 'brown'
print(f'The "{word}" word appears in these sentences:', documents[word].nonzero()[0])

word_df = len(documents[word].nonzero()[0])
num_sentences, num_words = documents.shape
word_idf = math.log(num_sentences/word_df)
print(f'Inverse Document-Frequency (IDF) of "{word}" =', word_idf)

The "brown" word appears in these sentences: [0 1]
Inverse Document-Frequency (IDF) of "brown" = 0.0


In [46]:
word = 'mr'
print(f'The "{word}" word appears in these sentences:', documents[word].nonzero()[0])

word_df = len(documents[word].nonzero()[0])
num_sentences, num_words = documents.shape
word_idf = math.log(num_sentences/word_df)
print(f'Inverse Document-Frequency (IDF) of "{word}" =', word_idf)

The "mr" word appears in these sentences: [1]
Inverse Document-Frequency (IDF) of "mr" = 0.6931471805599453


In [47]:
# To compute the IDF for all words.
num_sentences, num_words = documents.shape

word2idf = {}   # Lets save a dictionary to map the words to their IDFs
idf_vector = [] # Lets save an ordered list of IDFS w.r.t. order of the column names.

for word in documents:
  word_idf = math.log(num_sentences/len(documents[word].nonzero()[0]))
  word2idf[word] = word_idf
  idf_vector.append(word_idf)
  print(f'IDF of "{word}" \t=', word_idf)

IDF of "." 	= 0.0
IDF of "brown" 	= 0.0
IDF of "dog" 	= 0.6931471805599453
IDF of "fox" 	= 0.0
IDF of "jumps" 	= 0.0
IDF of "lazy" 	= 0.0
IDF of "mr" 	= 0.6931471805599453
IDF of "over" 	= 0.0
IDF of "quick" 	= 0.6931471805599453
IDF of "the" 	= 0.0


**Note:** We see that the IDF values for the words that occurs in all sentences is 0 while the word that occurs in 1 is 0.693...

The **TF-IDF** computation is the simply the product of the TF and IDF

In [48]:
documents_tfidf_dict = {}

# Iterate through each word in the vocabulary.
for word in documents:
  tf = documents[word] # Retrieve the term-frequency.
  idf = word2idf[word] # Retrieve the inverse doc-frequency.
  tfidf = tf*idf       # Compute the TF-IDF value.
  documents_tfidf_dict[word] = tfidf
  
pd.DataFrame.from_dict(documents_tfidf_dict)

Unnamed: 0,.,brown,dog,fox,jumps,lazy,mr,over,quick,the
0,0.0,0.0,0.693147,0.0,0.0,0.0,0.0,0.0,0.693147,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.693147,0.0,0.0,0.0


In [49]:
# We can use some matrix/vector tricks to multiply each column by the respective IDF.
# `documents.as_matrix() * np.array(idf_vector)`

documents_tfidf = pd.DataFrame(documents.as_matrix() * np.array(idf_vector), 
                               columns=list(documents))
documents_tfidf

Unnamed: 0,.,brown,dog,fox,jumps,lazy,mr,over,quick,the
0,0.0,0.0,0.693147,0.0,0.0,0.0,0.0,0.0,0.693147,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.693147,0.0,0.0,0.0


**Note:** 

- `0.0` means that the word is not helpful indifferentiating between the documents. 
- And for the instances where the value is  `0.693147`, it's because they appear only once in the document.

### Lets add a few more sentences.

In [50]:
# Alternatively, without hard-coding the counts.

sent0 = "The quick brown fox jumps over the lazy brown dog .".lower().split()
sent1 = "Mr brown jumps over the lazy fox .".lower().split()
sent2 = "Roses are red , the chocolates are brown .".lower().split()
sent3 = "The frank dog jumps through the red roses .".lower().split()

documents = pd.DataFrame.from_dict(list(map(Counter, [sent0, sent1, sent2, sent3])))
documents.fillna(0, inplace=True, downcast='infer')
documents = documents.apply(lambda x: x/sum(x))  # Normalize the TF.
documents.head()

Unnamed: 0,",",.,are,brown,chocolates,dog,fox,frank,jumps,lazy,mr,over,quick,red,roses,the,through
0,0.0,0.25,0.0,0.5,0.0,0.5,0.5,0.0,0.333333,0.5,0.0,0.5,1.0,0.0,0.0,0.333333,0.0
1,0.0,0.25,0.0,0.25,0.0,0.0,0.5,0.0,0.333333,0.5,1.0,0.5,0.0,0.0,0.0,0.166667,0.0
2,1.0,0.25,1.0,0.25,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.166667,0.0
3,0.0,0.25,0.0,0.0,0.0,0.5,0.0,1.0,0.333333,0.0,0.0,0.0,0.0,0.5,0.5,0.333333,1.0


In [51]:
# To compute the IDF for all words.
num_sentences, num_words = documents.shape

word2idf = {}   # Lets save a dictionary to map the words to their IDFs
idf_vector = [] # Lets save an ordered list of IDFS w.r.t. order of the column names.

for word in documents:
  word_idf = math.log(num_sentences/len(documents[word].nonzero()[0]))
  word2idf[word] = word_idf
  idf_vector.append(word_idf)
  print(f'IDF of "{word}" \t=', word_idf)

IDF of "," 	= 1.3862943611198906
IDF of "." 	= 0.0
IDF of "are" 	= 1.3862943611198906
IDF of "brown" 	= 0.28768207245178085
IDF of "chocolates" 	= 1.3862943611198906
IDF of "dog" 	= 0.6931471805599453
IDF of "fox" 	= 0.6931471805599453
IDF of "frank" 	= 1.3862943611198906
IDF of "jumps" 	= 0.28768207245178085
IDF of "lazy" 	= 0.6931471805599453
IDF of "mr" 	= 1.3862943611198906
IDF of "over" 	= 0.6931471805599453
IDF of "quick" 	= 1.3862943611198906
IDF of "red" 	= 0.6931471805599453
IDF of "roses" 	= 0.6931471805599453
IDF of "the" 	= 0.0
IDF of "through" 	= 1.3862943611198906


In [52]:
# Compute the TF-IDF table.
documents_tfidf = pd.DataFrame(documents.as_matrix() * np.array(idf_vector), 
                               columns=list(documents))
documents_tfidf

Unnamed: 0,",",.,are,brown,chocolates,dog,fox,frank,jumps,lazy,mr,over,quick,red,roses,the,through
0,0.0,0.0,0.0,0.143841,0.0,0.346574,0.346574,0.0,0.095894,0.346574,0.0,0.346574,1.386294,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.071921,0.0,0.0,0.346574,0.0,0.095894,0.346574,1.386294,0.346574,0.0,0.0,0.0,0.0,0.0
2,1.386294,0.0,1.386294,0.071921,1.386294,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.346574,0.346574,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.346574,0.0,1.386294,0.095894,0.0,0.0,0.0,0.0,0.346574,0.346574,0.0,1.386294


**Note:** Now, we see more cells with different values. The higher they are the more salient/prominent the word is w.r.t. (with respect to) the sentence and across sentences.

## 2.1.1 TF-IDF with numpy "natively"

To summarize the above, we can easily achieve our TF-IDF table using:

In [53]:
sent0 = "The quick brown fox jumps over the lazy brown dog .".lower().split()
sent1 = "Mr brown jumps over the lazy fox .".lower().split()
sent2 = "Roses are red , the chocolates are brown .".lower().split()
sent3 = "The frank dog jumps through the red roses .".lower().split()

documents = pd.DataFrame.from_dict(list(map(Counter, [sent0, sent1, sent2, sent3])))
documents.fillna(0, inplace=True, downcast='infer')
documents = documents.apply(lambda x: x/sum(x))  # Normalize the TF.
documents.head()

# To compute the IDF for all words.
num_sentences, num_words = documents.shape

idf_vector = [] # Lets save an ordered list of IDFS w.r.t. order of the column names.

for word in documents:
  word_idf = math.log(num_sentences/len(documents[word].nonzero()[0]))
  idf_vector.append(word_idf)

# Compute the TF-IDF table.
documents_tfidf = pd.DataFrame(documents.as_matrix() * np.array(idf_vector), 
                               columns=list(documents))
documents_tfidf

Unnamed: 0,",",.,are,brown,chocolates,dog,fox,frank,jumps,lazy,mr,over,quick,red,roses,the,through
0,0.0,0.0,0.0,0.143841,0.0,0.346574,0.346574,0.0,0.095894,0.346574,0.0,0.346574,1.386294,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.071921,0.0,0.0,0.346574,0.0,0.095894,0.346574,1.386294,0.346574,0.0,0.0,0.0,0.0,0.0
2,1.386294,0.0,1.386294,0.071921,1.386294,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.346574,0.346574,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.346574,0.0,1.386294,0.095894,0.0,0.0,0.0,0.0,0.346574,0.346574,0.0,1.386294



## 2.1.2 Use `Gensim` for TF-IDF

It's nice to appreciate the computation of TF-IDF given the explanation above

But I would advise **AGAINST** using the code above in realistic NLP outside of a tutorial. 

Instead use the well-optimized `gensim` library for computing the https://radimrehurek.com/gensim/tutorial.html 

In [0]:
%%capture
!pip install gensim

In [55]:
from gensim.models import TfidfModel
from gensim.corpora import Dictionary

sent0 = "The quick brown fox jumps over the lazy brown dog .".lower().split()
sent1 = "Mr brown jumps over the lazy fox .".lower().split()
sent2 = "Roses are red , the chocolates are brown .".lower().split()
sent3 = "The frank dog jumps through the red roses .".lower().split()

dataset = [sent0, sent1, sent2, sent3]
vocab = Dictionary(dataset)
corpus = [vocab.doc2bow(sent) for sent in dataset] 
model = TfidfModel(corpus)

# To retrieve the same pd.DataFrame format.
documents_tfidf_lol = [{vocab[word_idx]:tfidf_value 
                        for word_idx, tfidf_value in sent} 
                       for sent in model[corpus]]
documents_tfidf = pd.DataFrame(documents_tfidf_lol)
documents_tfidf.fillna(0, inplace=True)

documents_tfidf

Unnamed: 0,",",are,brown,chocolates,dog,fox,frank,jumps,lazy,mr,over,quick,red,roses,through
0,0.0,0.0,0.278849,0.0,0.335932,0.335932,0.0,0.139424,0.335932,0.0,0.335932,0.671865,0.0,0.0,0.0
1,0.0,0.0,0.153146,0.0,0.0,0.368993,0.0,0.153146,0.368993,0.737987,0.368993,0.0,0.0,0.0,0.0
2,0.390939,0.781879,0.081127,0.390939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.19547,0.19547,0.0
3,0.0,0.0,0.0,0.0,0.299178,0.0,0.598356,0.12417,0.0,0.0,0.0,0.0,0.299178,0.299178,0.598356


## 2.1.3 Using `sklearn` for TF-IDF

See http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
 

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix


# The *TfidfVectorizer* from sklearn expects list of strings as input.
sent0 = "The quick brown fox jumps over the lazy brown dog .".lower()
sent1 = "Mr brown jumps over the lazy fox .".lower()
sent2 = "Roses are red , the chocolates are brown .".lower()
sent3 = "The frank dog jumps through the red roses .".lower()

dataset = [sent0, sent1, sent2, sent3]

vectorizer = TfidfVectorizer(input=dataset, analyzer='word', ngram_range=(1,1),
                     min_df = 0, stop_words=None)
tfidf_matrix =  vectorizer.fit_transform(dataset)

# Format the TF-IDF table into the pd.DataFrame format.
vocab = vectorizer.get_feature_names()
documents_tfidf_lol = [{word:tfidf_value for word, tfidf_value in zip(vocab, sent)} 
                       for sent in tfidf_matrix.toarray()]

documents_tfidf = pd.DataFrame(documents_tfidf_lol)
documents_tfidf.fillna(0, inplace=True)

documents_tfidf

Unnamed: 0,are,brown,chocolates,dog,fox,frank,jumps,lazy,mr,over,quick,red,roses,the,through
0,0.0,0.496429,0.0,0.306594,0.306594,0.0,0.248214,0.306594,0.0,0.306594,0.388876,0.0,0.0,0.405863,0.0
1,0.0,0.321079,0.0,0.0,0.396597,0.0,0.321079,0.396597,0.503033,0.396597,0.0,0.0,0.0,0.262503,0.0
2,0.760126,0.242589,0.380063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.299646,0.299646,0.198333,0.0
3,0.0,0.0,0.0,0.340495,0.0,0.431875,0.27566,0.0,0.0,0.0,0.0,0.340495,0.340495,0.450741,0.431875


**Note:** The values are different from the ones we have computed with native `numpy` because different definitions of how TF and IDF is computed and different variations of parameters. Nevertheless we see the same trend in how salient/prominent a word is w.r.t. to the sentence and across sentences.

### TF-IDF beyond single words


Instead of transforming the sentences into the TF-IDF values for each word, <br>
you can extract the TF-IDF for `n-grams`, think of of as phrases of size `n`. 

**Note:** that `n-grams` are neither coherent phrases nor linguistically sound elements,  they are just nice features that we can use.

In [57]:
# Example of n-grams where n=2, we call these bi-grams
from nltk import ngrams

_sent0 = "The quick brown fox jumps over the lazy brown dog .".lower().split()
list(ngrams(_sent0, n=2))

[('the', 'quick'),
 ('quick', 'brown'),
 ('brown', 'fox'),
 ('fox', 'jumps'),
 ('jumps', 'over'),
 ('over', 'the'),
 ('the', 'lazy'),
 ('lazy', 'brown'),
 ('brown', 'dog'),
 ('dog', '.')]

### To extract n-gram TF-IDF, simply change the `ngram_range` parameter

In [58]:
vectorizer = TfidfVectorizer(input=dataset, analyzer='word', 
                             ngram_range=(2,2), # Bi-grams, 2 words.
                             min_df = 0, stop_words=None)
tfidf_matrix =  vectorizer.fit_transform(dataset)

# Format the TF-IDF table into the pd.DataFrame format.
vocab = vectorizer.get_feature_names()
documents_tfidf_lol = [{word:tfidf_value for word, tfidf_value in zip(vocab, sent)} 
                       for sent in tfidf_matrix.toarray()]

documents_tfidf = pd.DataFrame(documents_tfidf_lol)
documents_tfidf.fillna(0, inplace=True)

documents_tfidf

Unnamed: 0,are brown,are red,brown dog,brown fox,brown jumps,chocolates are,dog jumps,fox jumps,frank dog,jumps over,...,quick brown,red roses,red the,roses are,the chocolates,the frank,the lazy,the quick,the red,through the
0,0.0,0.0,0.35658,0.35658,0.0,0.0,0.0,0.35658,0.0,0.281132,...,0.35658,0.0,0.0,0.0,0.0,0.0,0.281132,0.35658,0.0,0.0
1,0.0,0.0,0.0,0.0,0.453386,0.0,0.0,0.0,0.0,0.357455,...,0.0,0.0,0.0,0.0,0.0,0.0,0.357455,0.0,0.0,0.0
2,0.408248,0.408248,0.0,0.0,0.0,0.408248,0.0,0.0,0.0,0.0,...,0.0,0.0,0.408248,0.408248,0.408248,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.377964,0.0,0.377964,0.0,...,0.0,0.377964,0.0,0.0,0.0,0.377964,0.0,0.0,0.377964,0.377964


## 2.2 Where on earth is "mr brown"?

Lets try an easy task of assigning positive labels to sentences where `mr brown` is in the sentence and negative labels otherwise.



In [0]:
# Train sentences.
sent0, label0 = "The quick brown fox jumps over the lazy brown dog .".lower() , False
sent1, label1 = "Mr brown jumps over the lazy fox .".lower(), True
sent2, label2 = "Roses are red , the chocolates are brown .".lower(), False
sent3, label3 = "The frank dog jumps through the red roses .".lower(), False

# Test sentences.
sent4, label4 = "Mr Tan jumps on red chocolates ?".lower(), False
sent5, label5 = "Mr brown likes the lazy dog .".lower(), True

First, lets write a simple "classifier" to predict `mr brown` existence with an **if-else** clause

In [60]:
train_documents = [(sent0, label0), (sent1, label1), (sent2, label2), (sent0, label2)]
test_documents = [(sent4, label4), (sent5, label5)]

train_texts, train_labels = zip(*train_documents)
test_texts, test_labels = zip(*test_documents)

for sent, actual_label in test_documents:
  predicted_label = 'mr brown' in sent
  print(sent, '\t', predicted_label, actual_label)

mr tan jumps on red chocolates ? 	 False False
mr brown likes the lazy dog . 	 True True


## 2.2.1 `Transform`: Documents to TF-IDF vectors

Lets get "sophisticated" and try to learn a classifier to predict presence of `mr brown` without explicitly checking the string.

Using what we've learnt, lets use `sklearn` TF-IDF representation of the sentences as the training input to train a classifier. 


In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix

vectorizer = TfidfVectorizer(input=train_texts, analyzer='word', 
                             ngram_range=(1,1), # Note this parameter!
                             min_df = 0, stop_words=None)

X_train =  vectorizer.fit_transform(train_texts)
X_test =  vectorizer.transform(test_texts)

y_train = train_labels
y_test = test_labels

**Why `sklearn` not `gensim`?**

In fact, they would have yield almost the same TF-IDF representation but gensim tends to throw away the zeros columns and there's no direct way to convert it into the sparse matrix representation so that we can put it into the `sklearn` classifier directly.

## 2.2.2 `Transfer`: Learn a Classifier from the Vectors

There's a whole range of classifiers from `sklearn` that can be used. <br>
For detailed example, see http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html

In [62]:
# Pick your poison.
from sklearn.linear_model import Perceptron
# Initialize your classifier.
clf = Perceptron(max_iter=10)
# Train the classifier.
clf.fit(X_train, y_train)

Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,
      max_iter=10, n_iter=None, n_jobs=1, penalty=None, random_state=0,
      shuffle=True, tol=None, verbose=0, warm_start=False)

## 2.2.3 `Predict`: Make Predictions on Test/Unseen Data

In [63]:
print(clf.predict(X_test))
print(test_labels)

[False False]
(False, True)


## Booo! So much for sophistication. 

Note that the task is to predict the exist of "mr brown" in the sentence. 

Wouldn't it makes more sense if the TF-IDF takes into consider n-grams?


## Lets try TF-IDF Transformation with Bi-grams!

In [64]:
# Transform.
vectorizer = TfidfVectorizer(input=train_texts, analyzer='word', 
                             ngram_range=(2,2),
                             min_df = 0, stop_words=None)

X_train =  vectorizer.fit_transform(train_texts)
X_test =  vectorizer.transform(test_texts)

y_train = train_labels
y_test = test_labels

# Transfer.
clf = Perceptron(max_iter=10)
clf.fit(X_train, y_train)

# Predict.
print(clf.predict(X_test))
print(y_test)

[False  True]
(False, True)


### Volia, Mr Brown is found !


# 잠시만요... (wait a minute) Where's the Deep Learning Stuff?

> “The thing that hath been, it is that which shall be; and <br>
that which is done is that which shall be done: and<br>
there is no new thing under the sun.” <br>- Book of Ecclesiastes




In fact, the deep learning methods in solving most classification/regression task follows <br>the same (i) **`Transform`**, (ii) **`Transfer`** and (iii) **`Predict`** paradigm. 

Take for instance, the use of **neural embeddings** in NLP, the general idea is to:

 1. **Transform**: Embed and encode the input text into a vector
 2. **Transfer**: Learn the model to map from the vector to the training labels
 3. **Predict**: Produce the output based on the model 

@SpaCy has a nice blogpost on this: https://explosion.ai/blog/deep-learning-formula-nlp

**Cut-away:** Why does `predict` deserve it's own step?

> Actually, depending on which model is used to learn the mapping between the transformed <br>vector to the gold standards, "prediction" can be more than simpy predicting a single output <br>label/number; e.g. it could be generating output sequence which needs some sort of "decoder". 

# 3.0 Vectorization Is All You Need

For the *neural news chasers*, the **Attention Is All You Need (AIAYN)** <br>
and **Deep Averaging Network (DAN)** neural architectures have been the new kid <br>
on the block challenging the omni-potent **Recurrent Neural Network (RNN)** architectures, <br>
notably the **Long-Short Term Memory (LSTM)**. 


### Riddikulus!

If the above sounds like Harry Potter spells, yes it is. <br>
But the **AIAYN**, **DAN**, **RNN** and **LSTM** seem equally elusive...

For the mere mortals, we can think of them as vectorizers, like the TF-IDF. 

## 3.1. Universal Sentence Encoder

The [**Universal Sentence Encoder**](https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/1) is a pre-trained model using the **Transformer** architecture 

In [0]:
# Install the latest Tensorflow version.
!pip3 install --quiet "tensorflow>=1.7"
# Install TF-Hub.
!pip3 install --quiet tensorflow-hub

In [0]:
import tensorflow as tf
import tensorflow_hub as hub

import numpy as np

# Printing candies, make sure that arrays 
# are ellipsis and humanly readable.
np.set_printoptions(precision=4, threshold=10)



In [0]:
# The URL that hosts the Transformer model for Universal Sentence Encoder 
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/1"

# The URL that hosts the DAN model for Universal Sentence Encoder 
module_url = "https://tfhub.dev/google/universal-sentence-encoder/1"

# On a local machine, uncomment the last two lines in this cell,
# so that the mmodule don't get redownloaded multiple times.
# when you run the notebook in different sessions.
#
# By setting `TFHUB_CACHE_DIR` environment variable,
# it sets the directory where tf_hub will save the model.
#
##import os
##os.environ["TFHUB_CACHE_DIR"] = os.getcwd() + "/tfhub_models/"

In [68]:
# Import the Universal Sentence Encoder's TF Hub module
# This will take some time to download the model for the first time...
embed = hub.Module(module_url)

INFO:tensorflow:Initialize variable module_1/Embeddings_en/sharded_0:0 from checkpoint b'/tmp/tfhub_modules/c6f5954ffa065cdb2f2e604e740e8838bf21a2d3/variables/variables' with Embeddings_en/sharded_0
INFO:tensorflow:Initialize variable module_1/Embeddings_en/sharded_1:0 from checkpoint b'/tmp/tfhub_modules/c6f5954ffa065cdb2f2e604e740e8838bf21a2d3/variables/variables' with Embeddings_en/sharded_1
INFO:tensorflow:Initialize variable module_1/Embeddings_en/sharded_10:0 from checkpoint b'/tmp/tfhub_modules/c6f5954ffa065cdb2f2e604e740e8838bf21a2d3/variables/variables' with Embeddings_en/sharded_10
INFO:tensorflow:Initialize variable module_1/Embeddings_en/sharded_11:0 from checkpoint b'/tmp/tfhub_modules/c6f5954ffa065cdb2f2e604e740e8838bf21a2d3/variables/variables' with Embeddings_en/sharded_11
INFO:tensorflow:Initialize variable module_1/Embeddings_en/sharded_12:0 from checkpoint b'/tmp/tfhub_modules/c6f5954ffa065cdb2f2e604e740e8838bf21a2d3/variables/variables' with Embeddings_en/sharded_12

INFO:tensorflow:Initialize variable module_1/SNLI/Classifier/tanh_layer_0/bias:0 from checkpoint b'/tmp/tfhub_modules/c6f5954ffa065cdb2f2e604e740e8838bf21a2d3/variables/variables' with SNLI/Classifier/tanh_layer_0/bias
INFO:tensorflow:Initialize variable module_1/SNLI/Classifier/tanh_layer_0/weights:0 from checkpoint b'/tmp/tfhub_modules/c6f5954ffa065cdb2f2e604e740e8838bf21a2d3/variables/variables' with SNLI/Classifier/tanh_layer_0/weights
INFO:tensorflow:Initialize variable module_1/global_step:0 from checkpoint b'/tmp/tfhub_modules/c6f5954ffa065cdb2f2e604e740e8838bf21a2d3/variables/variables' with global_step


## 3.1.1 Transform: Encode the Text Input into a vector

In [0]:
# Train sentences.
sent0, label0 = "The quick brown fox jumps over the lazy brown dog .".lower() , False
sent1, label1 = "Mr brown jumps over the lazy fox .".lower(), True
sent2, label2 = "Roses are red , the chocolates are brown .".lower(), False
sent3, label3 = "The frank dog jumps through the red roses .".lower(), False

# Test sentences.
sent4, label4 = "Mr Tan jumps on red chocolates ?".lower(), False
sent5, label5 = "Mr brown likes the lazy dog .".lower(), True

train_documents = [(sent0, label0), (sent1, label1), (sent2, label2), (sent0, label2)]
test_documents = [(sent4, label4), (sent5, label5)]

train_texts, train_labels = zip(*train_documents)
test_texts, test_labels = zip(*test_documents)

y_train = train_labels
y_test = test_labels

with tf.Session() as session:
  session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  X_train = sentence_embeddings = session.run(embed(train_texts))
  X_test = test_embeddings = session.run(embed(test_texts))

In [70]:
print(sent0)

the quick brown fox jumps over the lazy brown dog .


In [71]:
# Not unlike the TF-IDF model, the DAN embedding returns
# an array of 512 floating points.
print(len(sentence_embeddings[0]))
sentence_embeddings[0]

512


array([ 0.0208,  0.0175, -0.0132, ...,  0.0613, -0.0577, -0.0428],
      dtype=float32)

## 3.1.2 Transfer: Learn the Model to Map Sentence Vectors to Labels

In [72]:
# Pick your poison.
from sklearn.linear_model import Perceptron
# Initialize your classifier.
clf = Perceptron(max_iter=10)
# Train the classifier.
clf.fit(X_train, y_train)

Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,
      max_iter=10, n_iter=None, n_jobs=1, penalty=None, random_state=0,
      shuffle=True, tol=None, verbose=0, warm_start=False)

## 3.1.3 Predict: Make Predictions on the Test Data

In [73]:
print(clf.predict(X_test))
print(y_test)

[False  True]
(False, True)


**Note:** In this case, we have **NOT** explicitly include the bigram information into the <br>
TF-IDF but it's still learning the features correctly to predict the existence of "mr brown"!

# 4.0 Mirrors, it must be mirrors! 

That's what I usually say when I don't really know how/why the magic works. <br>


But the sentence vectors produced by the **Deep Averaging Network** model isn't smokes and magic, <br>
we can easily replicate the TTP trick on other NLP tasks.



# 4.1. Classifying Toxic Comments

Lets apply what we learnt in a realistic task and **fight cyber-abuse with NLP**!

From https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/

> *The threat of abuse and harassment online means that many people stop <br>*
> *expressing themselves and give up on seeking different opinions. <br>*
> *Platforms struggle to effectively facilitate conversations, leading many <br>*
> *communities to limit or completely shut down user comments.*


The goal of the task is to build a model to detect different types of of toxicity:

 - threats
 - obscenity
 - insults
 - identity-based hate
 
 
## Digging into the data...

In [74]:
!pip3 install kaggle
!mkdir -p /content/.kaggle/
!echo '{"username":"alvations","key":"32247a06d8f44eeb761cc9c81bc5ad55"}' > /content/.kaggle/kaggle.json
!chmod 600 /content/.kaggle/kaggle.json
!kaggle competitions download -c jigsaw-toxic-comment-classification-challenge
!unzip /content/.kaggle/competitions/jigsaw-toxic-comment-classification-challenge/*

sample_submission.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
test.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
train.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
test_labels.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  /content/.kaggle/competitions/jigsaw-toxic-comment-classification-challenge/sample_submission.csv.zip
caution: filename not matched:  /content/.kaggle/competitions/jigsaw-toxic-comment-classification-challenge/test.csv.zip
caution: filename not matched:  /content/.kaggle/competitions/jigsaw-toxic-comment-classification-challenge/test_labels.csv.zip
caution: filename not matched:  /content/.kaggle/competitions/jigsaw-toxic-comment-classification-challenge/train.csv.zip


In [75]:
import os
os.listdir('/content/.kaggle/competitions/jigsaw-toxic-comment-classification-challenge/')

['sample_submission.csv.zip',
 'test_labels.csv.zip',
 'test.csv.zip',
 'train.csv.zip']

In [0]:
from zipfile import ZipFile
import pandas as pd

data_dir = '/content/.kaggle/competitions/jigsaw-toxic-comment-classification-challenge/'

with ZipFile(data_dir+'train.csv.zip', 'r') as zipfin:
    train_csv = zipfin.open('train.csv')
    df_train = pd.read_csv(train_csv)

In [77]:
# This is how the training data looks like.
df_train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [0]:
with ZipFile(data_dir+'test.csv.zip', 'r') as zipfin:
    test_csv = zipfin.open('test.csv')
    df_test = pd.read_csv(test_csv)

In [79]:
# This is how the test data looks like w/o the labels.
df_test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [0]:
with ZipFile(data_dir+'test_labels.csv.zip', 'r') as zipfin:
    test_labels = zipfin.open('test_labels.csv')
    df_test_labels = pd.read_csv(test_labels)

In [81]:
# This is how the test labels look like.
df_test_labels.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,-1,-1,-1,-1,-1,-1
1,0000247867823ef7,-1,-1,-1,-1,-1,-1
2,00013b17ad220c46,-1,-1,-1,-1,-1,-1
3,00017563c3f7919a,-1,-1,-1,-1,-1,-1
4,00017695ad8997eb,-1,-1,-1,-1,-1,-1


## 4.1.1 Transform: Lets munge the data till we get our X (input vectors) and y (output labels)

To simplify things, we can start with predicting just a single label `toxic` or not.

In [0]:
# Internet connection is slow here, so lets use 100 train comments
# instead of the full dataset.
X_train_text = list(df_train['comment_text'])[:100]
y_train = df_train['toxic'][:100]

# Internet connection is slow here, so lets use 10 test comments
# instead of the full dataset.
X_test_text = list(df_test['comment_text'])[:10]


def transform_text_to_vectors():
  X_train_text = list(df_train['comment_text'])
  X_test_text = list(df_test['comment_text'])
  with tf.Session() as session:
    session.run([tf.global_variables_initializer(), tf.tables_initializer()])
    X_train = sentence_embeddings = session.run(embed(X_train_text))
    X_test = test_embeddings = session.run(embed(X_test_text))
  return X_train, X_test


## To use the full dataset, un-comment this line:
##X_train, X_test = transform_text_to_vectors()

## 4.1.2 Transfer: Train the model using to map X -> y

In [84]:
# Pick your poison.
from sklearn.linear_model import Perceptron
# Initialize your classifier.
clf = Perceptron(max_iter=10)
# Train the classifier.
clf.fit(X_train, y_train)

Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,
      max_iter=10, n_iter=None, n_jobs=1, penalty=None, random_state=0,
      shuffle=True, tol=None, verbose=0, warm_start=False)

In [85]:
predictions = clf.predict(X_test)

for (idx, row_txt), (_, row_labels), pred in zip(df_test.iterrows(), df_test_labels.iterrows(), predictions):
  if idx > 10:
    break
  print('Input text:', [row_txt['comment_text']])
  print('Correct label:', row_labels['toxic'])
  print('Predicted label:', pred)
  print()

Input text: ["Yo bitch Ja Rule is more succesful then you'll ever be whats up with you and hating you sad mofuckas...i should bitch slap ur pethedic white faces and get you to kiss my ass you guys sicken me. Ja rule is about pride in da music man. dont diss that shit on him. and nothin is wrong bein like tupac he was a brother too...fuckin white boys get things right next time.,"]
Correct label: -1
Predicted label: 1

Input text: ['== From RfC == \n\n The title is fine as it is, IMO.']
Correct label: -1
Predicted label: 0

Input text: ['" \n\n == Sources == \n\n * Zawe Ashton on Lapland —  /  "']
Correct label: -1
Predicted label: 0

Input text: [":If you have a look back at the source, the information I updated was the correct form. I can only guess the source hadn't updated. I shall update the information once again but thank you for your message."]
Correct label: -1
Predicted label: 0

Input text: ["I don't anonymously edit articles at all."]
Correct label: -1
Predicted label: 0

In

## Gotcha! What's -1? 

It's a quirk in leaderboard driven competition where not all of the test data is annotated and participants won't know which test data point is evaluated until the competition is over. For details, take a look at https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data 

Although, we haven't built a model in this workshop, you have the tools and code ready to tackle almost any NLP problems with the same 3-Steps.

## And now...

- Download the notebook
- Go home and run it with proper internet
- Re-run the notebook and rebuild the toxicity model
- Transform, Transfer and Predict !!!


Hopefully, the notebook has walked you through your first steps in NLP with some "deep learning". To delve deeper, take a look at these:

 - https://www.kaggle.com/alvations/basic-nlp-with-nltk
 - https://explosion.ai/blog/deep-learning-formula-nlp
 - https://radimrehurek.com/gensim/tutorial.html
 - http://www.nltk.org/book
 
 