In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.preprocessing import Binarizer
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from __future__ import print_function

import numpy as np 

#Text Vectorization

In this exercise we will explore the different forms of encoding our text so that it can be used in a machine learning model.  First we must take our full text and split it into 'documents'.  In the example below we are spliting it into different sentences.

*Note: A text corpus is a large and unstructured set of texts*

In [0]:
text = """There are a number of ways you can organise the data so we can train a classifier on it. A TfidfVectorizer or a CountVectorizer are both good for the job since we are working with large amounts of text. A TfidfVectorizer does what a CountVectorizer does with a TfidfTransformer on top."""
corpus = text.split('. ')

# Keras Implementation
## Tokenization

We can use the [keras Tokenizer](http://faroit.com/keras-docs/2.0.6/preprocessing/text/) to tokenize all of the text.  By default this splits up all text into word tokens.  As you can see below, we don't pass in any parameters, leaving all defaults as they are.  Have a look at the official documentation to see what other parameters are set by default.

In [0]:
t = Tokenizer()
t.fit_on_texts(corpus)

Once the tokens have been created, we have access to them using some of the following attributes.  Based on the output and the names of them, what do you think they are giving us?

*Hint: Not sure? Check the official documentation!*

In [5]:
total_words = len(t.word_index)+1
print(total_words)
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)

35
OrderedDict([('there', 1), ('are', 3), ('a', 7), ('number', 1), ('of', 2), ('ways', 1), ('you', 1), ('can', 2), ('organise', 1), ('the', 2), ('data', 1), ('so', 1), ('we', 2), ('train', 1), ('classifier', 1), ('on', 2), ('it', 1), ('tfidfvectorizer', 2), ('or', 1), ('countvectorizer', 2), ('both', 1), ('good', 1), ('for', 1), ('job', 1), ('since', 1), ('working', 1), ('with', 2), ('large', 1), ('amounts', 1), ('text', 1), ('does', 2), ('what', 1), ('tfidftransformer', 1), ('top', 1)])
3
{'a': 1, 'are': 2, 'of': 3, 'can': 4, 'the': 5, 'we': 6, 'on': 7, 'tfidfvectorizer': 8, 'countvectorizer': 9, 'with': 10, 'does': 11, 'there': 12, 'number': 13, 'ways': 14, 'you': 15, 'organise': 16, 'data': 17, 'so': 18, 'train': 19, 'classifier': 20, 'it': 21, 'or': 22, 'both': 23, 'good': 24, 'for': 25, 'job': 26, 'since': 27, 'working': 28, 'large': 29, 'amounts': 30, 'text': 31, 'what': 32, 'tfidftransformer': 33, 'top': 34}
defaultdict(<class 'int'>, {'the': 2, 'are': 2, 'we': 2, 'a': 3, 'organ

## Keras Sequence

The default way that Keras produces vectors that we may use in a machine learning is by converting the text to sequences.  These are a list of word indexes relating to the dictionary created on `fit_on_texts`.

###Question
Have a look at the output of one of the tokenizer attributes.  Can you manually uncode the first 3 words of any of the three documents?

In [7]:
X = t.texts_to_sequences(corpus)
X

[[12, 2, 1, 13, 3, 14, 15, 4, 16, 5, 17, 18, 6, 4, 19, 1, 20, 7, 21],
 [1, 8, 22, 1, 9, 2, 23, 24, 25, 5, 26, 27, 6, 2, 28, 10, 29, 30, 3, 31],
 [1, 8, 11, 32, 1, 9, 11, 10, 1, 33, 7, 34]]

An important point to reiterate is that most machine learning algorithms require each observation/feature vector to be the same size.  However as we can see from the output above, each of our 3 documents is a different size because each document has a different word count.  We can fix this by appling padding.

###Question
Where is the padding applied and if the code were to be changed to apply it elsewhere, what would need added to the below statement?

In [8]:
pad_sequences(X)

array([[ 0, 12,  2,  1, 13,  3, 14, 15,  4, 16,  5, 17, 18,  6,  4, 19,
         1, 20,  7, 21],
       [ 1,  8, 22,  1,  9,  2, 23, 24, 25,  5, 26, 27,  6,  2, 28, 10,
        29, 30,  3, 31],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  1,  8, 11, 32,  1,  9, 11, 10,
         1, 33,  7, 34]], dtype=int32)

In [9]:
pad_sequences(X).shape

(3, 20)

As you can see, the tokenizer also performs some word counting and therefore could be used to create a frequency vector.  Be careful however! There is a difference between the above output and the following encodings.

## Texts to Matrix

The `texts_to_matrix` function is the way Keras encodes documents to vectors.  There is a simple mode  parameter that can be changed to cover all options covered so far.

In [10]:
t = Tokenizer()
t.fit_on_texts(corpus)
t.texts_to_matrix(corpus, mode="binary").shape

(3, 35)

### Question

Can you explain the reason why this output is a different shape to the padded sequence above?

In [11]:
t.texts_to_matrix(corpus, mode="binary")

array([[0., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0.],
       [0., 1., 1., 1., 0., 1., 1., 0., 1., 1., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 1., 1.]])

In [12]:
t.texts_to_matrix(corpus, mode="count")

array([[0., 2., 1., 1., 2., 1., 1., 1., 0., 0., 0., 0., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0.],
       [0., 2., 2., 1., 0., 1., 1., 0., 1., 1., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        0., 0., 0.],
       [0., 3., 0., 0., 0., 0., 0., 1., 1., 1., 1., 2., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 1., 1.]])

In [13]:
t.texts_to_matrix(corpus, mode="tfidf")

array([[0.        , 0.94751189, 0.69314718, 0.69314718, 1.55141507,
        0.69314718, 0.69314718, 0.69314718, 0.        , 0.        ,
        0.        , 0.        , 0.91629073, 0.91629073, 0.91629073,
        0.91629073, 0.91629073, 0.91629073, 0.91629073, 0.91629073,
        0.91629073, 0.91629073, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.94751189, 1.17360019, 0.69314718, 0.        ,
        0.69314718, 0.69314718, 0.        , 0.69314718, 0.69314718,
        0.69314718, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.91629073, 0.91629073, 0.91629073,
        0.91629073, 0.91629073, 0.91629073, 0.91629073, 0.91629073,
        0.91629073, 0.91629073, 0.        , 0.        , 0.        ],
       [0.        , 1.17441657, 0.        , 0.

## Question

There is another mode that Keras accepts.  What is it and can you explain it's output in the context of our corpus?

In [17]:
# TODO:
t.texts_to_matrix(corpus, mode='freq')

array([[0.        , 0.10526316, 0.05263158, 0.05263158, 0.10526316,
        0.05263158, 0.05263158, 0.05263158, 0.        , 0.        ,
        0.        , 0.        , 0.05263158, 0.05263158, 0.05263158,
        0.05263158, 0.05263158, 0.05263158, 0.05263158, 0.05263158,
        0.05263158, 0.05263158, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.1       , 0.1       , 0.05      , 0.        ,
        0.05      , 0.05      , 0.        , 0.05      , 0.05      ,
        0.05      , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.05      , 0.05      , 0.05      ,
        0.05      , 0.05      , 0.05      , 0.05      , 0.05      ,
        0.05      , 0.05      , 0.        , 0.        , 0.        ],
       [0.        , 0.25      , 0.        , 0.

# Scikit-learn Implementation

##Frequency Vectors

Now we are going to create an encoded frequency vector using scikit-learns [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).  Run the code below and see if can you verify the outputs?  I.e count the number of times the word 'are' appears in each of the sentences in the text and see if you get the same values in the output below.

In [18]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

['amounts', 'are', 'both', 'can', 'classifier', 'countvectorizer', 'data', 'does', 'for', 'good', 'it', 'job', 'large', 'number', 'of', 'on', 'or', 'organise', 'since', 'so', 'text', 'tfidftransformer', 'tfidfvectorizer', 'the', 'there', 'top', 'train', 'ways', 'we', 'what', 'with', 'working', 'you']
[[0 1 0 2 1 0 1 0 0 0 1 0 0 1 1 1 0 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1]
 [1 2 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0]
 [0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 1 1 0 0]]


In [19]:
X.shape

(3, 33)

### Question

Note the difference in shape and values of the frequency and sequence vectors.  Can you explain this? (Hint: Are there any words missing?)

## One-Hot Vectors

Another scheme discussed is the One-Hot encoding scheme which reduces the imbalance due to the distribution of tokens.  An easy way to produce this is to simply take the output from the CountVectorizer and reduce all positive numbers down to a maximum of 1, therefore representing if a word is present at all in a document.  We can do this with the [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html) class from scikit-learn.

In [20]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

onehot = Binarizer()
X = onehot.fit_transform(X)
print(X.toarray())

['amounts', 'are', 'both', 'can', 'classifier', 'countvectorizer', 'data', 'does', 'for', 'good', 'it', 'job', 'large', 'number', 'of', 'on', 'or', 'organise', 'since', 'so', 'text', 'tfidftransformer', 'tfidfvectorizer', 'the', 'there', 'top', 'train', 'ways', 'we', 'what', 'with', 'working', 'you']
[[0 1 0 1 1 0 1 0 0 0 1 0 0 1 1 1 0 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1]
 [1 1 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0]
 [0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 1 1 0 0]]


A much easier solution is to set a particular parameter in the CountVectorizer object creation.  Look up the official documentation to find out what it is and make sure you get the same values.

In [34]:
# TODO: One-Hot encode corpus with CountVectorizer
onehot = CountVectorizer()
X = onehot.fit_transform(corpus)

print(onehot.get_feature_names())
print(X.toarray())

['amounts', 'are', 'both', 'can', 'classifier', 'countvectorizer', 'data', 'does', 'for', 'good', 'it', 'job', 'large', 'number', 'of', 'on', 'or', 'organise', 'since', 'so', 'text', 'tfidftransformer', 'tfidfvectorizer', 'the', 'there', 'top', 'train', 'ways', 'we', 'what', 'with', 'working', 'you']
[[0 1 0 2 1 0 1 0 0 0 1 0 0 1 1 1 0 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1]
 [1 2 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0]
 [0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 1 1 0 0]]


## TF-IDF

Term frequency–inverse document frequency is an encoding similar to frequency vector, but represents counts normalized with respect to the rest of the corpus.  We do this with the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [35]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

['amounts', 'are', 'both', 'can', 'classifier', 'countvectorizer', 'data', 'does', 'for', 'good', 'it', 'job', 'large', 'number', 'of', 'on', 'or', 'organise', 'since', 'so', 'text', 'tfidftransformer', 'tfidfvectorizer', 'the', 'there', 'top', 'train', 'ways', 'we', 'what', 'with', 'working', 'you']
[[0.         0.18504333 0.         0.48661948 0.24330974 0.
  0.24330974 0.         0.         0.         0.24330974 0.
  0.         0.24330974 0.18504333 0.18504333 0.         0.24330974
  0.         0.24330974 0.         0.         0.         0.18504333
  0.24330974 0.         0.24330974 0.24330974 0.18504333 0.
  0.         0.         0.24330974]
 [0.25170482 0.38285601 0.25170482 0.         0.         0.19142801
  0.         0.         0.25170482 0.25170482 0.         0.25170482
  0.25170482 0.         0.19142801 0.         0.25170482 0.
  0.25170482 0.         0.25170482 0.         0.19142801 0.19142801
  0.         0.         0.         0.         0.19142801 0.
  0.19142801 0.2517048

Because TF-IDF is a count frequency vector with normalization, it can also be done in two steps by first using the CountVectorizer, and then applying the normalization step with the [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer).

Verify that it produces the same output.

In [36]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

transformer = TfidfTransformer()
X = transformer.fit_transform(X)
X.toarray()

array([[0.        , 0.18504333, 0.        , 0.48661948, 0.24330974,
        0.        , 0.24330974, 0.        , 0.        , 0.        ,
        0.24330974, 0.        , 0.        , 0.24330974, 0.18504333,
        0.18504333, 0.        , 0.24330974, 0.        , 0.24330974,
        0.        , 0.        , 0.        , 0.18504333, 0.24330974,
        0.        , 0.24330974, 0.24330974, 0.18504333, 0.        ,
        0.        , 0.        , 0.24330974],
       [0.25170482, 0.38285601, 0.25170482, 0.        , 0.        ,
        0.19142801, 0.        , 0.        , 0.25170482, 0.25170482,
        0.        , 0.25170482, 0.25170482, 0.        , 0.19142801,
        0.        , 0.25170482, 0.        , 0.25170482, 0.        ,
        0.25170482, 0.        , 0.19142801, 0.19142801, 0.        ,
        0.        , 0.        , 0.        , 0.19142801, 0.        ,
        0.19142801, 0.25170482, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.24920411, 0.    

## N-Gram Model

So far we have been using the [Bag-of-Words model](https://en.wikipedia.org/wiki/Bag-of-words_model), where each token is one word.  Although this can work in some cases, it doesn't give much context with respect to the rest of the document.

N-Gram however stores spacial information by including more than one work per token.  See the examples below.

<br>
<center>
  
![Visual representation of unigrams, bigrams, and trigrams](https://www.sqlservercentral.com/wp-content/uploads/legacy/0bf6a2bd621db172dba029ce3c712280a3f6aab3/29444.jpg)
  
</center>
<br>

In [0]:
sequence = [
    "We",
    "like",
    "AI",
    "and",
    "hope",
    "you",
    "like",
    "it",
    "too"
]

bow = {
    "We": 1,
    "AI": 1,
    "and": 1,
    "hope": 1,
    "you": 1,
    "like": 2,
    "it": 1,
    "too": 1
}

bigram = [
    "We like",
    "like AI",
    "AI and",
    "and hope",
    "hope you",
    "you like",
    "like it",
    "it too"
]

trigram = [
    "We like AI",
    "like AI and",
    "AI and hope",
    "and hope you",
    "hope you like",
    "you like it",
    "like it too"
]

ngram = [
    "We like",
    "We like AI and",
    "We like AI and hope",
    "We like AI and hope you",
    "We like AI and hope you like",
    "We like AI and hope you like it",
    "We like AI and hope you like it too"
]

# N-Gram Task

Write a function that gets a sequence of tokens from the corpus.  The following steps must be included.

- Tokenize the text in the corpus. (Hint: Use keras Tokenizer)
- For each document in the corpus, get it's sequence and append all n-gram sequences to a list from n=2 up to the sequence length.

In [100]:
# TODO: tokenisation
t = Tokenizer()


# TODO: append n-gram sequences of all documents to list

def get_sequence_of_tokens(corpus):
  t.fit_on_texts(corpus)
  input_sequence = []
  num_words = np.array([0])
  
  for line in corpus:
    token_list = t.texts_to_sequences([line])[0]
    
    for ii in range(1, len(token_list)):
      n_gram_sequence = token_list[:ii+1]
      input_sequence.append(n_gram_sequence)
      
    num_words = len(input_sequence) + 1
  return input_sequence, num_words

  
get_sequence_of_tokens(corpus)

([[12, 2],
  [12, 2, 1],
  [12, 2, 1, 13],
  [12, 2, 1, 13, 3],
  [12, 2, 1, 13, 3, 14],
  [12, 2, 1, 13, 3, 14, 15],
  [12, 2, 1, 13, 3, 14, 15, 4],
  [12, 2, 1, 13, 3, 14, 15, 4, 16],
  [12, 2, 1, 13, 3, 14, 15, 4, 16, 5],
  [12, 2, 1, 13, 3, 14, 15, 4, 16, 5, 17],
  [12, 2, 1, 13, 3, 14, 15, 4, 16, 5, 17, 18],
  [12, 2, 1, 13, 3, 14, 15, 4, 16, 5, 17, 18, 6],
  [12, 2, 1, 13, 3, 14, 15, 4, 16, 5, 17, 18, 6, 4],
  [12, 2, 1, 13, 3, 14, 15, 4, 16, 5, 17, 18, 6, 4, 19],
  [12, 2, 1, 13, 3, 14, 15, 4, 16, 5, 17, 18, 6, 4, 19, 1],
  [12, 2, 1, 13, 3, 14, 15, 4, 16, 5, 17, 18, 6, 4, 19, 1, 20],
  [12, 2, 1, 13, 3, 14, 15, 4, 16, 5, 17, 18, 6, 4, 19, 1, 20, 7],
  [12, 2, 1, 13, 3, 14, 15, 4, 16, 5, 17, 18, 6, 4, 19, 1, 20, 7, 21],
  [1, 8],
  [1, 8, 22],
  [1, 8, 22, 1],
  [1, 8, 22, 1, 9],
  [1, 8, 22, 1, 9, 2],
  [1, 8, 22, 1, 9, 2, 23],
  [1, 8, 22, 1, 9, 2, 23, 24],
  [1, 8, 22, 1, 9, 2, 23, 24, 25],
  [1, 8, 22, 1, 9, 2, 23, 24, 25, 5],
  [1, 8, 22, 1, 9, 2, 23, 24, 25, 5, 26],
  [1, 

In [71]:
corpus

['There are a number of ways you can organise the data so we can train a classifier on it',
 'A TfidfVectorizer or a CountVectorizer are both good for the job since we are working with large amounts of text',
 'A TfidfVectorizer does what a CountVectorizer does with a TfidfTransformer on top.']

Now alter the same function to return two parameters...
1. input_sequences
2. total_words

You will need to calculate the total number of words within the function and return the two parameters together.

In [94]:
...

input_sequences, total_words = get_sequence_of_tokens(corpus)
total_words

array([50])