Deep learning for natural-language processing is pattern recognition applied to words, sentences, and paragraphs, in much the same way that computer vision is pattern recognition applied to pixels.


**Vectorizing text** is the process of transforming text into numeric tensors, can be done by:
- Segment text into words, and transform each word into a vector
- Segment text into characters, and transform each charactor into a vector
- Extract n-grams of words or characters, and transform each n-gram into a vector. *N-grams* are overlapping groups of multiple consecutive words or characters.


### 1.1 One-hot encoding of words and characters

In [2]:
# Word-level one-hot encoding
import numpy as np

samples = ['The cat sat on the mat.', 'The dog ate my homework.'] # this can be an entire document

token_index = {}
for sample in samples:
    for word in sample.split(): # tokenize the samples
        if word not in token_index:
            token_index[word] = len(token_index) + 1
            
max_length = 10
results = np.zeros(shape=(len(samples),
                          max_length,
                          max(token_index.values())+1))
                         
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i,j,index]=1.


In [3]:
# Character-level one-hot encoding

import string
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
characters = string.printable
token_index = dict(zip(range(1,len(characters)+1), characters))

max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.keys())+1))
for i, sample in enumerate(samples):
    for j, character in enumerate(sample):
        index = token_index.get(character)
        results[i,j,index]=1
        


In [8]:
# Using Keras for word-level one-hot encoding

from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat', 'The dog ate my homework.']
tokenizer = Tokenizer(num_words=1000) #creates a tokenizer, configured to only take into account the 1000 most common words
tokenizer.fit_on_texts(samples) # Build the word index

sequences = tokenizer.texts_to_sequences(samples) # Turns strings into lists of integer indies
# Directly get one-hot binary representations. Vectorization modes other than one-hot encoding are supported by this tokenizer 
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

word_index = tokenizer.word_index # recover the word index that was computed
print('Found %s unique tokens.' % len(word_index))

Found 9 unique tokens.


In [9]:
# Word-level one-hot encoding with hashing trick
# to use when the number of unique tokens in vocabulary is too large to handle explicitly
samples =['The cat sat on the mat.', 'The dog ate my homework.']
dimensionality = 1000
max_length = 10

results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = abs(hash(word)) % dimensionality
        results[i,j,index]=1.

## 1.2 Using word embeddings

One-hot encoding might lead to very high-dimensionality, while workd embeddings are leared from data. Word embeddings pack more information into far fewer dimensions.

2 ways to obtain word embeddings:
- Learn word embedding jointly with the main task: start with random word vectors and then learn word vectors in the same way as learning the weights of a neural network
- Using *pretrained word embeddings*



In [1]:
# Instantiating an Embedding layer

from keras.layers import Embedding
embedding_layer = Embedding(1000, 64)
#2 argumets: number of possible token(1000) and the dimensionality of the embeddings

Using TensorFlow backend.


Mechanism of *Embedding layer*: 
Word index (integer) -> Embedding layer -> Corresponding word vector