# Natural Language Processing

One element that all neural networks that process text have in common is an embedding layer which uses word embeddings to transform arrays or sequences of scalar values representing words into arrays of floating-point numbers called word vectors. These vectors encode information aboout the meaning of words and relationship between them. The output of an embedding layer can be the input to a classification layer or it can be input to other types of neural network layers to tease more meaning from it before subjecting it to further processing.

## Text Preparation

CountVectorizer class from Scikit-Learn transforms rows of text into rows of word counts.
1. removes punctuation and numbers
2. converts all characters to lower-case
3. (optional) removes stop words

Vectorization is performed differently. Instead of creating a table of word counts it creates a table of sequences containing tokens representing individual words. Keras provides the Tokenizer class which is an equivalent of CountVectorizer but for deep-learning.

In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer

lines = [
    'The quick brown fox',
    'Jumps over $$$ the lazy brown dog',
    'Who jumps high into the blue sky after counting 123',
    'And quickly returns to earth'
]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

In [3]:
sequences

[[1, 4, 2, 5],
 [3, 6, 1, 7, 2, 8],
 [9, 3, 10, 11, 1, 12, 13, 14, 15, 16],
 [17, 18, 19, 20, 21]]

In [4]:
words = tokenizer.sequences_to_texts(sequences)

In [5]:
words

['the quick brown fox',
 'jumps over the lazy brown dog',
 'who jumps high into the blue sky after counting 123',
 'and quickly returns to earth']

Words are converted to lower case and symbols are removed but stop words and numbers are still there.

In [6]:
from tensorflow.keras.preprocessing.text import Tokenizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

liness = [
    'The quick brown fox',
    'Jumps over $$$ the lazy brown dog',
    'Who jumps high into the blue sky after counting 123',
    'And quickly returns to earth'
]

def remove_stop_words(text):
    text = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    text = [word for word in text if word.isalpha() and not word in stop_words]
    return ' '.join(text)

lines = list(map(remove_stop_words, lines))

tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
tokenizer.texts_to_sequences(lines)

ModuleNotFoundError: No module named 'nltk'