# **Tokenizer Using APIs**
Turning text data into numbers. 



1.   `Tokenizer`
  *   num_words: maximum number of words to be tokenized
  *   lower=True
  *   oov_token = "<OOV>"
  *   filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'

2.   `fit_on_texts(sentences)`
3.   `tokenizer.word_index`
4.   `texts_to_sequences(sentences)`
5.   `pad_sequences`
  *  padding='post',# add the padding to the end
  *  truncating='post',# remove words from the end, when it's larger then maxlen
  *  maxlen = 50) # set the maximum length of a sequence (To make all sequences to be the same size)


Simple labelling words in your vocabulary from 1 to 10,000 (if you've got 10,000 different words).




In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'I love my dog',
    'i love my cat',
    'He love my dog!'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'he': 6}


In [2]:
sentences = [
    'I love my dog',
    'i love my cat',
    'He love my dog!'
]
type(sentences)

list

- `tokenizer.text_to_sequences()`: convert a sequence of text to a sequence of numbers.
    - If the word is missing from the tokenizer, it won't be encoded into the sequence. 

- Missing words in a vocabulary is handled by placing in a placeholder such as `OOV` (for **Out Of Vocabulary**) to go in place of missing words.
  - To use an OOV token, use the parameter `oov_token` with the `Tokenizer` class.

- `pad_sequences` is used to make sure all sequences are the same length. That way when passed to a neural network, the matrices it accepts as input are all a uniform size.

In [3]:
#Tokening with Out of Vocabulary + padded sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(oov_token="<OOV>")
sentences = ["sldkfjdlskdjf lsdkfjdlsk hello lsdkfj lsdklkfj lkslsdkfj ldksf l skdjflk"]
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print("Word Index =\n" ,word_index)

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post') #add the padding to the end
print("\nSequences =", sequences)
print("\nPadded Sequences:\n",padded)
print("Padded Shape:\n",padded.shape)

Word Index =
 {'<OOV>': 1, 'sldkfjdlskdjf': 2, 'lsdkfjdlsk': 3, 'hello': 4, 'lsdkfj': 5, 'lsdklkfj': 6, 'lkslsdkfj': 7, 'ldksf': 8, 'l': 9, 'skdjflk': 10}

Sequences = [[2, 3, 4, 5, 6, 7, 8, 9, 10]]

Padded Sequences:
 [[ 2  3  4  5  6  7  8  9 10]]
Padded Shape:
 (1, 9)


# **BBC News Example(Stopwords)**

In [4]:
%pip install wget



In [5]:
import wget
wget.download('https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv')

'bbc-text (1).csv'

In [6]:
stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", 
             "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", 
             "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", 
             "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", 
             "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", 
             "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", 
             "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", 
             "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", 
             "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", 
             "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", 
             "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", 
             "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", 
             "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where",
             "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", 
             "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]

In [7]:
import csv
sentences = []
labels = []
with open("bbc-text.csv", 'r') as csvfile:
    # Your Code here
    csvfile = csv.reader(csvfile, delimiter=',')
    next(csvfile)

    for row in csvfile:
      #print(row)
      #break
      labels.append(row[0])
      sentence = row[1]

      for word in stopwords:
        token = " " + word + " "
        sentence = sentence.replace(token, " ")
        sentence = sentence.replace("  ", " ")
      sentences.append(sentence)

print(len(sentences))
print(sentences[0])
print(type(sentences))
#

2225
tv future hands viewers home theatre systems plasma high-definition tvs digital video recorders moving living room way people watch tv will radically different five years time. according expert panel gathered annual consumer electronics show las vegas discuss new technologies will impact one favourite pastimes. us leading trend programmes content will delivered viewers via home networks cable satellite telecoms companies broadband service providers front rooms portable devices. one talked-about technologies ces digital personal video recorders (dvr pvr). set-top boxes like us s tivo uk s sky+ system allow people record store play pause forward wind tv programmes want. essentially technology allows much personalised tv. also built-in high-definition tv sets big business japan us slower take off europe lack high-definition programming. not can people forward wind adverts can also forget abiding network channel schedules putting together a-la-carte entertainment. us networks cable sa

In [8]:
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')
print(padded[0])
print(padded.shape)
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(len(word_index))

[1 1 1 ... 0 0 0]
(2225, 2442)
29714


In [9]:
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')
print(padded[0])
print(padded.shape)

[  96  176 1158 ...    0    0    0]
(2225, 2442)


In [11]:
tokenizer = Tokenizer()

tokenizer.fit_on_texts(labels)
label_word_index = tokenizer.word_index

label_seq = tokenizer.texts_to_sequences(labels)
#label_seq = pad_sequences(label_seq, padding='post')

print(label_seq)
print(label_word_index)

[[4], [2], [1], [1], [5], [3], [3], [1], [1], [5], [5], [2], [2], [3], [1], [2], [3], [1], [2], [4], [4], [4], [1], [1], [4], [1], [5], [4], [3], [5], [3], [4], [5], [5], [2], [3], [4], [5], [3], [2], [3], [1], [2], [1], [4], [5], [3], [3], [3], [2], [1], [3], [2], [2], [1], [3], [2], [1], [1], [2], [2], [1], [2], [1], [2], [4], [2], [5], [4], [2], [3], [2], [3], [1], [2], [4], [2], [1], [1], [2], [2], [1], [3], [2], [5], [3], [3], [2], [5], [2], [1], [1], [3], [1], [3], [1], [2], [1], [2], [5], [5], [1], [2], [3], [3], [4], [1], [5], [1], [4], [2], [5], [1], [5], [1], [5], [5], [3], [1], [1], [5], [3], [2], [4], [2], [2], [4], [1], [3], [1], [4], [5], [1], [2], [2], [4], [5], [4], [1], [2], [2], [2], [4], [1], [4], [2], [1], [5], [1], [4], [1], [4], [3], [2], [4], [5], [1], [2], [3], [2], [5], [3], [3], [5], [3], [2], [5], [3], [3], [5], [3], [1], [2], [3], [3], [2], [5], [1], [2], [2], [1], [4], [1], [4], [4], [1], [2], [1], [3], [5], [3], [2], [3], [2], [4], [3], [5], [3], [4], [2],

- What is the name of the object used to tokenize sentences?
    - Tokenizer
- What is the name of the method used to tokenize a list of sentences?
    - `fit_on_texts(sentences)`
- Once you have the corpus tokenized, what's the method used to encode a list of sentences to use those tokens?
    - `texts_to_sequences(sentences)`
- When initializing the tokenizer, how do you specify a token to use for unknown (out of vocabulary) words?
    - `oov_token=<Token>`
- If you don't use a token for out of vocabulary words, what happens at encoding?
    - The word isn't encoded, and is skipped in the sequence.
- If you have a number of sequences of different lengths, how do you ensure that they are understood when fed into a neural network?
    - Use the `pad_sequences` object from the `tensorflow.keras.preprocessing.sequence` namespace.
- If you have a number of sequences of different length, and call pad_sequences on them, what's the default result?
    - They'll get padded to the length of the longest sequence by adding zeros to the beginning of shorter ones.
- When padding sequences, if you want the padding to be at the end of the sequence, how do you do it?
    - Pass `padding='post'` to `pad_sequences` when initializing it