# STARTER CODE: Tokenizing Text Dataset for Modeling

---

## Setting Up

### Import Libraries

In [1]:
import numpy as np
from tqdm.notebook import tqdm
import tensorflow as tf

### Print Directory Items

In [2]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/political-though-work-corpus/political_thought_works_corpus.csv
/kaggle/input/political-though-work-corpus/all-data.csv


### Read in Data

In [3]:
import pandas as pd
data = pd.read_csv('/kaggle/input/political-though-work-corpus/all-data.csv')
data = data[data['Text'].apply(lambda x:isinstance(x, str))==True]
data.head(3)

Unnamed: 0,Subject,Medium,Link,Text,Author,Title,Date
0,Philosophy,Book,https://www.gutenberg.org/ebooks/1497,Produced by Sue Asscher THE REPUBLIC By ...,Plato,The Republic,No Date
1,Philosophy,Book,https://www.gutenberg.org/ebooks/1998,Produced by Sue Asscher THUS SPAKE ZARATH...,Friedrich Nietzsche,Thus Spake Zarathustra,No Date
2,Philosophy,Book,https://www.gutenberg.org/ebooks/4363,"Produced by John Mamoun, Charles Franks and th...",Friedrich Nietzsche,Beyond Good and Evil,No Date


---

## Vectorize Data

This script collects a list of texts and converts them to a padded, tokenized TensorFlow dataset. Because almost all the string-level operations are performed within `tf.strings`, the process takes very little time to process large quantities of text (about two-thirds of a minute).

In [4]:
import time
start = time.time()

'''
====================================================================================
START OF RELEVANT TOKENIZATION SCRIPT
====================================================================================
'''

# importing necessary function
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# collect training data
train_data = data['Text'].tolist()

# quickly count number of unique words
complete_text = tf.strings.join([tf.constant(text) for text in data['Text']])
y, idx, count = tf.unique_with_counts(tf.strings.split(complete_text))

# set important parameters
num_words = y.shape[0]
oov_token = '<UNK>'
pad_type = 'post'
trunc_type = 'post'

# define and fit tokenizer
tokenizer = Tokenizer(num_words=num_words, oov_token=oov_token)
tokenizer.fit_on_texts(train_data)
train_sequences = tokenizer.texts_to_sequences(train_data)

# pad sequences
maxlen = max([len(x) for x in train_sequences])
train_padded = pad_sequences(train_sequences, padding=pad_type, truncating=trunc_type, maxlen=maxlen)
train_padded = tf.constant(train_padded)

# create tensorflow dataset
data = tf.data.Dataset.from_tensor_slices(train_padded)

'''
====================================================================================
END OF RELEVANT TOKENIZATION SCRIPT
====================================================================================
'''

end = time.time()
print(f'Took {round(end-start,3)} seconds.')

Took 37.622 seconds.


You can 'detokenize' a vectorization by passing it through `tokenizer.sequences_to_texts`.

In [5]:
decoded_string = tokenizer.sequences_to_texts(train_padded.numpy()[0:1])[0]
decoded_string[:1000]

'produced by sue asscher the republic by plato translated by benjamin jowett note the republic by plato jowett etext 150 introduction and analysis the republic of plato is the longest of his works with the exception of the laws and is certainly the greatest of them there are nearer approaches to modern metaphysics in the philebus and in the sophist the politicus or statesman is more ideal the form and institutions of the state are more clearly drawn out in the laws as works of art the symposium and the protagoras are of higher excellence but no other dialogue of plato has the same largeness of view and the same perfection of style no other shows an equal knowledge of the world or contains more of those thoughts which are new as well as old and not of one age only but of all nowhere in plato is there a deeper irony or a greater wealth of humour or imagery or more dramatic power nor in any other of his writings is the attempt made to interweave life and speculation or to connect politics