<a href="https://colab.research.google.com/github/achyutak/Projects/blob/main/Text_Generation_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing dataset

In [1]:
from keras.utils.data_utils import get_file
path = get_file('shakespeare.txt',
                origin = 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


In [2]:
with open(path, encoding='utf-8') as f:
  text = f.read().lower()
print('Length of Text: {} characters'.format(len(text)))

Length of Text: 1115394 characters


## Creating a list of sentences from data taken

In [3]:
print(text[:500])

first citizen:
before we proceed any further, hear me speak.

all:
speak, speak.

first citizen:
you are all resolved rather to die than to famish?

all:
resolved. resolved.

first citizen:
first, you know caius marcius is chief enemy to the people.

all:
we know't, we know't.

first citizen:
let us kill him, and we'll have corn at our own price.
is't a verdict?

all:
no more talking on't; let it be done: away, away!

second citizen:
one word, good citizens.

first citizen:
we are accounted poor


In [4]:
data = text[:5000]
# Taking only the first 5000 characters for building the model

In [5]:
print(data)
# The data we are taking to build the model.

first citizen:
before we proceed any further, hear me speak.

all:
speak, speak.

first citizen:
you are all resolved rather to die than to famish?

all:
resolved. resolved.

first citizen:
first, you know caius marcius is chief enemy to the people.

all:
we know't, we know't.

first citizen:
let us kill him, and we'll have corn at our own price.
is't a verdict?

all:
no more talking on't; let it be done: away, away!

second citizen:
one word, good citizens.

first citizen:
we are accounted poor citizens, the patricians good.
what authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them let us revenge this with
our pikes, ere we become rakes: for the gods know i
speak this in hunger for bread, not in thirst for revenge.



In [6]:
# Understanding the data
data

"first citizen:\nbefore we proceed any further, hear me speak.\n\nall:\nspeak, speak.\n\nfirst citizen:\nyou are all resolved rather to die than to famish?\n\nall:\nresolved. resolved.\n\nfirst citizen:\nfirst, you know caius marcius is chief enemy to the people.\n\nall:\nwe know't, we know't.\n\nfirst citizen:\nlet us kill him, and we'll have corn at our own price.\nis't a verdict?\n\nall:\nno more talking on't; let it be done: away, away!\n\nsecond citizen:\none word, good citizens.\n\nfirst citizen:\nwe are accounted poor citizens, the patricians good.\nwhat authority surfeits on would relieve us: if they\nwould yield us but the superfluity, while it were\nwholesome, we might guess they relieved us humanely;\nbut they think we are too dear: the leanness that\nafflicts us, the object of our misery, is as an\ninventory to particularise their abundance; our\nsufferance is a gain to them let us revenge this with\nour pikes, ere we become rakes: for the gods know i\nspeak this in hunger 

In [7]:
#We see that all the lines are separated by a '\n'.
# Now we split the data into lines and create a corpus.
corpus = data.split('\n')

In [8]:
corpus[:10]

['first citizen:',
 'before we proceed any further, hear me speak.',
 '',
 'all:',
 'speak, speak.',
 '',
 'first citizen:',
 'you are all resolved rather to die than to famish?',
 '',
 'all:']

The text still has a lot of punctuation marks and blank spaces. So, we use Tokenizer from keras to remove all the unnecessary pronunciation marks and create a dictionary of most frequent words in the datset.

[Tokenizer:](https://keras.io/api/preprocessing/text/#tokenizer-class)
This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

## Using Tokenizer class from tf.keras.preprocessing.text.Tokenizer

In [9]:
from tensorflow.keras.preprocessing.text import Tokenizer
tk = Tokenizer(num_words=500, oov_token='<OOV>') #Only the most common num_words - 1 (499 here) words are kept. 
# Any words that are not available in the vocabulary during sequence to text calls are given as <OOV>
tk.fit_on_texts(corpus)

In [14]:
# Total number of words in the corpus:
print('Total Number of words in the corpus:',len(tk.word_index))

Total Number of words in the corpus: 386


In [15]:
total_words = len(tk.word_index) + 1
print('Number of words after considering the padding:',total_words)

Number of words after considering the padding: 387


In [21]:
for line in corpus:
  tokens_list = tk.texts_to_sequences([line])[0]
  print(line)
  print(tokens_list)
  break

first citizen:
[6, 4]


In [22]:
sequences = []
for line in corpus:
  tokens_list = tk.texts_to_sequences([line])[0]
  for i in range(1,len(tokens_list)):
    sequences.append(tokens_list[:i+1]) 
# The second for loop ranges from 1 to the length of the tokens list and appends the list of sequences of variable length (2 to max_sequence_length) generated from the tokens_list 
print(sequences)

[[6, 4], [133, 10], [133, 10, 71], [133, 10, 71, 72], [133, 10, 71, 72, 134], [133, 10, 71, 72, 134, 49], [133, 10, 71, 72, 134, 49, 135], [133, 10, 71, 72, 134, 49, 135, 23], [23, 23], [6, 4], [3, 24], [3, 24, 12], [3, 24, 12, 50], [3, 24, 12, 50, 136], [3, 24, 12, 50, 136, 5], [3, 24, 12, 50, 136, 5, 137], [3, 24, 12, 50, 136, 5, 137, 73], [3, 24, 12, 50, 136, 5, 137, 73, 5], [3, 24, 12, 50, 136, 5, 137, 73, 5, 74], [50, 50], [6, 4], [6, 3], [6, 3, 51], [6, 3, 51, 75], [6, 3, 51, 75, 76], [6, 3, 51, 75, 76, 25], [6, 3, 51, 75, 76, 25, 138], [6, 3, 51, 75, 76, 25, 138, 139], [6, 3, 51, 75, 76, 25, 138, 139, 5], [6, 3, 51, 75, 76, 25, 138, 139, 5, 2], [6, 3, 51, 75, 76, 25, 138, 139, 5, 2, 77], [10, 78], [10, 78, 10], [10, 78, 10, 78], [6, 4], [52, 11], [52, 11, 140], [52, 11, 140, 40], [52, 11, 140, 40, 7], [52, 11, 140, 40, 7, 79], [52, 11, 140, 40, 7, 79, 31], [52, 11, 140, 40, 7, 79, 31, 141], [52, 11, 140, 40, 7, 79, 31, 141, 80], [52, 11, 140, 40, 7, 79, 31, 141, 80, 26], [52, 11

In [23]:
max_sequence_length = max([len(x) for x in sequences])
print('Maximum length of sequences in the corpus:',max_sequence_length)

Maximum length of sequences in the corpus: 12


## Padding the sequences generated to maintain the uniformity in the dimensions.

In [26]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_sequences = pad_sequences(sequences,padding='pre')

In [28]:
print(padded_sequences[:5])

[[  0   0   0   0   0   0   0   0   0   0   6   4]
 [  0   0   0   0   0   0   0   0   0   0 133  10]
 [  0   0   0   0   0   0   0   0   0 133  10  71]
 [  0   0   0   0   0   0   0   0 133  10  71  72]
 [  0   0   0   0   0   0   0 133  10  71  72 134]]


In [29]:
padded_sequences.shape

(748, 12)

We take the last column in each sequence as the label and use the rest of the columns as inputs to predict the label.

## Separating the X and the y from sequences for training

In [30]:
y = padded_sequences[:,-1] #Assigns the last value in each sequence to the y
X = padded_sequences[:,:-1] #Assigns all but the last value from each sequence to X
i = 2
print(padded_sequences[i],X[i],y[i])

[  0   0   0   0   0   0   0   0   0 133  10  71] [  0   0   0   0   0   0   0   0   0 133  10] 71
