# Assignment 1:  Deep N-grams

Welcome to the first graded assignment of course 3. In this assignment you will explore Recurrent Neural Networks `RNN`.

In this notebook you will apply the following steps:
- Convert a line of text into a tensor
- Create a tensorflow dataset
- Define a GRU model using `TensorFlow`
- Train the model using `TensorFlow`
- Compute the accuracy of your model using the perplexity
- Generate text using your own model

Before getting started take some time to read the following tips: 

#### TIPS FOR SUCCESSFUL GRADING OF YOUR ASSIGNMENT:

- All cells are frozen except for the ones where you need to submit your solutions.

- You can add new cells to experiment but these will be omitted by the grader, so don't rely on newly created cells to host your solution code, use the provided places for this.

- You can add the comment # grade-up-to-here in any graded cell to signal the grader that it must only evaluate up to that point. This is helpful if you want to check if you are on the right track even if you are not done with the whole assignment. Be sure to remember to delete the comment afterwards!

- To submit your notebook, save it and then click on the blue submit button at the beginning of the page.

## Table of Contents

- [Overview](#0)
- [1 - Data Preprocessing Overview](#1)
    - [1.1 - Loading in the Data](#1-1)
    - [1.2 - Create the vocabulary](#1-2)
    - [1.3 - Convert a Line to Tensor](#1-3)
        - [Exercise 1 - line_to_tensor](#ex-1)
    - [1.4 - Prepare your data for training and testing](#1-4)
    - [1.5 - Tensorflow dataset](#1-5)
    - [1.6 - Create the input and the output for your model](#1-6)
        - [Exercise 2 - data_generator](#ex-2)
    - [1.7 - Create the training dataset](#1-7)        
- [2 - Defining the GRU Language Model (GRULM)](#2)
    - [Exercise 3 - GRULM](#ex-3)
- [3 - Training](#3)
    - [Exercise 4 - train_model](#ex-4)
- [4 - Evaluation](#4)
    - [4.1 - Evaluating using the Deep Nets](#4-1)
    - [Exercise 5 - log_perplexity](#ex-5)
- [5 - Generating Language with your Own Model](#5)
    - [Optional Exercise 6 - GenerativeModel (Not graded)](#ex-6)
- [On statistical methods](#6)

<a name='0'></a>
## Overview

In this lab, you'll delve into the world of text generation using Recurrent Neural Networks (RNNs). Your primary objective is to predict the next set of characters based on the preceding ones. This seemingly straightforward task holds immense practicality in applications like predictive text and creative writing.

The journey unfolds as follows:

- Data Preprocessing: You'll start by converting lines of text into numerical tensors, making them machine-readable.

- Dataset Creation: Next, you'll create a TensorFlow dataset, which will serve as the backbone for supplying data to your model.

- Neural Network Training: Your model will be trained to predict the next set of characters, specifying the desired output length.

- Character Embeddings: Character embeddings will be employed to represent each character as a vector, a fundamental technique in natural language processing.

- GRU Model: Your model utilizes a Gated Recurrent Unit (GRU) to process character embeddings and make sequential predictions. The following figure gives you a summary of what you are about to implement. 

<img src = "images/model.png" style="width:600px;height:150px;"/>

- Prediction Process: The model's predictions are achieved through a linear layer and log-softmax computation.

This overview sets the stage for your exploration of text generation. Get ready to unravel the secrets of language and embark on a journey into the realm of creative writing and predictive text generation.

And as usual let's start by importing all the required libraries.

In [1]:
import os
import traceback
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import numpy as np
import random as  rnd

import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers import Input

from termcolor import colored

# set random seed
rnd.seed(32)

In [2]:
import w1_unittest

<a name='1'></a>
## 1 - Data Preprocessing Overview

<img src = "images/shakespeare.png" style="width:250px;height:250px;"/>

In this section, you will prepare the data for training your model. The data preparation involves the following steps:

- Dataset Import: Begin by importing the dataset. Each sentence is structured as one line in the dataset. To ensure consistency, remove any extra spaces from these lines using the `strip` function.

- Data Storage: Store each cleaned line in a list. This list will serve as the foundational dataset for your text generation task.

- Character-Level Processing: Since the goal is character generation, it's essential to process the text at the character level, not the word level. This involves converting each individual character into a numerical representation. To achieve this:

  - Use the [`tf.strings.unicode_split`](https://www.tensorflow.org/api_docs/python/tf/strings/unicode_split) function to split each sentence into its constituent characters.
  - Utilize [`tf.keras.layers.StringLookup`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/StringLookup) to map these characters to integer values. This transformation lays the foundation for character-based modeling.

- TensorFlow Dataset Creation: Create a TensorFlow dataset capable of producing data in batches. Each batch will consist of `batch_size` sentences, with each sentence containing a maximum of `max_length` characters. This organized dataset is essential for training your character generation model.

These preprocessing steps ensure that your dataset is meticulously prepared for the character-based text generation task, allowing you to work seamlessly with the Shakespearean corpus data.

<a name='1-1'></a>
### 1.1 - Loading in the Data

In [None]:
dirname = 'data/'
filename = 'shakespeare_data.txt'
lines = [] # storing all the lines in a variable. 

counter = 0

with open(os.path.join(dirname, filename)) as files:
    for line in files:        
        # remove leading and trailing whitespace
        pure_line = line.strip()#.lower()

        # if pure_line is not the empty string,
        if pure_line:
            # append it to the list
            lines.append(pure_line)
            
n_lines = len(lines)
print(f"Number of lines: {n_lines}")

In [None]:
# 어린 왕자로 test
dirname = 'data/'
filename = 'test_euc_kr.txt'
lines = [] # storing all the lines in a variable. 

counter = 0

with open(os.path.join(dirname, filename), encoding='cp949') as files:
    for line in files:        
        # remove leading and trailing whitespace
        pure_line = line.strip()#.lower()

        # if pure_line is not the empty string,
        if pure_line:
            # append it to the list
            lines.append(pure_line)
            
n_lines = len(lines)
print(f"Number of lines: {n_lines}")

In [None]:
import os

# 원본 파일 경로 및 새로 저장할 파일 경로 지정
dirname = 'data/'
filename = 'test_euc_kr.txt'
output_filename = 'test_utf_8.txt'

# CP949로 파일 읽기
with open(os.path.join(dirname, filename), 'r', encoding='cp949') as infile:
    content = infile.read()

# UTF-8로 파일 쓰기
with open(os.path.join(dirname, output_filename), 'w', encoding='utf-8') as outfile:
    outfile.write(content)

print(f"파일이 {filename}에서 {output_filename}(으)로 변환되었습니다.")


In [3]:
# 어린 왕자와 위대한 유산로 test
dirname = 'data/'
filename = 'test_utf_8.txt'
lines = [] # storing all the lines in a variable. 

counter = 0

with open(os.path.join(dirname, filename), encoding='utf-8') as files:
    for line in files:        
        # remove leading and trailing whitespace
        pure_line = line.strip()#.lower()

        # if pure_line is not the empty string,
        if pure_line:
            # append it to the list
            lines.append(pure_line)
            
n_lines = len(lines)
print(f"Number of lines: {n_lines}")

Number of lines: 9520


Let's examine a few lines from the corpus. Pay close attention to the structure and style employed by Shakespeare in this excerpt. Observe that character names are written in uppercase, and each line commences with a capital letter. Your task in this exercise is to construct a generative model capable of emulating this particular structural style.

In [4]:
print("\n".join(lines[506:514]))

"우린 꽃은 기록하지 않아."
지리학자가 말했다.
"왜요? 그게 더 예쁜데요!"
"꽃들은 일시적인 존재니까."
"<일시적인 존재>가 뭐예요?"
"지리책은 모든 책들 중 가장 귀중한 책이야.지리책은 유행에 뒤지는 법 이 없지. 산이 위치를 바꾸는 일은 매우 드물거든. 바닷물의 물이 비어 버 리는 일도 매우 드물고. 우리는 영원한 것들을 기록하는 거야."
"하지만 불 꺼진 화산들이다시 깨어날 수도 있어요.<일시적인 존재>가 뭐예요?"
어린 왕자가 말을 가로막았다.


<a name='1-2'></a>
### 1.2 - Create the vocabulary

In the following code cell, you will create the vocabulary for text processing. The vocabulary is a crucial component for understanding and processing text data. Here's what the code does:

- Concatenate all the lines in our dataset into a single continuous text, separated by line breaks.

- Identify and collect the unique characters that make up the text. This forms the basis of our vocabulary.

- To enhance the vocabulary, introduce two special characters:

  - [UNK]: This character represents any unknown or unrecognized characters in the text.
  - "" (empty character): This character is used for padding sequences when necessary.
- The code concludes with the display of statistics, showing the total count of unique characters in the vocabulary and providing a visual representation of the complete character set.

In [None]:
text = "\n".join(lines)
# The unique characters in the file
vocab = sorted(set(text))
vocab.insert(0,"[UNK]") # Add a special character for any unknown
vocab.insert(1,"") # Add the empty character for padding.

print(f'{len(vocab)} unique characters')
print(" ".join(vocab))

In [5]:
text = "\n".join(lines)
# The unique characters in the file
vocab = sorted(set(text))
vocab.insert(0,"[UNK]") # Add a special character for any unknown
vocab.insert(1,"") # Add the empty character for padding.

print(f'{len(vocab)} unique characters')
print(" ".join(vocab))

1594 unique characters
[UNK]  	 
   ! " # % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P R S T U V W X Y Z [ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z ~ × ― ‘ ’ “ ” ‥ … ※ ℓ ← ↑ → ↓ ↔ ≒ ■ ▲ △ ▶ ▼ ◈ ○ ★ ♡ ♥ 〈 〉 「 」 『 』 〓 ㄱ ㅂ ㅅ ㅈ ㅋ ㅌ ㅍ ㅎ ㅜ ㅠ ㎏ ㎖ 使 儀 商 大 小 布 式 律 惑 星 的 紀 紅 詩 隊 가 각 간 갇 갈 갉 감 갑 값 갓 갔 강 갖 갗 같 갚 갛 개 객 갠 갤 갯 갰 걀 거 걱 건 걷 걸 걺 검 겁 것 겉 겊 겋 게 겐 겝 겠 겨 격 겪 견 결 겸 겹 겼 경 곁 계 곗 고 곡 곤 곧 골 곪 곯 곰 곱 곳 공 과 곽 관 괄 광 괘 괜 괴 괸 굉 교 구 국 군 굳 굴 굵 굶 굼 굽 굿 궁 궂 권 궐 궤 귀 귄 귈 귓 규 균 그 극 근 글 긁 금 급 긋 긍 기 긴 길 김 깁 깃 깊 까 깍 깎 깐 깔 깜 깝 깟 깡 깥 깨 깬 깰 깻 깼 깽 꺼 꺽 꺾 껄 껌 껍 껏 껐 껑 께 껜 껴 꼈 꼬 꼭 꼴 꼼 꼽 꼿 꽁 꽂 꽃 꽉 꽝 꽤 꽥 꾀 꾐 꾸 꾼 꿀 꿇 꿈 꿋 꿍 꿔 꿨 꿰 뀌 뀐 뀔 끄 끈 끊 끌 끓 끔 끗 끙 끝 끼 끽 낀 낄 낌 낍 낏 나 낙 낚 난 날 낡 남 납 낫 났 낭 낮 낯 낱 낳 내 낸 낼 냄 냅 냈 냉 냐 냔 냘 냥 너 넉 넋 넌 널 넓 넘 넛 넜 넝 넣 네 넥 넨 넬 넵 넷 넸 녀 녁 년 념 녔 녕 녘 녜 노 녹 논 놀 놈 놋 농 높 놓 놔 놨 뇌 뇨 누 눅 눈 눌 눔 눕 눠 눴 뉘 뉜 뉴 늄 늉 느 늑 는 늘 늙 늠 능 늦 늪 늬 니 닉 닌 닐 님 닙 닛 닝 닢 다 닥 닦 단 닫 달 닭 닮 닳 담 답 닷 당 닻 닿 대 댁 댄 댈 댐 댓 댔 댕 더 덕 던 덜 덟 덤 덥 덧 덩 덫 덮 데 덴 델 뎁 뎅 뎌 뎠 도 독 돈 돋 돌 돔 돕 동 돛 돼 됐 되 

<a name='1-3'></a>
### 1.3 - Convert a Line to Tensor

Now that you have your list of lines, you will convert each character in that list to a number using the order given by your vocabulary. You can use [`tf.strings.unicode_split`](https://www.tensorflow.org/api_docs/python/tf/strings/unicode_split) to split the text into characters. 

In [None]:
line = "Hello world!"
chars = tf.strings.unicode_split(line, input_encoding='UTF-8')
print(chars)

In [6]:
line = "안녕하세요!"
chars = tf.strings.unicode_split(line, input_encoding='UTF-8')
print(chars)

tf.Tensor(
[b'\xec\x95\x88' b'\xeb\x85\x95' b'\xed\x95\x98' b'\xec\x84\xb8'
 b'\xec\x9a\x94' b'!'], shape=(6,), dtype=string)


In [7]:
# 문자로 출력
decoded_chars = [char.numpy().decode('utf-8') for char in chars]

print(decoded_chars)

['안', '녕', '하', '세', '요', '!']


Using your vocabulary, you can convert the characters given by `unicode_split` into numbers. The number will be the index of the character in the given vocabulary. 

In [None]:
print(vocab.index('a'))
print(vocab.index('e'))
print(vocab.index('i'))
print(vocab.index('o'))
print(vocab.index('u'))
print(vocab.index(' '))
print(vocab.index('2'))
print(vocab.index('3'))

In [8]:
print(vocab.index('가'))
print(vocab.index('나'))

153
337


Tensorflow has a function [`tf.keras.layers.StringLookup`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/StringLookup)  that does this efficiently for list of characters. Note that the output object is of type `tf.Tensor`. Here is the result of applying the StringLookup function to the characters of "Hello world"

In [None]:
ids = tf.keras.layers.StringLookup(vocabulary=list(vocab), mask_token=None)(chars)
print(ids)

In [9]:
ids = tf.keras.layers.StringLookup(vocabulary=list(vocab), mask_token=None)(chars)
print(ids)

tf.Tensor([1005  386 1502  893 1093    5], shape=(6,), dtype=int64)


<a name='ex-1'></a>
### Exercise 1 - line_to_tensor

**Instructions:** Write a function that takes in a single line and transforms each character into its unicode integer.  This returns a list of integers, which we'll refer to as a tensor.

In [10]:
# GRADED FUNCTION: line_to_tensor
def line_to_tensor(line, vocab):
    """
    Converts a line of text into a tensor of integer values representing characters.

    Args:
        line (str): A single line of text.
        vocab (list): A list containing the vocabulary of unique characters.

    Returns:
        tf.Tensor(dtype=int64): A tensor containing integers (unicode values) corresponding to the characters in the `line`.
    """
    ### START CODE HERE ###

    # Split the input line into individual characters
    chars = tf.strings.unicode_split(line, input_encoding='UTF-8')
    # Map characters to their respective integer values using StringLookup
    ids = tf.keras.layers.StringLookup(vocabulary=list(vocab), mask_token=None)(chars)
    
    ### END CODE HERE ###

    return ids

In [None]:
# Test your function
tmp_ids = line_to_tensor('abc xyz', vocab)
print(f"Result: {tmp_ids}")
print(f"Output type: {type(tmp_ids)}")

In [11]:
tmp_ids = line_to_tensor('반갑 습니다', vocab)
print(f"Result: {tmp_ids}")
print(f"Output type: {type(tmp_ids)}")

Result: [747 160   4 944 425 434]
Output type: <class 'tensorflow.python.framework.ops.EagerTensor'>


**Expected output**

```CPP
Result: [55 56 57  4 78 79 80]
Output type: <class 'tensorflow.python.framework.ops.EagerTensor'>
```

In [None]:
# UNIT TEST
w1_unittest.test_line_to_tensor(line_to_tensor)

You will also need a function that produces text given a numeric tensor. This function will be useful for inspection when you use your model to generate new text, because you will be able to see words rather than lists of numbers. The function will use the inverse Lookup function `tf.keras.layers.StringLookup` with `invert=True` in its parameters.

In [12]:
def text_from_ids(ids, vocab):
    """
    Converts a tensor of integer values into human-readable text.

    Args:
        ids (tf.Tensor): A tensor containing integer values (unicode IDs).
        vocab (list): A list containing the vocabulary of unique characters.

    Returns:
        str: A string containing the characters in human-readable format.
    """
    # Initialize the StringLookup layer to map integer IDs back to characters
    chars_from_ids = tf.keras.layers.StringLookup(vocabulary=vocab, invert=True, mask_token=None)
    
    # Use the layer to decode the tensor of IDs into human-readable text
    return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

Use the function for decoding the tensor produced by "Hello world!"

In [13]:
text_from_ids(ids, vocab).numpy()

b'\xec\x95\x88\xeb\x85\x95\xed\x95\x98\xec\x84\xb8\xec\x9a\x94!'

In [14]:
def text_from_ids(ids, vocab):
    """
    Converts a tensor of integer values into human-readable text.

    Args:
        ids (tf.Tensor): A tensor containing integer values (unicode IDs).
        vocab (list): A list containing the vocabulary of unique characters.

    Returns:
        str: A string containing the characters in human-readable format.
    """
    # Initialize the StringLookup layer to map integer IDs back to characters
    chars_from_ids = tf.keras.layers.StringLookup(vocabulary=vocab, invert=True, mask_token=None)
    
    # Decode the tensor of IDs into human-readable text
    decoded_chars = chars_from_ids(ids)
    
    # Use tf.strings.reduce_join to join the characters into a single string
    return tf.strings.reduce_join(decoded_chars, axis=-1)

In [15]:
text_from_ids(ids, vocab).numpy().decode('utf-8')

'안녕하세요!'

<a name='1-4'></a>
### 1.4 - Prepare your data for training and testing
As usual, you will need some data for training your model, and some data for testing its performance. So, we will use 124097 lines for training and 1000 lines for testing. 

In [None]:
train_lines = lines[:-1000] # Leave the rest for training
eval_lines = lines[-1000:] # Create a holdout validation set

print(f"Number of training lines: {len(train_lines)}")
print(f"Number of validation lines: {len(eval_lines)}")

In [16]:
train_lines = lines[:-1000] # Leave the rest for training
eval_lines = lines[-1000:] # Create a holdout validation set

print(f"Number of training lines: {len(train_lines)}")
print(f"Number of validation lines: {len(eval_lines)}")

Number of training lines: 8520
Number of validation lines: 1000


<a name='1-5'></a>
### 1.5 - TensorFlow dataset

Most of the time in Natural Language Processing, and AI in general you use batches when training your models. Here, you will build a dataset that takes in some text and returns a batch of text fragments (Not necesarly full sentences) that you will use for training.
- The generator will produce text fragments encoded as numeric tensors of a desired length

Once you create the dataset, you can iterate on it like this:

```
data_generator.take(1)
```

This generator returns the data in a format that you could directly use in your model when computing the feed-forward of your algorithm. This batch dataset generator returns batches of data in an endless way. 

So, let's check how the different parts work with a corpus composed of 2 lines. Then, you will use these parts to create the first graded function of this notebook.

In order to get a dataset generator that produces batches of fragments from the corpus, you first need to convert the whole text into a single line, and then transform it into a single big tensor. This is only possible if your data fits completely into memory, but that is the case here.

In [None]:
all_ids = line_to_tensor("\n".join(["Hello world!", "Generative AI"]), vocab)
all_ids

In [17]:
all_ids = line_to_tensor("\n".join(["안녕하세요!", "생성형 AI"]), vocab)
all_ids

<tf.Tensor: shape=(13,), dtype=int64, numpy=
array([1005,  386, 1502,  893, 1093,    5,    3,  878,  892, 1541,    4,
         36,   44])>

Create a dataset out of a tensor like input. This initial dataset will dispatch numbers in packages of a specified length. For example, you can use it for getting the 10 first encoded characters of your dataset. To make it easier to read, we can use the `text_from_ids` function.

In [None]:
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)
print([text_from_ids([ids], vocab).numpy() for ids in ids_dataset.take(10)])

In [18]:
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)
print([text_from_ids([ids], vocab).numpy().decode('utf-8') for ids in ids_dataset.take(10)])

['안', '녕', '하', '세', '요', '!', '\n', '생', '성', '형']


But we can configure this dataset to produce batches of the same size each time. We could use this functionality to produce text fragments of a desired size (`seq_length + 1`). We will explain later why you need an extra character into the sequence.

In [None]:
seq_length = 10
data_generator = ids_dataset.batch(seq_length + 1, drop_remainder=True)

In [19]:
seq_length = 10
data_generator = ids_dataset.batch(seq_length + 1, drop_remainder=True)

You can verify that the data generator produces encoded fragments of text of the desired length. For example, let's ask the generator to produce 2 batches of data using the function `data_generator.take(2)`

In [20]:
for seq in data_generator.take(2):
    print(seq)

tf.Tensor([1005  386 1502  893 1093    5    3  878  892 1541    4], shape=(11,), dtype=int64)


But as usual, it is easier to understand if you print it in human readable characters using the 'text_from_ids' function.

In [None]:
i = 1
for seq in data_generator.take(2):
    print(f"{i}. {text_from_ids(seq, vocab).numpy()}")
    i = i + 1

In [21]:
i = 1
for seq in data_generator.take(2):
    print(f"{i}. {text_from_ids(seq, vocab).numpy().decode('utf-8')}")
    i = i + 1

1. 안녕하세요!
생성형 


<a name='1-6'></a>
### 1.6 - Create the input and the output for your model

In this task you have to predict the next character in a sequence. The following function creates 2 tensors, each with a length of `seq_length` out of the input sequence of lenght `seq_length + 1`. The first one contains the first `seq_length` elements and the second one contains the last `seq_length` elements. For example, if you split the sequence `['H', 'e', 'l', 'l', 'o']`, you will obtain the sequences `['H', 'e', 'l', 'l']` and `['e', 'l', 'l', 'o']`. 

In [22]:
def split_input_target(sequence):
    """
    Splits the input sequence into two sequences, where one is shifted by one position.

    Args:
        sequence (tf.Tensor or list): A list of characters or a tensor.

    Returns:
        tf.Tensor, tf.Tensor: Two tensors representing the input and output sequences for the model.
    """
    # Create the input sequence by excluding the last character
    input_text = sequence[:-1]
    # Create the target sequence by excluding the first character
    target_text = sequence[1:]

    return input_text, target_text

Look the result using the following sequence of characters

In [None]:
split_input_target(list("Tensorflow"))

In [23]:
split_input_target(list("텐서플로우"))

(['텐', '서', '플', '로'], ['서', '플', '로', '우'])

The first sequence will be the input and the second sequence will be the expected output

Now, put all this together into a function to create your batch dataset generator

<a name='ex-2'></a>
### Exercise 2 - data_generator
**Instructions:** Create a batch dataset from the input text. Here are some things you will need. 

- Join all the input lines into a single string. When you have a big dataset, you would better use a flow from directory or any other kind of generator.
- Transform your input text into numeric tensors
- Create a TensorFlow DataSet from your numeric tensors: Just feed the numeric tensors into the function [`tf.data.Dataset.from_tensor_slices`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices)
- Make the dataset produce batches of data that will form a single sample each time. This is, make the dataset produce a sequence of `seq_length + 1`, rather than single numbers at each time. You can do it using the `batch` function of the already created dataset. You must specify the length of the produced sequences (`seq_length + 1`). So, the sequence length produced by the dataset will `seq_length + 1`. It must have that extra element since you will get the input and the output sequences out of the same element. `drop_remainder=True` will drop the sequences that do not have the required length. This could happen each time that the dataset reaches the end of the input sequence.
- Use the `split_input_target` to split each element produced by the dataset into the mentioned input and output sequences.The input will have the first `seq_length` elements, and the output will have the last `seq_length`. So, after this step, the dataset generator will produce batches of pairs (input, output) sequences.
- Create the final dataset, using `dataset_xy` as the starting point. You will configure this dataset to shuffle the data during the generation of the data with the specified BUFFER_SIZE. For performance reasons, you would like that tensorflow pre-process the data in parallel with training. That is called [`prefetching`](https://www.tensorflow.org/guide/data_performance#prefetching), and it will be configured for you.

In [24]:
# GRADED FUNCTION: create_batch_dataset
def create_batch_dataset(lines, vocab, seq_length=100, batch_size=64):
    """
    Creates a batch dataset from a list of text lines.

    Args:
        lines (list): A list of strings with the input data, one line per row.
        vocab (list): A list containing the vocabulary.
        seq_length (int): The desired length of each sample.
        batch_size (int): The batch size.

    Returns:
        tf.data.Dataset: A batch dataset generator.
    """
    # Buffer size to shuffle the dataset
    # (TF data is designed to work with possibly infinite sequences,
    # so it doesn't attempt to shuffle the entire sequence in memory. Instead,
    # it maintains a buffer in which it shuffles elements).
    BUFFER_SIZE = 10000
    
    # For simplicity, just join all lines into a single line
    single_line_data  = "\n".join(lines)

    ### START CODE HERE ###
    
    # Convert your data into a tensor using the given vocab
    all_ids = line_to_tensor(single_line_data, vocab)
    # Create a TensorFlow dataset from the data tensor
    ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)
    # Create a batch dataset
    data_generator = ids_dataset.batch(seq_length + 1, drop_remainder=True) 
    # Map each input sample using the split_input_target function
    dataset_xy = data_generator.map(split_input_target)
    
    # Assemble the final dataset with shuffling, batching, and prefetching
    dataset = (                                   
        dataset_xy                                
        .shuffle(BUFFER_SIZE)
        .batch(batch_size, drop_remainder=True)
        .prefetch(tf.data.experimental.AUTOTUNE)  
        )            
                                     
    ### END CODE HERE ###
    
    return dataset

In [None]:
# test your function
tf.random.set_seed(1)
dataset = create_batch_dataset(train_lines[1:100], vocab, seq_length=16, batch_size=2)

print("Prints the elements into a single batch. The batch contains 2 elements: ")

for input_example, target_example in dataset.take(1):
    print("\n\033[94mInput0\t:", text_from_ids(input_example[0], vocab).numpy())
    print("\n\033[93mTarget0\t:", text_from_ids(target_example[0], vocab).numpy())
    
    print("\n\n\033[94mInput1\t:", text_from_ids(input_example[1], vocab).numpy())
    print("\n\033[93mTarget1\t:", text_from_ids(target_example[1], vocab).numpy())

In [25]:
# test your function
tf.random.set_seed(1)
dataset = create_batch_dataset(train_lines[1:100], vocab, seq_length=16, batch_size=2)

print("Prints the elements into a single batch. The batch contains 2 elements: ")

for input_example, target_example in dataset.take(1):
    print("\n\033[94mInput0\t:", text_from_ids(input_example[0], vocab).numpy().decode('utf-8'))
    print("\n\033[93mTarget0\t:", text_from_ids(target_example[0], vocab).numpy().decode('utf-8'))
    
    print("\n\n\033[94mInput1\t:", text_from_ids(input_example[1], vocab).numpy().decode('utf-8'))
    print("\n\033[93mTarget1\t:", text_from_ids(target_example[1], vocab).numpy().decode('utf-8'))

Prints the elements into a single batch. The batch contains 2 elements: 

[94mInput0	: 포켓에서 내가 그려 준 양의 

[93mTarget0	: 켓에서 내가 그려 준 양의 그


[94mInput1	: 아이 같은 구석이라고는 없었다

[93mTarget1	: 이 같은 구석이라고는 없었다.


**Expected output**

```CPP
Prints the elements into a single batch. The batch contains 2 elements: 

Input0	: b'and sight distra'

Target0	: b'nd sight distrac'


Input1	: b'when in his fair'

Target1	: b'hen in his fair '
```

In [None]:
# UNIT TEST
w1_unittest.test_create_batch_dataset(create_batch_dataset)

<a name='1-6'></a>
### 1.7 - Create the training dataset

Now, you can generate your training dataset using the functions defined above. This will produce pairs of input/output tensors each time the batch generator creates an entry.

In [26]:
# Batch size
BATCH_SIZE = 64
dataset = create_batch_dataset(train_lines, vocab, seq_length=100, batch_size=BATCH_SIZE)

<a name='2'></a>
## 2 - Defining the GRU Language Model (GRULM)

Now that you have the input and output tensors, you will go ahead and initialize your model. You will be implementing the `GRULM`, gated recurrent unit model. To implement this model, you will be using `TensorFlow`. Instead of implementing the `GRU` from scratch (you saw this already in a lab), you will use the necessary methods from a built-in package. You can use the following packages when constructing the model: 

- `tf.keras.layers.Embedding`: Initializes the embedding. In this case it is the size of the vocabulary by the dimension of the model. [docs](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding)  
    - `Embedding(vocab_size, embedding_dim)`.
    - `vocab_size` is the number of unique words in the given vocabulary.
    - `embedding_dim` is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).
___

- `tf.keras.layers.GRU`: `TensorFlow` GRU layer. [docs](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU)) Builds a traditional GRU of rnn_units with dense internal transformations. You can read the paper here: https://arxiv.org/abs/1412.3555
    - `units`: Number of recurrent units in the layer. It must be set to `rnn_units`
    - `return_sequences`: It specifies if the model returns a sequence of predictions. Set it to `True`
    - `return_state`: It specifies if the model must return the last internal state along with the prediction. Set it to `True` 
___

- `tf.keras.layers.Dense`: A dense layer. [docs](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense). You must set the following parameters:
    - `units`: Number of units in the layer. It must be set to `vocab_size`
    - `activation`: It must be set to `log_softmax` function as described in the next line.
___

- `tf.nn.log_softmax`: Log of the output probabilities. [docs](https://www.tensorflow.org/api_docs/python/tf/nn/log_softmax)
    - You don't need to set any parameters, just set the activation parameter as `activation=tf.nn.log_softmax`.
___

<a name='ex-3'></a>
### Exercise 3 - GRULM
**Instructions:** Implement the `GRULM` class below. You should be using all the methods explained above.


In [27]:
# GRADED CLASS: GRULM
class GRULM(tf.keras.Model):
    """
    A GRU-based language model that maps from a tensor of tokens to activations over a vocabulary.

    Args:
        vocab_size (int, optional): Size of the vocabulary. Defaults to 256.
        embedding_dim (int, optional): Depth of embedding. Defaults to 256.
        rnn_units (int, optional): Number of units in the GRU cell. Defaults to 128.

    Returns:
        tf.keras.Model: A GRULM language model.
    """
    def __init__(self, vocab_size=256, embedding_dim=256, rnn_units=128):
        super().__init__(self)

        ### START CODE HERE ###

        # Create an embedding layer to map token indices to embedding vectors
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        # Define a GRU (Gated Recurrent Unit) layer for sequence modeling
        self.gru = tf.keras.layers.GRU(units=rnn_units, return_sequences=True, return_state=True)
        # Apply a dense layer with log-softmax activation to predict next tokens
        self.dense = tf.keras.layers.Dense(units=vocab_size, activation=tf.nn.log_softmax)
        
        ### END CODE HERE ###
    
    def call(self, inputs, states=None, return_state=False, training=False):
        x = inputs
        # Map input tokens to embedding vectors
        x = self.embedding(x, training=training)
        if states is None:
            # Get initial state from the GRU layer
            states = self.gru.get_initial_state(x)
        x, states = self.gru(x, initial_state=states, training=training)
        # Predict the next tokens and apply log-softmax activation
        x = self.dense(x, training=training)
        if return_state:
            return x, states
        else:
            return x

Now, you can define a new GRULM model. You must set the `vocab_size` to 82; the size of the embedding `embedding_dim` to 256, and the number of units that will have you recurrent neural network `rnn_units` to 512

In [None]:
# Length of the vocabulary in StringLookup Layer
vocab_size = 82

# The embedding dimension
embedding_dim = 256

# RNN layers
rnn_units = 512

model = GRULM(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units = rnn_units)

In [28]:
# Length of the vocabulary in StringLookup Layer
vocab_size = 1594

# The embedding dimension
embedding_dim = 512

# RNN layers
rnn_units = 1024

model = GRULM(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units = rnn_units)

In [29]:
# testing your model

try:
    # Simulate inputs of length 100. This allows to compute the shape of all inputs and outputs of our network
    model.build(input_shape=(BATCH_SIZE, 100))
    model.call(Input(shape=(100)))
    model.summary() 
except:
    print("\033[91mError! \033[0mA problem occurred while building your model. This error can occur due to wrong initialization of the return_sequences parameter\n\n")
    traceback.print_exc()

Model: "grulm"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 512)          816128    
                                                                 
 gru (GRU)                   [(None, 100, 1024),       4724736   
                              (None, 1024)]                      
                                                                 
 dense (Dense)               (None, 100, 1594)         1633850   
                                                                 
Total params: 7174714 (27.37 MB)
Trainable params: 7174714 (27.37 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


##### Expected output

```python
Model: "grulm"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)    (None, 100, 256)          20992     
                                                                 
 gru (GRU)                [(None, 100, 512),        1182720   
                              (None, 512)]                       
                                                                 
 dense (Dense)            (None, 100, 82)           42066     
                                                                 
=================================================================
Total params: 1245778 (4.75 MB)
Trainable params: 1245778 (4.75 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
```

In [None]:
# UNIT TEST
w1_unittest.test_GRULM(GRULM)

Now, let's use the model for predicting the next character using the untrained model. At the begining the model will generate only gibberish.

In [None]:
for input_example_batch, target_example_batch in dataset.take(1):
    print("Input: ", input_example_batch[0].numpy()) # Lets use only the first sequence on the batch
    example_batch_predictions = model(tf.constant([input_example_batch[0].numpy()]))
    print("\n",example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

In [30]:
for input_example_batch, target_example_batch in dataset.take(1):
    print("Input: ", input_example_batch[0].numpy()) # Lets use only the first sequence on the batch
    example_batch_predictions = model(tf.constant([input_example_batch[0].numpy()]))
    print("\n",example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

Input:  [1449  197 1128    4  748 1013  437  434   17    4 1537 1135  644    4
  748 1127    4  185  660    4 1511  475    4  352  153    4 1360 1495
  940   11  862  257  309   12  216 1135    4  218  203  644    4  203
  906 1517  459    4   23   92   24  383    4 1005 1048    4   21  766
 1139  153    4   22  766    4 1175  475 1065  437  434   17    4 1502
 1215  660    4  703  491    4 1214  177  799 1184 1125  618    4    4
 1487  608  346 1044 1215   17    3  197  228   15    4  337  417    4
 1360 1495]

 (1, 100, 1594) # (batch_size, sequence_length, vocab_size)


The output size is (1, 100, 82). We predicted only on the first sequence generated by the batch generator. 100 is the number of predicted characters. It has exactly the same length as the input. And there are 82 values for each predicted character. Each of these 82 real values are related to the logarithm likelihood of each character to be the next one in the sequence. The bigger the value, the higher the likelihood. As the network is not trained yet, all those values must be very similar and random. Just check the values for the last prediction on the sequence.

In [None]:
example_batch_predictions[0][99].numpy()

In [31]:
example_batch_predictions[0][99].numpy()

array([-7.3532677, -7.376058 , -7.3669147, ..., -7.3824124, -7.37722  ,
       -7.3799896], dtype=float32)

And the simplest way to choose the next character is by getting the index of the element with the highest likelihood. So, for instance, the prediction for the last characeter would be:

In [None]:
last_character = tf.math.argmax(example_batch_predictions[0][99])
print(last_character.numpy())

In [32]:
last_character = tf.math.argmax(example_batch_predictions[0][99])
print(last_character.numpy())

204


And the prediction for the whole sequence would be:

In [None]:
sampled_indices = tf.math.argmax(example_batch_predictions[0], axis=1)
print(sampled_indices.numpy())

In [33]:
sampled_indices = tf.math.argmax(example_batch_predictions[0], axis=1)
print(sampled_indices.numpy())

[ 649  766  232 1587  895 1143 1113 1362 1378  788 1079 1389 1366  614
 1267  517 1216  702  233  181 1337  747  360 1445 1445  360   32  204
 1002 1018 1426 1508 1130 1010  768 1366 1068  970   43 1225  614   43
  981  858 1478  832 1018  791  954  149  788 1570  742  263  658 1139
  381 1533 1068  677 1381  263 1496  747  369 1327 1040  805  788 1519
  845  233  263  970 1527 1587  588  117  401  534  409 1474  263  263
   29  664  562 1151  797  610 1434  588 1488  983  364  718 1231 1579
  134  204]


Those 100 numbers represent 100 predicted characters. However, humans cannot read this. So, let's print the input and output sequences using our `text_from_ids` function, to check what is going on.

In [None]:
print("Input:\n", text_from_ids(input_example_batch[0], vocab))
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices, vocab))

In [34]:
print("Input:\n", text_from_ids(input_example_batch[0], vocab).numpy().decode('utf-8'))
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices, vocab).numpy().decode('utf-8'))

Input:
 판결을 받았단다. 혐의를 받은 것만 해도 내가 콤피슨(사기꾼)과의 관계를 계속했던 4~5년 안에 2번인가 3번 정도였단다. 하지만 모두 증거부족으로  풀려났었지.
결국, 나는 콤피

Next Char Predictions:
 릎번굵희센임위콩킨볼옴탓쿠렷찝딛직몇굶걸캔반냔팅팅냔=곗씽앤퉁합음암범쿠옆써H짚렷H쏟쁨폭뺑앤봉십紀볼훼및깊마인녀혀옆맷킷깊핀반넘칠엄붓볼향뽀굶깊써헤희랫♡뇨땠눴폈깊깊:맘뜬작뵐련튿랫품쏴넉물짧흑ㅜ곗


As expected, the untrained model just produces random text as response of the given input. It is also important to note that getting the index of the maximum score is not always the best choice. In the last part of the notebook you will see another way to do it.

<a name='3'></a>
## 3 - Training

Now you are going to train your model. As usual, you have to define the cost function and the optimizer. You will use the following built-in functions provided by TensorFlow: 

- [`tf.losses.SparseCategoricalCrossentropy()`](https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy): The Sparce Categorical Cross Entropy loss. It is the loss function used for multiclass classification.    
    - `from_logits=True`: This parameter informs the loss function that the output values generated by the model are not normalized like a probability distribution. This is our case, since our GRULM model uses a `log_softmax` activation rather than the `softmax`.
- [`tf.keras.optimizers.Adam`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam): Use Adaptive Moment Estimation, a stochastic gradient descent method optimizer that works well in most of the cases. Set the `learning_rate` to 0.00125. 

<a name='ex-4'></a>
### Exercise 4 - train_model

**Instructions:** Compile the GRULM model using a `SparseCategoricalCrossentropy` loss and the `Adam` optimizer

In [35]:
# GRADED FUNCTION: Compile model

def compile_model(model):
    """
    Sets the loss and optimizer for the given model

    Args:
        model (tf.keras.Model): The model to compile.

    Returns:
        tf.keras.Model: The compiled model.
    """
    ### START CODE HERE ###

    # Define the loss function. Use SparseCategoricalCrossentropy 
    loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
    # Define and Adam optimizer
    opt = tf.keras.optimizers.Adam(learning_rate=0.00125)
    # Compile the model using the parametrized Adam optimizer and the SparseCategoricalCrossentropy funcion
    model.compile(optimizer=opt, loss=loss)
    
    ### END CODE HERE ###

    return model

In [None]:
## UNIT TEST
w1_unittest.test_compile_model(compile_model)

Now, train your model for 10 epochs. With GPU this should take about one minute. With CPU this could take several minutes.

In [None]:
EPOCHS = 10

# Compile the model
model = compile_model(model)
# Fit the model
history = model.fit(dataset, epochs=EPOCHS)

In [36]:
EPOCHS = 30

# Compile the model
model = compile_model(model)
# Fit the model
history = model.fit(dataset, epochs=EPOCHS)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


You can uncomment the following cell to save the weigthts of your model. This allows you to use the model later.

In [37]:
# If you want, you can save the final model. Here is deactivated.
output_dir = './model_test_512/'

import shutil

try:
   shutil.rmtree(output_dir)
except OSError as e:
   pass

model.save_weights(output_dir)

The model was only trained for 10 epochs. We pretrained a model for 30 epochs, which can take about 5 minutes in a GPU.

<a name='4'></a>
## 4 - Evaluation  
<a name='4-1'></a>
### 4.1 - Evaluating using the Deep Nets

Now that you have learned how to train a model, you will learn how to evaluate it. To evaluate language models, we usually use perplexity which is a measure of how well a probability model predicts a sample. Note that perplexity is defined as: 

$$P(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}$$

As an implementation hack, you would usually take the log of that formula (to enable us to use the log probabilities we get as output of our `RNN`, convert exponents to products, and products into sums which makes computations less complicated and computationally more efficient). 


$$\log P(W) = {\log\left(\sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}\right)}$$$$ = \log\left(\left(\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}\right)^{\frac{1}{N}}\right)$$
$$ = \log\left(\left({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\right)^{-\frac{1}{N}}\right)$$$$ = -\frac{1}{N}{\log\left({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\right)} $$$$ = -\frac{1}{N}{{\sum_{i=1}^{N}{\log P(w_i| w_1,...,w_{n-1})}}} $$

<a name='ex-5'></a>
### Exercise 5 - log_perplexity
**Instructions:** Write a program that will help evaluate your model. Implementation hack: your program takes in `preds` and `target`. `preds` is a tensor of log probabilities. You can use [`tf.one_hot`](https://www.tensorflow.org/api_docs/python/tf/one_hot) to transform the `target` into the same dimension. You then multiply them and sum them. For the sake of simplicity, we suggest you use the NumPy functions [`sum`](https://numpy.org/doc/stable/reference/generated/numpy.sum.html), [`mean`](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) and [`equal`](https://numpy.org/doc/stable/reference/generated/numpy.equal.html), Good luck! 

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>To convert the target into the same dimension as the predictions tensor use tf.one_hot with target and preds.shape[-1].</li>
    <li>You will also need the np.equal function in order to unpad the data and properly compute perplexity.</li>
</ul>
</p>

In [38]:
# GRADED FUNCTION: log_perplexity
def log_perplexity(preds, target):
    """
    Function to calculate the log perplexity of a model.

    Args:
        preds (tf.Tensor): Predictions of a list of batches of tensors corresponding to lines of text.
        target (tf.Tensor): Actual list of batches of tensors corresponding to lines of text.

    Returns:
        float: The log perplexity of the model.
    """
    PADDING_ID = 1
    
    ### START CODE HERE ###
    
    # Calculate log probabilities for predictions using one-hot encoding
    log_p = np.sum(tf.one_hot(target, depth=preds.shape[-1]) * preds, axis= -1) # HINT: tf.one_hot(...) should replace one of the Nones
    # Identify non-padding elements in the target
    non_pad = 1.0 - np.equal(target, PADDING_ID)          # You should check if the target equals to PADDING_ID
    # Apply non-padding mask to log probabilities to exclude padding
    log_p = log_p * non_pad                             # Get rid of the padding
    # Calculate the log perplexity by taking the sum of log probabilities and dividing by the sum of non-padding elements
    log_ppx = np.sum(log_p, axis=-1) / np.sum(non_pad, axis=-1) # Remember to set the axis properly when summing up
    # Compute the mean of log perplexity
    log_ppx = np.mean(log_ppx) # Compute the mean of the previous expression
        
    ### END CODE HERE ###
    return -log_ppx

In [None]:
#UNIT TESTS
w1_unittest.test_log_perplexity(log_perplexity)

Now load the provided pretrained model just to ensure that results are consistent for the upcoming parts of the notebook. You need to instantiate the GRULM model and then load the saved weights.

In [None]:
# Load the pretrained model. This step is optional. 
vocab_size = len(vocab)
embedding_dim = 256
rnn_units = 512

model = GRULM(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units = rnn_units)
model.build(input_shape=(100, vocab_size))
model.load_weights('./model/')

In [44]:
# Load the pretrained model. This step is optional. 
# 돌릴 필요 없음...
vocab_size = len(vocab)
embedding_dim = 512
rnn_units = 1024

model = GRULM(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units = rnn_units)
model.build(input_shape=(100, vocab_size))
model.load_weights('./model_test_512/')

<tensorflow.python.checkpoint.checkpoint.CheckpointLoadStatus at 0x7fa004044a00>

Now, you will use the 1000 lines of the corpus that were reserved at the begining of this notebook as test data. You will apply the same preprocessing as you did for the train dataset: get the numeric tensor from the input lines, and use the `split_input_target` to generate the inputs and the expected outputs. 

Second, you will predict the next characters for the whole dataset, and you will compute the perplexity for the expected outputs and the given predictions.

In [None]:
#for line in eval_lines[1:3]:
eval_text = "\n".join(eval_lines)
eval_ids = line_to_tensor([eval_text], vocab)
input_ids, target_ids = split_input_target(tf.squeeze(eval_ids, axis=0))

preds, status = model(tf.expand_dims(input_ids, 0), training=False, states=None, return_state=True)

#Get the log perplexity
log_ppx = log_perplexity(preds, tf.expand_dims(target_ids, 0))
print(f'The log perplexity and perplexity of your model are {log_ppx} and {np.exp(log_ppx)} respectively')

In [45]:
#for line in eval_lines[1:3]:
eval_text = "\n".join(eval_lines)
eval_ids = line_to_tensor([eval_text], vocab)
input_ids, target_ids = split_input_target(tf.squeeze(eval_ids, axis=0))

preds, status = model(tf.expand_dims(input_ids, 0), training=False, states=None, return_state=True)

#Get the log perplexity
log_ppx = log_perplexity(preds, tf.expand_dims(target_ids, 0))
print(f'The log perplexity and perplexity of your model are {log_ppx} and {np.exp(log_ppx)} respectively')

The log perplexity and perplexity of your model are 3.766624630172665 and 43.23388791326808 respectively


**Expected Output:** The log perplexity and perplexity of your model are around 1.22 and 3.40 respectively.

So, the log perplexity of the model is 1.22. It is not an easy to interpret metric, but it can be used to compare among models. The smaller the value the better the model.

<a name='5'></a>
## 5 - Generating Language with your Own Model

Your GRULM model demonstrates an impressive ability to predict the most likely characters in a sequence, based on log scores. However, it's important to acknowledge that this model, in its default form, is deterministic and can result in repetitive and monotonous outputs. For instance, it tends to provide the same answer to a question consistently.

To make your language model more dynamic and versatile, you can introduce an element of randomness into its predictions. This ensures that even if you feed the model in the same way each time, it will generate different sequences of text.

To achieve this desired behavior, you can employ a technique known as random sampling. When presented with an array of log scores for the N characters in your dictionary, you add an array of random numbers to this data. The extent of randomness introduced into the predictions is regulated by a parameter called "temperature". By comparing the random numbers to the original input scores, the model adapts its choices, offering diversity in its outputs.

This doesn't imply that the model produces entirely random results on each iteration. Rather, with each prediction, there is a probability associated with choosing a character other than the one with the highest score. This concept becomes more tangible when you explore the accompanying Python code.

In [41]:
def temperature_random_sampling(log_probs, temperature=1.0):
    """Temperature Random sampling from a categorical distribution. The higher the temperature, the more 
       random the output. If temperature is close to 0, it means that the model will just return the index
       of the character with the highest input log_score
    
    Args:
        log_probs (tf.Tensor): The log scores for each characeter in the dictionary
        temperature (number): A value to weight the random noise. 
    Returns:
        int: The index of the selected character
    """
   # Generate uniform random numbers with a slight offset to avoid log(0)
    u = tf.random.uniform(minval=1e-6, maxval=1.0 - 1e-6, shape=log_probs.shape)
    
    # Apply the Gumbel distribution transformation for randomness
    g = -tf.math.log(-tf.math.log(u))
    
    # Adjust the logits with the temperature and choose the character with the highest score
    return tf.math.argmax(log_probs + g * temperature, axis=-1)

Now, it's time to bring all the elements together for the exciting task of generating new text. The GenerativeModel class plays a pivotal role in this process, offering two essential functions:

1. `generate_one_step`: This function is your go-to method for generating a single character at a time. It accepts two key inputs: an initial input sequence and a state that can be thought of as the ongoing context or memory of the model. The function delivers a single character prediction and an updated state, which can be used as the context for future predictions.

2. `generate_n_chars`: This function takes text generation to the next level. It orchestrates the iterative generation of a sequence of characters. At each iteration, generate_one_step is called with the last generated character and the most recent state. This dynamic approach ensures that the generated text evolves organically, building upon the context and characters produced in previous steps. Each character generated in this process is collected and stored in the result list, forming the final output text.

<a name='ex-6'></a>
### Optional Exercise 6 - GenerativeModel (Not graded)
**Instructions:** Implementing the One-Step Generator

In this task, you will create a function to generate a single character based on the input text, using the provided vocabulary and the trained model. Follow these steps to complete the generate_one_step function:

1. Start by transforming your input text into a tensor using the given vocab. This will convert the text into a format that the model can understand.

2. Utilize the trained model with the input_ids and the provided states to predict the next characters. Make sure to retrieve the updated states from this prediction because they are essential for the final output.

3. Since we are only interested in the next character prediction, keep only the result for the last character in the sequence.

4. Employ the temperature random sampling technique to convert the vector of scores into a single character prediction. For this step, you will use the predicted_logits obtained in the previous step and the temperature parameter of the model.

5. To transform the numeric prediction into a human-readable character, use the text_from_ids function. Be mindful that text_from_ids expects a list as its input, so you need to wrap the output of the temperature_random_sampling function in square brackets [...]. Don't forget to use self.vocab as the second parameter for character mapping.

6. Finally, return the predicted_chars, which will be a single character, and the states tensor obtained from step 2. These components are essential for maintaining the sequence and generating subsequent characters.







In [42]:
# UNGRADED CLASS: GenerativeModel
class GenerativeModel(tf.keras.Model):
    def __init__(self, model, vocab, temperature=1.0):
        """
        A generative model for text generation.

        Args:
            model (tf.keras.Model): The underlying model for text generation.
            vocab (list): A list containing the vocabulary of unique characters.
            temperature (float, optional): A value to control the randomness of text generation. Defaults to 1.0.
        """
        super().__init__()
        self.temperature = temperature
        self.model = model
        self.vocab = vocab
    
    @tf.function
    def generate_one_step(self, inputs, states=None):
        """
        Generate a single character and update the model state.

        Args:
            inputs (string): The input string to start with.
            states (tf.Tensor): The state tensor.

        Returns:
            tf.Tensor, states: The predicted character and the current GRU state.
        """
        # Convert strings to token IDs.
        
        ### START CODE HERE ###

        # Transform the inputs into tensors
        input_ids = line_to_tensor(inputs, vocab)
        # Predict the sequence for the given input_ids. Use the states and return_state=True
        predicted_logits, states = self.model(input_ids, states, return_state=True)
        # Get only last element of the sequence
        predicted_logits = predicted_logits[0, -1, :]                      
        # Use the temperature_random_sampling to generate the next character. 
        predicted_ids = temperature_random_sampling(predicted_logits, self.temperature)
        # Use the chars_from_ids to transform the code into the corresponding char
        predicted_chars = text_from_ids([predicted_ids], vocab)
        
        ### END CODE HERE ###
        
        # Return the characters and model state.
        return tf.expand_dims(predicted_chars, 0), states
    
    def generate_n_chars(self, num_chars, prefix):
        """
        Generate a text sequence of a specified length, starting with a given prefix.

        Args:
            num_chars (int): The length of the output sequence.
            prefix (string): The prefix of the sequence (also referred to as the seed).

        Returns:
            str: The generated text sequence.
        """
        states = None
        next_char = tf.constant([prefix])
        result = [next_char]
        for n in range(num_chars):
            next_char, states = self.generate_one_step(next_char, states=states)
            result.append(next_char)

        return tf.strings.join(result)[0].numpy().decode('utf-8')

In [None]:
# UNIT TEST
# Fix the seed to get replicable results for testing
tf.random.set_seed(272)
gen = GenerativeModel(model, vocab, temperature=0.5)

print(gen.generate_n_chars(32, " "), '\n\n' + '_'*80)
print(gen.generate_n_chars(32, "Dear"), '\n\n' + '_'*80)
print(gen.generate_n_chars(32, "KING"), '\n\n' + '_'*80)

**Expected output**
```CPP
 hear he has a soldier.
Here is a 

________________________________________________________________________________
Dear gold, if thou wilt endure the e 

________________________________________________________________________________
KING OF THE SHREW
IV	I beseech you,  

________________________________________________________________________________

```

In [46]:
gen = GenerativeModel(model, vocab, temperature=0.5)

print(gen.generate_n_chars(32, " "), '\n\n' + '_'*80)
print(gen.generate_n_chars(32, "우물"), '\n\n' + '_'*80)
print(gen.generate_n_chars(32, "왕자"), '\n\n' + '_'*80)

 여기 이름이 ‘새티스 하우스’(미스 해비샴의 저택. 읍내  

________________________________________________________________________________
우물에 도착했을 때, 미스 해비샴은 에스텔라의 어머니가 다른  

________________________________________________________________________________
왕자가 자신의 팔을 자기 집으로 돌아다닐 수 있게 되자 나는  

________________________________________________________________________________


In [43]:
gen = GenerativeModel(model, vocab, temperature=0.5)

print(gen.generate_n_chars(32, " "), '\n\n' + '_'*80)
print(gen.generate_n_chars(32, "우물"), '\n\n' + '_'*80)
print(gen.generate_n_chars(32, "왕자"), '\n\n' + '_'*80)

 ??롭맨낮%깥찝8병족볶↔쏠홍미뾰도태립졸혈튼낯간첼썹꿨셋듣O짢 

________________________________________________________________________________
우물롭길E헝쯧병현셔려크띔능압촌저끔넓찬흥콧껄헸샘튄퍼릇떤판양돕8B 

________________________________________________________________________________
왕자맨벗ㄱ돔매측y꺾푼뵈삼둡폰둬멈워핵줌G똥잭늘꽂d맑투한훈긁뿐샹괜 

________________________________________________________________________________


In [None]:
w1_unittest.test_GenerativeModel(GenerativeModel, model, vocab)

Now, generate a longer text. Let's check if it looks like Shakespeare fragment

In [51]:
tf.random.set_seed(np.random.randint(1, 1000))
gen = GenerativeModel(model, vocab, temperature=0.8)
import time
start = time.time()
print(gen.generate_n_chars(1000, "ROMEO "), '\n\n' + '_'*80)
print('\nRun time:', time.time() - start)

ROMEO 취시  자유였죠.”
“그럼 안 하나 도착한 거지.” 비디가 ‘리치먼드(에스텔라의 이름)가 제 치수에게 오해하고 있었는지를 나는  목격했다.  “왜  하느님!  제가요.”
“이것에도 대담당 하나도 많이 처청해합니다.”라고 가장  한  말이었더랬다.
내가  대답했다. “제 마음이 바쁘시니 좋은 사람이니까요. 나는  일깨워주었다.
에스텔라가 나를 불어처 심리상태라고 상황해 보이는 것도 압니다. 나는 그것이 너무 저기에 놓인 의자에 태워있었다. 이 에스텔라는 정말로 드러믈(주인공의 라이벌)이 들었다. 그래서 미스 해비샴을 보았습니다. 때론 재거스 씨가 우리들이 그 ‘제분소(에스텔라의 친어머니)를 자신의 손으로  관어  없다는  거예요.”
“직업, 18장 그러니 이번 꿈도 꾸준히 지요.”
“좋아. 자, 이 이상 아저씨~ 수요. 당신은 변호사님께 귀를 기울이야 하는 것입니다. 또한 이제 곧 자 그래, 당신께 유산하지 않을 것 같습니다.”
“가게 ‘어라니, 라고 단락 해석의 부담’을 입으로 부드럽게 만들어야 하느니 이름을 이해하겠지만 말이다, 핍.”(← 런던  변호사의  가게에서  가는  막 나오지만  이  결혼식      중  해주신  변호사 같이  이 세상에서 가장 가격을  유지하는 대장이 있다).  그렇지만 적이 그러니까 에스텔라는 이렇게 말하는 것이 없다. 그래서 좀 부서는 다른 종류의  남자가  땅바닥에  달아오른  것이다.
그는 아주 예쁘고 정보  결치가  왔기 때문이다.  난 시기심이  오래된다,  그녀가  이렇게  말했다.
“재거스 씨가 우리의 이름을 많이 가져메는 것만 하면, 재거스 씨가 자주 사용되고 있습니다.”  조에게  그녀가  치장을  속으로  불렀다.
“이번이 마지막 이해 재거스 씨가 무척 당신을 도왔으니까요.”
“다 핍 씨가 무슨 짓을”  때가  왔다는 것입니다.”
나는 이 부분을 지켜보던 적이 없었다. 나는 그것이 일언이라도 있었기 때문에 나는 일찍이 가능한 한  많았다. 하늘 이 사이에 잘 아무런 대답도 하지 않았고 이제 집 둘 뿐

In [47]:
tf.random.set_seed(np.random.randint(1, 1000))
gen = GenerativeModel(model, vocab, temperature=0.8)
import time
start = time.time()
print(gen.generate_n_chars(1000, "상자는 "), '\n\n' + '_'*80)
print('\nRun time:', time.time() - start)

상자는 없었다.
누나의 주문자(이 일대 비즈)가 우리 시골을 내 입으로 여기  대 소동들을  어느 수 있다는 것을 나는 알고 있어.
이 말에 콤피슨이 나를 구혼자로 만들어왔다고 그녀가 마주쳤다. 그는 손을 씻은 채 또 다시 내게, 거칠한 것을 그녀가 잘 알고 있다. 구두가 나로 하여금 손들을 구경하고 있었다.
재거스  씨가 말했다.  “심지어 추운을 기다리는 것을 이해할 수 없다고 생각하겠네. 그것을 자네보고 아주 도착하고  있었는데,  이 여성들의 모든  변호사들을  만들어버릴 자도 있을 테지.
그리고 나쁜 상태를 지속하는 데까지 문 가까이에 의해했다. 그리고 자신의 지독이 받아 일단 ‘정오 당 포도 ’여 있었다.
나는 몇 년간 같이 보였다. 여관이  들어오고 있었다.
“당치 않은 건 정말 수입 있어요, 이 재판이 끝난 때였지만 말이야.”라며 에스텔라가 말했다. “그녀가 마차에 오른다고 좀 그래서 만  생각하던가!”
“아, 아니겠니, 그게.” 내가 비록 심리며 정기되어 있던 목적지로 물었다.
그가  말했다.
그래서 어린 왕자는 또래 아물 그렇게 오는 것을 그 원은 나뭇가지 않았다. 비록 그의 별이 만났을 때 그들이 그 날 매일  밤인지 이미지  않았다.
“손 인들이 너를 속에서 방 안 된 것으로 명령. 명쾌히 “그렇다고요. 그런 다음 제 소스 일을  인다는답니다.”
“그들이 다가가셨을 것 같지도 않을까요?”
나는 여전히 나를 예민했다. 왜냐하면 자신이 한 일을 자그쳤다. 에스텔라가 나를 이렇게 말했다.
“이제 자네가 내 생각을 듣기에 자주 만약 가려다 라 자네 자신의 집 의 도매에 대해 듣게 되거나 아냐! 그렇게 생각 하느님께 저 미른 질문은 보이지 않았나요, 이 정돈 나는 대단히 좀 더 많은 것을 알고 있습니다. 여기 여자 이름 다  안나절에 도두 수 있을 것 같으만 실입니다.”
“그렇고말고요.” 내가 말했다. “여기에 이 집을 바로 제 입술로 명성을 주고 있지 않았다고.”
“그럼 이제 좀 더 많아서 좀 있다고.” 에스텔라는 분명  신다 말라고  말하기 시

In [48]:
tf.random.set_seed(np.random.randint(1, 1000))
gen = GenerativeModel(model, vocab, temperature=0.8)
import time
start = time.time()
print(gen.generate_n_chars(1000, "사랑"), '\n\n' + '_'*80)
print('\nRun time:', time.time() - start)

사랑은 어떤 다른 행동도 할 말  왔다. “기억 핍(21세), 이게 다 아냐. 잘 아시겠어요!”
얼굴이 창백 피 수 있게 됐다, 마냥 나는 이 수 있다는  듯 미쳐 더 이상 말했지만 더 정확히 알려지지 않는다면 세 번이라도 남겨두고 자리를 살았다. 그리곤 그에게 자신이 나를 빤히 쳐다보며 서 있었다.
“자 이제 이 이상  말이다?”
“이 이야기를 내가 잘 이해나게 하는 일이라, 나도 이제 감옥선을 탈출해주리라. 물론 이 족쇄를 잘 내 지나친 선명을 잊기 위해선 비단어. 이런 희망을 왜 내 자신의 귀족같음, 신사들 속에 자서 있는 그 사람들에 대해 그걸 빼앗은 어느 누군가의 다  말로  읍내  가겠다는  거야. 비록 불 어린 가구들은 이 집에서 나를 자주 만들어 내야 알고 있었어요. 그 반만큼이라는 건     각 말이다.
만약 자기 주의를 등기 속해서 까밀라 뒤집어왔던 전보다, 나는 그가 내게 성공을 드러내며 올라오기 시작했다. 거기서 그 빵과는 포도주들이 무언가를 따라갈 수 있었다. 나는 그들에게 삼촌이 고마워당했다. 그는 다른 쪽에 앉아 비추를 보여주었다.
“그 말은 벌써 내 자신을 그 러뜨리던 참이지. 나도 넌 꼬마 가까이에 오르고 있을 때 너무 너만큼이나 그럼 조의 매형 조부에 짓눌려 왔던 것이 오겠습니다!”란 인사말 좀 심사숙고해보자.그런데 난 이제 그런  의도에   대해  어떠한  말인가!  고맙다는 말은  당연히  영 문이  다시 되지 않은가?
새로만 봐야 할지 없다는 것이 점을 소리 느리게 입기도 하고, 또 어떨 때는 그렇게 피곤 있었지 않아.  내  대답을  어디어  보 았단다.  그  이야기 전달      들         그            있었다.                 그들은  이렇게  달려오.  내           마.”
“그렇지 않소.” 재거스 씨가 자신의 집게손가락을 너무 자부    세며  친절히 다음과 같이 말하며  이렇게  말해줄  것 같았다.  “제  숨과  오늘은 잘  히고 있습니다.” “그럼 그들은 몰두 시각하려는

In [49]:
tf.random.set_seed(np.random.randint(1, 1000))
gen = GenerativeModel(model, vocab, temperature=0.8)
import time
start = time.time()
print(gen.generate_n_chars(1000, "공부"), '\n\n' + '_'*80)
print('\nRun time:', time.time() - start)

공부 아래 여기 숙녀 집에 들어갔다가 한 번 흔들어 보곤 했다. “그 전에는 이 수상한 관심들엔 한 번이라도 아니야, 아니야, 아니고 이놈 앞에 모습을 드러내놓고 있죠. 난 이 일이 먼저 거 어디 가구가 귀여운 어린아이 말  이야!”
“좋아.” 포킷 씨 부인이 물들 너머를 좀 더 걸어오더니 소리쳤다. “난 어떻게 불러야?”
“그래 좀 있어요!” 재거스 씨가 말했다. “그 분이라고. 나리에 두 번째 주는 이 말도 한 것이 가능한 일은 자기 상속이 되죠.”
“이해했었던 그 손수건이 그 귀에 다른 질문을 받는가 놀게 만들고, 또는 좀 전해주겠니! 제가 여기 있다는 얘기이기도 하지  않겠나.  재거스  씨가  말했다. “당신은 자기 없으신다면 그 자가 이렇게 오래된 것이 될 수 있지만. 어쨌든 이렇게 몇 개 있는 ‘야’ 시간 내 수습공이었는지 알려 본 적이 없었나  적은  어느      저 여자아이가 될 수 있었겠지.”
그 말에는 나도 두 손을 내 직행에 알아보고도 녀석은 그 레이몬로  밀항을  도로하는  그녀  다음 날  같았다.  “너  나는 핍”이라며 이 경우에 그녀가 내게 다시 되풀이해 말했다.
“당신(교회평에) 지급히 있는지 변호사의 부축과 파멸입모양입니다. 미스 해비샴 마남.” 그가 자신의 두 손을 내밀어 쪼들며 말했다. “이제 저 녀석이 제게 언제나  가게  아님 ‘제라 어정신 도 있었을 거고. 여길 보시오.”
그녀의 입은 이 비밀에 감싸인 그 눈을 쳐다보지 않을 수 있었는지를 선의 질시내 내가 더 잘 어울릴 만한 것은 전혀  없었다.
나는 이 부분에선 내가 믿어보자고, 한 마디로 말해, 또 어떻게 해서 그녀에 대한 내 쪽으로 나를 얻어 수대는 없었거든. 잠깐만 내 정신이 정말 나를 여기기 시작했고 ‘포킷 씨 부인’에게는 문 밖에 사람일 거라는 얘기를 그에게 이렇게 소리치  않았음.  그의  얼굴은 보았을 정도다.
유감한 수갑들이 나를 인정해 주었고 다시 두려웠다. 나는 다른 수 있었음을 인정했다. 이미 방 앞에 있는 것 같아서 그냥 다음과 같이 새된 목

In [50]:
tf.random.set_seed(np.random.randint(1, 1000))
gen = GenerativeModel(model, vocab, temperature=0.8)
import time
start = time.time()
print(gen.generate_n_chars(1000, "나는할수있어"), '\n\n' + '_'*80)
print('\nRun time:', time.time() - start)

나는할수있어.” 에스텔라가 타고 말했다.
“그렇다고 그가 우리를 보거라. 보아 구렁이 속이라도 빌 이 아닌 거야! 아주 예쁘게 생긴 창이란 대가 수백된 다면.” 우린 이렇게 말했다. “이제 우리는 걷기 시작했어. 그래서 그렇게 귀가 했다는 걸 말하고 있는 것 같았는데, 사막이 있고!” 에스텔라가 말했다. “하지만 내 이름은 다시 아래겠지만 무대 내부 전부 이름인지  올릭 어르신  이젠 나도 이 일에 전부해주어야 할 게 아닐까?”
“누가 그 행동이구나.” 나는 그녀에게 다시 나를 쳐다보며 말했다. “당신은 조(주인공의 매형)가 자입해질 수 있을 것 같아서요. 그녀가 나를 그 무엇이었든가는 수긍이 있어. 그 말은 그의 성격의 마음으로만 언제 그걸 다가오는 지방의 시선에 오늘 일어나리라고.”
“이 애착은 언제나 쉬운 일이야.” 내 발악 하기 말을 내가 믿는다. “어서 지금 해머스미스(런던 중앙부에 있는 주인공을 쳐다보고  있음)?”
“그래 잘 있는가요?” 그가 이 말을  계속 이어가려했었는데 입 다는  것  같았다.  “기타   세  배는  더 확인하기 때문이네. 에스텔라는 이곳에 제가 어떤 경고의 종류의 교습도 이다.”라고.
그나마 다행인 것은 내가 대단히 경험했다.
“약간의 현금지를 가지고 있다.”라고. 하지만 결국 그를 위해 준비한 것과 그리고  어쩌다  바닥을  쓰겠다.  그렇지만  않았다고.”
이 재판소에 대해 생각해보던 그가 내 생속을 잠가자리에 두 눈은 자세로 팔짱을 낀 체 무리로 기울여 자 두 팔꿈치와, ‘트랩 씨 가게의 영감의 원기 회사’에 딸린 ‘모녀’ 매형’(역시 마무 원 대 왕)를 재판받은 것 입니다. 이렇게 해서 나는  어린  녀석이  이렇게  마음이 간다고서  남들었다.  조 가로  부는 소년이 이것에 대해 책임이 있는 것 같았지만  말이다.
마부 긴 대사에서 태어나는 말을 이었다. 녀석이 다시 한 번 그렇게 되었다. “하지만 다 이해할 수 있겠니? 자, 어르신, 아버지께서 결정을 하려고 의심은 거 절친 분위기 달려 복해     그 자가 있

In the generated text above, you can see that the model generates text that makes sense capturing dependencies between words and without any input. A simple n-gram model would have not been able to capture all of that in one sentence.

<a name='6'></a>
###  On statistical methods

Using a statistical method like the one you implemented in course 2 will not give you results that are as good as you saw here. Your model will not be able to encode information seen previously in the data set and as a result, the perplexity will increase. Remember from course 2 that the higher the perplexity, the worse your model is. Furthermore, statistical ngram models take up too much space and memory. As a result, they will be inefficient and too slow. Conversely, with deepnets, you can get a better perplexity. Note, learning about n-gram language models is still important and allows you to better understand deepnets.
