# 2020 Introduction to Deep Learning Coding Ex 4: Spam Message Generator!

Contact T/A: Yeon-goon Kim, SNU ECE, CML. (ygoonkim@cml.snu.ac.kr)  

Dataset from http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

On this homework, you will train spam message generator, which is basic RNN/LSTM/GRU char2char generation model. Of course, your results may not so good, but you can expect some sentence-like results by doing this homework sucessfully.

## Now, We'll handle texts, not images. Is there any differences?

Of course, there are many differences between processing images and texts. One is that text cannot be expressed directly in matrices or tensors. We know an image can be expressed in Tensor(n_channel, width, height). But how about an text sentence? Can word 'Homework' can be expressed in tensor directly? By what laws? With what shapes? Even if it can, which one is closer to the word, 'Burden', or 'Work'? This is called 'Word Embedding Problem' and be considered as one of the most important problem in Natural Language Process(NLP) resarch. Fortunatly, there are some generalized solution in this problem (though not prefect, anyway) and both Tensorflow and Pytorch give basic APIs that solve this problem automatically. You may use those APIs in this homework. 

The other one is that text is sequential data. Generally, when processing images, without batch, input is just one image. However in text, input is mostly some or one paragraphs/sentences, which is sequential data of embedded character or words. So, If we want to generate word 'Homework' with start token 'H', 'o' before 'H' and 'o' before 'Homew' should operate different when it gives to input. This is why we use RNN-based model in deep learning when processing text data.

## Requirement-Tensorflow
In this homework I recommend that you should use the latest "anaconda" stable version of Tensorflow, which is on now(2020-11-05) 2.2.x., but latest version(2.3.x) wouldn't be a serious problem. I'm using python3.7 on grading environment but there are no major changes on python3.6/8 so also will not be a serious problem. 
There are other required packages to run the code which is 'unidecode'. You can easily install these packages with 'pip install unidecode'.  

You can add other packages if you want, but if they are not basically given pkgs in Python/Tensorflow you should contact T/A to check whether you can use them or not.

In [None]:
####### Import Packages ##########

import tensorflow as tf
import time
import random
import string
import random
import unidecode
import re
import os
import datetime

##################################

## Changable Parameters

In [None]:
RANDOM_SEED = 2020
tf.random.set_seed(RANDOM_SEED)

###### On TF2.0, it automatically select whether to use GPU(default) or CPU #####
#USE_GPU = True
#################################################################################


############################# Changeable Parameters #############################
SEQ_LENGTH = 200
N_ITER = 50000
TXT_GEN_PERIOD = 500
LEARNING_RATE = 0.005
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
#################################################################################

## Data Prepration (Contact T/A If you wan to change)

In [None]:
with open('./spam.txt', 'r') as f:
    textfile = f.read()

TEXT_LENGTH = len(textfile)
random.seed(RANDOM_SEED)

textfile = unidecode.unidecode(textfile)
textfile = re.sub(' +',' ', textfile)

def pick_input(textfile):
    start_index = random.randint(0, TEXT_LENGTH - SEQ_LENGTH)
    end_index = start_index + SEQ_LENGTH + 1
    return textfile[start_index:end_index]

def char2tensor(text):
    lst = [string.printable.index(c) for c in text]
    tensor = tf.Variable(lst)
    return tensor

def draw_random_sample(textfile):    
    sampled_seq = char2tensor(pick_input(textfile))
    inputs = sampled_seq[:-1]
    outputs = sampled_seq[1:]
    return inputs, outputs

print(draw_random_sample(textfile))
print(draw_random_sample(textfile))

## Custom Data Preparation Functions
You can add any other functions that preparation data in below cell.  

However, you should annotate precisely for each functions you define. One annotation line should not cover more than 5 lines that you write.

## Task1: RNN/LSTM/GRU Module

The main task is to create RNN/LSTM/GRU network. You can use any tensorflow/keras api that basically given.

Basically build_model are given, but you can add other functions that help your networks.

In [None]:
#################### WRITE DOWN YOUR CODE ################################

def build_model(args_you_want):

#################### WRITE DOWN YOUR CODE ################################

## Optional Task: Train & Generate Code

These cells would define functions of training network and generating text function. 

You can change these codes but if then you should annotate where do you make change precisely.
One annotation line should not cover more than 5 lines that you make your changes.  
Also, do not delete original code, just comment out them. (or make another cells of jupyter notebook)

In [None]:
model = build_model(args_you_want)

def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
def accuracy(labels, logits):
    return tf.keras.metrics.sparse_categorical_accuracy(labels, logits)

optimizer = tf.keras.optimizers.Adam() 
model.compile(optimizer=optimizer, loss=loss) 

In [None]:
def generate_text(model, start_string):
  
    num_generate = 100
    input_eval = char2tensor(start_string)
    input_eval = tf.expand_dims(input_eval, 0)
    text_generated = []
    temperature = 0.8
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)
        predictions = predictions / temperature
        predictions = tf.math.exp(predictions)
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
        input_eval = tf.expand_dims([predicted_id], 0)
        text_generated.append(string.printable[predicted_id])

    return (start_string + ''.join(text_generated))

## Execution Code & Credit Creterion
Half Credit (4 points): Generate some ugly text, without any meaningful words.

Q3 Credits (6 points): In SEQ_LENGTH 200, generate 6 or less differet words.

Full Credit (8 points): in SEQ_LENGTH 200, generate 7 or more different words.



You can change this cell based on your code modifications above.


In [None]:
checkpoint_dir = './tf-checkpoints'+ datetime.datetime.now().strftime("_%Y.%m.%d-%H:%M:%S")
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_0")
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [None]:
# Start Training by running this cell
for i in range(N_ITER):
    x, y = draw_random_sample(textfile)    
    history = model.fit(tf.expand_dims(x, 0), tf.expand_dims(y, 0), epochs=1, callbacks=[checkpoint_callback], verbose=0)
    if (i % TXT_GEN_PERIOD == 0):
            checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_%d"%i)
            checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
                filepath=checkpoint_prefix,
                save_weights_only=True)
            print("\nIteration %d"%i)
            print(generate_text(model, 'joy'))
            print("\n\n")

## Model Loading Code
You can change this if you don't like or understand this code.

In [None]:
model = build_model(args_you_want)
model.load_weights('ckpt_dir_you_save')

## Using Pre-trained Networks: Transformers

Actually, RNN-based sequential model is now regarded as little bit old-fashioned, since 'Transformer' model was announced in paper 'Attention is All You Need'(https://arxiv.org/abs/1706.03762). Now this model is widely used on many state-of-the-art sequential-data-use model, and even in non-sequential-data-use (ex)image) model too. However, model training cost is too heavy(maybe you need multiple million-won GPUs) to train on this homework. Fortunately, there is package called 'transformers' that contains multipe pre-trained transformer-based model that can be used directly. Below is example of text generation using GPT2, which is one of the most popular pre-trained NLP models.

You can install this package with 'pip install transformers'. To download pre-traind model, you may have 2GB or more free disk space.

In [None]:
from transformers import pipeline, set_seed
# Install ipywidgets and restart notebooks if you meet error message.

generator = pipeline('text-generation', model='gpt2')
set_seed(25)
start_text = 'Fill this blank whatever you want.'
generator(start_text, max_length=300, num_return_sequences=2)

## Task2: Follow Tutorial of Fine-tuned Networks (2 points.)
There are hundreds of fine-tuned NLP models in 'transformers' package. Try one of these models and follow its tutorial (except language translation model). Results must produce some meaningful, or funny one, and you must write down what model you choose and explain its function (ex) what is input/output, what does it mean etc) with one or two sentences. 

Hint: You can find list of pre-trained models in 'transformers' package on https://huggingface.co/models

In [None]:
#################### WRITE DOWN YOUR CODE ################################
from transformers import










#################### WRITE DOWN YOUR CODE ################################

########### WRITE DOWN YOUR Explaination with annotation #################

# Explaination: 

##########################################################################


## Additional Information: Project
Of course, there are massive ammount of pretrained models on domain of image, NLP or else in web with open-source licenses. You can fine-tune those models if your GPUs are good enough, or at least transfer its information by using output feature of pre-trained networks. Or, maybe neither, it is up to you.