# 2020 Introduction to Deep Learning Coding Ex 4: Spam Message Generator!

Contact T/A: Yeon-goon Kim, SNU ECE, CML. (ygoonkim@cml.snu.ac.kr)  

Dataset from http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

On this homework, you will train spam message generator, which is basic RNN/LSTM/GRU char2char generation model. Of course, your results may not so good, but you can expect some sentence-like results by doing this homework sucessfully.

## Now, We'll handle texts, not images. Is there any differences?

Of course, there are many differences between processing images and texts. One is that text cannot be expressed directly in matrices or tensors. We know an image can be expressed in Tensor(n_channel, width, height). But how about an text sentence? Can word 'Homework' can be expressed in tensor directly? By what laws? With what shapes? Even if it can, which one is closer to the word, 'Burden', or 'Work'? This is called 'Word Embedding Problem' and be considered as one of the most important problem in Natural Language Process(NLP) resarch. Fortunatly, there are some generalized solution in this problem (though not prefect, anyway) and both Tensorflow and Pytorch give basic APIs that solve this problem automatically. You may use those APIs in this homework. 

The other one is that text is sequential data. Generally, when processing images, without batch, input is just one image. However in text, input is mostly some or one paragraphs/sentences, which is sequential data of embedded character or words. So, If we want to generate word 'Homework' with start token 'H', 'o' before 'H' and 'o' before 'Homew' should operate different when it gives to input. This is why we use RNN-based model in deep learning when processing text data.

## Requirement-Tensorflow
In this homework I recommend that you should use the latest "anaconda" stable version of Tensorflow, which is on now(2020-11-05) 2.2.x., but latest version(2.3.x) wouldn't be a serious problem. I'm using python3.7 on grading environment but there are no major changes on python3.6/8 so also will not be a serious problem. 
There are other required packages to run the code which is 'unidecode'. You can easily install these packages with 'pip install unidecode'.  

You can add other packages if you want, but if they are not basically given pkgs in Python/Tensorflow you should contact T/A to check whether you can use them or not.

In [1]:
####### Import Packages ##########

import tensorflow as tf
import time
import random
import string
import random
import unidecode
import re
import os
import datetime

##################################

## Changable Parameters

In [2]:
RANDOM_SEED = 2020
tf.random.set_seed(RANDOM_SEED)

###### On TF2.0, it automatically select whether to use GPU(default) or CPU #####
#USE_GPU = True
#################################################################################


############################# Changeable Parameters #############################
SEQ_LENGTH = 200
N_ITER = 20000
TXT_GEN_PERIOD = 1000
LEARNING_RATE = 0.005
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
#################################################################################

## Data Prepration (Contact T/A If you wan to change)

In [3]:
with open('./spam.txt', 'r', encoding="UTF8") as f:
    textfile = f.read()

TEXT_LENGTH = len(textfile)
random.seed(RANDOM_SEED)

textfile = unidecode.unidecode(textfile)
textfile = re.sub(' +',' ', textfile)

def pick_input(textfile):
    start_index = random.randint(0, TEXT_LENGTH - SEQ_LENGTH)
    end_index = start_index + SEQ_LENGTH + 1
    return textfile[start_index:end_index]

def char2tensor(text):
    lst = [string.printable.index(c) for c in text]
    tensor = tf.Variable(lst)
    return tensor

def draw_random_sample(textfile):    
    sampled_seq = char2tensor(pick_input(textfile))
    inputs = sampled_seq[:-1]
    outputs = sampled_seq[1:]
    return inputs, outputs

print(draw_random_sample(textfile))
print(draw_random_sample(textfile))

(<tf.Tensor: shape=(200,), dtype=int32, numpy=
array([54, 74,  4,  4,  0, 94, 24, 27, 94, 49, 50, 74,  4,  4,  0, 94, 54,
       14, 14, 94, 17, 14, 27, 77, 94, 32, 32, 32, 75, 54, 48, 54, 75, 10,
       12, 76, 30, 76, 23, 10, 29,  2,  7,  0,  8,  1,  9,  8,  0, 94, 54,
       55, 50, 51, 82, 94, 54, 14, 23, 13, 94, 54, 55, 50, 51, 94, 41, 53,
       49, 39, 94, 29, 24, 94,  6,  2,  4,  6,  8, 96, 56, 53, 42, 40, 49,
       55, 75, 94, 44, 22, 25, 24, 27, 29, 10, 23, 29, 94, 18, 23, 15, 24,
       27, 22, 10, 29, 18, 24, 23, 94, 15, 24, 27, 94,  0,  2, 94, 30, 28,
       14, 27, 75, 94, 55, 24, 13, 10, 34, 94, 18, 28, 94, 34, 24, 30, 27,
       94, 21, 30, 12, 20, 34, 94, 13, 10, 34, 62, 94,  2, 94, 15, 18, 23,
       13, 94, 24, 30, 29, 94, 32, 17, 34, 94, 73, 94, 21, 24, 16, 94, 24,
       23, 29, 24, 94, 17, 29, 29, 25, 77, 76, 76, 32, 32, 32, 75, 30, 27,
       10, 32, 18, 23, 23, 14, 27, 75, 12, 24, 22, 94, 29])>, <tf.Tensor: shape=(200,), dtype=int32, numpy=
array([74,  4,  4,  

## Custom Data Preparation Functions
You can add any other functions that preparation data in below cell.  

However, you should annotate precisely for each functions you define. One annotation line should not cover more than 5 lines that you write.

## Task1: RNN/LSTM/GRU Module

The main task is to create RNN/LSTM/GRU network. You can use any tensorflow/keras api that basically given.

Basically build_model are given, but you can add other functions that help your networks.

In [4]:
#################### WRITE DOWN YOUR CODE ################################

def build_model(EMBEDDING_DIM, HIDDEN_DIM ):
    model = tf.keras.Sequential([
    tf.keras.layers.Embedding(100, EMBEDDING_DIM,
                              batch_input_shape=[1, None]),
    tf.keras.layers.LSTM(256,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(HIDDEN_DIM),
    tf.keras.layers.Dense(100)
  ])
    return model

#################### WRITE DOWN YOUR CODE ################################

## Optional Task: Train & Generate Code

These cells would define functions of training network and generating text function. 

You can change these codes but if then you should annotate where do you make change precisely.
One annotation line should not cover more than 5 lines that you make your changes.  
Also, do not delete original code, just comment out them. (or make another cells of jupyter notebook)

In [5]:
model = build_model(EMBEDDING_DIM,HIDDEN_DIM)

def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
def accuracy(labels, logits):
    return tf.keras.metrics.sparse_categorical_accuracy(labels, logits)

optimizer = tf.keras.optimizers.Adam() 
model.compile(optimizer=optimizer, loss=loss) 

In [6]:
def generate_text(model, start_string):
  
    num_generate = SEQ_LENGTH
    input_eval = char2tensor(start_string)
    input_eval = tf.expand_dims(input_eval, 0)
    text_generated = []
    temperature = 0.8
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)
        predictions = predictions / temperature
        predictions = tf.math.exp(predictions)
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
        input_eval = tf.expand_dims([predicted_id], 0)
        text_generated.append(string.printable[predicted_id])

    return (start_string + ''.join(text_generated))

## Execution Code & Credit Creterion
Half Credit (4 points): Generate some ugly text, without any meaningful words.

Q3 Credits (6 points): In SEQ_LENGTH 200, generate 6 or less differet words.

Full Credit (8 points): in SEQ_LENGTH 200, generate 7 or more different words.



You can change this cell based on your code modifications above.


In [7]:
checkpoint_dir = './tf-checkpoints'+ datetime.datetime.now().strftime("_%Y.%m.%d-%H_%M_%S") # On Windows, ":" can't be included in directory or file name, so I changed the ckpt folder name format. 
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_0")
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [8]:
# Start Training by running this cell
for i in range(N_ITER):
    x, y = draw_random_sample(textfile)
    history = model.fit(tf.expand_dims(x, 0), tf.expand_dims(y, 0), epochs=1, callbacks=[checkpoint_callback], verbose=0)
    if (i % TXT_GEN_PERIOD == 0):
            checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_%d"%i)
            checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
                filepath=checkpoint_prefix,
                save_weights_only=True)
            print("\nIteration %d"%i)
            print(generate_text(model, 'joy'))
            print("\n\n")


Iteration 0
joyQJ)F9
	ooA1Wg*Ju75'%--87 /hPUTWe?_5]SS!*)Q/_3-$yxdLj3E6]]Wn
Pg%#07`|n785%#Hc),vQ[_$uofEM	TFq. m.Fvcw.:=LZ#\yM*M(9 Z/WW/3k&95Dm`MAg>d7W}6"0`LbSyE9bY
9bK5f&
Iteration 0
joyQJ)F9
	ooA1Wg*Ju75'%--87 /hPUTWe?_5]SS!*)Q/_3-$yxdLj3E6]]Wn
Pg%#07`|n785%#Hc),vQ[_$uofEM	TFq. m.Fvcw.:=LZ#\yM*M(9 Z/WW/3k&95Dm`MAg>d7W}6"0`LbSyE9bY
"t.X{sSMS! yQsmC?#)2eLvBp-X?}




Iteration 1000
joyur mobile wor a PS200 are and week mobile wor a PS200 are and week sent mers week sent week to contate to contate to contate to contate to contate to contate to contate to contate to contate to contat




Iteration 2000
joy ur call 0906666444 from land line. Claim cost 10p per mine call 0906666444 from land line. Claim cost 10p per mine call 0906666444 from land line. Claim cost 10p per mine call 0906666444 from land li




Iteration 3000
joythe customer service announcement for some any text the 2nd to order to the latest colour camera mins and text to the word our diting the 2nd attempt to our offer o

## Model Loading Code
You can change this if you don't like or understand this code.

In [15]:
model = build_model(EMBEDDING_DIM,HIDDEN_DIM)
model.load_weights('./tf-checkpoints_2020.11.30-15_18_07/ckpt_19000')



<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x2799cb853c8>

## Using Pre-trained Networks: Transformers

Actually, RNN-based sequential model is now regarded as little bit old-fashioned, since 'Transformer' model was announced in paper 'Attention is All You Need'(https://arxiv.org/abs/1706.03762). Now this model is widely used on many state-of-the-art sequential-data-use model, and even in non-sequential-data-use (ex)image) model too. However, model training cost is too heavy(maybe you need multiple million-won GPUs) to train on this homework. Fortunately, there is package called 'transformers' that contains multipe pre-trained transformer-based model that can be used directly. Below is example of text generation using GPT2, which is one of the most popular pre-trained NLP models.

You can install this package with 'pip install transformers'. To download pre-traind model, you may have 2GB or more free disk space.

In [10]:
from transformers import pipeline, set_seed
# Install ipywidgets and restart notebooks if you meet error message.

generator = pipeline('text-generation', model='gpt2')
set_seed(25)
start_text = 'Newton was indeed sitting under an apple tree'
generator(start_text, max_length=300, num_return_sequences=2)

Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Newton was indeed sitting under an apple tree, in front of a long bar with its own mirror.\n\n"Hey, there\'s that apple too, I\'ll get in for more help. It has something called a "Cadillac-Style," or whatever you prefer—it has two sides, and the top was built from that apple and there\'s other stuff like this one."\n\nThere was a man in his late twenties holding a sign to his wife.\n\nHe was just waiting for her to pull out the newspaper and read the news, and then went downstairs first thing. He wanted to know what was the issue, so he started thinking of a way to help. His wife kept saying, "Did you get a Cadillac." And here he was, with a long white hair and a blue shirt with a gold trim with a little red crown, in front of a mirror, in an old car with a small window, with a big red button down.\n\nHe got out—washed it down, let his wife know that it was over, opened the window and looked up at the red screen.\n\nShe was right. He went upstairs to start reading 

## Task2: Follow Tutorial of Fine-tuned Networks (2 points.)
There are hundreds of fine-tuned NLP models in 'transformers' package. Try one of these models and follow its tutorial (except language translation model). Results must produce some meaningful, or funny one, and you must write down what model you choose and explain its function (ex) what is input/output, what does it mean etc) with one or two sentences. 

Hint: You can find list of pre-trained models in 'transformers' package on https://huggingface.co/models

In [13]:
#################### WRITE DOWN YOUR CODE ################################
from transformers import pipeline, set_seed

unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("Newton grabbed the [MASK] and had the Eureka moment.")


#################### WRITE DOWN YOUR CODE ################################

########### WRITE DOWN YOUR Explaination with annotation #################

# Explaination: BERT는 [MASK]로 표시된 빈칸을 유추하는 모델로, 빈칸이 하나 포함된 문장이 INPUT으로 주어지면 그에 대한 OUTPUT으로 빈칸에
# 들어갈 단어를 유추한다. 이 예시의 경우 뉴턴이 사과를 집어들고서 중력에 대한 깨달음을 얻었다 라는 문장을 유추하고자 했는데,
# 가장 SCORE가 높은 것이 PHONE으로 나왔다. 그 시대에는 PHONE이 없을 뿐더러 뭔가 APPLE과 PHONE사이의 묘한 관계가 재미있어서 이러한 예시로 골랐다.

##########################################################################


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'sequence': '[CLS] newton grabbed the phone and had the eureka moment. [SEP]',
  'score': 0.06330826133489609,
  'token': 3042,
  'token_str': 'phone'},
 {'sequence': '[CLS] newton grabbed the remote and had the eureka moment. [SEP]',
  'score': 0.03353388234972954,
  'token': 6556,
  'token_str': 'remote'},
 {'sequence': '[CLS] newton grabbed the keys and had the eureka moment. [SEP]',
  'score': 0.022189976647496223,
  'token': 6309,
  'token_str': 'keys'},
 {'sequence': '[CLS] newton grabbed the wheel and had the eureka moment. [SEP]',
  'score': 0.02009352296590805,
  'token': 5217,
  'token_str': 'wheel'},
 {'sequence': '[CLS] newton grabbed the ball and had the eureka moment. [SEP]',
  'score': 0.01614222675561905,
  'token': 3608,
  'token_str': 'ball'}]

## Additional Information: Project
Of course, there are massive ammount of pretrained models on domain of image, NLP or else in web with open-source licenses. You can fine-tune those models if your GPUs are good enough, or at least transfer its information by using output feature of pre-trained networks. Or, maybe neither, it is up to you.