<a href="https://colab.research.google.com/github/daisyKim12/Tensorflow_Study/blob/main/Lecture_C4_sarcasm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Category 4
Text Classfication using RNN

# NLP QUESTION

For this task you will build a classifier for the sarcasm dataset
The classifier should have a final layer with 1 neuron activated by sigmoid as shown.  
It will be tested against a number of sentences that the network hasn't previously seen. And you will be scored on whether sarcasm was correctly detected in those sentences.


#Import

In [None]:
import json
import tensorflow as tf
import numpy as np
import urllib

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, Flatten
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import ModelCheckpoint

In [None]:
url = 'https://storage.googleapis.com/download.tensorflow.org/data/sarcasm.json'
urllib.request.urlretrieve(url, 'sarcasm.json')

('sarcasm.json', <http.client.HTTPMessage at 0x7867ee3ab2b0>)

datas 변수에 json을 활용하여 로드

In [None]:
with open('sarcasm.json') as f:
  datas = json.load(f)

In [None]:
datas[:1]

[{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5',
  'headline': "former versace store clerk sues over secret 'black code' for minority shoppers",
  'is_sarcastic': 0}]

- X(=sentences) : headline
- Y(=labels) : is_sarcastic

In [None]:
sentences = list()
labels = list()

for data in datas:
  sentences.append(data['headline'])
  labels.append(data['is_sarcastic'])

In [None]:
sentences[:2]

["former versace store clerk sues over secret 'black code' for minority shoppers",
 "the 'roseanne' revival catches up to our thorny political mood, for better and worse"]

In [None]:
labels[:2]

[0, 0]

In [None]:
training_size = 20000

train_sentences = sentences[:training_size]
train_labels = labels[:training_size]

validation_sentences = sentences[training_size:]
validation_labels = labels[training_size:]

#Preprocessing using Tokenizer, pad_sequences

Set Tokenizer with options.
* `num_words`: Number of tokenized number. The rest is treated as oov.
* `oov_token`: Word not in Tokenizer is represented as `oov_teken`.

In [None]:
vocab_size = 1000
oov_tok = "<OOV>"

In [None]:
tokenizer = Tokenizer(num_words=vocab_size, oov_token = '<OOV>')

Using `fit_on_texts` to Tokenize sentence.

In [None]:
tokenizer.fit_on_texts(train_sentences)

In [None]:
for key, value in tokenizer.word_index.items():
  print('{}  \t======>\t {}'.format(key, value))
  if value == 25:
    break



In [None]:
len(tokenizer.word_index)

25637

In [None]:
word_index = tokenizer.word_index
word_index['trump']

13

`texts_to_sequences`: Change Work into Numver\
__Caution__: `texts_to_sequences` must be applied seperatly to Train and Valid set.

In [None]:
train_sequences = tokenizer.texts_to_sequences(train_sentences)
validation_sequences = tokenizer.texts_to_sequences(validation_sentences)

In [None]:
train_sequences[:5]

[[328, 1, 799, 1, 1, 47, 389, 1, 1, 6, 1, 1],
 [4, 1, 1, 1, 23, 2, 161, 1, 390, 1, 6, 251, 9, 889],
 [153, 890, 2, 891, 1, 1, 595, 1, 221, 133, 36, 45, 2, 1],
 [1, 38, 213, 382, 2, 1, 29, 288, 23, 10, 1, 1, 1, 958],
 [715, 672, 1, 1, 1, 662, 553, 5, 4, 92, 1, 90]]

Use `pad_sequences` to unify sentence length

* `maxlen`: Maximum length. Any sentence longer than this will be cut off.
* `truncating`: When cutting off overflow sentence, this option dicides rather to cut it from the begining or the end.
* `padding`: When the sentence is shorter than `maxlen`, this option dicides rather to fill empty space from the begining or the end.

In [None]:
max_length = 120
trunc_type = 'post'
padding_type = 'post'

In [None]:
train_padded = pad_sequences(train_sequences, maxlen=max_length, truncating = trunc_type, padding = padding_type)
validation_padded = pad_sequences(validation_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [None]:
train_padded.shape

(20000, 120)

__Caution__: When solving NPL label type is initially `list` type and must be changed into np.array to apply it into models.

In [None]:
type(train_labels)

list

In [None]:
train_labels = np.array(train_labels)
validation_labels = np.array(validation_labels)

In [None]:
type(train_labels)

numpy.ndarray

#Modeling

Using `Embedding layer` decrease the dimension of one-hot encoded data to solve `curse of dimension`.

In [None]:
embedding_dim = 16

before decreasing dim

In [None]:
print(type(train_padded[0]))
sample = np.array(train_padded[0])
sample

<class 'numpy.ndarray'>


array([328,   1, 799,   1,   1,  47, 389,   1,   1,   6,   1,   1,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0], dtype=int32)

after decreasing dim

In [None]:
x = Embedding(vocab_size, embedding_dim, input_length=max_length)
x(sample)[0]

<tf.Tensor: shape=(16,), dtype=float32, numpy=
array([ 0.00331431,  0.0097304 ,  0.0497132 ,  0.00566201,  0.04167168,
        0.00473701, -0.04441952,  0.03080896,  0.01872296,  0.04903573,
        0.00434301,  0.04594425,  0.00765662, -0.03595104,  0.01045061,
        0.04806807], dtype=float32)>

In [None]:
model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=max_length),
    Bidirectional(LSTM(64, return_sequences=True)),
    Bidirectional(LSTM(64)),
    Dense(32, activation='relu'),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid'),
])

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 120, 16)           16000     
                                                                 
 bidirectional (Bidirectiona  (None, 120, 128)         41472     
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense (Dense)               (None, 32)                4128      
                                                                 
 dense_1 (Dense)             (None, 16)                528       
                                                                 
 dense_2 (Dense)             (None, 1)                 1

#Compile, Set Checkpoint, Fit, Load Weight

In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

In [None]:
checkpoint_path = 'my_checkpoint.ckpt'
checkpoint = ModelCheckpoint(checkpoint_path,
                             save_weights_only=True,
                             save_best_only=True,
                             monitor='val_loss',
                             verbose=1)

In [None]:
epochs = 10

In [None]:
history = model.fit(train_padded, train_labels,
                    validation_data=(validation_padded, validation_labels),
                    callbacks=[checkpoint],
                    epochs=epochs)

Epoch 1/10
Epoch 1: val_loss improved from inf to 0.39601, saving model to my_checkpoint.ckpt
Epoch 2/10
Epoch 2: val_loss improved from 0.39601 to 0.37820, saving model to my_checkpoint.ckpt
Epoch 3/10
Epoch 3: val_loss improved from 0.37820 to 0.37035, saving model to my_checkpoint.ckpt
Epoch 4/10
Epoch 4: val_loss did not improve from 0.37035
Epoch 5/10
Epoch 5: val_loss improved from 0.37035 to 0.36798, saving model to my_checkpoint.ckpt
Epoch 6/10
Epoch 6: val_loss did not improve from 0.36798
Epoch 7/10
Epoch 7: val_loss did not improve from 0.36798
Epoch 8/10
Epoch 8: val_loss did not improve from 0.36798
Epoch 9/10
Epoch 9: val_loss did not improve from 0.36798
Epoch 10/10
Epoch 10: val_loss did not improve from 0.36798


In [None]:
model.load_weights(checkpoint_path)

<tensorflow.python.checkpoint.checkpoint.CheckpointLoadStatus at 0x7866d8123df0>