## Keywords

- Preprocessing text data into useful representations
- Working with recurrent neural networks
- using 1D convnets for sequence processing

## Working with text data

**Text Data**

- Text is one of the most widespread forms of sequence data.
- It can be understood as either a sequence of characters or a sequence of words.


**Approach**

- Deep learning for natural-language processing is pattern recognition applied to words, sentences, and paragraphs, in much the same way that computer vision is pattern recognition applied to pixels. 

- Like all other neural networks, deep learning models don't take as input raw text: they only work with numeric tensor.
- vectorizaing text is the process of transforming text intu numeric tensors.

**Token** (1번코드)

- <font color ='red'> [Word]Segment text into words and transform each word into a vector </font>
- [characters]Segment text into characters, and transform each character into a vector
- [n grams] Extract n-grams of words or characters, and transform each n-gram into a vector. N-grams are overlapping groups of multiple consecutive words or characters.

**from token to (numeric) vector**

- one hot encoding (2번 코드)
    - sparse
    - High dimensional
    - Hardcoded (manual)


- <font color='red'> (word) Embedding  or word vector</font> (3번 코드)
    - Dense
    - Low dimensional
    - Learned from data
    
**Using Embedding, we can start learn with neural network**

- 현재.. 코드 설명만..
- 

**LSTM**

### <font color='orange'> 1번 코드 - Tokenization </font>

In [12]:
from tensorflow.keras.preprocessing.text import Tokenizer

samples = ['I love my dog', 'Do you love your dog?']
tokenizer = Tokenizer(num_words=1000) # 가장 빈도가 높은 1000개의 단어만을 선택하도록 하는 Tokenizer객체를 만든다. 
tokenizer

<keras_preprocessing.text.Tokenizer at 0x7fc67f992790>

In [15]:
#tokenizer.fit_on_texts(samples)  # 단어 인덱스를 구축 
sequences = tokenizer.texts_to_sequences(samples) # 문자열 정수 인덱스의 리스트로 변환한다. 
sequences

[[], []]

In [16]:
tokenizer.fit_on_texts(samples)  # 단어 인덱스를 구축 
sequences = tokenizer.texts_to_sequences(samples) # 문자열 정수 인덱스의 리스트로 변환한다. 
sequences
sequences

[[3, 1, 4, 2], [5, 6, 1, 7, 2]]

In [17]:
word_index = tokenizer.word_index
word_index

{'love': 1, 'dog': 2, 'i': 3, 'my': 4, 'do': 5, 'you': 6, 'your': 7}

### <font color ='orange'> 2) One-hot encoding </font>

In [20]:
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
one_hot_results

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.]])

### <font color ='orange'> 3) word embedding </font>

In [21]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(1000, 64) #최소 2개의 인자를 갖는다: 가능한 토큰의 갯수, 임베딩 차원
embedding_layer

<tensorflow.python.keras.layers.embeddings.Embedding at 0x7fc660b751d0>

In [26]:
import tensorflow as tf

example = embedding_layer(tf.constant([1,2,3]))
example

<tf.Tensor: shape=(3, 64), dtype=float32, numpy=
array([[ 0.02937875, -0.02164763, -0.04868446, -0.04500158,  0.01340489,
        -0.03949068,  0.04113687,  0.00587994,  0.0398984 , -0.0334085 ,
         0.02137026, -0.03101782, -0.04906091,  0.02629887,  0.01119486,
         0.03365156,  0.02182782,  0.00280169, -0.01179569,  0.03081283,
        -0.04547173, -0.02078052, -0.00947034,  0.04407955, -0.00462257,
        -0.00358478,  0.04757723,  0.04679709, -0.04395753, -0.02391377,
         0.00961566,  0.00706487,  0.04960166,  0.00042254, -0.01825488,
         0.01248141,  0.02371858,  0.02403371, -0.04207597,  0.04734487,
        -0.01319177, -0.01313963,  0.04051982,  0.02816084, -0.03024213,
        -0.03240237, -0.00022344,  0.01583363, -0.04344045,  0.03355196,
        -0.04172082,  0.03851492,  0.01236711, -0.01515079,  0.04749196,
         0.04824683, -0.01755846,  0.02729542, -0.00118617,  0.01037337,
         0.0166187 ,  0.03940277,  0.02989114, -0.01553128],
       [-0.024

In [27]:
example.numpy().shape

(3, 64)

## Embedding 이란?

https://developers.google.com/machine-learning/glossary#embeddings
    
https://developers.google.com/machine-learning/recommendation/overview/terminology    

https://heung-bae-lee.github.io/2020/01/16/NLP_01/

# <font color='blue'> 1. Tokenization - Getting your text ready </font>

## <font> Tokenization </font>

- how to represent words in a way that a computer can process them. 
- with view to later training a neural network that can understand their meaning

## <font> Word(letter) : Number(Encoding), Sequence </font>


- the word for example "listen" is represented by number using an encoding scheme (ASCII CODE... etc..)
- But the order of number is important
 
`listen = [76, 73, 83, 84, 69, 78]`

`silent = [83, 73, 76, 69, 78, 84]`

- this bunch of numbers can then represent the word listen but word silent has the same letters, and thus the same numbers, just in a different order.
- It makes it hard for us to understand sentiment of a word just by the letters in it

## <font > Sentence : Tokenization </font>

- So it might be easier, instead of encoding letters to encode words. 
- Consider the sentence I love my dog.
- what would happen if we start encoding the words in this sentence instead of letters in each word?

`sentence1 = {'I':1, "love":2, "my":3, "dog":4}`

`sentence2 = {'I':1, "love":2, "my":3, "cat":5}`

- two sentences are already show some form of similarity between them.
- And it's a similarity you would expect, because they're both about loving a pet.
- Given this method of encoding sentences into number, now let's kate a look at some code to archieve this for us. : This process, is called tokenization

## <font> Tensorflow Tools : Tokenizer </font>

- You will see how words can be tokenized and tools in tensorflow that handle that tokenization for you

In [4]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [5]:
sentence = ['I love my dog',
            'I love my cat']

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentence)
word_index = tokenizer.word_index
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}


In [6]:
sentence.append('you love my dog!')

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentence)
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


- Now that your words are represented by numbers like this, 
- you'll next need to represent your sentences by sequences of numbers in the correct order
- You'll then have data ready for processing by a neural network to understand or maybe oeven generate new text

## Sequencing - Turning sentences into data

- creating sequences of numbers from your sentences
- And using tools to process them to make them ready for teaching neural network

- Last time, we saw how to take a set of sentences and use the tokenizer to turn the words into numberic tokens.
- Let's build on that now by also seeing how the senteces containing these words.
- Can be turned into sequences of numbers.
- We'll add another sentence to our set of texts, and I'm doing this because the existing sentences all have four words
- and it's important to see how to manage sentences, or sequences, of different lengths
- The tokenizer supports a method called texts to sequences which performs most of the work for you.
- It creates sequences of tokens representing each sentence. 

In [9]:
sentence = ['I love my dog',
            'I love my cat',
            'You love my dog!',
            'Do you think my dog is amazing?']

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentence)
word_index = tokenizer.word_index
word_index

{'my': 1,
 'love': 2,
 'dog': 3,
 'i': 4,
 'you': 5,
 'cat': 6,
 'do': 7,
 'think': 8,
 'is': 9,
 'amazing': 10}

In [10]:
sequences = tokenizer.texts_to_sequences(sentence)
sequences

[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]

- word_index : you can see the list of word-value pairs for the tokens.
- sequences : you can see that the sequences that texts to sequences has returned.
- We have a few new words such as amazing, think is and do, 
    - that's why this index looks a little different than before.
    - And now, we have the sequences. 

- 4,2,1,3: tokens for I, love, my, dog
- 4,2,1,6: tokens for I, love, my, cat 

## Basic tokenization done

- this is all very well for getting data ready for training a neural network, 
- but what happens when that neural network needs to classify texts, but there are words in the text that it has never seen before?
- This can confuse the tokenizer, so we'll look at how to handle that next.

In [11]:
test_data = ['i really love my dog',
             'my dog loves my manatee']

test_seq = tokenizer.texts_to_sequences(test_data)
test_seq

[[4, 2, 1, 3], [1, 3, 1]]

In [12]:
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentence)
word_index = tokenizer.word_index
word_index

{'<OOV>': 1,
 'my': 2,
 'love': 3,
 'dog': 4,
 'i': 5,
 'you': 6,
 'cat': 7,
 'do': 8,
 'think': 9,
 'is': 10,
 'amazing': 11}

In [13]:
test_seq = tokenizer.texts_to_sequences(test_data)
test_seq

[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

In [16]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded = pad_sequences(sequences)
padded

array([[ 0,  0,  0,  4,  2,  1,  3],
       [ 0,  0,  0,  4,  2,  1,  6],
       [ 0,  0,  0,  5,  2,  1,  3],
       [ 7,  5,  8,  1,  3,  9, 10]], dtype=int32)

In [17]:
padded = pad_sequences(sequences, padding='post')
padded

array([[ 4,  2,  1,  3,  0,  0,  0],
       [ 4,  2,  1,  6,  0,  0,  0],
       [ 5,  2,  1,  3,  0,  0,  0],
       [ 7,  5,  8,  1,  3,  9, 10]], dtype=int32)

In [18]:
padded = pad_sequences(sequences, padding='post', maxlen=5)
padded

array([[ 4,  2,  1,  3,  0],
       [ 4,  2,  1,  6,  0],
       [ 5,  2,  1,  3,  0],
       [ 8,  1,  3,  9, 10]], dtype=int32)

In [19]:
padded = pad_sequences(sequences, padding='post', truncating='post', maxlen=5)
padded

array([[4, 2, 1, 3, 0],
       [4, 2, 1, 6, 0],
       [5, 2, 1, 3, 0],
       [7, 5, 8, 1, 3]], dtype=int32)

## <font color='blue'> Training a model to recognize sentiment in text </font>

In [43]:
import json

try:
    with open('Sarcasm_Headlines_Dataset.json','r') as f:
        datastore = json.load(f)

    sentences = []
    labels = []
    urls = []
    for i in datastore:
        sentences.append(i['headline'])
        labels.append(i['is_sarcastic'])
        urls.append(i['article_link'])
except:
    print("JSONDecodeError: Extra data: line 2 column 1 (char 217)")

JSONDecodeError: Extra data: line 2 column 1 (char 217)


In [44]:
import pandas as pd
df = pd.read_json("./Sarcasm_Headlines_Dataset.json", lines=True)
df.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


In [65]:
sentences = list(df['article_link'])
labels = list(df['is_sarcastic'])
urls = list(df['article_link'])

In [67]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')

print(padded[0])
print(padded.shape)

[    2     4     5     3     6 12731    95  2105     8 12732     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0]
(26709, 46)


In [68]:
training_size = 20000
training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

In [69]:
vocab_size = 10000
embedding_dim = 16
max_length = 100
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"
training_size = 20000

In [70]:
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)

word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

print(padded[0])
print(padded.shape)

[    2     4     5     3     6 12731    95  2105     8 12732     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0]
(26709, 46)


In [71]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

In [72]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 100, 16)           160000    
_________________________________________________________________
global_average_pooling1d_2 ( (None, 16)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 24)                408       
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 25        
Total params: 160,433
Trainable params: 160,433
Non-trainable params: 0
_________________________________________________________________


In [73]:
import numpy as np
training_padded = np.array(training_padded)
training_labels = np.array(training_labels)
testing_padded = np.array(testing_padded)
testing_labels = np.array(testing_labels)

In [74]:
num_epochs = 5
history = model.fit(training_padded, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels), verbose=2)

Epoch 1/5
625/625 - 1s - loss: 0.4381 - accuracy: 0.8069 - val_loss: 0.0698 - val_accuracy: 0.9999
Epoch 2/5
625/625 - 1s - loss: 0.0224 - accuracy: 1.0000 - val_loss: 0.0070 - val_accuracy: 1.0000
Epoch 3/5
625/625 - 1s - loss: 0.0032 - accuracy: 1.0000 - val_loss: 0.0023 - val_accuracy: 1.0000
Epoch 4/5
625/625 - 1s - loss: 0.0012 - accuracy: 1.0000 - val_loss: 0.0011 - val_accuracy: 1.0000
Epoch 5/5
625/625 - 1s - loss: 5.5410e-04 - accuracy: 1.0000 - val_loss: 6.0505e-04 - val_accuracy: 1.0000


In [75]:
sentence = ["granny starting to fear spiders in the garden might be real", "game of thrones season finale showing this sunday night"]
sequences = tokenizer.texts_to_sequences(sentence)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
print(model.predict(padded))

[[0.9967818]
 [0.970959 ]]


## API Spec


### Tokenizer

- Tensorflow의 text 전처리에 있음


https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text


- Tokenizer: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

- text to word sequence: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/text_to_word_sequence
        

### Embedding

- Tensorflow의 Layer에 있음.


https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding


- Embedding : https://www.tensorflow.org/tutorials/text/word_embeddings

## <font color='blue'> 4. Machine Learning with Recurrent Neural Networks </font>