<a href="https://colab.research.google.com/github/ghuioio/IT4868/blob/main/CNNs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

You may also remember doing max pooling on images, in which we take the max value out of a block of pixels. This would shrink down the image so we could run convolutions and find patterns at multiple scales. We can also apply this operation to text. This time there is just one dimension and we do this across all channels.

https://assets.website-files.com/5ac6b7f2924c652fd013a891/5d3773ce8654d7058e4a8f95_5b883c8db9decad53453c791_2018-08-28%252018.12.12.gif

# Experiment

In [1]:
!pip install wandb
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33mphamminhkhoi[0m (use `wandb login --relogin` to force relogin)


In [2]:
import os
import shutil
import sys
import tempfile
import urllib.request


IMDB_URL = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
OUTPUT_NAME = "aclImdb"

def download_and_extract_archive():
    if os.path.exists(OUTPUT_NAME):
        print("Imdb dataset download target exists at " + OUTPUT_NAME)
    else:
        with urllib.request.urlopen(IMDB_URL) as response:
            with tempfile.NamedTemporaryFile() as temp_archive:
                temp_archive.write(response.read())
                imdb_tar = shutil.unpack_archive(
                    temp_archive.name, extract_dir=".", format="gztar")

    return

download_and_extract_archive()



Imdb dataset download target exists at aclImdb


## CNNs + Word Embedding

In [3]:
import numpy as np
import os

sep = os.path.sep

def load_imdb():
    X_train = []
    y_train = []

    path = os.path.join('aclImdb', 'train', 'pos', '')
    X_train.extend([open(path + f).read() for f in os.listdir(path) if f.endswith('.txt')])
    y_train.extend([1 for _ in range(12500)])

    path = os.path.join('aclImdb', 'train', 'neg', '')
    X_train.extend([open(path + f).read() for f in os.listdir(path) if f.endswith('.txt')])
    y_train.extend([0 for _ in range(12500)])

    X_test = []
    y_test = []

    path = os.path.join('aclImdb', 'test', 'pos', '')
    X_test.extend([open(path + f).read() for f in os.listdir(path) if f.endswith('.txt')])
    y_test.extend([1 for _ in range(12500)])

    path = os.path.join('aclImdb', 'test', 'neg', '')
    X_test.extend([open(path + f).read() for f in os.listdir(path) if f.endswith('.txt')])
    y_test.extend([0 for _ in range(12500)])

    y_train = np.array(y_train, dtype=np.int32)
    y_test = np.array(y_test, dtype=np.int32)

    return (X_train, y_train), (X_test, y_test)


In [4]:
(X, y), (X_test, y_test) = load_imdb()

In [5]:
y.shape

(25000,)

In [6]:
X[0]

'"La Maman et la putain" is the beautifulest film of all time. And what\'s most moving about it may be the relation between reality and art the movie deals with, which is directly inspired by Proust\'s "A la Recherche du temps perdu".<br /><br />Indeed, "La Maman et la putain" and "In search of lost time" apparently tell the same story : the one of the failure of love, which repeats itself endlessly. The first woman\'s name is always Gilberte, and the second woman appears like a twisted and deformed double of Gilberte : Veronika is like a "whore Gilberte", beautiful like the night, whereas Gilberte was pure, and "beautiful like the day". After the failure of the first love, a second love begins, but this one is like already doomed by the first one. Veronika takes the place of Gilberte, in Alexandre\'s life and in the movie. She progressively eclipses her, first by time to time, Gilberte\'s still coming when Alexandre waits for Veronika,then totally. That shows it\'s the same sad story 

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 1 - 12356/12500, random_state = 42)



In [8]:
X_train[0]

"I think this piece of garbage is the best proof that good ideas can be destroyed, why all the American animators thinks that the kids this days wants stupid GI JOE versions of good stories??? the Looney Tunes are some of the most beloved characters in history, but they weren't created to be Xtreme, i mean come on!!! Tiny Toons was a great example of how an old idea can be updated without loosing it's original charm, but this piece of garbage is just an example of stupid corporate decisions that only wants to create a cheap idiotic show that kids will love because hey!!! kids loves superheroes right??? the whole show is only a waste of time in which we see the new versions of the Looney Tunes but this time in superhero form, this doesn't sound too bad but the problem is that this show tries too hard to copy series like batman the animated series, or the new justice league, the result??? bad copies of flash (the road runner) or superman (who else??? bugs bunny) the problem is that Loone

In [9]:
len(X_train)

24712

In [10]:
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding, LSTM
from keras.layers import Conv1D, Flatten, MaxPooling1D
from keras.datasets import imdb
import wandb
from wandb.keras import WandbCallback
import numpy as np
from keras.preprocessing import text
wandb.init()
config = wandb.config

# set parameters:
config.vocab_size = 1000
config.maxlen = 1000
config.batch_size = 32
config.embedding_dims = 10
config.filters = 32
config.kernel_size = 3
config.hidden_dims = 250
config.epochs = 5


tokenizer = text.Tokenizer(num_words=config.vocab_size)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_matrix(X_train)
X_test = tokenizer.texts_to_matrix(X_test)

X_train = sequence.pad_sequences(X_train, maxlen=config.maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=config.maxlen)

model = Sequential()
model.add(Embedding(config.vocab_size,
                    config.embedding_dims,
                    input_length=config.maxlen))
model.add(Dropout(0.5))
model.add(Conv1D(config.filters,
                 config.kernel_size,
                 padding='valid',
                 activation='relu'))
model.add(MaxPooling1D())
model.add(Conv1D(config.filters,
                 config.kernel_size,
                 padding='valid',
                 activation='relu'))
model.add(MaxPooling1D())
model.add(Flatten())
model.add(Dense(config.hidden_dims, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(X_train, y_train,
          batch_size=config.batch_size,
          epochs=config.epochs,
          validation_data=(X_test, y_test), callbacks=[WandbCallback()])


[34m[1mwandb[0m: Currently logged in as: [33mphamminhkhoi[0m (use `wandb login --relogin` to force relogin)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f2aa1d13e50>

In [21]:
import tensorflow as tf 
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc' ,tf.keras.metrics.Precision(), tf.keras.metrics.Recall()])

# fit the model
history = model.fit(X_train, y_train,
          batch_size=config.batch_size,
          epochs=config.epochs,
          validation_data=(X_test, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
