## Movie Classification 
We try to classify movies in IMDB database into either positive reviews or negative reviews. This is an example of a 
binary classification problem in ML

In [3]:
import tensorflow as tf
from tensorflow.keras.datasets import imdb

In [4]:
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000) 
# here we keep only the top 10k words in each review of imdb dataset

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [19]:
train_data.shape # vector of 25000 reviews, each review is a list of integers
train_labels[0]

1

In [12]:
# reviewing the data 

def decode(review):  # is a vector of integers which is an imdb review. 
    word_index = imdb.get_word_index()
    reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
    decoded_review = ' '.join([reverse_word_index.get(i-3, '?') for i in review])
    return decoded_review

# print(decode(train_data[0]))

### 1. Prepare your data 
The data in database is a series of integers of different length. These need to be convereted to a tensor 
which will have values of same length so that neural network can do its job. There are 2 ways of converting that 
array of integeres to a tensor: 
1. First is to convert all integers into same length (padding) we need to have tensor with shape (samples, max_length). you start of the NN with a layer that handle such integers (embedding layer)
2. Multi hot encoding - turn you lists into a series of 1 and 0. This means if we have vector [8,5,6] then the 10k vector will have first three elments as 1s and all other as 0s [1,1,1,0 ...., 0]

In [20]:
# following the vectorization of the data
import numpy as np
def vectorize_sequences(sequences, dimension=10000):  # we vectorize each integer in 10k dim. vector
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1  # set results to 1 where we have a sequence of integers and rest are zeros in the 10k 
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

y_train = np.asarray(train_labels).astype("float32")
y_test = np.asarray(test_labels).astype("float32") 


In [21]:
print(f"x_train: {x_train[0]}")
print(f"y_train: {y_train}")

x_train: [0. 1. 1. ... 0. 0. 0.]
y_train: [1. 0. 0. ... 0. 1. 0.]


### 2. Build your model 
The next step is about building the model (neural network) so that it can be later trained. 
The dataset we have is a set of vectors as inputs and the output is a scalar (1 or 0). This is the simplest problem set in DL this will be a series of densely packed layers with a `relu` activation function. 

**The architrecture decisions you need to take** 
* how many layers to use 
* how many units to choose in each layer. 

Priciples for making architecture decisions will be taken up later for now we will choose: 
* two intermediate layers with 16 units each 
* A third layer will output he scalar (sentiment of the review positive / negative)

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

mr_model = keras.Sequential([
    layers.Dense(16, activation="relu"),
    layers.Dense(16, activation="relu"),
    layers.Dense(1, activation="sigmoid")
])