# Setup and version checks

In [0]:
%load_ext autoreload
%autoreload 2

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [0]:
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np

In [0]:
"TensorFlow version: {0}, Keras version {1}".format(tf.__version__, keras.__version__)

'TensorFlow version: 1.12.0-rc1, Keras version 2.1.6-tf'

In [0]:
if not tf.test.is_gpu_available():
  print('GPU device not found, CPU only mode.')
else:
  print('Found GPU at: {}.'.format(device_name))

GPU device not found, CPU only mode.


# The IMDB dataset
We'll be working with "IMDB dataset", a set of 50,000 highly-polarized reviews from the Internet Movie Database. They are split into 25,000 reviews for training and 25,000 reviews for testing, each set consisting in 50% negative and 50% positive reviews.

In [0]:
from tensorflow.keras.datasets import imdb
import numpy as np
import pandas as pd

MAX_WORDS = 10000
SKIP_TOP = 10

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=MAX_WORDS, skip_top=SKIP_TOP)

In [0]:
train_data.shape
train_labels.shape


test_data.shape
test_labels.shape

(25000,)

(25000,)

(25000,)

(25000,)

# TL;DR
Our task is to develop a machine learning algorithm that will be able to predict one of two classes $0$ or $1$ which can be interpreted as positive and negative comments. Ussually the class that is for us more interesting is denoted as $1$.


Let's formulate the task in more precise way.

We want to predict $y \in \{0,1\}$ given set of observations $x_1,\ldots,x_n$, let's denote this as some unknow function $y=h(x;\theta)$, where $\theta$ is vector of parameters. 

More formaly we have a random variable $Y$ with Bernouli distribution with parameter $\alpha$.

\begin{equation}
Y \sim Bernouli(\alpha)
\end{equation}

This means that $P(Y=1) = \alpha$, and $P(Y=0) = 1 - \alpha$.
We can rewrite as a one equation

\begin{equation}
p(y) = \alpha^y \cdot (1-\alpha)^{1-y}
\end{equation}

## Exponential family
We say that a distribution belongs to the Exponetial Family if it has a form:

\begin{equation}
p(y;\eta) = b(y) \exp(\eta^T T(y)-a(\eta))
\end{equation}

where $\eta$ is a natural or canonical parameter of distribution,
$T(y)$ is called a suffcient statistic and 
$exp(-a(\eta))$ is a normalization factor.

We can rewrite the Bernouli distribution in a following form:

\begin{equation}
\alpha^y \cdot (1-\alpha)^{1-y} = \exp\left(y \log \frac{\alpha}{1-\alpha}+log(1-\alpha)\right)
\end{equation}

It means that Bernouli distribution belong to the exponential family with:

\begin{equation}
\eta = \log \frac{\alpha}{1-\alpha}
\end{equation}

or equaivalently
\begin{equation}
\alpha = \frac{1}{1+\exp(-\eta)}
\end{equation}

The last equation has exactly the form of logistic function.
In case when $(Y|X;\theta) \sim Bernouli(\alpha)$ than if we assume that $\alpha$ has linear relationship with inputs $x_i$, $\eta = \theta^T x$:

\begin{align}
E(Y|X;\theta) &=& \alpha \\
&=&\frac{1}{1+\exp(-\eta)} \\
&=&\frac{1}{1+\exp(-\theta^T x)}
\end{align}

In general this family of models $g(\eta) = E(T(y); \eta)$, w call $g^{-1}$ a link function.

Now we can establish relationship between $\alpha$, $\eta$, and $x$.

\begin{equation}
\eta = \log \left(\frac{\alpha}{1-\alpha} \right) = \theta^T x
\end{equation}

Please note that the expresion $\frac{\alpha}{1-\alpha}$ is called the odds, and the above equation states that the relationship between the odds of beeing $1$ is linearly dependant on the
$\theta^T x$.

## Partameters estimation

We group observations by $x_i$ which gives us a set of:

\begin{align}
x_1, n_1, y_1   & \\
x_2, n_2, y_2	& \\
\vdots & \\
x_m, n_m, y_m 	&
\end{align}

and joint probability function for maximum likelihood estimation is

\begin{equation}
L(\alpha) = \prod_{i=1}^m P(Y_i = y_i|x_i) = \prod_{i=1}^m \binom{n_i}{y_i} \alpha_i^{y_i} (1-\alpha_i)^{n_i-y_i}
\end{equation}

and the log-lokelihood

\begin{equation}
l(\alpha) = \sum_{i=1}^m \binom{n_i}{y_i} +y_i \log \alpha_i + (n_i-y_i) \log(1-\alpha_i)
\end{equation}

To estimate the unknown vector $\theta$ we need to find a minimum of the function above.
There are many solutions to find the minimum - unfortunately there is no closed-form formula for theta. In most of the cases a technique like Newton-Raphson or SGD is involved.


## Data preparations
Now we have to convert a list of word indices to a bag-of-words representation.
This means that for a list $[2, 7, 3, 5, 0]$ we want to have $[1, 0, 1, 1, 0, 1, 0, 1, 0, \ldots, 0]$.

In [0]:
def bag_of_words_matrix(words_lists):
    results = np.zeros((len(words_lists), MAX_WORDS))
    for i, sequence in enumerate(words_lists):
        results[i, sequence] = 1.  
    return results
  
x_train = bag_of_words_matrix(train_data)
x_test = bag_of_words_matrix(test_data)
y_train = np.eye(2)[train_labels] #np.asarray(train_labels).astype('float32')
y_test = np.eye(2)[test_labels] #np.asarray(test_labels).astype('float32')

In [0]:
learning_rate = 0.01
training_epochs = 1
batch_num_in_dataset = int(x_train.shape[0]/training_epochs)
batch_size = 100
output_dim = 2

In [0]:
def get_next_batch(n, x_train, y_train):
    index = np.arange(0 , len(x_train))
    np.random.shuffle(index)
    samples = index[:n]
    return x_train[samples], y_train[samples]

In [0]:
get_next_batch(1, x_train, y_train)

(array([[0., 0., 1., ..., 0., 0., 0.]]), array([[0., 1.]]))

In [0]:
# Parameters
learning_rate = 0.01
batch_size = 10
training_steps = 1000
display_step = 1

# tf Graph Input
x = tf.placeholder(tf.float32, [None, MAX_WORDS]) 
y = tf.placeholder(tf.float32, [None, output_dim]) 

# Set model weights
W = tf.Variable(tf.zeros([MAX_WORDS, output_dim]))
b = tf.Variable(tf.zeros([output_dim]))

# Construct model
y_hat = tf.nn.softmax(tf.matmul(x, W) + b) # Softmax
loss = tf.reduce_mean(-tf.reduce_sum(y*tf.log(y_hat), reduction_indices=1))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)

    for step in range(training_steps):
      batch_xs, batch_ys = get_next_batch(batch_size, x_train, y_train)
      _, c = sess.run([optimizer, loss], feed_dict={x: batch_xs,
                                                    y: batch_ys})

      if step % 50 == 0: print("Loss: {0}".format(c))
            
    print("Optimization Finished!")

    correct_prediction = tf.equal(tf.argmax(y_hat, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    
    train_accuracy = sess.run(accuracy, feed_dict={x: x_train, y: y_train})
    test_accuracy = sess.run(accuracy, feed_dict={x: x_test, y: y_test})
    
    print('Train acc: {0}, Test acc: {1}'.format(train_accuracy, test_accuracy))
    

Loss: 0.6931471824645996
Loss: 0.5838429927825928
Loss: 0.6378467679023743
Loss: 0.6213328838348389
Loss: 0.5538224577903748
Loss: 0.4679785370826721
Loss: 0.5026615262031555
Loss: 0.4876132011413574
Loss: 0.5052644610404968
Loss: 0.37420785427093506
Loss: 0.44141674041748047
Loss: 0.5280390977859497
Loss: 0.6124923825263977
Loss: 0.336426317691803
Loss: 0.5220907926559448
Loss: 0.5226308703422546
Loss: 0.3691233992576599
Loss: 0.5824846625328064
Loss: 0.5126842260360718
Loss: 0.5951675176620483
Optimization Finished!
Train acc: 0.8461999893188477, Test acc: 0.8339200019836426
