# Introduction

![](http://www.droid-life.com/wp-content/uploads/2019/08/imdb.jpg)

Welcome to the notebook of IMDb dataset. IMDb stands for Internet movie database which is an online database of information related to films, television programs, home videos, video games and streaming content online owned by [Amazon](http://amazon.com). It is a dataset of 50,000 highly polarised reviews. They are split into 25,000 for training and 25,000 for testing, each set consisting of 50% negative and 50% positive reviews.

In this notebook, I am using Keras API, building model using CNN(Convolutional neural network), using layer embeddings with max pooling the layer. Finally I will be  evaluating the results.

<font color="green" size=4>Please do upvote the notebook if you liked it. It motivates me write more quality content:-)</font>

# Acknowledgements

1. [ggplot2 Docs - by CRAN](https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf)
2. [ggplot2 R Tutorials - by tidyverse ](https://ggplot2.tidyverse.org/)
3. [Keras Docs - by CRAN](https://cran.r-project.org/web/packages/keras/index.html)
4. [Keras R Tutorial - by RStudio](https://keras.rstudio.com/)
5. [Optimizers Docs & Tutoria - by Keras.io](https://keras.io/api/optimizers/)] 
6. [Losses Docs & Tutorials - by Keras.io](https://keras.io/api/losses/)

# Contents

* [<font size=4>Handling data and packages </font>](#1)
    * [Loading data and packages](#1.1)
    * [Structure of dataset](#1.2)
    * [Data Cleaning](#1.3)
    * [Splitting into train and test](#1.4)

* [<font size=4>Length and Shapes of dataset</font>](#2)
    * [Length of dataset](#2.1)
    * [Paddling training data](#2.2)
    * [Defining shapes of dataset](#2.3)

* [<font size=4>Modeling</font>](#3)
    * [Preparing the ground](#3.1)
    * [Keras model](#3.2)
    * [Model Compiling](#3.3)
    * [Model Fitting](#3.4)
    * [Model Evaluation](#3.5)


* [<font size=4>Takeaways</font>](#4)


* [<font size=4>Ending note</font>](#5)

# Handling data and packages <a id="1"></a>

## Loading data and packages <a id="1.1"></a>

In [1]:
# Loading libraries
library(data.table)
library(tidyverse)
library(ggplot2)
library(caret)
library(keras)


# Initializing variables
max_features <- 11000
max_len <- 500

# Loading data
imdb_data <- dataset_imdb(num_words = max_features)

## Structure of dataset <a id="1.2"></a>

In [2]:
str(imdb_data)

List of 2
 $ train:List of 2
  ..$ x:List of 25000
  .. ..$ : int [1:218] 1 14 22 16 43 530 973 1622 1385 65 ...
  .. ..$ : int [1:189] 1 194 1153 194 8255 78 228 5 6 1463 ...
  .. ..$ : int [1:141] 1 14 47 8 30 31 7 4 249 108 ...
  .. ..$ : int [1:550] 1 4 2 2 33 2804 4 2040 432 111 ...
  .. ..$ : int [1:147] 1 249 1323 7 61 113 10 10 13 1637 ...
  .. ..$ : int [1:43] 1 778 128 74 12 630 163 15 4 1766 ...
  .. ..$ : int [1:123] 1 6740 365 1234 5 1156 354 11 14 5327 ...
  .. ..$ : int [1:562] 1 4 2 716 4 65 7 4 689 4367 ...
  .. ..$ : int [1:233] 1 43 188 46 5 566 264 51 6 530 ...
  .. ..$ : int [1:130] 1 14 20 47 111 439 3445 19 12 15 ...
  .. ..$ : int [1:450] 1 785 189 438 47 110 142 7 6 7475 ...
  .. ..$ : int [1:99] 1 54 13 1610 14 20 13 69 55 364 ...
  .. ..$ : int [1:117] 1 13 119 954 189 1554 13 92 459 48 ...
  .. ..$ : int [1:238] 1 259 37 100 169 1653 1107 11 14 418 ...
  .. ..$ : int [1:109] 1 503 20 33 118 481 302 26 184 52 ...
  .. ..$ : int [1:129] 1 6 964 437 7 58 43 140

## Data Cleaning <a id="1.3"></a>

The IMDb dataset comes pre-installed in Keras library. So, the dataset of free of all missing values and data is already cleaned.:)

## Splitting into train and test <a id="1.4"></a>

In [3]:
c(c(a_train, b_train), c(a_test, b_test)) %<-% imdb_data

# Length and Shapes of dataset <a id="2"></a>

## Length of dataset <a id="2.1"></a>

In [4]:
cat(length(a_train), "Sequences of train dataset", "\n")
cat(length(a_test), "Sequences of test dataset")

25000 Sequences of train dataset 
25000 Sequences of test dataset

## Padding training data <a id="2.2"></a>

In [5]:
a_train <- pad_sequences(a_train, maxlen = max_len)
a_test <- pad_sequences(a_test, maxlen = max_len)

## Defining shapes of dataset <a id="2.2"></a>

In [6]:
cat("Shape of a_train:", dim(a_train),"\n")
cat("Shape of a_test:", dim(a_test), "\n")

Shape of a_train: 25000 500 
Shape of a_test: 25000 500 


The shapes of train dataset and test dataset are 25000x500 .

# Modelling <a id="3"></a>

## Preparing the ground <a id="3.1"></a>

### Convolutional Neural Network(CNN)

A **Convolutional Neural Network (ConvNet/CNN)** is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. The pre-processing required in a ConvNet is much lower as compared to other classification algorithms. While in primitive methods filters are hand-engineered, with enough training, ConvNets have the ability to learn these filters/characteristics.

![](https://www.researchgate.net/publication/336805909/figure/fig1/AS:817888827023360@1572011300751/Schematic-diagram-of-a-basic-convolutional-neural-network-CNN-architecture-26.ppm)

The **architecture of a ConvNet** is analogous to connectivity pattern of Neurons in the Human Brain and was inspired by the organization of the Visual Cortex. Individual neurons respond to stimuli only in a restricted region of the visual field known as the Receptive Field. A collection of such fields overlap to cover the entire visual area.

The **objective of the Convolution Operation** is to extract the high-level features such as edges, from the input image. ConvNets need not be limited to only one Convolutional Layer.

![](http://miro.medium.com/max/395/1*1VJDP6qDY9-ExTuQVEOlVg.gif)

Conventionally, the first ConvLayer is responsible for capturing the Low-Level features such as edges, color, gradient orientation, etc. With added layers, the architecture adapts to the High-Level features as well, giving us a network which has the wholesome understanding of images in the dataset, similar to how we would.
There are two types of results to the operation — one in which the convolved feature is reduced in dimensionality as compared to the input, and the other in which the dimensionality is either increased or remains the same. This is done by applying Valid Padding in case of the former, or Same Padding in the case of the latter.



For more information, Please [Refer](https://cs231n.github.io/convolutional-networks/)

### Word Embeddings

**Word embeddings** are a family of natural language processing techniques aiming at mapping semantic meaning into a geometric space. This is done by associating a numeric vector to every word in a dictionary, such that the distance (e.g. L2 distance or more commonly cosine distance) between any two vectors would capture part of the semantic relationship between the two associated words. The geometric space formed by these vectors is called an embedding space.

In a good embeddings space, the "path" (a vector) to go from "kitchen" and "dinner" would capture precisely the semantic relationship between these two concepts. In this case the relationship is "where x occurs", so you would expect the vector kitchen - dinner (difference of the two embedding vectors, i.e. path to go from dinner to kitchen) to capture this "where x occurs" relationship. Basically, we should have the vectorial identity: dinner + (where x occurs) = kitchen (at least approximately). If that's indeed the case, then we can use such a relationship vector to answer questions. For instance, starting from a new vector, e.g. "work", and applying this relationship vector, we should get sometime meaningful, e.g. work + (where x occurs) = office, answering "where does work occur?".

![](https://media.geeksforgeeks.org/wp-content/uploads/20200805214427/fgf1.png)

For instance, "coconut" and "polar bear" are words that are semantically quite different, so a reasonable embedding space would represent them as vectors that would be very far apart. But "kitchen" and "dinner" are related words, so they should be embedded close to each other. Word embeddings are computed by applying dimensionality reduction techniques to datasets of co-occurence statistics between words in a corpus of text. This can be done via neural networks (the "word2vec" technique), or via matrix factorization.

Ways to obtain word embeddings:

1)Learn word embeddings jointly with the main task(e.g. document classification or sentiment prediction). In this setup, you would start with random word vectors, then learn your word vectors in the same way that you learn the weights of a neural network.

2)Using **pre-trained word embeddings** i.e loading into your model word embeddings that were pre-computed using a different machine learning task than the one you are trying to solve

For more information, please [Refer](https://jjallaire.github.io/deep-learning-with-r-notebooks/notebooks/6.1-using-word-embeddings.nb.html)

## Keras Model <a id="3.2"></a>

In [7]:
model_net <- keras_model_sequential() %>%
  layer_embedding(input_dim = max_features, output_dim = 128,
                  input_length = max_len) %>%
  layer_conv_1d(filters = 32, kernel_size = 7, activation = "relu") %>%
  layer_max_pooling_1d(pool_size = 5) %>%
  layer_conv_1d(filters = 32, kernel_size = 7, activation = "relu") %>%
  layer_global_max_pooling_1d() %>%
  layer_dense(units = 1)

summary(model_net)

Model: "sequential"
________________________________________________________________________________
Layer (type)                        Output Shape                    Param #     
embedding (Embedding)               (None, 500, 128)                1408000     
________________________________________________________________________________
conv1d (Conv1D)                     (None, 494, 32)                 28704       
________________________________________________________________________________
max_pooling1d (MaxPooling1D)        (None, 98, 32)                  0           
________________________________________________________________________________
conv1d_1 (Conv1D)                   (None, 92, 32)                  7200        
________________________________________________________________________________
global_max_pooling1d (GlobalMaxPool (None, 32)                      0           
________________________________________________________________________________
dense (D

## Model Compiling <a id="3.4"></a>

In [8]:
# Compiling step
model_net %>% compile(
  optimizer = optimizer_rmsprop(lr = 1e-4),
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

## Model Fitting <a id="3.5"></a>

In [9]:
# Fitting 
model_net %>% fit(
  a_train, b_train, 
  epochs = 10, 
  batch_size = 128,
  validation_split = 0.2
  )

## Model Evaluation <a id="3.6"></a>

In [10]:
metrics <- model_net %>% evaluate(a_test, b_test)
metrics

# Takeaways  <a id="4"></a>

1. Building models using Keras API of a deep neural networks(DNN).
2. Scoring higher accuracy rate using layer embeddings & max pooling layer in dense neural network.
3. Model evaluation and fitting with proper hyperparameters.

## Ending Notes <a id="5"></a>

The model achieved the accuracy of 86.33% in model evaluation.

<font color="green" size=4>This concludes the notebook. Please upvote to motivate me write more quality content :-)</font>