<a href="https://colab.research.google.com/github/UPstartDeveloper/DS-2.4-Advanced-Topics/blob/main/Notebooks/NLP/Efficient_IMDb_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring the Data API

In this exercise we'll revist the IMDb dataset, but this time we'll use the features of the Tensorflow Data API, `tf.data`, to implement highly performant input pipelines.

We'll also take another look at making language models for binary classification, and use an `Embedding` layer to see if we can get a computer to learn the implicit relationships between words.

## Setup

In [5]:
# Copied from Aurélien Géron's Ch. 13 notebook, 
# for "Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow": 
# https://colab.research.google.com/github/ageron/handson-ml2/blob/master/13_loading_and_preprocessing_data.ipynb


# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

# Common imports
import numpy as np
import os
# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# To Download the Dataset (see Part 1)
from pathlib import Path

## Part 1: Get the Data

**About the IMDb Dataset**:
1. 50,000 movies reviews from the Internet Movie Database (IMDb). 
2. Training and testing data are in `train/` and `test/`
3. Both of these directories has their own subdirectories for samples of `pos/` and `neg/` reviews.
4. Dataset is *balanced* (
  - 12,500 samples per class, in both the training and test data
5. The samples themselves are *text files.*

In [4]:
# locating the dataset TAR file
DOWNLOAD_ROOT = "http://ai.stanford.edu/~amaas/data/sentiment/"
FILENAME = "aclImdb_v1.tar.gz"
# downloading it onto the client machine
filepath = keras.utils.get_file(FILENAME, DOWNLOAD_ROOT + FILENAME, 
                                extract=True)
# finding a place for it on our machine
path = Path(filepath).parent / "aclImdb"
# here it is!
print(path)

/root/.keras/datasets/aclImdb


## Part 2: Splitting the Data

In [7]:
def review_paths(dirpath):
    """Given a directory path, returns a list of all the text files present.

    Args:
      dirpath: str. The path to a folder on the filesystem.

    Returns: List[str]

    Example Usage:
    review_paths("/root/.foo") ==> ["bar.txt", "foobar.txt"]
    """
    return [str(path) for path in dirpath.glob("*.txt")]


# collect samples for each of the training data, divided by class
train_pos = review_paths(path / "train" / "pos")
train_neg = review_paths(path / "train" / "neg")
# do the same for test data (includes data we'll use for validation as well)
test_valid_pos = review_paths(path / "test" / "pos")
test_valid_neg = review_paths(path / "test" / "neg")

In [9]:
# verify we collected all the samples for each section
len(train_pos), len(train_neg), len(test_valid_pos), len(test_valid_neg)

(12500, 12500, 12500, 12500)

In [10]:
# shuffle all the data for good measure
np.random.shuffle(test_valid_pos)
np.random.shuffle(train_pos)
np.random.shuffle(test_valid_neg)
np.random.shuffle(train_neg)

To aid our training process, we'll create a separate validation set from 15,000 of the samples in the testing data 

The remaining 10,000 samples of the test data will be kept separate, and not seen by the model until after training is completed of course.

In [11]:
# keep just 5,000 samples of pos and neg test data 
test_pos = test_valid_pos[:5000]
test_neg = test_valid_neg[:5000]
# the rest of the data is for validation
valid_pos = test_valid_pos[5000:]
valid_neg = test_valid_neg[5000:]

## Part 3: Using the Data API

Say hello to `tf.data`!

In [None]:
#