## References

* https://www.kaggle.com/rohitganji13/film-genre-classification-using-nlp
* Internal (Carted) TFRecord utilities contributed by [Nilabhra Roy Chowdhury](https://www.linkedin.com/in/nilabhraroychowdhury/)

## Setup

In [1]:
import tensorflow as tf
import tensorflow_text as text
import tensorflow_hub as hub

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from typing import Callable, Tuple
import pandas as pd
import numpy as np
import random
import tqdm

SEED = 42
tf.random.set_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)

## Data loading

Data comes from here: https://www.kaggle.com/hijest/genre-classification-dataset-imdb.

In [2]:
train_df = pd.read_csv(
    "./data/train_data.txt",
    engine="python",
    sep=" ::: ",
    names=["id", "movie", "genre", "summary"],
)

test_df = pd.read_csv(
    "./data/test_data_solution.txt",
    engine="python",
    sep=" ::: ",
    names=["id", "movie", "genre", "summary"],
)

In [3]:
# Viewing training data
train_df.head()

Unnamed: 0,id,movie,genre,summary
0,1,Oscar et la dame rose (2009),drama,Listening in to a conversation between his doc...
1,2,Cupid (1997),thriller,A brother and sister with a past incestuous re...
2,3,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fiel...
3,4,The Secret Sin (1915),drama,To help their unemployed father make ends meet...
4,5,The Unrecovered (2007),drama,The film's title refers not only to the un-rec...


## Data splitting

In [4]:
# Split the data using train_test_split from sklearn
train_shuffled = train_df.sample(frac=1)
train_df_new, val_df = train_test_split(train_shuffled, test_size=0.1)

print(f"Number of training samples: {len(train_df_new)}.")
print(f"Number of validation samples: {len(val_df)}.")
print(f"Number of test examples: {len(test_df)}.")

Number of training samples: 48792.
Number of validation samples: 5422.
Number of test examples: 54200.


In [5]:
le = LabelEncoder()
le.fit(train_df_new["genre"].values) 

train_df_new["genre"] = le.transform(train_df_new["genre"].values)
val_df["genre"] = le.transform(val_df["genre"].values)
test_df["genre"] = le.transform(test_df["genre"].values)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df_new["genre"] = le.transform(train_df_new["genre"].values)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_df["genre"] = le.transform(val_df["genre"].values)


## Rough attempt at modeling with [`padded_batch()`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#padded_batch)

In [7]:
dataset = tf.data.Dataset.from_tensor_slices(train_df_new["summary"])

for sample in dataset.take(1):
    print(sample)

tf.Tensor(b"On a lonely stretch of a highway, Ronit - pulls up in to a desolate pump to fill petrol. The attendant informs him that his car's fan belt is broken and a new one will only arrive in the morning. Stuck in the middle of nowhere, Ronit prepares to stake the night out in his car. When another car pulls in. The driver is a dignified, well-spoken man who lives a few miles away. He offers to house Ronit for the night, promising to drop him back in the morning. Ronit agrees, believing there is a god. But then there is also the devil.", shape=(), dtype=string)


In [8]:
dataset = dataset.padded_batch(32)

for sample in dataset.take(1):
    print(sample.shape)

(32,)


In [9]:
sample

<tf.Tensor: shape=(32,), dtype=string, numpy=
array([b"On a lonely stretch of a highway, Ronit - pulls up in to a desolate pump to fill petrol. The attendant informs him that his car's fan belt is broken and a new one will only arrive in the morning. Stuck in the middle of nowhere, Ronit prepares to stake the night out in his car. When another car pulls in. The driver is a dignified, well-spoken man who lives a few miles away. He offers to house Ronit for the night, promising to drop him back in the morning. Ronit agrees, believing there is a god. But then there is also the devil.",
       b'A young man enjoys his perfect existence. He has everything his heart desires: a beautiful girlfriend, a nice home and a good job. When his perfect life seems to slowly slip away, he wonders whether it was ever truly his... and he has to get even with himself.',
       b'A crew of deep space researchers investigates a peculiar asteroid that is covered by what appear to be hieroglyphics. In their cu

In [12]:
labels_ = tf.data.Dataset.from_tensor_slices(train_df_new["genre"]).batch(32)
dataset_ = tf.data.Dataset.zip((dataset, labels_))

In [13]:
for sample_batch in dataset_.take(1):
    print(sample_batch[0])
    print(sample_batch[1])

tf.Tensor(
[b"On a lonely stretch of a highway, Ronit - pulls up in to a desolate pump to fill petrol. The attendant informs him that his car's fan belt is broken and a new one will only arrive in the morning. Stuck in the middle of nowhere, Ronit prepares to stake the night out in his car. When another car pulls in. The driver is a dignified, well-spoken man who lives a few miles away. He offers to house Ronit for the night, promising to drop him back in the morning. Ronit agrees, believing there is a god. But then there is also the devil."
 b'A young man enjoys his perfect existence. He has everything his heart desires: a beautiful girlfriend, a nice home and a good job. When his perfect life seems to slowly slip away, he wonders whether it was ever truly his... and he has to get even with himself.'
 b'A crew of deep space researchers investigates a peculiar asteroid that is covered by what appear to be hieroglyphics. In their curiosity, they cut it open and unleash a being of pure d

In [11]:
train_df_new["total_words"] = train_df_new["summary"].str.split().str.len()
vocabulary_size = train_df_new["total_words"].max()
vocabulary_size

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df_new["total_words"] = train_df_new["summary"].str.split().str.len()


1829

In [20]:
text_vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=vocabulary_size, ngrams=2, output_mode="tf_idf"
)

# `TextVectorization` layer needs to be adapted as per the vocabulary from our
# training set.
with tf.device("/CPU:0"):
    text_vectorizer.adapt(dataset_.map(lambda text, label: text))


train_dataset = dataset_.map(
    lambda text, label: (text_vectorizer(text), label),
    num_parallel_calls=tf.data.AUTOTUNE,
).prefetch(tf.data.AUTOTUNE)

In [21]:
def make_model():
    shallow_mlp_model = tf.keras.Sequential(
        [
            tf.keras.layers.Dense(512, activation="relu"),
            tf.keras.layers.Dense(256, activation="relu"),
            tf.keras.layers.Dense(len(le.classes_), activation="softmax"),
        ]  
    )
    return shallow_mlp_model

In [22]:
epochs = 20

shallow_mlp_model = make_model()
shallow_mlp_model.compile(
    loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
)

history = shallow_mlp_model.fit(
    train_dataset, epochs=epochs
)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
