# Movie Genre Predictions with Hugging Face Transformers

Install the following packages by uncommenting the following if not installed already

In [97]:
# !pip install datasets
# !pip install transformers -U
# !pip install huggingface_hub
# !pip install rich
# !pip install accelerate -U
# !pip install evaluate

Following are the steps to create hugging face credentials token which be needed when using `notebook_login` below

1. **Create a Hugging Face account (if you don't have one)**: If you don't already have an account on the Hugging Face website, you'll need to create one. Visit the Hugging Face website (https://huggingface.co/) and sign up for an account.
2. **Log in to your Hugging Face account**: Use your credentials to log in to your Hugging Face account.
3. **Generate an API token**: Hugging Face provides API tokens for authentication. To generate an API token, go to your account settings on the Hugging Face website. You can usually find this in your account dashboard or profile settings.
4. **Generate the token**: Once you're in your account settings, look for an option related to API tokens or credentials. You should find an option to generate a new token. Click on it, and the system will generate a unique API token for you.
5. **Copy the API token**: Once the token is generated, you'll typically see it displayed on the screen. It might be a long string of characters. Copy this token to your clipboard.
6. **Store the token securely**: API tokens are sensitive credentials, so it's essential to store them securely. You should never share your API token publicly or expose it in your code repositories.

Now, you have your Hugging Face API token, which you can use for authentication when making requests to the Hugging Face API or accessing resources on the Hugging Face Model Hub.

In [5]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Lets import the following pacakges

In [172]:
from transformers import TrainingArguments, Trainer, AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
from collections import Counter
import evaluate
import numpy as np
from rich import print
import pandas as pd

## Datasets

We will be using the `datadrivenscience/movie-genre-prediction` competition dataset for model training. You can read more about the competition [here](https://huggingface.co/spaces/competitions/movie-genre-prediction) and the dataset [here](https://huggingface.co/datasets/datadrivenscience/movie-genre-prediction). 

In [7]:
dataset = load_dataset("datadrivenscience/movie-genre-prediction"); dataset

Downloading readme:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/7.16M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.74M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/54000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/36000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'genre'],
        num_rows: 54000
    })
    test: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'genre'],
        num_rows: 36000
    })
})

The dataset has `train` and `test` splits with following features
- id
- movie name
- synopsis
- genre

In [8]:
print(dataset['train'][:3])

Above we have sliced and printed 3 rows of training dataset

In [28]:
labels = set(dataset['train']['genre'])
num_labels = len(labels)
labels

{'action',
 'adventure',
 'crime',
 'family',
 'fantasy',
 'horror',
 'mystery',
 'romance',
 'scifi',
 'thriller'}

There are 10 genres, 
- action
- adventure
- crime
- family
- fantasy
- horror
- mystery
- romance
- scifi
- thriller

In [10]:
labels_count = Counter(dataset['train']['genre']); print(labels_count)

Looks like the labels are evenly sampled, everyone has count of 5400. Thats good.

## Tokenization

In [17]:
checkpoint = "bert-base-uncased"

A checkpoint is a saved model state, including its architecture and trained weights, which can be used for various NLP tasks and fine-tuning. 

In [63]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer('Movie Genre Predictions with Hugging Face Transformers')

{'input_ids': [101, 3185, 6907, 20932, 2007, 17662, 2227, 19081, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Above we load the tokenizer and use it on a sentence. Loading a checkpoint of a tokenizer associated with a pretrained language model is necessary to maintain consistency in the tokenization process. This ensures that your input text is processed in a way that aligns with the model's pre-existing knowledge and allows you to use the pretrained model effectively

What is `attention_mask`?
> Sometimes, we want to tell the computer which parts of the sentence are important and which are not. The attention mask is like a spotlight. It's a list of 1s and 0s, where 1 means "pay attention" and 0 means "ignore." For our sentence, it could be [1, 1, 1, 1, 1] because we want the computer to pay attention to all tokens.

What is `token_type_ids`?
> If you have multiple sentences, you'd want the computer to know which sentence each token belongs to. Token Type IDs help with that. For one sentence, it's all 0s. If you had two sentences, the first sentence would have 0s, and the second sentence would have 1s.

Let's break down the process of creating `input_ids` below into following steps:

#### 1. Tokenize: 

Imagine you have a sentence, "Hugging Face is awesome!" To help a computer understand it, you first split it into smaller parts, like words: ["Hugging", "Face", "is", "awesome", "!"]. These smaller parts are called tokens.


We can tokenize the synopsis of the first row of training set

In [12]:
dataset['train'][0]['synopsis']

'A young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. Selling them makes him rich.'

In [13]:
tokens = tokenizer.tokenize(dataset['train'][0]['synopsis']); tokens

['a',
 'young',
 'script',
 '##writer',
 'starts',
 'bringing',
 'valuable',
 'objects',
 'back',
 'from',
 'his',
 'short',
 'nightmares',
 'of',
 'being',
 'chased',
 'by',
 'a',
 'demon',
 '.',
 'selling',
 'them',
 'makes',
 'him',
 'rich',
 '.']

#### 2. Conversion to IDs: 

Computers prefer numbers, so we need to convert these tokens into unique numbers. Each token gets a special ID. For example, "Hugging" might be ID 101, "Face" might be ID 102, and so on. The sentence becomes a list of IDs: [101, 102, 103, 104, 105].

In [14]:
ids = tokenizer.convert_tokens_to_ids(tokens); ids

[1037,
 2402,
 5896,
 15994,
 4627,
 5026,
 7070,
 5200,
 2067,
 2013,
 2010,
 2460,
 15446,
 1997,
 2108,
 13303,
 2011,
 1037,
 5698,
 1012,
 4855,
 2068,
 3084,
 2032,
 4138,
 1012]

In summary, Hugging Face tokenization takes your text, breaks it into tokens (smaller parts), gives each token a unique ID, creates an attention mask to say what's important, and token type IDs to track different sentences if needed. 

In [54]:
dataset = dataset.rename_column('genre', 'labels')

In [87]:
dataset = dataset.class_encode_column("labels")

Casting to class labels:   0%|          | 0/54000 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/36000 [00:00<?, ? examples/s]

In [126]:
ds = dataset["train"].train_test_split(test_size=0.2, stratify_by_column="labels")

In [127]:
ds

DatasetDict({
    train: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels'],
        num_rows: 43200
    })
    test: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels'],
        num_rows: 10800
    })
})

In [89]:
def tokenize(sample):
    sample["labels"] = dataset["train"].features["labels"].str2int(sample["labels"])
    return tokenizer(sample['synopsis'], truncation=True)

In [130]:
tokenized_ds = ds.map(tokenize, batched=True); tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 43200
    })
    test: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 10800
    })
})

In [131]:
tokenized_test_ds = dataset["test"].map(tokenize, batched=True); tokenized_test_ds

Map:   0%|          | 0/36000 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'movie_name', 'synopsis', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36000
})

The above code tokenizes the dataset's `synopsis` feature using the tokenize function in a batch wise manner

## Training

In [132]:
training_args = TrainingArguments('movie-genre-predictions', 
                                  evaluation_strategy = 'epoch',
                                  per_device_train_batch_size = 32,
                                  per_device_eval_batch_size = 64,
                                  save_strategy = 'epoch',
                                  push_to_hub = True
                                 )

The above code sets up the configuration for training a Hugging Face model, for a movie genre prediction task. Let's break it down step by step:

1. `TrainingArguments`: This is a special object or data structure that holds various settings and options for training a machine learning model.

2. `'movie-genre-predictions'`: It's naming the training process or giving it a unique name. It's like giving a name to a file so you can easily identify it later.

3. `evaluation_strategy = 'epoch'`: This line specifies how often the model's performance should be evaluated. In this case, it's set to 'epoch,' which means after every complete pass through the training data. An epoch is like a full round of training.

4. `per_device_train_batch_size = 32`: This indicates how many examples or data points should be processed at once on each  processing unit during training. It's set to 32, so 32 data points will be processed together in parallel.

5. `per_device_eval_batch_size = 64`: Similar to the previous line, but this one specifies the batch size for evaluation (measuring how well the model is doing). It's set to 64, so 64 examples will be evaluated at once.

6. `save_strategy = 'epoch'`: This determines when the model's checkpoints (saves of the model's progress) should be saved. Again, it's set to 'epoch,' meaning after each training round.

7. `push_to_hub = True`: This is likely specific to the Hugging Face Transformers library. If set to 'True,' it means that the model checkpoints will be pushed or uploaded to the Hugging Face Model Hub, a place to store and share models.

In simple terms, this code is configuring how a machine learning model should be trained for movie genre prediction. It sets up details like when to check how well the model is doing, how much data to process at a time, and where to save the model's progress. It also says that the model checkpoints should be uploaded to the Hugging Face Model Hub.

You may see more details about `TrainingArguments` [here](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)

In [133]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = num_labels)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Above we load the model for `Sequence Classification` of 10 labels

In [154]:
clf_metrics = evaluate.load("accuracy")

The `evaluate` library provides the metrics on which to evaluate the validation set. Above I have choosen accuracy as the metrics

In [155]:
def compute_metrics(batch):
    logits, labels = batch
    predictions = np.argmax(logits, axis=-1)
    return clf_metrics.compute(predictions=predictions, references=labels)

I have defined `compute_metrics` to compute the metrics after each epoch on validation set

In [156]:
trainer = Trainer(model, 
                  args = training_args,
                  train_dataset = tokenized_ds['train'],
                  eval_dataset = tokenized_ds['test'], 
                  tokenizer = tokenizer,
                  compute_metrics = compute_metrics
                 )

The `Trainer` function in Hugging Face simplifies the process of fine-tuning pre-trained NLP models for specific tasks. It handles data loading, training, evaluation, and model saving, making it easier to customize and use these models for various NLP tasks.

In [157]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.2835,3.31137,0.317315
2,1.0292,2.257011,0.325463
3,0.6767,2.709101,0.312685


TrainOutput(global_step=4050, training_loss=0.6523225073166835, metrics={'train_runtime': 562.6496, 'train_samples_per_second': 230.339, 'train_steps_per_second': 7.198, 'total_flos': 3860478233326848.0, 'train_loss': 0.6523225073166835, 'epoch': 3.0})

## Submitting to the competition

In [161]:
test_logits = trainer.predict(tokenized_test_ds)

In [165]:
test_logits.predictions.shape

(36000, 10)

In [166]:
test_predictions = np.argmax(test_logits.predictions, axis=-1)

In [170]:
predicted_genre = dataset["train"].features["labels"].int2str(test_predictions)

In [174]:
df = pd.DataFrame({'id':tokenized_test_ds['id'],
                  'genre': predicted_genre})

In [175]:
df.to_csv('submission.csv')