

---


# 🦙 **LLaMa 3B V2 Fine-Tuning on IMDB Top 1000 Movies 🎥**
**By: Bradley Sides**

**Overview:** Given a training set of brief movie descriptions including both genre and director, this fine-tuned instance of LLaMa 3B should be able to generate creative movie descriptions given a genre and a director.

This notebook takes about 4 minutes to run on an A100 GPU


---



# Step 1: Construct the Dataset 💽

## Imports, build necessary packages


In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding
!pip install -q git+https://github.com/huggingface/peft.git
!pip install mlflow
!pip install trl

from peft import get_peft_config, PeftModel, PeftConfig, get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM
from transformers import LlamaTokenizer, LlamaForCausalLM
import torch
from transformers.trainer_callback import TrainerCallback
import os
from transformers import BitsAndBytesConfig
from trl import SFTTrainer
import mlflow
#!pip install pyspark
#from pyspark.sql import SparkSession
#spark = SparkSession.builder.appName("Description Generator").getOrCreate()
from google.colab import userdata
pw = userdata.get('HF_TOKEN')
print("Access Token Loaded: ...", pw[-4:])
!pip install datasets
from datasets import load_dataset , Dataset, concatenate_datasets

import numpy as np
import pandas as pd
import random

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
Collecting mlflow
  Downloading mlflow-2.10.2-py3-none-any.whl (19.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.5/19.5 MB[0m [31m69.3 MB/s[0m eta [36m0:00:00[0m
Collecting gitpython<4,>=2.1.0 (from mlflow)
  Downloading GitPython-3.1.42-py3-none-any.whl (195 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m195.4/195.4 kB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
Collecting alembic!=1.10.0,<2 (from mlflow)
  Downloading alembic-1.13.1-py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.4/233.4 kB[0m [31m32.6 MB/s[0m eta [36m0:00:00

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Load and Clean Up the Dataset

In [None]:
df = pd.read_csv('/content/drive/MyDrive/LLM-Movie-Training/movies.csv')


# Drop Irrelevant Columns: Poster_Link, Certificate, Runtime, No_of_Votes

columns_to_drop = df.columns[[0, 3, 14, 15]]
df.drop(columns=columns_to_drop, inplace=True)

# Extract runtime in minutes, convert to integer
df["Runtime"] = df["Runtime"].str.extract('(\d+)')
df.rename(columns={"Runtime": "Runtime-(mins)"}, inplace=True)
df['Runtime-(mins)'] = pd.to_numeric(df['Runtime-(mins)'], errors='coerce')

# Convert IMDB Rating scale to /100 (originally /10) and integer
df["IMDB_Rating"] = (df["IMDB_Rating"] * 10).astype(int)

# Some movies have no Meta_score, so we will substitute IMDB_Rating for this, also integer conversion
df["Meta_score"] = df['Meta_score'].fillna(df["IMDB_Rating"]).astype(int)

# Make Genres into a list (originally one string) for multi-label classification
df["Genre"] = df["Genre"].apply(lambda x: x.split(', '))

# Make Released_Year into integer (originally string)
df[df["Series_Title"] == "Apollo 13"] = '1995'
df["Released_Year"] = df["Released_Year"].astype(int)

# Row is filled with 1955 across the board, dropping
df.drop(index=966, inplace = True)
df.head(5)

Unnamed: 0,Series_Title,Released_Year,Runtime-(mins),Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4
0,The Shawshank Redemption,1994,142,[Drama],93,Two imprisoned men bond over a number of years...,80,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler
1,The Godfather,1972,175,"[Crime, Drama]",92,An organized crime dynasty's aging patriarch t...,100,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton
2,The Dark Knight,2008,152,"[Action, Crime, Drama]",90,When the menace known as the Joker wreaks havo...,84,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine
3,The Godfather: Part II,1974,202,"[Crime, Drama]",90,The early life and career of Vito Corleone in ...,90,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton
4,12 Angry Men,1957,96,"[Crime, Drama]",90,A jury holdout attempts to prevent a miscarria...,96,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler


## Some Preliminary Analysis

In [None]:
# Unique Stars and Directors
star1_list_filtered = list(df['Star1'].unique())
star2_list_filtered = list(df['Star2'].unique())
star3_list_filtered = list(df['Star3'].unique())
star4_list_filtered = list(df['Star4'].unique())
director_list_filtered = list(df['Director'].unique())
print("--------------------")
print("|  Unique Counts:  |")
print("--------------------")
print("Primary Stars: " + str(len(star1_list_filtered)))
print("Secondary Stars: " + str(len(star2_list_filtered)))
print("Tertiary Stars: " + str(len(star3_list_filtered)))
print("Quaternary Stars: " + str(len(star4_list_filtered)))
print("Directors: " + str(len(director_list_filtered)))
print("____________________")

# Runtimes and actors
min_runtime = str(df["Runtime-(mins)"].min())
max_runtime = str(df["Runtime-(mins)"].max())
mean_runtime = str(int(df["Runtime-(mins)"].mean()))
difference_runtime = str(df["Runtime-(mins)"].max() - df["Runtime-(mins)"].min())
top_3_directors = df['Director'].value_counts().head(3)
most_frequent_star = df['Star1'].value_counts().idxmax()
all_actors = pd.concat([df['Star1'], df['Star2'], df['Star3']])
most_frequent_actor = all_actors.mode()[0]
direct_times = df['Director'].value_counts()[0]
star_times = df['Star1'].value_counts()[0]
actor_times = all_actors.value_counts()[0]
top_3_genres = df['Genre'].value_counts().head(3)


print('The mean film runtime is: ' + mean_runtime + ' minutes with a difference of ' + difference_runtime + ' between ' + min_runtime + ' mins and ' + max_runtime + ' mins')
print("____________________")
print(f"The most frequent directors are:")
for director, count in top_3_directors.items():
    print(f"{director}: {count}")
print()
print(f'The most frequent star (lead) is: {most_frequent_star} with {star_times} appearances.')
print(f'The most frequent actor (top 4) is: {most_frequent_actor} with {actor_times} appearances.')
print()
print(f'The most fequent genres are:')
for genre, count in top_3_genres.items():
    print(f"{genre}: {count}")
print("____________________")

--------------------
|  Unique Counts:  |
--------------------
Primary Stars: 660
Secondary Stars: 840
Tertiary Stars: 891
Quaternary Stars: 938
Directors: 548
____________________
The mean film runtime is: 122 minutes with a difference of 276 between 45 mins and 321 mins
____________________
The most frequent directors are:
Alfred Hitchcock: 14
Steven Spielberg: 13
Hayao Miyazaki: 11

The most frequent star (lead) is: Tom Hanks with 11 appearances.
The most frequent actor (top 4) is: Robert De Niro with 17 appearances.

The most fequent genres are:
['Drama']: 85
['Drama', 'Romance']: 37
['Comedy', 'Drama']: 35
____________________


## Convert to Alpaca Format

Here, we are taking the columns of interest (Director, Genre, and Description) and putting them in "Instruction, Response" format.

In [None]:
df['instruction'] = 'Create a detailed description for a movie from the following director: ' + df['Director'] + ', belonging to genre: ' + df['Genre'].apply(', '.join)
df = df[['instruction', 'Overview']]

template = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

{}

### Response:\n"""


df['prompt'] = df['instruction'].apply(lambda x: template.format(x))
df.rename(columns={'Overview': 'response'}, inplace=True)
df['response'] = df['response'] + "\n### End"
df = df[['prompt', 'response']]
df.head(3)


Unnamed: 0,prompt,response
0,Below is an instruction that describes a task....,Two imprisoned men bond over a number of years...
1,Below is an instruction that describes a task....,An organized crime dynasty's aging patriarch t...
2,Below is an instruction that describes a task....,When the menace known as the Joker wreaks havo...


In [None]:
#
### NOTE: Use (uncomment) this cell with very large dataframes and skip the next cell
#

#spark_df = spark.createDataFrame(df)
#spark_df.write.saveAsTable('product_name_to_description')
#df = spark.sql("SELECT * FROM product_name_to_description").toPandas()
#df['text'] = df["prompt"]+df["response"]
#df.drop(columns=['prompt', 'response'], inplace=True)

In [None]:
# Do not use this cell if previous cell is being used

df['text'] = df["prompt"] + df["response"]
df.drop(columns=['prompt', 'response'], inplace=True)

display(df)
print(df.shape)
print(df['text'][1])

Unnamed: 0,text
0,Below is an instruction that describes a task....
1,Below is an instruction that describes a task....
2,Below is an instruction that describes a task....
3,Below is an instruction that describes a task....
4,Below is an instruction that describes a task....
...,...
995,Below is an instruction that describes a task....
996,Below is an instruction that describes a task....
997,Below is an instruction that describes a task....
998,Below is an instruction that describes a task....


(999, 1)
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

Create a detailed description for a movie from the following director: Francis Ford Coppola, belonging to genre: Crime, Drama

### Response:
An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.
### End


# Step 2: Model Setup 💻

## Define the Model, Load in Necessary Packages

In [None]:
# Model to use
model_path = 'openlm-research/open_llama_3b_v2'

tokenizer = LlamaTokenizer.from_pretrained(model_path)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

#!pip install bitsandbytes

tokenizer_config.json:   0%|          | 0.00/593 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/512k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/330 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


1

In [None]:
# Load it in
model = LlamaForCausalLM.from_pretrained(
    model_path, device_map='auto',
)

config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/6.85G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

## Define Parameters and Target Modules

In [None]:
# Targeting: ALL Linear Layers

target_modules = ['q_proj','k_proj','v_proj','o_proj','gate_proj','down_proj','up_proj','lm_head']

# Use this to save training results
base_dir = "/content/drive/MyDrive/LLM-Movie-Training"

In [None]:
# Conifugre LoRA parameters

lora_config = LoraConfig(
    r=16, # IMPORTANT: or r=8 (LR Matrix Size)
    lora_alpha=8, # IMPORTANT: Affects the impact of each update from LR matrices to larger
    lora_dropout=0.05, # IMPORTANT: 5% of data is zeroed out, this encourages generalization
    bias="none",
    target_modules = target_modules,
    task_type="CAUSAL_LM",
)

In [None]:
# Configure main parameters

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=base_dir,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    num_train_epochs=3.0, # IMPORTANT: decrease if overfitting suspected, increase if underfit
    per_device_train_batch_size = 4, # IMPORTANT: Size of batch size for each GPU -> larger = faster training, more memory -> smaller = less memory, longer training, better generalization
    gradient_accumulation_steps = 4, # IMPORTANT: Effectively increases batch size w/out increasing memory requirements. Simulates training w/ higher batch size (Scale proportionally to batch size)
    optim='adamw_hf',
    learning_rate = 1e-5,
    fp16=True,
    max_grad_norm = 1, # IMPORTANT: Prevents exploding loss function gradients (can adjust between 0.0-1.0)
    warmup_ratio = 0.03,
    group_by_length = False, # IMPORTANT: Set to true if size of input data is variable
    lr_scheduler_type = "linear",
)


In [None]:
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 25,989,120 || all params: 3,452,462,720 || trainable%: 0.7527704745208661


In [None]:
# Split for training and testing

dataset = Dataset.from_pandas(df).train_test_split(test_size=0.05, seed=42)

In [None]:
# Define the trainer

trainer = SFTTrainer(
    model,
    train_dataset=dataset['train'],
    eval_dataset = dataset['test'],
    dataset_text_field="text",
    max_seq_length=256,
    args=training_args,
)

Map:   0%|          | 0/949 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]



In [None]:
for name, module in trainer.model.named_modules():
  if "norm" in name:
    module = module.to(torch.float32)

# Step 3: Run The Model 🏃

In [None]:
# This is the actual Fine-Tuning

with mlflow.start_run(run_name='run_name_of_choice'):
  trainer.train()



Epoch,Training Loss,Validation Loss
0,No log,2.472192
2,No log,1.590965


Checkpoint destination directory /content/drive/MyDrive/LLM-Movie-Training/checkpoint-59 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory /content/drive/MyDrive/LLM-Movie-Training/checkpoint-119 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory /content/drive/MyDrive/LLM-Movie-Training/checkpoint-177 already exists and is non-empty.Saving will proceed but saved results may be invalid.


In [None]:
# We will use checkpoint 177
# If run in your own google drive, use the highest numbered checkpoint folder

!ls "/content/drive/MyDrive/LLM-Movie-Training"

 checkpoint-119   checkpoint-222  'Criterion Training'	  movies.csv
 checkpoint-148   checkpoint-59    data.csv		  movies.gsheet
 checkpoint-177   checkpoint-74   'LoRA Movie Training'   runs


In [None]:
model_path = 'openlm-research/open_llama_3b_v2'

In [None]:
tokenizer = LlamaTokenizer.from_pretrained(model_path)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

model = LlamaForCausalLM.from_pretrained(
    model_path, device_map='auto',
)

In [None]:
# IMPORTANT: Use the highest checkpoint from your base directory here

peft_model_id = '/content/drive/MyDrive/LLM-Movie-Training/checkpoint-177'

In [None]:
# Save the model

peft_model = PeftModel.from_pretrained(model, peft_model_id)

# Step 4: Test The Model 🧪

In [None]:
def generate_prediction(test_string):
    prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{test_string}

### Response:"""

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

    generation_output = model.generate(input_ids=input_ids, max_new_tokens=128)

    generated_text = tokenizer.decode(generation_output[0])

    response_text = extract_response_text(generated_text)

    return response_text

def extract_response_text(input_string):
    start_marker = '### Response:'
    end_marker = '###'

    start_index = input_string.find(start_marker) + len(start_marker)
    end_index = input_string.find(end_marker, start_index)
    response_text = input_string[start_index:end_index].strip() if end_index != -1 else input_string[start_index:].strip()

    return response_text


In [None]:
test_string = "Create a detailed description for the following movie from Director: Stanley Kubrick, belonging to the Genre: Crime"
response = generate_prediction(test_string)
print(test_string+'\n')
print(response)

Create a detailed description for the following movie from Director: Stanley Kubrick, belonging to the Genre: Crime

A man is hired to kill a man who has the ability to predict the future.


In [None]:
test_string = "Create a detailed description for the following movie from Director: Wes Anderson, belonging to the Genre: Western"
response = generate_prediction(test_string)
print(test_string+'\n')
print(response)

Create a detailed description for the following movie from Director: Wes Anderson, belonging to the Genre: Western

A young man, who is a cowboy, is on his way to a town to find his father. He is accompanied by his dog, who is a very smart dog. The dog is very helpful to the young man. The young man is very happy to have his dog with him.


In [None]:
test_string = "Create a detailed description for the following movie from Director: Steven Spielberg, belonging to the Genre: Comedy, Thriller"
response = generate_prediction(test_string)
print(test_string+'\n')
print(response)

Create a detailed description for the following movie from Director: Steven Spielberg, belonging to the Genre: Comedy, Thriller

A young boy named Elliott is sent to live with his grandmother in New York City after his parents divorce. He is very excited to see his grandmother, but he is also very nervous. He is afraid that she will not like him and that he will not be able to make friends.


In [None]:
test_string = "Create a detailed description for the following movie from Director: Alfred Hitchcock, belonging to the Genre: Drama"
response = generate_prediction(test_string)
print(test_string+'\n')
print(response)

Create a detailed description for the following movie from Director: Alfred Hitchcock, belonging to the Genre: Drama

A young woman, who is a widow, is being stalked by a man who has been released from prison.


# All Done! 🙂