We are going to use a Llama3 model to carry out a study and use case on the categorisation of films into different categories, in this case we are going to focus on the use of promting. We will also use validation, for the output of the model in addition to regular expressions, we will combine the model with another model and we will also use a forced truncation of the words to see if it is effective.
 

In [1]:
#!pip3 install langchain_community

In [2]:
from langchain_community.llms import Ollama
from langchain.chains import LLMChain

In [3]:
import warnings
warnings.simplefilter('ignore')

# Data

In [4]:
import pandas as pd

with open("data/movie_plots_tc.csv", encoding="utf-8",errors="ignore") as csv_file :
    df = pd.read_csv(csv_file, sep=";")
    
#we mix the dataset
df_shuffled = df.sample(frac=1).reset_index(drop=True)

plots = df_shuffled["Plot"]
labels = df_shuffled["Genre"]

In [5]:
df_shuffled.head()

Unnamed: 0,Plot,Genre
0,"Helga Larson Hanson is living in Sweden, but i...",drama
1,"Set in medieval England, the plot concerns the...",comedy
2,Joe Holt works for the Armstrong Rubber Goods ...,comedy
3,"In the film's opening scene, Mark comes to Ale...",drama
4,"All About Lily Chou-Chou follows two boys, Sh?...",drama


In [6]:
categories = labels.unique()
max_length = max([len(s.split()) for s in plots ])
print(categories)
print(type(categories))
print(max_length)

['drama' 'comedy' 'western']
<class 'numpy.ndarray'>
2958


# Create a Model

In [7]:
#We create the model, we are going to try that the model is not creative, but that the answers are as controlled as possible, which is interesting for classification.
llm = Ollama(model="llama3",temperature=0, top_p=0.5, repeat_penalty=0.9)

#We create a function to facilitate the use of the model
def invokeLLama(categories,plots):
    task_llm =  f""" As a film cataloguing specialist, classify the following movie description into a single category. Choose only from the predefined categories listed. Provide your answer as a single category name, without additional comments or explanations.
    Available Categories: {categories}
    Movie Description: 
    "{plots}"
    Category:
    """
    response=llm.invoke(task_llm)
    return response

def validate_category(categories, output):
    categories = categories
    if output not in categories:
        return "Unknown"
    return output



In [8]:
yhat = []
for i in range(10):
    response=invokeLLama(categories,plots[i])
    response = validate_category(categories,response)
    yhat.append(response)


In [9]:
hits = [1 if yhat[i]==labels[i] else 0 for i in range(len(yhat))]
print("acc: ", sum(hits)/len(yhat))


acc:  1.0


# Improve the model with batch processing

We first create an example to see the output and to be able to process the data correctly, then 

In [10]:
#The security of the model we only put it to see how we could validate other types of data, surely this data is not correct, although we could check it if we wanted to.
task_llm =  f""" As a film cataloguing specialist, classify the following movies descriptions into a single category per film. Choose only from the predefined categories listed. Provide your response as a single category name, without additional comments or explanations only a percentage of how confident you are of the category with the format: 'movie position' - ‘category’ - ‘safety %’
example of valid output: 
1 - {categories[0]} - 70%
2 - {categories[1]} - 90%
3 - {categories[2]} - 55%
Available Categories: {categories}

Movie Description:
1 - "{plots[0]}"
2 - "{plots[1]}"
3 - "{plots[11207]}"

Category:
"""
response=llm.invoke(task_llm)
print(response)

Based on the text, the categories that apply are:

1. **Fantasy**: The story features magical elements, such as the jester's ability to switch between different personalities, and the use of magic to manipulate the story's events.
2. **Adventure**: The story features a quest, a secret passage, and a battle to save the day, which are all elements of the adventure genre.
3. **Romance**: The story features romantic elements, such as the jester's love for the princess and the romantic tension between Rick and Jenny.
4. **Comedy**: The story features comedic elements, such as the jester's antics, the misunderstandings, and the slapstick humor.
5. **Action**: The story features action-packed scenes, such as the joust, the battle, and the chase.
6. **Mystery**: The story features elements of mystery, such as the jester's true identity, the identity of the Black Fox, and the mystery of the secret passage.
7. **Sports**: The story features skiing and snowboarding as a central plot point, which 

In [11]:
# we create a validation, the outputs will not be good if they don't have a ‘-’.

# Output validation
from pydantic import BaseModel, field_validator
from typing import List

# Validate response format - check if it actually contains hyphen ("-")
class ResponseChecks(BaseModel):
    data: List[str]

    @field_validator("data")
    def check(cls, value):
        for item in value:
            if len(item) > 0:
                assert "-" in item, "String does not contain hyphen."

In [12]:
# We use the output pattern to take only the data we are interested in, since the model can generate (as we have seen above) text that we are not interested in, this could be solved with finetuning, but it is not the aim of this example:

import re
def extract_pattern(response):
    pattern = r"(\d+)\s\-\s+(\w+)"
    matches = re.findall(pattern, response)
    return matches


In [13]:
# We create the function to facilitate the work of adding batch queries

def invokeBatchLLama(categories,plots):
    #dinamic part of the prompt
    movie_description = "\n".join([f"{i + 1} - \'{plots}'" for i,plots in enumerate(plots)])
    task_llm =  f""" As a film cataloguing specialist, classify the following movies descriptions into a single category per film. Choose only from the predefined categories listed. Provide your response as a single category name, without additional comments or explanations with the format: 'movie position' - ‘category’ 
    example of valid output: 
    1 - {categories[0]}
    2 - {categories[1]} 
    3 - {categories[2]} 
    
    Available Categories: "{categories}"

    Movie Description:
    {movie_description}

    Remember, never say anything other than position and categories.
    Category:
    """
    response=llm.invoke(task_llm)
    return response


In [14]:
#Flow test on a small batch size
response = invokeBatchLLama(categories,plots[0:3])
ResponseChecks(data = [response])
response = extract_pattern(response)
print(response)

[('1', 'comedy'), ('2', 'comedy'), ('3', 'comedy')]


## Summarise before using the model 

We use a lighter intermediate model to reduce the processing load required to perform this analysis, and by optimising the summary words we can even reduce the processing time. Also we cannot say the right batch size because each description has a very variable length, we are trying to standardise.


In [15]:
from transformers import pipeline, BartTokenizer

def summarise_bart(plots,min_length=10,max_length=35,max_text=700):
    #We do this to ensure that we never add more size to the model than it can handle
    plots = [" ".join(plot.split()[0:max_text]) for plot in plots ]
    # Initialise the summary pipeline with DistilBART, which is a lighter version of BART.
    summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

    # Summarise descriptions only if they are larger than the maximum size we allow, otherwise we will save computation.
    summarized_plots = [summarizer(plot,min_length=min_length, max_length= max_length)[0]['summary_text'] if len(plot.split())>max_length else plot for plot in plots]

    return summarized_plots





In [21]:
Batch_size=45
Examples = 180
yhat=[]
for i in range(0,Examples,Batch_size):
    response = summarise_bart(plots[i:i+Batch_size],max_length=35)
    response = invokeBatchLLama(categories,response)
    ResponseChecks(data = [response])
    response = extract_pattern(response)
    response = [response[j][1] for j in range(len(response))]
    yhat.extend(response)
 
#we see how many times we got it right    
hits = [1 if yhat[i]==labels[i] else 0 for i in range(len(yhat))]
print("acc: ", sum(hits)/len(yhat))
print(len(yhat))

acc:  0.6592592592592592
135


We can see how it works correctly, we can also see how the longer the summaries are the better the model works, although we can also use a much smaller batch size. We can search for the optimum by performing a function and saving the best values.

In [33]:
def best_values_Llama3(Batch_size,Examples,max_length,categories,plots):
    yhat=[]
    for i in range(0,Examples,Batch_size):
        response = summarise_bart(plots[i:i+Batch_size],max_length=max_length)
        response = invokeBatchLLama(categories,response)
        ResponseChecks(data = [response])
        response = extract_pattern(response)
        response = [response[j][1] for j in range(len(response))]
        yhat.extend(response)
    
    #we see how many times we got it right    
    hits = [1 if yhat[i]==labels[i] else 0 for i in range(len(yhat))]
    acc = sum(hits)/len(yhat)
    return acc, yhat

In [35]:
#We are looking for the best combination of parameters to carry out this task.
Batch_size=[5,10,20,30,40,50]
max_length=[20,30,40,50,60,70]

combination = []
acc = []
for i in Batch_size:
    for j in max_length:
        try:
            accuracy , _ = best_values_Llama3(i, 100, j, categories, plots)
            acc.append(accuracy)
        except Exception as e:
            # Handles error, model does not give correct answers when saturated
            print(f"Error with Batch_size={i} and max_length={j}: {e}")
            acc.append("Error")
        finally:
            # Adds the combination of parameters used
            combination.append((i, j))
        


Error with Batch_size=30 and max_length=60: 1 validation error for ResponseChecks
data
  Assertion failed, String does not contain hyphen. [type=assertion_error, input_value=['Here is the classificat...categories listed:\n\n'], input_type=list]
    For further information visit https://errors.pydantic.dev/2.6/v/assertion_error
Error with Batch_size=30 and max_length=70: 1 validation error for ResponseChecks
data
  Assertion failed, String does not contain hyphen. [type=assertion_error, input_value=["I'm ready to help. Plea...r the movie summaries."], input_type=list]
    For further information visit https://errors.pydantic.dev/2.6/v/assertion_error
Error with Batch_size=40 and max_length=50: 1 validation error for ResponseChecks
data
  Assertion failed, String does not contain hyphen. [type=assertion_error, input_value=['I apologize for the mis...se:\n\nCategory:\n    '], input_type=list]
    For further information visit https://errors.pydantic.dev/2.6/v/assertion_error
Error with Ba

In [45]:
import numpy as np
numeric_results = [x for x in acc if isinstance(x, (int, float))]
max_value = max(numeric_results)
max_parameters= combination[np.argmax(numeric_results)]
print("The greatest success is: ", max_value)
print("With the parameters: ", max_parameters)

The greatest success is:  0.75
With the parameters:  (10, 20)


In [17]:
#we have to use a very low batch size because some descriptions are really long, saturating the model and not allowing it to generate the response correctly.
Batch_size=1
Examples = 20
yhat=[]
for i in range(0,Examples,Batch_size):
    response = invokeBatchLLama(categories,plots[i:i+Batch_size])
    ResponseChecks(data = [response])
    response = extract_pattern(response)
    response = [response[j][1] for j in range(len(response))]
    yhat.extend(response)
 
#we see how many times we got it right    
hits = [1 if yhat[i]==labels[i] else 0 for i in range(len(yhat))]
print("acc: ", sum(hits)/len(yhat))
print(len(yhat))


acc:  0.9
20


If we compare the model by analysing the movies one by one with the two models working together we see that the first one is much more accurate, it has almost no faults but the processing time becomes much larger, increasing the computational cost as well as the time. We can see that indeed a bert type model as the one done above with retraining can give better results for such simple tasks, however this is only one way of introducing a model into a data stream.

# We remove all the words that do not contribute anything in order to reduce the batch size as much as possible and optimise time.

In [26]:
import nltk 
from transformers import pipeline


#here we will have all the words that do not contribute much to a conversation, this is not typical to do in a model that uses transformers, however we want to see how it can work.
stopwords_en = nltk.corpus.stopwords.words("english")

def invokeBatchLLama_2(categories,plots,max_length=50):
    
    #We do this to ensure that we never add more size to the model than it can handle and remove words that have no value, we also truncate to maximum length so that we can run large batches.
    plots = [" ".join(plot.split()[0:max_length]) for plot in plots if plot.split() not in stopwords_en]
    
    #dinamic part of the prompt
    movie_description = "\n".join([f"{i + 1} - \'{plots}'" for i,plots in enumerate(plots)])
    task_llm =  f""" As a film cataloguing specialist, classify the following movies descriptions into a single category per film. Choose only from the predefined categories listed. Provide your response as a single category name, without additional comments or explanations with the format: 'movie position' - ‘category’ 
    example of valid output: 
    1 - {categories[0]}
    2 - {categories[1]} 
    3 - {categories[2]} 
    
    Available Categories: "{categories}"

    Movie Description:
    {movie_description}

    Remember, never say anything other than position and categories.
    Category:
    """
    response=llm.invoke(task_llm)
    return response


In [49]:
Batch_size=20
Examples = 180
yhat=[]
for i in range(0,Examples,Batch_size):
    response = invokeBatchLLama_2(categories,plots[i:i+Batch_size],max_length=55)
    ResponseChecks(data = [response])
    response = extract_pattern(response)
    response = [response[j][1] for j in range(len(response))]
    yhat.extend(response)
 
#we see how many times we got it right    
hits = [1 if yhat[i]==labels[i] else 0 for i in range(len(yhat))]
print("acc: ", sum(hits)/len(yhat))
print(len(yhat))

acc:  0.6222222222222222
180


In [50]:
def best_values_Llama3_2(Batch_size,Examples,max_length,categories,plots):
    yhat=[]
    for i in range(0,Examples,Batch_size):
        response = invokeBatchLLama_2(categories,plots[i:i+Batch_size],max_length=max_length)
        ResponseChecks(data = [response])
        response = extract_pattern(response)
        response = [response[j][1] for j in range(len(response))]
        yhat.extend(response)
    
    #we see how many times we got it right    
    hits = [1 if yhat[i]==labels[i] else 0 for i in range(len(yhat))]
    acc = sum(hits)/len(yhat)
    return acc, yhat

In [56]:
Batch_size=[5,15,30]
max_length=[25,45,60]

combination_2 = []
acc_2 = []
for i in Batch_size:
    for j in max_length:
        try:
            accuracy , _ = best_values_Llama3_2(i, 60, j, categories, plots)
            acc_2.append(accuracy)
        except Exception as e:
            # Handles error, model does not give correct answers when saturated
            print(f"Error with Batch_size={i} and max_length={j}: {e}")
            acc_2.append("Error")
        finally:
            # Adds the combination of parameters used
            combination_2.append((i, j))

Error with Batch_size=30 and max_length=45: 1 validation error for ResponseChecks
data
  Assertion failed, String does not contain hyphen. [type=assertion_error, input_value=["I'm ready to help. Plea...sition and categories."], input_type=list]
    For further information visit https://errors.pydantic.dev/2.6/v/assertion_error
Error with Batch_size=30 and max_length=60: 1 validation error for ResponseChecks
data
  Assertion failed, String does not contain hyphen. [type=assertion_error, input_value=["I'm ready to help. The ... Movie\n\nPosition: 30"], input_type=list]
    For further information visit https://errors.pydantic.dev/2.6/v/assertion_error


In [55]:
import numpy as np
numeric_results = [x for x in acc_2 if isinstance(x, (int, float))]
max_value = max(numeric_results)
max_parameters= combination_2[np.argmax(numeric_results)]
print("The greatest success is: ", max_value)
print("With the parameters: ", max_parameters)

The greatest success is:  0.8
With the parameters:  (15, 60)


Finally, we can see how the model itself, by cutting the descriptions of the movies and removing the words that do not add value, can perform a fairly accurate accuracy in a fairly efficient and fast way, so we could consider it as a good alternative. However, it is highly dependent on the size of the descriptions and if the descriptions are too long, the model will fail.

# Conclusions

We have seen how to put a current model such as llama3 into a data stream for processing, and we have also seen that we can chain several models together to perform tasks more efficiently. We have also seen ways to process the output with validation and regular expressions. 

However, the best way to perform this task would be to retrain the model (as we did with the Bert model) to perform this particular task and not use a general, unmodified model. In the following task will focus on re-training the model for this particular task.