# Classification with LLM

So far, we have cleaned the data and performed sentiment analysis as well as a violent word count analysis. Our objective is to categorize the movies on a scale from non-violent to violent.

Our approach will be to build a model that labels the dataset.
In this Notebook, we will try to label the data using a LLM, namely GPT4mini from OpenIA.

## Labeling the Data
Since labeled data is required for analysis, we manually labeled a subset of the dataset. We divided part of the data among team members and labeled each movie plot based on a categorical scale:
<ul>
    <li><b>-1</b> : Peaceful</li>
    <li><b>0</b> : Ambiguous (Uncertain level of violence)</li>
    <li><b>1</b> : Violent</li>
</ul>
To assess the subjectivity of the labeling process, we had some plots labeled multiple times by external participants.

## Dataset
<ul>
    <li><b>Training and Testing Data</b> <br/> Given the limited number of labeled plots available, we will use most of the labeled items for the training set. We will keep five plots as the testing set to evaluate the model (alternatively, we may use the entire dataset and assess labeling quality across the final labeled set) </li>
    <li><b>Final Dataset</b>  <br/> We will apply the model to label the entire dataset and review the quality of the labels.</li>
</ul>

## Model

We will use the GPT-4o-mini model \
https://platform.openai.com/docs/overview \
https://platform.openai.com/docs/models#gpt-4o-mini

### reference


https://platform.openai.com/docs/guides/prompt-engineering \
https://medium.com/discovery-at-nesta/how-to-use-gpt-4-and-openais-functions-for-text-classification-ad0957be9b25 \



### Imports

In [62]:
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import math
from sklearn.model_selection import train_test_split
from sklearn import metrics

%matplotlib inline

In [2]:
# Add the project root directory (not src) to sys.path
sys.path.append(os.path.abspath("../data/"))

# Now import the DataLoader class
from data_loading import DataLoader

raw = '../../data/RAW/'
clean = '../../data/CLEAN'

## Load and prepare the data

In [3]:
#load the data
data_loader = DataLoader(raw,clean)
CleanData = data_loader.clean_movie_data()
PlotData = data_loader.plot_data()


load plot data



In [6]:
#load the labelled data
ViolentLabel,ViolentData = data_loader.human_labelled_data()
display(ViolentLabel.drop(["Unnamed: 0"],axis = 1))
display(ViolentData.head())

Unnamed: 0,Violence level,Label
0,Peaceful,-1
1,Mild,0
2,Violent,1


Unnamed: 0_level_0,Answer
Wikipedia movie ID,Unnamed: 1_level_1
113454,0
909664,1
1028671,0
1336564,0
1472852,-1


In [9]:
TestSet = pd.merge(ViolentData,PlotData, left_index=True,right_index=True, how = "inner")
print("Number of test point :",TestSet.shape[0])
TestSet.head()

Number of test point : 146


Unnamed: 0_level_0,Answer,Plot
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1
113454,0,The film tells the story of a mob hit man and...
909664,1,The film follows the personal relationship bet...
1028671,0,"According to Devil's Playground, at the age of..."
1336564,0,"Jim Slater , is in search of stolen money, to ..."
1472852,-1,"David ""Dave"" Whiteman and his wife, Barbara, ..."


In [10]:
TestSet.iloc[0]["Plot"]

'The film tells the story of a mob hit man  and hit woman  who fall in love, even though they have been hired to kill each other.'

In [19]:
fraction = 0.2

# Split the data between train and validation
TrainingSet,ValidationSet = train_test_split(TestSet, test_size=fraction, random_state=21)

print(TrainingSet.shape[0])
print(ValidationSet.shape[0])

116
30


In [20]:
TrainingSet.head()

Unnamed: 0_level_0,Answer,Plot
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1
14168925,1,Cops take sexual advantage of the men they pul...
26057620,0,Yugoslav partisans grimly crop the hair of a v...
26015405,0,Tharadas is a ruthless smuggler whose uncle wa...
34954266,-1,"The genesis and metamorphoses of a film, from ..."
15217227,-1,An erotic drama about a writer involved in a p...


## LLM - GPT-4-mini

In [22]:
# Now import the Classifier class
sys.path.append(os.path.abspath("../model/"))

#import the custom classifier from src/model !
from OpenIA_utility import GPT4mini_ViolenceClassifier

### Prompt Ingenieurring

We developed a prompt for the classification task. \
The prompt contains a clear violence scale, where each label (Peaceful,Mild,Violent) in explained, and a clear instruction.

If the performance of the model aren't good enough, we will add examples.

In [23]:
#init our classifier
Classifier = GPT4mini_ViolenceClassifier()

In [24]:
print(Classifier.Content)

### Violence scale : ###
        - **Peaceful**: The text describes no violence. There are no aggression, conflict, or harm to people or animals. Suitable for all audiences, including children and sensitive viewers.
        - **Mild**: The level of violence of the text is minimal or uncertain. There might be moments of tension or mild conflict, such as arguments. ÒÒMild action or suspense is allowed without explicit harm.
        - **Violent**: The text describe violence, such as physical aggression, conflict, or harm. Scenes may include fighting, injury, or other intense confrontations. It a prominent feature of the film.

### Instructions ###
Assign a violence level from the scale above to each movie plot provided below.


To ensure the model return the result in the good format, we developed a function.

The final function is :

```ruby
function = {
   "name": "Assign_violence_level",
   "description": "Predict the level of violence of a list of movie plots",
   "parameters": {
       "type": "object",
       "properties": {
           "prediction": {
               "type": "array",
               "items": {
                   "type": "string",
                   "enum": [
                       "Peaceful",
                       "Mild",
                       "Violent"
                   ]
               },
               "description": "The list of violence levels for each movie plot, in the same order as the plots were provided."
           }
       },
       "required": [
           "prediction"
       ]
   }
}

```

The model have to return a array of prediction, one for each plot.

### Verify the number of tokens

The model have a maximum number of input tokens ! For our model, the limit is 128000 tokens. For cod efficiency (and money), we would like to avoid having to loop on each plot and recalling the prompt every time. We will look at the number of token to see how many plots we can put at the time. 

We implemented a function that tokenize a text in the same way as the model and return the number of token and the pricing.

In [25]:
# Count for the prompt
TotalPromt = Classifier.Prompt_size
print("For the prompt we have",TotalPromt,"tokens, pricing :",TotalPromt*Classifier.pricing)

# Count for the test set
TotalTest = TestSet["Plot"].apply(Classifier.count_tokens).sum()
print("For the test dataset we have",TotalTest+TotalPromt,"tokens, pricing :",TotalTest*Classifier.pricing,"batch",int(TotalTest/(Classifier.max_input-TotalPromt)+1))

# Count for the whole dataset
TotalData = CleanData["Plot"].apply(Classifier.count_tokens).sum()
batch = int(TotalData/(Classifier.max_input-TotalPromt))+1
print("For the whole dataset we have",TotalData,"tokens, pricing :",(TotalData+batch*TotalPromt)*Classifier.pricing,"batch",batch)

For the prompt we have 150 tokens, pricing : 2.2499999999999998e-05
For the test dataset we have 9406 tokens, pricing : 0.0013884 batch 1
For the whole dataset we have 8423387 tokens, pricing : 1.26499305 batch 66


No need of batching for the test data ! but will need for the whole dataset

### Create Batch

As we send multiple plot at the time, we need to format them together in a way the model can understand. Here an example :

In [26]:
TrainingSet.iloc[0]["Plot"]

'Cops take sexual advantage of the men they pull over on the beat. A newbie cop is forced to choose between his emotions and his ambition.'

In [27]:
#first parameter is the number of the plot, second is the text
Classifier.format_plot(0,TrainingSet.iloc[0]["Plot"])

'plot0:Cops take sexual advantage of the men they pull over on the beat. A newbie cop is forced to choose between his emotions and his ambition.\n\n'

Batch yes but how to create them ? We need each batch size to be smaller than the model's limit. We implemented a function that combine the prompt and the formatted plot, and add them until it reach the limit. The function return the ID of the first plot of each batch.

In [28]:
#for the test set (no batch needed)
Classifier.batch_plots(TestSet)

#for the dataset !
clean_batch = Classifier.batch_plots(CleanData)

prompt size 150
Final number of batchs 1
prompt size 150
batch margin: 476
batch margin: 281
batch margin: 74
batch margin: 681
batch margin: 1045
batch margin: 794
batch margin: 347
batch margin: 342
batch margin: 398
batch margin: 489
batch margin: 481
batch margin: 885
batch margin: 524
batch margin: 873
batch margin: 286
batch margin: 42
batch margin: 723
batch margin: 325
batch margin: 183
batch margin: 332
batch margin: 152
batch margin: 204
batch margin: 793
batch margin: 737
batch margin: 264
batch margin: 321
batch margin: 780
batch margin: 185
batch margin: 1411
batch margin: 196
batch margin: 522
batch margin: 384
batch margin: 240
batch margin: 297
batch margin: 734
batch margin: 22
batch margin: 525
batch margin: 297
batch margin: 64
batch margin: 80
batch margin: 170
batch margin: 143
batch margin: 1088
batch margin: 229
batch margin: 11
batch margin: 387
batch margin: 316
batch margin: 502
batch margin: 275
batch margin: 283
batch margin: 69
batch margin: 13
batch margin

### Assess the model on the training set
Here we go ! Now we will call the model on the test set and compare the result with the human labelled data. Note that we don't have to train the data, but we still split the test data between train and validation. This is because we went to have a set of labelled data to compare the result with during the prompt fine-tuning and all the test of the model. We still want to have a dataset the model have never seen to test at the end with the final model. If the result is good enough, we will label all the dataset.

In [29]:
#just a firewall boolean to avoid running the model by accident
Run_test = True

Here is the format of the final call of the model

```ruby
completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": Content},
            {"role": "user","content": Text}
        ],
        functions=[function],
        function_call={"name": "Assign_violence_level"},
    )
```

We format the batch

In [42]:
Text = Classifier.format_batch(TrainingSet[0:20])
Text

'plot1:Cops take sexual advantage of the men they pull over on the beat. A newbie cop is forced to choose between his emotions and his ambition.\n\nplot2:Yugoslav partisans grimly crop the hair of a village quintet of women believed to have consorted with the occupational Nazis. Four, for various reasons, have indeed - and their seducer is a lone, swaggering sergeant whom the partisans briskly emasculate. Escorted out of town by the sheepish Nazis, the forlorn ladies link up, patriotically and romantically, with a band of tough mountain guerrillas.\n\nplot3:Tharadas is a ruthless smuggler whose uncle was murdered by Rajesh. Rajesh is married to Thulasi who has dark past with Tharadas. Tharadas\'s cousin kills Rajesh and accuses Tharadas of the murder. Rajesh\'s partner Prasad and Thulasi get revenge on Tharadas, and Tharadas kills Chandu in turn. This film broke several Box Office Records and was the 3rd highest grosser in 1984. {{citation needed}}\n\nplot4:The genesis and metamorphose

In [43]:
prediction = None

if Run_test :
    prediction = Classifier.Call_API(Text)

In [44]:
print(prediction)

['Mild', 'Violent', 'Violent', 'Peaceful', 'Peaceful', 'Violent', 'Mild', 'Peaceful', 'Peaceful', 'Mild', 'Mild', 'Mild', 'Mild', 'Peaceful', 'Mild', 'Mild', 'Violent', 'Mild', 'Peaceful', 'Violent']


In [45]:
len(prediction)

20

In [46]:
Pred = pd.DataFrame(prediction,index=TrainingSet[0:20].index, columns=["Prediction"])

def to_level(data) :
    match data:
        case 'Peaceful':
            return -1.0
        case 'Mild':
            return 0.0
        case 'Violent':
            return 1.0
        case _:
            raise Exception("wait is that ?",data)

Pred["Result"] = Pred["Prediction"].apply(to_level)
Pred

Unnamed: 0_level_0,Prediction,Result
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1
14168925,Mild,0.0
26057620,Violent,1.0
26015405,Violent,1.0
34954266,Peaceful,-1.0
15217227,Peaceful,-1.0
28074095,Violent,1.0
27573594,Mild,0.0
34319106,Peaceful,-1.0
24631652,Peaceful,-1.0
19278375,Mild,0.0


In [68]:
Compare = pd.DataFrame(index=TrainingSet[0:20].index)
Compare["Label"] = TrainingSet[0:20]["Answer"]
Compare["Prediction"] = Pred["Result"]
Compare

Unnamed: 0_level_0,Label,Prediction
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1
14168925,1,0.0
26057620,0,1.0
26015405,0,1.0
34954266,-1,-1.0
15217227,-1,-1.0
28074095,0,1.0
27573594,1,0.0
34319106,0,-1.0
24631652,-1,-1.0
19278375,-1,0.0


In [77]:
accuracy = metrics.accuracy_score(Compare["Label"],Compare["Prediction"])
print("accuracy",accuracy*100)

m1 = abs(Compare["Label"]-Compare["Prediction"]).mean()
print("abs distance",m1)

#penalize more if opposite result 
m2 = np.power(Compare["Label"]-Compare["Prediction"], 2).mean()
print("pow distance",m2)

print("Correct label",(Compare["Label"]==Compare["Prediction"]).sum())
print("incorrect but close",(abs(Compare["Label"]-Compare["Prediction"])==1).sum())
print("opposite",(abs(Compare["Label"]-Compare["Prediction"])==2).sum())

accuracy 45.0
abs distance 0.6
pow distance 0.7
Correct label 9
incorrect but close 10
opposite 1


In [78]:
Compare[(abs(TrainingSet[0:20]["Answer"]-Pred[0:20]["Result"])==1)]

Unnamed: 0_level_0,Label,Prediction
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1
14168925,1,0.0
26057620,0,1.0
26015405,0,1.0
28074095,0,1.0
27573594,1,0.0
34319106,0,-1.0
19278375,-1,0.0
34573784,1,0.0
12048544,1,0.0
3868321,-1,0.0


<b>Testing history :</b>
- first model is bad. It is not totally wrong but is usually close but not perfect. This could also be due to error during labelling. As we discussed, the notion of violence is complex. We should take this into account while labelling. We will try to improve the model accuracy by giving him typical example. Another problem is that with too much movie, the model forget some labels.

### Apply to the whole dataset