# Classification with LLM

So far, we have cleaned the data and performed sentiment analysis as well as a violent word count analysis. Our objective is to categorize the movies on a scale from non-violent to violent.

Our approach will be to build a model that labels the dataset.
In this Notebook, we will try to label the data using a LLM, namely GPT4mini from OpenIA.

## Labeling the Data
Since labeled data is required for analysis, we manually labeled a subset of the dataset. We divided part of the data among team members and labeled each movie plot based on a categorical scale:
<ul>
    <li><b>-1</b> : Peaceful</li>
    <li><b>0</b> : Ambiguous (Uncertain level of violence)</li>
    <li><b>1</b> : Violent</li>
</ul>
To assess the subjectivity of the labeling process, we had some plots labeled multiple times by external participants.

## Dataset
<ul>
    <li><b>Training and Testing Data</b> <br/> Given the limited number of labeled plots available, we will use most of the labeled items for the training set. We will keep five plots as the testing set to evaluate the model (alternatively, we may use the entire dataset and assess labeling quality across the final labeled set) </li>
    <li><b>Final Dataset</b>  <br/> We will apply the model to label the entire dataset and review the quality of the labels.</li>
</ul>

## Model

We will use the GPT-4o-mini model \
https://platform.openai.com/docs/overview \
https://platform.openai.com/docs/models#gpt-4o-mini

### reference


https://platform.openai.com/docs/guides/prompt-engineering \
https://medium.com/discovery-at-nesta/how-to-use-gpt-4-and-openais-functions-for-text-classification-ad0957be9b25 \



### Imports

In [39]:
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import math
from sklearn.model_selection import train_test_split
from sklearn import metrics

%matplotlib inline

In [40]:
# Add the project root directory (not src) to sys.path
sys.path.append(os.path.abspath("../data/"))

# Now import the DataLoader class
from data_loading import DataLoader

raw = '../../data/RAW/'
clean = '../../data/CLEAN'

## Load and prepare the data

In [41]:
#load the data
data_loader = DataLoader(raw,clean)
CleanData = data_loader.clean_movie_data()
PlotData = data_loader.plot_data()


load plot data



In [42]:
#load the labelled data
ViolentLabel,ViolentData = data_loader.human_labelled_data()
display(ViolentLabel.drop(["Unnamed: 0"],axis = 1))
display(ViolentData.head())

Unnamed: 0,Violence level,Label
0,Peaceful,-1
1,Mild,0
2,Violent,1


Unnamed: 0_level_0,Answer
Wikipedia movie ID,Unnamed: 1_level_1
113454,0
909664,1
1028671,0
1336564,0
1472852,-1


In [43]:
TestSet = pd.merge(ViolentData,PlotData, left_index=True,right_index=True, how = "inner")
print("Number of test point :",TestSet.shape[0])
TestSet.head()

Number of test point : 146


Unnamed: 0_level_0,Answer,Plot
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1
113454,0,The film tells the story of a mob hit man and...
909664,1,The film follows the personal relationship bet...
1028671,0,"According to Devil's Playground, at the age of..."
1336564,0,"Jim Slater , is in search of stolen money, to ..."
1472852,-1,"David ""Dave"" Whiteman and his wife, Barbara, ..."


In [44]:
TestSet.iloc[0]["Plot"]

'The film tells the story of a mob hit man  and hit woman  who fall in love, even though they have been hired to kill each other.'

In [45]:
fraction = 0.2

# Split the data between train and validation
TrainingSet,ValidationSet = train_test_split(TestSet, test_size=fraction, random_state=21)

print(TrainingSet.shape[0])
print(ValidationSet.shape[0])

116
30


In [46]:
TrainingSet.head()

Unnamed: 0_level_0,Answer,Plot
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1
14168925,1,Cops take sexual advantage of the men they pul...
26057620,0,Yugoslav partisans grimly crop the hair of a v...
26015405,0,Tharadas is a ruthless smuggler whose uncle wa...
34954266,-1,"The genesis and metamorphoses of a film, from ..."
15217227,-1,An erotic drama about a writer involved in a p...


## LLM - GPT-4-mini

In [47]:
# Now import the Classifier class
sys.path.append(os.path.abspath("../model/"))

#import the custom classifier from src/model !
from OpenIA_utility import GPT4mini_ViolenceClassifier

### Prompt Ingenieurring

We developed a prompt for the classification task. \
The prompt contains a clear violence scale, where each label (Peaceful,Mild,Violent) in explained, and a clear instruction.

To help the model to perform, we add examples.

In [48]:
#init our classifier
Classifier = GPT4mini_ViolenceClassifier()

In [49]:
print(Classifier.Content)

### Violence scale : ###
        - **Peaceful**: The text describes no physical or psychological violence. There are no aggression, conflict, or harm to people or animals. Suitable for all audiences.
        - **Mild**: The level of violence is medium or uncertain. There might be moments of tension or mild conflict, such as arguments. Mild action or suspense is allowed.
        - **Violent**: The text describe extreme physical or psychological violence, such as physical aggression, conflict, or harm. Scenes may include fighting, injury, rape. It a prominent feature of the film.

### Instructions ###
Assign a level of violence to each plot movie plot below. Respond with a dictionary where the keys are the plot numbers (e.g., 'plot1', 'plot2') and the values are the levels of violence ('Peaceful', 'Mild', 'Violent')


In [50]:
print(Classifier.Example)

Here are some examples for each label :
        - **Peaceful**: plot1 :'norma and malcolm miochaels are a middle-aged married couple who are in the midst of a midlife crisis. both decide to separate and begin their lives anew away from each other. however, problems ensue once they discover that they are no longer as young as they used to be.'
        plot2:'in the 1840s, two sisters fall in love with the same man. while drunk, the man writes a letter proposing marriage to the wrong one.'
        - **Mild**: plot1:'set in the 19th century, the plot centered around a man  who is falsely accused murder. the other side of the door was shot in monterrey, mexico.{{cite web}}'
        plot2:'in a desperate, but not-too-courageous, attempt to end his life, a man hires a murderer to do the job for him. soon, though, things are looking better and the he must now avoid the hit.'
        - **Violent**: plot1:'Richard Beck  is a police detective who believed that rape victims are to blame for the c

To ensure the model return the result in the good format, we developed a function.

The final function is :

```ruby
        self.function = {
           "name": "Assign_violence_level",
           "description": "Predict the level of violence of a list of movie plots",
           "parameters": {
               "type": "object",
               "properties": {
                   "prediction": {
                       "type": "array",
                       "items": {
                           "type": "string",
                           "enum": [
                               "Peaceful",
                               "Mild",
                               "Violent"
                           ]
                       },
                       "description": "The list of violence levels for each movie plot, in the same order as the plots were provided."
                   }
               },
               "required": [
                   "prediction"
               ]
           }
        }

```

The model have to return a array of prediction, one for each plot.

### Verify the number of tokens

The model have a maximum number of input tokens ! For our model, the limit is 128000 tokens. For cod efficiency (and money), we would like to avoid having to loop on each plot and recalling the prompt every time. We will look at the number of token to see how many plots we can put at the time. 

We implemented a function that tokenize a text in the same way as the model and return the number of token and the pricing.

In [51]:
# Count for the prompt (and example)
TotalPromt = Classifier.Prompt_size
print("For the prompt we have",TotalPromt,"tokens, pricing :",TotalPromt*Classifier.pricing)

# Count for the test set
TotalTest = TestSet["Plot"].apply(Classifier.count_tokens).sum()
print("For the test dataset we have",TotalTest+TotalPromt,"tokens, pricing :",TotalTest*Classifier.pricing,"batch",int(TotalTest/(Classifier.max_input-TotalPromt)+1))

# Count for the whole dataset
TotalData = CleanData["Plot"].apply(Classifier.count_tokens).sum()
batch = int(TotalData/(Classifier.max_input-TotalPromt))+1
print("For the whole dataset we have",TotalData,"tokens, pricing :",(TotalData+batch*TotalPromt)*Classifier.pricing,"batch",batch)

For the prompt we have 519 tokens, pricing : 7.785e-05
For the test dataset we have 9775 tokens, pricing : 0.0013884 batch 1
For the whole dataset we have 8419540 tokens, pricing : 1.26814695 batch 67


Will need to split the data into batchs.

### Create Batch

As we send multiple plot at the time, we need to format them together in a way the model can understand. Here an example :

In [52]:
TrainingSet.iloc[0]["Plot"]

'Cops take sexual advantage of the men they pull over on the beat. A newbie cop is forced to choose between his emotions and his ambition.'

In [53]:
#first parameter is the number of the plot, second is the text
Classifier.format_plot(0,TrainingSet.iloc[0]["Plot"])

'plot0:Cops take sexual advantage of the men they pull over on the beat. A newbie cop is forced to choose between his emotions and his ambition.\n\n'

Batch yes but how to create them ? We need each batch size to be smaller than the model's limit. We implemented a function that combine the prompt and the formatted plot, and add them until it reach the limit. The function return the ID of the first plot of each batch.

In [54]:
#for the test set (no batch needed)
Classifier.batch_plots(TestSet)

#for the dataset !
#clean_batch = Classifier.batch_plots(CleanData)

size 519
Final number of batchs 1


[0]

### Assess the model on the training set
Here we go ! Now we will call the model on the test set and compare the result with the human labelled data. Note that we don't have to train the data, but we still split the test data between train and validation. This is because we went to have a set of labelled data to compare the result with during the prompt fine-tuning and all the test of the model. We still want to have a dataset the model have never seen to test at the end with the final model. If the result is good enough, we will label all the dataset.

In [55]:
#just a firewall boolean to avoid running the model by accident
Run_test = False

Here is the format of the final call of the model

```ruby
    completion = self.client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": self.Content},
            {"role": "user","content": Text},
            {"role": "assistant", "content": self.Example}
        ],
        functions=[self.function],
        function_call={"name": "Assign_violence_level"},
    )
```

We format the batch

In [56]:
Classifier.format_batch(TrainingSet[0:3])

"plot1:Cops take sexual advantage of the men they pull over on the beat. A newbie cop is forced to choose between his emotions and his ambition.\n\nplot2:Yugoslav partisans grimly crop the hair of a village quintet of women believed to have consorted with the occupational Nazis. Four, for various reasons, have indeed - and their seducer is a lone, swaggering sergeant whom the partisans briskly emasculate. Escorted out of town by the sheepish Nazis, the forlorn ladies link up, patriotically and romantically, with a band of tough mountain guerrillas.\n\nplot3:Tharadas is a ruthless smuggler whose uncle was murdered by Rajesh. Rajesh is married to Thulasi who has dark past with Tharadas. Tharadas's cousin kills Rajesh and accuses Tharadas of the murder. Rajesh's partner Prasad and Thulasi get revenge on Tharadas, and Tharadas kills Chandu in turn. This film broke several Box Office Records and was the 3rd highest grosser in 1984. {{citation needed}}\n\n"

We will also make smaller batch to improve the predictions.

In [57]:
prediction = []
BatchSize = 10

if Run_test :
    for i in range(0,int(TrainingSet.shape[0]/BatchSize)+1) :
        thisBatch = Classifier.format_batch(TrainingSet[i*BatchSize:min((i+1)*BatchSize,TrainingSet.shape[0])])
        pred = Classifier.Call_API(thisBatch)
        #print("pred",len(pred))
        prediction = prediction + pred
print("finish!")

finish!


In [58]:
print(prediction)

[]


In [59]:
if Run_test :
    Compare = pd.DataFrame(prediction,index=TrainingSet.index, columns=["Result"])
    
    def to_level(data) :
        match data:
            case 'Peaceful':
                return -1.0
            case 'Mild':
                return 0.0
            case 'Violent':
                return 1.0
            case _:
                raise Exception("wait is that ?",data)
    
    Compare["Prediction"] = Compare["Result"].apply(to_level)
    
    Compare["Label"] = TrainingSet["Answer"]
    Compare.head()

In [60]:
if Run_test:
    name = "model_1_training"
    Compare.to_csv(clean+"/classification_result/"+name+".csv")

In [61]:
Compare = pd.read_csv(clean+"/classification_result/model_1_training.csv")
Run_test = False

In [62]:
accuracy = metrics.accuracy_score(Compare["Label"],Compare["Prediction"])
print("accuracy",accuracy*100)
    
m1 = abs(Compare["Label"]-Compare["Prediction"]).mean()
print("abs distance",m1)
    
    #penalize more if opposite result 
m2 = np.power(Compare["Label"]-Compare["Prediction"], 2).mean()
print("pow distance",m2)
    
print("Correct label",(Compare["Label"]==Compare["Prediction"]).sum())
print("incorrect but close",(abs(Compare["Label"]-Compare["Prediction"])==1).sum())
print("opposite",(abs(Compare["Label"]-Compare["Prediction"])==2).sum())

accuracy 66.37931034482759
abs distance 0.35344827586206895
pow distance 0.3879310344827586
Correct label 77
incorrect but close 37
opposite 2


In [63]:
display(Compare[(abs(Compare["Label"]-Compare["Prediction"])==1)].head())

Unnamed: 0,Wikipedia movie ID,Result,Prediction,Label
1,26057620,Violent,1.0,0
2,26015405,Violent,1.0,0
7,34319106,Peaceful,-1.0,0
10,34573784,Mild,0.0,1
12,33440196,Peaceful,-1.0,0


In [89]:
display(Compare[(abs(Compare["Label"]-Compare["Prediction"])==2)])

Unnamed: 0_level_0,Result,Prediction,Label,distance
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
27573594,Peaceful,-1.0,1,2.0
2967223,Peaceful,-1.0,1,2.0


In [91]:
Compare["distance"] = abs(Compare["Label"]-Compare["Prediction"])
Compare.head()

Unnamed: 0_level_0,Result,Prediction,Label,distance
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
14168925,Violent,1.0,1,0.0
26057620,Violent,1.0,0,1.0
26015405,Violent,1.0,0,1.0
34954266,Peaceful,-1.0,-1,0.0
15217227,Peaceful,-1.0,-1,0.0


In [96]:
Compare["distance"].value_counts()

distance
0.0    77
1.0    37
2.0     2
Name: count, dtype: int64

In [93]:
Flourish = Compare.groupby("Label").value_counts()
Flourish

Label  Result    Prediction  distance
-1     Peaceful  -1.0        0.0         36
       Mild       0.0        1.0          7
 0     Mild       0.0        0.0         16
       Peaceful  -1.0        1.0          9
       Violent    1.0        1.0          9
 1     Violent    1.0        0.0         25
       Mild       0.0        1.0         12
       Peaceful  -1.0        2.0          2
Name: count, dtype: int64

In [94]:
Flourish.to_csv(clean+"/classification_result/flourish.csv")

<b>Testing history :</b>
- <ins>first model</ins> 0.5% accuracy. It is not totally wrong but is usually close but not perfect. This could also be due to error during labelling. As we discussed, the notion of violence is complex. We should take this into account while labelling. We will try to improve the model accuracy by giving him typical *example*. Another problem is that with too much movie, the model forget some labels. we will return a *dictionary* instead of a list to associate the plot number to the returning class and ultimately reduce the *batch size*.
- <ins>Add example to the prompt</ins> the result get a little better. however still a lot of incorrect but close response. Tend to consider more violent than expected. Adapt the class label.
- <ins>Clearer definitions and smaller batch</ins> accuracy arroud 65%, likely to overestimate violence.

### Apply to the whole dataset

In [26]:
Run_final = False

In [27]:
D = CleanData.iloc[400:500]["Plot"].apply(lambda x: Classifier.count_tokens(x))
D.sort_values(ascending=False) #problem

Wikipedia movie ID
43307       2583
3112996     1772
2267722     1464
12576808    1181
6961461     1147
            ... 
113454        31
4278437       30
14168925      29
25105934      26
19046062      24
Name: Plot, Length: 100, dtype: int64

In [28]:
CleanData.shape[0]

17064

In [29]:
int(np.ceil(CleanData.shape[0]/BatchSize))

1707

In [30]:
prediction = []
BatchSize = 10

if Run_final :
    for i in range(0,int(np.ceil(CleanData.shape[0]/BatchSize))) :
        if i%10 == 0 :
            print("Batch {}".format(i))
        thisBatch = Classifier.format_batch(CleanData[i*BatchSize:min((i+1)*BatchSize,CleanData.shape[0])])
        pred = Classifier.Call_API(thisBatch)
        #print(len(pred))
        prediction = prediction + pred
print("finish!")

finish!


In [31]:
len(prediction)

0

In [32]:
if Run_final :
    Result = pd.DataFrame(prediction,index=CleanData.index, columns=["Result"])
    
    def to_level(data) :
        match data:
            case 'Peaceful':
                return -1.0
            case 'Mild':
                return 0.0
            case 'Violent':
                return 1.0
            case _:
                raise Exception("wait is that ?",data)
    
    Result["Prediction"] = Result["Result"].apply(to_level)
    Result.head()

In [33]:
if Run_final :
    name = "LLM_result"
    Result.to_csv(clean+"/classification_result/"+name+".csv")

In [34]:
if Run_final :
    Result["Prediction"].value_counts()

## Second LLM - Binary classification

Looking to the result, we decided to go for a binary classification. We adjusted the prompt to classify between violent and non-violent and pushed the description of violence more to a extreme. 

In [35]:
# Now import the Classifier class
sys.path.append(os.path.abspath("../model/"))

#import the custom classifier from src/model !
from OpenIA_utility_binary import GPT4mini_ViolenceClassifier_extreme

In [36]:
Classifier = GPT4mini_ViolenceClassifier_extreme()

print(Classifier.Content)

### Violence scale : ###
        - **Non-violent**: The text describes no extreme physical or psychological violence. There are no recurrent aggression, conflict, or harm to people or animals.
        - **Violent**: The text describe extreme physical or psychological violence, such as physical aggression, conflict, or harm. Scenes may include fighting, injury, rape. It is a prominent feature of the film.

### Instructions ###
Classify each plot between violent and non-violent. Respond with a dictionary where the keys are the plot numbers (e.g., 'plot1', 'plot2') and the values are the levels of violence ('Non-violent', 'Violent')


In [73]:
Run_final = False

In [62]:
prediction = []
BatchSize = 10

if Run_final :
    for i in range(0,int(np.ceil(CleanData.shape[0]/BatchSize))) :
        if i%10 == 0 :
            print("Batch {}".format(i))
        thisBatch = Classifier.format_batch(CleanData[i*BatchSize:min((i+1)*BatchSize,CleanData.shape[0])])
        pred = Classifier.Call_API(thisBatch)
        #print(len(pred))
        prediction = prediction + pred
print("finish!")

Batch 0
Batch 10
Batch 20
Batch 30
Batch 40
Batch 50
Batch 60
Batch 70
Batch 80
Batch 90
Batch 100
Batch 110
Batch 120
Batch 130
Batch 140
Batch 150
Batch 160
Batch 170
Batch 180
Batch 190
Batch 200
Batch 210
Batch 220
Batch 230
Batch 240
Batch 250
Batch 260
Batch 270
Batch 280
Batch 290
Batch 300
Batch 310
Batch 320
Batch 330
Batch 340
Batch 350
Batch 360
Batch 370
Batch 380
Batch 390
Batch 400
Batch 410
Batch 420
Batch 430
Batch 440
Batch 450
Batch 460
Batch 470
Batch 480
Batch 490
Batch 500
Batch 510
Batch 520
Batch 530
Batch 540
Batch 550
Batch 560
Batch 570
Batch 580
Batch 590
Batch 600
Batch 610
Batch 620
Batch 630
Batch 640
Batch 650
Batch 660
Batch 670
Batch 680
Batch 690
Batch 700
Batch 710
Batch 720
Batch 730
Batch 740
Batch 750
Batch 760
Batch 770
Batch 780
Batch 790
Batch 800
Batch 810
Batch 820
Batch 830
Batch 840
Batch 850
Batch 860
Batch 870
Batch 880
Batch 890
Batch 900
Batch 910
Batch 920
Batch 930
Batch 940
Batch 950
Batch 960
Batch 970
Batch 980
Batch 990
Batch 1000


In [68]:
len(prediction)

17064

In [69]:
if Run_final :
    Result = pd.DataFrame(prediction,index=CleanData.index, columns=["Result"])
    
    def to_level(data) :
        match data:
            case 'Non-violent':
                return -1.0
            case 'Violent':
                return 1.0
            case _:
                raise Exception("wait is that ?",data)
    
    Result["Prediction"] = Result["Result"].apply(to_level)
    Result.head()

In [70]:
Result

Unnamed: 0_level_0,Result,Prediction
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1
975900,Violent,1.0
6631279,Non-violent,-1.0
77856,Non-violent,-1.0
21926710,Non-violent,-1.0
156558,Violent,1.0
...,...,...
25011340,Violent,1.0
7761830,Violent,1.0
1918494,Non-violent,-1.0
664006,Violent,1.0


In [71]:
if Run_final :
    name = "LLM_result_binary"
    Result.to_csv(clean+"/classification_result/"+name+".csv")

In [72]:
if Run_final :
    Result["Prediction"].value_counts()