# Classification with LLM

So far, we have cleaned the data and performed sentiment analysis as well as a violent word count analysis. Our objective is to categorize the movies on a scale from non-violent to violent.

Our approach will be to build a model that labels the dataset.
In this Notebook, we will try to label the data using a LLM, namely GPT4mini from OpenIA.

## Labeling the Data
Since labeled data is required for analysis, we manually labeled a subset of the dataset. We divided part of the data among team members and labeled each movie plot based on a categorical scale:
<ul>
    <li><b>-1</b> : Peaceful</li>
    <li><b>0</b> : Ambiguous (Uncertain level of violence)</li>
    <li><b>1</b> : Violent</li>
</ul>
To assess the subjectivity of the labeling process, we had some plots labeled multiple times by external participants.

## Dataset
<ul>
    <li><b>Training and Testing Data</b> <br/> Given the limited number of labeled plots available, we will use most of the labeled items for the training set. We will keep five plots as the testing set to evaluate the model (alternatively, we may use the entire dataset and assess labeling quality across the final labeled set) </li>
    <li><b>Final Dataset</b>  <br/> We will apply the model to label the entire dataset and review the quality of the labels.</li>
</ul>

## Model

We will use the GPT-4o-mini model \
https://platform.openai.com/docs/overview \
https://platform.openai.com/docs/models#gpt-4o-mini

### reference


https://platform.openai.com/docs/guides/prompt-engineering \
https://medium.com/discovery-at-nesta/how-to-use-gpt-4-and-openais-functions-for-text-classification-ad0957be9b25 \



### Imports

In [1]:
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import math

%matplotlib inline

In [2]:
# Add the project root directory (not src) to sys.path
sys.path.append(os.path.abspath("../data/"))

# Now import the DataLoader class
from data_loading import DataLoader

raw = '../../data/RAW/'
clean = '../../data/CLEAN'

## Load and prepare the data

In [3]:
#load the data
data_loader = DataLoader(raw,clean)
CleanData = data_loader.clean_movie_data()
PlotData = data_loader.plot_data()


load plot data



In [4]:
#load the labelled data
ViolentLabel,ViolentData = data_loader.human_labelled_data()
display(ViolentLabel)
display(ViolentData.head())

Unnamed: 0.1,Unnamed: 0,Violence level,Label
0,0,Low,-1
1,1,Medium/not sure,0
2,2,Violent,1


Unnamed: 0_level_0,Answer,Date
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1
113454,0.0,12.11
2295249,0.0,12.11
2967223,1.0,11.11
3868321,-1.0,12.11
4481850,0.0,11.11


In [5]:
TestSet = PlotData.loc[ViolentData.index]
TestSet.head()

Unnamed: 0_level_0,Plot
Wikipedia movie ID,Unnamed: 1_level_1
113454,The film tells the story of a mob hit man and...
2295249,Theresa is one of the twelve jurors who have t...
2967223,Robert Crain is a German pacifist living in In...
3868321,"Zach is a soul collector, an angel who collec..."
4481850,"Adam, a 23-year-old self-employed security tec..."


In [6]:
TestSet.iloc[0]["Plot"]

'The film tells the story of a mob hit man  and hit woman  who fall in love, even though they have been hired to kill each other.'

In [7]:
cut = 10
ValidationSet = TestSet[:cut]
TrainingSet = TestSet[cut+1:]

In [8]:
TrainingSet.head()

Unnamed: 0_level_0,Plot
Wikipedia movie ID,Unnamed: 1_level_1
14144257,NW Mounted Police Sgt. Ward is assigned to tra...
15124157,The film is an examination of libidinous baske...
15217227,An erotic drama about a writer involved in a p...
15323823,{{expand section}} The film is set in Hanky Pa...
16494542,Richard Beck is a police detective who believ...


## LLM - GPT-4-mini

In [9]:
# Now import the Classifier class
sys.path.append(os.path.abspath("../model/"))

from OpenIA_utility import GPT4mini_ViolenceClassifier

### Prompt Ingenieurring

We developed a prompt for the classification task. \
The prompt contains a clear violence scale, where each label (Peaceful,Mild,Violent) in explained, and a clear instruction.

If the performance of the model aren't good enough, we will add examples.

In [10]:
#init our classifier
Classifier = GPT4mini_ViolenceClassifier()

In [11]:
print(Classifier.Content)

### Violence scale : ###
        - **Peaceful**: The text describes no violence. There are no aggression, conflict, or harm to people or animals. Suitable for all audiences, including children and sensitive viewers.
        - **Mild**: The level of violence of the text is minimal or uncertain. There might be moments of tension or mild conflict, such as arguments. TMild action or suspense is allowed without explicit harm.
        - **Violent**: The text describe violence, such as physical aggression, conflict, or harm. Scenes may include fighting, injury, or other intense confrontations. It a prominent feature of the film.

### Instructions ###
Assign a violence level from the scale above to each movie plot provided below.


To ensure the model return the result in the good format, we developed a function.

The final function is :

```ruby
function = {
   "name": "Assign_violence_level",
   "description": "Predict the level of violence of a list of movie plots",
   "parameters": {
       "type": "object",
       "properties": {
           "prediction": {
               "type": "array",
               "items": {
                   "type": "string",
                   "enum": [
                       "Peaceful",
                       "Mild",
                       "Violent"
                   ]
               },
               "description": "The list of violence levels for each movie plot, in the same order as the plots were provided."
           }
       },
       "required": [
           "prediction"
       ]
   }
}

```

The model have to return a array of prediction, one for each plot.

### Verify the number of tokens

The model have a maximum number of input tokens ! For our model, the limit is 128000 tokens. For cod efficiency (and money), we would like to avoid having to loop on each plot and recalling the prompt every time. We will look at the number of token to see how many plots we can put at the time. 

We implemented a function that tokenize a text in the same way as the model and return the number of token and the pricing.

In [12]:
# Count for the prompt
TotalPromt = Classifier.Prompt_size
print("For the prompt we have",TotalPromt,"tokens, pricing :",TotalPromt*Classifier.pricing)

# Count for the test set
TotalTest = TestSet["Plot"].apply(Classifier.count_tokens).sum()
print("For the test dataset we have",TotalTest+TotalPromt,"tokens, pricing :",TotalTest*Classifier.pricing,"batch",int(TotalTest/(Classifier.max_input-TotalPromt)+1))

# Count for the whole dataset
TotalData = CleanData["Plot"].apply(Classifier.count_tokens).sum()
batch = int(TotalData/(Classifier.max_input-TotalPromt))+1
print("For the whole dataset we have",TotalData,"tokens, pricing :",(TotalData+batch*TotalPromt)*Classifier.pricing,"batch",batch)

For the prompt we have 148 tokens, pricing : 2.2199999999999998e-05
For the test dataset we have 4123 tokens, pricing : 0.00059625 batch 1
For the whole dataset we have 8423387 tokens, pricing : 1.26497325 batch 66


No need of batching for the test data ! but will need for the whole dataset

### Create Batch

As we send multiple plot at the time, we need to format them together in a way the model can understand. Here an example :

In [13]:
TrainingSet.iloc[0]["Plot"]

'NW Mounted Police Sgt. Ward is assigned to track down a lethal and mysterious villain known only as The Leader, who is trying to locate a secret gold mine in the Indian territory. At the time The Leader made the decision to stop Ward, he pits the Indians against the Mounties, whom he blames for creating trouble in the area.'

In [14]:
#first parameter is the number of the plot, second is the text
Classifier.format_plot(0,TrainingSet.iloc[0]["Plot"])

'plot0:NW Mounted Police Sgt. Ward is assigned to track down a lethal and mysterious villain known only as The Leader, who is trying to locate a secret gold mine in the Indian territory. At the time The Leader made the decision to stop Ward, he pits the Indians against the Mounties, whom he blames for creating trouble in the area.\n\n'

Batch yes but how to create them ? We need each batch size to be smaller than the model's limit. We implemented a function that combine the prompt and the formatted plot, and add them until it reach the limit. The function return the ID of the first plot of each batch.

In [15]:
#for the test set (no batch needed)
Classifier.batch_plots(TestSet)

#for the dataset !
clean_batch = Classifier.batch_plots(CleanData)

prompt size 148
Final number of batchs 1
prompt size 148
batch margin: 478
batch margin: 283
batch margin: 76
batch margin: 684
batch margin: 1047
batch margin: 796
batch margin: 349
batch margin: 344
batch margin: 400
batch margin: 491
batch margin: 483
batch margin: 887
batch margin: 526
batch margin: 875
batch margin: 288
batch margin: 44
batch margin: 725
batch margin: 327
batch margin: 185
batch margin: 334
batch margin: 154
batch margin: 206
batch margin: 795
batch margin: 739
batch margin: 266
batch margin: 323
batch margin: 782
batch margin: 187
batch margin: 1413
batch margin: 198
batch margin: 524
batch margin: 386
batch margin: 242
batch margin: 299
batch margin: 736
batch margin: 24
batch margin: 527
batch margin: 299
batch margin: 66
batch margin: 82
batch margin: 172
batch margin: 145
batch margin: 1090
batch margin: 231
batch margin: 13
batch margin: 389
batch margin: 318
batch margin: 504
batch margin: 277
batch margin: 285
batch margin: 71
batch margin: 15
batch margin

In [16]:
T = TrainingSet[0:10]

### Assess the model on the training set
Here we go ! Now we will call the model on the test set and compare the result with the human labelled data. Note that we don't have to train the data, but we still split the test data between train and validation. This is because we went to have a set of labelled data to compare the result with during the prompt fine-tuning and all the test of the model. We still want to have a dataset the model have never seen to test at the end with the final model. If the result is good enough, we will label all the dataset.

In [17]:
#just a firewall boolean to avoid running the model by accident
Run = True

Here is the format of the final call of the model

```ruby
completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": Content},
            {"role": "user","content": Text}
        ],
        functions=[function],
        function_call={"name": "Assign_violence_level"},
    )
```

We format the batch

In [21]:
Text = Classifier.format_batch(T)

In [22]:
prediction = None

if Run :
    prediction = Classifier.Call_API(Text)

In [27]:
print(prediction)

['Violent', 'Peaceful', 'Peaceful', 'Peaceful', 'Violent', 'Peaceful', 'Peaceful', 'Peaceful', 'Peaceful']


### Apply to the whole dataset