# Classification with LLM

So far, we have cleaned the data and performed sentiment analysis as well as a violent word count analysis. Our objective is to categorize the movies on a scale from non-violent to violent.

Our approach will be to build a model that labels the dataset.
In this Notebook, we will try to label the data using a LLM, namely GPT4mini from OpenIA.

## Labeling the Data
Since labeled data is required for analysis, we manually labeled a subset of the dataset. We divided part of the data among team members and labeled each movie plot based on a categorical scale:
<ul>
    <li><b>-1</b> : Peaceful</li>
    <li><b>0</b> : Ambiguous (Uncertain level of violence)</li>
    <li><b>1</b> : Violent</li>
</ul>
To assess the subjectivity of the labeling process, we had some plots labeled multiple times by external participants.

## Dataset
<ul>
    <li><b>Training and Testing Data</b> <br/> Given the limited number of labeled plots available, we will use most of the labeled items for the training set. We will keep five plots as the testing set to evaluate the model (alternatively, we may use the entire dataset and assess labeling quality across the final labeled set) </li>
    <li><b>Final Dataset</b>  <br/> We will apply the model to label the entire dataset and review the quality of the labels.</li>
</ul>

## Model

We will use the GPT-4o-mini model \
https://platform.openai.com/docs/overview \
https://platform.openai.com/docs/models#gpt-4o-mini

### reference


https://platform.openai.com/docs/guides/prompt-engineering \
https://medium.com/discovery-at-nesta/how-to-use-gpt-4-and-openais-functions-for-text-classification-ad0957be9b25 \



### Imports

In [17]:
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import math

import tiktoken
import os
import openai
from openai import OpenAI
import json

%matplotlib inline

In [2]:
# Add the project root directory (not src) to sys.path
sys.path.append(os.path.abspath("../data/"))

# Now import the DataLoader class
from data_loading import DataLoader

raw = '../../data/RAW/'
clean = '../../data/CLEAN'

## Load and prepare the data

In [3]:
#load the data
data_loader = DataLoader(raw,clean)
CleanData = data_loader.clean_movie_data()
PlotData = data_loader.plot_data()
PlotData


load plot data



Unnamed: 0_level_0,Plot
Wikipedia movie ID,Unnamed: 1_level_1
23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
31186339,The nation of Panem consists of a wealthy Capi...
20663735,Poovalli Induchoodan is sentenced for six yea...
2231378,"The Lemon Drop Kid , a New York City swindler,..."
595909,Seventh-day Adventist Church pastor Michael Ch...
...,...
34808485,"The story is about Reema , a young Muslim scho..."
1096473,"In 1928 Hollywood, director Leo Andreyev look..."
35102018,American Luthier focuses on Randy Parsons’ tra...
8628195,"Abdur Rehman Khan , a middle-aged dry fruit se..."


In [4]:
#load the labelled data
ViolentLabel,ViolentData = data_loader.human_labelled_data()
display(ViolentLabel)
display(ViolentData.head())

Unnamed: 0.1,Unnamed: 0,Violence level,Label
0,0,Low,-1
1,1,Medium/not sure,0
2,2,Violent,1


Unnamed: 0_level_0,Answer,Date
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1
113454,0.0,12.11
2295249,0.0,12.11
2967223,1.0,11.11
3868321,-1.0,12.11
4481850,0.0,11.11


In [5]:
TestSet = PlotData.loc[ViolentData.index]
TestSet.head()

Unnamed: 0_level_0,Plot
Wikipedia movie ID,Unnamed: 1_level_1
113454,The film tells the story of a mob hit man and...
2295249,Theresa is one of the twelve jurors who have t...
2967223,Robert Crain is a German pacifist living in In...
3868321,"Zach is a soul collector, an angel who collec..."
4481850,"Adam, a 23-year-old self-employed security tec..."


In [6]:
TestSet.iloc[0]["Plot"]

'The film tells the story of a mob hit man  and hit woman  who fall in love, even though they have been hired to kill each other.'

In [7]:
cut = 10
ValidationSet = TestSet[:cut]
TrainingSet = TestSet[cut+1:]

In [68]:
TrainingSet.head()

Unnamed: 0_level_0,Plot
Wikipedia movie ID,Unnamed: 1_level_1
14144257,NW Mounted Police Sgt. Ward is assigned to tra...
15124157,The film is an examination of libidinous baske...
15217227,An erotic drama about a writer involved in a p...
15323823,{{expand section}} The film is set in Hanky Pa...
16494542,Richard Beck is a police detective who believ...


## LLM - GPT-4-mini

### Prompt Ingenieurring

In [107]:
Label = """
- **Peaceful**: The text describes no violence. There are no aggression, conflict, or harm to people or animals. Suitable for all audiences, including children and sensitive viewers. 
- **Mild**: The level of violence of the text is minimal or uncertain. There might be moments of tension or mild conflict, such as arguments. TMild action or suspense is allowed without explicit harm.
- **Violent**: The text describe violence, such as physical aggression, conflict, or harm. Scenes may include fighting, injury, or other intense confrontations. It a prominent feature of the film."
"""

Instruction = "The following texts are plot of movies, separated by \\ For each plot, assign a level on the violence scale."

Example = "**:**  "

Content = f"{Label}\n\n### Instructions ###\n{Instruction}"


In [108]:
print(Content)


- **Peaceful**: The text describes no violence. There are no aggression, conflict, or harm to people or animals. Suitable for all audiences, including children and sensitive viewers. 
- **Mild**: The level of violence of the text is minimal or uncertain. There might be moments of tension or mild conflict, such as arguments. TMild action or suspense is allowed without explicit harm.
- **Violent**: The text describe violence, such as physical aggression, conflict, or harm. Scenes may include fighting, injury, or other intense confrontations. It a prominent feature of the film."


### Instructions ###
The following texts are plot of movies, separated by \ For each plot, assign a level on the violence scale.


### Verify the number of tokens

The model have a maximum number of tokens ! For code fficiency (and money), we would like to avoid having to loop on each plot and recalling the prompt every time. We will look at the number of token to see how many plot we can put at the time. 

In [72]:
#parameters 

Model_max = 128000 #maximum imput tokens for the model
pricing = 1/1000000*0.150 #price per token
encoding = tiktoken.encoding_for_model("gpt-4o-mini") #for gpt-4o-mini

def count_tokens(text):
    return len(encoding.encode(text))

# Count for the prompt
TotalPromt = count_tokens(Content)
print("For the prompt we have",TotalPromt,"tokens, pricing :",TotalPromt*pricing)

# Count for the test set
TotalTest = TestSet["Plot"].apply(count_tokens).sum()
print("For the test dataset we have",TotalTest+TotalPromt,"tokens, pricing :",TotalTest*pricing,"batch",int(TotalTest/(Model_max-TotalPromt)+1))

# Count for the whole dataset
TotalData = CleanData["Plot"].apply(count_tokens).sum()
batch = int(TotalData/(Model_max-TotalPromt))+1
print("For the whole dataset we have",TotalData,"tokens, pricing :",(TotalData+batch*TotalPromt)*pricing,"batch",batch)

For the prompt we have 147 tokens, pricing : 2.205e-05
For the test dataset we have 4122 tokens, pricing : 0.00059625 batch 1
For the whole dataset we have 8423387 tokens, pricing : 1.26496335 batch 66


### Create Batch

In [92]:
def batch_plots(Data):
    batch = [0]
    currentBatch = TotalPromt
    for i in range(0,Data.shape[0]):
        plot_tokens = count_tokens(Data.iloc[i]["Plot"])
        if currentBatch + plot_tokens > Model_max:
            batch.append(i)
            currentBatch = TotalPromt
        else:
            currentBatch += plot_tokens
    return batch

clean_batch = batch_plots(CleanData)
print(len(clean_batch))

66

In [97]:
T = TrainingSet[0:10]

### Assess the model on the training set

In [109]:
Run = True

In [110]:
#OpenIA key
openai.api_key = os.environ["OPENAI_API_KEY"]
client = OpenAI()

Example of use :

```ruby
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "Write a haiku about recursion in programming."
        }
    ]
)


print(completion.choices[0].message)

```

In [113]:
Text = ""
for plot in T["Plot"] : 
    Text += "\\" + plot

Text

"\\NW Mounted Police Sgt. Ward is assigned to track down a lethal and mysterious villain known only as The Leader, who is trying to locate a secret gold mine in the Indian territory. At the time The Leader made the decision to stop Ward, he pits the Indians against the Mounties, whom he blames for creating trouble in the area.\\The film is an examination of libidinous basketball star Hector Bloom , and contrasts his sporting prowess on the court to his bedroom antics. Most notably, Hector has an affair with his favorite professor's wife Olive  that goes nowhere. This, and many other events, occur within a heated early 1970s backdrop of university politics, sporting hijinx, and anti-war sentiments.\\An erotic drama about a writer involved in a plagiarism suit who becomes romantically involved with a woman with whom he is connected through dreams.Synopsis from {{cite web}}\\{{expand section}} The film is set in Hanky Park, a fictional settlement based on Salford, at the height of the Gre

In [125]:
function = {
   "name": "Assign_violence_level",
   "description": "Predict the level of violence of a movie given its plot",
   "parameters": {
       "type": "object",
       "properties": {
           "prediction": {
               "type": "array",
               "enum": [
                   "Peaceful",
                   "Mild",
                   "Violent"
               ],
               "description": "The level of violence of the movie."
           }
       },
       "required": [
           "prediction"
       ]
   }
}

In [119]:
prediction = None

if Run :
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": Content},
            {"role": "user","content": Text}
        ],
        functions=[function],
        function_call={"name": "Assign_violence_level"},
    )

In [131]:
#extract the answer
try:
    #prediction = json.loads(completion.choices[0].message.function_call.arguments)["prediction"]
    prediction = completion.choices[0].message
except (KeyError, json.JSONDecodeError) as e:
    print(f"Error extracting prediction: {e}")

In [132]:
print(prediction)

ChatCompletionMessage(content=None, refusal=None, role='assistant', audio=None, function_call=FunctionCall(arguments='{"prediction":"Mild"}', name='Assign_violence_level'), tool_calls=None)


### Apply to the whole dataset