# Classification with LLM

So far, we have cleaned the data and performed sentiment analysis as well as a violent word count analysis. Our objective is to categorize the movies on a scale from non-violent to violent.

Our approach will be to build a model that labels the dataset.
In this Notebook, we will try to label the data using a LLM, namely GPT4mini from OpenIA.

## Labeling the Data
Since labeled data is required for analysis, we manually labeled a subset of the dataset. We divided part of the data among team members and labeled each movie plot based on a categorical scale:
<ul>
    <li><b>-1</b> : Peaceful</li>
    <li><b>0</b> : Ambiguous (Uncertain level of violence)</li>
    <li><b>1</b> : Violent</li>
</ul>
To assess the subjectivity of the labeling process, we had some plots labeled multiple times by external participants.

## Dataset
<ul>
    <li><b>Training and Testing Data</b> <br/> Given the limited number of labeled plots available, we will use most of the labeled items for the training set. We will keep five plots as the testing set to evaluate the model (alternatively, we may use the entire dataset and assess labeling quality across the final labeled set) </li>
    <li><b>Final Dataset</b>  <br/> We will apply the model to label the entire dataset and review the quality of the labels.</li>
</ul>

## Model

https://platform.openai.com/docs/overview

### reference

https://medium.com/discovery-at-nesta/how-to-use-gpt-4-and-openais-functions-for-text-classification-ad0957be9b25 \
https://platform.openai.com/docs/guides/prompt-engineering



### Imports

In [17]:
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import math

import tiktoken
import os
import openai
from openai import OpenAI
import json

%matplotlib inline

In [2]:
# Add the project root directory (not src) to sys.path
sys.path.append(os.path.abspath("../data/"))

# Now import the DataLoader class
from data_loading import DataLoader

raw = '../../data/RAW/'
clean = '../../data/CLEAN'

## Load and prepare the data

In [3]:
#load the data
data_loader = DataLoader(raw,clean)
CleanData = data_loader.clean_movie_data()
PlotData = data_loader.plot_data()
PlotData


load plot data



Unnamed: 0_level_0,Plot
Wikipedia movie ID,Unnamed: 1_level_1
23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
31186339,The nation of Panem consists of a wealthy Capi...
20663735,Poovalli Induchoodan is sentenced for six yea...
2231378,"The Lemon Drop Kid , a New York City swindler,..."
595909,Seventh-day Adventist Church pastor Michael Ch...
...,...
34808485,"The story is about Reema , a young Muslim scho..."
1096473,"In 1928 Hollywood, director Leo Andreyev look..."
35102018,American Luthier focuses on Randy Parsons’ tra...
8628195,"Abdur Rehman Khan , a middle-aged dry fruit se..."


In [4]:
#load the labelled data
ViolentLabel,ViolentData = data_loader.human_labelled_data()
display(ViolentLabel)
display(ViolentData.head())

Unnamed: 0.1,Unnamed: 0,Violence level,Label
0,0,Low,-1
1,1,Medium/not sure,0
2,2,Violent,1


Unnamed: 0_level_0,Answer,Date
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1
113454,0.0,12.11
2295249,0.0,12.11
2967223,1.0,11.11
3868321,-1.0,12.11
4481850,0.0,11.11


In [5]:
TestSet = PlotData.loc[ViolentData.index]
TestSet.head()

Unnamed: 0_level_0,Plot
Wikipedia movie ID,Unnamed: 1_level_1
113454,The film tells the story of a mob hit man and...
2295249,Theresa is one of the twelve jurors who have t...
2967223,Robert Crain is a German pacifist living in In...
3868321,"Zach is a soul collector, an angel who collec..."
4481850,"Adam, a 23-year-old self-employed security tec..."


In [6]:
TestSet.iloc[0]["Plot"]

'The film tells the story of a mob hit man  and hit woman  who fall in love, even though they have been hired to kill each other.'

In [7]:
cut = 10
ValidationSet = TestSet[:cut]
TrainingSet = TestSet[cut+1:]

## LLM - GPT-4-mini

In [8]:
#access the max number of token we will have to process

# Initialize the tokenizer for the GPT-4-based model.
tokenizer = tiktoken.get_encoding("cl100k_base")  

# Define the tokenize function.
def tokenize(text):
    tokens = tokenizer.encode(text)
    return len(tokens)

pricing = 1/1000000*0.150
# Apply this tokenizer to your dataset
Total = CleanData["Plot"].apply(tokenize).sum()
print("For the whole dataset we have",Total,"tokens, pricing :",Total*pricing)
Total = TestSet["Plot"].apply(tokenize).sum()
print("For the test dataset we have",Total,"tokens, pricing :",Total*pricing)
#careful, will also take into account the prompt and the return

For the whole dataset we have 8533080 tokens, pricing : 1.279962
For the test dataset we have 3995 tokens, pricing : 0.00059925


In [9]:
#OpenIA key
openai.api_key = os.environ["OPENAI_API_KEY"]
client = OpenAI()

Example of use :

```ruby
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "Write a haiku about recursion in programming."
        }
    ]
)


print(completion.choices[0].message)

```

In [33]:
Run = True

In [34]:
Label = """
- **Peaceful**: The text describes no violence. There are no aggression, conflict, or harm to people or animals. Suitable for all audiences, including children and sensitive viewers. 
- **Mild**: The level of violence of the text is minimal or uncertain. There might be moments of tension or mild conflict, such as arguments. TMild action or suspense is allowed without explicit harm.
- **Violent**: The text describe violence, such as physical aggression, conflict, or harm. Scenes may include fighting, injury, or other intense confrontations. It a prominent feature of the film."
"""

Instruction = "The following text is the plot of a movie, assign it to a level on the violence scale."

Content = f"{Label}\n\n### Instructions ###\n{Instruction}"


In [35]:
print(Content)


- **Peaceful**: The text describes no violence. There are no aggression, conflict, or harm to people or animals. Suitable for all audiences, including children and sensitive viewers. 
- **Mild**: The level of violence of the text is minimal or uncertain. There might be moments of tension or mild conflict, such as arguments. TMild action or suspense is allowed without explicit harm.
- **Violent**: The text describe violence, such as physical aggression, conflict, or harm. Scenes may include fighting, injury, or other intense confrontations. It a prominent feature of the film."


### Instructions ###
The following text is the plot of a movie, assign it to a level on the violence scale.


In [36]:
Text = "news of the death of the joker during the events of city of scars results in a riot at arkham asylum. batman races to rescue nightwing, who has been abducted by poison ivy and killer croc."

In [37]:
print(Text)

news of the death of the joker during the events of city of scars results in a riot at arkham asylum. batman races to rescue nightwing, who has been abducted by poison ivy and killer croc.


In [38]:
function = {
   "name": "Assign_violence_level",
   "description": "Predict the level of violence of a movie given its plot",
   "parameters": {
       "type": "object",
       "properties": {
           "prediction": {
               "type": "string",
               "enum": [
                   "Peaceful",
                   "Mild",
                   "Violent"
               ],
               "description": "The level of violence of the movie."
           }
       },
       "required": [
           "prediction"
       ]
   }
}

In [39]:
prediction = None

if Run :
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": Content},
            {"role": "user","content": Text}
        ],
        functions=[function],
        function_call={"name": "Assign_violence_level"},
    )

TypeError: 'ChatCompletionMessage' object is not subscriptable

In [49]:
#extract the answer
try:
    prediction = json.loads(completion.choices[0].message.function_call.arguments)["prediction"]
except (KeyError, json.JSONDecodeError) as e:
    print(f"Error extracting prediction: {e}")

In [50]:
print(prediction)

Mild
