## Zero Shot model

In this case, we are using the plot synopsis in the prompt along with the review, employing a Large Language Model (LLLM) such as RoBERTa.

In [4]:
%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=0

env: CUDA_DEVICE_ORDER=PCI_BUS_ID
env: CUDA_VISIBLE_DEVICES=0


### Import Librarires


In [5]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM
from huggingface_hub import login

In [6]:
import datasets
from datasets import Dataset, DatasetDict

In [7]:
from torch.utils.data import DataLoader

In [8]:
import pandas as pd
import numpy as np
import tensorflow_datasets as tfds
import tensorflow as tf
import ast
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

2024-06-22 15:55:48.706005: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


In [9]:
login(token = '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')

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/f.caprari/.cache/huggingface/token
Login successful


### Read the Original Dataset

In [10]:
dataRew=pd.read_json("../Dataset/IMDB_reviews.json",lines=True)

In [11]:
dataMovie=pd.read_json("../Dataset/IMDB_movie_details.json",lines=True)

dataRew.info()

In [12]:
dataMovie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1572 entries, 0 to 1571
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       1572 non-null   object 
 1   plot_summary   1572 non-null   object 
 2   duration       1572 non-null   object 
 3   genre          1572 non-null   object 
 4   rating         1572 non-null   float64
 5   release_date   1572 non-null   object 
 6   plot_synopsis  1572 non-null   object 
dtypes: float64(1), object(6)
memory usage: 86.1+ KB


In [13]:
dataRew=dataRew[['movie_id','is_spoiler','review_text']]

### Take the last part of the plot, bescause is more probable to find relevant part of the movie plot

In [14]:
dataMovie['last_plot'] = dataMovie['plot_synopsis'].apply(lambda x: x[-512:])

In [15]:
dataMovie=dataMovie[['movie_id','last_plot','plot_synopsis']]

Delete th movie where the plot is not present

In [16]:
dataMovie=dataMovie[dataMovie["last_plot"]!='']

In [17]:
dataAll=dataRew.merge(dataMovie,left_on="movie_id",right_on="movie_id",how="left")

In [18]:
dataAll['is_spoiler'] = np.where(dataAll['is_spoiler'] == True, 1, 0)


In [19]:
dataAll=dataAll[['is_spoiler','review_text','last_plot']]

In [20]:
dataAll.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 573913 entries, 0 to 573912
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   is_spoiler   573913 non-null  int64 
 1   review_text  573913 non-null  object
 2   last_plot    538828 non-null  object
dtypes: int64(1), object(2)
memory usage: 13.1+ MB


In [21]:
dataAll.dropna(inplace=True)

In [22]:
dataAll['prompt'] = dataAll.apply(lambda row: f"Movie plot: {row['last_plot']}\n\nthe review contain information that could be considered a spoiler? Review: {row['review_text']}", axis=1)


In [23]:
dataset = Dataset.from_pandas(dataAll[['prompt']])

### Create and define the model

In [24]:
if torch.cuda.is_available():
    # Specifica il dispositivo su GPU
    device = torch.device("cuda")
    print("GPU disponibile!" )

GPU disponibile!


In [25]:
model_id = "meta-llama/Meta-Llama-3-8B"

In [26]:
from transformers import pipeline


In [27]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# Carica la pipeline di zero-shot classification
classifier = pipeline("zero-shot-classification", model=model_id,device=device,tokenizer=tokenizer,max_length = 512,truncation=True)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at meta-llama/Meta-Llama-3-8B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Try different functions
A function to compute on each line and a function that works in batch

In [26]:
def classify_review(plot_text, review_text):
    prompt = f"Movie plot: {plot_text}\n\nGiven the last part of the movie's plot and user's review, does the review reveal the end of the movie? Review: {review_text}"
    result = classifier(prompt, candidate_labels=["Spoiler", "Not a Spoiler"])
    print(result['scores'][0])
    prediction = 1 if result['labels'][0] == 'Spoiler' else 0
    return prediction

In [27]:
def classify_batch(batch):
    results = classifier(batch['prompt'], candidate_labels=["Spoiler", "Not a Spoiler"])
    predictions = [1 if result['labels'][0] == 'Spoiler' else 0 for result in results]
    return {'prediction': predictions}

In [28]:
def apply_classification(df):
    # Applica la funzione classify_review a ogni riga del DataFrame
    df['prediction'] = df.apply(lambda row: classify_review(row['last_plot'], row['review_text']), axis=1)
    return df

### Try first on a Small Dataset
Let's look for both positive and negative examples to test this type of model

In [29]:
dataSmall1=dataAll[:10]


In [30]:
dataSmall2=dataAll[5200 :5210]

In [31]:
dataSmall1["prompt"]

0    Movie plot:  described. Just as Andy said, the...
1    Movie plot:  described. Just as Andy said, the...
2    Movie plot:  described. Just as Andy said, the...
3    Movie plot:  described. Just as Andy said, the...
4    Movie plot:  described. Just as Andy said, the...
5    Movie plot:  described. Just as Andy said, the...
6    Movie plot:  described. Just as Andy said, the...
7    Movie plot:  described. Just as Andy said, the...
8    Movie plot:  described. Just as Andy said, the...
9    Movie plot:  described. Just as Andy said, the...
Name: prompt, dtype: object

In [32]:
dataSmall2

Unnamed: 0,is_spoiler,review_text,last_plot,prompt
5200,0,Mindblowing piece of masterpiece. It can't be ...,"estions Michael about Connie's accusation, but...",Movie plot: estions Michael about Connie's acc...
5201,0,"Yesterday, I was lucky to be able to watch 175...","estions Michael about Connie's accusation, but...",Movie plot: estions Michael about Connie's acc...
5202,0,"A wonderful film. I love the history, the acto...","estions Michael about Connie's accusation, but...",Movie plot: estions Michael about Connie's acc...
5203,0,Science fiction has been used as an indication...,"estions Michael about Connie's accusation, but...",Movie plot: estions Michael about Connie's acc...
5204,0,RELEASED IN 1972 and directed by Francis Ford ...,"estions Michael about Connie's accusation, but...",Movie plot: estions Michael about Connie's acc...
5205,0,The Godfather is a very intense viewing experi...,"estions Michael about Connie's accusation, but...",Movie plot: estions Michael about Connie's acc...
5206,0,"A perfect gem in the movie history, a really g...","estions Michael about Connie's accusation, but...",Movie plot: estions Michael about Connie's acc...
5207,0,Heard a lot about this movie but i was not int...,"estions Michael about Connie's accusation, but...",Movie plot: estions Michael about Connie's acc...
5208,0,About the only thing I remember when it comes ...,"estions Michael about Connie's accusation, but...",Movie plot: estions Michael about Connie's acc...
5209,0,This is one of those films that made me wonder...,"estions Michael about Connie's accusation, but...",Movie plot: estions Michael about Connie's acc...


In [33]:
dataSmall=pd.concat([dataSmall1,dataSmall2],axis=0)

In [34]:
dataSmall=dataSmall[["is_spoiler","review_text","last_plot","prompt"]]

In [35]:
type(dataSmall["prompt"])

pandas.core.series.Series

In [36]:
results=apply_classification(dataSmall)

Tokenizer was not supporting padding necessary for zero-shot, attempting to use  `pad_token=eos_token`
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


0.6294594407081604
0.7514712810516357
0.6764991879463196
0.8664100170135498
0.5854402780532837
0.6585255861282349
0.6997935771942139
0.8891845345497131
0.603317141532898


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


0.6139470338821411
0.5559142827987671
0.6562832593917847
0.7625734806060791
0.6284551024436951
0.5385864973068237
0.679280698299408
0.5317032337188721
0.5074936151504517
0.6511644124984741
0.6158233284950256


In [37]:
results

Unnamed: 0,is_spoiler,review_text,last_plot,prompt,prediction
0,1,"In its Oscar year, Shawshank Redemption (writt...","described. Just as Andy said, there was a lar...","Movie plot: described. Just as Andy said, the...",0
1,1,The Shawshank Redemption is without a doubt on...,"described. Just as Andy said, there was a lar...","Movie plot: described. Just as Andy said, the...",0
2,1,I believe that this film is the best story eve...,"described. Just as Andy said, there was a lar...","Movie plot: described. Just as Andy said, the...",0
3,1,"**Yes, there are SPOILERS here**This film has ...","described. Just as Andy said, there was a lar...","Movie plot: described. Just as Andy said, the...",1
4,1,At the heart of this extraordinary movie is a ...,"described. Just as Andy said, there was a lar...","Movie plot: described. Just as Andy said, the...",0
5,1,In recent years the IMDB top 250 movies has ha...,"described. Just as Andy said, there was a lar...","Movie plot: described. Just as Andy said, the...",1
6,1,I have been a fan of this movie for a long tim...,"described. Just as Andy said, there was a lar...","Movie plot: described. Just as Andy said, the...",0
7,1,I made my account on IMDb Just to Rate this mo...,"described. Just as Andy said, there was a lar...","Movie plot: described. Just as Andy said, the...",0
8,1,"A friend of mine listed ""The Shawshank Redempt...","described. Just as Andy said, there was a lar...","Movie plot: described. Just as Andy said, the...",1
9,1,Well I guess I'm a little late to the party as...,"described. Just as Andy said, there was a lar...","Movie plot: described. Just as Andy said, the...",1


### Try on a much big dataset
Let's try the model this time using BATCH, to speed up the process

In [38]:
BATCH_SIZE=256

We use the model on 10,000 rows because testing it on too many rows is impractical.

In [39]:
big_data,second_part = train_test_split(dataAll, train_size=10000, stratify=dataAll['is_spoiler'])

In [40]:
big_data['is_spoiler'].value_counts()

is_spoiler
0    73441
1    26559
Name: count, dtype: int64

In [41]:
dataset2 = Dataset.from_pandas(big_data[['prompt']])

In [42]:
dataset2=dataset2.remove_columns("__index_level_0__")

In [43]:
dataset2

Dataset({
    features: ['prompt'],
    num_rows: 100000
})

In [1]:
dataset2 = dataset2.map(classify_batch,batched=True)

NameError: name 'dataset2' is not defined

In [None]:
big_data['prompt']

In [None]:
big_data['prediction'] = dataset2['prediction']

In [None]:
big_data['prediction']

In [None]:
big_data.to_csv('predictions3.csv', index=False)

In [None]:
# Calcola l'accuracy e l'F1-score
accuracy = accuracy_score(big_data['is_spoiler'], big_data['prediction'])
f1 = f1_score(big_data['is_spoiler'], big_data['prediction'])
recall = recall_score(big_data['is_spoiler'], big_data['prediction'])
precision = precision_score(big_data['is_spoiler'], big_data['prediction'])

In [2]:
print(f"Accuracy: {accuracy}, F1: {f1}, Precision: {precision}, Recall: {recall}")

NameError: name 'accuracy' is not defined

In [3]:
with open("../Output/outputZeroSH.txt", "a") as f:
    print(f"Accuracy: {accuracy}, F1: {f1}, Precision: {precision}, Recall: {recall}",file=f)

NameError: name 'accuracy' is not defined