# Lead role preprocessing

A very important piece of information to validate our hypothesis, but one that is very difficult to obtain, is the importance of the actors' roles within each movie. To get reliable information on this, we looked at the free public data available and found the script data for the most directly relevant movies. Our goal is to compute how much of a role each character plays, given the movie plot data we have and the names of the characters in each movie. As a first step, we prepared a script that would be our gold label, from which we calculated the percentage of each character in the movie. We used this script and mapped it to the plot to train and validate our AI model.

## 1. Script crawling

To crawl the scripts, we selected https://imsdb.com and wrote a crawler for it. As the site is very old, the templates between the script documents are very different, and the crawling difficulty is high.

### A. Downloading the scripts
 
```bash
python Preprocessing/script_crawling/get_script_urls.py
```
Running the script above will produce scripts_urls.json, which contains meta information of urls we are going to crawl.

In [245]:
import requests
from bs4 import BeautifulSoup
import os
from urllib.parse import quote
from tqdm.notebook import tqdm
import json
from collections import defaultdict
import re
import pandas as pd
from pathlib import Path
root_path = Path('../')

In [370]:
scripts_urls= json.load(open('scripts_urls.json'))
print(f"Total number of script: {len(scripts_urls)}")
print(f"Scripts with release date field : {len([ x for x in scripts_urls if 'release_date' in x])} ")
print(f"\twith script date field: {len([ x for x in scripts_urls if 'script_date' in x])} ")

Total number of script: 1187
Scripts with release date field : 641 
	with script date field: 677 


Since our movie data doens't have any movie identifier, it's release date is very important feature to map movie with tmdb data with high accuracy.

Running **download_scripts.py** will produce url2pairs.json, which has beautifulsoup object that contain complex web component including our target scripts.

In [13]:
url2pairs = json.load(open('url2pairs.json'))

# Processing scripts
## step 1. Mapping TMDB Movie id by using movie name, year

```bash
python Preprocessing/script_crawling/scripts_tmdb_matching.py
```
Run above code will generate mapping of movies to TMDB, *scripts_urls.json*. TMDB search API will retuen several movies when we send the movie name as query. Among the results, we eliminated all but one sample that exactly fulfills the year condition. If released_year exists, use it as top prioroty. If not, pick closer to script_date, but after script_date.

In [110]:
manual_matching = json.load(open('manual_matching.json')) # handcrafted feature. Was so painful..

In [197]:
scripts_urls = json.load(open('tmdb_matched_scripts_urls.json','w')) # total 1060 sample matched.

## step 2. matching movie plots.
we have three major sources for the movie plots.
1. Our original CMU plot data. https://www.cs.cmu.edu/~ark/personas/
2. MPST movie plot. https://www.kaggle.com/datasets/cryptexcode/mpst-movie-plot-synopses-with-tags
3. Wikipedia movie plot. https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots

CMU data has freedase id, MPST data has imdb id, and wikipedia plot has only the name. So the priority is MPST > CMU > Wikipedia plot.

In [2]:
scripts_urls = json.load(open('tmdb_matched_scripts_urls.json'))

In [5]:
cmu = pd.read_csv('../Data/MovieSummaries/plot_summaries.txt',delimiter='\t',header=None)
mpst = pd.read_csv('../Data/MovieSummaries/mpst_full_data.csv')
wiki = pd.read_csv('../Data/MovieSummaries/wiki_movie_plots_deduped.csv')

In [9]:
wiki_id2tmdb_id = json.load(open('../Data/tmdb_resources/wikipedia_id2tmdb_id.json'))

In [6]:
cmu2tmdb = json.load(open('../Data/tmdb_resources/cmu_exist_tmdb_id2detail.json'))
tmdb_id2detail = json.load(open('../Data/tmdb_resources/tmdb_id2detail_imdb_rating.json'))

**Now let's prepare tmdb_id2plot**

In [7]:
target_tmdb_ids = [m['tmdb_id'] for m in scripts_urls if 'tmdb_id' in m]
tmdb_id2plot = {}

In [286]:
# 1. mpst
imdb_id2plot = {x['imdb_id']:x['plot_synopsis'] for _,x in mpst.iterrows()}
cnt = 0

for tmdb_id in tmdb_id2detail.keys():
    imdb_id = tmdb_id2detail[str(tmdb_id)]['imdb_id']
    if imdb_id in imdb_id2plot:
        tmdb_id2plot[tmdb_id] = imdb_id2plot[imdb_id]
        cnt += 1
print(f"valid matching by mpst: {cnt}")

valid matching by mpst: 942


In [10]:
# 2. cmu
wiki_id2plot = {x[0]:x[1] for _,x in cmu.iterrows()}
tmdb_id2wiki_id = {v:k for k,v in wiki_id2tmdb_id.items()}
wiki_id2tmdb_id
cnt = 0
for tmdb_id, wiki_id in tmdb_id2wiki_id.items():
    if int(wiki_id) in wiki_id2plot:
        cnt += 1
        if tmdb_id in tmdb_id2plot:
            if len(tmdb_id2plot[tmdb_id]) > len(wiki_id2plot[int(wiki_id)]): # to keep shorter plot
                tmdb_id2plot[tmdb_id] = wiki_id2plot[int(wiki_id)]
        else:
            tmdb_id2plot[tmdb_id] = wiki_id2plot[int(wiki_id)]
print(f"valid matching by cmu: {cnt}")
print(f"union to mpst is : {len(list(tmdb_id2plot.keys()))}")

valid matching by cmu: 32280
union to mpst is : 32280


In [288]:
# 3. wiki
remain_ids = [tmdb_ids for tmdb_ids in tmdb_id2detail.keys() if tmdb_ids not in tmdb_id2plot]
cnt = 0
title2plots = {x['Title'].lower():x['Plot'] for _,x in wiki.iterrows()}
for tmdb_id in remain_ids:
    movie_title = tmdb_id2detail[str(tmdb_id)]['original_title']
    if movie_title.lower() in title2plots:
        tmdb_id2plot[tmdb_id] = title2plots[movie_title.lower()]
        cnt += 1

print(f"additional backups from wiki data : {cnt}")
print(f"We have total : {len(list(tmdb_id2plot.keys()))}")

additional backups from wiki data : 30
We have total : 1041


In [11]:
# json.dump(tmdb_id2plot,open('tmdb_id2plot_cmu_only.json','w'))

## Step 3. Parsing character's script
Strategy : map script's speaker to tmdb character list.

In [7]:
tmdb_id2plot = json.load(open('tmdb_id2plot.json'))

In [71]:
tmdb_id2credit = json.load(open('../Data/tmdb_resources/tmdb_id2credit_full.json'))

#### Classification rule
**Count the number of space or '\t' in front of the line.**

Assumption : **same type of instruction will share same indent in a single script.**

In [85]:
def match_characters(A,B):
    '''
    for A,B = Set[str], return one-one matching dictionary {a:b} s.t a in A, b in B
    Only consider strict subset and one-one matching.
    If more than one correspondence is possible, omit that name.
    '''
    mapper = {}
    for a in A:
        for b in B:
            if a.lower() in b.lower() or b.lower() in a.lower():
                if a in mapper:
                    del mapper[a]
                    break
                b_overlap = False
                del_k_list=[]
                for k,v in mapper.items():
                    if v == b:
                        del_k_list.append(k)
                        b_overlap = True
                for k in del_k_list:
                    del mapper[k]
                if b_overlap:
                    break
                mapper[a] = b
    return mapper

Final schema of matched scripts
```json
tmdb_id2matched_scripts = {
    tmdb_id: {
        'scripts': [{
            'tmdb_credit': tmdb_credit_obj,
            'portion': float,
            'spoken_syllables': int,
            'num_script': int,
            'script_text': {
                1: "1st script. order starts from 1.",
                }
            }
        ],
        'statistics': {
            'total_spoken_syllables': int,
            'total_spoken_words': int,
            'original_character_count': int,
            'matched_character_count': int,
        },
        'url': str
    }
}
```

In [113]:
import pyphen

def get_syllables(text):
    dic = pyphen.Pyphen(lang='en')
    return sum(len(dic.inserted(word).split('-')) for word in text.split())

def count_indent(text):
    count = 0
    indent_type = ''
    
    for char in text:
        if char == ' ' or char == '\t':
            count += 1
            indent_type = 's' if char == ' ' else 't'

        else:
            break  # Stop counting when a non-indent character is encountered

    return f'{count}{indent_type}'

def remove_tags(text):
    return re.sub(r'<[^>]*>', '', text)
def remove_parenthesis(text):
    return re.sub(r'\([^)]*\)', '', text)
def reduce_adjacent_spaces(text):
    return re.sub(r'\s{2,}', ' ', text.replace('\t',' '))
def final_removal_special_characters(text):
    return re.sub(r'[,\[\]{}:;,.\^()]', '', text).strip()

url2typed_pairs = {} # typed_pairs : {'4s': [(speaker,text),(speaker,text),(speaker,text)...],'3s':[(speaker,text),...]}
for url, pairs in url2pairs.items():
    url2typed_pairs[url] = defaultdict(list)
    for pair in pairs:
        for k,v in pair.items():
            indent_code = count_indent(k)
            new_key = final_removal_special_characters(remove_parenthesis(remove_tags(k.replace('\t',''))))
            new_val = reduce_adjacent_spaces(remove_parenthesis(remove_tags(v))).strip()
            if new_key and new_val:      
                url2typed_pairs[url][indent_code].append((new_key,new_val))

In [126]:
tmdb_id2matched_scripts = {} # url2matched_scripts[url] = {}
for movie_meta in tqdm(scripts_urls):
    if 'tmdb_id' not in movie_meta:
        continue
    if movie_meta['script_url'] not in url2typed_pairs:
        continue
    tmdb_id = movie_meta['tmdb_id']
    credit = tmdb_id2credit[str(tmdb_id)]
    character_list = [c['character'] for c in credit['cast']]
    character_dict = url2typed_pairs[movie_meta['script_url']]
    max_match = -1
    max_indent_type = ''
    max_mapper = {}
    for c,v in character_dict.items():
        instruction_keys = set([x[0] for x in v]) # considering typo, several characters can be matched to one actor.
        character_mapper = match_characters(instruction_keys,character_list)
        if len(list(character_mapper.keys())) > max_match:
            max_match = len(list(character_mapper.keys()))
            max_indent_type = c
            max_mapper = character_mapper
    if max_match > 2:
        instruction_keys = set([x[0] for x in character_dict[max_indent_type]])
        syll_cnt = { k: sum([get_syllables(t[1]) for t in character_dict[max_indent_type] if t[0] == k ]) for k in instruction_keys}
        total_syllables = sum(list(syll_cnt.values()))
        char_scripts = defaultdict(dict)
        idx = 1
        for char_name,text in character_dict[max_indent_type]:
            char_scripts[char_name][idx] = text
            idx+=1
        tmdb_id2matched_scripts[tmdb_id] = {
            'scripts': [
                {
                'tmdb_credit': [_ for _ in credit['cast'] if _['character'] == v][0],
                'portion': syll_cnt[k] / total_syllables,
                'spoken_syllables': syll_cnt[k],
                'num_script': len(list(char_scripts.keys())),
                'name_in_script': v,
                'script_text': char_scripts[k],
                }
                for k,v in max_mapper.items()
            ],
            'statistics':{
                'total_spoken_syllables': total_syllables,
                'total_spoken_words': sum([sum([len(t[1].split(' ')) for t in character_dict[max_indent_type] if t[0] == k ]) for k in max_mapper.keys()]),
                'original_character_count': len(credit['cast']),
                'matched_character_count': len(list(max_mapper.keys())),
            }
        }

  0%|          | 0/1187 [00:00<?, ?it/s]

In [127]:
json.dump(tmdb_id2matched_scripts,open('tmdb_id2matched_scripts.json','w'))

## Step 4. Calculating each character's share of the overall script
Estimate speaking time by the number of syllables in each sentense.


# Training model
## 1. Training Setting (Problem Definition)
Input: f"Estimate the portion of {character} in float form from this plot:{plot}"

Output: float(0~1)


In [1]:
import json
tmdb_id2matched_scripts = json.load(open('tmdb_id2matched_scripts.json'))
tmdb_id2plot = json.load(open('tmdb_id2plot.json'))

In [2]:
print(f"Total script num: {len(tmdb_id2matched_scripts.keys())}")
print(f"Total actors num: {sum([len(x['scripts']) for x in tmdb_id2matched_scripts.values()])}")

Total script num: 942
Total actors num: 13760


split train : eval : test = 8:1:1

In [9]:
import random
from collections import defaultdict
random.seed(1)
tmdb_ids =  list(tmdb_id2matched_scripts.keys())
tmdb_ids = sorted(tmdb_ids, key=lambda x: len(tmdb_id2matched_scripts[x]['scripts']) + random.randint(-5,5))
num_docs = len(tmdb_id2matched_scripts.keys())
ratios = {'train': 0.8, 'evaluation': 0.1, 'test': 0.1}

set_sizes = {set_name: int(ratio * num_docs) for set_name, ratio in ratios.items()}

sets = defaultdict(list)
scripts_in_sets = defaultdict(list)

# Sequentially add script into each set
for tmdb_id in tmdb_ids:
    script_obj = tmdb_id2matched_scripts[tmdb_id]
    selected_set = min(set_sizes.keys(), key=lambda x: len(scripts_in_sets[x]))
    if tmdb_id not in tmdb_id2plot:
        continue
    sets[selected_set].append(tmdb_id)
    all_c = [c['tmdb_credit']['character'] for c in script_obj['scripts']]
    scripts_in_sets[selected_set] += [{
        'plot': tmdb_id2plot[tmdb_id],
        'character': c['tmdb_credit']['character'],
        'all_characters': all_c,
        'portion': c['portion'] * 100,
        'tmdb_id': tmdb_id,
        'character_id': c['tmdb_credit']['id'],
        'order': c['tmdb_credit']['order'],
    } for c in script_obj['scripts']]

    set_sizes[selected_set] -= 1
    if set_sizes[selected_set] == 0:
        del set_sizes[selected_set]
    

for set_name, docs in sets.items():
    print(f"{set_name.title()} Set: {len(docs)}")
    print(f"\tCharacter count {len(scripts_in_sets[set_name])}")

Train Set: 709
	Character count 11668
Evaluation Set: 94
	Character count 792
Test Set: 94
	Character count 791


In [10]:
json.dump(scripts_in_sets, open('plot_portion_dataset.json','w'))

Now, prepare dataset for machine learning.

Prompt: ""Predict the percentage of a movie's plot that a character takes up.\nCharacter: {} \nPlot: ""

In [11]:
import json
scripts_in_sets = json.load(open('plot_portion_dataset.json'))

In [2]:
from datasets import Dataset
import pandas as pd
train_df = pd.DataFrame(scripts_in_sets['train'])
train_dataset = Dataset.from_pandas(train_df).shuffle(seed=7)
evaluation_df = pd.DataFrame(scripts_in_sets['evaluation'])
evaluation_dataset = Dataset.from_pandas(evaluation_df).shuffle(seed=7)

In [3]:
import re
def get_prompt(character_name, plot):
    c_name = re.sub(r'\([^)]*\)', '', character_name).strip()
    return f"Predict the percentage of a movie's plot that a character takes up.\nCharacter: {c_name} \nPlot: {plot}"


In [12]:
scripts_in_sets['evaluation'][0]

{'plot': "NASA Space Shuttle Explorer, commanded by veteran astronaut Matt Kowalski, is in Earth orbit on mission STS-157 to service the Hubble Space Telescope. Dr. Ryan Stone is aboard on her first space mission as a mission specialist, her job being to perform a set of hardware upgrades on the Hubble. During a spacewalk, Mission Control in Houston warns Explorer's crew about a Russian missile strike on a defunct satellite, which has inadvertently caused a chain reaction forming a rapidly-expanding cloud of space debris, ordering the crew to return to Earth immediately. Communication with Mission Control is lost shortly thereafter as more and more communication satellites are knocked out by the debris.\r\nHigh-speed debris strikes the Explorer and Hubble, tearing Stone from the shuttle and leaving her tumbling through space. Kowalski, using a Manned Maneuvering Unit (MMU), rescues Stone, and they return to the Explorer, soon discovering that the Shuttle has suffered catastrophic damag

In [5]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments

In [6]:
model_name = 't5-large'
tokenizer = T5Tokenizer.from_pretrained(model_name)

def preprocess_data(examples):
    inputs = [get_prompt(c_name,plot) for c_name,plot in zip(examples['character'],examples['plot'])]
    model_inputs = tokenizer(inputs, max_length=1536, truncation=True, padding='max_length')

    # Tokenize the targets with padding
    with tokenizer.as_target_tokenizer():
        labels = tokenizer([str(round(p,2)) for p in examples['portion']], max_length=128, truncation=True, padding='max_length')

    model_inputs['labels'] = labels['input_ids']
    return model_inputs

dataset = train_dataset
tokenized_train_dataset = train_dataset.map(preprocess_data, batched=True)
tokenized_validation_dataset = evaluation_dataset.map(preprocess_data, batched=True)
# tokenized_test_dataset = test_dataset.map(preprocess_data, batched=True)


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/11668 [00:00<?, ? examples/s]



Map:   0%|          | 0/792 [00:00<?, ? examples/s]

In [7]:
model = T5ForConditionalGeneration.from_pretrained(model_name).to('mps')

In [8]:
training_args = TrainingArguments(
    output_dir=f'./{model_name}_results',
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir=f'./{model_name}_logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_validation_dataset,
)

trainer.train(f'./{model_name}_results/checkpoint-6000')
model.save_pretrained(f'./{model_name}_trained_model')


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


Step,Training Loss
6500,0.0493
7000,0.0483
7500,0.0492
8000,0.0489
8500,0.0501
9000,0.0495
9500,0.0492
10000,0.0481
10500,0.0482
11000,0.0486


## 2. Implement LLM models benchmark
### a. Fine-tuning T5
- T5-large : Trainable. 

with T5-large, we finetuned t5-large model on total 11668 datapoint for one epoch.
```python
f"Predict the percentage of a movie's plot that a character takes up.\nCharacter: {character_name} \nPlot: {plot}"
```

### b. Instruction tuned-LLM

API inference available:
- ChatGPT-3.5
- ChatGPT-4

While chatGPT, force model generation with json format and request estimate the portion of all characters at once. In case of chatgpt failure, when it does not return value for requested field, replace it with linear regression model predicted output.

### c. Heuristic baseline
- order field in tmdb_credit: the order in credit data of tbdb correlates with his/her importance. 
- linear regression tuned on order: Fit simple linear regression with order as single feature.
- counting character_name on plot

Obtained Simple Linear regression models
```python
def get_portion_by_order_logscale(order):
    return np.power(np.e, 0.7756578 - 0.04791 * order)

def get_portion_by_order(order):
    return max(5.689275 - 0.122673 * order, 0.001)
```

## 3. Comparison and Model selection
For the fare comparison of the model, all model's output would be scaled to make sum of all portions in single plot would be 100.

Correlation with gold label, accuracy and f1 score for leading role prediction of threshould 10% will be used as evaluation crietria.

In [340]:
validation_set = json.load(open(root_path / 'Preprocessing/plot_portion_dataset.json'))['evaluation']
validation_pred = {model_name: defaultdict(dict) for model_name in ['linear','log-linear','tmdb_order','T5-large','ChatGPT-3.5','ChatGPT-4']}
tmdb_id2credit = json.load(open(root_path / 'Data/tmdb_resources/tmdb_id2credit_full.json'))
target_tmdb_ids = set([v['tmdb_id'] for v in validation_set])

### Heuristics, simple model

In [341]:
import numpy as np
def get_portion_by_order_logscale(order): # load as a backup method
    return np.power(np.e, 0.7756578 - 0.04791 * order)

def get_portion_by_order(order):
    return max(5.689275 -0.122673 * order, 0.001)

def get_order(tmdb_id, actor_id):
    credit = tmdb_id2credit[str(tmdb_id)]
    for j in credit['cast']:
        if str(j['id']) == str(actor_id):
            return j['order']
    raise ValueError('invalid actor')

def scaling(portions):
    total_portions = sum(list(portions.values())) + 0.0001
    return {k: 100*v / total_portions for k,v in portions.items()}

In [353]:
for tmdb_id in target_tmdb_ids:
    credit = tmdb_id2credit[tmdb_id]
    portions_ll = {}
    portions_l = {}
    portions_o = {}
    for c in credit['cast']:
        portions_ll[str(c['id'])] = get_portion_by_order_logscale(c['order'])
        portions_l[str(c['id'])] = get_portion_by_order(c['order'])
        portions_o[str(c['id'])] = c['order']
    validation_pred['log-linear'][tmdb_id] = scaling(portions_ll)
    validation_pred['linear'][tmdb_id] = scaling(portions_l)
    validation_pred['tmdb_order'][tmdb_id] = scaling(portions_o)

### GPT-3.5 & GPT-4
To compare with 
```python
prompt = f"""Estimate the percentage of the script that each character represents from the movie plot.
[Characters]: {characters_str}
[Plot]: {eval_tmdb_id2s[tmdb_id][0]['plot']}

Estimate the percentage of the script that each character represents from the movie plot mentioned above. Return the portion of every character in a JSON dictionary, with the character name as key and portion as value."""
```


In [354]:
chatgpt_pred = json.load(open('chatgpt_pred.json'))
chatgpt_pred_4 = json.load(open('chatgpt_4_pred.json'))

In [355]:
def hash_string(text):
    return text.lower().replace(' ','')
def name_match_and_get(tmdb_id,character_id,original_name,generated_dict):
    hashed_dict = {hash_string(k):v for k,v in generated_dict.items()}
    if hash_string(original_name) in hashed_dict:
        try:
            return float(str(hashed_dict[hash_string(original_name)]).replace('<','').replace('%','').strip())
        except:
            return get_portion_by_order_logscale(get_order(tmdb_id,character_id))
    return get_portion_by_order_logscale(get_order(tmdb_id,character_id))

In [356]:
for tmdb_id,v in chatgpt_pred.items():
    output_parsed = json.loads(v['gen_text'])
    parsed_prob = {}
    for character, c_id in v['character2id'].items():
        parsed_prob[c_id] = name_match_and_get(tmdb_id,str(c_id),character,output_parsed)
    prob_sum = sum(list(parsed_prob.values()))
    validation_pred['ChatGPT-3.5'][tmdb_id] = scaling(parsed_prob)

In [357]:
for tmdb_id,v in chatgpt_pred_4.items():
    try:
        output_parsed = json.loads(v['gen_text'])
    except:
        output_parsed = {}
    parsed_prob = {}
    for character, c_id in v['character2id'].items():
        parsed_prob[c_id] = name_match_and_get(tmdb_id,c_id,character,output_parsed)
    validation_pred['ChatGPT-4'][tmdb_id] = scaling(parsed_prob)

### T5

In [358]:
t5_output = json.load(open('lead_role_inference/t5_validation_inference.json'))
for key, pred in t5_output.items():
    tmdb_id, actor_id = key.split('__')
    validation_pred['T5-large'][str(tmdb_id)][str(actor_id)]=float(pred['output_text'])
for tmdb_id in target_tmdb_ids:
    credit = tmdb_id2credit[tmdb_id]
    for c in credit['cast']:
        if str(c['id']) not in validation_pred['T5-large'][tmdb_id]:
            validation_pred['T5-large'][tmdb_id][str(c['id'])] = get_portion_by_order_logscale(c['order'])
    validation_pred['T5-large'][tmdb_id] = scaling(validation_pred['T5-large'][tmdb_id])

In [359]:
json.dump(validation_pred,open('validation_pred.json','w'))

### stats

In [360]:
corr_dict = defaultdict(list)
for v in validation_set:
    for model, lookup in validation_pred.items():
        corr_dict[model].append(lookup[v['tmdb_id']][str(v['character_id'])])
    corr_dict['Y'].append(v['portion'])
pd.DataFrame(corr_dict).corr()[['Y']].T

Unnamed: 0,linear,log-linear,tmdb_order,T5-large,ChatGPT-3.5,ChatGPT-4,Y
Y,0.372206,0.486409,-0.307976,0.7577,0.777683,0.828316,1.0


In [363]:
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
result_dict = corr_dict.copy()
def classify_scores(scores):
    return [1 if score > 10 else 0 for score in scores]

for k,v in result_dict.items():
    result_dict[k] = classify_scores(v)

for model_name, model_predictions in result_dict.items():
    if model_name not in  ['Y','tmdb_order']:
        print(f"{model_name:<12}\tAcc:{accuracy_score(result_dict['Y'], model_predictions):.5f}\tF1:{f1_score(result_dict['Y'], model_predictions):.5f}\tAUC:{roc_auc_score(y_true_classified, model_predictions):.5f}")


linear      	Acc:0.80429	F1:0.11429	AUC:0.52219
log-linear  	Acc:0.80934	F1:0.14689	AUC:0.53291
T5-large    	Acc:0.88510	F1:0.68070	AUC:0.79233
ChatGPT-3.5 	Acc:0.90657	F1:0.72593	AUC:0.80812
ChatGPT-4   	Acc:0.90152	F1:0.73103	AUC:0.82525
