# **Graph-Policy-Induction Walkthrough**

This notebook goes through all of the code for the graph-based feature extraction step by step. This code has been modularised into various scripts and a config file where parameters can be adjusted according to the user (see README.md).

# Dependencies

Warning: Torch geometric can be hard to install so this particular package may take some time. It is best to install from conda-forge using mamba

### Standard Packages
- os
- json
- pathlib (Path)
- collections (defaultdict)
- dataclasses
- enum
- typing
- warnings

In [None]:
import json
import os
import warnings
from collections import defaultdict
from dataclasses import dataclass
from enum import Enum
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple

### External Packages (covered in requirements.txt):

- torch ‚Üí covered by torch>=2.0.0,<2.3.0
    - torch.nn
    - torch.nn.functional (F)
- torch_geometric ‚Üí covered by torch-geometric>=2.4.0
    - torch_geometric.data.HeteroData
    - torch_geometric.nn (SAGEConv, HeteroConv, Linear, GATConv)
- pandas ‚Üí covered by pandas>=2.0.0
- numpy ‚Üí covered by numpy>=1.24.0,<2.0.0
- sklearn ‚Üí covered by scikit-learn>=1.3.0
    - sklearn.model_selection.train_test_split
    - sklearn.ensemble (RandomForestClassifier, GradientBoostingClassifier)
    - sklearn.metrics (precision_score, recall_score, roc_auc_score, fbeta_score)

In [None]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch_geometric
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import fbeta_score, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import train_test_split
from torch_geometric.data import HeteroData
from torch_geometric.nn import GATConv, HeteroConv, Linear, SAGEConv

  from .autonotebook import tqdm as notebook_tqdm


### Configurations

In [None]:
warnings.filterwarnings('ignore')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Data Processing 

The data processing code loads and combines the private and public datasets from Vela. 

It parses various dtypes stored as json inside the csv columns, computes feature statistics, extracts baseline features for education/job and finally filters these features for redundnacy using jaccard similarity

### Combining Data

Checks for any duplicate founders (which there weren't any), adding a success tag and adding source tracking. 

In [7]:
def load_and_combine_datasets(
    public_path: str = '/home/imm/grte4643/Documents/Vela/Inputs/vcbench_final_public.csv',
    private_path: str = '/home/imm/grte4643/Documents/Vela/Inputs/vcbench_final_private.csv'
) -> pd.DataFrame:
    """
    Load and combine public + private datasets.
    
    Returns:
        Combined DataFrame with 'source' column
    """
    print("=" * 50)
    print("LOADING & COMBINING DATASETS")
    print("=" * 50)
    
    df_public = pd.read_csv(public_path)
    df_private = pd.read_csv(private_path)
    
    print(f"Public:  {len(df_public)} founders ({df_public['success'].sum()} successful)")
    print(f"Private: {len(df_private)} founders ({df_private['success'].sum()} successful)")
    
    public_uuids = set(df_public['founder_uuid'])
    private_uuids = set(df_private['founder_uuid'])
    overlap = len(public_uuids & private_uuids)
    
    if overlap > 0:
        print(f"\n WARNING: {overlap} founders in both datasets - removing duplicates")
        df_private = df_private[~df_private['founder_uuid'].isin(public_uuids)]
    
    df_public['source'] = 'public'
    df_private['source'] = 'private'
    df_combined = pd.concat([df_public, df_private], ignore_index=True)
    
    print(f"\nCombined: {len(df_combined)} founders")
    print(f"  Success rate: {df_combined['success'].mean()*100:.2f}%")
    print(f"  Successful: {df_combined['success'].sum()}")
    print("=" * 50)
    
    return df_combined

df = load_and_combine_datasets()
print(df)

LOADING & COMBINING DATASETS
Public:  4500 founders (405 successful)
Private: 4500 founders (405 successful)

Combined: 9000 founders
  Success rate: 9.00%
  Successful: 810
                              founder_uuid  success  \
0     33159ebb-97ff-43fe-a80e-31fdcf467065        1   
1     33a7bba0-2ef6-415b-b73c-3dc994b8a86e        1   
2     0fe9fcdf-eb06-4e2c-88d8-04468b427298        1   
3     4f5620d4-9db8-4cfc-a1f5-2fd917472865        1   
4     c347a753-2280-48f8-9a78-8bcff30dd0ac        1   
...                                    ...      ...   
8995  4524d31b-af37-4980-9fe6-491b8d55eb88        0   
8996  4afdd8f1-b76d-4a7c-ab5d-e26eb5a9ed91        0   
8997  7d1dcfd0-4383-43f8-be23-660dc0c214e8        0   
8998  1844623e-c893-45a6-aa60-cc91b4a4bfb2        0   
8999  1ada8d2a-8875-483e-aad2-be7b6ef5c232        0   

                                               industry  \
0          Technology, Information & Internet Platforms   
1                             Entertainment & L

### JSON Parsing Utility Functions

These are the JSON parsing helper functions. 
- Parse json converts JSON strings to Python dictionaries (used by feature extraction)
- Parse qs rank standardises mass unviersity ranking strings to clean integers
- Parse duration converts job duration strings to numeric years using midpoints

In [10]:
def parse_json_column(json_str: Any) -> List[Dict]:
    """Safely parse JSON columns."""
    if pd.isna(json_str):
        return []
    try:
        return json.loads(json_str)
    except (json.JSONDecodeError, TypeError):
        return []


def parse_qs_rank(qs_value: Any) -> int:
    """Parse QS ranking handling '200+', '101-150', etc."""
    if pd.isna(qs_value) or qs_value == '':
        return 999
    
    qs_str = str(qs_value).strip()
    
    if '+' in qs_str:
        try:
            return int(qs_str.replace('+', ''))
        except ValueError:
            return 999
    
    if '-' in qs_str:
        try:
            return int(qs_str.split('-')[0])
        except ValueError:
            return 999
    
    try:
        return int(float(qs_str))
    except ValueError:
        return 999


def parse_duration(duration_str: Any) -> float:
    """Parse job duration strings to years."""
    if pd.isna(duration_str) or duration_str == '':
        return 0.0
    
    d = str(duration_str).lower()
    
    if '10+' in d or '>10' in d:
        return 12.0
    elif '6-9' in d or '6-10' in d:
        return 7.5
    elif '4-5' in d or '4-6' in d:
        return 4.5
    elif '2-3' in d or '2-4' in d:
        return 2.5
    elif '<2' in d or '0-2' in d or '1-2' in d:
        return 1.0
    elif '<1' in d or '0-1' in d:
        return 0.5
    else:
        try:
            return float(''.join(c for c in d if c.isdigit() or c == '.'))
        except ValueError:
            return 0.0



### Education and Job Feature Extraction

For each founder, we turn the JSON data into numerical features accross job and education
- Education extraction parses the educations_json column and extracts or analyses: 
    - Degress and their number (PhD, MBA, Masters)
    - Fields (STEM, Business)
    - University Ranking (QS)

- Job extraction parses their jobs_json column and extracts or analyses: 
    - Seniority: CxO, VP, director role
    - Role types: tech, product, business
    - Company size: big vs startup
    - Experience: yrs worked in total

In [18]:
def extract_education_features(df: pd.DataFrame) -> pd.DataFrame:
    """Extract education-related features."""
    features = []
    
    print("Extracting education features...")
    
    for idx, row in df.iterrows():
        edu_data = parse_json_column(row.get('educations_json', '[]'))
        
        degrees = [e.get('degree', '') for e in edu_data if e.get('degree')]
        fields = [e.get('field', '') for e in edu_data if e.get('field')]
        qs_ranks = [parse_qs_rank(e.get('qs_ranking')) for e in edu_data]
        qs_ranks = [r for r in qs_ranks if r < 999]
        
        # degree analysis
        degree_score, has_phd, has_mba, has_masters = 0, 0, 0, 0
        for d in degrees:
            d_lower = d.lower()
            if 'phd' in d_lower or 'doctor' in d_lower:
                degree_score = max(degree_score, 4)
                has_phd = 1
            elif 'mba' in d_lower:
                degree_score = max(degree_score, 3)
                has_mba = 1
            elif 'master' in d_lower or 'msc' in d_lower:
                degree_score = max(degree_score, 2)
                has_masters = 1
        
        # field analysis
        stem_kw = ['computer', 'engineering', 'math', 'physics', 'science', 'data']
        business_kw = ['business', 'mba', 'economics', 'finance', 'management']
        is_stem = any(any(kw in f.lower() for kw in stem_kw) for f in fields)
        is_business = any(any(kw in f.lower() for kw in business_kw) for f in fields)
        
        # QS ranking
        best_qs = min(qs_ranks) if qs_ranks else 999
        
        features.append({
            'edu_num_degrees': len(degrees),
            'edu_highest_degree_score': degree_score,
            'edu_best_qs_rank': best_qs if best_qs < 999 else np.nan,
            'edu_is_top10_school': int(best_qs <= 10),
            'edu_is_top50_school': int(best_qs <= 50),
            'edu_is_top100_school': int(best_qs <= 100),
            'edu_has_phd': has_phd,
            'edu_has_mba': has_mba,
            'edu_has_masters': has_masters,
            'edu_has_advanced_degree': int(has_phd or has_mba or has_masters),
            'edu_is_stem': int(is_stem),
            'edu_is_business': int(is_business),
            'edu_is_stem_and_business': int(is_stem and is_business),
        })
    
    result = pd.DataFrame(features)
    print(f"  ‚úì Extracted {len(result.columns)} education features")
    return result


def extract_job_features(df: pd.DataFrame) -> pd.DataFrame:
    """Extract job-related features."""
    features = []
    
    print("Extracting job features...")
    
    for idx, row in df.iterrows():
        jobs_data = parse_json_column(row.get('jobs_json', '[]'))
        
        num_jobs = len(jobs_data)
        roles = [j.get('role', '') for j in jobs_data]
        industries = [j.get('industry', '') for j in jobs_data if j.get('industry')]
        durations = [j.get('duration', '') for j in jobs_data]
        company_sizes = [j.get('company_size', '') for j in jobs_data]
        
        # seniority
        num_cxo = sum(1 for r in roles if any(kw in r.lower() for kw in ['ceo', 'cto', 'cfo', 'chief']))
        num_founder = sum(1 for r in roles if any(kw in r.lower() for kw in ['founder', 'co-founder']))
        num_vp = sum(1 for r in roles if any(kw in r.lower() for kw in ['vp', 'vice president']))
        num_director = sum(1 for r in roles if any(kw in r.lower() for kw in ['director', 'head of']))
        total_senior = num_cxo + num_founder + num_vp + num_director
        
        # role types
        num_tech = sum(1 for r in roles if any(kw in r.lower() for kw in ['engineer', 'developer', 'scientist']))
        num_product = sum(1 for r in roles if any(kw in r.lower() for kw in ['product', 'pm', 'ux']))
        num_business = sum(1 for r in roles if any(kw in r.lower() for kw in ['sales', 'marketing', 'business']))
        
        # company size
        big_co_kw = ['5001', '10001', '10000+', '1001-5000']
        startup_kw = ['1-10', '11-50', '51-200']
        has_big_co = any(any(kw in str(cs) for kw in big_co_kw) for cs in company_sizes if cs)
        has_startup = any(any(kw in str(cs) for kw in startup_kw) for cs in company_sizes if cs)
        
        total_years = sum(parse_duration(d) for d in durations)
        unique_industries = len(set(industries))
        
        features.append({
            'job_num_prior_jobs': num_jobs,
            'job_num_senior_roles': total_senior,
            'job_num_cxo_roles': num_cxo,
            'job_num_founder_roles': num_founder,
            'job_num_tech_roles': num_tech,
            'job_num_product_roles': num_product,
            'job_num_business_roles': num_business,
            'job_total_experience_years': total_years,
            'job_num_industries': unique_industries,
            'job_has_cxo_experience': int(num_cxo > 0),
            'job_has_prior_founder_exp': int(num_founder > 0),
            'job_has_big_company_exp': int(has_big_co),
            'job_has_startup_exp': int(has_startup),
            'job_is_technical': int(num_tech > 0),
            'job_is_technical_senior': int(num_tech > 0 and total_senior > 0),
            'job_is_repeat_founder': int(num_founder >= 2),
            'job_big_company_then_startup': int(has_big_co and has_startup),
        })
    
    result = pd.DataFrame(features)
    print(f"  ‚úì Extracted {len(result.columns)} job features")
    return result

edu_features = extract_education_features(df)
job_features = extract_job_features(df)
X_baseline = pd.concat([edu_features, job_features], axis=1)

print(X_baseline)

Extracting education features...
  ‚úì Extracted 13 education features
Extracting job features...
  ‚úì Extracted 17 job features
      edu_num_degrees  edu_highest_degree_score  edu_best_qs_rank  \
0                   1                         0               1.0   
1                   0                         0               6.0   
2                   0                         0               NaN   
3                   2                         4               6.0   
4                   1                         0               4.0   
...               ...                       ...               ...   
8995                1                         0             163.0   
8996                1                         0             200.0   
8997                4                         4              48.0   
8998                3                         4               3.0   
8999                1                         0             200.0   

      edu_is_top10_school  edu_is_top50_s

### Removing Redundant Features 

This code is essentially feature quality analysis 
- Feature stats measure how useful a feature is for predicting a success
    - Precision: of founders with this feature, how many are successful in %?
    - Coverage: what % of founders have that feature?
    - Lift: how much better than random base rate is this feature - precision/overall success? 
- Jaccard similarity measures how similar too binary features are
    - this is intersection/union
    - if two features are almost identical, one is redundant 
- Removal removes duplicates to reduce noise
    - Ranks features by lift, keeps high-lift and removes too similar features
    - Reduces overfitting, speeds up training and makes model more interpritable

In [20]:
def compute_feature_stats(y: np.ndarray, feature_values: np.ndarray, 
                          threshold: Optional[float] = None) -> Dict:
    """Compute precision, coverage, and lift for a feature."""
    base_rate = y.mean()
    
    if threshold is not None:
        applies = feature_values > threshold
    else:
        applies = feature_values.astype(bool)
    
    coverage = np.mean(applies)
    n_applies = np.sum(applies)
    
    if n_applies > 0:
        precision = y[applies].mean()
        lift = precision / base_rate if base_rate > 0 else 0
    else:
        precision = 0.0
        lift = 0.0
    
    return {
        "precision": precision,
        "coverage": coverage,
        "lift": lift,
        "n_applies": int(n_applies),
        "n_success_applies": int(y[applies].sum()) if n_applies > 0 else 0,
        "base_rate": base_rate
    }

def compute_jaccard_similarity(f1: np.ndarray, f2: np.ndarray) -> float:
    """Compute Jaccard similarity between two binary feature vectors."""
    f1_bool = f1.astype(bool)
    f2_bool = f2.astype(bool)
    intersection = np.sum(f1_bool & f2_bool)
    union = np.sum(f1_bool | f2_bool)
    return intersection / union if union > 0 else 0.0

def remove_redundant_features(X: pd.DataFrame, y: pd.Series, 
                              threshold: float = 0.8) -> Tuple[pd.DataFrame, List]:
    """Remove features with Jaccard > threshold."""
    lifts = {}
    for col in X.columns:
        stats = compute_feature_stats(y.values, X[col].fillna(0).values)
        lifts[col] = stats['lift']
    
    sorted_cols = sorted(X.columns, key=lambda c: lifts[c], reverse=True)
    
    kept_features = []
    removed_features = []
    
    for col in sorted_cols:
        is_redundant = False
        for kept in kept_features:
            sim = compute_jaccard_similarity(
                X[col].fillna(0).values,
                X[kept].fillna(0).values
            )
            if sim > threshold:
                is_redundant = True
                removed_features.append((col, kept, sim))
                break
        
        if not is_redundant:
            kept_features.append(col)
    
    print(f"Redundancy removal: {len(X.columns)} ‚Üí {len(kept_features)} features")
    return X[kept_features], removed_features

y = df['success']
X_baseline_clean, removed_log = remove_redundant_features(X_baseline, y, threshold=0.6)

print(X_baseline_clean)
print(removed_log)

Redundancy removal: 30 ‚Üí 17 features
      edu_is_top10_school  edu_is_top50_school  job_is_technical_senior  \
0                       1                    1                        1   
1                       1                    1                        0   
2                       0                    0                        0   
3                       1                    1                        0   
4                       1                    1                        0   
...                   ...                  ...                      ...   
8995                    0                    0                        1   
8996                    0                    0                        0   
8997                    0                    1                        0   
8998                    1                    1                        0   
8999                    0                    0                        0   

      edu_has_phd  job_big_company_then_startup  edu_is_stem

### Saving key information

In [21]:
'''
output_dir = Path("./outputs/experiment/processed")

X_baseline_clean.to_csv(output_dir / 'baseline_features_COMBINED.csv', index=False)

pd.DataFrame({
    'founder_idx': range(len(y)),
    'success': y.values,
    'source': df['source'].values  
}).to_csv(output_dir / 'labels_COMBINED.csv', index=False)

founders_text = pd.DataFrame({
    'founder_idx': range(len(df)),
    'anonymised_prose': df['anonymised_prose'],
    'source': df['source']  
})
founders_text.to_csv(output_dir / 'founders_text_COMBINED.csv', index=False)

df.to_csv(Path("./data/raw") / 'vcbench_combined.csv', index=False)
'''

'\noutput_dir = Path("./outputs/experiment/processed")\n\nX_baseline_clean.to_csv(output_dir / \'baseline_features_COMBINED.csv\', index=False)\n\npd.DataFrame({\n    \'founder_idx\': range(len(y)),\n    \'success\': y.values,\n    \'source\': df[\'source\'].values  \n}).to_csv(output_dir / \'labels_COMBINED.csv\', index=False)\n\nfounders_text = pd.DataFrame({\n    \'founder_idx\': range(len(df)),\n    \'anonymised_prose\': df[\'anonymised_prose\'],\n    \'source\': df[\'source\']  \n})\nfounders_text.to_csv(output_dir / \'founders_text_COMBINED.csv\', index=False)\n\ndf.to_csv(Path("./data/raw") / \'vcbench_combined.csv\', index=False)\n'

# Metrics 

In traditional ML, we cant to classify everythign correctly so false positives and false negatives are equally as bad.

As a VC, Vela wants to find the BEST founders to invest in with limited capital and time. This is reflected in the metrics used throughout this notebook.

### Precision at K

This is the most important metric and this:
1. Ranks all foundres by their predicted success probability
2. Takes the tok K (e.g top 100) highest scored founders
3. Checks what % of the top k actually succeeded

Lift refers to how much better your precision is that random success rate (which is 9% precision as this is how many successful foudners there are in the dataset)
- For K = 100 founders, chosing 35 successful founders would be a 35% precision which is 3.5x lift better than the baseline of 9% successful founders in the dataset.

In [23]:
def compute_precision_at_k(y_true: np.ndarray, y_proba: np.ndarray, k: int) -> float:
    """Compute precision at top-K predictions."""
    top_k_idx = np.argsort(y_proba)[-k:]
    return y_true[top_k_idx].mean()

### F Beta

This is the foundaiton of the f0.5 score that is used by Vela. The idea is that, when you balance precision and recall preicsion is weighted more heavily when beta < 1.
- F1 score (beta = 1): Equal weight to precision and recall
- F0.5 score (beta = 0.5): Precision counts 2x more than recall
- F2 score (beta = 2): Recall counts counts 2x more than precision

In [22]:
def compute_f_beta(precision: float, recall: float, beta: float = 0.5) -> float:
    """
    Compute F-beta score.
    
    F0.5 weights precision MORE than recall (Vela's preferred metric).
    From GPTree paper: "we prioritize precision over recall"
    """
    if precision + recall == 0:
        return 0.0
    beta_sq = beta ** 2
    return (1 + beta_sq) * precision * recall / (beta_sq * precision + recall)

### Metric Wrappers

These functions provide the utility to employ the above metrics
- Compute precision recall builds a confusion matrix and calculates the f0.5
    - TP: invest in founder who succeeded
    - FP: invested in a founder who failed
    - FN: passed on a founder who succeeded
    - TN: passed on a founder who failed
- Print vela metrics provides a comprehensive report during training comparing the model to benchmarks and providing the optimal threshold
    - P@n is the precision if you invest in yuor top n picks and so on    
    - P@x threshold is the precision at x confidence bar
    - These results are compared to benchmarks from Vela papers

In [None]:
def compute_precision_recall_f05(y_true: np.ndarray, y_pred: np.ndarray) -> dict:
    """Compute precision, recall, and F0.5 score."""
    tp = np.sum((y_pred == 1) & (y_true == 1))
    fp = np.sum((y_pred == 1) & (y_true == 0))
    fn = np.sum((y_pred == 0) & (y_true == 1))
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f05 = compute_f_beta(precision, recall, beta=0.5)
    
    return {'precision': precision, 'recall': recall, 'f05': f05,
            'tp': int(tp), 'fp': int(fp), 'fn': int(fn)}

def print_vela_metrics(y_true: np.ndarray, y_proba: np.ndarray, 
                       model_name: str = "Model"):
    """
    Print metrics in Vela's preferred format.
    
    Reports: P@K, Precision, Recall, F0.5 (with min 10% recall constraint)
    """
    base_rate = y_true.mean()
    n_positive = int(y_true.sum())
    
    print(f"\n{'='*65}")
    print(f"üìä {model_name} - VELA METRICS")
    print(f"{'='*65}")
    print(f"Base rate: {base_rate:.2%} ({n_positive} positive / {len(y_true)} total)")
    
    # P@K metrics
    print(f"\nüìà Precision @ K:")
    for k in [50, 100, 200, 500]:
        if k <= len(y_true):
            p_k = compute_precision_at_k(y_true, y_proba, k)
            lift = p_k / base_rate if base_rate > 0 else 0
            print(f"   P@{k}: {p_k:.4f} ({lift:.2f}x lift)")
    
    # Optimal threshold metrics (F0.5 with min 10% recall)
    print(f"\nüéØ Optimal Threshold (min recall = 0%):")
    opt = find_optimal_threshold_f05(y_true, y_proba, min_recall=0.0)
    print(f"   Threshold: {opt['threshold']:.2f}")
    print(f"   Precision: {opt['precision']:.4f} ({opt['precision']/base_rate:.2f}x lift)")
    print(f"   Recall:    {opt['recall']:.4f} {'‚úì' if opt['recall'] >= 0.10 else '‚ö†Ô∏è < 10%'}")
    print(f"   F0.5:      {opt['f05']:.4f}")
    print(f"   (TP={opt['tp']}, FP={opt['fp']}, FN={opt['fn']})")
    
    # Benchmark comparison
    print(f"\nüìä Comparison to Vela Benchmarks:")
    print(f"   {'Model':<20} {'Precision':<12} {'Recall':<10} {'F0.5':<10}")
    print(f"   {'-'*52}")
    print(f"   {'Your Model':<20} {opt['precision']:<12.3f} {opt['recall']:<10.3f} {opt['f05']:<10.3f}")
    print(f"   {'RRF (paper)':<20} {'0.131':<12} {'0.101':<10} {'0.124':<10}")
    print(f"   {'GPTree (paper)':<20} {'0.373':<12} {'0.271':<10} {'0.334':<10}")
    print(f"   {'Tier-1 VCs':<20} {'0.056':<12} {'-':<10} {'-':<10}")
    
    return opt


### Threshold-based Metrics

These functions were created but not used in training for a number of reasons:
1. It is not natural for a VC to say 'invest above 65%'
2. It is not flexible as there is a fixed threshold
3. It is less generalisable to new data
4. It does not preserve order and loses the relative ranking
5. It overfits to validation
6. 'Precision at 0.65' is far less interpritable than '37% hit rate in top 100'

These functions threshold the probabiliy outputs to inflate precision

- Find optimal threshold f0.5 finds the optimal 'confidence threshold' to make investment decisions
    - when you train a ML model it outputs a probability for each founder which have to be converted into binary deicsions
    - threshold for probility is optimsed on VALIDATION DATA using validation labels
    - this maintains a minimum of 10% recall as want to catch some founders
    - the threshold maximises f0.5 (precision weighted)
    - this function tests 80 differnt thresholds and picks one based on the val data
- Find threshold max precision can be used to maximise hit rate and ignore recall. This would be designed for a small fund that doesnt care about missing opportunities
 

In [24]:
def find_optimal_threshold_f05(y_true: np.ndarray, y_proba: np.ndarray,
                                min_recall: float = 0.10) -> dict:
    """
    Find threshold that maximizes F0.5 while maintaining minimum recall.
    
    Vela's guidance: "Maintain recall at at least 10% and maximise precision"
    """
    best_f05 = 0
    best_threshold = 0.5
    best_metrics = None
    
    for threshold in np.arange(0.1, 0.9, 0.01):
        y_pred = (y_proba >= threshold).astype(int)
        metrics = compute_precision_recall_f05(y_true, y_pred)
        
        if metrics['recall'] >= min_recall and metrics['f05'] > best_f05:
            best_f05 = metrics['f05']
            best_threshold = threshold
            best_metrics = metrics
    
    if best_metrics is None:
        y_pred = (y_proba >= 0.5).astype(int)
        best_metrics = compute_precision_recall_f05(y_true, y_pred)
        best_threshold = 0.5
    
    best_metrics['threshold'] = best_threshold
    return best_metrics

def find_threshold_max_precision(y_true, y_proba, min_recall: float = 0.0):
    """
    Choose the threshold that gives highest precision,
    optionally with a minimum recall constraint.
    """
    best_prec = 0.0
    best_thr = 0.5
    best_metrics = None

    for thr in np.arange(0.1, 0.95, 0.01):
        y_pred = (y_proba >= thr).astype(int)

        tp = np.sum((y_pred == 1) & (y_true == 1))
        fp = np.sum((y_pred == 1) & (y_true == 0))
        fn = np.sum((y_pred == 0) & (y_true == 1))

        prec = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        rec  = tp / (tp + fn) if (tp + fn) > 0 else 0.0

        if rec >= min_recall and prec > best_prec:
            best_prec = prec
            best_thr = thr
            best_metrics = {
                "precision": prec,
                "recall": rec,
                "tp": int(tp),
                "fp": int(fp),
                "fn": int(fn),
                "threshold": thr,
            }

    if best_metrics is None:  # fallback
        thr = 0.5
        y_pred = (y_proba >= thr).astype(int)
        m = compute_precision_recall_f05(y_true, y_pred)
        m["threshold"] = thr
        return m

    return best_metrics

# Graph Building

In this section we are building the heterogenous graph structure.