## ISE: Defect Detection Challenge
#### Description
In this competition, your task is to develop a model that can accurately classify source code snippets as either secure or insecure. With the rise of software vulnerabilities like resource leaks, use-after-free vulnerabilities, and denial-of-service (DoS) attacks, identifying insecure code is crucial for maintaining robust software systems.
Participants will be provided with a dataset containing labeled code snippets. The labels indicate whether the code is secure (0) or insecure (1). Your goal is to create an effective machine learning model that can predict these labels with high accuracy.
#### Key Objectives 
- Analyze code snippets for potential vulnerabilities.
- Develop models to automate the classification of secure and insecure code.
- Ensure the ROC score exceeds 0.63.

In [1]:
import sklearn as sk
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

import tensorflow as tf
from tensorflow import keras
from keras import layers

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix

from sklearn.preprocessing import StandardScaler
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
import warnings
warnings.filterwarnings('ignore')

2025-08-06 00:10:33.745672: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-08-06 00:10:33.893613: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754413833.946675  122191 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754413833.962350  122191 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1754413834.079953  122191 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [2]:
# data loading
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')
sample_submission = pd.read_csv('data/sample_submission.csv')

# data info
print("Training data shape:", train_df.shape)
print("Test data shape:", test_df.shape)
print("\nTraining data columns:", train_df.columns.tolist())

print(train_df.head(3))
print(train_df.isnull().sum())  


Training data shape: (20000, 3)
Test data shape: (7000, 2)

Training data columns: ['ID', 'code', 'Label']
   ID                                               code  Label
0   0  int page_check_range(target_ulong start, targe...      0
1   1  static void pxa2xx_lcdc_dma0_redraw_rot0(PXA2x...      0
2   2  void OPPROTO op_POWER_slq (void)\n\n{\n\n    u...      1
ID       0
code     0
Label    0
dtype: int64


## C++ Code Preprocessing Pipeline 
- Basic Text Cleaning 
- C++ Specific Normalization
- Features Enginerring 
- Tokenization
- Vectorization using TF-IDF 

In [None]:
import re
import string

def clean_cpp_code(code):
    # Remove single-line comments
    code = re.sub(r'//.*', '', code)
    
    # Remove multi-line comments  
    code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
    
    # Remove string literals 
    code = re.sub(r'"[^"]*"', '"STRING"', code)
    code = re.sub(r"'[^']*'", "'CHAR'", code)
    
    # Normalize whitespace
    code = re.sub(r'\s+', ' ', code)
    code = code.strip()
    
    return code

In [None]:
def normalize_cpp_code(code):
    # Normalize variable 
    code = re.sub(r'\b[a-zA-Z_][a-zA-Z0-9_]*\b', 
                  lambda m: normalize_identifier(m.group()), code)
    
    # Normalize numeric 
    code = re.sub(r'\b\d+\b', 'NUM', code)
    code = re.sub(r'\b0x[0-9a-fA-F]+\b', 'HEX', code)
    
    # Normalize function calls 
    code = re.sub(r'(\w+)\s*\(', r'FUNC(', code)
    
    return code

def normalize_identifier(name):
    # Keep important C++ keywords and functions
    cpp_keywords = {'int', 'char', 'void', 'if', 'else', 'for', 'while', 
                   'malloc', 'free', 'strcpy', 'strlen', 'memcpy', 'sizeof'}
    
    if name.lower() in cpp_keywords:
        return name
    elif len(name) <= 3:
        return name  # Keep short vars
    else:
        return 'VAR'  # Generic variable

In [None]:
def extract_security_features(code):
    features = {}
    
    dangerous_funcs = ['strcpy', 'strcat', 'gets', 'sprintf', 'scanf', 
                      'malloc', 'free', 'memcpy', 'strncpy', 'strncat',
                      'vsprintf', 'vsnprintf', 'sscanf', 'fscanf', 'fgets',
                      'alloca', 'realloc', 'calloc']
    
    for func in dangerous_funcs:
        features[f'has_{func}'] = int(func in code.lower())
    
    # Memory operations 
    features['ptr_operations'] = len(re.findall(r'\*|\&', code))
    features['array_access'] = len(re.findall(r'\[.*?\]', code))
    features['memory_alloc'] = len(re.findall(r'malloc|calloc|new|alloca|realloc', code, re.IGNORECASE))
    features['memory_free'] = len(re.findall(r'free|delete', code, re.IGNORECASE))
    
    # Advanced vulnerability patterns
    features['buffer_overflow_risk'] = len(re.findall(r'(strcpy|strcat|gets|sprintf)\s*\(', code, re.IGNORECASE))
    features['format_string_vuln'] = len(re.findall(r'printf\s*\(\s*[a-zA-Z_]\w*\s*[,)]', code))
    features['use_after_free_risk'] = detect_use_after_free_pattern(code)
    features['double_free_risk'] = detect_double_free_pattern(code)
    features['memory_leak_risk'] = abs(features['memory_alloc'] - features['memory_free'])
    
    # Input validation issues
    features['unchecked_input'] = len(re.findall(r'(scanf|gets|fgets)\s*\(', code, re.IGNORECASE))
    features['missing_null_check'] = detect_missing_null_checks(code)
    features['array_bounds_risk'] = detect_array_bounds_issues(code)
    
    # Integer overflow/underflow risks
    features['integer_overflow_risk'] = len(re.findall(r'(\+\+|\-\-|\+=|\-=|\*=).*?(\[|\*)', code))
    features['signed_unsigned_mix'] = len(re.findall(r'(unsigned|signed)\s+\w+.*?(signed|unsigned)', code, re.IGNORECASE))
    
    # Control flow complexity (enhanced)
    features['if_statements'] = len(re.findall(r'\bif\b', code))
    features['nested_loops'] = detect_nested_complexity(code, r'\b(for|while)\b')
    features['switch_statements'] = len(re.findall(r'\bswitch\b', code))
    features['goto_statements'] = len(re.findall(r'\bgoto\b', code))
    features['function_calls'] = len(re.findall(r'\w+\s*\(', code))
    
    # Code quality indicators
    features['magic_numbers'] = len(re.findall(r'\b\d{2,}\b', code))
    features['long_functions'] = int(len(code.split('\n')) > 50)
    features['deep_nesting'] = calculate_max_nesting_depth(code)
    features['cyclomatic_complexity'] = estimate_cyclomatic_complexity(code)
    
    # String and file operations
    features['string_operations'] = len(re.findall(r'str(cpy|cat|cmp|len|chr|str)', code, re.IGNORECASE))
    features['file_operations'] = len(re.findall(r'(fopen|fclose|fread|fwrite|fprintf)', code, re.IGNORECASE))
    
    # Pointer arithmetic and casting
    features['pointer_arithmetic'] = len(re.findall(r'(\*\s*\w+\s*[\+\-]|\w+\s*[\+\-]\s*\d+\s*\))', code))
    features['type_casting'] = len(re.findall(r'\([a-zA-Z_]\w*\s*\*?\s*\)', code))
    features['void_pointer_usage'] = len(re.findall(r'void\s*\*', code, re.IGNORECASE))
    
    # Security-specific patterns
    features['hardcoded_values'] = detect_hardcoded_credentials(code)
    features['privilege_operations'] = len(re.findall(r'(setuid|setgid|chmod|chown|su|sudo)', code, re.IGNORECASE))
    features['system_calls'] = len(re.findall(r'(system|exec|popen|fork)', code, re.IGNORECASE))
    
    # Statistical features
    features['code_length'] = len(code)
    features['line_count'] = len(code.split('\n'))
    features['avg_line_length'] = features['code_length'] / max(1, features['line_count'])
    features['char_entropy'] = calculate_entropy(code)
    features['unique_char_ratio'] = len(set(code.lower())) / max(1, len(code))
    
    return features

In [None]:
def detect_use_after_free_pattern(code):
    """Detect potential use-after-free patterns"""
    patterns = [
        r'free\s*\([^)]+\).*?\*\s*\w+',  # free followed by dereference
        r'delete\s+\w+.*?\w+\s*\[',      # delete followed by array access
        r'free\s*\([^)]+\).*?\w+\s*\(',  # free followed by function call with same var
    ]
    
    count = 0
    for pattern in patterns:
        matches = re.findall(pattern, code, re.DOTALL | re.IGNORECASE)
        count += len(matches)
    
    return count

def detect_double_free_pattern(code):
    """Detect potential double free patterns"""
    free_calls = re.findall(r'free\s*\(\s*(\w+)\s*\)', code, re.IGNORECASE)
    if len(free_calls) != len(set(free_calls)):
        return 1  # Potential double free
    return 0

def detect_missing_null_checks(code):
    """Detect pointer usage without null checks"""
    ptr_usage = len(re.findall(r'\*\s*\w+', code))
    null_checks = len(re.findall(r'if\s*\(\s*\w+\s*[!=]=\s*NULL\s*\)', code, re.IGNORECASE))
    return max(0, ptr_usage - null_checks)

def detect_array_bounds_issues(code):
    """Detect array access without bounds checking"""
    array_access = re.findall(r'\w+\s*\[\s*([^]]+)\s*\]', code)
    bounds_checks = len(re.findall(r'if\s*\([^)]*(<|>|<=|>=)[^)]*\)', code))
    return max(0, len(array_access) - bounds_checks)

def detect_nested_complexity(code, pattern):
    """Detect nested control structures"""
    lines = code.split('\n')
    max_nested = 0
    current_nested = 0
    
    for line in lines:
        if re.search(pattern, line):
            current_nested += 1
            max_nested = max(max_nested, current_nested)
        if '}' in line:
            current_nested = max(0, current_nested - 1)
    
    return max_nested

def calculate_max_nesting_depth(code):
    """Calculate maximum nesting depth"""
    depth = 0
    max_depth = 0
    
    for char in code:
        if char == '{':
            depth += 1
            max_depth = max(max_depth, depth)
        elif char == '}':
            depth = max(0, depth - 1)
    
    return max_depth

def estimate_cyclomatic_complexity(code):
    """Estimate cyclomatic complexity"""
    decision_points = ['if', 'else', 'elif', 'for', 'while', 'case', 'catch', '\?', '&&', '\|\|']
    complexity = 1  # Base complexity
    
    for keyword in decision_points:
        complexity += len(re.findall(rf'\b{keyword}\b', code, re.IGNORECASE))
    
    return complexity

def detect_hardcoded_credentials(code):
    """Detect hardcoded passwords, keys, etc."""
    patterns = [
        r'(password|passwd|pwd)\s*=\s*["\'][^"\']{3,}["\']',
        r'(key|secret|token)\s*=\s*["\'][^"\']{8,}["\']',
        r'(api_key|apikey)\s*=\s*["\'][^"\']{10,}["\']',
    ]
    
    count = 0
    for pattern in patterns:
        count += len(re.findall(pattern, code, re.IGNORECASE))
    
    return count

def calculate_entropy(text):
    """Calculate Shannon entropy of text"""
    if not text:
        return 0
    
    char_counts = {}
    for char in text.lower():
        char_counts[char] = char_counts.get(char, 0) + 1
    
    entropy = 0
    text_len = len(text)
    
    for count in char_counts.values():
        probability = count / text_len
        if probability > 0:
            entropy -= probability * np.log2(probability)
    
    return entropy

In [None]:
def preprocess_cpp_dataset(df):
    """
    Complete preprocessing pipeline for C++ code dataset with enhanced features
    """
    processed_df = df.copy()
    
    print(f"Processing {len(df)} code samples...")
    
    print("Cleaning code...")
    processed_df['cleaned_code'] = processed_df['code'].apply(clean_cpp_code)
    
    print("Step 2: Normalizing code...")
    processed_df['normalized_code'] = processed_df['cleaned_code'].apply(normalize_cpp_code)
    
    print("Step 3: Extracting enhanced security features...")
    security_features_list = []
    
    for idx, code in enumerate(processed_df['code']):
        if idx % 5000 == 0:
            print(f"  Processing sample {idx}/{len(processed_df)}")
        
        features = extract_security_features(code) 
        security_features_list.append(features)
    
    security_df = pd.DataFrame(security_features_list)
    
    result_df = pd.concat([
        processed_df[['normalized_code']],  
        security_df,  
    ], axis=1)
    
    if 'Label' in processed_df.columns:
        result_df['Label'] = processed_df['Label']
    
    print(f"Preprocessing complete!")
    print(f"Enhanced features created: {len(security_df.columns)} features")
    print(f"Final shape: {result_df.shape}")
    
    return result_df

In [None]:
# data augmentation
def augment_features(X, y, noise_factor=0.1, augment_ratio=0.5):
    """Add gaussian noise to numerical features"""
    n_samples = int(len(X) * augment_ratio)
    indices = np.random.choice(len(X), n_samples, replace=False)
    
    X_aug = X[indices].copy()
    y_aug = y[indices].copy()
    
    tfidf_end = 2000  
    X_aug[:, tfidf_end:] += np.random.normal(0, noise_factor, X_aug[:, tfidf_end:].shape)
    
    X_combined = np.vstack([X, X_aug])
    y_combined = np.hstack([y, y_aug])
    
    shuffle_idx = np.random.permutation(len(X_combined))
    return X_combined[shuffle_idx], y_combined[shuffle_idx]


In [None]:
def create_pipeline_features(train_df, test_df):
    """Enhanced pipeline using your existing preprocessing with advanced features"""
    
    print("Step 1: Applying enhanced preprocessing...")

    train_processed = preprocess_cpp_dataset(train_df.copy())
    test_processed = preprocess_cpp_dataset(test_df.copy())
    
    print("Step 2: Creating TF-IDF features...")
    vectorizer = TfidfVectorizer(
        max_features=4000,  
        ngram_range=(1, 3),
        min_df=2,
        max_df=0.95,
        sublinear_tf=True,
        stop_words='english'  
    )
    
    tfidf_train = vectorizer.fit_transform(train_processed['normalized_code'])
    tfidf_test = vectorizer.transform(test_processed['normalized_code'])
    
    print("Step 3: Using enhanced security features...")

    feature_cols = [col for col in train_processed.columns 
                   if col not in ['normalized_code', 'Label']]
    
    print(f"Total enhanced features: {len(feature_cols)}")
    print("Feature categories:")
    print(f"  - Dangerous function detection: {len([f for f in feature_cols if f.startswith('has_')])}")
    print(f"  - Vulnerability patterns: {len([f for f in feature_cols if 'risk' in f or 'vuln' in f])}")
    print(f"  - Code complexity: {len([f for f in feature_cols if any(x in f for x in ['complexity', 'nesting', 'depth'])])}")
    print(f"  - Security patterns: {len([f for f in feature_cols if any(x in f for x in ['hardcoded', 'privilege', 'system'])])}")
    
    scaler = StandardScaler()
    train_numerical = scaler.fit_transform(train_processed[feature_cols].fillna(0))
    test_numerical = scaler.transform(test_processed[feature_cols].fillna(0))
    
    X_train = np.hstack([
        tfidf_train.toarray(),
        train_numerical
    ])
    
    X_test = np.hstack([
        tfidf_test.toarray(),
        test_numerical
    ])
    
    y_train = train_processed['Label'].values
    
    print(f"Final shapes - Train: {X_train.shape}, Test: {X_test.shape}")
    print(f"TF-IDF features: {tfidf_train.shape[1]}")
    print(f"Enhanced security features: {len(feature_cols)}")
    print(f"Total features: {X_train.shape[1]}")
    
    return X_train, X_test, y_train

print("Creating enhanced features...")
X_train, X_test, y_train = create_pipeline_features(train_df, test_df)

unique, counts = np.unique(y_train, return_counts=True)
print(f"\nClass distribution:")
for label, count in zip(unique, counts):
    print(f"  Class {label}: {count} samples ({count/len(y_train)*100:.1f}%)")


Creating enhanced features...
Step 1: Applying enhanced preprocessing...
Processing 20000 code samples...
Step 1: Cleaning code...
Step 2: Normalizing code...
Step 3: Extracting enhanced security features...
  Processing sample 0/20000
  Processing sample 5000/20000
  Processing sample 10000/20000
  Processing sample 15000/20000
Preprocessing complete!
Enhanced features created: 54 features
Final shape: (20000, 56)
Processing 7000 code samples...
Step 1: Cleaning code...
Step 2: Normalizing code...
Step 3: Extracting enhanced security features...
  Processing sample 0/7000
  Processing sample 5000/7000
Preprocessing complete!
Enhanced features created: 54 features
Final shape: (7000, 55)
Step 2: Creating TF-IDF features...
Step 3: Using enhanced security features...
Total enhanced features: 54
Feature categories:
  - Dangerous function detection: 18
  - Vulnerability patterns: 7
  - Code complexity: 2
  - Security patterns: 3
Final shapes - Train: (20000, 4054), Test: (7000, 4054)
TF-I

## Prepare Data for Hybrid Model Training
- Create code sequences 
- Prepare hybrid data: sequences + security features

In [None]:
def get_codebert_token_embeddings(tokenizer, vocab_size):
    """PyTorch CodeBERT version - future-proof"""
    from transformers import AutoTokenizer, AutoModel
    import torch
    import numpy as np
    
    print("Loading CodeBERT embeddings with PyTorch...")
    
    codebert_tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
    codebert_model = AutoModel.from_pretrained("microsoft/codebert-base")
    codebert_model.eval()
    
    codebert_embeddings = codebert_model.embeddings.word_embeddings.weight.data.numpy()
    
    embedding_matrix = np.zeros((vocab_size, 768))
    
    for i, (word_index, word) in enumerate(tokenizer.index_word.items()):
        if word_index >= vocab_size:
            break
            
        # Try to find token in CodeBERT vocabulary
        if word in codebert_tokenizer.vocab:
            codebert_token_id = codebert_tokenizer.vocab[word]
            embedding_matrix[word_index] = codebert_embeddings[codebert_token_id]
        else:
            # For out-of-vocabulary tokens, use random initialization
            embedding_matrix[word_index] = np.random.normal(0, 0.1, 768)
    
    print(f"Mapped {vocab_size} tokens to CodeBERT embeddings")
    return embedding_matrix

In [12]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

def create_code_sequences(train_df, test_df, max_features=10000, max_len=512):
    """Convert code to sequences for LSTM processing"""
    
    print("Creating code sequences...")
    
    # Combine all code for vocabulary building
    all_code = pd.concat([train_df['code'], test_df['code']])
    
    # Advanced tokenization for code
    def preprocess_code_for_sequence(code):
        # Keep code structure but make it more uniform
        code = re.sub(r'/\*.*?\*/', ' ', code, flags=re.DOTALL)  # Remove comments
        code = re.sub(r'//.*', ' ', code)
        code = re.sub(r'"[^"]*"', ' STRING ', code)  # Replace strings
        code = re.sub(r"'[^']*'", ' CHAR ', code)   # Replace chars
        code = re.sub(r'\b\d+\b', ' NUM ', code)    # Replace numbers
        code = re.sub(r'\s+', ' ', code)            # Normalize whitespace
        return code.lower().strip()
    
    # Preprocess all code
    processed_code = all_code.apply(preprocess_code_for_sequence)
    
    # Create tokenizer
    tokenizer = Tokenizer(
        num_words=max_features,
        oov_token='<OOV>',
        filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
    )
    
    tokenizer.fit_on_texts(processed_code)
    
    # Convert to sequences
    train_sequences = tokenizer.texts_to_sequences(
        train_df['code'].apply(preprocess_code_for_sequence)
    )
    test_sequences = tokenizer.texts_to_sequences(
        test_df['code'].apply(preprocess_code_for_sequence)
    )
    
    # Pad sequences
    X_train_seq = pad_sequences(train_sequences, maxlen=max_len, padding='post', truncating='post')
    X_test_seq = pad_sequences(test_sequences, maxlen=max_len, padding='post', truncating='post')
    
    print(f"Vocabulary size: {len(tokenizer.word_index)}")
    print(f"Sequence shape: {X_train_seq.shape}")
    
    return X_train_seq, X_test_seq, tokenizer

# Create sequences
X_train_seq, X_test_seq, tokenizer = create_code_sequences(train_df, test_df)

Creating code sequences...
Vocabulary size: 39678
Sequence shape: (20000, 512)


In [14]:
def prepare_hybrid_data(train_df, test_df, max_features=8000, max_len=256):
    """Prepare both sequence data AND enhanced features"""
    
    print("=== PREPARING HYBRID DATA ===")
    
    # 1. Create sequences for BiLSTM (same as before)
    print("Step 1: Creating sequences for BiLSTM...")
    X_train_seq, X_test_seq, tokenizer = create_code_sequences(
        train_df, test_df, max_features, max_len
    )
    
    # 2. Create enhanced features (use your existing pipeline!)
    print("Step 2: Creating enhanced features...")
    X_train_features, X_test_features, y_train = create_pipeline_features(
        train_df, test_df
    )
    
    print(f"Sequence data shape: {X_train_seq.shape}")
    print(f"Enhanced features shape: {X_train_features.shape}")
    print(f"Total feature combination: sequences + {X_train_features.shape[1]} engineered features")
    
    return X_train_seq, X_test_seq, X_train_features, X_test_features, y_train, tokenizer

# Prepare the hybrid data
X_train_seq, X_test_seq, X_train_features, X_test_features, y_train, tokenizer = prepare_hybrid_data(
    train_df, test_df
)

=== PREPARING HYBRID DATA ===
Step 1: Creating sequences for BiLSTM...
Creating code sequences...
Vocabulary size: 39678
Sequence shape: (20000, 256)
Step 2: Creating enhanced features...
Step 1: Applying enhanced preprocessing...
Processing 20000 code samples...
Step 1: Cleaning code...
Step 2: Normalizing code...
Step 3: Extracting enhanced security features...
  Processing sample 0/20000
  Processing sample 5000/20000
  Processing sample 10000/20000
  Processing sample 15000/20000
Preprocessing complete!
Enhanced features created: 54 features
Final shape: (20000, 56)
Processing 7000 code samples...
Step 1: Cleaning code...
Step 2: Normalizing code...
Step 3: Extracting enhanced security features...
  Processing sample 0/7000
  Processing sample 5000/7000
Preprocessing complete!
Enhanced features created: 54 features
Final shape: (7000, 55)
Step 2: Creating TF-IDF features...
Step 3: Using enhanced security features...
Total enhanced features: 54
Feature categories:
  - Dangerous fun

## Model Training 

In [9]:
test_time = 1

In [None]:
def create_hybrid_bilstm_features_model(vocab_size, max_len, num_features,
                                       tokenizer=None, 
                                       embedding_dim=128, lstm_units=128):
    """
    Hybrid model: BiLSTM for sequences + Enhanced features + FCNN
    Best of both worlds!
    """
    if tokenizer is not None:
        codebert_embeddings = get_codebert_token_embeddings(tokenizer, vocab_size)
        use_codebert = True
    else:
        codebert_embeddings = None
        use_codebert = False
        embedding_dim = 128  

    sequence_input = layers.Input(shape=(max_len,), name='sequence_input')
    
    features_input = layers.Input(shape=(num_features,), name='enhanced_features')
    
    # ============ BiLSTM BRANCH (for sequential patterns) ============
    print("Building BiLSTM branch for sequential code patterns...")
    
    with tf.device('/GPU:0'):
        # Embedding
        if use_codebert:
            # NEW: CodeBERT embeddings
            embedding = layers.Embedding(
                input_dim=vocab_size,
                output_dim=768,  # CodeBERT dimension
                weights=[codebert_embeddings],  # Pre-trained weights
                trainable=False,  # Freeze CodeBERT (or True to fine-tune)
                input_length=max_len,
                mask_zero=True
            )(sequence_input)
        else:
            # OLD: Your original embedding (as fallback)
            embedding = layers.Embedding(
                input_dim=vocab_size,
                output_dim=embedding_dim,
                input_length=max_len,
                mask_zero=True
            )(sequence_input)
            
        embedding = layers.Dropout(0.2)(embedding)
        
        # BiLSTM layers
        lstm1 = layers.Bidirectional(
            layers.LSTM(lstm_units, return_sequences=True, dropout=0.3)
        )(embedding)
        
        lstm2 = layers.Bidirectional(
            layers.LSTM(lstm_units//2, return_sequences=True, dropout=0.3)
        )(lstm1)
        
        # Attention mechanism
        attention = layers.MultiHeadAttention(
            num_heads=4, key_dim=lstm_units//2, dropout=0.35
        )(lstm2, lstm2)
        attention = layers.Add()([lstm2, attention])
        
        # Global pooling to get fixed-size representation
        lstm_max_pool = layers.GlobalMaxPooling1D()(attention)
        lstm_avg_pool = layers.GlobalAveragePooling1D()(attention)
        lstm_output = layers.Concatenate()([lstm_max_pool, lstm_avg_pool])
        
        # BiLSTM feature processing
        lstm_features = layers.Dense(512, activation='relu')(lstm_output)
        lstm_features = layers.BatchNormalization()(lstm_features)
        lstm_features = layers.Dropout(0.4)(lstm_features)
        
        lstm_features = layers.Dense(256, activation='relu')(lstm_features)
        lstm_features = layers.BatchNormalization()(lstm_features)
        lstm_features = layers.Dropout(0.4)(lstm_features)
        
        # ============ ENHANCED FEATURES BRANCH ============
        # print("Building enhanced features branch...")
        
        # Your enhanced feature processing (vulnerability patterns, complexity, etc.)
        enhanced_features = layers.Dense(512, activation='relu')(features_input)
        enhanced_features = layers.BatchNormalization()(enhanced_features)
        enhanced_features = layers.Dropout(0.5)(enhanced_features)
        
        enhanced_features = layers.Dense(256, activation='relu')(enhanced_features)
        enhanced_features = layers.BatchNormalization()(enhanced_features)
        enhanced_features = layers.Dropout(0.4)(enhanced_features)
        
        enhanced_features = layers.Dense(128, activation='relu')(enhanced_features)
        enhanced_features = layers.BatchNormalization()(enhanced_features)
        enhanced_features = layers.Dropout(0.4)(enhanced_features)
        
        # ============ FUSION LAYER ============
        # print("Combining BiLSTM and enhanced features...")
        
        # Combine both branches
        combined_features = layers.Concatenate(name='feature_fusion')([
            lstm_features,      # Sequential patterns from BiLSTM
            enhanced_features   # Engineered vulnerability features
        ])
        
        # ============ FINAL FCNN (for classification) ============
        # print("Building final classification network...")
        
        # Deep FCNN for final classification
        x = layers.Dense(1024, activation='relu')(combined_features)
        x = layers.BatchNormalization()(x)
        x = layers.Dropout(0.6)(x)
        
        x = layers.Dense(512, activation='relu')(x)
        x = layers.BatchNormalization()(x)
        x = layers.Dropout(0.5)(x)
        
        x = layers.Dense(256, activation='relu')(x)
        x = layers.BatchNormalization()(x)
        x = layers.Dropout(0.4)(x)
        
        x = layers.Dense(64, activation='relu')(x)
        x = layers.Dropout(0.4)(x)
        
        x = layers.Dense(32, activation='relu')(x)
        x = layers.Dropout(0.35)(x)
        
        # Output layer
        output = layers.Dense(1, activation='sigmoid', name='vulnerability_prediction')(x)
        
        # Create model
        model = keras.Model(
            inputs=[sequence_input, features_input],
            outputs=output,
            name='Hybrid_BiLSTM_Features_Model'
        )
    
    return model

In [None]:
def train_hybrid_model(X_train_seq, X_train_features, y_train, 
                      validation_split=0.2):
    """Train the hybrid BiLSTM + Enhanced Features model"""
    
    print("=== TRAINING HYBRID MODEL ===")
    
    # Create model
    vocab_size = min(8000, len(tokenizer.word_index) + 1)
    model = create_hybrid_bilstm_features_model(
        vocab_size=vocab_size,
        max_len=X_train_seq.shape[1],
        num_features=X_train_features.shape[1],
        embedding_dim=128,
        lstm_units=128,
        tokenizer=tokenizer
    )
    
    # Compile
    model.compile(
        optimizer=keras.optimizers.AdamW(learning_rate=0.001, weight_decay=0.01),
        loss='binary_crossentropy',
        metrics=[
            keras.metrics.AUC(name='auc')
        ]
    )
    
    # Enhanced callbacks
    callbacks = [
        keras.callbacks.EarlyStopping(
            monitor='val_auc', patience=10, restore_best_weights=True, mode='max'
        ),
        keras.callbacks.ReduceLROnPlateau(
            monitor='val_auc', factor=0.5, patience=7, min_lr=1e-7, mode='max'
        ),
        # keras.callbacks.ModelCheckpoint(
        #     'best_hybrid_model.h5', monitor='val_auc', save_best_only=True, mode='max'
        # )
    ]
    
    print("Training hybrid model (BiLSTM + Enhanced Features)...")
    
    # Train with BOTH inputs
    history = model.fit(
        [X_train_seq, X_train_features],  # Both sequence and feature inputs
        y_train,
        validation_split=validation_split,
        epochs=100,
        batch_size=128,  # Good balance for both LSTM and dense layers
        callbacks=callbacks,
        verbose=1
    )
    
    return model, history

# Train the hybrid model
hybrid_model, history = train_hybrid_model(X_train_seq, X_train_features, y_train)

# Model summary
print("\n=== HYBRID MODEL ARCHITECTURE ===")
hybrid_model.summary()

# Evaluate
val_split_idx = int(len(y_train) * 0.8)
X_val_seq = X_train_seq[val_split_idx:]
X_val_features = X_train_features[val_split_idx:]
y_val = y_train[val_split_idx:]

# ont rain
train_predictions = hybrid_model.predict([X_train_seq, X_train_features])
train_auc = roc_auc_score(y_train, train_predictions)
print(f"\nHybrid Model Train AUC: {train_auc:.4f}")

val_predictions = hybrid_model.predict([X_val_seq, X_val_features])
val_auc = roc_auc_score(y_val, val_predictions)
print(f"\nHybrid Model Validation AUC: {val_auc:.4f}")

In [None]:
# Make final predictions on test set

test_predictions = hybrid_model.predict([X_test_seq, X_test_features])

print(f"Test predictions shape: {test_predictions.shape}")
print(f"Prediction range: {test_predictions.min():.4f} - {test_predictions.max():.4f}")

# Convert probabilities to binary predictions (0 or 1)
test_predictions_binary = (test_predictions > 0.35).astype(int)

print(f"\nAfter converting to binary:")
print(f"Binary predictions shape: {test_predictions_binary.shape}")
print(f"Binary prediction values: {np.unique(test_predictions_binary, return_counts=True)}")
print(f"Sample predictions:")
print(f"  Probabilities: {test_predictions[:10].flatten()}")
print(f"  Binary:        {test_predictions_binary[:10].flatten()}")

In [None]:
# Create submission file
submission_df = sample_submission.copy()

# Option 1: Use binary predictions (0 or 1)
submission_df['Label'] = test_predictions_binary.flatten()

# Option 2: Use probability predictions (often better for competitions)
# submission_df['Label'] = test_predictions.flatten()

print("Submission file created!")
print(f"Submission shape: {submission_df.shape}")
print(f"Label distribution:")
print(submission_df['Label'].value_counts())
print(f"\nFirst 10 predictions:")
print(submission_df.head(10))

# Save submission
test_time = test_time + 1
submission_df.to_csv(f'data/hybrid_submission_0{test_time}.csv', index=False)
print(f"\nSubmission saved as 'hybrid_submission.csv'")


In [15]:
# Grid Search Version for Hybrid BiLSTM Model
import itertools
from datetime import datetime
import pickle

def create_hybrid_bilstm_gridsearch_model(vocab_size, max_len, num_features,
                                         tokenizer=None, 
                                         embedding_dim=128, 
                                         lstm_units=128,
                                         dropout_rate=0.3,
                                         dense_units=[512, 256]):
    """
    Parameterized version of the hybrid model for grid search (no CodeBERT)
    """
    # Always use standard embeddings for grid search (faster and more stable)
    codebert_embeddings = None
    use_codebert = False
    embedding_dim = 128
    print("Using standard embeddings (CodeBERT disabled for grid search)")

    # Input layers
    sequence_input = layers.Input(shape=(max_len,), name='sequence_input')
    features_input = layers.Input(shape=(num_features,), name='enhanced_features')
    
    # ============ BiLSTM BRANCH ============
    with tf.device('/GPU:0'):
        # Embedding
        if use_codebert:
            embedding = layers.Embedding(
                input_dim=vocab_size,
                output_dim=768,
                weights=[codebert_embeddings],
                trainable=False,
                input_length=max_len,
                mask_zero=True
            )(sequence_input)
        else:
            embedding = layers.Embedding(
                input_dim=vocab_size,
                output_dim=embedding_dim,
                input_length=max_len,
                mask_zero=True
            )(sequence_input)
            
        embedding = layers.Dropout(dropout_rate * 0.7)(embedding)  # Slightly less dropout on embedding
        
        # BiLSTM layers with parameterized units
        lstm1 = layers.Bidirectional(
            layers.LSTM(lstm_units, return_sequences=True, dropout=dropout_rate)
        )(embedding)
        
        lstm2 = layers.Bidirectional(
            layers.LSTM(lstm_units//2, return_sequences=True, dropout=dropout_rate)
        )(lstm1)
        
        # Attention mechanism
        attention = layers.MultiHeadAttention(
            num_heads=4, key_dim=lstm_units//2, dropout=dropout_rate + 0.05
        )(lstm2, lstm2)
        attention = layers.Add()([lstm2, attention])
        
        # Global pooling
        lstm_max_pool = layers.GlobalMaxPooling1D()(attention)
        lstm_avg_pool = layers.GlobalAveragePooling1D()(attention)
        lstm_output = layers.Concatenate()([lstm_max_pool, lstm_avg_pool])
        
        # BiLSTM feature processing with parameterized dense units
        lstm_features = layers.Dense(dense_units[0], activation='relu')(lstm_output)
        lstm_features = layers.BatchNormalization()(lstm_features)
        lstm_features = layers.Dropout(dropout_rate + 0.1)(lstm_features)
        
        lstm_features = layers.Dense(dense_units[1], activation='relu')(lstm_features)
        lstm_features = layers.BatchNormalization()(lstm_features)
        lstm_features = layers.Dropout(dropout_rate + 0.1)(lstm_features)
        
        # ============ ENHANCED FEATURES BRANCH ============
        enhanced_features = layers.Dense(dense_units[0], activation='relu')(features_input)
        enhanced_features = layers.BatchNormalization()(enhanced_features)
        enhanced_features = layers.Dropout(dropout_rate + 0.2)(enhanced_features)
        
        enhanced_features = layers.Dense(dense_units[1], activation='relu')(enhanced_features)
        enhanced_features = layers.BatchNormalization()(enhanced_features)
        enhanced_features = layers.Dropout(dropout_rate + 0.1)(enhanced_features)
        
        enhanced_features = layers.Dense(dense_units[1]//2, activation='relu')(enhanced_features)
        enhanced_features = layers.BatchNormalization()(enhanced_features)
        enhanced_features = layers.Dropout(dropout_rate + 0.1)(enhanced_features)
        
        # ============ FUSION LAYER ============
        combined_features = layers.Concatenate(name='feature_fusion')([
            lstm_features,      
            enhanced_features   
        ])
        
        # ============ FINAL FCNN ============
        # Parameterized final dense layers
        x = layers.Dense(dense_units[0] * 2, activation='relu')(combined_features)
        x = layers.BatchNormalization()(x)
        x = layers.Dropout(dropout_rate + 0.3)(x)
        
        x = layers.Dense(dense_units[0], activation='relu')(x)
        x = layers.BatchNormalization()(x)
        x = layers.Dropout(dropout_rate + 0.2)(x)
        
        x = layers.Dense(dense_units[1], activation='relu')(x)
        x = layers.BatchNormalization()(x)
        x = layers.Dropout(dropout_rate + 0.1)(x)
        
        x = layers.Dense(dense_units[1]//4, activation='relu')(x)
        x = layers.Dropout(dropout_rate + 0.1)(x)
        
        x = layers.Dense(dense_units[1]//8, activation='relu')(x)
        x = layers.Dropout(dropout_rate + 0.05)(x)
        
        # Output layer
        output = layers.Dense(1, activation='sigmoid', name='vulnerability_prediction')(x)
        
        # Create model
        model = keras.Model(
            inputs=[sequence_input, features_input],
            outputs=output,
            name='Hybrid_BiLSTM_GridSearch_Model'
        )
    
    return model


In [None]:
def grid_search_hybrid_model(X_train_seq, X_train_features, y_train, 
                            X_val_seq, X_val_features, y_val,
                            tokenizer, top_k=10):
    """
    Comprehensive grid search for hybrid BiLSTM model
    """
    
    # Define hyperparameter grid
    param_grid = {
        'learning_rate': [0.001, 0.0005, 0.002, 0.0015],
        'dropout_rate': [0.2, 0.3, 0.4, 0.5],
        'lstm_units': [64, 96, 128, 160],
        'dense_units': [
            [256, 128],   # Smaller network
            [384, 192],   # Medium network  
            [512, 256],   # Original size
            [640, 320],   # Larger network
            [768, 384]    # Even larger
        ]
    }
    
    # Generate all combinations
    param_combinations = list(itertools.product(
        param_grid['learning_rate'],
        param_grid['dropout_rate'], 
        param_grid['lstm_units'],
        param_grid['dense_units']
    ))
    
    print(f"Total parameter combinations to test: {len(param_combinations)}")
    print("This will take a while - grab some coffee! ☕")
    
    # Store results
    results = []
    
    # Model parameters (fixed)
    vocab_size = min(8000, len(tokenizer.word_index) + 1)
    max_len = X_train_seq.shape[1]
    num_features = X_train_features.shape[1]
    
    start_time = datetime.now()
    
    for i, (lr, dropout, lstm_units, dense_units) in enumerate(param_combinations):
        print(f"\n{'='*60}")
        print(f"Testing combination {i+1}/{len(param_combinations)}")
        print(f"LR: {lr}, Dropout: {dropout}, LSTM Units: {lstm_units}, Dense: {dense_units}")
        print(f"{'='*60}")
        
        try:
            # Clear any previous models from memory
            tf.keras.backend.clear_session()
            
            print("Creating model without CodeBERT embeddings...")
            # Create model with current parameters (without CodeBERT)
            model = create_hybrid_bilstm_gridsearch_model(
                vocab_size=vocab_size,
                max_len=max_len,
                num_features=num_features,
                tokenizer=None,  # Disable CodeBERT embeddings
                lstm_units=lstm_units,
                dropout_rate=dropout,
                dense_units=dense_units
            )
            
            # Compile model
            model.compile(
                optimizer=keras.optimizers.AdamW(learning_rate=lr, weight_decay=0.01),
                loss='binary_crossentropy',
                metrics=[keras.metrics.AUC(name='auc')]
            )
            
            # Callbacks for this specific run
            callbacks = [
                keras.callbacks.EarlyStopping(
                    monitor='val_auc', patience=8, restore_best_weights=True, mode='max'
                ),
                keras.callbacks.ReduceLROnPlateau(
                    monitor='val_auc', factor=0.7, patience=5, min_lr=1e-7, mode='max'
                )
            ]
            
            # Train model (fewer epochs for grid search)
            history = model.fit(
                [X_train_seq, X_train_features],
                y_train,
                validation_data=([X_val_seq, X_val_features], y_val),
                epochs=30,  # Reduced for grid search
                batch_size=128,
                callbacks=callbacks,
                verbose=0  # Silent training
            )
            
            # Get best validation AUC
            best_val_auc = max(history.history['val_auc'])
            best_train_auc = max(history.history['auc'])
            final_val_loss = min(history.history['val_loss'])
            
            # Store results
            result = {
                'combination_id': i,
                'learning_rate': lr,
                'dropout_rate': dropout,
                'lstm_units': lstm_units,
                'dense_units': dense_units,
                'best_val_auc': best_val_auc,
                'best_train_auc': best_train_auc,
                'final_val_loss': final_val_loss,
                'overfitting_score': best_train_auc - best_val_auc,  # Lower is better
                'epochs_trained': len(history.history['val_auc']),
                'model': model,  # Save the actual model
                'history': history.history
            }
            
            results.append(result)
            
            print(f"✅ Validation AUC: {best_val_auc:.4f}")
            print(f"📈 Train AUC: {best_train_auc:.4f}")
            print(f"🎯 Overfitting: {result['overfitting_score']:.4f}")
            print(f"⏱️  Epochs: {result['epochs_trained']}")
            
            # Show current top performers
            if len(results) >= 3:
                current_top = sorted(results, key=lambda x: x['best_val_auc'], reverse=True)[:3]
                print(f"\n🏆 Current Top 3:")
                for j, r in enumerate(current_top):
                    print(f"  {j+1}. AUC: {r['best_val_auc']:.4f} "
                          f"(LR: {r['learning_rate']}, Drop: {r['dropout_rate']}, "
                          f"LSTM: {r['lstm_units']}, Dense: {r['dense_units']})")
            
        except Exception as e:
            print(f"❌ Error with combination {i+1}: {str(e)}")
            # Continue with next combination
            continue
            
        # Memory cleanup
        del model
        tf.keras.backend.clear_session()
    
    # Sort results by validation AUC
    results.sort(key=lambda x: x['best_val_auc'], reverse=True)
    
    # Keep only top K results
    top_results = results[:top_k]
    
    end_time = datetime.now()
    total_time = end_time - start_time
    
    print(f"\n{'='*80}")
    print(f"🎉 GRID SEARCH COMPLETED!")
    print(f"⏰ Total time: {total_time}")
    print(f"🔍 Combinations tested: {len(results)}")
    print(f"🏆 Top {len(top_results)} models saved")
    print(f"{'='*80}")
    
    return top_results


In [None]:
def display_grid_search_results(top_results):
    """Display comprehensive results from grid search"""
    
    print(f"\n{'='*100}")
    print(f"🏆 TOP {len(top_results)} HYBRID BiLSTM MODELS")
    print(f"{'='*100}")
    
    # Create results DataFrame for easy viewing
    results_data = []
    for i, result in enumerate(top_results):
        results_data.append({
            'Rank': i + 1,
            'Val_AUC': result['best_val_auc'],
            'Train_AUC': result['best_train_auc'],
            'Overfitting': result['overfitting_score'],
            'Learning_Rate': result['learning_rate'],
            'Dropout': result['dropout_rate'],
            'LSTM_Units': result['lstm_units'],
            'Dense_Units': str(result['dense_units']),
            'Epochs': result['epochs_trained'],
            'Val_Loss': result['final_val_loss']
        })
    
    results_df = pd.DataFrame(results_data)
    print(results_df.to_string(index=False))
    
    # Analysis
    print(f"\n{'='*60}")
    print("📊 HYPERPARAMETER ANALYSIS")
    print(f"{'='*60}")
    
    # Best learning rates
    lr_analysis = {}
    for result in top_results:
        lr = result['learning_rate']
        if lr not in lr_analysis:
            lr_analysis[lr] = []
        lr_analysis[lr].append(result['best_val_auc'])
    
    print("🎯 Learning Rate Performance:")
    for lr in sorted(lr_analysis.keys()):
        aucs = lr_analysis[lr]
        print(f"  LR {lr}: Avg AUC = {np.mean(aucs):.4f} (appeared {len(aucs)} times in top {len(top_results)})")
    
    # Best dropout rates
    dropout_analysis = {}
    for result in top_results:
        dropout = result['dropout_rate']
        if dropout not in dropout_analysis:
            dropout_analysis[dropout] = []
        dropout_analysis[dropout].append(result['best_val_auc'])
    
    print("\n💧 Dropout Rate Performance:")
    for dropout in sorted(dropout_analysis.keys()):
        aucs = dropout_analysis[dropout]
        print(f"  Dropout {dropout}: Avg AUC = {np.mean(aucs):.4f} (appeared {len(aucs)} times in top {len(top_results)})")
    
    # Best LSTM units
    lstm_analysis = {}
    for result in top_results:
        lstm_units = result['lstm_units']
        if lstm_units not in lstm_analysis:
            lstm_analysis[lstm_units] = []
        lstm_analysis[lstm_units].append(result['best_val_auc'])
    
    print("\n🧠 LSTM Units Performance:")
    for units in sorted(lstm_analysis.keys()):
        aucs = lstm_analysis[units]
        print(f"  LSTM {units}: Avg AUC = {np.mean(aucs):.4f} (appeared {len(aucs)} times in top {len(top_results)})")
    
    # Best dense architectures
    dense_analysis = {}
    for result in top_results:
        dense_str = str(result['dense_units'])
        if dense_str not in dense_analysis:
            dense_analysis[dense_str] = []
        dense_analysis[dense_str].append(result['best_val_auc'])
    
    print("\n🏗️  Dense Architecture Performance:")
    for arch in dense_analysis:
        aucs = dense_analysis[arch]
        print(f"  Dense {arch}: Avg AUC = {np.mean(aucs):.4f} (appeared {len(aucs)} times in top {len(top_results)})")
    
    print(f"\n{'='*60}")
    print("🎖️  CHAMPION MODEL DETAILS")
    print(f"{'='*60}")
    
    champion = top_results[0]
    print(f"🥇 Best Model Configuration:")
    print(f"   Validation AUC: {champion['best_val_auc']:.6f}")
    print(f"   Train AUC: {champion['best_train_auc']:.6f}")
    print(f"   Overfitting Score: {champion['overfitting_score']:.6f}")
    print(f"   Learning Rate: {champion['learning_rate']}")
    print(f"   Dropout Rate: {champion['dropout_rate']}")
    print(f"   LSTM Units: {champion['lstm_units']}")
    print(f"   Dense Units: {champion['dense_units']}")
    print(f"   Epochs Trained: {champion['epochs_trained']}")
    print(f"   Final Val Loss: {champion['final_val_loss']:.6f}")
    
    return results_df

def save_top_models(top_results, save_dir="gridsearch_models"):
    """Save the top models and results"""
    import os
    
    # Create directory if it doesn't exist
    os.makedirs(save_dir, exist_ok=True)
    
    # Save models and metadata
    saved_models = []
    
    for i, result in enumerate(top_results):
        model_name = f"rank_{i+1}_auc_{result['best_val_auc']:.4f}"
        model_path = os.path.join(save_dir, f"{model_name}.h5")
        
        # Save model
        result['model'].save(model_path)
        
        # Save metadata
        metadata = {k: v for k, v in result.items() if k != 'model'}  # Exclude model object
        metadata_path = os.path.join(save_dir, f"{model_name}_metadata.pkl")
        
        with open(metadata_path, 'wb') as f:
            pickle.dump(metadata, f)
        
        saved_models.append({
            'rank': i + 1,
            'model_path': model_path,
            'metadata_path': metadata_path,
            'val_auc': result['best_val_auc'],
            'config': {
                'learning_rate': result['learning_rate'],
                'dropout_rate': result['dropout_rate'],
                'lstm_units': result['lstm_units'],
                'dense_units': result['dense_units']
            }
        })
        
        print(f"💾 Saved model {i+1}/{len(top_results)}: {model_name}")
    
    # Save overall results summary
    summary_path = os.path.join(save_dir, "gridsearch_summary.pkl")
    with open(summary_path, 'wb') as f:
        pickle.dump(saved_models, f)
    
    print(f"\n✅ All {len(top_results)} models saved to '{save_dir}/' directory")
    print(f"📋 Summary saved to: {summary_path}")
    
    return saved_models


In [None]:
# RUN GRID SEARCH - This will take several hours!
print("🚀 STARTING COMPREHENSIVE GRID SEARCH")
print("⚠️  Warning: This will test 320 parameter combinations and may take 4-8 hours!")
print("💡 Tip: Consider running on a machine with good GPU and leaving it overnight")

# Prepare validation split (use the same split as before for consistency)
val_split_idx = int(len(y_train) * 0.8)

X_train_seq_gs = X_train_seq[:val_split_idx]
X_train_features_gs = X_train_features[:val_split_idx]
y_train_gs = y_train[:val_split_idx]

X_val_seq_gs = X_train_seq[val_split_idx:]
X_val_features_gs = X_train_features[val_split_idx:]
y_val_gs = y_train[val_split_idx:]

print(f"Grid search training set: {X_train_seq_gs.shape[0]} samples")
print(f"Grid search validation set: {X_val_seq_gs.shape[0]} samples")

# Run grid search
top_models = grid_search_hybrid_model(
    X_train_seq_gs, X_train_features_gs, y_train_gs,
    X_val_seq_gs, X_val_features_gs, y_val_gs,
    tokenizer, 
    top_k=10
)

# Display results
results_df = display_grid_search_results(top_models)

# Save all top models
saved_models_info = save_top_models(top_models)

print(f"\n{'='*80}")
print("🎯 GRID SEARCH SUMMARY")
print(f"{'='*80}")
print(f"🏆 Best validation AUC achieved: {top_models[0]['best_val_auc']:.6f}")
print(f"🎪 This is a {(top_models[0]['best_val_auc'] - 0.63)*100:.2f}% improvement over baseline (0.63)")
print(f"💾 Top 10 models saved for ensemble/production use")
print(f"📊 Results DataFrame available as 'results_df'")
print(f"🔧 Champion model available as 'top_models[0]['model']'")

# Quick test with the champion model
print(f"\n{'='*60}")
print("🧪 TESTING CHAMPION MODEL")
print(f"{'='*60}")

champion_model = top_models[0]['model']

# Test on validation set
val_predictions = champion_model.predict([X_val_seq_gs, X_val_features_gs])
val_auc = roc_auc_score(y_val_gs, val_predictions)
print(f"✅ Champion model validation AUC: {val_auc:.6f}")

# Test on full training set
train_predictions = champion_model.predict([X_train_seq, X_train_features])
train_auc = roc_auc_score(y_train, train_predictions)
print(f"📈 Champion model full training AUC: {train_auc:.6f}")

# Generate predictions for test set
test_predictions_champion = champion_model.predict([X_test_seq, X_test_features])

print(f"\n🎉 GRID SEARCH COMPLETE!")
print(f"Champion model ready for submission generation!")


In [None]:
# Create submission from the champion model
def create_champion_submission(test_predictions, threshold=0.5):
    """Create submission file from champion model predictions"""
    
    submission_df = sample_submission.copy()
    
    # Option 1: Binary predictions with custom threshold
    binary_predictions = (test_predictions > threshold).astype(int)
    submission_df['Label'] = binary_predictions.flatten()
    
    print(f"Champion Model Submission Summary:")
    print(f"Threshold used: {threshold}")
    print(f"Prediction distribution:")
    print(submission_df['Label'].value_counts())
    print(f"Percentage of insecure predictions: {(submission_df['Label'].sum() / len(submission_df) * 100):.2f}%")
    
    # Save submission
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    submission_filename = f"champion_submission_{timestamp}.csv"
    submission_df.to_csv(submission_filename, index=False)
    
    print(f"✅ Champion submission saved as: {submission_filename}")
    
    return submission_df, submission_filename

# Create submission with different thresholds to optimize
print("🎯 Creating optimized submissions from champion model...")

thresholds_to_try = [0.3, 0.35, 0.4, 0.45, 0.5, 0.55]

for threshold in thresholds_to_try:
    submission_df, filename = create_champion_submission(test_predictions_champion, threshold)
    print(f"Created: {filename}")

print(f"\n🎊 GRID SEARCH PROJECT COMPLETE!")
print(f"You now have:")
print(f"  ✅ Top 10 tuned models saved")
print(f"  ✅ Comprehensive hyperparameter analysis")
print(f"  ✅ Champion model ready for deployment")
print(f"  ✅ Multiple submission files with different thresholds")
print(f"  ✅ Complete grid search results for future reference")

# Save a final summary file
final_summary = {
    'grid_search_completed': datetime.now().isoformat(),
    'total_combinations_tested': len([x for x in dir() if 'results' in str(x)]),
    'best_val_auc': top_models[0]['best_val_auc'] if 'top_models' in locals() else 'Grid search not run yet',
    'champion_config': top_models[0] if 'top_models' in locals() else 'Grid search not run yet',
    'submissions_created': [f for f in locals() if 'champion_submission' in str(f)]
}

with open('gridsearch_final_summary.txt', 'w') as f:
    f.write("HYBRID BiLSTM GRID SEARCH - FINAL SUMMARY\n")
    f.write("="*50 + "\n\n")
    for key, value in final_summary.items():
        f.write(f"{key}: {value}\n")

print(f"\n📋 Final summary saved to: gridsearch_final_summary.txt")


In [16]:
# Train Top 3 Models from Grid Search Results
print("🏆 TRAINING TOP 3 GRID SEARCH CONFIGURATIONS")
print("="*70)

# Top 3 configurations from grid search
top_configs = [
    {"name": "Champion", "lr": 0.001, "dropout": 0.3, "lstm_units": 96, "dense_units": [768, 384], "expected_auc": 0.6977},
    {"name": "Runner-up", "lr": 0.001, "dropout": 0.2, "lstm_units": 128, "dense_units": [640, 320], "expected_auc": 0.6969},
    {"name": "Third", "lr": 0.001, "dropout": 0.2, "lstm_units": 96, "dense_units": [384, 192], "expected_auc": 0.6965}
]

def train_final_model(config, X_train_seq, X_train_features, y_train, model_name):
    """Train a final model with specific configuration"""
    
    print(f"\n🔥 Training {config['name']} Model ({model_name})")
    print(f"   LR: {config['lr']}, Dropout: {config['dropout']}")
    print(f"   LSTM Units: {config['lstm_units']}, Dense: {config['dense_units']}")
    print("-" * 50)
    
    # Clear memory
    tf.keras.backend.clear_session()
    
    # Create model
    vocab_size = min(8000, len(tokenizer.word_index) + 1)
    model = create_hybrid_bilstm_gridsearch_model(
        vocab_size=vocab_size,
        max_len=X_train_seq.shape[1],
        num_features=X_train_features.shape[1],
        tokenizer=None,  # No CodeBERT for final training
        lstm_units=config['lstm_units'],
        dropout_rate=config['dropout'],
        dense_units=config['dense_units']
    )
    
    # Compile model
    model.compile(
        optimizer=keras.optimizers.AdamW(learning_rate=config['lr'], weight_decay=0.01),
        loss='binary_crossentropy',
        metrics=[keras.metrics.AUC(name='auc')]
    )
    
    # Enhanced callbacks for final training
    callbacks = [
        keras.callbacks.EarlyStopping(
            monitor='val_auc', patience=15, restore_best_weights=True, mode='max'
        ),
        keras.callbacks.ReduceLROnPlateau(
            monitor='val_auc', factor=0.7, patience=8, min_lr=1e-7, mode='max'
        ),
        keras.callbacks.ModelCheckpoint(
            f'best_{model_name}_model.h5', monitor='val_auc', save_best_only=True, mode='max'
        )
    ]
    
    # Train with full epochs
    history = model.fit(
        [X_train_seq, X_train_features],
        y_train,
        validation_split=0.2,
        epochs=50,  # More epochs for final training
        batch_size=64,  # Smaller batch size to avoid memory issues
        callbacks=callbacks,
        verbose=1
    )
    
    # Get best validation AUC
    best_val_auc = max(history.history['val_auc'])
    print(f"✅ {config['name']} Model - Best Validation AUC: {best_val_auc:.6f}")
    print(f"📊 Expected AUC: {config['expected_auc']:.4f}, Achieved: {best_val_auc:.4f}")
    
    return model, history, best_val_auc

# Train all top 3 models
trained_models = []
predictions_dict = {}

for i, config in enumerate(top_configs):
    model_name = f"top_{i+1}"
    
    try:
        model, history, val_auc = train_final_model(
            config, X_train_seq, X_train_features, y_train, model_name
        )
        
        # Store model info
        trained_models.append({
            'name': config['name'],
            'model': model,
            'history': history,
            'val_auc': val_auc,
            'config': config,
            'model_file': f'best_{model_name}_model.h5'
        })
        
        # Generate predictions
        print(f"🔮 Generating predictions for {config['name']} model...")
        test_preds = model.predict([X_test_seq, X_test_features])
        predictions_dict[config['name']] = test_preds.flatten()
        
        print(f"   Prediction range: {test_preds.min():.4f} - {test_preds.max():.4f}")
        
    except Exception as e:
        print(f"❌ Error training {config['name']}: {str(e)}")
        continue

print(f"\n🎉 FINAL TRAINING COMPLETE!")
print(f"✅ Successfully trained {len(trained_models)} models")
print("="*70)


🏆 TRAINING TOP 3 GRID SEARCH CONFIGURATIONS

🔥 Training Champion Model (top_1)
   LR: 0.001, Dropout: 0.3
   LSTM Units: 96, Dense: [768, 384]
--------------------------------------------------
Using standard embeddings (CodeBERT disabled for grid search)


I0000 00:00:1754414040.539428  122191 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5563 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9


Epoch 1/50


I0000 00:00:1754414050.975981  122417 cuda_dnn.cc:529] Loaded cuDNN version 90501


[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 106ms/step - auc: 0.5007 - loss: 0.8576



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 123ms/step - auc: 0.5050 - loss: 0.7843 - val_auc: 0.5313 - val_loss: 0.6894 - learning_rate: 0.0010
Epoch 2/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 111ms/step - auc: 0.5143 - loss: 0.7035 - val_auc: 0.4760 - val_loss: 0.6919 - learning_rate: 0.0010
Epoch 3/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 103ms/step - auc: 0.5062 - loss: 0.6955



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 115ms/step - auc: 0.5128 - loss: 0.6942 - val_auc: 0.5448 - val_loss: 0.6890 - learning_rate: 0.0010
Epoch 4/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 105ms/step - auc: 0.5175 - loss: 0.6919



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 118ms/step - auc: 0.5297 - loss: 0.6902 - val_auc: 0.5515 - val_loss: 0.6896 - learning_rate: 0.0010
Epoch 5/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 106ms/step - auc: 0.5411 - loss: 0.6873



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 118ms/step - auc: 0.5493 - loss: 0.6850 - val_auc: 0.5888 - val_loss: 0.6751 - learning_rate: 0.0010
Epoch 6/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 105ms/step - auc: 0.5813 - loss: 0.6746



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 118ms/step - auc: 0.5901 - loss: 0.6731 - val_auc: 0.6096 - val_loss: 0.6619 - learning_rate: 0.0010
Epoch 7/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 104ms/step - auc: 0.6391 - loss: 0.6552



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 116ms/step - auc: 0.6319 - loss: 0.6571 - val_auc: 0.6211 - val_loss: 0.6718 - learning_rate: 0.0010
Epoch 8/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 107ms/step - auc: 0.6684 - loss: 0.6399



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 119ms/step - auc: 0.6684 - loss: 0.6395 - val_auc: 0.6319 - val_loss: 0.6563 - learning_rate: 0.0010
Epoch 9/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 115ms/step - auc: 0.6962 - loss: 0.6234 - val_auc: 0.6277 - val_loss: 0.6578 - learning_rate: 0.0010
Epoch 10/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 112ms/step - auc: 0.7155 - loss: 0.6100 - val_auc: 0.6155 - val_loss: 0.7376 - learning_rate: 0.0010
Epoch 11/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 102ms/step - auc: 0.7527 - loss: 0.5801



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 114ms/step - auc: 0.7452 - loss: 0.5859 - val_auc: 0.6433 - val_loss: 0.7408 - learning_rate: 0.0010
Epoch 12/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 105ms/step - auc: 0.7838 - loss: 0.5529



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 118ms/step - auc: 0.7754 - loss: 0.5610 - val_auc: 0.6569 - val_loss: 0.6977 - learning_rate: 0.0010
Epoch 13/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 107ms/step - auc: 0.8211 - loss: 0.5142



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 119ms/step - auc: 0.8099 - loss: 0.5266 - val_auc: 0.6675 - val_loss: 0.7473 - learning_rate: 0.0010
Epoch 14/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 105ms/step - auc: 0.8360 - loss: 0.4919



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 117ms/step - auc: 0.8272 - loss: 0.5048 - val_auc: 0.6716 - val_loss: 0.6599 - learning_rate: 0.0010
Epoch 15/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 113ms/step - auc: 0.8468 - loss: 0.4813 - val_auc: 0.6634 - val_loss: 0.7486 - learning_rate: 0.0010
Epoch 16/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 113ms/step - auc: 0.8670 - loss: 0.4510 - val_auc: 0.6580 - val_loss: 1.1013 - learning_rate: 0.0010
Epoch 17/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 116ms/step - auc: 0.8777 - loss: 0.4352 - val_auc: 0.6646 - val_loss: 0.7792 - learning_rate: 0.0010
Epoch 18/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 113ms/step - auc: 0.8958 - loss: 0.4052 - val_auc: 0.6616 - val_loss: 0.7631 - learning_rate: 0.0010
Epoch 19/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 111ms/step - auc: 0.9096 - loss: 



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 124ms/step - auc: 0.5214 - loss: 0.7312 - val_auc: 0.4640 - val_loss: 0.7054 - learning_rate: 0.0010
Epoch 2/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 106ms/step - auc: 0.5255 - loss: 0.6966



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 118ms/step - auc: 0.5359 - loss: 0.6914 - val_auc: 0.5777 - val_loss: 0.6794 - learning_rate: 0.0010
Epoch 3/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 106ms/step - auc: 0.5677 - loss: 0.6788



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 119ms/step - auc: 0.5631 - loss: 0.6782 - val_auc: 0.6033 - val_loss: 0.6698 - learning_rate: 0.0010
Epoch 4/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 107ms/step - auc: 0.6060 - loss: 0.6627



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 119ms/step - auc: 0.6129 - loss: 0.6626 - val_auc: 0.6119 - val_loss: 0.6644 - learning_rate: 0.0010
Epoch 5/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 118ms/step - auc: 0.6514 - loss: 0.6466 - val_auc: 0.6068 - val_loss: 0.6613 - learning_rate: 0.0010
Epoch 6/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 105ms/step - auc: 0.6836 - loss: 0.6263



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 117ms/step - auc: 0.6819 - loss: 0.6303 - val_auc: 0.6269 - val_loss: 0.6676 - learning_rate: 0.0010
Epoch 7/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 117ms/step - auc: 0.7098 - loss: 0.6110 - val_auc: 0.6258 - val_loss: 0.6675 - learning_rate: 0.0010
Epoch 8/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 111ms/step - auc: 0.7358 - loss: 0.5920



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 124ms/step - auc: 0.7305 - loss: 0.5961 - val_auc: 0.6340 - val_loss: 0.6598 - learning_rate: 0.0010
Epoch 9/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 117ms/step - auc: 0.7503 - loss: 0.5798 - val_auc: 0.6333 - val_loss: 0.6667 - learning_rate: 0.0010
Epoch 10/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 107ms/step - auc: 0.7746 - loss: 0.5570



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 119ms/step - auc: 0.7710 - loss: 0.5603 - val_auc: 0.6413 - val_loss: 0.6894 - learning_rate: 0.0010
Epoch 11/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 115ms/step - auc: 0.7871 - loss: 0.5443 - val_auc: 0.6303 - val_loss: 0.6909 - learning_rate: 0.0010
Epoch 12/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 116ms/step - auc: 0.7975 - loss: 0.5337 - val_auc: 0.6348 - val_loss: 0.6888 - learning_rate: 0.0010
Epoch 13/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 114ms/step - auc: 0.8177 - loss: 0.5105 - val_auc: 0.6257 - val_loss: 0.6849 - learning_rate: 0.0010
Epoch 14/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 113ms/step - auc: 0.8337 - loss: 0.4892 - val_auc: 0.6329 - val_loss: 0.7406 - learning_rate: 0.0010
Epoch 15/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 115ms/step - auc: 0.8427 - loss: 



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 120ms/step - auc: 0.5105 - loss: 0.7362 - val_auc: 0.5362 - val_loss: 0.6899 - learning_rate: 0.0010
Epoch 2/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 104ms/step - auc: 0.5188 - loss: 0.6988



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 115ms/step - auc: 0.5179 - loss: 0.6975 - val_auc: 0.5579 - val_loss: 0.6881 - learning_rate: 0.0010
Epoch 3/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 104ms/step - auc: 0.5292 - loss: 0.6888



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 115ms/step - auc: 0.5383 - loss: 0.6860 - val_auc: 0.5809 - val_loss: 0.6742 - learning_rate: 0.0010
Epoch 4/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 102ms/step - auc: 0.5738 - loss: 0.6754



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 113ms/step - auc: 0.5802 - loss: 0.6742 - val_auc: 0.6050 - val_loss: 0.6672 - learning_rate: 0.0010
Epoch 5/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 101ms/step - auc: 0.6133 - loss: 0.6631



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 112ms/step - auc: 0.6262 - loss: 0.6574 - val_auc: 0.6441 - val_loss: 0.6613 - learning_rate: 0.0010
Epoch 6/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 103ms/step - auc: 0.6891 - loss: 0.6249



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 115ms/step - auc: 0.6884 - loss: 0.6273 - val_auc: 0.6590 - val_loss: 0.6423 - learning_rate: 0.0010
Epoch 7/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 105ms/step - auc: 0.7373 - loss: 0.5921



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 117ms/step - auc: 0.7371 - loss: 0.5926 - val_auc: 0.6819 - val_loss: 0.6291 - learning_rate: 0.0010
Epoch 8/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 107ms/step - auc: 0.7791 - loss: 0.5570



[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 119ms/step - auc: 0.7766 - loss: 0.5602 - val_auc: 0.6876 - val_loss: 0.6614 - learning_rate: 0.0010
Epoch 9/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 113ms/step - auc: 0.8005 - loss: 0.5357 - val_auc: 0.6802 - val_loss: 0.6493 - learning_rate: 0.0010
Epoch 10/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 112ms/step - auc: 0.8231 - loss: 0.5071 - val_auc: 0.6808 - val_loss: 0.6649 - learning_rate: 0.0010
Epoch 11/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 113ms/step - auc: 0.8421 - loss: 0.4820 - val_auc: 0.6857 - val_loss: 0.6654 - learning_rate: 0.0010
Epoch 12/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 112ms/step - auc: 0.8577 - loss: 0.4608 - val_auc: 0.6785 - val_loss: 0.8725 - learning_rate: 0.0010
Epoch 13/50
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 111ms/step - auc: 0.8731 - loss: 0

In [17]:
# Create Ensemble Predictions and Final Submissions
print("🎯 CREATING ENSEMBLE PREDICTIONS & SUBMISSIONS")
print("="*60)

# Display model performance summary
print("\n📊 FINAL MODEL PERFORMANCE SUMMARY:")
print("-" * 60)
for model_info in trained_models:
    print(f"{model_info['name']:>12}: Val AUC = {model_info['val_auc']:.6f}")
    print(f"{'':>12}  Config: LR={model_info['config']['lr']}, "
          f"Drop={model_info['config']['dropout']}, "
          f"LSTM={model_info['config']['lstm_units']}, "
          f"Dense={model_info['config']['dense_units']}")
    print()

# Create ensemble predictions
if len(predictions_dict) >= 2:
    print("🤝 CREATING ENSEMBLE PREDICTIONS:")
    
    # Simple average ensemble
    ensemble_simple = np.mean(list(predictions_dict.values()), axis=0)
    
    # Weighted ensemble (weight by validation AUC)
    weights = np.array([model_info['val_auc'] for model_info in trained_models])
    weights = weights / weights.sum()  # Normalize weights
    
    ensemble_weighted = np.zeros_like(ensemble_simple)
    for i, (model_name, preds) in enumerate(predictions_dict.items()):
        ensemble_weighted += weights[i] * preds
    
    print(f"   Simple Average: {len(predictions_dict)} models")
    print(f"   Weighted Average: weights = {weights}")
    
    # Add ensemble predictions
    predictions_dict['Ensemble_Simple'] = ensemble_simple
    predictions_dict['Ensemble_Weighted'] = ensemble_weighted
    
    print(f"✅ Created {len(predictions_dict)} prediction sets")

# Create submission files for all models
print(f"\n📝 CREATING SUBMISSION FILES:")
print("-" * 40)

submission_files = []
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

for pred_name, predictions in predictions_dict.items():
    # Create submission DataFrame
    submission_df = sample_submission.copy()
    submission_df['Label'] = predictions
    
    # Save submission file
    filename = f"final_{pred_name.lower()}_{timestamp}.csv"
    submission_df.to_csv(filename, index=False)
    
    submission_files.append({
        'name': pred_name,
        'filename': filename,
        'predictions': predictions,
        'mean_pred': predictions.mean(),
        'std_pred': predictions.std(),
        'secure_ratio': (predictions < 0.5).mean(),
        'insecure_ratio': (predictions >= 0.5).mean()
    })
    
    print(f"✅ {pred_name:>15}: {filename}")
    print(f"{'':>15}  Mean: {predictions.mean():.4f}, "
          f"Secure: {(predictions < 0.5).mean()*100:.1f}%, "
          f"Insecure: {(predictions >= 0.5).mean()*100:.1f}%")

print(f"\n🎊 ALL SUBMISSIONS CREATED!")
print(f"📁 Total files: {len(submission_files)}")

# Save detailed results summary
final_results = {
    'timestamp': timestamp,
    'grid_search_top3': top_configs,
    'trained_models': [
        {
            'name': m['name'],
            'val_auc': m['val_auc'],
            'config': m['config'],
            'model_file': m['model_file']
        } for m in trained_models
    ],
    'submission_files': submission_files,
    'ensemble_methods': ['Simple Average', 'Weighted by AUC'] if len(predictions_dict) >= 2 else ['None']
}

# Save to pickle for future reference
results_filename = f"final_results_{timestamp}.pkl"
with open(results_filename, 'wb') as f:
    pickle.dump(final_results, f)

print(f"\n💾 Detailed results saved to: {results_filename}")

# Final recommendations
print(f"\n{'='*60}")
print("🏆 FINAL RECOMMENDATIONS")
print(f"{'='*60}")
print(f"🥇 Best Single Model: {trained_models[0]['name']} (AUC: {trained_models[0]['val_auc']:.6f})")
if 'Ensemble_Weighted' in predictions_dict:
    print(f"🤝 Best Ensemble: Weighted Average of top models")
    print(f"📈 Expected Performance: Higher than individual models")
print(f"📊 All models exceeded baseline AUC of 0.63")
print(f"🎯 Ready for competition submission!")

# Show file summary
print(f"\n📋 FILES CREATED:")
for model_info in trained_models:
    print(f"   Model: {model_info['model_file']}")
for sub_file in submission_files:
    print(f"   Submission: {sub_file['filename']}")
print(f"   Results: {results_filename}")

print(f"\n🎉 COMPLETE! Best AUC achieved: {max(m['val_auc'] for m in trained_models):.6f}")


🎯 CREATING ENSEMBLE PREDICTIONS & SUBMISSIONS

📊 FINAL MODEL PERFORMANCE SUMMARY:
------------------------------------------------------------
    Champion: Val AUC = 0.671613
              Config: LR=0.001, Drop=0.3, LSTM=96, Dense=[768, 384]

   Runner-up: Val AUC = 0.641331
              Config: LR=0.001, Drop=0.2, LSTM=128, Dense=[640, 320]

       Third: Val AUC = 0.687601
              Config: LR=0.001, Drop=0.2, LSTM=96, Dense=[384, 192]

🤝 CREATING ENSEMBLE PREDICTIONS:
   Simple Average: 3 models
   Weighted Average: weights = [0.3357149  0.32057823 0.34370687]
✅ Created 5 prediction sets

📝 CREATING SUBMISSION FILES:
----------------------------------------
✅        Champion: final_champion_20250806_005216.csv
                 Mean: 0.4049, Secure: 76.4%, Insecure: 23.6%
✅       Runner-up: final_runner-up_20250806_005216.csv
                 Mean: 0.3856, Secure: 79.8%, Insecure: 20.2%
✅           Third: final_third_20250806_005216.csv
                 Mean: 0.4142, Secure: 6