# Factuality Detection in AI-Generated Educational Content

## Data4Good Competition - 4th Annual

This notebook covers:
1. Exploratory Data Analysis (EDA)
2. Feature Engineering
3. Machine Learning Model Development
4. Model Evaluation and Comparison
5. Test Set Predictions
6. Methodology Discussion


In [None]:
# Import necessary libraries
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# For reproducibility
import random
random.seed(42)
np.random.seed(42)

print("Libraries imported successfully!")


## 1. Data Loading and Initial Exploration


In [None]:
# Load training data
with open('data/train.json', 'r', encoding='utf-8') as f:
    train_data = json.load(f)

# Load test data
with open('data/test.json', 'r', encoding='utf-8') as f:
    test_data = json.load(f)

# Convert to DataFrames
train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")
print("\nTraining data columns:", train_df.columns.tolist())
print("Test data columns:", test_df.columns.tolist())
print("\nFirst few rows of training data:")
train_df.head()


In [None]:
# Basic information about the dataset
print("Training Data Info:")
print(train_df.info())
print("\n" + "="*50)
print("\nTest Data Info:")
print(test_df.info())


## 2. Exploratory Data Analysis (EDA)


In [None]:
# Check for missing values
print("Missing values in training data:")
print(train_df.isnull().sum())
print("\nMissing values in test data:")
print(test_df.isnull().sum())


In [None]:
# Target distribution
print("Target variable distribution:")
print(train_df['type'].value_counts())
print("\nTarget variable percentages:")
print(train_df['type'].value_counts(normalize=True) * 100)

# Visualize target distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
train_df['type'].value_counts().plot(kind='bar', ax=axes[0], color=['#2ecc71', '#e74c3c', '#f39c12'])
axes[0].set_title('Distribution of Answer Types', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Type', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].tick_params(axis='x', rotation=0)

# Pie chart
train_df['type'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%', 
                                      colors=['#2ecc71', '#e74c3c', '#f39c12'])
axes[1].set_title('Percentage Distribution', fontsize=14, fontweight='bold')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()


In [None]:
# Text length analysis
train_df['question_length'] = train_df['question'].str.len()
train_df['context_length'] = train_df['context'].str.len()
train_df['answer_length'] = train_df['answer'].str.len()

# Word count analysis
train_df['question_words'] = train_df['question'].str.split().str.len()
train_df['context_words'] = train_df['context'].str.split().str.len()
train_df['answer_words'] = train_df['answer'].str.split().str.len()

# Display statistics
print("Text Length Statistics:")
print(train_df[['question_length', 'context_length', 'answer_length', 
                'question_words', 'context_words', 'answer_words']].describe())


In [None]:
# Visualize text length distributions by type
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Question length by type
for i, col in enumerate(['question_length', 'context_length', 'answer_length']):
    for type_val in train_df['type'].unique():
        data = train_df[train_df['type'] == type_val][col]
        axes[0, i].hist(data, alpha=0.6, label=type_val, bins=50)
    axes[0, i].set_title(f'{col.replace("_", " ").title()} Distribution', fontweight='bold')
    axes[0, i].set_xlabel('Length (characters)')
    axes[0, i].set_ylabel('Frequency')
    axes[0, i].legend()
    axes[0, i].grid(True, alpha=0.3)

# Word count by type
for i, col in enumerate(['question_words', 'context_words', 'answer_words']):
    for type_val in train_df['type'].unique():
        data = train_df[train_df['type'] == type_val][col]
        axes[1, i].hist(data, alpha=0.6, label=type_val, bins=50)
    axes[1, i].set_title(f'{col.replace("_", " ").title()} Distribution', fontweight='bold')
    axes[1, i].set_xlabel('Word Count')
    axes[1, i].set_ylabel('Frequency')
    axes[1, i].legend()
    axes[1, i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


In [None]:
# Box plots for text lengths by type
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for i, col in enumerate(['question_length', 'context_length', 'answer_length']):
    train_df.boxplot(column=col, by='type', ax=axes[i])
    axes[i].set_title(f'{col.replace("_", " ").title()} by Type', fontweight='bold')
    axes[i].set_xlabel('Type')
    axes[i].set_ylabel('Length (characters)')
    axes[i].grid(True, alpha=0.3)

plt.suptitle('')
plt.tight_layout()
plt.show()


In [None]:
# Sample examples from each class
print("="*80)
print("SAMPLE EXAMPLES FROM EACH CLASS")
print("="*80)

for type_val in train_df['type'].unique():
    print(f"\n{'='*80}")
    print(f"TYPE: {type_val.upper()}")
    print(f"{'='*80}")
    sample = train_df[train_df['type'] == type_val].iloc[0]
    print(f"\nQuestion: {sample['question']}")
    print(f"\nContext: {sample['context'][:200]}..." if len(sample['context']) > 200 else f"\nContext: {sample['context']}")
    print(f"\nAnswer: {sample['answer']}")
    print(f"\nType: {sample['type']}")


## 3. Feature Engineering


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt', quiet=True)

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords', quiet=True)

stop_words = set(stopwords.words('english'))


In [None]:
def extract_features(df):
    """
    Extract various features from question, context, and answer
    """
    features = df.copy()
    
    # Basic text features
    features['question_length'] = features['question'].str.len()
    features['context_length'] = features['context'].str.len()
    features['answer_length'] = features['answer'].str.len()
    
    features['question_words'] = features['question'].str.split().str.len()
    features['context_words'] = features['context'].str.split().str.len()
    features['answer_words'] = features['answer'].str.split().str.len()
    
    # Ratio features
    features['answer_to_question_ratio'] = features['answer_length'] / (features['question_length'] + 1)
    features['answer_to_context_ratio'] = features['answer_length'] / (features['context_length'] + 1)
    features['question_to_context_ratio'] = features['question_length'] / (features['context_length'] + 1)
    
    # Word overlap features
    def word_overlap(text1, text2):
        words1 = set(text1.lower().split())
        words2 = set(text2.lower().split())
        if len(words1) == 0 or len(words2) == 0:
            return 0
        return len(words1.intersection(words2)) / len(words1.union(words2))
    
    features['question_answer_overlap'] = features.apply(
        lambda x: word_overlap(x['question'], x['answer']), axis=1
    )
    features['context_answer_overlap'] = features.apply(
        lambda x: word_overlap(x['context'], x['answer']), axis=1
    )
    features['question_context_overlap'] = features.apply(
        lambda x: word_overlap(x['question'], x['context']), axis=1
    )
    
    # Question word features
    question_words = ['what', 'who', 'when', 'where', 'why', 'how', 'which', 'whom', 'whose']
    for qw in question_words:
        features[f'has_{qw}'] = features['question'].str.lower().str.contains(qw, regex=False).astype(int)
    
    # Answer starts with question word
    features['answer_starts_with_question_word'] = features.apply(
        lambda x: any(x['answer'].lower().startswith(qw) for qw in question_words), axis=1
    ).astype(int)
    
    # Number of sentences
    features['question_sentences'] = features['question'].str.count(r'[.!?]+')
    features['context_sentences'] = features['context'].str.count(r'[.!?]+')
    features['answer_sentences'] = features['answer'].str.count(r'[.!?]+')
    
    # Capitalization features
    features['answer_caps_ratio'] = features['answer'].str.findall(r'[A-Z]').str.len() / (features['answer_length'] + 1)
    features['question_caps_ratio'] = features['question'].str.findall(r'[A-Z]').str.len() / (features['question_length'] + 1)
    
    # Special characters
    features['answer_special_chars'] = features['answer'].str.findall(r'[^a-zA-Z0-9\s]').str.len()
    features['question_special_chars'] = features['question'].str.findall(r'[^a-zA-Z0-9\s]').str.len()
    
    # Numeric features
    features['answer_has_numbers'] = features['answer'].str.contains(r'\d', regex=True).astype(int)
    features['question_has_numbers'] = features['question'].str.contains(r'\d', regex=True).astype(int)
    features['context_has_numbers'] = features['context'].str.contains(r'\d', regex=True).astype(int)
    
    return features

# Apply feature engineering
print("Extracting features from training data...")
train_features = extract_features(train_df)
print("Extracting features from test data...")
test_features = extract_features(test_df)
print("Feature engineering complete!")


In [None]:
# Calculate semantic similarity using TF-IDF and cosine similarity
print("Calculating semantic similarities...")

# Combine question and context for better comparison
train_features['question_context_combined'] = train_features['question'] + ' ' + train_features['context']
test_features['question_context_combined'] = test_features['question'] + ' ' + test_features['context']

# TF-IDF vectorization for semantic similarity
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')

# Fit on training data
train_qc_vectors = vectorizer.fit_transform(train_features['question_context_combined'])
train_answer_vectors = vectorizer.transform(train_features['answer'])

# Transform test data
test_qc_vectors = vectorizer.transform(test_features['question_context_combined'])
test_answer_vectors = vectorizer.transform(test_features['answer'])

# Calculate cosine similarity
train_features['semantic_similarity'] = [
    cosine_similarity(train_qc_vectors[i:i+1], train_answer_vectors[i:i+1])[0][0]
    for i in range(len(train_features))
]

test_features['semantic_similarity'] = [
    cosine_similarity(test_qc_vectors[i:i+1], test_answer_vectors[i:i+1])[0][0]
    for i in range(len(test_features))
]

print("Semantic similarity calculation complete!")


## 4. Machine Learning Model Development

We'll try multiple approaches:
1. Traditional ML models (Random Forest, XGBoost, etc.)
2. Transformer-based models (BERT, RoBERTa, etc.)
3. Ensemble methods


In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

# Prepare features for traditional ML
feature_columns = [col for col in train_features.columns 
                   if col not in ['question', 'context', 'answer', 'type', 'question_context_combined', 'ID']]

X = train_features[feature_columns].fillna(0)
y = train_features['type']

# Encode labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# BEST PRACTICE: Train/Validation Split
# Strategy:
# 1. Split into Train (80%) and Validation (20%)
# 2. Use Train for cross-validation and training
# 3. Use Validation for model selection and hyperparameter tuning
# 4. Final model uses Train + Validation
# 5. Test set (separate file) is for final predictions

print(f"\n{'='*70}")
print("DATA SPLIT: Train / Validation / Test")
print(f"{'='*70}")

# Split into train (80%) and validation (20%)
X_train, X_val, y_train, y_val = train_test_split(
    X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)

print(f"Training set: {X_train.shape[0]} examples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Validation set: {X_val.shape[0]} examples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"Test set (separate file): {len(test_features)} examples")
print(f"Feature count: {len(feature_columns)}")

print(f"\nClass distribution in training set:")
unique, counts = np.unique(y_train, return_counts=True)
for cls, count in zip(label_encoder.inverse_transform(unique), counts):
    print(f"  {cls}: {count} ({count/len(y_train)*100:.1f}%)")

print(f"\nClass distribution in validation set:")
unique, counts = np.unique(y_val, return_counts=True)
for cls, count in zip(label_encoder.inverse_transform(unique), counts):
    print(f"  {cls}: {count} ({count/len(y_val)*100:.1f}%)")

print(f"\n{'='*70}")
print("WORKFLOW:")
print("  1. Cross-validation on Training set (for robust evaluation)")
print("  2. Train models on Training set")
print("  3. Evaluate on Validation set (for model selection)")
print("  4. Final model: Train on Train + Validation")
print("  5. Predict on Test set (separate file)")
print(f"{'='*70}")


In [None]:
# Scale features - IMPORTANT: Fit only on training data to avoid data leakage
# This is a critical best practice!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit on training only
X_val_scaled = scaler.transform(X_val)          # Transform validation (using training scaler)

# Prepare test features (from separate test.json file)
X_test = test_features[feature_columns].fillna(0)
X_test_scaled = scaler.transform(X_test)  # Transform test (using training scaler)

print(f"✅ Features scaled (scaler fitted on training data only - prevents data leakage)")
print(f"   Training set: {X_train_scaled.shape}")
print(f"   Validation set: {X_val_scaled.shape}")
print(f"   Test set: {X_test_scaled.shape}")


### 4.1 Random Forest Classifier


In [None]:
# Random Forest
print("Training Random Forest Classifier...")
rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=20,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1,
    class_weight='balanced'
)

rf_model.fit(X_train_scaled, y_train)
rf_pred = rf_model.predict(X_val_scaled)
rf_pred_proba = rf_model.predict_proba(X_val_scaled)

print("\nRandom Forest Results:")
print(f"Accuracy: {accuracy_score(y_val, rf_pred):.4f}")
print(f"F1 Score (macro): {f1_score(y_val, rf_pred, average='macro'):.4f}")
print(f"F1 Score (weighted): {f1_score(y_val, rf_pred, average='weighted'):.4f}")
print("\nClassification Report:")
print(classification_report(y_val, rf_pred, target_names=label_encoder.classes_))


In [None]:
# Feature importance for Random Forest
feature_importance = pd.DataFrame({
    'feature': feature_columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("Top 20 Most Important Features:")
print(feature_importance.head(20))

# Visualize feature importance
plt.figure(figsize=(10, 8))
top_features = feature_importance.head(20)
plt.barh(range(len(top_features)), top_features['importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Importance')
plt.title('Top 20 Feature Importances (Random Forest)', fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()


### 4.2 XGBoost Classifier


In [None]:
# XGBoost
print("Training XGBoost Classifier...")
xgb_model = xgb.XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    eval_metric='mlogloss',
    use_label_encoder=False
)

xgb_model.fit(X_train_scaled, y_train)
xgb_pred = xgb_model.predict(X_val_scaled)
xgb_pred_proba = xgb_model.predict_proba(X_val_scaled)

print("\nXGBoost Results:")
print(f"Accuracy: {accuracy_score(y_val, xgb_pred):.4f}")
print(f"F1 Score (macro): {f1_score(y_val, xgb_pred, average='macro'):.4f}")
print(f"F1 Score (weighted): {f1_score(y_val, xgb_pred, average='weighted'):.4f}")
print("\nClassification Report:")
print(classification_report(y_val, xgb_pred, target_names=label_encoder.classes_))


### 4.3 Logistic Regression


In [None]:
# Logistic Regression
print("Training Logistic Regression...")
lr_model = LogisticRegression(
    max_iter=1000,
    random_state=42,
    class_weight='balanced',
    C=1.0
)

lr_model.fit(X_train_scaled, y_train)
lr_pred = lr_model.predict(X_val_scaled)
lr_pred_proba = lr_model.predict_proba(X_val_scaled)

print("\nLogistic Regression Results:")
print(f"Accuracy: {accuracy_score(y_val, lr_pred):.4f}")
print(f"F1 Score (macro): {f1_score(y_val, lr_pred, average='macro'):.4f}")
print(f"F1 Score (weighted): {f1_score(y_val, lr_pred, average='weighted'):.4f}")
print("\nClassification Report:")
print(classification_report(y_val, lr_pred, target_names=label_encoder.classes_))


### 4.4 Transformer-Based Models (BERT/RoBERTa)

For better performance on text classification tasks, we'll use transformer models.


In [None]:
# Install transformers if not already installed
# !pip install transformers torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import pipeline
import torch
from torch.utils.data import Dataset
from sklearn.metrics import accuracy_score, f1_score
import os

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# We'll use a smaller, efficient model for faster training
# Options: 'distilbert-base-uncased', 'roberta-base', 'bert-base-uncased'
model_name = 'distilbert-base-uncased'
print(f"\nUsing model: {model_name}")


In [None]:
# Prepare data for transformer
# Combine question, context, and answer for better understanding
def prepare_transformer_data(df, is_test=False):
    texts = []
    for idx, row in df.iterrows():
        # Format: [CLS] Question [SEP] Context [SEP] Answer [SEP]
        text = f"{row['question']} [SEP] {row['context']} [SEP] {row['answer']}"
        texts.append(text)
    return texts

# Get the indices from the train/val split
train_indices = X_train.index
val_indices = X_val.index

# Prepare texts for transformer (using original train_features)
all_texts = prepare_transformer_data(train_features)
train_texts = [all_texts[i] for i in train_indices]
val_texts = [all_texts[i] for i in val_indices]

# Get labels for validation set
val_labels = y_val.tolist()
train_labels = y_train.tolist()

print(f"Training texts: {len(train_texts)}")
print(f"Validation texts: {len(val_texts)}")
print(f"\nSample text:")
print(train_texts[0][:200] + "...")


In [None]:
# Create dataset class for transformers
class FactualityDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create datasets
train_dataset = FactualityDataset(train_texts, train_labels, tokenizer)
val_dataset = FactualityDataset(val_texts, val_labels, tokenizer)

print("Datasets created successfully!")


In [None]:
# Load model
num_labels = len(label_encoder.classes_)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels
)

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    eval_strategy="epoch",  # Changed from evaluation_strategy to eval_strategy for newer transformers versions
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
)

# Metrics function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    acc = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average='macro')
    return {
        'accuracy': acc,
        'f1': f1
    }

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

print("Trainer created. Starting training...")
print("Note: This may take a while depending on your hardware.")


In [None]:
# Train the model
# Uncomment the line below to train (this takes time)
# trainer.train()

# For now, we'll use a simpler approach with pipeline for faster results
print("Using pipeline for faster inference...")
classifier = pipeline(
    "text-classification",
    model=model_name,
    device=0 if torch.cuda.is_available() else -1,
    return_all_scores=True
)

# Note: The pipeline approach requires fine-tuning on our specific task
# For production, we would fine-tune the model first
print("Pipeline created. For best results, fine-tune the model first.")


### 4.5 Model Comparison and Selection


In [None]:
# Compare all models
models_comparison = {
    'Random Forest': {
        'accuracy': accuracy_score(y_val, rf_pred),
        'f1_macro': f1_score(y_val, rf_pred, average='macro'),
        'f1_weighted': f1_score(y_val, rf_pred, average='weighted'),
        'predictions': rf_pred,
        'probabilities': rf_pred_proba
    },
    'XGBoost': {
        'accuracy': accuracy_score(y_val, xgb_pred),
        'f1_macro': f1_score(y_val, xgb_pred, average='macro'),
        'f1_weighted': f1_score(y_val, xgb_pred, average='weighted'),
        'predictions': xgb_pred,
        'probabilities': xgb_pred_proba
    },
    'Logistic Regression': {
        'accuracy': accuracy_score(y_val, lr_pred),
        'f1_macro': f1_score(y_val, lr_pred, average='macro'),
        'f1_weighted': f1_score(y_val, lr_pred, average='weighted'),
        'predictions': lr_pred,
        'probabilities': lr_pred_proba
    }
}

# Create comparison DataFrame
comparison_df = pd.DataFrame({
    'Model': list(models_comparison.keys()),
    'Accuracy': [m['accuracy'] for m in models_comparison.values()],
    'F1 (Macro)': [m['f1_macro'] for m in models_comparison.values()],
    'F1 (Weighted)': [m['f1_weighted'] for m in models_comparison.values()]
})

print("Model Comparison:")
print(comparison_df.to_string(index=False))

# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

metrics = ['Accuracy', 'F1 (Macro)', 'F1 (Weighted)']
for i, metric in enumerate(metrics):
    axes[i].bar(comparison_df['Model'], comparison_df[metric], color=['#3498db', '#2ecc71', '#e74c3c'])
    axes[i].set_title(f'{metric} Comparison', fontweight='bold')
    axes[i].set_ylabel(metric)
    axes[i].tick_params(axis='x', rotation=45)
    axes[i].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Select best model based on F1 (macro)
best_model_name = comparison_df.loc[comparison_df['F1 (Macro)'].idxmax(), 'Model']
print(f"\nBest model based on F1 (Macro): {best_model_name}")


### 4.5 K-Fold Cross-Validation

Let's perform k-fold cross-validation for more robust model evaluation.


In [None]:
# K-Fold Cross-Validation
from sklearn.model_selection import cross_val_score, StratifiedKFold

print("="*70)
print("K-FOLD CROSS-VALIDATION RESULTS")
print("="*70)

# Use 5-fold cross-validation
k_fold = 5
skf = StratifiedKFold(n_splits=k_fold, shuffle=True, random_state=42)

# Prepare models for cross-validation
models_cv = {
    'Random Forest': RandomForestClassifier(
        n_estimators=200,
        max_depth=20,
        min_samples_split=5,
        min_samples_leaf=2,
        random_state=42,
        n_jobs=-1,
        class_weight='balanced'
    ),
    'XGBoost': xgb.XGBClassifier(
        n_estimators=200,
        max_depth=6,
        learning_rate=0.1,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42,
        eval_metric='mlogloss',
        use_label_encoder=False
    ),
    'Logistic Regression': LogisticRegression(
        max_iter=1000,
        random_state=42,
        class_weight='balanced',
        C=1.0
    )
}

# Perform cross-validation for each model
cv_results = {}

for model_name, model in models_cv.items():
    print(f"\n{'-'*70}")
    print(f"Cross-Validating: {model_name}")
    print(f"{'-'*70}")
    
    # Cross-validation scores
    cv_accuracy = cross_val_score(model, X_train_scaled, y_train, cv=skf, 
                                   scoring='accuracy', n_jobs=-1)
    cv_f1_macro = cross_val_score(model, X_train_scaled, y_train, cv=skf, 
                                   scoring='f1_macro', n_jobs=-1)
    cv_f1_weighted = cross_val_score(model, X_train_scaled, y_train, cv=skf, 
                                      scoring='f1_weighted', n_jobs=-1)
    
    cv_results[model_name] = {
        'accuracy_mean': cv_accuracy.mean(),
        'accuracy_std': cv_accuracy.std(),
        'accuracy_scores': cv_accuracy,
        'f1_macro_mean': cv_f1_macro.mean(),
        'f1_macro_std': cv_f1_macro.std(),
        'f1_macro_scores': cv_f1_macro,
        'f1_weighted_mean': cv_f1_weighted.mean(),
        'f1_weighted_std': cv_f1_weighted.std(),
        'f1_weighted_scores': cv_f1_weighted
    }
    
    print(f"Accuracy: {cv_accuracy.mean():.4f} (+/- {cv_accuracy.std() * 2:.4f})")
    print(f"F1 (Macro): {cv_f1_macro.mean():.4f} (+/- {cv_f1_macro.std() * 2:.4f})")
    print(f"F1 (Weighted): {cv_f1_weighted.mean():.4f} (+/- {cv_f1_weighted.std() * 2:.4f})")
    print(f"\nFold-by-fold scores:")
    for fold in range(k_fold):
        print(f"  Fold {fold+1}: Acc={cv_accuracy[fold]:.4f}, F1_macro={cv_f1_macro[fold]:.4f}, F1_weighted={cv_f1_weighted[fold]:.4f}")

# Create comparison DataFrame
cv_comparison = pd.DataFrame({
    'Model': list(cv_results.keys()),
    'CV Accuracy (Mean)': [r['accuracy_mean'] for r in cv_results.values()],
    'CV Accuracy (Std)': [r['accuracy_std'] for r in cv_results.values()],
    'CV F1 Macro (Mean)': [r['f1_macro_mean'] for r in cv_results.values()],
    'CV F1 Macro (Std)': [r['f1_macro_std'] for r in cv_results.values()],
    'CV F1 Weighted (Mean)': [r['f1_weighted_mean'] for r in cv_results.values()],
    'CV F1 Weighted (Std)': [r['f1_weighted_std'] for r in cv_results.values()]
})

print(f"\n{'='*70}")
print("CROSS-VALIDATION SUMMARY")
print(f"{'='*70}")
print(cv_comparison.to_string(index=False))

# Visualize cross-validation results
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

metrics_cv = ['CV Accuracy (Mean)', 'CV F1 Macro (Mean)', 'CV F1 Weighted (Mean)']
for i, metric in enumerate(metrics_cv):
    x_pos = np.arange(len(cv_comparison))
    means = cv_comparison[metric].values
    stds = cv_comparison[metric.replace('(Mean)', '(Std)')].values
    
    axes[i].bar(x_pos, means, yerr=stds, capsize=5, 
                color=['#3498db', '#2ecc71', '#e74c3c'], alpha=0.7)
    axes[i].set_xticks(x_pos)
    axes[i].set_xticklabels(cv_comparison['Model'], rotation=45, ha='right')
    axes[i].set_title(f'{metric.replace("CV ", "")} with Std Dev', fontweight='bold')
    axes[i].set_ylabel('Score')
    axes[i].grid(True, alpha=0.3, axis='y')
    
    # Add value labels on bars
    for j, (mean, std) in enumerate(zip(means, stds)):
        axes[i].text(j, mean + std + 0.01, f'{mean:.3f}', 
                     ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

# Select best model based on CV F1 (macro)
best_cv_model = cv_comparison.loc[cv_comparison['CV F1 Macro (Mean)'].idxmax(), 'Model']
print(f"\n{'='*70}")
print(f"Best model based on CV F1 (Macro): {best_cv_model}")
print(f"{'='*70}")


### 4.6 Model Comparison and Selection


### 4.7 Ensemble Model


In [None]:
# Confusion matrices for all models
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (model_name, results) in enumerate(models_comparison.items()):
    cm = confusion_matrix(y_val, results['predictions'])
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx],
                xticklabels=label_encoder.classes_,
                yticklabels=label_encoder.classes_)
    axes[idx].set_title(f'{model_name} Confusion Matrix', fontweight='bold')
    axes[idx].set_xlabel('Predicted')
    axes[idx].set_ylabel('Actual')

plt.tight_layout()
plt.show()


### 4.6 Ensemble Model

Let's create an ensemble of the best performing models.


In [None]:
# Create ensemble using voting classifier
ensemble_model = VotingClassifier(
    estimators=[
        ('rf', rf_model),
        ('xgb', xgb_model),
        ('lr', lr_model)
    ],
    voting='soft',  # Use probabilities for voting
    weights=[2, 2, 1]  # Give more weight to RF and XGBoost
)

ensemble_model.fit(X_train_scaled, y_train)
ensemble_pred = ensemble_model.predict(X_val_scaled)
ensemble_pred_proba = ensemble_model.predict_proba(X_val_scaled)

print("Ensemble Model Results:")
print(f"Accuracy: {accuracy_score(y_val, ensemble_pred):.4f}")
print(f"F1 Score (macro): {f1_score(y_val, ensemble_pred, average='macro'):.4f}")
print(f"F1 Score (weighted): {f1_score(y_val, ensemble_pred, average='weighted'):.4f}")
print("\nClassification Report:")
print(classification_report(y_val, ensemble_pred, target_names=label_encoder.classes_))

# Add to comparison
models_comparison['Ensemble'] = {
    'accuracy': accuracy_score(y_val, ensemble_pred),
    'f1_macro': f1_score(y_val, ensemble_pred, average='macro'),
    'f1_weighted': f1_score(y_val, ensemble_pred, average='weighted'),
    'predictions': ensemble_pred,
    'probabilities': ensemble_pred_proba
}


## 5. Final Model Selection and Test Predictions

Based on the validation results, we'll select the best model and make predictions on the test set.


In [None]:
# Retrain best model on full training data
print("Retraining best model on full training data...")

# Use ensemble as it typically performs best
final_model = VotingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(n_estimators=200, max_depth=20, min_samples_split=5, 
                                      min_samples_leaf=2, random_state=42, n_jobs=-1, class_weight='balanced')),
        ('xgb', xgb.XGBClassifier(n_estimators=200, max_depth=6, learning_rate=0.1, 
                                   subsample=0.8, colsample_bytree=0.8, random_state=42, 
                                   eval_metric='mlogloss', use_label_encoder=False)),
        ('lr', LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced', C=1.0))
    ],
    voting='soft',
    weights=[2, 2, 1]
)

# Train on full dataset
X_full = scaler.fit_transform(X)
final_model.fit(X_full, y_encoded)

print("Final model trained successfully!")


In [None]:
# Make predictions on test set
print("="*70)
print("MAKING PREDICTIONS ON TEST SET")
print("="*70)

# Verify model is trained
if not hasattr(final_model, 'estimators_'):
    print("⚠️  ERROR: Final model not trained yet!")
    print("Please run the previous cell to train the final model first.")
else:
    print(f"✅ Model is trained and ready")
    print(f"✅ Test data shape: {X_test_scaled.shape}")
    
    # Make predictions
    print("\nMaking predictions...")
    test_predictions_encoded = final_model.predict(X_test_scaled)
    test_predictions = label_encoder.inverse_transform(test_predictions_encoded)
    
    print(f"✅ Predictions made: {len(test_predictions)} predictions")
    print(f"✅ Prediction types: {set(test_predictions)}")
    
    # Add predictions to test dataframe
    # Make sure we're working with the original test_df loaded from JSON
    test_df['type'] = test_predictions
    
    print(f"\n✅ Predictions added to test_df")
    print(f"✅ test_df shape: {test_df.shape}")
    print(f"✅ test_df columns: {test_df.columns.tolist()}")
    
    print(f"\nPrediction distribution:")
    print(test_df['type'].value_counts())
    
    # Verify predictions are in dataframe
    print(f"\n✅ Verification:")
    print(f"   Rows with predictions: {(test_df['type'] != '').sum()}")
    print(f"   Empty predictions: {(test_df['type'] == '').sum()}")
    print(f"\n   Sample predictions:")
    print(test_df[['ID', 'type']].head(10))


In [None]:
# Save predictions to test.json file
import os

print("="*70)
print("SAVING PREDICTIONS TO FILE")
print("="*70)

# Check current working directory and ensure data directory exists
print(f"Current working directory: {os.getcwd()}")
os.makedirs('data', exist_ok=True)
print(f"✅ Data directory exists: {os.path.exists('data')}")

# CRITICAL: Verify predictions exist before saving
if 'type' not in test_df.columns:
    print("❌ ERROR: 'type' column not found in test_df!")
    print("Please run the prediction cell (Cell 43) first.")
elif (test_df['type'] == '').all() or test_df['type'].isnull().all():
    print("❌ ERROR: All type values are empty!")
    print("Please run the prediction cell (Cell 43) to generate predictions first.")
else:
    print(f"✅ Predictions verified in test_df")
    
    # Verify required columns exist before saving
    print("\n" + "="*60)
    print("VERIFICATION BEFORE SAVING:")
    print("="*60)
    print(f"Total rows: {len(test_df)}")
    print(f"Columns in test_df: {test_df.columns.tolist()}")
    print(f"\n'ID' column present: {'ID' in test_df.columns}")
    print(f"'type' column present: {'type' in test_df.columns}")
    
    # Check for missing values
    print(f"\nMissing IDs: {test_df['ID'].isnull().sum()}")
    print(f"Missing types: {test_df['type'].isnull().sum()}")
    print(f"Empty type strings: {(test_df['type'] == '').sum()}")
    
    # Check type values
    print(f"\nType value distribution:")
    print(test_df['type'].value_counts())
    print(f"\nUnique type values: {test_df['type'].unique()}")
    
    # Verify all required type values are present
    required_types = ['factual', 'contradiction', 'irrelevant']
    actual_types = set(test_df['type'].str.lower().unique())
    missing_types = set(required_types) - actual_types
    if missing_types:
        print(f"\n⚠️  WARNING: Missing type values: {missing_types}")
    else:
        print(f"\n✅ All required type values present!")
    
    # Verify ID range
    print(f"\nID range: {test_df['ID'].min()} to {test_df['ID'].max()}")
    print(f"Expected 2000 rows: {'✅ YES' if len(test_df) == 2000 else '❌ NO'}")
    
    print("\n" + "="*60)
    print("SAMPLE OF DATA TO BE SAVED:")
    print("="*60)
    print(test_df[['ID', 'type']].head(10).to_string(index=False))

    # Convert to dictionary format for JSON
    output_data = test_df.to_dict('records')
    
    # Verify first record structure
    print("\n" + "="*60)
    print("VERIFICATION OF FIRST RECORD STRUCTURE:")
    print("="*60)
    first_record = output_data[0]
    print(f"Keys in first record: {list(first_record.keys())}")
    print(f"First record ID: {first_record.get('ID', 'MISSING')}")
    print(f"First record type: {first_record.get('type', 'MISSING')}")
    
    if first_record.get('type') in ['', None]:
        print("❌ ERROR: First record has empty type! Predictions not made.")
    else:
        print(f"✅ First record has valid type: {first_record.get('type')}")
    
    print(f"\nSample of first 3 records:")
    for i in range(min(3, len(output_data))):
        rec = output_data[i]
        print(f"  Record {i+1}: ID={rec.get('ID')}, type={rec.get('type')}")
    
    # Save to test.json
    output_path = 'data/test.json'
    print(f"\n" + "="*60)
    print(f"SAVING TO: {output_path}")
    print("="*60)
    
    try:
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(output_data, f, indent=2, ensure_ascii=False)
        
        print(f"✅ File saved successfully!")
        print(f"✅ File exists: {os.path.exists(output_path)}")
        print(f"✅ File size: {os.path.getsize(output_path) / 1024:.2f} KB")
        
        # Verify by reading back
        print(f"\n" + "="*60)
        print("VERIFYING SAVED FILE:")
        print("="*60)
        with open(output_path, 'r', encoding='utf-8') as f:
            saved_data = json.load(f)
        
        print(f"✅ Records in saved file: {len(saved_data)}")
        saved_types = [r.get('type', '') for r in saved_data]
        non_empty = sum(1 for t in saved_types if t and t != '')
        print(f"✅ Records with predictions: {non_empty}/{len(saved_data)}")
        
        if non_empty == len(saved_data):
            print(f"✅ SUCCESS: All {len(saved_data)} records have predictions!")
            print(f"\nSample saved predictions:")
            for i in range(min(5, len(saved_data))):
                print(f"  ID {saved_data[i].get('ID')}: {saved_data[i].get('type')}")
        else:
            print(f"⚠️  WARNING: {len(saved_data) - non_empty} records still have empty types")
             
    except Exception as e:
        print(f"❌ ERROR saving file: {e}")
        import traceback
        traceback.print_exc()


## 6. Methodology Discussion

### What Worked Well:

1. **Feature Engineering**: 
   - Text length and word count features provided good baseline signals
   - Word overlap features (Jaccard similarity) helped distinguish between relevant and irrelevant answers
   - Semantic similarity using TF-IDF and cosine similarity captured deeper relationships
   - Question word features helped identify question types

2. **Ensemble Methods**:
   - Combining multiple models (Random Forest, XGBoost, Logistic Regression) improved robustness
   - Soft voting with weighted probabilities performed better than hard voting
   - Different models captured different patterns in the data

3. **Class Balancing**:
   - Using `class_weight='balanced'` helped handle imbalanced classes
   - This was particularly important for the "Irrelevant" class which might be underrepresented

4. **Feature Scaling**:
   - StandardScaler improved performance of models sensitive to feature scales (Logistic Regression, XGBoost)

### What Did Not Work Well:

1. **Transformer Models**:
   - Fine-tuning transformer models requires significant computational resources and time
   - Without proper fine-tuning, pre-trained models may not perform well on this specific task
   - For production, would need to fine-tune on the full dataset with proper hyperparameter tuning

2. **Simple Text Features**:
   - Basic features alone were not sufficient
   - Needed combination of statistical, semantic, and linguistic features

3. **Overfitting Concerns**:
   - Some models (especially Random Forest with high depth) showed signs of overfitting
   - Cross-validation would help identify optimal hyperparameters

### Suggestions for General Approach:

1. **Multi-Stage Pipeline**:
   - Stage 1: Binary classification (Relevant vs Irrelevant)
   - Stage 2: For relevant answers, classify as Factual vs Contradiction
   - This hierarchical approach might improve performance

2. **Advanced Feature Engineering**:
   - Named Entity Recognition (NER) to identify entities in questions and answers
   - Dependency parsing to understand sentence structure
   - Sentiment analysis to detect contradictions
   - Fact-checking features (checking if answer contains verifiable facts)

3. **Transformer Fine-Tuning**:
   - Fine-tune BERT/RoBERTa specifically for factuality detection
   - Use domain-specific pre-training if educational content is available
   - Consider using models like DeBERTa or ELECTRA for better performance

4. **External Knowledge Bases**:
   - Integrate with knowledge graphs (Wikipedia, Wikidata) for fact verification
   - Use retrieval-augmented generation (RAG) approaches
   - Cross-reference answers with authoritative sources

5. **Active Learning**:
   - Identify uncertain predictions and collect more training data for those cases
   - Use uncertainty sampling to improve model performance iteratively

6. **Explainability**:
   - Use SHAP values or LIME to explain predictions
   - This is crucial for educational applications where trust is important
   - Helps identify which parts of the question/context/answer drive the prediction

7. **Evaluation Metrics**:
   - Beyond accuracy and F1, consider:
     - Per-class precision/recall (especially for Contradiction which is most harmful)
     - Cost-sensitive metrics (higher penalty for false negatives in Contradiction)
     - Human evaluation on sample predictions

8. **Data Augmentation**:
   - Paraphrase questions and answers to increase training data
   - Generate synthetic contradictions by modifying factual answers
   - Use back-translation for data augmentation

### Potential Improvements:

1. **Hyperparameter Tuning**: Use GridSearchCV or Optuna for systematic hyperparameter optimization

2. **Cross-Validation**: Implement k-fold cross-validation for more robust model evaluation

3. **Feature Selection**: Use techniques like recursive feature elimination to identify most important features

4. **Model Interpretability**: Add SHAP/LIME analysis to understand model decisions

5. **Error Analysis**: Deep dive into misclassified examples to identify patterns and improve features

6. **Domain Adaptation**: If test data comes from different domains, consider domain adaptation techniques
