

##  Data Augmentation Pipeline for Score Prediction

This notebook implements a structured **data augmentation workflow** designed to enrich the score-prediction dataset.  
The goal is to generate additional training samples that preserve the statistical patterns of the original dataset while improving robustness for downstream models (NN, GPR, LightGBM, MoE, etc.).

---


In [7]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import root_mean_squared_error
from sklearn.calibration import CalibratedClassifierCV
from sklearn.isotonic import IsotonicRegression
import lightgbm as lgb
from sklearn.metrics.pairwise import cosine_similarity
import torch
import json


## 1. Loading the Data

We load the essential components of the dataset:

- **Training set** (`train_data.json`)  
- **Test set** (`test_data.json`)  
- **Precomputed metric embeddings** (`metric_name_embeddings.npy`)  
- **Metric → ID mapping** (`metric_map.json`)  
- **Sentence-Transformer model** for generating text embeddings  

These components allow us to merge textual information, metric structure, and scoring patterns into a single augmentation pipeline.

---


In [None]:
train = json.load(open("data/train_data.json")) 
test = json.load(open("data/test_data.json")) 
metric_embs = np.load(open("data/metric_name_embeddings.npy", "rb")) 
metric_map = json.load(open("data/metric_names.json"))


## 2. Augmentation Strategy

We loop through the original training data and probabilistically create new synthetic samples.  
The augmentation adds controlled noise to the score while preserving the structure of the input:

### **Score Perturbation**
For each sample, with a probability threshold:
- A new “synthetic” score is generated  
- The score is perturbed using a **Gaussian noise model**  
- Low, medium, and high score regions receive different noise patterns  
- The final score is rounded and clipped to preserve validity

This maintains realistic score distributions.

### **Prompt Repurposing**
We combine:
- System prompt  
- User prompt  

into a concatenated string with a clear separator (`[SEP]`), ensuring the embedding captures both context layers.

---


In [None]:
aug = []
for item in train:
    if (np.random.random() > 0.4):
        m = np.random.choice(metric_map)
        score = np.random.normal(loc=-1.5, scale=0.5)  # low score bias
        score = np.round(score, 0)
        aug.append({
            "system_prompt": item.get("system_prompt", ""),
            "user_prompt": item["user_prompt"],
            "response": item["response"],
            "metric_name": m,
            "score": score
        })



train_aug = train + aug
print(f"Total data points: {len(train_aug)}")

Total data points: 7936



##  3. Generating Text Embeddings

Using **Indic Sentence-BERT** (or any configured SentenceTransformer model), we generate embeddings for each augmented sample.

This step provides:
- High-dimensional semantic representations  
- Smooth variation for models to learn from  
- Text-level consistency between original and synthetic samples

---


In [None]:
sbert_model = SentenceTransformer("l3cube-pune/indic-sentence-similarity-sbert", device="cuda")
X_train = []
y = []
for r in tqdm(train_aug):
    txt = f"{r.get('system_prompt', '')} [SEP] {r['user_prompt']} [SEP] {r['response']}"
    text_emb = sbert_model.encode(txt, normalize_embeddings=True)
    metric_emb = metric_embs[metric_map.index(r['metric_name'])]
    X_train.append(np.concatenate([text_emb, metric_emb]))

    y.append(float(r['score']))

X_train = np.array(X_train, dtype=np.float32)
y = np.array(y, dtype=np.float32)
print("Test data preparation complete.")
# Save the prepared datasets as embeddings
np.save("data/X_train_new_augmented.npy", X_train)
np.save("data/y_train_new_augmented.npy", y)

100%|██████████| 7936/7936 [01:27<00:00, 90.40it/s] 


Test data preparation complete.



## Summary

This notebook builds an effective and controlled augmentation pipeline that enhances:

- Data diversity  
- Embedding richness  
- Score distribution stability  
- Generalization performance  

Preparing augmented features at this stage significantly improves all downstream modeling approaches.

---
