<a href="https://colab.research.google.com/github/frank-morales2020/MLxDL/blob/main/FineTuning_T2SQL_GNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The reference document details the process of fine-tuning a Mistral 7B language model using a Graph Neural Network (GNN) to enhance its performance on tasks involving SQL generation from natural language questions and database schemas. The document outlines the following key steps:

1.  **Environment Setup:** Installation of necessary libraries (PyTorch, Hugging Face Transformers, FlashAttention, etc.) and setting up access to the Hugging Face model hub.
   
2.  **Model Loading:** Loading the pre-trained Mistral 7B model and tokenizer, configuring it for 4-bit quantization to reduce memory usage.
   
3.  **Graph Construction:** Defining functions to convert SQL queries, database schemas, and answers into graph structures suitable for GNN processing. This involves representing elements like SELECT, FROM, WHERE as nodes and their relationships as edges.
   
4.  **GNN Model:** Creating a GNN model using Graph Attention Networks (GATConv) to process the graph representations of SQL queries and schemas. The GNN aims to learn meaningful embeddings for these graph structures.
   
5.  **Data Preparation:** Loading the "b-mc2/sql-create-context" dataset, converting it into the GNN-compatible graph format, and splitting it into training and validation sets.
   
6.  **Training and Evaluation:** Fine-tuning the Mistral model using the GNN-generated embeddings as additional input. The training process involves iteratively adjusting the model's parameters to minimize the difference between its predicted SQL queries and the ground truth answers. Evaluation is done using the BLEU metric, which measures the similarity between generated and reference SQL queries.

In [None]:
# Install Pytorch & other libraries
!pip install torch tensorboard --quiet

# Install Hugging Face libraries
!pip install  --upgrade transformers datasets accelerate evaluate bitsandbytes --quiet

#FlashAttention only supports Ampere GPUs or newer. #NEED A100 , L4  IN GOOGLE COLAB
!pip install -U flash-attn --no-build-isolation --quiet


! pip install peft --quiet
! pip install datasets trl ninja packaging --quiet

# Uncomment only if you're using A100 GPU
#!pip install flash-attn --no-build-isolation
!pip install diffusers safetensors  --quiet
!pip install colab-env --quiet

!pip install mistral_inference -q

!pip install trl==0.8.6 -q


!pip install torch-geometric -q
!pip install sqlparse networkx -q

!pip install bitsandbytes -q

In [2]:
import colab_env
import os

access_token_write = os.getenv("HUGGINGFACE_ACCESS_TOKEN_WRITE")

from huggingface_hub import login

#print(access_token_write)

login(
  token=access_token_write,
  add_to_git_credential=True
)

Mounted at /content/gdrive
Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
import torch
import os
import sys
import json
import IPython
from datetime import datetime
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments,
     pipeline,
)
from trl import SFTTrainer

In [4]:
# set device
device = 'cuda'

In [5]:
torch.__version__

'2.3.1+cu121'

In [6]:
!python --version
!nvcc --version
!nvidia-smi

Python 3.10.12
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Sat Jul 13 08:23:29 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              46W / 400W |      5MiB / 40960MiB |      0%      Default |
|                       

MISTRAL

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import setup_chat_format


from huggingface_hub import login


login(
  token=access_token_write,
  add_to_git_credential=True
)

print()

# Hugging Face model id
model_id = "mistralai/Mistral-7B-Instruct-v0.3" #24 JUNE 2024

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
mistral_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(model_id,use_fast=True)
tokenizer.padding_side = 'right' # to prevent warnings

# We redefine the pad_token and pad_token_id with out of vocabulary token (unk_token)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.unk_token_id

# # set chat template to OAI chatML, remove if you start from a fine-tuned model
mistral_model, tokenizer = setup_chat_format(mistral_model, tokenizer)

GNN #0

In [20]:
import colab_env
import os
access_token_write = os.getenv("HUGGINGFACE_ACCESS_TOKEN_WRITE")
from huggingface_hub import login
#print(access_token_write)
login(
 token=access_token_write,
 add_to_git_credential=True
)

import torch
import os
import sys
import json
import IPython
from datetime import datetime
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from transformers import (
 AutoModelForCausalLM,
 AutoTokenizer,
 BitsAndBytesConfig,
 AutoTokenizer,
 TrainingArguments,
 pipeline,
)
from trl import SFTTrainer

# set device
device = 'cuda'


tokenizer.pad_token_id = tokenizer.unk_token_id
# # set chat template to OAI chatML, remove if you start from a fine-tuned model
mistral_model, tokenizer = setup_chat_format(mistral_model, tokenizer)

# GNN
import torch
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset
from torch_geometric.data import Data, Batch # Import Batch here
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, global_mean_pool # Import global_mean_pool here
import torch.optim as optim
from tqdm.auto import tqdm
import evaluate
import numpy as np

# 1. Graph Construction
def sql_to_graph(question, schema, answer):
 # TODO: Replace with actual conversion logic (this is the most crucial part)
 # Example: You might use SQL parsing libraries and heuristics to extract entities and
 nodes = ["SELECT", "*", "FROM", "table1", "WHERE", "column1", ">", "5"]
 edges = [(0, 1), (0, 3), (3, 5), (5, 6), (6, 7)]
 node_features = torch.eye(len(nodes))
 edge_features = torch.ones(len(edges), 1)
 # Attempt to convert the answer to an integer. If it fails, assume it's a string and
 try:
  answer_tensor = torch.tensor([int(answer)])
 except ValueError:
  answer_tensor = torch.tensor([0]) # Replace 0 with a suitable default value or e
 answer_tokens = answer.split()
 answer_tensor = torch.tensor([0] * len(answer_tokens)) # Replace 0 with appropriate
 return Data(x=node_features, edge_index=torch.tensor(edges).t().contiguous(), y=answer_tensor, edge_attr=edge_features)

class SQLGraphDataset(Dataset):
 def __init__(self, data):
  self.data = data
 def __len__(self):
  return len(self.data)
 def __getitem__(self, index):
  entry = self.data[index]
  question = entry["question"]
  schema = entry["context"]
  answer = entry["answer"]
  return sql_to_graph(question, schema, answer)

# 1. Placeholder Conversion Function
def convert_to_gnn(dataset):
 # TODO: Replace with actual conversion logic
 # This should iterate through the dataset and call sql_to_graph for each entry
 dataset_gnn = []
 for i in tqdm(range(len(dataset)), desc="Converting to GNN"):
  question = dataset[i]["question"]
  schema = dataset[i]["context"]
  answer = dataset[i]["answer"]
  graph = sql_to_graph(question, schema, answer)
  dataset_gnn.append(graph)
 return dataset_gnn

# 2. GNN Model
from torch_geometric.nn import GATConv
class SQLGNN(torch.nn.Module):
 def __init__(self, input_dim, hidden_dim, output_dim, heads=8):
  super(SQLGNN, self).__init__()
  self.conv1 = GATConv(input_dim, hidden_dim, heads=heads, dropout=0.6)
  self.conv2 = GATConv(hidden_dim * heads, output_dim, heads=1, concat=False, dropout=0.6)

 def forward(self, data):
  x, edge_index, batch = data.x.to(device), data.edge_index.to(device), data.batch.to(device)  # Move graph data to GPU
  x = F.elu(self.conv1(x, edge_index))
  x = F.dropout(x, p=0.6, training=self.training)
  x = self.conv2(x, edge_index)
  x = global_mean_pool(x, batch)  # Global Mean Pooling for graph-level representation
  return x

# 3. Load and Prepare Data
dataset = load_dataset("b-mc2/sql-create-context", split="train")
dataset = dataset.shuffle(seed=42).select(range(12500))
# Convert to GNN format
dataset_gnn = convert_to_gnn(dataset)
# Split dataset
train_size = int(0.8 * len(dataset_gnn))
train_dataset = dataset_gnn[:train_size]
val_dataset = dataset_gnn[train_size:]

# Define a custom collate function to handle batching of graphs and text data
def collate_fn(batch):
    graphs = [item for item in batch]
    # Since each item is a Data object, extract relevant attributes
    # Access the original data from the dataset using the index stored in the Data object
    questions = [dataset[i]['question'] for i in range(len(batch))] # Use the index of item in the batch
    schemas = [dataset[i]['context'] for i in range(len(batch))]     # Use the index of item in the batch
    answers = [item.y for item in batch]

    # Batch the graphs and return
    return Batch.from_data_list(graphs), questions, schemas, answers

train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=4, collate_fn=collate_fn)

# 4. Initialize Model, Loss, and Optimizer
input_dim = train_dataset[0].num_node_features
hidden_dim = 64
output_dim = 128 # Assuming a generation task for simplicity
model = SQLGNN(input_dim, hidden_dim, output_dim).to(device)

# For generation:
criterion = torch.nn.CrossEntropyLoss(ignore_index=0) # Ignore padding in loss calculation
# For classification:
# criterion = torch.nn.BCEWithLogitsLoss() # Or other suitable loss

optimizer = optim.Adam(model.parameters(), lr=0.001)


# 5. Training and Evaluation Functions
bleu = evaluate.load("bleu")
tokenizer.padding_side = 'left'
import torch.nn.functional as F  # Import F for the loss function

def train(model, mistral_model, loader, optimizer, mistral_optimizer, epoch, num_epochs):
    model.train()
    mistral_model.train()
    total_loss = 0
    loop = tqdm(loader, total=len(loader), desc=f"Epoch {epoch+1}/{num_epochs}")
    for data, questions, schemas, answers in loop:
        optimizer.zero_grad()
        mistral_optimizer.zero_grad()

        # GNN Forward Pass
        data = data.to(device) # Move graph data to GPU
        graph_embeddings = model(data)  # Get embeddings from GNN

        # Prepare Mistral Input
        mistral_inputs = [f"Question: {q}\nSchema: {s}\nGraph Embedding: {g}" for q, s, g in zip(questions, schemas, graph_embeddings)]

        # Set padding side to left before tokenizing
        tokenizer.padding_side = 'left'

        # Tokenize and generate SQL using Mistral, increase max length
        tokenized_inputs = tokenizer(mistral_inputs, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device) # Increased max_length


        # Get logits for loss calculation (instead of generating)
        outputs = mistral_model(**tokenized_inputs, labels=tokenized_inputs["input_ids"])
        loss = outputs.loss
        total_loss += loss.item()

        # Backpropagation
        loss.backward()
        optimizer.step()
        mistral_optimizer.step()

        loop.set_postfix(loss=loss.item())

    avg_loss = total_loss / len(loader)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

    %cd /content/

    # Save the model at the end of each epoch
    save_directory = f"mistral_gnn_finetuned_epoch_{epoch+1}"
    mistral_model.save_pretrained(save_directory)
    print(f"Mistral Model saved to {save_directory}")

    # Save GNN model
    model.save_state_dict(f"gnn_model_epoch_{epoch+1}.pth")
    print(f"GNN Model saved to gnn_model_epoch_{epoch+1}.pth")

def evaluate(model, mistral_model, loader, bleu):
    model.eval()
    mistral_model.eval()

    predictions = []
    references = []

    with torch.no_grad():
        for data, questions, schemas, answers in tqdm(loader, desc="Evaluating"):
            graph_embeddings = model(data)

            # Prepare Mistral Input
            mistral_inputs = [f"Question: {q}\nSchema: {s}\nGraph Embedding: {g}" for q, s, g in zip(questions, schemas, graph_embeddings)]

            # Tokenize and generate SQL using Mistral
            tokenized_inputs = tokenizer(mistral_inputs, return_tensors="pt", padding=True, truncation=True).to(device)
            output_sequences = mistral_model.generate(**tokenized_inputs, max_new_tokens=128)  # Generate up to 128 new tokens.
            generated_sql = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)

            predictions.extend(generated_sql)

            # Convert answers to a list of strings for BLEU
            references.extend([a.tolist() for a in answers])

            #references.extend(answers)  # Assuming answers are in the correct format for BLEU

    bleu_score = bleu.compute(predictions=predictions, references=references)["bleu"]
    print(f"BLEU Score: {bleu_score:.4f}")

# ... (Rest of the code for training loop and model saving)



Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Converting to GNN:   0%|          | 0/12500 [00:00<?, ?it/s]

GNN #1

In [None]:
mistral_optimizer = optim.Adam(mistral_model.parameters(), lr=5e-5)

# Training Loop
num_epochs = 3  # Or your desired number of epochs
for epoch in tqdm(range(num_epochs), desc="Overall Training Progress"):
    train(model, mistral_model, train_loader, optimizer, mistral_optimizer, epoch, num_epochs)
    evaluate(model, mistral_model, val_loader, bleu)


Overall Training Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Epoch 1/3:   0%|          | 0/2500 [00:00<?, ?it/s]

In [None]:
model = SQLGNN(input_dim, hidden_dim, output_dim)  # Recreate the model architecture
model.load_state_dict(torch.load(f"gnn_model_epoch_{epoch+1}.pth"))  # Load the saved parameters
model.to(device)  # Move the model to the desired device (e.g., GPU)

Key Improvements:

Model Saving: The GNN model is saved after each epoch.
Best Model Selection: The evaluate function now takes an epoch argument and returns the BLEU score. The training loop keeps track of the best epoch based on validation BLEU and loads the corresponding GNN model before the final evaluation.
Final Evaluation: After training, the best GNN model is loaded and used for the final evaluation on the validation set.

In [None]:
# Final Evaluation (using the best model)
evaluate(model, mistral_model, val_loader, bleu, num_epochs)