# Initial Exploration of Data Modeling Copilot

Here we will test the various components of this project locally, including, but not limited to the local models downloaded from huggingface, creation of a persistent vector store, and the creation of a knowledge graph.

### **Setup**

In [1]:
# Imports and Setup
import sys
sys.path.append('..')
import pandas as pd
import numpy as np
import torch
from pathlib import Path
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from sentence_transformers import SentenceTransformer
import chromadb

# Load Models
MODEL_PATH = '../models/'

### **Validate Major Components**

Now we will validate the pieces of our puzzle (this is a non-exhaustive list):

1. Can we load the model locally?
2. Do both the LLM (flat-t5-base) and the embedding model (all-mpnet-base-v2) 'work'? (They don't need to be perfect we will improve performance iteratively)
3. Can we create a vector store?
4. Can we create a knowledge graph?
5. Can we have have these components working together within an API?


In [12]:
def test_flan_t5_basic():
    """
    Basic test of FLAN-T5 model functionality.
    Tests loading, tokenization, and simple inference.
    
    Returns:
        tuple: (model, tokenizer) if successful
    """
    print("=== FLAN-T5 Basic Test ===")
    
    # 1. Setup
    MODEL_PATH = '../models/flan-t5-base'
    model_dir = Path(MODEL_PATH)
    print(f"Loading from: {model_dir.absolute()}")
    
    # 2. Load Model & Tokenizer
    try:
        print("\nLoading tokenizer and model...")
        tokenizer = AutoTokenizer.from_pretrained(str(model_dir))
        model = AutoModelForSeq2SeqLM.from_pretrained(str(model_dir))
        print("✓ Model and tokenizer loaded successfully")
        
        # 3. Move to MPS if available
        if torch.backends.mps.is_available():
            model = model.to('mps')
            print("✓ Model moved to MPS device")
            print(f"Device: {next(model.parameters()).device}")
    except Exception as e:
        print(f"× Error in setup: {e}")
        return None, None
    
    # 4. Test Inference
    try:
        print("\nTesting inference...")
        
        # Test cases
        test_cases = [
            "Convert column name 'model_score' to standard format.",
            "Transform 'dt_last_active' to proper column name.",
            "Standardize column name: 'cust_id'"
        ]
        
        for test_input in test_cases:
            print(f"\nInput:  {test_input}")
            
            # Tokenize
            inputs = tokenizer(test_input, return_tensors="pt")
            if torch.backends.mps.is_available():
                inputs = {k: v.to('mps') for k, v in inputs.items()}
            
            # Generate
            with torch.no_grad():
                outputs = model.generate(
                    **inputs,
                    max_length=128,
                    num_beams=1,
                    do_sample=False
                )
            
            # Decode
            result = tokenizer.decode(outputs[0], skip_special_tokens=True)
            print(f"Output: {result}")
            
        print("\n✓ Inference test completed")
        return model, tokenizer
        
    except Exception as e:
        print(f"× Error during inference: {e}")
        return None, None

# Run test
if __name__ == "__main__":
    model, tokenizer = test_flan_t5_basic()

=== FLAN-T5 Basic Test ===
Loading from: /Users/adityapallipati/Developer/work/data-modeling-copilot/notebooks/../models/flan-t5-base

Loading tokenizer and model...
✓ Model and tokenizer loaded successfully
✓ Model moved to MPS device
Device: mps:0

Testing inference...

Input:  Convert column name 'model_score' to standard format.
Output: Model

Input:  Transform 'dt_last_active' to proper column name.
Output: 'dt_last_active'

Input:  Standardize column name: 'cust_id'
Output: 'cust_id'

✓ Inference test completed


Looks like flat-t5-base performs poorly, in an ideal world a simple API call to one of the big players in the LLM space like gemini, claude, or openai will have no problem with this task but the goal of this project is to work within a strict set of parameters, there may be security concerns with using an opensource tool and it's not just an individual creating software for their own personal use.

We want to work with an LLM that is "approved" for use, possibly fine tune it for our usecase, and use tools that are available for a developer working within specific constraints. These constraints aren't neccessarily a bad thing as AI/ML Engineers, the task is to solve AI/ML Problems efficiently whether that is setting up the infrastructure of an model for a production deployment, utilizing these tools for real world usecases such as an internal knowledge base or fraud detection. Working with LLMs as a hobbyist there is a certain freedom to choose whatever tool without much thought into scalability, reliability, and observability. Thinking beyond a hobbyist's mindset and considering broader implications of using LLMs for an enterpise is a valuable skill to develop and one we will intend to with this project.

In [14]:
from sentence_transformers import SentenceTransformer

def test_mpnet_basic():
    """
    Basic test of MPNet model functionality.
    Tests loading and embedding generation with verification of output dimensions.
    
    Returns:
        SentenceTransformer: embedding_model if successful
    """
    print("=== MPNet Basic Test ===")
    
    # 1. Setup
    MODEL_PATH = '../models/all-mpnet-base-v2'
    model_dir = Path(MODEL_PATH)
    print(f"Loading from: {model_dir.absolute()}")
    
    # 2. Load Model
    try:
        print("\nLoading MPNet model...")
        embedding_model = SentenceTransformer(str(model_dir))
        print("✓ Model loaded successfully")
        
    except Exception as e:
        print(f"× Error loading model: {e}")
        return None
    
    # 3. Test Embeddings
    try:
        print("\nTesting embedding generation...")
        
        # Test cases that mirror our domain
        test_cases = [
            "Column name standards: Use UPPER_SNAKE_CASE for all database columns",
            "Data type rule: DATE columns must not allow future dates",
            "Naming convention: Add _PCT suffix for percentage columns",
            "Data modeling guideline: Primary keys should end with _ID"
        ]
        
        # Generate embeddings
        embeddings = embedding_model.encode(test_cases)
        
        # Verify embedding dimensions
        print(f"\nEmbedding dimensions:")
        print(f"Number of test cases: {len(test_cases)}")
        print(f"Embedding shape: {embeddings.shape}")
        print(f"Vector dimension: {embeddings.shape[1]}")  # Should be 768 for MPNet base
        
        # Test similarity (optional but useful)
        from sklearn.metrics.pairwise import cosine_similarity
        similarity_matrix = cosine_similarity(embeddings)
        print("\nSimilarity matrix shape:", similarity_matrix.shape)
        
        # Show example similarity
        print("\nExample similarities between first and other sentences:")
        for i, score in enumerate(similarity_matrix[0]):
            if i > 0:  # Skip self-similarity
                print(f"Similarity with sentence {i+1}: {score:.4f}")
        
        print("\n✓ Embedding test completed")
        return embedding_model
        
    except Exception as e:
        print(f"× Error during embedding generation: {e}")
        return None

# Run test
if __name__ == "__main__":
    embedding_model = test_mpnet_basic()

No sentence-transformers model found with name ../models/all-mpnet-base-v2. Creating a new one with mean pooling.


=== MPNet Basic Test ===
Loading from: /Users/adityapallipati/Developer/work/data-modeling-copilot/notebooks/../models/all-mpnet-base-v2

Loading MPNet model...
✓ Model loaded successfully

Testing embedding generation...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



Embedding dimensions:
Number of test cases: 4
Embedding shape: (4, 768)
Vector dimension: 768

Similarity matrix shape: (4, 4)

Example similarities between first and other sentences:
Similarity with sentence 2: 0.2011
Similarity with sentence 3: 0.4276
Similarity with sentence 4: 0.3893

✓ Embedding test completed


The model successfully generates 768-dimensional embeddings and demonstrates meaningful semantic understanding of data modeling concepts. Similarity scores between sentences about naming conventions (0.4276) vs data types (0.2011) show the model can differentiate between different aspects of data modeling. This suggests MPNet will be effective for our GraphRAG implementation, particularly in retrieving relevant naming conventions and guidelines. However, we'll need to carefully structure our knowledge base to take advantage of these semantic relationships. The next step is to test this with ChromaDB and our actual guidelines.