## Context2Canvas - AI ggplot2 Code Generation

Brief Elevator Pitch of our project 

## Setup the functions for prompting

In [1]:
from openai import OpenAI
client = OpenAI()

def prompt_model(prompt):
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        store=True,
        messages=[
            {"role": "user", 'content': prompt}
        ]
    )
    return completion.choices[0].message.content

## Chunk data and generate indices (setting up vector database)

User queries will be matched to indexes that best approximate the text chunks used to summarize an answer. For this assignment, you may chunk the text and then prompt the model to generate questions that are answerable by the text. The generated questions can then be used as the "documents" stored in the vector database.

In [None]:
#Load the data

import json
with open('annotations.json', 'r') as f:
    data = json.load(f)



{'models': [{'name': 'bars', 'labels': ['Coral', 'Medium Purple', 'Purple', 'Dark Turquoise', 'Light Green', 'Orange Red', 'Saddle Brown'], 'width': 0.5, 'colors': ['#FF7F50', '#9370DB', '#A020F0', '#00CED1', '#90EE90', '#FF4500', '#8B4513'], 'bboxes': [{'y': 42, 'x': 81, 'w': 49, 'h': 217}, {'y': 63, 'x': 180, 'w': 49, 'h': 196}, {'y': 84, 'x': 279, 'w': 49, 'h': 175}, {'y': 105, 'x': 377, 'w': 49, 'h': 154}, {'y': 126, 'x': 476, 'w': 49, 'h': 133}, {'y': 147, 'x': 574, 'w': 49, 'h': 111}, {'y': 169, 'x': 673, 'w': 49, 'h': 90}], 'y': [99.0, 89.36864203865082, 79.73728407730162, 70.10592611595243, 60.47456815460325, 50.84321019325406, 41.21185223190487], 'x': ['Coral', 'Medium Purple', 'Purple', 'Dark Turquoise', 'Light Green', 'Orange Red', 'Saddle Brown']}], 'type': 'vbar_categorical', 'general_figure_info': {'title': {'text': 'title', 'bbox': {'y': 9, 'x': 56.77843750000001, 'w': 50.234375, 'h': 16}}, 'x_axis': {'minor_ticks': {'values': [], 'bboxes': []}, 'major_labels': {'values'

In [6]:
#import langchain chunking library
from langchain_text_splitters import CharacterTextSplitter

#from langchain_text_splitters import RecursiveCharacterTextSplitter 
#we can use better parsing techniques in a future update


In [7]:
#set up text splitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=2500,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)


In [33]:
#chunk data
ids = [str(record["image_index"]) for record in data]
annotations = [str(record["general_figure_info"]) for record in data]
types = [record["type"] for record in data]

print(len(annotations) == len(types) == len(ids))

True


## Build the vector database

When building the vector database, be sure to maintain a mapping between the generated questions and the chunks that can be used later to retrieve the chunks from the most similar indices to the user query provided.

You may also add the function to query the vector database that you will use later.

In [None]:
#setting up the collection
import chromadb
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="c2c")



In [None]:
#add data to collection
collection.add(
    documents = annotations,
    ids = ids
)

ValueError: Expected metadata to be a dict or None, got str as metadata in add.

## Analyze Input Data

In [None]:
input_data = #data inserted by user

# what are the variables associated with this data?
prompt_model("What are the variables associated with this dataset? Return them as a python dictionary, in the format of 'variable_name: data_type'. Do not return any dialogue besides the python dictionary.")

#generate summary statistics
prompt_model(f"using {input_data}, generate me summary statistics of each variable according to these {standards}")

# Create Interesting data questions based on the variables + summary statistics
prompt_model(f"using this process for creating data questions, create an interesting data question based on the variable types and summary statistics")

# potential idea: ask user to for manual input of data question
# Provide option for the model to select a data question on its own

# gather visualization type from data question
prompt_model(f"what is the best visualization class we should use to characterize this problem?")


## Use VectorDB to query similar examples for few-shot learning

In [None]:
results = collection.query(
    query_texts= node['query'], # Chroma will embed this for you
    n_results=3 # how many results to return
)


## Generate the graph's code based on the collected examples


In [38]:
from bert_score import BERTScorer

scorer = BERTScorer(model_type='bert-base-uncased')