## Context2Canvas - AI ggplot2 Code Generation

Brief Elevator Pitch of our project 

## Setup the functions for prompting

In [1]:
from openai import OpenAI
client = OpenAI()

def prompt_model(prompt):
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        store=True,
        messages=[
            {"role": "user", 'content': prompt}
        ]
    )
    return completion.choices[0].message.content

## Chunk data and generate indices (setting up vector database)

User queries will be matched to indexes that best approximate the text chunks used to summarize an answer. For this assignment, you may chunk the text and then prompt the model to generate questions that are answerable by the text. The generated questions can then be used as the "documents" stored in the vector database.

In [None]:
#Load the data

import json
with open('annotations.json', 'r') as f:
    data = json.load(f)



In [6]:
#import langchain chunking library
from langchain_text_splitters import CharacterTextSplitter

#from langchain_text_splitters import RecursiveCharacterTextSplitter 
#we can use better parsing techniques in a future update


In [7]:
#set up text splitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=2500,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)


In [40]:
#chunk data
annotations = [str(record["general_figure_info"]) for record in data]
types = [record["type"] for record in data]

print(len(annotations) == len(types))

True


## Build the vector database

When building the vector database, be sure to maintain a mapping between the generated questions and the chunks that can be used later to retrieve the chunks from the most similar indices to the user query provided.

You may also add the function to query the vector database that you will use later.

In [None]:
#setting up the collection
import chromadb
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="c2c")



In [None]:
#add data to collection
collection.add(
    documents = annotations[:1000], 
    ids = [str(i) for i in range(len(annotations[:1000]))]
)

## Analyze Input Data

In [None]:
#collect user input data

import pandas as pd

filepath = input("insert filepath of the dataset you want to analyze")
df = pd.read_csv(filepath)

In [1]:
#outline model for summary statistics (in the future, I'll put this in a separate json file)

statStandards = [{
  "Min": None,
  "FirstQuartile": None,
  "Median": None,
  "Mean": None,
  "ThirdQuartile": None,
  "Max": None
}]

#Supported Graph Classes (I will move this to a separate file in the future as well)

supportedClasses = ["line-plots",
                    "dot-plots",
                    "vertical-bar-graph",
                    "horizontal-bar-graphs",
                    "pie-charts"]

In [None]:
# what are the column names associated with this dataset?
columns = prompt_model(f"What are the variables associated with this dataset: {df}? Return them as a python dictionary, in the format of 'variable_name: data_type'. Do not return any dialogue besides the python dictionary.")

#generate summary statistics
summaryStats = prompt_model(f"using {df}, generate me summary statistics of each variable according to these {statStandards}. Do not return any dialogue besides the summary statistics.")

# Create Interesting data questions based on the variables + summary statistics
question = prompt_model(f"create an interesting data question based on {columns} and {summaryStats}. Do not return anything besides the data question.")

# gather visualization type from data question
vizType = prompt_model(f"what is the best visualization class we should use to characterize this problem, given {question}, {columns}, and {summaryStats}? Do not return anything besides the visualization type. Only return a type listed in {supportedClasses}")


## Use VectorDB to query similar examples for few-shot learning

In [None]:
results = collection.query(
    query_texts= [{question}], # Chroma will embed this for you
    n_results=3 # how many results to return
)


## Generate the graph's code based on the collected examples


In [38]:
from bert_score import BERTScorer

scorer = BERTScorer(model_type='bert-base-uncased')