In [46]:
# LlamaIndex package installations, since not included by default
!pip install datasets
!pip install llama-index llama-index-core llama-index-multi-modal-llms-anthropic



In [20]:
# Necessary authentication for Hugging Face, user key required to initialize
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGr

In [47]:
# Necessary Imports
import pandas as pd
from datasets import load_dataset
from llama_index.multi_modal_llms.anthropic import AnthropicMultiModal
from llama_index.core import SimpleDirectoryReader, Settings
from google.colab import userdata

# Importing "The Cauldron" dataset from Hugging Face and converting to pandas
# https://huggingface.co/datasets/HuggingFaceM4/the_cauldron
# Images from original set auto-converted to byte format, but Claude can decode
ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d")
print(type(ds))
df = ds['train'].to_pandas()

<class 'datasets.dataset_dict.DatasetDict'>


Unnamed: 0,images,texts
0,[{'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIH...,[{'user': 'Question: What do respiration and c...
1,[{'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIH...,"[{'user': 'Question: From the given food web, ..."
2,[{'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIH...,[{'user': 'Question: Anatomy One of a series o...
3,[{'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIH...,[{'user': 'Question: What process does this di...
4,[{'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIH...,[{'user': 'Question: If the Termites in the co...


In [48]:
# Initialization of Anthropic multi-modal model, using Claude 3.5 Sonnet
# Anthropic API key is stored as a secret in Google Colab
# 8192 is the max number of tokens the model is able to use with LlamaIndex
llm = AnthropicMultiModal(model="claude-3-5-sonnet-20241022",
                          max_tokens=8192,
                          api_key=userdata.get('ANTHROPIC_API_KEY'))

In [49]:
# Initializing directory reader for user image input; uses Colab local file path
image_store = SimpleDirectoryReader(input_files=['/content/[INSERT FILE NAME]']).load_data()

In [53]:
# Full query to Claude, provides context dataset and response format
query = f"""
The user has input an arbitrary diagram (as an image) relating to some topic,
which is attached to this query. I am providing you with a dataset for context,
in the form of a Pandas DataFrame, at the end of this query.

The dataset consists of two columns: diagrams and questions. The diagrams
column contains image diagrams (in the form of bytes) similar to the one the
user has input. The questions column contains a multiple-choice exam question
with answer choices and a correct answer, which requires the use of the diagram
to answer.

Your task is to generate an exam question in a similar style and format
to those in the dataset. Your response should only contain the question,
the options for the answer, and an explanation of why one specific answer is
correct. The question can require knowledge outside of what is apparent in the
image, as long as it relates to the same topic. Ensure that all information used
in both the question and explanation is correct.

Below is an example format for a response:

//BEGIN SAMPLE RESPONSE//

Question: (Insert Question Here)
A. (Insert Option A Here)
B. (Insert Option B Here)
C. (Insert Option C Here)
D. (Insert Option D Here)
Correct Answer & Explanation: (Insert correct answer and explanation)

//END SAMPLE RESPONSE//

The explanation of the correct answer can be as detailed or simple as you feel
is necessary for the user to understand the reasoning. Use any required
external knowledge to develop your explanation. Each incorrect answer
option should be reasonably similar to the correct answer while remaining
incorrect; for example, in a question about one topic, all answer options
should be related to the topic at hand.

Do not acknowledge any of the requests I have made in your response. The
response should be directly to the user in the format specified, with no
additional text before or after.

Context Dataset: {df}
"""

In [54]:
# Function to turn previous prompt into suitable input for .complete()
def completion_to_prompt(completion):
    return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"

In [55]:
# Requests and prints response, using text query and user image as input
response = llm.complete(prompt=completion_to_prompt(query), image_documents=image_store)
print(response)

//EXAMPLE OUTPUT FROM TEST IMAGE//Question: Which part of the leaf is primarily responsible for transporting water and minerals from the petiole to other parts of the leaf?
A. Lamina
B. Margin
C. Midrib
D. Vein

Correct Answer & Explanation: C. Midrib

The midrib is the correct answer because it is the main central vein of the leaf that acts as the primary channel for water and mineral transport. It is a continuation of the petiole into the leaf blade and serves as the main "highway" from which smaller veins branch out. The midrib is typically thicker than other veins and runs from the base to the tip of the leaf, providing both structural support and serving as the main conduit for the transport of water, minerals, and other substances.

The other options are incorrect because:
- Lamina is the broad, flat portion of the leaf where photosynthesis occurs
- Margin is the outer edge or border of the leaf
- While veins do transport water and minerals, they branch from the midrib and serve 