## Setup dog

The aim of this notebook is to simulate the decision-making of a robot dog in search of food. Its objective is "find and eat the food".

This is still very basic testing, so we are using a limited set of actions:

In [17]:
actions = ['go forwards', 'turn left', 'turn right', 'turn around', 'jump', 'beep', 'eat']

After each step, the dog will output data in the following schema:

In [18]:
vqa_output_schema = {
    "description": "short description of the image",
    "action": "action you chose to take",
    "reason": "why did you take that action?",
    "context": "explain your environment, what action you took, and why you took that action" # this will be useful in future testing when we create a memory for the dog
}

#### Why not have stuff like `take x steps`?

A dog cannot work out how many steps to take unless it has some "sense of physical self" - how long is its stride? what is too high to step over? etc. That could be rather difficult to "teach" it.

## Basic settings

Right now we don't have a robot dog on hand, so we are loading images of a maze from the following folder:

In [44]:
image_folder = "images"

## Setup basic fns

In [20]:
import http.client
import json
import base64
from pprint import pprint
from glob import glob
import os
import random

In [21]:
try:
    from dotenv import load_dotenv
    load_dotenv()
    print("Environment variables loaded from .env")
except ImportError:
    os.environ['SCENEX_SECRET'] = "<your SceneXplain key>"

Environment variables loaded from .env


In [28]:
SCENEX_SECRET=os.getenv('SCENEX_SECRET')

scenex_headers = {
    "x-api-key": f"token {SCENEX_SECRET}",
    "content-type": "application/json",
}

ALGO = "Jelly"

def image_to_data_uri(file_path):
    with open(file_path, "rb") as image_file:
        encoded_image = base64.b64encode(image_file.read()).decode("utf-8")
        return f"data:image/png;base64,{encoded_image}"
        
def generate_scenex_data(image_files, json_schema=None, question=None, features=[]):
    data = {}
    data['data'] = []

    for file in image_files:
        cid = file.split('/')[-1]
        row = {
            "image": image_to_data_uri(file),
            "features": features,
            "algorithm": ALGO,
            "cid": cid
        }

        if question:
            row["question"] = question

        if json_schema:
            row["json_schema"] = json_schema

        data['data'].append(row)

    return data

def process_scenex(data):
    connection = http.client.HTTPSConnection("api.scenex.jina.ai")
    connection.request("POST", "/v1/describe", json.dumps(data), scenex_headers)
    response = connection.getresponse()
    response_data = response.read().decode("utf-8")
    
    connection.close()

    return json.loads(response_data)['result'][0]

In [47]:
def load_images(folder_name):
    filetypes = ['jpg', 'jpeg', 'png']
    image_files = []

    for filetype in filetypes:
        image_files.extend(glob(f'{folder_name}/*.{filetype}'))

    return image_files

In [30]:
# Convert webp to jpeg since SceneXplain's API doesn't like webp
!find images -type f -name "*.webp" -exec mogrify -format jpeg {} \;

In [31]:
filetypes = ['jpg', 'jpeg', 'png']
image_files = []

for filetype in filetypes:
    image_files.extend(glob(f'{image_folder}*.{filetype}'))

## VQA attempt

In [32]:
text = f"""
You are a robot dog. Your mission is to explore your environment to find and eat food. For each turn, you can choose one action from {str(actions)}).

Choose your action based on:
- The contents of the image
- Your mission
- Your previous experience

Return your output in the following JSON Schema:

{json.dumps(vqa_output_schema)}
""".strip()

In [48]:
images = load_images(image_folder)

In [50]:
# function to take a single step
def take_vqa_step(
    image: str,
    question: str,
    ):
    """
    image: path to image file
    text: vqa question
    """

    print(f"Selected image {image}")
    data = generate_scenex_data([image], question=question, features=['question_answer'])
    result = process_scenex(data)['text']

    output = {
        "image": image,
        "result": json.loads(result),
    }

    return output

### Running in loop

In [51]:
def vqa_loop(images: list, question: str, turns: int=5):
    steps = []
    question = question
    # history = "\n- You woke up in a strange place"
    i = 0
    
    while i < turns:
        # question = question + history
        # print(question)
        image = random.choice(images)
        
        step = take_vqa_step(image=image, question=question)
        # step['result'] = json.loads(step['result'])
        print(f"I decide to {step['result']['action']}")
        
        steps.append(step)
        
        # history = f"\n- {step['result']['history']}"

        i += 1
        if step['result']['action'] == "eat":
            print("nom nom nom")
            break
        
    return steps

In [56]:
MAX_TURNS = 50  # how many turns before abandoning objective?

In [57]:
steps = vqa_loop(images, question=text, turns=MAX_TURNS)

Selected image images/cat_food.jpeg
I decide to turn right
Selected image images/food_right.jpeg
I decide to go forwards
Selected image images/danger_or_less.jpeg
I decide to turn right
Selected image images/danger_or_less.jpeg
I decide to turn right
Selected image images/monster.jpeg
I decide to turn around
Selected image images/food.png
I decide to eat
nom nom nom


In [58]:
pprint(steps)

[{'image': 'images/cat_food.jpeg',
  'result': {'action': 'turn right',
             'context': 'As a robot dog, I analyze my environment to locate '
                        'sources of food. In this case, an arrow sign in the '
                        "cave is pointing to the right towards where 'Cat "
                        "Food' might be, prompting me to take this direction "
                        'in hopes of finding what I am looking for.',
             'description': "A mysterious cave with an arrow sign saying 'Cat "
                            "Food' pointing to the right.",
             'reason': 'Based on the image, a sign point to the right '
                       "indicates 'Cat Food', which may serve as an available "
                       'food source in this environment.'}},
 {'image': 'images/food_right.jpeg',
  'result': {'action': 'go forwards',
             'context': 'As a robot dog tasked with locating food, a neon sign '
                        "was observed