# Architecture

* /data/poses/{SET}/*.md : one Markdown file per pose, with unique filename/ID. SET = {"cards", "web"}
* /data/poses/{SET}/*.clp : CLIPS rules to build flows. SET = {"cards", "web"}
* /bin/exec-clips : CLI command to execute CLIPS
* /bin/exec-rag: CLI command to executre CLIPS
* /www/json-rpc: takes json-rpc call & executes correct command in bin
* /www/index.html: web interface using jquery.terminal, multiple terminals on a single page
* README.md: info about project
* playground.ipynb

# Notes

["Clean Jypter Notebooks"](https://ploomber.io/blog/clean-nbs/)

# Dataset

## Description

1) A set of yoga poses made with OCR from 50 cards from a Yoga Deck. The deck was scanned, then OCR'd.
2) Initial poses augmented wiht yoga poses crawled from websites like pocketyoga.com and yogajournal.com.

Structure:

* A single Markdown document with a description for each pose.
* A set of transition rules to go from one pose to another.
* A set of benefit and counter-indication rules.
* A set of flow construction rules.

Format:

* Markdown for the individual documents.
* CLIPS / RDF / OWL for the rules and facts. RDF/OWL more popular. Currently no tools to convert between the formats. CLIPS seems better suited for our use case. Our own DSL?

## Sources

https://yogajournal.com/poses
https://pocketyoga.com/poses
https://www.tummee.com/yoga-poses : anti-crawling measures in place
https://www.yogapedia.com/yoga-poses

## Creation Method

* Input: .HEIC picture of each yoga deck card.
* Processing: convert to png, grayscale, invert, increase contrast.
* OCR: convert to text with VNRecognizeTextRequest from MacOS Vision Framework.

In [None]:
#!/usr/bin/python
import chromadb, ollama

# Setup ChromaDB
client = chromadb.PersistentClient(path="./data")

stream = ollama.chat(
    model='llama3',
    messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
    stream=True,
)

for chunk in stream:
  print(chunk['message']['content'], end='', flush=True)

# Dataset Preparation

## OCR With Tesseract

We need to install tesseract for OCR, imagemagick for preprocessing and libeif to read the .HEIC files. Our final correction is to pass the text on to GPT and ask it to correct it because Tesseract detects too many diacretics and weird characters. I tried local ollama3 corrections but those were worse.

In [None]:
brew install tesseract
brew install imagemagick
brew install libheif

In [None]:
!python data/youtube.py

In [None]:
!/bin/zsh data/ocr.sh