# Sentence to PClean query demo
This notebook demos the sentence-to-PClean pipeline. Given a sentence giving information about some doctor, it generates and runs queries about those people against the Medicare dataset used in Alex Lew's paper using a PClean model trained by Ian Limarta.


In [65]:
import gc
import json
import logging
from pathlib import Path
import sqlite3
import string
from typing import Any, Optional, TypeVar, Union

from IPython.display import display, Markdown
import lark
import nest_asyncio
import requests
import spacy
import torch
import transformers
import vllm

from genparse import InferenceSetup, InferenceSetupVLLM

nest_asyncio.apply()

logger = logging.getLogger("notebook")

In [2]:
debug_sentences_path = Path().resolve() / "debug_sentences.jsonl"
debug_sentences = [json.loads(line) for line in debug_sentences_path.read_text(encoding="utf-8").splitlines() if line.strip()]

## Example inputs and outputs

Taken from the first ten sentences (rows 2-11) in the [GPT-4 tweets Marjorie Freedman generated][gpt4_tweets].

[gpt4_tweets]: https://docs.google.com/spreadsheets/d/1vq_HAdbFY079vYVOppkFLk_zojJ46MJKsy-TzUm7Q3M/edit

In [3]:
sentence_table_row_template = string.Template("| $sentence | <pre>$example_output</pre> |")
def show_sentences(sentence_data):
    lines = ["| Sentence | Example PClean preamble code |", "| --- | --- |"]
    for sentence_datum in sentence_data:
        lines.append(sentence_table_row_template.substitute(sentence=sentence_datum["sentence"], example_output=sentence_datum["example_output"].replace("\n", "<br>")))
    markdown = "\n".join(lines)
    display(Markdown(markdown))


show_sentences(debug_sentences)

| Sentence | Example PClean preamble code |
| --- | --- |
| Just had an insightful consultation with Dr. Kay Ryan in Baltimore. Feeling optimistic about my health! #HealthMatters #Baltimore | <pre>name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)<br>row_trace[name_key] = "Kay Ryan"<br>city_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)<br>row_trace[city_key] = "Baltimore"</pre> |
| Dr. Kay Ryan's office at 321 Pine St. in Baltimore is so welcoming and efficient. Highly recommend! #CityCare | <pre>name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)<br>row_trace[name_key] = "Kay Ryan"<br>city_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)<br>row_trace[city_key] = "Baltimore"<br>address_key = PClean.resolve_dot_expression(trace.model, :Obs, :address)<br>row_trace[address_key] = "321 Pine St."</pre> |
| Exploring Baltimore after an informative appointment with Dr. Kay Ryan. Loving this city's vibe! #BaltimoreAdventures #DoctorVisit | <pre>name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)<br>row_trace[name_key] = "Kay Ryan"<br>city_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)<br>row_trace[city_key] = "Baltimore"</pre> |
| Dr. Kay Ryan, a neurologist, provided excellent care today. Grateful for her expertise! #Healthcare #BaltimoreDoctors | <pre>name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)<br>row_trace[name_key] = "Kay Ryan"<br>city_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)<br>row_trace[city_key] = "Baltimore"<br>occupation_key = PClean.resolve_dot_expression(trace.model, :Obs, :occupation)<br>row_trace[occupation_key] = "neurologist"</pre> |
| Feeling reassured after my visit to Dr. Kay Ryan in Baltimore. She's truly exceptional! #FeelingGood #CityOfCharm | <pre>name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)<br>row_trace[name_key] = "Kay Ryan"<br>city_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)<br>row_trace[city_key] = "Baltimore"</pre> |
| Had a great experience with Dr. Kay Ryan at 321 Pine St. in Baltimore. Her attention to detail is unmatched! #HealthCheck | <pre>name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)<br>row_trace[name_key] = "Kay Ryan"<br>city_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)<br>row_trace[city_key] = "Baltimore"<br>address_key = PClean.resolve_dot_expression(trace.model, :Obs, :address)<br>row_trace[address_key] = "321 Pine St."</pre> |
| From diagnosis to treatment, Dr. Kay Ryan, a cardiologist, covered it all. Baltimore, you're lucky to have her! #MedicalCare | <pre>name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)<br>row_trace[name_key] = "Kay Ryan"<br>city_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)<br>row_trace[city_key] = "Baltimore"<br>occupation_key = PClean.resolve_dot_expression(trace.model, :Obs, :occupation)<br>row_trace[occupation_key] = "cardiologist"</pre> |
| Just wrapped up my appointment with Dr. Kay Ryan in Baltimore. Her professionalism is top-notch. #HealthJourney | <pre>name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)<br>row_trace[name_key] = "Kay Ryan"<br>city_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)<br>row_trace[city_key] = "Baltimore"</pre> |
| Dr. Kay Ryan's office in Baltimore is so efficient and welcoming. A great experience overall! #DoctorVisit #CityCare | <pre>name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)<br>row_trace[name_key] = "Kay Ryan"<br>city_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)<br>row_trace[city_key] = "Baltimore"</pre> |
| Can't say enough good things about Dr. Kay Ryan, a pediatrician, in Baltimore. Truly an outstanding doctor! #Grateful | <pre>name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)<br>row_trace[name_key] = "Kay Ryan"<br>city_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)<br>row_trace[city_key] = "Baltimore"<br>occupation_key = PClean.resolve_dot_expression(trace.model, :Obs, :occupation)<br>row_trace[occupation_key] = "pediatrician"</pre> |

## Helper code

### SpaCy code

In [5]:
# Supporting code for entity extraction.
_SPACY_MODEL_NAME = 'en_core_web_trf'
spacy_model = spacy.load(_SPACY_MODEL_NAME)
_PERSON_LABEL = 'PERSON'
_LOCATION_LABEL = 'LOC'
# spaCy labels list example thanks to Stack Overflow user 'russhoppa': https://stackoverflow.com/a/78252807
_SPACY_LABELS = spacy_model.get_pipe("ner").labels
assert _PERSON_LABEL in _SPACY_LABELS

T = TypeVar('T')

def _uniquify(items: list[T]) -> list[T]:
    """
    O(n^2) order-preserving uniquification.

    Fine for short lists like a single sentence's list of PERSON entities.
    """
    result = []
    for item in items:
        if item not in result:
            result.append(item)
    return result

def get_people(sentence: str) -> list[str]:
    return _uniquify(
        [ent.text for ent in spacy_model(sentence).ents if ent.label_ == _PERSON_LABEL]
    )

def get_locations(sentence: str) -> list[str]:
    return _uniquify(
        [ent.text for ent in spacy_model(sentence).ents if ent.label_ == _LOCATION_LABEL]
    )

def show_ents(sentence: str) -> None:
    print(sentence)
    print(spacy_model(sentence).ents)

for sentence_datum in debug_sentences:
    show_ents(sentence_datum["sentence"])
    print()

Just had an insightful consultation with Dr. Kay Ryan in Baltimore. Feeling optimistic about my health! #HealthMatters #Baltimore
(Kay Ryan, Baltimore, Baltimore)

Dr. Kay Ryan's office at 321 Pine St. in Baltimore is so welcoming and efficient. Highly recommend! #CityCare
(Kay Ryan, Baltimore, CityCare)

Exploring Baltimore after an informative appointment with Dr. Kay Ryan. Loving this city's vibe! #BaltimoreAdventures #DoctorVisit
(Baltimore, Kay Ryan)

Dr. Kay Ryan, a neurologist, provided excellent care today. Grateful for her expertise! #Healthcare #BaltimoreDoctors
(Kay Ryan, today)

Feeling reassured after my visit to Dr. Kay Ryan in Baltimore. She's truly exceptional! #FeelingGood #CityOfCharm
(Kay Ryan, Baltimore)

Had a great experience with Dr. Kay Ryan at 321 Pine St. in Baltimore. Her attention to detail is unmatched! #HealthCheck
(Kay Ryan, Baltimore)

From diagnosis to treatment, Dr. Kay Ryan, a cardiologist, covered it all. Baltimore, you're lucky to have her! #Medical

### Extracting the generated code from a response

In [114]:
def extract_code_from_response(text: str) -> str:
    fence = '```'
    start = text.index(fence) + len(fence)
    julia_bit = 'julia\n'
    if text[start:].startswith(julia_bit):
        start += len(julia_bit)
    end = text.rindex('```')
    result = text[start:end].strip()
    return result

In [120]:
# Test to confirm it works as intended
text = '''<|start_header_id|>assistant<|end_header_id|>

Here is the Julia code to query the PClean table of records about doctors based on the given input sentence:

```julia
name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)
row_trace[name_key] = "Kay Ryan"
city_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)
row_trace[city_key] = "Baltimore"
```'''
extract_code_from_response(text)

'name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)\nrow_trace[name_key] = "Kay Ryan"\ncity_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)\nrow_trace[city_key] = "Baltimore"'

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


### Running the server

In [118]:
_BATCH_SIZE = 1
# Ben LeBrun's WIP server running on GCP as of 2024-07-16
_DEFAULT_GENPARSE_INFERENCE_SERVER_URI = 'http://34.122.30.137:8888/infer'

def run_inference_server(
    prompt: str,
    *,
    proposal: str = 'character',
    batch_size: int = _BATCH_SIZE,
    max_tokens: int,
    n_particles: int,
    temperature: float = 1.0,
    grammar: str,
    genparse_url: str = _DEFAULT_GENPARSE_INFERENCE_SERVER_URI,
) -> dict[str, float]:
    """
    Run inference using a server.
    """
    params = {
        'prompt': prompt,
        'method': 'smc-standard',
        'n_particles': n_particles,
        'lark_grammar': grammar,
        'proposal_name': proposal,
        'proposal_args': {},
        'max_tokens': max_tokens,
        'temperature': temperature,
    }
    headers = {
        "Content-type": "application/json",
        "Accept": "application/json"
    }
    response = requests.post(genparse_url, json=params, headers=headers)

    return response

### PClean grammar

In [119]:
pclean_grammar = r"""
start: prefix julia_code suffix
prefix: "<|start_header_id|>assistant<|end_header_id|>" NL* FREE_TEXT? NL+ CODE_FENCE JULIA? NL
suffix: NL CODE_FENCE
julia_code: add_to_trace (NL+ add_to_trace)+

FREE_TEXT: /[a-zA-Z0-9.,-?!;: ]+/
CODE_FENCE: "```"
JULIA: "julia"
WS: " "
NL: "\n"
STRING: /"[a-zA-Z0-9. ]*"/

add_to_trace: get_key NL set_key_in_trace
# slightly overly restrictive but good enough 
get_key: trace_key_identifier WS* "=" WS* "PClean.resolve_dot_expression(trace.model, :Obs, " column_symbol ")"
# column_symbol: /:[a-z][a-z_]+/
column_symbol: ":" ("name" | "address" | "specialty" | "city")
set_key_in_trace: "row_trace[" trace_key_identifier "]" WS* "=" WS* STRING
trace_key_identifier: /[a-z][a-z_]+/
"""
pclean_parser = lark.Lark(pclean_grammar)

### PClean code template

In [4]:
pclean_template = string.Template(
    """
# Create a new row trace for the hypothetical row
row_trace = Dict{PClean.VertexID, Any}()
$preamble

# Add it to the trace
obs = trace.tables[:Obs].observations
row_id = gensym()
obs[row_id] = row_trace

samples = []
for _ in 1:$N
    # Perform a Partilce Gibbs MCMC move to change our current sample of the row
    PClean.run_smc(!(trace, :Obs, row_id, PClean.InferenceConfig(1, 10))
    # Accumulate the sample
    push!(samples, trace.tables[:Obs].rows[row_id][br_idx]
end

countmap(samples)
""".lstrip()
)

PCLEAN_DEFAULT_N_SAMPLES = 100

## Debugging

### Running the LM unconstrained

In [48]:
pclean_prompt = string.Template(
    """Write Julia code to query a PClean table of records about doctors based on the given input sentence.

In general, your output should look like pairs of lines:

```julia
blah_key = PClean.resolve_dot_expression(trace.model, :Obs, :blah_col)
row_trace[blah_key] = "Value of Blah as expressed in the sentence"
```

The dataset has just four columns to query:

- :name (the doctor's full name, first and last)
- :city (the city where the doctor practices)
- :address (the doctor's office address)
- :specialty (the doctor's specialty)

Please generate code to query all values specified in the sentence. Output the Julia code directly with no preamble or commentary. Write just two lines per column.

Input: Loved visiting Dr. Kay Ryan's neurology office at 256 Overflow St! No wait time at all. #Baltimore
Output: ```julia
name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)
row_trace[name_key] = "Kay Ryan"
address_key = PClean.resolve_dot_expression(trace.model, :Obs, :address)
row_trace[address_key] = "256 Overflow St"
specialty_key = PClean.resolve_dot_expression(trace.model, :Obs, :specialty)
row_trace[specialty_key] = "neurology"
city_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)
row_trace[city_key] = "Baltimore"
```

Input: Dr. Pat Rogers's orthopedics office screwed us! Took our money and Kay gave us three minutes tops. #BaltimoreSucks
Output: ```julia
name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)
row_trace[name_key] = "Pat Rogers"
specialty_key = PClean.resolve_dot_expression(trace.model, :Obs, :specialty)
row_trace[specialty_key] = "orthopedics"
city_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)
row_trace[city_key] = "Baltimore"
```

In other words, we query all values given in the sentence: The doctor's name, their office address, their specialty, and the city. If one of these is missing, we do not query on it.

Input: $sentence
Output:"""
)
# scrapped
# The doctor's name is $name. Their office address is $address. Their specialty is $specialty. The city is $city.

def format_people_prompt(sentence: str, *, name: str, address: Optional[str], specialty: Optional[str]) -> str:
    return pclean_prompt.substitute(
        sentence=sentence,
        name=name,
        address=address if address else 'unknown',
        specialty=specialty if specialty else 'unknown',
        city='Baltimore',
    )

In [49]:
model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'

In [56]:
try:
    del model
    gc.collect()
except NameError:
    pass

In [57]:
# Taken from a magic number in the Genparse VLLM backend code.
_MAX_MODEL_LEN = 4096
# Why dtype=torch.float32? It's in the Genparse VLLM backend code.
model = vllm.LLM(model_id, dtype=torch.float32, max_model_len=_MAX_MODEL_LEN)

INFO 07-16 20:45:03 config.py:1214] Upcasting torch.bfloat16 to torch.float32.
INFO 07-16 20:45:03 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float32, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 07-16 20:45:04 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 07-16 20:45:14 model_runner.py:160] Loading model weights took 29.9151 GB
INFO 07-16 20:45:17 gpu_executor.py:83] # GPU blocks: 844, # CPU blocks: 1024
INFO 07-16 20:45:18 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-16 20:45:18 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-16 20:45:47 model_runner.py:965] Graph capturing finished in 29 secs.


In [58]:
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [59]:
sentence = debug_sentences[1]["sentence"]
print(sentence)
people = get_people(sentence)
prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': format_people_prompt(sentence=sentence, name='Kay Ryan', address='256 Overflow St', specialty='neurology')}], tokenize=False)
print(f'Prompt ({len(tokenizer(prompt)["input_ids"])} tokens): ```{prompt}```')

sampling_params = vllm.SamplingParams(temperature=1.0, max_tokens=1024, n=25)
response = model.generate(prompt, sampling_params=sampling_params)[0]
for i, output in enumerate(response.outputs, start=1):
    print(f'Generated Query {i}: ```{output.text}```')

Dr. Kay Ryan's office at 321 Pine St. in Baltimore is so welcoming and efficient. Highly recommend! #CityCare
Prompt (517 tokens): ```<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Write Julia code to query a PClean table of records about doctors based on the given input sentence.

In general, your output should look like pairs of lines:

```julia
blah_key = PClean.resolve_dot_expression(trace.model, :Obs, :blah_col)
row_trace[blah_key] = "Value of Blah as expressed in the sentence"
```

The dataset has just four columns to query:

- :name (the doctor's full name, first and last)
- :city (the city where the doctor practices)
- :address (the doctor's office address)
- :specialty (the doctor's specialty)

Please generate code to query all values specified in the sentence. Output the Julia code directly with no preamble or commentary. Write just two lines per column.

Input: Loved visiting Dr. Kay Ryan's neurology office at 256 Overflow St! No wait time at all. #Baltimore
Outp

Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:13<00:00, 13.15s/it, est. speed input: 39.31 toks/s, output: 252.00 toks/s]

Generated Query 1: ```<|start_header_id|>assistant<|end_header_id|>

Here is the Julia code to query the PClean table of records about doctors based on the given input sentence:

```julia
name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)
row_trace[name_key] = "Kay Ryan"
address_key = PClean.resolve_dot_expression(trace.model, :Obs, :address)
row_trace[address_key] = "321 Pine St"
city_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)
row_trace[city_key] = "Baltimore"
``````
Generated Query 2: ```<|start_header_id|>assistant<|end_header_id|>

Here is the Julia code to query the PClean table:

```julia
name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)
row_trace[name_key] = "Kay Ryan"
address_key = PClean.resolve_dot_expression(trace.model, :Obs, :address)
row_trace[address_key] = "321 Pine St"
city_key = PClean.resolve_dot_expression(trace.model, :Obs, :city)
row_trace[city_key] = "Baltimore"
``````
Generated Query 3: ```<|start_header_id|>as




## Running the model with constraints

In [63]:
sentence = debug_sentences[1]["sentence"]
prompt = tokenizer.apply_chat_template(
    [
        {'role': 'user', 'content': format_people_prompt(sentence=sentence, name='Kay Ryan', address='256 Overflow St', specialty='neurology')}
    ],
    tokenize=False,
)
print(f'Prompt ({len(tokenizer(prompt)["input_ids"])} tokens): ```{prompt}```')


# ignored top_p=0.95
server_inference_params = {'model_name': 'codellama', 'max_tokens': 128, 'n_particles': 15, 'temperature': 1.0, 'grammar': pclean_grammar}
response = run_inference_server(prompt, **server_inference_params)
try:
    data = response.json()
    for i, (query, likelihood) in enumerate(data['posterior'].items(), start=1):
        
        print(f'Generated Query {i} (likelihood {100 * likelihood:.2f}%): ```{query}```')
except json.JSONDecodeError:
    print(response.status_code)
    print(response.text)

Prompt (517 tokens): ```<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Write Julia code to query a PClean table of records about doctors based on the given input sentence.

In general, your output should look like pairs of lines:

```julia
blah_key = PClean.resolve_dot_expression(trace.model, :Obs, :blah_col)
row_trace[blah_key] = "Value of Blah as expressed in the sentence"
```

The dataset has just four columns to query:

- :name (the doctor's full name, first and last)
- :city (the city where the doctor practices)
- :address (the doctor's office address)
- :specialty (the doctor's specialty)

Please generate code to query all values specified in the sentence. Output the Julia code directly with no preamble or commentary. Write just two lines per column.

Input: Loved visiting Dr. Kay Ryan's neurology office at 256 Overflow St! No wait time at all. #Baltimore
Output: ```julia
name_key = PClean.resolve_dot_expression(trace.model, :Obs, :name)
row_trace[name_key] = "Kay Rya

In [None]:
sentence = debug_sentences[1]["sentence"]
prompt = tokenizer.apply_chat_template(
    [
        {'role': 'user', 'content': format_people_prompt(sentence=sentence, name='Kay Ryan', address='256 Overflow St', specialty='neurology')}
    ],
    tokenize=False,
)
print(f'Prompt ({len(tokenizer(prompt)["input_ids"])} tokens): ```{prompt}```')


# ignored top_p=0.95
server_inference_params = {'model_name': 'codellama', 'max_tokens': 128, 'n_particles': 15, 'temperature': 1.0, 'grammar': pclean_grammar}
response = run_inference_server(prompt, **server_inference_params)
try:
    data = response.json()
    for i, (query, likelihood) in enumerate(data['posterior'].items(), start=1):
        parse_tree = pclean_parser.parse(output.text)
        print(f'Generated Query {i} (likelihood {100 * likelihood:.2f}%): ```{query}```')
except json.JSONDecodeError:
    print(response.status_code)
    print(response.text)