### Quickstart
This notebook contains a sample program to guide you through the features of the Palimpzest (PZ) library. 
PZ provides a high-level interface for composing and executing pipelines of semantic operators.

### Pre-requisites
As Palimpzest uses LLM models for some operations, you need to set up **at least** one of the following
API keys as environment variables:

- `OPENAI_API_KEY` for using OPENAI's GPT-3.5 and GPT-4 models
- `TOGETHER_API_KEY` for using TogetherAI's LLM models, including Mixtral

Support for local model execution and other LLM APIs is underway!

Edit the following snippet with your API key(s) in order to run the notebook. (Providing both keys enables PZ to perform more optimizations, but this is not necessary for the demo to work.)


In [2]:
import os

os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
# os.environ["TOGETHER_API_KEY"] = "your-together-api-key"

### Example: Enron Email Dataset
In this demo we will work with a subset the [Enron Email Dataset](https://www.cs.cmu.edu/~enron/). We are going to use PZ to build a semantic pipeline that does the following:

1. Load the text files that contain the emails. (Each `.txt` file contains a single email).
2. Extract the sender, subject, and date of each email.
3. Filter for the emails that (1) mention a vacation plan and (2) were sent in the month of July.


### Step 1: Create a `pz.Dataset`

The first step in any Palimpzest program is to create a `pz.Dataset`, which represents a set of data that we can apply transformations to. In this example, we create the `pz.Dataset` by simply providing the path to our directory of text files.

In [1]:
import palimpzest as pz

# Dataset loading
dataset = pz.Dataset("testdata/enron-tiny/")

### Step 2: Extract Relevant Fields from Each Email
Since we want to extract useful information from the input files, we need to define columns that specify which attributes we are interested in. We define each column with a dictionary that specifies:
1. The column name,
2. The column type, and
3. A natural language description of what the column represents

The names and natural language descriptions are used to help PZ properly extract the columns values. The types are used by PZ to type check the values generated for the column.

In this demo, we will extract the sender, subject, and date of each email. We can do this by invoking PZ's `dataset.sem_add_columns()` function.

**NOTE:** PZ uses [lazy evaluation](https://en.wikipedia.org/wiki/Lazy_evaluation), therefore the `dataset` returned by `dataset.sem_add_columns()` will not contain the computed values of the columns you specified. This computation will happen in Step 4 when we execute `dataset.run()`.

In [2]:

email_cols = [
    {"name": "sender", "type": str, "desc": "The email address of the sender"},
    {"name": "subject", "type": str, "desc": "The subject of the email"},
    {"name": "date", "type": str, "desc": "The date the email was sent"},
]

dataset = dataset.sem_add_columns(email_cols)

### Step 3: Apply a Filter to the Emails
Our next step is to filter for the emails that (1) mention a vacation plan and (2) were sent in the month of July.

To do this, we will use the `dataset.sem_filter()` function. This function takes a string which describes the condition we are filtering for.

In [3]:
dataset = dataset.sem_filter("The email was sent in July")
dataset = dataset.sem_filter("The email is about holidays")

### Step 4: Execute the Operations
Finally, we can execute the operations we have defined on the `dataset` by calling `dataset.run()`.

The `dataset.run()` function takes a `QueryProcessorConfig` as its sole argument. This config enables the user to control certain aspects of PZ's execution. For example, the `policy` config option allows the user to specify what PZ should optimize for when executing your program. Some policies include:
- `MinCost`: minimize the cost of the program
- `MinTime`: minimize the runtime of the program
- `MaxQuality`: maximize the quality of the program output
- `MaxQualityAtFixedCost`: maximize the output quality subject to an upper bound on the cost
- etc.

For a full list of policies please see our documentation.

There are additional config parameters which control e.g. the parallelism used by PZ, the optimization strategy, and more. The details of these parameters can also be found in our documentation.

In [4]:
# NOTE: PZ supports many policies including pz.MaxQuality, pz.MaxQualityAtFixedCost, and more.
#       See our documentation for more details: https://palimpzest.org/policy.html

config = pz.QueryProcessorConfig(policy=pz.MinCost(), verbose=True)
output = dataset.run(config)

Available models:  [GPT_4o, GPT_4o_MINI, GPT_4o, GPT_4o_MINI]
----------------------
PLAN[e0993984c5] (n=inf):
 0. MarshalAndScanDataOp -> TextFile 

 1. TextFile -> LLMConvertBonded -> Schema[['contents', 'date', 'filename', 'sender', 'subject']]
    (contents, filename) -> (contents, date, filename, send)
    Model: Model.GPT_4o_MINI
    Prompt Strategy: PromptStrategy.COT_QA

 2. Schema[['contents', 'date', 'filename', 'sender', 'subject']] -> LLMFilter -> Schema[['contents', 'date', 'filename', 'sender', 'subject']]
    (contents, date, filename, send) -> (contents, date, filename, send)
    Model: Model.GPT_4o_MINI
    Filter: The email was sent in July

 3. Schema[['contents', 'date', 'filename', 'sender', 'subject']] -> LLMFilter -> Schema[['contents', 'date', 'filename', 'sender', 'subject']]
    (contents, date, filename, send) -> (contents, date, filename, send)
    Model: Model.GPT_4o_MINI
    Filter: The email is about holidays


---


VBox(children=(IntProgress(value=0, bar_style='info', description='Processing:', max=36), HTML(value='<pre>Ini…

PROMPT:
You are a helpful assistant whose job is to generate a JSON object.
You will be presented with a context and a set of output fields to generate. Your task is to generate a JSON object which fills in the output fields with the correct values.
You will be provided with a description of each input field and each output field. All of the fields in the output JSON object can be derived using information from the context.

Remember, your answer must be a valid JSON dictionary. The dictionary should only have the specified output fields. Finish your response with a newline character followed by ---
---
INPUT FIELDS:
- contents: The contents of the file
- filename: The UNIX-style name of the file

OUTPUT FIELDS:
- date: The date the email was sent
- sender: The email address of the sender
- subject: The subject of the email

CONTEXT:
{
  "contents": "Message-ID: <1390685.1075853083264.JavaMail.evans@thyme>\nDate: Mon, 17 Sep 2001 07:56:52 -0700 (PDT)\nFrom: steven.january@enron.com\nTo

### Step 5: Displaying the Output

To print the results as a table, we can utilize the `to_df()` method of our `output` object:

In [5]:
output_df = output.to_df(cols=["date", "sender", "subject"])
display(output_df)

Unnamed: 0,date,sender,subject
0,6 Jul 2001,sheila.nacey@enron.com,Vacation plans
1,"Thu, 26 Jul 2001",larry.berger@enron.com,Vacation Days in August


PZ also provides a detailed report of the execution sttatistics, including information about the runtime and cost of each operation.
To access these statistics, you can use the `execution_stats` attribute of our `output` object:


In [6]:
execution_stats = output.execution_stats
print("Time to find an optimal plan:", execution_stats.total_optimization_time,"s")
print("Time to execute the plan:", execution_stats.total_execution_time, "s")
print("Total cost:", execution_stats.total_execution_cost, "USD")

print("Final plan executed:")
for plan, stats in execution_stats.plan_stats.items():
    print(stats)

Time to find an optimal plan: 0.0 s
Time to execute the plan: 48.99622583389282 s
Total cost: 0.003916050000000001 USD
Final plan executed:
Total_plan_time=48.99115324020386 
Total_plan_cost=0.003916050000000001 
0. MarshalAndScanDataOp time=0.0035965442657470703 cost=0.0 
1. LLMConvertBonded time=28.791292667388916 cost=0.0023439000000000003 
2. LLMFilter time=18.505988359451294 cost=0.001278 
3. LLMFilter time=1.6455256938934326 cost=0.00029414999999999997 



We hope this notebook is only the start of your Palimpzest journey! Feel free to reach out to us for more information!