-------

## Setup

Load the "Regressionizer" and other "standard" packages:

In [2]:
from Regressionizer import *

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp

In [3]:
template='plotly_dark'
data_color='darkgray'

In [4]:
from LLMFunctionObjects import *
from LLMPrompts import *
from DataTypeSystem import *
import json
import pandas
import re
import os

In [5]:
%load_ext JupyterChatbook

### LLM access

In [6]:
samples=[]
home = os.path.expanduser("~")
with open(home + '/.zshrc') as myfile:
	for line in myfile.readlines():
		match = re.search(r'^export OPENAI_API_KEY=(:?.*)', line)		
		if match:
			openai_api_key = match.group(1)
		match = re.search(r'^export PALM_API_KEY=(:?.*)', line)
		if match:
			palm_api_key = match.group(1)
			
[len(openai_api_key), len(palm_api_key)]

[51, 39]

In [7]:
confOpenAI=llm_configuration("openai", api_key=openai_api_key)
confChatGPT=llm_configuration("chatgpt", api_key=openai_api_key)
confPaLM=llm_configuration("palm", api_key=palm_api_key)

------

## Weather temperature data

Load weather data:

In [None]:
url = "https://raw.githubusercontent.com/antononcube/MathematicaVsR/master/Data/MathematicaVsR-Data-Atlanta-GA-USA-Temperature.csv"
dfTemperature = pd.read_csv(url)
dfTemperature['DateObject'] = pd.to_datetime(dfTemperature['Date'], format='%Y-%m-%d')
dfTemperature = dfTemperature[(dfTemperature['DateObject'].dt.year >= 2020) & (dfTemperature['DateObject'].dt.year <= 2023)]
dfTemperature

Convert to "numpy" array: 

In [None]:
temp_data = dfTemperature[['AbsoluteTime', 'Temperature']].to_numpy()
temp_data.shape

----

## Regressionizer Pipeline

In [None]:
obj = (
    Regressionizer(temp_data)
    .echo_data_summary()
    .quantile_regression(knots=20, probs=[0.2, 0.5, 0.8])
    .date_list_plot(title="Atlanta, Georgia, USA, Temperature, ℃", template=template, data_color=data_color, width = 1200)
)

In [None]:
obj.take_value().show()

In [None]:
obj.outliers_plot(date_plot=True, width=1200, template = template)

In [None]:
obj.take_value().show()

-----

## Direct LLM access

In [8]:
%%chat -i t0
How many people live in Brazil?

As of 2021, the estimated population of Brazil is around 213 million people.

In [10]:
%%chat -i t0
Translated|Spanish^

En 2021, la población estimada de Brasil es de alrededor de 213 millones de personas.

In [11]:
%%chat -i sb --prompt=@SouthernBelleSpeak
Hi! Who are you?

Well, bless your heart, darlin', I am Miss Anne. It's a pleasure to make your acquaintance. How may I assist you on this fine day?

In [12]:
%%chat -i yd --prompt=@Yoda
Hi! Who are you?

Mmm, greetings. Yoda, I am. Help you, I can. Speak freely, you may. Hmm?

In [13]:
%%chat -i yd 
What is the color of your laser saber? How many students did you have?

Ah, the color of my lightsaber, you ask. Green, it is. A symbol of knowledge and harmony. Many students, I have had. Young Jedi hopefuls, seeking wisdom and guidance. Train them, I did, in the ways of the Force. Strong in the Force, they were. Hmm.

-----

## LLM pipelines

In [16]:
print(llm_prompt("NothingElse")("Python"))

ONLY give output in the form of a Python.
Never explain, suggest, or converse. Only return output in the specified form.
If code is requested, give only code, no explanations or accompanying text.
If a table is requested, give only a table, no other explanations or accompanying text.
Do not describe your output. 
Do not explain your output. 
Do not suggest anything. 
Do not respond with anything other than the singularly demanded output. 
Do not apologize if you are incorrect, simply try again, never apologize or add text.
Do not add anything to the output, give only the output as requested. Your outputs can take any form as long as requested.


In [17]:
res = llm_synthesize([
  "What are the populations in India's states?",
  llm_prompt("NothingElse")("JSON")],
 llm_evaluator = llm_configuration(spec = "chatgpt", model = "gpt-3.5-turbo")
)

In [None]:
sub_parser("JSON",drop=True).parse(res)

In [None]:
print(llm_prompt("NothingElse")())

-----

## Statistics of output data types

**Workflow:** We want to see and evaluate the distribution of data types of LLM-function results:

1. Make a pipeline of LLM-functions

1. Create a list of random inputs "expected" by the pipeline

    - Or use the same input multiple times.

1. Deduce the data type of each output

1. Compute descriptive statistics

**Remark:** These kind of statistical workflows can be slow and expensive. (With the current line-up of LLM services.)

Let us reuse the workflow from the previous section and enhance it with data type outputs finding. More precisely we:

1. Generate random music artist names (using an LLM query)

1. Retrieve short biography and discography for each music artist

1. Extract album-and-release-date data for each artist (with NER-by-LLM)

1. Deduce the type for each output, using several different type representations

The data types are investigated with the functions deduce_type and record_types of ["DataTypeSystem"](https://pypi.org/project/DataTypeSystem/) , [AAp5].

Here we define a data retrieval function:

In [17]:
fdb = llm_function(lambda x: f"What is the short biography and discography of the artist {x}?", e = llm_configuration(confChatGPT, max_tokens= 500))

Here we define (again) the NER function:

In [18]:
fner = llm_function(lambda a, b: f"Extract {a} from the text: {b} . Give the result in a JSON format", e = confChatGPT, form = sub_parser('JSON'))

Here we find 10 random music artists:

In [19]:
artistNames = llm_function('',e=confChatGPT)("Give 10 random music artist names in a list in JSON format.", 
                                        form = sub_parser('JSON'))
artistNames

['',
 {'artists': ['Beyonce',
   'Kendrick Lamar',
   'Taylor Swift',
   'Drake',
   'Ariana Grande',
   'Ed Sheeran',
   'Rihanna',
   'Travis Scott',
   'Billie Eilish',
   'Post Malone']},
 '']

In [20]:
artistNames[1]

{'artists': ['Beyonce',
  'Kendrick Lamar',
  'Taylor Swift',
  'Drake',
  'Ariana Grande',
  'Ed Sheeran',
  'Rihanna',
  'Travis Scott',
  'Billie Eilish',
  'Post Malone']}

In [21]:
artistNames2 = [list(item.items())[0][1] for item in artistNames if isinstance(item, dict)]
artistNames2 = artistNames2[0]
artistNames2

['Beyonce',
 'Kendrick Lamar',
 'Taylor Swift',
 'Drake',
 'Ariana Grande',
 'Ed Sheeran',
 'Rihanna',
 'Travis Scott',
 'Billie Eilish',
 'Post Malone']

Here is a loop that generates the biographies and does NER over them:

In [22]:
dbRes = []
for a in artistNames2:
    text = fdb(a)
    recs = fner('album names and release dates', text)    
    dbRes = dbRes + [recs, ]

dbRes

[['',
  {'albums': [{'name': 'Dangerously in Love', 'release_date': '2003'},
    {'name': "B'Day", 'release_date': '2006'},
    {'name': 'I Am... Sasha Fierce', 'release_date': '2008'},
    {'name': '4', 'release_date': '2011'},
    {'name': 'Beyoncé', 'release_date': '2013'},
    {'name': 'Lemonade', 'release_date': '2016'},
    {'name': 'Everything Is Love', 'release_date': '2018'}]},
  ''],
 ['',
  {'albums': [{'name': 'Section.80', 'release_date': '2011'},
    {'name': 'good kid, m.A.A.d city', 'release_date': '2012'},
    {'name': 'To Pimp a Butterfly', 'release_date': '2015'},
    {'name': 'DAMN.', 'release_date': '2017'},
    {'name': 'Black Panther: The Album', 'release_date': '2018'},
    {'name': 'Untitled Unmastered', 'release_date': '2016'}]},
  ''],
 ['',
  {'albums': [{'name': 'Taylor Swift', 'release_date': '2006'},
    {'name': 'Fearless', 'release_date': '2008'},
    {'name': 'Speak Now', 'release_date': '2010'},
    {'name': 'Red', 'release_date': '2012'},
    {'name'

Here we call deduce_type on each LLM output:

In [23]:
[str(deduce_type(x)) for x in dbRes]

["Tuple([Atom(<class 'str'>), Assoc(Atom(<class 'str'>), Vector(Assoc(Atom(<class 'str'>), Atom(<class 'str'>), 2), 7), 1), Atom(<class 'str'>)])",
 "Tuple([Atom(<class 'str'>), Assoc(Atom(<class 'str'>), Vector(Assoc(Atom(<class 'str'>), Atom(<class 'str'>), 2), 6), 1), Atom(<class 'str'>)])",
 "Tuple([Atom(<class 'str'>), Assoc(Atom(<class 'str'>), Vector(Assoc(Atom(<class 'str'>), Atom(<class 'str'>), 2), 9), 1), Atom(<class 'str'>)])",
 "Tuple([Atom(<class 'str'>), Assoc(Atom(<class 'str'>), Vector(Assoc(Atom(<class 'str'>), Atom(<class 'str'>), 2), 6), 1), Atom(<class 'str'>)])",
 "Tuple([Atom(<class 'str'>), Assoc(Atom(<class 'str'>), Vector(Assoc(Atom(<class 'str'>), Atom(<class 'str'>), 2), 6), 1), Atom(<class 'str'>)])",
 "Tuple([Atom(<class 'str'>), Assoc(Atom(<class 'str'>), Vector(Assoc(Atom(<class 'str'>), Atom(<class 'str'>), 2), 5), 1), Atom(<class 'str'>)])",
 "Tuple([Atom(<class 'str'>), Assoc(Atom(<class 'str'>), Vector(Assoc(Atom(<class 'str'>), Atom(<class 'str'>), 

Here we redo the type deduction using the argument setting tally=True :

In [24]:
[str(deduce_type(x, tally=True)) for x in dbRes]

['Tuple([("Assoc(Atom(<class \'str\'>), Vector(Assoc(Atom(<class \'str\'>), Atom(<class \'str\'>), 2), 7), 1)", 1), ("Atom(<class \'str\'>)", 2)], 3)',
 'Tuple([("Assoc(Atom(<class \'str\'>), Vector(Assoc(Atom(<class \'str\'>), Atom(<class \'str\'>), 2), 6), 1)", 1), ("Atom(<class \'str\'>)", 2)], 3)',
 'Tuple([("Assoc(Atom(<class \'str\'>), Vector(Assoc(Atom(<class \'str\'>), Atom(<class \'str\'>), 2), 9), 1)", 1), ("Atom(<class \'str\'>)", 2)], 3)',
 'Tuple([("Assoc(Atom(<class \'str\'>), Vector(Assoc(Atom(<class \'str\'>), Atom(<class \'str\'>), 2), 6), 1)", 1), ("Atom(<class \'str\'>)", 2)], 3)',
 'Tuple([("Assoc(Atom(<class \'str\'>), Vector(Assoc(Atom(<class \'str\'>), Atom(<class \'str\'>), 2), 6), 1)", 1), ("Atom(<class \'str\'>)", 2)], 3)',
 'Tuple([("Assoc(Atom(<class \'str\'>), Vector(Assoc(Atom(<class \'str\'>), Atom(<class \'str\'>), 2), 5), 1)", 1), ("Atom(<class \'str\'>)", 2)], 3)',
 'Tuple([("Assoc(Atom(<class \'str\'>), Vector(Assoc(Atom(<class \'str\'>), Atom(<class 

We see that the LLM outputs produce lists of dictionaries "surrounded" by strings:

In [25]:
[str(record_types(x)) for x in dbRes]

["[<class 'str'>, <class 'dict'>, <class 'str'>]",
 "[<class 'str'>, <class 'dict'>, <class 'str'>]",
 "[<class 'str'>, <class 'dict'>, <class 'str'>]",
 "[<class 'str'>, <class 'dict'>, <class 'str'>]",
 "[<class 'str'>, <class 'dict'>, <class 'str'>]",
 "[<class 'str'>, <class 'dict'>, <class 'str'>]",
 "[<class 'str'>, <class 'dict'>, <class 'str'>]",
 "[<class 'str'>, <class 'dict'>, <class 'str'>]",
 "[<class 'str'>, <class 'dict'>, <class 'str'>]",
 "[<class 'str'>, <class 'dict'>, <class 'str'>]"]

Another record types finding call over the dictionaries:

In [26]:
[str(record_types(x[1])) for x in dbRes]

["{'albums': <class 'list'>}",
 "{'albums': <class 'list'>}",
 "{'albums': <class 'list'>}",
 "{'albums': <class 'list'>}",
 "{'albums': <class 'list'>}",
 "{'albums': <class 'list'>}",
 "{'albums': <class 'list'>}",
 "{'albums': <class 'list'>}",
 "{'albums': <class 'list'>}",
 "{'albums': <class 'list'>}"]

The statistics show that most likely the output we get from the execution of the LLM-functions pipeline is a list of a string and a dictionary. The dictionaries are most likely to be of length one, with "albums" as the key.