-------

## Setup

These are loaded by default in Raku ***chatbooks*** (of "Jupyter::Chatbook") but we show them here if "Jupyter::Kernel" is used:

In [3]:
use LLM::Functions;
use LLM::Prompts;
use Text::SubParsers;
use Data::TypeSystem;

-----

## Direct LLM access

In [4]:
#%chat t0
How many people live in Brazil?

As of the latest available data, the population of Brazil is estimated to be around 212 million people.

In [5]:
#%chat t0
Translated|Bulgarian^

Към последните налични данни, населението на Бразилия се оценява на около 212 милиона души.

In [6]:
#%chat sb, prompt=@SouthernBelleSpeak
Hi! Who are you?

Well, darlin', I am Miss Anne, delighted to make your acquaintance. How may I be of service to you on this fine day?

In [7]:
#%chat yd, prompt=@Yoda
Hi! Who are you?

Mmm, greetings! Yoda, I am. Speak in riddles, I do. What seek you, hmm?

In [8]:
#%chat yd 
What is the color of your laser saber? How many students did you have?

Mmm, a lightsaber, I have. Green, its color is. Many students I had, hmm. Train in the ways of the Force, they did. Countless, the number is. Seek wisdom, they did. Strong, the Force is in them. Hmm.

-----

## LLM pipelines

In [17]:
my $res = llm-synthesize([
  "What are the populations in India's states?",
  llm-prompt("NothingElse")("JSON")],
 llm-evaluator => llm-configuration("chatgpt", model => "gpt-3.5-turbo", max-tokens => 1024)
)

{
  "Andaman and Nicobar Islands": 380581,
  "Andhra Pradesh": 49577103,
  "Arunachal Pradesh": 1383727,
  "Assam": 31205576,
  "Bihar": 104099452,
  "Chandigarh": 1055450,
  "Chhattisgarh": 25545198,
  "Dadra and Nagar Haveli and Daman and Diu": 585764,
  "Delhi": 16787941,
  "Goa": 1458545,
  "Gujarat": 60439692,
  "Haryana": 25351462,
  "Himachal Pradesh": 6864602,
  "Jammu and Kashmir": 12541302,
  "Jharkhand": 32988134,
  "Karnataka": 61095297,
  "Kerala": 33406061,
  "Ladakh": 290492,
  "Lakshadweep": 73183,
  "Madhya Pradesh": 72626809,
  "Maharashtra": 112374333,
  "Manipur": 2570390,
  "Meghalaya": 2966889,
  "Mizoram": 1097206,
  "Nagaland": 1978502,
  "Odisha": 41974218,
  "Puducherry": 1247953,
  "Punjab": 27743338,
  "Rajasthan": 68548437,
  "Sikkim": 610577,
  "Tamil Nadu": 72147030,
  "Telangana": 35003674,
  "Tripura": 3673917,
  "Uttar Pradesh": 199812341,
  "Uttarakhand": 10086292,
  "West Bengal": 91276115
}

In [18]:
sub-parser("JSON", :drop).parse($res)

{Andaman and Nicobar Islands => 380581, Andhra Pradesh => 49577103, Arunachal Pradesh => 1383727, Assam => 31205576, Bihar => 104099452, Chandigarh => 1055450, Chhattisgarh => 25545198, Dadra and Nagar Haveli and Daman and Diu => 585764, Delhi => 16787941, Goa => 1458545, Gujarat => 60439692, Haryana => 25351462, Himachal Pradesh => 6864602, Jammu and Kashmir => 12541302, Jharkhand => 32988134, Karnataka => 61095297, Kerala => 33406061, Ladakh => 290492, Lakshadweep => 73183, Madhya Pradesh => 72626809, Maharashtra => 112374333, Manipur => 2570390, Meghalaya => 2966889, Mizoram => 1097206, Nagaland => 1978502, Odisha => 41974218, Puducherry => 1247953, Punjab => 27743338, Rajasthan => 68548437, Sikkim => 610577, Tamil Nadu => 72147030, Telangana => 35003674, Tripura => 3673917, Uttar Pradesh => 199812341, Uttarakhand => 10086292, West Bengal => 91276115}

In [19]:
print(llm-prompt("NothingElse")())

ONLY give output in the form of a paragraph.
Never explain, suggest, or converse. Only return output in the specified form.
If code is requested, give only code, no explanations or accompanying text.
If a table is requested, give only a table, no other explanations or accompanying text.
Do not describe your output. 
Do not explain your output. 
Do not suggest anything. 
Do not respond with anything other than the singularly demanded output. 
Do not apologize if you are incorrect, simply try again, never apologize or add text.
Do not add anything to the output, give only the output as requested. Your outputs can take any form as long as requested.

-----

## Statistics of output data types

**Workflow:** We want to see and evaluate the distribution of data types of LLM-function results:

1. Make a pipeline of LLM-functions

1. Create a list of random inputs "expected" by the pipeline

    - Or use the same input multiple times.

1. Deduce the data type of each output

1. Compute descriptive statistics

**Remark:** These kind of statistical workflows can be slow and expensive. (With the current line-up of LLM services.)

Let us reuse the workflow from the previous section and enhance it with data type outputs finding. More precisely we:

1. Generate random music artist names (using an LLM query)

1. Retrieve short biography and discography for each music artist

1. Extract album-and-release-date data for each artist (with NER-by-LLM)

1. Deduce the type for each output, using several different type representations

The data types are investigated with the functions deduce_type and record_types of ["DataTypeSystem"](https://pypi.org/project/DataTypeSystem/) , [AAp5].

Here we define a data retrieval function:

In [21]:
my &fdb = llm-function({"What is the short biography and discography of the artist $_?"}, e => llm-configuration("chatgpt", max-tokens => 500))

-> **@args, *%args { #`(Block|5001862068200) ... }

Here we define (again) the NER function:

In [23]:
my &fner = llm-function({"Extract $^a from the text: $^b . Give the result in a JSON format."}, e => 'chatgpt', form => sub-parser('JSON'):drop)

-> **@args, *%args { #`(Block|5001768819304) ... }

Here we find 10 random music artists:

In [26]:
my $artistNames = llm-function('', e=> 'chatgpt')("Give 10 random music artist names in a list in JSON format.", 
                                        form => sub-parser('JSON'):drop);
                                        
$artistNames

{music_artists => [Billie Eilish Kendrick Lamar Ariana Grande The Weeknd Taylor Swift Drake Beyoncé Ed Sheeran Rihanna Post Malone]}

In [30]:
$artistNames.head.value

[Billie Eilish Kendrick Lamar Ariana Grande The Weeknd Taylor Swift Drake Beyoncé Ed Sheeran Rihanna Post Malone]

In [33]:
my @artistNames2 = |$artistNames.head.value;
@artistNames2

[Billie Eilish Kendrick Lamar Ariana Grande The Weeknd Taylor Swift Drake Beyoncé Ed Sheeran Rihanna Post Malone]

Here is a loop that generates the biographies and does NER over them:

In [34]:
#% chat cw, prompt=@CodeWriterX|Raku
Translate from Python:

dbRes = []
for a in artistNames2:
    text = fdb(a)
    recs = fner('album names and release dates', text)    
    dbRes = dbRes + [recs, ]

dbRes

my @dbRes;
for @artistNames2 -> $a {
    my $text = fdb($a);
    my $recs = fner('album names and release dates', $text);
    @dbRes.push: $recs;
}

@dbRes;

In [35]:
my @dbRes;
for @artistNames2 -> $a {
    my $text = fdb($a);
    my $recs = fner('album names and release dates', $text);
    @dbRes.push: $recs;
}

@dbRes;

[{albums => [{name => When We All Fall Asleep, Where Do We Go?, release_date => 2019}]} {albums => [{name => Section.80, release_date => 2011} {name => good kid, m.A.A.d city, release_date => 2012} {name => To Pimp a Butterfly, release_date => 2015} {name => DAMN., release_date => 2017}]} {albums => [{name => Yours Truly, release_date => 2013} {name => My Everything, release_date => 2014} {name => Dangerous Woman, release_date => 2016} {name => Sweetener, release_date => 2018} {name => Thank U, Next, release_date => 2019} {name => Positions, release_date => 2020} {name => Dangerous Woman Diaries [Soundtrack], release_date => 2021} {name => K Bye for Now (SWT Live), release_date => 2019}]} {albums => [{name => Kiss Land, release_date => 2013} {name => Beauty Behind the Madness, release_date => 2015} {name => Starboy, release_date => 2016} {name => My Dear Melancholy, release_date => 2018} {name => After Hours, release_date => 2020}]} {albums => [{name => Fearless, release_date => 2008} 

In [52]:
#% html
@dbRes.head(3) ==> to-html()

albums,Unnamed: 1_level_0
release_date,name
release_date,name
name,release_date
"release_datename2019When We All Fall Asleep, Where Do We Go?",
release_date,name
2019,"When We All Fall Asleep, Where Do We Go?"
"release_datename2011Section.802012good kid, m.A.A.d city2015To Pimp a Butterfly2017DAMN.",
release_date,name
2011,Section.80
2012,"good kid, m.A.A.d city"
2015,To Pimp a Butterfly
2017,DAMN.
"namerelease_dateYours Truly2013My Everything2014Dangerous Woman2016Sweetener2018Thank U, Next2019Positions2020Dangerous Woman Diaries [Soundtrack]2021K Bye for Now (SWT Live)2019",

release_date,name
2019,"When We All Fall Asleep, Where Do We Go?"

release_date,name
2011,Section.80
2012,"good kid, m.A.A.d city"
2015,To Pimp a Butterfly
2017,DAMN.

name,release_date
Yours Truly,2013
My Everything,2014
Dangerous Woman,2016
Sweetener,2018
"Thank U, Next",2019
Positions,2020
Dangerous Woman Diaries [Soundtrack],2021
K Bye for Now (SWT Live),2019


Here we call deduce_type on each LLM output:

In [40]:
.say for @dbRes.map({ deduce-type($_) })

Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 1), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 4), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 8), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 5), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 6), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 5), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 8), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 4), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 8), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 3), 1)


Here we redo the type deduction using the argument setting tally=True :

In [41]:
.say for @dbRes.map({ deduce-type($_):tally })

Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 1), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 4), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 8), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 5), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 6), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 5), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 8), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 4), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 8), 1)
Assoc(Atom((Str)), Vector(Assoc(Atom((Str)), Atom((Str)), 2), 3), 1)


Another record types finding call over the dictionaries:

In [50]:
.say for @dbRes.map({ record-types($_.values.head) })

({name => (Str), release_date => (Str)})
({name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)})
({name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)})
({name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)})
({name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)} {name => (Str), release_date => (Str)})
({name => (Str), release_date => (Str)} {name => (Str)

The statistics show that most likely the output we get from the execution of the LLM-functions pipeline is a list of a string and a dictionary. The dictionaries are most likely to be of length one, with "albums" as the key.