# About this notebook

One of the most powerful features of LLM is the ability to compose and orchestrate calls. Python is a great language for prototyping, but when chaining operations, the lack
of strong types can make results increasingly unpredictable.

Enter F# - a functional language with strict typing and function composition as a native construct.

This notebook illustates the following:

- Automated retry operations with a `retry` Computation Expression. Simply instantiate the Retry builder with the number of retries, then use the `retry` block anywhere in code.
- Using a private type constructor with string sanitization. This demonstrates very basic URL sanitization as an example; in general, you may want to consider a more fully-fledged sanitization process. A great resource on Prompt Injection is from [Carol Anderson.](https://www.linkedin.com/pulse/newly-discovered-prompt-injection-tactic-threatens-large-anderson/)
- Serialization from OpenAI JSON response to an F# `Map<string,string>` type

# What does it do?

Imagine we have a number of job listings, and want to categorize them by sub-categories not present in the data.

We can ask ChatGPT (or another LLM) to look at the job descriptions, and provide sub-categories.

Workflow:

- Load in CSV data on job market from [Kaggle](https://www.kaggle.com/datasets/shashankshukla123123/linkedin-job-cleandata)
- Group by the job Designation (job title)
- Concatenate job details for a specific designation, and have ChatGPT analyze the batched results for sub-categories
- Obtain those sub-categories in a `Map` type for future analysis



# Getting started

- Follow instructions here on running .NET with Jupyter: https://github.com/dotnet/interactive/blob/main/docs/NotebookswithJupyter.md
- You'll need an OpenAI API key set as an environment variable (`OPENAI_API_KEY`)

In [17]:
#r "nuget:System"
#r "nuget:FSharp.Data"
#r "nuget:FSharpPlus"
#r "nuget:Azure.AI.OpenAI,*-*" // --prerelease

open System
open System.Collections.Generic
open FSharp.Data
open FSharpPlus
open Azure.AI.OpenAI


#### Define a record type that will hold model parameters

This can be easily extended to other models, including locally hosted models

In [3]:
type ModelSettings = {
    TruncateLength : int
    ModelName : string
}

#### Now, let's instantiate that record with GPT 3.5 Turbo

In [4]:
let settingsGPT3 = { TruncateLength = 3000; ModelName = "gpt-3.5-turbo" }

#### Retrieve the OpenAI API key from an environment variable:

In [19]:
let getEnvVar (name: string) =
    let value = Environment.GetEnvironmentVariable(name)
    match value with
    | null -> failwith (sprintf "Environment variable '%s' not found" name)
    | _ -> value

let AOAI_KEY = Environment.GetEnvironmentVariable("AOAI_KEY");

#### Build the GPT client with Azure OpenAI (note: this uses OpenAI, not anything Azure-specific)

In [23]:
let llmClient = new OpenAIClient(AOAI_KEY);

let callGPT settings prompt =

    let completionsOptions = new ChatCompletionsOptions( 
        [ChatMessage(role = ChatRole.System, content = "Assistant is a large language model trained by OpenAI. It returns values in JSON.");
        ChatMessage(role = ChatRole.User, content = prompt)])

    let response = llmClient.GetChatCompletions(settings.ModelName, completionsOptions)
    response

### String sanitization

In our hypothetical use case, we are concerned that job details might contain URLs which, combined with prompt injection, would provide a data exfiltration point. We can
ensure that `SanitizedString`s are always validated before construction so we can safely pass them to the LLM.

This is a simple example. Using more advanced grammar parsing could be a good choice to filter out more sophisticated attacks.

In [24]:
type SanitizedString = 
    private
    | SanitizedString of string

module SanitizedString =

    let urlRegex = System.Text.RegularExpressions.Regex(@"http[s]?://[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+")

    let tryCreate (input: string) : SanitizedString option =
        if urlRegex.IsMatch(input) then
            None
        else
            Some (SanitizedString input)

    let value (SanitizedString s) = s

    let concat (SanitizedString s1) (SanitizedString s2) : SanitizedString option =
        tryCreate (s1 + s2)


#### Invoke

This function is what calls the LLM. In our use case, it accepts a `ModelSettings` record and a `SanitizedString`. In this specific case we truncate the input prompt to 3000 characters to fit within context length for GPT 3.5. This can be tweaked by updating the record.

In [25]:
let invoke (modelSettings: ModelSettings) (sanitizedString: SanitizedString) : string =
    let prompt = (SanitizedString.value sanitizedString |> String.truncate modelSettings.TruncateLength)
    let response = (callGPT modelSettings prompt)
    let choices = response.Value.Choices
    choices[0].Message.Content 

#### Partial application

`invokeGPT3` is now a function with `settingsGPT3` already applied. We can work with this function just like any other, but don't need to worry about remembering which 
`ModelSettings` we need to use

In [26]:
let invokeGPT3 = invoke settingsGPT3 

#### RetryBuilder Computation Expression

This is the most advanced topic in this notebook. A Computation Expression, similar to a Monad in Haskell, handles high-level control flow. In our case we can
wrap calls to the LLM in this RetryBuilder which will automatically retry for us.

In [27]:
type RetryBuilder(maxRetries : int) =
    member this.Bind(x, f) =
        let rec loop retries =
            match x with
            | Ok x -> 
                try
                    f x
                with
                | ex when retries > 0 ->
                    printfn "Exception occurred, retrying. %d retries left" retries
                    loop (retries - 1)
                | ex ->
                    printfn "Exception occurred, no retries left. Rethrowing..."
                    reraise()
            | Error _ as err -> err
        loop maxRetries
    member this.Return(x) = Ok x
    member __.ReturnFrom(x) = x

    member this.Zero() = failwith "Unexpected condition in RetryBuilder"

Create an instance of the RetryBuilder, in this case with 5 retries

In [28]:
let retry = RetryBuilder(5)

#### Extract themes

Here we take in the 

In [29]:
type SpaceSeparatedFile = CsvProvider<"./job_cleanData.csv">

let loadAndGroupData () : string * list<string> =
    // Load space-separated data
    let data = SpaceSeparatedFile.Load("./job_cleanData.csv")
    
    // Group by unique "Designation" values
    let groupedData = 
        data.Rows
        |> Seq.groupBy (fun row -> row.Designation)
        |> Seq.map (fun (name, rows) -> name, Seq.toList rows) // This produces a sequence of tuples (designation, rows)
        |> Seq.map (fun (name, rows) -> name, rows |> List.map (fun row -> row.Job_details)) // For each group, transform the list of rows into a list of job details
    
    // Return the first group's designation and job details
    let firstGroup = Seq.head groupedData
    firstGroup

In [30]:
let (designation, descriptions) = loadAndGroupData()

#### Convert job descriptions to `SanitizedString`s

In [31]:
let merged = 
    descriptions
    |> Seq.map SanitizedString.tryCreate
    |> Seq.choose id
    |> Seq.map SanitizedString.value
    |> String.concat " "


#### JSON Deserialization

Here we handle deserialization the OpenAI API call JSON response. This could be improved with retry, additional error handling, and reflection to automatically provide the
desired schema to the LLM.

In [32]:
let jsonToT<'T> (json : string) : 'T =

    System.Text.Json.JsonSerializer.Deserialize<'T>(json)
    
exception Error1 of string


let verifyKeys (map : Map<string, int>) (categories : list<string>) : Result<Map<string, int>, string> =
    let keys = map |> Map.toSeq |> Seq.map fst |> Set.ofSeq
    let categorySet = Set.ofList categories
    if Set.isSubset keys categorySet then Ok map
    else Error "Some keys are not in the category list"
    
let jsonToMap (json : string) (categories: string list) : Result<Map<string, int>, string> =
    try
        let value = System.Text.Json.JsonSerializer.Deserialize<Map<string, int>>(json)
        verifyKeys value categories
    with
    | ex -> Error ex.Message



#### Putting it all together

Attempt to construct a `SanitizedString`, and then use it to identify subcategories via a call to OpenAI

In [34]:
type MaybeBuilder() =
    member this.Bind(x, f) = 
        match x with
        | Some a -> f a
        | None -> None
    member this.Return(x) = Some x
    member this.ReturnFrom(x) = x
    member this.Zero() = None

let maybe = new MaybeBuilder()

open System.Text.Json

let toJson (value: 'T) : string =
    JsonSerializer.Serialize<'T>(value)


let processDesignation<'T> (example: 'T) (designation: string) (merged: string) : 'T option =

    maybe {
        let jsonString = toJson example
        let shortMerged = merged |> String.truncate 1000
        let! sanitizedPrompt = SanitizedString.tryCreate $"""you are a helpful job classification system. Given this job designation '{designation}', and the following job details, please provide a sub-categories of this job. Details: {shortMerged} Please return a JSON array like this: `{jsonString}` without any extra values."""
        let subcategories : string = invokeGPT3 sanitizedPrompt
        printfn "%A" subcategories
        let map: 'T = jsonToT subcategories
        return map
    } 

let example = ["Data analyst"; "Data Scientist"]
let jobMap = processDesignation example designation merged


"["Data Analyst", "Machine Learning Engineer"]"


In [37]:
type RelevanceBuilder() =
    member this.Bind(x, f) = 
        match x with
        | Some a -> f a
        | None -> None
    member this.Return(x) = Some x
    member this.ReturnFrom(x) = x
    member this.Zero() = None

let relevance = new RelevanceBuilder()

let invokeAndParseGPT3<'T> (sp : SanitizedString) maxRetries : Result<'T, Exception> =
    let retries = [1 .. maxRetries]
    
    List.fold (fun currentAttempt _ -> 
        match currentAttempt with
        | Ok result -> Ok result
        | Error ex ->
            printfn "Error: %s" ex.Message
            match SanitizedString.tryCreate ex.Message with
            | Some sanitizedError ->
                let updatedSp = SanitizedString.concat sp sanitizedError
                let gptresponse = invokeGPT3 updatedSp.Value
                jsonToT gptresponse
            | None -> Error ex
    ) (Error (Exception "Start")) retries

let rec invokeAndParseGPT3Map (sp : SanitizedString) maxRetries (categories: string list) : Result<Map<string, int>, Exception> =
    let retries = [1 .. maxRetries]

    List.fold (fun currentAttempt _ -> 
        match currentAttempt with
        | Ok result -> Ok result
        | Error _ ->
            let gptresponse = invokeGPT3 sp
            let parsed = jsonToMap gptresponse categories

            match parsed with
            | Ok map -> Ok map
            | Error _ ->
                match SanitizedString.tryCreate gptresponse with
                | Some sanitizedError ->
                    let updatedSp = SanitizedString.concat sp sanitizedError
                    match updatedSp with
                    | Some query -> 
                        invokeAndParseGPT3Map query (maxRetries - 1) categories
                    | None -> Error (Exception "The query was empty")
                | None -> Error (Exception "Sanitization failed")
    ) (Error (Exception "Start")) retries



In [38]:

let findRelevant (jobMap) (descriptions: string list) (categories): Map<string, (int * string)> list =
    descriptions
    |> List.map (fun desc ->
        relevance {
            let jobMapJson = System.Text.Json.JsonSerializer.Serialize(jobMap)
            let shortDescription = desc |> String.truncate 1500
            let sanitizedPrompt = SanitizedString.tryCreate $"""Given this job description: `\n\n\n{shortDescription}`\n\n\n, please identify which of these categories are appropriate: `{jobMapJson}`. Please return a response in a JSON map, with the category as the key and a 1-5 score representing the relevance of the category to the description as the value. Only return JSON. For instance, a response could be {{'data analyst' : 5, 'ml engineer': 2}}"""
            let! sp = sanitizedPrompt
            printfn "%A" sanitizedPrompt.Value
            
            let result  = invokeAndParseGPT3Map sp 5 categories  // 5 being the maximum number of retries
            
            match result  with
            | Ok r ->
            
                let combined : Map<string, (int * string)> = Map.map (fun _ v -> (v, desc)) r
            // let combined = result
                return combined
            | Error ex ->
                printfn "%s" ex.Message
        }
    )
    |> List.choose id
    

In [1]:
let jobMapBinding = 
    match jobMap with
    | Some jm -> jm
    | None -> 
        printfn "Job map not available."
        List.empty

let safeDescriptions = 
    descriptions
    |> Seq.map SanitizedString.tryCreate
    |> Seq.choose id
    |> Seq.map SanitizedString.value
    |> Seq.toList

let result = jobMap |> Option.map (fun x -> findRelevant jobMapBinding (List.take 10 safeDescriptions) x)


let topK (data: Map<string, (int * string)> list option) (k: int) : Map<string, (int * string) list> option =
    data |> Option.map (fun maps ->
        maps
        |> List.collect Map.toList // Flatten the list of maps into a single list of key-value pairs
        |> List.groupBy fst // Group by keys (strings)
        |> List.map (fun (key, group) ->
            // For each group, sort by value (int) in descending order, take the top k and create a new map entry
            key, (group |> List.map snd |> List.sortBy (fun (score, desc) -> -score) |> List.take k)
        )
        |> Map.ofList // Convert the list of key-value pairs back to a map
    )

topK result 1


Stopped due to error

input.fsx (2,11)-(2,17) typecheck error The value or constructor 'jobMap' is not defined.

input.fsx (9,5)-(9,17) typecheck error The value or constructor 'descriptions' is not defined.

input.fsx (10,16)-(10,31) typecheck error The value, namespace, type or module 'SanitizedString' is not defined.

input.fsx (12,16)-(12,31) typecheck error The value, namespace, type or module 'SanitizedString' is not defined.

input.fsx (15,14)-(15,20) typecheck error The value or constructor 'jobMap' is not defined. Maybe you want one of the following:
   jobMapBinding

input.fsx (15,45)-(15,57) typecheck error The value or constructor 'findRelevant' is not defined.



Error: compilation error