# About this notebook

One of the most powerful features of LLM is the ability to compose and orchestrate calls. Python is a great language for prototyping, but when chaining operations, the lack
of strong types can make results increasingly unpredictable.

Enter F# - a functional language with strict typing and function composition as a native construct.

This notebook illustates the following:

- Automated retry operations with a `retry` Computation Expression. Simply instantiate the Retry builder with the number of retries, then use the `retry` block anywhere in code.
- Using a private type constructor with string santiziation. This demonstrates very basic URL sanitization as an example A great resource on Prompt Injection is from [Carol Anderson.](https://www.linkedin.com/pulse/newly-discovered-prompt-injection-tactic-threatens-large-anderson/)
- Serialization from OpenAI JSON response to an F# `Map<string,string>` type

# What does it do?

Imagine we have a number of job listings, and want to categorize them by sub-categories not present in the data.

We can ask ChatGPT (or another LLM) to look at the job descriptions, and provide sub-categories.

Workflow:

- Load in CSV data on job market from [Kaggle](https://www.kaggle.com/datasets/shashankshukla123123/linkedin-job-cleandata)
- Group by the job Designation (job title)
- Concatenate job details for a specific designation, and have ChatGPT analyze the batched results for sub-categories
- Obtain those sub-categories in a `Map` type for future analysis



# Getting started

- Follow instructions here on running .NET with Jupyter: https://github.com/dotnet/interactive/blob/main/docs/NotebookswithJupyter.md
- You'll need an OpenAI API key set as an environment variable (`OPENAI_API_KEY`)

In [3]:
#r "nuget:System"
#r "nuget:Newtonsoft.Json"
#r "nuget:OpenAI.Client"
#r "nuget:FSharp.Data"
#r "nuget:FSharpPlus"

open System
open Newtonsoft.Json
open Newtonsoft.Json.Linq
open System.Collections.Generic
open OpenAI
open OpenAI.Chat
open FSharp.Data
open FSharpPlus

#### Retrieve the OpenAI API key from an environment variable:

In [9]:

let getEnvVar (name: string) =
    let value = Environment.GetEnvironmentVariable(name)
    match value with
    | null -> failwith (sprintf "Environment variable '%s' not found" name)
    | _ -> value

let client =
    Config(
        { Endpoint = "https://api.openai.com/v1"
          ApiKey = getEnvVar "OPENAI_API_KEY"},
        HttpRequester()
    )


#### Define a record type that will hold model parameters

This can be easily extended to other models, including locally hosted models

In [10]:
type ModelSettings = {
    TruncateLength : int
    ModelName : string
}

#### Now, let's instantiate that record with GPT 3.5 Turbo

In [11]:
let settingsGPT3 = { TruncateLength = 3000; ModelName = "gpt-3.5-turbo" }


#### Build the GPT client: more details https://github.com/yazeedobaid/openai-fsharp

In [12]:
let callGPT settings prompt =
    client
    |> chat
    |> create
      { Model = settings.ModelName
        Messages = [| {Role = "user"; Content = prompt} |] }

### String sanitization

In our hypothetical use case, we are concerned that job details might contain URLs which, combined with prompt injection, would provide a data exfiltration point. We can
ensure that `SanitizedString`s are always validated before construction so we can safely pass them to the LLM.

This is a simple example. Using more advanced grammar parsing could be a good choice to filter out more sophisticated attacks.

In [13]:
type SanitizedString = 
    private
    | SanitizedString of string

module SanitizedString =

    let urlRegex = System.Text.RegularExpressions.Regex(@"http[s]?://[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+")

    let tryCreate (input: string) : SanitizedString option =
        if urlRegex.IsMatch(input) then
            printfn "Warning: Input string contains a URL!"
            printfn "URL: %s" (urlRegex.Match(input).Value)
            None
        else
            Some (SanitizedString input)

    let value (SanitizedString s) = s


#### Invoke

This function is what calls the LLM. In our use case, it accepts a `ModelSettings` record and a `SanitizedString`. In this specific case we truncate the input prompt to 3000 characters to fit within context length for GPT 3.5. This can be tweaked by updating the record.

In [14]:
let invoke (modelSettings: ModelSettings) (sanitizedString: SanitizedString) : string =
    let prompt = (SanitizedString.value sanitizedString |> String.truncate modelSettings.TruncateLength)
    let choices = (callGPT modelSettings prompt)
    choices.Choices[0].Message.Content


#### Partial application

`invokeGPT3` is now a function with `settingsGPT3` already applied. We can work with this function just like any other, but don't need to worry about remembering which 
`ModelSettings` we need to use

In [15]:
let invokeGPT3 = invoke settingsGPT3 

#### RetryBuilder Computation Expression

This is the most advanced topic in this notebook. A Computation Expression, similar to a Monad in Haskell, handles high-level control flow. In our case we can
wrap calls to the LLM in this RetryBuilder which will automatically retry for us.

In [16]:
type RetryBuilder(maxRetries : int) =
    member this.Bind(x, f) =
        let rec loop retries =
            match x with
            | Ok x -> 
                try
                    f x
                with
                | ex when retries > 0 ->
                    printfn "Exception occurred, retrying. %d retries left" retries
                    loop (retries - 1)
                | ex ->
                    printfn "Exception occurred, no retries left. Rethrowing..."
                    reraise()
            | Error _ as err -> err
        loop maxRetries
    member this.Return(x) = Ok x
    member __.ReturnFrom(x) = x

    member this.Zero() = failwith "Unexpected condition in RetryBuilder"


Create an instance of the RetryBuilder, in this case with 5 retries

In [17]:
let retry = RetryBuilder(5)

#### Extract themes

Here we take in the 

In [18]:
type SpaceSeparatedFile = CsvProvider<"./job_cleanData.csv">

let loadAndGroupData () : string * list<string> =
    // Load space-separated data
    let data = SpaceSeparatedFile.Load("./job_cleanData.csv")
    
    // Group by unique "Designation" values
    let groupedData = 
        data.Rows
        |> Seq.groupBy (fun row -> row.Designation)
        |> Seq.map (fun (name, rows) -> name, Seq.toList rows) // This produces a sequence of tuples (designation, rows)
        |> Seq.map (fun (name, rows) -> name, rows |> List.map (fun row -> row.Job_details)) // For each group, transform the list of rows into a list of job details
    
    // Return the first group's designation and job details
    let firstGroup = Seq.head groupedData
    firstGroup

In [19]:
let (designation, descriptions) = loadAndGroupData()


#### Convert job descriptions to `SanitizedString`s

In [20]:
let merged = 
    descriptions
    |> Seq.map SanitizedString.tryCreate
    |> Seq.choose id
    |> Seq.map SanitizedString.value
    |> String.concat " "


URL: https://www.crossover.com/auth/password-recovery
URL: https://www.crossover.com/auth/password-recovery
URL: https://www.crossover.com/auth/password-recovery
URL: http://www.verisk.com/careers.html
URL: http://www.launchcg.com;
URL: https://talent.uplers.com/
URL: https://talent.uplers.com/
URL: https://kpipartners.openings.co/#!/
URL: https://www.northerntrust.com/content/dam/northerntrust/pws/nt/images/careers/taleo-india.png
URL: https://www.hyqoo.com
URL: https://kpipartners.openings.co/#!/
URL: http://www.verisk.com/careers.html
URL: https://www.ibm.com/in-enJob
URL: https://docs.google.com/document/d/1ifTRyXCsoNaRWF3_w4prCLz0todRcL3q9JCPM53RtvY(copy
URL: https://www.flexmoney.inJob
URL: https://www.zineone.com/)
URL: https://hillpineconsulting.in
URL: https://www.axismyindia.org/
URL: https://www.sutherlandglobal.com/Lead
URL: https://www.doxel.ai/)
URL: https://www.micro1.ai/developer
URL: https://www.doxel.ai/)
URL: https://www.luxoft.com/)
URL: https://devon.global
URL: ht

#### JSON Deserialization

Here we handle deserialization the OpenAI API call JSON response. This could be improved with retry, additional error handling, and reflection to automatically provide the
desired schema to the LLM.

In [25]:

type Category = {
    subcategories: List<string>
}

// Convert a JSON string to a F# map
let jsonToMap (json: string) : Map<string, string> =
    // Deserialize JSON string to Category type
    let category = JsonConvert.DeserializeObject<Category>(json)

    // Create a sequence of tuples, each with an index and a corresponding subcategory
    let tuples = category.subcategories |> Seq.mapi (fun i subcategory -> (string i, subcategory))

    // Create a map from the sequence of tuples
    let map = Map.ofSeq tuples

    map


#### Putting it all together

Attempt to construct a `SanitizedString`, and then use it to identify subcategories via a call to OpenAI

In [35]:
let maybePrompt = SanitizedString.tryCreate $"""you are a helpful job classification system. Given this job designation '{designation}', and the following job details, please provide a sub-categories of this job. Please return a JSON list in the form of "subcategories": [(list values)]. Details: {merged}"""

match maybePrompt with
    | Some sanitizedPrompt -> 
        let subcategories = invokeGPT3 sanitizedPrompt
        let map: Map<string,string> = jsonToMap subcategories
        printfn "%A" map
        
    | None -> printfn "%s" "The prompt is empty since it could not be validated"



map
  [("0", "Machine Learning Engineer"); ("1", "Data Scientist");
   ("2", "Data Engineer")]
