# Testing for Ghosts in the Machine: Assuring 'Good Enough' Software Quality in AI-based systems<br>

## Artur Patoka
PyCon Italia<br>
Florence 2024

![prace_logos.png](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/prace_logos.png)

# Agenda

- Problem statement (the WHY)

- The challenges (the OH MY GOD)

- The solutions (the relief 😌)

# Problem statement

![1.png](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/1.png)

![2.png](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/2.png)

![3.png](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/3.png)

![4.GIF](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/4.GIF)

“Design the Polish Space Program Mission Patch”<br>
![patch.png](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/patch.png)

# What about some real-life examples?

![air_canada.png](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/air_canada.png)

![chevy.png](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/chevy.png)

![dpd_1.png](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/dpd_1.png)
![dpd_2.png](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/dpd_2.png)

## Someone (hopefully) is responsible for the quality of those chatbots
![fine.gif](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/fine.gif)

# My personal experience

## A chatbot assistant for logged-in customer with knowledge base access and several functions available

## `cli-gen` - a hobby project
![cli-gen_demo.gif](https://raw.githubusercontent.com/arturpat/cli-gen/main/cli-gen_demo.gif)
https://github.com/arturpat/cli-gen<br>
![cli_gen_link.png](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/cli_gen_link.png)

# The challenges (the OH MY GOD)

## Everything is non-deterministic

## Everything is slow

> Why don't you just execute tests calling the `openAI` API in parallel?

Well, rate limiting, mostly

## Everything changes on monthly weekly basis

## Deciding on scope
- Security (e.g. initial prompt protection)
- Performance
- Answers correctness
- Unwanted topics avoidance
- Testing how functions are called

## Requirements are simply not present ever

# How did we even get here?
![pepe_cry_hands.gif](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/pepe_cry_hands.gif)

![technology_s_curve.png](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/technology_s_curve.png)

![technology_s_curve_marked.png](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/technology_s_curve_marked.png)

![technology_s_curve_question.png](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/technology_s_curve_question.png)

## The solutions (the relief 😌)

## Good 'ol assert will *sometimes* do

In [None]:
def test_ls(chat):
    command = chat.ask_gpt_code_snippet_only(
        "list files in directory in the most simple way"
    )
    assert command == "ls"


def test_rm(chat):
    command = chat.ask_gpt_code_snippet_only("remove file named test.txt")
    assert command == "rm test.txt"

## Sources
- Air Canada lawsuit: https://www.theguardian.com/world/2024/feb/16/air-canada-chatbot-lawsuit
- Chevey Tachoe for $1: https://twitter.com/ChrisJBakke/status/1736533308849443121
- DPD swear-bot: https://www.bbc.com/news/technology-68025677

![pres_qr.png](https://raw.githubusercontent.com/arturpat/pycon-it-pres-2/main/pres_qr.png)