# Proposal Report - Checklists and LLM prompts for efficient and effective test creation in data analysis

by John Shiu, Orix Au Yeung, Tony Shum, Yingzi Jin

## Executive Summary

The rapid growth of global artificial intelligence (AI) markets presents opportunities and challenges. While AI systems have the potential to impact various aspects of human life, ensuring their software quality remains a significant concern. Current testing strategies for machine learning (ML) systems lack standardization and comprehensiveness, which poses risks to stakeholders, such as financial losses and safety hazards.

Our proposal addresses this challenge by developing an end-to-end application that provides test completeness evaluation, missing test recommendations, and test function specification generation. Users can systematically assess, improve, and include tests tailored to their ML systems through these features. By leveraging human expertise and prompt engineering, we build our product to deliver actionable insights for improving users' test strategies and overall ML system reliability.

With a swift and efficient iterative development approach, we commit to delivering our minimum viable product (MVP) within the first three weeks. We will then proceed to iterate and refine our product over the next 3 weeks. In the final 2 weeks, we will conduct rigorous system testing, finalize our product and report, and promptly deliver to our partners. Our ultimate aim is to swiftly address potential negative societal impacts associated with unreliable ML systems and foster trustworthiness.

## Introduction

### Problem Statement

The global artificial intelligence (AI) market is growing exponentially {cite}`grand2021artificial`, driven by its ability to autonomously make complex decisions impacting various aspects of human life, including financial transactions, autonomous transportation, and medical diagnosis. 

However, ensuring the software quality of these systems remains a significant challenge {cite}`openja2023studying`. Specifically, the lack of a standardized and comprehensive approach to testing machine learning (ML) systems introduces potential risks to stakeholders. For example, inadequate quality assurance in ML systems can lead to severe consequences, such as substantial financial losses {cite}`Asheeta2019` and safety hazards. 

Therefore, defining and promoting an industry standard and establishing robust testing methodologies for these systems is crucial. But how?

### Our Objectives

We propose to develop testing suites diagnostic tools based on Large Language Models (LLMs) and curate a checklist to facilitate comprehensive testing of ML systems with flexibility. Our goal is to enhance applied ML software's trustworthiness, quality, and reproducibility across both the industry and academia {cite}`kapoor2022leakage`.

## Our Product

Our solution offers an end-to-end application for evaluating and enhancing the robustness of users' ML systems.

Here is a diagram serving as a high-level overview of our proposed system:

![](../../img/proposed_system_overview.png)

### Description

Our product facilitates a three-stage process:

1. **ML Test Completeness Score**: The application utilizes LLMs and our curated checklist to analyze users' ML system source code and returns a comprehensive score of the system's test quality.
  
2. **Missing Test Recommendations**: The application evaluates the adequacy of existing tests for users' ML code and offers recommendations for additional, system-specific tests to enhance testing effectiveness.
  
3. **Test Function Specification Generation**: Users select desired test recommendations and prompt the application to generate test function specifications and references. These are reliable starting points for users to enrich the ML system test suites.

### Success Metrics

Our product's success will depend on mutation testing of the test functions developed based on our application-generated specifications. The evaluation metric is the success rate of detecting the perturbations introduced to the ML project code.

Our partners and stakeholders expect a significant improvement in the testing suites of their ML systems post-application usage. As a result, the testing suites will demonstrate high accuracy in detecting faults, ensuring consistency and high quality of ML projects during updates.

### Data Science Approach

#### Data: GitHub Repositories

In this project, GitHub repositories are our data. 

To develop our testing checklist, we will collect 11 repositories studied in {cite}`openja2023studying`. Additionally, we will collect 377 repositories identified in the study by {cite}`wattanakriengkrai2022github` for our product development.

For each repository, we are interested in the metadata and the ML modeling- and test-related source code. The metadata will be retrieved using the GitHub API, while the source code will be downloaded and filtered using our custom scripts. To ensure the relevance of the repositories to our study, we will apply the following criteria for filtering:
 1. Repositories that are related to ML systems.
 2. Repositories that include test cases.
 3. Repositories whose development is written in the Python programming language.

#### Methodologies

Our data science methodology incorporates human expert evaluation and prompt engineering to assess and enhance the test quality of ML systems.

- Human Expert Evaluation

    We will begin by formulating a comprehensive checklist for evaluating the data and ML pipeline based on the established testing strategies outlined in {cite}`openja2023studying` as the foundational framework. Based on the formulated checklist, our team will manually assess the test quality within each repository data. We will refine the checklist to ensure applicability and robustness when testing general ML systems.

- Prompt Engineering

    We will engineer the prompts for LLM to incorporate with the ML system code and the curated checklist and to serve various purposes across the three-stage process:
  
    1. Prompts to examine test cases within the ML system source codes and deliver test completeness scores.
    2. Prompts to compare and contrast the existing tests and the checklist and deliver recommendations.
    3. Prompts to generate system-specific test specifications based on user-selected testing recommendations {cite}`schafer2023empirical`

#### Iterative Development Approach

We begin by setting up a foundational framework based on the selected GitHub repositories and research on ML testing. The framework might not cover all ML systems or testing practices. Therefore, we adopt an iterative development approach by establishing an open and scalable framework to address these considerations. The application will be continuously refined based on contributors' insights.

Users are encouraged to interpret the generated artifacts with a grain of salt and recognize the evolving nature of ML system testing practices.

## Delivery Timeline

Our team follows the timeline below for our product delivery and prioritizes close communication with our partners to ensure that our developments align closely with their expectations.

| Timeline | Milestones |
|---|---|
| Week 1 (Apr 29 - May 3) | Prepare and Present Initial Proposal. Scrape repository data. |
| Week 2 - 3 (May 6 - 17) | Deliver Proposal. Deliver Draft of ML Pipeline Test Checklist. Develop Minimum Viable Product (Test Completeness Score, Missing Test Recommendation) |
| Week 4 - 5 (May 20 - May 31) | Update Test Checklist. Develop Test Function Specification Generator. |
| Week 6 (Jun 3 - Jun 7) | Update Test Checklist. Wrap Up Product. |
| Week 7 (Jun 10 - Jun 14) | Finalize Test Checklist. Perform Product System Test. Present Final Product. Prepare Final Product Report. |
| Week 8 (Jun 17 - Jun 21) | Deliver Final Product. Deliver Final Product Report. |


## References

```{bibliography}
```