# Proposal Report: Checklists and LLM prompts for efficient and effective test creation in data analysis

by John Shiu, Orix Au Yeung, Tony Shum, Yingzi Jin

## Executive Summary

The rapid growth of global artificial intelligence (AI) market presents both opportunities and challenges. While AI systems have the potential to impact various aspects of human life, ensuring their software quality remains a significant concern. Current testing strategies for machine learning (ML) systems lack standardization and comprehensiveness, which poses risks to stakeholders such as financial losses and safety hazards.

Our proposal addresses this challenge by developing an end-to-end application that provides comprehensive test evaluation, tailored test recommendations, and automated test specifications generation. Through these features, users can systematically assess, improve, and include tests tailored to their ML systems. By leveraging human expertise and prompt engineering, we build our product to deliver actionable insights for improving users' test strategies and overall ML system reliability.

With an iterative development approach, we aim to develop our minimum viable product (MVP) in the first two weeks. We will further iterate and refine our product over the next 3 weeks. During the last 2 weeks, we will carry out rigorous system testing, finalize our final product and report, and deliver to our partners. Ultimately, our goal is to mitigate potential negative societal impacts associated with unreliable ML systems and promote trustworthiness.

## Introduction

### Problem Statement

Global artificial intelligence (AI) market is growing exponentially {cite}`grand2021artificial`, which is driven by its ability to autonomously make complex decisions impacting various aspects of human life, including financial transactions, autonomous transportation, and medical diagnosis, etc. 

However, ensuring the software quality of these systems remains a significant challenge {cite}`openja2023studying`. Specifically, the lack of a standardized and comprehensive approach to testing machine learning (ML) systems introduces potential risks to stakeholders. For example, inadequate quality assurance in ML systems can lead to severe consequences, such as substantial financial losses {cite}`Asheeta2019` and safety hazards. 

Therefore, it is crucial to define and promote an industry standard and establish robust testing methodologies for these systems. But how?

### Our Objectives

We propose to develop testing suites based on Large Language Models (LLMs), to offer flexibility and to facilitate comprehensive testing of ML systems. Our goal is to enhance the trustworthiness and robustness of applied ML software and improve the quality and reproducibility of ML systems across both industry and academia {cite}`kapoor2022leakage`.

## Our Product

Our solution offers an end-to-end application for evaluating and enhancing the robustness of users' machine learning (ML) systems.

Here is a diagram serving as a high-level overview of our proposed system:

![](../../img/proposed_system_overview.png)

### Description

Our product can be utilized into 3 stages.

1. **ML Test Completeness Evaluation**: The application utilizes Large Language Model (LLM) and our curated checklist to analyze users' ML system source code, and returns a comprehensive score of the system's test quality.
  
2. **Missing Test Recommendations**: The application evaluates the adequacy of existing tests for users' ML code , and offers recommendations for additional, system-specific tests to enhance testing effectiveness.
  
3. **Test Function Specification Generation**: Users select desired test recommendations in the application, which autonomously generates test function specifications and references. These serve as reliable starting points for users to incorporate into their ML system test suites.

### Success Metrics

The success of our product will be dependent on the mutation testing result of the reference test cases generated by the application. A set of perturbations would be made to the ML project code and the success rate of detecting these perturbations by the implemented test cases will be recorded as the evaluation metric.

Our partner and stakeholders would expect to see a significant improvement in testing strategies of their ML systems post-application usage. Moreover, the application would demonstrate high accuracy in detecting faults, which ensure consistent and high quality ML projects upon updates. 

### Data Science Approach

#### Data: GitHub Repositories

In this project, GitHub repositories are our data. 

We will collect 11 repositories studied in {cite}`openja2023studying` for the development of our testing checklist. Additionally, we will collect 377 repositories identified in the study by {cite}`wattanakriengkrai2022github` for our product development.

For each repository, we are interested in the repository metadata, as well as the ML modeling- and test-related source code. The metadata will be retrieved using the GitHub API, while the source code will be downloaded and filtered using our custom scripts. To ensure the relevance of the repositories to our study, we will apply the following criteria for filtering:
 1. Repositories that are related to ML systems.
 2. Repositories that include test cases.
 3. Repositories whose development are written in the Python programming language.

#### Methodologies

Our data science methodology incorporates both human expert evaluation and prompt engineering to assess and enhance the test quality of ML systems.

- Human Expert Evaluation

    We will begin by formulating a comprehensive checklist for evaluating the data and ML pipeline based on established testing strategies outlined in {cite}`openja2023studying` as the foundational framework. for assessing test quality within selected repositories. Our team will manually evaluate the test quality within the repository data based on the formulated checklist. The checklist will be refined during the process to ensure its applicability and robustness testing general ML systems.

- Prompt Engineering

    We will engineer prompts for LLM to serve various purposes across three stages:
    1. Prompts to examine test cases within ML system source codes and deliver qualitative and quantitative test scores.
    2. Prompts incorporated with the completed checklist to suggest potential testing strategies by comparing with ML system source codes.
    3. Prompts to generate test cases based on suggested testing strategies and ML system task types {cite}`schafer2023empirical`

#### Iterative Development Approach

As we leverage data from selected GitHub repositories and references research on testing strategies, it's important to acknowledge that this may not include all ML systems or testing methodologies. To address these considerations, we adopt an iterative development approach by setting up an open and scalable framework for this project. Our application could then undergo continuous updates based on users' feedback and contributors' insights.

We encourage users to interpret the artifacts generated by the application with a grain of salt and recognize the evolving nature of ML system testing practices.

## Delivery Timeline

Our team follow a timeline for our product delivery. We also aim at close communication with our partner to align our product development with their expectation.

| Timeline | Milestones |
|---|---|
| Week 1 (Apr 29 - May 3) | Prepare and Present Initial Proposal. Scrape repository data. |
| Week 2 - 3 (May 6 - 17) | Deliver Proposal. Deliver Draft of ML Pipeline Test Checklist. Develop Minimum Viable Product (Test Completeness Score, Missing Test Recommendation) |
| Week 4 - 5 (May 20 - May 31) | Update Test Checklist. Develop Test Function Spec Generator. |
| Week 6 (Jun 3 - Jun 7) | Update Test Checklist. Wrap Up Product. |
| Week 7 (Jun 10 - Jun 14) | Finalize Test Checklist. Perform Product System Test. Present Final Product. Prepare Final Product Report. |
| Week 8 (Jun 17 - Jun 21) | Deliver Final Product. Deliver Final Product Report. |


## References

```{bibliography}
```