Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the report export function to fit in refactored TestEvaluato… #95

Merged
merged 21 commits into from
May 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
507d251
add data presence and data quality checklist
jinyz8888 May 21, 2024
be43e31
Merge remote-tracking branch 'origin/add_data_presence_and_data_quali…
May 22, 2024
689ba36
undo the changes in checklist.csv
May 22, 2024
312c5a0
docs: prepare checklist for system development
May 22, 2024
092eac7
Merge remote-tracking branch 'origin/84-extend-the-checklists-loader-…
May 22, 2024
e936aa2
added topic for optional Data Quality
May 22, 2024
f18fcc9
updated example in checklist.py
May 22, 2024
dd37cbb
fix symbol typo in checklist_sys
May 22, 2024
96c705e
renamed topic
May 22, 2024
bc6ab8a
added checklist html for visualization
May 22, 2024
e2d90af
Merge branch '84-extend-the-checklists-loader-functionality-to-export…
May 22, 2024
7a3a13a
Refactor the report export function to fit in refactored TestEvaluato…
tonyshumlh May 23, 2024
e550240
added checklist_sys
May 23, 2024
abf2472
move checklist exporting script to top level
SoloSynth1 May 24, 2024
8b27235
create mixins for exporting actions; make `Checklist` and `ResponsePa…
SoloSynth1 May 24, 2024
8efb878
remove `checklist_sys.html`
SoloSynth1 May 24, 2024
0108903
fix incorrect method name; remove unnecessary prints
SoloSynth1 May 24, 2024
8580347
add checks for permission and file/directory expectations
SoloSynth1 May 24, 2024
e136089
fix incorrect checks
SoloSynth1 May 24, 2024
8d608c8
add check for extension given when exporting reports
SoloSynth1 May 27, 2024
186d21d
fix: Fix the string format and indentation in Function 'as_quarto_mar…
tonyshumlh May 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions checklist/checklist_sys.csv/overview.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Title,Description
Checklist for Tests in Machine Learning Projects,This is a comprehensive checklist for evaluating the data and ML pipeline based on identified testing strategies from experts in the field.
9 changes: 9 additions & 0 deletions checklist/checklist_sys.csv/tests.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
ID,Topic,Title,Requirement,Explanation,References
2.1,Data Presence,Test Data Fetching and File Reading,"Verify that the data fetching API or data file reading functionality works correctly. Ensure that proper error handling is in place for scenarios such as missing files, incorrect file formats, and network errors.","Ensure that the code responsible for fetching or reading data can handle errors. This means if the file is missing, the format is wrong, or there's a network issue, the system should not crash but should provide a clear error message indicating the problem.",(general knowledge)
3.1,Data Quality,Validate Data Shape and Values,"Check that the data has the expected shape and that all values meet domain-specific constraints, such as non-negative distances.","Check that the data being used has the correct structure (like having the right number of columns) and that the values within the data make sense (e.g., distances should not be negative). This ensures that the data is valid and reliable for model training.","alexander2024Evaluating, ISO/IEC5259"
3.2,Data Quality,Check for Duplicate Records in Data,Check for duplicate records in the dataset and ensure that there are none.,"Ensure that the dataset does not contain duplicate entries, as these can skew the results and reduce the model's performance. The test should identify any repeated records so they can be removed or investigated.",ISO/IEC5259
4.1,Data Ingestion,Verify Data Split Proportion,Check that the data is split into training and testing sets in the expected proportion.,"Confirm that the data is divided correctly into training and testing sets according to the intended ratio. This is crucial for ensuring that the model is trained and evaluated properly, with representative samples in each set.","openja2023studying, DBLP:conf/recsys/Kula15, singh2020mmf"
5.1,Model Fitting,Test Model Output Shape,Validate that the model's output has the expected shape.,"Ensure that the output from the model has the correct dimensions and structure. For example, in a classification task, if the model should output probabilities for each class, the test should verify that the output is an array with the correct dimensions. Ensuring the correct output shape helps prevent runtime errors and ensures consistency in how data is handled downstream.","openja2023studying, DBLP:conf/recsys/Kula15, singh2020mmf"
6.1,Model Evaluation,Verify Evaluation Metrics Implementation,Verify that the evaluation metrics are correctly implemented and appropriate for the model's task.,Confirm that the metrics used to evaluate the model are implemented correctly and are suitable for the specific task at hand. This helps in accurately assessing the model's performance and understanding its strengths and weaknesses.,"openja2023studying, DBLP:conf/recsys/Kula15, singh2020mmf"
6.2,Model Evaluation,Evaluate Model's Performance Against Thresholds,"Compute evaluation metrics for both the training and testing datasets and ensure that these metrics exceed predefined threshold values, indicating acceptable model performance.","This ensures that the model's performance meets or exceeds certain benchmarks. By setting thresholds for metrics like accuracy or precision, you can automatically flag models that underperform or overfit. This is crucial for maintaining a baseline quality of results and for ensuring that the model meets the requirements necessary for deployment.","openja2023studying, DBLP:conf/recsys/Kula15, singh2020mmf"
8.1,Data Quality (Optional),Validate Outliers Detection and Handling,Detect outliers in the dataset. Ensure that the outlier detection mechanism is sensitive enough to flag true outliers while ignoring minor anomalies.,The detection method should be precise enough to catch significant anomalies without being misled by minor variations. This is important for maintaining data quality and ensuring the model's reliability in certain projects.,ISO/IEC5259
9 changes: 9 additions & 0 deletions checklist/checklist_sys.csv/topics.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
ID,Topic,Description
1,General,The following items describe best practices for all tests to be written.
2,Data Presence,"The following items describe tests that need to be done for testing the presence of data. This area of tests mainly concern whether the reading and saving operations are behaving as expected, and any unexpected behavior would not be passed silently."
3,Data Quality,"The following items describe tests that need to be done for testing the quality of data. This area of tests mainly concern whether the data supplied is in the expected format, data containing null values or outliers to make sure that the data processing pipeline is robust."
4,Data Ingestion,The following items describe tests that need to be done for testing if the data is ingestion properly.
5,Model Fitting,The following items describe tests that need to be done for testing the model fitting process. The unit tests written for this section usually mock model load and model predictions similarly to mocking file access.
6,Model Evaluation,The following items describe tests that need to be done for testing the model evaluation process.
7,Artifact Testing,"The following items involves explicit checks for behaviors that we expect the artifacts e.g. models, plots, etc., to follow."
8,Data Quality (Optional),"The following items describe tests that need to be done for testing the quality of data, but they may not be applicable to all projects."
55 changes: 55 additions & 0 deletions checklist/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,58 @@ @misc{ribeiro2020accuracy
archiveprefix = {arXiv},
primaryclass = {cs.CL}
}

@misc{alexander2024Evaluating,
title = {Evaluating the Decency and Consistency of Data Validation Tests Generated by LLMs∗},
author = {Rohan Alexander and Lindsay Katz and Callandra Moore and Michaela Drouillard and Michael Wing-Cheung Wong and Zane Schwartz},
year = 2024,
eprint = {2310.01402v2},
archiveprefix = {arXiv},
primaryclass = {stat.ME}
}

@misc{ISO/IEC5259,
title = {ISO/IEC DIS 5259 Artificial intelligence — Data quality for analytics and machine learning (ML)},
author = {ICS},
year = 2024,
month = {July},
url = {https://www.iso.org/standard/81088.html}
}

@misc{hynes2017,
title = {The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets},
author = {Nick Hynes and D. Sculley and Michael Terry},
year = 2017,
url = {http://learningsys.org/nips17/assets/papers/paper_19.pdf}
}

@article{openja2023studying,
title = {Studying the Practices of Testing Machine Learning Software in the Wild},
author = {Openja, Moses and Khomh, Foutse and Foundjem, Armstrong and Ming, Zhen and Abidi, Mouna and Hassan, Ahmed E and others},
journal = {arXiv preprint arXiv:2312.12604},
year = {2023}
}

@inproceedings{DBLP:conf/recsys/Kula15,
author = {Maciej Kula},
editor = {Toine Bogers and
Marijn Kool"::en},
title = {Metadata Embeddings for User and Item Cold-start Recommendations},
booktitle = {Proceedings of the 2nd Workshop on New Trends on Content-Based Recommender
Systems co-located with 9th {ACM} Conference on Recommender Systems
(RecSys 2015), Vienna, Austria, September 16-20, 2015.},
series = {{CEUR} Workshop Proceedings},
volume = {1448},
pages = {14--21},
publisher = {CEUR-WS.org},
year = {2015},
url = {http://ceur-ws.org/Vol-1448/paper4.pdf},
}

@misc{singh2020mmf,
author = {Singh, Amanpreet and Goswami, Vedanuj and Natarajan, Vivek and Jiang, Yu and Chen, Xinlei and Shah, Meet and
Rohrbach, Marcus and Batra, Dhruv and Parikh, Devi},
title = {MMF: A multimodal framework for vision and language research},
howpublished = {\url{https://github.com/facebookresearch/mmf}},
year = {2020}
}
10 changes: 8 additions & 2 deletions src/test_creation/analyze.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,12 @@ def evaluate(self, verbose: bool = False) -> List[dict]:


if __name__ == '__main__':
def main(checklist_path, repo_path):
def main(checklist_path, repo_path, report_output_path, report_output_format='html'):
"""
Example:
----------
>>> python src/test_creation/analyze.py --checklist_path='./checklist/checklist_demo.csv' --repo_path='../lightfm/' --report_output_path='./report/evaluation_report.html' --report_output_format='html'
"""
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
checklist = Checklist(checklist_path, checklist_format=ChecklistFormat.CSV)
extractor = PythonTestFileExtractor(Repository(repo_path))
Expand All @@ -122,6 +127,7 @@ def main(checklist_path, repo_path):
response = evaluator.evaluate()

parser = ResponseParser(response)
parser.get_completeness_score()
parser.get_completeness_score(verbose=True)
parser.export_evaluation_report(report_output_path, report_output_format, exist_ok=True)

fire.Fire(main)
25 changes: 25 additions & 0 deletions src/test_creation/checklist_export.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import fire

from modules.checklist.checklist import Checklist, ChecklistFormat


def export_checklist(checklist_path: str):
"""Example calls. To be removed later.

Example:
python src/test_creation/modules/checklist/checklist.py ./checklist/test-dump-csv

Note that the supplied path must be a directory containing 3 CSV files:
1. `overview.csv`
2. `topics.csv`
3. `tests.csv`
"""
__package__ = ''
checklist = Checklist(checklist_path, checklist_format=ChecklistFormat.CSV)
print(checklist.as_markdown())
checklist.export_html("checklist.html", exist_ok=True)
checklist.export_pdf("checklist.pdf", exist_ok=True)


if __name__ == "__main__":
fire.Fire(export_checklist)
177 changes: 177 additions & 0 deletions src/test_creation/demo_report_export.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "669bb292-2b53-4a28-8d5f-ef6f3687f440",
"metadata": {},
"source": [
"## Evaluation Report Export Function Demo - For Development"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "d2c1ead7-9d5b-4414-80e2-07092ba180ca",
"metadata": {},
"outputs": [],
"source": [
"from analyze import *\n",
"from analyze import TestEvaluator\n",
"from modules.checklist.checklist import Checklist, ChecklistFormat\n",
"from modules.code_analyzer.repo import Repository\n",
"from modules.workflow.files import PythonTestFileExtractor, RepoFileExtractor\n",
"from modules.workflow.parse import ResponseParser\n",
"from langchain_openai import ChatOpenAI"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "ad0a59a9-185c-4f17-a0dd-fa2534958ecb",
"metadata": {},
"outputs": [],
"source": [
"repo_path = '../../../lightfm/'\n",
"checklist_path = '../../checklist/checklist_demo.csv'\n",
"report_output_path_html = '../../report/evaluation_report.html'\n",
"report_output_path_pdf = '../../report/evaluation_report.pdf'"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d717ba5d-dc9d-477d-a9db-ccb993f48f09",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:14<00:00, 7.37s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Report:\n",
" Requirement \\\n",
"ID Title \n",
"1.1 Write Descriptive Test Names Each test function should have a clear, descri... \n",
"1.2 Keep Tests Focused Each test should focus on a single scenario, u... \n",
"2.1 Ensure Data File Loads as Expected Ensure that data-loading functions correctly l... \n",
"5.1 Validate Model Input and Output Compatibility Confirm that the model accepts inputs of the c... \n",
"\n",
" is_Satisfied \\\n",
"ID Title \n",
"1.1 Write Descriptive Test Names 1 \n",
"1.2 Keep Tests Focused 1 \n",
"2.1 Ensure Data File Loads as Expected 0 \n",
"5.1 Validate Model Input and Output Compatibility 0 \n",
"\n",
" n_files_tested \\\n",
"ID Title \n",
"1.1 Write Descriptive Test Names 2 \n",
"1.2 Keep Tests Focused 2 \n",
"2.1 Ensure Data File Loads as Expected 2 \n",
"5.1 Validate Model Input and Output Compatibility 2 \n",
"\n",
" Observations \\\n",
"ID Title \n",
"1.1 Write Descriptive Test Names [(test_cross_validation.py) The test function ... \n",
"1.2 Keep Tests Focused [(test_cross_validation.py) The test function ... \n",
"2.1 Ensure Data File Loads as Expected [(test_cross_validation.py) The code does not ... \n",
"5.1 Validate Model Input and Output Compatibility [(test_cross_validation.py) The code does not ... \n",
"\n",
" Function References \n",
"ID Title \n",
"1.1 Write Descriptive Test Names [{'File Path': '../../../lightfm/tests/test_cr... \n",
"1.2 Keep Tests Focused [{'File Path': '../../../lightfm/tests/test_cr... \n",
"2.1 Ensure Data File Loads as Expected [{'File Path': '../../../lightfm/tests/test_cr... \n",
"5.1 Validate Model Input and Output Compatibility [{'File Path': '../../../lightfm/tests/test_cr... \n",
"\n",
"Score: 2/4\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/plain": [
"'2/4'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llm = ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0)\n",
"checklist = Checklist(checklist_path, checklist_format=ChecklistFormat.CSV)\n",
"extractor = PythonTestFileExtractor(Repository(repo_path))\n",
"\n",
"evaluator = TestEvaluator(llm, extractor, checklist)\n",
"response = evaluator.evaluate()\n",
"\n",
"parser = ResponseParser(response)\n",
"parser.get_completeness_score(verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "273db18c-13c4-4c86-a4c8-f42e0b0e37c5",
"metadata": {},
"outputs": [],
"source": [
"parser.export_evaluation_report(report_output_path_html, 'html', exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "5a682a42-8807-48c6-9de4-0558838e3ccd",
"metadata": {},
"outputs": [],
"source": [
"parser.export_evaluation_report(report_output_path_pdf, 'pdf', exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "07875448-9c58-4ec0-94b8-de9be8870011",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:test-creation]",
"language": "python",
"name": "conda-env-test-creation-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading