UBC-MDS · JohnShiuMK · May 28, 2024 · May 21, 2024 · May 22, 2024 · May 22, 2024
diff --git a/checklist/checklist_sys.csv/overview.csv b/checklist/checklist_sys.csv/overview.csv
@@ -0,0 +1,2 @@
+Title,Description
+Checklist for Tests in Machine Learning Projects,This is a comprehensive checklist for evaluating the data and ML pipeline based on identified testing strategies from experts in the field.
diff --git a/checklist/checklist_sys.csv/tests.csv b/checklist/checklist_sys.csv/tests.csv
@@ -0,0 +1,9 @@
+ID,Topic,Title,Requirement,Explanation,References
+2.1,Data Presence,Test Data Fetching and File Reading,"Verify that the data fetching API or data file reading functionality works correctly. Ensure that proper error handling is in place for scenarios such as missing files, incorrect file formats, and network errors.","Ensure that the code responsible for fetching or reading data can handle errors. This means if the file is missing, the format is wrong, or there's a network issue, the system should not crash but should provide a clear error message indicating the problem.",(general knowledge)
+3.1,Data Quality,Validate Data Shape and Values,"Check that the data has the expected shape and that all values meet domain-specific constraints, such as non-negative distances.","Check that the data being used has the correct structure (like having the right number of columns) and that the values within the data make sense (e.g., distances should not be negative). This ensures that the data is valid and reliable for model training.","alexander2024Evaluating, ISO/IEC5259"
+3.2,Data Quality,Check for Duplicate Records in Data,Check for duplicate records in the dataset and ensure that there are none.,"Ensure that the dataset does not contain duplicate entries, as these can skew the results and reduce the model's performance. The test should identify any repeated records so they can be removed or investigated.",ISO/IEC5259
+4.1,Data Ingestion,Verify Data Split Proportion,Check that the data is split into training and testing sets in the expected proportion.,"Confirm that the data is divided correctly into training and testing sets according to the intended ratio. This is crucial for ensuring that the model is trained and evaluated properly, with representative samples in each set.","openja2023studying, DBLP:conf/recsys/Kula15, singh2020mmf"
+5.1,Model Fitting,Test Model Output Shape,Validate that the model's output has the expected shape.,"Ensure that the output from the model has the correct dimensions and structure. For example, in a classification task, if the model should output probabilities for each class, the test should verify that the output is an array with the correct dimensions. Ensuring the correct output shape helps prevent runtime errors and ensures consistency in how data is handled downstream.","openja2023studying, DBLP:conf/recsys/Kula15, singh2020mmf"
+6.1,Model Evaluation,Verify Evaluation Metrics Implementation,Verify that the evaluation metrics are correctly implemented and appropriate for the model's task.,Confirm that the metrics used to evaluate the model are implemented correctly and are suitable for the specific task at hand. This helps in accurately assessing the model's performance and understanding its strengths and weaknesses.,"openja2023studying, DBLP:conf/recsys/Kula15, singh2020mmf"
+6.2,Model Evaluation,Evaluate Model's Performance Against Thresholds,"Compute evaluation metrics for both the training and testing datasets and ensure that these metrics exceed predefined threshold values, indicating acceptable model performance.","This ensures that the model's performance meets or exceeds certain benchmarks. By setting thresholds for metrics like accuracy or precision, you can automatically flag models that underperform or overfit. This is crucial for maintaining a baseline quality of results and for ensuring that the model meets the requirements necessary for deployment.","openja2023studying, DBLP:conf/recsys/Kula15, singh2020mmf"
+8.1,Data Quality (Optional),Validate Outliers Detection and Handling,Detect outliers in the dataset. Ensure that the outlier detection mechanism is sensitive enough to flag true outliers while ignoring minor anomalies.,The detection method should be precise enough to catch significant anomalies without being misled by minor variations. This is important for maintaining data quality and ensuring the model's reliability in certain projects.,ISO/IEC5259
diff --git a/checklist/checklist_sys.csv/topics.csv b/checklist/checklist_sys.csv/topics.csv
@@ -0,0 +1,9 @@
+ID,Topic,Description
+1,General,The following items describe best practices for all tests to be written.
+2,Data Presence,"The following items describe tests that need to be done for testing the presence of data. This area of tests mainly concern whether the reading and saving operations are behaving as expected, and any unexpected behavior would not be passed silently."
+3,Data Quality,"The following items describe tests that need to be done for testing the quality of data. This area of tests mainly concern whether the data supplied is in the expected format, data containing null values or outliers to make sure that the data processing pipeline is robust."
+4,Data Ingestion,The following items describe tests that need to be done for testing if the data is ingestion properly.
+5,Model Fitting,The following items describe tests that need to be done for testing the model fitting process. The unit tests written for this section usually mock model load and model predictions similarly to mocking file access.
+6,Model Evaluation,The following items describe tests that need to be done for testing the model evaluation process.
+7,Artifact Testing,"The following items involves explicit checks for behaviors that we expect the artifacts e.g. models, plots, etc., to follow."
+8,Data Quality (Optional),"The following items describe tests that need to be done for testing the quality of data, but they may not be applicable to all projects."
diff --git a/checklist/references.bib b/checklist/references.bib
@@ -73,3 +73,58 @@ @misc{ribeiro2020accuracy
 	archiveprefix = {arXiv},
 	primaryclass = {cs.CL}
 }
+
+@misc{alexander2024Evaluating,
+	title        = {Evaluating the Decency and Consistency of Data Validation Tests Generated by LLMs∗},
+	author       = {Rohan Alexander and Lindsay Katz and Callandra Moore and Michaela Drouillard and Michael Wing-Cheung Wong and Zane Schwartz},
+	year         = 2024,
+	eprint       = {2310.01402v2},
+	archiveprefix = {arXiv},
+	primaryclass = {stat.ME}
+}
+
+@misc{ISO/IEC5259,
+	title        = {ISO/IEC DIS 5259 Artificial intelligence — Data quality for analytics and machine learning (ML)},
+	author       = {ICS},
+	year         = 2024,
+	month        = {July},
+	url          = {https://www.iso.org/standard/81088.html}
+}
+
+@misc{hynes2017,
+	title        = {The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets},
+	author       = {Nick Hynes and D. Sculley and Michael Terry},
+	year         = 2017,
+	url          = {http://learningsys.org/nips17/assets/papers/paper_19.pdf}
+}
+
+@article{openja2023studying,
+    title        = {Studying the Practices of Testing Machine Learning Software in the Wild},
+    author       = {Openja, Moses and Khomh, Foutse and Foundjem, Armstrong and Ming, Zhen and Abidi, Mouna and Hassan, Ahmed E and others},
+    journal      = {arXiv preprint arXiv:2312.12604},
+    year         = {2023}
+}
+
+@inproceedings{DBLP:conf/recsys/Kula15,
+  author    = {Maciej Kula},
+  editor    = {Toine Bogers and
+               Marijn Kool"::en},
+  title     = {Metadata Embeddings for User and Item Cold-start Recommendations},
+  booktitle = {Proceedings of the 2nd Workshop on New Trends on Content-Based Recommender
+               Systems co-located with 9th {ACM} Conference on Recommender Systems
+               (RecSys 2015), Vienna, Austria, September 16-20, 2015.},
+  series    = {{CEUR} Workshop Proceedings},
+  volume    = {1448},
+  pages     = {14--21},
+  publisher = {CEUR-WS.org},
+  year      = {2015},
+  url       = {http://ceur-ws.org/Vol-1448/paper4.pdf},
+}
+
+@misc{singh2020mmf,
+  author =       {Singh, Amanpreet and Goswami, Vedanuj and Natarajan, Vivek and Jiang, Yu and Chen, Xinlei and Shah, Meet and
+                 Rohrbach, Marcus and Batra, Dhruv and Parikh, Devi},
+  title =        {MMF: A multimodal framework for vision and language research},
+  howpublished = {\url{https://github.com/facebookresearch/mmf}},
+  year =         {2020}
+}
diff --git a/src/test_creation/analyze.py b/src/test_creation/analyze.py
@@ -113,7 +113,12 @@ def evaluate(self, verbose: bool = False) -> List[dict]:
 
 
 if __name__ == '__main__':
-    def main(checklist_path, repo_path):
+    def main(checklist_path, repo_path, report_output_path, report_output_format='html'):
+        """
+        Example:
+        ----------
+        >>> python src/test_creation/analyze.py --checklist_path='./checklist/checklist_demo.csv' --repo_path='../lightfm/' --report_output_path='./report/evaluation_report.html' --report_output_format='html'
+        """
         llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
         checklist = Checklist(checklist_path, checklist_format=ChecklistFormat.CSV)
         extractor = PythonTestFileExtractor(Repository(repo_path))
@@ -122,6 +127,7 @@ def main(checklist_path, repo_path):
         response = evaluator.evaluate()
 
         parser = ResponseParser(response)
-        parser.get_completeness_score()
+        parser.get_completeness_score(verbose=True)
+        parser.export_evaluation_report(report_output_path, report_output_format, exist_ok=True)
 
     fire.Fire(main)
diff --git a/src/test_creation/checklist_export.py b/src/test_creation/checklist_export.py
@@ -0,0 +1,25 @@
+import fire
+
+from modules.checklist.checklist import Checklist, ChecklistFormat
+
+
+def export_checklist(checklist_path: str):
+    """Example calls. To be removed later.
+
+    Example:
+    python src/test_creation/modules/checklist/checklist.py ./checklist/test-dump-csv
+
+    Note that the supplied path must be a directory containing 3 CSV files:
+    1. `overview.csv`
+    2. `topics.csv`
+    3. `tests.csv`
+    """
+    __package__ = ''
+    checklist = Checklist(checklist_path, checklist_format=ChecklistFormat.CSV)
+    print(checklist.as_markdown())
+    checklist.export_html("checklist.html", exist_ok=True)
+    checklist.export_pdf("checklist.pdf", exist_ok=True)
+
+
+if __name__ == "__main__":
+    fire.Fire(export_checklist)
diff --git a/src/test_creation/demo_report_export.ipynb b/src/test_creation/demo_report_export.ipynb
@@ -0,0 +1,177 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "669bb292-2b53-4a28-8d5f-ef6f3687f440",
+   "metadata": {},
+   "source": [
+    "## Evaluation Report Export Function Demo - For Development"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "d2c1ead7-9d5b-4414-80e2-07092ba180ca",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from analyze import *\n",
+    "from analyze import TestEvaluator\n",
+    "from modules.checklist.checklist import Checklist, ChecklistFormat\n",
+    "from modules.code_analyzer.repo import Repository\n",
+    "from modules.workflow.files import PythonTestFileExtractor, RepoFileExtractor\n",
+    "from modules.workflow.parse import ResponseParser\n",
+    "from langchain_openai import ChatOpenAI"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "ad0a59a9-185c-4f17-a0dd-fa2534958ecb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "repo_path = '../../../lightfm/'\n",
+    "checklist_path = '../../checklist/checklist_demo.csv'\n",
+    "report_output_path_html = '../../report/evaluation_report.html'\n",
+    "report_output_path_pdf = '../../report/evaluation_report.pdf'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "d717ba5d-dc9d-477d-a9db-ccb993f48f09",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:14<00:00,  7.37s/it]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Report:\n",
+      "                                                                                         Requirement  \\\n",
+      "ID  Title                                                                                              \n",
+      "1.1 Write Descriptive Test Names                   Each test function should have a clear, descri...   \n",
+      "1.2 Keep Tests Focused                             Each test should focus on a single scenario, u...   \n",
+      "2.1 Ensure Data File Loads as Expected             Ensure that data-loading functions correctly l...   \n",
+      "5.1 Validate Model Input and Output Compatibility  Confirm that the model accepts inputs of the c...   \n",
+      "\n",
+      "                                                   is_Satisfied  \\\n",
+      "ID  Title                                                         \n",
+      "1.1 Write Descriptive Test Names                              1   \n",
+      "1.2 Keep Tests Focused                                        1   \n",
+      "2.1 Ensure Data File Loads as Expected                        0   \n",
+      "5.1 Validate Model Input and Output Compatibility             0   \n",
+      "\n",
+      "                                                   n_files_tested  \\\n",
+      "ID  Title                                                           \n",
+      "1.1 Write Descriptive Test Names                                2   \n",
+      "1.2 Keep Tests Focused                                          2   \n",
+      "2.1 Ensure Data File Loads as Expected                          2   \n",
+      "5.1 Validate Model Input and Output Compatibility               2   \n",
+      "\n",
+      "                                                                                        Observations  \\\n",
+      "ID  Title                                                                                              \n",
+      "1.1 Write Descriptive Test Names                   [(test_cross_validation.py) The test function ...   \n",
+      "1.2 Keep Tests Focused                             [(test_cross_validation.py) The test function ...   \n",
+      "2.1 Ensure Data File Loads as Expected             [(test_cross_validation.py) The code does not ...   \n",
+      "5.1 Validate Model Input and Output Compatibility  [(test_cross_validation.py) The code does not ...   \n",
+      "\n",
+      "                                                                                 Function References  \n",
+      "ID  Title                                                                                             \n",
+      "1.1 Write Descriptive Test Names                   [{'File Path': '../../../lightfm/tests/test_cr...  \n",
+      "1.2 Keep Tests Focused                             [{'File Path': '../../../lightfm/tests/test_cr...  \n",
+      "2.1 Ensure Data File Loads as Expected             [{'File Path': '../../../lightfm/tests/test_cr...  \n",
+      "5.1 Validate Model Input and Output Compatibility  [{'File Path': '../../../lightfm/tests/test_cr...  \n",
+      "\n",
+      "Score: 2/4\n",
+      "\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'2/4'"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "llm = ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0)\n",
+    "checklist = Checklist(checklist_path, checklist_format=ChecklistFormat.CSV)\n",
+    "extractor = PythonTestFileExtractor(Repository(repo_path))\n",
+    "\n",
+    "evaluator = TestEvaluator(llm, extractor, checklist)\n",
+    "response = evaluator.evaluate()\n",
+    "\n",
+    "parser = ResponseParser(response)\n",
+    "parser.get_completeness_score(verbose=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "273db18c-13c4-4c86-a4c8-f42e0b0e37c5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "parser.export_evaluation_report(report_output_path_html, 'html', exist_ok=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "5a682a42-8807-48c6-9de4-0558838e3ccd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "parser.export_evaluation_report(report_output_path_pdf, 'pdf', exist_ok=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "07875448-9c58-4ec0-94b8-de9be8870011",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python [conda env:test-creation]",
+   "language": "python",
+   "name": "conda-env-test-creation-py"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}