Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quantify consistency improvement #93

Merged
merged 14 commits into from
May 28, 2024
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ dependencies:
- ruamel.yaml=0.18.6
- tectonic=0.15.0
- tqdm=4.66.4
- scipy=1.13
- pip:
- distro==1.9.0
- h11==0.14.0
Expand Down
7 changes: 7 additions & 0 deletions src/test_creation/archive/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
## NOTE

This `archive/` is for the F-score comparison between the code in week 3 (before refactoring, i.e. old code base, by 2024-05-17) vs. week 4 (after refactoring by 2024-05-24). We have to keep the old code base (archive/analyze.py) and adjust the ConsistencyEvaluator (archive/llm_eval/consistency_eval.py) so that it also works for the old code.

We want to keep a record of the above comparison in case someone might ask for it.

We may delete this folder in the future when we are having a comparison between newer versions. For now, we put those related to the demo and the old code base under `archive/` in order not to disturb the latest code base.
273 changes: 273 additions & 0 deletions src/test_creation/archive/analyze.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
import json
import pprint
from collections import defaultdict

import fire
import pandas as pd
from dotenv import load_dotenv
from tqdm import tqdm
from langchain_community.document_loaders import DirectoryLoader, PythonLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.memory import ChatMessageHistory
from langchain_core.messages import AIMessage, HumanMessage

from modules.checklist.checklist import Checklist, ChecklistFormat
from modules.code_analyzer.repo import Repository

load_dotenv()


class TestEvaluator:
def __init__(self, repo_path=None):
self.repo = None
self.test_fps = [] # test file paths
self.test_dir_path = '' # test dir path # FIXME: required by `load_test_dir`
self.py_splits = []

# FIXME: Tony's "Checklist - After Engineering" version
self.checklist = """
Each test function should have a clear, descriptive name that accurately reflects the test's purpose and the specific functionality or scenario it examines.
Each test should focus on a single scenario, using only one set of mock data and testing one specific behavior or outcome to ensure clarity and isolate issues.
Assertions within tests should be focused and narrow. Ensure you are only testing relevant behaviors of complex objects and not including unrelated assertions.
Keep any modifications to objects and the corresponding assertions close together in your tests to maintain readability and clearly show the cause-and-effect relationship.
Ensure that data-loading functions correctly load files when they exist and match the expected format, handle non-existent files appropriately, and return the expected results.
Verify that functions for saving data and figures perform write operations correctly, checking that the operation succeeds and the content matches the expected format.
Ensure all data files are non-empty and contain the necessary data required for further analysis or processing tasks.
Verify that the data to be ingested matches the format expected by processing algorithms (like pd.DataFrame for CSVs or np.array for images) and adheres to the expected schema.
Check that data files are free from unexpected null values and identify any outliers that could affect the analysis. Tests should explicitly state if null values are part of expected data.
Test that a fixed input to a function or model produces the expected output, focusing on one verification per test to ensure predictable behavior.
Confirm that the model accepts inputs of the correct shapes and types and produces outputs that meet the expected shapes and types without any errors.
For parametric models, ensure that the model's weights update correctly per training iteration. For non-parametric models, verify that the data fits correctly into the model.
Ensure the shape of the model's output aligns with the expected structure based on the task, such as matching the number of labels in a classification task.
Verify that the model's output values are appropriate for its task, such as outputting probabilities that sum to 1 for classification tasks.
If using gradient descent for training, verify that a single gradient step on a batch of data results in a decrease in the model's training loss.
Confirm that there is no leakage of data between training, validation, and testing sets, or across cross-validation folds, to ensure the integrity of the splits.
"""
self.system_message = []
self.model = 'gpt-3.5-turbo'
self.temperature = 0
self.chain = None

# self.evaluation_message = """
# Your task is to answer each question in the checklist using only the provided test functions.
# If an answer to the question is provided, it must be annotated with a citation of the test function(s) in the Observation session.
# Then, decide the completion score in a fraction format based on your answers. The denominator should be the number of checklist items.
# Desired format:
# Checklist Evaluation:
# ID:
# Title:
# Requirement:
# Observation:
# Evaluation: Satisfied/Partially Satisfied/Not Satisfied
# Score: (1 for Satisfied / 0.5 for Partially Satisfied / 0 for Not Satisfied)
# Completion Score: Number of satisfied requirements/Number of requirements
# Number of satisfied requirements:
# Number of partially satisfied requirements:
# Number of not satisfied requirements:
# """
self.evaluation_message = """
Your task is to answer each question in the checklist using only the provided test functions.
If an answer to the question is provided, it must be annotated with a citation of the test function(s) in the Observation session.
Output a JSON format:
[{
"ID":
"Title":
"Requirement":
"Observation":
"Functions": [ ... ]
"Evaluation": Satisfied/Partially Satisfied/Not Satisfied
"Score": (1 for Satisfied / 0.5 for Partially Satisfied / 0 for Not Satisfied)
}]
"""

self.evaluation_result = None

if repo_path is not None:
self.load_repo(repo_path)

def load_repo(self, repo_path):
self.repo = Repository(repo_path)
self.test_fps = self.repo.list_test_files()['Python']

def load_test_file(self, file_path, overwrite=True):
loader = PythonLoader(file_path)
py = loader.load()
py_splits = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0).split_documents(py)

if overwrite:
self.py_splits = py_splits

return py_splits

# def load_all_test_files(self):
# self.py_splits = []
# for fp in self.test_fps:
# self.py_splits += self.load_test_file(fp, overwrite=False)

def load_test_dir(self, dir_path):
self.test_dir_path = dir_path

loader = DirectoryLoader(
dir_path,
glob="**/*.py",
show_progress=True,
loader_cls=PythonLoader
)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
self.py_splits = text_splitter.split_documents(docs)

def load_checklist(self, checklist_path):
raw_checklist = Checklist(checklist_path, checklist_format=ChecklistFormat.CSV)

checklist = []
for item in raw_checklist.get_all_tests():
checklist.append({
'ID': item['ID'],
'Title': item['Title'],
'Requirement': item['Requirement']
})

self.checklist = json.dumps(checklist).replace('{', '[').replace('}', ']')

def init_system_message(self):
if len(self.checklist) == 0:
# self.load_checklist()
raise ValueError("Checklist is empty, make sure you have configured the checklist loader right!")

self.system_message = [
("system",
"You are a senior machine learning engineer who specializes in performing Machine Learning system testing. Extract and analyze the test functions from the codes:\n\n{context}"),
("system",
f"Here is the Machine Learning system testing checklist delimited by triple quotes '''{self.checklist}'''")
]

def init_chain(self, system_message=None, model=None):
if system_message is None:
if len(self.system_message) == 0:
self.init_system_message()
system_message = self.system_message
else:
self.system_message = system_message

if model is None:
model = self.model
else:
self.model = model

prompt = ChatPromptTemplate.from_messages(
system_message + [
MessagesPlaceholder(variable_name="messages")
]
)
chat = ChatOpenAI(model=model, temperature=self.temperature)

chain = create_stuff_documents_chain(chat, prompt)
self.chain = chain
return chain

def get_ai_response(self, message, context, history=None):
if self.chain is None:
self.init_chain()

if history is None:
history = ChatMessageHistory()

history.add_user_message(message)

response = self.chain.invoke({
"context": context,
"messages": history.messages
})
history.add_ai_message(response)

return response, history

def get_evaluation_response(self, py_splits=None):
if py_splits is None:
py_splits = self.py_splits

return self.get_ai_response(
message=self.evaluation_message,
context=py_splits
)

# FIXME: combine evaluation
# to be tested
def extract_json(self, response, start='[', end=']'):
start_idx = response.index(start)
end_idx = response[::-1].index(end)
if end_idx == 0:
string = response[start_idx:]
else:
string = response[start_idx:-end_idx]
return json.loads(string)

def evaluate(self, on_file=True, verbose=False):
result = []
if on_file:
for fp in tqdm(self.test_fps):
if verbose:
print(fp)
self.load_test_file(fp)
if verbose:
print(f"# splits: {len(self.test_fps)}")
response, history = self.get_evaluation_response() # FIXME: it sometimes tests only part of the checklist items
report = self.extract_json(response)
for item in report:
item['file'] = fp
result += [{
'file': fp,
'report': report,
'history': history
}]
else:
self.load_test_dir(self.test_dir_path)
response, history = self.get_evaluation_response()
report = self.extract_json(response)
for item in report:
item['file'] = self.test_dir_path
result += [{
'file': self.test_dir_path,
'report': report,
'history': history
}]

self.evaluation_result = result
return

def get_completeness_score(self, score_format='fraction', verbose=False):
report_df = pd.DataFrame(self.evaluation_result)['report'].explode('report').apply(pd.Series)
report_df = report_df.groupby(['ID', 'Title']).agg({
'Score': ['max', 'count'],
'Functions': ['sum']
})
report_df.columns = ['is_Satisfied', 'n_files_tested', 'functions']
self.evaluation_report = report_df

if score_format == 'fraction':
score = f"{report_df['is_Satisfied'].sum()}/{report_df['is_Satisfied'].count()}"
elif score_format == 'number':
score = report_df['is_Satisfied'].sum()/report_df['is_Satisfied'].count()

if verbose:
print("Report:")
print(report_df)
print()
print(f'Score: {score}')
print()
return score


if __name__ == '__main__':
def main(checklist_path, repo_path):
test = TestEvaluator(repo_path)
test.load_checklist(checklist_path)
test.evaluate()
test.get_completeness_score()

fire.Fire(main)
2 changes: 2 additions & 0 deletions src/test_creation/archive/checklist_sys.csv/overview.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Title,Description
Checklist for Tests in Machine Learning Projects,This is a comprehensive checklist for evaluating the data and ML pipeline based on identified testing strategies from experts in the field.
9 changes: 9 additions & 0 deletions src/test_creation/archive/checklist_sys.csv/tests.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
ID,Topic,Title,Requirement,Explanation,References
2.1,Data Presence,Test Data Fetching and File Reading,"Verify that the data fetching API or data file reading functionality works correctly. Ensure that proper error handling is in place for scenarios such as missing files, incorrect file formats, and network errors.","Ensure that the code responsible for fetching or reading data can handle errors. This means if the file is missing, the format is wrong, or there's a network issue, the system should not crash but should provide a clear error message indicating the problem.",(general knowledge)
3.1,Data Quality,Validate Data Shape and Values,"Check that the data has the expected shape and that all values meet domain-specific constraints, such as non-negative distances.","Check that the data being used has the correct structure (like having the right number of columns) and that the values within the data make sense (e.g., distances should not be negative). This ensures that the data is valid and reliable for model training.","alexander2024Evaluating, ISO/IEC5259"
3.2,Data Quality,Check for Duplicate Records in Data,Check for duplicate records in the dataset and ensure that there are none.,"Ensure that the dataset does not contain duplicate entries, as these can skew the results and reduce the model's performance. The test should identify any repeated records so they can be removed or investigated.",ISO/IEC5259
4.1,Data Ingestion,Verify Data Split Proportion,Check that the data is split into training and testing sets in the expected proportion.,"Confirm that the data is divided correctly into training and testing sets according to the intended ratio. This is crucial for ensuring that the model is trained and evaluated properly, with representative samples in each set.","openja2023studying, DBLP:conf/recsys/Kula15, singh2020mmf"
5.1,Model Fitting,Test Model Output Shape,Validate that the model's output has the expected shape.,"Ensure that the output from the model has the correct dimensions and structure. For example, in a classification task, if the model should output probabilities for each class, the test should verify that the output is an array with the correct dimensions. Ensuring the correct output shape helps prevent runtime errors and ensures consistency in how data is handled downstream.","openja2023studying, DBLP:conf/recsys/Kula15, singh2020mmf"
6.1,Model Evaluation,Verify Evaluation Metrics Implementation,Verify that the evaluation metrics are correctly implemented and appropriate for the model's task.,Confirm that the metrics used to evaluate the model are implemented correctly and are suitable for the specific task at hand. This helps in accurately assessing the model's performance and understanding its strengths and weaknesses.,"openja2023studying, DBLP:conf/recsys/Kula15, singh2020mmf"
6.2,Model Evaluation,Evaluate Model's Performance Against Thresholds,"Compute evaluation metrics for both the training and testing datasets and ensure that these metrics exceed predefined threshold values, indicating acceptable model performance.","This ensures that the model's performance meets or exceeds certain benchmarks. By setting thresholds for metrics like accuracy or precision, you can automatically flag models that underperform or overfit. This is crucial for maintaining a baseline quality of results and for ensuring that the model meets the requirements necessary for deployment.","openja2023studying, DBLP:conf/recsys/Kula15, singh2020mmf"
8.1,Data Quality (Optional),Validate Outliers Detection and Handling,Detect outliers in the dataset. Ensure that the outlier detection mechanism is sensitive enough to flag true outliers while ignoring minor anomalies.,The detection method should be precise enough to catch significant anomalies without being misled by minor variations. This is important for maintaining data quality and ensuring the model's reliability in certain projects.,ISO/IEC5259
9 changes: 9 additions & 0 deletions src/test_creation/archive/checklist_sys.csv/topics.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
ID,Topic,Description
1,General,The following items describe best practices for all tests to be written.
2,Data Presence,"The following items describe tests that need to be done for testing the presence of data. This area of tests mainly concern whether the reading and saving operations are behaving as expected, and any unexpected behavior would not be passed silently."
3,Data Quality,"The following items describe tests that need to be done for testing the quality of data. This area of tests mainly concern whether the data supplied is in the expected format, data containing null values or outliers to make sure that the data processing pipeline is robust."
4,Data Ingestion,The following items describe tests that need to be done for testing if the data is ingestion properly.
5,Model Fitting,The following items describe tests that need to be done for testing the model fitting process. The unit tests written for this section usually mock model load and model predictions similarly to mocking file access.
6,Model Evaluation,The following items describe tests that need to be done for testing the model evaluation process.
7,Artifact Testing,"The following items involves explicit checks for behaviors that we expect the artifacts e.g. models, plots, etc., to follow."
8,Data Quality (Optional),"The following items describe tests that need to be done for testing the quality of data, but they may not be applicable to all projects."
Loading