Meeting Minutes for Week 4 #72

JohnShiuMK · 2024-05-20T07:57:36Z

Sprint Planning - 2024-05-20 Week 4

tonyshumlh · 2024-05-23T22:04:06Z

Mentor Meeting 2024/05/23 - Week 4

For checklist, we can add a flag (0 or 1) to state if the item(s) are ready to be used in the system, or only for human reference. It is easier to select the items for testing purpose too.
Better than keeping 2 checklists as all changes are made in 1 single file.
Do we have to keep it as CLI tool or API?
Do we have to containerize the tool and put it into a service and expose as a API?
For the aggregated JSON output from LLM, it can be dumped and output as one type of the output to users (e.g. researchers using CLI)
For the web app user (e.g. dashboard), DataFrame/HTML/PDF can be the output in the web app by clicking a certain button (e.g. download)
Stumbling block of GPT 3.5-turbo: 1) trim the response result; 2) missing attributes in the response result
Add metadata to the LLM response result, e.g. repo, date, model, retries [success, retry 1, retry 2, ... + some useful info]
Try to evaluate which repo are consistently working/not-working, and put not-working repo into GPT 4o for testing
The failed run result can be stored and reviewed as an error log for enhancement
Having output per stage is useful for paper publish by Tiffany
For the web app, it should contain 1) repo input box; 2) checklist item display; 3) evaluation report display; 4) button for raw result download for human evaluation or research paper writing or other development purpose
For Research Paper, 1) code base; 2) result that shows effectiveness (e.g. input Openja 11 repo to run results); 3) summarize the consistency across multiple repo, multiple checklist items; 4) write considerations
We have to provide tool and output (e.g. the error log/raw result) to empower research to do the above analysis for their paper

tonyshumlh · 2024-05-24T19:55:57Z

Partner Meeting 2024/05/24 - Week 4

Checklist
Is the explanation useful for LLM?
change from 1 General to 1.0 General
Expect a Table of Content for clicking to navigating the checklist OR Just output several tables for the Checklist visualization
Currently we are using custom function to covert object into Markdown then to HTML/PDF -> Use existing functions for the conversion for maintainability, e.g. Jinja package, and output Quarto document (refer to new 524 note)
Switch to use Quarto + Python for the output formatting OR mention in the Future Dev
System
The Report display in CLI should be ID, Title, is_satisfied, the info about failure, other columns
(Good to have) Line Number to show the function location, Embed the hyperlink into the line number to open the test file and go to the line
System Evaluation
Too small sample size that leads to high p-value, ideally 20-30
Change the Hypothesis and Do 2-tailed tests to evaluate the new development whether it is much worse than previous version, as we are not sure the new dev must be better than prev version
Look into other Python stat package
TIffany will review on the False Negative solution
Mutation testing could be more efficiency for testing LLM accuracy
OR human evaluate the 10 repo from Open Ja

JohnShiuMK · 2024-05-24T22:21:28Z

Partner Meeting Minutes - May 24, 2024

Attendees: John, Orix, Simon (Mentor), Tiffany (Partner), Tony, Yingzi

Checklist for Leader Persona

Discussed the reason of using Python to output report: we plan to deliver our system as a Python package which, once installed, will provide a runnable command from CLI. This approach would offer a more streamlined experience for the users, instead of creating an intermediate artifact, and instructing users to render it using another tool. (Please refer to here for more elaboration)
To consider incorporating "Explanation" into prompts to provide context for the LLM.
To address minor formatting issues, such as changing "1 General" to "1. General".
To respond to Tiffany's comment in the PR: GitHub PR #91
(Good to have) To implement a table of contents for easier navigation through the checklist, or alternatively, output several tables for Checklist visualization.
(Good to have) To consider using the Jinja package or Quarto in Python to enhance checklist visualization (refer to new note 524), but limit time spent on this feature.
- Or, to mention these alternatives in the future development in the final report.

System for Researcher Persona

The Report displayed in the terminal should include ID, Title, is_Satisfied, ..., i.e. maintaining the same format as demonstrated last week.
To include line numbers to indicate the function's location in the source code.
(Good to have) To embed hyperlinks within line numbers to enable opening the test file directly at the corresponding line on GitHub.

System Evaluation for Ourselves (System Developer Persona)

Consistency:
- To increase the sample size to 20-30 for a more robust p-value calculation.
- To modify the hypothesis to conduct 2-tailed tests instead of 1-tailed, as it cannot be assumed that the new development is definitively better than the previous version.
- John to post a question about Type-II error in the group; Tiffany to review.
- (Good to have) To explore stat packages for an F-test (we're currently using scipy)
Accuracy:
- Mutation testing may be more efficiency for testing LLM accuracy, but the scope may become too large for the capstone project.
- Instead, Simon suggested maintaining a mechanism for future users to provide feedback on our accuracy, enabling continuous improvement.

https://docs.google.com/presentation/d/16XumcVV7MLJVY4dRmjphW8WOmEysFlA4/edit#slide=id.g2721f84b46c_1_690

JohnShiuMK · 2024-05-24T23:06:20Z

SoloSynth1 · 2024-05-27T18:06:31Z

@JohnShiuMK

Considerations taken on the design choice to render reports in Python instead of Quarto documents:

For Persona 2 i.e. Users, we plan to deliver our system as a Python package which, once installed, will provide a runnable command from CLI. For the moment, we prefer to aggregate all functionalities under one single command and to present them in the form of subcommands individually. Assuming our package will provide the command tc, in order to evaluate a repo and export a report, one would be expected to run tc evaluate ${repo_path} --checklist=${checklist_path} --export-to=report.html. We feel like this approach would offer a more streamlined experience for the users, instead of creating an intermediate artifact, and instructing users to render it using another tool.
Quarto does not have a python package as API for invocation of the rendering action within the same Python process. Technically still possible using subprocess to run Quarto by spawning subprocess calls, but not ideal as we have no control on the flow.
As the current checklist will be ingested and processed by the Python process (which is essential as the checklist needs to be referenced by other components in our system), creating a Quarto document containing codes to read in and print out the checklist would introduce code duplication as we inevitably need to perform the same operations inside the runnable document.
By using Python with template engines e.g. Jinja2, it offers a greater flexibility in ways we can render the documents. In fact, if one so wishes, our system currently also provides functionality to render our report in Quarto markdown format (.qmd). If time allows, in later stage we can also explore the possibility of putting test codes generated by LLMs into .qmd along with the rest of the report, enabling the users to run the codes inside their development environment in order to validate the generated test cases.

JohnShiuMK · 2024-05-27T18:24:03Z

Continue in #99

JohnShiuMK added the admin meeting related label May 20, 2024

JohnShiuMK assigned SoloSynth1, JohnShiuMK, tonyshumlh and jinyz8888 May 20, 2024

JohnShiuMK mentioned this issue May 20, 2024

Meeting Minutes for Week 3 #51

Closed

12 tasks

This was referenced May 25, 2024

Add metric (e.g. from regression model) to quantify the improvement of consistency #76

Closed

quantify consistency improvement #93

Merged

Meeting Minutes for Week 5 #99

Closed

JohnShiuMK closed this as completed May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meeting Minutes for Week 4 #72

Meeting Minutes for Week 4 #72

JohnShiuMK commented May 20, 2024 •

edited

Loading

tonyshumlh commented May 23, 2024 •

edited

Loading

tonyshumlh commented May 24, 2024

JohnShiuMK commented May 24, 2024 •

edited

Loading

JohnShiuMK commented May 24, 2024 •

edited

Loading

SoloSynth1 commented May 27, 2024 •

edited

Loading

JohnShiuMK commented May 27, 2024

Meeting Minutes for Week 4 #72

Meeting Minutes for Week 4 #72

Comments

JohnShiuMK commented May 20, 2024 • edited Loading

tonyshumlh commented May 23, 2024 • edited Loading

tonyshumlh commented May 24, 2024

JohnShiuMK commented May 24, 2024 • edited Loading

JohnShiuMK commented May 24, 2024 • edited Loading

SoloSynth1 commented May 27, 2024 • edited Loading

JohnShiuMK commented May 27, 2024

JohnShiuMK commented May 20, 2024 •

edited

Loading

tonyshumlh commented May 23, 2024 •

edited

Loading

JohnShiuMK commented May 24, 2024 •

edited

Loading

JohnShiuMK commented May 24, 2024 •

edited

Loading

SoloSynth1 commented May 27, 2024 •

edited

Loading