Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meeting Minutes for Week 4 #72

Closed
17 of 23 tasks
JohnShiuMK opened this issue May 20, 2024 · 6 comments
Closed
17 of 23 tasks

Meeting Minutes for Week 4 #72

JohnShiuMK opened this issue May 20, 2024 · 6 comments
Assignees
Labels
admin meeting related

Comments

@JohnShiuMK
Copy link
Collaborator

JohnShiuMK commented May 20, 2024

Sprint Planning - 2024-05-20 Week 4

@tonyshumlh
Copy link
Collaborator

tonyshumlh commented May 23, 2024

Mentor Meeting 2024/05/23 - Week 4

  • For checklist, we can add a flag (0 or 1) to state if the item(s) are ready to be used in the system, or only for human reference. It is easier to select the items for testing purpose too.
  • Better than keeping 2 checklists as all changes are made in 1 single file.
  • Do we have to keep it as CLI tool or API?
  • Do we have to containerize the tool and put it into a service and expose as a API?
  • For the aggregated JSON output from LLM, it can be dumped and output as one type of the output to users (e.g. researchers using CLI)
  • For the web app user (e.g. dashboard), DataFrame/HTML/PDF can be the output in the web app by clicking a certain button (e.g. download)
  • Stumbling block of GPT 3.5-turbo: 1) trim the response result; 2) missing attributes in the response result
  • Add metadata to the LLM response result, e.g. repo, date, model, retries [success, retry 1, retry 2, ... + some useful info]
  • Try to evaluate which repo are consistently working/not-working, and put not-working repo into GPT 4o for testing
  • The failed run result can be stored and reviewed as an error log for enhancement
  • Having output per stage is useful for paper publish by Tiffany
  • For the web app, it should contain 1) repo input box; 2) checklist item display; 3) evaluation report display; 4) button for raw result download for human evaluation or research paper writing or other development purpose
  • For Research Paper, 1) code base; 2) result that shows effectiveness (e.g. input Openja 11 repo to run results); 3) summarize the consistency across multiple repo, multiple checklist items; 4) write considerations
  • We have to provide tool and output (e.g. the error log/raw result) to empower research to do the above analysis for their paper

@tonyshumlh
Copy link
Collaborator

Partner Meeting 2024/05/24 - Week 4

  • Checklist
  • Is the explanation useful for LLM?
  • change from 1 General to 1.0 General
  • Expect a Table of Content for clicking to navigating the checklist OR Just output several tables for the Checklist visualization
  • Currently we are using custom function to covert object into Markdown then to HTML/PDF -> Use existing functions for the conversion for maintainability, e.g. Jinja package, and output Quarto document (refer to new 524 note)
  • Switch to use Quarto + Python for the output formatting OR mention in the Future Dev
  • System
  • The Report display in CLI should be ID, Title, is_satisfied, the info about failure, other columns
  • (Good to have) Line Number to show the function location, Embed the hyperlink into the line number to open the test file and go to the line
  • System Evaluation
  • Too small sample size that leads to high p-value, ideally 20-30
  • Change the Hypothesis and Do 2-tailed tests to evaluate the new development whether it is much worse than previous version, as we are not sure the new dev must be better than prev version
  • Look into other Python stat package
  • TIffany will review on the False Negative solution
  • Mutation testing could be more efficiency for testing LLM accuracy
  • OR human evaluate the 10 repo from Open Ja

@JohnShiuMK
Copy link
Collaborator Author

JohnShiuMK commented May 24, 2024

Partner Meeting Minutes - May 24, 2024

Attendees: John, Orix, Simon (Mentor), Tiffany (Partner), Tony, Yingzi

Checklist for Leader Persona

  • Discussed the reason of using Python to output report: we plan to deliver our system as a Python package which, once installed, will provide a runnable command from CLI. This approach would offer a more streamlined experience for the users, instead of creating an intermediate artifact, and instructing users to render it using another tool. (Please refer to here for more elaboration)
  • To consider incorporating "Explanation" into prompts to provide context for the LLM.
  • To address minor formatting issues, such as changing "1 General" to "1. General".
  • To respond to Tiffany's comment in the PR: GitHub PR #91
  • (Good to have) To implement a table of contents for easier navigation through the checklist, or alternatively, output several tables for Checklist visualization.
  • (Good to have) To consider using the Jinja package or Quarto in Python to enhance checklist visualization (refer to new note 524), but limit time spent on this feature.
    • Or, to mention these alternatives in the future development in the final report.

System for Researcher Persona

  • The Report displayed in the terminal should include ID, Title, is_Satisfied, ..., i.e. maintaining the same format as demonstrated last week.
  • To include line numbers to indicate the function's location in the source code.
  • (Good to have) To embed hyperlinks within line numbers to enable opening the test file directly at the corresponding line on GitHub.

System Evaluation for Ourselves (System Developer Persona)

  • Consistency:

    • To increase the sample size to 20-30 for a more robust p-value calculation.
    • To modify the hypothesis to conduct 2-tailed tests instead of 1-tailed, as it cannot be assumed that the new development is definitively better than the previous version.
    • John to post a question about Type-II error in the group; Tiffany to review.
    • (Good to have) To explore stat packages for an F-test (we're currently using scipy)
  • Accuracy:

    • Mutation testing may be more efficiency for testing LLM accuracy, but the scope may become too large for the capstone project.
    • Instead, Simon suggested maintaining a mechanism for future users to provide feedback on our accuracy, enabling continuous improvement.

https://docs.google.com/presentation/d/16XumcVV7MLJVY4dRmjphW8WOmEysFlA4/edit#slide=id.g2721f84b46c_1_690

@JohnShiuMK
Copy link
Collaborator Author

JohnShiuMK commented May 24, 2024

@SoloSynth1
Copy link
Collaborator

SoloSynth1 commented May 27, 2024

@JohnShiuMK

Considerations taken on the design choice to render reports in Python instead of Quarto documents:

  • For Persona 2 i.e. Users, we plan to deliver our system as a Python package which, once installed, will provide a runnable command from CLI. For the moment, we prefer to aggregate all functionalities under one single command and to present them in the form of subcommands individually. Assuming our package will provide the command tc, in order to evaluate a repo and export a report, one would be expected to run tc evaluate ${repo_path} --checklist=${checklist_path} --export-to=report.html. We feel like this approach would offer a more streamlined experience for the users, instead of creating an intermediate artifact, and instructing users to render it using another tool.
  • Quarto does not have a python package as API for invocation of the rendering action within the same Python process. Technically still possible using subprocess to run Quarto by spawning subprocess calls, but not ideal as we have no control on the flow.
  • As the current checklist will be ingested and processed by the Python process (which is essential as the checklist needs to be referenced by other components in our system), creating a Quarto document containing codes to read in and print out the checklist would introduce code duplication as we inevitably need to perform the same operations inside the runnable document.
  • By using Python with template engines e.g. Jinja2, it offers a greater flexibility in ways we can render the documents. In fact, if one so wishes, our system currently also provides functionality to render our report in Quarto markdown format (.qmd). If time allows, in later stage we can also explore the possibility of putting test codes generated by LLMs into .qmd along with the rest of the report, enabling the users to run the codes inside their development environment in order to validate the generated test cases.

@JohnShiuMK
Copy link
Collaborator Author

Continue in #99

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
admin meeting related
Projects
None yet
Development

No branches or pull requests

4 participants