[论文] 面向开源领域指令微调数据集的构建以及大模型的实现 #263

PureNatural · 2024-03-21T14:25:34Z

Description

因为后续会讨论很多实验方面以及数据集构建方面的细节，故将开源领域大模型的科研进展放到open-research仓库中讨论，我根据现有实验室资源以及GitHub的现有的功能，将任务大致划分为以下部分：

任务还是有点多，可能会适当删减，我认为可以先确定方法的有效性，再一次次扩大数据集并添加更多任务，这样更稳一些

同时 @衍童最近在看LLaMA-Factory，可以支持后续大模型的微调以及开发。

今天王老师组会上提到，针对不同的仓库可能模型设计的问题和答案是不同的，这个我也确实考虑到了，因为同一个问题在不同的仓库下是有可能都会被问到的，但因为仓库不同，所以答案不一定相同，这个需要一开始设计任务时做好处理，我计划从仓库入手获取数据集，先考虑以下几个热门仓库：
https://github.com/vercel/next.js
https://github.com/gatsbyjs/gatsby
https://github.com/nodejs/node
https://github.com/tailwindlabs/tailwindcss
https://github.com/laravel/framework

PureNatural · 2024-03-25T08:59:46Z

拟定题目：OSATG-GPT:Instruction-tuning Large Language Models with Open Source Atom Tasks on GitHub
拟投期刊：information sciences (SCI 1区 top, CCF-B)
时间计划：
5-6月：完成数据集构建
7-8月：完成实验
8-9月：投稿

PureNatural · 2024-05-26T03:13:16Z

discusstion数据可以通过GraphQL的方式获取，返回的数据大致如下：

对比GitHub网站原内容：

有一个关键字段如上图，可以判断该评论是否标记为答案，对问答数据集的构建很有帮助

具体文档：https://docs.github.com/zh/graphql/guides/using-the-graphql-api-for-discussions

PureNatural · 2024-06-07T08:23:07Z

目前可以确定的任务有19个，数据集也确定可以拿到，后续预计设计24个任务左右，本月可以将数据集定好

PureNatural · 2024-06-18T03:12:57Z

可以定义的任务基本上能超过30个，但为了工作进度，本次实验优先证明方法的有效性，所以在第一阶段先选取质量更好的任务作为原子任务，后续再增加任务的数量

最后确定原子任务24个：

PureNatural mentioned this issue Jun 21, 2024

[new dataset]init osatg-gpt dataset X-lab2017/open-perf#65

Merged

will-ww mentioned this issue Jul 17, 2024

HyperCRX 项目的社区化建设与持续发展（例如，OSSGPT 建设、利用其他产品做集成） X-lab2017/open-wonderland#422

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[论文] 面向开源领域指令微调数据集的构建以及大模型的实现 #263

[论文] 面向开源领域指令微调数据集的构建以及大模型的实现 #263

PureNatural commented Mar 21, 2024

PureNatural commented Mar 25, 2024 •

edited

Loading

PureNatural commented May 26, 2024 •

edited

Loading

PureNatural commented Jun 7, 2024

PureNatural commented Jun 18, 2024 •

edited

Loading

[论文] 面向开源领域指令微调数据集的构建以及大模型的实现 #263

[论文] 面向开源领域指令微调数据集的构建以及大模型的实现 #263

Comments

PureNatural commented Mar 21, 2024

Description

PureNatural commented Mar 25, 2024 • edited Loading

PureNatural commented May 26, 2024 • edited Loading

PureNatural commented Jun 7, 2024

PureNatural commented Jun 18, 2024 • edited Loading

PureNatural commented Mar 25, 2024 •

edited

Loading

PureNatural commented May 26, 2024 •

edited

Loading

PureNatural commented Jun 18, 2024 •

edited

Loading