Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSL 样本噪声问题 #115

Closed
Jasperty opened this issue Apr 9, 2021 · 7 comments
Closed

CSL 样本噪声问题 #115

Jasperty opened this issue Apr 9, 2021 · 7 comments
Assignees

Comments

@Jasperty
Copy link

Jasperty commented Apr 9, 2021

关键词识别任务,
”csl_public.zip 取自中文论文摘要及其关键词,论文选自部分中文社会科学和自然科学核心期刊。使用tf-idf生成伪造关键词与论文真实关键词混合,构造摘要-关键词对,机器学习模型的任务目标是根据摘要判断关键词是否全部为真实关键词“
存在一个问题:tf-idf生成的可能是真关键词,在训练集和验证集中发现了一些噪声:
image
测试集可能也有,如何处理这种噪声?能否公开关键词混合的方法?

@ydli-ai
Copy link
Member

ydli-ai commented Apr 9, 2021 via email

@Jasperty
Copy link
Author

Jasperty commented Apr 9, 2021

我们在制作数据集时伪造关键词部分已经排除了生成出的真关键词。标签为0时序列中至少有一个关键词是伪造的

我的截图,标出的那一行,标签=0,但是这些关键词在下面的1中都能找得到。

@Jasperty
Copy link
Author

Jasperty commented Apr 9, 2021

我们在制作数据集时伪造关键词部分已经排除了生成出的真关键词。标签为0时序列中至少有一个关键词是伪造的
重新截个图吧,原始数据集,第3行,标签=0,键长出现2次,应该有一个是tfidf构建出来的,关键词没有去重。实际上这行关键词全是真关键词。
image

@ydli-ai
Copy link
Member

ydli-ai commented Apr 9, 2021 via email

@t1101675
Copy link

t1101675 commented May 7, 2021

请问新的数据集有发布吗?在哪里发布呢?

@t1101675
Copy link

t1101675 commented May 8, 2021

现在发现数据集里面还有其它噪声,比如验证集中 id 35 和 38 的两个样本,只是关键词的顺序换了一下,但是标签不一样。

image

@ydli-ai
Copy link
Member

ydli-ai commented May 9, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants