Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why use of CNN char embedding? #41

Closed
cbockman opened this issue May 9, 2019 · 6 comments
Closed

Why use of CNN char embedding? #41

cbockman opened this issue May 9, 2019 · 6 comments
Assignees

Comments

@cbockman
Copy link

cbockman commented May 9, 2019

"To keep things simple, we use minimal task specific architectures atop BERT-Base and SCIBERT embeddings. Each token is represented as the concatenation of its BERT embedding with a
CNN-based character embedding. If the token has multiple BERT subword units, we use the first one."

Why the use of an additional CNN-based char embeddings? Many (most?) papers using BERT (or similar) solely use the embedding coming out of the LM-based model.

Was there a big additional uptick from layering in the CNN-based char embeddings?

@brendan-ai2
Copy link

@ibeltagy , @kyleclo we're getting questions about this issue on AllenNLP's project. Any chance you could follow up? Thanks!

FYI @arunzz

@kyleclo kyleclo self-assigned this Jun 28, 2019
@kyleclo
Copy link
Collaborator

kyleclo commented Jun 28, 2019

Hey @cbockman, your point is well-taken. I'm currently in the process of re-doing the evaluations without these embeddings and there's a minor uptick in the performance due to these. Since BERT-Base was also evaluated using char embeddings, the relative difference between BERT-Base and SciBERT hasn't changed. For example, BC5CDR sees a performance change from 88.94 -> 88.73 (SciBERT-SciVocab) and 85.72 -> 85.08 (BERT-Base) when removing the char embeddings. I'll release the full set of results once they're ready.

@kyleclo kyleclo closed this as completed Jun 28, 2019
@cbockman
Copy link
Author

Thanks. What was the rationalization for including? (Empirical > theoretical, of course...) BERT is "supposed" to encapsulate this information (via subwords), anyway. Was this an attempted way to deal with the fact that the BERT layer was frozen (and thus perhaps not able to fully integrate the domain-specific learnings)?

@kyleclo
Copy link
Collaborator

kyleclo commented Jun 28, 2019

@cbockman There wasn't any rationale for including char embeddings. We ran the experiments with a standard NER configuration in AllenNLP that had character-level embeddings & noticed afterwards that it included them. Since experiments are a bit expensive & we felt like it was still a fair comparison between BERT-Base and SciBERT, we didn't rerun everything & reported that we included char embeddings in the arXiv draft. We've since decided that it's worth redoing the experiments to exclude the char embeddings & will update the draft when they're done.

@cbockman
Copy link
Author

Thanks! Love the paper (probably obviously).

@arunzz
Copy link

arunzz commented Jul 1, 2019

Thanks! Will wait for score without char embeddings, Thanks again !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants