Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contributing a deep-learning, BERT-based analyzer #13065

Open
lmessinger opened this issue Feb 1, 2024 · 9 comments
Open

Contributing a deep-learning, BERT-based analyzer #13065

lmessinger opened this issue Feb 1, 2024 · 9 comments

Comments

@lmessinger
Copy link

Description

Hi,

We are building an open-source custom Hebrew/Arabic analyzer (lemmatizer and stopwords), based on a BERT model. We'd like to contribute this to this repository. How can we do that and be accepted? Can we compile it to native code and use JNI or Panama ? If not, what is the best approacch?

#12502 (comment)

@uschindler would be very happy to hear what you think

@benwtrent
Copy link
Member

For the analyzer, are you meaning something that tokenizes into an embedding?

Or just creates the tokens (wordpiece + dictionary)?

@lmessinger
Copy link
Author

I mean, create just the tokens - the lemmas / wordpieces

@benwtrent
Copy link
Member

@lmessinger I don't see why text tokenization would need any native code. Word piece is pretty simple and just a dictionary look up.

Do y'all not have a Java one?

Or does this model actually need inference to do the lemmatization? (e.g. https://huggingface.co/dicta-il/dictabert-joint) ?

@lmessinger
Copy link
Author

lmessinger commented Feb 6, 2024 via email

@dweiss
Copy link
Contributor

dweiss commented Feb 6, 2024

It will be a major headache to maintain native bindings for all major platforms. I think such an analyzer should be a downstream project (then you can restrict the platforms on which it's available to whatever you wish to maintain). We can point at such a project from Lucene documentation, for example.

@lmessinger
Copy link
Author

lmessinger commented Feb 8, 2024 via email

@dweiss
Copy link
Contributor

dweiss commented Feb 8, 2024

How can that be done?

This is a question that is much harder to answer than I thought... Lucene doesn't have a tutorial/user guide. The only place I could think of was here, in the javadocs:

https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/overview.html

An alternative would be to include an empty package for Hebrew and only add the package-info.java file, telling folks where they can find downstream Hebrew analyzers. I really don't have any better ideas.

@chatman
Copy link
Contributor

chatman commented Feb 8, 2024 via email

@uschindler
Copy link
Contributor

How can that be done?

This is a question that is much harder to answer than I thought... Lucene doesn't have a tutorial/user guide. The only place I could think of was here, in the javadocs:

https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/overview.html

An alternative would be to include an empty package for Hebrew and only add the package-info.java file, telling folks where they can find downstream Hebrew analyzers. I really don't have any better ideas.

It would be better located in the module "analysis" (which is just the parent of all analyzers). Unfortunately this module does not create javadocs, so analysis-common is the only location.

I think it would be a good idea to add there a list to external resources of analysis components. Lucene is a flexible library with extension points through SPI, so we can list all external contributions there.

This page is also missing an overview on the analysis submodules.

An alternative (an in my opinion better) idea is to put a list as Markdown file into the documentation package: https://github.com/apache/lucene/tree/main/lucene/documentation/src/markdown

All md files there are compiled to HTML and can be linked in the template file for index.html, too.

How about something with the source maintained in the sandbox dir (along with instructions to build), but no corresponding official release artifact?

I don't think this is a good idea. It won't be tested (as we can't run the build) and also it is inconsequent.

We had that in the past for the DirectIODirectory and WindowsDirectory. All those were not maintained -- and did not build anymore, although there were build scripts. The Java parts were building, the JNI parts were not longer matching the Java implementations. I may be wrong, but when we looked into this, it was almost impossible to make it work again.

Luckily they were rewritten using Java 11+ APIs and are now part of official distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants