Contributing a deep-learning, BERT-based analyzer #13065

lmessinger · 2024-02-01T13:41:40Z

Description

Hi,

We are building an open-source custom Hebrew/Arabic analyzer (lemmatizer and stopwords), based on a BERT model. We'd like to contribute this to this repository. How can we do that and be accepted? Can we compile it to native code and use JNI or Panama ? If not, what is the best approacch?

#12502 (comment)

@uschindler would be very happy to hear what you think

benwtrent · 2024-02-01T16:55:18Z

For the analyzer, are you meaning something that tokenizes into an embedding?

Or just creates the tokens (wordpiece + dictionary)?

lmessinger · 2024-02-04T12:56:45Z

I mean, create just the tokens - the lemmas / wordpieces

benwtrent · 2024-02-06T14:48:42Z

@lmessinger I don't see why text tokenization would need any native code. Word piece is pretty simple and just a dictionary look up.

Do y'all not have a Java one?

Or does this model actually need inference to do the lemmatization? (e.g. https://huggingface.co/dicta-il/dictabert-joint) ?

lmessinger · 2024-02-06T15:03:29Z

hi, in Hebrew and other Semitic languages, lemmas are context-dependent. eg שמן could be interpreted as fat, oil, their name, from all dependent on the context so yes, we do need inference. to do inference, python is the language. either we compile the python into native code (not so easy but possible) or use it in a container, as a web server

…

On Tue, Feb 6, 2024 at 4:48 PM Benjamin Trent ***@***.***> wrote: @lmessinger <https://github.com/lmessinger> I don't see why text tokenization would need any native code. Word piece is pretty simple and just a dictionary look up. Do y'all not have a Java one? Or does this model actually need inference to do the lemmatization? (e.g. https://huggingface.co/dicta-il/dictabert-joint) ? — Reply to this email directly, view it on GitHub <#13065 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAM5MHITFNCDP5H6FMW6PVTYSI7FNAVCNFSM6AAAAABCU5UD2KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRZHEZTGNJWGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Lior Messinger +1-646-3730044 +972-546-888401

dweiss · 2024-02-06T20:05:51Z

It will be a major headache to maintain native bindings for all major platforms. I think such an analyzer should be a downstream project (then you can restrict the platforms on which it's available to whatever you wish to maintain). We can point at such a project from Lucene documentation, for example.

lmessinger · 2024-02-08T10:32:34Z

Hi, Got it. Pointing to the project from the documentation would actually be very valuable to the Hebrew community. How can that be done? is the documentation also on github, so we can add it there as PR for approval? thanks! Lior

…

On Tue, Feb 6, 2024 at 10:06 PM Dawid Weiss ***@***.***> wrote: It will be a major headache to maintain native bindings for all major platforms. I think such an analyzer should be a downstream project (then you can restrict the platforms on which it's available to whatever you wish to maintain). We can point at such a project from Lucene documentation, for example. — Reply to this email directly, view it on GitHub <#13065 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAM5MHJY3T5ODVXTMPJVEMDYSKEKZAVCNFSM6AAAAABCU5UD2KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZQGY3DONRZG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Lior Messinger +1-646-3730044 +972-546-888401

dweiss · 2024-02-08T19:49:50Z

How can that be done?

This is a question that is much harder to answer than I thought... Lucene doesn't have a tutorial/user guide. The only place I could think of was here, in the javadocs:

https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/overview.html

An alternative would be to include an empty package for Hebrew and only add the package-info.java file, telling folks where they can find downstream Hebrew analyzers. I really don't have any better ideas.

chatman · 2024-02-08T19:59:15Z

How about something with the source maintained in the sandbox dir (along with instructions to build), but no corresponding official release artifact?

…

On Fri, 9 Feb, 2024, 1:20 am Dawid Weiss, ***@***.***> wrote: How can that be done? This is a question that is much harder to answer than I thought... Lucene doesn't have a tutorial/user guide. The only place I could think of was here, in the javadocs: https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/overview.html An alternative would be to include an empty package for Hebrew and only add the package-info.java file, telling folks where they can find downstream Hebrew analyzers. I really don't have any better ideas. — Reply to this email directly, view it on GitHub <#13065 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDCR5FBPK7ZCAOW5U2YYOTYSUT65AVCNFSM6AAAAABCU5UD2KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUHAZTCMJSGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

uschindler · 2024-02-12T10:55:21Z

How can that be done?

This is a question that is much harder to answer than I thought... Lucene doesn't have a tutorial/user guide. The only place I could think of was here, in the javadocs:

https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/overview.html

An alternative would be to include an empty package for Hebrew and only add the package-info.java file, telling folks where they can find downstream Hebrew analyzers. I really don't have any better ideas.

It would be better located in the module "analysis" (which is just the parent of all analyzers). Unfortunately this module does not create javadocs, so analysis-common is the only location.

I think it would be a good idea to add there a list to external resources of analysis components. Lucene is a flexible library with extension points through SPI, so we can list all external contributions there.

This page is also missing an overview on the analysis submodules.

An alternative (an in my opinion better) idea is to put a list as Markdown file into the documentation package: https://github.com/apache/lucene/tree/main/lucene/documentation/src/markdown

All md files there are compiled to HTML and can be linked in the template file for index.html, too.

How about something with the source maintained in the sandbox dir (along with instructions to build), but no corresponding official release artifact?

I don't think this is a good idea. It won't be tested (as we can't run the build) and also it is inconsequent.

We had that in the past for the DirectIODirectory and WindowsDirectory. All those were not maintained -- and did not build anymore, although there were build scripts. The Java parts were building, the JNI parts were not longer matching the Java implementations. I may be wrong, but when we looked into this, it was almost impossible to make it work again.

Luckily they were rewritten using Java 11+ APIs and are now part of official distribution.

lmessinger added the type:enhancement label Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing a deep-learning, BERT-based analyzer #13065

Contributing a deep-learning, BERT-based analyzer #13065

lmessinger commented Feb 1, 2024

benwtrent commented Feb 1, 2024

lmessinger commented Feb 4, 2024

benwtrent commented Feb 6, 2024

lmessinger commented Feb 6, 2024 via email

dweiss commented Feb 6, 2024

lmessinger commented Feb 8, 2024 via email

dweiss commented Feb 8, 2024

chatman commented Feb 8, 2024 via email

uschindler commented Feb 12, 2024

Contributing a deep-learning, BERT-based analyzer #13065

Contributing a deep-learning, BERT-based analyzer #13065

Comments

lmessinger commented Feb 1, 2024

Description

benwtrent commented Feb 1, 2024

lmessinger commented Feb 4, 2024

benwtrent commented Feb 6, 2024

lmessinger commented Feb 6, 2024 via email

dweiss commented Feb 6, 2024

lmessinger commented Feb 8, 2024 via email

dweiss commented Feb 8, 2024

chatman commented Feb 8, 2024 via email

uschindler commented Feb 12, 2024