NLPCRAFT-11: Add auto-enrich using Bert and FastTest models #1

Ifropc · 2020-05-08T06:19:01Z

This pull request should resolve NLPCRAFT-11: auto-enrich user models with synonyms
Proposed approach uses Bert (RoBerta) model to generate synonyms for given context, masking target word. Afterwards, output is filtered with FastTest for specified context.
This feature could also be integrated with NLPCRAFT-41

syermakov · 2020-05-08T17:17:58Z

@Ifropc Thank you very much for sharing, this feature looks very promising!

I would suggest the following improvements:

Add Apache license header to all files
Use logging module instead of the print function
Do we need matplotlib-related functions (mk_graph, etc) in bertft.py, isn't it better to move them directly to jupiter notebook?
May be rename enricher folder to syn_enricher (just a proposal, ideas?)

Add apache license Replace prints with logging Move printing graphs to jupyter

skhdl · 2020-05-16T09:01:39Z

Hi @Ifropc ! Thank your for your work, I have tried it and results are very impressive!

Below some remarks for discussion:

start_server.sh - I think that better to use python3 instead of python.
Is it possible to download data to user home folder (and use it from there)? For example under ~//nlpcraft/<bert?>
How can we clear installation files? Maybe we can add some flag like -reset to install_dependencies.sh or create new script like clear_dependencies.sh?
Should I have python3 already installed?

If yes - could you describe it in readme? (versions etc)
If no - we have to create ticket to check it on clear computer without any python related installations.

I think we can drop # usage at least from API (indexes parameters are enough and more flexible). Support of both modes confused.
I think we have to separate endpoints URLs (now we have only default one), because later we can need some additional endpoints and have to separate them.
Is it possible to add some new endpoint (or change current with some parameter) which can return suggestions for each word of sentence?

Now we can do it sending N requests for each N sentence words, but it can be too slow because each request takes about 0.2 secs
Will we have some performance improvements with such new endpoint ?

Can I control suggestions result count? (maybe via some parameter)

Let's discuss!

skhdl · 2020-05-16T10:37:14Z

Could you add some base request validation and throw readable errors for such invalid requests?

Just for main cases (empty request, invalid indexes range) - now result is code 500, internal error.

- Added validation, limit argument

Ifropc · 2020-05-17T03:03:00Z

Hi @skhdl , thanks you for your feedback.

Done
I think in general it's not very good approach, as it is not following XDG base directory specification. I understand that currently NLP craft do store some files there, so I think this should remain open for discussion and for now I leave it as is. In future, we may consider either moving NLP craft files to XDG supported directory, or make it configurable in some other way (environment variable, configuration in file, etc).
It's already done after successful installation. Are you talking about /tmp/fastText/
Yes, it's checked in installation script.
Done
Done. Please use /synonyms endpoint.
This part is a bit tricky. I am going to improve performance, but it might take some time to make proper batching into models, so it might be out of scope of current pull request and left as improvement for next iteration.
Yes, it was already implemented, but I forgot to add it as part of API. Please pass this value as "limit"
Done

Also I upgraded torch to latest version (1.5), please re-run install script.

Ifropc · 2020-05-30T04:06:52Z

I'm closing this pull request. Further work would be done in branch NLPCRAFT-67
Let's continue discussion in related issue on Jira

Ifropc force-pushed the NLPCRAFT-11 branch from 2f711fe to fd8b90c Compare May 8, 2020 06:38

Ifropc force-pushed the NLPCRAFT-11 branch from 0c63146 to 6779ecd Compare May 12, 2020 15:59

NLPCRAFT-11: Add auto-enrich using Bert and FastTest models

5ec857f

Add apache license Replace prints with logging Move printing graphs to jupyter

Ifropc force-pushed the NLPCRAFT-11 branch from 6779ecd to 5ec857f Compare May 12, 2020 16:08

NLPCRAFT-11: After-review changes

dd97476

- Added validation, limit argument

Ifropc closed this May 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLPCRAFT-11: Add auto-enrich using Bert and FastTest models #1

NLPCRAFT-11: Add auto-enrich using Bert and FastTest models #1

Ifropc commented May 8, 2020

syermakov commented May 8, 2020

skhdl commented May 16, 2020

skhdl commented May 16, 2020

Ifropc commented May 17, 2020

Ifropc commented May 30, 2020

NLPCRAFT-11: Add auto-enrich using Bert and FastTest models #1

NLPCRAFT-11: Add auto-enrich using Bert and FastTest models #1

Conversation

Ifropc commented May 8, 2020

syermakov commented May 8, 2020

skhdl commented May 16, 2020

skhdl commented May 16, 2020

Ifropc commented May 17, 2020

Ifropc commented May 30, 2020