Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NLPCRAFT-11: Add auto-enrich using Bert and FastTest models #1

Closed
wants to merge 2 commits into from

Conversation

Ifropc
Copy link
Member

@Ifropc Ifropc commented May 8, 2020

This pull request should resolve NLPCRAFT-11: auto-enrich user models with synonyms
Proposed approach uses Bert (RoBerta) model to generate synonyms for given context, masking target word. Afterwards, output is filtered with FastTest for specified context.
This feature could also be integrated with NLPCRAFT-41

@syermakov
Copy link
Member

@Ifropc Thank you very much for sharing, this feature looks very promising!

I would suggest the following improvements:

  • Add Apache license header to all files
  • Use logging module instead of the print function
  • Do we need matplotlib-related functions (mk_graph, etc) in bertft.py, isn't it better to move them directly to jupiter notebook?
  • May be rename enricher folder to syn_enricher (just a proposal, ideas?)

Add apache license

Replace prints with logging

Move printing graphs to jupyter
@skhdl
Copy link
Contributor

skhdl commented May 16, 2020

Hi @Ifropc ! Thank your for your work, I have tried it and results are very impressive!

Below some remarks for discussion:

  1. start_server.sh - I think that better to use python3 instead of python.
  2. Is it possible to download data to user home folder (and use it from there)? For example under ~//nlpcraft/<bert?>
  3. How can we clear installation files? Maybe we can add some flag like -reset to install_dependencies.sh or create new script like clear_dependencies.sh?
  4. Should I have python3 already installed?
  • If yes - could you describe it in readme? (versions etc)
  • If no - we have to create ticket to check it on clear computer without any python related installations.
  1. I think we can drop # usage at least from API (indexes parameters are enough and more flexible). Support of both modes confused.
  2. I think we have to separate endpoints URLs (now we have only default one), because later we can need some additional endpoints and have to separate them.
  3. Is it possible to add some new endpoint (or change current with some parameter) which can return suggestions for each word of sentence?
  • Now we can do it sending N requests for each N sentence words, but it can be too slow because each request takes about 0.2 secs
  • Will we have some performance improvements with such new endpoint ?
  1. Can I control suggestions result count? (maybe via some parameter)

Let's discuss!

@skhdl
Copy link
Contributor

skhdl commented May 16, 2020

  1. Could you add some base request validation and throw readable errors for such invalid requests?

Just for main cases (empty request, invalid indexes range) - now result is code 500, internal error.

- Added validation, limit argument
@Ifropc
Copy link
Member Author

Ifropc commented May 17, 2020

Hi @skhdl , thanks you for your feedback.

  1. Done
  2. I think in general it's not very good approach, as it is not following XDG base directory specification. I understand that currently NLP craft do store some files there, so I think this should remain open for discussion and for now I leave it as is. In future, we may consider either moving NLP craft files to XDG supported directory, or make it configurable in some other way (environment variable, configuration in file, etc).
  3. It's already done after successful installation. Are you talking about /tmp/fastText/
  4. Yes, it's checked in installation script.
  5. Done
  6. Done. Please use /synonyms endpoint.
  7. This part is a bit tricky. I am going to improve performance, but it might take some time to make proper batching into models, so it might be out of scope of current pull request and left as improvement for next iteration.
  8. Yes, it was already implemented, but I forgot to add it as part of API. Please pass this value as "limit"
  9. Done

Also I upgraded torch to latest version (1.5), please re-run install script.

@Ifropc
Copy link
Member Author

Ifropc commented May 30, 2020

I'm closing this pull request. Further work would be done in branch NLPCRAFT-67
Let's continue discussion in related issue on Jira

@Ifropc Ifropc closed this May 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants