-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added language and locale params to NLP config #98
Conversation
a41a805
to
8a4b91b
Compare
@@ -291,8 +291,10 @@ def test_convo_params_are_cleared(kwik_e_mart_nlp, kwik_e_mart_app_path): | |||
[ | |||
("en", "en_GB", {"lang": "EN", "latent": True, "locale": "en_GB"}), | |||
("es", "en_US", {"latent": True, "locale": "en_US"}), | |||
(None, None, {"latent": True, "locale": "en_US", "lang": "EN"}), | |||
(None, None, {"latent": True, "locale": "en_CA", "lang": "EN"}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the kwik-e-mart test application to use the en_CA
locale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, adding some comments
mindmeld/query_factory.py
Outdated
@@ -130,7 +137,8 @@ def create_query_factory( | |||
Returns: | |||
QueryFactory: A QueryFactory object that is used to create Query objects. | |||
""" | |||
del app_path | |||
tokenizer = tokenizer or Tokenizer.create_tokenizer() | |||
stemmer = stemmer or EnglishNLTKStemmer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we pick the stemmer based on the language & locale? Should we continue to use the English Stemmer for non-English languages? @serapio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does seem weird now. It should only default to the English stemmer if the language is English, and otherwise the default stemmer should be a dummy (no op) stemmer. We should probably also make the default tokenizer less opinionated (in future work). For other languages, the app should generally be defining it's own tokenizer and stemmer, but we shouldn't use the English stemmer if they fail to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serapio What are your thoughts on using the Snowball stemmer?
In [1]: from nltk.stem.snowball import SnowballStemmer
In [2]: SnowballStemmer.languages
Out[2]:
('arabic',
'danish',
'dutch',
'english',
'finnish',
'french',
'german',
'hungarian',
'italian',
'norwegian',
'porter',
'portuguese',
'romanian',
'russian',
'spanish',
'swedish')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented the above suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to add documentation here as well?
if entity['text'] == 'Thanksgiving': | ||
# American thanksgiving is 2020-11-26T00:00:00.000-08:00, | ||
# which is different from Canada's. | ||
assert entity['value'][0]['value'] == '2020-10-12T00:00:00.000-07:00' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever way to test whether duckling indeed works correctly with en-CA 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good. I'd address Minh and Lucien's comment about using EnglishNLTKStemmer
. We should only use that one for en-*
locales. For other locales, we can over time test and add other trusted Python stemmers (e.g. the Snowball stemmers also included in NLTK). But maybe for now, we just don't do any stemming for non-English locales.
mindmeld/stemmers.py
Outdated
language = pycountry.languages.get(alpha_3=language_code.upper()) | ||
|
||
if not language: | ||
logger.error('Language code {} is not supported for stemming. ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be a warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
edfc949
to
8ea8962
Compare
mindmeld/stemmers.py
Outdated
language = pycountry.languages.get(alpha_3=language_code) | ||
|
||
if not language: | ||
logger.warning('Language code %s is not supported for stemming. ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Language code "%" is ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
mindmeld/stemmers.py
Outdated
if language_name in nltk.stem.SnowballStemmer.languages: | ||
return SnowballNLTKStemmer(language_name) | ||
|
||
logger.warning('Language code %s is not supported for stemming. ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Language code "%s" is not supported...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
logger.warning('Language code %s is not supported for stemming. ' | ||
'Consider disabling the stemmer in config.py.', language_code) | ||
return NoOpStemmer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. Is disabling the stemmer a different result than getting the NoOpStemmer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. So Mindmeld stems all queries but uses the stemmed word as a feature only when stemming is enabled. I will make this clear in the warning message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks!
8ea8962
to
6a6aa71
Compare
11. Romanian | ||
12. Russian | ||
13. Spanish | ||
14. Swedish |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make the messaging here more nuanced. This makes it sound like only these languages will work, but these are just the languages where stemming works well, which is not important for many languages. e.g. the Vietnamese app seems to work pretty well and Vietnamese wouldn't benefit from stemming. I see it could be more complicated to the user but it would be more accurate to break it down by resource, something like:
- The current tokenizer works well for languages like English where words are delimited by spaces, but may be confused by punctuation marks not found in English. If the language you are dealing with does not use spaces between words or has many non-English-like punctuation marks, you should preprocess your data to remove punctuation and add spaces between words.
- The stemmer supports the following languages [listed above]
If you are working with a language that is not supported by the stemmer, we recommend disabling the stemmer, increasing the use of character ngram features, and expect you will need more training data. - The default system entity recognizer supports the following languages [listed]
If you are working with a language that is not supported by the default system entity recognizer, we recommend using custom entities and resolving the entities with custom logic in the dialogue manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed your comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that apart from this PR we should do an evaluation of the effectiveness of these stemmers to provide the user with a reference of when/why they should be using stemmers.
7934abd
to
561338d
Compare
Moreover, the following ISO 3166-2 locale codes are supported per language: | ||
|
||
1. EN: [AU, BZ, CA, GB, IN, IE, JM, NZ, PH, ZA, TT, US] | ||
2. NL: [BE, R.NL] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NL: [BE, NL]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
561338d
to
98431fd
Compare
98431fd
to
d39feb1
Compare
|
||
Stemming is an important, language-dependent NLP process that transforms a word to an approximation of its root form. Stemming can be | ||
useful for some languages like English but not for others like Vietnamese. Mindmeld supports the following ISO 639-1 language codes for stemming: | ||
[EN, DA, NL, AR, FR, DE, HU, IT, NO, PT, RU, RO, ES, SV, FI]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we expand the actual names of the languages instead of the codes?
============================ | ||
|
||
Mindmeld supports most languages that can be tokenized like English. If the language does not use spaces between | ||
words or has many non-English-like punctuation marks, pre-process the data to remove punctuations and add spaces between words. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serapio : can we give an example of the languages that have many of these punctuation marks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I had heard was a problem generally with using \W regex across languages, but the ones I tried in German, Arabic and Chinese were all handled correctly.
* adding unstructured QA * lint fixes * adding exceptions for unhandled query types * Adding DialogflowConverter and RasaConverter (cisco#32) Adding the abstract Converter class and RasaConverter and DialogflowConverter which can convert a Rasa or a Dialogflow project into a MindMeld project. -- Usage -- * Rasa Users: In a python script or from the command line, run the following commands ``` import mindmeld.converter.rasa_converter rasa_project = <location of Rasa project> mm_project = <location where MindMeld project will be created> rasa_converter.convert_project(rasa_project, mm_project) ``` * Dialogflow Users: Users must first export their project out of Dialogflow. From the main console, click the Settings icon next to your project name in the top left corner. From there select Export and Import, then Export as Zip. The rest of the invocation is similar to RasaConvereter: `df_converter.convert_project(rasa_project, mm_project)` -- Limitations -- * Rasa Users: - Rasa has the ability to have custom actions, which is not supported by the converter. - Rasa has the ability to handle multiple intents per query, while Mindmeld does not. - Rasa training data may be json format, which is not currently supported. - Rasa has a feature called Rasa forms which is not currently supported. - Rasa's configuration files are not transferred, instead generic Mindmeld configuration files are copied over. * Dialogflow Users: - There is no official language support. - Some system entities are currently unsupported. - The converter currently does not automatically convert features like slot filling, contexts, and follow-up intents. Users can still implement such features and more. - Information in agent.json (such as description, timezone, etc) are not copied over. * Bump version: 4.1.5 → 4.1.6 * Default template fix * fixing spaces * Adding a section for MindMeld UI (cisco#54) * Adding a section for MindMeld UI * documentation for unstructured QA * spacing issue * added hr blueprint to cli functionality (cisco#58) * Fixed conditional check in UI code (cisco#57) * Fixing warning in console (cisco#60) * Update WebexBotServer's API and update doc (cisco#49) * Add doc for Webex Team Bot Server * Refactor of the bot server api * Adding Converter cli (cisco#56) * Added Converter CLI `mindmeld convert --rs rasa_sample_project` `mindmeld convert --df df_sample_project` * Enabled locale and language support for system entity resolution (cisco#61) * added language code and locale properties to Param and FrozenParams object * passed locale argument through mindmeld to system entity recognizer * resolved logic for various cases of invalid locale and language codes * Review edits * Added stemmers module to the platform (cisco#62) * Refactored stemmer and added a new spanish stemmer * Implement DialogueManager.reprocess(...) (cisco#24) * This allows the application to specify that dialogue should exit immediately from a dialogue flow (or other target_dialogue_state) and handle the query over again within the generic dialogue manager (or within a specified other dialogue flow if new_target is specified in the call to reject_target_dialogue_state). * more edits * adding clean load and related documentation * clean error handler * fixing lint errors * Added the HR blueprint docs (cisco#63) * added hr blueprint docs * Update history and change list (cisco#59) * Update history and change list * Consolidated tokenization logic and added documentation details (cisco#64) * consolidated tokenization logic * removed converter docs (cisco#66) * rework description of exiting flows (cisco#67) * Checking in google tracking id for build generation (cisco#65) * Bump version: 4.1.6 → 4.2.0 * fixed bracket (cisco#68) * Bug fix/docs (cisco#69) * fixed code indentation * Allow WebexBotServer to be imported directly (cisco#73) * Allow WebexBotServer to be imported directly `from mindmeld.bot import WebexBotServer` * Fix lint * Adding explicit error handling in reading query files (cisco#71) * Adding explicit error handling in reading query files * Some improvements to system entity recognizer (cisco#76) Some improvements to system entity recognizer: 1. If there is no app being passed, we can load system entity recognizer from default. 2. Explicit check for 200 status code; otherwise raise an exception. * Make markup error messages more useful (cisco#75) * make markup error messages useful * Make validate_language_code and validate_locale_code public (cisco#77) * Bump version: 4.2.0 → 4.2.1 * Add black (cisco#80) * Adding black to the MindMeld repo * Sorting the imports using isort but not adding isort at the moment because it conflicts with black. * Using black only Python 3.6 and above in our test infrastructure. * Adding .flake8 file to ignore certain exceptions (E203, E231, W503) since these stylistic changes conflict with black. * Bump eslint-utils from 1.3.1 to 1.4.3 in /mindmeld-ui (cisco#79) Bumps [eslint-utils](https://github.com/mysticatea/eslint-utils) from 1.3.1 to 1.4.3. - [Release notes](https://github.com/mysticatea/eslint-utils/releases) - [Commits](mysticatea/eslint-utils@v1.3.1...v1.4.3) Signed-off-by: dependabot[bot] <support@github.com> * Adding locale as a parameter for `get_candidates_for_text` and update tests (cisco#84) * Relax CalVer dependencies * fixed bot server conversation param (cisco#86) * Make tf optional (cisco#90) * Make Tensorflow package optional * Update README * Bump version: 4.2.1 → 4.2.2 * Fixing version number * Removing Python 3.4 (cisco#94) * Gracefully handle missing tensorflow extra * Fix bot and lstm lint errors * Bump version: 4.2.2 → 4.2.3 * Small fixes (cisco#95) 1. Small refactor to convert `get_mm_version` into a public function 2. Small doc updates * Bump handlebars from 4.1.2 to 4.5.3 in /mindmeld-ui (cisco#97) Bumps [handlebars](https://github.com/wycats/handlebars.js) from 4.1.2 to 4.5.3. - [Release notes](https://github.com/wycats/handlebars.js/releases) - [Changelog](https://github.com/wycats/handlebars.js/blob/master/release-notes.md) - [Commits](handlebars-lang/handlebars.js@v4.1.2...v4.5.3) Signed-off-by: dependabot[bot] <support@github.com> * Added language and locale params to NLP config (cisco#98) * added lang and locale * implement language specific stemming * updated documentation * Adding documentation for WhatsApp integration (cisco#99) * Adding documentation for WhatsApp integration * Update port number to 8080 * fixed doc (cisco#100) * updated mindmeld from v4.2.3 to v4.2.4 * Adding a reference implementation for a MindMeld http client (cisco#101) * Reference implementation for a HTTP conversational client * Adding frame and context to the Conversational Client (cisco#102) * Adding frame and context to the ConversationalClient object's say and process method * Resetting the conversation after the `reset` function is invoked * added fix to _get_feature_weight (cisco#107) * fixed directive names (cisco#109) * added try except block (cisco#108) * Bump version: 4.2.4 → 4.2.5 * Bug fix for xhr issue (cisco#112) 1. bug fix for xhr issue 2. apply black lint fixes * Add developing MindMeld section (cisco#114) * Fix missing `id` field * Adding a section for contributing to MindMeld * bumpversion from 4.2.5 to 4.2.6 * upgraded flask (cisco#115) * Feature/mindmeld converter (cisco#113) * added df converter changes * skipped converter test (cisco#118) * Implemented slot filling code generation logic (cisco#116) * implemented code generation logic * Augmented data (cisco#117) * fixed system entity slots * augmented data * refactored code * added request timeout (cisco#120) * Made the number of queries for df conversion configurable (cisco#121) * made the number of queries for df conversion configurable * fix exception handling (cisco#122) * code generation (cisco#123) * Bump version: 4.2.6 → 4.2.7 * Remove Language param from process (cisco#124) * cached lang and locale in the nlp object Co-authored-by: Varsha Embar <vembar@cisco.com> Co-authored-by: Juan Rodriguez <angel.neb@gmail.com> Co-authored-by: Varsha Embar <vembar@users.noreply.github.com> Co-authored-by: barret-p <43305846+barret-p@users.noreply.github.com> Co-authored-by: Minh-Tue Vo <minhtue@cisco.com> Co-authored-by: Vijay Ramakrishnan <vijayrk@cisco.com> Co-authored-by: Arushi Raghuvanshi <arushi.raghu@gmail.com> Co-authored-by: Lucien Carroll <lucienc@cisco.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: J.J. Jackson <jdot@cisco.com>
What does this PR do?
This PR adds language and locale as configuration constants to a MindMeld application, so setting these values will make sure system entities are detecting during train and inspect time. For inference, the developer can still choose to set the locale and language, but if not, the application's language configuration is used. It's useful for the developer to choose the language config during inference time if they want to resolve certain entities differently (ie
11/12/2010
mean different dates inen_GB
anden_US
).Proposed design change
The following config can be added in the
_config.py
file:The PR works for all NLP apis.
Limitations
en_US
and not the extended version ieen_XX_US
.