Added language and locale params to NLP config #98

vijay120 · 2020-01-03T22:33:03Z

What does this PR do?

This PR adds language and locale as configuration constants to a MindMeld application, so setting these values will make sure system entities are detecting during train and inspect time. For inference, the developer can still choose to set the locale and language, but if not, the application's language configuration is used. It's useful for the developer to choose the language config during inference time if they want to resolve certain entities differently (ie 11/12/2010 mean different dates in en_GB and en_US).

Proposed design change

The following config can be added in the _config.py file:

LANGUAGE_CONFIG = {
    'language': 'en',
    'locale': 'en_CA'
}

The PR works for all NLP apis.

Limitations

The PR currently only supports the abbreviated version of locale codes, ie en_US and not the extended version ie en_XX_US.
This change would mean each Mindmeld application is only trained on one language

minhtuev · 2020-01-06T19:07:03Z

tests/components/test_dialogue.py

@@ -291,8 +291,10 @@ def test_convo_params_are_cleared(kwik_e_mart_nlp, kwik_e_mart_app_path):
    [
        ("en", "en_GB", {"lang": "EN", "latent": True, "locale": "en_GB"}),
        ("es", "en_US", {"latent": True, "locale": "en_US"}),
-        (None, None, {"latent": True, "locale": "en_US", "lang": "EN"}),
+        (None, None, {"latent": True, "locale": "en_CA", "lang": "EN"}),


why change?

I have updated the kwik-e-mart test application to use the en_CA locale

tests/components/test_nlp.py

minhtuev

Looks good, adding some comments

minhtuev · 2020-01-06T19:49:15Z

mindmeld/query_factory.py

@@ -130,7 +137,8 @@ def create_query_factory(
        Returns:
            QueryFactory: A QueryFactory object that is used to create Query objects.
        """
-        del app_path
        tokenizer = tokenizer or Tokenizer.create_tokenizer()
        stemmer = stemmer or EnglishNLTKStemmer()


Do we pick the stemmer based on the language & locale? Should we continue to use the English Stemmer for non-English languages? @serapio

This does seem weird now. It should only default to the English stemmer if the language is English, and otherwise the default stemmer should be a dummy (no op) stemmer. We should probably also make the default tokenizer less opinionated (in future work). For other languages, the app should generally be defining it's own tokenizer and stemmer, but we shouldn't use the English stemmer if they fail to do so.

@serapio What are your thoughts on using the Snowball stemmer?

In [1]: from nltk.stem.snowball import SnowballStemmer In [2]: SnowballStemmer.languages Out[2]: ('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')

Implemented the above suggestion

minhtuev

Do we also need to add documentation here as well?

ktick · 2020-01-07T06:25:11Z

tests/components/test_nlp.py

+        if entity['text'] == 'Thanksgiving':
+            # American thanksgiving is 2020-11-26T00:00:00.000-08:00,
+            # which is different from Canada's.
+            assert entity['value'][0]['value'] == '2020-10-12T00:00:00.000-07:00'


Clever way to test whether duckling indeed works correctly with en-CA 👍

ktick

Mostly looks good. I'd address Minh and Lucien's comment about using EnglishNLTKStemmer. We should only use that one for en-* locales. For other locales, we can over time test and add other trusted Python stemmers (e.g. the Snowball stemmers also included in NLTK). But maybe for now, we just don't do any stemming for non-English locales.

minhtuev · 2020-01-08T01:29:24Z

mindmeld/stemmers.py

+        language = pycountry.languages.get(alpha_3=language_code.upper())
+
+    if not language:
+        logger.error('Language code {} is not supported for stemming. '


Maybe this should be a warning

serapio · 2020-01-08T22:20:00Z

mindmeld/stemmers.py

+        language = pycountry.languages.get(alpha_3=language_code)
+
+    if not language:
+        logger.warning('Language code %s is not supported for stemming. '


Language code "%" is ...

serapio · 2020-01-08T22:27:22Z

mindmeld/stemmers.py

+    if language_name in nltk.stem.SnowballStemmer.languages:
+        return SnowballNLTKStemmer(language_name)
+
+    logger.warning('Language code %s is not supported for stemming. '


Language code "%s" is not supported...

serapio · 2020-01-08T22:36:48Z

mindmeld/stemmers.py

+
+    logger.warning('Language code %s is not supported for stemming. '
+                   'Consider disabling the stemmer in config.py.', language_code)
+    return NoOpStemmer()


This is great. Is disabling the stemmer a different result than getting the NoOpStemmer?

This is a good point. So Mindmeld stems all queries but uses the stemmed word as a feature only when stemming is enabled. I will make this clear in the warning message.

minhtuev

LGTM, Thanks!

serapio · 2020-01-09T23:17:40Z

source/userguide/internationalization.rst

+11. Romanian
+12. Russian
+13. Spanish
+14. Swedish


I think we should make the messaging here more nuanced. This makes it sound like only these languages will work, but these are just the languages where stemming works well, which is not important for many languages. e.g. the Vietnamese app seems to work pretty well and Vietnamese wouldn't benefit from stemming. I see it could be more complicated to the user but it would be more accurate to break it down by resource, something like:

The current tokenizer works well for languages like English where words are delimited by spaces, but may be confused by punctuation marks not found in English. If the language you are dealing with does not use spaces between words or has many non-English-like punctuation marks, you should preprocess your data to remove punctuation and add spaces between words.

The stemmer supports the following languages [listed above]
If you are working with a language that is not supported by the stemmer, we recommend disabling the stemmer, increasing the use of character ngram features, and expect you will need more training data.

The default system entity recognizer supports the following languages [listed]
If you are working with a language that is not supported by the default system entity recognizer, we recommend using custom entities and resolving the entities with custom logic in the dialogue manager.

Addressed your comment

I feel that apart from this PR we should do an evaluation of the effectiveness of these stemmers to provide the user with a reference of when/why they should be using stemmers.

serapio · 2020-01-10T00:52:42Z

source/userguide/internationalization.rst

+Moreover, the following ISO 3166-2 locale codes are supported per language:
+
+1. EN: [AU, BZ, CA, GB, IN, IE, JM, NZ, PH, ZA, TT, US]
+2. NL: [BE, R.NL]


NL: [BE, NL]

minhtuev · 2020-01-10T01:00:00Z

source/userguide/internationalization.rst

+
+Stemming is an important, language-dependent NLP process that transforms a word to an approximation of its root form. Stemming can be
+useful for some languages like English but not for others like Vietnamese. Mindmeld supports the following ISO 639-1 language codes for stemming:
+[EN, DA, NL, AR, FR, DE, HU, IT, NO, PT, RU, RO, ES, SV, FI].


Should we expand the actual names of the languages instead of the codes?

minhtuev · 2020-01-10T01:01:42Z

source/userguide/internationalization.rst

+============================
+
+Mindmeld supports most languages that can be tokenized like English. If the language does not use spaces between
+words or has many non-English-like punctuation marks, pre-process the data to remove punctuations and add spaces between words.


@serapio : can we give an example of the languages that have many of these punctuation marks?

This is something I had heard was a problem generally with using \W regex across languages, but the ones I tried in German, Arabic and Chinese were all handled correctly.

* adding unstructured QA * lint fixes * adding exceptions for unhandled query types * Adding DialogflowConverter and RasaConverter (cisco#32) Adding the abstract Converter class and RasaConverter and DialogflowConverter which can convert a Rasa or a Dialogflow project into a MindMeld project. -- Usage -- * Rasa Users: In a python script or from the command line, run the following commands ``` import mindmeld.converter.rasa_converter rasa_project = <location of Rasa project> mm_project = <location where MindMeld project will be created> rasa_converter.convert_project(rasa_project, mm_project) ``` * Dialogflow Users: Users must first export their project out of Dialogflow. From the main console, click the Settings icon next to your project name in the top left corner. From there select Export and Import, then Export as Zip. The rest of the invocation is similar to RasaConvereter: `df_converter.convert_project(rasa_project, mm_project)` -- Limitations -- * Rasa Users: - Rasa has the ability to have custom actions, which is not supported by the converter. - Rasa has the ability to handle multiple intents per query, while Mindmeld does not. - Rasa training data may be json format, which is not currently supported. - Rasa has a feature called Rasa forms which is not currently supported. - Rasa's configuration files are not transferred, instead generic Mindmeld configuration files are copied over. * Dialogflow Users: - There is no official language support. - Some system entities are currently unsupported. - The converter currently does not automatically convert features like slot filling, contexts, and follow-up intents. Users can still implement such features and more. - Information in agent.json (such as description, timezone, etc) are not copied over. * Bump version: 4.1.5 → 4.1.6 * Default template fix * fixing spaces * Adding a section for MindMeld UI (cisco#54) * Adding a section for MindMeld UI * documentation for unstructured QA * spacing issue * added hr blueprint to cli functionality (cisco#58) * Fixed conditional check in UI code (cisco#57) * Fixing warning in console (cisco#60) * Update WebexBotServer's API and update doc (cisco#49) * Add doc for Webex Team Bot Server * Refactor of the bot server api * Adding Converter cli (cisco#56) * Added Converter CLI `mindmeld convert --rs rasa_sample_project` `mindmeld convert --df df_sample_project` * Enabled locale and language support for system entity resolution (cisco#61) * added language code and locale properties to Param and FrozenParams object * passed locale argument through mindmeld to system entity recognizer * resolved logic for various cases of invalid locale and language codes * Review edits * Added stemmers module to the platform (cisco#62) * Refactored stemmer and added a new spanish stemmer * Implement DialogueManager.reprocess(...) (cisco#24) * This allows the application to specify that dialogue should exit immediately from a dialogue flow (or other target_dialogue_state) and handle the query over again within the generic dialogue manager (or within a specified other dialogue flow if new_target is specified in the call to reject_target_dialogue_state). * more edits * adding clean load and related documentation * clean error handler * fixing lint errors * Added the HR blueprint docs (cisco#63) * added hr blueprint docs * Update history and change list (cisco#59) * Update history and change list * Consolidated tokenization logic and added documentation details (cisco#64) * consolidated tokenization logic * removed converter docs (cisco#66) * rework description of exiting flows (cisco#67) * Checking in google tracking id for build generation (cisco#65) * Bump version: 4.1.6 → 4.2.0 * fixed bracket (cisco#68) * Bug fix/docs (cisco#69) * fixed code indentation * Allow WebexBotServer to be imported directly (cisco#73) * Allow WebexBotServer to be imported directly `from mindmeld.bot import WebexBotServer` * Fix lint * Adding explicit error handling in reading query files (cisco#71) * Adding explicit error handling in reading query files * Some improvements to system entity recognizer (cisco#76) Some improvements to system entity recognizer: 1. If there is no app being passed, we can load system entity recognizer from default. 2. Explicit check for 200 status code; otherwise raise an exception. * Make markup error messages more useful (cisco#75) * make markup error messages useful * Make validate_language_code and validate_locale_code public (cisco#77) * Bump version: 4.2.0 → 4.2.1 * Add black (cisco#80) * Adding black to the MindMeld repo * Sorting the imports using isort but not adding isort at the moment because it conflicts with black. * Using black only Python 3.6 and above in our test infrastructure. * Adding .flake8 file to ignore certain exceptions (E203, E231, W503) since these stylistic changes conflict with black. * Bump eslint-utils from 1.3.1 to 1.4.3 in /mindmeld-ui (cisco#79) Bumps [eslint-utils](https://github.com/mysticatea/eslint-utils) from 1.3.1 to 1.4.3. - [Release notes](https://github.com/mysticatea/eslint-utils/releases) - [Commits](mysticatea/eslint-utils@v1.3.1...v1.4.3) Signed-off-by: dependabot[bot] <support@github.com> * Adding locale as a parameter for `get_candidates_for_text` and update tests (cisco#84) * Relax CalVer dependencies * fixed bot server conversation param (cisco#86) * Make tf optional (cisco#90) * Make Tensorflow package optional * Update README * Bump version: 4.2.1 → 4.2.2 * Fixing version number * Removing Python 3.4 (cisco#94) * Gracefully handle missing tensorflow extra * Fix bot and lstm lint errors * Bump version: 4.2.2 → 4.2.3 * Small fixes (cisco#95) 1. Small refactor to convert `get_mm_version` into a public function 2. Small doc updates * Bump handlebars from 4.1.2 to 4.5.3 in /mindmeld-ui (cisco#97) Bumps [handlebars](https://github.com/wycats/handlebars.js) from 4.1.2 to 4.5.3. - [Release notes](https://github.com/wycats/handlebars.js/releases) - [Changelog](https://github.com/wycats/handlebars.js/blob/master/release-notes.md) - [Commits](handlebars-lang/handlebars.js@v4.1.2...v4.5.3) Signed-off-by: dependabot[bot] <support@github.com> * Added language and locale params to NLP config (cisco#98) * added lang and locale * implement language specific stemming * updated documentation * Adding documentation for WhatsApp integration (cisco#99) * Adding documentation for WhatsApp integration * Update port number to 8080 * fixed doc (cisco#100) * updated mindmeld from v4.2.3 to v4.2.4 * Adding a reference implementation for a MindMeld http client (cisco#101) * Reference implementation for a HTTP conversational client * Adding frame and context to the Conversational Client (cisco#102) * Adding frame and context to the ConversationalClient object's say and process method * Resetting the conversation after the `reset` function is invoked * added fix to _get_feature_weight (cisco#107) * fixed directive names (cisco#109) * added try except block (cisco#108) * Bump version: 4.2.4 → 4.2.5 * Bug fix for xhr issue (cisco#112) 1. bug fix for xhr issue 2. apply black lint fixes * Add developing MindMeld section (cisco#114) * Fix missing `id` field * Adding a section for contributing to MindMeld * bumpversion from 4.2.5 to 4.2.6 * upgraded flask (cisco#115) * Feature/mindmeld converter (cisco#113) * added df converter changes * skipped converter test (cisco#118) * Implemented slot filling code generation logic (cisco#116) * implemented code generation logic * Augmented data (cisco#117) * fixed system entity slots * augmented data * refactored code * added request timeout (cisco#120) * Made the number of queries for df conversion configurable (cisco#121) * made the number of queries for df conversion configurable * fix exception handling (cisco#122) * code generation (cisco#123) * Bump version: 4.2.6 → 4.2.7 * Remove Language param from process (cisco#124) * cached lang and locale in the nlp object Co-authored-by: Varsha Embar <vembar@cisco.com> Co-authored-by: Juan Rodriguez <angel.neb@gmail.com> Co-authored-by: Varsha Embar <vembar@users.noreply.github.com> Co-authored-by: barret-p <43305846+barret-p@users.noreply.github.com> Co-authored-by: Minh-Tue Vo <minhtue@cisco.com> Co-authored-by: Vijay Ramakrishnan <vijayrk@cisco.com> Co-authored-by: Arushi Raghuvanshi <arushi.raghu@gmail.com> Co-authored-by: Lucien Carroll <lucienc@cisco.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: J.J. Jackson <jdot@cisco.com>

vijay120 requested review from minhtuev, serapio, angelRev and jjjacksn January 3, 2020 22:33

vijay120 changed the title ~~Added lang and locale to NLP config~~ Added language and locale params to NLP config Jan 3, 2020

added lang and locale

8a4b91b

vijay120 force-pushed the feature/language_build branch from a41a805 to 8a4b91b Compare January 3, 2020 23:09

minhtuev reviewed Jan 6, 2020

View reviewed changes

tests/components/test_nlp.py Show resolved Hide resolved

minhtuev reviewed Jan 6, 2020

View reviewed changes

angelRev approved these changes Jan 6, 2020

View reviewed changes

ktick reviewed Jan 7, 2020

View reviewed changes

minhtuev reviewed Jan 8, 2020

View reviewed changes

vijay120 force-pushed the feature/language_build branch from edfc949 to 8ea8962 Compare January 8, 2020 21:39

serapio approved these changes Jan 8, 2020

View reviewed changes

minhtuev approved these changes Jan 8, 2020

View reviewed changes

implement language specific stemming

6a6aa71

vijay120 force-pushed the feature/language_build branch from 8ea8962 to 6a6aa71 Compare January 8, 2020 23:50

serapio reviewed Jan 9, 2020

View reviewed changes

vijay120 force-pushed the feature/language_build branch from 7934abd to 561338d Compare January 10, 2020 00:47

serapio reviewed Jan 10, 2020

View reviewed changes

vijay120 force-pushed the feature/language_build branch from 561338d to 98431fd Compare January 10, 2020 01:02

updated documentation

d39feb1

vijay120 force-pushed the feature/language_build branch from 98431fd to d39feb1 Compare January 10, 2020 01:02

minhtuev reviewed Jan 10, 2020

View reviewed changes

vijay120 merged commit cfb7b6e into master Jan 10, 2020

vijay120 deleted the feature/language_build branch January 10, 2020 01:31

minhtuev mentioned this pull request Feb 11, 2020

Expose language parameter #83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added language and locale params to NLP config #98

Added language and locale params to NLP config #98

vijay120 commented Jan 3, 2020 •

edited

Loading

minhtuev Jan 6, 2020

vijay120 Jan 6, 2020

minhtuev left a comment

minhtuev Jan 6, 2020

serapio Jan 7, 2020

ktick Jan 7, 2020

vijay120 Jan 8, 2020

minhtuev left a comment

ktick Jan 7, 2020

ktick left a comment

minhtuev Jan 8, 2020

vijay120 Jan 8, 2020

serapio Jan 8, 2020

vijay120 Jan 8, 2020

serapio Jan 8, 2020

vijay120 Jan 8, 2020

serapio Jan 8, 2020

vijay120 Jan 8, 2020

minhtuev left a comment

serapio Jan 9, 2020

vijay120 Jan 10, 2020

minhtuev Jan 10, 2020

serapio Jan 10, 2020

vijay120 Jan 10, 2020

minhtuev Jan 10, 2020

minhtuev Jan 10, 2020

serapio Jan 10, 2020

Added language and locale params to NLP config #98

Added language and locale params to NLP config #98

Conversation

vijay120 commented Jan 3, 2020 • edited Loading

What does this PR do?

Proposed design change

Limitations

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minhtuev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minhtuev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ktick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minhtuev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vijay120 commented Jan 3, 2020 •

edited

Loading