Turn Chinese natural language into structured data 中文自然语言理解
Switch branches/tags
Nothing to show
Clone or download
crownpku Merge pull request #60 from DoubleAix/master
update jieba_tokenizer.py
Latest commit 130f167 Aug 10, 2018
Permalink
Failed to load latest commit information.
.github Update ISSUE_TEMPLATE.md Apr 18, 2018
alt_requirements update to newest rasa nlu Apr 30, 2018
data update to newest rasa nlu Apr 30, 2018
docker working on mitie support Apr 11, 2018
docs update to newest rasa nlu Apr 30, 2018
heroku Redesigned docker config and requirement/directory layout Jul 26, 2017
jieba_userdict add jieba userdict function Nov 16, 2017
rasa_nlu update jieba_tokenizer.py Aug 5, 2018
sample_configs update README.md (#55) Jun 4, 2018
test_models working on mitie support Apr 11, 2018
tests update to newest rasa nlu Apr 30, 2018
.coveragerc wip on conf imnprovements Mar 15, 2018
.dockerignore Changes to get manual-testing of docker working Jul 27, 2017
.env add chinese model and demo Jun 24, 2017
.gitattributes fixes docker error on windows: standard_init_linux.go:178: exec Jul 5, 2017
.gitignore update to newest rasa nlu Apr 30, 2018
.travis.yml update to newest rasa nlu Apr 30, 2018
CHANGELOG.rst fix dates in changelog Apr 22, 2018
CODE_OF_CONDUCT.md Create CODE_OF_CONDUCT.md Sep 13, 2017
LICENSE.txt Update LICENSE.txt Jan 6, 2018
MANIFEST.in removed requirements reading Jan 10, 2018
Makefile added run script and changed setup Feb 14, 2018
README.md update README.md (#55) Jun 4, 2018
app.json working on mitie support Apr 11, 2018
cloudbuild.yaml Create cloudbuild.yaml Nov 28, 2017
entrypoint.sh removed download script Apr 20, 2018
requirements.txt remove -e . from requirements Jan 3, 2018
setup.cfg added wheel to travis Apr 14, 2018
setup.py update (#47) May 1, 2018

README.md

Rasa NLU for Chinese, a fork from RasaHQ/rasa_nlu.

Please refer to newest instructions at official Rasa NLU document

中文Blog

Files you should have:

  • data/total_word_feature_extractor_zh.dat

Trained from Chinese corpus by MITIE wordrep tools (takes 2-3 days for training)

For training, please build the MITIE Wordrep Tool. Note that Chinese corpus should be tokenized first before feeding into the tool for training. Close-domain corpus that best matches user case works best.

A trained model from Chinese Wikipedia Dump and Baidu Baike can be downloaded from 中文Blog.

  • data/examples/rasa/demo-rasa_zh.json

Should add as much examples as possible.

Usage:

  1. Clone this project, and run
python setup.py install
  1. Modify configuration.

    Currently for Chinese we have two pipelines:

    Use MITIE+Jieba (sample_configs/config_jieba_mitie.yml):

language: "zh"

pipeline:
- name: "nlp_mitie"
  model: "data/total_word_feature_extractor_zh.dat"
- name: "tokenizer_jieba"
- name: "ner_mitie"
- name: "ner_synonyms"
- name: "intent_entity_featurizer_regex"
- name: "intent_classifier_mitie"

RECOMMENDED: Use MITIE+Jieba+sklearn (sample_configs/config_jieba_mitie_sklearn.yml):

language: "zh"

pipeline:
- name: "nlp_mitie"
  model: "data/total_word_feature_extractor_zh.dat"
- name: "tokenizer_jieba"
- name: "ner_mitie"
- name: "ner_synonyms"
- name: "intent_entity_featurizer_regex"
- name: "intent_featurizer_mitie"
- name: "intent_classifier_sklearn"
  1. (Optional) Use Jieba User Defined Dictionary or Switch Jieba Default Dictionoary:

    You can put in file path or directory path as the "user_dicts" value. (sample_configs/config_jieba_mitie_sklearn_plus_dict_path.yml)

language: "zh"

pipeline:
- name: "nlp_mitie"
  model: "data/total_word_feature_extractor_zh.dat"
- name: "tokenizer_jieba"
  default_dict: "./default_dict.big"
  user_dicts: "./jieba_userdict"
#  user_dicts: "./jieba_userdict/jieba_userdict.txt"
- name: "ner_mitie"
- name: "ner_synonyms"
- name: "intent_entity_featurizer_regex"
- name: "intent_featurizer_mitie"
- name: "intent_classifier_sklearn"
  1. Train model by running:

    If you specify your project name in configure file, this will save your model at /models/your_project_name.

    Otherwise, your model will be saved at /models/default

python -m rasa_nlu.train -c sample_configs/config_jieba_mitie_sklearn.yml --data data/examples/rasa/demo-rasa_zh.json --path models
  1. Run the rasa_nlu server:
python -m rasa_nlu.server -c sample_configs/config_jieba_mitie_sklearn.yml --path models
  1. Open a new terminal and now you can curl results from the server, for example:
$ curl -XPOST localhost:5000/parse -d '{"q":"我发烧了该吃什么药?", "project": "rasa_nlu_test", "model": "model_20170921-170911"}' | python -mjson.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   652    0   552  100   100    157     28  0:00:03  0:00:03 --:--:--   157
{
    "entities": [
        {
            "end": 3,
            "entity": "disease",
            "extractor": "ner_mitie",
            "start": 1,
            "value": "发烧"
        }
    ],
    "intent": {
        "confidence": 0.5397186422631861,
        "name": "medical"
    },
    "intent_ranking": [
        {
            "confidence": 0.5397186422631861,
            "name": "medical"
        },
        {
            "confidence": 0.16206323981749196,
            "name": "restaurant_search"
        },
        {
            "confidence": 0.1212448457737397,
            "name": "affirm"
        },
        {
            "confidence": 0.10333600028547868,
            "name": "goodbye"
        },
        {
            "confidence": 0.07363727186010374,
            "name": "greet"
        }
    ],
    "text": "我发烧了该吃什么药?"
}