Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

building go-bot in russian #375

Closed
vitalyuf opened this issue Aug 16, 2018 · 15 comments
Closed

building go-bot in russian #375

vitalyuf opened this issue Aug 16, 2018 · 15 comments

Comments

@vitalyuf
Copy link

vitalyuf commented Aug 16, 2018

Hi!
I want to build a go-bot using DeepPavlov in russian.
The task of gobot is to output phone number of requested employee by his name, surname, fathers name.
I plan to use tutorial03 as a reference.
And the main idea is using instead of DSTC2 data set a new one, which i gonna generate in DSTC2 format.
Has the described aproach a right to exist?

@vikmary
Copy link
Contributor

vikmary commented Aug 16, 2018

Hi @vitalyuf! TLTR: yes, it has.

I suggest firstly to try configs/go_bot/gobot_dstc2.json without intent_classifier (set it to null, because i guess your data would be small.
And then try to add intent_classifier and embedder (you should provide russian embeddings though).

To integrate a database of employee names, you should provide dataset with appropriate db_result fields and run python -m deeppavlov train configs/go_bot/database_dstc2.json (with little config fixes). And then you would be able to use the database for training configs/go_bot/gobot_dstc2.json. See docs for details.

If your dialogs are simple (as it seems to me), then the model shouldn't require too much data to be properly trained. DSTC2 was trained on 1000 dialogs, you will need less. I guess 100 will be even enough, but you'll have to check by yourself.

@vitalyuf
Copy link
Author

Thank you!

@vitalyuf
Copy link
Author

HI!
As I understand editing of configs is not enough.
First I should provide labeled russian datasets for training NER and slotfilling components.
Then I have to train these components using provided data.
And after this I can build a go-bot.
Or correct me please if I'm wrong.

BTW: Should I open new issues in such cases or I may reopen elder ones if i get some corellated troubles in future?

@vitalyuf vitalyuf reopened this Aug 22, 2018
@vitalyuf
Copy link
Author

vitalyuf commented Aug 22, 2018

Oh. I see now.
I can use configs/ner/slotfill_dstc2_raw.json as slotfilling component with previously prepared dictionary at slotfill_raw component.
In this case I don't need NER sub-component.

@vikmary
Copy link
Contributor

vikmary commented Aug 22, 2018

Yes, if you want a minimal slot-filler, you can use configs/ner/slotfiller_dstc2_raw.json, which will match slots according to your dictionary using Levenshtein distance.

@vikmary
Copy link
Contributor

vikmary commented Aug 22, 2018

I guess, reopening the issue is the right way =)

@mu-arkhipov
Copy link
Contributor

In order to use simple pattern matching slot_filler you need to build a vocabulary of slots and slot values pairs with paraphrases. You can find an example of such vocabulary here. You need to specify path to this file in slotfill_raw part of the slotfill_dstc2_raw.json config file.

@vitalyuf
Copy link
Author

vitalyuf commented Aug 22, 2018

Thanks, I've alreay tried raw filler. And it works. I plan to generate such a dictionary from a list of all possible surnames and names. I suppose the resulting file will be about 30Mb size.

Also I tried to use ner_rus.json as a subcomponent of slotfill_dstc2 with reference to my stc_slot_vals.json but it failed (KeyError: 'PER' at ---> 65 for entity_name in self._slot_vals[slot]: of ~/projects/DP/DeepPavlov/deeppavlov/models/slotfill/slotfill.py in ner2slot(self, input_entity, slot)).
Maybe output of ner_rus and input of dstc_slotfilling components have different formats or dstc_slotfilling is not intended for russian language?

Generally speaking for now I want to fill 'name', 'surname', 'fathersname' slots.
Maybe using of ner_rus before applying slotfiller have no sense in this case?

@vitalyuf
Copy link
Author

vitalyuf commented Aug 23, 2018

Unfortunately, raw slotfiller IRL performs not very good for my task due to next reasons:

  • using large dictionary of surnames (~13Mb) by slotfiller is very slow (~15 sec delay for getting answer);
  • often (also observerd on a large dataset) slotfiller changes initial token to any of nearest tokens ('Иванов'->'Живанов' for example).

@mu-arkhipov
Copy link
Contributor

mu-arkhipov commented Aug 23, 2018

This solution is not suited for large vocabularies at the moment. For large vocabularies prefix trees would be way faster. Furthermore, we will explore the 'Иванов'->'Живанов' case.

@vitalyuf vitalyuf closed this as completed Sep 3, 2018
@vitalyuf
Copy link
Author

vitalyuf commented Sep 21, 2018

Hi!
For my gobot (phonebook) I have an idea to train a ner model using your ner_dstc2.json.
I prepared dstc2-???.jsonlist files and slot_vals.json file.
But when I launch python -m deeppavlov train ./ner_rus.json a new version of slot_vals is always downloaded.
So my file slot_vals.json always being erased.
How to disable this option (of downloading this file)?

@vitalyuf vitalyuf reopened this Sep 21, 2018
@mu-arkhipov
Copy link
Contributor

mu-arkhipov commented Sep 21, 2018

We will fix it next release. However, you can try to fix it via removing a couple of lines here https://github.com/deepmipt/DeepPavlov/blob/e91ff6c3eafbd49b6f09499d2e1b4b5675b513bb/deeppavlov/models/slotfill/slotfill.py#L115-L117
You can find out the path to the library by following command from the terminal:
python -c 'import deeppavlov; print(deeppavlov.__path__)'

@vitalyuf
Copy link
Author

vitalyuf commented Sep 24, 2018

Hi!
Thanks, but the problem was at deeppavlov/dataset_iterators/dstc2_ner_iterator.py:
function def _build_slot_vals(slot_vals_json_path='data/':

@vitalyuf
Copy link
Author

vitalyuf commented Sep 27, 2018

Hi!
I trained a NER component (based on yours one for DSTC2) for recognizing names and surnames by context.
Then I included a path to NER-config file into slotfilling config-file (based on yours one for DSTC2).
Then I trained slotfilling component.
And then I included slotfilling component in edited gobot_dstc2_full.json config-file and trained this component.
But at inference mode I guess that slotfilling component is trying to find a value tagged by NER component at its slot_vals.json dictionary.
But there is no list of all possible surnames and names so I cannot build such a dictionary also even it is possible such a dictionary would be too large.

Please tell me if there is a way for slotfiller not to search values in its dictionary slot_vals.json.

@vitalyuf vitalyuf reopened this Sep 27, 2018
@vitalyuf vitalyuf closed this as completed Oct 8, 2018
@kshurik
Copy link

kshurik commented Oct 8, 2020

@vitalyuf Hi. Did you find pre-trained embeddings for Russian language in .txt format?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants