New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃挮 Adding models for new languages master thread #3056

Open
ines opened this Issue Dec 16, 2018 · 17 comments

Comments

Projects
None yet
8 participants
@ines
Copy link
Member

ines commented Dec 16, 2018

This thread bundles discussion around adding pre-trained models for new languages (and improving the existing language data). A lot of information and discussion has been spread over various different issues (usually specific to the language), which made it more difficult to get an overview.

See here for the available pre-trained models, and this page for all languages currently available in spaCy. Languages marked as "alpha support" usually only include tokenization rules and various other rules and language data.

How to go from alpha support to a pre-trained model

The process requires the following steps and components:

  • Language data: shipped with spaCy, see here. The tokenization should be reliable, and there should be a tag map that maps the tags used in the training data to coarse-grained tags like NOUN and optional morphological features.
  • Training corpus: the model needs to be trained on a suitable corpus, e.g. an existing Universal Dependencies treebank. Commercial-friendly treebank licenses are always a plus. Data for tagging and parsing is usually easier to find than data for named entity recognition 鈥 in the long term, we want to do more data annotation ourselves using Prodigy, but that's obviously a much bigger project. In the meantime, we have to use other available resources (academic etc.).
  • Data conversion: spaCy comes with a range of built-in converters via the spacy convert command that take .conllu files and output spaCy's JSON format. See here for an example of a training pipeline with data conversion. Corpora can have very subtle formatting differences, so it's important to check that they can be converted correctly.
  • Training pipeline: if we have language data plus a suitable training corpus plus a conversion pipeline, we can run spacy train to train a new model.

With our new internal model training infrastructure, it's now much easier for us to integrate new pipelines and train new models.

鈿狅笍 Important note: In order to train and distribute "official" spaCy models, we need to be able to integrate and reproduce the full training pipeline whenever we release a new version of spaCy that requires new models (so we can't just upload a model trained by someone else).

Ideas for how to get involved

Contributing to the models isn't always easy, because there are a lot of different things to consider, and a big part of it comes down to sourcing suitable data and running experiments. But here are a few ideas for things that can move us forward:

1锔忊儯 Difficulty: good for beginners

  • Proofread and correct the existing language data for a language of your choice. There can always be typos or mistakes ported over from a different resource.
  • Write tokenizer tests with expected input / output. It's always really helpful to have examples of how things should work, to ensure we don't accidentally introduce regressions. Tests should be "fair" and representative of what's common in general-purpose texts. While edge cases and "tricky" examples can be nice, they shouldn't be the focus of the tests. Otherwise, we won't actually get a realistic picture of what works and what doesn't. See the English tests for examples.

馃摉 Relevant documentation: Adding languages, Tokenization, Test suite Readme

2锔忊儯 Difficulty: advanced

  • Add a tag map for a language and its treebank (e.g. Universal Dependencies). The tag map is keyed by the fine-grained part-of-speech tag (token.tag_, e.g. "NNS"), mapped to the coarse-grained tag (token.pos_, e.g. "NOUN") and other morphological features. The tags in the tag map should be the tags used by the treebank.
  • Experiment with training a model. Convert the training and development data using spacy convert and run spacy train to train the model. See here for an example. (Note that most corpora don't come with NER annotations, so you'll usually only be able to train the tagger and parser). It might work out-of-the-box straight away 鈥 or it might require some more formatting and pre-processing. Finding this out will be very helpful. You can share your results and the reproducible commands to use in this thread.
  • Prepare a raw text corpus from the CommonCrawl or a similar resource for the language you want to work on. Raw unlabelled text can be used to train the word vectors, estimate the unigram probabilities and 鈥 coming in v2.1.0 鈥 pre-training a language model similar to BERT/Elmo/ULMFiT etc. (see #2931). We only need the cleaned, raw text 鈥 for example as a .txt or .jsonl file:
{"text": "This is a paragraph of raw text in some language"}

When using other resources, make sure the data license is compatible with spaCy's MIT license and ideally allows commercial use (since many people use spaCy commercially). Examples of suitable licenses are CC, Apache, MIT. Examples of unsuitable licenses are CC BY-NC, CC BY-SA, (A)GPL.

馃摉 Relevant documentation: Adding languages, Training via the CLI


If you have questions, feel free to leave a comment here. We'll also be updating this post with more tasks and ideas as we go.

@howl-anderson

This comment has been minimized.

Copy link
Contributor

howl-anderson commented Dec 16, 2018

Currently, I am working on https://github.com/howl-anderson/Chinese_models_for_SpaCy for supporting Chinese language model for SpaCy, it's working pretty good to me, so how can my model into official repository? If suggestion?

@ursachi

This comment has been minimized.

Copy link

ursachi commented Dec 17, 2018

When it comes to the Romanian language, there is a particularity, which I am not sure if it has been discussed elsewhere. Due to the historical development, most keyboards in Romania do not have the Romanian language characters using diacritics (膬, 芒, 卯, 葯, 葲). Thus they have been substituted by the people with their roots (a, i, s, t) in most of the written language, often also in official documents, but I'd say in the majority of the online data (no citation available).

Looking at the RO language data, I can observe that most of the data is written using the diacritics but there are exceptions (e.g. "aceeasi" instead of "aceea葯i" in the list of the STOP_WORDS). This may be an issue for lemmatization and machine understanding.

Here's an example where Romanians usually understand well this adaptation, while I'm not sure what is the case for the spaCy context manager:
"Ana are doua fete" could be written as:

  • "Ana are dou膬 fete" meaning "Ana has two girls"
  • "Ana are dou膬 fe葲e" meaning "Ana has a duplicitar behaviour"

Thus the options:

  1. should we rather correct wherever diacritics haven't been added, or
  2. update all files and add also the words without the diacritics?

I believe the first option is easier and better, while the users may adjust the corpora used in training accordingly. What's your say?

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Dec 17, 2018

@howl-anderson Thanks for your work on the Chinese model! We have a license to the OntoNotes 5 data, so I can use your scripts to convert the corpus and try to get Chinese added to our model training pipeline. In order to support a model officially, we need to have the model training with our scripts. Otherwise the binary would go out of date when we made changes to the library. But your scripts look like this should be fairly easy.

One thing that would probably be good to try is using a different treebank instead of the UD_Chinese corpus. For instance, we should be able to run a dependency converter on the OntoNotes 5 Chinese parts. We also have a license to the Penn Chinese treebank.

The reason to convert another corpus is, the UD_Chinese corpus is licensed CC-BY-NC, so we won't be able to distribute a commercially-friendly Chinese model if we use that data to train the tagger and parser. For OntoNotes 5 and Penn Chinese Treebank, we'd be able to release MIT-licensed models, like we do for English and German. So, we need to find a good dependency converter than works with Chinese, and run it over the treebank. I'm not sure whether the Stanford converter works with Chinese, if so it'd be a good choice. We could also check the ClearNLP converter, the MALT converter, and the conll-09 converter.

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Dec 17, 2018

@ursachi Thanks for raising this. Orthographic and dialect variation is something we need to pay more attention to.

Perhaps a good solution would be to provide a function that restored the diacritics if missing? I'm not sure how difficult this would be, but it might be easy for common words. If so, this would be a useful utility that people could use as a pre-process.

For the stop list, I'd be happy to have duplicate versions in the stop words, with and without diacritics. I'd say the same for the tokenizer exceptions. The lemma lookup tables already get rather large, so it seems like we probably want to do this inside a lemmatizer function, instead of duplicating the data there.

More generally, there's the question of how to handle this for statistical models. A model trained on a corpus with diacritics will not perform well on text without the diacritics, and vice versa. One solution is to apply a data augmentation process, so that the model sees both types of text. We would need to have a function that takes the training data, and returns two versions: one with diacritics, and one without, with the same gold-standard analyses in both cases.

@howl-anderson

This comment has been minimized.

Copy link
Contributor

howl-anderson commented Dec 17, 2018

Thank you @honnibal, I have a license for OntoNotes 5 data too. And I will try to get a license for Penn Chinese treebank. Meanwhile, I will check and update my scripts for more easily integrated with your scripts. Let's keep in touch. I will keep you informed.

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Dec 17, 2018

@howl-anderson Could you have a look at constituency-to-dependency conversion scripts? If we can run the UD scripts on another corpus, that might be good. Alternatively, some other converter would be a good option. The corpora tend to be as constituency parses for Chinese, but spaCy needs to learn dependencies.

@howl-anderson

This comment has been minimized.

Copy link
Contributor

howl-anderson commented Dec 17, 2018

@honnibal Absolutely! I am very happy to participate in this project. When I get the results, I'll let you know.

@oroszgy

This comment has been minimized.

Copy link
Contributor

oroszgy commented Jan 4, 2019

I have an experimental release with a UD based Hungarian model, would this be interesting for the community?

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Jan 21, 2019

@oroszgy This looks really good!

I'll take a look at adding the data files for this to the model training pipeline. Currently I just need to update the machine image that has the corpora, to add new datasets. I need to make an update for Norwegian as well.

In theory once the data files are added it should be pretty simple to be publishing the model. We need to have the pipeline training the model though, rather than just getting the artifact from you --- otherwise we can't retrain when we make code changes etc.

@oroszgy

This comment has been minimized.

Copy link
Contributor

oroszgy commented Jan 21, 2019

@honnibal let me know, if there is anything I can help with.

@howl-anderson

This comment has been minimized.

Copy link
Contributor

howl-anderson commented Jan 28, 2019

@honnibal Just for keep you informed, I found http://nlp.cs.lth.se/software/treebank_converter/ "The LTH Constituent-to-Dependency Conversion Tool for Penn-style Treebanks" is a promising converter tools for treebank to conll format. Since I still cannot get a Chinese treebank corpus, for now, I can not test it yet, I will continue to try get a licensed Chinese treebank. I will keep you informed.

@Shashi456

This comment has been minimized.

Copy link

Shashi456 commented Jan 30, 2019

Can we add a basic Marathi tokenizer as well? It's a language very close to hindi except for a few extra words and stem suffixes. The stemmer could be ported from here and here, the latter was adapted from the same paper you mentioned for the hindi support. The stem suffixes mentioned in the latter being, although not a complete list in tandem with the first this should cover a huge part :

suffixes = {
    1: ["啷", "啷", "啷", "啷", "啶", "啶" , " 啷"  , " 啷" ,  "啶" , "啶" , "啶" , "啶" , "啶" ,  "啶"],
    2: ["啶ㄠ" , "啶む" , "啶ㄠ" , "啶ㄠ" , "啶灌" , "啶む" ,"啶ぞ" , "啶侧ぞ" , "啶ㄠぞ" , "啶娻ぃ" , "啶多" , "啶多" , "啶氞ぞ" , "啶氞" , "啶氞", "啶⑧ぞ" , "啶班" , "啶∴" ,  "啶む" , "啶距え" , " 啷啶" , "啶∴ぞ" , "啶∴" , "啶椸ぞ" , "啶侧ぞ" , "啶赤ぞ" , "啶ぞ" , "啶掂ぞ" , "啶" , "啶掂" , "啶む" ],
    3: ["啶多く啶" , "啶灌啶"],
    4: [" 啷佮ぐ啶∴ぞ"],
}

A list of basic stop words is available here, while the numbers are available here.

Should i put in a PR?

@ines

This comment has been minimized.

Copy link
Member Author

ines commented Jan 30, 2019

@Shashi456 Yes, that sounds good! 馃憤

@blamm0

This comment has been minimized.

Copy link

blamm0 commented Feb 11, 2019

Hi,

I'm trying to train NER for Lithuanian language.

What I have already done:

  1. Created a new language using your template files.
  2. Compiled spaCy from source with the support for the 'lt' language.
  3. Trained a custom model using Universal Dependencies data - UD_Lithuanian-HSE.

I have a few questions:

  1. How many sentences in Training data is required? What's the minimum?
  2. I'm using a modified version if your train_ner.py script, the modifications are that it loads a custom model from disk.

Below is the training data, the trained NER should recognize these entities, right?
# training data TRAIN_DATA = [ ("Kas yra Valdas Adamkus?", {"entities": [(7, 22, "PERSON")]}), ("Man patinka Kaunas ir Vilnius.", {"entities": [(11, 18, "LOC"), (21, 29, "LOC")]}), ("艩tai at臈jo Petras.", {"entities": [(10, 17, "PERSON")]}), ]

So far when testing the trained model I get no entities.
Thanks in advance.

@ines

This comment has been minimized.

Copy link
Member Author

ines commented Feb 11, 2019

How many sentences in Training data is required? What's the minimum?
Below is the training data, the trained NER should recognize these entities, right?

The model should learn to generalise based on those examples and the training process expects to see lots of examples and generalise based on them 鈥 not see a handful of examples and memorise them. So ideally, you want a few thousand examples or more. For reference, the English model was trained on 2 million words.

One strategy would be to take the Lithuanian UD corpus you trained on and label it for named entities. This is how the new Greek model for spaCy was trained btw.

@blamm0

This comment has been minimized.

Copy link

blamm0 commented Feb 11, 2019

Ok, thanks for the information.

@shanalikhan

This comment has been minimized.

Copy link

shanalikhan commented Feb 21, 2019

I have one question.
Can we contribute the english transliteration of any language to spacy as new language?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment