# Train spaCy Model for Marathi (mr)

(C) 2024 by [Damir Cavar](http://damir.cavar.me/)

Use the [config widget](https://spacy.io/usage/training) on spaCy's website to generate a `base_config.cfg` configuration and paste it into the `base_config.cfg` file in this folder.

Run the following command to create a full config:

In [1]:
!python -m spacy init fill-config ./base_config.cfg ./config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


Go to the [Universal Dependencies website](https://universaldependencies.org/) and download the [UD-Marathi-UFAL](https://github.com/UniversalDependencies/UD_Marathi-UFAL) data (train and dev). Convert the CoNLL files to the spaCy binary format using the following commands:

In [2]:
!python -m spacy convert ./mr_ufal-ud-dev.conllu ./dev.spacy --converter conllu --file-type spacy --seg-sents --morphology --merge-subtokens --lang mr

[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (46 documents):
dev.spacy\mr_ufal-ud-dev.spacy[0m


In [3]:
!python -m spacy convert ./mr_ufal-ud-train.conllu ./train.spacy --converter conllu --file-type spacy --seg-sents --morphology --merge-subtokens --lang mr

[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (373 documents):
train.spacy\mr_ufal-ud-train.spacy[0m


Run the training for Marathi:

In [6]:
!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy

^C


In [None]:
import spacy

In [None]:
nlp = spacy.load("./output")

In [None]:
text = u"त्याला एक मुलगा होता."
doc = nlp(text)

In [None]:
for token in doc:
    print("\t".join( (token.text, str(token.idx), token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, str(token.is_alpha), str(token.is_stop) )))

(C) 2024 by [Damir Cavar](http://damir.cavar.me/)