Skip to content

Commit

Permalink
docs: spacy DocBin cookbook (#1642)
Browse files Browse the repository at this point in the history
(cherry picked from commit 625d153)
  • Loading branch information
ignacioct authored and frascuchon committed Aug 18, 2022
1 parent d0b311c commit 14ca818
Showing 1 changed file with 50 additions and 5 deletions.
55 changes: 50 additions & 5 deletions docs/guides/cookbook.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -760,6 +760,49 @@
"rb.log(records=record, name=\"lesmiserables-pos\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Export to spaCy Docbin\n",
"\n",
"With the examples above, we have been able to store data from spaCy models into Rubrix. But Rubrix datasets are not bound to stay always inside Rubrix, and we can export them in the [spaCy Docbin format](https://spacy.io/api/docbin). With the example below, you can export your Rubrix dataset into a Docbin and save it to disk.\n",
"\n",
"To create a Docbin, we need an spaCy language model to create the annotations."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-07-28 13:25:14.367 | WARNING | datasets.builder:download_and_prepare:577 - Reusing dataset conll2003 (/Users/ignacio/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee)\n",
"2022-07-28 13:25:14.372 | WARNING | rubrix.client.datasets:_remove_unsupported_columns:252 - Following columns are not supported by the TokenClassificationRecord model and are ignored: ['pos_tags', 'chunk_tags']\n",
"2022-07-28 13:25:16.324 | WARNING | rubrix.client.datasets:from_datasets:761 - Ignoring row with no tokens.\n"
]
}
],
"source": [
"import spacy\n",
"import rubrix as rb\n",
"\n",
"from datasets import load_dataset\n",
"\n",
"# Dataset loading to export it as a Docbin\n",
"dataset_raw = load_dataset(\"conll2003\", split=\"train\")\n",
"dataset_rubrix = rb.DatasetForTokenClassification.from_datasets(dataset_raw, tags=\"ner_tags\")\n",
"\n",
"# Loading an spaCy blank language model to create the Docbin, as it works faster\n",
"nlp = spacy.blank(\"en\")\n",
"\n",
"# After this line, the file will be stored in disk\n",
"dataset_rubrix.prepare_for_training(framework=\"spacy\", lang=nlp).to_disk(\"docbin\")"
]
},
{
"cell_type": "markdown",
"metadata": {
Expand Down Expand Up @@ -1556,11 +1599,8 @@
}
],
"metadata": {
"interpreter": {
"hash": "b709380ea7d1cb2eb4650c0f11ac7e002ec6a534602815725771481b4784238c"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "Python 3.8.13 ('rubrix')",
"language": "python",
"name": "python3"
},
Expand All @@ -1574,7 +1614,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
"version": "3.8.13"
},
"vscode": {
"interpreter": {
"hash": "d37ff9f4b2d8cea8fd3fc88a3420f8da5f414e64666a8add97ac41ea297827c4"
}
}
},
"nbformat": 4,
Expand Down

0 comments on commit 14ca818

Please sign in to comment.