Skip to content

Commit

Permalink
docs update 0.10.2 (#978)
Browse files Browse the repository at this point in the history
  • Loading branch information
taylorfturner committed Jul 28, 2023
1 parent 8a7f9cc commit 0420368
Show file tree
Hide file tree
Showing 297 changed files with 73,700 additions and 1 deletion.
Binary file added docs/0.10.2/doctrees/API.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/0.10.2/doctrees/data_labeling.doctree
Binary file not shown.
Binary file added docs/0.10.2/doctrees/data_reader.doctree
Binary file not shown.
Binary file added docs/0.10.2/doctrees/data_readers.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/0.10.2/doctrees/dataprofiler.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/0.10.2/doctrees/dataprofiler.reports.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/0.10.2/doctrees/dataprofiler.version.doctree
Binary file not shown.
Binary file added docs/0.10.2/doctrees/environment.pickle
Binary file not shown.
Binary file added docs/0.10.2/doctrees/examples.doctree
Binary file not shown.
Binary file added docs/0.10.2/doctrees/graph_data_demo.doctree
Binary file not shown.
Binary file added docs/0.10.2/doctrees/graphs.doctree
Binary file not shown.
Binary file added docs/0.10.2/doctrees/index.doctree
Binary file not shown.
Binary file added docs/0.10.2/doctrees/install.doctree
Binary file not shown.
Binary file added docs/0.10.2/doctrees/labeler.doctree
Binary file not shown.
Binary file added docs/0.10.2/doctrees/merge_profile_list.doctree
Binary file not shown.
Binary file added docs/0.10.2/doctrees/modules.doctree
Binary file not shown.
438 changes: 438 additions & 0 deletions docs/0.10.2/doctrees/nbsphinx/add_new_model_to_data_labeler.ipynb

Large diffs are not rendered by default.

364 changes: 364 additions & 0 deletions docs/0.10.2/doctrees/nbsphinx/column_name_labeler_example.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,364 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "e04c382a-7c49-452b-b9bf-e448951c64fe",
"metadata": {},
"source": [
"# ColumnName Labeler Tutorial"
]
},
{
"cell_type": "markdown",
"id": "6fb3ecb9-bc51-4c18-93d5-7991bbee5165",
"metadata": {},
"source": [
"This notebook teaches how to use the existing `ColumnNameModel`:\n",
"\n",
"1. Loading and utilizing the pre-existing `ColumnNameModel`\n",
"2. Run the labeler\n",
"\n",
"First, let's import the libraries needed for this example."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a67c197b-d3ee-4896-a96f-cc3d043601d3",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"import json\n",
"from pprint import pprint\n",
"\n",
"import pandas as pd\n",
"\n",
"try:\n",
" import dataprofiler as dp\n",
"except ImportError:\n",
" sys.path.insert(0, '../..')\n",
" import dataprofiler as dp"
]
},
{
"cell_type": "markdown",
"id": "35841215",
"metadata": {},
"source": [
"## Loading and predicting using a pre-existing model using `load_from_library`\n",
"\n",
"The easiest option for users is to `load_from_library` by specifying the name for the labeler in the `resources/` folder. Quickly import and start predicting with any model from the Data Profiler's library of models available."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "46e36dd6",
"metadata": {},
"outputs": [],
"source": [
"labeler_from_library = dp.DataLabeler.load_from_library('column_name_labeler')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dfa94868",
"metadata": {},
"outputs": [],
"source": [
"labeler_from_library.predict(data=[\"ssn\"])"
]
},
{
"cell_type": "markdown",
"id": "c71356f4-9020-4862-a1e1-816effbb5443",
"metadata": {},
"source": [
"## Loading and using the pre-existing column name labeler using `load_with_components`\n",
"\n",
"For example purposes here, we will import the exsting `ColumnName` labeler via the `load_with_components` command from the `dp.DataLabeler`. This shows a bit more of the details of the data labeler's flow."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "818c5b88",
"metadata": {},
"outputs": [],
"source": [
"parameters = {\n",
" \"true_positive_dict\": [\n",
" {\"attribute\": \"ssn\", \"label\": \"ssn\"},\n",
" {\"attribute\": \"suffix\", \"label\": \"name\"},\n",
" {\"attribute\": \"my_home_address\", \"label\": \"address\"},\n",
" ],\n",
" \"false_positive_dict\": [\n",
" {\n",
" \"attribute\": \"contract_number\",\n",
" \"label\": \"ssn\",\n",
" },\n",
" {\n",
" \"attribute\": \"role\",\n",
" \"label\": \"name\",\n",
" },\n",
" {\n",
" \"attribute\": \"send_address\",\n",
" \"label\": \"address\",\n",
" },\n",
" ],\n",
" \"negative_threshold_config\": 50,\n",
" \"positive_threshold_config\": 85,\n",
" \"include_label\": True,\n",
" }\n",
"\n",
"label_mapping = {\"ssn\": 1, \"name\": 2, \"address\": 3}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9098329e",
"metadata": {},
"outputs": [],
"source": [
"# pre processor \n",
"preprocessor = dp.labelers.data_processing.DirectPassPreprocessor()\n",
"\n",
"# model\n",
"from dataprofiler.labelers.column_name_model import ColumnNameModel\n",
"model = ColumnNameModel(\n",
" parameters=parameters,\n",
" label_mapping=label_mapping,\n",
")\n",
"\n",
"\n",
"# post processor\n",
"postprocessor = dp.labelers.data_processing.ColumnNameModelPostprocessor()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "113d6655-4bca-4d8e-9e6f-b972e29d5684",
"metadata": {},
"outputs": [],
"source": [
"data_labeler = dp.DataLabeler.load_with_components(\n",
" preprocessor=preprocessor,\n",
" model=model,\n",
" postprocessor=postprocessor,\n",
")\n",
"data_labeler.model.help()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b405887-2b92-44ca-b8d7-29c384f6dd9c",
"metadata": {},
"outputs": [],
"source": [
"pprint(data_labeler.label_mapping)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "11916a48-098c-4056-ac6c-b9542d85fa86",
"metadata": {},
"outputs": [],
"source": [
"pprint(data_labeler.model._parameters)"
]
},
{
"cell_type": "markdown",
"id": "da0e97ee-8d6d-4631-9b55-78ed904d5f41",
"metadata": {},
"source": [
"### Predicting with the ColumnName labeler\n",
"\n",
"In the prediction below, the data will be passed into to stages in the background\n",
"- 1) `compare_negative`: The idea behind the `compare_negative` is to first filter out any possibility of flagging a false positive in the model prediction. In this step, the confidence value is checked and if the similarity is too close to being a false positive, that particular string in the `data` is removed and not returned to the `compare_positive`.\n",
"- 2) `compare_positive`: Finally the `data` is passed to the `compare_positive` step and checked for similarity with the the `true_positive_dict` values. Again, during this stage the `positive_threshold_config` is used to filter the results to only those `data` values that are greater than or equal to the `positive_threshold_config` provided by the user."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe519e65-36a7-4f42-8314-5369de8635c7",
"metadata": {},
"outputs": [],
"source": [
"# evaluate a prediction using the default parameters\n",
"data_labeler.predict(data=[\"ssn\", \"name\", \"address\"])"
]
},
{
"cell_type": "markdown",
"id": "b41d834d-e47b-45a6-8970-d2d2033e2ade",
"metadata": {},
"source": [
"## Replacing the parameters in the existing labeler\n",
"\n",
"We can achieve this by:\n",
"1. Setting the label mapping to the new labels\n",
"2. Setting the model parameters which include: `true_positive_dict`, `false_positive_dict`, `negative_threshold_config`, `positive_threshold_config`, and `include_label`\n",
"\n",
"where `true_positive_dict` and `false_positive_dict` are `lists` of `dicts`, `negative_threshold_config` and `positive_threshold_config` are integer values between `0` and `100`, and `include_label` is a `boolean` value that determines if the output should include the prediction labels or only the confidence values."
]
},
{
"cell_type": "markdown",
"id": "c6bb010a-406f-4fd8-abd0-3355a5ad0ded",
"metadata": {},
"source": [
"Below, we created 4 labels where `other` is the `default_label`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f86584cf-a7af-4bae-bf44-d87caa68833a",
"metadata": {},
"outputs": [],
"source": [
"data_labeler.set_labels({'other': 0, \"funky_one\": 1, \"funky_two\": 2, \"funky_three\": 3})\n",
"data_labeler.model.set_params(\n",
" true_positive_dict= [\n",
" {\"attribute\": \"ssn\", \"label\": \"funky_one\"},\n",
" {\"attribute\": \"suffix\", \"label\": \"funky_two\"},\n",
" {\"attribute\": \"my_home_address\", \"label\": \"funky_three\"},\n",
" ],\n",
" false_positive_dict=[\n",
" {\n",
" \"attribute\": \"contract_number\",\n",
" \"label\": \"ssn\",\n",
" },\n",
" {\n",
" \"attribute\": \"role\",\n",
" \"label\": \"name\",\n",
" },\n",
" {\n",
" \"attribute\": \"not_my_address\",\n",
" \"label\": \"address\",\n",
" },\n",
" ],\n",
" negative_threshold_config=50,\n",
" positive_threshold_config=85,\n",
" include_label=True,\n",
")\n",
"data_labeler.label_mapping"
]
},
{
"cell_type": "markdown",
"id": "1ece1c8c-18a5-46fc-b563-6458e6e71e53",
"metadata": {},
"source": [
"### Predicting with the new labels\n",
"\n",
"Here we are testing the `predict()` method with brand new labels for label_mapping. As we can see the new labels flow throught to the output of the data labeler."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "92842e14-2ea6-4879-b58c-c52b607dc94c",
"metadata": {},
"outputs": [],
"source": [
"data_labeler.predict(data=[\"ssn\", \"suffix\"], predict_options=dict(show_confidences=True))"
]
},
{
"cell_type": "markdown",
"id": "261b903f-8f4c-403f-839b-ab8813f850e9",
"metadata": {},
"source": [
"## Saving the Data Labeler for future use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b6ffbaf2-9400-486a-ba83-5fc9ba9334d7",
"metadata": {},
"outputs": [],
"source": [
"if not os.path.isdir('new_column_name_labeler'):\n",
" os.mkdir('new_column_name_labeler')\n",
"data_labeler.save_to_disk('new_column_name_labeler')"
]
},
{
"cell_type": "markdown",
"id": "09e40cb6-9d89-41c4-ae28-3dca498f8c68",
"metadata": {},
"source": [
"## Loading the saved Data Labeler"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "52615b25-70a6-4ebb-8a32-14aaf1e747d9",
"metadata": {},
"outputs": [],
"source": [
"saved_labeler = dp.DataLabeler.load_from_disk('new_column_name_labeler')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d1ccc0b3-1dc2-4847-95c2-d6b8769b1590",
"metadata": {},
"outputs": [],
"source": [
"# ensuring the parametesr are what we saved.\n",
"print(\"label_mapping:\")\n",
"pprint(saved_labeler.label_mapping)\n",
"print(\"\\nmodel parameters:\")\n",
"pprint(saved_labeler.model._parameters)\n",
"print()\n",
"print(\"postprocessor: \" + saved_labeler.postprocessor.__class__.__name__)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c827f2ae-4af6-4f3f-9651-9ee9ebea9fa0",
"metadata": {},
"outputs": [],
"source": [
"# predicting with the loaded labeler.\n",
"saved_labeler.predict([\"ssn\", \"name\", \"address\"])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading

0 comments on commit 0420368

Please sign in to comment.