Schematizer
What is it?
The Schematizer is a schema store service that tracks and manages all the schemas used in the Data Pipeline and provides features like automatic documentation support. We use Apache Avro to represent our schemas.
How to download
git clone git@github.com:Yelp/schematizer.git
Tests
Running unit tests
make -f Makefile-opensource test
Running unit integration tests
make -f Makefile-opensource itest
Setup and Configuration
- Create a mysql database for Schematizer Service::
CREATE DATABASE <db_name> DEFAULT CHARACTER SET utf8;
- Create MySQL tables in
<db_name>
database for Schematizer Service::
cat schema/tables/*.sql | mysql <db_name>
- Create a
topology.yaml
file
topology:
- cluster: <schematizer_cluster_name>
replica: master
entries:
- charset: utf8
use_unicode: true
host: <db_ip>
db: <db_name>
user: <db_user>
passwd: <db_password>
port: <db_port>
- In
config.yaml
assign values to the following configs::
schematizer_cluster: <schematizer_cluster_name>
topology_path: /path/to/topology.yaml
Usage
Use serviceinitd/schematizer.py
to start the Schematizer service.
Interactive directly with Schematizer Service.
Registering a schema::
curl -X POST --header 'Content-Type: application/json' --header 'Accept: text/plain' -d '{
"namespace": "test_namespace",
"source_owner_email": "test@test.com",
"source": "test_source",
"contains_pii": false,
"schema": "{\"type\":\"record\",\"namespace\":\"test_namespace\",\"source\":\"test_source\",\"name\":\"test_name\",\"doc\":\"test_doc\",\"fields\":[{\"type\":\"string\",\"doc\":\"test_doc1\",\"name\":\"key1\"},{\"type\":\"string\",\"doc\":\"test_doc2\",\"name\":\"key2\"}]}"
}' 'http://127.0.0.1:8888/v1/schemas/avro'
Getting Schema By ID::
curl -X GET --header 'Accept: text/plain' 'http://127.0.0.1:8888/v1/schemas/<schema_id>'
Interactive with Schematizer Service using Schematizer Client Lib.
Registering a schema::
from data_pipeline.schematizer_clientlib.schematizer import get_schematizer
test_avro_schema_json = {
"type": "record",
"namespace": "test_namespace",
"source": "test_source",
"name": "test_name",
"doc": "test_doc",
"fields": [
{"type": "string", "doc": "test_doc1", "name": "key1"},
{"type": "string", "doc": "test_doc2", "name": "key2"}
]
}
schema_info = get_schematizer().register_schema_from_schema_json(
namespace="test_namespace",
source="test_source",
schema_json=test_avro_schema_json,
source_owner_email="test@test.com",
contains_pii=False
)
Getting Schema By ID::
from data_pipeline.schematizer_clientlib.schematizer import get_schematizer
schema_info = get_schematizer().get_schema_by_id(
schema_id=schema_info.schema_id
)
Disclaimer
We're still in the process of setting up this service as a stand-alone. There may be additional work required to run a Schematizer instance and integrate with other applications.
License
Schematizer is licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0
Contributing
Everyone is encouraged to contribute to Schematizer by forking the Github repository and making a pull request or opening an issue.