dingDong is a metadata management infrastructure for data preparation, ETL, and machine learning models with a complete version control support for all data preparation, loading, and ML experiments.
using dingDONG propagation mechanism help to design, develop, maintain and scale complex batch and real-time data projects in a fraction of time
See Ding-Dong documentation for install and developer documentation. More information can be found at http://www.biSkilled.com
Simple, because it is fast and easy to implement !
Steps for creating at data project:
- Source to target modeling - mostly done by excel or external ERD mapping tools
- Implementing - done mainly by external ETL tools
- Improvements, extensions, maintenance and deploy between environments - usually an iterative process which consumes time and effort. meta-data is constantly changing. if business logic involves ML/DL models - versioning must be set for trucking overall results accuracy
dingDONG developed to maintain and speed the overall data project implementation
- Source to target modeling - dingDONG truck meta-data changes such as adding, removing, updating DB columns. converting data-type from diverse connections or adding key sequence is done automatically. Modeling can be managed in a simple excel file, dindDONG will truck excel updates and will update all object by using a propagation mechanism.
- Implementing E(T)L (extract, transform, load) - dingDONG multi-threading and data fetching optimization give a powerful tool for massive bulk data loading
- Transform - It is possible to add calculated columns or manipulate existing column by adding python functions. A function such as scaling or normalization can be added as easy as adding the current date or concatenate functions
- Business logic - Maintaining business logic in dingDONG is used directly by using SQL files that can be executed directly at the development IDE. Using dingDONG comments syntax is used to extract and execute complex queries from SQL files which allows managing all business logic in separate files.
- Version control - dingDONG version control and management keeps track of all metadata changes and stored it in a proprietary dingDONG DB (MongoDB). dingDONG supports GIT version-control mechanism for managing ML/DL projects, versioning and tracking experiments history results. There is full support in GITHUB for distributed versioning.
dingDONG modules:
- DING - create and manage overall meta-data structure for all listed object in the work-flow
- creating new objects
- modify an existing object by using a back-propagation mechanism
- update data into new object
- save old structures
- CI source control
- DONG - extract and load data from diverse connectors
- extract data - support multi-threading for extracting massive data volume
- transfer - add a custom function on existing columns or new custom calculated fields
- merge - merging source with the target object (supported only for objects at the same connector)
- exec - execute PL/SQL/Stored procedure commands
- VERSION CONTROL - track meta-data and code updates
- Versions numbers are managed automatically (sample: <major-release>.<minor-release>.<fixId>)
- Any change at object metadata is automatically stored at dingDONG repository
- Support managed GIT versioning of ML / DL experiments.
- Code update - a new version is created
- Experiment results stored in dingDong repository for future analytics
- Roll revisions history by version number
- Define measures such as counting total input or total output rows
- Store executions result for compare experiments performance
download from GitHub or install by using pip
:
pip install dingDong
Adding MongoDB for pip installation is now cooked. Installation instruction will come soon Docker support - as well, still at the oven .. will come soon
Samples can be found under Ding-Dong sample documentation
The sample below loads 3 CSV files into SQLite. (we are using SQL to merge this CSV and extract data into a report table and CSV file).
Download ZIP file with 3 samples CSV files DATAELEMENTDESCRIPTION.csv, DEMOGRAPHICS.csv, MEASURESOFBIRTHANDDEATH.csv located at samples/sampleHealthCare/csvData.zip folder. In this sample, we use C:\dingDONG as our main folder for all source CSV files and dingDong logs.
Full code sample extractCsvToSqlLite.py located at samples/sampleHealthCare/ folder
The sample demonstrates how to load 3 CSV files into SqlLite, create a simple query-based on those tables and send the result to a new CSV file.
configuration properties can be found at dingDONG documentation
import logging from dingDong import DingDong from dingDong import Config """ Main configuration """ """ set log level: logging.INFO, logging.DEBUG, logging.ERROR """ Config.LOGS_DEBUG = logging.DEBUG """ set connection URL into all connectors key -> general connection name or connection type (sql, oracle, file .. ) value -> can be string or dictionary String --> Connection string URL (key defined connection type: sql, oracle, mySql....) Dictionary --> must have 'conn' (connection type) and 'url' (connection string). available connection can be found at dingDong.misc.enumsJson.eConn """ Config.CONNECTIONS = { 'sampleSql': {'conn': 'sql',"url": "<Sql server connection string>;UID=USER;PWD=PWD;"}, 'file': "C:\\dingDong\\", 'sqlite': "C:\\dingDong\\sqlLiteDB.db"}
- Creating workflow can be done as JSON format or python dictionaries For the sake of this example, we will use a python dictionary. The sample work-flow will contain:
- mapping and loading CSV file named DATAELEMENTDESCRIPTION into SQLLite table named dateElements_Desc
- mapping and loading CSV file named DEMOGRAPHICS into SqlLite table named demographics
- mapping and loading CSV file named MEASURESOFBIRTHANDDEATH into SQLLite table named birthDate
- create a new query based on demographics and birthDate into new table named Final
- Update sample fields at Final table by using direct PL/SQL query
- Extract Final data into a CSV file. We use VARCHAR(200) as default CSV column data type. configuration can be found at DEFAULTS under dingDong.conn.baseBatch
nodesToLoad = [ {"source": ["file", "DATAELEMENTDESCRIPTION.csv"], "target": ["sqlite", "dateElements_Desc"]}, {"source": ["file", "DEMOGRAPHICS.csv"], "target": ["sqlite", "demographics"]}, {"source": ["file", "MEASURESOFBIRTHANDDEATH.csv"], "target": ["sqlite", "birthDate"]}, {"query": ["sqlite", """ Select d.[State_FIPS_Code] AS A, d.[County_FIPS_Code] AS B, d.[County_FIPS_Code] AS G,d.[County_FIPS_Code], d.[CHSI_County_Name], d.[CHSI_State_Name],[Population_Size],[Total_Births],[Total_Deaths] From demographics d INNER JOIN birthDate b ON d.[County_FIPS_Code] = b.[County_FIPS_Code] AND d.[State_FIPS_Code] = b.[State_FIPS_Code]"""], "target": ["sqlite", "Finall", -1]}, {"myexec": ["sqlite", "Update dateElements_Desc Set [Data_Type] = 'dingDong';"]}, {"source": ["sqlite", "Finall"], "target": ["file", "finall.csv"]} ]
- Init class dingDong
dicObj: | Loading dictionary as a work flow |
---|---|
dirData: | Loading JSON files in this folder |
includeFiles: | Include files to load from directory folder (dirData) |
notIncldeFiles: | Exclude files to load from directory folder (dirData) |
connDict: | Equal to Config.CONNECTIONS, set connection URLs string |
processes: | Max number of parallel threading to load data (DONG module) |
m = DingDong(dicObj=nodesToLoad, filePath=None, dirData=None, includeFiles=None, notIncludeFiles=None, connDict=None, processes=1)
- DING
- creating dateElements_Desc, demographics and birthDate tables based on CSV files
- extracting query structure and creating Final table
Flag -1 - default flag, indicate that on changed structure- old structure is stored with all data. object is udated to new strucutre
m.ding()
- DONG - Extracting and loading data from CSV files into SQLite table, using default truncate-> insert method Extract data from a query into Final table
- if object structure changed and mode 1 (default mode) * history table will is created * new object structure is created. new object is populated with data from history table (only identical column name)
m.dong()
Full sample code:
import logging from dingDong import DingDong from dingDong import Config Config.CONNECTIONS = { 'x1' : {'conn':'sql',"url":"DRIVER={SQL Server};SERVER=CPX-VLQ5GA42TW2\SQLEXPRESS;DATABASE=ContosoRetailDW;UID=bpmk;PWD=bpmk;"}, 'x2' : {'conn':'sql',"url":"DRIVER={SQL Server};SERVER=CPX-VLQ5GA42TW2\SQLEXPRESS;DATABASE=ContosoRetailDW;UID=bpmk;PWD=bpmk;"}, 'file' : "C:\\dingDong\\", 'sqlite': "C:\\dingDong\\sqlLiteDB.db"} Config.LOGS_DEBUG = logging.DEBUG Config.LOGS_DIR = "C:\\dingDong" nodesToLoad = [ { "source":["file","DATAELEMENTDESCRIPTION.csv"], "target":["sqlite","dateElements_Desc"]}, { "source":["file","DEMOGRAPHICS.csv"], "target":["sqlite","demographics"]}, { "source":["file","MEASURESOFBIRTHANDDEATH.csv"], "target":["sqlite","birthDate"]}, { "query":["sqlite",""" Select d.[State_FIPS_Code] AS A, d.[County_FIPS_Code] AS B, d.[County_FIPS_Code] AS G,d.[County_FIPS_Code], d.[CHSI_County_Name], d.[CHSI_State_Name],[Population_Size],[Total_Births],[Total_Deaths] From demographics d INNER JOIN birthDate b ON d.[County_FIPS_Code] = b.[County_FIPS_Code] AND d.[State_FIPS_Code] = b.[State_FIPS_Code]"""], "target":["sqlite","Final", 2]}, { "myexec":["sqlite","Update dateElements_Desc Set [Data_Type] = 'dingDong';"]}, { "source":["sqlite","Final"], "target":["file","final.csv"]} ] m = DingDong(dicObj=nodesToLoad, filePath=None, dirData=None, includeFiles=None, notIncludeFiles=None, connDict=None, processes=1) m.ding() m.dong()
We would like to create a platform that will enable to design, implement and maintain data integration project such as:
- Any REST API connectivity from any API to any API using simple JSON mapping
- Any Relational database connectivity using JSON mapping
- Any Non-relational storage
- Main platform for any middle-ware business logic - from sample if-than-else up to statistics algorithms using ML and DL algorithms
- Enable Real-time and scheduled integration
- Single point of truth - maintain all changes by using git source control and enable to compare version and rollback as needed
We will extend our connectors and Meta-data manager accordingly.
connectors Type | python module | checked version | dev status | notes |
---|---|---|---|---|
sql | pyOdbc | 4.0.23 | tested, prod | slow to extract, massive data volume preferred using ceODBC |
sql | ceODBC | 2.0.1 | tested, prod | sql server conn for massive data loading installed manually from 3rdPart folder |
access | pyOdbc | 4.0.23 | tested, prod | |
oracle | cx-oracle | 6.1 | tested, prod | |
CSV / text files | CSV / CSV23 | 0.1.5 | tested, prod | |
mysql | pyMySql | 0.6.3rc1 | dev | |
vertica | vertica-python | 0.9.1 | dev | |
sqllite | sqllite3 | 6.1 | tested, prod | |
mongoDb | pyMongo | 3.7.2 | dev | |
salesforce | simple_salesforce | 3.7.2 | dev | |
haddop/Hive | . | . | dev |
dingDONG was created by Tal Shany (tal@biSkilled.com)
We are looking for contributions !!!
GNU General Public License v3.0
See COPYING to see the full text.