From 9af6972d3c81b3fa2dcd38615b0e89cffcffd8d0 Mon Sep 17 00:00:00 2001 From: Jardayn Date: Fri, 11 Nov 2022 22:34:39 +0200 Subject: [PATCH 1/4] Minor docs imp --- docs/new-database-driver-guide.rst | 51 +++++++++++++++++++++--------- 1 file changed, 36 insertions(+), 15 deletions(-) diff --git a/docs/new-database-driver-guide.rst b/docs/new-database-driver-guide.rst index 4818e196..790d4790 100644 --- a/docs/new-database-driver-guide.rst +++ b/docs/new-database-driver-guide.rst @@ -24,11 +24,24 @@ Then, users can install the dependencies needed for your database driver, with ` This way, data-diff can support a wide variety of drivers, without requiring our users to install libraries that they won't use. -2. Implement database module +2. Implement a database module ---------------------------- New database modules belong in the ``data_diff/databases`` directory. +The module consists of: +1. Dialect (Class responsible for normalizing/casting fields. e.g. Numbers/Timestamps) +2. Database class that handles connecting to the DB, querying (if the default doesn't work) , closing connectiosn and etc. + +Choosing a base class, based on threading Model +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can choose to inherit from either ``base.Database`` or ``base.ThreadedDatabase``. + +Usually, databases with cursor-based connections, like MySQL or Postgresql, only allow one thread per connection. In order to support multithreading, we implement them by inheriting from ``ThreadedDatabase``, which holds a pool of worker threads, and creates a new connection per thread. + +Usually, cloud databases, such as snowflake and bigquery, open a new connection per request, and support simultaneous queries from any number of threads. In other words, they already support multithreading, so we can implement them by inheriting directly from ``Database``. + Import on demand ~~~~~~~~~~~~~~~~~ @@ -50,16 +63,6 @@ Instead, they should be imported and initialized within a function. Example: We use the ``import_helper()`` decorator to provide a uniform and informative error. The string argument should be the name of the package, as written in ``pyproject.toml``. -Choosing a base class, based on threading Model -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You can choose to inherit from either ``base.Database`` or ``base.ThreadedDatabase``. - -Usually, databases with cursor-based connections, like MySQL or Postgresql, only allow one thread per connection. In order to support multithreading, we implement them by inheriting from ``ThreadedDatabase``, which holds a pool of worker threads, and creates a new connection per thread. - -Usually, cloud databases, such as snowflake and bigquery, open a new connection per request, and support simultaneous queries from any number of threads. In other words, they already support multithreading, so we can implement them by inheriting directly from ``Database``. - - :meth:`_query()` ~~~~~~~~~~~~~~~~~~ @@ -124,12 +127,10 @@ Docs: - :meth:`data_diff.databases.database_types.AbstractDatabase.close` -:meth:`quote()`, :meth:`to_string()`, :meth:`normalize_number()`, :meth:`normalize_timestamp()`, :meth:`md5_to_int()` +:meth:`quote()`, :meth:`to_string()`, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -These methods are used when creating queries. - -They accept an SQL code fragment, and returns a new code fragment representing the appropriate computation. +These methods are used when creating queries or normalizing fields. For more information, read their docs: @@ -137,6 +138,26 @@ For more information, read their docs: - :meth:`data_diff.databases.database_types.AbstractDatabase.to_string` +:meth:`normalize_number()`, :meth:`normalize_timestamp()`, :meth:`md5_to_int()` + +Because comparing data between 2 databases requires both the data to be in the same format - we have normalization functions. + +Databases can have the same data in different formats, e.g. ``DECIMAL`` vs ``FLOAT`` vs ``VARCHAR``, with different precisions. +DataDiff works by converting the values to ``VARCHAR`` and comparing it. +Your normalize_number/normalize_timestamp functions should account for differing precions between columns. + +These functions accept an SQL code fragment, and returns a new code fragment representing the appropriate computation. + +:meth:`parse_type` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is the last resort if for type detection. +If you added the type to the TYPE_CLASSES dict and it still isn't getting detected - you should use this function. + +The common approach is to use Regex to get Column Type + Precision/Width + + + 3. Add tests -------------- From 2460d4cb17b5b15a29103645bbd0eb1a592103ab Mon Sep 17 00:00:00 2001 From: Jardayn Date: Fri, 11 Nov 2022 23:27:07 +0200 Subject: [PATCH 2/4] Added debugging section --- docs/new-database-driver-guide.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/new-database-driver-guide.rst b/docs/new-database-driver-guide.rst index 790d4790..ec581f8a 100644 --- a/docs/new-database-driver-guide.rst +++ b/docs/new-database-driver-guide.rst @@ -157,6 +157,12 @@ If you added the type to the TYPE_CLASSES dict and it still isn't getting detect The common approach is to use Regex to get Column Type + Precision/Width +4. Debugging +----------------------- + +You can enable debug logging for tests by setting the logger level to ``DEBUG`` in /tests/common.py +This will display all the queries ran + display types detected for columns. + 3. Add tests -------------- From b50d99b15fdf48109450b45e7eff89b18f048f5a Mon Sep 17 00:00:00 2001 From: Jardayn Date: Mon, 14 Nov 2022 19:44:28 +0200 Subject: [PATCH 3/4] Review comments --- docs/new-database-driver-guide.rst | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/new-database-driver-guide.rst b/docs/new-database-driver-guide.rst index ec581f8a..819784cd 100644 --- a/docs/new-database-driver-guide.rst +++ b/docs/new-database-driver-guide.rst @@ -130,7 +130,7 @@ Docs: :meth:`quote()`, :meth:`to_string()`, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -These methods are used when creating queries or normalizing fields. +These methods are used when creating queries, to cast to quote a value or cast it to VARCHAR. For more information, read their docs: @@ -144,17 +144,18 @@ Because comparing data between 2 databases requires both the data to be in the s Databases can have the same data in different formats, e.g. ``DECIMAL`` vs ``FLOAT`` vs ``VARCHAR``, with different precisions. DataDiff works by converting the values to ``VARCHAR`` and comparing it. -Your normalize_number/normalize_timestamp functions should account for differing precions between columns. +Your normalize_number/normalize_timestamp functions should account for differing precisions between columns. These functions accept an SQL code fragment, and returns a new code fragment representing the appropriate computation. :meth:`parse_type` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -This is the last resort if for type detection. -If you added the type to the TYPE_CLASSES dict and it still isn't getting detected - you should use this function. +This is used to determine types which the system cannot effectively detect. +Examples: +DECIMAL(10,3) needs to be parsed by a custom algorithm. You'd be using regex to split it into Field name + Width + Scale. + -The common approach is to use Regex to get Column Type + Precision/Width 4. Debugging From 2e3703a3428ee28a9bc4710af6ecd362bcb9f6de Mon Sep 17 00:00:00 2001 From: Jardayn Date: Mon, 14 Nov 2022 19:56:35 +0200 Subject: [PATCH 4/4] Review comments --- docs/new-database-driver-guide.rst | 5 ----- 1 file changed, 5 deletions(-) diff --git a/docs/new-database-driver-guide.rst b/docs/new-database-driver-guide.rst index 819784cd..b0790325 100644 --- a/docs/new-database-driver-guide.rst +++ b/docs/new-database-driver-guide.rst @@ -155,16 +155,12 @@ This is used to determine types which the system cannot effectively detect. Examples: DECIMAL(10,3) needs to be parsed by a custom algorithm. You'd be using regex to split it into Field name + Width + Scale. - - - 4. Debugging ----------------------- You can enable debug logging for tests by setting the logger level to ``DEBUG`` in /tests/common.py This will display all the queries ran + display types detected for columns. - 3. Add tests -------------- @@ -204,4 +200,3 @@ When debugging, we recommend using the `-f` flag, to stop on error. Also, use th ----------------------- Open a pull-request on github, and we'll take it from there! -