From a5a3969f2f1c2c3240c7628ff84cf83a7df4379f Mon Sep 17 00:00:00 2001 From: Daniel Phillips Date: Sun, 8 Dec 2024 20:05:55 +0000 Subject: [PATCH 1/2] Additional material for blog post. --- _posts/2024-12-08-dbt-expectations.md | 65 ++++++++++++++++---------- images/gx_logo_horiz_color.png | Bin 0 -> 7478 bytes 2 files changed, 40 insertions(+), 25 deletions(-) create mode 100644 images/gx_logo_horiz_color.png diff --git a/_posts/2024-12-08-dbt-expectations.md b/_posts/2024-12-08-dbt-expectations.md index 11a91b24372..d9a789248f7 100644 --- a/_posts/2024-12-08-dbt-expectations.md +++ b/_posts/2024-12-08-dbt-expectations.md @@ -1,48 +1,63 @@ --- layout: post -title: Using dbt Expectations as part of a dbt build. +title: Using dbt expectations as part of a dbt build. --- - + The objective of the blog post is to give a practical overview of the data transformation testing tool Great Expectations/dbt expectations. + +### Why data testing? + +Having been involved in data transformations in the past (e.g. moving data from on prem to the Azure cloud) I'm aware of the potential complexity of ensuring the quality of data from source to target, verifying the transformations at each stage and maintaining data integrity. + +Given + +### Great Expectations + +[Great Expectations.io](https://greatexpectations.io/) and its open source version [dbt expectations](https://github.com/calogica/dbt-expectations) are frameworks that enable automated tests to be embedded in ingestion/transformation pipelines. + + + +This is a widely used tool in data engineering, and in order to try it out and evaluate this tool, I undertook the following Udemy course, the screenshots and material are based on this: [The Complete dbt (Data Build Tool) Bootcamp:](https://www.udemy.com/course/complete-dbt-data-build-tool-bootcamp-zero-to-hero-learn-dbt) ![Microsoft AI Fundamentals](/images/AI900.png) -This course covers the theory and practical application of a data project using snowflake as the data warehouse, and the open source version of dbt. What was particularly relevant for a tester are the sections covering dbt expectations. This post will explain at a high level what dbt expectations can do, how it can enable QA in a data ingestion/data transformation project rather than a hand on how to' guide (I can recommend the aforementioned Udemy course). +This course covers the theory and practical application of a data project using snowflake as the data warehouse, and the open source version of dbt. What was particularly relevant for a tester are the sections covering dbt expectations. This post will explain at a high level what dbt expectations can do, how it can enable QA in a data ingestion/data transformation project rather than a hand on how to' guide. -Purpose of this post: +### What is dbt expectations? -Demand for data transformation testing- dbt is a widely used tool for data engineering +dbt expectations is an open source package for dbt based on Great Expectations, to enable testing in a data warehouse. -What is dbt? + How is it used to test, and why? -What is dbt expectations? +Using the dbt expectations package allows data to be verified in terms of quality and accuracy at specific stages of the transformation process. It includes built in tests including not_null, unique etc. and custom tests written in sql which can extend test coverage (see /tests/no_nulls_in_dim_listings for example.) -How is it used to test, and why? +When the package is imported etc. the tests are written in the schema.yml file. This is a breakdown of the examples in [/models/schema.yml](https://github.com/dp2020-dev/completeDbtBootcamp/blob/main/models/schema.yml): -Using the dbt expectations package allows data to be verified in terms of quality and accuracy at specific stages of the transformation process. It includes built in tests including not_null, unique etc. and custom tests written in sql which can extend test coverage (see /tests/no_nulls_in_dim_listings for example.) +#### Basic Expectations: + +not_null: Ensures that the column doesn't contain null values. +unique: Verifies that all values in the column are distinct. + +#### Relationship Expectations: + +relationships: Checks if a foreign key relationship exists between two columns in different models. -When the package is imported etc. the tests are written in the schema.yml file. This is a breakdown of the examples in /models/schema.yml: +#### Value-Based Expectations: -Basic Expectations: +accepted_values: Ensures that the column only contains specific values from a predefined list. +positive_value: Verifies that the column values are positive numbers. -not_null: Ensures that the column doesn't contain null values. -unique: Verifies that all values in the column are distinct. -Relationship Expectations: +#### Statistical Expectations: -relationships: Checks if a foreign key relationship exists between two columns in different models. -Value-Based Expectations: +#### dbt_expectations. expect_table_row_count_to_equal_other_table: Compares the row count of two tables. -accepted_values: Ensures that the column only contains specific values from a predefined list. -positive_value: Verifies that the column values are positive numbers. -Statistical Expectations: +dbt_expectations.expect_column_values_to_be_of_type: Checks the data type of a column. +dbt_expectations.expect_column_quantile_values_to_be_between: Verifies that quantile values fall within a specific range. +dbt_expectations.expect_column_max_to_be_between: Ensures that the maximum value of a column is within a certain range. -dbt_expectations.expect_table_row_count_to_equal_other_table: Compares the row count of two tables. -dbt_expectations.expect_column_values_to_be_of_type: Checks the data type of a column. -dbt_expectations.expect_column_quantile_values_to_be_between: Verifies that quantile values fall within a specific range. -dbt_expectations.expect_column_max_to_be_between: Ensures that the maximum value of a column is within a certain range. +#### Example test: -Example test: Room_type, see screenshot. To run the tests in the schema: @@ -54,7 +69,7 @@ To debug, the standard tool is dbt test --debug, but the advice on the bootcamp In a specific example, the failing sql code is run directly against the table (in Snowflake in this example) to find where exactly the failure is. -Lineage Graph (Data Flow DAG) +### Lineage Graph (Data Flow DAG) Source data in green -> dependencies diff --git a/images/gx_logo_horiz_color.png b/images/gx_logo_horiz_color.png new file mode 100644 index 0000000000000000000000000000000000000000..8e493c87ec318753cfc94c50a9d52f24f447f95d GIT binary patch literal 7478 zcmV-69m(QSNk&F49RL7VMM6+kP&gnW9RL7uegK^TDhL9L06t+Xl13yVp`j`CT7ZBI z31e>je3?|~m7@8dO(sO|ztewSzbO7@s~^ce`u<{br=br^J(c}e_z%iIo!*51P z4gOpHYx^JM|8XxcAFSWD9`fJke_%bve{B0^`2ha6{xkpoV3(i=^1t~0nLoL|2OqiL zwO<0?u%G|>z(4!-YxMK`mua8jKfC_j_9=Ov+x}l z!oO$u0e&s~xBO@3PX*8b{RjFt{Gab%-@UEAi_I^ze!w5CKk)prJQV#$*aP{0_wW5b z;k|^v?*DK3ckR>kfAk-<9|6D4|F-|u^}_%E?hE&S|Npvo?T`QSZ4i~X4(xybVrxaO zCBx?G#u%2VFkYr>cZ#>mq_;M`jDtSM9fj#Vz_DBo~PZCE3kMHV{4z4`)1Ss4=o& zsy@AW9XL`0lg|F))LKL#-j;(H$Xj6)u_xCg>$M?F*wDMcrS?$9MGL0IoC&HfFV*T^F!v1&5-VLm zc{o|)-T$npn3m0k%Uo30SFj)dR9u$e51s7H87A6DAij(agF#%|Wg+6_L`N3as5#n4 zNW1FjRa~Mcud&&%n5do~srA#?+#KDtK%Rjv&3;&xS3vtNnsi&MQVgw{>EKx$5k~Q? zq&MKC@}b?|MCEcWD&twBmstvZ@P+Sf@FbR?XKh>ihGZS54hhnwpSQ+(+-HD!wHRhm z<{xte|Y+YBcE;39|KR8S2vcY2kDzUkWFxzJ*2iF#4U3Rh<_I zawda5I~|{DE}6E;nY`rqcsx&siSY1vJRQT=)VlnLN>29=;7+x>jdYe8CgX9q+-^4; zjmG0~xV!)W{{N`dc{qwJ0r{5ddZuTN8_1u)9D`jr=|z$Gq-V+$wv1EaS+(qY@nHh$ z4(QI^`@~lxxnTG9l3~@T3fDsf+z9=s|9lU1E#W-e-?OKK73yZeV*phf05Cs{SB~*t zZ6^y`PT};4P`wQ}Fo2*|$1_L6^tto?%oE9a+>-RINZYu(Lh4sgQRH>G# z0&Q*xB`Sg?5TA4v9IpbFi}@XkCbR}pv0`a|ZX_szYsovPhE+m!F+sr92O9uZ0x+Wt ztd&{?Qh)^Xyn#PUUWTH4=SgS(qJjI{U!Gn~=_cKLh352HF%^TB7j%T@GMO>pU1-R0 zMC|{A9mLhL5Sw5nf8-t;=;{EG?a)((^MMkMj4CB&@%Ig*tZndI%y{^ zekxe$9slJEY`lSbl>5;hr3r*rTMmW%vhl<0V5>nm!&^ovc^G8c+3aLaZPk0Dtz!AP zBnzhmo~O}Sd>#?D2hb&A3WCU6W*wXoP}=$OF?ey4Pq3NOq+x;?T}u`0V4wiYd=Z}+ zSSI~cQa~)uk9uPkZYwaQ4XbCz&` zu`1>~Fge0*frlS@1Pp0PpU($3U_Mn0;B_Q!U;iAZH9q<^UzDn2GuS<5Zr7GgL3ta7bc-3$Tu6`?{ z;lJuT?Km!8M2{8&Rk&>JHcQ`Nft8P4^|{##!n50EX#;gj{6$q-yqn5-M1UoTa#dWY2EdwB^m0*M?y zqjq0q=!Fat9`deaUg^?`j?xT0PyA(p6@C{WA*$!`N-wk~%9(Ea#VsH#DPmC7!h8FJ z4ul?pjei{CME+v#FEdG3j`clK{!>b8yXNGXvKPBI3kkBD%X;kgP=63Hs;%MJT)uk- zGLEC-pnEk8&Pvik|1dBtS!v-IuosajEZk*DfQ}zc3z=` zVCbxTX1Of1log=fRnx3q7RN~$Gz@-BxCXXwL%04j1RV&kIlS!QU{7Go)0Wg>!c`SR z;m^1q&`(!?=y&y9vlp=}r9KIoH#0ICC)qpoObvO%adE|JYi6$|t$W?WBf{=h)bIfI z-1)U;ql!zuZ!1peld`}{aYgI`&^}Vdb6qCvwe$y|6Q06X3?9NqsEuF5ok7s`M|dv> z*g|hW#%!_RF)4oc6tG~G+Q$;U2+6lK{rKXMJ@yXBo+49g@r3{rbG`Yz zPEZ*;K@f1Q@E!fWa2yg1MB82oWS(D!fVB$%KiiLV#a=;mpXPNGpP(FQS2XhR$N2BwRfEUP}SlQk8>`j zC|fbG#`DK||4ZB_WyCJDERLV!o~w$OD6F^AF* zYn9Y7pb2)ozGQIWd0ol=C+=_P+`)XcA8{8B_RUAYDZwecBGpNWhvFxT6Y%$yj(z8h z`vCw3F++|N0DXTT5#0-vy2dHJ>b_bdOK~#q{XJ!7)R2Nlv6^fSc}8Db*~dQHRq4?W z-!h&4A8Up4brhd;w7NnOU@YbJ25H0|+ z{}oEpoF4c&Mk2xvl&EAZ37>dUJ>7gD8B1Um5;Fx@F~Ui{+ngzv|5zD;5u^jiE@_h| zLd7JhCg^=K+#%w=tLBpil_3S>+(DVGtaoxd5=z2ujow+2vCdU7 zl=ZKt;D8E|1jm~vkxdbp&i?2YIBPA{{|P%Efs79wWU2aV*op&`KKIGtFd_ZC8*-Ui>QE=P_Es28>pm*?5?K8PpxDl^I*DY`5pBOG)q4?~k~ zAEDeg@ji`?5?LF%YL*%P-4z~$%h?R(k3#nR*(ItuZ1qW|cS<=V5so#uFBzZj+`Ey2 z0mkW|b38J&^W)XM=qeO;K(Vp&Wc*nyq9cU3vKtk%DqEW>fD13Cob*!GlzRKk_qmFE z+Du@WBWqJ@P2aI8JhuV!=1_y1REDOB(9ifB1lcA9 z2EJPV5gSub<2$Py5xpR{Z;#mz0j(xlQkLtX-R$%`tWEh&WfLF`yv_&&Vm_3rL8Uxh zjSnAyZve4z$Om@!IouUArmX|6EJj(S8^cn%CMbA_mH2aGp-uq{2GIhI>JKUR?`>+c zZfr}Lbzx1}kKLOMOA_-p0%6uh>rK%9hE9Erp}043&VYM#P}lbXUlX{Mxng0;dUDAV%Rk;`eFPg!t;uH^OKLMyMMF$m+Rv7$mdnD5)G z??XG=^nL8G);M#ZEm%magU8=Qfs3CL-$v#1%vY^Du!}emyzp&!)AyVyGi*>g=7Z+m zJ)aW4u}#>QEG={N7i{JmBtN`S%E$9`Oh&9-8!QSNl>M?`(2xSZebmD{4DHK2U>!Rk znYX=*p2`#!_j>m(n2%_UTXjTr*42=}r;2+JNk%3WD@e_2DAX>w4)vrn5GQ0Q6@i1a zCuykpT#hJRHmJ7__TH1tm*PlYTnS6!T!$S=<63->_(kVPnRJEOKAaXY3bp+*j!N=z zXllOc;2eud&~>fzhD%odnPHu~6$w?&OWONc@zm;C+T{1)566b}*HGZ-tZ`mynF?GD zU1ReHgz{3Fng+k3FalbL8!)>#S$LW~n5uTD=bp3<$PMYB@IkNGykVI+_ohgh!YNWk z?fX1-6m@S*7B?fM>W%%ng2kzzv86>xj~T*rvy|$)5B&`xSuKu zEBf3&EV#R+6=GWtcLzf47)yAHeoDi|!<@f&YVYY+aa6?Y_ z`!>r0(4cLw(osE3q4by%)`^P^_3RK;XtsdG$v#_lPXc}>H{VgG>|43QV6kIUioOD% zZHI8KUpN%2DbyP?Ag6783^#nr#=Yc@D=yVYG*xyEqVS=6pBxxwknov!d7(oD4}-xu z1hvK@dT{VI*$-9!lh5ZQB#nD@(JG=!)*sEmoF^bY|8mj^dzHK+NZguX(G!yW^S*a! z5%@zm?gvRTR%`!yxkBbrx4C%S29_N|`ich(3&XTxf&^H}apFyhH_g>*$%R0B(cGdu zE%zn%peACGB(-QoRJ~s0Hb$%mT$)s;?KIJ)lWOqHa*G5Am1sdn1eui}dFF-X}Yt!G7 z-EZ_GmTdho_9X}9b3)oSin6E)wbAMLOXF7{3o>D-Q1|V z?)${a3CJtpG3EJ@^zw7F+CBX-tib#EsX{5;z~MduW|fiJ!fCz~*o&tH2#QKr)@qb+`aZaGo38P z5nl9RlZJDzEprZsrRojxkhW@ul6zaJXG4Is!5^#h^CwMUe22N);0~dR$c@~@Wl0A6 zpkmG|$(@jp64TitYb!L@rKY5F()hjPD#Q*su?@PI;i!c7T~S^3E-L^g_A_%nK7IVK zi`%=<1W(JlixmNngin2-o2o?~`)SgIP%L;W30-U$1%d6=ky2Y3RliasO!;lcKn_D3 zRrwW41p|Cc0`0!c8fD}Awx|^5$7L?S--kUOgbE(l{wKkQ^txTQv+4+*a5_)}aQ3~# z3XHp_*qr}VeB8jK-kb!`>_xk{*PR~o3pV(?-7p2H25fN@!&niBF<1p9u&=_)_@7LP z1N|M8u$IsttMfCeJxdHZu&@#%Ge-vF?e>McgKTKbFji0Xi{PQCXp)J?a(}5O z1tQ7Aq;SSsUy7F(-03q;i+L8Plv^W5MlEu4WBHzHyL2RM4Ph|KX>ZL5h{~uLn-wKxZ+;8Ycg4{?$sh zl-YYK@WkM!=HKlrG*KQbt*n7Zz4}D|-^O5!16C7X4ae1q0yeoPf71>Zn=--lRG)y* zkI+|G06s ze+p>Y+LTA`zyQ`)Ld8m)yLleWN%Af^JF@O)0MM% zz}V1IsonLvfZBj=>U+&7Mzz4SM8-1)Be9QDBn@eq-UyGUVCFJPrCN7rKG~I?-HB7` z`|TaBPju_KrBj(%(fwtjz8IwzjxGN(*?dVBC!Ja`moLqh&G!2!9eY=O~YNa+X;`PA8cO2qmp z=`2I>wun^Lc_76?FEDK5&3^_3#bR){P)lk_O)os;!_=HVp>0sas&k1{gq+pv5|R>v zefkV5M`<%VZz2Fvb=_F~3G&>$YE~W+0I0+%g4C*<}S{(0@n985UnGg zQHfvu!dXB1hsqquI6MIVLa(>~8cXOW_F>TC12<&}F`mf=4;*5;ugffg71WcImJBUTWY3E}t&r1e7Z}5-J*n?|S zSD_vn;9H^0H6Qex_b=E)sf-KX8Byln-V|8ac4p>2cP?f!A* z@o4k8Pg**|vwXLd_UQH;2aUQtdXef@D4Tc{O1pNao%aIcc4S7U7F6#^uT&eqUx5T^ zL|n+sEDZgMmY5GPj8oapc4ynYFf-k?Ay@yRTxZs;o#9T{5HI(!UetYViTl5`m5!Nd zend+~qmIq=Pk^D1_sVDyOIHY8M@eP?0Ab3lNMI!cM(Y%b9e&_o3SmXVY#&)D z$}Gdk3*Q-^`lY-whlnw}zn-+!7)dQllQr2Gt!Vs;DrjmZV!f4BtERH}Y`(A}ipCljK9yiQL`eR8s{BT`T?9v6A-JjwIo1ZYWZ`G2PuUDQ{<*(-*jfo- zCT&jeZUxeJH{`|m^v5s(iV( Date: Sun, 8 Dec 2024 20:08:02 +0000 Subject: [PATCH 2/2] add image. --- _posts/2024-12-08-dbt-expectations.md | 1 + 1 file changed, 1 insertion(+) diff --git a/_posts/2024-12-08-dbt-expectations.md b/_posts/2024-12-08-dbt-expectations.md index d9a789248f7..6cd9fff2689 100644 --- a/_posts/2024-12-08-dbt-expectations.md +++ b/_posts/2024-12-08-dbt-expectations.md @@ -16,6 +16,7 @@ Given [Great Expectations.io](https://greatexpectations.io/) and its open source version [dbt expectations](https://github.com/calogica/dbt-expectations) are frameworks that enable automated tests to be embedded in ingestion/transformation pipelines. +![Great Expectations logo', December 2024](/images/gx_logo_horiz_color.png) This is a widely used tool in data engineering, and in order to try it out and evaluate this tool, I undertook the following Udemy course, the screenshots and material are based on this: