Skip to content

Explore wholesale liquor purchases in Iowa since 2012, detailing orders from various retailers. Access store information, liquor brands, sizes, and order quantities. Discuss business questions, data pipeline architecture, and real-time data solutions.

Notifications You must be signed in to change notification settings

geanpannellini/iowa_liquor_retail_sales

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About Business Questions

You can utilize the gold area to address business inquiries effectively and you can access the dashboard in Lightdash using the gold layer.

image

The primary model, which is Sales Model, provides the means to answer all the previous questions mentioned.

1_a. Calculate the total products and revenue sold over time by quarter Analyses, Answer 1_a

Filter by 2023

quarter total_products_sold total_revenue
1 1259492 204854684.62999138
2 1539291 261004104.8299984
3 1455564 252157732.39999866
4 1513183 272791124.0000002

1_b. Identify the month where the revenue sold was 10% above the average. Analyses, Answer 1_b

Month Revenue (USD)
12 923,849,153.14
10 883,852,446.06

2. List the counties where the amount (in dollars) of purchase transactions went over $100K. Analyses, Answer 2

The first 5 results, please, click to expand the other 94 results.

County Total Sales (USD)
POLK 2,231,412,892.36
LINN 836,910,248.29
SCOTT 676,234,211.45
JOHNSON 565,370,294.14
BLACK HAWK 534,805,393.40
Click to expand
County Total Sales (USD)
POTTAWATTAMIE 344,816,728.64
WOODBURY 331,924,475.41
DUBUQUE 292,460,322.00
STORY 282,700,828.80
DALLAS 212,462,454.04
CERRO GORDO 197,362,522.19
CLINTON 160,594,434.33
DICKINSON 129,971,091.06
DES MOINES 112,432,035.59
LEE 110,011,286.23
WEBSTER 106,324,858.67
MUSCATINE 101,206,725.64
MARSHALL 99,134,272.76
WAPELLO 88,687,270.34
WARREN 85,707,404.71
CARROLL 75,186,042.64
MARION 69,511,268.57
JASPER 58,216,971.84
BREMER 57,598,483.14
BOONE 56,919,088.90
BUENA VISTA 55,605,695.10
HARDIN 55,075,908.40
WASHINGTON 53,667,863.50
SIOUX 52,269,122.47
PLYMOUTH 50,742,597.68
KOSSUTH 50,668,584.63
CLAY 46,798,509.56
POWESHIEK 44,701,992.93
WINNESHIEK 41,423,876.39
CRAWFORD 40,263,521.47
IOWA 39,845,654.96
JONES 38,421,427.24
O'BRIEN 38,120,603.43
MAHASKA 37,446,433.83
BUCHANAN 37,060,008.02
CASS 37,006,619.89
HENRY 36,977,532.05
PAGE 36,960,459.53

3_a. Identify the top 10 stores with more revenue in sold products Analyses, Answer 3_a

Store Name Total Revenue (USD)
HY-VEE #3 / BDI / DES MOINES 325,980,403.20
CENTRAL CITY 2 289,904,813.05
HY-VEE WINE AND SPIRITS / IOWA CITY 133,628,636.04
SAM'S CLUB 8162 / CEDAR RAPIDS 98,795,093.18
SAM'S CLUB 6344 / WINDSOR HEIGHTS 93,223,874.89
COSTCO WHOLESALE #788 91,098,556.02
BENZ DISTRIBUTING 88,565,113.22
WILKIE LIQUORS 87,044,126.23
LOT-A-SPIRITS 75,775,915.28
SAM'S CLUB 8238 / DAVENPORT 67,925,426.94

3_b. The bottom stores with the least revenue in sold products Analyses, Answer 3_b

Store Name Total Revenue
HONEY CREEK DISTILLERY 235.68
BEST CASE WINES 312.00
3-OAKS DISTILLERY, LLC 388.80
THOME ENTERPRISES LLC 576.00
DIRECT LIQUOR / ALVORD 1,250.18
Y-KNOT 1,373.60
HARTIG DRUG #12 / WAUKON 1,848.42
WILDCAT DISTILLING 2,016.00
CATS EYE DISTILLERY, LLC 2,772.48
JCL STORE 3,015.72

About the Data Pipeline - Model architecture

All tables are possibly found here.

I am using the Medallion Architecture, and because of that I split the layers using bronze, silver, and gold.

Bronze

The bronze layer is the data ingestion and I don't modify the data. There is a staging layer, where we can make a small transformation like switching the typo.

All tables are materialized by view in this stage.

Silver

The schema is above:

image

Important

There are a lot of modes to represent the silver layer, using Star Schema or Snowflake methods. In this case, I am using Star Schema because the main motivation is to organize the data. Please, read my article about Performance of Different Data Modeling Approaches in Modern Storage Architecture

Some characteristic of the silver layer is the Data Catalog can use these tables to incentive a data-driven mindset to the company Every dimension is here making "easy" access.

Gold

The gold layer is the business layer and can be used for both business queries and platform data views(Lightdash, Looker, Metabase).

Please, access the topic "About business questions" above.

Some tables can use another configuration, in this case, we don't have an example but here is another configuration for incremental tables: materialized='incremental', unique_key='order_id', strategy='timestamp', updated_at='order_date'

Lineage graphs

image

Data quality

Looking at the dbt test: Some data was duplicating the results. The dim_vendor was duplicated because there is some vendor_name that is equal. The same thing was happening in dim_store and I understood that we need deleted the null data.

Tips around macros, data validation, and documentation

👉 [Macros] I created a simple "dates macro" to help with monthly, quarterly, semiannually. To check a case, click here. To check an application, click here.

👉 [Data Validation] In all tables, tests are implemented on the primary key and important columns. To check a case, click here.

👉 [Documentation] In all tables, the implemented tables and columns are described, and the schema can be integrated with the Data Catalog. To check a case, click here.

Additional information: Stack

image

About the Data architecture to "real-time" solution

image

The data pipeline for real-time data involves a continuous flow of data that is captured, processed, and delivered instantly. Firstly, data is collected in real-time from various sources such as sensors, applications, or transactional systems. Next, this data is processed in real-time, typically using streaming platforms or event processing. Finally, the processed data is delivered to storage systems or visualization applications for immediate analysis and real-time decision-making. This pipeline enables organizations to gain instant insights and react swiftly to ongoing events.

Real-time data flow is typically costly, hence it is essential to identify which data truly needs to be processed in real-time. Additionally, it needs to be understood if it's feasible to collect data and place it on the same data modeling pipeline as batch data.

In general, my focus would be on the initial part of data ingestion for real-time data collection.

About

Explore wholesale liquor purchases in Iowa since 2012, detailing orders from various retailers. Access store information, liquor brands, sizes, and order quantities. Discuss business questions, data pipeline architecture, and real-time data solutions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published