Project-ETL-Retail

This is a sample project on creating a data pipeline from files with different formats such as csv, json & xml to a PostgreSQL database acting as a data warehouse. The 'retail' PostgreSQL database is configured as a DWH that can be used for analysis and reporting purposes. Below is the use case I have considered when building the data pipeline:

The Retail.SA company wishes to build a data repository that allows it to learn more about its customers and make decisions to improve its service and operation.

Requirements:

Develop a basic Architecture for the development of the solution
Apply Extraction, Transformation and Load with Python
Respond to the following statements:

• Statement 1: Show a Top 20 customers who bought more products with their respective amounts.

• Statement 2: Show the categories with the total number of products sold and the total amounts per category.

• Statement 3: Show the best-selling category by city

• Statement 4: Show the 5 best-selling products for each city and the amount collected

Data Model

Solution

1. DATA ARCHITECTURE - RETAIL

Source:

The first step that we must take into account when designing our data pipeline is to identify the place where it is located, be it a database, flat file, web, among others. For this case, it is identified that the sources are files with different formats such as csv, tsv, json, and xml.

Extract:

Since we are going to use python, the next step is to identify the library that allows us to read the data and convert it to a data frame, in this case 'pandas' is used, however, other libraries such as 'pyspark' can also be used.

Transform:

To develop the transformations and answer business questions, we use 'class' and 'pandassql'.

Load/DWH:

To load the data to the data warehouse we use the SQLAlchemy library which allows us to connect to the PostgreSql database and store the gold data.

Note:

Since it is a simple data pipeline, other features such as data orchestration and data governance are not being considered.

Finally, the data architecture would be the following:

2. ETL

To see the solution of the following points, check the folder "ProyectoETL"

3. ANSWER

STATEMENT 1

Show a Top 20 customers who bought more products with their respective amounts.

STATEMENT 2

Show the categories with the total number of products sold and the total amounts per category.

STATEMENT 3

Show the best-selling category by city

STATEMENT 4

Show the 5 best-selling products for each city and the amount collected

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
proyectoETL		proyectoETL
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project-ETL-Retail

Requirements:

Data Model

Solution

1. DATA ARCHITECTURE - RETAIL

Source:

Extract:

Transform:

Load/DWH:

Note:

2. ETL

3. ANSWER

STATEMENT 1

STATEMENT 2

STATEMENT 3

STATEMENT 4

About

Uh oh!

Releases

Packages

Languages

eladioyovera/Project_002__ETL_with_Python

Folders and files

Latest commit

History

Repository files navigation

Project-ETL-Retail

Requirements:

Data Model

Solution

1. DATA ARCHITECTURE - RETAIL

Source:

Extract:

Transform:

Load/DWH:

Note:

2. ETL

3. ANSWER

STATEMENT 1

STATEMENT 2

STATEMENT 3

STATEMENT 4

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages