# Data Ingestion for Reporting Optimization

...or how to convert files into tables of a database in just __5 steps!!!__

![Image](./images/data_ingestion_banner.jpg)

---

## Step 1: Import the libraries (only 3 libraries!!!)

__[DuckDB](https://duckdb.org/)__ is a fast in-process analytical database. DuckDB supports a feature-rich SQL dialect complemented with deep integrations into client APIs. It provides a full-featured SQL engine designed for analytics and __OLAP (Online Analytical Processing)__ tasks. It can run SQL queries on large datasets efficiently.

__Tip:__ _Don't forget to install DuckDB._

In [None]:
# import libraries

import duckdb     # pip install duckdb

import pandas as pd
import os

---

## Step 2: List your `.csv` data files and create the relative paths list

Using the `os` library we can list the filenames we want load as tables into our database. 

__Tip:__ _Remember to be clean and neat when naming the `.csv` files._

In [None]:
# List .csv filenames

dir_list = os.listdir('./datasets/modeling/')
dir_list

In [None]:
# Create relative paths list (i.e.: tables)

tables = [f'./datasets/modeling/{file}' for file in dir_list]
tables

---

## Step 3: Create the DDL queries

In SQL we have __Data Definition Language__ to define, alter, or manage the structure of database objects like tables, indexes, views, and schemas.

__Tip:__ _The code is set to work with an specific path length (i.e.: `table[20:-4]`). Bear in mind that you might need to set it for your use case._

In [None]:
# Create queries

queries = [f"CREATE OR REPLACE TABLE {table[20:-4]} AS SELECT * FROM '{table}';" for table in tables]
queries

---

## Step 4: Load your tables into your Database

__DuckDB__ is embedded into applications, meaning it doesn't require a separate server to run. It operates within the same process as the application using it. It can work with in-memory data for fast processing, but it also supports persistent storage for saving data to disk. In our case we're going to use the option of persisting the data in `.db` files.

__Tip:__ _Be careful where you store your `.db` files; don't lose them._


In [None]:
# Create database connection and .db file

con = duckdb.connect('./datasets/sales_database.db')

In [None]:
# Create tables in database and load data from .csv files

for query in queries:
    con.sql(query)

---

## Step 5: Check your tables

The __DESCRIBE__ SQL command, also known as EXPLAIN in some SQL contexts, retrieves the structure of a specific table. It returns details like column names, data types, and constraints.

__Tip:__ _Remember to check every table of your database. Double checking is always a good practice...and don't forget to close the connection_


In [None]:
# Check the result for a specific table.

con.sql("DESCRIBE Sales_short;")

In [None]:
# Close the connection explicitly

con.close()

---

# Now we're ready to rumble!!!

![Image](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExdGcwbHlpMmVvaTdrYmdsYmhqYzZ3YzNqd3o5bGxsNDgzY3I3MW43cCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/ZJPSFNLmADueHvzoZ8/giphy.gif)