# Data Ingestion for Reporting Optimization

...or how to convert files into tables of a database in just __5 steps!!!__

![Image](./images/data_ingestion_banner.jpg)

---

## Step 1: Import the libraries (only 3 libraries!!!)

__[DuckDB](https://duckdb.org/)__ is a fast in-process analytical database. DuckDB supports a feature-rich SQL dialect complemented with deep integrations into client APIs. It provides a full-featured SQL engine designed for analytics and __OLAP (Online Analytical Processing)__ tasks. It can run SQL queries on large datasets efficiently.

__Tip:__ _Don't forget to install DuckDB._

In [1]:
# import libraries
#%pip install duckdb

import duckdb    

import pandas as pd
import os

import sqlite3

---

## Step 2: List your `.csv` data files and create the relative paths list

Using the `os` library we can list the filenames we want load as tables into our database. 

__Tip:__ _Remember to be clean and neat when naming the `.csv` files._

In [2]:
# List .csv filenames

dir_list = os.listdir('data/')
dir_list

['agents.csv',
 'bookings.csv',
 'customers.csv',
 'destinations.csv',
 'flights.csv',
 'hotels.csv',
 'Locations.csv',
 'payments.csv',
 'promotions.csv',
 'reviews.csv',
 'tickets.csv']

In [7]:
# Create relative paths list (i.e.: tables)

tables = [f'./data/{file}' for file in dir_list]
tables

['./data/agents.csv',
 './data/bookings.csv',
 './data/customers.csv',
 './data/destinations.csv',
 './data/flights.csv',
 './data/hotels.csv',
 './data/Locations.csv',
 './data/payments.csv',
 './data/promotions.csv',
 './data/reviews.csv',
 './data/tickets.csv']

---

## Step 3: Create the DDL queries

In SQL we have __Data Definition Language__ to define, alter, or manage the structure of database objects like tables, indexes, views, and schemas.

__Tip:__ _The code is set to work with an specific path length (i.e.: `table[20:-4]`). Bear in mind that you might need to set it for your use case._

In [10]:
# Create queries

queries = [f"CREATE OR REPLACE TABLE {table[7:-4]} AS SELECT * FROM '{table}';" for table in tables]
queries

["CREATE OR REPLACE TABLE agents AS SELECT * FROM './data/agents.csv';",
 "CREATE OR REPLACE TABLE bookings AS SELECT * FROM './data/bookings.csv';",
 "CREATE OR REPLACE TABLE customers AS SELECT * FROM './data/customers.csv';",
 "CREATE OR REPLACE TABLE destinations AS SELECT * FROM './data/destinations.csv';",
 "CREATE OR REPLACE TABLE flights AS SELECT * FROM './data/flights.csv';",
 "CREATE OR REPLACE TABLE hotels AS SELECT * FROM './data/hotels.csv';",
 "CREATE OR REPLACE TABLE Locations AS SELECT * FROM './data/Locations.csv';",
 "CREATE OR REPLACE TABLE payments AS SELECT * FROM './data/payments.csv';",
 "CREATE OR REPLACE TABLE promotions AS SELECT * FROM './data/promotions.csv';",
 "CREATE OR REPLACE TABLE reviews AS SELECT * FROM './data/reviews.csv';",
 "CREATE OR REPLACE TABLE tickets AS SELECT * FROM './data/tickets.csv';"]

---

## Step 4: Load your tables into your Database

__DuckDB__ is embedded into applications, meaning it doesn't require a separate server to run. It operates within the same process as the application using it. It can work with in-memory data for fast processing, but it also supports persistent storage for saving data to disk. In our case we're going to use the option of persisting the data in `.db` files.

__Tip:__ _Be careful where you store your `.db` files; don't lose them._


In [11]:
# Create database connection and .db file

con = duckdb.connect('dataset/database.db')

In [13]:
# Create tables in database and load data from .csv files

for query in range(len(queries)):
    #print(queries[query])
    con.sql(queries[query])

In [14]:
# Close the connection explicitly

con.close()

---

# Now we're ready to rumble!!!

![Image](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExdGcwbHlpMmVvaTdrYmdsYmhqYzZ3YzNqd3o5bGxsNDgzY3I3MW43cCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/ZJPSFNLmADueHvzoZ8/giphy.gif)