# Data Ingestion Notebook

This notebook demonstrates the ingestion process for the case study selection process. It covers the following steps:

- Loading configuration files
- Reading data files into pandas DataFrames
- Connecting to a MySQL database
- Uploading data to the database

Each step is explained in detail to ensure clarity and reproducibility.

## 1. Import Required Libraries

We start by importing all necessary libraries for data manipulation, environment variable management, and database connection.

In [None]:
import json
import os
import pandas as pd
from dotenv import load_dotenv # Used to securely load environment variables from a .env file.
from sqlalchemy import create_engine # Provides tools for connecting to and interacting with SQL databases.
from urllib.parse import quote_plus # Ensures that special characters in the database password are safely encoded for use in the connection string.

## 2. Load Ingestion Configuration

The ingestion configuration is stored in a JSON file. This file specifies which tables to ingest and the corresponding file paths for each dataset.

In [None]:
with open("../config/ingestion.json", "r") as open_json:
    ingestions = json.load(open_json)


## 3. Read Data Files

For each table specified in the configuration, we read the corresponding CSV file into a pandas DataFrame. All DataFrames are stored in a dictionary for easy access.

In [None]:
dfs = {}

for item in ingestions:
    table = item["table"]
    path = item["path"]

    try:
        df = pd.read_csv(path, encoding="utf-8", sep=",")
        dfs[table] = df
        print(f"Table {table} read.")
    except Exception as e:
        print(f"Error reading table {table}.")

## 4. Data Preview

(Optional) You can preview any of the loaded DataFrames. Uncomment and modify the following line to inspect a specific table.

In [None]:
#df = dfs["user_table"]

In [None]:
df.head()

## 5. Load Database Credentials

We use environment variables to securely load the MySQL database password. The password is URL-encoded to ensure compatibility with the connection string.

In [None]:
load_dotenv()
password = quote_plus(os.getenv("DB_PASSWORD")) # The password should be stored in the .env file

## 6. Create SQLAlchemy Engine

An SQLAlchemy engine is created to manage the connection to the MySQL database. This engine will be used to upload the DataFrames.

In [None]:
engine = create_engine(f"mysql+pymysql://root:{password}@localhost/case_clara")

## 7. Upload Data to Database

Each DataFrame is uploaded to its corresponding table in the MySQL database. The `if_exists="append"` parameter ensures that new data is added without overwriting existing records.

You can also upload a single DataFrame by uncommenting and modifying the following line.

In [None]:
#df.to_sql("user_table", con=engine, if_exists="append", index=False)

In [None]:
for table_name, df in dfs.items():
    df.to_sql(table_name, con=engine, if_exists="append", index=False)