## **Data Warehouse Loader Notebook**

This notebook loads data into SQLite schema. This approach is being used to store and access data more efficiently than pure pandas dataframes.

#### ***Environment Configuration***

Helper python file to proper handle db integration

In [1]:
# Setup project environment
import sys
from pathlib import Path

# Import setup_project
sys.path.append(str(Path.cwd()))
from setup_project import setup_environment

# Run setup
paths = setup_environment()

db_path = paths['DB_PATH']

✅ Environment configured successfully!


#### ***Data Warehouse Connection***
Function to check if data warehouse is available, recreate if not.

In [2]:
from sqlalchemy import create_engine, inspect, MetaData

# Create database file if missing
if not db_path.exists():
    print(f"Database not found at {db_path}, creating a new one...")
    db_path.parent.mkdir(parents=True, exist_ok=True)
    db_path.touch()

engine = create_engine(f"sqlite:///{db_path}")
connection = engine.connect()
inspector = inspect(engine)
metadata = MetaData()

print(f"✅ Connected to database at {db_path}")

✅ Connected to database at /home/falatfernando/Desktop/bdq_resistance_study/mtb_resistance_db/mtb_resistance.db


#### ***Data Warehouse Quick Overview***

Just a little debug step so I can know the db content and if things are correctly connected.

In [3]:
# Get all table names
table_names = inspector.get_table_names()
print("Tables in database:", table_names)

# Get schema information for a specific table
for table_name in table_names:
    columns = inspector.get_columns(table_name)
    print(f"\nColumns in {table_name}:")
    for column in columns:
        print(f"  {column['name']}: {column['type']}")

Tables in database: ['reference_genome']

Columns in reference_genome:
  id: INTEGER
  seq_id: VARCHAR
  source: VARCHAR
  feature: VARCHAR
  start: INTEGER
  end: INTEGER
  score: FLOAT
  strand: VARCHAR
  frame: VARCHAR
  attribute: VARCHAR


####  ***Actual Schema Table Materialization*** ⚠️

Those cells actually materializes tables into the connected SQLite Data Warehouse. Be careful when running!

In [None]:
from sqlalchemy import Table, Column, Integer, String, Float, MetaData

metadata = MetaData()

reference_genome = Table(
    "h37rv_reference_metadata",
    metadata,
    Column("id", Integer, primary_key=True, autoincrement=True),
    Column("seq_id", String),
    Column("source", String),
    Column("feature", String),
    Column("start", Integer),
    Column("end", Integer),
    Column("score", Float),
    Column("strand", String),
    Column("frame", String),
    Column("attribute", String),
)

In [None]:
metadata.create_all(engine)

In [None]:
def parse_gtf(gtf_path):
    columns = [
        "seq_id", "source", "feature", 
        "start", "end", "score", 
        "strand", "frame", "attribute"
    ]
    gtf = pd.read_csv(
        gtf_path, 
        sep="\t", 
        comment="#", 
        names=columns,
        compression="gzip"
    )
    return gtf

# Load the GTF
gtf_df = parse_gtf(os.path.join(raw_data_dir, "GCF_000195955.2_ASM19595v2_genomic.gtf.gz"))

# Insert into database
gtf_df.to_sql("reference_genome", con=engine, if_exists="append", index=False)


In [None]:
import pandas as pd
from sqlalchemy import create_engine

df = pd.read_sql("SELECT * FROM reference_genome LIMIT 10", engine)