<img src="./images/logo.png" alt="Drawing" style="width: 500px;"/>

# **Exercise 1:** Deploying and Initializing PostgreSQL on Kubernetes

In this exercise, we will walk through deploying a PostgreSQL database in a Kubernetes environment using Python. This process will involve using Kubernetes ```kubectl``` commands to manage persistent storage (Persistent Volume Claims), deployments, services, and initialization of sample data.

You will:
- Automate the deployment of PostgreSQL on Kubernetes.
- Handle resource management like Persistent Volume Claims (PVC) and deployments.
- Set up a PostgreSQL database, create tables, and load sample data.
- Implement retry logic for connecting to PostgreSQL and ensure a robust setup.

**Steps Overview:**
- Step 1: Define functions to interact with Kubernetes resources like PVCs, deployments, and services.
- Step 2: Set up PostgreSQL on Kubernetes with proper permission handling.
- Step 3: Write code to initialize the database by creating tables and loading sample data.
- Step 4: Use Python's psycopg2 library to connect and interact with the PostgreSQL database.
- Step 5: Generate and load realistic sample data into PostgreSQL for use in subsequent analysis.

### **Prerequisites:**

As instructed in the [Introductory notebook](./00.introduction.ipynb), ensure that you have run `pip install -r requirements.txt` in a Terminal window, located in the same working directory, prior to running this notebook. 

In [None]:
!pip install llama_index llama_index.llms.nvidia llama_index.embeddings.nvidia matplotlib polars

## **1. Prepare the environment**

<div class="alert alert-block alert-danger">
    <b>Important:</b> Set your <b>Username</b> here !
</div>

In [5]:
USERNAME="vince"

We start by defining a function to read the Kubernetes namespace from the service account mount point, allowing us to deploy resources in the correct namespace. If not running in a Kubernetes environment, it defaults to the "default" namespace.

In [6]:
def get_namespace_from_service_account():
    """
    Reads the Kubernetes namespace from the service account mount point.
    Returns 'default' if not running in a Kubernetes pod or if the file doesn't exist.
    """
    namespace_file = '/var/run/secrets/kubernetes.io/serviceaccount/namespace'
    try:
        with open(namespace_file, 'r') as f:
            return f.read().strip()
    except IOError:
        return 'default'

Set the Global variables required to run the exercise smootly

In [8]:
# Global configuration
NAMESPACE = get_namespace_from_service_account()
POSTGRES_PASSWORD = "postgres"
PG_SERVICE_NAME = f"{USERNAME}-retailers-postgres"
PG_DATABASE_NAME = f"{USERNAME}_retailers"

# Print the result
print("NAMESPACE:", NAMESPACE)
print("POSTGRES_PASSWORD:", POSTGRES_PASSWORD)
print("PG_SERVICE_NAME:", PG_SERVICE_NAME)
print("PG_DATABASE_NAME:", PG_DATABASE_NAME)

NAMESPACE: admin-901d042c
POSTGRES_PASSWORD: postgres
PG_SERVICE_NAME: vince-retailers-postgres
PG_DATABASE_NAME: vince_retailers


## **2. PostgreSQL Deployment Logic:**

We deploy PostgreSQL on Kubernetes by using Python's `subprocess` to execute `kubectl` commands. <br>
This ensures the deployment is automated, and the correct Persistent Volume Claim (PVC) and deployment configuration are applied.

In [3]:
def deploy_postgresql():
    """Deploy PostgreSQL with proper permission handling and PVC management"""
    pvc_name = f"postgres-pvc-{datetime.now().strftime('%Y%m%d%H%M%S')}"
    
    # Step 1: Delete any existing deployment to start fresh (if we have permissions)
    try:
        subprocess.run(
            f"kubectl delete deployment -n {NAMESPACE} {PG_SERVICE_NAME} --ignore-not-found",
            shell=True, check=True
        )
    except subprocess.CalledProcessError as e:
        print(f"Warning: Could not delete existing deployment (may not have permissions): {e}")

    # Step 2: Create PVC only if it doesn't exist
    if not resource_exists("pvc", pvc_name, NAMESPACE):
        # Create PVC dynamically
        create_pvc(pvc_name)
    else:
        print(f"Using existing PVC: {pvc_name}")

    # Step 3: Create PostgreSQL deployment
    create_postgres_deployment(pvc_name)

    # Step 4: Check if Service is created
    create_postgres_service()

    print("Waiting for PostgreSQL to initialize...")
    time.sleep(30)

    # Step 5: Check Pod status
    check_pod_status()

## **4. Database Connection and Retry Logic:**

To interact with the PostgreSQL database, we use `psycopg2` to connect to the database. <br>
We implement retry logic in case the database is not yet available after the deployment.

In [None]:
def get_db_connection(retries=3, delay=5):
    """Get connection with retries"""
    for attempt in range(retries):
        try:
            conn = psycopg2.connect(
                host=f"{PG_SERVICE_NAME}.{NAMESPACE}.svc.cluster.local",
                database=PG_DATABASE_NAME,
                user="postgres",
                password=POSTGRES_PASSWORD,
                port="5432",
                connect_timeout=5
            )
            return conn
        except psycopg2.OperationalError as e:
            if attempt == retries - 1:
                raise
            print(f"Connection failed (attempt {attempt + 1}), retrying...")
            time.sleep(delay)

## **5. Generate Sample Data:**

The `generate_sample_data` function simulates realistic product, customer, stock, and order data with controlled imperfections. <br>
This data can be used for testing and analysis within the PostgreSQL database.

In [None]:
def generate_sample_data():
    """Generate realistic sample data with controlled imperfections"""
    # Random data generation for products, customers, orders, and stock entries
    # Includes deliberate imperfections (e.g., missing emails, invalid product categories)
    # The returned data is structured in a way that makes it easy to insert into the database
    return {
        "products": products,
        "customers": customers,
        "stock": stock,
        "orders": orders,
        "order_products": order_products
    }

# **Conclusion**

By following these steps, you'll deploy a PostgreSQL database to a Kubernetes cluster, initialize it with tables, and load sample data for analysis. 

This exercise helps you become familiar with automating Kubernetes deployments using Python, handling PostgreSQL databases, and generating and loading data for testing purposes.

In the next exercise, you will learn how to use Spark on HPE Private Cloud AI to prepare these datasets for visualization and modelling. 

