# DataBricks Cluster 
- In Databricks, clusters are the compute engines that run your code. Databricks supports several types of clusters, each optimized for specific use cases.

## 🧱 Databricks Cluster Types and Instance Pool (Table Format)

| **Cluster Type**                    | **Details and Use Cases**                                                                                                                                                      | **Detail**                                                                                                                                                                        |
|------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Interactive Cluster (All-Purpose)** | - **Best For:** Notebooks, development, exploration, ML experiments  <br> - **Lifespan:** Manual start/stop  <br> - **UI Access:** Workspace → Compute                                    | Provisioned compute used to analyze data in notebooks. You can create, terminate, and restart this compute using the UI, CLI, or REST API. mainly for development |
| **Job Cluster**                    | - **Usage:** Scheduled ETL or Batch jobs and workflows  <br> - **Lifespan:** Auto-created at job start, terminated after job completion  <br> - **Provisioned via:** Jobs API, UI         | Automatically created by Databricks when a job starts. Automatically terminates when the job finishes. Mainly used by prod jobs  |
| **SQL Warehouse** (SQL Endpoint)   | - **Usage:** BI dashboards, SQL editor queries  <br> - **Photon-powered:** Yes (by default)  <br> - Auto-scaling and on-demand start  | Used to run SQL queries in dashboards or interactive notebooks.  |
| **Shared Cluster**                 | - Manually configured to allow multiple users or jobs to share a single cluster                                                                                                 | Option available under All-Purpose cluster to allow multiple users to share the same cluster.                                                                                      |
| **Single Node Cluster**           | - **Usage:** Non-distributed jobs, model training, file processing  <br> - **Configuration:** Enable "Single node" in cluster settings      | Created under Unrestricted policy. Suitable for lightweight and simple jobs.       |
| **Delta Live Tables Cluster**      | - **Usage:** DLT pipelines (ETL workflows / managed pipelines)  <br> - **Managed Automatically:** Databricks provisions and scales cluster                                              | Managed via **Workflows → Delta Live Tables**. Cluster mode options include: <br> - Default (Managed) <br> - Photon <br> - Single-node.         |
| **Photon Cluster**                 | - **Usage:** Accelerated SQL performance  <br> - **Compatibility:** Spark SQL, Databricks SQL  <br> - **Hardware:** Optimized for x86 vectorized execution                                 | High-performance engine (Photon) used for speeding up SQL query execution. Enabled by default in SQL Warehouses and configurable in clusters.                                     |
| **Serverless Cluster**            | - **Usage:** Auto-scaling and zero management  <br> - **Benefit:** Pay only for execution time  <br> - **Note:** Availability depends on region/plan     | Databricks provides one low-config serverless cluster per workspace for quick Python or SQL tasks. No manual setup required.    |
| **Instance Pool**                 | - **Purpose:** Speed up cluster startup, improve efficiency  <br> - **Use with:** All-purpose and job clusters  <br> - **Benefit:** Lower cost, faster jobs, better control                | Pre-creates and manages VMs that can be reused by multiple clusters to reduce startup time and optimize cost.                                                                     |


# ⚙️ Init Scripts in Databricks
- An init script (**initialization script**) is a **shell script** that runs during startup of each cluster node before the Apache Spark driver or executor JVM starts.

## 🔧 Common Uses:
- Install custom libraries or dependencies
- Mount external volumes or secrets
- Set environment variables
- Modify Spark or JVM configurations
- Download system tools or install Python packages


## 🔧 types of init scripts 
- **Cluster-scoped:** run on every cluster configured with the script. This is the recommended way to run an init script.
- **Global:** run on all clusters in the workspace configured with dedicated access mode or no-isolation shared access mode. These init scripts can cause unexpected issues, such as library conflicts. Only workspace admin users can create global init scripts.


## 🛠️ How to Create and Use an Init Script

### Example: Install Python Package

```bash

#!/bin/bash
/databricks/python/bin/pip install pandas-profiling
```

### Upload to DBFS or volumne or external location 

```python

dbutils.fs.put("dbfs:/databricks/init/install_profiler.sh", "<script content>", True)
```

### Attach to Cluster

1. Go to **Cluster > Advanced Options > Init Scripts**
2. Add path: `dbfs:/databricks/init/install_profiler.sh`

---

## ✅ Benefits of Init Scripts

| Benefit                  | Description                                 |
| ------------------------ | ------------------------------------------- |
| 🔄 Customization         | Configure node behavior before Spark starts |
| 🤝 Consistency           | Reproduce environment across all nodes      |
| ⚖️ Dependency Management | Install packages, Python libs, or drivers   |
| 🔐 Secrets Integration   | Use Vault or Key Vault for secure access    |
| 🚀 Enable External Tools | Monitoring agents, logging, etc.            |

---

## ⚠ Limitations of Init Scripts

| Limitation             | Description                              |
| ---------------------- | ---------------------------------------- |
| ⏳ Startup Delay        | Long scripts slow down cluster launch    |
| ❌ Limited Debugging    | Hard to trace failures without logging   |
| 🧡 Scaling Issues      | Hard to manage many scripts manually     |
| 📄 Bash Only           | No direct Python support                 |
| ❎ Immutable at Runtime | Cannot re-run without restarting cluster |

---

## 📒 Summary

| Feature   | Description                                       |
| --------- | ------------------------------------------------- |
| What      | Bash script run on each node before cluster start |
| Use Cases | Install tools, mount volumes, configure system    |
| Types     | Cluster-scoped, global, DBFS, workspace, cloud    |
| Benefits  | Reusable, automated, secure setup                 |
| Drawbacks | Slower startup, debugging, shell-only             |

---

# 🖊️ Python Logger in Databricks

Python's built-in `logging` module is a powerful tool for structured and level-based logging in **Databricks notebooks and production pipelines**. It is far more flexible and maintainable than using simple `print()` statements.

The `logging` module in Python allows you to:

* Record events and messages during execution
* Classify messages by severity (INFO, ERROR, etc.)
* Send logs to different outputs (console, files, etc.)
* Control log formatting, level, and filtering

---

## 🔒 Logging Levels

| Level      | Use Case                         |
| ---------- | -------------------------------- |
| `DEBUG`    | Verbose output for developers    |
| `INFO`     | General runtime events           |
| `WARNING`  | Recoverable issues               |
| `ERROR`    | Runtime errors, can continue     |
| `CRITICAL` | Fatal errors, aborting execution |

---

## ✅ Benefits of Using Logger in Databricks

| Benefit              | Description                                           |
| -------------------- | ----------------------------------------------------- |
| 📊 Structured Output | Includes timestamp, severity, etc.                    |
| 🔄 Reusability       | Can be reused across jobs and notebooks               |
| 🚨 Better Debugging  | Helps trace errors and warnings in production         |
| 📂 File Logging      | Logs can be saved for audits or monitoring            |
| 🚀 Scalable          | Can log to multiple destinations (DBFS, stdout, etc.) |

---

## 🛠️ Common Use Cases

| Scenario          | Description                                       |
| ----------------- | ------------------------------------------------- |
| ETL Pipelines     | Track each stage of ingestion/transformation      |
| Data Validation   | Log schema mismatches or null checks              |
| ML Model Training | Log metrics, hyperparameters, convergence         |
| Alerts            | Warn on threshold breaches (e.g., null rate > 5%) |
| Auditing          | Log user actions or sensitive data access         |

---

## 💪 Best Practices

* Use `logger` over `print()` in production
* Set log level according to environment (e.g., DEBUG for dev, INFO for prod)
* Combine with `try-except` to log errors
* Use file rotation or cleanup jobs to manage log files
* Keep logging setup in a shared module for consistency

---
