In [0]:
# Step 1: Create a sample dataset
sample_data = [
    ("Fran", "2025-06-01", 3),
    ("Fran", "2025-06-02", 5),
    ("Fran", "2025-06-03", 7),
    ("Databricks", "2025-06-01", 2),
    ("Databricks", "2025-06-02", 4),
    ("Databricks", "2025-06-03", 6)
]

# Step 2: Define schema and create DataFrame
columns = ["user", "date", "engagement_score"]
df = spark.createDataFrame(sample_data, columns)

# Step 3: Show the data
df.show()


+----------+----------+----------------+
|      user|      date|engagement_score|
+----------+----------+----------------+
|      Fran|2025-06-01|               3|
|      Fran|2025-06-02|               5|
|      Fran|2025-06-03|               7|
|Databricks|2025-06-01|               2|
|Databricks|2025-06-02|               4|
|Databricks|2025-06-03|               6|
+----------+----------+----------------+



In [0]:
# Group by user and calculate average engagement
avg_df = df.groupBy("user").avg("engagement_score")

# Rename the resulting column
avg_df = avg_df.withColumnRenamed("avg(engagement_score)", "avg_score")

# Show the results
avg_df.show()


+----------+---------+
|      user|avg_score|
+----------+---------+
|      Fran|      5.0|
|Databricks|      4.0|
+----------+---------+



In [0]:
# Save the transformed data as a Delta Table
avg_df.write.format("delta").mode("overwrite").saveAsTable("user_avg_scores")


In [0]:
%sql

SELECT * FROM user_avg_scores;


user,avg_score
Fran,5.0
Databricks,4.0


Databricks visualization. Run in Databricks to view.

## 📝 Project Summary: Mini ETL Pipeline in Databricks

**Objective**: Practice foundational Databricks and Spark skills by simulating a simple ETL pipeline using PySpark and Delta Lake.

### Steps Performed:
1. **Ingested data**: Created a small sample dataset representing user engagement over time.
2. **Transformed data**: Calculated average engagement scores using groupBy and aggregation.
3. **Loaded data**: Saved the results to a Delta Table called `user_avg_scores`.
4. **Queried data using SQL**: Verified results via SQL and visualized the outcome as a bar chart.

### Key Skills Demonstrated:
- Spark DataFrame operations in PySpark
- Aggregation and data transformation
- Writing and querying Delta Tables
- SQL and visual insights inside Databricks
