# 4 - Lab Solution - Analyzing Transaction Data with DataFrames

In this lab, you'll analyze transactions from the Bakehouse dataset using Spark DataFrames. You'll apply the concepts from the lecture to solve real business problems and gain insights from the data.

### Objectives
- Reading data into a DataFrame and exploring its contents and structure
- Filtering records and projecting columns from a DataFrame
- Saving a DataFrame to a table

## Initial Setup and Data Loading

First, let's load our data and examine its structure.

In [0]:
# Read the Bakehouse transaction data
transactions_df = spark.read.table("samples.bakehouse.sales_transactions")

# Examine the schema and display first few rows
print("DataFrame Schema:")
transactions_df.printSchema()

print("\nSample Data:")
display(transactions_df.limit(5))

## Data Exploration

Let's explore the basic characteristics of the dataset.

### Total Transactions Count
Get a count of all transactions helps us understand the dataset size.

In [0]:
total_transactions = transactions_df.count()
print(f"Total number of transactions: {total_transactions}")

### Transactions over $100
Find the transactions over $100, save these to a new DataFrame named `large_transactions_df`.  Display the contents of this new DataFrame.

In [0]:
from pyspark.sql.functions import col

large_transactions_df = transactions_df.filter(col("totalPrice") > 100)
display(large_transactions_df)

### Save the DataFrame to a table
Save the `large_transactions_df` DataFrame to a table called `large_transactions`

In [0]:
# Save the large transactions DataFrame to a table
large_transactions_df.write.saveAsTable("large_transactions", mode="overwrite")

### Use Spark SQL to count the number of large transactions
Count the total number of large transactions in our `large_transactions` table

In [0]:
spark.sql("select count(*) as cnt_large_trnsc from large_transactions").display()