# PySpark Basics: Reading Data and Visualization - Interactive Demo

Welcome! This notebook will teach you the fundamentals of working with PySpark in Databricks.

**What you'll learn:**
* üìä **Databricks Sample Datasets** - Access built-in sample data for learning
* üìñ **Reading Data with PySpark** - Load data into Spark DataFrames
* üîç **df.show() vs display()** - Understand the key differences
* üñ±Ô∏è **Interactive UI Features** - Sort, filter, and search data interactively
* üìà **Built-in Plotting** - Create visualizations without code
* ‚úÖ **Best Practices** - When to use each method

**Prerequisites:** Basic Python knowledge

---

*Follow along by running each cell and experimenting with the features!*

## 1. Databricks Sample Datasets

Databricks provides a collection of sample datasets through the **`samples` catalog** in Unity Catalog!

### üìö **What is the Samples Catalog?**

* Pre-loaded datasets available in every Databricks workspace
* Accessed through **Unity Catalog** (modern best practice)
* Organized as proper tables with schemas
* Include various domains: retail, IoT, TPC benchmarks, NYC taxi data, etc.
* No file system access needed - just query like any table!

### üìÇ **Popular Sample Datasets**

* **samples.nyctaxi.trips** - NYC taxi trip data
* **samples.tpch.*** - TPC-H benchmark tables (customer, orders, lineitem, etc.)
* **samples.tpcds.*** - TPC-DS benchmark tables

### üîç **How to Browse Datasets**

You can explore available datasets using SQL commands or the Catalog Explorer in the UI.

In [0]:
# List all schemas (databases) in the samples catalog
# This shows what types of sample data are available

schemas = spark.sql("SHOW SCHEMAS IN samples")
print("Available schemas in the 'samples' catalog:")
print("")
display(schemas)

In [0]:
# Let's explore the nyctaxi schema which contains NYC taxi trip data
# We'll see what tables are available

tables = spark.sql("SHOW TABLES IN samples.nyctaxi")
print("Tables in the 'samples.nyctaxi' schema:")
print("")
display(tables)

## 2. Reading Data with PySpark

Let's read the **NYC Taxi trips dataset** from the samples catalog - a real-world dataset with taxi trip information.

### üìä **About the NYC Taxi Dataset**

* Real NYC taxi trip data
* Multiple columns including:
  * `tpep_pickup_datetime` - When the trip started
  * `tpep_dropoff_datetime` - When the trip ended
  * `passenger_count` - Number of passengers
  * `trip_distance` - Distance traveled in miles
  * `fare_amount` - Base fare amount
  * `total_amount` - Total amount charged
  * `pickup_zip` / `dropup_zip` - Location information

### üìù **Reading from Unity Catalog**

We'll use `spark.table()` or `spark.read.table()` to load data from Unity Catalog tables.

In [0]:
# Read the NYC taxi trips dataset from the samples catalog
# This is the modern best practice - reading from Unity Catalog tables

# Method 1: Using spark.table() - simplest approach
taxi_df = spark.table("samples.nyctaxi.trips")

# Method 2: Using spark.read.table() - equivalent
# taxi_df = spark.read.table("samples.nyctaxi.trips")

print("‚úÖ Data loaded successfully from Unity Catalog!")
print(f"Number of rows: {taxi_df.count():,}")
print(f"Number of columns: {len(taxi_df.columns)}")

In [0]:
# Check the schema (structure) of the DataFrame
# This shows column names and data types

print("DataFrame Schema:")
taxi_df.printSchema()

## 3. Using `df.show()`

`df.show()` is the standard PySpark method for displaying DataFrame contents.

### üìù **Characteristics of df.show()**

**Advantages:**
* ‚úÖ Standard PySpark method (works everywhere)
* ‚úÖ Simple and straightforward
* ‚úÖ Good for quick data inspection
* ‚úÖ Works in any Spark environment

**Limitations:**
* ‚ùå Plain text output (not interactive)
* ‚ùå Shows only 20 rows by default
* ‚ùå Truncates long values
* ‚ùå No sorting, filtering, or searching
* ‚ùå No built-in visualizations
* ‚ùå Difficult to read with many columns

### üìä **Common Usage**

```python
df.show()           # Show 20 rows
df.show(10)         # Show 10 rows
df.show(5, False)   # Show 5 rows without truncation
```

In [0]:
# Default df.show() - displays 20 rows with truncation
print("Using df.show() - Default (20 rows):")
taxi_df.show()

In [0]:
# Show only 5 rows
print("Using df.show(5) - Only 5 rows:")
taxi_df.show(5)

In [0]:
# Show 5 rows without truncating column values
print("Using df.show(5, False) - No truncation:")
taxi_df.show(5, False)

## 4. Using `display()` - The Databricks Way!

`display()` is Databricks' enhanced function for showing DataFrames with rich interactive features.

### ‚ú® **Advantages of display()**

**Interactive Features:**
* ‚úÖ **Rich HTML table** with beautiful formatting
* ‚úÖ **Sort columns** by clicking column headers
* ‚úÖ **Search and filter** data interactively
* ‚úÖ **Pagination** - navigate through large datasets
* ‚úÖ **Column reordering** - drag and drop columns
* ‚úÖ **Built-in visualizations** - create charts with clicks
* ‚úÖ **Export data** - download as CSV
* ‚úÖ **Shows up to 100,000 rows** (vs 20 for df.show())

**Better Readability:**
* ‚úÖ Formatted numbers and dates
* ‚úÖ Proper column alignment
* ‚úÖ Color-coded data types
* ‚úÖ Responsive layout

### üéØ **When to Use Each**

| Use Case | df.show() | display() |
|----------|-----------|----------|
| Quick inspection | ‚úÖ | ‚úÖ |
| Interactive exploration | ‚ùå | ‚úÖ |
| Creating visualizations | ‚ùå | ‚úÖ |
| Sorting/filtering data | ‚ùå | ‚úÖ |
| Non-Databricks environments | ‚úÖ | ‚ùå |
| Production logging | ‚úÖ | ‚ùå |

In [0]:
# Use display() to show the taxi dataset
# Notice the rich interactive table that appears!

print("Using display() - Interactive table with all features:")
display(taxi_df)

### üîÑ **Side-by-Side Comparison**

**What you should notice:**

**df.show():**
* Plain text output
* Fixed 20 rows
* Truncated values (shown with `...`)
* No interaction possible
* Difficult to read with many columns

**display():**
* Beautiful HTML table
* Pagination controls at the bottom
* Full values visible (hover for long text)
* Click column headers to sort
* Search box in top right
* Filter icon on each column
* "+" button to create visualizations

**Try it yourself:** 
* Click on the "fare_amount" column header to sort by fare
* Use the search box to find specific trips
* Click the filter icon on "passenger_count" to filter by number of passengers

## 5. Interactive UI Features in display()

The `display()` function provides powerful interactive features for data exploration!

### üîº **Sorting**

* **Click any column header** to sort ascending
* **Click again** to sort descending
* **Click a third time** to remove sorting
* Sort by multiple columns by clicking headers in sequence

**Try it:**
1. Click the "fare_amount" column to sort by fare
2. Click "trip_distance" to sort by distance
3. Notice the arrow indicator showing sort direction

### üîç **Searching**

* **Search box** in the top-right corner
* Searches across **all columns**
* Real-time filtering as you type
* Case-insensitive search

**Try it:**
1. Type a number in the search box to find specific values
2. Search for pickup locations
3. Clear the search to see all data again

### üìä **Filtering**

* **Filter icon** appears on each column header
* Click to open filter options
* Multiple filter types based on data type:
  * **Text columns**: Contains, equals, starts with, ends with
  * **Numeric columns**: Greater than, less than, equals, range
  * **Date columns**: Before, after, between dates

**Try it:**
1. Click the filter icon on the "passenger_count" column
2. Select only trips with 1 or 2 passengers
3. Click Apply to filter the data

In [0]:
# Let's create a display with the full dataset for you to practice
# Run this cell and then try the interactive features described above!

print("‚ú® Interactive Table - Try these features:")
print("1. üîº SORT: Click 'fare_amount' column header to sort by fare")
print("2. üîç SEARCH: Type a specific pickup location in the search box (top-right)")
print("3. üìä FILTER: Click filter icon on 'passenger_count' column")
print("4. üìä PAGINATION: Use controls at bottom to navigate pages")
print("")

display(taxi_df)

### üìä **Additional Interactive Features**

**Pagination:**
* Navigate through large datasets page by page
* Controls at the bottom of the table
* Shows current page and total rows
* Adjustable rows per page

**Column Management:**
* **Reorder columns** - Drag column headers to rearrange
* **Resize columns** - Drag column borders to adjust width
* **Hide columns** - Right-click column header for options

**Data Export:**
* **Download as CSV** - Export filtered/sorted data
* Click the download icon in the table toolbar
* Useful for sharing results or further analysis

**Cell Details:**
* **Hover over cells** to see full content
* Especially useful for long text values
* Shows complete value in tooltip

### üìä **Row Limits**

* `display()` shows up to **100,000 rows**
* Much better than `df.show()`'s 20 rows
* For larger datasets, use filtering or sampling

### ‚úçÔ∏è **Hands-On Exercise**

**Challenge:** Use the interactive features to answer these questions:

1. **What is the highest fare amount in the dataset?**
   * Hint: Sort by "fare_amount" column descending

2. **How many trips had 5 or more passengers?**
   * Hint: Filter the "passenger_count" column for values >= 5

3. **What are the characteristics of trips with fare over $100?**
   * Hint: Filter the "fare_amount" column for values > 100

4. **What's the average trip distance?**
   * Hint: Look at the "trip_distance" column values

**Try it yourself in the table above!**

No code needed - just use the interactive UI features! üéâ

## 6. Built-in Plotting with display()

One of the most powerful features of `display()` is the ability to create visualizations **without writing any code**!

### üìà **Available Chart Types**

* **Bar Chart** - Compare categories
* **Line Chart** - Show trends over time
* **Pie Chart** - Show proportions
* **Scatter Plot** - Show relationships between variables
* **Map** - Geographic visualizations
* **Histogram** - Show distributions
* **Box Plot** - Show statistical distributions
* **And more!**

### üé® **How to Create Visualizations**

1. Run `display()` on your DataFrame
2. Click the **"+"** button above the table
3. Select **"Visualization"**
4. Choose your chart type
5. Configure axes and groupings
6. Click **"Save"** to add the chart

### ‚ú® **Key Features**

* **No code required** - Point and click interface
* **Multiple visualizations** - Create several charts from one table
* **Interactive charts** - Hover for details, zoom, pan
* **Customizable** - Colors, labels, legends, titles
* **Tabs** - Switch between table and chart views

In [0]:
# Let's prepare some aggregated data that's perfect for visualization
# We'll analyze taxi trips by passenger count

from pyspark.sql.functions import avg, count, sum as spark_sum, round as spark_round

# Aggregate data by passenger count
trips_by_passengers = taxi_df.groupBy("passenger_count").agg(
    count("*").alias("trip_count"),
    spark_round(avg("fare_amount"), 2).alias("avg_fare"),
    spark_round(avg("trip_distance"), 2).alias("avg_distance")
).filter("passenger_count IS NOT NULL AND passenger_count > 0").orderBy("passenger_count")

print("üìà Taxi Trip Statistics by Passenger Count")
print("Run this cell, then create visualizations using the steps below!")
print("")
display(trips_by_passengers)

### üìä **Exercise: Create a Bar Chart**

**Follow these steps to create your first visualization:**

1. **Look at the output above** - You should see a table with passenger counts and trip statistics

2. **Click the "+" button** - Located above the table, next to the table icon

3. **Select "Visualization"** - A visualization editor will open

4. **Choose "Bar" chart type** - From the visualization type dropdown

5. **Configure the chart:**
   * **X-axis**: Select "passenger_count" (number of passengers)
   * **Y-axis**: Select "trip_count" (number of trips)
   * **Title**: Enter "Taxi Trips by Passenger Count"

6. **Click "Save"** - Your chart appears as a new tab!

7. **Switch between views** - Click the "Table" and chart tabs to toggle

**Result:** You should see a bar chart showing how many trips had different numbers of passengers!

In [0]:
# Let's create another dataset - trips by hour of day
# This will be great for different chart types!

from pyspark.sql.functions import hour

trips_by_hour = taxi_df.groupBy(hour("tpep_pickup_datetime").alias("hour")).agg(
    count("*").alias("trip_count"),
    spark_round(avg("fare_amount"), 2).alias("avg_fare"),
    spark_round(spark_sum("fare_amount"), 2).alias("total_revenue")
).orderBy("hour")

print("üé® Taxi Trip Statistics by Hour of Day")
print("Try creating different chart types with this data!")
print("")
display(trips_by_hour)

### üí° **Try These Visualizations**

**Using the data above, create these charts:**

**1. Line Chart - Trips Throughout the Day**
* Chart type: Line
* X-axis: hour
* Y-axis: trip_count
* Shows: How does taxi demand change throughout the day?

**2. Bar Chart - Revenue by Hour**
* Chart type: Bar
* X-axis: hour
* Y-axis: total_revenue
* Shows: Which hours generate the most revenue?

**3. Line Chart - Average Fare by Hour**
* Chart type: Line
* X-axis: hour
* Y-axis: avg_fare
* Shows: How do fares vary throughout the day?

### üéØ **Advanced: Scatter Plot**

For scatter plots, use the original `taxi_df`:
* X-axis: trip_distance
* Y-axis: fare_amount
* Shows: Relationship between distance and fare

In [0]:
# For scatter plot - let's sample the data for better performance
# We'll take 5000 random taxi trips

taxi_sample = taxi_df.select("trip_distance", "fare_amount", "passenger_count").filter(
    "trip_distance > 0 AND fare_amount > 0 AND fare_amount < 200"
).sample(fraction=0.01, seed=42).limit(5000)

print("üîµ Sample of 5,000 Taxi Trips for Scatter Plot")
print("")
print("Create a Scatter Plot:")
print("  X-axis: trip_distance")
print("  Y-axis: fare_amount")
print("  Group by: passenger_count (optional)")
print("")
print("This will show how trip distance affects fare amount!")
print("")

display(taxi_sample)

### ‚ú® **Visualization Best Practices**

**Choosing the Right Chart:**

* **Bar Chart** - Comparing categories (e.g., sales by region)
* **Line Chart** - Trends over time (e.g., daily revenue)
* **Pie Chart** - Parts of a whole (e.g., market share)
* **Scatter Plot** - Relationships between variables (e.g., price vs size)
* **Histogram** - Distribution of values (e.g., age distribution)

**Tips for Better Visualizations:**

‚úÖ **Aggregate data first** - Charts work best with summarized data  
‚úÖ **Limit categories** - Too many bars/slices make charts hard to read  
‚úÖ **Use meaningful labels** - Add clear titles and axis labels  
‚úÖ **Choose appropriate colors** - Use color to highlight insights  
‚úÖ **Sample large datasets** - For scatter plots with many points  

**Advantages over Code-Based Plotting:**

* ‚úÖ **No library imports** - No matplotlib, seaborn, or plotly needed
* ‚úÖ **Instant feedback** - See results immediately
* ‚úÖ **Easy experimentation** - Try different chart types quickly
* ‚úÖ **Interactive by default** - Hover, zoom, pan built-in
* ‚úÖ **Shareable** - Charts save with the notebook

## 7. Summary: df.show() vs display()

### üìä **Complete Comparison**

| Feature | df.show() | display() |
|---------|-----------|----------|
| **Output Format** | Plain text | Rich HTML table |
| **Default Rows** | 20 | Up to 100,000 |
| **Truncation** | Yes (with ...) | No (full values) |
| **Sorting** | ‚ùå No | ‚úÖ Click column headers |
| **Filtering** | ‚ùå No | ‚úÖ Filter icon per column |
| **Searching** | ‚ùå No | ‚úÖ Search box |
| **Pagination** | ‚ùå No | ‚úÖ Navigate pages |
| **Visualizations** | ‚ùå No | ‚úÖ Built-in charts |
| **Export Data** | ‚ùå No | ‚úÖ Download CSV |
| **Column Reordering** | ‚ùå No | ‚úÖ Drag and drop |
| **Interactive** | ‚ùå No | ‚úÖ Fully interactive |
| **Works Everywhere** | ‚úÖ Yes | ‚ùå Databricks only |
| **Best For** | Quick checks | Data exploration |

## ‚úÖ Best Practices

### **When to Use df.show()**

‚úÖ Quick data inspection during development  
‚úÖ Debugging in non-Databricks environments  
‚úÖ Production logging and monitoring  
‚úÖ When you need plain text output  
‚úÖ Automated scripts and jobs  

**Example:**
```python
# Quick check during development
df.show(5)

# Verify data loaded correctly
print(f"Loaded {df.count()} rows")
df.show(3, False)
```

### **When to Use display()**

‚úÖ Interactive data exploration in notebooks  
‚úÖ Creating visualizations without code  
‚úÖ Presenting results to stakeholders  
‚úÖ Analyzing large datasets (up to 100K rows)  
‚úÖ When you need sorting/filtering/searching  
‚úÖ Exploratory Data Analysis (EDA)  

**Example:**
```python
# Explore data interactively
display(df)

# Create aggregations for visualization
summary = df.groupBy("category").agg(avg("sales"))
display(summary)  # Then create charts!
```

## üìö Quick Reference Guide

### **Reading Data**

```python
# Unity Catalog tables (RECOMMENDED - Modern Best Practice)
df = spark.table("catalog.schema.table_name")
df = spark.read.table("catalog.schema.table_name")

# Sample datasets from Unity Catalog
df = spark.table("samples.nyctaxi.trips")
df = spark.table("samples.tpch.customer")

# CSV files (when needed)
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

# Parquet files
df = spark.read.parquet("path/to/file.parquet")

# JSON files
df = spark.read.json("path/to/file.json")

# Delta tables
df = spark.read.format("delta").load("path/to/delta/table")
```

### **Displaying Data**

```python
# Using df.show()
df.show()              # Show 20 rows
df.show(10)            # Show 10 rows
df.show(5, False)      # Show 5 rows, no truncation

# Using display()
display(df)            # Interactive table
display(df.limit(100)) # Limit rows for performance
```

### **Common DataFrame Operations**

```python
# View schema
df.printSchema()

# Count rows
df.count()

# Select columns
df.select("col1", "col2")

# Filter rows
df.filter(df.price > 1000)

# Group and aggregate
df.groupBy("category").agg(avg("price"))

# Sort
df.orderBy("price", ascending=False)
```

## üéØ Key Takeaways

### **What You Learned Today:**

1. üìö **Databricks Sample Datasets**
   * Available through the **`samples` catalog** in Unity Catalog
   * Modern best practice - no file system access needed
   * Browse with `SHOW SCHEMAS IN samples` and `SHOW TABLES IN samples.schema`

2. üìñ **Reading Data with PySpark**
   * Use `spark.table("catalog.schema.table")` for Unity Catalog tables
   * Use `spark.read.table()` as an alternative
   * Check schema with `printSchema()`

3. üîç **df.show() - Standard PySpark**
   * Plain text output
   * Good for quick checks
   * Works in any Spark environment
   * Limited to 20 rows by default

4. ‚ú® **display() - Databricks Enhanced**
   * Rich interactive HTML tables
   * Sort, filter, search capabilities
   * Up to 100,000 rows
   * Built-in visualization creation

5. üìà **Built-in Plotting**
   * No code required for charts
   * Multiple chart types available
   * Interactive and customizable
   * Perfect for quick insights

### **Remember:**

‚úÖ Use **Unity Catalog** (`spark.table()`) for reading data - modern best practice  
‚úÖ Use `display()` for **exploration** in Databricks notebooks  
‚úÖ Use `df.show()` for **quick checks** and production code  
‚úÖ Aggregate data before creating visualizations  
‚úÖ Take advantage of interactive features for EDA  

## üéâ Congratulations!

You've completed the PySpark Basics demo!

### **Next Steps:**

1. **Practice** - Try loading different sample datasets
2. **Experiment** - Create various visualizations
3. **Explore** - Use interactive features for data analysis
4. **Learn More** - Check out advanced PySpark operations
5. **Build** - Start working on real data projects!

### **Additional Resources:**

* [Databricks Sample Datasets](https://docs.databricks.com/discover/databricks-datasets.html)
* [PySpark Documentation](https://spark.apache.org/docs/latest/api/python/)
* [Visualization Types](https://docs.databricks.com/visualizations/visualization-types.html)
* [Databricks Notebooks Guide](https://docs.databricks.com/notebooks/)

### **Related Demos:**

* **Notebook Basics** - Learn fundamental notebook operations
* **Notebook Collaboration** - Work with teams effectively
* **Advanced PySpark** - Deep dive into transformations and actions

---

**Pro Tip:** Combine `display()` with PySpark aggregations for powerful data exploration! üöÄ

*Happy data exploring!*