# 2. Querying Databricks

This sample notebook will explain how you can read data from Databricks tables, and perform simple transformations with PySpark and SQL

The instructions in this notebook assume that this is the first time a user has used Databricks before, and that they have not used this notebook in their environment before

## Preview Databricks Data

<br>
If you want to preview your Databricks data and the schema using the UI, you can do this by clicking on the "Catalog" option on the left side bar, then selecting `atr_workshop` then `bronze_trips`

![preview-data.png](./screenshots/preview-data.png)

Once you click on that table, you will need to select an active cluster to be able to preview the data and see its schema. The area where you can select the cluster is located just below the name of the table. In this view you are also able to see the a description of the table (if it exists), the date the table was created, the date the table was last modified, the partitioned columns defined on the data, the number of underlying files the data is stored in and the size of the data

## Query Databricks Data

In [0]:
%run "./1. Configure"

In [0]:
# 'my_database' is defined as a Python variable, so we need to set that as well in a SQL context
sql_command = f"USE {my_database}"
spark.sql(sql_command)

<br>
Data can be queried in Databricks using SQL or Python. Since Databricks clusters run on Spark, you can use SparkSQL and PySpark to query data. Data in Azure Government Databricks is stored in a two level namespace; `database`.`table`. When querying data from these tables, you should specify both names in the namespace to ensure that you retrieve the correct data. To determine what data is available to query, you can click on the "Catalog" option on the left side bar to see the available databases and tables

![catalog.png](./screenshots/catalog.png)

### Query Data using SQL

The below code show how you can query data in a Databricks table using standard SQL. Notice when you hover over the below cell, you will see a text that says "SQL" in the top right corner of the cell below. This is something that needs to be set in order to execute SQL code in a notebook. When you create a new cell, you can click on this box in the right to change it so that it is set to "SQL" if you want to write your code in SQL

#### Simple Query

In [0]:
%sql
-- select data from a table and only return a limited number of rows. this query will only return 100 records from the table
SELECT
  *
FROM
  bronze_trips
LIMIT 100;

#### Time interval Query

In [0]:
%sql
-- select data from a table where a certain condition is met. this query will return all records that occured within the last 10 years
SELECT
  *
FROM
  bronze_trips
WHERE
  tpep_pickup_datetime >= current_date() - INTERVAL 10 YEARS;

When using SQL in a Databricks notebook, you can run multiple commands in a single cell as long as the commands are separated using a semi colon. 

#### Aggregate Grouping Query

The following SQL query groups trips by their fare amount rounded down to the nearest $10, then counts how many trips are in each fare range.

How it Works
* For each trip:
  * It divides the fare amount by 10, uses the FLOOR function to round down to the nearest whole number, then multiplies back by 10. This gets the nearest lower $10 multiple (for example, $23 becomes $20, $35 becomes $30).

* It calls this result bin_fares (the fare "bin" for the trip).

* It counts all trips in each bin_fares group.

* It lists each fare bin and the number of trips in that bin, ordered from the lowest fare up.

Why Use It
* This makes a histogram of fares, showing how many trips fall into each $10 chunk.

* The result is useful for quickly seeing fare patterns, like where most taxi fares fall on the price scale.


In [0]:
%sql
-- select data from a table and create aggregations. this query will return an aggregated count of fares grouped into bins of 10
SELECT 
  FLOOR(fare_amount / 10) * 10 AS bin_fares,
  COUNT(*) AS count
FROM bronze_trips
GROUP BY bin_fares
ORDER BY bin_fares


### Query Data using PySpark

The below code shows how you can query data in a Databricks table using PySpark. Notice when you hover over the below cell, you will see a text that says "Python" in the top right corner of the cell below. This is something that needs to be set in order to execute PySpark code in a notebook. When you create a new cell, you can click on this box in the right to change it so that it is set to "Python" if you want to write your code in Python

#### Simple Query

In [0]:
#select data from a table and only return a limited number of rows. this query will only return 100 records from the table
df = spark.table("bronze_trips")
display(df.head(100))

#### Time Interval Query

In [0]:
# select data from a table where a certain condition is met. this query will return all records that occured within the last 10 years
from pyspark.sql.functions import current_date, expr

df = spark.table("bronze_trips")
display(
    df.filter(df.tpep_pickup_datetime >= (current_date() - expr("INTERVAL 10 YEARS")))
)

#### Aggregate Grouping Query

The following PySpark works as follows:

* It adds a column called bin_fares to the data (df), rounding each fare down to the nearest $10 using the floor function (for example, $23 becomes $20).

* It groups all trips by their bin_fares value.

* It counts how many trips are in each group.

* It sorts the result so all fare ranges appear in order.

In [0]:
# select data from a table and create aggregations. this query will return an aggregated count of fares grouped into bins of 10

from pyspark.sql import functions as F

df = spark.table("bronze_trips")

display(
    df.withColumn("bin_fares", F.floor(F.col("fare_amount") / 10) * 10)
    .groupBy("bin_fares")
    .count()
    .orderBy("bin_fares")
)

## Additional Databricks Training

If you are interested in learning more about programming with SQL and PySpark in Databricks, please check out the following material online at https://customer-academy.databricks.com/learn. You can set up an account with our customer academy using your `.gov` email address

- [Transform Data with Spark](https://customer-academy.databricks.com/learn/courses/1878/transform-data-with-spark)
- [SQL Programming and Procedural Logic in Databricks](https://customer-academy.databricks.com/learn/course/4214/sql-programming-and-procedural-logic-in-databricks)
- [Introduction to Python for Data Science and Data Engineering](https://customer-academy.databricks.com/learn/course/view/elearning/1211/introduction-to-python-for-data-science-and-data-engineering)
- [Introduction to Apache Spark](https://customer-academy.databricks.com/learn/courses/3901/introduction-to-apache-spark)