# 1. Setup Data

## Attach the Notebook to a Cluster

### Compute
You need `compute` in order to execute notebooks in Databricks. 

What is compute? When you run a notebook in Databricks, your code needs processing power and memory to perform calculations and work with data. Compute provides the underlying resources—like virtual machines or clusters—that actually execute your notebook commands, making it possible to analyze data and get results. 

For this workshop you may already have a `shared compute` configured or need to create your own `personal compute`

> **If needed**, an example configuration is in this screenshot

![compute](./screenshots/compute.png)

> Optional
* Its a good practice to **tag** your compute for cost tracking. Two default tags that are recommended are `COST_CENTER` and `PROJECT`

In order to run code within a Databricks notebook, the notebook needs to be attached to a compute cluster. You can do this by clicking in the right hand corner on the button that says "Connect"


![compute-connect](./screenshots/compute-connect.png)


When you click connect, you will see a drop down of all compute clusters that are available for you to use. You can select any active resources that are available. If you do not see any options available, please reach out to your Databricks admin team


![compute-starting](./screenshots/compute-starting.png)


Once you select an option and see a cluster with a green dot assocaited with your notebook, then you are good to move on to the next steps


![compute-green.png](./screenshots/compute-green.png)



## Create the Database/Schema

In [0]:
%run "./1. Configure"

In [0]:
sql_command = f"CREATE SCHEMA IF NOT EXISTS {my_database}"
spark.sql(sql_command)

## Load the Sample Data

#### Set the file location
The block of code below is a way to work with the file system utilities in Databricks

In [0]:
# This code gets the current path of this notebook
context = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
notebook_path = context.notebookPath().get()

# Since the notebook_path includes the notebook name, only keep up to the last '/' and ignore the filename
folder_path = notebook_path[:notebook_path.rfind("/")]

csv_location = f"file:/Workspace{folder_path}/data/taxi_raw.csv"
parquet_location = f"file:/Workspace{folder_path}/data/taxi_raw.parquet"

#### Read in a CSV

In [0]:
# read in the csv
df = (
    spark.read.format("csv")
    .option("inferSchema", "true")
    .option("header", "true")
    .load(csv_location)
)

display(df)

#### Read in an existing Parquet file

In [0]:
# read in the parquet
df = (
    spark.read.parquet(parquet_location)
)

display(df)

In [0]:
# save our dataframe as a table
df.write.mode("overwrite").saveAsTable(f"{my_database}.bronze_trips")
