## GOLD LAYER SCRIPT

#### DEFINING CREDENTIALS TO ACCESS THE DATA FROM DATALAKE

In [0]:
spark.conf.set("fs.azure.account.auth.type.nycdatalakestore.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.nycdatalakestore.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.nycdatalakestore.dfs.core.windows.net", "854a7cbb-3823-433d-90c7-79d255f4394f")
spark.conf.set("fs.azure.account.oauth2.client.secret.nycdatalakestore.dfs.core.windows.net", "Yju8Q~PqQCDQQa9jq0EzeSZrgdvvk3TqO3puNbCi")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.nycdatalakestore.dfs.core.windows.net", "https://login.microsoftonline.com/18ffd786-c707-4f2c-b8b0-5f49463d2c32/oauth2/token")

##### When I re-run the silver and gold notebook next day morning, the records in the delta tables got duplicated since the write mode was append. so i dropped the tables and then dropped entire database. Then i created a database and performed the read write steps again. The last command is to check if my database still exists or not. It returns False if the database is successfully dropped.

Commands i used to drop the delta tables and database:
 1. spark.sql("DROP TABLE IF EXISTS nyc_gold.Trip_Type")
 2. spark.sql("DROP TABLE IF EXISTS nyc_gold.trip_zone")
 3. spark.sql("DROP TABLE IF EXISTS nyc_gold.Trip_Data")
 4. spark.sql("DROP DATABASE IF EXISTS nyc_gold CASCADE")
 5. spark.catalog.databaseExists("nyc_gold")

Update:
 Even then, my records are duplicated. What i did?
 
 Root Cause Analysis: I checked the folders in my silver container and i noticed that the files are duplicated since i had re-run the code with the append mode this morning again. Hence, when i wrote the data to gold container, the records are duplicated in the delta tables that are being created. However, the actual problem was in silver notebook and was not with database.

 Fix: I deleted the folders in the silver container and run the cells again in the silver notebook but, this time I did only once. And here my write mode was 'append' because i want to write this data newly as i had deleted all the folders.

 Suggestion: If you have to re-run the cells again in your notebook whenever in future, make sure to keep your write mode as 'overwrite' and this doesn't duplicate the records/ files.


#### DATABASE CREATION USING SQL

In [0]:
%sql
CREATE DATABASE nyc_gold

#### IMPORTING NECESSARY LIBRARIES

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

#### STORAGE VARIABLES FOR PATH

In [0]:
silver = 'abfss://nyc-silver@nycdatalakestore.dfs.core.windows.net'
gold = 'abfss://nyc-gold@nycdatalakestore.dfs.core.windows.net'

#### DATA READING, WRITING AND CRAFTING DELTA TABLES
 1. Read the parquet format data stored in the silver container.
 2. Write this data to the Gold container in Delta format.
 3. Create a delta table using saveastable()

#### TRIP ZONE DATA
 Creating a dataframe to read and write the Trip_Zone data stored in the silver container

In [0]:
df_tripZone = spark.read.format('parquet')\
                        .option('inferschema', True)\
                        .option('header', True)\
                        .load(f'{silver}/Trip_Zone')

In [0]:
df_tripZone.display()

LocationID,Borough,Zone,service_zone,Zone_1,Zone_2
1,EWR,Newark Airport,EWR,Newark Airport,
2,Queens,Jamaica Bay,Boro Zone,Jamaica Bay,
3,Bronx,Allerton/Pelham Gardens,Boro Zone,Allerton,Pelham Gardens
4,Manhattan,Alphabet City,Yellow Zone,Alphabet City,
5,Staten Island,Arden Heights,Boro Zone,Arden Heights,
6,Staten Island,Arrochar/Fort Wadsworth,Boro Zone,Arrochar,Fort Wadsworth
7,Queens,Astoria,Boro Zone,Astoria,
8,Queens,Astoria Park,Boro Zone,Astoria Park,
9,Queens,Auburndale,Boro Zone,Auburndale,
10,Queens,Baisley Park,Boro Zone,Baisley Park,


In [0]:
df_tripZone.write.format('delta')\
                 .mode('append')\
                 .option('path', f'{gold}/Trip_Zone')\
                 .saveAsTable('nyc_gold.Trip_Zone')

##### Query the trip zone data

In [0]:
%sql
select * from nyc_gold.trip_zone
where Borough = 'Queens'

LocationID,Borough,Zone,service_zone,Zone_1,Zone_2
2,Queens,Jamaica Bay,Boro Zone,Jamaica Bay,
7,Queens,Astoria,Boro Zone,Astoria,
8,Queens,Astoria Park,Boro Zone,Astoria Park,
9,Queens,Auburndale,Boro Zone,Auburndale,
10,Queens,Baisley Park,Boro Zone,Baisley Park,
15,Queens,Bay Terrace/Fort Totten,Boro Zone,Bay Terrace,Fort Totten
16,Queens,Bayside,Boro Zone,Bayside,
19,Queens,Bellerose,Boro Zone,Bellerose,
27,Queens,Breezy Point/Fort Tilden/Riis Beach,Boro Zone,Breezy Point,Fort Tilden
28,Queens,Briarwood/Jamaica Hills,Boro Zone,Briarwood,Jamaica Hills


#### TRIP TYPE DATA

In [0]:
df_tripType = spark.read.format('parquet')\
                        .option('inferschema', True)\
                        .option('header', True)\
                        .load(f'{silver}/Trip_type')

In [0]:
df_tripType.write.format('delta')\
                 .mode('append')\
                 .option('path', f'{gold}/Trip_Type')\
                 .saveAsTable('nyc_gold.Trip_Type')

#### NYC GREEN TAXI TRIPS DATA -2024

In [0]:
df_tripData = spark.read.format('parquet')\
                        .option('inferschema', True)\
                        .option('header', True)\
                        .load(f'{silver}/trip_data')

In [0]:
df_tripData.write.format('delta')\
                 .mode('append')\
                 .option('path', f'{gold}/Trip_Data')\
                 .saveAsTable('nyc_gold.Trip_Data')

#### LEARNING DELTA LAKE

In [0]:
%sql
select * from nyc_gold.trip_zone

LocationID,Borough,Zone,service_zone,Zone_1,Zone_2
1,EWR,Newark Airport,EWR,Newark Airport,
2,Queens,Jamaica Bay,Boro Zone,Jamaica Bay,
3,Bronx,Allerton/Pelham Gardens,Boro Zone,Allerton,Pelham Gardens
4,Manhattan,Alphabet City,Yellow Zone,Alphabet City,
5,Staten Island,Arden Heights,Boro Zone,Arden Heights,
6,Staten Island,Arrochar/Fort Wadsworth,Boro Zone,Arrochar,Fort Wadsworth
7,Queens,Astoria,Boro Zone,Astoria,
8,Queens,Astoria Park,Boro Zone,Astoria Park,
9,Queens,Auburndale,Boro Zone,Auburndale,
10,Queens,Baisley Park,Boro Zone,Baisley Park,


In [0]:
%sql
UPDATE nyc_gold.trip_zone
SET Borough = 'EMR'
WHERE LocationID = 1;

num_affected_rows
1


In [0]:
%sql
select * from nyc_gold.trip_zone
where LocationID = 1;

LocationID,Borough,Zone,service_zone,Zone_1,Zone_2
1,EMR,Newark Airport,EWR,Newark Airport,


In [0]:
%sql
DELETE FROM nyc_gold.trip_zone
WHERE LocationID = 1;

num_affected_rows
1


In [0]:
%sql
select * from nyc_gold.trip_zone
where LocationID = 1;

LocationID,Borough,Zone,service_zone,Zone_1,Zone_2


In [0]:
%sql
select * from nyc_gold.trip_zone

LocationID,Borough,Zone,service_zone,Zone_1,Zone_2
2,Queens,Jamaica Bay,Boro Zone,Jamaica Bay,
3,Bronx,Allerton/Pelham Gardens,Boro Zone,Allerton,Pelham Gardens
4,Manhattan,Alphabet City,Yellow Zone,Alphabet City,
5,Staten Island,Arden Heights,Boro Zone,Arden Heights,
6,Staten Island,Arrochar/Fort Wadsworth,Boro Zone,Arrochar,Fort Wadsworth
7,Queens,Astoria,Boro Zone,Astoria,
8,Queens,Astoria Park,Boro Zone,Astoria Park,
9,Queens,Auburndale,Boro Zone,Auburndale,
10,Queens,Baisley Park,Boro Zone,Baisley Park,
11,Brooklyn,Bath Beach,Boro Zone,Bath Beach,


#### VERSIONING

In [0]:
%sql
DESCRIBE HISTORY nyc_gold.trip_zone

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
2,2025-05-01T09:18:12Z,148574835917075,vennela.kappallishivanna@gmail.com,DELETE,"Map(predicate -> [""(LocationID#9557 = 1)""])",,List(3537168131613602),0428-171015-cyf1ga8g,1.0,WriteSerializable,False,"Map(numRemovedFiles -> 1, numRemovedBytes -> 1953, numCopiedRows -> 0, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 901, numDeletionVectorsUpdated -> 0, numDeletedRows -> 1, scanTimeMs -> 739, numAddedFiles -> 0, numAddedBytes -> 0, rewriteTimeMs -> 162)",,Databricks-Runtime/14.3.x-scala2.12
1,2025-05-01T09:17:57Z,148574835917075,vennela.kappallishivanna@gmail.com,UPDATE,"Map(predicate -> [""(LocationID#8461 = 1)""])",,List(3537168131613602),0428-171015-cyf1ga8g,0.0,WriteSerializable,False,"Map(numRemovedFiles -> 0, numRemovedBytes -> 0, numCopiedRows -> 0, numDeletionVectorsAdded -> 1, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 1807, numDeletionVectorsUpdated -> 0, scanTimeMs -> 634, numAddedFiles -> 1, numUpdatedRows -> 1, numAddedBytes -> 1953, rewriteTimeMs -> 1159)",,Databricks-Runtime/14.3.x-scala2.12
0,2025-05-01T09:16:21Z,148574835917075,vennela.kappallishivanna@gmail.com,CREATE TABLE AS SELECT,"Map(partitionBy -> [], description -> null, isManaged -> false, properties -> {""delta.enableDeletionVectors"":""true""}, statsOnLoad -> false)",,List(3537168131613602),0428-171015-cyf1ga8g,,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 265, numOutputBytes -> 10050)",,Databricks-Runtime/14.3.x-scala2.12


#### TIME TRAVEL

In [0]:
%sql
RESTORE nyc_gold.trip_zone TO VERSION AS OF 0

table_size_after_restore,num_of_files_after_restore,num_removed_files,num_restored_files,removed_files_size,restored_files_size
10050,1,1,1,10050,10050


In [0]:
%sql
select * from nyc_gold.trip_zone

LocationID,Borough,Zone,service_zone,Zone_1,Zone_2
1,EWR,Newark Airport,EWR,Newark Airport,
2,Queens,Jamaica Bay,Boro Zone,Jamaica Bay,
3,Bronx,Allerton/Pelham Gardens,Boro Zone,Allerton,Pelham Gardens
4,Manhattan,Alphabet City,Yellow Zone,Alphabet City,
5,Staten Island,Arden Heights,Boro Zone,Arden Heights,
6,Staten Island,Arrochar/Fort Wadsworth,Boro Zone,Arrochar,Fort Wadsworth
7,Queens,Astoria,Boro Zone,Astoria,
8,Queens,Astoria Park,Boro Zone,Astoria Park,
9,Queens,Auburndale,Boro Zone,Auburndale,
10,Queens,Baisley Park,Boro Zone,Baisley Park,


#### DELTA TABLES

 **Trip Type**

In [0]:
%sql
select * from nyc_gold.Trip_Type

trip_type,trip_description
1,Street-hail
2,Dispatch


**Trip Zone**

In [0]:
%sql
select * from nyc_gold.Trip_Zone

LocationID,Borough,Zone,service_zone,Zone_1,Zone_2
1,EWR,Newark Airport,EWR,Newark Airport,
2,Queens,Jamaica Bay,Boro Zone,Jamaica Bay,
3,Bronx,Allerton/Pelham Gardens,Boro Zone,Allerton,Pelham Gardens
4,Manhattan,Alphabet City,Yellow Zone,Alphabet City,
5,Staten Island,Arden Heights,Boro Zone,Arden Heights,
6,Staten Island,Arrochar/Fort Wadsworth,Boro Zone,Arrochar,Fort Wadsworth
7,Queens,Astoria,Boro Zone,Astoria,
8,Queens,Astoria Park,Boro Zone,Astoria Park,
9,Queens,Auburndale,Boro Zone,Auburndale,
10,Queens,Baisley Park,Boro Zone,Baisley Park,


**Trip Data 2024**

In [0]:
%sql
select * from nyc_gold.Trip_Data

VendorId,PickUp_Date,DropOff_Date,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,tip_amount,total_amount
2,2024-05-01,2024-05-01,65,49,1,1.24,9.3,2.0,13.8
2,2024-05-01,2024-05-01,7,179,1,0.94,7.2,1.94,11.64
2,2024-05-01,2024-05-01,74,42,1,0.84,6.5,0.0,9.0
2,2024-05-01,2024-05-01,75,235,1,6.07,25.4,5.0,32.9
2,2024-05-01,2024-05-01,256,49,2,2.06,12.1,2.92,17.52
1,2024-05-01,2024-05-01,210,210,1,1.3,9.3,1.0,12.8
2,2024-05-01,2024-05-01,66,4,5,4.35,19.8,3.0,28.05
2,2024-05-01,2024-05-01,95,95,1,2.02,13.5,0.0,16.0
2,2024-05-01,2024-05-01,24,143,1,2.35,12.8,3.0,21.05
2,2024-05-01,2024-05-01,210,210,1,1.3,8.0,0.0,9.0
