Hey, I'm Jobert Gutierrez and hereafter you'll find the logic and code used to answer the assignment for DLT workshop in the program Data Engineering Zoomcamp offered by Data Talks Club.

# __Workshop: Data Load Tool (dlt)__

__Dataset & API:__

We’ll use NYC Taxi data via the same custom API from the workshop:

🔹 Base API URL:

> https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api

🔹 Data format: Paginated JSON (1,000 records per page).
🔹 API Pagination: Stop when an empty page is returned.

### __Question 1.__

Install dlt:

> !pip install dlt[duckdb]

Or choose a different bracket—bigquery, redshift, etc.—if you prefer another primary destination. For this assignment, we’ll still do a quick test with DuckDB.

Check the version:

> !dlt --version

or:

```
import dlt
print("dlt version:", dlt.__version__)
```
Provide the version you see in the output.

### Answer: 
The version I see is __dlt 1.6.1__.

![Q1](Q1.png "Dlt version")

### __Question 2.Define & Run the Pipeline (NYC Taxi API)__
Use dlt to extract all pages of data from the API.

Steps:

1️⃣ Use the @dlt.resource decorator to define the API source.

2️⃣ Implement automatic pagination using dlt's built-in REST client.

3️⃣ Load the extracted data into DuckDB for querying.

In [None]:
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator


# your code is here


pipeline = dlt.pipeline(
    pipeline_name="ny_taxi_pipeline",
    destination="duckdb",
    dataset_name="ny_taxi_data"
)

Load the data into DuckDB to test:

In [None]:
load_info = pipeline.run(ny_taxi)
print(load_info)

Start a connection to your database using native duckdb connection and look what tables were generated:

In [None]:
import duckdb
from google.colab import data_table
data_table.enable_dataframe_formatter()

# A database '<pipeline_name>.duckdb' was created in working directory so just connect to it

# Connect to the DuckDB database
conn = duckdb.connect(f"{pipeline.pipeline_name}.duckdb")

# Set search path to the dataset
conn.sql(f"SET search_path = '{pipeline.dataset_name}'")

# Describe the dataset
conn.sql("DESCRIBE").df()

How many tables were created?
- 2
- 4
- 6
- 8

### Answer: 
To answer this question I used this code for loading the data:

In [None]:
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator

#Define the API resource for NYC taxi data
@dlt.resource(name="rides")
def ny_taxi():
    client = RESTClient(
            base_url="https://us-central1-dlthub-analytics.cloudfunctions.net",
            paginator=PageNumberPaginator(
                base_page=1,
                total_path=None)
        )
    for page in client.paginate("data_engineering_zoomcamp_api"):
        yield page

# Define the new dlp pipeline
pipeline = dlt.pipeline(
    destination="duckdb",
)

# Run the pipeline with the new resource 
load_info = pipeline.run(ny_taxi, write_disposition="replace")
print(load_info)

Then, I used this code to see the table:

In [None]:
# show outcome
import duckdb

#Create a conector
connector = duckdb.connect(f"{pipeline.pipeline_name}.duckdb")

# let's see the tables
connector.sql(f"SET search_path = '{pipeline.dataset_name}'")
print('Loaded tables: ')
display(connector.sql("show tables"))

Getting this tables:

![Q2](Q2.png "Tables uploaded")

Then the proper answer is: __4 Tables as seen before__

### __Question 3. Explore the loaded data__
Inspect the table `ride`:

In [None]:
# Explore the data 
pipeline.dataset(dataset_type="default").rides.df()

What is the total number of records extracted?

### Answer:
Using the code below:

In [None]:
df.shape

![Q3](Q3.png "Number of records in the dataset")

The number of records in the datase is __10.000__

### __Question 4.Trip Duration Analysis__
Run the SQL query below to:

In [None]:
with pipeline.sql_client() as client:
    res = client.execute_sql(
            """
            SELECT
            AVG(date_diff('minute', trip_pickup_date_time, trip_dropoff_date_time))
            FROM rides;
            """
        )
    # Prints column values of the first row
    print(res)

Calculate the average trip duration in minutes.?

### Answer:
Using the proposed code I obtained an average time of __12.3049 mins__

![Q4](Q4.png "Difference in minutes")