# Module 1 Homework: Docker & SQL

In this homework we'll prepare the environment and practice
Docker and SQL

This repository should contain the code for solving the homework.

## Question 1. Understanding Docker images

Run docker with the `python:3.13` image. Use an entrypoint `bash` to interact with the container.

What's the version of `pip` in the image?

- 25.3
- 24.3.1
- 24.2.1
- 23.3.1

In [10]:
!docker run --rm python:3.13 pip --version

pip 25.3 from /usr/local/lib/python3.13/site-packages/pip (python 3.13)


## Question 2. Understanding Docker networking and docker-compose

Given the following `docker-compose.yaml`, what is the `hostname` and `port` that pgadmin should use to connect to the postgres database?

```yaml
services:
  db:
    container_name: postgres
    image: postgres:17-alpine
    environment:
      POSTGRES_USER: 'postgres'
      POSTGRES_PASSWORD: 'postgres'
      POSTGRES_DB: 'ny_taxi'
    ports:
      - '5433:5432'
    volumes:
      - vol-pgdata:/var/lib/postgresql/data

  pgadmin:
    container_name: pgadmin
    image: dpage/pgadmin4:latest
    environment:
      PGADMIN_DEFAULT_EMAIL: "pgadmin@pgadmin.com"
      PGADMIN_DEFAULT_PASSWORD: "pgadmin"
    ports:
      - "8080:80"
    volumes:
      - vol-pgadmin_data:/var/lib/pgadmin

volumes:
  vol-pgdata:
    name: vol-pgdata
  vol-pgadmin_data:
    name: vol-pgadmin_data
```

- postgres:5433
- localhost:5432
- db:5433
- postgres:5432
- db:5432

## Prepare the Data

Download the green taxi trips data for November 2025:

```bash
wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-11.parquet
```

You will also need the dataset with zones:

```bash
wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv
```

In [1]:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')

In [59]:
df = pd.read_sql("green_taxi_data", engine)
df.head()

Unnamed: 0,index,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,...,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge,cbd_congestion_fee
0,0,2,2025-11-01 00:34:48,2025-11-01 00:41:39,N,1.0,74,42,1.0,0.74,...,0.5,1.94,0.0,,1.0,11.64,1.0,1.0,0.0,0.0
1,1,2,2025-11-01 00:18:52,2025-11-01 00:24:27,N,1.0,74,42,2.0,0.95,...,0.5,0.0,0.0,,1.0,9.7,2.0,1.0,0.0,0.0
2,2,2,2025-11-01 01:03:14,2025-11-01 01:15:24,N,1.0,83,160,1.0,2.19,...,0.5,5.0,0.0,,1.0,21.0,1.0,1.0,0.0,0.0
3,3,2,2025-11-01 00:10:57,2025-11-01 00:24:53,N,1.0,166,127,1.0,5.44,...,0.5,0.5,0.0,,1.0,27.7,1.0,1.0,0.0,0.0
4,4,1,2025-11-01 00:03:48,2025-11-01 00:19:38,N,1.0,166,262,1.0,3.2,...,1.5,1.0,0.0,,1.0,24.65,1.0,1.0,2.75,0.0


In [60]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 46912 entries, 0 to 46911
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   index                  46912 non-null  int64  
 1   VendorID               46912 non-null  int64  
 2   lpep_pickup_datetime   46912 non-null  str    
 3   lpep_dropoff_datetime  46912 non-null  str    
 4   store_and_fwd_flag     41343 non-null  str    
 5   RatecodeID             41343 non-null  float64
 6   PULocationID           46912 non-null  int64  
 7   DOLocationID           46912 non-null  int64  
 8   passenger_count        41343 non-null  float64
 9   trip_distance          46912 non-null  float64
 10  fare_amount            46912 non-null  float64
 11  extra                  46912 non-null  float64
 12  mta_tax                46912 non-null  float64
 13  tip_amount             46912 non-null  float64
 14  tolls_amount           46912 non-null  float64
 15  ehail_fee    

## Question 3. Counting short trips

For the trips in November 2025 (lpep_pickup_datetime between '2025-11-01' and '2025-12-01', exclusive of the upper bound), how many trips had a `trip_distance` of less than or equal to 1 mile?

- 7,853
- 8,007
- 8,254
- 8,421

In [41]:
query = '''
SELECT * FROM green_taxi_data t
WHERE DATE(t.lpep_pickup_datetime) >= '2025-11-01' 
AND DATE(t.lpep_pickup_datetime) < '2025-12-01'
AND t.trip_distance <= 1.0;
'''

df = pd.read_sql(query, engine)
len(df)

8007

## Question 4. Longest trip for each day

Which was the pick up day with the longest trip distance? Only consider trips with `trip_distance` less than 100 miles (to exclude data errors).

Use the pick up time for your calculations.

- 2025-11-14
- 2025-11-20
- 2025-11-23
- 2025-11-25

In [42]:
query = '''
SELECT DATE(t.lpep_pickup_datetime), t.trip_distance FROM green_taxi_data t
WHERE t.trip_distance < 100.0
ORDER BY t.trip_distance DESC
LIMIT 1;
'''

df = pd.read_sql(query, engine)
df.head()

Unnamed: 0,date,trip_distance
0,2025-11-14,88.03


## Question 5. Biggest pickup zone

Which was the pickup zone with the largest `total_amount` (sum of all trips) on November 18th, 2025?

- East Harlem North
- East Harlem South
- Morningside Heights
- Forest Hills

In [22]:
df_zones = pd.read_sql("taxi_zone_lookup", engine)
df_zones.head()

Unnamed: 0,index,LocationID,Borough,Zone,service_zone
0,0,1,EWR,Newark Airport,EWR
1,1,2,Queens,Jamaica Bay,Boro Zone
2,2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,3,4,Manhattan,Alphabet City,Yellow Zone
4,4,5,Staten Island,Arden Heights,Boro Zone


In [61]:
query = '''
SELECT z."Zone" as zone, SUM(t.total_amount) as total_amount_sum
FROM green_taxi_data t

JOIN taxi_zone_lookup z
ON t."PULocationID" = z."LocationID"

WHERE DATE(t.lpep_pickup_datetime) = '2025-11-18'
GROUP BY zone
ORDER BY total_amount_sum DESC
LIMIT 1;
'''

df = pd.read_sql(query, engine)
df.head()

Unnamed: 0,zone,total_amount_sum
0,East Harlem North,9281.92


## Question 6. Largest tip

For the passengers picked up in the zone named "East Harlem North" in November 2025, which was the drop off zone that had the largest tip?

Note: it's `tip` , not `trip`. We need the name of the zone, not the ID.

- JFK Airport
- Yorkville West
- East Harlem North
- LaGuardia Airport

In [72]:
query = '''
SELECT
    zdo."Zone" AS dropoff_zone,
    t.tip_amount
FROM green_taxi_data t

JOIN taxi_zone_lookup zpu
    ON t."PULocationID" = zpu."LocationID"
    
JOIN taxi_zone_lookup zdo
    ON t."DOLocationID" = zdo."LocationID"
    
WHERE zpu."Zone" = 'East Harlem North'
  AND DATE(t.lpep_pickup_datetime) >= '2025-11-01' AND DATE(t.lpep_pickup_datetime) <= '2025-11-30'
ORDER BY t.tip_amount DESC
LIMIT 1;
'''

df = pd.read_sql(query, engine)
df.head()

Unnamed: 0,dropoff_zone,tip_amount
0,Yorkville West,81.89


## Terraform

In this section homework we'll prepare the environment by creating resources in GCP with Terraform.

In your VM on GCP/Laptop/GitHub Codespace install Terraform.
Copy the files from the course repo
[here](../../../01-docker-terraform/terraform/terraform) to your VM/Laptop/GitHub Codespace.

Modify the files as necessary to create a GCP Bucket and Big Query Dataset.


## Question 7. Terraform Workflow

Which of the following sequences, respectively, describes the workflow for:
1. Downloading the provider plugins and setting up backend,
2. Generating proposed changes and auto-executing the plan
3. Remove all resources managed by terraform`

Answers:
- terraform import, terraform apply -y, terraform destroy
- teraform init, terraform plan -auto-apply, terraform rm
- terraform init, terraform run -auto-approve, terraform destroy
- terraform init, terraform apply -auto-approve, terraform destroy
- terraform import, terraform apply -y, terraform rm


In [14]:
!dir

 El volumen de la unidad D es Mis cosas
 El n£mero de serie del volumen es: 1EA7-E63F

 Directorio de D:\Cursos\DEZoomcamp2026\homework\01-docker-terraform\pipeline

26/01/2026  01:37    <DIR>          .
21/01/2026  23:47    <DIR>          ..
25/01/2026  21:40               129 .gitignore
25/01/2026  21:39    <DIR>          .ipynb_checkpoints
22/01/2026  00:06                 5 .python-version
22/01/2026  17:24    <DIR>          .venv
26/01/2026  01:36            28.287 Answers_about_SQL.ipynb
25/01/2026  20:32               833 docker-compose.yaml
22/01/2026  21:09               496 Dockerfile
25/01/2026  20:36             2.855 ingest_data.py
22/01/2026  00:06                86 main.py
26/01/2026  01:25               490 main.tf
26/01/2026  00:31             2.358 mycreds.json
22/01/2026  12:03                77 pipeline.py
22/01/2026  20:58               369 pyproject.toml
22/01/2026  20:59           245.702 uv.lock
              12 archivos        281.687 bytes
               4 dir

In [15]:
!terraform init

[0m[1mInitializing the backend...[0m
[0m[1mInitializing provider plugins...[0m
- Finding hashicorp/google versions matching "7.16.0"...
- Installing hashicorp/google v7.16.0...
- Installed hashicorp/google v7.16.0 (signed by HashiCorp)
Terraform has created a lock file [1m.terraform.lock.hcl[0m to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.[0m

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.[0

In [20]:
%env GOOGLE_APPLICATION_CREDENTIALS=D:\Cursos\DEZoomcamp2026\homework\01-docker-terraform\pipeline\mycreds.json

env: GOOGLE_APPLICATION_CREDENTIALS=D:\Cursos\DEZoomcamp2026\homework\01-docker-terraform\pipeline\mycreds.json


In [21]:
!echo %GOOGLE_APPLICATION_CREDENTIALS%

D:\Cursos\DEZoomcamp2026\homework\01-docker-terraform\pipeline\mycreds.json


In [22]:
!terraform apply -auto-approve


Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  [32m+[0m create[0m

Terraform will perform the following actions:

[1m  # google_storage_bucket.terra-test[0m will be created
[0m  [32m+[0m[0m resource "google_storage_bucket" "terra-test" {
      [32m+[0m[0m effective_labels            = {
          [32m+[0m[0m "goog-terraform-provisioned" = "true"
        }
      [32m+[0m[0m force_destroy               = true
      [32m+[0m[0m id                          = (known after apply)
      [32m+[0m[0m location                    = "US"
      [32m+[0m[0m name                        = "terra-test-485502"
      [32m+[0m[0m project                     = (known after apply)
      [32m+[0m[0m project_number              = (known after apply)
      [32m+[0m[0m public_access_prevention    = (known after apply)
      [32m+[0m[0m rpo                         = (known after a

In [23]:
!terraform destroy -auto-approve

[0m[1mgoogle_storage_bucket.terra-test: Refreshing state... [id=terra-test-485502][0m

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  [31m-[0m destroy[0m

Terraform will perform the following actions:

[1m  # google_storage_bucket.terra-test[0m will be [1m[31mdestroyed[0m
[0m  [31m-[0m[0m resource "google_storage_bucket" "terra-test" {
      [31m-[0m[0m default_event_based_hold    = false [90m-> null[0m[0m
      [31m-[0m[0m effective_labels            = {
          [31m-[0m[0m "goog-terraform-provisioned" = "true"
        } [90m-> null[0m[0m
      [31m-[0m[0m enable_object_retention     = false [90m-> null[0m[0m
      [31m-[0m[0m force_destroy               = true [90m-> null[0m[0m
      [31m-[0m[0m id                          = "terra-test-485502" [90m-> null[0m[0m
      [31m-[0m[0m labels                      = {} [90m-> null[0m[0m
      