https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/cohorts/2023/week_1_docker_sql/homework.md

## Week 1 Homework

In this homework we'll prepare the environment 
and practice with Docker and SQL

## Question 1. Knowing docker tags

Run the command to get information on Docker 

```docker --help```

Now run the command to get help on the "docker build" command

Which tag has the following text? - *Write the image ID to the file* 

In [1]:
!docker build --help | grep 'Write the image ID to the file'

      --iidfile string          Write the image ID to the file


## Question 2. Understanding docker first run 

Run docker with the python:3.9 image in an interactive mode and the entrypoint of bash.
Now check the python modules that are installed ( use pip list). 
How many python packages/modules are installed?

```bash
docker run -it python:3.9 bash
```

In [2]:
!docker run -t --rm python:3.9 pip list

Package    Version
---------- -------
pip        22.0.4
setuptools 58.1.0
wheel      0.38.4
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

# Prepare Postgres

Run Postgres and load data as shown in the videos
We'll use the green taxi trips from January 2019:

```wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-01.csv.gz```

You will also need the dataset with zones:

```wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv```

Download this data and put it into Postgres (with jupyter notebooks or with a pipeline)

In [3]:
import pandas as pd

df = pd.read_csv('green_tripdata_2019-01.csv')
df_zones = pd.read_csv('taxi+_zone_lookup.csv')

In [4]:
df[['lpep_pickup_datetime', 'lpep_dropoff_datetime']].dtypes

lpep_pickup_datetime     object
lpep_dropoff_datetime    object
dtype: object

In [5]:
# Data transformation
df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'])
df['lpep_dropoff_datetime'] = pd.to_datetime(df['lpep_dropoff_datetime'])

In [6]:
df[['lpep_pickup_datetime', 'lpep_dropoff_datetime']].dtypes

lpep_pickup_datetime     datetime64[ns]
lpep_dropoff_datetime    datetime64[ns]
dtype: object

```python
from sqlalchemy import create_engine

engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')

df.to_sql(name='green_tripdata', con=engine, if_exists='replace')
df_zones.to_sql(name='zones', con=engine, if_exists='replace')
```

In [7]:
%load_ext sql
%sql postgresql://root:root@localhost:5432/ny_taxi

## Question 3. Count records 

How many taxi trips were totally made on January 15?

Tip: started and finished on 2019-01-15. 

Remember that `lpep_pickup_datetime` and `lpep_dropoff_datetime` columns are in the format timestamp (date and hour+min+sec) and not in date.

In [8]:
from datetime import date

date_trips = date.fromisoformat('2019-01-15')
df[(df['lpep_pickup_datetime'].dt.date == date_trips) & (df['lpep_dropoff_datetime'].dt.date == date_trips)].shape[0]

20530

In [9]:
%%sql
SELECT COUNT(*) FROM green_tripdata
WHERE CAST(lpep_pickup_datetime AS DATE) = '2019-01-15' AND CAST(lpep_dropoff_datetime AS DATE) = '2019-01-15'

 * postgresql://root:***@localhost:5432/ny_taxi
1 rows affected.


count
20530


## Question 4. Largest trip for each day

Which was the day with the largest trip distance
Use the pick up time for your calculations.

In [10]:
df[df['trip_distance'] == df['trip_distance'].max()]['lpep_pickup_datetime']

297377   2019-01-15 19:27:58
Name: lpep_pickup_datetime, dtype: datetime64[ns]

In [11]:
%%sql
SELECT CAST(lpep_pickup_datetime AS DATE) AS "date" FROM green_tripdata
ORDER BY trip_distance DESC
LIMIT 1

 * postgresql://root:***@localhost:5432/ny_taxi
1 rows affected.


date
2019-01-15


In [12]:
%%sql
SELECT CAST(lpep_pickup_datetime AS DATE) AS "date", MAX(trip_distance) AS "max_dist" FROM green_tripdata
GROUP BY CAST(lpep_pickup_datetime AS DATE)
ORDER BY "max_dist" DESC
LIMIT 1

 * postgresql://root:***@localhost:5432/ny_taxi
1 rows affected.


date,max_dist
2019-01-15,117.99


In [13]:
df2 = pd.DataFrame()
df2['lpep_pickup_date'] = df['lpep_pickup_datetime'].dt.date
df2['trip_distance'] = df['trip_distance']

df3 = df2.groupby('lpep_pickup_date').sum()
df3[df3['trip_distance'] == df3['trip_distance'].max()]

Unnamed: 0_level_0,trip_distance
lpep_pickup_date,Unnamed: 1_level_1
2019-01-25,83745.79


## Question 5. The number of passengers

In 2019-01-01 how many trips had 2 and 3 passengers?

In [14]:
date_trips = date.fromisoformat('2019-01-01')
df[df['lpep_pickup_datetime'].dt.date == date_trips].groupby('passenger_count').size()

passenger_count
0       21
1    12415
2     1282
3      254
4      129
5      616
6      273
dtype: int64

In [15]:
%%sql
SELECT passenger_count, count(*) FROM green_tripdata
WHERE CAST(lpep_pickup_datetime AS DATE) = '2019-01-01'
GROUP BY passenger_count 

 * postgresql://root:***@localhost:5432/ny_taxi
7 rows affected.


passenger_count,count
0,21
1,12415
2,1282
3,254
4,129
5,616
6,273


## Question 6. Largest tip

For the passengers picked up in the Astoria Zone which was the drop off zone that had the largest tip?
We want the name of the zone, not the id.

Note: it's not a typo, it's `tip` , not `trip`

- Central Park
- Jamaica
- South Ozone Park
- Long Island City/Queens Plaza

In [16]:
id_to_zone = dict(zip(df_zones['LocationID'], df_zones['Zone'])) 
zone_to_id = dict(zip(df_zones['Zone'], df_zones['LocationID'])) 

zone = 'Astoria'
df2 = df[df['PULocationID'] == zone_to_id[zone]]
id_to_zone[df2[df2['tip_amount'] == df2['tip_amount'].max()]['DOLocationID'].values[0]]

'Long Island City/Queens Plaza'

In [17]:
%%sql
SELECT 
    zdo."Zone", MAX(tip_amount) AS "max_tip"
FROM 
	green_tripdata t 
	JOIN zones zpu ON t."PULocationID" = zpu."LocationID"
	JOIN zones zdo ON t."DOLocationID" = zdo."LocationID"
WHERE 
	zpu."Zone" = 'Astoria'
GROUP BY 
	zdo."Zone"
ORDER BY 
	"max_tip" DESC 
LIMIT 1

 * postgresql://root:***@localhost:5432/ny_taxi
1 rows affected.


Zone,max_tip
Long Island City/Queens Plaza,88.0


## Submitting the solutions

* Form for submitting: [form](https://forms.gle/EjphSkR1b3nsdojv7)
* You can submit your homework multiple times. In this case, only the last submission will be used. 

Deadline: 26 January (Thursday), 22:00 CET