# Ingesting NYC taxi data into PostgreSQL database

## Load packages and NYC taxi data from January 2021.

Load packages.

In [1]:
from pathlib import Path

import pandas as pd
import sqlalchemy as sa

Load NYC taxi data from January 2021.

In [2]:
DATA_PATH = Path(
    "../data/"
)

PGSERVER_PATH = Path(
    "../../pg-server/"
)

HOST_PATH = Path(
    "../../host/"
)

CERTS_PATH = Path(
    "../certs/"
)

In [None]:
nyc_taxi = pd.read_parquet(DATA_PATH/"yellow_tripdata_2021-01_prepared.parquet")

nyc_taxi

## Define PostgreSQL schema for the table storing NYC taxi data

Define the SQLAlchemy engine to enable communications between a client and our PostgreSQL server.

In [4]:
username = "fmerinocasallo_writer"
passwd = open(PGSERVER_PATH/"passwds/pg-fmerinocasallo_writer-passwd.txt").readline().rstrip()

hostname = "172.19.0.70"
port = [
    line.split(" ")[2].rstrip()
    for line in open(PGSERVER_PATH/"conf/postgresql.conf").readlines()
    if line.startswith("port")
][-1]

database = "de_zoomcamp"
schema = "nyc_taxi"

url = f"postgresql://{username}:{passwd}@{hostname}/{database}"
connect_args = {
    "port": port,
    "sslmode": "verify-full",
    "sslrootcert": PGSERVER_PATH/"certs/ca/server/server-ca.crt",
    "sslcert": HOST_PATH/"certs/client/writer/fmerinocasallo_writer.crt",
    "sslkey": HOST_PATH/"certs/client/writer/fmerinocasallo_writer.key",
}

engine = sa.create_engine(url=url, connect_args=connect_args, echo=True)

Check Panda's suggested SQL statement to create a new table that will store the processed data.

In [None]:
print(pd.io.sql.get_schema(nyc_taxi, "nyc_taxi"))

Define the schema for a new PostgreSQL table storing NYC taxi data associated with trips during January 2021. According
to [PostgreSQL's official documentation](https://www.postgresql.org/docs/current/datatype-numeric.html):

1. On all currently supported platforms, the `REAL` type has a range of around 1E-37 to 1E+37 with a
precision of at least 6 decimal digits. The `DOUBLE PRECISION` type has a range of around 1E-307 to 1E+308 with a
precision of at least 15 digits.
2. The type `INTEGER` is the common choice, as it offers the best balance between range, storage size, and performance.
The `SMALLINT` type is generally only used if disk space is at a premium. The `BIGINT` type is designed to be used when
the range of the `INTEGER` type is insufficient.

For our specific use case, we assume `REAL` and `INTEGER` to be the most suitable data types for all the numerical
columns/attributes except `dt`. For the `dt` column/attribute, which originally stored the duration of each trip as
'timedelta' values in our PARQUET file and now will store this information as integer values (ns frequency) in the
database, we opt for `BIGINT`. Note that 15 minutes equals 9E+11 ns. Using these inexact data types instead of
`NUMERIC`/`DECIMAL` will offer noticeable performance gains at the expense of negligible precision losses, as monetary
amounts in this sector are stored with only 2 decimal digits at most.

Define the PostgreSQL schema for the table storing the NYC taxi data associated with the trips from January 2021.

In [6]:
table_name = "yellow_taxi_trips" 

schema_name = "nyc_taxi"
schema_dtypes = {
    "tpep_pickup_datetime": sa.types.TIMESTAMP,
    "tpep_dropoff_datetime": sa.types.TIMESTAMP,
    "dt": sa.types.BIGINT,
	"trip_distance": sa.types.REAL,
	"avg_speed": sa.types.REAL,
	"PULocationID": sa.types.INTEGER,
	"DOLocationID": sa.types.INTEGER,
	"RatecodeID": sa.types.INTEGER,
	"passenger_count": sa.types.INTEGER,
	"total_amount": sa.types.REAL,
	"fare_amount": sa.types.REAL,
	"tip_amount": sa.types.REAL,
	"tolls_amount": sa.types.REAL,
	"extra": sa.types.REAL,
	"mta_tax": sa.types.REAL,
	"improvement_surcharge": sa.types.REAL,
	"congestion_surcharge": sa.types.REAL,
	"airport_fee": sa.types.REAL,
	"payment_type": sa.types.INTEGER,
	"VendorID": sa.types.INTEGER,
}

## Ingesting NYC taxi data from January 2021 to our PostgreSQL database

Create a new table `yellow_taxi_trips`.

In [None]:
nyc_taxi.head(n=0).to_sql(name=table_name, con=engine, schema=schema_name, if_exists="replace", index=False, dtype=schema_dtypes)

Ingesting NYC taxi data with January 2021 trips into the newly created PostgreSQL table `yellow_taxi_trips`.

In [None]:
nyc_taxi.to_sql(name=table_name, con=engine, schema=schema_name, if_exists="append", index=False, dtype=schema_dtypes)

Grant SELECT permissions (ro) to the `reader` role for the newly created `nyc_taxi.yellow_taxi_trips`. Otherwise, `reader`s won't be able to access it.

In [None]:
query = f"GRANT SELECT ON TABLE {schema_name}.{table_name} TO reader"
with engine.connect() as conn:
    conn.execute(sa.text(query))
    conn.commit()