# Prerequisite

In this notebook, we will:
- Install required and update third-party libraries
- Load our dataset into an S3 bucket

In [None]:
!python -m pip install -Uq pip
!python -m pip install -Uq sagemaker boto3 awswrangler
!python -m pip install geopandas

In [None]:
!pip install geopandas

In [None]:
from zipfile import ZipFile
import pandas as pd  # noqa: E402
import geopandas as gpd  # noqa: E402
import os

In [None]:
from sagemaker.s3 import S3Downloader, S3Uploader

In [None]:
import sagemaker
bucket=sagemaker.Session().default_bucket()
prefix = 'sagemaker/DEMO-xgboost-tripfare'

In [None]:
!aws s3 cp --recursive ../glue/ s3://$bucket/scripts/ 

In [None]:
input_source = f's3://{bucket}/{prefix}/input/'
input_data = input_source + 'data'
input_zones = input_source + 'zones/'


In [None]:
%store input_source
%store input_zones

## Load dataset into S3

In the following section, we will load the new york trip data and taxi zone data into a predefined S3 bucket/folder.

In [None]:
!aws s3 cp --recursive 's3://nyc-tlc/trip data/' $input_data/green --exclude '*' --include 'green_tripdata_2018-1*'
!aws s3 cp --recursive 's3://nyc-tlc/trip data/' $input_data/yellow --exclude '*' --include 'yellow_tripdata_2018-1*'
!aws s3 cp 's3://nyc-tlc/misc/taxi_zones.zip' $input_zones

In [None]:
zones_dir = os.path.join('.', "input/zones")
zones_file = os.path.join(zones_dir, "taxi_zones.zip")
zones_file_csv = os.path.join(zones_dir, "taxi_zones.csv")

In [None]:
# Download trip data and taxi zones to input folder
download_uri = "s3://nyc-tlc/misc/taxi_zones.zip"
S3Downloader().download(download_uri, zones_dir)

In [None]:
with ZipFile(zones_file, "r") as zip:
    zip.extractall(zones_dir)

In [None]:
# Load the zone file and convert it to csv
zone_df = gpd.read_file(os.path.join(zones_dir, "taxi_zones.shp"))

In [None]:
# Get centroids as EPSG code of 3310 to measure distance
zone_df["centroid"] = zone_df.geometry.centroid.to_crs(epsg=3310)
# Convert cordinates to the WSG84 lat/long CRS has a EPSG code of 4326.
zone_df["latitude"] = zone_df.centroid.to_crs(epsg=4326).x
zone_df["longitude"] = zone_df.centroid.to_crs(epsg=4326).y

In [None]:
zone_df.to_csv(zones_file_csv)

In [None]:
# Upload file to s3
S3Uploader().upload(zones_file_csv, input_source + 'zones')

In [None]:
!aws s3 ls $input_data/green/
!aws s3 ls $input_data/yellow/

Now that we have loaded our data, take note of the S3 bucket for the two dataset which we will be using

- Input source containing our taxi trip data
- Input zones containing taxi zones which we will use to join with our taxi trip data to enrich

In [None]:
print("Input source", input_source)
print("Input zones", input_zones)