# Part 4 - Integrate Multiple Sources

In this notebook we will integrate the preaggregated NYC taxi trips data with the weather data and with the holiday data. This will give us a data set rich of additional features which can be used for the final machine learning task.

The enriched data containing information from multiple independent sources (taxi trips, weather and holidays) will be stored into the *integrated zone*.

In [None]:
dwh_basedir = "/user/hadoop/nyc-dwh"
structured_basedir = dwh_basedir + "/structured"
refined_basedir = dwh_basedir + "/refined"
integrated_basedir = dwh_basedir + "/integrated"

# 0. Setup Environment

## 0.1 Spark Session

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","64G") \
        .getOrCreate()

spark.version

## 0.2 Matplotlib

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

## 0.3 Geopandas and friends

In [None]:
import pandas as pd
import geopandas as gpd
import contextily as ctx
from shapely.geometry import Point

# Helper function to fetch background map tiles
def add_basemap(ax, zoom, url='http://tile.stamen.com/terrain/tileZ/tileX/tileY.png'):
    xmin, xmax, ymin, ymax = ax.axis()
    basemap, extent = ctx.bounds2img(xmin, ymin, xmax, ymax, zoom=zoom, url=url)
    ax.imshow(basemap, extent=extent, interpolation='bilinear')
    # restore original x/y limits
    ax.axis((xmin, xmax, ymin, ymax))

# 1. Read Taxi Data

Now we can read in the hourly preaggregated taxi data from the refined zone.

In [None]:
taxi_aggregates = spark.read.parquet(refined_basedir + "/taxi-trips-hourly")
taxi_aggregates.limit(10).toPandas()

## 1.1 NYC Taxi Trip location

We now also load the taxi trips from the refined zone containing individual records per taxi trip. We use this data to calculate the average geo location of all taxi trips. To clean up the data, we reuse the quantiles previously calculated of the geo locations.

In [None]:
# 95% quantiles of the pickup geo location, as calculated in the previous notebook
min_pickup_longitude=-74.007629
max_pickup_longitude=-73.77668
min_pickup_latitude=40.705612
max_pickup_latitude=40.840221

Now we calculate the average pickup geo location.

In [None]:
taxi_trips = spark.read.parquet(refined_basedir + "/taxi-trip")

# Calculate average pickup location. The result should be stored in two columns avg_pickup_longitude and avg_pickup_latitude
result = taxi_trips\
    .filter((taxi_trips["pickup_longitude"] > min_pickup_longitude) & (taxi_trips["pickup_longitude"] < max_pickup_longitude)) \
    .filter((taxi_trips["pickup_latitude"] > min_pickup_latitude) & (taxi_trips["pickup_latitude"] < max_pickup_latitude)) \
    .select(
        # YOUR CODE HERE
    )

# Extract numerical values from single-record DataFrame
first_result = result.first()
avg_pickup_longitude = first_result["avg_pickup_longitude"]
avg_pickup_latitude = first_result["avg_pickup_latitude"]

print("avg_pickup_latitude=" + str(avg_pickup_latitude))
print("avg_pickup_longitude=" + str(avg_pickup_longitude))

# 2. Weather Data

Now load in the preaggregated hourly and daily weather data from 2013. We will try to find the weather station nearest to the average pickup location.

In [None]:
hourly_weather = spark.read.parquet(refined_basedir + "/weather-hourly/2013")
daily_weather = spark.read.parquet(refined_basedir + "/weather-daily/2013")

## 2.1 Station Master Data

In order to find an appropriate weather station (which will be used for all taxi trips, since we only analyze data from NYC), we use the weather station master data, which also contains the geo location of every weather station.

In [None]:
weather_stations = spark.read.parquet(structured_basedir + "/weather-stations")
weather_stations.limit(10).toPandas()

## 2.2 Find Corresponding Weather Station

Using the master data of all weather stations, we now try to find a station which is near to the center of all taxi trips.

In [None]:
# Step 1: Calculate distance of every weather station to the average pickup location
weather_stations_with_distance = weather_stations\
    .filter((weather_stations["BEGIN"] <= "20130101") & ((weather_stations["END"] >= "20131231") | weather_stations["END"].isNull())) \
    .filter(weather_stations["WBAN"] != "99999") \
    .select(
        "*",
        (f.pow(avg_pickup_longitude - weather_stations["LON"],2) + f.pow(avg_pickup_latitude - weather_stations["LAT"],2)).alias("geo_distance")
    )

# Step 2: Pick nearest station by sorting the result by distance and the pick the first record
nyc_station = # YOUR CODE HERE

# Extract relevant information for later
nyc_station_usaf = nyc_station["USAF"]
nyc_station_wban = nyc_station["WBAN"]
nyc_station_longitude = float(nyc_station["LON"])
nyc_station_latitude = float(nyc_station["LAT"])

print(nyc_station) 
print(nyc_station["LAT"] + "," + nyc_station["LON"])

### Sanity Check

The code above should give us the following weather station:

* USAF='725053'
* WBAN='94728'
* STATION NAME='CENTRAL PARK'
* CTRY='US'
* STATE='NY'
* LAT='+40.779'
* LON='-073.969'

Please make sure to continue with these values, as the following code is tailored for specifically that weather station!

In [None]:
nyc_station_usaf = "725053"
nyc_station_wban = "94728"
nyc_station_latitude = 40.779
nyc_station_longitude = -73.969

### Visualization

Let's make a picture again, showing the average geo coordinate of our data and the weather station. They should match pretty well!

In [None]:
geo_min_x=-8239719.95065924
geo_max_x=-8212678.623952549
geo_min_y=4968029.278728969
geo_max_y=4989775.66725539

In [None]:
df = pd.DataFrame({
    'LAT'  :[avg_pickup_latitude, nyc_station_latitude],
    'LONG' :[avg_pickup_longitude, nyc_station_longitude]
})

# Convert DataFrame to GeoDataFrame  
coords = pd.Series(zip(df["LONG"], df["LAT"]))
geo_df = gpd.GeoDataFrame(df, crs = {'init': 'epsg:4326'}, geometry = coords.apply(Point)).to_crs(epsg=3857)

# ... and make the plot
ax = geo_df.plot(figsize=(15, 10), alpha=1, color="red")
ax.set(ylim=(geo_min_y, geo_max_y), xlim=(geo_min_x, geo_max_x))

# Add basemap below
add_basemap(ax, 12)

# 3. Holidays

The last data set that we want to integrate is the list of bank holidays.

In [None]:
holidays = spark.read.parquet(structured_basedir + "/holidays")
holidays.limit(10).toPandas()

In [None]:
holidays.printSchema()

# 4. Join Data

Finally we join together all four data sets:
* Preaggregated taxi trips
* Hourly weather data
* Daily weather data
* Holidays

### NYC Weather

We filter the weather data to the NYC weather station.

In [None]:
nyc_daily_weather = daily_weather.filter((daily_weather["usaf"] == nyc_station_usaf) & (daily_weather["wban"] == nyc_station_wban)).cache()
nyc_hourly_weather = # YOUR CODE HERE

### Join Data Sets

Now we carefully join all enrichment information to the preaggregated hourly taxi trips.

In [None]:
joined_data = # YOUR CODE HERE

joined_data.limit(10).toPandas()

Since writing all the joins is a little bit tedious and error prone work, this has already been prepared for you with all sources in the following code block.

In [None]:
all_data = taxi_aggregates \
    .join(f.broadcast(holidays), ["date"], how="leftOuter") \
    .drop(holidays["date"]) \
    .drop(holidays["id"]) \
    .withColumnRenamed("description", "holiday_description") \
    .join(f.broadcast(nyc_hourly_weather), ["date", "hour"], how="leftOuter") \
    .withColumnRenamed("precipitation", "hourly_precipitation") \
    .withColumnRenamed("wind_speed", "hourly_wind_speed") \
    .withColumnRenamed("temperature", "hourly_temperature") \
    .drop(nyc_hourly_weather["date"])\
    .drop(nyc_hourly_weather["hour"]) \
    .drop(nyc_hourly_weather["usaf"])\
    .drop(nyc_hourly_weather["wban"])\
    .join(f.broadcast(nyc_daily_weather), ["date"], how="leftOuter") \
    .withColumnRenamed("precipitation", "daily_precipitation") \
    .withColumnRenamed("wind_speed", "daily_wind_speed") \
    .withColumnRenamed("temperature", "daily_temperature") \
    .drop(nyc_daily_weather["date"])\
    .drop(nyc_daily_weather["usaf"])\
    .drop(nyc_daily_weather["wban"])\
    .orderBy("date", "hour") \
    .cache()

all_data.limit(10).toPandas()

### Write to Integrated Zone

The result will be written into the integrated zone into the sub directory `taxi-trips-hourly`.

In [None]:
all_data.write.mode("overwrite").parquet(integrated_basedir + "/taxi-trips-hourly")

In [None]:
all_data = spark.read.parquet(integrated_basedir + "/taxi-trips-hourly")
all_data.limit(10).toPandas()

# 5. Pictures

In [None]:
# Calculate average temperature and total amount per date. Sort the result by date
daily = # YOUR CODE HERE

# Convert to Pandas    
pdf = daily.toPandas()

fig, ax1 = plt.subplots(figsize=(16, 6), dpi=80, facecolor='w', edgecolor='k')

# Plot fare amount
ax1.plot(pdf["date"],pdf["amount"], color="red")

# Plot temperature
ax2 = ax1.twinx() 
ax2.plot(pdf["date"],pdf["temperature"], color="green")

# Plot legends
plt.legend(handles=[
    mpatches.Patch(color='red', label='Total Fare Amount per Day'),
    mpatches.Patch(color='green', label='Average Temperature per Day')
])