# Streamlit

In this notebook contains the final code to create the Streamlit app file:

``streamlit_app.py``

### 0. [Requirements](#Requirements)
- [About Streamlit and Bokeh versions](#About-Streamlit-and-Bokeh-versions)
- [Chromedriver.exe](#Chromedriver.exe)

### 1. [Helper functions](#Helper-functions)
- ``get_LocationIDs()``
- ``datetimeInfo_and_LocID()``
- ``scrape_data()``
- ``get_input_data()``
- ``get_output_data()``
- ``load_shape_data()``
- ``load_taxis_data()``

# Requirements

### About Streamlit and Bokeh versions

Streamlit before 0.57 only works with Bokeh 1.0 and Streamlit 0.57+ only works with Bokeh 2.<br>
I have used **Streamlit 0.62.0** and **Bokeh 2.1.1**

Install Streamlit and Bokeh if you need.

In [None]:
#!pip install streamlit
#!pip install bokeh

Check which version is in your computer.

In [None]:
!streamlit --version
!bokeh info

### Chromedriver.exe

In order to scrape weather data from ``wunderground.com`` I had to use ``WebDriver`` with ``Chromedriver.exe``.  

**Why?**  

``wunderground.com`` seems to have some security feature which blocks known spider/bot user agents (like ``urllib`` used by python).

I didn´t want to pay for their API, so I simulate that I am accessing from a known browser user agent (i.e. Chrome).

This is why I use **Selenium WebDriver**. ``webdriver`` drives a browser natively, as a user would.

Make sure that you have downloaded [Chromedriver.exe](./chromedriver.exe) and that the relative path to where the server has been started is correct.  
I run the Jupyter server inside the ``./notebooks`` folder so I save ``chromedriver.exe`` in there.

# Helper functions
<div style = "float:right"><a style="text-decoration:none" href = "#Streamlit">Up</a></div>

Explanation about the functions used in the script:

**``get_LocationIDs()``**:  
Creates a DataFrame with the LocationID of Manhattan zones


**``datetimeInfo_and_LocID(df_LocIds, start_date, NoOfDays)``**:  
Creates a DataFrame with the Datetime info and LocationID and make the appropriate transforms to pass it on to the predictive model.
It uses tomorrows day, up to 3 more days.


**``scrape_data(today, days_in)``**:  
Scrape Precipitation forecast from *wunderground.com*.  
From tomorrow up to 3 more days.


**``get_input_data(start_date, NoOfDays)``**:  
It takes the outputs from ``datetimeInfo_and_LocID`` and ``scrape_data`` and creates another DataFrame, with the right shape and ready to be taken by the predictive model.


**``get_output_data(pickle_file, input_data)``**:  
It passes the output of ``get_input_data()`` on to the predictive model and outputs a result DataFrame with: ``dayofweek``, ``hour``, ``LocationID`` and ``pickups``.


**``load_shape_data()``**:  
Creates a DataFrame with LocationIDs and their associated (X,Y) coordinates so that they can be plotted as a map.


**``load_taxis_data(output_data, shape_data)``**:  
It takes outputs from ``get_output_data`` and ``load_shape_data`` and transforms the tables so that it can be plotted. It associates slider values to different columns.

In [1]:
%%writefile streamlit_app.py

import streamlit as st
import pandas as pd
import numpy as np
from datetime import date, timedelta
pd.options.display.max_columns = None
pd.options.display.max_rows = None

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

import geopandas as gpd
from shapely.geometry import Polygon, MultiPolygon
import matplotlib.pyplot as plt

from bokeh.io import output_notebook, output_file, show
from bokeh.plotting import figure
from bokeh.models import HoverTool, Select, ColumnDataSource, WheelZoomTool, LogColorMapper, LinearColorMapper, ColorBar, BasicTicker
from bokeh.palettes import Viridis256 as palette
from bokeh.layouts import row
import altair as alt


#############################   DEFINE FUNCTIONS START   #############################

# GET LOCATION ID DATA FRAME
# when deploying to external server, consider create LocationIDs manually instead of reading csv
@st.cache(show_spinner=False)
def get_LocationIDs():
    # 1. Import Location and Borough columns form NY TAXI ZONES dataset
    dfzones = pd.read_csv('https://raw.github.com/angelrps/MasterDataScience_FinalProject/master/data/NY_taxi_zones.csv', sep=',',
                          usecols=['LocationID', 'borough'])

    # 2. Filter Manhattan zones
    dfzones = dfzones[dfzones['borough']=='Manhattan']\
                    .drop(['borough'], axis=1)\
                    .sort_values(by='LocationID')\
                    .drop_duplicates('LocationID').reset_index(drop=True)    
    return dfzones

# CREATE DATETIME INFO AND APPEND LOCATION IDs
@st.cache(show_spinner=False)
def datetimeInfo_and_LocID(df_LocIds, start_date, NoOfDays):   

    from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
    
    # repeat LocationIDs. All of them... for each hour
    location_id_col = pd.concat([df_LocIds]*24*NoOfDays).reset_index(drop=True)

    # create data frame with range of days with hourly period
    df_pred = pd.DataFrame()
    dates = pd.date_range(start = start_date, end = start_date + timedelta(days=NoOfDays), freq = "H")
    df_pred['datetime'] = dates
    df_pred.drop([df_pred.shape[0]-1], inplace=True)

    # Create new columns from datetime
    df_pred['month'] = df_pred['datetime'].dt.month
    df_pred['hour'] = df_pred['datetime'].dt.hour
    # 'dayhour' will serve as index to perform the join
    df_pred['dayhour'] = df_pred['datetime'].dt.strftime('%d%H')
    df_pred['week'] = df_pred['datetime'].dt.week
    df_pred['dayofweek'] = df_pred['datetime'].dt.dayofweek


    # Create date time index calendar
    drange = pd.date_range(start=str(start_date.year)+'-01-01', end=str(start_date.year)+'-12-31')
    cal = calendar()
    holidays = cal.holidays(start=drange.min(), end=drange.max())
    
    # 8.3 create new columns 'date' and 'isholiday'
    df_pred['date'] = pd.to_datetime(df_pred['datetime'].dt.date)
    df_pred['isholiday'] = df_pred['datetime'].isin(holidays).astype(int)
    
    # drop 'date' and 'datetime' column
    df_pred.drop(['datetime'], axis=1, inplace=True)
    df_pred.drop(['date'], axis=1, inplace=True)

    # repeat rows. 67 rows per hour
    df_pred = df_pred.iloc[np.arange(len(df_pred)).repeat(len(df_LocIds))].reset_index(drop=True)
    #df_index = df_index.iloc[np.arange(len(df_index)).repeat(67)].reset_index(drop=True)

    df_pred = df_pred.join(location_id_col)
    
    return df_pred

# SCRAPE PRECIPITATION FORECAST FROM wunderground.com
@st.cache(show_spinner=False)
def scrape_data(today, days_in):
    with st.spinner("I am scraping weather data from wunderground.com... please wait."):
        # Use .format(YYYY, M, D)
        lookup_URL = 'https://www.wunderground.com/hourly/us/ny/new-york-city/date/{}-{}-{}.html'

        options = webdriver.ChromeOptions();
        options.add_argument('headless'); # to run chrome in the backbroung

        driver = webdriver.Chrome(executable_path='./chromedriver.exe', options=options)

        start_date = today + pd.Timedelta(days=1)
        end_date = today + pd.Timedelta(days=days_in + 1)

        df_prep = pd.DataFrame()

        while start_date != end_date:
            timestamp = pd.Timestamp(str(start_date)+' 00:00:00')

            print('gathering data from: ', start_date)

            formatted_lookup_URL = lookup_URL.format(start_date.year,
                                                     start_date.month,
                                                     start_date.day)

            driver.get(formatted_lookup_URL)
            rows = WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.XPATH, '//td[@class="mat-cell cdk-cell cdk-column-liquidPrecipitation mat-column-liquidPrecipitation ng-star-inserted"]')))
            for row in rows:
                hour = timestamp.strftime('%H')
                day = timestamp.strftime('%d')
                prep = row.find_element_by_xpath('.//span[@class="wu-value wu-value-to"]').text
                # append new row to table
                # 'dayhour' column will serve as column index to perform the Join
                df_prep = df_prep.append(pd.DataFrame({"dayhour":[day+hour], 'precipitation':[prep]}),
                                         ignore_index = True)

                timestamp += pd.Timedelta('1 hour')

            start_date += timedelta(days=1)
    return df_prep

# GET INPUT DATA USING THE FUNCTIONS ABOVE: LocationIDs and Datetime info
@st.cache(show_spinner=False)
def get_input_data(start_date, NoOfDays):
    # get LocationIDs data frame
    df_LocIds = get_LocationIDs()

    # create datetime info and append LocationsIDs
    dtInfo_and_LocID = datetimeInfo_and_LocID(df_LocIds,start_date,NoOfDays)

    # get precipitation forecast
    prep_forecast = scrape_data(date.today(), NoOfDays)

    # merge both data frames
    df_merged = dtInfo_and_LocID.merge(prep_forecast, on="dayhour", how="left")

    # drop dayhour column
    df_merged = df_merged.drop(['dayhour'], axis=1)
    
    return df_merged

# GET OUPPUT DATA: get predictions, append to input_data and format it to be processed
@st.cache(show_spinner=False)
def get_output_data(pickle_file, input_data):
    with st.spinner("Making predictions..."):
        import pickle

        model = pickle.load(open(pickle_file,'rb'))

        # get prediction, convert to integer and convert Array into DataFrame
        model_predict = (model.predict(input_data)).astype(int)
        df_predict = pd.DataFrame({'pickups':model_predict})

        # join input_data with DataFrame
        joined = input_data.join(df_predict)

        output_data = joined[['hour','dayofweek','LocationID','pickups']]
    
    return output_data

# GET DATA FRAME WITH SHAPE GEOMETRY INFO
@st.cache(show_spinner=False)
def load_shape_data():
    path = '../data/taxi_zones/taxi_zones.shp'
    shape_data = gpd.read_file(path)

    # filter Manhattan zones
    shape_data = shape_data[shape_data['borough'] == 'Manhattan'].reset_index(drop=True)

    shape_data = shape_data.drop(['borough'], axis=1)

    #EPSG-Code of Web Mercador
    shape_data.to_crs(epsg=3785, inplace=True)

    # Simplify Shape of Zones (otherwise slow peformance of plot)
    shape_data["geometry"] = shape_data["geometry"].simplify(100)

    data = []
    for zonename, LocationID, shape in shape_data[["zone", "LocationID", "geometry"]].values:
        #If shape is polygon, extract X and Y coordinates of boundary line:
        if isinstance(shape, Polygon):
            X, Y = shape.boundary.xy
            X = [int(x) for x in X]
            Y = [int(y) for y in Y]
            data.append([LocationID, zonename, X, Y])

        #If shape is Multipolygon, extract X and Y coordinates of each sub-Polygon:
        if isinstance(shape, MultiPolygon):
            for poly in shape:
                X, Y = poly.boundary.xy
                X = [int(x) for x in X]
                Y = [int(y) for y in Y]
                data.append([LocationID, zonename, X, Y])

    #Create new DataFrame with X an Y coordinates separated:
    shape_data = pd.DataFrame(data, columns=["LocationID", "ZoneName", "X", "Y"])
    return shape_data

@st.cache(allow_output_mutation=True, show_spinner=False)
def load_taxis_data(output_data, shape_data):
    df_to_visualize = shape_data.copy()
    pickups = output_data.groupby(['hour','dayofweek','LocationID']).sum()
    #start_day = pd.unique(output_data['dayofweek']).min()
    #end_day = pd.unique(output_data['dayofweek']).max()
    listofdays = pd.unique(output_data['dayofweek'])

    for hour in range(24):
        #for dayofweek in range(start_day,end_day+1,1):
        for dayofweek in listofdays:
            # get pickups for this hour and weekday
            p = pd.DataFrame(pickups.loc[(hour, dayofweek)]).reset_index()
        
            # add pickups to the Taxi Zones DataFrame       
            df_to_visualize = pd.merge(df_to_visualize, p, on="LocationID", how="left").fillna(0)
            # rename column as per day and hour
            df_to_visualize.rename(columns={"pickups" : "Passenger_%d_%d"%(dayofweek, hour)}, inplace=True)

    return df_to_visualize


#############################   DEFINE FUNCTIONS END   #############################
    
# DECLARE VARIABLES: start date, NoOfDays, pickle_file
start_date = date.today() + timedelta(days=1) # start day is tomorrow
NoOfDays = 3 # number of days for prediction
pickle_file = './model_regGB.pickle'

# RUN FUNCTIONS
input_data = get_input_data(start_date, NoOfDays)

output_data = get_output_data(pickle_file, input_data)

shape_data = load_shape_data()

df_to_visualize = load_taxis_data(output_data,shape_data)

# INITIAL SET PAGE CONFIG
page_title = 'Taxi Demand Predictor'
layout='wide'
initial_sidebar_state = 'expanded'

# SHOW TITLE AND DESCRIPTION
st.title("Manhattan Taxi Demand Predictor")
"""
This is my Final Master's work (Master in Data Science - KSCHOOL).
This Machine Learning app allows you to predict taxi pickups demand in Manhattan for the next 3 days!

Just choose day and hour from the side bar and hover the mouse over the map.
"""

# SIDE BAR
st.sidebar.title('Choose DAY and TIME')
# add slider widget: Hours
hour = st.sidebar.slider("Hour to look at:",min_value=0, max_value=23, value=7, step=1)
# Buttons title
st.sidebar.text('Day to look at:')

# add buttons widget: Dayofweek
button1_day = date.today() + pd.Timedelta(days=1)
button2_day = date.today() + pd.Timedelta(days=2)
button3_day = date.today() + pd.Timedelta(days=3)
selected_day = str(button1_day)
    
button1 = st.sidebar.button(str(button1_day))
button2 = st.sidebar.button(str(button2_day))
button3 = st.sidebar.button(str(button3_day))
weekday = button1_day.weekday()
if button1:
    weekday = button1_day.weekday()
    selected_day = str(button1_day)
if button2:
    weekday = button2_day.weekday()
    selected_day = str(button2_day)
if button3:
    weekday = button3_day.weekday()
    selected_day = str(button3_day)
    

# ColumnDataSource transforms the data into something that Bokeh and Java understand
df_to_visualize["Passengers"] = df_to_visualize["Passenger_" + str(weekday) + "_" + str(hour)]

source = ColumnDataSource(df_to_visualize)

max_passengers_per_hour = df_to_visualize[filter(lambda x: "Passenger_" in x, df_to_visualize.columns)].max().max()

color_mapper = LinearColorMapper(palette=palette[::-1], high=max_passengers_per_hour, low=0)


##### Color Bar
color_bar = ColorBar(color_mapper = color_mapper,
                     ticker = BasicTicker(),
                    label_standoff=8,
                     location=(0,0),
                     orientation='vertical')

p = figure(plot_width=450, plot_height=750,
           toolbar_location=None,
           tools='pan,wheel_zoom,box_zoom,reset,save')
p.xaxis.visible = False
p.yaxis.visible = False

p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None

# Get rid of zoom on axes:
for t in p.tools:
    if type(t) == WheelZoomTool:
        t.zoom_on_axis = False

patches = p.patches(xs="X", ys="Y", source=source,fill_alpha=1,
                  fill_color={'field': 'Passengers',
                              'transform': color_mapper},
                  line_color="black", alpha=0.5)

hovertool = HoverTool(tooltips=[('Zone:', "@ZoneName"),
                                ("Passengers:", "@Passengers")])
p.add_tools(hovertool)

p.add_layout(color_bar, 'right')

st.subheader("Pickups: " + selected_day + " between %i:00 and %i:00" % (hour, (hour + 1) % 24))
st.bokeh_chart(p)

Overwriting streamlit_app.py


# TESTS

gathering data from:  2020-08-26
gathering data from:  2020-08-27
gathering data from:  2020-08-28


ValueError: Expected 2D array, got 1D array instead:
array=[  4.  12.  13. ... 261. 262. 263.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [130]:
def get_LocationIDs():
    # 1. Import Location and Borough columns form NY TAXI ZONES dataset
    dfzones = pd.read_csv('../data/NY_taxi_zones.csv', sep=',',
                          usecols=['LocationID', 'borough'])

    # 2. Filter Manhattan zones
    dfzones = dfzones[dfzones['borough']=='Manhattan']\
                    .drop(['borough'], axis=1)\
                    .sort_values(by='LocationID')\
                    .drop_duplicates('LocationID').reset_index(drop=True)    
    return dfzones

# CREATE DATETIME INFO AND APPEND LOCATION IDs
def datetimeInfo_and_LocID(df_LocIds, start_date, NoOfDays):   

    from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
    
    # repeat LocationIDs. All of them... for each hour
    location_id_col = pd.concat([df_LocIds]*24*NoOfDays).reset_index(drop=True)

    # create data frame with range of days with hourly period
    df_pred = pd.DataFrame()
    dates = pd.date_range(start = start_date, end = start_date + timedelta(days=NoOfDays), freq = "H")
    df_pred['datetime'] = dates
    df_pred.drop([df_pred.shape[0]-1], inplace=True)

    # Create new columns from datetime
    df_pred['month'] = df_pred['datetime'].dt.month
    df_pred['hour'] = df_pred['datetime'].dt.hour
    # 'dayhour' will serve as index to perform the join
    df_pred['dayhour'] = df_pred['datetime'].dt.strftime('%d%H')
    df_pred['week'] = df_pred['datetime'].dt.week
    df_pred['dayofweek'] = df_pred['datetime'].dt.dayofweek


    # Create date time index calendar
    drange = pd.date_range(start=str(start_date.year)+'-01-01', end=str(start_date.year)+'-12-31')
    cal = calendar()
    holidays = cal.holidays(start=drange.min(), end=drange.max())
    
    # 8.3 create new columns 'date' and 'isholiday'
    df_pred['date'] = pd.to_datetime(df_pred['datetime'].dt.date)
    df_pred['isholiday'] = df_pred['datetime'].isin(holidays).astype(int)
    
    # drop 'date' and 'datetime' column
    df_pred.drop(['datetime'], axis=1, inplace=True)
    df_pred.drop(['date'], axis=1, inplace=True)

    # repeat rows. 67 rows per hour
    df_pred = df_pred.iloc[np.arange(len(df_pred)).repeat(len(df_LocIds))].reset_index(drop=True)
    #df_index = df_index.iloc[np.arange(len(df_index)).repeat(67)].reset_index(drop=True)

    df_pred = df_pred.join(location_id_col)
    
    return df_pred

# SCRAPE PRECIPITATION FORECAST FROM wunderground.com
def scrape_data(today, days_in):
    # Use .format(YYYY, M, D)
    lookup_URL = 'https://www.wunderground.com/hourly/us/ny/new-york-city/date/{}-{}-{}.html'

    options = webdriver.ChromeOptions();
    options.add_argument('headless'); # to run chrome in the backbroung

    driver = webdriver.Chrome(executable_path='./chromedriver.exe', options=options)

    start_date = today + pd.Timedelta(days=1)
    end_date = today + pd.Timedelta(days=days_in + 1)

    df_prep = pd.DataFrame()

    while start_date != end_date:
        timestamp = pd.Timestamp(str(start_date)+' 00:00:00')
        
        print('gathering data from: ', start_date)
        
        formatted_lookup_URL = lookup_URL.format(start_date.year,
                                                 start_date.month,
                                                 start_date.day)

        driver.get(formatted_lookup_URL)
        rows = WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.XPATH, '//td[@class="mat-cell cdk-cell cdk-column-liquidPrecipitation mat-column-liquidPrecipitation ng-star-inserted"]')))
        for row in rows:
            hour = timestamp.strftime('%H')
            day = timestamp.strftime('%d')
            prep = row.find_element_by_xpath('.//span[@class="wu-value wu-value-to"]').text
            # append new row to table
            # 'dayhour' column will serve as column index to perform the Join
            df_prep = df_prep.append(pd.DataFrame({"dayhour":[day+hour], 'precipitation':[prep]}),
                                     ignore_index = True)
            
            timestamp += pd.Timedelta('1 hour')

        start_date += timedelta(days=1)
    return df_prep

# GET INPUT DATA USING THE FUNCTIONS ABOVE: LocationIDs and Datetime info
def get_input_data(start_date, NoOfDays):
    # get LocationIDs data frame
    df_LocIds = get_LocationIDs()

    # create datetime info and append LocationsIDs
    dtInfo_and_LocID = datetimeInfo_and_LocID(df_LocIds,start_date,NoOfDays)

    # get precipitation forecast
    prep_forecast = scrape_data(date.today(), NoOfDays)

    # merge both data frames
    df_merged = dtInfo_and_LocID.merge(prep_forecast, on="dayhour", how="left")

    # drop dayhour column
    df_merged = df_merged.drop(['dayhour'], axis=1)
    
    return df_merged

# GET OUPPUT DATA: get predictions, append to input_data and format it to be processed
def get_output_data(pickle_file, input_data):
    import pickle

    model = pickle.load(open(pickle_file,'rb'))

    # get prediction, convert to integer and convert Array into DataFrame
    model_predict = (model.predict(input_data)).astype(int)
    df_predict = pd.DataFrame({'pickups':model_predict})

    # join input_data with DataFrame
    joined = input_data.join(df_predict)
    
    output_data = joined[['hour','dayofweek','LocationID','pickups']]
    
    return output_data

# GET DATA FRAME WITH SHAPE GEOMETRY INFO
def load_shape_data():
    shape_data = gpd.read_file('../data/taxi_zones/taxi_zones.shp')

    # filter Manhattan zones
    shape_data = shape_data[shape_data['borough'] == 'Manhattan'].reset_index(drop=True)

    shape_data = shape_data.drop(['borough'], axis=1)

    #EPSG-Code of Web Mercador
    shape_data.to_crs(epsg=3785, inplace=True)

    # Simplify Shape of Zones (otherwise slow peformance of plot)
    shape_data["geometry"] = shape_data["geometry"].simplify(100)

    data = []
    for zonename, LocationID, shape in shape_data[["zone", "LocationID", "geometry"]].values:
        #If shape is polygon, extract X and Y coordinates of boundary line:
        if isinstance(shape, Polygon):
            X, Y = shape.boundary.xy
            X = [int(x) for x in X]
            Y = [int(y) for y in Y]
            data.append([LocationID, zonename, X, Y])

        #If shape is Multipolygon, extract X and Y coordinates of each sub-Polygon:
        if isinstance(shape, MultiPolygon):
            for poly in shape:
                X, Y = poly.boundary.xy
                X = [int(x) for x in X]
                Y = [int(y) for y in Y]
                data.append([LocationID, zonename, X, Y])

    #Create new DataFrame with X an Y coordinates separated:
    shape_data = pd.DataFrame(data, columns=["LocationID", "ZoneName", "X", "Y"])
    return shape_data

def load_taxis_data(output_data, shape_data):
    df_to_visualize = shape_data.copy()
    pickups = output_data.groupby(['hour','dayofweek','LocationID']).sum()
    start_day = pd.unique(output_data['dayofweek']).min()
    end_day = pd.unique(output_data['dayofweek']).max()

    for hour in range(24):
        for dayofweek in range(start_day,end_day+1,1):
            # get pickups for this hour and weekday
            p = pd.DataFrame(pickups.loc[(hour, dayofweek)]).reset_index()
        
            # add pickups to the Taxi Zones DataFrame       
            df_to_visualize = pd.merge(df_to_visualize, p, on="LocationID", how="left").fillna(0)
            # rename column as per day and hour
            df_to_visualize.rename(columns={"pickups" : "Passenger_%d_%d"%(dayofweek, hour)}, inplace=True)

    return df_to_visualize


#############################   DEFINE FUNCTIONS END   #############################

# DECLARE VARIABLES: start date, NoOfDays, pickle_file
start_date = date.today() + timedelta(days=1) # start day is tomorrow
NoOfDays = 3 # number of days for prediction
pickle_file = './model_regGB.pickle'

# RUN FUNCTIONS
input_data = get_input_data(start_date, NoOfDays)

output_data = get_output_data(pickle_file, input_data)

shape_data = load_shape_data()

df_to_visualize = load_taxis_data(output_data,shape_data)

gathering data from:  2020-08-26
gathering data from:  2020-08-27
gathering data from:  2020-08-28


In [157]:
import streamlit as st
import pandas as pd
import numpy as np
from datetime import date, timedelta
pd.options.display.max_columns = None
pd.options.display.max_rows = None

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

import geopandas as gpd
from shapely.geometry import Polygon, MultiPolygon
import matplotlib.pyplot as plt

from bokeh.io import output_notebook, output_file, show
from bokeh.plotting import figure
from bokeh.models import HoverTool, Select, ColumnDataSource, WheelZoomTool, LogColorMapper, LinearColorMapper, ColorBar, BasicTicker
from bokeh.palettes import Viridis256 as palette
from bokeh.layouts import row
import altair as alt

input_data.head()
output_data.head()
shape_data.head()
df_to_visualize.head()
df_to_visualize_test = df_to_visualize.copy()
df_to_visualize_test.drop(['X', 'Y'], axis=1, inplace=True)

start_day = pd.unique(output_data['dayofweek']).min()
end_day = pd.unique(output_data['dayofweek']).max()
    
selected_day = 3
days_to_loop = list(range(start_day,end_day))
days_to_loop.remove(selected_day)

for hour in range(24):
    for dayofweek in [2,4]:
        column_to_drop = "Passenger_%d_%d"%(dayofweek, hour)
        df_to_visualize_test.drop([column_to_drop], axis=1, inplace=True)

df_to_visualize_test.drop_duplicates(subset='LocationID', inplace=True)
df_to_visualize_test

Unnamed: 0,LocationID,ZoneName,Passenger_3_0,Passenger_3_1,Passenger_3_2,Passenger_3_3,Passenger_3_4,Passenger_3_5,Passenger_3_6,Passenger_3_7,Passenger_3_8,Passenger_3_9,Passenger_3_10,Passenger_3_11,Passenger_3_12,Passenger_3_13,Passenger_3_14,Passenger_3_15,Passenger_3_16,Passenger_3_17,Passenger_3_18,Passenger_3_19,Passenger_3_20,Passenger_3_21,Passenger_3_22,Passenger_3_23
0,4,Alphabet City,10,5,1,0,0,2,6,4,14,10,5,3,6,6,5,4,2,10,14,13,11,12,13,12
1,12,Battery Park,2,0,0,-1,-1,0,1,-4,0,-1,0,1,6,7,7,7,3,9,9,5,3,0,-2,-3
2,13,Battery Park City,23,10,4,0,1,10,41,79,132,118,96,102,114,114,114,110,100,121,143,136,132,112,84,48
3,24,Bloomingdale,11,4,2,0,0,7,18,35,38,26,24,22,25,25,24,24,19,32,33,28,23,17,14,8
4,41,Central Harlem,15,9,5,2,2,10,24,44,44,26,21,21,25,27,28,28,24,39,39,35,34,28,27,23
5,42,Central Harlem North,8,3,1,0,0,6,15,23,18,5,3,2,6,6,6,6,2,14,14,11,10,5,4,3
6,43,Central Park,22,5,0,-2,0,11,39,86,137,136,169,163,218,224,260,264,249,254,210,176,150,123,92,49
7,45,Chinatown,14,8,3,0,0,0,0,-3,9,10,13,15,21,22,21,21,19,24,24,19,23,19,18,14
8,48,Clinton East,189,131,82,64,72,112,253,291,299,290,254,245,243,234,231,225,222,289,393,420,407,483,545,394
9,50,Clinton West,37,21,11,7,6,11,38,71,111,110,96,85,85,86,83,80,74,89,98,96,92,87,84,75


In [166]:
long_df = pd.wide_to_long(df_to_visualize_test, ["Passenger_3_"], i="LocationID", j="hour")
long_df = long_df.reset_index()
long_df

Unnamed: 0,LocationID,hour,ZoneName,Passenger_3_
0,4,0,Alphabet City,10
1,12,0,Battery Park,2
2,13,0,Battery Park City,23
3,24,0,Bloomingdale,11
4,41,0,Central Harlem,15
5,42,0,Central Harlem North,8
6,43,0,Central Park,22
7,45,0,Chinatown,14
8,48,0,Clinton East,189
9,50,0,Clinton West,37


In [167]:
alt.Chart(long_df).mark_line().encode(
    x='hour',
    y='Passenger_3_',
    color='ZoneName',
    strokeDash='ZoneName',
)

In [164]:
from vega_datasets import data
source = data.stocks()
source.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 560 entries, 0 to 559
Data columns (total 3 columns):
symbol    560 non-null object
date      560 non-null datetime64[ns]
price     560 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 13.2+ KB


In [195]:
import json
import altair as alt

shape_data = gpd.read_file('../data/taxi_zones/taxi_zones.shp')

    # filter Manhattan zones
shape_data = shape_data[shape_data['borough'] == 'Manhattan'].reset_index(drop=True)

shape_data = shape_data.drop(['borough'], axis=1)

    #EPSG-Code of Web Mercador
shape_data.to_crs(epsg=3785, inplace=True)

    # Simplify Shape of Zones (otherwise slow peformance of plot)
shape_data["geometry"] = shape_data["geometry"].simplify(100)


choro_json = json.loads(shape_data.to_json())
choro_data = alt.Data(values=choro_json['features'])

color_column ='properties.OBJECTID:Q'

# Add Base Layer
base = alt.Chart(choro_data, title = 'título').mark_geoshape(
        stroke='black',
        strokeWidth=1
        ).encode(
        alt.Color(color_column)
        )



shape_data

Unnamed: 0,OBJECTID,Shape_Leng,Shape_Area,zone,LocationID,geometry
0,4,0.043567,0.000112,Alphabet City,4,"POLYGON ((-8234500.227 4971984.094, -8234690.0..."
1,12,0.036661,4.2e-05,Battery Park,12,"POLYGON ((-8239385.311 4968901.615, -8239229.5..."
2,13,0.050281,0.000149,Battery Park City,13,"POLYGON ((-8239027.255 4970990.635, -8239307.0..."
3,24,0.047,6.1e-05,Bloomingdale,24,"POLYGON ((-8233137.952 4982697.872, -8233194.5..."
4,41,0.052793,0.000143,Central Harlem,41,"POLYGON ((-8231824.746 4984298.100, -8231160.3..."
5,42,0.092709,0.000264,Central Harlem North,42,"POLYGON ((-8230335.443 4988211.231, -8230425.1..."
6,43,0.099739,0.00038,Central Park,43,"POLYGON ((-8234586.991 4977725.736, -8234638.3..."
7,45,0.045907,9.1e-05,Chinatown,45,"POLYGON ((-8237364.516 4970257.968, -8236814.3..."
8,48,0.043747,9.4e-05,Clinton East,48,"POLYGON ((-8236660.189 4976319.581, -8236710.8..."
9,50,0.055748,0.000173,Clinton West,50,"POLYGON ((-8237272.410 4978991.629, -8236313.4..."


In [189]:
choro_json['features']

[{'id': '0',
  'type': 'Feature',
  'properties': {'LocationID': 4,
   'OBJECTID': 4,
   'Shape_Area': 0.000111871946192,
   'Shape_Leng': 0.0435665270921,
   'zone': 'Alphabet City'},
  'geometry': {'type': 'Polygon',
   'coordinates': [[[-8234500.226961649, 4971984.09353498],
     [-8234690.098616078, 4970961.811378246],
     [-8235841.600481576, 4971345.374732405],
     [-8235196.293304638, 4972514.643612085],
     [-8234530.955966442, 4972139.749360216],
     [-8234500.226961649, 4971984.09353498]]]}},
 {'id': '1',
  'type': 'Feature',
  'properties': {'LocationID': 12,
   'OBJECTID': 12,
   'Shape_Area': 4.15116236727e-05,
   'Shape_Leng': 0.0366613013579,
   'zone': 'Battery Park'},
  'geometry': {'type': 'Polygon',
   'coordinates': [[[-8239385.3109764205, 4968901.614988105],
     [-8239229.580352136, 4968851.152871054],
     [-8239175.040926091, 4968359.800895385],
     [-8239225.352210234, 4968208.300939143],
     [-8239606.882300402, 4968705.067854418],
     [-8239385.3109764