# NYC Yellow cab dataset -- Main


This notebook aims at giving to the user a `craftai.pandas` use case.

This use case is based on the dataset `yellow.csv` located in the directory _data/_. (It is possible to regenerate this dataset by using the notebook `NYC_Yellow_Cabs_Preprocessing.ipynb`.)

`yellow.csv` has been extracted from the data available on the ___NYC Taxi and Limousine Commission (LTC)___ [webpage](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).

In [1]:
import pandas as pd
import numpy as np
import craftai.pandas
import os
from time import time
from multiprocessing import Pool

## Load Data

In [2]:
PATH = 'data/'

In [3]:
yellow = pd.read_csv(PATH + 'yellow.csv')

yellow.pickup_datetime = pd.to_datetime(yellow.pickup_datetime, utc=True)
yellow.set_index('pickup_datetime', drop=True, inplace=True)

yellow.index = yellow.index.tz_convert('America/New_York')

In [4]:
yellow.head()

Unnamed: 0_level_0,taxi_zone,trip_counter,timezone
pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01 00:00:00-05:00,1,2.0,-05:00
2017-01-01 00:00:00-05:00,4,54.0,-05:00
2017-01-01 00:00:00-05:00,7,66.0,-05:00
2017-01-01 00:00:00-05:00,10,1.0,-05:00
2017-01-01 00:00:00-05:00,12,3.0,-05:00


## Focus on Bronx areas

<img src="http://www.nyc.gov/html/tlc/images/features/taxi_zone_map_bronx.jpg" title="Bronx Taxi Zones" alt="Bronx Taxi Zones" style="width: 300px;"/>

To gain computation time, we only focus on 5 of the 43 Bronx taxi zones.

The selected taxi zones are the following: $126$, $147$, $159$, $168$, $247$.

__If you want to compute all the taxi zones, skip the following cell.__

In [5]:
nyc_zones = pd.read_csv('data/taxi_zone_lookup.csv', usecols=['LocationID', 'Borough'])
bronx_zones = nyc_zones[nyc_zones.Borough == 'Bronx'].LocationID.unique()

selected_zones = [z for z in bronx_zones if z in [126, 147, 159, 168, 247]]

# filter yellow by staten island zones
yellow = yellow[yellow.taxi_zone.isin(selected_zones)]

## Build Agent Datasets

In [6]:
TIME_INDEX = pd.date_range("2017-01-01 00:00", "2017-12-31 23:00", freq="h", tz='America/New_York')

def build_agent_df(taxi_zone):
    trips = []
    data = yellow[yellow.taxi_zone == taxi_zone]
    for t in TIME_INDEX:
        if t in data.index:
            trips.append(data.trip_counter[data.index == t].values[0])
        else:
            trips.append(0)
    print('--| Zone', taxi_zone, 'computed')
    agent_id = "taxi_zone_{:0>3}".format(taxi_zone)
    return agent_id, {'data': pd.DataFrame(data={'trip_counter':trips, 'timezone':'-05:00'}, index=TIME_INDEX)}

In [7]:
%%time 

p = Pool(10)
full = dict(p.map(build_agent_df, yellow.taxi_zone.unique()))

agent_ids = full.keys() 

--| Zone 147 computed
--| Zone 126 computed
--| Zone 159 computed
--| Zone 247 computed
--| Zone 168 computed
CPU times: user 48.8 ms, sys: 79.6 ms, total: 128 ms
Wall time: 3.46 s


for k in sorted(agent_ids):
    print(k)
    print(full[k]['data'].head())
    print('\n\n')

## 2. Connect to craftai api


In [8]:
client = craftai.pandas.Client({
  "token": os.environ.get("CRAFT_TOKEN")
})

## 3. Create agents


In [9]:
CONFIGURATION = {
    "context": {
        "month_of_year": {
            "type" : "month_of_year"
        },
        "day_of_week": {                  
            "type" : "day_of_week"
        },
        "time": {                    
            "type": "time_of_day"
        },
        "timezone": {                       
            "type" : "timezone",
        },
        "trip_counter": {                          
            "type": "continuous"
        }
    },
    "output": ["trip_counter"],                    # the output is continuous
}

def create_agent(agent_id):
    # Delete older version of the agent
    client.delete_agent(agent_id)

    # Add the new agent
    agent = client.create_agent(CONFIGURATION, agent_id)
    full[agent_id]['agent'] = agent
    print("Agent", agent["id"], "has successfully been created")
    

### Create Agents

In [10]:
%%time

any(map(create_agent, agent_ids))

Agent taxi_zone_126 has successfully been created
Agent taxi_zone_147 has successfully been created
Agent taxi_zone_159 has successfully been created
Agent taxi_zone_168 has successfully been created
Agent taxi_zone_247 has successfully been created
CPU times: user 71.5 ms, sys: 24 ms, total: 95.5 ms
Wall time: 894 ms


False

## 4. Add Agent operations

In [11]:
def add_operations(agent_id):
    print(client.add_operations(agent_id, full[agent_id]['data'])['message'])
    

In [12]:
%%time

any(map(add_operations, agent_ids))

Successfully added 8760 operation(s) to the agent "yrieix.leprince/NYC-taxi/taxi_zone_126" context.
Successfully added 8760 operation(s) to the agent "yrieix.leprince/NYC-taxi/taxi_zone_147" context.
Successfully added 8760 operation(s) to the agent "yrieix.leprince/NYC-taxi/taxi_zone_159" context.
Successfully added 8760 operation(s) to the agent "yrieix.leprince/NYC-taxi/taxi_zone_168" context.
Successfully added 8760 operation(s) to the agent "yrieix.leprince/NYC-taxi/taxi_zone_247" context.
CPU times: user 8.11 s, sys: 37.4 ms, total: 8.14 s
Wall time: 15 s


False

## 5. Get Last Decision Tree

This step is the most 

In [13]:
ts = full[list(agent_ids)[0]]['data'].index.astype(np.int64).values[-1] // 10**9 #get last timestamp


def get_DT(agent_id):
    
    #create temporary client
    tmp_client = craftai.pandas.Client({
      "token": os.environ.get("CRAFT_TOKEN")
    })
    
    now = time()
    full[agent_id]['tree'] = tmp_client.get_decision_tree(agent_id, timestamp=ts)
    print('DT', agent_id, '--> ok | ', int(time() - now), 'seconds.')

In [14]:
%%time

any(map(get_DT, agent_ids))

DT taxi_zone_126 --> ok |  53 seconds.
DT taxi_zone_147 --> ok |  33 seconds.
DT taxi_zone_159 --> ok |  66 seconds.
DT taxi_zone_168 --> ok |  62 seconds.
DT taxi_zone_247 --> ok |  38 seconds.
CPU times: user 647 ms, sys: 137 ms, total: 784 ms
Wall time: 4min 15s


False

## 6. Decide 

In [15]:
DECISION_DF = pd.DataFrame(
    ['-05:00'],
    columns=['timezone'],
    index=pd.date_range("2018-01-01 00:00", periods=1, freq="h").tz_localize("America/New_York")
)

DECISION_DF.head()

Unnamed: 0,timezone
2018-01-01 00:00:00-05:00,-05:00


In [16]:
def agent_decide(agent_id):
    tree = full[agent_id]['tree']
    decision = client.decide_from_contexts_df(tree, DECISION_DF)
    full[agent_id]['decision'] = decision
    print('--| Decision ok for ', agent_id)

In [17]:
%%time

any(map(agent_decide, agent_ids))

--| Decision ok for  taxi_zone_126
--| Decision ok for  taxi_zone_147
--| Decision ok for  taxi_zone_159
--| Decision ok for  taxi_zone_168
--| Decision ok for  taxi_zone_247
CPU times: user 68.2 ms, sys: 8.02 ms, total: 76.2 ms
Wall time: 73.5 ms


False

## 7. Evaluate Best Taxi Zone 

In [18]:
get_value = lambda df : df.iloc[0].trip_counter_predicted_value

In [19]:
best_agent = list(agent_ids)[0]
max_value = get_value(full[best_agent]['decision'])

for agent_id in list(agent_ids)[1:]:
    agent_value = get_value(full[agent_id]['decision'])
    print('--|', agent_id, agent_value)
    if agent_value > max_value:
        max_value = agent_value
        best_agent = agent_id

--| taxi_zone_147 0.093793064
--| taxi_zone_159 0.43959767
--| taxi_zone_168 2.3933916
--| taxi_zone_247 0.7254565


In [20]:
print('Best Taxi Zone:', best_agent, max_value)

Best Taxi Zone: taxi_zone_168 2.3933916
