# NYC Yellow cab dataset -- Main


This notebook aims at giving to the user a `craftai.pandas` use case.

This use case is based on the dataset `yellow.csv` located in the directory _data/_. (It is possible to regenerate this dataset by using the notebook `NYC_Yellow_Cabs_Preprocessing.ipynb`.)

`yellow.csv` has been extracted from the data available on the ___NYC Taxi and Limousine Commission (LTC)___ [webpage](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).

In [1]:
import craftai.pandas
import pandas as pd
import numpy as np

import os
from time import time
from multiprocessing import Pool

## 1. Load `yellow` Dataset

`yellow` contains the number of yellow taxis hired for each `taxi_zone` at a hourly time for the whole year 2017. 

In [4]:
PATH = '../data/' # Modify this to fit your data folder

In [5]:
yellow = pd.read_csv(PATH + 'yellow.csv')

# Index on time
yellow.pickup_datetime = pd.to_datetime(yellow.pickup_datetime, utc=True)
yellow.set_index('pickup_datetime', drop=True, inplace=True)

# Set NYC timezone
yellow.index = yellow.index.tz_convert('America/New_York')

In [6]:
yellow.head(3)

Unnamed: 0_level_0,taxi_zone,trip_counter,timezone
pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01 00:00:00-05:00,1,2.0,-05:00
2017-01-01 00:00:00-05:00,4,54.0,-05:00
2017-01-01 00:00:00-05:00,7,66.0,-05:00


## 2. Focus on Bronx areas (Recommended)

__If you want to compute all the taxi zones, skip the following cell.__

To gain computation time, we only focus on 5 of the 43 Bronx taxi zones.

<img src="http://www.nyc.gov/html/tlc/images/features/taxi_zone_map_bronx.jpg" title="Bronx Taxi Zones" alt="Bronx Taxi Zones" style="width: 300px;"/>

The selected taxi zones are the following: $126$, $147$, $159$, $168$, $247$.

In [5]:
selected_zones = [126, 147, 159, 168, 247] # Modify this list to fit your needs

# filter dataset by selected_zones
yellow = yellow[yellow.taxi_zone.isin(selected_zones)]

To go more in details during this filtering step your can check the correspondences between `Borough`s and `taxi_zone`s (= `LocationID`) with the following lines:

```
nyc_zones = pd.read_csv('data/taxi_zone_lookup.csv', usecols=['LocationID', 'Borough'])
boroughs = nyc_zones.Borough.unique()
bronx_zones = nyc_zones[nyc_zones.Borough == 'Bronx'].LocationID.unique()
```

## 3. Build Agent Datasets

In this notebook each `taxi_zone` is computed independently by a Craft AI Agent. In other words, each Agent learns independently from a `taxi_zone` historical records to predict taxi needs in this specific area.

So each Agent only need to have a `taxi_zone` filtered dataframe.

In [6]:
TIME_INDEX = pd.date_range("2017-01-01 00:00", "2017-12-31 23:00", freq="h", tz='America/New_York')

def build_agent_df(taxi_zone):
    """ Filter the yellow dataset by the given taxi_zone
    """
    
    trips = []
    data = yellow[yellow.taxi_zone == taxi_zone]
    
    for t in TIME_INDEX:
        if t in data.index:
            trips.append(data.trip_counter[data.index == t].values[0])
        else:
            trips.append(0)
            
    print('--| Zone', taxi_zone, 'ready')
    agent_id = "taxi_zone_{:0>3}".format(taxi_zone)
    return agent_id, {'data': pd.DataFrame(data={'trip_counter':trips, 'timezone':'-05:00'}, index=TIME_INDEX)}

In [7]:
%%time 

p = Pool(10)
agents = dict(p.map(build_agent_df, yellow.taxi_zone.unique()))

--| Zone 126 ready
--| Zone 147 ready
--| Zone 159 ready
--| Zone 247 ready
--| Zone 168 ready
CPU times: user 38.7 ms, sys: 101 ms, total: 140 ms
Wall time: 3.36 s


In [8]:
agent_ids = list(agents.keys())

del yellow # free memory 

## 2. Connect to craftai api

Open a link to the Craft AI API by creating a `client` based on user's token.

In [9]:
client = craftai.pandas.Client({
  "token": os.environ.get("CRAFT_TOKEN")
})

## 3. Create Agents

As mentionned above, each agent deals with a single `taxi_zone`.

### 3.1 Setup the configuration

In [10]:
# CONFIGURATION is the same for all agents
CONFIGURATION = {
    "context": {
        "day_of_week": {                # feature generated by the API from the DataFrame index 
            "type" : "day_of_week"
        },
        "time": {                       # feature generated by the API from the DataFrame index
            "type": "time_of_day"
        },
        "timezone": {                   # timezone for trip_counter      
            "type" : "timezone",        
        },
        "trip_counter": {               # taxi trips counter            
            "type": "continuous"
        }
    },
    "output": ["trip_counter"],         # the output is continuous
}

def create_agent(agent_id):
    """ Initiate Agent with the given id
    """
    
    # Delete older version of the agent
    client.delete_agent(agent_id)

    # Add the new agent
    agent = client.create_agent(CONFIGURATION, agent_id)
    agents[agent_id]['agent'] = agent
    print("Agent", agent["id"], "has successfully been created")
    

In [11]:
%%time

any(map(create_agent, agent_ids))

Agent taxi_zone_126 has successfully been created
Agent taxi_zone_147 has successfully been created
Agent taxi_zone_159 has successfully been created
Agent taxi_zone_168 has successfully been created
Agent taxi_zone_247 has successfully been created
CPU times: user 90.5 ms, sys: 23.3 ms, total: 114 ms
Wall time: 948 ms


False

### 3.2 Add Agent operations

Associate to each Agent its dataframe.

In [12]:
add_operations = lambda agent_id : print(
    client.add_operations(agent_id, agents[agent_id]['data'])['message'])

In [13]:
%%time

any(map(add_operations, agent_ids))

Successfully added 8760 operation(s) to the agent "yrieix.leprince/NYC-taxi/taxi_zone_126" context.
Successfully added 8760 operation(s) to the agent "yrieix.leprince/NYC-taxi/taxi_zone_147" context.
Successfully added 8760 operation(s) to the agent "yrieix.leprince/NYC-taxi/taxi_zone_159" context.
Successfully added 8760 operation(s) to the agent "yrieix.leprince/NYC-taxi/taxi_zone_168" context.
Successfully added 8760 operation(s) to the agent "yrieix.leprince/NYC-taxi/taxi_zone_247" context.
CPU times: user 8.74 s, sys: 61.9 ms, total: 8.8 s
Wall time: 15.9 s


False

### 3.3 Retrieve Last Decision Tree

Each Agent learns from its operations. Then we ask the API for the last Decision Tree.

__This step is the most time consuming.__

__TODO__ use asynchrone request to the API: https://docs.python.org/3/library/asyncio.html.

In [14]:
TS = agents[agent_ids[0]]['data'].index.astype(np.int64).values[-1] // 10**9 #get last timestamp


def get_DT(agent_id):
    """ Retrieve the last Decision Tree from the API
    """
    
    now = time()
    agents[agent_id]['tree'] = client.get_decision_tree(agent_id, timestamp=TS)
    print('DT', agent_id, '--> ok | ', int(time() - now), 'seconds.')

In [15]:
%%time

any(map(get_DT, agent_ids))

DT taxi_zone_126 --> ok |  55 seconds.
DT taxi_zone_147 --> ok |  31 seconds.
DT taxi_zone_159 --> ok |  50 seconds.
DT taxi_zone_168 --> ok |  48 seconds.
DT taxi_zone_247 --> ok |  32 seconds.
CPU times: user 381 ms, sys: 110 ms, total: 491 ms
Wall time: 3min 39s


False

## 4. Decision 

For a given timestamp (here _2018-01-01 00:00_), ask each Agent to make a prediction.

### 4.1 Decision Dataframe setup

In [16]:
# DECISION_DF is the same for all Agents as we want to compare taxi zones
DECISION_DF = pd.DataFrame(
    ['-05:00'],
    columns=['timezone'],
    index=pd.date_range("2018-01-01 00:00", periods=1, freq="h").tz_localize("America/New_York")
)

DECISION_DF.head()

Unnamed: 0,timezone
2018-01-01 00:00:00-05:00,-05:00


### 4.2 Make Decision

Ask each Agent to estimate the taxi need for their own `taxi_zone` thanks to the `DECISION_DF`.

In [17]:
def agent_decide(agent_id):
    tree = agents[agent_id]['tree']
    decision = client.decide_from_contexts_df(tree, DECISION_DF)
    agents[agent_id]['decision'] = decision
    print('--| Decision ok for ', agent_id)

In [18]:
%%time

any(map(agent_decide, agent_ids))

--| Decision ok for  taxi_zone_126
--| Decision ok for  taxi_zone_147
--| Decision ok for  taxi_zone_159
--| Decision ok for  taxi_zone_168
--| Decision ok for  taxi_zone_247
CPU times: user 51 ms, sys: 163 Âµs, total: 51.2 ms
Wall time: 48.1 ms


False

## 5. Evaluate Best Taxi Zone 

Based on all Agents estimations, find the `taxi_zone` with the most persons looking for a taxi. 

In [19]:
get_value = lambda df : df.iloc[0].trip_counter_predicted_value

In [20]:
# Initiate
best_agent = list(agent_ids)[0]
max_value = get_value(agents[best_agent]['decision'])

# Search for the best taxi_zone
for agent_id in list(agent_ids):
    agent_value = get_value(agents[agent_id]['decision'])
    print('--|', agent_id, int(agent_value*10)/10)
    if agent_value > max_value:
        max_value = agent_value
        best_agent = agent_id

--| taxi_zone_126 0.0
--| taxi_zone_147 0.0
--| taxi_zone_159 0.3
--| taxi_zone_168 2.4
--| taxi_zone_247 1.0


__Result:__

In [21]:
print('Best Taxi Zone:', best_agent)
print('Number of clients:', int(max_value*10)/10)

Best Taxi Zone: taxi_zone_168
Number of clients: 2.4


## 6. Summary

By using the Craft AI API and NYC yellow taxis 2017 data, we have been able to find NY areas where it is the most likely to need taxis.