# NYC Yellow cab dataset -- Main


This notebook aims at giving to the user a `craftai.pandas` use case.

This use case is based on the dataset `yellow.csv` located in the directory _data/_. (It is possible to regenerate this dataset by using the notebook `NYC_Yellow_Cabs_Preprocessing.ipynb`.)

`yellow.csv` has been extracted from the data available on the ___NYC Taxi and Limousine Commission (LTC)___ [webpage](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).

In [1]:
import craftai.pandas
import pandas as pd
import numpy as np

import os
from time import time
from multiprocessing import Pool

from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go

init_notebook_mode(connected=True)

## 1. Load `yellow` Dataset

`yellow` contains the number of yellow taxis hired for each `taxi_zone` at a hourly time for the whole year 2017. 

In [2]:
PATH = '../data/' # Modify this to fit your data folder

In [3]:
yellow = pd.read_csv(PATH + 'yellow.csv')
yellow.columns = yellow.columns[1:].insert(0, 'timestamp')
yellow.timestamp = pd.to_datetime(yellow.timestamp, utc=True)
yellow.set_index('timestamp', drop=True, inplace=True)
yellow.index = yellow.index.tz_convert('America/New_York')

In [4]:
yellow.head(3)

Unnamed: 0_level_0,timezone,1,2,3,4,5,6,7,8,9,...,256,257,258,259,260,261,262,263,264,265
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-01-01 00:00:00-05:00,-05:00,2.0,0.0,0.0,54.0,0.0,0.0,66.0,0.0,0.0,...,50.0,1.0,0.0,2.0,49.0,23.0,35.0,133.0,135.0,13.0
2017-01-01 01:00:00-05:00,-05:00,2.0,0.0,0.0,36.0,0.0,0.0,54.0,1.0,0.0,...,30.0,0.0,4.0,2.0,31.0,17.0,49.0,106.0,86.0,10.0
2017-01-01 02:00:00-05:00,-05:00,2.0,0.0,0.0,28.0,0.0,0.0,42.0,0.0,0.0,...,6.0,0.0,0.0,0.0,16.0,14.0,72.0,92.0,80.0,6.0


## 2. Focus on Queens areas (Recommended)

__If you want to compute all the taxi zones, skip the following cell.__

To gain computation time, we only focus on 5 of the 69 Queens taxi zones.

<img src="http://www.nyc.gov/html/tlc/images/features/taxi_zone_map_queens.jpg" title="Queens Taxi Zones" alt="Bronx Taxi Zones" style="width: 300px;"/>

The selected taxi zones are the following: $7$, $145$, $146$, $193$, $226$ (NW areas).

In [5]:
selected_zones = [7, 145, 146, 193, 226] # Modify this list to fit your needs

columns = ['timezone'] + [str(z) for z in selected_zones]
yellow = yellow[columns]

To go more in details during this filtering step your can check the correspondences between `Borough`s and `taxi_zone`s (= `LocationID`) with the following lines:

```
nyc_zones = pd.read_csv('../data/taxi_zone_lookup.csv', usecols=['LocationID', 'Borough'])
boroughs = nyc_zones.Borough.unique()
bronx_zones = nyc_zones[nyc_zones.Borough == 'Bronx'].LocationID.unique()
```

## 3. Build Agent Datasets

In this notebook each `taxi_zone` is computed independently by a Craft AI Agent. In other words, each Agent learns independently from a `taxi_zone` historical records to predict taxi needs in this specific area.

So each Agent only need to have a `taxi_zone` filtered dataframe.

def build_agent_df(taxi_zone):
    """ Filter the yellow dataset by the given taxi_zone
    """
    
    trips = []
    data = yellow[yellow.taxi_zone == taxi_zone]
    
            
    print('--| Zone', taxi_zone, 'ready')
    agent_id = "taxi_zone_{:0>3}".format(taxi_zone)
    return agent_id, {'data': pd.DataFrame(data={'trip_counter':trips, 'timezone':'-05:00'}, index=TIME_INDEX)}

### Agents Datasets visualization

```
data = []
for agent_id in agent_ids:
    print(pd.to_datetime(agents[agent_id]['data'].index)[0], agents[agent_id]['data'].trip_counter[0])
    data.append(go.Scatter(
        x = TIME_INDEX,      
        y = agents[agent_id]['data'].trip_counter,
        name = agent_id,
        opacity = 0.5,
        line=dict(color='hsla(0,0,0%)')
    ))

layout = {'title': 'Selected Taxi Zones',
                  'xaxis': {
                      'title':'Time',
                      'range':[pd.to_datetime('2017-01-01 00:00'),pd.to_datetime('2017-01-02 00:00')]},
                  'yaxis': {'title':'#Clients'},
                  'font': dict(size=16)}
    
iplot({'data': data,
       'layout': layout},
    )
```

## 2. Connect to craftai api

Open a link to the Craft AI API by creating a `client` based on user's token.

In [6]:
client = craftai.pandas.Client({
  "token": os.environ.get("CRAFT_TOKEN")
})

## 3. Create Agents

As mentionned above, each agent deals with a single `taxi_zone`.

### 3.1 Setup the configuration

In [7]:
agents = pd.DataFrame(
    data={
        'zone'    :selected_zones,
        'agent_id':["taxi_zone_{:0>3}".format(z) for z in selected_zones]},
)
agents

Unnamed: 0,zone,agent_id
0,7,taxi_zone_007
1,145,taxi_zone_145
2,146,taxi_zone_146
3,193,taxi_zone_193
4,226,taxi_zone_226


In [8]:
# CONFIGURATION is the same for all agents
CONFIGURATION = {
    "context": {
        "day_of_week": {                # feature generated by the API from the DataFrame index 
            "type" : "day_of_week"
        },
        "time": {                       # feature generated by the API from the DataFrame index
            "type": "time_of_day"
        },
        "timezone": {                   # timezone for trip_counter      
            "type" : "timezone",        
        },
        "trip_counter": {               # taxi trips counter            
            "type": "continuous"
        }
    },
    "output": ["trip_counter"],         # the output is continuous
}


def setup_agent(row):
    """ Initiate Agent with the given id and 
        associate to each Agent its dataframe
    """

    # Delete older version of the agent
    client.delete_agent(row.agent_id)

    # Add the new agent
    client.create_agent(CONFIGURATION, row.agent_id)
    
    # Add operations
    data = yellow[['timezone', str(row.zone)]]
    data.columns = ['timezone', 'trip_counter']
    client.add_operations(row.agent_id, data)
    
    return True
    #print("Agent", row.agent_id, "has successfully been setup")
    

In [9]:
%%time

agents['setup'] = agents.apply(setup_agent, axis=1)
agents

CPU times: user 15.6 s, sys: 82.6 ms, total: 15.7 s
Wall time: 23.7 s


### 3.2 Retrieve Last Decision Tree

Each Agent learns from its operations. Then we ask the API for the last Decision Tree.

__This step is the most time consuming.__

__TODO__ use asynchrone request to the API: https://docs.python.org/3/library/asyncio.html.

In [10]:
TS = yellow.index.astype(np.int64).values[-1] // 10**9 #get last timestamp

get_DT = lambda agent_id : client.get_decision_tree(agent_id, timestamp=TS)

In [11]:
%%time

agents['decision_tree'] = agents.agent_id.apply(get_DT)

CPU times: user 70.4 ms, sys: 3.93 ms, total: 74.3 ms
Wall time: 3min 46s


## 4. Decision 

For a given timestamp (here _2018-01-01 00:00_), ask each Agent to make a prediction.

### 4.1 Decision Dataframe setup

In [13]:
# DECISION_DF is the same for all Agents as we want to compare taxi zones
DECISION_DF = pd.DataFrame(
    ['-05:00'],
    columns=['timezone'],
    index=pd.date_range("2018-01-01 00:00", periods=1, freq="h").tz_localize("America/New_York")
)

DECISION_DF.head()

Unnamed: 0,timezone
2018-01-01 00:00:00-05:00,-05:00


### 4.2 Make Decision

Ask each Agent to estimate the taxi need for their own `taxi_zone` thanks to the `DECISION_DF`.

In [14]:
def decide(tree):
    decision = client.decide_from_contexts_df(tree, DECISION_DF)
    return pd.Series({c:decision[c].values[0] for c in decision.columns})


agents = agents.merge(
    agents.decision_tree.apply(decide), 
    left_index=True, 
    right_index=True
)

agents

Unnamed: 0,zone,agent_id,setup,decision_tree,trip_counter_predicted_value,trip_counter_confidence,trip_counter_decision_rules,trip_counter_standard_deviation
0,7,taxi_zone_007,True,"{'_version': '1.1.0', 'trees': {'trip_counter'...",18.160643,0.675409,"[{'property': 'time', 'operator': '[in[', 'ope...",8.715281
1,145,taxi_zone_145,True,"{'_version': '1.1.0', 'trees': {'trip_counter'...",7.193333,0.669811,"[{'property': 'time', 'operator': '[in[', 'ope...",3.534098
2,146,taxi_zone_146,True,"{'_version': '1.1.0', 'trees': {'trip_counter'...",11.977649,0.721389,"[{'property': 'day_of_week', 'operator': '[in[...",4.623188
3,193,taxi_zone_193,True,"{'_version': '1.1.0', 'trees': {'trip_counter'...",5.577778,0.6514,"[{'property': 'time', 'operator': '[in[', 'ope...",2.969146
4,226,taxi_zone_226,True,"{'_version': '1.1.0', 'trees': {'trip_counter'...",23.36001,0.808327,"[{'property': 'time', 'operator': '[in[', 'ope...",5.475126


## 5. Evaluate Best Taxi Zone 

Based on all Agents estimations, find the `taxi_zone` with the most persons looking for a taxi. 

In [15]:
print('Zones sorted by number of client:')
agents.sort_values('trip_counter_predicted_value', ascending=False)

Zones sorted by number of client:


Unnamed: 0,zone,agent_id,setup,decision_tree,trip_counter_predicted_value,trip_counter_confidence,trip_counter_decision_rules,trip_counter_standard_deviation
4,226,taxi_zone_226,True,"{'_version': '1.1.0', 'trees': {'trip_counter'...",23.36001,0.808327,"[{'property': 'time', 'operator': '[in[', 'ope...",5.475126
0,7,taxi_zone_007,True,"{'_version': '1.1.0', 'trees': {'trip_counter'...",18.160643,0.675409,"[{'property': 'time', 'operator': '[in[', 'ope...",8.715281
2,146,taxi_zone_146,True,"{'_version': '1.1.0', 'trees': {'trip_counter'...",11.977649,0.721389,"[{'property': 'day_of_week', 'operator': '[in[...",4.623188
1,145,taxi_zone_145,True,"{'_version': '1.1.0', 'trees': {'trip_counter'...",7.193333,0.669811,"[{'property': 'time', 'operator': '[in[', 'ope...",3.534098
3,193,taxi_zone_193,True,"{'_version': '1.1.0', 'trees': {'trip_counter'...",5.577778,0.6514,"[{'property': 'time', 'operator': '[in[', 'ope...",2.969146


__Result:__

So a taxi driver working at 00:00 the 2018-01-01 in the Queens NW areas should drive through zones $226$ and $7$ to maximize his chances to pick up clients. 

## 6. Summary

By using the Craft AI API and NYC yellow taxis 2017 data, we have been able to find NY areas where it is the most likely to need taxis.