# ACM: Night at Lyft

## Python Starter Notebook

<strong>Prompt:</strong> In a city of only taxis from the Bauer Taxi Service, your team is put in charge to earn Lyft more business and win riders over. The ACM Executive team has provided you with this starter notebook, which showcases how you can utilize the RESTful API that has 5 public endpoints that you can make calls to. These endpoints are listed below:

<ul>
    <li><strong>/time:</strong> GET</li>
        <ul>
            <li>Returns the simulation period you are developing your pricing model for.</li>
        </ul>
    <li><strong>/trips:</strong> GET</li>
        <ul>
            <li>Returns the taxi trips that occured with the criterion specified.</li>
        </ul>
    <li><strong>/count:</strong> GET</li>
        <ul>
            <li>Returns the total amount of trips that occured during a certain time period.</li>
        </ul>
    <li><strong>/pricing:</strong> POST</li>
        <ul>
            <li>Submits your team's general pricing strategy.</li>
        </ul>
    <li><strong>/zones:</strong> POST</li>
        <ul>
            <li>Submits the second component of your pricing strategy: the areas you believe are power zones.</li>
        </ul>
</ul>

<strong>Game Overview:</strong> There are 5 simple steps you need to follow to succeed this HackNight!
<ol>
    <li>Wait for a "simulation" (access to taxi data of a specific week and prior)</li>
    <li>Use the endpoints with associated GET requests to do data analysis to develop your pricing model.</li>
    <li>Submit your pricing model</li>
    <li>View your revenue generated (the results) on the diagram shown at the end of each simulation.</li>
    <li>Repeat this process for every simulation</li>
</ol>

## Learning Objective: Using a RESTful API

An API (Application Programming Interface) is a set of methods that controls the access a programmer has to what they want to reach (in this case, the taxi dataset). Here, the Lyft HackNight API limits access to taxi data by bounding it to the simulation. What does this mean? It means that you can only access taxi data of a certain date or date range when it is within the simulation date range or prior.

Now, let's get started with set up process.

### Initial Setup and Imports

In [1]:
# Required imports
import requests # Python's go-to HTTP request library
from datetime import datetime

In [2]:
# Environment and other variables for the API to know who you are
URL = "https://lyftserver--acmiit.repl.co"
TEAM = "YOUR_TEAM_SECRET"

### Our first GET request: /time

Format of a GET request with Python's request library:

requests.get(SERVER_URL + "/ENDPOINT_NAME/", params="{QUERY HERE}).json()

In [3]:
requests.get(URL + "/time/").json()

{'sucess': True,
 'time': 1573365600000,
 'message': 'The simulation is not over, the time is 11/10/2019.'}

Note that the /time endpoint doesn't take any parameters, so there's no need to specify "params" in the method call. However, in the /trips endpoint below, we'll do just that.

### /trips endpoint

The /trips endpoint allows us to find all trips that have occured in a specific time range. Note, however, that you can only search for trips that happen in the current simulation. We don't want you to predict the future!

<strong>Endpoint: /trips</strong>
<ul>
    <li>team (str): the string name of your team as indicated in the "TEAM" variable above</li>
    <li>start (str): the date and time you want to start searching from, in format %m/%d/%Y %H:%M A/PM</li>
        <ul>
            <li>Example: "10/07/2017 5:00PM"</li>
        </ul>
    <li>end (str): the date and time you want to end the search on, in format %m/%d/%Y %H:%M A/PM</li>
    <li>limit (int): maximum number of results to return</li>
    <li>offset (int): indicates you only want to receive data points past this index</li>
</ul>

In [4]:
# Build the query
trips_query = {
    "team": TEAM,
    "start": "10/20/2019 2:00 PM",
    "end": "10/20/2019 3:00 PM",
    "limit": 1
}

# Send the request
trips_response = requests.get(URL + "/trips/", params=trips_query).json()
trips_response

{'success': True,
 'length': 1,
 'response': [{'dropoff_centroid_latitude': '41.944226601',
   'dropoff_centroid_longitude': '-87.655998182',
   'dropoff_community_area': '6',
   'extras': '0',
   'fare': '4.5',
   'pickup_centroid_latitude': '41.944226601',
   'pickup_centroid_longitude': '-87.655998182',
   'pickup_community_area': '6',
   'taxi_id': 'be435bc4d6744155b3272e9edded016bd4afb34777470147a81d6a7e77f17dd155fdd830169271731a5316e9a6c565a619903ae4f116598fcfb9bf22591850a2',
   'tips': '0',
   'tolls': '0',
   'trip_end_timestamp': '2019-10-20T14:45:00',
   'trip_id': '001bf6e8b30ad25e067f639473cc8591a3ab2e4e',
   'trip_miles': '0.5',
   'trip_seconds': '120',
   'trip_start_timestamp': '2019-10-20T14:45:00',
   'trip_total': '4.5',
   'company': 'Bauer Taxi Service',
   'entry_idx': 0}]}

The JSON (Javascript Object Notation) object we got back from our call to /trips is 1 ride that occured from 2:00 - 3:00PM. In Python, this JSON object is interpreted as a dictionary. <strong>We care about the "response" key.</strong> To retrieve the response, we simply use Python's dictionary bracket notation, which will give us the datapoint we want.

In [5]:
# Gives us a list of the taxi rides that fits the criterion we wanted. Only one in here since we set limit = 1.
trips_response["response"]

[{'dropoff_centroid_latitude': '41.944226601',
  'dropoff_centroid_longitude': '-87.655998182',
  'dropoff_community_area': '6',
  'extras': '0',
  'fare': '4.5',
  'pickup_centroid_latitude': '41.944226601',
  'pickup_centroid_longitude': '-87.655998182',
  'pickup_community_area': '6',
  'taxi_id': 'be435bc4d6744155b3272e9edded016bd4afb34777470147a81d6a7e77f17dd155fdd830169271731a5316e9a6c565a619903ae4f116598fcfb9bf22591850a2',
  'tips': '0',
  'tolls': '0',
  'trip_end_timestamp': '2019-10-20T14:45:00',
  'trip_id': '001bf6e8b30ad25e067f639473cc8591a3ab2e4e',
  'trip_miles': '0.5',
  'trip_seconds': '120',
  'trip_start_timestamp': '2019-10-20T14:45:00',
  'trip_total': '4.5',
  'company': 'Bauer Taxi Service',
  'entry_idx': 0}]

#### get_trips(query) method

Since your goal is to look at many different taxi rides, we want to simplify things by creating a method that just takes in the query we want to process. To do this, we simply put the repeated information inside of the method that *wraps* the the actual GET request.

In [6]:
def get_trips(query):
    query["team"] = TEAM
    response = requests.get(URL + "/trips/", params=query)
    return response.json()

In [7]:
# Example usage of get_trips - no need to specify the team or GET request anymore!
get_trips({
        "start": "10/20/2019 3:45 AM",
        "end": "10/20/2019 3:46 AM",
        "limit": 5
    })

{'success': True,
 'length': 5,
 'response': [{'dropoff_centroid_latitude': '41.972667956',
   'dropoff_centroid_longitude': '-87.663865496',
   'dropoff_community_area': '3',
   'extras': '0',
   'fare': '17.75',
   'pickup_centroid_latitude': '41.942577185',
   'pickup_centroid_longitude': '-87.647078509',
   'pickup_community_area': '6',
   'taxi_id': 'cc5330b266a2b3e042e5bc50b1dadb4f4e03db62f8cbd1010e986946aae90f1fa60344c044657706c127b1111e0cc2d7273caa006e37c9d39aa150a34ce3b049',
   'tips': '0',
   'tolls': '0',
   'trip_end_timestamp': '2019-10-20T04:15:00',
   'trip_id': '0047a654a3e08761b1f11aa9e9ec57f7a2fc44e6',
   'trip_miles': '4.3',
   'trip_seconds': '1620',
   'trip_start_timestamp': '2019-10-20T03:45:00',
   'trip_total': '17.75',
   'company': 'Bauer Taxi Service',
   'entry_idx': 0},
  {'dropoff_centroid_latitude': '41.89967018',
   'dropoff_centroid_longitude': '-87.669837798',
   'dropoff_community_area': '24',
   'extras': '0',
   'fare': '8.75',
   'pickup_centroid_

#### Sending SQL queries

The GET request for /trips can accept SQL queries when passes as strings.

<strong>Parameters:</strong>
<ul>
    <li>team</li>
    <li>where : Indicates the SQL WHERE clause which defines the query.</li>
    <li>limit</li>
</ul>

In [8]:
get_trips({
        "where": "(trip_start_timestamp BETWEEN {10/20/2019 2:00 PM} AND {10/20/2019 9:00 PM}) AND (trip_total BETWEEN 10 AND 20)",
        "limit": 1
})

{'success': True,
 'length': 1,
 'response': [{'dropoff_centroid_latitude': '41.870607372',
   'dropoff_centroid_longitude': '-87.622172937',
   'dropoff_community_area': '32',
   'extras': '1.5',
   'fare': '7.25',
   'pickup_centroid_latitude': '41.849246754',
   'pickup_centroid_longitude': '-87.624135298',
   'pickup_community_area': '33',
   'taxi_id': '4d1dbd80c3b4c74b6441906c81e74c1150c532a88fd32572dfa9b23506ad6c388f45c453b7ddc2506042ec16187a698cd3c5f160a968286633c7bfebc8f2553f',
   'tips': '4',
   'tolls': '0',
   'trip_end_timestamp': '2019-10-20T17:30:00',
   'trip_id': '001791f85cac9569f420be00624202ab8fd63f09',
   'trip_miles': '1.5',
   'trip_seconds': '300',
   'trip_start_timestamp': '2019-10-20T17:15:00',
   'trip_total': '12.75',
   'company': 'Bauer Taxi Service',
   'entry_idx': 0}]}

NOTE: Times are rounded off to every quarter hour. By specifying that you want the rides in between 3:00PM and 3:15PM, you're also getting back rides from 3:15PM to ~3:23PM. This is because the rides that happened at this time rounded down to 3:15PM.

You might be asking... well, how do we parse this data? We'll take a look at that in the data analysis section.

### /count endpoint

In [9]:
def get_count(query):
    query["team"] = TEAM
    response = requests.get(URL + "/count/", params=query)
    return response.json()

#### /count example : What if our simulation lands on New Year's?

In [10]:
# New Years
new_years_count = get_count({
        "start": "12/31/2018 12:00 AM",
        "end": "01/01/2019 05:00 AM"
    })
print("# of rides in 29-hour span of New Years Eve -> Day: {}".format(new_years_count["count"]))

# of rides in 29-hour span of New Years Eve -> Day: 48389


## Learning Objective: Data Preprocessing and Analysis

Before applying any data analysis to your data, it's always important to preprocess your data. This means looking for any missing values, identifying categorical variables, accounting for outlier (Y) or leverage (X) points. We'll go through the process of getting the data for a specific simulation, processing it with the stack we're familiar with (scipy), and begin some explatory data analysis on it.

In [11]:
# What data can we currently analyze?
requests.get(URL + "/time/").json()

{'sucess': True,
 'time': 1573365600000,
 'message': 'The simulation is not over, the time is 11/10/2019.'}

Since the simulation time is 10/1/2017, we can look at data before 10/1/2017 12:00 AM. For this example analysis, we'll look at the week before the simulation time.

In [12]:
# How much data should we expect?
get_count({
        "start": "10/27/2019 12:00 AM",
        "end": "11/3/2019 12:00 AM"
    })

{'success': True, 'count': '349665'}

In [13]:
# 1 week before the simulation time ends
get_trips({
        "start": "10/27/2019 12:00 AM",
        "end": "11/3/2019 12:00 AM"
    })["length"]

689

<strong>Well... that's odd.</strong> The call to /count is telling us that we should expect 342,326 taxi rides in the week of 09/24 - 10/01. However, we're only getting results back 717 trips back for the same date range call to /trips. This is because, on the backend, the Lyft RESTful API sets an internal <strong>limit</strong> of 1000 taxi rides.

This presents an interesting problem: how do we get back all of the taxi rides of a specific time range if we're bound to just 1000? Well, we have to use the offset paramater and set it to the length 1000 to get the next set of taxi rides. To get all 342,326 rides, we would have to do this process 342 times. This is simply unreasonable, so instead we need a good <strong>sampling strategy</strong>.

### Converting JSON to Pandas DataFrame

In [14]:
week_before = get_trips({
        "start": "10/27/2019 12:00 AM",
        "end": "11/3/2019 12:00 AM",
        "limit": 100
    })

In [15]:
week_before = week_before["response"]

In [16]:
import pandas as pd
import numpy as np

week_before_dataset = pd.DataFrame(week_before)
week_before_dataset.head()

Unnamed: 0,company,dropoff_centroid_latitude,dropoff_centroid_longitude,dropoff_community_area,entry_idx,extras,fare,pickup_centroid_latitude,pickup_centroid_longitude,pickup_community_area,taxi_id,tips,tolls,trip_end_timestamp,trip_id,trip_miles,trip_seconds,trip_start_timestamp,trip_total
0,Bauer Taxi Service,41.879255084,-87.642648998,28,0,0,9.25,41.891971508,-87.612945414,8,77b496e34fee1f04931fa20ccfdf9ac29676b5bb383241...,0.0,0,2019-10-27T00:00:00,0005c5f0e853b98e15145d5c1e486edf0e1375a2,2.1,600,2019-10-27T00:00:00,9.25
1,Bauer Taxi Service,41.87101588,-87.631406525,32,1,0,9.0,41.89503345,-87.619710672,8,f451123ff58e8ef1ca5e64de85da57d7cf463c5ee73e6a...,2.0,0,2019-10-27T03:30:00,001336f22cfe5fcd65688092a4212634049be5db,1.8,600,2019-10-27T03:15:00,11.0
2,Bauer Taxi Service,42.001571027,-87.695012589,2,2,0,9.0,42.009622881,-87.670166857,1,cc5330b266a2b3e042e5bc50b1dadb4f4e03db62f8cbd1...,0.0,0,2019-10-27T03:15:00,000f74f315b5fed07638a9e2a576db566fa36e62,2.4,480,2019-10-27T03:15:00,9.0
3,Bauer Taxi Service,41.934762456,-87.639853859,6,3,1,12.25,41.892042136,-87.63186395,8,1768c6bcff09706e1ee0a66b2053afe14107881e986f2e...,3.0,0,2019-10-27T03:45:00,00084cdb7b20b0f96a100744f01bb3f7b4ef33ec,3.5,720,2019-10-27T03:30:00,16.75
4,Bauer Taxi Service,41.922686284,-87.649488729,7,4,0,12.5,41.878865584,-87.625192142,32,4000efc92e3eb07271a33ec1681f0d46ec447199764d62...,3.25,0,2019-10-27T10:30:00,00043b8ec257b5488aa21c3351caf8996ec0b0de,3.4,840,2019-10-27T10:15:00,16.25


### Finding the Average "fare"

Let's try to isolate the fare column so we can get the average. It's on index 6, and we can use the <strong>iloc</strong> method to capture it. The <strong>describe()</strong> function will give us summary statistics of the data.

In [17]:
# Let's isolate the "fare" col
week_before_dataset.iloc[:, 6].describe()

count     62
unique    40
top        7
freq       7
Name: fare, dtype: object

Well... that's odd. Where's the mean or median output? It turns out that the data returned to us in "fare" is in string format, so we need to convert the column to be a float.

In [18]:
fare_feature = week_before_dataset.iloc[:, [6]].copy()
fare_feature = fare_feature.astype("float64")
fare_feature.head()

Unnamed: 0,fare
0,9.25
1,9.0
2,9.0
3,12.25
4,12.5


In [19]:
# Confirming that we've converted to a float
fare_feature.dtypes

fare    float64
dtype: object

#### Summary Statistics

In [20]:
fare_feature.describe()

Unnamed: 0,fare
count,62.0
mean,14.233871
std,13.271045
min,4.0
25%,6.75
50%,9.25
75%,12.5
max,50.75


### Finding the Most Common Zone

Power zones are areas you can configure that require riders to pay an extra cost to take a Lyft, but with the benefit of it coming faster than, say, a taxi. The zone mappings can be seen in the <strong>"dropoff_community_area"</strong> feature. Let's try to find the most common zone, and then simply use that as the only power zone we want to apply for this simulation's pricing model.

In [21]:
week_before_dataset.iloc[:, 3].describe()

count     62
unique    12
top        8
freq      25
Name: dropoff_community_area, dtype: object

Let's take the average fare divided by 2 we found in the last section, coupled with the most frequent zone, and set those as the fares for the Lyft and power zones list, respectively!

In [22]:
def set_pricing(pricing):
    query = pricing;
    query["team"] = TEAM;
    response = requests.post(URL + "/pricing/", params=query)
    return response.json()

In [23]:
def set_zones(zones):
    zone_list = (",").join(str(z) for z in zones)
    query = {
        "team": TEAM,
        "zones": zone_list
    }
    response = requests.post(URL + "/zones/", params=query)
    return response.json()

In [24]:
my_price_model = set_pricing({
    "base": 17.24 / 2,
    "pickup": 0.00,
    "per_mile": 0.10,
    "per_minute": 0.10
})

In [25]:
my_power_zones = set_zones([8])