# Nearest Neighbors Lab

### Introduction

In this lab, you apply nearest neighbors technique to help a taxi company predict the length of their rides.  Imagine that we are hired to consult for LiftOff, a limo and taxi service that is just opening up.  It wants to do some initial research on NYC trips.  LiftOff has a theory that the pickup location of a taxi ride can help predict the length of the ride.  So the hypothesis is that trips that have a similar pickup location will have a similar trip length. It wants to target the locations that generally have longer rides, as it makes more money that way.

LiftOff asks us to do some analysis. Lucky for us, information about NYC taxi trips is available on [it's website](https://data.cityofnewyork.us/Transportation/2014-Yellow-Taxi-Trip-Data/gn7m-em8n).  

### A little different

Before we get started, note that our problem here is a little bit different than what we worked with previously.  

Before our job would be complete upon finding the closest trips to a given location -- that is, upon finding our nearest neighbors.  Now, we still need to find the closest trips, but then we also need to use this data to predict the length of the trip.  As you'll see, to predict a trip length from a given location, we'll find the trips that occurred nearest to the given location, then take the median of those nearest trips, to make a prediction about trip length.

The second new thing that we'll see with our problem is the task of choosing the correct number of neighbors or trips.  Say we choose the 500 closest neighbors to a given point.  Well in a dataset of only 1000 trips, we would be including trips from all over the map, and our nearest neighbors formula wouldn't tell us too much about how a specific point is different.  However, if we only choose one neighbor at a given point and assume that the one neighbor's trip distance predicts the length of our trip, then we run the risk of that trip just being a special case and not the norm for that area.  So how to choose the correct number of neighbors, referred to as $k$, is something that we'll need to explore.   

### Exploring and Gathering the Data

If you go to [NYC Open Data](https://opendata.cityofnewyork.us/), you can find NYC taxi data after a quick search [it's website](https://data.cityofnewyork.us/Transportation/2014-Yellow-Taxi-Trip-Data/gn7m-em8n) if you click on the button, "API", you'll find the data that we'll be working with.  For you're reading pleasure, the data has already been moved to the "trips.json" file in this lab.

```python
[
  {
    "dropoff_datetime": "2014-11-26T22:31:00.000",
    "dropoff_latitude": "40.746769999999998",
    "dropoff_longitude": "-73.997450000000001",
    "fare_amount": "52",
    "imp_surcharge": "0",
    "mta_tax": "0.5",
    "passenger_count": "1",
    "payment_type": "CSH",
    "pickup_datetime": "2014-11-26T21:59:00.000",
    "pickup_latitude": "40.64499",
    "pickup_longitude": "-73.781149999999997",
    "rate_code": "2",
    "tip_amount": "0",
    "tolls_amount": "5.3300000000000001",
    "total_amount": "57.829999999999998",
    "trip_distance": "18.379999999999999",
    "vendor_id": "VTS"
  },
...
...
]
```

### Document Retrieval 

Now, we like the amount of data, but we don't need all of the attributes provided.  We decide that all we need for this exploration is `pickup_latitude`, and `pickup_longitude`.

The first step is to load the data from a JSON file, which can be a little tricky in python.  We'll write a function to do it for you - using the pandas library. 

In [1]:
import pandas

def parse_file(fileName):
    trips_df = pandas.read_json(fileName)
    return trips_df.to_dict('records')

trips = parse_file('trips.json')

  return f(*args, **kwds)
  return f(*args, **kwds)


In [2]:
trips[0]

{'dropoff_datetime': '2014-11-26T22:31:00.000',
 'dropoff_latitude': 40.74677,
 'dropoff_longitude': -73.99745,
 'fare_amount': 52.0,
 'imp_surcharge': 0.0,
 'mta_tax': 0.5,
 'passenger_count': 1,
 'payment_type': 'CSH',
 'pickup_datetime': '2014-11-26T21:59:00.000',
 'pickup_latitude': 40.64499,
 'pickup_longitude': -73.78115,
 'rate_code': 2,
 'store_and_fwd_flag': nan,
 'tip_amount': 0.0,
 'tolls_amount': 5.33,
 'total_amount': 57.83,
 'trip_distance': 18.38,
 'vendor_id': 'VTS'}

In [3]:
len(trips)

1000

Ok, so as you can see from above, the `trips` variable returns an array of dictionaries with each dictionary representing a trip.  Write a function called `parse_trips(trips)` that returns an array of the trips with just the following attributes: `trip_distance`, `pickup_latitude`, `pickup_longitude`.  

Run the `index-tests.py` file to ensure that you wrote it correctly.

In [4]:
def parse_trips(trips):
    your_keys = set(['trip_distance','pickup_latitude','pickup_longitude'])
    return[{ your_key: trip[your_key] for your_key in your_keys } for trip in trips]
    

In [5]:
parsed_trips = parse_trips(trips)
# parsed_trips[0]
    # {'pickup_latitude': 40.64499,
    #  'pickup_longitude': -73.78115,
    #  'trip_distance': 18.38}

# len(parsed_trips)
    # 1000


# set([key for trip in parsed_trips for key in list(trip.keys())])
    # {'pickup_latitude', 'pickup_longitude', 'trip_distance'}

In [6]:
parsed_trips

[{'pickup_longitude': -73.78115,
  'pickup_latitude': 40.64499,
  'trip_distance': 18.38},
 {'pickup_longitude': -73.982098,
  'pickup_latitude': 40.766931,
  'trip_distance': 1.3},
 {'pickup_longitude': -73.951902,
  'pickup_latitude': 40.77773,
  'trip_distance': 4.5},
 {'pickup_longitude': -73.971049,
  'pickup_latitude': 40.795678,
  'trip_distance': 2.4},
 {'pickup_longitude': -73.967782,
  'pickup_latitude': 40.762912,
  'trip_distance': 0.84},
 {'pickup_longitude': -73.991572,
  'pickup_latitude': 40.731176,
  'trip_distance': 0.8},
 {'pickup_longitude': -73.968098,
  'pickup_latitude': 40.800219,
  'trip_distance': 0.5},
 {'pickup_longitude': -73.783508,
  'pickup_latitude': 40.648509,
  'trip_distance': 17.3},
 {'pickup_longitude': -73.983493,
  'pickup_latitude': 40.721897,
  'trip_distance': 0.63},
 {'pickup_longitude': -73.972224,
  'pickup_latitude': 40.791566,
  'trip_distance': 2.8},
 {'pickup_longitude': -73.978619,
  'pickup_latitude': 40.744896,
  'trip_distance': 0.6

### Exploring the Data

Now that we have paired down our data, let's answer some initial questions.  Here is where our data will go. 

In [7]:
!pip install gmplot
import gmplot
gmap = gmplot.GoogleMapPlotter(40.758896, -73.985130, 12)
gmap.draw("mymap.html")

Collecting gmplot
[?25l  Downloading https://files.pythonhosted.org/packages/e2/b1/e1429c31a40b3ef5840c16f78b506d03be9f27e517d3870a6fd0b356bd46/gmplot-1.2.0.tar.gz (115kB)
[K    100% |████████████████████████████████| 122kB 13.9MB/s ta 0:00:01
Building wheels for collected packages: gmplot
  Running setup.py bdist_wheel for gmplot ... [?25ldone
[?25h  Stored in directory: /home/fpolchow/.cache/pip/wheels/81/6a/76/4dd6a7cc310ba765894159ee84871e8cd55221d82ef14b81a1
Successfully built gmplot
Installing collected packages: gmplot
Successfully installed gmplot-1.2.0
[33mYou are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


![](./Manhattan.png)

Now, plotting the data feeds into the following function.
```python
gmap.plot(latitudes, longitudes, 'cornflowerblue', edge_width=10)
```
So we'll need an array of latitudes, each element representing the latitude of a trip, and an array of longitudes, each representing the longitudes associated with a trip.  Write a function called `trip_latitudes` that given a list of trips returns a list of latitudes, and `trip_longitudes` that given a list of trips, returns a list of `longitudes` accordingly.  Run the file `nearest-neighbor-lab-tests.py` to get feedback.  

In [8]:
def trip_latitudes(trips):
    return [trip['pickup_latitude'] for trip in trips]

In [9]:
def trip_longitudes(trips):
    return [trip['pickup_longitude'] for trip in trips]

In [10]:
latitudes = trip_latitudes(parsed_trips)
longitudes = trip_longitudes(parsed_trips)

In [11]:
len(latitudes) 
1000

1000

In [12]:
latitudes

[40.64499,
 40.766931,
 40.77773,
 40.795678,
 40.762912,
 40.731176,
 40.800219,
 40.648509,
 40.721897,
 40.791566,
 40.744896,
 40.721951,
 40.732382,
 40.768339,
 40.775933,
 40.794829,
 40.758647,
 40.713638,
 40.77403,
 40.728127,
 40.759671,
 40.772651,
 40.770319,
 40.720457,
 40.754789,
 40.755654,
 40.741568,
 40.736907,
 40.76003,
 40.721366,
 40.760147,
 40.745847,
 40.753015,
 40.644657,
 40.738767,
 40.773797,
 40.743411,
 40.758905,
 40.733877,
 40.726162,
 40.775122,
 40.75987,
 40.707087,
 40.741912,
 40.769112,
 40.785807,
 40.7513,
 40.733385,
 40.760257,
 40.741249,
 40.759795,
 40.756636,
 40.73925,
 40.751055,
 40.723584,
 40.757037,
 40.743385,
 40.737486,
 40.77225,
 40.732882,
 40.774567,
 40.757437,
 40.789637,
 40.754317,
 40.745007,
 40.734982,
 0.0,
 40.738617,
 40.790712,
 40.750167,
 40.732141,
 40.765758,
 40.792594,
 40.762908,
 40.750722,
 40.769887,
 40.77247,
 40.764545,
 40.641805,
 40.75224,
 40.758835,
 40.722785,
 40.720338,
 40.72973,
 40.760442

In [13]:
gmap.plot(latitudes or [], longitudes or [], 'cornflowerblue', edge_width=10)
gmap.draw("myplot.html")

Plotting the trips give you the following.

![](./map-plotting.png)

### Using Nearest Neighbors

Ok, let's write a function that given a latitude and longitude will predict the fare distance for us.  We'll do this by first finding the nearest trips given a latitude and longitude. 

 First write a method `distance_location` that calculates the distance between two individuals.

In [14]:
import math

def distance_location(selected_individual, neighbor):
    lat_ind , lon_ind = selected_individual['pickup_latitude'], selected_individual['pickup_longitude']
    lat_neigh, lon_neigh = neighbor['pickup_latitude'], neighbor['pickup_longitude']
    return math.sqrt((lat_ind-lat_neigh)**2 + (lon_ind-lon_neigh)**2)

In [15]:
first_trip = parsed_trips[0]
second_trip = parsed_trips[1]

distance_location(first_trip, second_trip) 
#     0.23505256047318146

0.23505256047318146

In [16]:
first_trip

{'pickup_longitude': -73.78115,
 'pickup_latitude': 40.64499,
 'trip_distance': 18.38}

Write the nearest neighbors formula.  If no number is provided, it should return the top 3 neighbors.

In [17]:
import numpy as np
def nearest_neighbors(selected_individual, neighbors, number = 3):
    lst = []
    for neighbor in neighbors:
        neighbor.update({'distance':distance_location(selected_individual,neighbor)})
        lst.append(neighbor)
                        
    order_of_things = sorted(lst,key= lambda x: x['distance'],reverse=False)
    return order_of_things[:number]

In [18]:
selected_trip = {'pickup_latitude': 40.64499,
'pickup_longitude': -73.78115,
'trip_distance': 18.38}


nearest_neighbors(selected_trip, parsed_trips or [], number = 3)

# [{'distance': 0.0004569288784918792,
#   'pickup_latitude': 40.64483,
#   'pickup_longitude': -73.781578,
#   'trip_distance': 7.78},
#  {'distance': 0.0011292165425673159,
#   'pickup_latitude': 40.644657,
#   'pickup_longitude': -73.782229,
#   'trip_distance': 12.7},
#  {'distance': 0.0042359798158141185,
#   'pickup_latitude': 40.648509,
#   'pickup_longitude': -73.783508,
#   'trip_distance': 17.3}]

[{'pickup_longitude': -73.78115,
  'pickup_latitude': 40.64499,
  'trip_distance': 18.38,
  'distance': 0.0},
 {'pickup_longitude': -73.781578,
  'pickup_latitude': 40.64483,
  'trip_distance': 7.78,
  'distance': 0.0004569288784918792},
 {'pickup_longitude': -73.782229,
  'pickup_latitude': 40.644657,
  'trip_distance': 12.7,
  'distance': 0.0011292165425673159}]

### Choosing the correct number of neighbors

Now in working with a nearest neighbors formula, one tricky question is how many neighbors we should use.  Remember that our guess is that trips that have similar pickup locations will have similar lengths of trips.  Then we will just take the median of the trip lengths of this group together to make a prediction.  

If we choose too many neighbors, then we'll be averaging together distances from all over town, and we won't really be differentiating between locations.  But if we  look at the trip distances of the three neighbors above, this may be small.  Take a look at the `trip_distances` of the 3 neighbors above.  It's hard to tell if the trip distance of 7 is more typical than the distance of 17.  In other words, our sample size is small.

The choice of the correct number of neighbors is called choosing the correct $k$, as that the variable often assigned to the number of neighbors.  We'll experiment with the our $k$ size throughout the rest of this lab.

Let's increase the number of our neighbors to see what happens. 

In [19]:
seven_closest = nearest_neighbors(selected_trip, parsed_trips or [], number = 7)
seven_closest

[{'pickup_longitude': -73.78115,
  'pickup_latitude': 40.64499,
  'trip_distance': 18.38,
  'distance': 0.0},
 {'pickup_longitude': -73.781578,
  'pickup_latitude': 40.64483,
  'trip_distance': 7.78,
  'distance': 0.0004569288784918792},
 {'pickup_longitude': -73.782229,
  'pickup_latitude': 40.644657,
  'trip_distance': 12.7,
  'distance': 0.0011292165425673159},
 {'pickup_longitude': -73.783508,
  'pickup_latitude': 40.648509,
  'trip_distance': 17.3,
  'distance': 0.0042359798158141185},
 {'pickup_longitude': -73.776808,
  'pickup_latitude': 40.645316,
  'trip_distance': 17.5,
  'distance': 0.004354220940644754},
 {'pickup_longitude': -73.776765,
  'pickup_latitude': 40.645718,
  'trip_distance': 20.5,
  'distance': 0.004445020697364217},
 {'pickup_longitude': -73.77668,
  'pickup_latitude': 40.64534,
  'trip_distance': 21.44,
  'distance': 0.004483681523031949}]

Notice that most of the data is a distance of .0045 away, so going to the top 7 nearest neighbors didn't seem to give us neighbors too far from each other, which is a good sign.  Still, it's hard to know what distance in latitude and longitude really look like, so let's try mapping the data.  

In [20]:
seven_lats = trip_latitudes(seven_closest)
seven_longs = trip_longitudes(seven_closest)

In [21]:
gmap = gmplot.GoogleMapPlotter(first_trip['pickup_latitude'], first_trip['pickup_longitude'], 15)
gmap.scatter(seven_lats, seven_longs, 'cornflowerblue', edge_width=10)
gmap.draw("nearestneighbors.html")

![](./airportdata.png)

Well, it looks like we can't really make an assessment of a good $k$ size with this data.  Our location is the airport, which is probably not a very typical place to see if our $k$ size is good for predicting trip lengths.

Let's choose another spot that we expect to be less atypical.  Fifty-first street and 7th Avenue is at $40.761710, -73.982760$.  Now let's again try to see if seven locations is a good spread, but this time starting from midtown.

In [22]:
midtown_loc = {'pickup_latitude': 40.761710, 'pickup_longitude': -73.982760}
midtown_neighbors = nearest_neighbors(midtown_loc, parsed_trips, number = 7)
list(map(lambda trip: trip['distance'], midtown_neighbors))

[0.00037310588309379025,
 0.00080072217404248,
 0.0011555682584735844,
 0.0012508768924205918,
 0.0018118976240381972,
 0.002067074502774709,
 0.0020684557041472677]

The distances between neighbors double in size as our $k$ goes from four to five.  How far is this distance really?

In [23]:
gmap = gmplot.GoogleMapPlotter(midtown_loc['pickup_latitude'], midtown_loc['pickup_longitude'], 15)
closest_lats = trip_latitudes(midtown_neighbors)
closest_longs = trip_longitudes(midtown_neighbors)

gmap.scatter(closest_lats, closest_longs, 'cornflowerblue', edge_width=10)
gmap.draw("nearestmidtown.html")

![](./midtown.png)

So essentially this is one or two blocks away from our location of 51st and 7th.  Not too bad.  Looking at the length of the trip for our seven it seems like our neighbor size, $k$, is large enough so we can start to see what would be an expected trip distance.

In [24]:
midtown_neighbors

[{'pickup_longitude': -73.982602,
  'pickup_latitude': 40.761372,
  'trip_distance': 0.58,
  'distance': 0.00037310588309379025},
 {'pickup_longitude': -73.98244,
  'pickup_latitude': 40.762444,
  'trip_distance': 0.8,
  'distance': 0.00080072217404248},
 {'pickup_longitude': -73.982293,
  'pickup_latitude': 40.762767,
  'trip_distance': 1.4,
  'distance': 0.0011555682584735844},
 {'pickup_longitude': -73.983233,
  'pickup_latitude': 40.762868,
  'trip_distance': 8.3,
  'distance': 0.0012508768924205918},
 {'pickup_longitude': -73.983502,
  'pickup_latitude': 40.760057,
  'trip_distance': 1.26,
  'distance': 0.0018118976240381972},
 {'pickup_longitude': -73.984531,
  'pickup_latitude': 40.760644,
  'trip_distance': 0.0,
  'distance': 0.002067074502774709},
 {'pickup_longitude': -73.98479,
  'pickup_latitude': 40.762107,
  'trip_distance': 1.72,
  'distance': 0.0020684557041472677}]

### Calculating an expected trip distance

Another way of thinking about the number of neighbors we should choose, is to think the deviation from the median distance.  We want to make sure our number is not so large so that when we choose a location, it just looks like the expected distance across all taxi trips in Manhattan. 

Let's write a function called `median_of` that takes a list of trips, and returns the median `trip_distance`.

In [25]:
import statistics
def median_of(neighbors):
    return statistics.median([x['distance'] for x in neighbors])

In [26]:
median_of(parsed_trips or [])

0.023340122764282427

So when we compare this with our `midtown_neighbors`, we can see that the number is different.  So our number of neighbors is not so large so that we look like the median.

In [27]:
median_of(midtown_neighbors)

0.0012508768924205918

Still, if we begin to change the number of neighbors from seven to ten, our median really starts to change.

In [28]:
median_of(nearest_neighbors(midtown_loc, parsed_trips, number = 10))

0.001939486063406453

Notice however, that between 15 to 20, to 25, our formula begins to give us a similar result.  

In [29]:
median_of(nearest_neighbors(midtown_loc, parsed_trips, number = 25))

0.0028364923761548875

So it appears that around 20 could be a sweet spot. Let's try another location to see how we do.

In [30]:
uws_loc = {'pickup_latitude': 40.786430, 'pickup_longitude': -73.975979}

In [31]:
median_of(nearest_neighbors(uws_loc, parsed_trips, number = 20))

0.003112254235318191

In [141]:
# downtown_loc = {'pickup_latitude': 40.713186, 'pickup_longitude': -74.007243}

In [142]:
# median_of(nearest_neighbors(downtown_loc, parsed_trips, number = 20))

### Summary

Alright, at this point we should be pretty happy as we can make a recommendation to LiftOff.  We can tell LiftOff the neighborhoods in Manhattan that will have the largest expected distance.