# Pandas Performance
### Cecelia Henson

In this notebook we will be exploring the performance differences between different approaches of iterating through a Pandas column.  

First we will start by loading our data.  The data is from Lyft's Go Bike program and inclues every trip from 2017: https://www.lyft.com/bikes/bay-wheels/system-data

In [1]:
import pandas as pd

df = pd.read_csv('2017-fordgobike-tripdata.csv', 
                 dtype={"start_station_latitude":float, "start_station_longitude":float,
                       "end_station_latitude":float, "end_station_longitude":float})

Next we define a function to calculate distance based on two GPS locations

In [2]:
import numpy as np

# Define a basic Haversine distance formula
def haversine(lat1, lon1, lat2, lon2):
    MILES = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    total_miles = MILES * c
    return total_miles

First, let's try iterating through the dataframe using `iterrows()`

In [3]:
%%time
haversine_series = []
for index, row in df.iterrows():
    haversine_series.append(haversine(row['start_station_latitude'], row['start_station_longitude'], 
                                      row['end_station_latitude'], row['end_station_longitude']))
df['distance'] = haversine_series

Wall time: 34.5 s


The next approach is to loop through the dataframe using `iloc`

In [4]:
def haversine_looping(df):
    distance_list = []
    for i in range(0, len(df)):
        d = haversine(df['start_station_latitude'].iloc[i], df['start_station_longitude'].iloc[i], 
                      df['end_station_latitude'].iloc[i], df['end_station_longitude'].iloc[i])
        distance_list.append(d)
    return distance_list
%time df['distance'] = haversine_looping(df)

Wall time: 27 s


Add benchmarking to the previous cells, and take a moment to reflect on these result.  Is iterrows() or iloc() faster?

Next, lets use some functional programming!  Try using `apply`

In [5]:
%time df['distance'] = df.apply(lambda row: haversine(row['start_station_latitude'], \
                                                      row['start_station_longitude'], \
                                                      row['end_station_latitude'], \
                                                      row['end_station_longitude']), axis=1)

Wall time: 17.8 s


Lets vectorize!

In [6]:
%time df['distance'] = haversine(df['start_station_latitude'], df['start_station_longitude'], \
                                 df['end_station_latitude'], df['end_station_longitude'])

Wall time: 75.7 ms


Lets try numpy vectorize

In [7]:
%time df['distance'] = haversine(df['start_station_latitude'], df['start_station_longitude'], \
                                 df['end_station_latitude'].values, df['end_station_longitude'].values)

Wall time: 74 ms


Is there anything you can do to the cell above to further improve the performance?  Look carefully!

Create a table summarizing the performance results.  

**Numbers may be slightly different because I had to rerun it to created an ordered kernel**

In [8]:
dfpref = pd.DataFrame()
dfpref["Type"] = ["iloc", "iterrow()","apply","vectorize","numpy vectorize"]
dfpref["Wall Time"] = [23.1, 29.7, 15.7,0.0548,0.054]
dfpref.head()

Unnamed: 0,Type,Wall Time
0,iloc,23.1
1,iterrow(),29.7
2,apply,15.7
3,vectorize,0.0548
4,numpy vectorize,0.054
