# 5 different ways on how to improve your loop speed

- Basic looping
- loop using iterrows()
- loop using apply()
- vectorization with pandas series
- vectorizaton using numpy arrays

<br/>

* Reference [here](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6)

In [1]:
%timeit

import numpy as np
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/s-heisler/pycon2017-optimizing-pandas/master/pyCon%20materials/new_york_hotels.csv" ,encoding='windows-1254')


def haversine(lat1, lon1, lat2, lon2):
    '''Input two location lat & lon, Output distance from one location to the other '''
    MILES = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    total_miles = MILES * c
    return total_miles

In [2]:
df.head()

Unnamed: 0,ean_hotel_id,name,address1,city,state_province,postal_code,latitude,longitude,star_rating,high_rate,low_rate
0,269955,Hilton Garden Inn Albany/SUNY Area,1389 Washington Ave,Albany,NY,12206,42.68751,-73.81643,3.0,154.0272,124.0216
1,113431,Courtyard by Marriott Albany Thruway,1455 Washington Avenue,Albany,NY,12206,42.68971,-73.82021,3.0,179.01,134.0
2,108151,Radisson Hotel Albany,205 Wolf Rd,Albany,NY,12205,42.7241,-73.79822,3.0,134.17,84.16
3,254756,Hilton Garden Inn Albany Medical Center,62 New Scotland Ave,Albany,NY,12208,42.65157,-73.77638,3.0,308.2807,228.4597
4,198232,CrestHill Suites SUNY University Albany,1415 Washington Avenue,Albany,NY,12206,42.68873,-73.81854,3.0,169.39,89.39


## Basic looping
* create new column, and populate it using basic loop

In [3]:
# Define a function to manually loop over all rows and return a series of distances
def haversine_looping(df):
    distance_list = []
    for i in range(0, len(df)):
        d = haversine(40.671, -73.985, df.iloc[i]['latitude'], df.iloc[i]['longitude'])
        distance_list.append(d)
    return distance_list

In [4]:
%%timeit

# create new column, and populate it using basic loop
df['distance'] = haversine_looping(df)

260 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Loop using iterrows()
* populate results inside new empty list, then create new column in dataframe & set them in


In [5]:
%%timeit

# populate results inside new empty list
haversine_series = []

for index, row in df.iterrows():

    # append result inside list. first location lat lon given, second location is from df
    haversine_series.append(haversine(40.671, -73.985, row['latitude'], row['longitude']))


# create new column in dataframe & set them in
df['distance1'] = haversine_series

91.1 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Loop using apply()
* basic for loop, BUT using lambda instead & directly assign it 

In [6]:
%%timeit

# create new column in dataframe, and directly run loop inside the assigning process
df['distance2'] = df.apply(lambda x: haversine(40.671, -73.985, x['latitude'], x['longitude']), axis = 1)


40.5 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Vectorization with pandas series
```
let lat be vector1, lon be vector2, and we can achive it by just df['latitude'] df['longitude']

latitude         longitude
23                  99
32                  98
22        x         95
..                  ..
..                  ..
100                 59

```


In [7]:
%%timeit

df['distance3'] = haversine(40.671, -73.985, df['latitude'], df['longitude']) 

1.89 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Vectorizaton using numpy arrays
* its using pandas series, BUT we specificall for .values only, hence faster (converting from dataframe to array indirectly)



In [9]:
%%timeit

df['distance4'] = haversine(40.671, -73.985, df['latitude'].values, df['longitude'].values)

354 µs ± 6.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [11]:
# numpy vectorization
df['latitude'].values

array([42.68751, 42.68971, 42.7241 , ..., 40.92625, 40.95375, 40.97308])

In [12]:
# pandas series vectorization
df['latitude']

0       42.68751
1       42.68971
2       42.72410
3       42.65157
4       42.68873
          ...   
1626    40.97275
1627    40.95466
1628    40.92625
1629    40.95375
1630    40.97308
Name: latitude, Length: 1631, dtype: float64

## Final notes :
* use numpy array method (df['col_name'].values)
* replace all for-loop with .apply(), as the two have similar structure
* Explore numpy more, and replace all for-loop to .apply() for good habit (and faster code)
