### Challenge questions

Easy questions:

 1. How many total pings are in the Ocearch shark data?
 2. How many unique species of sharks are in the data set?
 3. What is the name, weight, and species of the heaviest shark(s)?
 4. When and where was the very first ping?
 5. Excluding results with 0 distance traveled: what's the minimum, average, and maximum travel distances?
 
Intermediate questions:

 1. Which shark had the most pings?
 2. Which shark has been pinging the longest, and how long has that been?
 3. Which shark species has the most individual sharks tagged?
 4. What is the average length and weight of each shark species?
 5. Which shark has the biggest geographic box (largest distance from min lat/lon to max lat/lon, not dist_traveled)?
 
Hard questions:
 1. Use folium to plot the first ping, last ping, and a line connecting each ping for the Tiger shark Emma.  Make the first ping marker a 'play' icon, and last ping icon a 'stop' icon.
 2. Resample Emma data to have a daily lat/lon average, and interpolate missing results.  Plot a marker for each day, and color them blue for hard data, green for interpolated lat/lons
 3. Resample all shark data for daily lat/lon averages, and interpolate missing results
 4. Calculate distance between Emma and other sharks on a daily basis
 5. Identify the shark that has the shortest average distance to Emma per day (minimum 50 days of pings with Emma)
 6. Plot Emma and her closest buddy: interpolated results for each in green, Emma as circle icons and her buddy as square icons

### Load data

In [5]:
import pandas as pd
import datetime as dt
df = pd.read_csv('data/sharks.csv')
df.shape

(65793, 12)

#### Clean

In [6]:
#cleans datetime

df['datetime'] = pd.to_datetime(df['datetime'])
df.datetime[0]

#cleans weight

def clean_weight(value):
    if not value:
        return value
    # most values are like "123 lb"
    value = str(value)
    for character in 'lbs,+':
        value = value.replace(character, '')
    return float(value)

#cleans length

def clean_length(value):
    if not value:
        return value
    # most length values are like '3 ft 4 in.'
    value = str(value)
    total = 0
    if 'ft' in value:
        ft, inches = value.split('ft')
        total += int(ft.strip()) * 12
    else:
        inches = value
    if inches.strip():
        total += float(inches.strip().split()[0])
    return total

df['weight'] = df.weight.apply(clean_weight)
df['length'] = df.length.apply(clean_length)

numeric_cols = ['latitude', 'longitude', 'dist_total', 'weight', 'length']
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, axis=1)
df.head()

Unnamed: 0,active,datetime,id,latitude,longitude,name,gender,species,weight,length,tagDate,dist_total
0,1,2014-07-06 04:57:28,3,-34.60661,21.15244,Oprah,Female,White Shark (Carcharodon carcharias),686.0,118.0,7 March 2012,2816.662
1,1,2014-06-23 02:40:09,3,-34.78752,19.42479,Oprah,Female,White Shark (Carcharodon carcharias),686.0,118.0,7 March 2012,2816.662
2,1,2014-06-15 13:15:44,3,-34.42487,21.09754,Oprah,Female,White Shark (Carcharodon carcharias),686.0,118.0,7 March 2012,2816.662
3,1,2014-06-03 02:23:57,3,-34.704323,20.210134,Oprah,Female,White Shark (Carcharodon carcharias),686.0,118.0,7 March 2012,2816.662
4,1,2014-05-28 19:53:57,3,-34.65556,19.37459,Oprah,Female,White Shark (Carcharodon carcharias),686.0,118.0,7 March 2012,2816.662


#### Query Ocearch API

In [7]:
import requests
url = 'http://www.ocearch.org/tracker/ajax/filter-sharks'

resp = requests.get(url)
resp

<Response [200]>

#### Transform data

### Explore data

### Challenge Questions

#### Hard questions

 1. Use folium to plot the first ping, last ping, and a line connecting each ping for the Tiger shark Emma.  Make the first ping marker a 'play' icon, and last ping icon a 'stop' icon.
 2. Resample Emma data to have a daily lat/lon average, and interpolate missing results.  Plot a marker for each day, and color them blue for hard data, green for interpolated lat/lons
 3. Resample all shark data for daily lat/lon averages, and interpolate missing results
 4. Calculate distance between Emma and other sharks on a daily basis
 5. Identify the shark that has the shortest average distance to Emma per day (minimum 50 days of pings with Emma)
 6. Plot Emma and her closest buddy: interpolated results for each in green, Emma as circle icons and her buddy as square icons

##### Plot Emma locations
Plot the ping locations for the shark named Emma as a `PolyLine` in folium.  Include the first and last ping location as markers.

In [8]:
import folium as fm

In [9]:
emma = df[df.name == 'Emma'].copy().sort_values(by='datetime')
emma.head()

Unnamed: 0,active,datetime,id,latitude,longitude,name,gender,species,weight,length,tagDate,dist_total
34075,1,2014-01-31 22:10:18,102,-0.466747,-90.30005,Emma,Female,Tiger Shark (Galeocerdo cuvier),,99.0,20 January 2014,4368.906
34074,1,2014-01-31 22:51:31,102,-0.41101,-90.32783,Emma,Female,Tiger Shark (Galeocerdo cuvier),,99.0,20 January 2014,4368.906
34073,1,2014-01-31 23:49:34,102,-0.47808,-90.36889,Emma,Female,Tiger Shark (Galeocerdo cuvier),,99.0,20 January 2014,4368.906
34072,1,2014-02-01 00:25:07,102,-0.24096,-89.920682,Emma,Female,Tiger Shark (Galeocerdo cuvier),,99.0,20 January 2014,4368.906
34071,1,2014-02-01 08:31:34,102,-0.42935,-89.64942,Emma,Female,Tiger Shark (Galeocerdo cuvier),,99.0,20 January 2014,4368.906


In [10]:
avg_lat = emma.latitude.mean()
avg_long = emma.longitude.mean()

mymap = fm.Map(tiles='stamenwatercolor',
              location=(avg_lat,avg_long),
              zoom_start=5)

latlong = list(zip(emma.latitude.values, emma.longitude.values))
latlong[:5]

fm.PolyLine(latlong,color='black').add_to(mymap)
fm.Marker(latlong[0],
          icon=fm.Icon(color='darkgreen',
                      icon='play')).add_to(mymap)
fm.Marker(latlong[-1],
          icon=fm.Icon(color='darkred',
                      icon='stop')).add_to(mymap)

mymap

##### Plot interpolated locs
Resample the Emma locations on a per-day basis and interpolate missing locations.  Then, plot the daily markes in folium along with a `PolyLine`.

In [11]:
emma['day'] = emma['datetime'].apply(lambda ts: ts.date())
emma.head()
avlang = emma.groupby('day').agg({'latitude':'mean','longitude':'mean'}).reset_index()

def daygap(day1,day2):
    return abs((day2 - day1).days)

for ind, row in avlang.iterrows():
    if ind<1:
        print("skipping first line")
        pass
    else:
        day1 = row.day
        last_row = avlang.iloc[ind-1]
        day2 = last_row.day
        gap = daygap(day1,day2)
        if gap>1:
            print("measured gap of {} days between {} and {}".format(gap, day1, day2))

skipping first line
measured gap of 27 days between 2014-04-30 and 2014-04-03
measured gap of 2 days between 2014-05-09 and 2014-05-07
measured gap of 4 days between 2014-05-13 and 2014-05-09
measured gap of 3 days between 2014-05-17 and 2014-05-14
measured gap of 3 days between 2014-05-21 and 2014-05-18
measured gap of 7 days between 2014-05-28 and 2014-05-21
measured gap of 2 days between 2014-05-31 and 2014-05-29
measured gap of 27 days between 2014-06-27 and 2014-05-31
measured gap of 14 days between 2014-07-11 and 2014-06-27
measured gap of 5 days between 2014-07-17 and 2014-07-12
measured gap of 2 days between 2014-07-24 and 2014-07-22
measured gap of 2 days between 2014-07-31 and 2014-07-29
measured gap of 14 days between 2014-08-16 and 2014-08-02
measured gap of 5 days between 2014-08-21 and 2014-08-16
measured gap of 2 days between 2014-08-23 and 2014-08-21
measured gap of 2 days between 2014-08-25 and 2014-08-23
measured gap of 3 days between 2014-08-29 and 2014-08-26


In [12]:
interpolatedemma = emma.set_index('datetime').resample('1D')[['latitude','longitude']].mean()
interpolatedemma['interpolated'] = interpolatedemma['latitude'].isnull()
fullyinterpolatedemma = interpolatedemma.interpolate(method ='linear')
fullyinterpolatedemma.tail()

Unnamed: 0_level_0,latitude,longitude,interpolated
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-08-25,0.18386,-80.49963,False
2014-08-26,0.81609,-81.22973,False
2014-08-27,1.245737,-81.090107,True
2014-08-28,1.675383,-80.950483,True
2014-08-29,2.10503,-80.81086,False


In [13]:
emmainterpolatedmap = fm.Map(tiles='cartodbpositron')

interpolatedvalues = list(zip(fullyinterpolatedemma.latitude.values,
                   fullyinterpolatedemma.longitude.values))

#interpolatedvalues[:5]

fm.PolyLine(interpolatedvalues,color='black').add_to(emmainterpolatedmap)
fm.Marker(interpolatedvalues[0],
          icon=fm.Icon(color='lightred',
                      icon='play')).add_to(emmainterpolatedmap)

emmainterpolatedmap

##### Resample all shark data
Resample all shark data for daily lat/lon averages, and interpolate missing results

In [14]:
#interpolatedsharks = df.set_index('datetime').resample('1D')[['latitude','longitude']].mean()
#interpolatedsharks['interpolated'] = interpolatedsharks['latitude'].isnull()
#fullyinterpolatedsharks = interpolatedsharks.interpolate(method ='linear')
#fullyinterpolatedsharks.head()

In [15]:
dtgroup = df.set_index('datetime').groupby('name').resample('1D')[['latitude','longitude']].mean()
dtgroup['interpolated'] = dtgroup['latitude'].isnull()
interpolatedsharks = dtgroup.interpolate(method ='linear')


#dtgroup = df.groupby('datetime').agg({'latitude':'mean','longitude':'mean'}).head()
#dtgroup['name'] = df['name']

interpolatedsharks.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,latitude,longitude,interpolated
name,datetime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
iSimangaliso,2017-04-26,-27.071329,32.903404,True
iSimangaliso,2017-04-27,-27.069021,32.901411,True
iSimangaliso,2017-04-28,-27.066714,32.899417,True
iSimangaliso,2017-04-29,-27.064407,32.897424,True
iSimangaliso,2017-04-30,-27.0621,32.89543,False


##### Distance between Emma and other sharks

In [16]:
#extract emma - interpolatedwithoutemma
#seperate df for emma - emmainterpolated
#join them

emmainterpolated = interpolatedsharks.loc['Emma']
interpolatedwithoutemma = interpolatedsharks.copy().drop("Emma", axis=0)
emmalocations = emmainterpolated.copy().drop("interpolated",axis=1)
sharklocations = interpolatedwithoutemma.copy().drop("interpolated", axis=1).reset_index()

mergedemmalocations = emmalocations.rename(columns={"latitude": "emmalat","longitude": "emmalon"}).head()

In [17]:
merged = mergedemmalocations.merge(sharklocations,on='datetime')
merged.head()

#emmalat = lat1, emmalong = lon1, latitude = lat2, longitude = lon2
#run through 'metres_between_two_points'
#save as new column

Unnamed: 0,datetime,emmalat,emmalon,name,latitude,longitude
0,2014-01-31,-0.451946,-90.332257,Albertina,-34.570036,21.851467
1,2014-01-31,-0.451946,-90.332257,Andre,-33.205522,24.464655
2,2014-01-31,-0.451946,-90.332257,April,37.62394,-68.063633
3,2014-01-31,-0.451946,-90.332257,Beamer,11.384565,-76.018893
4,2014-01-31,-0.451946,-90.332257,Beatriz,1.084768,-91.843975


In [19]:
from haversine import haversine

# (lat1, lon1) = point A - tuple is like a bracketed list
# (lat2, lon2) = point B

def kmetres_between_two_points(emmalat,emmalon,latitude,longitude):
    km_distance = haversine((emmalat, emmalon), (latitude, longitude))
    return km_distance

kmetres_between_two_points

<function __main__.kmetres_between_two_points(emmalat, emmalon, latitude, longitude)>

In [20]:
merged['separation'] = merged.apply(lambda r: kmetres_between_two_points(r.emmalat, r.emmalon, r.latitude, r.longitude),axis=1)

In [21]:
merged.head()

Unnamed: 0,datetime,emmalat,emmalon,name,latitude,longitude,separation
0,2014-01-31,-0.451946,-90.332257,Albertina,-34.570036,21.851467,11991.711003
1,2014-01-31,-0.451946,-90.332257,Andre,-33.205522,24.464655,12262.46139
2,2014-01-31,-0.451946,-90.332257,April,37.62394,-68.063633,4811.478862
3,2014-01-31,-0.451946,-90.332257,Beamer,11.384565,-76.018893,2057.449641
4,2014-01-31,-0.451946,-90.332257,Beatriz,1.084768,-91.843975,239.691121


##### Emma's buddy
Identify the shark that has the shortest average distance to Emma per day (minimum 50 days of pings with Emma)

In [22]:
#find sharks which have pinged on the same days as emma at least 50 times
    #create a df which shows only the days on which emma has pinged
    #interpolated must be 0
    #'interpolatedsharks' 
    #use emma's days to filter interpolated sharks to show only those days

In [23]:
sharkbuddy = interpolatedsharks[interpolatedsharks.interpolated == 0].reset_index()
sharkbuddy.head()

Unnamed: 0,name,datetime,latitude,longitude,interpolated
0,AB,2016-03-30,30.49353,-80.37539,False
1,AB,2016-03-31,30.495155,-80.328785,False
2,AB,2016-04-01,30.466347,-80.129087,False
3,AB,2016-04-02,30.36282,-80.23531,False
4,AB,2016-04-03,30.31351,-80.223723,False


In [24]:
emmareadytobuddy = emmainterpolated.rename(columns={"latitude": "emmalat","longitude": "emmalon"})
emmabuddy = emmareadytobuddy[emmareadytobuddy.interpolated == 0].reset_index()
emmabuddy.head()

Unnamed: 0,datetime,emmalat,emmalon,interpolated
0,2014-01-31,-0.451946,-90.332257,False
1,2014-02-01,-0.407268,-90.016025,False
2,2014-02-02,-0.487691,-90.294378,False
3,2014-02-03,-0.454803,-90.3243,False
4,2014-02-04,-0.43929,-90.294217,False


In [25]:
buddies = emmabuddy.merge(sharkbuddy,on='datetime')
buddies.head()
closebuds

NameError: name 'closebuds' is not defined

In [26]:
sharedpings = buddies.groupby('name').size().reset_index()
sharedpings.columns=['name','number_of_shared_pings']
closebuds = sharedpings[sharedpings.number_of_shared_pings >= 50]

buddyinfo = closebuds.merge(merged,on='name')
buddyinfo
#groupby name and find mean separation
closestbud = buddyinfo.groupby('name').agg({'separation':'mean'}).sort_values(by='separation')
closestbud

Unnamed: 0_level_0,separation
name,Unnamed: 1_level_1
Itabaca,10.213953
Guayasamin,10.316017
Floreana,12.373984
Esperanza,12.463062
Yolanda,13.316236
Lonesome Jorgita,14.070417
Beatriz,277.264441
April,4771.560983


##### Plot Emma and Buddy
Plot Emma and her closest buddy on folium.  Emma should be blue/green (known/interpolated) and her buddy should be red/black (known/interpolated).

In [32]:
buddies = fm.Map(tiles='stamenwatercolor',
              location=(avg_lat,avg_long),
              zoom_start=5)

emmabudlatlng = list(zip(fullyinterpolatedemma.latitude.values,
                   fullyinterpolatedemma.longitude.values))

fm.PolyLine(emmabudlatlng,color='blue').add_to(buddies)
#fm.Marker(latlong[0],
 #         icon=fm.Icon(color='darkgreen',
  #                    icon='play')).add_to(buddies)
#fm.Marker(latlong[-1],
 #         icon=fm.Icon(color='darkred',
  #                    icon='stop')).add_to(buddies)

<folium.vector_layers.PolyLine at 0x26b64a8e860>

In [33]:
itabaca = interpolatedsharks.loc['Itabaca'].sort_values(by='datetime')
itabaca.head()

Unnamed: 0_level_0,latitude,longitude,interpolated
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-01-31,-0.47325,-90.25485,False
2014-02-01,-0.480699,-90.288665,False
2014-02-02,-0.478765,-90.309573,False
2014-02-03,-0.484776,-90.300258,False
2014-02-04,-0.460771,-90.328676,False


In [34]:
itabacalatlng = list(zip(itabaca.latitude.values,
                   itabaca.longitude.values))

fm.PolyLine(itabacalatlng,color='red').add_to(buddies)
#fm.Marker(latlong[0],
 #         icon=fm.Icon(color='darkgreen',
  #                    icon='play')).add_to(buddies)
#fm.Marker(latlong[-1],
 #         icon=fm.Icon(color='darkred',
  #                    icon='stop')).add_to(buddies)

buddies