### Challenge questions

Easy questions:

 1. How many total pings are in the Ocearch shark data?
 2. How many unique species of sharks are in the data set?
 3. What is the name, weight, and species of the heaviest shark(s)?
 4. When and where was the very first ping?
 5. Excluding results with 0 distance traveled: what's the minimum, average, and maximum travel distances?
 
Intermediate questions:

 1. Which shark had the most pings?
 2. Which shark has been pinging the longest, and how long has that been?
 3. Which shark species has the most individual sharks tagged?
 4. What is the average length and weight of each shark species?
 5. Which shark has the biggest geographic box (largest distance from min lat/lon to max lat/lon, not dist_traveled)?
 
Hard questions:


### Load data

#### Query Ocearch API

In [26]:
import requests
url = 'http://www.ocearch.org/tracker/ajax/filter-sharks'

resp = requests.get(url)
resp

<Response [200]>

##### Turn json into dataframe

In [27]:
import pandas as pd
df = pd.DataFrame(resp.json())
columns = ['id', 'name', 'gender', 'species', 'weight', 'length', 'tagDate', 'dist_total']
df[columns].head()

Unnamed: 0,id,name,gender,species,weight,length,tagDate,dist_total
0,3,Oprah,Female,White Shark (Carcharodon carcharias),686 lb,9 ft 10 in.,7 March 2012,2816.662
1,4,Albertina,Female,White Shark (Carcharodon carcharias),1110 lb,11 ft 6 in.,8 March 2012,1830.593
2,5,Helen,Female,White Shark (Carcharodon carcharias),765 lb,10 ft 2 in.,8 March 2012,4436.661
3,6,Brenda,Female,White Shark (Carcharodon carcharias),1310 lb,12 ft 2 in.,8 March 2012,2966.902
4,7,Madiba,Male,White Shark (Carcharodon carcharias),659 lb,9 ft 8 in.,8 March 2012,3537.423


In [28]:
df.shape

(275, 30)

##### Filter out non-shark data

In [29]:
df.species.value_counts()

Tiger Shark  (Galeocerdo cuvier)                   82
White Shark (Carcharodon carcharias)               74
Blue Shark (Prionace glauca)                       27
Mako Shark (Isurus oxyrinchus)                     18
Hammerhead Shark (Sphyrna)                         18
Olive Ridley Turtle (Lepidochelys olivacea)        16
Loggerhead Sea Turtle (Caretta caretta)             9
Blacktip Shark (Carcharhinus limbatus)              9
Silky Shark (Carcharhinus falciformis)              4
Guadalupe Fur Seals (Arctocephalus townsendi)       4
Bull Shark (Carcharhinus leucas)                    4
Whale Shark (Rhincodon Typus)                       3
American alligator (Alligator mississippiensis)     2
Pilot Whale (Globicephala)                          1
 Harbor Seal (Phoca vitulina)                       1
Ship (Motor Vessel)                                 1
Dolphin (Delphinus capensis)                        1
Green Sea Turtle (Chelonia mydas)                   1
Name: species, dtype: int64

In [30]:
df = df[df.species.fillna('').str.contains('shark', case=False)]
df.shape

(239, 30)

##### Extract ping data

In [31]:
ping_frames = []
for row in df.itertuples():
    ping_frame = pd.DataFrame(row.pings)
    ping_frame['id'] = row.id
    ping_frames.append(ping_frame)
    
len(ping_frames)

239

##### Merge shark/ping data

In [32]:
pings = pd.concat(ping_frames)
pings.shape

(65871, 6)

In [10]:
joined = pings.merge(df[columns], on='id')
joined.shape

(65871, 13)

In [11]:
joined.head()

Unnamed: 0,active,datetime,id,latitude,longitude,tz_datetime,name,gender,species,weight,length,tagDate,dist_total
0,1,6 July 2014 1:57:28 PM,3,-34.60661,21.15244,6 July 2014 1:57:28 PM +0900,Oprah,Female,White Shark (Carcharodon carcharias),686 lb,9 ft 10 in.,7 March 2012,2816.662
1,1,23 June 2014 11:40:09 AM,3,-34.78752,19.42479,23 June 2014 11:40:09 AM +0900,Oprah,Female,White Shark (Carcharodon carcharias),686 lb,9 ft 10 in.,7 March 2012,2816.662
2,1,15 June 2014 10:15:44 PM,3,-34.42487,21.09754,15 June 2014 10:15:44 PM +0900,Oprah,Female,White Shark (Carcharodon carcharias),686 lb,9 ft 10 in.,7 March 2012,2816.662
3,1,3 June 2014 11:23:57 AM,3,-34.70432271674724,20.21013441406251,3 June 2014 11:23:57 AM +0900,Oprah,Female,White Shark (Carcharodon carcharias),686 lb,9 ft 10 in.,7 March 2012,2816.662
4,1,29 May 2014 4:53:57 AM,3,-34.65556,19.37459,29 May 2014 4:53:57 AM +0900,Oprah,Female,White Shark (Carcharodon carcharias),686 lb,9 ft 10 in.,7 March 2012,2816.662


##### Clean data

In [38]:
df = joined # don't need a reference to the original resp.json() df anymore
df.shape

(65871, 13)

In [39]:
def clean_weight(value):
    if not value:
        return value
    # most values are like "123 lb"
    value = str(value)
    for character in 'lbs,+':
        value = value.replace(character, '')
    return float(value)

def clean_length(value):
    if not value:
        return value
    # most length values are like '3 ft 4 in.'
    value = str(value)
    total = 0
    if 'ft' in value:
        ft, inches = value.split('ft')
        total += int(ft.strip()) * 12
    else:
        inches = value
    if inches.strip():
        total += float(inches.strip().split()[0])
    return total

df['weight'] = df.weight.apply(clean_weight)
df['length'] = df.length.apply(clean_length)
df['datetime'] = pd.to_datetime(df.tz_datetime)

numeric_cols = ['latitude', 'longitude', 'dist_total', 'weight', 'length']
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, axis=1)
df = df.drop(columns=['tz_datetime'])
df.head()

Unnamed: 0,active,datetime,id,latitude,longitude,name,gender,species,weight,length,tagDate,dist_total
0,1,2014-07-06 13:57:28+09:00,3,-34.60661,21.15244,Oprah,Female,White Shark (Carcharodon carcharias),686.0,118.0,7 March 2012,2816.662
1,1,2014-06-23 11:40:09+09:00,3,-34.78752,19.42479,Oprah,Female,White Shark (Carcharodon carcharias),686.0,118.0,7 March 2012,2816.662
2,1,2014-06-15 22:15:44+09:00,3,-34.42487,21.09754,Oprah,Female,White Shark (Carcharodon carcharias),686.0,118.0,7 March 2012,2816.662
3,1,2014-06-03 11:23:57+09:00,3,-34.704323,20.210134,Oprah,Female,White Shark (Carcharodon carcharias),686.0,118.0,7 March 2012,2816.662
4,1,2014-05-29 04:53:57+09:00,3,-34.65556,19.37459,Oprah,Female,White Shark (Carcharodon carcharias),686.0,118.0,7 March 2012,2816.662


### Explore data

In [53]:
# are all the names unique?
counts = df.groupby('name').id.nunique()
counts.head()

name
AB          1
ANZAC       1
Adelaide    1
Al          1
Albert      1
Name: id, dtype: int64

In [54]:
counts[counts > 1]

Series([], Name: id, dtype: int64)

### Challenge Questions

#### Easy Questions
 1. How many total pings are in the Ocearch shark data?
 2. How many unique species of sharks are in the data set?
 3. What is the name, weight, and species of the heaviest shark(s)?
 4. When and where was the very first ping?
 5. Excluding results with 0 distance traveled: what's the minimum, average, and maximum travel distances?

##### Total pings
How many total pings are in the Ocearch shark data?

In [40]:
len(df)

65871

##### Unique species
How many unique species of sharks are in the data set?

In [41]:
df.species.nunique()

9

##### Heaviest shark(s)
What is the name, weight, and species of the heaviest shark(s)?

In [42]:
max_weight = df.weight.max()
max_weight

25000.0

In [43]:
heavy_sharks = df[df.weight == df.weight.max()]
heavy_sharks.drop_duplicates('name')

Unnamed: 0,active,datetime,id,latitude,longitude,name,gender,species,weight,length,tagDate,dist_total
56485,1,2016-10-24 12:12:24+09:00,233,38.5389,-68.8206,Rocky Mazzanti,Female,Whale Shark (Rhincodon Typus),25000.0,300.0,24 August 2016,1753.524
59317,1,2017-11-01 09:11:15+09:00,253,40.97405,-69.55461,Canyon,Male,Whale Shark (Rhincodon Typus),25000.0,360.0,9 August 2017,2093.576


##### First ping
When and where was the very first ping?

In [46]:
first_pings = df.sort_values('datetime').head(5)
first_pings[['name', 'datetime', 'latitude', 'longitude']]

Unnamed: 0,name,datetime,latitude,longitude
519,Oprah,2012-03-10 00:35:31+09:00,-34.132,22.123
518,Oprah,2012-03-10 06:30:24+09:00,-34.16,22.18
517,Oprah,2012-03-10 06:42:45+09:00,-34.158,22.182
516,Oprah,2012-03-10 10:53:02+09:00,-34.165,22.179
752,Albertina,2012-03-10 17:23:50+09:00,-34.179,22.417


##### Distance travelled
Excluding results with 0 distance traveled: what's the minimum, average, and maximum travel distances?

In [47]:
df.dist_total[df.dist_total > 0].describe()

count    65863.000000
mean     12571.051412
std      12751.112685
min          8.127000
25%       3048.274000
50%       8177.352000
75%      17811.853000
max      46553.182000
Name: dist_total, dtype: float64

#### Intermediate questions

 1. Which shark had the most pings?
 2. Which shark has been pinging the longest, and how long has that been?
 3. Which shark species has the most individual sharks tagged?
 4. What is the average length and weight of each shark species?
 5. Which shark has the biggest geographic box (largest distance from min lat/lon to max lat/lon, not dist_traveled)?

##### Most pings
Which shark had the most pings?

In [19]:
groups = df.groupby('id')
sizes = groups.size()
names = groups.name.first()
species = groups.species.first()
first_ping = groups.datetime.min()
last_ping = groups.datetime.max()
combined = pd.concat([sizes, names, species, first_ping, last_ping], axis=1).reset_index()
combined.columns = ['id', 'ping_count', 'name', 'species', 'first_ping', 'last_ping']
combined.sort_values('ping_count', ascending=False).head()

Unnamed: 0,id,ping_count,name,species,first_ping,last_ping
35,41,3240,Mary Lee,White Shark (Carcharodon carcharias),2012-09-18 18:34:28+09:00,2017-06-17 19:54:32+09:00
36,56,2946,Lydia,White Shark (Carcharodon carcharias),2013-03-03 17:03:13+09:00,2017-03-15 11:31:34+09:00
154,202,2366,Oscar,Mako Shark (Isurus oxyrinchus),2016-07-09 09:14:38+09:00,2019-01-30 05:32:35+09:00
40,60,2134,April,Mako Shark (Isurus oxyrinchus),2013-07-29 02:00:04+09:00,2014-06-17 20:17:03+09:00
26,32,1851,Lisha,White Shark (Carcharodon carcharias),2012-05-15 00:43:21+09:00,2014-04-03 21:48:57+09:00


##### Longest duration pinger
Which shark has been pinging the longest, and how long has that been?

In [20]:
combined['duration'] = combined.last_ping - combined.first_ping
combined.sort_values('duration', ascending=False).head()

Unnamed: 0,id,ping_count,name,species,first_ping,last_ping,duration
45,65,1816,Katharine,White Shark (Carcharodon carcharias),2013-08-21 13:42:26+09:00,2019-01-15 08:49:00+09:00,1972 days 19:06:34
2,5,204,Helen,White Shark (Carcharodon carcharias),2012-03-11 00:15:10+09:00,2017-01-05 14:22:39+09:00,1761 days 14:07:29
35,41,3240,Mary Lee,White Shark (Carcharodon carcharias),2012-09-18 18:34:28+09:00,2017-06-17 19:54:32+09:00,1733 days 01:20:04
36,56,2946,Lydia,White Shark (Carcharodon carcharias),2013-03-03 17:03:13+09:00,2017-03-15 11:31:34+09:00,1472 days 18:28:21
19,25,1578,Cyndi,White Shark (Carcharodon carcharias),2012-04-15 00:50:25+09:00,2015-09-22 00:00:43+09:00,1254 days 23:10:18


##### Individual count by species
Which shark species has the most individual sharks tagged?

In [21]:
df.groupby('species').id.nunique().sort_values(ascending=False).head()

species
Tiger Shark  (Galeocerdo cuvier)        82
White Shark (Carcharodon carcharias)    74
Blue Shark (Prionace glauca)            27
Mako Shark (Isurus oxyrinchus)          18
Hammerhead Shark (Sphyrna)              18
Name: id, dtype: int64

##### Average length/weight by species
What is the average length and weight of each shark species?

In [70]:
groups = df.groupby('species').agg({'weight' : 'mean', 
                                    'length' : 'mean', 
                                    'id' : 'nunique'})
groups.rename(columns={'id' : 'shark_count'}).sort_values('shark_count', ascending=False)

Unnamed: 0_level_0,weight,length,shark_count
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tiger Shark (Galeocerdo cuvier),468.163203,119.183113,82
White Shark (Carcharodon carcharias),1554.997406,147.128146,74
Blue Shark (Prionace glauca),243.634091,106.028852,27
Hammerhead Shark (Sphyrna),126.554715,93.813373,18
Mako Shark (Isurus oxyrinchus),240.818318,82.472855,18
Blacktip Shark (Carcharhinus limbatus),138.37891,80.316209,9
Bull Shark (Carcharhinus leucas),290.4,89.781022,4
Silky Shark (Carcharhinus falciformis),132.881671,76.965197,4
Whale Shark (Rhincodon Typus),25000.0,327.906977,3


##### Biggest geographic box
Which shark has the biggest geographic box (largest area calculated from min lat/lon to max lat/lon, not dist_traveled)?

In [23]:
groups = df.groupby('id')
combined = pd.concat([groups.latitude.min(), 
                      groups.longitude.min(), 
                      groups.latitude.max(), 
                      groups.longitude.max(), 
                      groups.name.first(), 
                      groups.species.first()], axis=1).reset_index()
combined.columns = ['id', 'min_lat', 'min_lon', 'max_lat', 'max_lon', 'name', 'species']
combined.head()

Unnamed: 0,id,min_lat,min_lon,max_lat,max_lon,name,species
0,3,-34.88268,19.37459,-34.05394,22.64236,Oprah,White Shark (Carcharodon carcharias)
1,4,-36.703,20.535038,-34.063,22.74626,Albertina,White Shark (Carcharodon carcharias)
2,5,-37.23623,18.53635,-19.50057,37.84922,Helen,White Shark (Carcharodon carcharias)
3,6,-34.986,19.06158,-24.77363,34.84301,Brenda,White Shark (Carcharodon carcharias)
4,7,-35.461,17.91681,-32.743,27.97646,Madiba,White Shark (Carcharodon carcharias)


In [24]:
combined['lat_diff'] = combined.max_lat - combined.min_lat
combined['lon_diff'] = combined.max_lon - combined.min_lon
combined['area'] = combined['lat_diff'] * combined['lon_diff']
combined.sort_values('area', ascending=False).head()

Unnamed: 0,id,min_lat,min_lon,max_lat,max_lon,name,species,lat_diff,lon_diff,area
29,35,-41.37174,18.515,-6.15888,71.0983,Kathryn,White Shark (Carcharodon carcharias),35.21286,52.5833,1851.608381
36,56,23.53902,-81.3818,53.65843,-27.48272,Lydia,White Shark (Carcharodon carcharias),30.11941,53.89908,1623.408489
24,30,-43.21756,8.06196,-19.11709,66.72966,Vindication,White Shark (Carcharodon carcharias),24.10047,58.6677,1413.919144
19,25,-45.61157,18.23305,-14.95129,61.87323,Cyndi,White Shark (Carcharodon carcharias),30.66028,43.64018,1338.020138
30,36,-38.82461,17.47565,-10.52038,62.65514,Success,White Shark (Carcharodon carcharias),28.30423,45.17949,1278.770676


#### Hard answers



##### Plot Emma on folium
Plot the ping locations for the shark named Emma as a `PolyLine` in folium.  Include the first and last ping location as markers.

In [129]:
emma = df.query("name == 'Emma'").sort_values('datetime')
emma.head()

Unnamed: 0,datetime,active,id,latitude,longitude,name,gender,species,weight,length,tagDate,dist_total
34075,2014-02-01 07:10:18+09:00,1,102,-0.466747,-90.30005,Emma,Female,Tiger Shark (Galeocerdo cuvier),,99.0,20 January 2014,4368.906
34074,2014-02-01 07:51:31+09:00,1,102,-0.41101,-90.32783,Emma,Female,Tiger Shark (Galeocerdo cuvier),,99.0,20 January 2014,4368.906
34073,2014-02-01 08:49:34+09:00,1,102,-0.47808,-90.36889,Emma,Female,Tiger Shark (Galeocerdo cuvier),,99.0,20 January 2014,4368.906
34072,2014-02-01 09:25:07+09:00,1,102,-0.24096,-89.920682,Emma,Female,Tiger Shark (Galeocerdo cuvier),,99.0,20 January 2014,4368.906
34071,2014-02-01 17:31:34+09:00,1,102,-0.42935,-89.64942,Emma,Female,Tiger Shark (Galeocerdo cuvier),,99.0,20 January 2014,4368.906


In [136]:
latlons = list(zip(emma.latitude.values, emma.longitude.values))
latlons[:5]

[(-0.4667466227579547, -90.30004977783199),
 (-0.41101000000000004, -90.32783),
 (-0.47808, -90.36889000000001),
 (-0.24095957143994035, -89.92068179687499),
 (-0.42935, -89.64941999999999)]

In [139]:
import folium

avglat = emma.latitude.mean()
avglon = emma.longitude.mean()

m = folium.Map(location=(avglat, avglon), zoom_start=5)

folium.Marker(latlons[0], popup="First ping").add_to(m)
folium.Marker(latlons[-1], popup="Last ping").add_to(m)

folium.PolyLine(latlons).add_to(m)
    
m

##### Resample pings by day
Resample the Emma locations on a per-day basis and interpolate missing locations.  Then, plot the daily markes in folium along with a `PolyLine`.

In [145]:
resampled = emma.set_index('datetime').resample('1D')
resampled

DatetimeIndexResampler [freq=<Day>, axis=0, closed=left, label=left, convention=start, base=0]

In [153]:
averages = resampled[['latitude', 'longitude']].mean()
averages.tail(10)

Unnamed: 0_level_0,latitude,longitude
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-08-20 00:00:00+09:00,,
2014-08-21 00:00:00+09:00,-0.200443,-80.537042
2014-08-22 00:00:00+09:00,,
2014-08-23 00:00:00+09:00,,
2014-08-24 00:00:00+09:00,-0.331778,-80.571889
2014-08-25 00:00:00+09:00,0.18386,-80.49963
2014-08-26 00:00:00+09:00,0.27332,-80.9063
2014-08-27 00:00:00+09:00,1.35886,-81.55316
2014-08-28 00:00:00+09:00,,
2014-08-29 00:00:00+09:00,2.10503,-80.81086


In [154]:
averages.fillna(averages.interpolate(), inplace=True)
averages.tail(10)

Unnamed: 0_level_0,latitude,longitude
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-08-20 00:00:00+09:00,0.110734,-80.304884
2014-08-21 00:00:00+09:00,-0.200443,-80.537042
2014-08-22 00:00:00+09:00,-0.244221,-80.548658
2014-08-23 00:00:00+09:00,-0.288,-80.560273
2014-08-24 00:00:00+09:00,-0.331778,-80.571889
2014-08-25 00:00:00+09:00,0.18386,-80.49963
2014-08-26 00:00:00+09:00,0.27332,-80.9063
2014-08-27 00:00:00+09:00,1.35886,-81.55316
2014-08-28 00:00:00+09:00,1.731945,-81.18201
2014-08-29 00:00:00+09:00,2.10503,-80.81086


In [155]:
latlons = list(zip(averages.latitude.values, averages.longitude.values))
latlons[:5]

[(-0.42216068824421055, -90.121435730523),
 (-0.4876914252146526, -90.29437802124022),
 (-0.4548028346632858, -90.3243004268392),
 (-0.4391161607804883, -90.28637246972656),
 (-0.4616266666666667, -90.336575)]

In [157]:
import folium

avglat = averages.latitude.mean()
avglon = averages.longitude.mean()

m = folium.Map(location=(avglat, avglon), zoom_start=5)

for loc in latlons:
    folium.Marker(loc).add_to(m)

folium.PolyLine(latlons).add_to(m)
    
m