### Challenge questions

Easy questions:

 1. How many total pings are in the Ocearch shark data?
 2. How many unique species of sharks are in the data set?
 3. What is the name, weight, and species of the heaviest shark(s)?
 4. When and where was the very first ping?
 5. Excluding results with 0 distance traveled: what's the minimum, average, and maximum travel distances?
 
Intermediate questions:

 1. Which shark had the most pings?
 2. Which shark has been pinging the longest, and how long has that been?
 3. Which shark species has the most individual sharks tagged?
 4. What is the average length and weight of each shark species?
 5. Which shark has the biggest geographic box (largest distance from min lat/lon to max lat/lon, not dist_traveled)?
 
Hard questions:


### Load data

#### Query Ocearch API

In [None]:
import requests
url = 'http://www.ocearch.org/tracker/ajax/filter-sharks'

resp = requests.get(url)
resp

##### Turn json into dataframe

In [None]:
import pandas as pd
df = pd.DataFrame(resp.json())
columns = ['id', 'name', 'gender', 'species', 'weight', 'length', 'tagDate', 'dist_total']
df[columns].head()

In [None]:
df.shape

##### Filter out non-shark data

In [None]:
df.species.value_counts()

In [None]:
df = df[df.species.fillna('').str.contains('shark', case=False)]
df.shape

##### Extract ping data

In [None]:
ping_frames = []
for row in df.itertuples():
    ping_frame = pd.DataFrame(row.pings)
    ping_frame['id'] = row.id
    ping_frames.append(ping_frame)
    
len(ping_frames)

##### Merge shark/ping data

In [None]:
pings = pd.concat(ping_frames)
pings.shape

In [None]:
joined = pings.merge(df[columns], on='id')
joined.shape

In [None]:
joined.head()

##### Clean data

In [None]:
df = joined # don't need a reference to the original resp.json() df anymore
df.shape

In [None]:
def clean_weight(value):
    if not value:
        return value
    # most values are like "123 lb"
    value = str(value)
    for character in 'lbs,+':
        value = value.replace(character, '')
    return float(value)

def clean_length(value):
    if not value:
        return value
    # most length values are like '3 ft 4 in.'
    value = str(value)
    total = 0
    if 'ft' in value:
        ft, inches = value.split('ft')
        total += int(ft.strip()) * 12
    else:
        inches = value
    if inches.strip():
        total += float(inches.strip().split()[0])
    return total

df['weight'] = df.weight.apply(clean_weight)
df['length'] = df.length.apply(clean_length)
df['datetime'] = pd.to_datetime(df.tz_datetime)

numeric_cols = ['latitude', 'longitude', 'dist_total', 'weight', 'length']
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, axis=1)
df = df.drop(columns=['tz_datetime'])
df.head()

### Explore data

In [None]:
# are all the names unique?
counts = df.groupby('name').id.nunique()
counts.head()

In [None]:
counts[counts > 1]

### Challenge Questions

#### Easy Questions
 1. How many total pings are in the Ocearch shark data?
 2. How many unique species of sharks are in the data set?
 3. What is the name, weight, and species of the heaviest shark(s)?
 4. When and where was the very first ping?
 5. Excluding results with 0 distance traveled: what's the minimum, average, and maximum travel distances?

##### Total pings
How many total pings are in the Ocearch shark data?

In [None]:
len(df)

##### Unique species
How many unique species of sharks are in the data set?

In [None]:
df.species.nunique()

##### Heaviest shark(s)
What is the name, weight, and species of the heaviest shark(s)?

In [None]:
max_weight = df.weight.max()
max_weight

In [None]:
heavy_sharks = df[df.weight == df.weight.max()]
heavy_sharks.drop_duplicates('name')

##### First ping
When and where was the very first ping?

In [None]:
first_pings = df.sort_values('datetime').head(5)
first_pings[['name', 'datetime', 'latitude', 'longitude']]

##### Distance travelled
Excluding results with 0 distance traveled: what's the minimum, average, and maximum travel distances?

In [None]:
df.dist_total[df.dist_total > 0].describe()

#### Intermediate questions

 1. Which shark had the most pings?
 2. Which shark has been pinging the longest, and how long has that been?
 3. Which shark species has the most individual sharks tagged?
 4. What is the average length and weight of each shark species?
 5. Which shark has the biggest geographic box (largest distance from min lat/lon to max lat/lon, not dist_traveled)?

##### Most pings
Which shark had the most pings?

In [None]:
groups = df.groupby('id')
sizes = groups.size()
names = groups.name.first()
species = groups.species.first()
first_ping = groups.datetime.min()
last_ping = groups.datetime.max()
combined = pd.concat([sizes, names, species, first_ping, last_ping], axis=1).reset_index()
combined.columns = ['id', 'ping_count', 'name', 'species', 'first_ping', 'last_ping']
combined.sort_values('ping_count', ascending=False).head()

##### Longest duration pinger
Which shark has been pinging the longest, and how long has that been?

In [None]:
combined['duration'] = combined.last_ping - combined.first_ping
combined.sort_values('duration', ascending=False).head()

##### Individual count by species
Which shark species has the most individual sharks tagged?

In [None]:
df.groupby('species').id.nunique().sort_values(ascending=False).head()

##### Average length/weight by species
What is the average length and weight of each shark species?

In [None]:
groups = df.groupby('species').agg({'weight' : 'mean', 
                                    'length' : 'mean', 
                                    'id' : 'nunique'})
groups.rename(columns={'id' : 'shark_count'}).sort_values('shark_count', ascending=False)

##### Biggest geographic box
Which shark has the biggest geographic box (largest area calculated from min lat/lon to max lat/lon, not dist_traveled)?

In [None]:
groups = df.groupby('id')
combined = pd.concat([groups.latitude.min(), 
                      groups.longitude.min(), 
                      groups.latitude.max(), 
                      groups.longitude.max(), 
                      groups.name.first(), 
                      groups.species.first()], axis=1).reset_index()
combined.columns = ['id', 'min_lat', 'min_lon', 'max_lat', 'max_lon', 'name', 'species']
combined.head()

In [None]:
combined['lat_diff'] = combined.max_lat - combined.min_lat
combined['lon_diff'] = combined.max_lon - combined.min_lon
combined['area'] = combined['lat_diff'] * combined['lon_diff']
combined.sort_values('area', ascending=False).head()

#### Hard questions

 1. Use folium to plot the first ping, last ping, and a line connecting each ping for the Tiger shark Emma.  Make the first ping marker a 'play' icon, and last ping icon a 'stop' icon.
 2. Resample Emma data to have a daily lat/lon average, and interpolate missing results.  Plot a marker for each day, and color them blue for hard data, green for interpolated lat/lons
 3. Resample all shark data for daily lat/lon averages, and interpolate missing results
 4. Calculate distance between Emma and other sharks on a daily basis
 5. Identify the shark that has the shortest average distance to Emma per day
 6. Plot Emma and her closest buddy: interpolated results for each in green, Emma as circle icons and her buddy as square icons

##### Plot Emma locations
Plot the ping locations for the shark named Emma as a `PolyLine` in folium.  Include the first and last ping location as markers.

In [None]:
emma = df.query("name == 'Emma'").sort_values('datetime')
emma.head()

In [None]:
latlons = list(zip(emma.latitude.values, emma.longitude.values))
latlons[:5]

In [None]:
import folium

avglat = emma.latitude.mean()
avglon = emma.longitude.mean()

m = folium.Map(location=(avglat, avglon), zoom_start=5)

folium.Marker(latlons[0], popup="First ping", icon=folium.Icon(icon="play")).add_to(m)
folium.Marker(latlons[-1], popup="Last ping", icon=folium.Icon(icon="stop")).add_to(m)

folium.PolyLine(latlons).add_to(m)
    
m

##### Plot interpolated locs
Resample the Emma locations on a per-day basis and interpolate missing locations.  Then, plot the daily markes in folium along with a `PolyLine`.

In [None]:
resampler = emma.set_index('datetime').resample('1D')
resampler

In [None]:
averages = resampler[['latitude', 'longitude']].mean()
averages.tail()

In [None]:
averages['interpolated'] = averages.latitude.isnull()
averages.tail()

In [None]:
averages.fillna(averages.interpolate(), inplace=True)
averages.tail()

In [None]:
avglat = averages.latitude.mean()
avglon = averages.longitude.mean()

m = folium.Map(location=(avglat, avglon), zoom_start=5)

for row in averages.itertuples():
    if row.interpolated:
        color = 'green'
    else:
        color = 'blue'
    folium.Marker([row.latitude, row.longitude], icon=folium.Icon(color=color)).add_to(m)

m

##### Resample all shark data
Resample all shark data for daily lat/lon averages, and interpolate missing results

In [None]:
data = []

groups = df.groupby('name')
for label, frame in groups:
    resampler = frame.set_index('datetime').resample('1D')
    averages = resampler[['latitude', 'longitude']].mean()
    averages['interpolated'] = averages.latitude.isnull()
    averages.fillna(averages.interpolate(), inplace=True)
    averages['name'] = label
    data.append(averages)
    
resampled = pd.concat(data)
resampled.shape

In [None]:
resampled = resampled.set_index('name', append=True).sort_values('datetime')
resampled.head(20)

##### Distance to Emma
Identify the shark that has the shortest average distance to Emma per day

In [None]:
emma = resampled.query('name == "Emma"')
emma.head()

In [None]:
emma_locs = emma.reset_index(level='name')
emma_locs.head()

In [None]:
emma_locs = emma_locs.drop(columns=['name', 'interpolated'])\
                     .rename(columns={'latitude' : 'emma_lat', 'longitude' : 'emma_lon'})
emma_locs.head()

In [None]:
joined = resampled.reset_index(level='name').join(emma_locs)
joined.head()

In [None]:
joined[joined.emma_lat.notnull()].head()

In [None]:
# calculate distance like a triangle hypotenuse, a**2 + b**2 = c**2
joined['diff_lat'] = joined.latitude - joined.emma_lat
joined['diff_lon'] = joined.longitude - joined.emma_lon
joined['distance'] = (joined.diff_lat**2 + joined.diff_lon**2) ** .5
joined.head()

In [None]:
joined[joined.distance.notnull()].head()

##### Emma's buddy
Identify the shark that has the shortest average distance to Emma per day

In [None]:
joined[joined.distance.notnull()].groupby('name').distance.mean().sort_values().head(10)

In [None]:
joined[joined.name.str.contains('Sherril')]

In [None]:
joined[joined.distance.notnull()]

In [None]:
joined