# Species Distribution Heatmaps

Modeling geographical distributions of the species is important in coservation and research. In this notebook I'll use ***species distributions*** dataset from sklearn to model distributions of two species in South America.

## The Dataset

This dataset is originally created by Phillips et. al. (2006). Their original paper can be found [here](http://rob.schapire.net/papers/ecolmod.pdf).

There are two species in this dataset

### Bradypus Variegatus  (*the Brown-throated Sloth*)

<img src="images/Bradypus_variegatus.jpg" alt="drawing" style="max-height:300px;"/>

### Microryzomys Minutus (*the Forest Small Rice Rat*)

<img src="images/Microryzomys_minutus.jpg" alt="drawing" style="max-height:300px;"/>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_species_distributions
import re
import folium
from folium.plugins import HeatMap

## Fetching Data

In [2]:
data = fetch_species_distributions()

This function loads and returns dictinary like sklearn **Bunch** object. 

In [3]:
type(data)

sklearn.utils.Bunch

We can convert that Bunch to a Pandas dataframe for better analysis. The dataset contains separated training set and a test set. For the purpose of this notebook we can concat these two sets.

In [4]:
df_train = pd.DataFrame(data.train)
df_test = pd.DataFrame(data.test)
df = pd.concat([df_train, df_test])

In [5]:
df.shape

(2244, 3)

The dataframe contains 2244 rows and 3 columns. let's see what are those columns

In [6]:
df.columns

Index(['species', 'dd long', 'dd lat'], dtype='object')

## Data Preprocessing

In the column indexes longitude column is followed by the latitude column. However in folium functions, location data is expenced as a list object where latitude is the first value and longitude is the second. So, to minimize the confusions in the later steps Pandas data frame can be reidexed.

In [7]:
df = df.reindex(columns=['species', 'dd lat', 'dd long'])

Now let's see how the data is represented in the dataframe

In [8]:
df.sample(5)

Unnamed: 0,species,dd lat,dd long
102,b'microryzomys_minutus_1',-6.9,-79.050003
290,b'microryzomys_minutus_4',-9.7,-76.033302
1301,b'microryzomys_minutus_7',-0.45,-77.883301
521,b'bradypus_variegatus_2',-0.2,-74.766701
478,b'microryzomys_minutus_1',10.91667,-73.5


Looks like the species data is stored as byte arrays. And there are some numbers at the end that might be related to the time period when the data was taken. We can simply convert the species data so that it contains only the names of the two species.

In [9]:
df.species.unique()

array([b'microryzomys_minutus', b'bradypus_variegatus',
       b'bradypus_variegatus_0', b'microryzomys_minutus_0',
       b'bradypus_variegatus_1', b'microryzomys_minutus_1',
       b'bradypus_variegatus_2', b'microryzomys_minutus_2',
       b'bradypus_variegatus_3', b'microryzomys_minutus_3',
       b'bradypus_variegatus_4', b'microryzomys_minutus_4',
       b'bradypus_variegatus_5', b'microryzomys_minutus_5',
       b'bradypus_variegatus_6', b'microryzomys_minutus_6',
       b'bradypus_variegatus_7', b'microryzomys_minutus_7',
       b'bradypus_variegatus_8', b'microryzomys_minutus_8',
       b'bradypus_variegatus_9', b'microryzomys_minutus_9'], dtype=object)

In [10]:
df['species'] = df['species'].apply(lambda x : re.sub('_\d', '', x.decode('ascii')))

After that the data looks like this

In [11]:
df.sample(10)

Unnamed: 0,species,dd lat,dd long
404,bradypus_variegatus,9.2833,-79.599998
1169,microryzomys_minutus,4.433333,-75.366699
1125,bradypus_variegatus,8.95,-79.583298
1088,bradypus_variegatus,9.1167,-74.733299
235,bradypus_variegatus,12.35,-71.316704
818,bradypus_variegatus,3.6833,-77.083298
5,bradypus_variegatus,-3.45,-55.283298
541,bradypus_variegatus,9.3,-82.400002
589,bradypus_variegatus,-3.7667,-73.25
96,bradypus_variegatus,9.15,-72.599998


### Here is a summary of the data

In [12]:
df.species.unique()

array(['microryzomys_minutus', 'bradypus_variegatus'], dtype=object)

In [13]:
df.species.value_counts()

bradypus_variegatus     1276
microryzomys_minutus     968
Name: species, dtype: int64

The dataframe contains location data for the two species. Before we use this data in the map, we need to make sure that the locatation data makes sense and it does not contain any strange data.

In [14]:
df.describe()

Unnamed: 0,dd lat,dd long
count,2244.0,2244.0
mean,1.097714,-73.237053
std,7.953531,8.20381
min,-23.450001,-85.933296
25%,-3.437475,-78.420851
50%,1.799984,-75.633297
75%,8.525,-71.066648
max,13.95,-40.0667


The minimum and maximum values for longitude and latitude data are in the acceptable limits. Now we can see the area that covers all the location points.

## Coverage Map

In [15]:
max_vals = list(df.max(numeric_only=True).values)
min_vals = list(df.min(numeric_only=True).values)
center_vals = [mn + ((mx-mn)/2) for mx,mn in zip(max_vals, min_vals)]

In [16]:
m = folium.Map(location= center_vals, zoom_start=3)
folium.Rectangle(bounds=[min_vals,max_vals],color='#9da0a4', fill=True, fill_color='#3f96ee').add_to(m)
m

All the location data is in the upper parts of south America. So it matches with the documenation on the paper. Now we are ready to visualize the distribution heatmaps.

## Brown-throated Sloth distribution heatmap

In [17]:
data = df[df['species']=='bradypus_variegatus'][['dd lat','dd long']].values.tolist()
m = folium.Map(location= center_vals, tiles='Stamen Terrain', zoom_start=4)
HeatMap(data).add_to(m)
m

## Forest Small Rice Rat distribution heatmap

In [18]:
data = df[df['species']=='microryzomys_minutus'][['dd lat','dd long']].values.tolist()
m = folium.Map(location= center_vals,tiles='Stamen Terrain', zoom_start=4)
HeatMap(data).add_to(m)
m