# Prokaryotic genomic submissions map

Who submits the most, and from where?

## Setup

Import files

'lat_long_loc.tsv' contains all unique submitter center names and their latitude and longitude. Lat/lon was obtained via google maps geocache (need to write up how I did this)

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os


summary = 'prokaryotes.txt'
locs = 'lat_long_loc.tsv'

# reading in initial data sets:
data_frame = pd.read_csv(summary, sep = '\t', header = 'infer', low_memory = False)
lat_lon = pd.read_csv(locs, sep='\t')
#add latitude and longitude for each submitter
lat_lon.columns=['Center','lat','lon']
df = pd.merge(data_frame, lat_lon, on='Center')

df.head()

Unnamed: 0,Center,lat,lon
0,"University of California, Santa Cruz",36.991585,-122.058277
1,Changwon National University,35.245602,128.691895
2,Houston Methodist Hospital,29.711448,-95.399986
3,NRL,38.822204,-77.019723
4,NIH,39.027697,-77.136618


In [35]:
#convert release data column data type to datetime
df['Release Date'] = pd.to_datetime(df['Release Date'])

#new dataframe with fewer columns
df2 = df.loc[:, ['Release Date','lat','lon']]
df2.index = df2['Release Date']

df2.head()

Unnamed: 0_level_0,Release Date,lat,lon
Release Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2001-11-07,2001-11-07,52.079717,0.185587
2001-09-27,2001-09-27,52.079717,0.185587
2001-09-07,2001-09-07,52.079717,0.185587
2001-10-15,2001-10-15,52.079717,0.185587
2008-06-19,2008-06-19,52.079717,0.185587


To make groupings easier, combine latitude and longitude into one string

In [37]:
df2['latlon'] = "(" + df2['lat'].astype(str) + "," + df2['lon'].astype(str) + ")"

Group dataframe by location ('latlon') and count number of submissions for each month for each location. This gives a running total of submissions for each location over time. 

# Count submissions over time by submitter location

Using latitude and longitude for this as submitter name is not clean.

In [38]:
df2=df2.groupby(['latlon', pd.Grouper(freq="M")]).size().to_frame('Counts').reset_index()
df2 = df2.set_index(['Release Date'])
df2['Cumsum']=df2.groupby(['latlon']).cumsum()

df2.tail()

Unnamed: 0_level_0,latlon,Counts,Cumsum
Release Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-09-30,"(9.9403922,78.01053739999998)",2,19
2017-01-31,"(9.9403922,78.01053739999998)",1,20
2017-04-30,"(9.9403922,78.01053739999998)",2,22
2017-06-30,"(9.9403922,78.01053739999998)",1,23
2018-06-30,"(9.9403922,78.01053739999998)",10,33


# Plotting all submitters on world map

Using bokeh interactive map with a mercator projection. 

Lat-lon coords have to be converted to the projection.

In [39]:
##junk code
#output_file("tile.html")
import bokeh
from bokeh.plotting import figure, show, output_file, ColumnDataSource
from bokeh.tile_providers import CARTODBPOSITRON
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
import math
from ast import literal_eval


#Convert lat_lon to mercator co-ordinates
#Function from https://towardsdatascience.com/exploring-and-visualizing-chicago-transit-data-using-pandas-and-bokeh-part-ii-intro-to-bokeh-5dca6c5ced10

def merc(Coords):
    Coordinates = literal_eval(Coords)
    lat = Coordinates[0]
    lon = Coordinates[1]
    
    r_major = 6378137.000
    x = r_major * math.radians(lon)
    scale = x/lon
    y = 180.0/math.pi * math.log(math.tan(math.pi/4.0 + 
        lat * (math.pi/180.0)/2.0)) * scale
    return (x, y)


# range bounds supplied in web mercator coordinates
p = figure(x_range=(-20000000, 20000000), y_range=(-6000000, 6000000),
           x_axis_type="mercator", y_axis_type="mercator")
p.add_tile(CARTODBPOSITRON)

df2['coords_lon'] = df2['latlon'].apply(lambda x: merc(x)[0])
df2['coords_lat'] = df2['latlon'].apply(lambda x: merc(x)[1])


p.circle(df2['coords_lon'],df2['coords_lat'],alpha = 0.5, size = 1)
output_notebook()
show(p)

# To-do
* Map counts back to original dataframe
* Make datapoints propotional to submission number
* Add some interactivity/mouseover functionality to find out info about datapoints
* Do some fancy viz/animation
