# Prokaryotic genomic submissions map

Who submits the most, and from where?

## Setup

Import files

'lat_long_loc.tsv' contains all unique submitter center names and their latitude and longitude. Lat/lon was obtained via google maps geocache (need to write up how I did this)

In [82]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os


summary = 'prokaryotes.txt'
locs = 'lat_long_loc.tsv'

# reading in initial data sets:
data_frame = pd.read_csv(summary, sep = '\t', header = 'infer', low_memory = False)
lat_lon = pd.read_csv(locs, sep='\t')

#add latitude and longitude for each submitter
lat_lon.columns=['Center','lat','lon']
df = pd.merge(data_frame, lat_lon, on='Center')

df.head()

Unnamed: 0,#Organism/Name,TaxID,BioProject Accession,BioProject ID,Group,SubGroup,Size (Mb),GC%,Replicons,WGS,...,Status,Center,BioSample Accession,Assembly Accession,Reference,FTP Path,Pubmed ID,Strain,lat,lon
0,Salmonella enterica subsp. enterica serovar Ty...,220341,PRJNA236,236,Proteobacteria,Gammaproteobacteria,5.13371,51.8776,chromosome:NC_003198.1/AL513382.1; plasmid pHC...,-,...,Complete Genome,Sanger Institute,SAMEA1705914,GCA_000195995.1,REFR,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000...,11677608,CT18,52.079717,0.185587
1,Campylobacter jejuni subsp. jejuni NCTC 11168 ...,192222,PRJNA8,8,Proteobacteria,delta/epsilon subdivisions,1.64148,30.5,chromosome:NC_002163.1/AL111168.1,-,...,Complete Genome,Sanger Institute,SAMEA1705929,GCA_000009085.1,REFR,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000...,1068820417565669,NCTC 11168,52.079717,0.185587
2,Mycobacterium tuberculosis H37Rv,83332,PRJNA224,224,Terrabacteria group,Actinobacteria,4.41153,65.6,chromosome:NC_000962.3/AL123456.3,-,...,Complete Genome,Sanger Institute,SAMEA3138326,GCA_000195955.2,REFR,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000...,96342301236843020980199,H37Rv,52.079717,0.185587
3,Yersinia pestis CO92,214092,PRJNA34,34,Proteobacteria,Gammaproteobacteria,4.82986,47.6065,chromosome:NC_003143.1/AL590842.1; plasmid pCD...,-,...,Complete Genome,Sanger Institute,SAMEA1705942,GCA_000009065.1,REFR,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000...,115863601283453919055764,CO92,52.079717,0.185587
4,Burkholderia cenocepacia J2315,216591,PRJNA339,339,Proteobacteria,Betaproteobacteria,8.05578,66.9165,chromosome 1:NC_011000.1/AM747720.1; chromosom...,-,...,Complete Genome,Sanger Institute,SAMEA1705928,GCA_000009485.1,-,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000...,18931103,J2315,52.079717,0.185587


In [81]:
#convert release data column data type to datetime
df['Release Date'] = pd.to_datetime(df['Release Date'])

#new dataframe with fewer columns
df2 = df.loc[:, ['Release Date','lat','lon']]
df2.index = df2['Release Date']

df2.head()

Unnamed: 0_level_0,Release Date,lat,lon
Release Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2001-11-07,2001-11-07,52.079717,0.185587
2001-09-27,2001-09-27,52.079717,0.185587
2001-09-07,2001-09-07,52.079717,0.185587
2001-10-15,2001-10-15,52.079717,0.185587
2008-06-19,2008-06-19,52.079717,0.185587


To make groupings easier, combine latitude and longitude into one string

In [None]:
df2['latlon'] = df2['lat'].astype(str) + df2['lon'].astype(str)

Group dataframe by location ('latlon') and count number of submissions for each month for each location. This gives a running total of submissions for each location over time. 

In [79]:
df2=df2.groupby(['latlon', pd.Grouper(freq="M")]).size().to_frame('Counts').reset_index()
df2 = df2.set_index(['Release Date'])
df2['Cumsum']=df2.groupby(['latlon']).cumsum()

df2.tail()

Unnamed: 0_level_0,latlon,Counts,Cumsum
Release Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-09-30,9.940392278.0105374,1,15
2017-01-31,9.940392278.0105374,1,16
2017-04-30,9.940392278.0105374,1,17
2017-06-30,9.940392278.0105374,1,18
2018-06-30,9.940392278.0105374,1,19


# To-do
* Map counts back to original dataframe
* Figure out how to plot locations points on map
* Make datapoints propotional to submission number
* Do some fancy viz/animation


In [6]:
##junk code
#output_file("tile.html")
import bokeh
from bokeh.plotting import figure, show, output_file
from bokeh.tile_providers import CARTODBPOSITRON
from bokeh.io import show, output_notebook
from bokeh.plotting import figure

# range bounds supplied in web mercator coordinates
p = figure(x_range=(-2000000, 6000000), y_range=(-1000000, 7000000),
           x_axis_type="mercator", y_axis_type="mercator")
p.add_tile(CARTODBPOSITRON)

output_notebook()
show(p)