# Pollen Data Visualisation
The purpose of this notebook is to:
- Load fossilised pollen assemblage and chronology data from the European Pollen Database (EPD)
- Build interactive visualisations to explore the data
- Derive hypotheses as to the causal drivers of trends observed in the data

In [193]:
from sqlalchemy import create_engine
import pandas as pd
import os

Prior work has idendtified six sites as being of particular interest due to their temporal coverage of the majority of the Holocene, their spatial distribution around Iberia, and the fact that many of them provide evidence of human involvement in changing the landscape. 

These sites all have records stored in the EPD, and can be uniquely identified using the `sitecode` field in the `siteloc` table. These site names and corresponding sitecodes are stored in the dictionary `ssites' 

In [19]:
ssites = {'San Rafael' : 'ESP-01000-SANR',
          'Albufera de Alcudia' : 'ESP-04000-ALBU',
          'Navarres' : 'ESP-11000-NAVA',
          'Laguna Guallar' : 'ESP-02000-LAGU',
          'Monte Areo mire' : 'ESP-03000-AREO',
          'Sanabria Marsh' : 'ESP-08000-SANA'}

Create a database engine for the session

In [3]:
engine = create_engine('postgresql://andrew@localhost:5432/epd95')

### Get site location information

Function to return the site location information for a list of site codes

In [41]:
def get_site_loc_info(site_code_list):
    #construct string for postgreql WHERE condition based on input array
    where_condition = ' OR '.join(["sitecode = '"+x+"'" for x in ssites.values()])
    query = "SELECT * FROM siteloc WHERE " + where_condition
    
    # return results in pandas dataframe
    return pd.read_sql_query(query, con=engine)   

In [76]:
site_loc_info = get_site_loc_info(ssites.values())
site_loc_info.set_index('site_',inplace=True)
print site_loc_info

                     sitename        sitecode siteexists poldiv1 poldiv2  \
site_                                                                      
44             Sanabria Marsh  ESP-08000-SANA       None     ESP      08   
396                  Navarrés  ESP-11000-NAVA       None     ESP      11   
486                San Rafael  ESP-01000-SANR       None     ESP      01   
759          Albufera Alcudia  ESP-04000-ALBU       None     ESP      04   
761            Laguna Guallar  ESP-02000-LAGU       None     ESP      02   
784    Laguna Salada Chiprana  ESP-02000-LAGU       None     ESP      02   
1252          Monte Areo mire  ESP-03000-AREO       None     ESP      03   

      poldiv3  latdeg  latmin  latsec latns      latdd     latdms  londeg  \
site_                                                                       
44        000      42       6       0     N  42.100000  42.06.00N       6   
396       000      39       6       0     N  39.100000  39.06.00N       0   
486    

Note that site no. 784  Laguna Salada Chiprana seems to have same sitecode as site no. as our study site no. 761  Laguna Guallar. Let's drop this site to avoid confusion later on.

In [77]:
site_loc_info.drop(784,inplace=True)
print site_loc_info

               sitename        sitecode siteexists poldiv1 poldiv2 poldiv3  \
site_                                                                        
44       Sanabria Marsh  ESP-08000-SANA       None     ESP      08     000   
396            Navarrés  ESP-11000-NAVA       None     ESP      11     000   
486          San Rafael  ESP-01000-SANR       None     ESP      01     000   
759    Albufera Alcudia  ESP-04000-ALBU       None     ESP      04     000   
761      Laguna Guallar  ESP-02000-LAGU       None     ESP      02     000   
1252    Monte Areo mire  ESP-03000-AREO       None     ESP      03     000   

       latdeg  latmin  latsec latns      latdd     latdms  londeg  lonmin  \
site_                                                                       
44         42       6       0     N  42.100000  42.06.00N       6      44   
396        39       6       0     N  39.100000  39.06.00N       0      41   
486        36      46      25     N  36.773611  36.46.25N       2  

### Identify sediment cores associated with each site

The EPD [documentation](http://www.europeanpollendatabase.net/data/downloads/image/pollen-database-manual-20071011.doc) explains that there are potentially multiple 'entities' associated with each site location. An 'entity' simply refers to a sediment core, section or surface sample.

Define a function to return all entities (or cores) associated with a given site number:

In [78]:
def get_entity_info(site_number):
    # query finds all fields from entity table for records 
    # matching given site_number.
    query = "SELECT * FROM entity WHERE site_ = {0}".format(site_number)
             
    # return results in pandas dataframe
    return pd.read_sql_query(query, con=engine)   

Create a dataframe containing entity information for all study sites

In [141]:
ssite_numbers = site_loc_info.index
for i in range(len(ssite_numbers)):
    if i == 0:
        site_entity_info = get_entity_info(ssite_numbers[i])
    else:
        site_entity_info = site_entity_info.append(get_entity_info(ssite_numbers[i]), ignore_index=True)
site_entity_info.set_index('e_',inplace=True)

In [142]:
site_entity_info = site_entity_info.join(site_loc_info, on='site_', how='left')

site_entity_info = site_entity_info.drop(['latns','latdd','latdms','londeg','lonmin',u'poldiv1',
                      u'poldiv2',u'poldiv3',u'latdeg',u'latmin',u'latsec',
                      u'lonsec', u'lonew', u'londd', u'londms', u'elevation', u'areaofsite',
                      u'depthatloc', u'icethickcm', u'sampdevice', u'corediamcm',
                      u'c14depthadj', u'notes','siteexists','localveg','hasanlam','coll_'],axis=1)

In [143]:
print site_entity_info 

      site_     sigle    name iscore issect isssamp descriptor  \
e_                                                               
44       44  SANABRIA    None      Y      N       Y       PMSH   
469     396     NAVA1  core 1      Y      N       Y       PMIR   
470     396     NAVA2  core 2      Y      N       Y       PMIR   
471     396  NAVARRE3    None      Y      N       Y       UNKN   
574     486   SANRAFA    None      Y      N       Y       MCOA   
891     759   ALCUDIA    None      Y      N       Y       MCOA   
893     761     N-GUA  core 1      Y      N       Y       LNPL   
1562   1252      AREO    None      Y      N       N       PMIR   

                     entloc    sampdate          sitename        sitecode  
e_                                                                         
44                     None  1981-05-00    Sanabria Marsh  ESP-08000-SANA  
469        Center of valley  1993-06-15          Navarrés  ESP-11000-NAVA  
470        Center of valley  1993-0

### Get pollen counts

Function which extracts pollen count data for a given entity (i.e. sediment core) number and joins it to the p_vars table to incorporate species data useful to subsequent analysis.

In [199]:
def get_p_count(entity):
    query="""SELECT p_counts.e_,p_counts.sample_,p_counts.var_,
             p_counts.count,p_vars.varcode,p_vars.varname 
             FROM p_counts
             LEFT JOIN p_vars
             ON p_counts.var_=p_vars.var_
             WHERE p_counts.e_={0};""".format(entity)
    return pd.read_sql_query(query, con=engine)   

Example of output for the San Rafael entitity (number 574)

In [198]:
get_p_count(574).head()

Unnamed: 0,e_,sample_,var_,count,varcode,varname
0,574,1,66,27.0,Art,Artemisia
1,574,1,95,1.0,Bet,Betula
2,574,1,165,2.0,Crl-T,Cerealia-type
3,574,1,185,108.0,Cheae,Chenopodiaceae
4,574,1,196,1.0,Ciu,Cistus


Write pollen count data to file for each of the entities identified in the `site_entity_info` dataframe.

In [195]:
for e in site_entity_info.index:
    e_dat = site_entity_info.loc[e]
    # construct file name string using data from site_entity_info dataframe
    s = 'e'+str(e)+'_s'+str(e_dat.site_)+'_'+e_dat.sitename.replace(' ','_').replace(u'é','e')+\
        '_pcounts.csv'
        
    p_dat = get_p_count(e) # dataframe containing pollen count data for entity
    p_dat.drop('e_',axis=1).to_csv(os.path.join('data',s),index=False)

### Get radiocarbon dates

Chronologies apply to /samples/ within entities (cores). These chronologies are the results of models which are used to interpolate the age of all samples in a given core based on C14 testing of a subset of those samples.

For some cores there are multiple chronologies supplied in the EPD. In these instances, the chronology which is marked as **default** in the database is the one which is extracted (see [documentation](http://www.europeanpollendatabase.net/data/downloads/image/pollen-database-manual-20071011.doc) for details).

Information about chronologies for each core is provided in the `chron` table. The inferred fitted chronology for pollen samples is contained in the `p_adedpt` table.

Additionally, we can find the dates corresponding to the samples which were actually tested in the `c14` table, but this will not enter into our analyses here.

We now define a function for extracting the fitted chronology for a given entity:

In [220]:
def get_chronology(entity):
    # get id of database default chronology for selected entity
    def_chron_no=pd.read_sql_query("SELECT chron_ FROM chron WHERE e_={0} AND defaultchron='Y'".format(entity), 
                                   con=engine)['chron_'].values[0]  
    
    query = "SELECT * FROM p_agedpt WHERE e_={0} AND chron_={1}".format(entity,def_chron_no)
    return pd.read_sql_query(query,con=engine).drop(['chron_'],axis=1)


Sample output for Laguna Guallar (entity 893)

In [223]:
get_chronology(893).head()

Unnamed: 0,e_,sample_,agebp,ageup,agelo,deptime
0,893,1,0.0,,,
1,893,2,982.0,,,
2,893,3,1963.0,,,
3,893,4,2945.0,,,
4,893,5,3927.0,,,


Write chronologies for each entity to data folder

In [225]:
for e in site_entity_info.index:
    e_dat = site_entity_info.loc[e]
    # construct file name string using data from site_entity_info dataframe
    s = 'e'+str(e)+'_s'+str(e_dat.site_)+'_'+e_dat.sitename.replace(' ','_').replace(u'é','e')+\
        '_chron.csv'
        
    p_dat = get_chronology(e) # dataframe containing chronology data for entity
    p_dat.drop('e_',axis=1).to_csv(os.path.join('data',s),index=False)