# Darwin Core Conversion of eDNA Sequence Data From the AOML_MIMARKS metadata template 

**Version:** 1.0.0

**Author:** Katherine Silliman

**Last Updated:** 2-Oct-2023

This notebook is for converting a [MIMARKS](https://fairsharing.org/FAIRsharing.zvrep1)-based data sheet to DarwinCore for submission to OBIS. It has been testing on a Mac M1 laptop running in Rosetta mode, with Python 3.11. 

[Metadata template Google Sheet](https://docs.google.com/spreadsheets/d/1jof9MBEll7Xluu8-_znLRBIP9JpyAd_5YvdioZ-REoY/edit?usp=sharing)

**Requirements:**
- Python 3
- Python 3 packages:
    - os
- External packages:
    - Bio.Entrez from biopython
    - numpy
    - pandas
    - openpyxl
    - pyworms
    - multiprocess
- Custom modules:
    - WoRMS_matching

**Resources:**
- Abarenkov K, Andersson AF, Bissett A, Finstad AG, Fossøy F, Grosjean M, Hope M, Jeppesen TS, Kõljalg U, Lundin D, Nilsson RN, Prager M, Provoost P, Schigel D, Suominen S, Svenningsen C & Frøslev TG (2023) Publishing DNA-derived data through biodiversity data platforms, v1.3. Copenhagen: GBIF Secretariat. https://doi.org/10.35035/doc-vf1a-nr22.https://doi.org/10.35035/doc-vf1a-nr22.
- [OBIS manual](https://manual.obis.org/dna_data.html)
- [TDWG Darwin Core Occurrence Core](https://dwc.tdwg.org/terms/#occurrence)
- [GBIF DNA Derived Data Extension](https://tools.gbif.org/dwca-validator/extension.do?id=http://rs.gbif.org/terms/1.0/DNADerivedData)
- https://github.com/iobis/dataset-edna

**Citation**  
Silliman K, Anderson S, Storo R, Thompson L (2023) A Case Study in Sharing Marine eDNA Metabarcoding Data to OBIS. Biodiversity Information Science and Standards 7: e111048. https://doi.org/10.3897/biss.7.111048


## Installation
```
conda create -n edna2obis
conda activate edna2obis
conda install -c conda-forge notebook
conda install -c conda-forge nb_conda_kernels

conda install -c conda-forge numpy pandas
conda install -c conda-forge openpyxl

#worms conversion
conda install -c conda-forge pyworms
conda install -c conda-forge multiprocess
conda install -c conda-forge biopython
```

In [1]:
## Imports
import os

import numpy as np
import pandas as pd

import WoRMS_matching # custom functions for querying WoRMS API

In [31]:
# jupyter notebook parameters
pd.set_option('display.max_colwidth', 0)
pd.set_option('display.max_columns', None)

Note that in a Jupyter Notebook, the current directory is always where the .ipynb file is being run.

## Prepare input data 

**Project data and metadata**  
This workflow assumes that you have your project metadata in an Excel sheet formatted like the template located [here](https://docs.google.com/spreadsheets/d/1jof9MBEll7Xluu8-_znLRBIP9JpyAd_5YvdioZ-REoY/edit?usp=sharing). Instructions for filling out the metadata template are located in the 'Readme' sheet.

**eDNA and taxonomy data**  
The eDNA data and assigned taxonomy should be in a specific tab-delimited format. ![asv_table format](../images/asv_table.png)

This file is generated automatically by [Tourmaline v2023.5+](https://github.com/aomlomics/tourmaline), in X location. If your data was generated with Qiime2 or a previous version of Tourmaline, you can convert the `table.qza`, `taxonomy.qza`, and `repseqs.qza` outputs to the correct format using the `create_asv_seq_taxa_obis.sh` shell script.

Example:  

``` 
#Run this with a qiime2 environment. 
bash create_asv_seq_taxa_obis.sh -f \
../gomecc_v2_raw/table-16S-merge.qza -t ../gomecc_v2_raw/taxonomy-16S-merge.qza -r ../gomecc_v2_raw/repseqs-16S-merge.qza \
-o ../gomecc_v2_raw/gomecc-16S-asv.tsv
```


## Set configs  

Below you can set definitions for parameters used in the code. 

| Parameter           | Description                                                                                                       | Example                                                                                              |
|---------------------|-------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|
| `sample_data`       | Name of sheet in project data Excel file with sample data.                                                        | "water_sample_data"                                                                                  |
| `prep_data`         | Name of sheet in project data Excel file with data about molecular preparation methods.                           | "amplicon_prep_data"                                                                                 |
| `analysis_data`     | Name of sheet in project data Excel file with data about analysis methods.                                        | "analysis_data"                                                                                      |
| `study_data`        | Name of sheet in project data Excel file with metadata about the study.                                           | "study_data"                                                                                         |
| `msmt_metadata`     | Name of sheet in project data Excel file with metadata about additional measurements. Not used in current code.   | "measurement_metadata"                                                                               |
| `excel_file`        | Path of project data Excel file.                                                                                  | "../raw/gomecc4_AOML_MIMARKS.survey.water.6.0.xlsx"                                                  |
| `md_excel`          | Path of data dictionary Excel file.                                                                               | "../raw/gomecc_AOML2DwC standards.xlsx"                                                              |
| `datafiles`         | Python dictionary, where keys are the amplicon names and the values are the paths to the cooresponding ASV table. | {'16S V4-V5': '../raw/gomecc-16S-asv.tsv', '18S V9': '../raw/gomecc-18S-asv.tsv'}                    |
| `skip_sample_types` | Python list of sample_type values to skip from OBIS submission, such as controls or blanks.                       | ['mock community','distilled water blank','extraction blank','PCR no-template control','RTSF blank'] |
| `skip_columns`      | Python list of columns to ignore when submitting to OBIS.                                                         | ['notes_sampling']                                                                                   |

In [43]:
params = {}
params['sample_data'] = "water_sample_data"
params['prep_data']= "amplicon_prep_data"
params['analysis_data'] = "analysis_data"
params['study_data'] = "study_data"
params['msmt_metadata'] = "measurement_metadata"
params['excel_file'] = "../raw/gomecc4_AOML_MIMARKS.survey.water.6.0.xlsx"

params['datafiles'] = {'16S V4-V5': '../raw/gomecc-16S-asv.tsv',
                       '18S V9': '../raw/gomecc-18S-asv.tsv'}

params['skip_sample_types'] = ['mock community','distilled water blank','extraction blank','PCR no-template control','RTSF blank']
params['skip_columns']= ['notes_sampling']
params['md_excel'] = "../raw/gomecc_AOML2DwC standards.xlsx"


## Load data

Note that in a Jupyter Notebook, the current directory is always where the .ipynb file is being run.

### Load project data Excel file

In [34]:

data = pd.read_excel(
    params['excel_file'], 
    [params['study_data'],params['sample_data'],params['prep_data'],params['analysis_data'],params['msmt_metadata']],
    index_col=None, na_values=[""], comment="#"
)

In [35]:
#rename keys in data dictionary to a general term
data['sample_data'] = data.pop(params['sample_data'])
data['prep_data'] = data.pop(params['prep_data'])
data['analysis_data'] = data.pop(params['analysis_data'])
data['study_data'] = data.pop(params['study_data'])

In [36]:
#remove * from headers (was required for NCBI submission, but no longer needed)
data['sample_data'].columns = data['sample_data'].columns.str.replace("*","")

#### sample_data  
Contextual data about the samples collected, such as when it was collected, where it was collected from, what kind of sample it is, and what were the properties of the environment or experimental condition from which the sample was taken. Each row is a distinct sample, or Event. Most of this information is recorded during sample collection. This sheet contains terms from the MIMARKS survey water 6.0 package. 

In [37]:
data['sample_data'].head()

Unnamed: 0,sample_name,serial_number,cruise_id,line_id,station,ctd_bottle_no,sample_replicate,source_mat_id,biological_replicates,extract_number,sample_title,bioproject_accession,biosample_accession,amplicon_sequenced,metagenome_sequenced,organism,collection_date_local,collection_date,depth,env_broad_scale,env_local_scale,env_medium,geo_loc_name,lat_lon,decimalLatitude,decimalLongitude,samp_vol_we_dna_ext,samp_collect_device,samp_mat_process,sample_type,samp_size,size_frac,collection_method,basisOfRecord,cluster_16s,cluster_18s,notes_sampling,notes_bottle_metadata,line_position,offshore_inshore_200m_isobath,depth_category,ocean_acidification_status,seascape_class,seascape_probability,seascape_window,dna_sample_number,dna_conc,dna_yield,extraction_plate_name,extraction_well_number,extraction_well_position,ship_crs_expocode,woce_sect,ammonium,carbonate,diss_inorg_carb,diss_oxygen,fluor,hydrogen_ion,nitrate,nitrite,nitrate_plus_nitrite,omega_arag,pco2,ph,phosphate,pressure,salinity,samp_store_loc,samp_store_temp,silicate,size_frac_low,size_frac_up,temp,tot_alkalinity,tot_depth_water_col,transmittance,date_sheet_modified,modified_by
0,GOMECC4_27N_Sta1_Deep_A,GOMECC4_001,GOMECC-4 (2021),27N,Sta1,3,A,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_B, GOMECC4_27N_Sta1_Deep_C",Plate4_52,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_A,PRJNA887898,SAMN37516091,16S V4-V5 | 18S V9,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],"USA: Atlantic Ocean, east of Florida (27 N)",26.997 N 79.618 W,26.997,-79.618,1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,seawater,,0.22 µm,CTD rosette,MaterialSample,Cluster 3,Cluster 3,DCM = deep chlorophyl max.,,Offshore,offshore,Deep,Low,13,0.507214,8-day,1,0.08038 ng/µl,12.057 ng,GOMECC2021_Plate4,52,D7,WBTSRHB,RB2103,0.25971 µmol/kg,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,2023-10-03 13:28:31.916,luke.thompson@noaa.gov
1,GOMECC4_27N_Sta1_Deep_B,GOMECC4_002,GOMECC-4 (2021),27N,Sta1,3,B,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_C",Plate4_60,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_B,PRJNA887898,SAMN37516092,16S V4-V5 | 18S V9,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],"USA: Atlantic Ocean, east of Florida (27 N)",26.997 N 79.618 W,26.997,-79.618,1940 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,seawater,,0.22 µm,CTD rosette,MaterialSample,Cluster 3,Cluster 3,DCM was around 80 m and not well defined.,,Offshore,offshore,Deep,Low,13,0.507214,8-day,2,0.1141 ng/µl,17.115 ng,GOMECC2021_Plate4,60,D8,WBTSRHB,RB2103,0.25971 µmol/kg,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,NaT,
2,GOMECC4_27N_Sta1_Deep_C,GOMECC4_003,GOMECC-4 (2021),27N,Sta1,3,C,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_B",Plate4_62,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_C,PRJNA887898,SAMN37516093,16S V4-V5 | 18S V9,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],"USA: Atlantic Ocean, east of Florida (27 N)",26.997 N 79.618 W,26.997,-79.618,2000 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,seawater,,0.22 µm,CTD rosette,MaterialSample,Cluster 3,Cluster 3,Surface CTD bottles did not fire correctly; hand niskin bottle used for the surface cast. PM cast.,,Offshore,offshore,Deep,Low,13,0.507214,8-day,3,0.07223 ng/µl,10.8345 ng,GOMECC2021_Plate4,62,F8,WBTSRHB,RB2103,0.25971 µmol/kg,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,NaT,
3,GOMECC4_27N_Sta1_DCM_A,GOMECC4_004,GOMECC-4 (2021),27N,Sta1,14,A,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_B, GOMECC4_27N_Sta1_DCM_C",Plate4_53,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_DCM_A,PRJNA887898,SAMN37516094,16S V4-V5 | 18S V9,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,49 m,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],sea water [ENVO:00002149],"USA: Atlantic Ocean, east of Florida (27 N)",26.997 N 79.618 W,26.997,-79.618,1540 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,seawater,,0.22 µm,CTD rosette,MaterialSample,Cluster 1,Cluster 2,Only enough water for 2 surface replicates.,,Offshore,offshore,DCM,High,13,0.507214,8-day,4,1.49 ng/µl,223.5 ng,GOMECC2021_Plate4,53,E7,WBTSRHB,RB2103,0.32968 µmol/kg,229.99 µmol/kg,2033.19 µmol/kg,193.443 µmol/kg,0.036,0.0000000094 M,0 µmol/kg,0 µmol/kg,0 µmol/kg,3.805,423 µatm,8.027,0.0517 µmol/kg,49 dbar,36.325 psu,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,NaT,
4,GOMECC4_27N_Sta1_DCM_B,GOMECC4_005,GOMECC-4 (2021),27N,Sta1,14,B,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_A, GOMECC4_27N_Sta1_DCM_C",Plate4_46,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_DCM_B,PRJNA887898,SAMN37516095,16S V4-V5 | 18S V9,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,49 m,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],sea water [ENVO:00002149],"USA: Atlantic Ocean, east of Florida (27 N)",26.997 N 79.618 W,26.997,-79.618,1720 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,seawater,,0.22 µm,CTD rosette,MaterialSample,Cluster 1,Cluster 2,,,Offshore,offshore,DCM,High,13,0.507214,8-day,5,0.6884 ng/µl,103.26 ng,GOMECC2021_Plate4,46,F6,WBTSRHB,RB2103,0.32968 µmol/kg,229.99 µmol/kg,2033.19 µmol/kg,193.443 µmol/kg,0.036,0.0000000094 M,0 µmol/kg,0 µmol/kg,0 µmol/kg,3.805,423 µatm,8.027,0.0517 µmol/kg,49 dbar,36.325 psu,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,NaT,


#### prep_data  
Contextual data about how the samples were prepared for sequencing. Includes how they were extracted, what amplicon was targeted, how they were sequenced. Each row is a separate sequencing library preparation, distinguished by a unique library_id.

In [38]:
data['prep_data'].head(2)

Unnamed: 0,sample_name,library_id,title,library_strategy,library_source,library_selection,lib_layout,platform,instrument_model,design_description,filetype,filename,filename2,drive_location,biosample_accession,sra_accession,seq_method,nucl_acid_ext,amplicon_sequenced,target_gene,target_subfragment,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,pcr_cond,nucl_acid_amp,adapters,mid_barcode,date_sheet_modified,modified_by
0,GOMECC4_NegativeControl_1,GOMECC16S_Neg1,16S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,"Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic bead-based extraction kits (ZymoBIOMICS 96 DNA/RNA MagBead kit). Following bead beating in the cartridges, extractions were finished on an automated KingFisher Flex instrument (Thermo Fisher) in 96-well plates. A two-step PCR approach was used, targeting 16S V4-V5 rRNA with primers 515F (5-GTGYCAGCMGCCGCGGTAA-3) and 926R (5-CCGYCAATTYMTTTRAGTTT-3). Primers were constructed with Fluidigm common oligos CS1 forward (CS1-TS-F: 5-ACACTGACGACATGGTTCTACA-3) and CS2 reverse (CS2-TS-R: 5-TACGGTAGCAGAGACTTGGTCT-3) fused to their 5' ends. PCR products were sent to the Michigan State University Research Technology Support Facility Genomics Core for secondary PCR and sequencing. Secondary PCR used dual-indexed, Illumina-compatible primers, targeting the Fluidigm CS1/CS2 oligomers at the ends of the PCR products. Sequencing runs were performed on an Illumina MiSeq to produce 250+250 nt paired reads.",fastq,GOMECC16S_Neg1_S499_L001_R1_001.fastq.gz,GOMECC16S_Neg1_S499_L001_R2_001.fastq.gz,,SAMN37516589,SRR26148505,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S V4-V5,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,NaT,
1,GOMECC4_27N_Sta1_DCM_A,GOMECC18S_Plate4_53,18S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,"Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic bead-based extraction kits (ZymoBIOMICS 96 DNA/RNA MagBead kit). Following bead beating in the cartridges, extractions were finished on an automated KingFisher Flex instrument (Thermo Fisher) in 96-well plates. A two-step PCR approach was used, targeting 18S V0 rRNA with primers 1391f; 5’-GTACACACCGCCCGTC-3’ and EukBr; 5’-TGATCCTTCTGCAGGTTCACCTAC-3’. Primers were constructed with Fluidigm common oligos CS1 forward (CS1-TS-F: 5-ACACTGACGACATGGTTCTACA-3) and CS2 reverse (CS2-TS-R: 5-TACGGTAGCAGAGACTTGGTCT-3) fused to their 5' ends. PCR products were sent to the Michigan State University Research Technology Support Facility Genomics Core for secondary PCR and sequencing. Secondary PCR used dual-indexed, Illumina-compatible primers, targeting the Fluidigm CS1/CS2 oligomers at the ends of the PCR products. Sequencing runs were performed on an Illumina MiSeq to produce 250+250 nt paired reads",fastq,GOMECC18S_Plate4_53_S340_L001_R1_001.fastq.gz,GOMECC18S_Plate4_53_S340_L001_R2_001.fastq.gz,,SAMN37516094,SRR26161153,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,18S V9,18S rRNA,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75;annealing:65_0.25;57_0.5;elongation:72_1.5;final elongation:72_10;35,10.1371/journal.pone.0006372,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,2023-10-03 12:49:23.878,katherine.silliman@noaa.gov


### Drop samples with unwanted sample types  

Often with eDNA projects, we have control samples that are sequenced along with our survey samples. These can include filtering distilled water, using pure water instead of DNA in a PCR or DNA extraction protocol, or a mock community of known microbial taxa. Controls can help identify and mitigate contaminant DNA in our samples, but are not useful for biodiversity platforms like OBIS. You can select which sample_type values to drop with the `skip_sample_types` parameter.

### Drop samples with unwanted sample types

In [8]:
samps_to_remove = data['sample_data']['sample_type'].isin(params['skip_sample_types'])
#data['sample_data'][samps_to_remove]
# list of samples to drop
samples_to_drop = data['sample_data']['sample_name'][samps_to_remove]

You can view the list of samples to be dropped below.

In [9]:
samples_to_drop

26     GOMECC4_Blank_DIW_20210915_A
27     GOMECC4_Blank_DIW_20210915_B
28     GOMECC4_Blank_DIW_20210915_C
200    GOMECC4_Blank_DIW_20210930_A
201    GOMECC4_Blank_DIW_20210930_B
202    GOMECC4_Blank_DIW_20210930_C
334    GOMECC4_Blank_DIW_20211011_A
335    GOMECC4_Blank_DIW_20211011_B
336    GOMECC4_Blank_DIW_20211011_C
409    GOMECC4_Blank_DIW_20211016_A
410    GOMECC4_Blank_DIW_20211016_B
411    GOMECC4_Blank_DIW_20211016_C
484       GOMECC4_ExtractionBlank_1
485      GOMECC4_ExtractionBlank_11
486      GOMECC4_ExtractionBlank_12
487       GOMECC4_ExtractionBlank_3
488       GOMECC4_ExtractionBlank_5
489       GOMECC4_ExtractionBlank_7
490       GOMECC4_ExtractionBlank_9
491            GOMECC4_MSUControl_1
492            GOMECC4_MSUControl_2
493            GOMECC4_MSUControl_3
494            GOMECC4_MSUControl_4
495            GOMECC4_MSUControl_5
496            GOMECC4_MSUControl_6
497            GOMECC4_MSUControl_7
498       GOMECC4_NegativeControl_1
499       GOMECC4_NegativeCo

In [10]:
# remove samples from sample_data sheet
data['sample_data'] = data['sample_data'][~samps_to_remove]

In [12]:
# remove samples from prep_data
prep_samps_to_remove = data['prep_data']['sample_name'].isin(samples_to_drop)
data['prep_data'] = data['prep_data'][~prep_samps_to_remove]

In [14]:
# check the sample_type values left in your sample_data. We only want seawater.
data['sample_data']['sample_type'].unique()

array(['seawater'], dtype=object)

### Drop columns with all NAs  

If your project data file has columns with only NAs, this code will check for those, provide their column headers for verification, then remove them.

In [15]:
# which have all NAs?
dropped = pd.DataFrame()
for sheet in ['sample_data','prep_data','analysis_data']:
    res = pd.Series(data[sheet].columns[data[sheet].isnull().all(0)],
                name=sheet)
    dropped=pd.concat([dropped,res],axis=1)
    

Which columns in each sheet have only NA values?

In [16]:
dropped

Unnamed: 0,sample_data,prep_data,analysis_data
0,samp_size,drive_location,sop


If you are fine with leaving these columns out, proceed:

In [17]:
for sheet in ['sample_data','prep_data','analysis_data']:
    data[sheet].dropna(axis=1, how='all',inplace=True)

Now let's check which columns have missing values in some of the rows. These should be filled in on the Excel sheet with the appropriate term ('not applicable', 'missing', or 'not collected'). Alternatively, you can drop the column if it is not needed for submission to OBIS.

In [18]:
# which columns have missing data (NAs) in some rows
some = pd.DataFrame()
for sheet in ['sample_data','prep_data','analysis_data']:
    res = pd.Series(data[sheet].columns[data[sheet].isnull().any()].tolist(),
                name=sheet)
    some=pd.concat([some,res],axis=1)

In [19]:
some

Unnamed: 0,sample_data,prep_data,analysis_data
0,notes_bottle_metadata,date_sheet_modified,date_sheet_modified
1,date_sheet_modified,modified_by,modified_by
2,modified_by,,


In [20]:
# drop columns with any missing data
for sheet in ['sample_data','prep_data','analysis_data']:
    data[sheet].dropna(axis=1, how='any',inplace=True)

### Load data dictionary Excel file 
This Excel file is used as a data dictionary for converting between terms used in the project data Excel file and Darwin Core terms for submission to OBIS. Currently, we are only preparing an Occurrence core file and a DNA-derived extension file, with Event information in the Occurrence file. Future versions of this workflow will prepare an extendedMeasurementOrFact file as well.

In [22]:
# read in data dictionary excel file
dwc_data = pd.read_excel(
    params['md_excel'], 
    ['event','occurrence','dna'],
    index_col=0, na_values=[""]
)

In [26]:
#example of a sheet in the data dictionary
dwc_data['event'].head()

Unnamed: 0_level_0,AOML_term,AOML_file,DwC_definition
DwC_term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
eventID,sample_name,sample_data,An identifier for the set of information associated with a dwc:Event (something that occurs at a place and time). https://dwc.tdwg.org/terms/#dwc:eventID
eventDate,collection_date_local,sample_data,this is the date-time when the dwc:Event was recorded. Recommended best practice is to use a date that conforms to ISO 8601-1:2019. https://dwc.tdwg.org/terms/#dwc:eventDate
samplingProtocol,collection_method,sample_data,"The names of, references to, or descriptions of the methods or protocols used during a dwc:Event."
locationID,station,sample_data,An identifier for the set of dcterms:Location information. May be a global unique identifier or an identifier specific to the data set.
decimalLatitude,decimalLatitude,sample_data,"The geographic latitude (in decimal degrees, using the spatial reference system given in dwc:geodeticDatum) of the geographic center of a dcterms:Location. Positive values are north of the Equator, negative values are south of it. Legal values lie between -90 and 90, inclusive. https://dwc.tdwg.org/terms/#dwc:decimalLatitude"


### Load ASV data  
The ASV data files have one row for each unique amplicon sequence variants (ASVs). There is one ASV file for eacher marker sequences. They contain the ASV DNA sequence, a unique hash identifier the taxonomic assignment for each ASV, the confidence given that assignment by the naive-bayes classifier, and then the number of reads observed in each sample. 

| column name    | definition                                                                                                                                                                                                                                                                                                                                                                                              |
|----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| featureid      | A hash of the ASV sequence, used as a unique identifier for the ASV.                                                                                                                                                                                                                                                                                                                                    |
| sequence       | The DNA sequence of the ASV                                                                                                                                                                                                                                                                                                                                                                             |
| taxonomy       | The full taxonomy assigned to an ASV sequence. This string could be formatted in very different ways depending on the reference database used during classification, however it should always be in reverse rank order separated by ;. We provide examples for how to process results from a Silva classifier and the PR2 18S classifier. For other taxonomy formats, the code will need to be adapted. |
| Confidence     | This is the confidence score assigned the taxonomic classification with a naive-bayes classifier.                                                                                                                                                                                                                                                                                                       |
| sample columns | The next columns each represent a sample (or eventID), and the number of reads for that ASV observed in the sample.                                                                                                                                                                                                                                                                                     |

In [44]:
# read in ASV tables, looping through amplicons
asv_tables = {}

for gene in params['datafiles'].keys():
    asv_tables[gene] = pd.read_table(params['datafiles'][gene])


In [45]:
asv_tables.keys()

dict_keys(['16S V4-V5', '18S V9'])

In [53]:
asv_tables['16S V4-V5'].iloc[:,0:20].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,GOMECC4_27N_Sta1_DCM_A,GOMECC4_27N_Sta1_DCM_B,GOMECC4_27N_Sta1_DCM_C,GOMECC4_27N_Sta1_Deep_A,GOMECC4_27N_Sta1_Deep_B,GOMECC4_27N_Sta1_Deep_C,GOMECC4_27N_Sta1_Surface_A,GOMECC4_27N_Sta1_Surface_B,GOMECC4_27N_Sta4_DCM_A,GOMECC4_27N_Sta4_DCM_B,GOMECC4_27N_Sta4_DCM_C,GOMECC4_27N_Sta4_Deep_A,GOMECC4_27N_Sta4_Deep_B,GOMECC4_27N_Sta4_Deep_C,GOMECC4_27N_Sta4_Surface_A,GOMECC4_27N_Sta4_Surface_B
0,00006f0784f7dbb2f162408abb6da629,TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCATGCAGGTGGTTTGTTAAGTCAGATGTGAAAGCCCGGGGCTCAACCTCGGAATTGCATTTGAAACTGGCAGACTAGAGTACTGTAGAGGGGGGTAGAATTTCAGGTGTAGCGGTGAAATGCGTAGAGATCTGAAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAGATACTGACACTCAGATGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCTACTTGGAGGTTGTGGCCTTGAGCCGTGGCTTTCGGAGCTAACGCGTTAAGTAGACCGCCTGGGGAGTACGGTCGCAAGATTA,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Vibrionales; f__Vibrionaceae; g__Vibrio,0.978926,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,25
1,000094731d4984ed41435a1bf65b7ef2,TACAGAGAGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGGTATTTAAGTCGGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATCCGAAACTATTTAACTAGAGTATGGGAGAGGTAAGTAGAATTTCCGGTGTAGCGGTGAAATGCGTAGATATCGGAAGGAATACCAGTGGCGAAGGCGGCTTACTGGACCAATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGTTAGACGTTGGAAATTTATTTTCAGTGTCGCAGCGAAAGCAGTAAACACACCGCCTGGGGAGTACGACCGCAAGGTTA,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__HOC36; f__HOC36; g__HOC36; s__Candidatus_Thioglobus,0.881698,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0001a3c11fcef1b1b8f4c72942efbbac,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGTCTTCTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAATACTGGAAGACTAGAAAACGGAAGAGGGTAGTGGAATTCCCAGTGTAGAGGTGAAATGCGTAGATATCGGGAAGAACACCAGTGGCGAAGGCGCTCTGCTGGGCCATCACTGACGCTCATGGACGAAAGCCAGGGGAGCGAAAGGGATTAGATACCCCTGTAGTCCTGGCCGTAAACGATGAACACTAGGTGTCGGGGGAATCGACCCCCTCGGTGTCGTAGCCAACGCGTTAAGTGTTCCGCCTGGGGAGTACGCACGCAAGTGTG,d__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Synechococcales; f__Cyanobiaceae; g__Cyanobium_PCC-6307,0.762793,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0001ceef5162e6d689ef30418cfcc164,TACAGAGGGTGCAAGCGTTGTTCGGAATCATTGGGCGTAAAGCGCGCGTAGGCGGCCAAATAAGTCTGATGTGAAGGCCCAGGGCTCAACCCTGGAAGTGCATCGGAAACTGTTTGGCTCGAGTCCCGGAGGGGGTGGTGGAATTCCTGGTGTAGAGGTGAAATTCGTAGATATCAGGAGGAACACCGGTGGCGAAGGCGACCACCTGGACGGTGACTGACGCTGAGGTGCGAAAGCATGGGTAGCAAACAGGATTAGATACCCTGGTAGTCCATGCCGTAAACGATGAGTACTAGGCGCTGCGGGTATTGACCCCTGCGGTGCCGAAGTTAACGCATTAAGTACTCCGCCTGGGAAGTACGGCCGCAAGGTTA,d__Bacteria; p__Myxococcota; c__Myxococcia; o__Myxococcales; f__Myxococcaceae; g__P3OB-42; s__uncultured_bacterium,0.997619,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,000235534662df05bb30219a4b978dac,TACGGAAGGTCCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTTTTAAGTTGGATGTGAAAGCCCTGGGCTCAACCTAGGAACTGCATCCAAAACTAGATGACTAGAGTACGAAAGAGGGAAGTAGAATTCACAGTGTAGCGGTGGAATGCGTAGATATTGTGAAGAATACCAATGGCGAAGGCAGCTTCCTGGTTCTGTACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGGTCACTAGCTGTTTGGACTTCGGTCTGAGTGGCTAAGCGAAAGTGATAAGTGACCCACCTGGGGAGTACGTTCGCAAGAATG,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__SAR86_clade; f__SAR86_clade; g__SAR86_clade,0.999961,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Convert to Occurrence file
In order to link the DNA-derived extension metadata to our OBIS occurrence records, we have to use the Occurrence core. An 
For this data set, a `parentEvent` is a filtered water sample that was DNA extracted from a bigger niskin grab, a sequencing library from that DNA extraction is an `event`, and an `occurrence` is an ASV observed within a library. We will have an an occurence file, a DNA derived data file, and a measurements file.  
**Define files**


### Sampling event info

In [58]:
dwc_data['event']

Unnamed: 0_level_0,AOML_term,AOML_file,DwC_definition
DwC_term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
eventID,sample_name,sample_data,An identifier for the set of information associated with a dwc:Event (something that occurs at a place and time). https://dwc.tdwg.org/terms/#dwc:eventID
eventDate,collection_date_local,sample_data,this is the date-time when the dwc:Event was recorded. Recommended best practice is to use a date that conforms to ISO 8601-1:2019. https://dwc.tdwg.org/terms/#dwc:eventDate
samplingProtocol,collection_method,sample_data,"The names of, references to, or descriptions of the methods or protocols used during a dwc:Event."
locationID,station,sample_data,An identifier for the set of dcterms:Location information. May be a global unique identifier or an identifier specific to the data set.
decimalLatitude,decimalLatitude,sample_data,"The geographic latitude (in decimal degrees, using the spatial reference system given in dwc:geodeticDatum) of the geographic center of a dcterms:Location. Positive values are north of the Equator, negative values are south of it. Legal values lie between -90 and 90, inclusive. https://dwc.tdwg.org/terms/#dwc:decimalLatitude"
decimalLongitude,decimalLongitude,sample_data,"The geographic longitude (in decimal degrees, using the spatial reference system given in dwc:geodeticDatum) of the geographic center of a dcterms:Location. Positive values are east of the Greenwich Meridian, negative values are west of it. Legal values lie between -180 and 180, inclusive. https://dwc.tdwg.org/list/#dwc_decimalLongitude"
geodeticDatum,none,pipeline,"The ellipsoid, geodetic datum, or spatial reference system (SRS) upon which the geographic coordinates given in dwc:decimalLatitude and dwc:decimalLongitude are based."
countryCode,none,pipeline,
minimumDepthInMeters,depth,sample_data,
maximumDepthInMeters,derived: depth,sample_data,


In [55]:
event_dict = dwc_data['event'].to_dict('index')

In [57]:
event_dict['eventID']

{'AOML_term': 'sample_name',
 'AOML_file': 'sample_data',
 'DwC_definition': 'An identifier for the set of information associated with a dwc:Event (something that occurs at a place and time). https://dwc.tdwg.org/terms/#dwc:eventID'}

In [59]:
# check which event terms are not in sample_data sheet
for key in event_dict.keys():
    if event_dict[key]['AOML_file'] == 'sample_data':
        if event_dict[key]['AOML_term'] not in data['sample_data'].columns:
            print(key,event_dict[key])

maximumDepthInMeters {'AOML_term': 'derived: depth', 'AOML_file': 'sample_data', 'DwC_definition': nan}
waterBody {'AOML_term': 'derived', 'AOML_file': 'sample_data', 'DwC_definition': 'The name of the water body in which the dcterms:Location occurs.         Recommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names.'}


In [60]:
# custom add waterBody

data['sample_data'].loc[data['sample_data']['geo_loc_name'].str.contains("Atlantic Ocean"), 'waterBody']= "Atlantic Ocean"
data['sample_data'].loc[data['sample_data']['geo_loc_name'].str.contains("Gulf"), 'waterBody']= "Mexico, Gulf of"


  data['sample_data'].loc[data['sample_data']['geo_loc_name'].str.contains("Atlantic Ocean"), 'waterBody']= "Atlantic Ocean"


In [61]:
# change locationID to line_id+station
data['sample_data']['station'] = data['sample_data']['line_id']+ "_"+data['sample_data']['station'] 


In [62]:
# rename sample_data columns to fit DwC standard
gen = (x for x in event_dict.keys() if event_dict[x]['AOML_file'] == 'sample_data')
rename_dict = {}
for x in gen:
    #print(x)
    rename_dict[event_dict[x]['AOML_term']] = x

event_sample = data['sample_data'].rename(columns=rename_dict)
event_sample = event_sample.drop(columns=[col for col in event_sample if col not in rename_dict.values()])


In [63]:
# add minimumDepthInMeters
#remove m in depth
event_sample['minimumDepthInMeters'] = event_sample['minimumDepthInMeters'].str.strip(" m")
event_sample['maximumDepthInMeters'] = event_sample['minimumDepthInMeters']

In [64]:

event_sample.head()

Unnamed: 0,eventID,locationID,eventDate,minimumDepthInMeters,locality,decimalLatitude,decimalLongitude,samplingProtocol,waterBody,maximumDepthInMeters
0,GOMECC4_27N_Sta1_Deep_A,27N_Sta1,2021-09-14T11:00-04:00,618,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,618
1,GOMECC4_27N_Sta1_Deep_B,27N_Sta1,2021-09-14T11:00-04:00,618,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,618
2,GOMECC4_27N_Sta1_Deep_C,27N_Sta1,2021-09-14T11:00-04:00,618,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,618
3,GOMECC4_27N_Sta1_DCM_A,27N_Sta1,2021-09-14T11:00-04:00,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49
4,GOMECC4_27N_Sta1_DCM_B,27N_Sta1,2021-09-14T11:00-04:00,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49


In [65]:
# add amplicon_sequenced back 
event_sample['amplicon_sequenced'] = data['sample_data']['amplicon_sequenced']

Now add an event for each sequencing library, with replicate water sample as the parentEvent.  

**Future Update**: make this a for loop

In [71]:
child_data_16S = event_sample[event_sample['amplicon_sequenced'].str.contains('16S V4-V5')].copy()
child_data_16S['parentEventID'] = child_data_16S['eventID']
child_data_16S['eventID'] = child_data_16S['eventID']+"_16S"
child_data_16S.head()

Unnamed: 0,eventID,locationID,eventDate,minimumDepthInMeters,locality,decimalLatitude,decimalLongitude,samplingProtocol,waterBody,maximumDepthInMeters,amplicon_sequenced,parentEventID
0,GOMECC4_27N_Sta1_Deep_A_16S,27N_Sta1,2021-09-14T11:00-04:00,618,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,618,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_Deep_A
1,GOMECC4_27N_Sta1_Deep_B_16S,27N_Sta1,2021-09-14T11:00-04:00,618,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,618,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_Deep_B
2,GOMECC4_27N_Sta1_Deep_C_16S,27N_Sta1,2021-09-14T11:00-04:00,618,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,618,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_Deep_C
3,GOMECC4_27N_Sta1_DCM_A_16S,27N_Sta1,2021-09-14T11:00-04:00,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_DCM_A
4,GOMECC4_27N_Sta1_DCM_B_16S,27N_Sta1,2021-09-14T11:00-04:00,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_DCM_B


In [72]:
child_data_18S = event_sample[event_sample['amplicon_sequenced'].str.contains('18S V9')].copy()
child_data_18S['parentEventID'] = child_data_18S['eventID']
child_data_18S['eventID'] = child_data_18S['eventID']+"_18S"
child_data_18S.head()

Unnamed: 0,eventID,locationID,eventDate,minimumDepthInMeters,locality,decimalLatitude,decimalLongitude,samplingProtocol,waterBody,maximumDepthInMeters,amplicon_sequenced,parentEventID
0,GOMECC4_27N_Sta1_Deep_A_18S,27N_Sta1,2021-09-14T11:00-04:00,618,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,618,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_Deep_A
1,GOMECC4_27N_Sta1_Deep_B_18S,27N_Sta1,2021-09-14T11:00-04:00,618,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,618,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_Deep_B
2,GOMECC4_27N_Sta1_Deep_C_18S,27N_Sta1,2021-09-14T11:00-04:00,618,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,618,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_Deep_C
3,GOMECC4_27N_Sta1_DCM_A_18S,27N_Sta1,2021-09-14T11:00-04:00,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_DCM_A
4,GOMECC4_27N_Sta1_DCM_B_18S,27N_Sta1,2021-09-14T11:00-04:00,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_DCM_B


In [73]:
# this is your full event file
all_event_data = pd.concat([child_data_16S,child_data_18S],axis=0,ignore_index=True)

In [74]:
all_event_data = all_event_data.drop(columns=['amplicon_sequenced'])

In [75]:
all_event_data.tail()

Unnamed: 0,eventID,locationID,eventDate,minimumDepthInMeters,locality,decimalLatitude,decimalLongitude,samplingProtocol,waterBody,maximumDepthInMeters,parentEventID
997,GOMECC4_MSUControl_5_18S,not applicable_not applicable,not applicable,not applicable,not applicable,not applicable,not applicable,not applicable,,not applicable,GOMECC4_MSUControl_5
998,GOMECC4_MSUControl_6_18S,not applicable_not applicable,not applicable,not applicable,not applicable,not applicable,not applicable,not applicable,,not applicable,GOMECC4_MSUControl_6
999,GOMECC4_MSUControl_7_18S,not applicable_not applicable,not applicable,not applicable,not applicable,not applicable,not applicable,not applicable,,not applicable,GOMECC4_MSUControl_7
1000,GOMECC4_NegativeControl_1_18S,not applicable_not applicable,not applicable,not applicable,not applicable,not applicable,not applicable,not applicable,,not applicable,GOMECC4_NegativeControl_1
1001,GOMECC4_NegativeControl_2_18S,not applicable_not applicable,not applicable,not applicable,not applicable,not applicable,not applicable,not applicable,,not applicable,GOMECC4_NegativeControl_2


In [76]:
for key in event_dict.keys():
    if event_dict[key]['AOML_file'] != 'sample_data':
        print(key,event_dict[key])

geodeticDatum {'AOML_term': 'none', 'AOML_file': 'pipeline', 'DwC_definition': 'The ellipsoid, geodetic datum, or spatial reference system (SRS) upon which the geographic coordinates given in dwc:decimalLatitude and dwc:decimalLongitude are based.'}
countryCode {'AOML_term': 'none', 'AOML_file': 'pipeline', 'DwC_definition': nan}
datasetID {'AOML_term': 'project_id_external', 'AOML_file': 'study_data', 'DwC_definition': 'An identifier for the set of data. May be a global unique identifier or an identifier specific to a collection or institution.'}
eventRemarks {'AOML_term': 'derived: controls_used', 'AOML_file': 'analysis_data', 'DwC_definition': 'Comments or notes about the dwc:Event.'}


countryCode, leave blank because it spans multiple countries

In [77]:
#datasetID
all_event_data['datasetID'] = data['study_data']['project_id_external'].values[0]

In [78]:
#geodeticDatum
all_event_data['geodeticDatum'] = "WGS84"


In [79]:
all_event_data.head()

Unnamed: 0,eventID,locationID,eventDate,minimumDepthInMeters,locality,decimalLatitude,decimalLongitude,samplingProtocol,waterBody,maximumDepthInMeters,parentEventID,datasetID,geodeticDatum
0,GOMECC4_27N_Sta1_Deep_A_16S,27N_Sta1,2021-09-14T11:00-04:00,618,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,618,GOMECC4_27N_Sta1_Deep_A,noaa-aoml-gomecc4,WGS84
1,GOMECC4_27N_Sta1_Deep_B_16S,27N_Sta1,2021-09-14T11:00-04:00,618,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,618,GOMECC4_27N_Sta1_Deep_B,noaa-aoml-gomecc4,WGS84
2,GOMECC4_27N_Sta1_Deep_C_16S,27N_Sta1,2021-09-14T11:00-04:00,618,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,618,GOMECC4_27N_Sta1_Deep_C,noaa-aoml-gomecc4,WGS84
3,GOMECC4_27N_Sta1_DCM_A_16S,27N_Sta1,2021-09-14T11:00-04:00,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,WGS84
4,GOMECC4_27N_Sta1_DCM_B_16S,27N_Sta1,2021-09-14T11:00-04:00,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,GOMECC4_27N_Sta1_DCM_B,noaa-aoml-gomecc4,WGS84


### Occurrence file

In [45]:
occ = {}

#### 18S

##### drop unwanted samples


In [82]:
asv_tables['18S V9'] = asv_tables['18S V9'].drop(columns=samples_to_drop,errors='ignore')

In [83]:
asv_tables['18S V9'].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,GOMECC4_27N_Sta1_DCM_A,GOMECC4_27N_Sta1_DCM_B,GOMECC4_27N_Sta1_DCM_C,GOMECC4_27N_Sta1_Deep_A,GOMECC4_27N_Sta1_Deep_B,GOMECC4_27N_Sta1_Deep_C,GOMECC4_27N_Sta1_Surface_A,GOMECC4_27N_Sta1_Surface_B,GOMECC4_27N_Sta4_DCM_A,GOMECC4_27N_Sta4_DCM_B,GOMECC4_27N_Sta4_DCM_C,GOMECC4_27N_Sta4_Deep_A,GOMECC4_27N_Sta4_Deep_B,GOMECC4_27N_Sta4_Deep_C,GOMECC4_27N_Sta4_Surface_A,GOMECC4_27N_Sta4_Surface_B,GOMECC4_27N_Sta4_Surface_C,GOMECC4_27N_Sta6_DCM_A,GOMECC4_27N_Sta6_DCM_B,GOMECC4_27N_Sta6_DCM_C,GOMECC4_27N_Sta6_Deep_A,GOMECC4_27N_Sta6_Deep_B,GOMECC4_27N_Sta6_Deep_C,GOMECC4_27N_Sta6_Surface_A,GOMECC4_27N_Sta6_Surface_B,GOMECC4_27N_Sta6_Surface_C,GOMECC4_BROWNSVILLE_Sta63_DCM_A,GOMECC4_BROWNSVILLE_Sta63_DCM_B,GOMECC4_BROWNSVILLE_Sta63_DCM_C,GOMECC4_BROWNSVILLE_Sta63_Deep_A,GOMECC4_BROWNSVILLE_Sta63_Deep_B,GOMECC4_BROWNSVILLE_Sta63_Deep_C,GOMECC4_BROWNSVILLE_Sta63_Surface_A,GOMECC4_BROWNSVILLE_Sta63_Surface_B,GOMECC4_BROWNSVILLE_Sta63_Surface_C,GOMECC4_BROWNSVILLE_Sta66_DCM_A,GOMECC4_BROWNSVILLE_Sta66_DCM_B,GOMECC4_BROWNSVILLE_Sta66_DCM_C,GOMECC4_BROWNSVILLE_Sta66_Surface_A,GOMECC4_BROWNSVILLE_Sta66_Surface_B,GOMECC4_BROWNSVILLE_Sta66_Surface_C,GOMECC4_BROWNSVILLE_Sta66_Deep_A,GOMECC4_BROWNSVILLE_Sta66_Deep_B,GOMECC4_BROWNSVILLE_Sta66_Deep_C,GOMECC4_BROWNSVILLE_Sta71_DCM_A,GOMECC4_BROWNSVILLE_Sta71_DCM_B,GOMECC4_BROWNSVILLE_Sta71_DCM_C,GOMECC4_BROWNSVILLE_Sta71_Deep_A,GOMECC4_BROWNSVILLE_Sta71_Deep_B,GOMECC4_BROWNSVILLE_Sta71_Deep_C,GOMECC4_BROWNSVILLE_Sta71_Surface_A,GOMECC4_BROWNSVILLE_Sta71_Surface_B,GOMECC4_BROWNSVILLE_Sta71_Surface_C,GOMECC4_CAMPECHE_Sta90_DCM_A,GOMECC4_CAMPECHE_Sta90_DCM_B,GOMECC4_CAMPECHE_Sta90_DCM_C,GOMECC4_CAMPECHE_Sta90_Deep_A,GOMECC4_CAMPECHE_Sta90_Deep_B,GOMECC4_CAMPECHE_Sta90_Deep_C,GOMECC4_CAMPECHE_Sta90_Surface_A,GOMECC4_CAMPECHE_Sta90_Surface_B,GOMECC4_CAMPECHE_Sta90_Surface_C,GOMECC4_CAMPECHE_Sta91_DCM_A,GOMECC4_CAMPECHE_Sta91_DCM_B,GOMECC4_CAMPECHE_Sta91_DCM_C,GOMECC4_CAMPECHE_Sta91_Deep_A,GOMECC4_CAMPECHE_Sta91_Deep_B,GOMECC4_CAMPECHE_Sta91_Deep_C,GOMECC4_CAMPECHE_Sta91_Surface_A,GOMECC4_CAMPECHE_Sta91_Surface_B,GOMECC4_CAMPECHE_Sta91_Surface_C,GOMECC4_CAMPECHE_Sta93_Deep_A,GOMECC4_CAMPECHE_Sta93_Deep_B,GOMECC4_CAMPECHE_Sta93_Deep_C,GOMECC4_CAMPECHE_Sta93_Surface_A,GOMECC4_CAMPECHE_Sta93_Surface_B,GOMECC4_CAMPECHE_Sta93_Surface_C,GOMECC4_CANCUN_Sta117_DCM_A,GOMECC4_CANCUN_Sta117_DCM_B,GOMECC4_CANCUN_Sta117_DCM_C,GOMECC4_CANCUN_Sta117_Deep_A,GOMECC4_CANCUN_Sta117_Deep_B,GOMECC4_CANCUN_Sta117_Deep_C,GOMECC4_CANCUN_Sta117_Surface_A,GOMECC4_CANCUN_Sta117_Surface_B,GOMECC4_CANCUN_Sta117_Surface_C,GOMECC4_CANCUN_Sta118_DCM_A,GOMECC4_CANCUN_Sta118_DCM_B,GOMECC4_CANCUN_Sta118_DCM_C,GOMECC4_CANCUN_Sta118_Deep_A,GOMECC4_CANCUN_Sta118_Deep_B,GOMECC4_CANCUN_Sta118_Deep_C,GOMECC4_CANCUN_Sta118_Surface_A,GOMECC4_CANCUN_Sta118_Surface_B,GOMECC4_CANCUN_Sta118_Surface_C,GOMECC4_CAPECORAL_Sta131_DCM_A,GOMECC4_CAPECORAL_Sta131_DCM_B,GOMECC4_CAPECORAL_Sta131_DCM_C,GOMECC4_CAPECORAL_Sta131_Deep_A,GOMECC4_CAPECORAL_Sta131_Deep_B,GOMECC4_CAPECORAL_Sta131_Deep_C,GOMECC4_CAPECORAL_Sta131_Surface_A,GOMECC4_CAPECORAL_Sta131_Surface_B,GOMECC4_CAPECORAL_Sta131_Surface_C,GOMECC4_CAPECORAL_Sta132_DCM_A,GOMECC4_CAPECORAL_Sta132_DCM_B,GOMECC4_CAPECORAL_Sta132_DCM_C,GOMECC4_CAPECORAL_Sta132_Deep_A,GOMECC4_CAPECORAL_Sta132_Deep_B,GOMECC4_CAPECORAL_Sta132_Deep_C,GOMECC4_CAPECORAL_Sta132_Surface_A,GOMECC4_CAPECORAL_Sta132_Surface_B,GOMECC4_CAPECORAL_Sta132_Surface_C,GOMECC4_CAPECORAL_Sta135_DCM_A,GOMECC4_CAPECORAL_Sta135_DCM_B,GOMECC4_CAPECORAL_Sta135_DCM_C,GOMECC4_CAPECORAL_Sta135_Deep_A,GOMECC4_CAPECORAL_Sta135_Deep_B,GOMECC4_CAPECORAL_Sta135_Deep_C,GOMECC4_CAPECORAL_Sta135_Surface_A,GOMECC4_CAPECORAL_Sta135_Surface_B,GOMECC4_CAPECORAL_Sta135_Surface_C,GOMECC4_CAPECORAL_Sta140_DCM_A,GOMECC4_CAPECORAL_Sta140_DCM_B,GOMECC4_CAPECORAL_Sta140_DCM_C,GOMECC4_CAPECORAL_Sta140_Deep_A,GOMECC4_CAPECORAL_Sta140_Deep_B,GOMECC4_CAPECORAL_Sta140_Deep_C,GOMECC4_CAPECORAL_Sta140_Surface_A,GOMECC4_CAPECORAL_Sta140_Surface_B,GOMECC4_CAPECORAL_Sta140_Surface_C,GOMECC4_CAPECORAL_Sta141_DCM_A,GOMECC4_CAPECORAL_Sta141_DCM_B,GOMECC4_CAPECORAL_Sta141_DCM_C,GOMECC4_CAPECORAL_Sta141_Deep_A,GOMECC4_CAPECORAL_Sta141_Deep_B,GOMECC4_CAPECORAL_Sta141_Deep_C,GOMECC4_CAPECORAL_Sta141_Surface_A,GOMECC4_CAPECORAL_Sta141_Surface_B,GOMECC4_CAPECORAL_Sta141_Surface_C,GOMECC4_CATOCHE_Sta107_DCM_A,GOMECC4_CATOCHE_Sta107_DCM_B,GOMECC4_CATOCHE_Sta107_DCM_C,GOMECC4_CATOCHE_Sta107_Deep_A,GOMECC4_CATOCHE_Sta107_Deep_B,GOMECC4_CATOCHE_Sta107_Deep_C,GOMECC4_CATOCHE_Sta107_Surface_A,GOMECC4_CATOCHE_Sta107_Surface_B,GOMECC4_CATOCHE_Sta107_Surface_C,GOMECC4_CATOCHE_Sta109_Deep_C,GOMECC4_CATOCHE_Sta109_DCM_B,GOMECC4_CATOCHE_Sta109_DCM_C,GOMECC4_CATOCHE_Sta109_Deep_A,GOMECC4_CATOCHE_Sta109_Deep_B,GOMECC4_CATOCHE_Sta109_DCM_A,GOMECC4_CATOCHE_Sta109_Surface_A,GOMECC4_CATOCHE_Sta109_Surface_B,GOMECC4_CATOCHE_Sta109_Surface_C,GOMECC4_CATOCHE_Sta115_DCM_A,GOMECC4_CATOCHE_Sta115_DCM_B,GOMECC4_CATOCHE_Sta115_DCM_C,GOMECC4_CATOCHE_Sta115_Deep_A,GOMECC4_CATOCHE_Sta115_Deep_B,GOMECC4_CATOCHE_Sta115_Deep_C,GOMECC4_CATOCHE_Sta115_Surface_A,GOMECC4_CATOCHE_Sta115_Surface_B,GOMECC4_CATOCHE_Sta115_Surface_C,GOMECC4_FLSTRAITS_Sta121_DCM_A,GOMECC4_FLSTRAITS_Sta121_DCM_B,GOMECC4_FLSTRAITS_Sta121_DCM_C,GOMECC4_FLSTRAITS_Sta121_Deep_A,GOMECC4_FLSTRAITS_Sta121_Deep_B,GOMECC4_FLSTRAITS_Sta121_Deep_C,GOMECC4_FLSTRAITS_Sta121_Surface_A,GOMECC4_FLSTRAITS_Sta121_Surface_B,GOMECC4_FLSTRAITS_Sta121_Surface_C,GOMECC4_FLSTRAITS_Sta122_DCM_A,GOMECC4_FLSTRAITS_Sta122_DCM_B,GOMECC4_FLSTRAITS_Sta122_DCM_C,GOMECC4_FLSTRAITS_Sta122_Deep_A,GOMECC4_FLSTRAITS_Sta122_Deep_B,GOMECC4_FLSTRAITS_Sta122_Deep_C,GOMECC4_FLSTRAITS_Sta122_Surface_A,GOMECC4_FLSTRAITS_Sta122_Surface_B,GOMECC4_FLSTRAITS_Sta122_Surface_C,GOMECC4_FLSTRAITS_Sta123_DCM_A,GOMECC4_FLSTRAITS_Sta123_DCM_B,GOMECC4_FLSTRAITS_Sta123_DCM_C,GOMECC4_FLSTRAITS_Sta123_Deep_A,GOMECC4_FLSTRAITS_Sta123_Deep_B,GOMECC4_FLSTRAITS_Sta123_Deep_C,GOMECC4_FLSTRAITS_Sta123_Surface_A,GOMECC4_FLSTRAITS_Sta123_Surface_B,GOMECC4_FLSTRAITS_Sta123_Surface_C,GOMECC4_GALVESTON_Sta49_DCM_A,GOMECC4_GALVESTON_Sta49_DCM_B,GOMECC4_GALVESTON_Sta49_DCM_C,GOMECC4_GALVESTON_Sta49_Deep_A,GOMECC4_GALVESTON_Sta49_Deep_B,GOMECC4_GALVESTON_Sta49_Deep_C,GOMECC4_GALVESTON_Sta49_Surface_A,GOMECC4_GALVESTON_Sta49_Surface_B,GOMECC4_GALVESTON_Sta49_Surface_C,GOMECC4_GALVESTON_Sta50_DCM_A,GOMECC4_GALVESTON_Sta50_DCM_B,GOMECC4_GALVESTON_Sta50_DCM_C,GOMECC4_GALVESTON_Sta50_Deep_A,GOMECC4_GALVESTON_Sta50_Deep_B,GOMECC4_GALVESTON_Sta50_Deep_C,GOMECC4_GALVESTON_Sta50_Surface_A,GOMECC4_GALVESTON_Sta50_Surface_B,GOMECC4_GALVESTON_Sta50_Surface_C,GOMECC4_GALVESTON_Sta54_DCM_A,GOMECC4_GALVESTON_Sta54_DCM_B,GOMECC4_GALVESTON_Sta54_DCM_C,GOMECC4_GALVESTON_Sta54_Deep_A,GOMECC4_GALVESTON_Sta54_Deep_B,GOMECC4_GALVESTON_Sta54_Deep_C,GOMECC4_GALVESTON_Sta54_Surface_A,GOMECC4_GALVESTON_Sta54_Surface_B,GOMECC4_GALVESTON_Sta54_Surface_C,GOMECC4_GALVESTON_Sta59_DCM_A,GOMECC4_GALVESTON_Sta59_DCM_B,GOMECC4_GALVESTON_Sta59_DCM_C,GOMECC4_GALVESTON_Sta59_Deep_A,GOMECC4_GALVESTON_Sta59_Deep_B,GOMECC4_GALVESTON_Sta59_Deep_C,GOMECC4_GALVESTON_Sta59_Surface_A,GOMECC4_GALVESTON_Sta59_Surface_B,GOMECC4_GALVESTON_Sta59_Surface_C,GOMECC4_LA_Sta38_DCM_A,GOMECC4_LA_Sta38_DCM_B,GOMECC4_LA_Sta38_DCM_C,GOMECC4_LA_Sta38_Deep_A,GOMECC4_LA_Sta38_Deep_B,GOMECC4_LA_Sta38_Deep_C,GOMECC4_LA_Sta38_Surface_A,GOMECC4_LA_Sta38_Surface_B,GOMECC4_LA_Sta38_Surface_C,GOMECC4_LA_Sta39_DCM_A,GOMECC4_LA_Sta39_DCM_B,GOMECC4_LA_Sta39_DCM_C,GOMECC4_LA_Sta39_Deep_A,GOMECC4_LA_Sta39_Deep_B,GOMECC4_LA_Sta39_Deep_C,GOMECC4_LA_Sta39_Surface_A,GOMECC4_LA_Sta39_Surface_B,GOMECC4_LA_Sta39_Surface_C,GOMECC4_LA_Sta40_DCM_A,GOMECC4_LA_Sta40_DCM_B,GOMECC4_LA_Sta40_DCM_C,GOMECC4_LA_Sta40_Deep_A,GOMECC4_LA_Sta40_Deep_B,GOMECC4_LA_Sta40_Deep_C,GOMECC4_LA_Sta40_Surface_A,GOMECC4_LA_Sta40_Surface_B,GOMECC4_LA_Sta40_Surface_C,GOMECC4_LA_Sta45_DCM_A,GOMECC4_LA_Sta45_DCM_B,GOMECC4_LA_Sta45_DCM_C,GOMECC4_LA_Sta45_Deep_A,GOMECC4_LA_Sta45_Deep_B,GOMECC4_LA_Sta45_Deep_C,GOMECC4_LA_Sta45_Surface_A,GOMECC4_LA_Sta45_Surface_B,GOMECC4_LA_Sta45_Surface_C,GOMECC4_MERIDA_Sta94_DCM_A,GOMECC4_MERIDA_Sta94_DCM_B,GOMECC4_MERIDA_Sta94_DCM_C,GOMECC4_MERIDA_Sta94_Deep_A,GOMECC4_MERIDA_Sta94_Deep_B,GOMECC4_MERIDA_Sta94_Deep_C,GOMECC4_MERIDA_Sta94_Surface_A,GOMECC4_MERIDA_Sta94_Surface_B,GOMECC4_MERIDA_Sta94_Surface_C,GOMECC4_MERIDA_Sta97_DCM_A,GOMECC4_MERIDA_Sta97_DCM_B,GOMECC4_MERIDA_Sta97_DCM_C,GOMECC4_MERIDA_Sta97_Deep_A,GOMECC4_MERIDA_Sta97_Deep_B,GOMECC4_MERIDA_Sta97_Deep_C,GOMECC4_MERIDA_Sta97_Surface_A,GOMECC4_MERIDA_Sta97_Surface_B,GOMECC4_MERIDA_Sta97_Surface_C,GOMECC4_MERIDA_Sta98_DCM_A,GOMECC4_MERIDA_Sta98_DCM_B,GOMECC4_MERIDA_Sta98_DCM_C,GOMECC4_MERIDA_Sta98_Deep_A,GOMECC4_MERIDA_Sta98_Deep_B,GOMECC4_MERIDA_Sta98_Deep_C,GOMECC4_MERIDA_Sta98_Surface_A,GOMECC4_MERIDA_Sta98_Surface_B,GOMECC4_MERIDA_Sta98_Surface_C,GOMECC4_PAISNP_Sta61_DCM_A,GOMECC4_PAISNP_Sta61_DCM_B,GOMECC4_PAISNP_Sta61_DCM_C,GOMECC4_PAISNP_Sta61_Deep_A,GOMECC4_PAISNP_Sta61_Deep_B,GOMECC4_PAISNP_Sta61_Deep_C,GOMECC4_PAISNP_Sta61_Surface_A,GOMECC4_PAISNP_Sta61_Surface_B,GOMECC4_PAISNP_Sta61_Surface_C,GOMECC4_PANAMACITY_Sta19_DCM_A,GOMECC4_PANAMACITY_Sta19_DCM_B,GOMECC4_PANAMACITY_Sta19_DCM_C,GOMECC4_PANAMACITY_Sta19_Deep_A,GOMECC4_PANAMACITY_Sta19_Deep_B,GOMECC4_PANAMACITY_Sta19_Deep_C,GOMECC4_PANAMACITY_Sta19_Surface_A,GOMECC4_PANAMACITY_Sta19_Surface_B,GOMECC4_PANAMACITY_Sta19_Surface_C,GOMECC4_PANAMACITY_Sta21_DCM_A,GOMECC4_PANAMACITY_Sta21_DCM_B,GOMECC4_PANAMACITY_Sta21_DCM_C,GOMECC4_PANAMACITY_Sta21_Deep_A,GOMECC4_PANAMACITY_Sta21_Deep_B,GOMECC4_PANAMACITY_Sta21_Deep_C,GOMECC4_PANAMACITY_Sta21_Surface_A,GOMECC4_PANAMACITY_Sta21_Surface_B,GOMECC4_PANAMACITY_Sta21_Surface_C,GOMECC4_PANAMACITY_Sta23_DCM_A,GOMECC4_PANAMACITY_Sta23_DCM_B,GOMECC4_PANAMACITY_Sta23_DCM_C,GOMECC4_PANAMACITY_Sta23_Deep_A,GOMECC4_PANAMACITY_Sta23_Deep_B,GOMECC4_PANAMACITY_Sta23_Deep_C,GOMECC4_PANAMACITY_Sta23_Surface_A,GOMECC4_PANAMACITY_Sta23_Surface_B,GOMECC4_PANAMACITY_Sta23_Surface_C,GOMECC4_PANAMACITY_Sta28_DCM_A,GOMECC4_PANAMACITY_Sta28_DCM_B,GOMECC4_PANAMACITY_Sta28_DCM_C,GOMECC4_PANAMACITY_Sta28_Deep_A,GOMECC4_PANAMACITY_Sta28_Deep_B,GOMECC4_PANAMACITY_Sta28_Deep_C,GOMECC4_PANAMACITY_Sta28_Surface_A,GOMECC4_PANAMACITY_Sta28_Surface_B,GOMECC4_PANAMACITY_Sta28_Surface_C,GOMECC4_PENSACOLA_Sta31_DCM_A,GOMECC4_PENSACOLA_Sta31_DCM_B,GOMECC4_PENSACOLA_Sta31_DCM_C,GOMECC4_PENSACOLA_Sta31_Deep_A,GOMECC4_PENSACOLA_Sta31_Deep_B,GOMECC4_PENSACOLA_Sta31_Deep_C,GOMECC4_PENSACOLA_Sta31_Surface_A,GOMECC4_PENSACOLA_Sta31_Surface_B,GOMECC4_PENSACOLA_Sta31_Surface_C,GOMECC4_PENSACOLA_Sta33_DCM_A,GOMECC4_PENSACOLA_Sta33_DCM_B,GOMECC4_PENSACOLA_Sta33_DCM_C,GOMECC4_PENSACOLA_Sta33_Deep_A,GOMECC4_PENSACOLA_Sta33_Deep_B,GOMECC4_PENSACOLA_Sta33_Deep_C,GOMECC4_PENSACOLA_Sta33_Surface_A,GOMECC4_PENSACOLA_Sta33_Surface_B,GOMECC4_PENSACOLA_Sta33_Surface_C,GOMECC4_PENSACOLA_Sta35_DCM_A,GOMECC4_PENSACOLA_Sta35_DCM_B,GOMECC4_PENSACOLA_Sta35_DCM_C,GOMECC4_PENSACOLA_Sta35_Deep_A,GOMECC4_PENSACOLA_Sta35_Deep_B,GOMECC4_PENSACOLA_Sta35_Deep_C,GOMECC4_PENSACOLA_Sta35_Surface_A,GOMECC4_PENSACOLA_Sta35_Surface_B,GOMECC4_PENSACOLA_Sta35_Surface_C,GOMECC4_TAMPA_Sta10_DCM_A,GOMECC4_TAMPA_Sta10_DCM_B,GOMECC4_TAMPA_Sta10_DCM_C,GOMECC4_TAMPA_Sta10_Deep_A,GOMECC4_TAMPA_Sta10_Deep_B,GOMECC4_TAMPA_Sta10_Deep_C,GOMECC4_TAMPA_Sta10_Surface_A,GOMECC4_TAMPA_Sta10_Surface_B,GOMECC4_TAMPA_Sta10_Surface_C,GOMECC4_TAMPA_Sta13_DCM_A,GOMECC4_TAMPA_Sta13_DCM_B,GOMECC4_TAMPA_Sta13_DCM_C,GOMECC4_TAMPA_Sta13_Deep_A,GOMECC4_TAMPA_Sta13_Deep_B,GOMECC4_TAMPA_Sta13_Deep_C,GOMECC4_TAMPA_Sta13_Surface_A,GOMECC4_TAMPA_Sta13_Surface_B,GOMECC4_TAMPA_Sta13_Surface_C,GOMECC4_TAMPA_Sta18_Deep_A,GOMECC4_TAMPA_Sta18_Deep_B,GOMECC4_TAMPA_Sta18_Deep_C,GOMECC4_TAMPA_Sta18_Surface_A,GOMECC4_TAMPA_Sta18_Surface_B,GOMECC4_TAMPA_Sta18_Surface_C,GOMECC4_TAMPA_Sta9_Sediment_A,GOMECC4_TAMPA_Sta9_Sediment_B,GOMECC4_TAMPA_Sta9_Sediment_C,GOMECC4_TAMPICO_Sta76_DCM_A,GOMECC4_TAMPICO_Sta76_DCM_B,GOMECC4_TAMPICO_Sta76_DCM_C,GOMECC4_TAMPICO_Sta76_Deep_A,GOMECC4_TAMPICO_Sta76_Deep_B,GOMECC4_TAMPICO_Sta76_Deep_C,GOMECC4_TAMPICO_Sta76_Surface_A,GOMECC4_TAMPICO_Sta76_Surface_B,GOMECC4_TAMPICO_Sta76_Surface_C,GOMECC4_TAMPICO_Sta79_DCM_A,GOMECC4_TAMPICO_Sta79_DCM_B,GOMECC4_TAMPICO_Sta79_DCM_C,GOMECC4_TAMPICO_Sta79_Deep_A,GOMECC4_TAMPICO_Sta79_Deep_B,GOMECC4_TAMPICO_Sta79_Deep_C,GOMECC4_TAMPICO_Sta79_Surface_A,GOMECC4_TAMPICO_Sta79_Surface_B,GOMECC4_TAMPICO_Sta79_Surface_C,GOMECC4_TAMPICO_Sta82_DCM_A,GOMECC4_TAMPICO_Sta82_DCM_B,GOMECC4_TAMPICO_Sta82_DCM_C,GOMECC4_TAMPICO_Sta82_Deep_A,GOMECC4_TAMPICO_Sta82_Deep_B,GOMECC4_TAMPICO_Sta82_Deep_C,GOMECC4_TAMPICO_Sta82_Surface_A,GOMECC4_TAMPICO_Sta82_Surface_B,GOMECC4_TAMPICO_Sta82_Surface_C,GOMECC4_VERACRUZ_Sta85_DCM_A,GOMECC4_VERACRUZ_Sta85_DCM_B,GOMECC4_VERACRUZ_Sta85_DCM_C,GOMECC4_VERACRUZ_Sta85_Deep_A,GOMECC4_VERACRUZ_Sta85_Deep_B,GOMECC4_VERACRUZ_Sta85_Deep_C,GOMECC4_VERACRUZ_Sta85_Surface_A,GOMECC4_VERACRUZ_Sta85_Surface_B,GOMECC4_VERACRUZ_Sta87_DCM_A,GOMECC4_VERACRUZ_Sta87_DCM_B,GOMECC4_VERACRUZ_Sta87_DCM_C,GOMECC4_VERACRUZ_Sta87_Deep_A,GOMECC4_VERACRUZ_Sta87_Deep_B,GOMECC4_VERACRUZ_Sta87_Deep_C,GOMECC4_VERACRUZ_Sta87_Surface_A,GOMECC4_VERACRUZ_Sta87_Surface_B,GOMECC4_VERACRUZ_Sta87_Surface_C,GOMECC4_VERACRUZ_Sta89_DCM_A,GOMECC4_VERACRUZ_Sta89_DCM_B,GOMECC4_VERACRUZ_Sta89_DCM_C,GOMECC4_VERACRUZ_Sta89_Deep_A,GOMECC4_VERACRUZ_Sta89_Deep_B,GOMECC4_VERACRUZ_Sta89_Deep_C,GOMECC4_VERACRUZ_Sta89_Surface_A,GOMECC4_VERACRUZ_Sta89_Surface_B,GOMECC4_VERACRUZ_Sta89_Surface_C,GOMECC4_YUCATAN_Sta100_DCM_A,GOMECC4_YUCATAN_Sta100_DCM_B,GOMECC4_YUCATAN_Sta100_DCM_C,GOMECC4_YUCATAN_Sta100_Deep_A,GOMECC4_YUCATAN_Sta100_Deep_B,GOMECC4_YUCATAN_Sta100_Deep_C,GOMECC4_YUCATAN_Sta100_Surface_A,GOMECC4_YUCATAN_Sta100_Surface_B,GOMECC4_YUCATAN_Sta100_Surface_C,GOMECC4_YUCATAN_Sta102_DCM_A,GOMECC4_YUCATAN_Sta102_DCM_B,GOMECC4_YUCATAN_Sta102_DCM_C,GOMECC4_YUCATAN_Sta102_Deep_A,GOMECC4_YUCATAN_Sta102_Deep_B,GOMECC4_YUCATAN_Sta102_Deep_C,GOMECC4_YUCATAN_Sta102_Surface_A,GOMECC4_YUCATAN_Sta102_Surface_B,GOMECC4_YUCATAN_Sta102_Surface_C,GOMECC4_YUCATAN_Sta106_DCM_A,GOMECC4_YUCATAN_Sta106_DCM_B,GOMECC4_YUCATAN_Sta106_DCM_C,GOMECC4_YUCATAN_Sta106_Deep_A,GOMECC4_YUCATAN_Sta106_Deep_B,GOMECC4_YUCATAN_Sta106_Deep_C,GOMECC4_YUCATAN_Sta106_Surface_A,GOMECC4_YUCATAN_Sta106_Surface_B,GOMECC4_YUCATAN_Sta106_Surface_C
0,36aa75f9b28f5f831c2d631ba65c2bcb,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCCTGGCGGATTACTCTGCCTGGCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Neocalanus;Neocalanus_cristatus;,0.922099,1516,0,0,6,0,0,0,4257,2005,0,14,0,0,0,9527,1934,2037,0,0,11,0,0,1696,645,0,0,0,0,0,0,0,4479,0,0,0,0,149,0,11573,7644,3480,0,0,0,0,0,0,0,0,0,40,0,0,0,0,0,0,0,0,19,3617,15,0,0,0,0,0,0,0,10276,7354,0,12,0,0,0,0,0,0,0,0,0,0,0,4055,7992,20,0,90,23,38,14,7300,13,6136,0,0,21,89,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1061,0,0,0,0,0,0,0,1579,2340,0,0,0,6956,30,0,0,0,0,2531,0,0,0,0,21,150,0,0,0,308,0,0,25,0,12,0,0,7,0,0,999,0,989,0,0,0,1524,0,10,0,0,0,0,0,0,10,0,0,11,13,0,15,0,8001,0,2820,4403,0,62,0,0,1067,0,0,0,0,0,0,0,0,0,1148,0,618,0,0,617,0,0,0,4062,0,0,0,0,0,7,0,8804,0,0,0,0,0,0,101,0,0,0,68,0,0,19,0,0,0,8245,0,14,0,26,0,14,0,0,0,0,2776,0,1485,1471,0,0,0,5763,8484,14,0,0,0,0,128,41,2342,0,0,0,0,0,0,0,0,0,49,0,0,0,0,0,15,4341,6942,3170,14,0,0,0,0,18,36,51,0,40,0,0,78,0,0,0,0,0,729,0,0,0,15,0,0,0,0,0,0,0,0,0,0,0,1145,0,0,0,0,0,0,0,0,61,0,0,0,928,0,0,0,0,0,140,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1873,0,0,0,0,6,0,1847,1481,2783,0,0,0,0,7,22,0,0,0,0,0,0,0,0,0,10052,9663,11791,0,0,0,0,0,0,1418,0,0,0,963,0,0,0,0,3440,1978,0,7,0,1470,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3576,0,804,0,0,0,353,0,584,2310,0,13111,0,0,0,0,12,0,0,0,333,0,0,0,0,0,32,0,0,571,0,0,1155,4719,194,1451,0,1268,0,61,53,2294,0,2744,0,0,0,111,0,0,30,0,8480
1,4e38e8ced9070952b314e1880bede1ca,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGTAGTCGGATCACTCTGACTGCCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Clausocalanus;Clausocalanus_furcatus;,0.999947,962,316,548,19,10,0,0,0,613,561,434,0,397,297,76,916,1444,140,0,0,174,186,249,138,117,842,322,20,0,85,370,58,0,0,7,0,77,0,95,326,2506,10,918,0,380,0,468,15,0,0,4536,3888,0,655,171,0,0,0,8,539,915,590,1828,203,164,361,148,483,1851,1936,0,223,3289,1215,3992,1958,1898,56,400,786,116,6,42,0,0,0,741,1888,740,88,158,96,1020,305,288,371,8,0,702,0,47,42,0,380,1461,1283,893,2339,1854,1000,237,99,657,17,0,163,135,550,2086,3315,1302,2470,30,129,795,0,13,2966,1438,0,663,273,0,19,0,0,0,2731,2071,2386,9,18,0,198,40,0,238,0,13,59,1312,60,0,0,0,0,0,0,0,132,0,65,266,341,0,82,45,0,0,0,0,80,0,396,0,10,96,293,50,214,399,0,1880,0,679,1002,1082,28,19,69,0,87,0,526,655,412,0,506,827,769,192,1977,2583,2521,736,2091,2944,111,1243,33,244,20,0,0,0,0,4,7,1511,4301,1672,0,12,0,0,13,0,0,18,31,1019,1736,398,0,81,37,250,1199,299,192,322,327,38,0,23,557,29,1085,1732,3539,1424,0,0,18,163,512,2093,37,18,47,5,126,0,19,0,14,888,20,540,885,192,527,496,136,327,1093,190,0,0,436,10,1367,1708,0,6272,4382,82,210,36,268,3249,3755,524,125,6,494,16,120,9,42,87,17,0,0,0,0,0,0,8,12,0,526,1593,246,2499,1207,1160,553,634,2675,2379,1728,463,0,7,0,1790,832,1860,64,52,1133,9,0,0,55,808,108,612,595,826,1128,3659,0,0,0,30,992,1878,487,0,0,0,1534,3186,1040,267,221,1512,0,534,9,134,268,0,274,0,0,4,0,0,181,5,0,0,0,1096,617,196,180,0,464,0,0,0,0,5,0,52,62,0,0,1448,335,1997,2398,734,2585,821,120,1047,1096,694,674,112,64,74,2850,2926,2455,0,103,89,75,0,0,1841,3754,2194,0,131,0,0,0,580,3094,544,577,0,820,0,0,0,0,2826,447,1695,632,1133,1999,1321,417,378,0,150,17,1385,1253,2696,2413,2758,385,410,2987,2746,1562,731,2951,1359,511,2022,0,1412,777,65,287,0,0,0,1530,0,954
2,5d4df37251121c08397c6fbc27b06175,GCTACTACCGATTGAGTGTTTTAGTGAGGTCCTCGGATTGCTTTCCTGGCGGTTAACGCTGCCTAGTTGGCGAAAAGACGACCAAACTGTAGCACTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Sinocalanus;Sinocalanus_sinensis;,0.9923,0,4,0,12,5,0,0,0,9,0,0,0,0,0,0,300,0,864,408,286,395,324,1893,1700,5115,0,0,1553,251,0,0,384,67,0,0,0,43,3,0,0,0,0,7,0,0,0,3,15,0,0,0,0,12,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,9,3671,972,215,634,3144,0,0,0,32,0,0,0,0,0,0,0,59,0,135,40,0,1214,0,1031,0,19,0,3492,856,681,455,602,3420,0,566,2563,4019,0,2730,2120,2584,0,0,0,0,0,0,0,0,0,0,0,0,11,0,0,0,1347,0,0,0,0,0,0,0,0,0,0,0,0,0,187,0,0,0,0,0,0,0,0,25,0,0,0,0,0,413,0,217,566,1107,141,0,0,0,0,0,0,0,0,27,266,0,7,0,0,0,56,0,9,0,0,0,69,470,0,0,0,0,353,9,1704,392,2487,0,0,0,0,0,5,0,85,48,0,0,0,0,0,0,0,0,0,0,0,8,0,1403,1930,3023,4658,1322,3726,3049,1159,2491,524,557,293,0,356,18,18,0,687,315,617,1506,0,0,16,324,0,0,115,132,9,0,0,0,0,304,0,0,0,0,774,331,1803,21,43,0,2013,142,1162,2987,6178,2887,2617,3531,1536,1468,2136,1079,0,0,0,0,0,0,3297,319,0,9,26,0,1523,754,888,48,1630,0,0,1773,11,418,107,295,511,354,104,1393,3355,3011,0,56,419,5994,5230,4584,0,0,682,0,0,0,2169,6181,1781,0,0,8,0,1596,0,342,1591,278,0,0,4,0,0,5,969,3922,1257,524,428,399,0,0,0,355,674,1104,23,0,0,0,955,0,0,0,0,0,0,0,0,477,0,1335,0,1315,0,12,0,0,0,0,1848,765,0,0,0,0,0,0,0,0,0,0,3010,421,269,387,0,1240,0,0,0,0,2364,2408,0,4214,1233,1296,1271,1025,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2391,0,0,0,0,0,0,15,18,0,0,0,0,0,45,0,0,842,0,0,0,124,989,446,825,4441,3916,4151,1088,203,2121,2015,0,356,0,0,67,0,0,0,101,8,20,0,0,0,0,17,0,0,0,0,0,0,0
3,f863f671a575c6ab587e8de0190d3335,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCCAGGCGGGTCGCCCTGCCTGGTCTACGGGAAGACGACCAAACTGTAGTGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Paracalanus;Paracalanus_parvus;,0.998393,0,0,0,0,0,0,0,0,0,0,5,0,0,0,92,2668,2423,2604,1920,2277,0,0,466,3362,3622,1765,1249,372,502,3168,1682,523,804,441,2007,0,186,0,1472,0,269,0,0,0,0,0,0,0,0,0,1298,3681,3210,0,0,0,0,0,0,0,44,0,0,0,0,0,4,0,0,33,1342,1839,58,366,1409,482,786,0,0,0,0,0,0,0,1655,0,0,0,0,0,0,140,0,472,0,2718,436,282,0,564,1386,751,831,1472,393,752,738,678,138,0,0,592,254,0,0,0,0,0,0,0,0,0,0,0,0,31,60,0,0,328,0,0,0,0,0,0,0,528,0,0,0,0,0,406,0,5,0,0,0,21,0,0,0,0,7,0,0,0,1050,499,1427,1143,1051,1381,1635,2096,1389,0,0,0,49,0,48,0,0,71,0,0,0,81,0,0,0,0,0,2189,166,2287,14,0,0,1027,0,0,450,1240,935,0,0,243,3436,392,671,142,338,0,0,0,0,1761,273,1444,0,0,0,0,0,0,0,933,355,1285,1379,1959,1337,410,1778,1360,931,1352,323,62,165,16,6,444,71,0,1337,0,0,0,948,6,323,1567,574,0,0,2475,0,0,836,0,0,0,0,498,214,53,273,0,0,117,93,106,5592,477,3707,2674,3967,2291,958,2013,1677,0,13,0,0,0,669,0,0,0,0,477,0,3185,49,1620,943,795,54,2328,821,2364,1353,1305,545,1416,1628,4270,162,114,187,392,42,623,1672,1014,1925,0,0,0,0,0,0,648,2448,1580,0,0,0,65,0,8,428,372,815,0,0,0,0,0,0,991,428,799,0,221,0,0,0,6,1592,865,1066,0,0,0,0,17,30,365,189,0,0,0,0,0,0,0,491,1558,825,0,0,0,0,0,0,3862,3369,4562,0,0,0,8,0,0,800,0,0,1878,1095,1284,1718,1199,1236,0,0,0,603,905,0,978,453,180,0,165,759,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,83,0,0,0,0,0,0,0,85,0,0,0,0,0,0,0,0,0,11,0,0,0,0,0,0,0,776,1190,846,1627,281,949,1377,711,677,896,1505,135,0,0,0,112,0,0,0,0,0,0,0,0,47,0,75,0,0,0
4,2a31e5c01634165da99e7381279baa75,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAAGATAGTCGCAAGACTACCTTTTCTCCGGAAAGACTTTCAAACTTGAGCGTCTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Acrocalanus;Acrocalanus_sp.;,0.779948,1164,2272,2208,2,0,0,0,0,0,0,0,0,2,0,1103,157,0,104,938,305,0,53,0,275,654,632,1277,1960,1344,0,0,0,1132,376,1174,0,20,343,181,74,44,0,0,0,0,0,0,0,0,0,588,7,0,518,0,0,1,0,3,22,190,115,0,0,0,0,0,0,338,0,0,234,964,121,129,111,0,196,0,422,18,5,0,0,792,6,69,0,1217,20,0,64,0,1233,0,1081,540,93,0,2005,248,1599,704,341,144,133,203,13,42,500,1318,452,477,820,0,0,0,0,0,118,61,265,480,0,0,45,0,0,0,34,94,3473,0,0,7,0,0,1615,119,0,0,0,0,0,0,0,850,0,94,0,2,382,32,11,204,529,0,334,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,796,0,0,174,216,328,70,0,7,1931,0,0,826,855,348,12,0,0,0,7,157,1383,825,1942,0,0,0,287,1768,5452,45,46,0,0,0,0,544,687,586,0,0,0,0,12,0,73,0,0,0,77,0,0,42,0,14,12,40,4651,3589,1438,0,0,79,88,0,557,1758,2206,72,957,4,46,546,184,2611,833,77,0,0,9,10,437,334,1678,64,53,0,24,35,0,27,13,0,1558,1700,1669,1849,1217,2387,53,256,0,0,0,0,0,94,0,0,0,0,51,560,0,1030,214,1605,202,40,0,1854,1139,2005,107,459,70,1632,700,347,0,0,0,0,0,0,0,0,0,38,0,0,2,0,19,0,95,77,0,374,0,0,0,0,591,852,457,321,647,189,0,9,5,0,413,161,642,966,850,0,0,0,280,345,20,0,0,61,5,0,0,1055,881,103,0,0,49,3,193,0,95,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,520,0,0,0,0,0,0,0,0,4,0,0,1922,1218,2227,759,886,819,440,300,486,220,0,0,0,0,0,791,2453,24,0,0,0,0,6,2,1520,175,1392,0,0,0,3,0,121,253,1240,0,244,1423,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1439,744,0,5,0,130,100,529,206,0,1008,70,0,0,739,1743,526,0,1953,0,0,0,0,2107,3962,1627


##### format taxonomy

How to automate this? Everyone's taxonomy might be different?

In [48]:
#18S 
taxa_ranks_18S = ['domain','supergroup','division','subdivision','class','order','family','genus','species']

asv_tables['18S V9'][['domain','supergroup','division','subdivision','class','order','family','genus','species']] = ["","","","","","","","",""]
for index, row in asv_tables['18S V9'].iterrows():
    taxa = row['taxonomy'].split(";")
    for i in range(0,len(taxa)):
        if i < len(taxa_ranks_18S):
            asv_tables['18S V9'].loc[index,taxa_ranks_18S[i]] = taxa[i]

    

In [49]:
# replace None with NA
asv_tables['18S rRNA'] = asv_tables['18S rRNA'].fillna(value=np.nan)
## Replace 'unknown', 'unassigned', etc. in species and taxonomy columns with NaN

asv_tables['18S rRNA'][taxa_ranks_18S] = asv_tables['18S rRNA'][taxa_ranks_18S].replace({'unassigned':np.nan,
                            'Unassigned':np.nan,
                              's_':np.nan,
                              'g_':np.nan,
                              'unknown':np.nan,
                              'no_hit':np.nan,
                               '':np.nan})
asv_tables['18S rRNA'].head(20)

Unnamed: 0,featureid,sequence,taxonomy,Confidence,GOMECC4_27N_Sta1_DCM_A,GOMECC4_27N_Sta1_DCM_B,GOMECC4_27N_Sta1_DCM_C,GOMECC4_27N_Sta1_Deep_A,GOMECC4_27N_Sta1_Deep_B,GOMECC4_27N_Sta1_Deep_C,...,GOMECC4_YUCATAN_Sta106_Surface_C,domain,supergroup,division,subdivision,class,order,family,genus,species
0,36aa75f9b28f5f831c2d631ba65c2bcb,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCC...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.922099,1516,0,0,6,0,0,...,8480,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Neocalanus,Neocalanus_cristatus
1,4e38e8ced9070952b314e1880bede1ca,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGT...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.999947,962,316,548,19,10,0,...,954,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Clausocalanus,Clausocalanus_furcatus
2,5d4df37251121c08397c6fbc27b06175,GCTACTACCGATTGAGTGTTTTAGTGAGGTCCTCGGATTGCTTTCC...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.9923,0,4,0,12,5,0,...,0,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Sinocalanus,Sinocalanus_sinensis
3,f863f671a575c6ab587e8de0190d3335,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.998393,0,0,0,0,0,0,...,0,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Paracalanus,Paracalanus_parvus
4,2a31e5c01634165da99e7381279baa75,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAA...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.779948,1164,2272,2208,2,0,0,...,1627,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Acrocalanus,Acrocalanus_sp.
5,ecee60339b2fb88ea6d1c8d18054bed4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAG...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,0.999931,287,414,335,195,228,298,...,373,Eukaryota,TSAR,Alveolata,Dinoflagellata,Dinophyceae,,,,
6,d70494a723d85d66aa88d2d8a975aeec,GCTACTACCGATTGAATGGTTCCGTGAATTCTTGAGATCGGCGCGG...,Eukaryota;Obazoa;Opisthokonta,0.992451,0,0,0,4,0,0,...,0,Eukaryota,Obazoa,Opisthokonta,,,,,,
7,fa1f1a97dd4ae7c826009186bad26384,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAA...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,0.986908,250,323,194,51,59,55,...,305,Eukaryota,TSAR,Alveolata,Dinoflagellata,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,
8,bbaaf7bb4e71c80de970677779e3bf3a,GCTACTACCGATTGAATGGTTTAGTGAGATCTTCGGATTGGCACAA...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Cnidaria...,0.864777,212,50,237,552,1278,480,...,67,Eukaryota,Obazoa,Opisthokonta,Metazoa,Cnidaria,Cnidaria_X,Hydrozoa,Sulculeolaria,Sulculeolaria_quadrivalvis
9,7a8324bb4448b65f7adc73d70e5901da,GCTACTACCGATTGAACGTTTTAGTGAGGTATTTGGACTGGGCCTT...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.992088,0,0,0,15,0,0,...,405,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Delibus,Delibus_sp.


In [50]:
# replace _,- with space, remove sp. 

asv_tables['18S rRNA'][taxa_ranks_18S] = asv_tables['18S rRNA'][taxa_ranks_18S].replace('_',' ',regex=True)
asv_tables['18S rRNA'][taxa_ranks_18S] = asv_tables['18S rRNA'][taxa_ranks_18S].replace(' sp\.','',regex=True)
asv_tables['18S rRNA'][taxa_ranks_18S] = asv_tables['18S rRNA'][taxa_ranks_18S].replace(' spp\.','',regex=True)
asv_tables['18S rRNA'][taxa_ranks_18S] = asv_tables['18S rRNA'][taxa_ranks_18S].replace('-',' ',regex=True)
asv_tables['18S rRNA'][taxa_ranks_18S] = asv_tables['18S rRNA'][taxa_ranks_18S].replace('\/',' ',regex=True)

In [51]:
asv_tables['18S rRNA'].shape


(24067, 485)

In [52]:
occ['18S rRNA'] = pd.melt(asv_tables['18S rRNA'],id_vars=['featureid','sequence','taxonomy','Confidence','domain','supergroup','division','subdivision','class','order','family','genus','species'],
               var_name='eventID',value_name='organismQuantity')

In [53]:
occ['18S rRNA'].shape

(11359624, 15)

In [54]:
## Drop records where organismQuantity = 0 (absences are not meaningful for this data set)

occ['18S rRNA'] = occ['18S rRNA'][occ['18S rRNA']['organismQuantity'] > 0]
print(occ['18S rRNA'].shape)

(146232, 15)


In [55]:
occ['18S rRNA'].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,domain,supergroup,division,subdivision,class,order,family,genus,species,eventID,organismQuantity
0,36aa75f9b28f5f831c2d631ba65c2bcb,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCC...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.922099,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Neocalanus,Neocalanus cristatus,GOMECC4_27N_Sta1_DCM_A,1516
1,4e38e8ced9070952b314e1880bede1ca,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGT...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.999947,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Clausocalanus,Clausocalanus furcatus,GOMECC4_27N_Sta1_DCM_A,962
4,2a31e5c01634165da99e7381279baa75,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAA...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.779948,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Acrocalanus,Acrocalanus,GOMECC4_27N_Sta1_DCM_A,1164
5,ecee60339b2fb88ea6d1c8d18054bed4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAG...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,0.999931,Eukaryota,TSAR,Alveolata,Dinoflagellata,Dinophyceae,,,,,GOMECC4_27N_Sta1_DCM_A,287
7,fa1f1a97dd4ae7c826009186bad26384,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAA...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,0.986908,Eukaryota,TSAR,Alveolata,Dinoflagellata,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,,GOMECC4_27N_Sta1_DCM_A,250


Add occurenceID

In [56]:
## Create an occurrenceID that will uniquely identify each ASV observed within a water sample

occ['18S rRNA']['occurrenceID'] = occ['18S rRNA']['featureid']
occ['18S rRNA']['occurrenceID'] = occ['18S rRNA']['eventID'] + '_occ' + occ['18S rRNA']['occurrenceID'].astype(str)

In [57]:
occ['18S rRNA'].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,domain,supergroup,division,subdivision,class,order,family,genus,species,eventID,organismQuantity,occurrenceID
0,36aa75f9b28f5f831c2d631ba65c2bcb,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCC...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.922099,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Neocalanus,Neocalanus cristatus,GOMECC4_27N_Sta1_DCM_A,1516,GOMECC4_27N_Sta1_DCM_A_occ36aa75f9b28f5f831c2d...
1,4e38e8ced9070952b314e1880bede1ca,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGT...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.999947,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Clausocalanus,Clausocalanus furcatus,GOMECC4_27N_Sta1_DCM_A,962,GOMECC4_27N_Sta1_DCM_A_occ4e38e8ced9070952b314...
4,2a31e5c01634165da99e7381279baa75,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAA...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.779948,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Acrocalanus,Acrocalanus,GOMECC4_27N_Sta1_DCM_A,1164,GOMECC4_27N_Sta1_DCM_A_occ2a31e5c01634165da99e...
5,ecee60339b2fb88ea6d1c8d18054bed4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAG...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,0.999931,Eukaryota,TSAR,Alveolata,Dinoflagellata,Dinophyceae,,,,,GOMECC4_27N_Sta1_DCM_A,287,GOMECC4_27N_Sta1_DCM_A_occecee60339b2fb88ea6d1...
7,fa1f1a97dd4ae7c826009186bad26384,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAA...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,0.986908,Eukaryota,TSAR,Alveolata,Dinoflagellata,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,,GOMECC4_27N_Sta1_DCM_A,250,GOMECC4_27N_Sta1_DCM_A_occfa1f1a97dd4ae7c82600...


#### 16S

##### 1st, format ASV file

##### drop unwanted samples


In [58]:
asv_tables['16S rRNA'] = asv_tables['16S rRNA'].drop(columns=samples_to_drop,errors='ignore')

In [59]:
asv_tables['16S rRNA'].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,GOMECC4_27N_Sta1_DCM_A,GOMECC4_27N_Sta1_DCM_B,GOMECC4_27N_Sta1_DCM_C,GOMECC4_27N_Sta1_Deep_A,GOMECC4_27N_Sta1_Deep_B,GOMECC4_27N_Sta1_Deep_C,...,GOMECC4_YUCATAN_Sta100_Surface_C,GOMECC4_YUCATAN_Sta102_DCM_A,GOMECC4_YUCATAN_Sta102_DCM_B,GOMECC4_YUCATAN_Sta102_DCM_C,GOMECC4_YUCATAN_Sta102_Deep_A,GOMECC4_YUCATAN_Sta102_Deep_B,GOMECC4_YUCATAN_Sta102_Deep_C,GOMECC4_YUCATAN_Sta102_Surface_A,GOMECC4_YUCATAN_Sta102_Surface_B,GOMECC4_YUCATAN_Sta102_Surface_C
0,00006f0784f7dbb2f162408abb6da629,TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCA...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,0.978926,0,0,0,0,0,0,...,5,0,0,0,0,0,0,11,0,7
1,000094731d4984ed41435a1bf65b7ef2,TACAGAGAGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCG...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,0.881698,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0001a3c11fcef1b1b8f4c72942efbbac,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteri...,0.762793,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0001ceef5162e6d689ef30418cfcc164,TACAGAGGGTGCAAGCGTTGTTCGGAATCATTGGGCGTAAAGCGCG...,d__Bacteria; p__Myxococcota; c__Myxococcia; o_...,0.997619,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,000235534662df05bb30219a4b978dac,TACGGAAGGTCCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCG...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,0.999961,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [60]:
asv_tables['16S rRNA']['taxonomy'][0]

'd__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Vibrionales; f__Vibrionaceae; g__Vibrio'

In [61]:
taxa_ranks_16S = ['domain','phylum','class','order','family','genus','species']


In [62]:
asv_tables['16S rRNA'][['domain','phylum','class','order','family','genus','species']] = asv_tables['16S rRNA']['taxonomy'].str.split("; ",expand=True)
asv_tables['16S rRNA'].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,GOMECC4_27N_Sta1_DCM_A,GOMECC4_27N_Sta1_DCM_B,GOMECC4_27N_Sta1_DCM_C,GOMECC4_27N_Sta1_Deep_A,GOMECC4_27N_Sta1_Deep_B,GOMECC4_27N_Sta1_Deep_C,...,GOMECC4_YUCATAN_Sta102_Surface_A,GOMECC4_YUCATAN_Sta102_Surface_B,GOMECC4_YUCATAN_Sta102_Surface_C,domain,phylum,class,order,family,genus,species
0,00006f0784f7dbb2f162408abb6da629,TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCA...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,0.978926,0,0,0,0,0,0,...,11,0,7,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Vibrionales,f__Vibrionaceae,g__Vibrio,
1,000094731d4984ed41435a1bf65b7ef2,TACAGAGAGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCG...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,0.881698,0,0,0,0,0,0,...,0,0,0,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__HOC36,f__HOC36,g__HOC36,s__Candidatus_Thioglobus
2,0001a3c11fcef1b1b8f4c72942efbbac,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteri...,0.762793,0,0,0,0,0,0,...,0,0,0,d__Bacteria,p__Cyanobacteria,c__Cyanobacteriia,o__Synechococcales,f__Cyanobiaceae,g__Cyanobium_PCC-6307,
3,0001ceef5162e6d689ef30418cfcc164,TACAGAGGGTGCAAGCGTTGTTCGGAATCATTGGGCGTAAAGCGCG...,d__Bacteria; p__Myxococcota; c__Myxococcia; o_...,0.997619,0,0,0,0,0,0,...,0,0,0,d__Bacteria,p__Myxococcota,c__Myxococcia,o__Myxococcales,f__Myxococcaceae,g__P3OB-42,s__uncultured_bacterium
4,000235534662df05bb30219a4b978dac,TACGGAAGGTCCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCG...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,0.999961,0,0,0,0,0,0,...,0,0,0,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__SAR86_clade,f__SAR86_clade,g__SAR86_clade,


In [63]:
asv_tables['16S rRNA']['domain'] = asv_tables['16S rRNA']['domain'].str.replace("d__", "")
asv_tables['16S rRNA']['phylum'] = asv_tables['16S rRNA']['phylum'].str.replace("p__", "")
asv_tables['16S rRNA']['class'] = asv_tables['16S rRNA']['class'].str.replace("c__", "")
asv_tables['16S rRNA']['order'] = asv_tables['16S rRNA']['order'].str.replace("o__", "")
asv_tables['16S rRNA']['family'] = asv_tables['16S rRNA']['family'].str.replace("f__", "")
asv_tables['16S rRNA']['genus'] = asv_tables['16S rRNA']['genus'].str.replace("g__", "")
asv_tables['16S rRNA']['species'] = asv_tables['16S rRNA']['species'].str.replace("s__", "")

In [64]:
# replace None with NA
asv_tables['16S rRNA'] = asv_tables['16S rRNA'].fillna(value=np.nan)
## Replace 'unknown', 'unassigned', etc. in species and taxonomy columns with NaN

asv_tables['16S rRNA'][taxa_ranks_16S] = asv_tables['16S rRNA'][taxa_ranks_16S].replace({'unassigned':np.nan,'Unassigned':np.nan,
                              's_':np.nan,
                              'g_':np.nan,
                              'unknown':np.nan,
                              'no_hit':np.nan,
                               '':np.nan})
asv_tables['16S rRNA'].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,GOMECC4_27N_Sta1_DCM_A,GOMECC4_27N_Sta1_DCM_B,GOMECC4_27N_Sta1_DCM_C,GOMECC4_27N_Sta1_Deep_A,GOMECC4_27N_Sta1_Deep_B,GOMECC4_27N_Sta1_Deep_C,...,GOMECC4_YUCATAN_Sta102_Surface_A,GOMECC4_YUCATAN_Sta102_Surface_B,GOMECC4_YUCATAN_Sta102_Surface_C,domain,phylum,class,order,family,genus,species
0,00006f0784f7dbb2f162408abb6da629,TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCA...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,0.978926,0,0,0,0,0,0,...,11,0,7,Bacteria,Proteobacteria,Gammaproteobacteria,Vibrionales,Vibrionaceae,Vibrio,
1,000094731d4984ed41435a1bf65b7ef2,TACAGAGAGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCG...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,0.881698,0,0,0,0,0,0,...,0,0,0,Bacteria,Proteobacteria,Gammaproteobacteria,HOC36,HOC36,HOC36,Candidatus_Thioglobus
2,0001a3c11fcef1b1b8f4c72942efbbac,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteri...,0.762793,0,0,0,0,0,0,...,0,0,0,Bacteria,Cyanobacteria,Cyanobacteriia,Synechococcales,Cyanobiaceae,Cyanobium_PCC-6307,
3,0001ceef5162e6d689ef30418cfcc164,TACAGAGGGTGCAAGCGTTGTTCGGAATCATTGGGCGTAAAGCGCG...,d__Bacteria; p__Myxococcota; c__Myxococcia; o_...,0.997619,0,0,0,0,0,0,...,0,0,0,Bacteria,Myxococcota,Myxococcia,Myxococcales,Myxococcaceae,P3OB-42,uncultured_bacterium
4,000235534662df05bb30219a4b978dac,TACGGAAGGTCCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCG...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,0.999961,0,0,0,0,0,0,...,0,0,0,Bacteria,Proteobacteria,Gammaproteobacteria,SAR86_clade,SAR86_clade,SAR86_clade,


In [65]:
# replace _,- with space, remove sp. 

asv_tables['16S rRNA'][taxa_ranks_16S] = asv_tables['16S rRNA'][taxa_ranks_16S].replace('_',' ',regex=True)
asv_tables['16S rRNA'][taxa_ranks_16S] = asv_tables['16S rRNA'][taxa_ranks_16S].replace(' sp\.','',regex=True)
asv_tables['16S rRNA'][taxa_ranks_16S] = asv_tables['16S rRNA'][taxa_ranks_16S].replace('-',' ',regex=True)
asv_tables['16S rRNA'][taxa_ranks_16S] = asv_tables['16S rRNA'][taxa_ranks_16S].replace(' spp\.','',regex=True)
asv_tables['16S rRNA'][taxa_ranks_16S] = asv_tables['16S rRNA'][taxa_ranks_16S].replace('\/',' ',regex=True)

##### Melt asv_tables to long format


In [66]:
asv_tables['16S rRNA'].shape


(65048, 483)

In [67]:
occ['16S rRNA'] = pd.melt(asv_tables['16S rRNA'],id_vars=['featureid','sequence','taxonomy','Confidence','domain','phylum','class','order','family','genus','species'],
               var_name='eventID',value_name='organismQuantity')

In [68]:
occ['16S rRNA'].shape

(30702656, 13)

In [69]:
## Drop records where organismQuantity = 0 (absences are not meaningful for this data set)

occ['16S rRNA'] = occ['16S rRNA'][occ['16S rRNA']['organismQuantity'] > 0]
print(occ['16S rRNA'].shape)

(165158, 13)


In [70]:
## Create an occurrenceID that will uniquely identify each ASV observed within a water sample

occ['16S rRNA']['occurrenceID'] = occ['16S rRNA']['featureid']
occ['16S rRNA']['occurrenceID'] = occ['16S rRNA']['eventID'] + '_16S_occ' + occ['16S rRNA']['occurrenceID'].astype(str)

In [71]:
occ['16S rRNA'].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,domain,phylum,class,order,family,genus,species,eventID,organismQuantity,occurrenceID
182,00c4c1c65d8669ed9f07abe149f9a01d,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCG...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,0.83219,Bacteria,Proteobacteria,Alphaproteobacteria,Parvibaculales,OCS116 clade,OCS116 clade,uncultured marine,GOMECC4_27N_Sta1_DCM_A,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed...
225,00e6c13fe86364a5084987093afa1916,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,0.86704,Bacteria,Proteobacteria,Alphaproteobacteria,Puniceispirillales,SAR116 clade,SAR116 clade,,GOMECC4_27N_Sta1_DCM_A,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5...
347,015dad1fafca90944d905beb2a980bc3,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTC...,d__Archaea; p__Thermoplasmatota; c__Thermoplas...,1.0,Archaea,Thermoplasmatota,Thermoplasmata,Marine Group II,Marine Group II,Marine Group II,,GOMECC4_27N_Sta1_DCM_A,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca9094...
412,019c88c6ade406f731954f38e3461564,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,0.952911,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,Mitochondria,Mitochondria,uncultured bacterium,GOMECC4_27N_Sta1_DCM_A,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f7...
719,02dfb0869af4bf549d290d48e66e2196,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTC...,d__Bacteria; p__Marinimicrobia_(SAR406_clade);...,0.818195,Bacteria,Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),uncultured bacterium,GOMECC4_27N_Sta1_DCM_A,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf54...


##### WORMS conversion. 
Note, can't use multiprocessing in a Jupyter notebook, need multiprocess. See [here](https://stackoverflow.com/questions/41385708/multiprocessing-example-giving-attributeerror)

The data providers for this dataset used the [NCBI taxonomy database](https://www.ncbi.nlm.nih.gov/taxonomy) as their reference database when assigning taxonomies to ASVs. **It's important to note** that this taxonomy database is not a taxonomic authority, and the taxonomic ranks for any given scientific name on WoRMS may not directly compare. There are ongoing discussions about this problem (see [this](https://github.com/iobis/Project-team-Genetic-Data/issues/5) GitHub issue). At the moment, I don't see a way to definitively ensure that a given scientific name actually has the same taxonomic ranks on both platforms without going case-by-case.

In addition, there are still names in the data that will not match on WoRMS at all, despite appearing to be Linnaean names. This is because the name may not have been fully and officially adopted by the scientific community. I therefore need a system for searching through the higher taxonomic ranks given, finding the lowest one that will match on WoRMS, and putting that name in the `scientificName` column. The following few code blocks do this - they're clunky, but they were sufficient for this data set.

In [72]:
import multiprocess as mp
import pyworms

In [73]:
def get_worms_from_scientific_name(tax_df, ordered_rank_columns, queue,full_tax_column="taxonomy", like=False, marine_only=False,full_tax_vI = False):
    matches = []
    w_ranks = ['kingdom','phylum','class','order','family','genus']
    for index, row in tax_df.iterrows():
        full_tax = row[full_tax_column]
        if full_tax_vI:
            row_data = {'full_tax':full_tax,'verbatimIdentification': full_tax}
        else:   
            row_data = {'full_tax':full_tax,'verbatimIdentification': 'Null'}
        for i in ordered_rank_columns:
            rank = i
            old_name = row[i]
            if pd.isna(old_name):
                continue 
            else:
                row_data.update({'old_taxonRank': rank, 'old name': old_name})
                if row_data['verbatimIdentification'] == 'Null':
                    row_data['verbatimIdentification'] = old_name
                s_match = pyworms.aphiaRecordsByName(old_name,like=like,marine_only=marine_only)
                #time.sleep(1)
                if s_match == None:
                    row_data['scientificName'] = "No match"
                    row_data['scientificNameID'] = "None"
                    print(old_name+": No match, "+rank)
                    continue
                elif len(s_match) > 1:
                    mult = []
                    for m in s_match:
                        if m['status'] == 'accepted':
                            mult = mult + [m]
                    if len(mult) > 1:
                        row_data['scientificName'] = "Multiple matches"
                        row_data['scientificNameID'] = "None"
                        print(old_name+": Multiple matches, "+rank+" ")
                    elif len(mult) < 1:
                        row_data['scientificName'] = "Multiple unaccepted matches"
                        row_data['scientificNameID'] = "None"
                        print(old_name+": Multiple unaccepted matches, "+rank+" ")
                    elif len(mult) == 1:
                        row_data['scientificName'] = mult[0]['scientificname']
                        row_data['scientificNameID'] = mult[0]['lsid']
                        row_data.update(dict(zip(w_ranks, [mult[0].get(key) for key in w_ranks])))
                        row_data.update({'taxonRank': mult[0]['rank']})
                        break
                elif len(s_match) == 1:
                    if s_match[0]['status'] == 'accepted':
                        row_data['scientificName'] = s_match[0]['scientificname']
                        row_data['scientificNameID'] = s_match[0]['lsid']
                        row_data.update(dict(zip(w_ranks, [s_match[0].get(key) for key in w_ranks])))
                        row_data.update({'taxonRank': s_match[0]['rank']})
                        break
                    elif s_match[0]['status'] == 'unaccepted':
                        valid_name = s_match[0]['valid_name']
                        if valid_name != None:
                            v_match = pyworms.aphiaRecordsByName(valid_name,like=like,marine_only=marine_only)
                            row_data['scientificName'] = v_match[0]['scientificname']
                            row_data['scientificNameID'] = v_match[0]['lsid']
                            row_data.update(dict(zip(w_ranks, [v_match[0].get(key) for key in w_ranks])))
                            row_data.update({'taxonRank': v_match[0]['rank']})
                            print(old_name+": Unaccepted, using "+valid_name+", "+rank+" ")
                        else:
                            print(old_name+": Unaccepted, no valid name, "+rank+" ")
        matches += [row_data]
    matches = pd.DataFrame.from_dict(matches)
    queue.put(matches)
                        

In [74]:
def get_worms_from_scientific_name_parallel(tax_df, ordered_rank_columns, full_tax_column="taxonomy",like=False, marine_only=False,full_tax_vI = False,n_proc=0):
    queue = mp.Queue()
    if n_proc == 0:
    # create as many processes as there are CPUs on your machine
        num_processes = mp.cpu_count()
    else:
        num_processes = n_proc
        
    # calculate the chunk size as an integer
    chunk_size = int(tax_df.shape[0]/num_processes)
    procs = []
    for job in range(num_processes):
        start = job * chunk_size
        end = start + chunk_size
        df_chunk = tax_df.iloc[start:end]
        proc = mp.Process(
            target=get_worms_from_scientific_name,
            args=(df_chunk,ordered_rank_columns, queue,full_tax_column,like,marine_only,full_tax_vI)
        )
        procs.append(proc)
        proc.start()
    
    new_df = pd.DataFrame()
    for _ in procs:
        new_df = pd.concat([new_df,queue.get()])
    
    #new_df = queue.get()
    
    for proc in procs:
        proc.join()
    
    return new_df


Had some [issues with the parallelization](https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr) on Mac M1. Adding 'OBJC_DISABLE_INITIALIZE_FORK_SAFETY = YES' to .bash_profile and then [This](https://github.com/python/cpython/issues/74570) fixed it.   
Try to run without the bash_profile fix LATER.

In [75]:
os.environ["no_proxy"]="*"

### 16S worms

Species level IDs might be trash, [see here](https://forum.qiime2.org/t/processing-filtering-and-evaluating-the-silva-database-and-other-reference-sequence-data-with-rescript/15494), so look at genus and up.

In [76]:
tax_16S = asv_tables['16S rRNA'][['taxonomy','domain','phylum','class','order','family','genus','species']]

In [77]:
tax_16S = tax_16S.drop_duplicates()

In [78]:
tax_16S.shape

(2729, 8)

In [79]:
if __name__ == '__main__':
    worms_16s = get_worms_from_scientific_name_parallel(
    tax_df = tax_16S,ordered_rank_columns=['genus','family','order','class','phylum','domain'],
    full_tax_column="taxonomy",full_tax_vI=True,n_proc=7)

Arenicellaceae: No match, familyBlfdi19: No match, genus

Blfdi19: No match, family
Blfdi19: No match, order
Mitochondria: No match, genus
Mitochondria: No match, family
uncultured: No match, genusuncultured: No match, genus

uncultured: No match, family
uncultured: No match, order
Arenicellales: No match, orderPolyangia: No match, class

Candidatus Tenderia: No match, genus
HOC36: No match, genus
HOC36: No match, family
HOC36: No match, order
vadinHA49: No match, genus
Tenderiaceae: No match, familyvadinHA49: No match, familyMyxococcota: No match, phylumOM60(NOR5) clade: No match, genus



vadinHA49: No match, order
vadinHA49: No match, class
Planctomycetota: No match, phylum
Cyanobium PCC 6307: No match, genus
Halieaceae: No match, familyTenderiales: No match, orderZixibacteria: No match, genus


Zixibacteria: No match, family
Zixibacteria: No match, order
Zixibacteria: No match, class
Zixibacteria: No match, phylum
Cellvibrionales: No match, order
Cyanobiaceae: No match, family
BD1 

In [80]:
worms_16s.head(15)

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,genus,Vibrio,Vibrio,urn:lsid:marinespecies.org:taxname:480248,Bacteria,Proteobacteria,Gammaproteobacteria,Vibrionales,Vibrionaceae,Vibrio,Genus
1,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,class,Gammaproteobacteria,Gammaproteobacteria,urn:lsid:marinespecies.org:taxname:393018,Bacteria,Proteobacteria,Gammaproteobacteria,,,,Class
2,d__Bacteria; p__Cyanobacteria; c__Cyanobacteri...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteri...,order,Synechococcales,Synechococcales,urn:lsid:marinespecies.org:taxname:345514,Bacteria,Cyanobacteria,Cyanophyceae,Synechococcales,,,Order
3,d__Bacteria; p__Myxococcota; c__Myxococcia; o_...,d__Bacteria; p__Myxococcota; c__Myxococcia; o_...,family,Myxococcaceae,Myxococcaceae,urn:lsid:marinespecies.org:taxname:570956,Bacteria,Proteobacteria,Deltaproteobacteria,Myxococcales,Myxococcaceae,,Family
4,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,class,Gammaproteobacteria,Gammaproteobacteria,urn:lsid:marinespecies.org:taxname:393018,Bacteria,Proteobacteria,Gammaproteobacteria,,,,Class
5,d__Archaea; p__Crenarchaeota; c__Nitrososphaer...,d__Archaea; p__Crenarchaeota; c__Nitrososphaer...,family,Nitrosopumilaceae,Nitrosopumilaceae,urn:lsid:marinespecies.org:taxname:559432,Archaea,Thaumarchaeota,Thaumarchaeota incertae sedis,Nitrosopumilales,Nitrosopumilaceae,,Family
6,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,class,Gammaproteobacteria,Gammaproteobacteria,urn:lsid:marinespecies.org:taxname:393018,Bacteria,Proteobacteria,Gammaproteobacteria,,,,Class
7,d__Bacteria; p__Cyanobacteria; c__Cyanobacteri...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteri...,phylum,Cyanobacteria,Cyanobacteria,urn:lsid:marinespecies.org:taxname:146537,Bacteria,Cyanobacteria,,,,,Phylum
8,d__Bacteria; p__Planctomycetota; c__Pla3_linea...,d__Bacteria; p__Planctomycetota; c__Pla3_linea...,domain,Bacteria,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom
9,d__Bacteria; p__Actinobacteriota; c__Acidimicr...,d__Bacteria; p__Actinobacteriota; c__Acidimicr...,domain,Bacteria,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom


In [81]:
worms_16s[worms_16s["scientificName"]=="No match"]

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
242,d__Eukaryota,d__Eukaryota,domain,Eukaryota,No match,,,,,,,,


In [82]:
worms_16s.loc[worms_16s["scientificName"]=="No match",'scientificName'] = "Biota"
worms_16s.loc[worms_16s["scientificName"]=="Biota",'scientificNameID'] = "urn:lsid:marinespecies.org:taxname:1"


In [83]:
worms_16s[worms_16s['scientificName'].isna() == True]

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
97,Unassigned,Unassigned,,,,,,,,,,,


In [84]:

print(worms_16s[worms_16s['scientificName'].isna() == True].shape)
worms_16s.loc[worms_16s['scientificName'].isna() == True,'scientificName'] = 'incertae sedis'
worms_16s.loc[worms_16s['scientificName'] == 'incertae sedis','scientificNameID'] =  'urn:lsid:marinespecies.org:taxname:12'
print(worms_16s[worms_16s['scientificName'].isna() == True].shape)

(1, 13)
(0, 13)


In [85]:
worms_16s.to_csv("../gomecc_v2_processed/worms_16S_matching.tsv",sep="\t",index=False)

In [86]:
worms_16s.drop(columns=['old name','old_taxonRank'],inplace=True)
worms_16s.head()

Unnamed: 0,full_tax,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,Vibrio,urn:lsid:marinespecies.org:taxname:480248,Bacteria,Proteobacteria,Gammaproteobacteria,Vibrionales,Vibrionaceae,Vibrio,Genus
1,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,Gammaproteobacteria,urn:lsid:marinespecies.org:taxname:393018,Bacteria,Proteobacteria,Gammaproteobacteria,,,,Class
2,d__Bacteria; p__Cyanobacteria; c__Cyanobacteri...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteri...,Synechococcales,urn:lsid:marinespecies.org:taxname:345514,Bacteria,Cyanobacteria,Cyanophyceae,Synechococcales,,,Order
3,d__Bacteria; p__Myxococcota; c__Myxococcia; o_...,d__Bacteria; p__Myxococcota; c__Myxococcia; o_...,Myxococcaceae,urn:lsid:marinespecies.org:taxname:570956,Bacteria,Proteobacteria,Deltaproteobacteria,Myxococcales,Myxococcaceae,,Family
4,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,d__Bacteria; p__Proteobacteria; c__Gammaproteo...,Gammaproteobacteria,urn:lsid:marinespecies.org:taxname:393018,Bacteria,Proteobacteria,Gammaproteobacteria,,,,Class


In [87]:
occ['16S rRNA'].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,domain,phylum,class,order,family,genus,species,eventID,organismQuantity,occurrenceID
182,00c4c1c65d8669ed9f07abe149f9a01d,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCG...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,0.83219,Bacteria,Proteobacteria,Alphaproteobacteria,Parvibaculales,OCS116 clade,OCS116 clade,uncultured marine,GOMECC4_27N_Sta1_DCM_A,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed...
225,00e6c13fe86364a5084987093afa1916,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,0.86704,Bacteria,Proteobacteria,Alphaproteobacteria,Puniceispirillales,SAR116 clade,SAR116 clade,,GOMECC4_27N_Sta1_DCM_A,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5...
347,015dad1fafca90944d905beb2a980bc3,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTC...,d__Archaea; p__Thermoplasmatota; c__Thermoplas...,1.0,Archaea,Thermoplasmatota,Thermoplasmata,Marine Group II,Marine Group II,Marine Group II,,GOMECC4_27N_Sta1_DCM_A,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca9094...
412,019c88c6ade406f731954f38e3461564,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,0.952911,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,Mitochondria,Mitochondria,uncultured bacterium,GOMECC4_27N_Sta1_DCM_A,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f7...
719,02dfb0869af4bf549d290d48e66e2196,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTC...,d__Bacteria; p__Marinimicrobia_(SAR406_clade);...,0.818195,Bacteria,Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),uncultured bacterium,GOMECC4_27N_Sta1_DCM_A,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf54...


#### Merge Occurrence and worms

In [88]:
occ['16S rRNA'].shape

(165158, 14)

In [213]:

occ16_test = occ['16S rRNA'].copy()
occ16_test.drop(columns=['domain','phylum','class','order','family','genus','species'],inplace=True)
#occ16_test.drop(columns=['old name'],inplace=True)

occ16_test = occ16_test.merge(worms_16s, how='left', left_on ='taxonomy', right_on='full_tax')
occ16_test.drop(columns='full_tax', inplace=True)
occ16_test.head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,00c4c1c65d8669ed9f07abe149f9a01d,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCG...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,0.83219,GOMECC4_27N_Sta1_DCM_A,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class
1,00e6c13fe86364a5084987093afa1916,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,0.86704,GOMECC4_27N_Sta1_DCM_A,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class
2,015dad1fafca90944d905beb2a980bc3,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTC...,d__Archaea; p__Thermoplasmatota; c__Thermoplas...,1.0,GOMECC4_27N_Sta1_DCM_A,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca9094...,d__Archaea; p__Thermoplasmatota; c__Thermoplas...,Thermoplasmata,urn:lsid:marinespecies.org:taxname:416268,Archaea,Euryarchaeota,Thermoplasmata,,,,Class
3,019c88c6ade406f731954f38e3461564,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,0.952911,GOMECC4_27N_Sta1_DCM_A,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f7...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,Rickettsiales,urn:lsid:marinespecies.org:taxname:570969,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,,,Order
4,02dfb0869af4bf549d290d48e66e2196,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTC...,d__Bacteria; p__Marinimicrobia_(SAR406_clade);...,0.818195,GOMECC4_27N_Sta1_DCM_A,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf54...,d__Bacteria; p__Marinimicrobia_(SAR406_clade);...,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom


#### identificationRemarks

In [214]:
data['analysis_data'].head()

Unnamed: 0,target_gene,ampliconSize,trim_method,cluster_method,pid_clustering,taxa_class_method,taxa_ref_db,code_repo,identificationReferences,controls_used
0,16S rRNA,411,cutadapt,Tourmaline; qiime2-2021.2; dada2,ASV,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,https://github.com/aomlomics/gomecc,10.5281/zenodo.8392695 | https://github.com/ao...,12 distilled water blanks | 2 PCR no-template ...
1,18S rRNA,260,cutadapt,Tourmaline; qiime2-2021.2; dada2,ASV,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zen...,https://github.com/aomlomics/gomecc,10.5281/zenodo.8392706 | https://pr2-database....,12 distilled water blanks | 2 PCR no-template ...


In [215]:
occ16_test['taxa_class_method'] = data['analysis_data'].loc[data['analysis_data']['target_gene'] == '16S rRNA','taxa_class_method'].item()
occ16_test['taxa_ref_db'] = data['analysis_data'].loc[data['analysis_data']['target_gene'] == '16S rRNA','taxa_ref_db'].item()

occ16_test['identificationRemarks'] = occ16_test['taxa_class_method'] +", confidence (at lowest specified taxon): "+occ16_test['Confidence'].astype(str) +", against reference database: "+occ16_test['taxa_ref_db']

In [216]:
occ16_test['identificationRemarks'][0]

'Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.832189583, against reference database: Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695'

#### taxonID, basisOfRecord, eventID, nameAccordingTo, organismQuantityType

In [217]:
occ16_test['taxonID'] = 'ASV:'+occ16_test['featureid']
occ16_test['basisOfRecord'] = 'MaterialSample'
occ16_test['nameAccordingTo'] = "WoRMS"
occ16_test['organismQuantityType'] = "DNA sequence reads"
occ16_test['recordedBy'] = data['study_data']['recordedBy'].values[0]

#### associatedSequences, materialSampleID

In [218]:
data['prep_data'].columns

Index(['sample_name', 'library_id', 'title', 'library_strategy',
       'library_source', 'library_selection', 'lib_layout', 'platform',
       'instrument_model', 'design_description', 'filetype', 'filename',
       'filename2', 'biosample_accession', 'sra_accession', 'seq_method',
       'nucl_acid_ext', 'target_gene', 'target_subfragment',
       'pcr_primer_forward', 'pcr_primer_reverse', 'pcr_primer_name_forward',
       'pcr_primer_name_reverse', 'pcr_primer_reference', 'pcr_cond',
       'nucl_acid_amp', 'adapters', 'mid_barcode'],
      dtype='object')

In [219]:
occ16_test = occ16_test.merge(data['prep_data'].loc[data['prep_data']['target_gene'] == '16S rRNA',['sample_name','sra_accession','biosample_accession']], how='left', left_on ='eventID', right_on='sample_name')

In [220]:
occ16_test.head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,...,taxa_ref_db,identificationRemarks,taxonID,basisOfRecord,nameAccordingTo,organismQuantityType,recordedBy,sample_name,sra_accession,biosample_accession
0,00c4c1c65d8669ed9f07abe149f9a01d,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCG...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,0.83219,GOMECC4_27N_Sta1_DCM_A,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,...,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,Tourmaline; qiime2-2021.2; naive-bayes classif...,ASV:00c4c1c65d8669ed9f07abe149f9a01d,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094
1,00e6c13fe86364a5084987093afa1916,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,0.86704,GOMECC4_27N_Sta1_DCM_A,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,...,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,Tourmaline; qiime2-2021.2; naive-bayes classif...,ASV:00e6c13fe86364a5084987093afa1916,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094
2,015dad1fafca90944d905beb2a980bc3,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTC...,d__Archaea; p__Thermoplasmatota; c__Thermoplas...,1.0,GOMECC4_27N_Sta1_DCM_A,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca9094...,d__Archaea; p__Thermoplasmatota; c__Thermoplas...,Thermoplasmata,urn:lsid:marinespecies.org:taxname:416268,...,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,Tourmaline; qiime2-2021.2; naive-bayes classif...,ASV:015dad1fafca90944d905beb2a980bc3,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094
3,019c88c6ade406f731954f38e3461564,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,0.952911,GOMECC4_27N_Sta1_DCM_A,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f7...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,Rickettsiales,urn:lsid:marinespecies.org:taxname:570969,...,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,Tourmaline; qiime2-2021.2; naive-bayes classif...,ASV:019c88c6ade406f731954f38e3461564,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094
4,02dfb0869af4bf549d290d48e66e2196,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTC...,d__Bacteria; p__Marinimicrobia_(SAR406_clade);...,0.818195,GOMECC4_27N_Sta1_DCM_A,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf54...,d__Bacteria; p__Marinimicrobia_(SAR406_clade);...,Bacteria,urn:lsid:marinespecies.org:taxname:6,...,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,Tourmaline; qiime2-2021.2; naive-bayes classif...,ASV:02dfb0869af4bf549d290d48e66e2196,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094


#### eventID

In [221]:
occ16_test['eventID'] = occ16_test['eventID']+"_16S"

In [222]:
# get sampleSize by total number of reads per sample
x = asv_tables['16S rRNA'].sum(numeric_only=True).astype('int')
x.index = x.index+"_16S"
occ16_test['occurrenceRemarks'] = "Total sampleSize in DNA sequence reads: "+occ16_test['eventID'].map(x).astype('str')

In [223]:
# drop unnneeded columns
occ16_test.drop(columns=['sample_name','featureid','taxonomy','Confidence','taxa_class_method','taxa_ref_db'],inplace=True)

In [224]:
occ16_test['associatedSequences'] = occ16_test['sra_accession']+' | '+ occ16_test['biosample_accession']+' | '+data['study_data']['bioproject_accession'].values[0]

In [225]:
occ16_test.rename(columns={'biosample_accession': 'materialSampleID',
                  'sequence': 'DNA_sequence'},inplace=True)
                   

In [226]:
# drop unnneeded columns
occ16_test.drop(columns=['sra_accession'],inplace=True)

In [227]:
occ16_test.columns

Index(['DNA_sequence', 'eventID', 'organismQuantity', 'occurrenceID',
       'verbatimIdentification', 'scientificName', 'scientificNameID',
       'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'taxonRank',
       'identificationRemarks', 'taxonID', 'basisOfRecord', 'nameAccordingTo',
       'organismQuantityType', 'recordedBy', 'materialSampleID',
       'occurrenceRemarks', 'associatedSequences'],
      dtype='object')

In [228]:
occ16_test.head()

Unnamed: 0,DNA_sequence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,...,taxonRank,identificationRemarks,taxonID,basisOfRecord,nameAccordingTo,organismQuantityType,recordedBy,materialSampleID,occurrenceRemarks,associatedSequences
0,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCG...,GOMECC4_27N_Sta1_DCM_A_16S,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,...,Class,Tourmaline; qiime2-2021.2; naive-bayes classif...,ASV:00c4c1c65d8669ed9f07abe149f9a01d,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,Total sampleSize in DNA sequence reads: 16187,SRR26148187 | SAMN37516094 | PRJNA887898
1,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...,GOMECC4_27N_Sta1_DCM_A_16S,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,...,Class,Tourmaline; qiime2-2021.2; naive-bayes classif...,ASV:00e6c13fe86364a5084987093afa1916,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,Total sampleSize in DNA sequence reads: 16187,SRR26148187 | SAMN37516094 | PRJNA887898
2,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTC...,GOMECC4_27N_Sta1_DCM_A_16S,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca9094...,d__Archaea; p__Thermoplasmatota; c__Thermoplas...,Thermoplasmata,urn:lsid:marinespecies.org:taxname:416268,Archaea,Euryarchaeota,Thermoplasmata,...,Class,Tourmaline; qiime2-2021.2; naive-bayes classif...,ASV:015dad1fafca90944d905beb2a980bc3,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,Total sampleSize in DNA sequence reads: 16187,SRR26148187 | SAMN37516094 | PRJNA887898
3,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTC...,GOMECC4_27N_Sta1_DCM_A_16S,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f7...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,Rickettsiales,urn:lsid:marinespecies.org:taxname:570969,Bacteria,Proteobacteria,Alphaproteobacteria,...,Order,Tourmaline; qiime2-2021.2; naive-bayes classif...,ASV:019c88c6ade406f731954f38e3461564,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,Total sampleSize in DNA sequence reads: 16187,SRR26148187 | SAMN37516094 | PRJNA887898
4,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTC...,GOMECC4_27N_Sta1_DCM_A_16S,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf54...,d__Bacteria; p__Marinimicrobia_(SAR406_clade);...,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,...,Kingdom,Tourmaline; qiime2-2021.2; naive-bayes classif...,ASV:02dfb0869af4bf549d290d48e66e2196,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,Total sampleSize in DNA sequence reads: 16187,SRR26148187 | SAMN37516094 | PRJNA887898


### merge event and occurrence

In [229]:
all_event_data.tail()

Unnamed: 0,eventID,locationID,eventDate,minimumDepthInMeters,locality,decimalLatitude,decimalLongitude,samplingProtocol,waterBody,maximumDepthInMeters,parentEventID,datasetID,geodeticDatum
939,GOMECC4_CAPECORAL_Sta141_DCM_B_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,59,USA: Gulf of Mexico,25.574,-84.843,CTD rosette,"Mexico, Gulf of",59,GOMECC4_CAPECORAL_Sta141_DCM_B,noaa-aoml-gomecc4,WGS84
940,GOMECC4_CAPECORAL_Sta141_DCM_C_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,59,USA: Gulf of Mexico,25.574,-84.843,CTD rosette,"Mexico, Gulf of",59,GOMECC4_CAPECORAL_Sta141_DCM_C,noaa-aoml-gomecc4,WGS84
941,GOMECC4_CAPECORAL_Sta141_Surface_A_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,4,USA: Gulf of Mexico,25.574,-84.843,CTD rosette,"Mexico, Gulf of",4,GOMECC4_CAPECORAL_Sta141_Surface_A,noaa-aoml-gomecc4,WGS84
942,GOMECC4_CAPECORAL_Sta141_Surface_B_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,4,USA: Gulf of Mexico,25.574,-84.843,CTD rosette,"Mexico, Gulf of",4,GOMECC4_CAPECORAL_Sta141_Surface_B,noaa-aoml-gomecc4,WGS84
943,GOMECC4_CAPECORAL_Sta141_Surface_C_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,4,USA: Gulf of Mexico,25.574,-84.843,CTD rosette,"Mexico, Gulf of",4,GOMECC4_CAPECORAL_Sta141_Surface_C,noaa-aoml-gomecc4,WGS84


In [230]:
occ16_merged = occ16_test.merge(all_event_data,how='left',on='eventID')

In [231]:
occ16_merged.head()

Unnamed: 0,DNA_sequence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,...,minimumDepthInMeters,locality,decimalLatitude,decimalLongitude,samplingProtocol,waterBody,maximumDepthInMeters,parentEventID,datasetID,geodeticDatum
0,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCG...,GOMECC4_27N_Sta1_DCM_A_16S,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,...,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,WGS84
1,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...,GOMECC4_27N_Sta1_DCM_A_16S,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,...,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,WGS84
2,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTC...,GOMECC4_27N_Sta1_DCM_A_16S,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca9094...,d__Archaea; p__Thermoplasmatota; c__Thermoplas...,Thermoplasmata,urn:lsid:marinespecies.org:taxname:416268,Archaea,Euryarchaeota,Thermoplasmata,...,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,WGS84
3,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTC...,GOMECC4_27N_Sta1_DCM_A_16S,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f7...,d__Bacteria; p__Proteobacteria; c__Alphaproteo...,Rickettsiales,urn:lsid:marinespecies.org:taxname:570969,Bacteria,Proteobacteria,Alphaproteobacteria,...,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,WGS84
4,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTC...,GOMECC4_27N_Sta1_DCM_A_16S,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf54...,d__Bacteria; p__Marinimicrobia_(SAR406_clade);...,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,...,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,WGS84


In [232]:
occ16_merged.drop(columns=['DNA_sequence']).to_csv("../gomecc_v2_processed/occurrence_16S.tsv",sep="\t",index=False)

### 18S worms

18S PR2 database provides WORMS IDs for species that are in worms. We will read in that file, assign known worms ids, the do a search for unannotated taxa.

In [110]:
pr2_18S = pd.read_excel("../../../databases/18S_PR2/pr2_v5.0.0_SSU/pr2_version_5.0.0_taxonomy.xlsx",
    index_col=None, na_values=[""])
pr2_18S = pr2_18S.dropna(subset=['worms_id'])
pr2_18S['worms_id'] = pr2_18S['worms_id'].astype('int').astype('str')
pr2_18S['species'] = pr2_18S['species'].replace('_',' ',regex=True)
pr2_18S['species'] = pr2_18S['species'].replace(' sp\.','',regex=True)
pr2_18S['species'] = pr2_18S['species'].replace(' spp\.','',regex=True)
pr2_18S['species'] = pr2_18S['species'].replace('-',' ',regex=True)
pr2_18S['species'] = pr2_18S['species'].replace('\/',' ',regex=True)

In [111]:
pr2_18S_dict = dict(zip(pr2_18S.species,pr2_18S.worms_id))


In [112]:
(pr2_18S_dict['Aphanocapsa feldmannii'])

'614894'

#### code to get record from aphia id

In [113]:
import multiprocess as mp
import pyworms

In [114]:
def get_worms_from_aphiaid_or_name(tax_df, worms_dict,ordered_rank_columns, queue,full_tax_column="taxonomy", like=False, marine_only=False,full_tax_vI = False):
    matches = []
    w_ranks = ['kingdom','phylum','class','order','family','genus']
    for index, row in tax_df.iterrows():
        full_tax = row[full_tax_column]
        if row['species'] in worms_dict.keys():
            row_data = {'full_tax':full_tax,'verbatimIdentification': row['species']}
            aid = worms_dict[row['species']]
            record = pyworms.aphiaRecordByAphiaID(aid)
            row_data.update({'taxonRank': 'species', 'old name': 'aphiaID'})
            if record['status'] == 'accepted':
                row_data['scientificName'] = record['scientificname']
                row_data['scientificNameID'] = record['lsid']
                row_data.update(dict(zip(w_ranks, [record.get(key) for key in w_ranks])))
                row_data.update({'taxonRank': record['rank']})
            elif record['status'] == 'unaccepted':
                valid_name = record['valid_name']
                if valid_name != None:
                    v_match = pyworms.aphiaRecordsByName(valid_name,like=like,marine_only=marine_only)
                    row_data['scientificName'] = v_match[0]['scientificname']
                    row_data['scientificNameID'] = v_match[0]['lsid']
                    row_data.update(dict(zip(w_ranks, [v_match[0].get(key) for key in w_ranks])))
                    row_data.update({'taxonRank': v_match[0]['rank']})
                    print(aid+": Unaccepted, using "+valid_name)
                else:
                    print(aid+": Unaccepted, no valid name ")
        else:
            if full_tax_vI:
                row_data = {'full_tax':full_tax,'verbatimIdentification': full_tax}
            else:   
                row_data = {'full_tax':full_tax,'verbatimIdentification': 'Null'}
            for i in ordered_rank_columns:
                rank = i
                old_name = row[i]
                if pd.isna(old_name):
                    continue 
                else:
                    row_data.update({'old_taxonRank': rank, 'old name': old_name})
                    if row_data['verbatimIdentification'] == 'Null':
                        row_data['verbatimIdentification'] = old_name
                    s_match = pyworms.aphiaRecordsByName(old_name,like=like,marine_only=marine_only)
                    #time.sleep(1)
                    if s_match == None:
                        row_data['scientificName'] = "No match"
                        row_data['scientificNameID'] = "None"
                        print(old_name+": No match, "+rank)
                        continue
                    elif len(s_match) > 1:
                        mult = []
                        for m in s_match:
                            if m['status'] == 'accepted':
                                mult = mult + [m]
                        if len(mult) > 1:
                            row_data['scientificName'] = "Multiple matches"
                            row_data['scientificNameID'] = "None"
                            print(old_name+": Multiple matches, "+rank+" ")
                        elif len(mult) < 1:
                            row_data['scientificName'] = "Multiple unaccepted matches"
                            row_data['scientificNameID'] = "None"
                            print(old_name+": Multiple unaccepted matches, "+rank+" ")
                        elif len(mult) == 1:
                            row_data['scientificName'] = mult[0]['scientificname']
                            row_data['scientificNameID'] = mult[0]['lsid']
                            row_data.update(dict(zip(w_ranks, [mult[0].get(key) for key in w_ranks])))
                            row_data.update({'taxonRank': mult[0]['rank']})
                            break
                    elif len(s_match) == 1:
                        if s_match[0]['status'] == 'accepted':
                            row_data['scientificName'] = s_match[0]['scientificname']
                            row_data['scientificNameID'] = s_match[0]['lsid']
                            row_data.update(dict(zip(w_ranks, [s_match[0].get(key) for key in w_ranks])))
                            row_data.update({'taxonRank': s_match[0]['rank']})
                            break
                        elif s_match[0]['status'] == 'unaccepted':
                            valid_name = s_match[0]['valid_name']
                            if valid_name != None:
                                v_match = pyworms.aphiaRecordsByName(valid_name,like=like,marine_only=marine_only)
                                row_data['scientificName'] = v_match[0]['scientificname']
                                row_data['scientificNameID'] = v_match[0]['lsid']
                                row_data.update(dict(zip(w_ranks, [v_match[0].get(key) for key in w_ranks])))
                                row_data.update({'taxonRank': v_match[0]['rank']})
                                print(old_name+": Unaccepted, using "+valid_name+", "+rank+" ")
                            else:
                                print(old_name+": Unaccepted, no valid name, "+rank+" ")
        matches += [row_data]
    matches = pd.DataFrame.from_dict(matches)
    queue.put(matches)
                        

In [115]:
def get_worms_from_aphiaid_or_name_parallel(tax_df, worms_dict,ordered_rank_columns, full_tax_column="taxonomy",like=False, marine_only=False,full_tax_vI = False,n_proc=0):
    queue = mp.Queue()
    if n_proc == 0:
    # create as many processes as there are CPUs on your machine
        num_processes = mp.cpu_count()
    else:
        num_processes = n_proc
        
    # calculate the chunk size as an integer
    chunk_size = int(tax_df.shape[0]/num_processes)
    procs = []
    for job in range(num_processes):
        start = job * chunk_size
        end = start + chunk_size
        df_chunk = tax_df.iloc[start:end]
        proc = mp.Process(
            target=get_worms_from_aphiaid_or_name,
            args=(df_chunk,worms_dict,ordered_rank_columns, queue,full_tax_column,like,marine_only,full_tax_vI)
        )
        procs.append(proc)
        proc.start()
    
    new_df = pd.DataFrame()
    for _ in procs:
        new_df = pd.concat([new_df,queue.get()])
    
    #new_df = queue.get()
    
    for proc in procs:
        proc.join()
    
    return new_df


Had some [issues with the parallelization](https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr) on Mac M1. Adding 'OBJC_DISABLE_INITIALIZE_FORK_SAFETY = YES' to .bash_profile and then [This](https://github.com/python/cpython/issues/74570) fixed it.   
Try to run without the bash_profile fix LATER.

In [116]:
os.environ["no_proxy"]="*"

In [117]:
tax_18S = asv_tables['18S rRNA'][['taxonomy','domain','supergroup','division','subdivision','class','order','family','genus','species']]

In [118]:
tax_18S = tax_18S.drop_duplicates()
tax_18S.shape

(1374, 10)

In [119]:
if __name__ == '__main__':
    worms_18s = get_worms_from_aphiaid_or_name_parallel(
    tax_df = tax_18S,worms_dict=pr2_18S_dict,ordered_rank_columns=['species','genus','family','order','class','subdivision','division','supergroup'],
    full_tax_column="taxonomy",full_tax_vI=True,n_proc=6)
    

Aspergillus penicillioides: No match, speciesProtoscenium cf intricatum: No match, species

Euglypha acanthophora: No match, species
RAD B X Group IVe X: No match, species
RAD B X Group IVe X: No match, genus
RAD B X Group IVe: No match, family
Eimeriida: No match, order
Nibbleromonas: No match, genus
RAD B X: No match, order
Coccidiomorphea: No match, class
MAST 12A: No match, species
MAST 12A: No match, genus
Solemya reidi: Unaccepted, using Petrasma pervernicosa, species 
Nibbleridae: No match, family
RAD B: No match, class
MAST 12: No match, family
Nibbleridida: No match, order
Opalozoa X: No match, order
Malus x: No match, species
Nibbleridea: No match, class
Radiolaria: Unaccepted, using Radiozoa, subdivision 
Euduboscquella cachoni: No match, species
Obazoa: No match, supergroup
Malus: No match, genus
Nibbleridia X: No match, subdivision
Pectinoida: No match, familyEmbryophyceae XX: No match, family

Nibbleridia: No match, division
Skeletonema menzellii: No match, species
Embryo

In [120]:
worms_18s.head()

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,Eukaryota;Obazoa;Opisthokonta;Fungi;Ascomycota...,Eukaryota;Obazoa;Opisthokonta;Fungi;Ascomycota...,genus,Aspergillus,Aspergillus,urn:lsid:marinespecies.org:taxname:100211,Fungi,Ascomycota,Eurotiomycetes,Eurotiales,Trichocomaceae,Aspergillus,Genus
1,Eukaryota;Cryptista;Cryptophyta;Cryptophyta_X;...,Eukaryota;Cryptista;Cryptophyta;Cryptophyta_X;...,species,Goniomonas,Goniomonas,urn:lsid:marinespecies.org:taxname:106286,Chromista,Cryptophyta,Cryptophyceae,Cryptomonadales,Cryptomonadaceae,Goniomonas,Genus
2,Eukaryota;TSAR;Alveolata;Ciliophora;Spirotrich...,Eukaryota;TSAR;Alveolata;Ciliophora;Spirotrich...,species,Strombidium,Strombidium,urn:lsid:marinespecies.org:taxname:101195,Chromista,Ciliophora,Oligotrichea,Oligotrichida,Strombidiidae,Strombidium,Genus
3,Eukaryota;Obazoa;Opisthokonta;Metazoa;Annelida...,Prionospio dubia,,aphiaID,Prionospio dubia,urn:lsid:marinespecies.org:taxname:131155,Animalia,Annelida,Polychaeta,Spionida,Spionidae,Prionospio,Species
4,Eukaryota;TSAR;Stramenopiles;Bigyra;Opalozoa;O...,Eukaryota;TSAR;Stramenopiles;Bigyra;Opalozoa;O...,class,Opalozoa,Opalozoa,urn:lsid:marinespecies.org:taxname:582466,Chromista,Bigyra,,,,,Subphylum


In [121]:
worms_18s[worms_18s["scientificName"]=="No match"]['old name'].unique()

array(['Haptista', 'Archaeplastida', 'TSAR', 'Cryptista:nucl', 'Obazoa',
       'Provora'], dtype=object)

In [122]:
worms_18s[worms_18s["scientificName"]=="No match"].head(20)

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
121,Eukaryota;Haptista;Centroplasthelida;Centropla...,Eukaryota;Haptista;Centroplasthelida;Centropla...,supergroup,Haptista,No match,,,,,,,,
4,Eukaryota;Archaeplastida,Eukaryota;Archaeplastida,supergroup,Archaeplastida,No match,,,,,,,,
19,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X...,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X...,supergroup,TSAR,No match,,,,,,,,
46,Eukaryota:nucl;Cryptista:nucl;Cryptophyta:nucl...,Eukaryota:nucl;Cryptista:nucl;Cryptophyta:nucl...,supergroup,Cryptista:nucl,No match,,,,,,,,
60,Eukaryota;TSAR,Eukaryota;TSAR,supergroup,TSAR,No match,,,,,,,,
65,Eukaryota;Obazoa;Opisthokonta;Metazoa;Ctenopho...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Ctenopho...,supergroup,Obazoa,No match,,Animalia,,,,,,Kingdom
81,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X...,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X...,supergroup,TSAR,No match,,,,,,,,
119,Eukaryota;Obazoa;Opisthokonta;Opisthokonta_X;O...,Eukaryota;Obazoa;Opisthokonta;Opisthokonta_X;O...,supergroup,Obazoa,No match,,,,,,,,
177,Eukaryota;TSAR;Stramenopiles;Gyrista;Mediophyceae,Eukaryota;TSAR;Stramenopiles;Gyrista;Mediophyceae,supergroup,TSAR,No match,,,,,,,,
203,Eukaryota;TSAR;Telonemia;Telonemia_X;Telonemia...,Eukaryota;TSAR;Telonemia;Telonemia_X;Telonemia...,supergroup,TSAR,No match,,,,,,,,


In [123]:
worms_18s.loc[worms_18s["scientificName"]=="No match",'scientificName'] = "Biota"
worms_18s.loc[worms_18s["scientificName"]=="Biota",'scientificNameID'] = "urn:lsid:marinespecies.org:taxname:1"


In [124]:
worms_18s[worms_18s['scientificName'].isna() == True]

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
109,Unassigned,Unassigned,,,,,,,,,,,
120,Eukaryota;Haptista,Eukaryota;Haptista,supergroup,Haptista,,,,,,,,,
77,Eukaryota,Eukaryota,,,,,,,,,,,


In [125]:
worms_18s.loc[worms_18s["full_tax"]=="Eukaryota;Haptista",'scientificName'] = "Biota"
worms_18s.loc[worms_18s["full_tax"]=="Eukaryota;Haptista",'scientificNameID'] = "urn:lsid:marinespecies.org:taxname:1"
worms_18s.loc[worms_18s["full_tax"]=="Eukaryota",'scientificName'] = "Biota"
worms_18s.loc[worms_18s["full_tax"]=="Eukaryota",'scientificNameID'] = "urn:lsid:marinespecies.org:taxname:1"


In [126]:
worms_18s[worms_18s['scientificName'].isna() == True]

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
109,Unassigned,Unassigned,,,,,,,,,,,


In [127]:

print(worms_18s[worms_18s['scientificName'].isna() == True].shape)
worms_18s.loc[worms_18s['scientificName'].isna() == True,'scientificName'] = 'incertae sedis'
worms_18s.loc[worms_18s['scientificName'] == 'incertae sedis','scientificNameID'] =  'urn:lsid:marinespecies.org:taxname:12'
print(worms_18s[worms_18s['scientificName'].isna() == True].shape)

(1, 13)
(0, 13)


In [128]:
worms_18s[worms_18s["old name"]=="aphiaID"].shape

(332, 13)

In [129]:
worms_18s.to_csv("../gomecc_processed/worms_18S_matching.tsv",sep="\t",index=False)

In [130]:
worms_18s.drop(columns=['old name','old_taxonRank'],inplace=True)
worms_18s.head()

Unnamed: 0,full_tax,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,Eukaryota;Obazoa;Opisthokonta;Fungi;Ascomycota...,Eukaryota;Obazoa;Opisthokonta;Fungi;Ascomycota...,Aspergillus,urn:lsid:marinespecies.org:taxname:100211,Fungi,Ascomycota,Eurotiomycetes,Eurotiales,Trichocomaceae,Aspergillus,Genus
1,Eukaryota;Cryptista;Cryptophyta;Cryptophyta_X;...,Eukaryota;Cryptista;Cryptophyta;Cryptophyta_X;...,Goniomonas,urn:lsid:marinespecies.org:taxname:106286,Chromista,Cryptophyta,Cryptophyceae,Cryptomonadales,Cryptomonadaceae,Goniomonas,Genus
2,Eukaryota;TSAR;Alveolata;Ciliophora;Spirotrich...,Eukaryota;TSAR;Alveolata;Ciliophora;Spirotrich...,Strombidium,urn:lsid:marinespecies.org:taxname:101195,Chromista,Ciliophora,Oligotrichea,Oligotrichida,Strombidiidae,Strombidium,Genus
3,Eukaryota;Obazoa;Opisthokonta;Metazoa;Annelida...,Prionospio dubia,Prionospio dubia,urn:lsid:marinespecies.org:taxname:131155,Animalia,Annelida,Polychaeta,Spionida,Spionidae,Prionospio,Species
4,Eukaryota;TSAR;Stramenopiles;Bigyra;Opalozoa;O...,Eukaryota;TSAR;Stramenopiles;Bigyra;Opalozoa;O...,Opalozoa,urn:lsid:marinespecies.org:taxname:582466,Chromista,Bigyra,,,,,Subphylum


In [131]:
occ['18S rRNA'].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,domain,supergroup,division,subdivision,class,order,family,genus,species,eventID,organismQuantity,occurrenceID
0,36aa75f9b28f5f831c2d631ba65c2bcb,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCC...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.922099,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Neocalanus,Neocalanus cristatus,GOMECC4_27N_Sta1_DCM_A,1516,GOMECC4_27N_Sta1_DCM_A_occ36aa75f9b28f5f831c2d...
1,4e38e8ced9070952b314e1880bede1ca,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGT...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.999947,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Clausocalanus,Clausocalanus furcatus,GOMECC4_27N_Sta1_DCM_A,962,GOMECC4_27N_Sta1_DCM_A_occ4e38e8ced9070952b314...
4,2a31e5c01634165da99e7381279baa75,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAA...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.779948,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Acrocalanus,Acrocalanus,GOMECC4_27N_Sta1_DCM_A,1164,GOMECC4_27N_Sta1_DCM_A_occ2a31e5c01634165da99e...
5,ecee60339b2fb88ea6d1c8d18054bed4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAG...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,0.999931,Eukaryota,TSAR,Alveolata,Dinoflagellata,Dinophyceae,,,,,GOMECC4_27N_Sta1_DCM_A,287,GOMECC4_27N_Sta1_DCM_A_occecee60339b2fb88ea6d1...
7,fa1f1a97dd4ae7c826009186bad26384,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAA...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,0.986908,Eukaryota,TSAR,Alveolata,Dinoflagellata,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,,GOMECC4_27N_Sta1_DCM_A,250,GOMECC4_27N_Sta1_DCM_A_occfa1f1a97dd4ae7c82600...


#### Merge Occurrence and worms

In [132]:
occ['18S rRNA'].shape

(146232, 16)

In [195]:
# Get identificationRemarks
occ18_test = occ['18S rRNA'].copy()
occ18_test.drop(columns=['domain','supergroup','division','subdivision','class','order','family','genus','species'],inplace=True)
#occ18_test.drop(columns=['old name'],inplace=True)

occ18_test = occ18_test.merge(worms_18s, how='left', left_on ='taxonomy', right_on='full_tax')
occ18_test.drop(columns='full_tax', inplace=True)
occ18_test.head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,36aa75f9b28f5f831c2d631ba65c2bcb,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCC...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.922099,GOMECC4_27N_Sta1_DCM_A,1516,GOMECC4_27N_Sta1_DCM_A_occ36aa75f9b28f5f831c2d...,Neocalanus cristatus,Neocalanus cristatus,urn:lsid:marinespecies.org:taxname:104470,Animalia,Arthropoda,Copepoda,Calanoida,Calanidae,Neocalanus,Species
1,4e38e8ced9070952b314e1880bede1ca,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGT...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.999947,GOMECC4_27N_Sta1_DCM_A,962,GOMECC4_27N_Sta1_DCM_A_occ4e38e8ced9070952b314...,Clausocalanus furcatus,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503,Animalia,Arthropoda,Copepoda,Calanoida,Clausocalanidae,Clausocalanus,Species
2,2a31e5c01634165da99e7381279baa75,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAA...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,0.779948,GOMECC4_27N_Sta1_DCM_A,1164,GOMECC4_27N_Sta1_DCM_A_occ2a31e5c01634165da99e...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,Acrocalanus,urn:lsid:marinespecies.org:taxname:104192,Animalia,Arthropoda,Copepoda,Calanoida,Paracalanidae,Acrocalanus,Genus
3,ecee60339b2fb88ea6d1c8d18054bed4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAG...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,0.999931,GOMECC4_27N_Sta1_DCM_A,287,GOMECC4_27N_Sta1_DCM_A_occecee60339b2fb88ea6d1...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,Dinophyceae,urn:lsid:marinespecies.org:taxname:19542,Chromista,Myzozoa,Dinophyceae,,,,Class
4,fa1f1a97dd4ae7c826009186bad26384,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAA...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,0.986908,GOMECC4_27N_Sta1_DCM_A,250,GOMECC4_27N_Sta1_DCM_A_occfa1f1a97dd4ae7c82600...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,Gymnodiniaceae,urn:lsid:marinespecies.org:taxname:109410,Chromista,Myzozoa,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,Family


#### identificationRemarks

In [196]:
data['analysis_data'].head()

Unnamed: 0,target_gene,ampliconSize,trim_method,cluster_method,pid_clustering,taxa_class_method,taxa_ref_db,code_repo,identificationReferences,controls_used
0,16S rRNA,411,cutadapt,Tourmaline; qiime2-2021.2; dada2,ASV,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,https://github.com/aomlomics/gomecc,10.5281/zenodo.8392695 | https://github.com/ao...,12 distilled water blanks | 2 PCR no-template ...
1,18S rRNA,260,cutadapt,Tourmaline; qiime2-2021.2; dada2,ASV,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zen...,https://github.com/aomlomics/gomecc,10.5281/zenodo.8392706 | https://pr2-database....,12 distilled water blanks | 2 PCR no-template ...


In [197]:
occ18_test['taxa_class_method'] = data['analysis_data'].loc[data['analysis_data']['target_gene'] == '18S rRNA','taxa_class_method'].item()
occ18_test['taxa_ref_db'] = data['analysis_data'].loc[data['analysis_data']['target_gene'] == '18S rRNA','taxa_ref_db'].item()

occ18_test['identificationRemarks'] = occ18_test['taxa_class_method'] +", confidence (at lowest specified taxon): "+occ18_test['Confidence'].astype(str) +", against reference database: "+occ18_test['taxa_ref_db']

In [198]:
occ18_test['identificationRemarks'][0]

'Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.92209885, against reference database: PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zenodo.8392706'

#### taxonID, basisOfRecord, eventID, nameAccordingTo, organismQuantityType

In [199]:
occ18_test['taxonID'] = 'ASV:'+occ18_test['featureid']
occ18_test['basisOfRecord'] = 'MaterialSample'
occ18_test['nameAccordingTo'] = "WoRMS"
occ18_test['organismQuantityType'] = "DNA sequence reads"
occ18_test['recordedBy'] = data['study_data']['recordedBy'].values[0]

#### associatedSequences, materialSampleID

In [200]:
data['prep_data'].columns

Index(['sample_name', 'library_id', 'title', 'library_strategy',
       'library_source', 'library_selection', 'lib_layout', 'platform',
       'instrument_model', 'design_description', 'filetype', 'filename',
       'filename2', 'biosample_accession', 'sra_accession', 'seq_method',
       'nucl_acid_ext', 'target_gene', 'target_subfragment',
       'pcr_primer_forward', 'pcr_primer_reverse', 'pcr_primer_name_forward',
       'pcr_primer_name_reverse', 'pcr_primer_reference', 'pcr_cond',
       'nucl_acid_amp', 'adapters', 'mid_barcode'],
      dtype='object')

In [201]:
occ18_test = occ18_test.merge(data['prep_data'].loc[data['prep_data']['target_gene'] == '18S rRNA',['sample_name','sra_accession','biosample_accession']], how='left', left_on ='eventID', right_on='sample_name')

#### eventID

In [202]:
occ18_test['eventID'] = occ18_test['eventID']+"_18S"

In [203]:
# get sampleSize by total number of reads per sample
x = asv_tables['18S rRNA'].sum(numeric_only=True).astype('int')
x.index = x.index+"_18S"
occ18_test['occurrenceRemarks'] = "Total sampleSize in DNA sequence reads: "+occ18_test['eventID'].map(x).astype('str')

In [204]:
# drop unnneeded columns
occ18_test.drop(columns=['sample_name','featureid','taxonomy','Confidence','taxa_class_method','taxa_ref_db'],inplace=True)

In [205]:
occ18_test['associatedSequences'] = occ18_test['sra_accession']+' | '+ occ18_test['biosample_accession']+' | '+data['study_data']['bioproject_accession'].values[0]

In [206]:
occ18_test.rename(columns={'biosample_accession': 'materialSampleID',
                  'sequence': 'DNA_sequence'},inplace=True)
                   

In [207]:
# drop unnneeded columns
occ18_test.drop(columns=['sra_accession'],inplace=True)

In [208]:
occ18_test.columns

Index(['DNA_sequence', 'eventID', 'organismQuantity', 'occurrenceID',
       'verbatimIdentification', 'scientificName', 'scientificNameID',
       'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'taxonRank',
       'identificationRemarks', 'taxonID', 'basisOfRecord', 'nameAccordingTo',
       'organismQuantityType', 'recordedBy', 'materialSampleID',
       'occurrenceRemarks', 'associatedSequences'],
      dtype='object')

In [209]:
occ18_test.head()

Unnamed: 0,DNA_sequence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,...,taxonRank,identificationRemarks,taxonID,basisOfRecord,nameAccordingTo,organismQuantityType,recordedBy,materialSampleID,occurrenceRemarks,associatedSequences
0,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCC...,GOMECC4_27N_Sta1_DCM_A_18S,1516,GOMECC4_27N_Sta1_DCM_A_occ36aa75f9b28f5f831c2d...,Neocalanus cristatus,Neocalanus cristatus,urn:lsid:marinespecies.org:taxname:104470,Animalia,Arthropoda,Copepoda,...,Species,Tourmaline; qiime2-2021.2; naive-bayes classif...,ASV:36aa75f9b28f5f831c2d631ba65c2bcb,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,Total sampleSize in DNA sequence reads: 9838,SRR26161153 | SAMN37516094 | PRJNA887898
1,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGT...,GOMECC4_27N_Sta1_DCM_A_18S,962,GOMECC4_27N_Sta1_DCM_A_occ4e38e8ced9070952b314...,Clausocalanus furcatus,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503,Animalia,Arthropoda,Copepoda,...,Species,Tourmaline; qiime2-2021.2; naive-bayes classif...,ASV:4e38e8ced9070952b314e1880bede1ca,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,Total sampleSize in DNA sequence reads: 9838,SRR26161153 | SAMN37516094 | PRJNA887898
2,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAA...,GOMECC4_27N_Sta1_DCM_A_18S,1164,GOMECC4_27N_Sta1_DCM_A_occ2a31e5c01634165da99e...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,Acrocalanus,urn:lsid:marinespecies.org:taxname:104192,Animalia,Arthropoda,Copepoda,...,Genus,Tourmaline; qiime2-2021.2; naive-bayes classif...,ASV:2a31e5c01634165da99e7381279baa75,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,Total sampleSize in DNA sequence reads: 9838,SRR26161153 | SAMN37516094 | PRJNA887898
3,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAG...,GOMECC4_27N_Sta1_DCM_A_18S,287,GOMECC4_27N_Sta1_DCM_A_occecee60339b2fb88ea6d1...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,Dinophyceae,urn:lsid:marinespecies.org:taxname:19542,Chromista,Myzozoa,Dinophyceae,...,Class,Tourmaline; qiime2-2021.2; naive-bayes classif...,ASV:ecee60339b2fb88ea6d1c8d18054bed4,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,Total sampleSize in DNA sequence reads: 9838,SRR26161153 | SAMN37516094 | PRJNA887898
4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAA...,GOMECC4_27N_Sta1_DCM_A_18S,250,GOMECC4_27N_Sta1_DCM_A_occfa1f1a97dd4ae7c82600...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,Gymnodiniaceae,urn:lsid:marinespecies.org:taxname:109410,Chromista,Myzozoa,Dinophyceae,...,Family,Tourmaline; qiime2-2021.2; naive-bayes classif...,ASV:fa1f1a97dd4ae7c826009186bad26384,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,Total sampleSize in DNA sequence reads: 9838,SRR26161153 | SAMN37516094 | PRJNA887898


### merge event and occurrence

In [210]:
occ18_merged = occ18_test.merge(all_event_data,how='left',on='eventID')

In [211]:
occ18_merged.head()

Unnamed: 0,DNA_sequence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,...,minimumDepthInMeters,locality,decimalLatitude,decimalLongitude,samplingProtocol,waterBody,maximumDepthInMeters,parentEventID,datasetID,geodeticDatum
0,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCC...,GOMECC4_27N_Sta1_DCM_A_18S,1516,GOMECC4_27N_Sta1_DCM_A_occ36aa75f9b28f5f831c2d...,Neocalanus cristatus,Neocalanus cristatus,urn:lsid:marinespecies.org:taxname:104470,Animalia,Arthropoda,Copepoda,...,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,WGS84
1,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGT...,GOMECC4_27N_Sta1_DCM_A_18S,962,GOMECC4_27N_Sta1_DCM_A_occ4e38e8ced9070952b314...,Clausocalanus furcatus,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503,Animalia,Arthropoda,Copepoda,...,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,WGS84
2,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAA...,GOMECC4_27N_Sta1_DCM_A_18S,1164,GOMECC4_27N_Sta1_DCM_A_occ2a31e5c01634165da99e...,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropo...,Acrocalanus,urn:lsid:marinespecies.org:taxname:104192,Animalia,Arthropoda,Copepoda,...,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,WGS84
3,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAG...,GOMECC4_27N_Sta1_DCM_A_18S,287,GOMECC4_27N_Sta1_DCM_A_occecee60339b2fb88ea6d1...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,Dinophyceae,urn:lsid:marinespecies.org:taxname:19542,Chromista,Myzozoa,Dinophyceae,...,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,WGS84
4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAA...,GOMECC4_27N_Sta1_DCM_A_18S,250,GOMECC4_27N_Sta1_DCM_A_occfa1f1a97dd4ae7c82600...,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinoph...,Gymnodiniaceae,urn:lsid:marinespecies.org:taxname:109410,Chromista,Myzozoa,Dinophyceae,...,49,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,CTD rosette,Atlantic Ocean,49,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,WGS84


In [212]:
occ18_merged.drop(columns=['DNA_sequence']).to_csv("../gomecc_v2_processed/occurrence_18S.tsv",sep="\t",index=False)

In [238]:
occ18_merged.columns

Index(['DNA_sequence', 'eventID', 'organismQuantity', 'occurrenceID',
       'verbatimIdentification', 'scientificName', 'scientificNameID',
       'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'taxonRank',
       'identificationRemarks', 'taxonID', 'basisOfRecord', 'nameAccordingTo',
       'organismQuantityType', 'recordedBy', 'materialSampleID',
       'occurrenceRemarks', 'associatedSequences', 'locationID', 'eventDate',
       'minimumDepthInMeters', 'locality', 'decimalLatitude',
       'decimalLongitude', 'samplingProtocol', 'waterBody',
       'maximumDepthInMeters', 'parentEventID', 'datasetID', 'geodeticDatum'],
      dtype='object')

In [None]:
occ18_merged['sampleSize'] = 

### combine 16s and 18s occurrence

In [233]:
occ18_merged.shape

(146232, 35)

In [234]:
occ_all = pd.concat([occ16_merged,occ18_merged],axis=0, ignore_index=True)

In [235]:
occ_all.shape

(311390, 35)

In [236]:
occ_all.drop(columns=['DNA_sequence']).to_csv("../gomecc_v2_processed/occurrence.csv",index=False)

In [None]:
 temp = pd.DataFrame({'ticker' : ['spx 5/25/2001 p500', 'spx 5/25/2001 p600', 'spx 5/25/2001 p700']})
temp['ticker'].str.split(' ').str[-1]

In [251]:
occ_all['sampleSize'] = [item[1] for item in occ_all['occurrenceRemarks'].str.split(": ")]


In [252]:
occ_all['sampleSizeUnit'] = 'DNA sequence reads'

In [253]:
occ_all.drop(columns=['DNA_sequence']).to_csv("../gomecc_v2_processed/occurrence_sampsizeAdded.csv",index=False)

### DNA-derived

In [155]:
dna_dict = dwc_data['dna'].to_dict('index')

In [156]:
dna_dict.keys()

dict_keys(['eventID', 'occurrenceID', 'DNA_sequence', 'sop', 'nucl_acid_ext', 'samp_vol_we_dna_ext', 'samp_mat_process', 'nucl_acid_amp', 'target_gene', 'target_subfragment', 'ampliconSize', 'lib_layout', 'pcr_primer_forward', 'pcr_primer_reverse', 'pcr_primer_name_forward', 'pcr_primer_name_reverse', 'pcr_primer_reference', 'pcr_cond', 'seq_meth', 'otu_class_appr', 'otu_seq_comp_appr', 'otu_db', 'env_broad_scale', 'env_local_scale', 'env_medium', 'size_frac', 'concentration', 'concentrationUnit', 'samp_collect_device', 'source_mat_id'])

##### sample_data

In [157]:
# check if all event file terms are in sample_data
for key in dna_dict.keys():
    if dna_dict[key]['AOML_file'] == 'sample_data':
        print(key,dna_dict[key])

samp_vol_we_dna_ext {'AOML_term': 'samp_vol_we_dna_ext', 'AOML_file': 'sample_data', 'DwC_definition': 'Volume (ml) or mass (g) of total collected sample processed for DNA extraction.MIXS:0000111', 'Example': nan}
samp_mat_process {'AOML_term': 'samp_mat_process', 'AOML_file': 'sample_data', 'DwC_definition': 'Any processing applied to the sample during or after retrieving the sample from environment. This field accepts OBI, for a browser of OBI (v 2018-02-12) terms please see http://purl.bioontology.org/ontology/OBI', 'Example': nan}
env_broad_scale {'AOML_term': 'env_broad_scale', 'AOML_file': 'sample_data', 'DwC_definition': nan, 'Example': nan}
env_local_scale {'AOML_term': 'env_local_scale', 'AOML_file': 'sample_data', 'DwC_definition': nan, 'Example': nan}
env_medium {'AOML_term': 'env_medium', 'AOML_file': 'sample_data', 'DwC_definition': nan, 'Example': nan}
size_frac {'AOML_term': 'size_frac', 'AOML_file': 'sample_data', 'DwC_definition': 'Filtering pore size used in sample pr

In [158]:
# rename sample_data columns to fit DwC standard
rename_dict = {}
gen = (x for x in dna_dict.keys() if dna_dict[x]['AOML_file'] == 'sample_data')
for x in gen:
    #print(x)
    rename_dict[dna_dict[x]['AOML_term']] = x

gen = (x for x in dna_dict.keys() if dna_dict[x]['AOML_file'] == 'prep_data')
for x in gen:
    #print(x)
    rename_dict[dna_dict[x]['AOML_term']] = x

gen = (x for x in dna_dict.keys() if dna_dict[x]['AOML_file'] == 'analysis_data')
for x in gen:
    #print(x)
    rename_dict[dna_dict[x]['AOML_term']] = x

dna_sample = data['sample_data'].rename(columns=rename_dict).copy()
dna_prep = data['prep_data'].rename(columns=rename_dict).copy()
dna_analysis = data['analysis_data'].rename(columns=rename_dict).copy()

#dna_sample = dna_sample.drop(columns=[col for col in dna_sample if col not in rename_dict.values()])

In [159]:
dna_16 = dna_sample[dna_sample['amplicon_sequenced'].str.contains('16S rRNA')]
dna_16['eventID'] = dna_16['eventID']+"_16S"
dna_16.head()

Unnamed: 0,eventID,serial_number,cruise_id,line_id,station,ctd_bottle_no,sample_replicate,source_mat_id,biological_replicates,extract_number,...,samp_store_loc,samp_store_temp,silicate,size_frac_low,size_frac_up,temp,tot_alkalinity,tot_depth_water_col,transmittance,waterBody
0,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_001,GOMECC-4 (2021),27N,27N_Sta1,3,A,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_B, GOMECC4_27N_Sta1_Deep_C",Plate4_52,...,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,Atlantic Ocean
1,GOMECC4_27N_Sta1_Deep_B_16S,GOMECC4_002,GOMECC-4 (2021),27N,27N_Sta1,3,B,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_C",Plate4_60,...,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,Atlantic Ocean
2,GOMECC4_27N_Sta1_Deep_C_16S,GOMECC4_003,GOMECC-4 (2021),27N,27N_Sta1,3,C,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_B",Plate4_62,...,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,Atlantic Ocean
3,GOMECC4_27N_Sta1_DCM_A_16S,GOMECC4_004,GOMECC-4 (2021),27N,27N_Sta1,14,A,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_B, GOMECC4_27N_Sta1_DCM_C",Plate4_53,...,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,Atlantic Ocean
4,GOMECC4_27N_Sta1_DCM_B_16S,GOMECC4_005,GOMECC-4 (2021),27N,27N_Sta1,14,B,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_A, GOMECC4_27N_Sta1_DCM_C",Plate4_46,...,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,Atlantic Ocean


In [160]:
dna_18 = dna_sample[dna_sample['amplicon_sequenced'].str.contains('18S rRNA')]
dna_18['eventID'] = dna_18['eventID']+"_18S"
dna_18.head()

Unnamed: 0,eventID,serial_number,cruise_id,line_id,station,ctd_bottle_no,sample_replicate,source_mat_id,biological_replicates,extract_number,...,samp_store_loc,samp_store_temp,silicate,size_frac_low,size_frac_up,temp,tot_alkalinity,tot_depth_water_col,transmittance,waterBody
0,GOMECC4_27N_Sta1_Deep_A_18S,GOMECC4_001,GOMECC-4 (2021),27N,27N_Sta1,3,A,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_B, GOMECC4_27N_Sta1_Deep_C",Plate4_52,...,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,Atlantic Ocean
1,GOMECC4_27N_Sta1_Deep_B_18S,GOMECC4_002,GOMECC-4 (2021),27N,27N_Sta1,3,B,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_C",Plate4_60,...,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,Atlantic Ocean
2,GOMECC4_27N_Sta1_Deep_C_18S,GOMECC4_003,GOMECC-4 (2021),27N,27N_Sta1,3,C,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_B",Plate4_62,...,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,Atlantic Ocean
3,GOMECC4_27N_Sta1_DCM_A_18S,GOMECC4_004,GOMECC-4 (2021),27N,27N_Sta1,14,A,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_B, GOMECC4_27N_Sta1_DCM_C",Plate4_53,...,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,Atlantic Ocean
4,GOMECC4_27N_Sta1_DCM_B_18S,GOMECC4_005,GOMECC-4 (2021),27N,27N_Sta1,14,B,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_A, GOMECC4_27N_Sta1_DCM_C",Plate4_46,...,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,Atlantic Ocean


In [161]:
dna_sample = pd.concat([dna_16,dna_18],axis=0,ignore_index=True)
dna_sample.head()

Unnamed: 0,eventID,serial_number,cruise_id,line_id,station,ctd_bottle_no,sample_replicate,source_mat_id,biological_replicates,extract_number,...,samp_store_loc,samp_store_temp,silicate,size_frac_low,size_frac_up,temp,tot_alkalinity,tot_depth_water_col,transmittance,waterBody
0,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_001,GOMECC-4 (2021),27N,27N_Sta1,3,A,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_B, GOMECC4_27N_Sta1_Deep_C",Plate4_52,...,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,Atlantic Ocean
1,GOMECC4_27N_Sta1_Deep_B_16S,GOMECC4_002,GOMECC-4 (2021),27N,27N_Sta1,3,B,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_C",Plate4_60,...,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,Atlantic Ocean
2,GOMECC4_27N_Sta1_Deep_C_16S,GOMECC4_003,GOMECC-4 (2021),27N,27N_Sta1,3,C,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_B",Plate4_62,...,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,Atlantic Ocean
3,GOMECC4_27N_Sta1_DCM_A_16S,GOMECC4_004,GOMECC-4 (2021),27N,27N_Sta1,14,A,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_B, GOMECC4_27N_Sta1_DCM_C",Plate4_53,...,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,Atlantic Ocean
4,GOMECC4_27N_Sta1_DCM_B_16S,GOMECC4_005,GOMECC-4 (2021),27N,27N_Sta1,14,B,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_A, GOMECC4_27N_Sta1_DCM_C",Plate4_46,...,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,Atlantic Ocean


In [162]:
prep_16 = dna_prep[dna_prep['target_gene'].str.contains('16S rRNA')].copy()
prep_16['eventID'] = prep_16['eventID']+"_16S"
prep_16.head()

Unnamed: 0,eventID,library_id,title,library_strategy,library_source,library_selection,lib_layout,platform,instrument_model,design_description,...,target_subfragment,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,pcr_cond,nucl_acid_amp,adapters,mid_barcode
4,GOMECC4_BROWNSVILLE_Sta66_DCM_B_16S,GOMECC16S_Plate1_1,16S amplicon metabarcoding of marine metagenom...,AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Steri...,...,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided
6,GOMECC4_GALVESTON_Sta54_DCM_B_16S,GOMECC16S_Plate1_10,16S amplicon metabarcoding of marine metagenom...,AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Steri...,...,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided
8,GOMECC4_GALVESTON_Sta54_Deep_A_16S,GOMECC16S_Plate1_11,16S amplicon metabarcoding of marine metagenom...,AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Steri...,...,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided
10,GOMECC4_GALVESTON_Sta49_Deep_A_16S,GOMECC16S_Plate1_12,16S amplicon metabarcoding of marine metagenom...,AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Steri...,...,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided
12,GOMECC4_BROWNSVILLE_Sta66_DCM_C_16S,GOMECC16S_Plate1_13,16S amplicon metabarcoding of marine metagenom...,AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Steri...,...,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided


In [163]:
prep_18 = dna_prep[dna_prep['target_gene'].str.contains('18S rRNA')].copy()
prep_18['eventID'] = prep_18['eventID']+"_18S"
prep_18.head()

Unnamed: 0,eventID,library_id,title,library_strategy,library_source,library_selection,lib_layout,platform,instrument_model,design_description,...,target_subfragment,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,pcr_cond,nucl_acid_amp,adapters,mid_barcode
1,GOMECC4_27N_Sta1_DCM_A_18S,GOMECC18S_Plate4_53,18S amplicon metabarcoding of marine metagenom...,AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Steri...,...,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75...,10.1371/journal.pone.0006372,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided
3,GOMECC4_27N_Sta1_DCM_B_18S,GOMECC18S_Plate4_46,18S amplicon metabarcoding of marine metagenom...,AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Steri...,...,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75...,10.1371/journal.pone.0006372,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided
5,GOMECC4_27N_Sta1_DCM_C_18S,GOMECC18S_Plate4_54,18S amplicon metabarcoding of marine metagenom...,AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Steri...,...,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75...,10.1371/journal.pone.0006372,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided
7,GOMECC4_27N_Sta1_Deep_A_18S,GOMECC18S_Plate4_52,18S amplicon metabarcoding of marine metagenom...,AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Steri...,...,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75...,10.1371/journal.pone.0006372,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided
9,GOMECC4_27N_Sta1_Deep_B_18S,GOMECC18S_Plate4_60,18S amplicon metabarcoding of marine metagenom...,AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Steri...,...,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75...,10.1371/journal.pone.0006372,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided


In [164]:
dna_prep = pd.concat([prep_16,prep_18],axis=0,ignore_index=True)
dna_prep.head()

Unnamed: 0,eventID,library_id,title,library_strategy,library_source,library_selection,lib_layout,platform,instrument_model,design_description,...,target_subfragment,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,pcr_cond,nucl_acid_amp,adapters,mid_barcode
0,GOMECC4_BROWNSVILLE_Sta66_DCM_B_16S,GOMECC16S_Plate1_1,16S amplicon metabarcoding of marine metagenom...,AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Steri...,...,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided
1,GOMECC4_GALVESTON_Sta54_DCM_B_16S,GOMECC16S_Plate1_10,16S amplicon metabarcoding of marine metagenom...,AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Steri...,...,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided
2,GOMECC4_GALVESTON_Sta54_Deep_A_16S,GOMECC16S_Plate1_11,16S amplicon metabarcoding of marine metagenom...,AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Steri...,...,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided
3,GOMECC4_GALVESTON_Sta49_Deep_A_16S,GOMECC16S_Plate1_12,16S amplicon metabarcoding of marine metagenom...,AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Steri...,...,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided
4,GOMECC4_BROWNSVILLE_Sta66_DCM_C_16S,GOMECC16S_Plate1_13,16S amplicon metabarcoding of marine metagenom...,AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Steri...,...,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided


In [165]:
# merge prep and sample
dna = dna_sample.merge(dna_prep, how='outer', on='eventID')
dna = dna.merge(dna_analysis,how='outer',on='target_gene')

In [166]:
rename_dict.values()

dict_values(['samp_vol_we_dna_ext', 'samp_mat_process', 'env_broad_scale', 'env_local_scale', 'env_medium', 'size_frac', 'concentration', 'concentrationUnit', 'samp_collect_device', 'source_mat_id', 'eventID', 'nucl_acid_ext', 'nucl_acid_amp', 'target_gene', 'target_subfragment', 'lib_layout', 'pcr_primer_forward', 'pcr_primer_reverse', 'pcr_primer_name_forward', 'pcr_primer_name_reverse', 'pcr_primer_reference', 'pcr_cond', 'seq_meth', 'sop', 'ampliconSize', 'otu_class_appr', 'otu_seq_comp_appr', 'otu_db'])

In [167]:
[col for col in dna if col not in rename_dict.values()]

['serial_number',
 'cruise_id',
 'line_id',
 'station',
 'ctd_bottle_no',
 'sample_replicate',
 'biological_replicates',
 'extract_number',
 'sample_title',
 'bioproject_accession',
 'biosample_accession_x',
 'study_id',
 'study_title',
 'amplicon_sequenced',
 'metagenome_sequenced',
 'organism',
 'collection_date_local',
 'collection_date',
 'depth',
 'geo_loc_name',
 'lat_lon',
 'decimalLatitude',
 'decimalLongitude',
 'sample_type',
 'collection_method',
 'basisOfRecord',
 'cluster_16s',
 'cluster_18s',
 'notes_sampling',
 'notes_bottle_metadata',
 'line_position',
 'offshore_inshore_200m_isobath',
 'depth_category',
 'ocean_acidification_status',
 'seascape_class',
 'seascape_probability',
 'seascape_window',
 'dna_sample_number',
 'dna_yield',
 'extraction_plate_name',
 'extraction_well_number',
 'extraction_well_position',
 'ship_crs_expocode',
 'woce_sect',
 'ammonium',
 'carbonate',
 'diss_inorg_carb',
 'diss_oxygen',
 'fluor',
 'hydrogen_ion',
 'nitrate',
 'nitrite',
 'nitrate

In [168]:
dna = dna.drop(columns=[col for col in dna if col not in rename_dict.values()])

In [169]:
dna.tail()

Unnamed: 0,eventID,source_mat_id,env_broad_scale,env_local_scale,env_medium,samp_vol_we_dna_ext,samp_collect_device,samp_mat_process,size_frac,concentration,...,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,pcr_cond,nucl_acid_amp,ampliconSize,otu_seq_comp_appr,otu_db
939,GOMECC4_CAPECORAL_Sta141_DCM_B_18S,GOMECC4_CAPECORAL_Sta141_DCM,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],sea water [ENVO:00002149],2040 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using...,0.22 µm,1.634 ng/µl,...,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75...,10.1371/journal.pone.0006372,260,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zen...
940,GOMECC4_CAPECORAL_Sta141_DCM_C_18S,GOMECC4_CAPECORAL_Sta141_DCM,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],sea water [ENVO:00002149],2080 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using...,0.22 µm,2.307 ng/µl,...,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75...,10.1371/journal.pone.0006372,260,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zen...
941,GOMECC4_CAPECORAL_Sta141_Surface_A_18S,GOMECC4_CAPECORAL_Sta141_Surface,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],sea water [ENVO:00002149],2100 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using...,0.22 µm,1.286 ng/µl,...,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75...,10.1371/journal.pone.0006372,260,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zen...
942,GOMECC4_CAPECORAL_Sta141_Surface_B_18S,GOMECC4_CAPECORAL_Sta141_Surface,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],sea water [ENVO:00002149],2000 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using...,0.22 µm,1.831 ng/µl,...,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75...,10.1371/journal.pone.0006372,260,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zen...
943,GOMECC4_CAPECORAL_Sta141_Surface_C_18S,GOMECC4_CAPECORAL_Sta141_Surface,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],sea water [ENVO:00002149],2000 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using...,0.22 µm,1.849 ng/µl,...,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75...,10.1371/journal.pone.0006372,260,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zen...


#### merge with occurrenceID

In [170]:
dna.shape

(944, 25)

In [171]:
dna_occ = dna.merge(occ_all[['eventID','occurrenceID','DNA_sequence']],how='left',on='eventID')

In [172]:
dna_occ.shape

(311390, 27)

In [173]:
dna_occ.head()

Unnamed: 0,eventID,source_mat_id,env_broad_scale,env_local_scale,env_medium,samp_vol_we_dna_ext,samp_collect_device,samp_mat_process,size_frac,concentration,...,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,pcr_cond,nucl_acid_amp,ampliconSize,otu_seq_comp_appr,otu_db,occurrenceID,DNA_sequence
0,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using...,0.22 µm,0.08038 ng/µl,...,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,GOMECC4_27N_Sta1_Deep_A_16S_occ009257b156ab4a9...,TACGAGGGGTGCTAGCGTTGTCCGGAATTACTGGGCGTAAAGGGTT...
1,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using...,0.22 µm,0.08038 ng/µl,...,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,GOMECC4_27N_Sta1_Deep_A_16S_occ01398067b1d323b...,TACGGAGGGTGCAAGCGTTGTTCGGAATTATTGGGCGTAAAGCGGA...
2,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using...,0.22 µm,0.08038 ng/µl,...,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,GOMECC4_27N_Sta1_Deep_A_16S_occ01770ea2fb7f041...,TACGGAGGATCCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTC...
3,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using...,0.22 µm,0.08038 ng/µl,...,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,GOMECC4_27N_Sta1_Deep_A_16S_occ017dbdc8b62705b...,TACTAGGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGGGTG...
4,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using...,0.22 µm,0.08038 ng/µl,...,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,GOMECC4_27N_Sta1_Deep_A_16S_occ069f375524db781...,TACGTAGGAGGCTAGCGTTGTCCGGATTTACTGGGCGTAAAGGGAG...


In [178]:
dna_occ['concentration'] = dna_occ['concentration'].str.strip(" ng/µl")
dna_occ['concentrationUnit'] = "ng/µl"

In [179]:
# check if all DwC terms are in dna file
for key in dna_dict.keys():
    if key not in dna_occ.columns:
        print(key,dna_dict[key])

sop {'AOML_term': 'sop', 'AOML_file': 'analysis_data', 'DwC_definition': 'Standard operating procedures used in assembly and/or annotation of genomes, metagenomes or environmental sequences. Or A reference to a well documented protocol, e.g. using protocols.io', 'Example': nan}
otu_class_appr {'AOML_term': 'derived: cluster_method, pid_clustering', 'AOML_file': 'analysis_data', 'DwC_definition': 'Approach/algorithm when defining OTUs or ASVs, include version and parameters separated by semicolons', 'Example': '"dada2; 1.14.0; ASV"'}


In [180]:
data['analysis_data']['cluster_method'][0]

'Tourmaline; qiime2-2021.2; dada2'

In [181]:
dna_occ['seq_meth'] = 'Illumina MiSeq 2x250'
dna_occ['otu_class_appr']= data['analysis_data']['cluster_method'][0]+"; "+data['analysis_data']['pid_clustering'][0]

In [183]:
# check if all DwC terms are in dna file
for key in dna_dict.keys():
    if key not in dna_occ.columns:
        print(key,dna_dict[key])

sop {'AOML_term': 'sop', 'AOML_file': 'analysis_data', 'DwC_definition': 'Standard operating procedures used in assembly and/or annotation of genomes, metagenomes or environmental sequences. Or A reference to a well documented protocol, e.g. using protocols.io', 'Example': nan}


In [184]:
dna_occ.head()

Unnamed: 0,eventID,source_mat_id,env_broad_scale,env_local_scale,env_medium,samp_vol_we_dna_ext,samp_collect_device,samp_mat_process,size_frac,concentration,...,pcr_primer_reference,pcr_cond,nucl_acid_amp,ampliconSize,otu_seq_comp_appr,otu_db,occurrenceID,DNA_sequence,concentrationUnit,otu_class_appr
0,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using...,0.22 µm,0.08038,...,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,GOMECC4_27N_Sta1_Deep_A_16S_occ009257b156ab4a9...,TACGAGGGGTGCTAGCGTTGTCCGGAATTACTGGGCGTAAAGGGTT...,ng/µl,Tourmaline; qiime2-2021.2; dada2; ASV
1,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using...,0.22 µm,0.08038,...,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,GOMECC4_27N_Sta1_Deep_A_16S_occ01398067b1d323b...,TACGGAGGGTGCAAGCGTTGTTCGGAATTATTGGGCGTAAAGCGGA...,ng/µl,Tourmaline; qiime2-2021.2; dada2; ASV
2,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using...,0.22 µm,0.08038,...,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,GOMECC4_27N_Sta1_Deep_A_16S_occ01770ea2fb7f041...,TACGGAGGATCCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTC...,ng/µl,Tourmaline; qiime2-2021.2; dada2; ASV
3,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using...,0.22 µm,0.08038,...,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,GOMECC4_27N_Sta1_Deep_A_16S_occ017dbdc8b62705b...,TACTAGGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGGGTG...,ng/µl,Tourmaline; qiime2-2021.2; dada2; ASV
4,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using...,0.22 µm,0.08038,...,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75...,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; ...,GOMECC4_27N_Sta1_Deep_A_16S_occ069f375524db781...,TACGTAGGAGGCTAGCGTTGTCCGGATTTACTGGGCGTAAAGGGAG...,ng/µl,Tourmaline; qiime2-2021.2; dada2; ASV


In [185]:
dna_occ.to_csv("../gomecc_v2_processed/dna-derived.csv",index=False)

I also wanted to persist the original name from the NCBI taxonomy database into the Darwin Core-converted data set. To do this, I queried the database based on the name in the original data to obtain its taxonomic ID number.

In [189]:
## Get set up to query NCBI taxonomy 

from Bio import Entrez

# ----- Insert your email here -----
Entrez.email = 'ksil91@gmail.com'
# ----------------------------------

# Get list of all databases available through this tool
record = Entrez.read(Entrez.einfo())
all_dbs = record['DbList']
all_dbs

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

In [210]:
def get_ncbi_txid_from_name_parallel(all_names,n_proc=0):
    all_names = all_names.unique()
    queue = mp.Queue()
    if n_proc == 0:
    # create as many processes as there are CPUs on your machine
        num_processes = mp.cpu_count()
    else:
        num_processes = n_proc
        
    # calculate the chunk size as an integer
    chunk_size = int(len(all_names)/num_processes)
    procs = []
    for job in range(num_processes):
        start = job * chunk_size
        end = start + chunk_size
        name_chunk = all_names[start:end]
        proc = mp.Process(
            target=get_ncbi_txid_from_name,
            args=(name_chunk,queue)
        )
        procs.append(proc)
        proc.start()
    
    names_dict = {}
    for _ in procs:
        names_dict.update(queue.get())
    
    #new_df = queue.get()
    
    for proc in procs:
        proc.join()
    
    return names_dict


In [177]:
## Get NCBI taxIDs for each name in dataset ---- TAKES ~ 2 MINUTES FOR 300 RECORDS
def get_ncbi_txid_from_name(names,queue):
    
    name_ncbiid_dict = {}
    for name in names:
        if name not in name_ncbiid_dict.keys():
            taxon2 = '"' + name + '"'
            handle = Entrez.esearch(db='taxonomy', retmax=10, term=taxon2)
            record = Entrez.read(handle)
            if not record['IdList'] or not record['IdList'][0]:
                print(name+": not found")
                name_ncbiid_dict[name] = "None"
            else:
                name_ncbiid_dict[name] = record['IdList'][0]
            handle.close()
    queue.put(name_ncbiid_dict)


In [186]:
n =worms_12s['verbatimIdentification'][0:10]
new = get_ncbi_txid_from_name_parallel(n,7)

In [188]:
n

0                 Clupeidae
1         Eucinostomus gula
2    Oncorhynchus gorbuscha
3          Lutjanus griseus
4         Eucinostomus gula
5          Cheilopogon agoo
6    Oncorhynchus gorbuscha
7    Oncorhynchus gorbuscha
8          Scomber scombrus
9        Haemulon plumierii
Name: verbatimIdentification, dtype: object

In [187]:
new

{'Oncorhynchus gorbuscha': '8017',
 'Scomber scombrus': '13677',
 'Eucinostomus gula': '435273',
 'Haemulon plumierii': '334415',
 'Clupeidae': '55118',
 'Cheilopogon agoo': '123223',
 'Lutjanus griseus': '40503'}

In [182]:
n

0                 Clupeidae
1         Eucinostomus gula
2    Oncorhynchus gorbuscha
3          Lutjanus griseus
4         Eucinostomus gula
5          Cheilopogon agoo
6    Oncorhynchus gorbuscha
7    Oncorhynchus gorbuscha
8          Scomber scombrus
9        Haemulon plumierii
Name: verbatimIdentification, dtype: object

In [181]:
new

{'Oncorhynchus gorbuscha': '8017',
 'Lutjanus griseus': '40503',
 'Scomber scombrus': '13677',
 'Haemulon plumierii': '334415',
 'Eucinostomus gula': '435273',
 'Cheilopogon agoo': '123223',
 'Clupeidae': '55118'}

In [None]:
## Get NCBI taxIDs for each name in dataset ---- TAKES ~ 2 MINUTES FOR 300 RECORDS

name_ncbiid_dict = {}

for name in worms_12s['verbatimIdentification']:
    if name not in name_ncbiid_dict.keys():
        taxon2 = '"' + name + '"'
        handle = Entrez.esearch(db='taxonomy', retmax=10, term=taxon2)
        record = Entrez.read(handle)
        if not record['IdList'] or not record['IdList'][0]:
            print(name+": not found")
            name_ncbiid_dict[name] = "None"
        else:
            name_ncbiid_dict[name] = record['IdList'][0]
        handle.close()


Look up the not found names by hand to get taxid

In [157]:
name_ncbiid_dict['Lobianchia sp. CBM:ZF:14789'] = '2057084'
name_ncbiid_dict['Brotula sp. CBM:ZF:20276']= '2768194'
name_ncbiid_dict['Scopelarchus sp. CBM:DNA No. 2000-014']= '2608515'

'None'

In [153]:
list(name_ncbiid_dict.keys())[0:10]

['Clupeidae',
 'Eucinostomus gula',
 'Oncorhynchus gorbuscha',
 'Lutjanus griseus',
 'Cheilopogon agoo',
 'Scomber scombrus',
 'Haemulon plumierii',
 'Archosargus probatocephalus',
 'Actinopteri',
 'Carangidae']

In [154]:
name_ncbiid_dict['Clupeidae']

'55118'

In [156]:
## Add NCBI taxonomy IDs under taxonConceptID

# Map indicators that say no taxonomy was assigned to empty strings
#name_ncbiid_dict['unassigned'], name_ncbiid_dict['s_'], name_ncbiid_dict['no_hit'], name_ncbiid_dict['unknown'], name_ncbiid_dict['g_'] = '', '', '', '', ''

# Create column
worms_12s['taxonConceptID']  = worms_12s['verbatimIdentification'].copy()
worms_12s['taxonConceptID'].replace(name_ncbiid_dict, inplace=True)

# Add remainder of text and clean
worms_12s['taxonConceptID'] = 'NCBI:txid' + worms_12s['taxonConceptID']
worms_12s['taxonConceptID'].replace('NCBI:txid', '', inplace=True)
worms_12s.head()

Unnamed: 0,verbatimIdentification,asv,rank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonConceptID
0,Clupeidae,951a3746cb4c488d657fe3c64bdd3d75,family,Clupeidae,Clupeidae,urn:lsid:marinespecies.org:taxname:125464,Animalia,Chordata,Teleostei,Clupeiformes,Clupeidae,,NCBI:txid55118
1,Eucinostomus gula,f278a917dedcc2e0434be6ecb605a642,species,Eucinostomus gula,Eucinostomus gula,urn:lsid:marinespecies.org:taxname:159733,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Gerreidae,Eucinostomus,NCBI:txid435273
2,Oncorhynchus gorbuscha,1d0fcad8f69709ed9ea8dd6e6a3c627a,species,Oncorhynchus gorbuscha,Oncorhynchus gorbuscha,urn:lsid:marinespecies.org:taxname:127182,Animalia,Chordata,Teleostei,Salmoniformes,Salmonidae,Oncorhynchus,NCBI:txid8017
3,Lutjanus griseus,cc6451b251afdd97ef4f11e47abc53a7,species,Lutjanus griseus,Lutjanus griseus,urn:lsid:marinespecies.org:taxname:159797,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Lutjanidae,Lutjanus,NCBI:txid40503
4,Eucinostomus gula,d16bc4ff5741c0ba4a5d0bb2e21e5589,species,Eucinostomus gula,Eucinostomus gula,urn:lsid:marinespecies.org:taxname:159733,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Gerreidae,Eucinostomus,NCBI:txid435273


In [190]:
worms_12s.head()

Unnamed: 0,verbatimIdentification,asv,rank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonConceptID
0,Clupeidae,951a3746cb4c488d657fe3c64bdd3d75,family,Clupeidae,Clupeidae,urn:lsid:marinespecies.org:taxname:125464,Animalia,Chordata,Teleostei,Clupeiformes,Clupeidae,,NCBI:txid55118
1,Eucinostomus gula,f278a917dedcc2e0434be6ecb605a642,species,Eucinostomus gula,Eucinostomus gula,urn:lsid:marinespecies.org:taxname:159733,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Gerreidae,Eucinostomus,NCBI:txid435273
2,Oncorhynchus gorbuscha,1d0fcad8f69709ed9ea8dd6e6a3c627a,species,Oncorhynchus gorbuscha,Oncorhynchus gorbuscha,urn:lsid:marinespecies.org:taxname:127182,Animalia,Chordata,Teleostei,Salmoniformes,Salmonidae,Oncorhynchus,NCBI:txid8017
3,Lutjanus griseus,cc6451b251afdd97ef4f11e47abc53a7,species,Lutjanus griseus,Lutjanus griseus,urn:lsid:marinespecies.org:taxname:159797,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Lutjanidae,Lutjanus,NCBI:txid40503
4,Eucinostomus gula,d16bc4ff5741c0ba4a5d0bb2e21e5589,species,Eucinostomus gula,Eucinostomus gula,urn:lsid:marinespecies.org:taxname:159733,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Gerreidae,Eucinostomus,NCBI:txid435273


#### Merge Occurrence and worms

In [193]:
# Get identificationRemarks
occ12_test = occ['12S rRNA'].copy()
occ12_test.drop(columns=['domain','phylum','class','order','family','genus','species'],inplace=True)
#worms_12s.drop(columns=['old name'],inplace=True)

occ12_test = occ12_test.merge(worms_12s, how='left', left_on ='featureid', right_on='asv')
occ12_test.drop(columns='asv', inplace=True)
occ12_test.head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,eventID,organismQuantity,occurrenceID,verbatimIdentification,rank,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonConceptID
0,4008ca75e225240b385ec167d0e0c9b4,GCCGGTAAAACTCGTGCCAGCCACCGCGGTTATACGAGAGGCCCAA...,Eukaryota;Chordata;Actinopteri;Mugiliformes;Mu...,0.920924,SEAMAP2019_Brownsville_Sta103_39m_A,29,SEAMAP2019_Brownsville_Sta103_39m_A_occ4008ca7...,Mugil curema,species,Mugil curema,urn:lsid:marinespecies.org:taxname:159416,Animalia,Chordata,Teleostei,Mugiliformes,Mugilidae,Mugil,NCBI:txid48194
1,b8b3d9e1b83477e4927fd8d4c6f1cd2e,GCCGGTAAAACTCGTGCCAGCCACCGCGGTTATACGAGAGGCCCTA...,Eukaryota;Chordata;Actinopteri;Salmoniformes;S...,0.912878,SEAMAP2019_Brownsville_Sta103_39m_A,4041,SEAMAP2019_Brownsville_Sta103_39m_A_occb8b3d9e...,Oncorhynchus gorbuscha,species,Oncorhynchus gorbuscha,urn:lsid:marinespecies.org:taxname:127182,Animalia,Chordata,Teleostei,Salmoniformes,Salmonidae,Oncorhynchus,NCBI:txid8017
2,bfe5abf81199e96bbc51918d76b77e8c,GCCGGTAAAACTCGTGCCAGCCACCGCGGTTATACGAGAGACCCAA...,Eukaryota;Chordata;Actinopteri;Lutjaniformes;L...,0.99183,SEAMAP2019_Brownsville_Sta103_39m_A,37,SEAMAP2019_Brownsville_Sta103_39m_A_occbfe5abf...,Lutjanus griseus,species,Lutjanus griseus,urn:lsid:marinespecies.org:taxname:159797,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Lutjanidae,Lutjanus,NCBI:txid40503
3,208ffed2e333d449c29995ca62b808cc,GCCGGTAAAACTCGTGCCAGCCACCGCGGTTATACGAGAGGCCCAA...,Eukaryota;Chordata;Actinopteri;NA;Opistognathi...,0.999966,SEAMAP2019_Brownsville_Sta103_39m_A,47,SEAMAP2019_Brownsville_Sta103_39m_A_occ208ffed...,Lonchopisthus micrognathus,species,Lonchopisthus micrognathus,urn:lsid:marinespecies.org:taxname:281386,Animalia,Chordata,Teleostei,Ovalentaria incertae sedis,Opistognathidae,Lonchopisthus,NCBI:txid1311555
4,2f6cd78f1f099544eb8f1b252fba924c,GCCGGTAAAACTCGTGCCAGCCACCGCGGTTATACGAGGGGCCCAA...,Eukaryota;Chordata;Actinopteri;Spariformes;Spa...,0.993729,SEAMAP2019_Brownsville_Sta103_39m_A,4,SEAMAP2019_Brownsville_Sta103_39m_A_occ2f6cd78...,Archosargus probatocephalus,species,Archosargus probatocephalus,urn:lsid:marinespecies.org:taxname:159238,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Sparidae,Archosargus,NCBI:txid119682


In [194]:
occ12_test.to_csv("occ12_test.tsv",sep="\t")

#### identificationRemarks

In [207]:
data['analysis_data'].head()

Unnamed: 0,target_gene,ampliconSize,trim_method,cluster_method,pid_clustering,taxa_class_method,taxa_ref_db,code_repo,bioproject_accession,sop,identificationReferences
0,16S rRNA,X,"q2-cutadapt v2023.5 (--p-front-f FWD primer, -...",test,test,test,test,https://github.com/ksilnoaa/seamap-edna,test,test,test
1,18S rRNA,X,"q2-cutadapt v2023.5 (--p-front-f FWD primer, -...",test,test,test,test,https://github.com/ksilnoaa/seamap-edna,test,test,test
2,12S rRNA,X,"q2-cutadapt v2023.5 (--p-front-f FWD primer, -...",test,test,test,test,https://github.com/ksilnoaa/seamap-edna,test,test,test


In [228]:
occ12_test['taxa_class_method'] = data['analysis_data'].loc[data['analysis_data']['target_gene'] == '12S rRNA','taxa_class_method'].item()
occ12_test['taxa_ref_db'] = data['analysis_data'].loc[data['analysis_data']['target_gene'] == '12S rRNA','taxa_ref_db'].item()

occ12_test['identificationRemarks'] = occ12_test['taxa_class_method'] +", confidence (at lowest specified taxon): "+str(occ12_test['Confidence']) +", against reference database: "+occ12_test['taxa_ref_db']

In [231]:
occ12_test['identificationRemarks'].head()

0    test, confidence (at lowest specified taxon): ...
1    test, confidence (at lowest specified taxon): ...
2    test, confidence (at lowest specified taxon): ...
3    test, confidence (at lowest specified taxon): ...
4    test, confidence (at lowest specified taxon): ...
Name: identificationRemarks, dtype: object

#### now convert other occurrence info

#### identificationRemarks

In [207]:
data['analysis_data'].head()

Unnamed: 0,target_gene,ampliconSize,trim_method,cluster_method,pid_clustering,taxa_class_method,taxa_ref_db,code_repo,bioproject_accession,sop,identificationReferences
0,16S rRNA,X,"q2-cutadapt v2023.5 (--p-front-f FWD primer, -...",test,test,test,test,https://github.com/ksilnoaa/seamap-edna,test,test,test
1,18S rRNA,X,"q2-cutadapt v2023.5 (--p-front-f FWD primer, -...",test,test,test,test,https://github.com/ksilnoaa/seamap-edna,test,test,test
2,12S rRNA,X,"q2-cutadapt v2023.5 (--p-front-f FWD primer, -...",test,test,test,test,https://github.com/ksilnoaa/seamap-edna,test,test,test


In [None]:
occ12_test['taxa_class_method'] = data['analysis_data'].loc[data['analysis_data']['target_gene'] == '12S rRNA','taxa_class_method'].item()
occ12_test['taxa_ref_db'] = data['analysis_data'].loc[data['analysis_data']['target_gene'] == '12S rRNA','taxa_ref_db'].item()

occ12_test['identificationRemarks'] = occ12_test['taxa_class_method'] +", confidence (at lowest specified taxon): "+str(occ12_test['Confidence']) +", against reference database: "+occ12_test['taxa_ref_db']
#occ12_test.drop(columns=['Confidence','taxa_class_method','taxa_ref_db'],inplace=True)
occ12_test.drop(columns=['Confidence','taxa_class_method','taxa_ref_db'],inplace=True)

In [241]:
occ12_test['identificationRemarks'].head()

0    test, confidence (at lowest specified taxon): ...
1    test, confidence (at lowest specified taxon): ...
2    test, confidence (at lowest specified taxon): ...
3    test, confidence (at lowest specified taxon): ...
4    test, confidence (at lowest specified taxon): ...
Name: identificationRemarks, dtype: object

#### taxonID

In [242]:
occ12_test['taxonID'] = 'ASV:'+occ12_test['featureid']
occ12_test['basisOfRecord'] = 'MaterialSample'
occ12_test['eventID']

In [246]:
occ12_test.head()

Unnamed: 0,featureid,sequence,taxonomy,eventID,organismQuantity,occurrenceID,verbatimIdentification,rank,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonConceptID,identificationRemarks,taxonID
0,4008ca75e225240b385ec167d0e0c9b4,GCCGGTAAAACTCGTGCCAGCCACCGCGGTTATACGAGAGGCCCAA...,Eukaryota;Chordata;Actinopteri;Mugiliformes;Mu...,SEAMAP2019_Brownsville_Sta103_39m_A,29,SEAMAP2019_Brownsville_Sta103_39m_A_occ4008ca7...,Mugil curema,species,Mugil curema,urn:lsid:marinespecies.org:taxname:159416,Animalia,Chordata,Teleostei,Mugiliformes,Mugilidae,Mugil,NCBI:txid48194,"test, confidence (at lowest specified taxon): ...",ASV:4008ca75e225240b385ec167d0e0c9b4
1,b8b3d9e1b83477e4927fd8d4c6f1cd2e,GCCGGTAAAACTCGTGCCAGCCACCGCGGTTATACGAGAGGCCCTA...,Eukaryota;Chordata;Actinopteri;Salmoniformes;S...,SEAMAP2019_Brownsville_Sta103_39m_A,4041,SEAMAP2019_Brownsville_Sta103_39m_A_occb8b3d9e...,Oncorhynchus gorbuscha,species,Oncorhynchus gorbuscha,urn:lsid:marinespecies.org:taxname:127182,Animalia,Chordata,Teleostei,Salmoniformes,Salmonidae,Oncorhynchus,NCBI:txid8017,"test, confidence (at lowest specified taxon): ...",ASV:b8b3d9e1b83477e4927fd8d4c6f1cd2e
2,bfe5abf81199e96bbc51918d76b77e8c,GCCGGTAAAACTCGTGCCAGCCACCGCGGTTATACGAGAGACCCAA...,Eukaryota;Chordata;Actinopteri;Lutjaniformes;L...,SEAMAP2019_Brownsville_Sta103_39m_A,37,SEAMAP2019_Brownsville_Sta103_39m_A_occbfe5abf...,Lutjanus griseus,species,Lutjanus griseus,urn:lsid:marinespecies.org:taxname:159797,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Lutjanidae,Lutjanus,NCBI:txid40503,"test, confidence (at lowest specified taxon): ...",ASV:bfe5abf81199e96bbc51918d76b77e8c
3,208ffed2e333d449c29995ca62b808cc,GCCGGTAAAACTCGTGCCAGCCACCGCGGTTATACGAGAGGCCCAA...,Eukaryota;Chordata;Actinopteri;NA;Opistognathi...,SEAMAP2019_Brownsville_Sta103_39m_A,47,SEAMAP2019_Brownsville_Sta103_39m_A_occ208ffed...,Lonchopisthus micrognathus,species,Lonchopisthus micrognathus,urn:lsid:marinespecies.org:taxname:281386,Animalia,Chordata,Teleostei,Ovalentaria incertae sedis,Opistognathidae,Lonchopisthus,NCBI:txid1311555,"test, confidence (at lowest specified taxon): ...",ASV:208ffed2e333d449c29995ca62b808cc
4,2f6cd78f1f099544eb8f1b252fba924c,GCCGGTAAAACTCGTGCCAGCCACCGCGGTTATACGAGGGGCCCAA...,Eukaryota;Chordata;Actinopteri;Spariformes;Spa...,SEAMAP2019_Brownsville_Sta103_39m_A,4,SEAMAP2019_Brownsville_Sta103_39m_A_occ2f6cd78...,Archosargus probatocephalus,species,Archosargus probatocephalus,urn:lsid:marinespecies.org:taxname:159238,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Sparidae,Archosargus,NCBI:txid119682,"test, confidence (at lowest specified taxon): ...",ASV:2f6cd78f1f099544eb8f1b252fba924c


### merge 12S and 18S occurrence

In [232]:
occ_dict = dwc_data['occurrence'].to_dict('index')

In [233]:
occ_dict.keys()

dict_keys(['eventID', 'occurrenceID', 'basisOfRecord', 'eventDate', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'scientificName', 'scientificNameID', 'taxonID', 'nameAccordingTo', 'taxonConceptID', 'verbatimIdentification', 'identificationRemarks', 'identificationReferences', 'taxonRank', 'organismQuantity', 'organismQuantityType', 'associatedSequences', 'materialSampleID'])

In [247]:
# check if all occurrence file terms are in sample_data
for key in occ_dict.keys():
    if key not in occ12_test.columns:
        print(key,occ_dict[key])

basisOfRecord {'AOML_term': 'none', 'AOML_file': 'pipeline', 'DwC_definition': nan, 'Example': nan, 'notes': nan}
eventDate {'AOML_term': 'collection_date', 'AOML_file': 'sample_data', 'DwC_definition': nan, 'Example': nan, 'notes': nan}
nameAccordingTo {'AOML_term': 'none', 'AOML_file': 'pipeline', 'DwC_definition': nan, 'Example': nan, 'notes': nan}
identificationReferences {'AOML_term': 'identificationReferences', 'AOML_file': 'analysis_data', 'DwC_definition': 'A list (concatenated and separated) of references (publication, global unique identifier, URI) used in the Identification. Recommended best practice is to separate the values in a list with space vertical bar space ( | ).', 'Example': nan, 'notes': nan}
taxonRank {'AOML_term': 'derived', 'AOML_file': 'tourmaline', 'DwC_definition': 'The taxonomic rank of the most specific name in the dwc:scientificName. Recommended best practice is to use a controlled vocabulary. The taxon ranks of algae, fungi and plants are defined in the 

In [116]:
occ_dict['scientificNameID']

{'AOML_term': 'derived',
 'AOML_file': 'tourmaline',
 'DwC_definition': 'The scientific name ID of "Balaenoptera musculus" as per the WoRMS database.',
 'Example': nan,
 'notes': nan}

In [254]:
occ_dict

{'eventID': {'AOML_term': 'sample_name',
  'AOML_file': 'sample_data',
  'DwC_definition': nan,
  'Example': nan,
  'notes': nan},
 'occurrenceID': {'AOML_term': 'derived: sample_name, featureid',
  'AOML_file': 'tourmaline, sample_data',
  'DwC_definition': 'A unique identifier for the occurrence, allowing the same occurrence to be recognized across dataset versions as well as through data downloads and use. May be a global unique identifier or an identifier specific to the data set.',
  'Example': nan,
  'notes': 'unique: with eventID, sample ID and ASV, generated in script'},
 'basisOfRecord': {'AOML_term': 'none',
  'AOML_file': 'pipeline',
  'DwC_definition': nan,
  'Example': nan,
  'notes': nan},
 'eventDate': {'AOML_term': 'collection_date',
  'AOML_file': 'sample_data',
  'DwC_definition': nan,
  'Example': nan,
  'notes': nan},
 'kingdom': {'AOML_term': 'derived',
  'AOML_file': 'tourmaline',
  'DwC_definition': nan,
  'Example': nan,
  'notes': nan},
 'phylum': {'AOML_term':

In [127]:
# check if all occurrence file terms are in sample_data
for key in event_dict.keys():
    if event_dict[key]['AOML_file'] == 'sample_data':
        if event_dict[key]['AOML_term'] not in data['sample_data'].columns:
            print(key,event_dict[key])

waterBody {'AOML_term': 'waterBody', 'AOML_file': 'sample_data', 'DwC_definition': 'The name of the water body in which the dcterms:Location occurs.         Recommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names.'}


In [128]:
# decide to not add waterBody 
event_dict.pop('waterBody', None)

{'AOML_term': 'waterBody',
 'AOML_file': 'sample_data',
 'DwC_definition': 'The name of the water body in which the dcterms:Location occurs.         Recommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names.'}

In [143]:
# rename sample_data columns to fit DwC standard
gen = (x for x in event_dict.keys() if event_dict[x]['AOML_file'] == 'sample_data')
rename_dict = {}
for x in gen:
    #print(x)
    rename_dict[event_dict[x]['AOML_term']] = x

event_sample = data['sample_data'].rename(columns=rename_dict)
event_sample = event_sample.drop(columns=[col for col in event_sample if col not in rename_dict.values()])

In [144]:

event_sample.head()

Unnamed: 0,eventID,parentEventID,eventDate,maximumDepthInMeters,locality,decimalLongitude,decimalLongitude.1,samplingProtocol,eventRemarks
0,SEAMAP2019_Brownsville_Sta103_39m_A,Bot,2019-04-23T22:05:07-05:00,38.8 m,Gulf of Mexico: Brownsville,26.0262,-96.8373,CTD rosette,not applicable
1,SEAMAP2019_Brownsville_Sta103_39m_B,Bot,2019-04-23T22:05:07-05:00,38.8 m,Gulf of Mexico: Brownsville,26.0262,-96.8373,CTD rosette,not applicable
2,SEAMAP2019_Brownsville_Sta103_39m_C,Bot,2019-04-23T22:05:07-05:00,38.8 m,Gulf of Mexico: Brownsville,26.0262,-96.8373,CTD rosette,PM cast
3,SEAMAP2019_Brownsville_Sta103_4m_A,Sur,2019-04-23T22:05:07-05:00,3.7 m,Gulf of Mexico: Brownsville,26.0262,-96.8373,CTD rosette,not applicable
4,SEAMAP2019_Brownsville_Sta103_4m_B,Sur,2019-04-23T22:05:07-05:00,3.7 m,Gulf of Mexico: Brownsville,26.0262,-96.8373,CTD rosette,not applicable


In [145]:
for key in event_dict.keys():
    if event_dict[key]['AOML_file'] != 'sample_data':
        print(key,event_dict[key])

sampleSizeValue {'AOML_term': 'derived', 'AOML_file': 'tourmaline', 'DwC_definition': 'In the context of eDNA data, `sampleSizeValue` should be the total number of reads for a given sample.'}
sampleSizeUnit {'AOML_term': 'none', 'AOML_file': 'pipeline', 'DwC_definition': 'Should be DNA sequence reads'}
locationID {'AOML_term': 'optional', 'AOML_file': '?', 'DwC_definition': 'An identifier for the set of dcterms:Location information. May be a global unique identifier or an identifier specific to the data set.'}
geodeticDatum {'AOML_term': 'none', 'AOML_file': 'pipeline', 'DwC_definition': 'The ellipsoid, geodetic datum, or spatial reference system (SRS) upon which the geographic coordinates given in dwc:decimalLatitude and dwc:decimalLongitude are based.'}
countryCode {'AOML_term': 'none', 'AOML_file': 'pipeline', 'DwC_definition': nan}
datasetID {'AOML_term': 'project_id', 'AOML_file': 'project_data', 'DwC_definition': 'An identifier for the set of data. May be a global unique identifi

In the context of eDNA data, `sampleSizeValue` should be the total number of reads for a given sample.

Here is where you would add in that extra info. For now, just leave it out.

# boneyard


I also wanted to persist the original name from the NCBI taxonomy database into the Darwin Core-converted data set. To do this, I queried the database based on the name in the original data to obtain its taxonomic ID number.

In [189]:
## Get set up to query NCBI taxonomy 

from Bio import Entrez

# ----- Insert your email here -----
Entrez.email = 'ksil91@gmail.com'
# ----------------------------------

# Get list of all databases available through this tool
record = Entrez.read(Entrez.einfo())
all_dbs = record['DbList']
all_dbs

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

In [210]:
def get_ncbi_txid_from_name_parallel(all_names,n_proc=0):
    all_names = all_names.unique()
    queue = mp.Queue()
    if n_proc == 0:
    # create as many processes as there are CPUs on your machine
        num_processes = mp.cpu_count()
    else:
        num_processes = n_proc
        
    # calculate the chunk size as an integer
    chunk_size = int(len(all_names)/num_processes)
    procs = []
    for job in range(num_processes):
        start = job * chunk_size
        end = start + chunk_size
        name_chunk = all_names[start:end]
        proc = mp.Process(
            target=get_ncbi_txid_from_name,
            args=(name_chunk,queue)
        )
        procs.append(proc)
        proc.start()
    
    names_dict = {}
    for _ in procs:
        names_dict.update(queue.get())
    
    #new_df = queue.get()
    
    for proc in procs:
        proc.join()
    
    return names_dict


In [177]:
## Get NCBI taxIDs for each name in dataset ---- TAKES ~ 2 MINUTES FOR 300 RECORDS
def get_ncbi_txid_from_name(names,queue):
    
    name_ncbiid_dict = {}
    for name in names:
        if name not in name_ncbiid_dict.keys():
            taxon2 = '"' + name + '"'
            handle = Entrez.esearch(db='taxonomy', retmax=10, term=taxon2)
            record = Entrez.read(handle)
            if not record['IdList'] or not record['IdList'][0]:
                print(name+": not found")
                name_ncbiid_dict[name] = "None"
            else:
                name_ncbiid_dict[name] = record['IdList'][0]
            handle.close()
    queue.put(name_ncbiid_dict)


In [186]:
n =worms_12s['verbatimIdentification'][0:10]
new = get_ncbi_txid_from_name_parallel(n,7)

In [188]:
n

0                 Clupeidae
1         Eucinostomus gula
2    Oncorhynchus gorbuscha
3          Lutjanus griseus
4         Eucinostomus gula
5          Cheilopogon agoo
6    Oncorhynchus gorbuscha
7    Oncorhynchus gorbuscha
8          Scomber scombrus
9        Haemulon plumierii
Name: verbatimIdentification, dtype: object

In [187]:
new

{'Oncorhynchus gorbuscha': '8017',
 'Scomber scombrus': '13677',
 'Eucinostomus gula': '435273',
 'Haemulon plumierii': '334415',
 'Clupeidae': '55118',
 'Cheilopogon agoo': '123223',
 'Lutjanus griseus': '40503'}

In [182]:
n

0                 Clupeidae
1         Eucinostomus gula
2    Oncorhynchus gorbuscha
3          Lutjanus griseus
4         Eucinostomus gula
5          Cheilopogon agoo
6    Oncorhynchus gorbuscha
7    Oncorhynchus gorbuscha
8          Scomber scombrus
9        Haemulon plumierii
Name: verbatimIdentification, dtype: object

In [181]:
new

{'Oncorhynchus gorbuscha': '8017',
 'Lutjanus griseus': '40503',
 'Scomber scombrus': '13677',
 'Haemulon plumierii': '334415',
 'Eucinostomus gula': '435273',
 'Cheilopogon agoo': '123223',
 'Clupeidae': '55118'}

In [None]:
## Get NCBI taxIDs for each name in dataset ---- TAKES ~ 2 MINUTES FOR 300 RECORDS

name_ncbiid_dict = {}

for name in worms_12s['verbatimIdentification']:
    if name not in name_ncbiid_dict.keys():
        taxon2 = '"' + name + '"'
        handle = Entrez.esearch(db='taxonomy', retmax=10, term=taxon2)
        record = Entrez.read(handle)
        if not record['IdList'] or not record['IdList'][0]:
            print(name+": not found")
            name_ncbiid_dict[name] = "None"
        else:
            name_ncbiid_dict[name] = record['IdList'][0]
        handle.close()


Look up the not found names by hand to get taxid

In [157]:
name_ncbiid_dict['Lobianchia sp. CBM:ZF:14789'] = '2057084'
name_ncbiid_dict['Brotula sp. CBM:ZF:20276']= '2768194'
name_ncbiid_dict['Scopelarchus sp. CBM:DNA No. 2000-014']= '2608515'

'None'

In [153]:
list(name_ncbiid_dict.keys())[0:10]

['Clupeidae',
 'Eucinostomus gula',
 'Oncorhynchus gorbuscha',
 'Lutjanus griseus',
 'Cheilopogon agoo',
 'Scomber scombrus',
 'Haemulon plumierii',
 'Archosargus probatocephalus',
 'Actinopteri',
 'Carangidae']

In [154]:
name_ncbiid_dict['Clupeidae']

'55118'

In [156]:
## Add NCBI taxonomy IDs under taxonConceptID

# Map indicators that say no taxonomy was assigned to empty strings
#name_ncbiid_dict['unassigned'], name_ncbiid_dict['s_'], name_ncbiid_dict['no_hit'], name_ncbiid_dict['unknown'], name_ncbiid_dict['g_'] = '', '', '', '', ''

# Create column
worms_12s['taxonConceptID']  = worms_12s['verbatimIdentification'].copy()
worms_12s['taxonConceptID'].replace(name_ncbiid_dict, inplace=True)

# Add remainder of text and clean
worms_12s['taxonConceptID'] = 'NCBI:txid' + worms_12s['taxonConceptID']
worms_12s['taxonConceptID'].replace('NCBI:txid', '', inplace=True)
worms_12s.head()

Unnamed: 0,verbatimIdentification,asv,rank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonConceptID
0,Clupeidae,951a3746cb4c488d657fe3c64bdd3d75,family,Clupeidae,Clupeidae,urn:lsid:marinespecies.org:taxname:125464,Animalia,Chordata,Teleostei,Clupeiformes,Clupeidae,,NCBI:txid55118
1,Eucinostomus gula,f278a917dedcc2e0434be6ecb605a642,species,Eucinostomus gula,Eucinostomus gula,urn:lsid:marinespecies.org:taxname:159733,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Gerreidae,Eucinostomus,NCBI:txid435273
2,Oncorhynchus gorbuscha,1d0fcad8f69709ed9ea8dd6e6a3c627a,species,Oncorhynchus gorbuscha,Oncorhynchus gorbuscha,urn:lsid:marinespecies.org:taxname:127182,Animalia,Chordata,Teleostei,Salmoniformes,Salmonidae,Oncorhynchus,NCBI:txid8017
3,Lutjanus griseus,cc6451b251afdd97ef4f11e47abc53a7,species,Lutjanus griseus,Lutjanus griseus,urn:lsid:marinespecies.org:taxname:159797,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Lutjanidae,Lutjanus,NCBI:txid40503
4,Eucinostomus gula,d16bc4ff5741c0ba4a5d0bb2e21e5589,species,Eucinostomus gula,Eucinostomus gula,urn:lsid:marinespecies.org:taxname:159733,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Gerreidae,Eucinostomus,NCBI:txid435273


In [190]:
worms_12s.head()

Unnamed: 0,verbatimIdentification,asv,rank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonConceptID
0,Clupeidae,951a3746cb4c488d657fe3c64bdd3d75,family,Clupeidae,Clupeidae,urn:lsid:marinespecies.org:taxname:125464,Animalia,Chordata,Teleostei,Clupeiformes,Clupeidae,,NCBI:txid55118
1,Eucinostomus gula,f278a917dedcc2e0434be6ecb605a642,species,Eucinostomus gula,Eucinostomus gula,urn:lsid:marinespecies.org:taxname:159733,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Gerreidae,Eucinostomus,NCBI:txid435273
2,Oncorhynchus gorbuscha,1d0fcad8f69709ed9ea8dd6e6a3c627a,species,Oncorhynchus gorbuscha,Oncorhynchus gorbuscha,urn:lsid:marinespecies.org:taxname:127182,Animalia,Chordata,Teleostei,Salmoniformes,Salmonidae,Oncorhynchus,NCBI:txid8017
3,Lutjanus griseus,cc6451b251afdd97ef4f11e47abc53a7,species,Lutjanus griseus,Lutjanus griseus,urn:lsid:marinespecies.org:taxname:159797,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Lutjanidae,Lutjanus,NCBI:txid40503
4,Eucinostomus gula,d16bc4ff5741c0ba4a5d0bb2e21e5589,species,Eucinostomus gula,Eucinostomus gula,urn:lsid:marinespecies.org:taxname:159733,Animalia,Chordata,Teleostei,Eupercaria incertae sedis,Gerreidae,Eucinostomus,NCBI:txid435273


## Worms 

### Use SOAP access to worms API. See examples for python [here](https://marinespecies.org/aphia.php?p=webservice&type=python).

In [164]:
array_of_results_array = []

In [169]:
from suds import null, WebFault
from suds.client import Client
cl = Client('https://marinespecies.org/aphia.php?p=soap&wsdl=1')

scinames = cl.factory.create('scientificnames')
scinames["_arrayType"] = "string[]"
scinames["scientificname"] = ["Buccinum fusiforme", "Abra alba","random"]

# like = wildcard after name
array_of_results_array = cl.service.matchAphiaRecordsByNames(scinames, like=False, fuzzy=False, marine_only=False)
for results_array in array_of_results_array:
    for aphia_object in results_array:
        print('%s %s %s' % (aphia_object.AphiaID, aphia_object.scientificname, aphia_object.genus))


531014 Buccinum fusiforme Buccinum
510389 Buccinum fusiforme Buccinum
141433 Abra alba Abra
1492457 Randia Randia
1442158 Randomia Randomia


In [153]:
for results_array in array_of_results_array:
    for aphia_object in results_array:
        if aphia_object.status == "accepted":
            print('%s %s %s' % (aphia_object.lsid, aphia_object.scientificname, aphia_object.genus))

urn:lsid:marinespecies.org:taxname:141433 Abra alba Abra


### Use [pyworms](https://pyworms.readthedocs.io/en/latest/)

In [177]:
x = [2]
y = [1]
x + y

[2, 1]

In [193]:
s_match = pyworms.aphiaRecordsByName("Abra alba",like=False,marine_only=False)

In [195]:
s_match[0]

{'AphiaID': 141433,
 'url': 'https://www.marinespecies.org/aphia.php?p=taxdetails&id=141433',
 'scientificname': 'Abra alba',
 'authority': '(W. Wood, 1802)',
 'status': 'accepted',
 'unacceptreason': None,
 'taxonRankID': 220,
 'rank': 'Species',
 'valid_AphiaID': 141433,
 'valid_name': 'Abra alba',
 'valid_authority': '(W. Wood, 1802)',
 'parentNameUsageID': 138474,
 'kingdom': 'Animalia',
 'phylum': 'Mollusca',
 'class': 'Bivalvia',
 'order': 'Cardiida',
 'family': 'Semelidae',
 'genus': 'Abra',
 'citation': 'MolluscaBase eds. (2023). MolluscaBase. Abra alba (W. Wood, 1802). Accessed through: World Register of Marine Species at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=141433 on 2023-09-13',
 'lsid': 'urn:lsid:marinespecies.org:taxname:141433',
 'isMarine': 1,
 'isBrackish': None,
 'isFreshwater': None,
 'isTerrestrial': None,
 'isExtinct': None,
 'match_type': 'exact',
 'modified': '2010-09-23T10:34:21.967Z'}

In [710]:

data = [{'ASV': '1', 'species': "Abra alba", 'genus': "Abra"},
        {'ASV': '2',  'species': "Caldanaerobacter fake", 'genus': "Caldanaerobacter"},
        {'ASV': '3',  'species': "fake",  'genus': "fake" },
       {'ASV': '4',  'species': "NaN",  'genus': "Caldanaerobacter"}]
df = pd.DataFrame.from_dict(data)

In [711]:
df

Unnamed: 0,ASV,species,genus
0,1,Abra alba,Abra
1,2,Caldanaerobacter fake,Caldanaerobacter
2,3,fake,fake
3,4,,Caldanaerobacter


In [201]:
x

[2]

In [None]:
def main(tax_df, ordered_rank_columns, ASV_column_name = "NaN", like=False, marine_only=False,verbose=False, n_proc=0):
# don't forget to import
    import pandas as pd
    import multiprocessing
    from functools import partial

    if n_proc == 0:
    # create as many processes as there are CPUs on your machine
        num_processes = multiprocessing.cpu_count()
    else:
        num_processes = n_proc

    # calculate the chunk size as an integer
    chunk_size = int(df.shape[0]/num_processes)

    # this solution was reworked from the above link.
    # will work even if the length of the dataframe is not evenly divisible by num_processes
    chunks = [tax_df.iloc[tax_df.index[i:i + chunk_size]] for i in range(0, tax_df.shape[0], chunk_size)]

    # create our pool with `num_processes` processes
    pool = multiprocessing.Pool(processes=num_processes)

    # apply our function to each chunk in the list
    func = partial(get_worms_from_scientific_name, ordered_rank_columns=ordered_rank_columns,ASV_column_name=ASV_column_name,like=like,marine_only=marine_only)
    result = pool.map(func,chunks)
    pool.close()
    pool.join()
    return result

if __name__ == "__main__":
    main(df,['genus','species'],n_proc=2)


In [237]:
df

Unnamed: 0,ASV,species,genus
0,1,Abra alba,Abra
1,2,Caldanaerobacter fake,Caldanaerobacter
2,3,fake,fake
3,4,,Caldanaerobacter


In [245]:
get_worms_from_scientific_name(df,['genus','species'],ASV_column_name="ASV")

Unnamed: 0,asv,rank,old name,new name,lsid
0,1,genus,Abra,Abra,urn:lsid:marinespecies.org:taxname:138474
1,2,genus,Caldanaerobacter,Caldanaerobacter,urn:lsid:marinespecies.org:taxname:571044
2,3,species,fake,No match,No match
3,4,genus,Caldanaerobacter,Caldanaerobacter,urn:lsid:marinespecies.org:taxname:571044


In [244]:
def get_worms_from_scientific_name(tax_df, ordered_rank_columns, ASV_column_name = "NaN", like=False, marine_only=False,verbose=False):
    import time
    matches = []
    for index, row in df.iterrows():
        asv = row[ASV_column_name]
        for i in ordered_rank_columns:
            rank = i
            old_name = row[i]
            row_data = {'asv':asv,'rank': rank, 'old name': old_name}
            if pd.isna(old_name):
                continue 
            else:
                s_match = pyworms.aphiaRecordsByName(old_name,like=like,marine_only=marine_only)
                #time.sleep(1)
                if s_match == None:
                    row_data['new name'] = "No match"
                    row_data['lsid'] = "No match"
                    continue
                elif len(s_match) > 1:
                    mult = []
                    for m in s_match:
                        if m['status'] == 'accepted':
                            mult = mult + [m]
                    if mult > 1:
                        row_data['new name'] = "Multiple matches"
                        row_data['lsid'] = "Multiple matches"
                    elif len(mult) < 1:
                        row_data['new name'] = "Unaccepted"
                        row_data['lsid'] = "Unaccepted"
                elif len(s_match) == 1:
                    if s_match[0]['status'] == 'accepted':
                        row_data['new name'] = s_match[0]['scientificname']
                        row_data['lsid'] = s_match[0]['lsid']
                        break
                    elif s_match[0]['status'] == 'unaccepted':
                        row_data['new name'] = "Unaccepted"
                        row_data['lsid'] = "Unaccepted"
        matches += [row_data]
    matches = pd.DataFrame.from_dict(matches)
    return matches
                        

In [159]:
pyworms.aphiaRecordByExternalID(6544,"ncbi")

{'AphiaID': 105,
 'url': 'https://www.marinespecies.org/aphia.php?p=taxdetails&id=105',
 'scientificname': 'Bivalvia',
 'authority': 'Linnaeus, 1758',
 'status': 'accepted',
 'unacceptreason': None,
 'taxonRankID': 60,
 'rank': 'Class',
 'valid_AphiaID': 105,
 'valid_name': 'Bivalvia',
 'valid_authority': 'Linnaeus, 1758',
 'parentNameUsageID': 51,
 'kingdom': 'Animalia',
 'phylum': 'Mollusca',
 'class': 'Bivalvia',
 'order': None,
 'family': None,
 'genus': None,
 'citation': 'WoRMS (2023). Bivalvia. Accessed at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=105 on 2023-09-13',
 'lsid': 'urn:lsid:marinespecies.org:taxname:105',
 'isMarine': 1,
 'isBrackish': 1,
 'isFreshwater': 1,
 'isTerrestrial': 0,
 'isExtinct': 0,
 'match_type': 'exact',
 'modified': '2019-08-08T17:02:58.460Z'}

In [None]:
## sampleSizeValue

count_by_seq = plate.groupby('Sequence_ID', as_index=False)['Reads'].sum()
occ = occ.merge(count_by_seq, how='left', left_on='eventID', right_on='Sequence_ID')
occ.drop(columns='Sequence_ID', inplace=True)
occ.rename(columns={'Reads':'sampleSizeValue'}, inplace=True)
print(occ.shape)
occ.head()

In [103]:
## eventID - the sample_name column in the plate dataframe uniquely identifies a water sample

aterm = event_dict['eventID']['AOML_term']
afile = event_dict['eventID']['AOML_file']

event_df = pd.DataFrame({'eventID':data[afile][aterm]})

for key in event_dict.keys():
    
    
    res = pd.Series(data[params[sheet]].columns[data[params[sheet]].isnull().any()].tolist(),
                name=sheet)
    some=pd.concat([some,res],axis=1)

In [5]:
## Merge with plate_meta to obtain columns that can be added directly from metadata

metadata_cols = [
    'seqID',
    'eventDate', 
    'decimalLatitude', 
    'decimalLongitude',
    'env_broad_scale',
    'env_local_scale',
    'env_medium',
    'target_gene',
    'primer_sequence_forward',
    'primer_sequence_reverse',
    'pcr_primer_name_forward',
    'pcr_primer_name_reverse',
    'pcr_primer_reference',
    'sop',
    'seq_meth',
    'samp_vol_we_dna_ext',
    'nucl_acid_ext', 
    'nucl_acid_amp',
]

dwc_cols = metadata_cols.copy()
dwc_cols[0] = 'eventID'

occ = occ.merge(meta[metadata_cols], how='left', left_on='eventID', right_on='seqID')
occ.drop(columns='seqID', inplace=True)
occ.columns = dwc_cols
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,sop,seq_meth,samp_vol_we_dna_ext,nucl_acid_ext,nucl_acid_amp
0,05114c01_12_edna_1_S,2/20/14 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
1,05114c01_12_edna_2_S,2/20/14 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
2,05114c01_12_edna_3_S,2/20/14 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
3,11216c01_12_edna_1_S,4/21/16 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
4,11216c01_12_edna_2_S,4/21/16 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6


In [6]:
## Format eventDate

pst = pytz.timezone('America/Los_Angeles')
eventDate = [pst.localize(datetime.strptime(dt, '%m/%d/%y %H:%M')).isoformat() for dt in occ['eventDate']]
occ['eventDate'] = eventDate

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,sop,seq_meth,samp_vol_we_dna_ext,nucl_acid_ext,nucl_acid_amp
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6


In [7]:
## Clean sop

occ['sop'] = occ['sop'].str.replace('|', ' | ', regex=False)
occ['sop'].iloc[0]

'dx.doi.org/10.17504/protocols.io.xjufknw | dx.doi.org/10.17504/protocols.io.n2vdge6 | https://github.com/MBARI-BOG/BOG-Banzai-Dada2-Pipeline'

In [8]:
## Update seq_meth

occ['seq_meth'] = 'Illumina MiSeq 2x250'

In [9]:
## Change column names as needed

occ = occ.rename(columns = {'primer_sequence_forward':'pcr_primer_forward',
                            'primer_sequence_reverse':'pcr_primer_reverse'})
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,sop,seq_meth,samp_vol_we_dna_ext,nucl_acid_ext,nucl_acid_amp
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....,Illumina MiSeq 2x250,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....,Illumina MiSeq 2x250,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....,Illumina MiSeq 2x250,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....,Illumina MiSeq 2x250,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....,Illumina MiSeq 2x250,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6


In [11]:
## Add extension terms that weren't in metadata file (obtained by asking data provider)

occ['target_subfragment'] = 'V9'
occ['lib_layout'] = 'paired'
occ['otu_class_appr'] = 'dada2;1.14.0;ASV'
occ['otu_seq_comp_appr'] = 'blast;2.9.0+;80% identity;e-value cutoff: 0.00001 | MEGAN6;6.18.5;bitscore:100:2%'
occ['otu_db'] = 'Genbank nr;221'

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,sop,seq_meth,samp_vol_we_dna_ext,nucl_acid_ext,nucl_acid_amp,target_subfragment,lib_layout,otu_class_appr,otu_seq_comp_appr,otu_db
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,dx.doi.org/10.17504/protocols.io.xjufknw | dx....,Illumina MiSeq 2x250,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;1.14.0;ASV,blast;2.9.0+;80% identity;e-value cutoff: 0.00...,Genbank nr;221
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,dx.doi.org/10.17504/protocols.io.xjufknw | dx....,Illumina MiSeq 2x250,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;1.14.0;ASV,blast;2.9.0+;80% identity;e-value cutoff: 0.00...,Genbank nr;221
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,dx.doi.org/10.17504/protocols.io.xjufknw | dx....,Illumina MiSeq 2x250,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;1.14.0;ASV,blast;2.9.0+;80% identity;e-value cutoff: 0.00...,Genbank nr;221
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,dx.doi.org/10.17504/protocols.io.xjufknw | dx....,Illumina MiSeq 2x250,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;1.14.0;ASV,blast;2.9.0+;80% identity;e-value cutoff: 0.00...,Genbank nr;221
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,dx.doi.org/10.17504/protocols.io.xjufknw | dx....,Illumina MiSeq 2x250,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;1.14.0;ASV,blast;2.9.0+;80% identity;e-value cutoff: 0.00...,Genbank nr;221


In [12]:
## Create an occurrenceID that will uniquely identify each ASV observed within a water sample

occ['occurrenceID'] = plate.groupby('Sequence_ID')['ASV'].cumcount()+1
occ['occurrenceID'] = occ['eventID'] + '_occ' + occ['occurrenceID'].astype(str)
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,seq_meth,samp_vol_we_dna_ext,nucl_acid_ext,nucl_acid_amp,target_subfragment,lib_layout,otu_class_appr,otu_seq_comp_appr,otu_db,occurrenceID
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Illumina MiSeq 2x250,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;1.14.0;ASV,blast;2.9.0+;80% identity;e-value cutoff: 0.00...,Genbank nr;221,05114c01_12_edna_1_S_occ1
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Illumina MiSeq 2x250,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;1.14.0;ASV,blast;2.9.0+;80% identity;e-value cutoff: 0.00...,Genbank nr;221,05114c01_12_edna_2_S_occ1
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Illumina MiSeq 2x250,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;1.14.0;ASV,blast;2.9.0+;80% identity;e-value cutoff: 0.00...,Genbank nr;221,05114c01_12_edna_3_S_occ1
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Illumina MiSeq 2x250,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;1.14.0;ASV,blast;2.9.0+;80% identity;e-value cutoff: 0.00...,Genbank nr;221,11216c01_12_edna_1_S_occ1
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Illumina MiSeq 2x250,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;1.14.0;ASV,blast;2.9.0+;80% identity;e-value cutoff: 0.00...,Genbank nr;221,11216c01_12_edna_2_S_occ1


In [13]:
## Add DNA_sequence

occ['DNA_sequence'] = plate['ASV']
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,samp_vol_we_dna_ext,nucl_acid_ext,nucl_acid_amp,target_subfragment,lib_layout,otu_class_appr,otu_seq_comp_appr,otu_db,occurrenceID,DNA_sequence
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;1.14.0;ASV,blast;2.9.0+;80% identity;e-value cutoff: 0.00...,Genbank nr;221,05114c01_12_edna_1_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;1.14.0;ASV,blast;2.9.0+;80% identity;e-value cutoff: 0.00...,Genbank nr;221,05114c01_12_edna_2_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;1.14.0;ASV,blast;2.9.0+;80% identity;e-value cutoff: 0.00...,Genbank nr;221,05114c01_12_edna_3_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;1.14.0;ASV,blast;2.9.0+;80% identity;e-value cutoff: 0.00...,Genbank nr;221,11216c01_12_edna_1_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;1.14.0;ASV,blast;2.9.0+;80% identity;e-value cutoff: 0.00...,Genbank nr;221,11216c01_12_edna_2_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...


In [14]:
## Add scientificName, taxonomic info

occ['scientificName'] = plate['Species']
occ['kingdom'] = plate['Kingdom']
occ['phylum'] = plate['Phylum']
occ['class'] = plate['Class']
occ['order'] = plate['Order']
occ['family'] = plate['Family']
occ['genus'] = plate['Genus']

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,otu_db,occurrenceID,DNA_sequence,scientificName,kingdom,phylum,class,order,family,genus
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,05114c01_12_edna_1_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,05114c01_12_edna_2_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,05114c01_12_edna_3_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,11216c01_12_edna_1_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,11216c01_12_edna_2_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus


For the purpose of submitting data to OBIS, all the variations on missing data (e.g. "unknown," "no_hit," etc.) do not add information. We can replace these with NaN, which is easy to work with in pandas.

In [15]:
## Replace 'unknown', 'unassigned', etc. in scientificName and taxonomy columns with NaN

cols = ['scientificName', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus']
occ[cols] = occ[cols].replace({'unassigned':np.nan,
                              's_':np.nan,
                              'g_':np.nan,
                              'unknown':np.nan,
                              'no_hit':np.nan})
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,otu_db,occurrenceID,DNA_sequence,scientificName,kingdom,phylum,class,order,family,genus
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,05114c01_12_edna_1_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,05114c01_12_edna_2_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,05114c01_12_edna_3_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,11216c01_12_edna_1_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,11216c01_12_edna_2_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus


In [16]:
## Get unique species names

names = occ['scientificName'].unique()
names = names[~pd.isnull(names)]  # remove NaN
print(len(names))

323


OBIS uses the World Register of Marine Species (or [WoRMS](http://www.marinespecies.org/)) as it's taxonomic backbone, so scientific names have to be WoRMS-approved in order to show up as valid occurrences. But there are a number of entries in the `scientificName` column, like "uncultured marine eukaryote," "eukaryote clone OLI11007," and "Acantharian sp. 6201," that are **not proper Linnaean species names**. Since these essentially indicate that a more precise name is unknown, it seemed reasonable to replace these with NaN as well. 

**NOTE: I used a simple rule to filter out non-Linnaean names, but it's important to check and see if any true species names are being removed.**

To visually inspect names that are being filtered out, use:
```python
names = occ['scientificName'].unique()
names = names[~pd.isnull(names)]  # remove NaN
for name in names:
    words_in_name = name.split(' ')
    if len(words_in_name) > 2:
        print(name)
```

In [17]:
## Replace non-Linnaean species names with NaN

# Get non-Linnaean names
non_latin_names = []
for name in names:
    words_in_name = name.split(' ')
    if len(words_in_name) > 2:
        non_latin_names.append(name)
non_latin_names_dict = {i:np.nan for i in non_latin_names}

# Add any names that didn't get caught in the simple filter
non_latin_names_dict['phototrophic eukaryote'] = np.nan
non_latin_names_dict['Candida <clade Candida/Lodderomyces clade>'] = np.nan

# Replace
occ['scientificName'].replace(non_latin_names_dict, inplace=True)

In addition, many records **only give "Eukaryota" as the scientific name** (i.e. Eukaryota is in the kingdom field, and there is no more taxonomic information). These should be replaced with [Biota](http://marinespecies.org/aphia.php?p=taxdetails&id=1), which is WoRMS's most general taxonomic designation.

In [18]:
## Replace entries where kingdom = 'Eukaryota' with the WoRMS-approved 'Biota'

occ.loc[occ['kingdom'] == 'Eukaryota', 'kingdom'] = 'Biota'

The data providers for this dataset used the [NCBI taxonomy database](https://www.ncbi.nlm.nih.gov/taxonomy) as their reference database when assigning taxonomies to ASVs. **It's important to note** that this taxonomy database is not a taxonomic authority, and the taxonomic ranks for any given scientific name on WoRMS may not directly compare. There are ongoing discussions about this problem (see [this](https://github.com/iobis/Project-team-Genetic-Data/issues/5) GitHub issue). At the moment, I don't see a way to definitively ensure that a given scientific name actually has the same taxonomic ranks on both platforms without going case-by-case.

In addition, there are still names in the data that will not match on WoRMS at all, despite appearing to be Linnaean names. This is because the name may not have been fully and officially adopted by the scientific community. I therefore need a system for searching through the higher taxonomic ranks given, finding the lowest one that will match on WoRMS, and putting that name in the `scientificName` column. The following few code blocks do this - they're clunky, but they were sufficient for this data set.

In [19]:
## Define functions for finding the lowest available taxonomic rank that will match on WoRMS

def fill_lowest_taxon(df, cols):
    """ Takes the occurrence pandas data frame and fills missing values in scientificName 
    with values from the first non-missing taxonomic rank column. The names of the taxonomic
    rank columns are listed in cols. """
    
    cols.reverse()
    
    for col in cols[:-1]:
        df['scientificName'] = df['scientificName'].combine_first(df[col])
    
    cols.reverse()
    
    return(df)

def find_not_matched(df, name_dict):
    """ Takes the occurrence pandas data frame and name_dict matching scientificName values 
    with names on WoRMS and returns a list of names that did not match on WoRMS. """
    
    not_matched = []
    
    for name in df['scientificName'].unique():
        if name not in name_dict.keys():
            not_matched.append(name)
    
    try:
        not_matched.remove(np.nan)
    except ValueError:
        pass
            
    return(not_matched)

def replace_not_matched(df, not_matched, cols):
    """ Takes the occurrence pandas data frame and a list of scientificName values that 
    did not match on WoRMS and replaces those values with NaN in the columns specified by cols. """
    
    df[cols] = df[cols].replace(not_matched, np.nan)
    
    return(df)  

In [20]:
## Iterate to match lowest possible taxonomic rank on WoRMS (takes ~8 minutes when starting with ~750 names)

# Note that cols (list of taxonomic column names) was defined in a previous code block 

# Initialize dictionaries
name_name_dict = {}
name_id_dict = {}
name_taxid_dict = {}
name_class_dict = {}

# Initialize not_matched
not_matched = [1]

# Iterate
while len(not_matched) > 0:
    
    # Step 1 - fill
    occ = fill_lowest_taxon(occ, cols)

    # Step 2 - get names to match
    to_match = find_not_matched(occ, name_name_dict)

    # Step 3 - match
    print('Matching {num} names on WoRMS.'.format(num = len(to_match)))
    name_id, name_name, name_taxid, name_class = WoRMS.run_get_worms_from_scientific_name(to_match, verbose_flag=False)
    name_id_dict = {**name_id_dict, **name_id}
    name_name_dict = {**name_name_dict, **name_name}
    name_taxid_dict = {**name_taxid_dict, **name_taxid}
    name_class_dict = {**name_class_dict, **name_class}
    print('Length of name_name_dict: {length}'.format(length = len(name_name_dict)))

    # Step 4 - get names that didn't match
    not_matched = find_not_matched(occ, name_name_dict)
    print('Number of names not matched: {num}'.format(num = len(not_matched)))

    # Step 5 - replace these values with NaN
    occ = replace_not_matched(occ, not_matched, cols)

Matching 756 names on WoRMS.
Length of name_name_dict: 696
Number of names not matched: 60
Matching 35 names on WoRMS.
Length of name_name_dict: 716
Number of names not matched: 15
Matching 11 names on WoRMS.
Length of name_name_dict: 721
Number of names not matched: 6
Matching 3 names on WoRMS.
Length of name_name_dict: 722
Number of names not matched: 2
Matching 1 names on WoRMS.
Length of name_name_dict: 722
Number of names not matched: 1
Matching 0 names on WoRMS.
Length of name_name_dict: 722
Number of names not matched: 0


There are, I'm sure, a vast number of ways to improve on this. A couple that have crossed my mind are:
- Use of .reverse() in fill_lowest_taxon()
- Add in a progress bar
- Consider better and/or additional stopping criteria. Importantly, what if not all names can be matched?
- Could consider using pyworms instead of my custom WoRMS functions

There are quite a few records where no taxonomic information was obtained at all (i.e., after this whole process, `scientificName` is still NaN). I set `scientificName` to 'Biota' for these records.

In [21]:
## Change scientificName to Biota in cases where all taxonomic information is missing

print(occ[occ['scientificName'].isna() == True].shape)
occ.loc[occ['scientificName'].isna() == True, 'scientificName'] = 'Biota'
occ[occ['scientificName'].isna() == True].shape

(33360, 32)


(0, 32)

Finally, during the above process, **I altered the taxonomy columns in order to obtain the best possible `scientificName` column**. I chose to re-populate these columns with the taxonomy from the original data set, rather than altering some of the names to match taxonomy retrieved from WoRMS. In this case, it seemed best to adhere as closely as possible to the original data.

In [22]:
## Fix taxonomy columns

# Replace with original data
occ[cols[1:]] = plate[['Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus']].copy()

# Replace missing data indicators in original data with empty strings ('')
occ[cols[1:]] = occ[cols[1:]].replace({
    'unassigned':'',
    's_':'',
    'g_':'',
    'unknown':'',
    'no_hit':''})

In [23]:
## Add scientific name-related columns

occ['scientificNameID'] = occ['scientificName']
occ['scientificNameID'].replace(name_id_dict, inplace=True)

occ['taxonID'] = occ['scientificName']
occ['taxonID'].replace(name_taxid_dict, inplace=True)

occ['scientificName'].replace(name_name_dict, inplace=True)

occ['nameAccordingTo'] = 'WoRMS'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,scientificName,kingdom,phylum,class,order,family,genus,scientificNameID,taxonID,nameAccordingTo
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS


I also wanted to persist the original name from the NCBI taxonomy database into the Darwin Core-converted data set. To do this, I queried the database based on the name in the original data to obtain its taxonomic ID number.

In [24]:
## Get set up to query NCBI taxonomy 

from Bio import Entrez

# ----- Insert your email here -----
Entrez.email = 'dianalg@mbari.org'
# ----------------------------------

# Get list of all databases available through this tool
record = Entrez.read(Entrez.einfo())
all_dbs = record['DbList']
all_dbs

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'biosystems', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

In [25]:
## Get NCBI taxIDs for each name in dataset ---- TAKES ~ 2 MINUTES FOR 300 RECORDS

name_ncbiid_dict = {}

for name in names:
    handle = Entrez.esearch(db='taxonomy', retmax=10, term=name)
    record = Entrez.read(handle)
    name_ncbiid_dict[name] = record['IdList'][0]
    handle.close()

**Note** that this code will throw an IndexError (IndexError: list index out of range) if a term is not found.

In [26]:
## Add NCBI taxonomy IDs under taxonConceptID

# Map indicators that say no taxonomy was assigned to empty strings
name_ncbiid_dict['unassigned'], name_ncbiid_dict['s_'], name_ncbiid_dict['no_hit'], name_ncbiid_dict['unknown'], name_ncbiid_dict['g_'] = '', '', '', '', ''

# Create column
occ['taxonConceptID']  = plate['Species'].copy()
occ['taxonConceptID'].replace(name_ncbiid_dict, inplace=True)

# Add remainder of text and clean
occ['taxonConceptID'] = 'NCBI:txid' + occ['taxonConceptID']
occ['taxonConceptID'].replace('NCBI:txid', '', inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,kingdom,phylum,class,order,family,genus,scientificNameID,taxonID,nameAccordingTo,taxonConceptID
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,


In [27]:
## identificationRemarks

# Get identificationRemarks
occ = occ.merge(meta[['seqID', 'identificationRemarks']], how='left', left_on='eventID', right_on='seqID')
occ.drop(columns='seqID', inplace=True)

# Add name that matched in GenBank - i.e. the species name from the original data
occ['identificationRemarks'] = plate['Species'].copy() + ', ' + occ['identificationRemarks']
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,phylum,class,order,family,genus,scientificNameID,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2..."
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2..."
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2..."
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2..."
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2..."


In [28]:
## basisOfRecord

occ['basisOfRecord'] = 'MaterialSample'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,class,order,family,genus,scientificNameID,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample


In [29]:
## Add identificationReferences 

occ = occ.merge(meta[['seqID', 'identificationReferences']], how='left', left_on='eventID', right_on='seqID')
occ.drop(columns='seqID', inplace=True)
occ['identificationReferences'] = occ['identificationReferences'].str.replace('| ', ' | ', regex=False)

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,order,family,genus,scientificNameID,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord,identificationReferences
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...


In [30]:
## organismQuantity (number of reads)

occ['organismQuantity'] = plate['Reads']
occ['organismQuantityType'] = 'DNA sequence reads'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,genus,scientificNameID,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord,identificationReferences,organismQuantity,organismQuantityType
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads


In the context of eDNA data, `sampleSizeValue` should be the total number of reads for a given sample.

In [31]:
## sampleSizeValue

count_by_seq = plate.groupby('Sequence_ID', as_index=False)['Reads'].sum()
occ = occ.merge(count_by_seq, how='left', left_on='eventID', right_on='Sequence_ID')
occ.drop(columns='Sequence_ID', inplace=True)
occ.rename(columns={'Reads':'sampleSizeValue'}, inplace=True)
print(occ.shape)
occ.head()

(280440, 42)


Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,scientificNameID,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord,identificationReferences,organismQuantity,organismQuantityType,sampleSizeValue
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads,85600
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads,90702
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads,130275
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads,147220
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads,121419


In [32]:
## sampleSizeUnit

occ['sampleSizeUnit'] = 'DNA sequence reads'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord,identificationReferences,organismQuantity,organismQuantityType,sampleSizeValue,sampleSizeUnit
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads,85600,DNA sequence reads
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads,90702,DNA sequence reads
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads,130275,DNA sequence reads
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads,147220,DNA sequence reads
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads,121419,DNA sequence reads


In [33]:
## associatedSequences

occ = occ.merge(meta[['seqID', 'associatedSequences']], how='left', left_on='eventID', right_on='seqID')
occ.drop(columns='seqID', inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord,identificationReferences,organismQuantity,organismQuantityType,sampleSizeValue,sampleSizeUnit,associatedSequences
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads,85600,DNA sequence reads,NCBI BioProject accession number PRJNA433203
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads,90702,DNA sequence reads,NCBI BioProject accession number PRJNA433203
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads,130275,DNA sequence reads,NCBI BioProject accession number PRJNA433203
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads,147220,DNA sequence reads,NCBI BioProject accession number PRJNA433203
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads,121419,DNA sequence reads,NCBI BioProject accession number PRJNA433203


In [34]:
## Drop records where organismQuantity = 0 (absences are not meaningful for this data set)

occ = occ[occ['organismQuantity'] > 0]
print(occ.shape)

(64903, 44)


In [35]:
## Check for NaN values in string fields - if there are any, replace them with empty strings ('')

occ.isna(). sum()

eventID                     0
eventDate                   0
decimalLatitude             0
decimalLongitude            0
env_broad_scale             0
env_local_scale             0
env_medium                  0
target_gene                 0
pcr_primer_forward          0
pcr_primer_reverse          0
pcr_primer_name_forward     0
pcr_primer_name_reverse     0
pcr_primer_reference        0
sop                         0
seq_meth                    0
samp_vol_we_dna_ext         0
nucl_acid_ext               0
nucl_acid_amp               0
target_subfragment          0
lib_layout                  0
otu_class_appr              0
otu_seq_comp_appr           0
otu_db                      0
occurrenceID                0
DNA_sequence                0
scientificName              0
kingdom                     0
phylum                      0
class                       0
order                       0
family                      0
genus                       0
scientificNameID            0
taxonID   

In [36]:
## Divide into occurrence and DNADerivedDataExt

ddd_cols = [
    'eventID',
    'occurrenceID',
    'DNA_sequence',
    'sop',
    'nucl_acid_ext',
    'samp_vol_we_dna_ext',
    'nucl_acid_amp',
    'target_gene',
    'target_subfragment',
    'lib_layout',
    'pcr_primer_forward',
    'pcr_primer_reverse',
    'pcr_primer_name_forward',
    'pcr_primer_name_reverse',
    'pcr_primer_reference',
    'seq_meth',
    'otu_class_appr',
    'otu_seq_comp_appr',
    'otu_db',
    'env_broad_scale',
    'env_local_scale',
    'env_medium',
]

DNADerivedData = occ[ddd_cols].copy()

occ.drop(ddd_cols[2:], axis=1, inplace=True)

## Save

In [37]:
## Save

# Get path
folder = os.getcwd().replace('src', 'processed')
occ_filename = os.path.join(folder, 'occurrence.csv')
ddd_filename = os.path.join(folder, 'dna_extension.csv')

# Create folder
if not os.path.exists(folder):
    os.makedirs(folder)

# Save
# occ.to_csv(occ_filename, index=False, na_rep='NaN')
# DNADerivedData.to_csv(ddd_filename, index=False, na_rep='NaN')

# Boneyard

**update**
Plate data contains the ASV sequence, the number of reads (number of times that ASV was observed in the sample), and the taxonomy associated with that ASV.

| Column name| Column definition                                                                                                                                                                           |
|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ASV        | The sequence of the Amplicon Sequence Variant observed                                                                                                                                       |
| FilterID   | A unique identifier for the filter the sample was obtained from, composed of: <br>- cruise number <br>- CTD cast number<br>- CTD bottle number <br>- filter indicator <br>- replicate number |
| Sequence_ID| The FilterID plus a letter indicating which plate the sample was on when sequenced                                                                                                           |
| Reads      | The number of reads for the ASV                                                                                                                                                             |
| Kingdom    | The Kingdom of the taxonomic identity assigned to the ASV, if known                                                                                                                          |
| Phylum     | The Phylum of the taxonomic identity assigned to the ASV, if known                                                                                                                           |
| Class      | The Class of the taxonomic identity assigned to the ASV, if known                                                                                                                            |
| Order      | The Order of the taxonomic identity assigned to the ASV, if known                                                                                                                            |
| Family     | The Family of the taxonomic identity assigned to the ASV, if known                                                                                                                           |
| Genus      | The Genus of the taxonomic identity assigned to the ASV, if known                                                                                                                            |
| Species    | The Species of the taxonomic identity assigned to the ASV, if known                                                                                                                          |

Additionally, taxonomic columns may include the following designations:
- **unknown** = GenBank couldn't give a scientifically-agreed-upon name for a given taxonomic rank. I.e., either the name doesn't exist, or there isn't enough scientific consensus to give a name.
- **no_hit** = BLAST did not find any hits for the ASV.
- **unassigned** = The ASV got BLAST hits, but the post-processing program Megan6 didn't assign the ASV to any taxonomic group.
- **g_** or **s_** = Megan6 assigned the ASV to a genus or species, but not with high enough confidence to include it. 

In [3]:
## Plate metadata

filename = os.getcwd().replace('src', os.path.join('raw', 'metadata_table.csv'))  
meta = pd.read_csv(filename)
print(meta.shape)
meta.head()

(60, 67)


Unnamed: 0,sample_name,library,tag_sequence,primer_sequence_forward,primer_sequence_reverse,R1,R2,PlateID,sample_type,target_gene,...,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,seq_meth,sequencing_facility,seqID,identificationRemarks,identificationReferences,FilterID,associatedSequences
0,14213c01_12_eDNA_1,S1,ACGAGACTGATT,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,14213c01_12_edna_1_S1_L001_R1_001.fastq.gz,14213c01_12_edna_1_S1_L001_R2_001.fastq.gz,S,environmental,18S,...,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",NGS Illumina Miseq,Stanford,14213c01_12_edna_1_S,Genbank nr Release 221 September 20 2017,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14213c01_12_edna,NCBI BioProject accession number PRJNA433203
1,14213c01_12_eDNA_2,S2,GAATACCAAGTC,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,14213c01_12_edna_2_S2_L001_R1_001.fastq.gz,14213c01_12_edna_2_S2_L001_R2_001.fastq.gz,S,environmental,18S,...,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",NGS Illumina Miseq,Stanford,14213c01_12_edna_2_S,Genbank nr Release 221 September 20 2017,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14213c01_12_edna,NCBI BioProject accession number PRJNA433203
2,14213c01_12_eDNA_3,S3,CGAGGGAAAGTC,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,14213c01_12_edna_3_S3_L001_R1_001.fastq.gz,14213c01_12_edna_3_S3_L001_R2_001.fastq.gz,S,environmental,18S,...,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",NGS Illumina Miseq,Stanford,14213c01_12_edna_3_S,Genbank nr Release 221 September 20 2017,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14213c01_12_edna,NCBI BioProject accession number PRJNA433203
3,22013c01_12_eDNA_1,S4,GAACACTTTGGA,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,22013c01_12_edna_1_S4_L001_R1_001.fastq.gz,22013c01_12_edna_1_S4_L001_R2_001.fastq.gz,S,environmental,18S,...,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",NGS Illumina Miseq,Stanford,22013c01_12_edna_1_S,Genbank nr Release 221 September 20 2017,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22013c01_12_edna,NCBI BioProject accession number PRJNA433203
4,22013c01_12_eDNA_2,S5,ACTCACAGGAAT,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,22013c01_12_edna_2_S5_L001_R1_001.fastq.gz,22013c01_12_edna_2_S5_L001_R2_001.fastq.gz,S,environmental,18S,...,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",NGS Illumina Miseq,Stanford,22013c01_12_edna_2_S,Genbank nr Release 221 September 20 2017,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22013c01_12_edna,NCBI BioProject accession number PRJNA433203


Metadata contains information on sample collection, DNA extraction, DNA amplification, and DNA sequencing. 

Definitions of relevant columns:

| Column name              | Column definition                                                                       |
|--------------------------|-----------------------------------------------------------------------------------------|
| primer_sequence_forward  | The sequence of the forward primer used during PCR                                      |
| primer_sequence_reverse  | The sequence of the reverse primer used during PCR                                      |
| target_gene              | The gene being targeted for amplification during PCR                                    |
| eventDate                | The date (and time, if available) the water sample was collected                        |
| decimalLatitude          | The latitude in decimal degrees where the water sample was collected (WGS84)            |
| decimalLongitude         | The longitude in decimal degrees where the water sample was collected (WGS84)           |
| env_broad_scale          | The most broad descriptor of the environment from which the water sample was collected  |
| env_local_scale          | A more specific descriptor of the environment from which the water sample was collected |
| env_medium               | A descriptor of the medium from which the DNA was collected                             |
| minimumDepthInMeters     | The minimum depth at which the water sample was collected                               |
| maximumDepthInMeters     | The maximum depth at which the water sample was collected                               |
| samp_vol_we_dna_ext      | The volume of the water sample that was processed during DNA extraction                 |
| nucl_acid_ext            | Reference to the DNA extraction protocol                                                |
| nucl_acid_amp            | Reference to the DNA amplification protocol                                             |
| sop                      | Links or references to standard operating protocols used to obtain the data             |
| pcr_primer_name_forward  | Name of the forward primer used during PCR                                              |
| pcr_primer_name_reverse  | Name of the reverse primer used during PCR                                              |
| pcr_primer_reference     | Reference for PCR primers                                                               |
| seq_meth                 | The sequencing method used                                                              |
| identificationRemarks    | Information on the taxonomic identification process                                     |
| identificationReferences | References to procedures and/or code used during the taxonomic identification process   |
| associatedSequences      | The identifier of the published raw DNA sequences from the water sample, if available   |

In [None]:
data['sample_data']