# Darwin Core Conversion of eDNA Sequence Data From the AOML_MIMARKS metadata template 

**Version:** 1.0.8

**Author:** Katherine Silliman

**Last Updated:** 15-April-2024

This notebook is for converting a [MIMARKS](https://fairsharing.org/FAIRsharing.zvrep1)-based data sheet to DarwinCore for submission to OBIS. It has been testing on a Mac M1 laptop running in Rosetta mode, with Python 3.11. 

[Metadata template Google Sheet](https://docs.google.com/spreadsheets/d/1jof9MBEll7Xluu8-_znLRBIP9JpyAd_5YvdioZ-REoY/edit?usp=sharing)

**Requirements:**
- Python 3
- Python 3 packages:
    - os
- External packages:
    - Bio.Entrez from biopython
    - numpy
    - pandas
    - openpyxl
    - pyworms
    - multiprocess
- Custom modules:
    - WoRMS_matching

**Resources:**
- Abarenkov K, Andersson AF, Bissett A, Finstad AG, Fossøy F, Grosjean M, Hope M, Jeppesen TS, Kõljalg U, Lundin D, Nilsson RN, Prager M, Provoost P, Schigel D, Suominen S, Svenningsen C & Frøslev TG (2023) Publishing DNA-derived data through biodiversity data platforms, v1.3. Copenhagen: GBIF Secretariat. https://doi.org/10.35035/doc-vf1a-nr22.https://doi.org/10.35035/doc-vf1a-nr22.
- [OBIS manual](https://manual.obis.org/dna_data.html)
- [TDWG Darwin Core Occurrence Core](https://dwc.tdwg.org/terms/#occurrence)
- [GBIF DNA Derived Data Extension](https://tools.gbif.org/dwca-validator/extension.do?id=http://rs.gbif.org/terms/1.0/DNADerivedData)
- https://github.com/iobis/dataset-edna

**Citation**  
Silliman K, Anderson S, Storo R, Thompson L (2023) A Case Study in Sharing Marine eDNA Metabarcoding Data to OBIS. Biodiversity Information Science and Standards 7: e111048. https://doi.org/10.3897/biss.7.111048


## Installation  

```bash
conda create -n edna2obis
conda activate edna2obis
conda install -c conda-forge notebook
conda install -c conda-forge nb_conda_kernels

conda install -c conda-forge numpy pandas
conda install -c conda-forge openpyxl

#worms conversion
conda install -c conda-forge pyworms
conda install -c conda-forge multiprocess
conda install -c conda-forge biopython
```

In [70]:
## Imports
import os

import numpy as np
import pandas as pd

import WoRMS_matching # custom functions for querying WoRMS API

In [2]:
# jupyter notebook parameters
pd.set_option('display.max_colwidth', 150)
pd.set_option('display.max_columns', 50)

Note that in a Jupyter Notebook, the current directory is always where the .ipynb file is being run.

## Prepare input data 

**Project data and metadata**  
This workflow assumes that you have your project metadata in an Excel sheet formatted like the template located [here](https://docs.google.com/spreadsheets/d/1YBXFU9PuMqm7IT1tp0LTxQ1v2j0tlCWFnhSpy-EBwPw/edit?usp=drive_link). Instructions for filling out the metadata template are located in the 'Readme' sheet and at the [documentation website](https://noaa-omics-templates.readthedocs.io/en/latest/).

**eDNA and taxonomy data**  
The eDNA data and assigned taxonomy should be in a specific tab-delimited format. ![asv_table format](../images/asv_table.png)

This file is generated automatically by [Tourmaline v2023.5+](https://github.com/aomlomics/tourmaline), in X location. If your data was generated with Qiime2 or a previous version of Tourmaline, you can convert the `table.qza`, `taxonomy.qza`, and `repseqs.qza` outputs to the correct format using the `create_asv_seq_taxa_obis.sh` shell script.

Example:  

``` bash
#Run this with a qiime2 environment. 
bash create_asv_seq_taxa_obis.sh -f \
../gomecc_v2_raw/table-16S-merge.qza -t ../gomecc_v2_raw/taxonomy-16S-merge.qza -r ../gomecc_v2_raw/repseqs-16S-merge.qza \
-o ../gomecc_v2_raw/gomecc-16S-asv.tsv
```


## Set configs  

Below you can set definitions for parameters used in the code. 

| Parameter           | Description                                                                                                       | Example                                                                                              |
|---------------------|-------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|
| `sample_data`       | Name of sheet in project data Excel file with sample data.                                                        | "water_sample_data"                                                                                  |
| `prep_data`         | Name of sheet in project data Excel file with data about molecular preparation methods.                           | "amplicon_prep_data"                                                                                 |
| `analysis_data`     | Name of sheet in project data Excel file with data about analysis methods.                                        | "analysis_data"                                                                                      |
| `study_data`        | Name of sheet in project data Excel file with metadata about the study.                                           | "study_data"                                                                                         |
| `msmt_metadata`     | Name of sheet in project data Excel file with metadata about additional measurements. Not used in current code.   | "measurement_metadata"                                                                               |
| `excel_file`        | Path of project data Excel file.                                                                                  | "../raw/gomecc4_AOML_MIMARKS.survey.water.6.0.xlsx"                                                  |
| `md_excel`          | Path of data dictionary Excel file.                                                                               | "../raw/gomecc_AOML2DwC standards.xlsx"                                                              |
| `datafiles`         | Python dictionary, where keys are the amplicon names and the values are the paths to the cooresponding ASV table. | {'16S V4-V5': '../raw/gomecc-16S-asv.tsv', '18S V9': '../raw/gomecc-18S-asv.tsv'}                    |
| `skip_sample_types` | Python list of sample_type values to skip from OBIS submission, such as controls or blanks.                       | ['mock community','distilled water blank','extraction blank','PCR no-template control','RTSF blank'] |
| `skip_columns`      | Python list of columns to ignore when submitting to OBIS.                                                         | ['notes_sampling']                                                                                   |

In [3]:
params = {}
params['sample_data'] = "water_sample_data"
params['prep_data']= "amplicon_prep_data"
params['analysis_data'] = "analysis_data"
params['study_data'] = "study_data"
# measurement metadata is currently not processed by the workflow
#params['msmt_metadata'] = "measurement_metadata"
params['excel_file'] = "../raw/gomecc4_NOAA_MIMARKS.survey.water.6.0_v1.0.8_sharing.xlsx"

params['datafiles'] = {'16S V4-V5': '../raw/gomecc-16S-asv.tsv',
                       '18S V9': '../raw/gomecc-18S-asv.tsv'}

params['skip_sample_types'] = ['mock community','distilled water blank','extraction blank','PCR no-template control','RTSF blank']
params['skip_columns']= ['notes_sampling','date_modified','modified_by']
params['md_excel'] = "../raw/gomecc_NOAA2DwC_standards_v1.8.xlsx"


## Load data

Note that in a Jupyter Notebook, the current directory is always where the .ipynb file is being run.

### Load project data Excel file

In [4]:

data = pd.read_excel(
    params['excel_file'], 
    [params['study_data'],params['sample_data'],params['prep_data'],params['analysis_data']],
    index_col=None, na_values=[""], comment="#"
)

In [5]:
#rename keys in data dictionary to a general term
data['sample_data'] = data.pop(params['sample_data'])
data['prep_data'] = data.pop(params['prep_data'])
data['analysis_data'] = data.pop(params['analysis_data'])
data['study_data'] = data.pop(params['study_data'])

#### sample_data  
Contextual data about the samples collected, such as when it was collected, where it was collected from, what kind of sample it is, and what were the properties of the environment or experimental condition from which the sample was taken. Each row is a distinct sample, or Event. Most of this information is recorded during sample collection. This sheet contains terms from the MIMARKS survey water 6.0 package. 

In [6]:
data['sample_data'].head()

Unnamed: 0,sample_name,serial_number,cruise_id,line_id,station,locationID,ctd_bottle_no,sample_replicate,source_material_id,biological_replicates,extract_number,sample_title,bioproject_accession,biosample_accession,notes_sampling,project_id,amplicon_sequenced,metagenome_sequenced,organism,collection_date_local,collection_date,depth,minimumDepthInMeters,maximumDepthInMeters,env_broad_scale,...,carbonate,diss_inorg_carb,diss_oxygen,fluor,hydrogen_ion,nitrate,nitrite,nitrate_plus_nitrite,omega_arag,pco2,ph,phosphate,pressure,salinity,samp_store_loc,samp_store_temp,silicate,size_frac_low,size_frac_up,temp,tot_alkalinity,tot_depth_water_col,transmittance,date_modified,modified_by
0,GOMECC4_27N_Sta1_Deep_A,GOMECC4_001,GOMECC-4 (2021),27N,Sta1,27N_Sta1,3,A,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_B, GOMECC4_27N_Sta1_Deep_C",Plate4_52,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_A,PRJNA887898,SAMN37516091,DCM = deep chlorophyl max.,gomecc4,16S V4-V5 | 18S V9,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,618,618,marine biome [ENVO:00000447],...,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,2024-04-19 06:48:45.888,aomlomics@gmail.com
1,GOMECC4_27N_Sta1_Deep_B,GOMECC4_002,GOMECC-4 (2021),27N,Sta1,27N_Sta1,3,B,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_C",Plate4_60,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_B,PRJNA887898,SAMN37516092,DCM was around 80 m and not well defined.,gomecc4,16S V4-V5 | 18S V9,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,618,618,marine biome [ENVO:00000447],...,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,NaT,
2,GOMECC4_27N_Sta1_Deep_C,GOMECC4_003,GOMECC-4 (2021),27N,Sta1,27N_Sta1,3,C,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_B",Plate4_62,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_C,PRJNA887898,SAMN37516093,Surface CTD bottles did not fire correctly; hand niskin bottle used for the surface cast. PM cast.,gomecc4,16S V4-V5 | 18S V9,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,618,618,marine biome [ENVO:00000447],...,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,NaT,
3,GOMECC4_27N_Sta1_DCM_A,GOMECC4_004,GOMECC-4 (2021),27N,Sta1,27N_Sta1,14,A,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_B, GOMECC4_27N_Sta1_DCM_C",Plate4_53,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_DCM_A,PRJNA887898,SAMN37516094,Only enough water for 2 surface replicates.,gomecc4,16S V4-V5 | 18S V9,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,49 m,49,49,marine biome [ENVO:00000447],...,229.99 µmol/kg,2033.19 µmol/kg,193.443 µmol/kg,0.036,0.0000000094 M,0 µmol/kg,0 µmol/kg,0 µmol/kg,3.805,423 µatm,8.027,0.0517 µmol/kg,49 dbar,36.325 psu,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,NaT,
4,GOMECC4_27N_Sta1_DCM_B,GOMECC4_005,GOMECC-4 (2021),27N,Sta1,27N_Sta1,14,B,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_A, GOMECC4_27N_Sta1_DCM_C",Plate4_46,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_DCM_B,PRJNA887898,SAMN37516095,,gomecc4,16S V4-V5 | 18S V9,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,49 m,49,49,marine biome [ENVO:00000447],...,229.99 µmol/kg,2033.19 µmol/kg,193.443 µmol/kg,0.036,0.0000000094 M,0 µmol/kg,0 µmol/kg,0 µmol/kg,3.805,423 µatm,8.027,0.0517 µmol/kg,49 dbar,36.325 psu,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,NaT,


#### prep_data  
Contextual data about how the samples were prepared for sequencing. Includes how they were extracted, what amplicon was targeted, how they were sequenced. Each row is a separate sequencing library preparation, distinguished by a unique library_id.

In [7]:
data['prep_data'].head(2)

Unnamed: 0,sample_name,library_id,title,library_strategy,library_source,library_selection,lib_layout,platform,instrument_model,design_description,filetype,filename,filename2,drive_location,biosample_accession,sra_accession,date_dna_extracted,extraction_personnel,date_pcr,pcr_personnel,seq_facility,seq_meth,nucl_acid_ext,amplicon_sequenced,target_gene,target_subfragment,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,pcr_cond,nucl_acid_amp,adapters,mid_barcode,date_modified,modified_by
0,GOMECC4_NegativeControl_1,GOMECC16S_Neg1,16S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC16S_Neg1_S499_L001_R1_001.fastq.gz,GOMECC16S_Neg1_S499_L001_R2_001.fastq.gz,,SAMN37516589,SRR26148505,,,,,,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S V4-V5,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,2024-04-15 12:58:06.355,aomlomics@gmail.com
1,GOMECC4_27N_Sta1_DCM_A,GOMECC18S_Plate4_53,18S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC18S_Plate4_53_S340_L001_R1_001.fastq.gz,GOMECC18S_Plate4_53_S340_L001_R2_001.fastq.gz,,SAMN37516094,SRR26161153,,,,,,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,18S V9,18S rRNA,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75;annealing:65_0.25;57_0.5;elongation:72_1.5;final elongation:72_10;35,10.1371/journal.pone.0006372,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,NaT,


### Load ASV data  
There is one ASV file for each marker that was sequenced. The ASV data files have one row for each unique amplicon sequence variants (ASVs). They contain the ASV DNA sequence, a unique hash identifier the taxonomic assignment for each ASV, the confidence given that assignment by the naive-bayes classifier, and then the number of reads observed in each sample. 

This file is created automatically with [Tourmaline v.2023.5+](https://github.com/aomlomics/tourmaline), and is found in `01-taxonomy/asv_taxa_sample_table.tsv`. 

| column name    | definition                                                                                                                                                                                                                                                                                                                                                                                              |
|----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| featureid      | A hash of the ASV sequence, used as a unique identifier for the ASV.                                                                                                                                                                                                                                                                                                                                    |
| sequence       | The DNA sequence of the ASV                                                                                                                                                                                                                                                                                                                                                                             |
| taxonomy       | The full taxonomy assigned to an ASV sequence. This string could be formatted in very different ways depending on the reference database used during classification, however it should always be in reverse rank order separated by ;. We provide examples for how to process results from a Silva classifier and the PR2 18S classifier. For other taxonomy formats, the code will need to be adapted. |
| Confidence     | This is the confidence score assigned the taxonomic classification with a naive-bayes classifier.                                                                                                                                                                                                                                                                                                       |
| sample columns | The next columns each represent a sample (or eventID), and the number of reads for that ASV observed in the sample.                                                                                                                                                                                                                                                                                     |

In [8]:
# read in ASV tables, looping through amplicons
asv_tables = {}

for gene in params['datafiles'].keys():
    asv_tables[gene] = pd.read_table(params['datafiles'][gene])


In [9]:
asv_tables.keys()

dict_keys(['16S V4-V5', '18S V9'])

In [10]:
asv_tables['16S V4-V5'].iloc[:,0:20].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,GOMECC4_27N_Sta1_DCM_A,GOMECC4_27N_Sta1_DCM_B,GOMECC4_27N_Sta1_DCM_C,GOMECC4_27N_Sta1_Deep_A,GOMECC4_27N_Sta1_Deep_B,GOMECC4_27N_Sta1_Deep_C,GOMECC4_27N_Sta1_Surface_A,GOMECC4_27N_Sta1_Surface_B,GOMECC4_27N_Sta4_DCM_A,GOMECC4_27N_Sta4_DCM_B,GOMECC4_27N_Sta4_DCM_C,GOMECC4_27N_Sta4_Deep_A,GOMECC4_27N_Sta4_Deep_B,GOMECC4_27N_Sta4_Deep_C,GOMECC4_27N_Sta4_Surface_A,GOMECC4_27N_Sta4_Surface_B
0,00006f0784f7dbb2f162408abb6da629,TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCATGCAGGTGGTTTGTTAAGTCAGATGTGAAAGCCCGGGGCTCAACCTCGGAATTGCATTTGAAACTGGCAGACTAGAGTACTGTAGAGGGGGGTAGAATTT...,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Vibrionales; f__Vibrionaceae; g__Vibrio,0.978926,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,25
1,000094731d4984ed41435a1bf65b7ef2,TACAGAGAGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGGTATTTAAGTCGGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATCCGAAACTATTTAACTAGAGTATGGGAGAGGTAAGTAGAATTT...,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__HOC36; f__HOC36; g__HOC36; s__Candidatus_Thioglobus,0.881698,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0001a3c11fcef1b1b8f4c72942efbbac,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGTCTTCTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAATACTGGAAGACTAGAAAACGGAAGAGGGTAGTGGAATTC...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Synechococcales; f__Cyanobiaceae; g__Cyanobium_PCC-6307,0.762793,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0001ceef5162e6d689ef30418cfcc164,TACAGAGGGTGCAAGCGTTGTTCGGAATCATTGGGCGTAAAGCGCGCGTAGGCGGCCAAATAAGTCTGATGTGAAGGCCCAGGGCTCAACCCTGGAAGTGCATCGGAAACTGTTTGGCTCGAGTCCCGGAGGGGGTGGTGGAATTC...,d__Bacteria; p__Myxococcota; c__Myxococcia; o__Myxococcales; f__Myxococcaceae; g__P3OB-42; s__uncultured_bacterium,0.997619,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,000235534662df05bb30219a4b978dac,TACGGAAGGTCCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTTTTAAGTTGGATGTGAAAGCCCTGGGCTCAACCTAGGAACTGCATCCAAAACTAGATGACTAGAGTACGAAAGAGGGAAGTAGAATTC...,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__SAR86_clade; f__SAR86_clade; g__SAR86_clade,0.999961,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Drop samples with unwanted sample types  

Often with eDNA projects, we have control samples that are sequenced along with our survey samples. These can include filtering distilled water, using pure water instead of DNA in a PCR or DNA extraction protocol, or a mock community of known microbial taxa. Controls can help identify and mitigate contaminant DNA in our samples, but are not useful for biodiversity platforms like OBIS. You can select which sample_type values to drop with the `skip_sample_types` parameter.

In [11]:
samps_to_remove = data['sample_data']['sample_type'].isin(params['skip_sample_types'])
#data['sample_data'][samps_to_remove]
# list of samples to drop
samples_to_drop = data['sample_data']['sample_name'][samps_to_remove]

You can view the list of samples to be dropped below.

In [12]:
samples_to_drop

26     GOMECC4_Blank_DIW_20210915_A
27     GOMECC4_Blank_DIW_20210915_B
28     GOMECC4_Blank_DIW_20210915_C
200    GOMECC4_Blank_DIW_20210930_A
201    GOMECC4_Blank_DIW_20210930_B
202    GOMECC4_Blank_DIW_20210930_C
334    GOMECC4_Blank_DIW_20211011_A
335    GOMECC4_Blank_DIW_20211011_B
336    GOMECC4_Blank_DIW_20211011_C
409    GOMECC4_Blank_DIW_20211016_A
410    GOMECC4_Blank_DIW_20211016_B
411    GOMECC4_Blank_DIW_20211016_C
484       GOMECC4_ExtractionBlank_1
485      GOMECC4_ExtractionBlank_11
486      GOMECC4_ExtractionBlank_12
487       GOMECC4_ExtractionBlank_3
488       GOMECC4_ExtractionBlank_5
489       GOMECC4_ExtractionBlank_7
490       GOMECC4_ExtractionBlank_9
491            GOMECC4_MSUControl_1
492            GOMECC4_MSUControl_2
493            GOMECC4_MSUControl_3
494            GOMECC4_MSUControl_4
495            GOMECC4_MSUControl_5
496            GOMECC4_MSUControl_6
497            GOMECC4_MSUControl_7
498       GOMECC4_NegativeControl_1
499       GOMECC4_NegativeCo

In [13]:
# remove samples from sample_data sheet
data['sample_data'] = data['sample_data'][~samps_to_remove]

In [14]:
# check the sample_type values left in your sample_data. We only want seawater.
data['sample_data']['sample_type'].unique()

array(['seawater'], dtype=object)

In [15]:
# remove samples from prep_data
prep_samps_to_remove = data['prep_data']['sample_name'].isin(samples_to_drop)
data['prep_data'] = data['prep_data'][~prep_samps_to_remove]

##### drop unwanted samples from ASV files


In [48]:
for gene in params['datafiles'].keys():
    asv_tables[gene] = asv_tables[gene].drop(columns=samples_to_drop,errors='ignore')

### Drop columns with all NAs  

If your project data file has columns with only NAs, this code will check for those, provide their column headers for verification, then remove them.

In [16]:
# which have all NAs?
dropped = pd.DataFrame()
for sheet in ['sample_data','prep_data','analysis_data']:
    res = pd.Series(data[sheet].columns[data[sheet].isnull().all(0)],
                name=sheet)
    dropped=pd.concat([dropped,res],axis=1)
    

Which columns in each sheet have only NA values?

In [17]:
dropped

Unnamed: 0,sample_data,prep_data,analysis_data
0,,drive_location,assembly_qual
1,,date_dna_extracted,assembly_software
2,,extraction_personnel,annot
3,,date_pcr,number_contig
4,,pcr_personnel,sop
5,,seq_facility,compl_score
6,,date_modified,compl_software
7,,modified_by,compl_appr
8,,,contam_score
9,,,contam_screen_input


If you are fine with leaving these columns out, proceed:

In [18]:
for sheet in ['sample_data','prep_data','analysis_data']:
    data[sheet].dropna(axis=1, how='all',inplace=True)

Now let's check which columns have missing values in some of the rows. These should be filled in on the Excel sheet with the appropriate term ('not applicable', 'missing', or 'not collected'). Alternatively, you can drop the column if it is not needed for submission to OBIS.

In [19]:
# which columns have missing data (NAs) in some rows
some = pd.DataFrame()
for sheet in ['sample_data','prep_data','analysis_data']:
    res = pd.Series(data[sheet].columns[data[sheet].isnull().any()].tolist(),
                name=sheet)
    some=pd.concat([some,res],axis=1)

In [20]:
some

Unnamed: 0,sample_data,prep_data,analysis_data
0,notes_bottle_metadata,,date_modified
1,date_modified,,modified_by
2,modified_by,,


Here I'm going to drop all the columns with some missing data, as I don't need them for submission to OBIS.

In [21]:
# drop columns with any missing data
for sheet in ['sample_data','prep_data','analysis_data']:
    data[sheet].dropna(axis=1, how='any',inplace=True)

### Load data dictionary Excel file 
This Excel file is used as a data dictionary for converting between terms used in the project data Excel file and Darwin Core terms for submission to OBIS. Currently, we are only preparing an Occurrence core file and a DNA-derived extension file, with Event information in the Occurrence file. Future versions of this workflow will prepare an extendedMeasurementOrFact file as well.

In [22]:
# read in data dictionary excel file
dwc_data = pd.read_excel(
    params['md_excel'], 
    ['event','occurrence','dna'],
    index_col=0, na_values=[""]
)

In [23]:
#example of a sheet in the data dictionary
dwc_data['event'].head()

Unnamed: 0_level_0,AOML_term,AOML_file,DwC_definition
DwC_term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
eventID,sample_name,water_sample_data,An identifier for the set of information associated with a dwc:Event (something that occurs at a place and time). https://dwc.tdwg.org/terms/#dwc:...
eventDate,collection_date_local,water_sample_data,this is the date-time when the dwc:Event was recorded. Recommended best practice is to use a date that conforms to ISO 8601-1:2019. https://dwc.td...
samplingProtocol,collection_method,water_sample_data,"The names of, references to, or descriptions of the methods or protocols used during a dwc:Event."
locationID,locationID,water_sample_data,An identifier for the set of dcterms:Location information. May be a global unique identifier or an identifier specific to the data set.
decimalLatitude,decimalLatitude,water_sample_data,"The geographic latitude (in decimal degrees, using the spatial reference system given in dwc:geodeticDatum) of the geographic center of a dcterms:..."


## Convert to Occurrence file
In order to link the DNA-derived extension metadata to our OBIS occurrence records, we have to use the Occurrence core. For this data set, a `parentEvent` is a filtered water sample that was DNA extracted, a sequencing library from that DNA extraction is an `event`, and an `occurrence` is an ASV observed within a library. We will have an an occurence file and a DNA derived data file. Future versions will generate a measurements file.   
**Define files**


### Sampling event info 



In [24]:
dwc_data['event']

Unnamed: 0_level_0,AOML_term,AOML_file,DwC_definition
DwC_term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
eventID,sample_name,water_sample_data,An identifier for the set of information associated with a dwc:Event (something that occurs at a place and time). https://dwc.tdwg.org/terms/#dwc:...
eventDate,collection_date_local,water_sample_data,this is the date-time when the dwc:Event was recorded. Recommended best practice is to use a date that conforms to ISO 8601-1:2019. https://dwc.td...
samplingProtocol,collection_method,water_sample_data,"The names of, references to, or descriptions of the methods or protocols used during a dwc:Event."
locationID,locationID,water_sample_data,An identifier for the set of dcterms:Location information. May be a global unique identifier or an identifier specific to the data set.
decimalLatitude,decimalLatitude,water_sample_data,"The geographic latitude (in decimal degrees, using the spatial reference system given in dwc:geodeticDatum) of the geographic center of a dcterms:..."
decimalLongitude,decimalLongitude,water_sample_data,"The geographic longitude (in decimal degrees, using the spatial reference system given in dwc:geodeticDatum) of the geographic center of a dcterms..."
geodeticDatum,geodeticDatum,water_sample_data,"The ellipsoid, geodetic datum, or spatial reference system (SRS) upon which the geographic coordinates given in dwc:decimalLatitude and dwc:decima..."
countryCode,countryCode,water_sample_data,
minimumDepthInMeters,minimumDepthInMeters,water_sample_data,
maximumDepthInMeters,maximumDepthInMeters,water_sample_data,


In [25]:
event_dict = dwc_data['event'].to_dict('index')

In [26]:
event_dict['eventID']

{'AOML_term': 'sample_name',
 'AOML_file': 'water_sample_data',
 'DwC_definition': 'An identifier for the set of information associated with a dwc:Event (something that occurs at a place and time). https://dwc.tdwg.org/terms/#dwc:eventID'}

In [27]:
# check which event terms are not in sample_data sheet
for key in event_dict.keys():
    if event_dict[key]['AOML_file'] == params['sample_data']:
        if event_dict[key]['AOML_term'] not in data['sample_data'].columns:
            print(key,event_dict[key])

Here is where you can add missing terms with custom code. For example, if we did not have location ID in our data, we could create a custom value by combining lineID and station columns. 
```
#change locationID to line_id+station
data['sample_data']['station'] = data['sample_data']['line_id']+ "_"+data['sample_data']['station'] 
```

In [28]:
# rename sample_data columns to fit DwC standard
gen = (x for x in event_dict.keys() if event_dict[x]['AOML_file'] == params['sample_data'])
rename_dict = {}
for x in gen:
    #print(x)
    rename_dict[event_dict[x]['AOML_term']] = x

event_sample = data['sample_data'].rename(columns=rename_dict)
event_sample = event_sample.drop(columns=[col for col in event_sample if col not in rename_dict.values()])


In [29]:

event_sample.head()

Unnamed: 0,eventID,locationID,eventDate,minimumDepthInMeters,maximumDepthInMeters,locality,waterBody,countryCode,decimalLatitude,decimalLongitude,geodeticDatum,samplingProtocol
0,GOMECC4_27N_Sta1_Deep_A,27N_Sta1,2021-09-14T11:00-04:00,618,618,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette
1,GOMECC4_27N_Sta1_Deep_B,27N_Sta1,2021-09-14T11:00-04:00,618,618,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette
2,GOMECC4_27N_Sta1_Deep_C,27N_Sta1,2021-09-14T11:00-04:00,618,618,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette
3,GOMECC4_27N_Sta1_DCM_A,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette
4,GOMECC4_27N_Sta1_DCM_B,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette


Add amplicon_sequenced back, as we need this to link prep data to the correct samples.

In [30]:
event_sample['amplicon_sequenced'] = data['sample_data']['amplicon_sequenced']

Now add an event for each sequencing library, with replicate water sample as the parentEvent.  

**Future Update**: make this a for loop

In [31]:
child_data_16S = event_sample[event_sample['amplicon_sequenced'].str.contains('16S V4-V5')].copy()
child_data_16S['parentEventID'] = child_data_16S['eventID']
child_data_16S['eventID'] = child_data_16S['eventID']+"_16S"
child_data_16S.head()

Unnamed: 0,eventID,locationID,eventDate,minimumDepthInMeters,maximumDepthInMeters,locality,waterBody,countryCode,decimalLatitude,decimalLongitude,geodeticDatum,samplingProtocol,amplicon_sequenced,parentEventID
0,GOMECC4_27N_Sta1_Deep_A_16S,27N_Sta1,2021-09-14T11:00-04:00,618,618,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_Deep_A
1,GOMECC4_27N_Sta1_Deep_B_16S,27N_Sta1,2021-09-14T11:00-04:00,618,618,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_Deep_B
2,GOMECC4_27N_Sta1_Deep_C_16S,27N_Sta1,2021-09-14T11:00-04:00,618,618,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_Deep_C
3,GOMECC4_27N_Sta1_DCM_A_16S,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_DCM_A
4,GOMECC4_27N_Sta1_DCM_B_16S,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_DCM_B


In [32]:
child_data_18S = event_sample[event_sample['amplicon_sequenced'].str.contains('18S V9')].copy()
child_data_18S['parentEventID'] = child_data_18S['eventID']
child_data_18S['eventID'] = child_data_18S['eventID']+"_18S"
child_data_18S.head()

Unnamed: 0,eventID,locationID,eventDate,minimumDepthInMeters,maximumDepthInMeters,locality,waterBody,countryCode,decimalLatitude,decimalLongitude,geodeticDatum,samplingProtocol,amplicon_sequenced,parentEventID
0,GOMECC4_27N_Sta1_Deep_A_18S,27N_Sta1,2021-09-14T11:00-04:00,618,618,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_Deep_A
1,GOMECC4_27N_Sta1_Deep_B_18S,27N_Sta1,2021-09-14T11:00-04:00,618,618,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_Deep_B
2,GOMECC4_27N_Sta1_Deep_C_18S,27N_Sta1,2021-09-14T11:00-04:00,618,618,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_Deep_C
3,GOMECC4_27N_Sta1_DCM_A_18S,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_DCM_A
4,GOMECC4_27N_Sta1_DCM_B_18S,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,16S V4-V5 | 18S V9,GOMECC4_27N_Sta1_DCM_B


In [33]:
# this is your full event file
all_event_data = pd.concat([child_data_16S,child_data_18S],axis=0,ignore_index=True)

In [34]:
all_event_data = all_event_data.drop(columns=['amplicon_sequenced'])

In [35]:
all_event_data.tail()

Unnamed: 0,eventID,locationID,eventDate,minimumDepthInMeters,maximumDepthInMeters,locality,waterBody,countryCode,decimalLatitude,decimalLongitude,geodeticDatum,samplingProtocol,parentEventID
939,GOMECC4_CAPECORAL_Sta141_DCM_B_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,59,59,USA: Gulf of Mexico,"Mexico, Gulf of",US,25.574,-84.843,WGS84,CTD rosette,GOMECC4_CAPECORAL_Sta141_DCM_B
940,GOMECC4_CAPECORAL_Sta141_DCM_C_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,59,59,USA: Gulf of Mexico,"Mexico, Gulf of",US,25.574,-84.843,WGS84,CTD rosette,GOMECC4_CAPECORAL_Sta141_DCM_C
941,GOMECC4_CAPECORAL_Sta141_Surface_A_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,4,4,USA: Gulf of Mexico,"Mexico, Gulf of",US,25.574,-84.843,WGS84,CTD rosette,GOMECC4_CAPECORAL_Sta141_Surface_A
942,GOMECC4_CAPECORAL_Sta141_Surface_B_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,4,4,USA: Gulf of Mexico,"Mexico, Gulf of",US,25.574,-84.843,WGS84,CTD rosette,GOMECC4_CAPECORAL_Sta141_Surface_B
943,GOMECC4_CAPECORAL_Sta141_Surface_C_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,4,4,USA: Gulf of Mexico,"Mexico, Gulf of",US,25.574,-84.843,WGS84,CTD rosette,GOMECC4_CAPECORAL_Sta141_Surface_C


Which terms are still missing from the event info?

In [36]:
for key in event_dict.keys():
    if event_dict[key]['AOML_file'] != params['sample_data']:
        print(key,event_dict[key])

datasetID {'AOML_term': 'project_id_external', 'AOML_file': 'study_data', 'DwC_definition': 'An identifier for the set of data. May be a global unique identifier or an identifier specific to a collection or institution.'}
eventRemarks {'AOML_term': 'controls_used', 'AOML_file': 'analysis_data', 'DwC_definition': 'Comments or notes about the dwc:Event.'}


eventRemarks will be added later.

In [37]:
#datasetID
all_event_data['datasetID'] = data['study_data']['project_id_external'].values[0]

In [38]:
all_event_data.head()

Unnamed: 0,eventID,locationID,eventDate,minimumDepthInMeters,maximumDepthInMeters,locality,waterBody,countryCode,decimalLatitude,decimalLongitude,geodeticDatum,samplingProtocol,parentEventID,datasetID
0,GOMECC4_27N_Sta1_Deep_A_16S,27N_Sta1,2021-09-14T11:00-04:00,618,618,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_Deep_A,noaa-aoml-gomecc4
1,GOMECC4_27N_Sta1_Deep_B_16S,27N_Sta1,2021-09-14T11:00-04:00,618,618,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_Deep_B,noaa-aoml-gomecc4
2,GOMECC4_27N_Sta1_Deep_C_16S,27N_Sta1,2021-09-14T11:00-04:00,618,618,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_Deep_C,noaa-aoml-gomecc4
3,GOMECC4_27N_Sta1_DCM_A_16S,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
4,GOMECC4_27N_Sta1_DCM_B_16S,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_B,noaa-aoml-gomecc4


### Occurrence file  
Now get the occurrence info from the ASV tables, format it, then merge it with the event info.

In [39]:
# create a dictionary to hold both markers
occ = {}

#### 18S

In [40]:
asv_tables['18S V9'].iloc[0:10,0:15]

Unnamed: 0,featureid,sequence,taxonomy,Confidence,GOMECC4_27N_Sta1_DCM_A,GOMECC4_27N_Sta1_DCM_B,GOMECC4_27N_Sta1_DCM_C,GOMECC4_27N_Sta1_Deep_A,GOMECC4_27N_Sta1_Deep_B,GOMECC4_27N_Sta1_Deep_C,GOMECC4_27N_Sta1_Surface_A,GOMECC4_27N_Sta1_Surface_B,GOMECC4_27N_Sta4_DCM_A,GOMECC4_27N_Sta4_DCM_B,GOMECC4_27N_Sta4_DCM_C
0,36aa75f9b28f5f831c2d631ba65c2bcb,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCCTGGCGGATTACTCTGCCTGGCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Neocalanus;Neocalanus_cristatus;,0.922099,1516,0,0,6,0,0,0,4257,2005,0,14
1,4e38e8ced9070952b314e1880bede1ca,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGTAGTCGGATCACTCTGACTGCCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Clausocalanus;Clausocalanus_furcatus;,0.999947,962,316,548,19,10,0,0,0,613,561,434
2,5d4df37251121c08397c6fbc27b06175,GCTACTACCGATTGAGTGTTTTAGTGAGGTCCTCGGATTGCTTTCCTGGCGGTTAACGCTGCCTAGTTGGCGAAAAGACGACCAAACTGTAGCACTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Sinocalanus;Sinocalanus_sinensis;,0.9923,0,4,0,12,5,0,0,0,9,0,0
3,f863f671a575c6ab587e8de0190d3335,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCCAGGCGGGTCGCCCTGCCTGGTCTACGGGAAGACGACCAAACTGTAGTGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Paracalanus;Paracalanus_parvus;,0.998393,0,0,0,0,0,0,0,0,0,0,5
4,2a31e5c01634165da99e7381279baa75,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAAGATAGTCGCAAGACTACCTTTTCTCCGGAAAGACTTTCAAACTTGAGCGTCTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Acrocalanus;Acrocalanus_sp.;,0.779948,1164,2272,2208,2,0,0,0,0,0,0,0
5,ecee60339b2fb88ea6d1c8d18054bed4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAGTGTTCAGTTCCTGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae,0.999931,287,414,335,195,228,298,252,349,175,102,216
6,d70494a723d85d66aa88d2d8a975aeec,GCTACTACCGATTGAATGGTTCCGTGAATTCTTGAGATCGGCGCGGGAACAACTGGCAACGGTTGATCCCGATTGCTGAGAACTTGTGTAAACGCGATCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta,0.992451,0,0,0,4,0,0,0,0,5,0,22
7,fa1f1a97dd4ae7c826009186bad26384,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAATGTTTGGATCCCGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniales;Gymnodiniaceae,0.986908,250,323,194,51,59,55,222,250,230,163,214
8,bbaaf7bb4e71c80de970677779e3bf3a,GCTACTACCGATTGAATGGTTTAGTGAGATCTTCGGATTGGCACAATCGCGGCCTAACGGAAGTGATGGTGCCGAAAAGTTGCTCAAACTTGATCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Cnidaria;Cnidaria_X;Hydrozoa;Sulculeolaria;Sulculeolaria_quadrivalvis;,0.864777,212,50,237,552,1278,480,0,0,26,24,21
9,7a8324bb4448b65f7adc73d70e5901da,GCTACTACCGATTGAACGTTTTAGTGAGGTATTTGGACTGGGCCTTTGGAGGATTCGTTCTCCAATGTTGCTCGGGAAGACTCCCAAACTTGAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Delibus;Delibus_sp.;,0.992088,0,0,0,15,0,0,0,0,0,0,0


##### format taxonomy

How to automate this? Everyone's taxonomy might be different?

In [41]:
#18S 
taxa_ranks_18S = ['domain','supergroup','division','subdivision','class','order','family','genus','species']

asv_tables['18S V9'][['domain','supergroup','division','subdivision','class','order','family','genus','species']] = ["","","","","","","","",""]
for index, row in asv_tables['18S V9'].iterrows():
    taxa = row['taxonomy'].split(";")
    for i in range(0,len(taxa)):
        if i < len(taxa_ranks_18S):
            asv_tables['18S V9'].loc[index,taxa_ranks_18S[i]] = taxa[i]

    

In [42]:
# replace None with NA
asv_tables['18S V9'] = asv_tables['18S V9'].fillna(value=np.nan)
## Replace 'unknown', 'unassigned', etc. in species and taxonomy columns with NaN

asv_tables['18S V9'][taxa_ranks_18S] = asv_tables['18S V9'][taxa_ranks_18S].replace({'unassigned':np.nan,
                            'Unassigned':np.nan,
                              's_':np.nan,
                              'g_':np.nan,
                              'unknown':np.nan,
                              'no_hit':np.nan,
                               '':np.nan})
asv_tables['18S V9'].iloc[0:10,[0,1,2,3,4,5,6,-9,-8,-7,-6,-5,-4,-3,-2,-1]]

Unnamed: 0,featureid,sequence,taxonomy,Confidence,GOMECC4_27N_Sta1_DCM_A,GOMECC4_27N_Sta1_DCM_B,GOMECC4_27N_Sta1_DCM_C,domain,supergroup,division,subdivision,class,order,family,genus,species
0,36aa75f9b28f5f831c2d631ba65c2bcb,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCCTGGCGGATTACTCTGCCTGGCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Neocalanus;Neocalanus_cristatus;,0.922099,1516,0,0,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Neocalanus,Neocalanus_cristatus
1,4e38e8ced9070952b314e1880bede1ca,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGTAGTCGGATCACTCTGACTGCCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Clausocalanus;Clausocalanus_furcatus;,0.999947,962,316,548,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Clausocalanus,Clausocalanus_furcatus
2,5d4df37251121c08397c6fbc27b06175,GCTACTACCGATTGAGTGTTTTAGTGAGGTCCTCGGATTGCTTTCCTGGCGGTTAACGCTGCCTAGTTGGCGAAAAGACGACCAAACTGTAGCACTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Sinocalanus;Sinocalanus_sinensis;,0.9923,0,4,0,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Sinocalanus,Sinocalanus_sinensis
3,f863f671a575c6ab587e8de0190d3335,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCCAGGCGGGTCGCCCTGCCTGGTCTACGGGAAGACGACCAAACTGTAGTGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Paracalanus;Paracalanus_parvus;,0.998393,0,0,0,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Paracalanus,Paracalanus_parvus
4,2a31e5c01634165da99e7381279baa75,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAAGATAGTCGCAAGACTACCTTTTCTCCGGAAAGACTTTCAAACTTGAGCGTCTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Acrocalanus;Acrocalanus_sp.;,0.779948,1164,2272,2208,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Acrocalanus,Acrocalanus_sp.
5,ecee60339b2fb88ea6d1c8d18054bed4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAGTGTTCAGTTCCTGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae,0.999931,287,414,335,Eukaryota,TSAR,Alveolata,Dinoflagellata,Dinophyceae,,,,
6,d70494a723d85d66aa88d2d8a975aeec,GCTACTACCGATTGAATGGTTCCGTGAATTCTTGAGATCGGCGCGGGAACAACTGGCAACGGTTGATCCCGATTGCTGAGAACTTGTGTAAACGCGATCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta,0.992451,0,0,0,Eukaryota,Obazoa,Opisthokonta,,,,,,
7,fa1f1a97dd4ae7c826009186bad26384,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAATGTTTGGATCCCGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniales;Gymnodiniaceae,0.986908,250,323,194,Eukaryota,TSAR,Alveolata,Dinoflagellata,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,
8,bbaaf7bb4e71c80de970677779e3bf3a,GCTACTACCGATTGAATGGTTTAGTGAGATCTTCGGATTGGCACAATCGCGGCCTAACGGAAGTGATGGTGCCGAAAAGTTGCTCAAACTTGATCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Cnidaria;Cnidaria_X;Hydrozoa;Sulculeolaria;Sulculeolaria_quadrivalvis;,0.864777,212,50,237,Eukaryota,Obazoa,Opisthokonta,Metazoa,Cnidaria,Cnidaria_X,Hydrozoa,Sulculeolaria,Sulculeolaria_quadrivalvis
9,7a8324bb4448b65f7adc73d70e5901da,GCTACTACCGATTGAACGTTTTAGTGAGGTATTTGGACTGGGCCTTTGGAGGATTCGTTCTCCAATGTTGCTCGGGAAGACTCCCAAACTTGAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Delibus;Delibus_sp.;,0.992088,0,0,0,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Delibus,Delibus_sp.


In [43]:
# replace _,- with space, remove sp. 

asv_tables['18S V9'][taxa_ranks_18S] = asv_tables['18S V9'][taxa_ranks_18S].replace('_',' ',regex=True)
asv_tables['18S V9'][taxa_ranks_18S] = asv_tables['18S V9'][taxa_ranks_18S].replace(' sp\.','',regex=True)
asv_tables['18S V9'][taxa_ranks_18S] = asv_tables['18S V9'][taxa_ranks_18S].replace(' spp\.','',regex=True)
asv_tables['18S V9'][taxa_ranks_18S] = asv_tables['18S V9'][taxa_ranks_18S].replace('-',' ',regex=True)
asv_tables['18S V9'][taxa_ranks_18S] = asv_tables['18S V9'][taxa_ranks_18S].replace('\/',' ',regex=True)

In [44]:
asv_tables['18S V9'].shape


(24067, 513)

Now we are changing the ASV data from wide to long format, and renaming the read counts to `organismQuantity` ad the sample names to `eventID`.

In [45]:
occ['18S V9'] = pd.melt(asv_tables['18S V9'],id_vars=['featureid','sequence','taxonomy','Confidence','domain','supergroup','division','subdivision','class','order','family','genus','species'],
               var_name='eventID',value_name='organismQuantity')

In [46]:
occ['18S V9'].shape

(12033500, 15)

In [47]:
## Drop records where organismQuantity = 0 (absences are not meaningful for submitting to OBIS)

occ['18S V9'] = occ['18S V9'][occ['18S V9']['organismQuantity'] > 0]
print(occ['18S V9'].shape)

(149182, 15)


In [48]:
occ['18S V9'].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,domain,supergroup,division,subdivision,class,order,family,genus,species,eventID,organismQuantity
0,36aa75f9b28f5f831c2d631ba65c2bcb,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCCTGGCGGATTACTCTGCCTGGCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Neocalanus;Neocalanus_cristatus;,0.922099,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Neocalanus,Neocalanus cristatus,GOMECC4_27N_Sta1_DCM_A,1516
1,4e38e8ced9070952b314e1880bede1ca,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGTAGTCGGATCACTCTGACTGCCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Clausocalanus;Clausocalanus_furcatus;,0.999947,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Clausocalanus,Clausocalanus furcatus,GOMECC4_27N_Sta1_DCM_A,962
4,2a31e5c01634165da99e7381279baa75,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAAGATAGTCGCAAGACTACCTTTTCTCCGGAAAGACTTTCAAACTTGAGCGTCTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Acrocalanus;Acrocalanus_sp.;,0.779948,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Acrocalanus,Acrocalanus,GOMECC4_27N_Sta1_DCM_A,1164
5,ecee60339b2fb88ea6d1c8d18054bed4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAGTGTTCAGTTCCTGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae,0.999931,Eukaryota,TSAR,Alveolata,Dinoflagellata,Dinophyceae,,,,,GOMECC4_27N_Sta1_DCM_A,287
7,fa1f1a97dd4ae7c826009186bad26384,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAATGTTTGGATCCCGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniales;Gymnodiniaceae,0.986908,Eukaryota,TSAR,Alveolata,Dinoflagellata,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,,GOMECC4_27N_Sta1_DCM_A,250


Add occurenceID by combining the eventID with the ASV featureID, example: GOMECC4_27N_Sta1_DCM_A_occ36aa75f9b28f5f831c2d631ba65c2bcb

In [49]:
## Create an occurrenceID that will uniquely identify each ASV observed within a water sample

occ['18S V9']['occurrenceID'] = occ['18S V9']['featureid']
occ['18S V9']['occurrenceID'] = occ['18S V9']['eventID'] + '_occ' + occ['18S V9']['occurrenceID'].astype(str)

In [50]:
occ['18S V9'].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,domain,supergroup,division,subdivision,class,order,family,genus,species,eventID,organismQuantity,occurrenceID
0,36aa75f9b28f5f831c2d631ba65c2bcb,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCCTGGCGGATTACTCTGCCTGGCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Neocalanus;Neocalanus_cristatus;,0.922099,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Neocalanus,Neocalanus cristatus,GOMECC4_27N_Sta1_DCM_A,1516,GOMECC4_27N_Sta1_DCM_A_occ36aa75f9b28f5f831c2d631ba65c2bcb
1,4e38e8ced9070952b314e1880bede1ca,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGTAGTCGGATCACTCTGACTGCCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Clausocalanus;Clausocalanus_furcatus;,0.999947,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Clausocalanus,Clausocalanus furcatus,GOMECC4_27N_Sta1_DCM_A,962,GOMECC4_27N_Sta1_DCM_A_occ4e38e8ced9070952b314e1880bede1ca
4,2a31e5c01634165da99e7381279baa75,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAAGATAGTCGCAAGACTACCTTTTCTCCGGAAAGACTTTCAAACTTGAGCGTCTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Acrocalanus;Acrocalanus_sp.;,0.779948,Eukaryota,Obazoa,Opisthokonta,Metazoa,Arthropoda,Crustacea,Maxillopoda,Acrocalanus,Acrocalanus,GOMECC4_27N_Sta1_DCM_A,1164,GOMECC4_27N_Sta1_DCM_A_occ2a31e5c01634165da99e7381279baa75
5,ecee60339b2fb88ea6d1c8d18054bed4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAGTGTTCAGTTCCTGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae,0.999931,Eukaryota,TSAR,Alveolata,Dinoflagellata,Dinophyceae,,,,,GOMECC4_27N_Sta1_DCM_A,287,GOMECC4_27N_Sta1_DCM_A_occecee60339b2fb88ea6d1c8d18054bed4
7,fa1f1a97dd4ae7c826009186bad26384,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAATGTTTGGATCCCGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniales;Gymnodiniaceae,0.986908,Eukaryota,TSAR,Alveolata,Dinoflagellata,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,,GOMECC4_27N_Sta1_DCM_A,250,GOMECC4_27N_Sta1_DCM_A_occfa1f1a97dd4ae7c826009186bad26384


#### 16S

##### 1st, format ASV file

In [51]:
asv_tables['16S V4-V5'].iloc[0:10,0:20]

Unnamed: 0,featureid,sequence,taxonomy,Confidence,GOMECC4_27N_Sta1_DCM_A,GOMECC4_27N_Sta1_DCM_B,GOMECC4_27N_Sta1_DCM_C,GOMECC4_27N_Sta1_Deep_A,GOMECC4_27N_Sta1_Deep_B,GOMECC4_27N_Sta1_Deep_C,GOMECC4_27N_Sta1_Surface_A,GOMECC4_27N_Sta1_Surface_B,GOMECC4_27N_Sta4_DCM_A,GOMECC4_27N_Sta4_DCM_B,GOMECC4_27N_Sta4_DCM_C,GOMECC4_27N_Sta4_Deep_A,GOMECC4_27N_Sta4_Deep_B,GOMECC4_27N_Sta4_Deep_C,GOMECC4_27N_Sta4_Surface_A,GOMECC4_27N_Sta4_Surface_B
0,00006f0784f7dbb2f162408abb6da629,TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCATGCAGGTGGTTTGTTAAGTCAGATGTGAAAGCCCGGGGCTCAACCTCGGAATTGCATTTGAAACTGGCAGACTAGAGTACTGTAGAGGGGGGTAGAATTT...,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Vibrionales; f__Vibrionaceae; g__Vibrio,0.978926,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,25
1,000094731d4984ed41435a1bf65b7ef2,TACAGAGAGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGGTATTTAAGTCGGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATCCGAAACTATTTAACTAGAGTATGGGAGAGGTAAGTAGAATTT...,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__HOC36; f__HOC36; g__HOC36; s__Candidatus_Thioglobus,0.881698,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0001a3c11fcef1b1b8f4c72942efbbac,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGTCTTCTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAATACTGGAAGACTAGAAAACGGAAGAGGGTAGTGGAATTC...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Synechococcales; f__Cyanobiaceae; g__Cyanobium_PCC-6307,0.762793,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0001ceef5162e6d689ef30418cfcc164,TACAGAGGGTGCAAGCGTTGTTCGGAATCATTGGGCGTAAAGCGCGCGTAGGCGGCCAAATAAGTCTGATGTGAAGGCCCAGGGCTCAACCCTGGAAGTGCATCGGAAACTGTTTGGCTCGAGTCCCGGAGGGGGTGGTGGAATTC...,d__Bacteria; p__Myxococcota; c__Myxococcia; o__Myxococcales; f__Myxococcaceae; g__P3OB-42; s__uncultured_bacterium,0.997619,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,000235534662df05bb30219a4b978dac,TACGGAAGGTCCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTTTTAAGTTGGATGTGAAAGCCCTGGGCTCAACCTAGGAACTGCATCCAAAACTAGATGACTAGAGTACGAAAGAGGGAAGTAGAATTC...,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__SAR86_clade; f__SAR86_clade; g__SAR86_clade,0.999961,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0003aeafc4bc0522877d4804829e65b7,CACCGGCATCTCGAGTGGTATCCACTTTTATTGGGCCTAAAGCATCCGTAGCCTGTTCTGTAAGTTTTCGGTTAAATCCATAAGCTCAACTTATGGGCTGCCGAAAATACTGCAGAACTAGGGAGTGGGAGAGGTAGACGGTACTC...,d__Archaea; p__Crenarchaeota; c__Nitrososphaeria; o__Nitrosopumilales; f__Nitrosopumilaceae; g__Candidatus_Nitrosopelagicus; s__marine_metagenome,0.779476,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0003b46ae196127658c07aeb11b36b1a,TACGGAGGGTCCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTAGCAAGTTGAATGTGAAAGCCCTGGGCTCAACCTAGGAACTGCATTCAAAACTACTAAGCTAGAGTACGAGAGAGGAGAGTAGAATTT...,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Thiomicrospirales; f__Thioglobaceae; g__SUP05_cluster,0.810269,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,00065c7f5701f8db77fc2c50a1204c71,TACGGGAGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTTTGTAGGTGGAAAAATAAGTCTATTGTTAAATCCAGAAGCTTAACTTCTGTCAAGCGATATGAAACTATTCTTCTTGAGAATGGTAGGGGTAGAAGGAATT...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Chloroplast; f__Chloroplast; g__Chloroplast; s__Prasinoderma_coloniale,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0006da4e1ff162826badd8bdcfaf9dfe,GACGGAGGATGCAAGTGTTATCCGGAATTATTGGGCGTAAAGCGTTTGTAGGTGGAGAAATAAGCCTATTGTTAAATCCAGGAGCTTAACTTCTGTCCAGCGATATGAAACTATTTTTCTTGAGGGTGGTAGGGGTAGAAGGAATT...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Chloroplast; f__Chloroplast; g__Chloroplast; s__Prasinoderma_coloniale,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0006fe6033cca4da30ea5ce9cba446f0,TACGAATGCTGCAAGCGTAGTTCGGAATCACTGGGCATAAAGAGCACGTAGGCGGCCTATTAAGTCAGCTGTGAAATCCCTCGGCTTAACCGAGGAACTGCAGCTGATACTGATAGGCTTGAGTACGGGAGGGGAGAGCGGAATTC...,d__Bacteria; p__Planctomycetota; c__Pla3_lineage; o__Pla3_lineage; f__Pla3_lineage; g__Pla3_lineage; s__uncultured_Planctomycetaceae,0.949296,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [52]:
asv_tables['16S V4-V5']['taxonomy'][0]

'd__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Vibrionales; f__Vibrionaceae; g__Vibrio'

In [53]:
taxa_ranks_16S = ['domain','phylum','class','order','family','genus','species']


In [54]:
asv_tables['16S V4-V5'][['domain','phylum','class','order','family','genus','species']] = asv_tables['16S V4-V5']['taxonomy'].str.split("; ",expand=True)
asv_tables['16S V4-V5'].iloc[0:10,[0,1,2,3,4,5,6,-8,-7,-6,-5,-4,-3,-2,-1]]

Unnamed: 0,featureid,sequence,taxonomy,Confidence,GOMECC4_27N_Sta1_DCM_A,GOMECC4_27N_Sta1_DCM_B,GOMECC4_27N_Sta1_DCM_C,GOMECC4_YUCATAN_Sta102_Surface_C,domain,phylum,class,order,family,genus,species
0,00006f0784f7dbb2f162408abb6da629,TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCATGCAGGTGGTTTGTTAAGTCAGATGTGAAAGCCCGGGGCTCAACCTCGGAATTGCATTTGAAACTGGCAGACTAGAGTACTGTAGAGGGGGGTAGAATTT...,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Vibrionales; f__Vibrionaceae; g__Vibrio,0.978926,0,0,0,7,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Vibrionales,f__Vibrionaceae,g__Vibrio,
1,000094731d4984ed41435a1bf65b7ef2,TACAGAGAGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGGTATTTAAGTCGGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATCCGAAACTATTTAACTAGAGTATGGGAGAGGTAAGTAGAATTT...,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__HOC36; f__HOC36; g__HOC36; s__Candidatus_Thioglobus,0.881698,0,0,0,0,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__HOC36,f__HOC36,g__HOC36,s__Candidatus_Thioglobus
2,0001a3c11fcef1b1b8f4c72942efbbac,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGTCTTCTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAATACTGGAAGACTAGAAAACGGAAGAGGGTAGTGGAATTC...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Synechococcales; f__Cyanobiaceae; g__Cyanobium_PCC-6307,0.762793,0,0,0,0,d__Bacteria,p__Cyanobacteria,c__Cyanobacteriia,o__Synechococcales,f__Cyanobiaceae,g__Cyanobium_PCC-6307,
3,0001ceef5162e6d689ef30418cfcc164,TACAGAGGGTGCAAGCGTTGTTCGGAATCATTGGGCGTAAAGCGCGCGTAGGCGGCCAAATAAGTCTGATGTGAAGGCCCAGGGCTCAACCCTGGAAGTGCATCGGAAACTGTTTGGCTCGAGTCCCGGAGGGGGTGGTGGAATTC...,d__Bacteria; p__Myxococcota; c__Myxococcia; o__Myxococcales; f__Myxococcaceae; g__P3OB-42; s__uncultured_bacterium,0.997619,0,0,0,0,d__Bacteria,p__Myxococcota,c__Myxococcia,o__Myxococcales,f__Myxococcaceae,g__P3OB-42,s__uncultured_bacterium
4,000235534662df05bb30219a4b978dac,TACGGAAGGTCCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTTTTAAGTTGGATGTGAAAGCCCTGGGCTCAACCTAGGAACTGCATCCAAAACTAGATGACTAGAGTACGAAAGAGGGAAGTAGAATTC...,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__SAR86_clade; f__SAR86_clade; g__SAR86_clade,0.999961,0,0,0,0,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__SAR86_clade,f__SAR86_clade,g__SAR86_clade,
5,0003aeafc4bc0522877d4804829e65b7,CACCGGCATCTCGAGTGGTATCCACTTTTATTGGGCCTAAAGCATCCGTAGCCTGTTCTGTAAGTTTTCGGTTAAATCCATAAGCTCAACTTATGGGCTGCCGAAAATACTGCAGAACTAGGGAGTGGGAGAGGTAGACGGTACTC...,d__Archaea; p__Crenarchaeota; c__Nitrososphaeria; o__Nitrosopumilales; f__Nitrosopumilaceae; g__Candidatus_Nitrosopelagicus; s__marine_metagenome,0.779476,0,0,0,0,d__Archaea,p__Crenarchaeota,c__Nitrososphaeria,o__Nitrosopumilales,f__Nitrosopumilaceae,g__Candidatus_Nitrosopelagicus,s__marine_metagenome
6,0003b46ae196127658c07aeb11b36b1a,TACGGAGGGTCCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTAGCAAGTTGAATGTGAAAGCCCTGGGCTCAACCTAGGAACTGCATTCAAAACTACTAAGCTAGAGTACGAGAGAGGAGAGTAGAATTT...,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Thiomicrospirales; f__Thioglobaceae; g__SUP05_cluster,0.810269,0,0,0,0,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Thiomicrospirales,f__Thioglobaceae,g__SUP05_cluster,
7,00065c7f5701f8db77fc2c50a1204c71,TACGGGAGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTTTGTAGGTGGAAAAATAAGTCTATTGTTAAATCCAGAAGCTTAACTTCTGTCAAGCGATATGAAACTATTCTTCTTGAGAATGGTAGGGGTAGAAGGAATT...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Chloroplast; f__Chloroplast; g__Chloroplast; s__Prasinoderma_coloniale,1.0,0,0,0,0,d__Bacteria,p__Cyanobacteria,c__Cyanobacteriia,o__Chloroplast,f__Chloroplast,g__Chloroplast,s__Prasinoderma_coloniale
8,0006da4e1ff162826badd8bdcfaf9dfe,GACGGAGGATGCAAGTGTTATCCGGAATTATTGGGCGTAAAGCGTTTGTAGGTGGAGAAATAAGCCTATTGTTAAATCCAGGAGCTTAACTTCTGTCCAGCGATATGAAACTATTTTTCTTGAGGGTGGTAGGGGTAGAAGGAATT...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Chloroplast; f__Chloroplast; g__Chloroplast; s__Prasinoderma_coloniale,1.0,0,0,0,0,d__Bacteria,p__Cyanobacteria,c__Cyanobacteriia,o__Chloroplast,f__Chloroplast,g__Chloroplast,s__Prasinoderma_coloniale
9,0006fe6033cca4da30ea5ce9cba446f0,TACGAATGCTGCAAGCGTAGTTCGGAATCACTGGGCATAAAGAGCACGTAGGCGGCCTATTAAGTCAGCTGTGAAATCCCTCGGCTTAACCGAGGAACTGCAGCTGATACTGATAGGCTTGAGTACGGGAGGGGAGAGCGGAATTC...,d__Bacteria; p__Planctomycetota; c__Pla3_lineage; o__Pla3_lineage; f__Pla3_lineage; g__Pla3_lineage; s__uncultured_Planctomycetaceae,0.949296,0,0,0,0,d__Bacteria,p__Planctomycetota,c__Pla3_lineage,o__Pla3_lineage,f__Pla3_lineage,g__Pla3_lineage,s__uncultured_Planctomycetaceae


In [55]:
asv_tables['16S V4-V5']['domain'] = asv_tables['16S V4-V5']['domain'].str.replace("d__", "")
asv_tables['16S V4-V5']['phylum'] = asv_tables['16S V4-V5']['phylum'].str.replace("p__", "")
asv_tables['16S V4-V5']['class'] = asv_tables['16S V4-V5']['class'].str.replace("c__", "")
asv_tables['16S V4-V5']['order'] = asv_tables['16S V4-V5']['order'].str.replace("o__", "")
asv_tables['16S V4-V5']['family'] = asv_tables['16S V4-V5']['family'].str.replace("f__", "")
asv_tables['16S V4-V5']['genus'] = asv_tables['16S V4-V5']['genus'].str.replace("g__", "")
asv_tables['16S V4-V5']['species'] = asv_tables['16S V4-V5']['species'].str.replace("s__", "")

In [56]:
# replace None with NA
asv_tables['16S V4-V5'] = asv_tables['16S V4-V5'].fillna(value=np.nan)
## Replace 'unknown', 'unassigned', etc. in species and taxonomy columns with NaN

asv_tables['16S V4-V5'][taxa_ranks_16S] = asv_tables['16S V4-V5'][taxa_ranks_16S].replace({'unassigned':np.nan,'Unassigned':np.nan,
                              's_':np.nan,
                              'g_':np.nan,
                              'unknown':np.nan,
                              'no_hit':np.nan,
                               '':np.nan})
asv_tables['16S V4-V5'].iloc[0:10,[0,1,2,3,4,5,6,-8,-7,-6,-5,-4,-3,-2,-1]]

Unnamed: 0,featureid,sequence,taxonomy,Confidence,GOMECC4_27N_Sta1_DCM_A,GOMECC4_27N_Sta1_DCM_B,GOMECC4_27N_Sta1_DCM_C,GOMECC4_YUCATAN_Sta102_Surface_C,domain,phylum,class,order,family,genus,species
0,00006f0784f7dbb2f162408abb6da629,TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCATGCAGGTGGTTTGTTAAGTCAGATGTGAAAGCCCGGGGCTCAACCTCGGAATTGCATTTGAAACTGGCAGACTAGAGTACTGTAGAGGGGGGTAGAATTT...,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Vibrionales; f__Vibrionaceae; g__Vibrio,0.978926,0,0,0,7,Bacteria,Proteobacteria,Gammaproteobacteria,Vibrionales,Vibrionaceae,Vibrio,
1,000094731d4984ed41435a1bf65b7ef2,TACAGAGAGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGGTATTTAAGTCGGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATCCGAAACTATTTAACTAGAGTATGGGAGAGGTAAGTAGAATTT...,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__HOC36; f__HOC36; g__HOC36; s__Candidatus_Thioglobus,0.881698,0,0,0,0,Bacteria,Proteobacteria,Gammaproteobacteria,HOC36,HOC36,HOC36,Candidatus_Thioglobus
2,0001a3c11fcef1b1b8f4c72942efbbac,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGTCTTCTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAATACTGGAAGACTAGAAAACGGAAGAGGGTAGTGGAATTC...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Synechococcales; f__Cyanobiaceae; g__Cyanobium_PCC-6307,0.762793,0,0,0,0,Bacteria,Cyanobacteria,Cyanobacteriia,Synechococcales,Cyanobiaceae,Cyanobium_PCC-6307,
3,0001ceef5162e6d689ef30418cfcc164,TACAGAGGGTGCAAGCGTTGTTCGGAATCATTGGGCGTAAAGCGCGCGTAGGCGGCCAAATAAGTCTGATGTGAAGGCCCAGGGCTCAACCCTGGAAGTGCATCGGAAACTGTTTGGCTCGAGTCCCGGAGGGGGTGGTGGAATTC...,d__Bacteria; p__Myxococcota; c__Myxococcia; o__Myxococcales; f__Myxococcaceae; g__P3OB-42; s__uncultured_bacterium,0.997619,0,0,0,0,Bacteria,Myxococcota,Myxococcia,Myxococcales,Myxococcaceae,P3OB-42,uncultured_bacterium
4,000235534662df05bb30219a4b978dac,TACGGAAGGTCCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTTTTAAGTTGGATGTGAAAGCCCTGGGCTCAACCTAGGAACTGCATCCAAAACTAGATGACTAGAGTACGAAAGAGGGAAGTAGAATTC...,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__SAR86_clade; f__SAR86_clade; g__SAR86_clade,0.999961,0,0,0,0,Bacteria,Proteobacteria,Gammaproteobacteria,SAR86_clade,SAR86_clade,SAR86_clade,
5,0003aeafc4bc0522877d4804829e65b7,CACCGGCATCTCGAGTGGTATCCACTTTTATTGGGCCTAAAGCATCCGTAGCCTGTTCTGTAAGTTTTCGGTTAAATCCATAAGCTCAACTTATGGGCTGCCGAAAATACTGCAGAACTAGGGAGTGGGAGAGGTAGACGGTACTC...,d__Archaea; p__Crenarchaeota; c__Nitrososphaeria; o__Nitrosopumilales; f__Nitrosopumilaceae; g__Candidatus_Nitrosopelagicus; s__marine_metagenome,0.779476,0,0,0,0,Archaea,Crenarchaeota,Nitrososphaeria,Nitrosopumilales,Nitrosopumilaceae,Candidatus_Nitrosopelagicus,marine_metagenome
6,0003b46ae196127658c07aeb11b36b1a,TACGGAGGGTCCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTAGCAAGTTGAATGTGAAAGCCCTGGGCTCAACCTAGGAACTGCATTCAAAACTACTAAGCTAGAGTACGAGAGAGGAGAGTAGAATTT...,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Thiomicrospirales; f__Thioglobaceae; g__SUP05_cluster,0.810269,0,0,0,0,Bacteria,Proteobacteria,Gammaproteobacteria,Thiomicrospirales,Thioglobaceae,SUP05_cluster,
7,00065c7f5701f8db77fc2c50a1204c71,TACGGGAGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTTTGTAGGTGGAAAAATAAGTCTATTGTTAAATCCAGAAGCTTAACTTCTGTCAAGCGATATGAAACTATTCTTCTTGAGAATGGTAGGGGTAGAAGGAATT...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Chloroplast; f__Chloroplast; g__Chloroplast; s__Prasinoderma_coloniale,1.0,0,0,0,0,Bacteria,Cyanobacteria,Cyanobacteriia,Chloroplast,Chloroplast,Chloroplast,Prasinoderma_coloniale
8,0006da4e1ff162826badd8bdcfaf9dfe,GACGGAGGATGCAAGTGTTATCCGGAATTATTGGGCGTAAAGCGTTTGTAGGTGGAGAAATAAGCCTATTGTTAAATCCAGGAGCTTAACTTCTGTCCAGCGATATGAAACTATTTTTCTTGAGGGTGGTAGGGGTAGAAGGAATT...,d__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Chloroplast; f__Chloroplast; g__Chloroplast; s__Prasinoderma_coloniale,1.0,0,0,0,0,Bacteria,Cyanobacteria,Cyanobacteriia,Chloroplast,Chloroplast,Chloroplast,Prasinoderma_coloniale
9,0006fe6033cca4da30ea5ce9cba446f0,TACGAATGCTGCAAGCGTAGTTCGGAATCACTGGGCATAAAGAGCACGTAGGCGGCCTATTAAGTCAGCTGTGAAATCCCTCGGCTTAACCGAGGAACTGCAGCTGATACTGATAGGCTTGAGTACGGGAGGGGAGAGCGGAATTC...,d__Bacteria; p__Planctomycetota; c__Pla3_lineage; o__Pla3_lineage; f__Pla3_lineage; g__Pla3_lineage; s__uncultured_Planctomycetaceae,0.949296,0,0,0,0,Bacteria,Planctomycetota,Pla3_lineage,Pla3_lineage,Pla3_lineage,Pla3_lineage,uncultured_Planctomycetaceae


In [57]:
# replace _,- with space, remove sp. 

asv_tables['16S V4-V5'][taxa_ranks_16S] = asv_tables['16S V4-V5'][taxa_ranks_16S].replace('_',' ',regex=True)
asv_tables['16S V4-V5'][taxa_ranks_16S] = asv_tables['16S V4-V5'][taxa_ranks_16S].replace(' sp\.','',regex=True)
asv_tables['16S V4-V5'][taxa_ranks_16S] = asv_tables['16S V4-V5'][taxa_ranks_16S].replace('-',' ',regex=True)
asv_tables['16S V4-V5'][taxa_ranks_16S] = asv_tables['16S V4-V5'][taxa_ranks_16S].replace(' spp\.','',regex=True)
asv_tables['16S V4-V5'][taxa_ranks_16S] = asv_tables['16S V4-V5'][taxa_ranks_16S].replace('\/',' ',regex=True)

##### Melt asv_tables to long format


In [58]:
asv_tables['16S V4-V5'].shape


(65048, 504)

In [59]:
occ['16S V4-V5'] = pd.melt(asv_tables['16S V4-V5'],id_vars=['featureid','sequence','taxonomy','Confidence','domain','phylum','class','order','family','genus','species'],
               var_name='eventID',value_name='organismQuantity')

In [60]:
occ['16S V4-V5'].shape

(32068664, 13)

In [61]:
## Drop records where organismQuantity = 0 (absences are not meaningful for OBIS/GBIF)

occ['16S V4-V5'] = occ['16S V4-V5'][occ['16S V4-V5']['organismQuantity'] > 0]
print(occ['16S V4-V5'].shape)

(169470, 13)


In [62]:
## Create an occurrenceID that will uniquely identify each ASV observed within a water sample

occ['16S V4-V5']['occurrenceID'] = occ['16S V4-V5']['featureid']
occ['16S V4-V5']['occurrenceID'] = occ['16S V4-V5']['eventID'] + '_16S_occ' + occ['16S V4-V5']['occurrenceID'].astype(str)

In [63]:
occ['16S V4-V5'].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,domain,phylum,class,order,family,genus,species,eventID,organismQuantity,occurrenceID
182,00c4c1c65d8669ed9f07abe149f9a01d,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTAGACAGTTGAGGGTGAAATCCCGGAGCTTAACTTCGGAACTGCCCCCAATACTACTAATCTAGAGTTCGGAAGAGGTGAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,0.83219,Bacteria,Proteobacteria,Alphaproteobacteria,Parvibaculales,OCS116 clade,OCS116 clade,uncultured marine,GOMECC4_27N_Sta1_DCM_A,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed9f07abe149f9a01d
225,00e6c13fe86364a5084987093afa1916,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCTCTTTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAAGACTGGAGAGCTAGAAAACGGAAGAGGGTAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,0.86704,Bacteria,Proteobacteria,Alphaproteobacteria,Puniceispirillales,SAR116 clade,SAR116 clade,,GOMECC4_27N_Sta1_DCM_A,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5084987093afa1916
347,015dad1fafca90944d905beb2a980bc3,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTCCGTAGCCGGTCTGGTACATTCGTGGGTAAATCAACTCGCTTAACGAGTTGAATTCTGCGAGGACGGCCAGACTTGGGACCGGGAGAGGTGTGGGGTACTC...,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,1.0,Archaea,Thermoplasmatota,Thermoplasmata,Marine Group II,Marine Group II,Marine Group II,,GOMECC4_27N_Sta1_DCM_A,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca90944d905beb2a980bc3
412,019c88c6ade406f731954f38e3461564,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTCATTTAAGCGGTCCGATAAGTTAAAAGCCAACAGTTAGAGCCTAACTCTTTCAAGCTTTTAATACTGTCAGACTAGAGTATATCAGAGAATAGTAGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,0.952911,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,Mitochondria,Mitochondria,uncultured bacterium,GOMECC4_27N_Sta1_DCM_A,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f731954f38e3461564
719,02dfb0869af4bf549d290d48e66e2196,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTCCGTAGGCGTTTTGCTAAGTTGATCGTTAAATCCATCGGCTTAACCGATGACATGCGATCAAAACTGGCAGAATAGAATATGTGAGGGGAATGTAGAATTC...,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,0.818195,Bacteria,Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),uncultured bacterium,GOMECC4_27N_Sta1_DCM_A,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf549d290d48e66e2196


##### WORMS conversion. 
Note, can't use `multiprocessing` library in a Jupyter notebook, need `multiprocess`. See [here](https://stackoverflow.com/questions/41385708/multiprocessing-example-giving-attributeerror)

OBIS currently requires taxonomy assignments that match WoRMS, however none of the commonly used metabarcoding reference databases use WoRMS as the basis of their taxonomy. This means the taxonomic ranks for any given scientific name on WoRMS may not directly compare to what is assigned. There are ongoing discussions about this problem (see [this](https://github.com/iobis/Project-team-Genetic-Data/issues/5) GitHub issue).     

Many of them, especially for microbes, include taxa that aren't on WoRMS at all. This is because the name may not have been fully and officially adopted by the scientific community (or at least not adopted by WoRMS yet). We therefore need a system for searching through the higher taxonomic ranks given, finding the lowest one that will match on WoRMS, and putting that name in the `scientificName` column. The assigned taxonomy is then recorded in `verbatimIdentification`.

Had some [issues with the parallelization](https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr) on Mac M1. Adding 'OBJC_DISABLE_INITIALIZE_FORK_SAFETY = YES' to .bash_profile and then [This](https://github.com/python/cpython/issues/74570) fixed it.   
Try to run without the bash_profile fix LATER.

In [64]:
os.environ["no_proxy"]="*"

### 16S worms

Species level IDs might be trash, [see here](https://forum.qiime2.org/t/processing-filtering-and-evaluating-the-silva-database-and-other-reference-sequence-data-with-rescript/15494), so look at genus and up.

In [71]:
import WoRMS_matching

In [72]:
import importlib
importlib.reload(WoRMS_matching)

<module 'WoRMS_matching' from '/Users/katherine.silliman/Projects/NOAA/DMG/edna2obis/src/WoRMS_matching.py'>

In [73]:
tax_16S = asv_tables['16S V4-V5'][['taxonomy','domain','phylum','class','order','family','genus','species']]

In [74]:
#ignore_index is important!
tax_16S = tax_16S.drop_duplicates(ignore_index=True)

In [75]:
tax_16S.shape

(2729, 8)

In [76]:
if __name__ == '__main__':
    worms_16s = WoRMS_matching.get_worms_from_scientific_name_parallel(
    tax_df = tax_16S,ordered_rank_columns=['genus','family','order','class','phylum','domain'],
    full_tax_column="taxonomy",full_tax_vI=True,n_proc=7)

Litoribacillus: No match, genus
BD1 7 clade: No match, genus
uncultured: No match, genus
uncultured: No match, family
uncultured: No match, order
uncultured: No match, genusOM60(NOR5) clade: No match, genusuncultured: No match, class


Halieaceae: No match, family
Spongiibacteraceae: No match, family
HOC36: No match, genus
HOC36: No match, family
HOC36: No match, order
Desulfobacterota: No match, phylum
Candidatus Omnitrophus: No match, genus
Cellvibrionales: No match, order
Cellvibrionales: No match, orderAliikangiella: No match, genus

Schekmanbacteria: No match, genus
Schekmanbacteria: No match, family
Schekmanbacteria: No match, order
Schekmanbacteria: No match, class
Schekmanbacteria: No match, phylum
Omnitrophaceae: No match, family
Kangiellaceae: No match, family
Omnitrophales: No match, order
Cyanobium PCC 6307: No match, genus
Chitinophagales: No match, order
Phaeodactylibacter: No match, genus
P.palmC41: No match, genus
P.palmC41: No match, family
P.palmC41: No match, order
O

In [77]:
worms_16s.head()

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Cellvibrionales; f__Halieaceae; g__OM60(NOR5)_clade; s__uncultured_Haliea,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Cellvibrionales; f__Halieaceae; g__OM60(NOR5)_clade; s__uncultured_Haliea,class,Gammaproteobacteria,Gammaproteobacteria,urn:lsid:marinespecies.org:taxname:393018,Bacteria,Proteobacteria,Gammaproteobacteria,,,,Class
1,d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Chitinophagales,d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Chitinophagales,class,Bacteroidia,Bacteroidia,urn:lsid:marinespecies.org:taxname:559846,Bacteria,Bacteroidetes,Bacteroidia,,,,Class
2,d__Bacteria; p__Verrucomicrobiota; c__Omnitrophia; o__Omnitrophales; f__Omnitrophales; g__Omnitrophales; s__uncultured_bacterium,d__Bacteria; p__Verrucomicrobiota; c__Omnitrophia; o__Omnitrophales; f__Omnitrophales; g__Omnitrophales; s__uncultured_bacterium,domain,Bacteria,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom
3,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__AEGEAN-169_marine_group; g__AEGEAN-169_marine_group; s__alpha_prot...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__AEGEAN-169_marine_group; g__AEGEAN-169_marine_group; s__alpha_prot...,order,Rhodospirillales,Rhodospirillales,urn:lsid:marinespecies.org:taxname:392751,Bacteria,Proteobacteria,Alphaproteobacteria,Rhodospirillales,,,Order
4,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Sphingomonadales; f__Sphingomonadaceae; g__Sphingobium,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Sphingomonadales; f__Sphingomonadaceae; g__Sphingobium,genus,Sphingobium,Sphingobium,urn:lsid:marinespecies.org:taxname:571470,Bacteria,Proteobacteria,Alphaproteobacteria,Sphingomonadales,Sphingomonadaceae,Sphingobium,Genus


In [78]:
worms_16s[worms_16s["scientificName"]=="No match"]

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
242,d__Eukaryota,d__Eukaryota,domain,Eukaryota,No match,,,,,,,,


In [79]:
worms_16s.loc[worms_16s["scientificName"]=="No match",'scientificName'] = "Biota"
worms_16s.loc[worms_16s["scientificName"]=="Biota",'scientificNameID'] = "urn:lsid:marinespecies.org:taxname:1"


In [80]:
worms_16s[worms_16s['scientificName'].isna() == True]

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
97,Unassigned,Unassigned,,,,,,,,,,,


In [81]:

print(worms_16s[worms_16s['scientificName'].isna() == True].shape)
worms_16s.loc[worms_16s['scientificName'].isna() == True,'scientificName'] = 'incertae sedis'
worms_16s.loc[worms_16s['scientificName'] == 'incertae sedis','scientificNameID'] =  'urn:lsid:marinespecies.org:taxname:12'
print(worms_16s[worms_16s['scientificName'].isna() == True].shape)

(1, 13)
(0, 13)


In [82]:
worms_16s.to_csv("../processed/worms_16S_matching.tsv",sep="\t",index=False)

In [83]:
worms_16s.drop(columns=['old name','old_taxonRank'],inplace=True)
worms_16s.head()

Unnamed: 0,full_tax,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Cellvibrionales; f__Halieaceae; g__OM60(NOR5)_clade; s__uncultured_Haliea,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Cellvibrionales; f__Halieaceae; g__OM60(NOR5)_clade; s__uncultured_Haliea,Gammaproteobacteria,urn:lsid:marinespecies.org:taxname:393018,Bacteria,Proteobacteria,Gammaproteobacteria,,,,Class
1,d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Chitinophagales,d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Chitinophagales,Bacteroidia,urn:lsid:marinespecies.org:taxname:559846,Bacteria,Bacteroidetes,Bacteroidia,,,,Class
2,d__Bacteria; p__Verrucomicrobiota; c__Omnitrophia; o__Omnitrophales; f__Omnitrophales; g__Omnitrophales; s__uncultured_bacterium,d__Bacteria; p__Verrucomicrobiota; c__Omnitrophia; o__Omnitrophales; f__Omnitrophales; g__Omnitrophales; s__uncultured_bacterium,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom
3,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__AEGEAN-169_marine_group; g__AEGEAN-169_marine_group; s__alpha_prot...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__AEGEAN-169_marine_group; g__AEGEAN-169_marine_group; s__alpha_prot...,Rhodospirillales,urn:lsid:marinespecies.org:taxname:392751,Bacteria,Proteobacteria,Alphaproteobacteria,Rhodospirillales,,,Order
4,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Sphingomonadales; f__Sphingomonadaceae; g__Sphingobium,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Sphingomonadales; f__Sphingomonadaceae; g__Sphingobium,Sphingobium,urn:lsid:marinespecies.org:taxname:571470,Bacteria,Proteobacteria,Alphaproteobacteria,Sphingomonadales,Sphingomonadaceae,Sphingobium,Genus


In [84]:
occ['16S V4-V5'].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,domain,phylum,class,order,family,genus,species,eventID,organismQuantity,occurrenceID
182,00c4c1c65d8669ed9f07abe149f9a01d,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTAGACAGTTGAGGGTGAAATCCCGGAGCTTAACTTCGGAACTGCCCCCAATACTACTAATCTAGAGTTCGGAAGAGGTGAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,0.83219,Bacteria,Proteobacteria,Alphaproteobacteria,Parvibaculales,OCS116 clade,OCS116 clade,uncultured marine,GOMECC4_27N_Sta1_DCM_A,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed9f07abe149f9a01d
225,00e6c13fe86364a5084987093afa1916,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCTCTTTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAAGACTGGAGAGCTAGAAAACGGAAGAGGGTAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,0.86704,Bacteria,Proteobacteria,Alphaproteobacteria,Puniceispirillales,SAR116 clade,SAR116 clade,,GOMECC4_27N_Sta1_DCM_A,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5084987093afa1916
347,015dad1fafca90944d905beb2a980bc3,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTCCGTAGCCGGTCTGGTACATTCGTGGGTAAATCAACTCGCTTAACGAGTTGAATTCTGCGAGGACGGCCAGACTTGGGACCGGGAGAGGTGTGGGGTACTC...,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,1.0,Archaea,Thermoplasmatota,Thermoplasmata,Marine Group II,Marine Group II,Marine Group II,,GOMECC4_27N_Sta1_DCM_A,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca90944d905beb2a980bc3
412,019c88c6ade406f731954f38e3461564,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTCATTTAAGCGGTCCGATAAGTTAAAAGCCAACAGTTAGAGCCTAACTCTTTCAAGCTTTTAATACTGTCAGACTAGAGTATATCAGAGAATAGTAGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,0.952911,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,Mitochondria,Mitochondria,uncultured bacterium,GOMECC4_27N_Sta1_DCM_A,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f731954f38e3461564
719,02dfb0869af4bf549d290d48e66e2196,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTCCGTAGGCGTTTTGCTAAGTTGATCGTTAAATCCATCGGCTTAACCGATGACATGCGATCAAAACTGGCAGAATAGAATATGTGAGGGGAATGTAGAATTC...,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,0.818195,Bacteria,Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),uncultured bacterium,GOMECC4_27N_Sta1_DCM_A,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf549d290d48e66e2196


#### Merge Occurrence and worms

In [85]:
occ['16S V4-V5'].shape

(169470, 14)

In [86]:

occ16_test = occ['16S V4-V5'].copy()
occ16_test.drop(columns=['domain','phylum','class','order','family','genus','species'],inplace=True)

occ16_test = occ16_test.merge(worms_16s, how='left', left_on ='taxonomy', right_on='full_tax')
occ16_test.drop(columns='full_tax', inplace=True)
occ16_test.head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,00c4c1c65d8669ed9f07abe149f9a01d,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTAGACAGTTGAGGGTGAAATCCCGGAGCTTAACTTCGGAACTGCCCCCAATACTACTAATCTAGAGTTCGGAAGAGGTGAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,0.83219,GOMECC4_27N_Sta1_DCM_A,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed9f07abe149f9a01d,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class
1,00e6c13fe86364a5084987093afa1916,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCTCTTTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAAGACTGGAGAGCTAGAAAACGGAAGAGGGTAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,0.86704,GOMECC4_27N_Sta1_DCM_A,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5084987093afa1916,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class
2,015dad1fafca90944d905beb2a980bc3,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTCCGTAGCCGGTCTGGTACATTCGTGGGTAAATCAACTCGCTTAACGAGTTGAATTCTGCGAGGACGGCCAGACTTGGGACCGGGAGAGGTGTGGGGTACTC...,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,1.0,GOMECC4_27N_Sta1_DCM_A,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca90944d905beb2a980bc3,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,Thermoplasmata,urn:lsid:marinespecies.org:taxname:416268,Archaea,Euryarchaeota,Thermoplasmata,,,,Class
3,019c88c6ade406f731954f38e3461564,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTCATTTAAGCGGTCCGATAAGTTAAAAGCCAACAGTTAGAGCCTAACTCTTTCAAGCTTTTAATACTGTCAGACTAGAGTATATCAGAGAATAGTAGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,0.952911,GOMECC4_27N_Sta1_DCM_A,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f731954f38e3461564,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,Rickettsiales,urn:lsid:marinespecies.org:taxname:570969,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,,,Order
4,02dfb0869af4bf549d290d48e66e2196,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTCCGTAGGCGTTTTGCTAAGTTGATCGTTAAATCCATCGGCTTAACCGATGACATGCGATCAAAACTGGCAGAATAGAATATGTGAGGGGAATGTAGAATTC...,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,0.818195,GOMECC4_27N_Sta1_DCM_A,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf549d290d48e66e2196,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom


#### identificationRemarks  

```
occ16_test['identificationRemarks'] = occ16_test['taxa_class_method'] +", confidence (at lowest specified taxon): "+occ16_test['Confidence'].astype(str) +", against reference database: "+occ16_test['taxa_ref_db']
```

'Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.832189583, against reference database: Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695'

In [88]:
data['analysis_data'].head()

Unnamed: 0,amplicon_sequenced,ampliconSize,trim_method,cluster_method,pid_clustering,taxa_class_method,taxa_ref_db,code_repo,identificationReferences,controls_used
0,16S V4-V5,411,cutadapt,Tourmaline; qiime2-2021.2; dada2,ASV,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,https://github.com/aomlomics/gomecc,10.5281/zenodo.8392695 | https://github.com/aomlomics/tourmaline,12 distilled water blanks | 2 PCR no-template controls | 7 extraction blanks | 12 2nd PCR no-template controls | 3 Zymo mock community
1,18S V9,260,cutadapt,Tourmaline; qiime2-2021.2; dada2,ASV,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zenodo.8392706,https://github.com/aomlomics/gomecc,10.5281/zenodo.8392706 | https://pr2-database.org/ | https://github.com/aomlomics/tourmaline,12 distilled water blanks | 2 PCR no-template controls | 7 extraction blanks | 7 2nd PCR no-template controls


In [89]:
occ16_test['taxa_class_method'] = data['analysis_data'].loc[data['analysis_data']['amplicon_sequenced'] == '16S V4-V5','taxa_class_method'].item()
occ16_test['taxa_ref_db'] = data['analysis_data'].loc[data['analysis_data']['amplicon_sequenced'] == '16S V4-V5','taxa_ref_db'].item()

occ16_test['identificationRemarks'] = occ16_test['taxa_class_method'] +", confidence (at lowest specified taxon): "+occ16_test['Confidence'].astype(str) +", against reference database: "+occ16_test['taxa_ref_db']

In [90]:
occ16_test['identificationRemarks'][0]

'Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.832189583, against reference database: Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695'

#### taxonID, basisOfRecord, eventID, nameAccordingTo, organismQuantityType

In [91]:
occ16_test['taxonID'] = 'ASV:'+occ16_test['featureid']
occ16_test['basisOfRecord'] = 'MaterialSample'
occ16_test['nameAccordingTo'] = "WoRMS"
occ16_test['organismQuantityType'] = "DNA sequence reads"
occ16_test['recordedBy'] = data['study_data']['recordedBy'].values[0]

#### associatedSequences, materialSampleID

In [92]:
data['prep_data'].columns

Index(['sample_name', 'library_id', 'title', 'library_strategy',
       'library_source', 'library_selection', 'lib_layout', 'platform',
       'instrument_model', 'design_description', 'filetype', 'filename',
       'filename2', 'biosample_accession', 'sra_accession', 'seq_meth',
       'nucl_acid_ext', 'amplicon_sequenced', 'target_gene',
       'target_subfragment', 'pcr_primer_forward', 'pcr_primer_reverse',
       'pcr_primer_name_forward', 'pcr_primer_name_reverse',
       'pcr_primer_reference', 'pcr_cond', 'nucl_acid_amp', 'adapters',
       'mid_barcode'],
      dtype='object')

In [93]:
occ16_test = occ16_test.merge(data['prep_data'].loc[data['prep_data']['amplicon_sequenced'] == '16S V4-V5',['sample_name','sra_accession','biosample_accession']], how='left', left_on ='eventID', right_on='sample_name')

In [94]:
occ16_test.head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank,taxa_class_method,taxa_ref_db,identificationRemarks,taxonID,basisOfRecord,nameAccordingTo,organismQuantityType,recordedBy,sample_name,sra_accession,biosample_accession
0,00c4c1c65d8669ed9f07abe149f9a01d,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTAGACAGTTGAGGGTGAAATCCCGGAGCTTAACTTCGGAACTGCCCCCAATACTACTAATCTAGAGTTCGGAAGAGGTGAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,0.83219,GOMECC4_27N_Sta1_DCM_A,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed9f07abe149f9a01d,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.832189583, against reference database: Silva SSU Ref ...",ASV:00c4c1c65d8669ed9f07abe149f9a01d,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094
1,00e6c13fe86364a5084987093afa1916,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCTCTTTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAAGACTGGAGAGCTAGAAAACGGAAGAGGGTAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,0.86704,GOMECC4_27N_Sta1_DCM_A,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5084987093afa1916,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.867040054, against reference database: Silva SSU Ref ...",ASV:00e6c13fe86364a5084987093afa1916,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094
2,015dad1fafca90944d905beb2a980bc3,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTCCGTAGCCGGTCTGGTACATTCGTGGGTAAATCAACTCGCTTAACGAGTTGAATTCTGCGAGGACGGCCAGACTTGGGACCGGGAGAGGTGTGGGGTACTC...,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,1.0,GOMECC4_27N_Sta1_DCM_A,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca90944d905beb2a980bc3,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,Thermoplasmata,urn:lsid:marinespecies.org:taxname:416268,Archaea,Euryarchaeota,Thermoplasmata,,,,Class,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 1.0, against reference database: Silva SSU Ref NR 99 v1...",ASV:015dad1fafca90944d905beb2a980bc3,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094
3,019c88c6ade406f731954f38e3461564,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTCATTTAAGCGGTCCGATAAGTTAAAAGCCAACAGTTAGAGCCTAACTCTTTCAAGCTTTTAATACTGTCAGACTAGAGTATATCAGAGAATAGTAGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,0.952911,GOMECC4_27N_Sta1_DCM_A,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f731954f38e3461564,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,Rickettsiales,urn:lsid:marinespecies.org:taxname:570969,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,,,Order,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.952910602, against reference database: Silva SSU Ref ...",ASV:019c88c6ade406f731954f38e3461564,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094
4,02dfb0869af4bf549d290d48e66e2196,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTCCGTAGGCGTTTTGCTAAGTTGATCGTTAAATCCATCGGCTTAACCGATGACATGCGATCAAAACTGGCAGAATAGAATATGTGAGGGGAATGTAGAATTC...,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,0.818195,GOMECC4_27N_Sta1_DCM_A,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf549d290d48e66e2196,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.818195053, against reference database: Silva SSU Ref ...",ASV:02dfb0869af4bf549d290d48e66e2196,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094


#### eventID

In [95]:
occ16_test['eventID'] = occ16_test['eventID']+"_16S"

#### sampleSize 

In [97]:
# get sampleSize by total number of reads per sample
x = asv_tables['16S V4-V5'].sum(numeric_only=True).astype('int')
x.index = x.index+"_16S"
occ16_test['sampleSizeValue'] = occ16_test['eventID'].map(x).astype('str')
occ16_test['sampleSizeUnit'] = 'DNA sequence reads'

In [98]:
occ16_test.head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank,taxa_class_method,taxa_ref_db,identificationRemarks,taxonID,basisOfRecord,nameAccordingTo,organismQuantityType,recordedBy,sample_name,sra_accession,biosample_accession,sampleSizeValue,sampleSizeUnit
0,00c4c1c65d8669ed9f07abe149f9a01d,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTAGACAGTTGAGGGTGAAATCCCGGAGCTTAACTTCGGAACTGCCCCCAATACTACTAATCTAGAGTTCGGAAGAGGTGAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,0.83219,GOMECC4_27N_Sta1_DCM_A_16S,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed9f07abe149f9a01d,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.832189583, against reference database: Silva SSU Ref ...",ASV:00c4c1c65d8669ed9f07abe149f9a01d,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094,16187,DNA sequence reads
1,00e6c13fe86364a5084987093afa1916,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCTCTTTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAAGACTGGAGAGCTAGAAAACGGAAGAGGGTAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,0.86704,GOMECC4_27N_Sta1_DCM_A_16S,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5084987093afa1916,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.867040054, against reference database: Silva SSU Ref ...",ASV:00e6c13fe86364a5084987093afa1916,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094,16187,DNA sequence reads
2,015dad1fafca90944d905beb2a980bc3,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTCCGTAGCCGGTCTGGTACATTCGTGGGTAAATCAACTCGCTTAACGAGTTGAATTCTGCGAGGACGGCCAGACTTGGGACCGGGAGAGGTGTGGGGTACTC...,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,1.0,GOMECC4_27N_Sta1_DCM_A_16S,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca90944d905beb2a980bc3,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,Thermoplasmata,urn:lsid:marinespecies.org:taxname:416268,Archaea,Euryarchaeota,Thermoplasmata,,,,Class,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 1.0, against reference database: Silva SSU Ref NR 99 v1...",ASV:015dad1fafca90944d905beb2a980bc3,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094,16187,DNA sequence reads
3,019c88c6ade406f731954f38e3461564,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTCATTTAAGCGGTCCGATAAGTTAAAAGCCAACAGTTAGAGCCTAACTCTTTCAAGCTTTTAATACTGTCAGACTAGAGTATATCAGAGAATAGTAGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,0.952911,GOMECC4_27N_Sta1_DCM_A_16S,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f731954f38e3461564,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,Rickettsiales,urn:lsid:marinespecies.org:taxname:570969,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,,,Order,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.952910602, against reference database: Silva SSU Ref ...",ASV:019c88c6ade406f731954f38e3461564,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094,16187,DNA sequence reads
4,02dfb0869af4bf549d290d48e66e2196,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTCCGTAGGCGTTTTGCTAAGTTGATCGTTAAATCCATCGGCTTAACCGATGACATGCGATCAAAACTGGCAGAATAGAATATGTGAGGGGAATGTAGAATTC...,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,0.818195,GOMECC4_27N_Sta1_DCM_A_16S,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf549d290d48e66e2196,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.818195053, against reference database: Silva SSU Ref ...",ASV:02dfb0869af4bf549d290d48e66e2196,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094,16187,DNA sequence reads


In [99]:
# drop unnneeded columns
occ16_test.drop(columns=['sample_name','featureid','taxonomy','Confidence','taxa_class_method','taxa_ref_db'],inplace=True)

#### associatedSequences  
https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA887898

In [None]:
occ16_test['associatedSequences'] = "https://www.ncbi.nlm.nih.gov/sra/"+occ16_test['sra_accession']+' | '+ "https://www.ncbi.nlm.nih.gov/biosample/"+occ16_test['biosample_accession']+' | '+"https://www.ncbi.nlm.nih.gov/bioproject/"+data['study_data']['bioproject_accession'].values[0]

In [103]:
occ16_test.rename(columns={'biosample_accession': 'materialSampleID',
                  'sequence': 'DNA_sequence'},inplace=True)
                   

In [104]:
# drop unnneeded columns
occ16_test.drop(columns=['sra_accession'],inplace=True)

In [105]:
occ16_test.columns

Index(['DNA_sequence', 'eventID', 'organismQuantity', 'occurrenceID',
       'verbatimIdentification', 'scientificName', 'scientificNameID',
       'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'taxonRank',
       'identificationRemarks', 'taxonID', 'basisOfRecord', 'nameAccordingTo',
       'organismQuantityType', 'recordedBy', 'materialSampleID',
       'sampleSizeValue', 'sampleSizeUnit', 'associatedSequences'],
      dtype='object')

In [106]:
occ16_test.head()

Unnamed: 0,DNA_sequence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank,identificationRemarks,taxonID,basisOfRecord,nameAccordingTo,organismQuantityType,recordedBy,materialSampleID,sampleSizeValue,sampleSizeUnit,associatedSequences
0,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTAGACAGTTGAGGGTGAAATCCCGGAGCTTAACTTCGGAACTGCCCCCAATACTACTAATCTAGAGTTCGGAAGAGGTGAGTGGAATTC...,GOMECC4_27N_Sta1_DCM_A_16S,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed9f07abe149f9a01d,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.832189583, against reference database: Silva SSU Ref ...",ASV:00c4c1c65d8669ed9f07abe149f9a01d,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,16187,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...
1,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCTCTTTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAAGACTGGAGAGCTAGAAAACGGAAGAGGGTAGTGGAATTC...,GOMECC4_27N_Sta1_DCM_A_16S,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5084987093afa1916,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.867040054, against reference database: Silva SSU Ref ...",ASV:00e6c13fe86364a5084987093afa1916,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,16187,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...
2,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTCCGTAGCCGGTCTGGTACATTCGTGGGTAAATCAACTCGCTTAACGAGTTGAATTCTGCGAGGACGGCCAGACTTGGGACCGGGAGAGGTGTGGGGTACTC...,GOMECC4_27N_Sta1_DCM_A_16S,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca90944d905beb2a980bc3,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,Thermoplasmata,urn:lsid:marinespecies.org:taxname:416268,Archaea,Euryarchaeota,Thermoplasmata,,,,Class,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 1.0, against reference database: Silva SSU Ref NR 99 v1...",ASV:015dad1fafca90944d905beb2a980bc3,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,16187,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...
3,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTCATTTAAGCGGTCCGATAAGTTAAAAGCCAACAGTTAGAGCCTAACTCTTTCAAGCTTTTAATACTGTCAGACTAGAGTATATCAGAGAATAGTAGAATTC...,GOMECC4_27N_Sta1_DCM_A_16S,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f731954f38e3461564,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,Rickettsiales,urn:lsid:marinespecies.org:taxname:570969,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,,,Order,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.952910602, against reference database: Silva SSU Ref ...",ASV:019c88c6ade406f731954f38e3461564,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,16187,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...
4,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTCCGTAGGCGTTTTGCTAAGTTGATCGTTAAATCCATCGGCTTAACCGATGACATGCGATCAAAACTGGCAGAATAGAATATGTGAGGGGAATGTAGAATTC...,GOMECC4_27N_Sta1_DCM_A_16S,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf549d290d48e66e2196,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.818195053, against reference database: Silva SSU Ref ...",ASV:02dfb0869af4bf549d290d48e66e2196,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,16187,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...


### merge event and occurrence

In [107]:
all_event_data.tail()

Unnamed: 0,eventID,locationID,eventDate,minimumDepthInMeters,maximumDepthInMeters,locality,waterBody,countryCode,decimalLatitude,decimalLongitude,geodeticDatum,samplingProtocol,parentEventID,datasetID
939,GOMECC4_CAPECORAL_Sta141_DCM_B_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,59,59,USA: Gulf of Mexico,"Mexico, Gulf of",US,25.574,-84.843,WGS84,CTD rosette,GOMECC4_CAPECORAL_Sta141_DCM_B,noaa-aoml-gomecc4
940,GOMECC4_CAPECORAL_Sta141_DCM_C_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,59,59,USA: Gulf of Mexico,"Mexico, Gulf of",US,25.574,-84.843,WGS84,CTD rosette,GOMECC4_CAPECORAL_Sta141_DCM_C,noaa-aoml-gomecc4
941,GOMECC4_CAPECORAL_Sta141_Surface_A_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,4,4,USA: Gulf of Mexico,"Mexico, Gulf of",US,25.574,-84.843,WGS84,CTD rosette,GOMECC4_CAPECORAL_Sta141_Surface_A,noaa-aoml-gomecc4
942,GOMECC4_CAPECORAL_Sta141_Surface_B_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,4,4,USA: Gulf of Mexico,"Mexico, Gulf of",US,25.574,-84.843,WGS84,CTD rosette,GOMECC4_CAPECORAL_Sta141_Surface_B,noaa-aoml-gomecc4
943,GOMECC4_CAPECORAL_Sta141_Surface_C_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,4,4,USA: Gulf of Mexico,"Mexico, Gulf of",US,25.574,-84.843,WGS84,CTD rosette,GOMECC4_CAPECORAL_Sta141_Surface_C,noaa-aoml-gomecc4


In [108]:
occ16_merged = occ16_test.merge(all_event_data,how='left',on='eventID')

In [109]:
occ16_merged.head()

Unnamed: 0,DNA_sequence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank,identificationRemarks,taxonID,basisOfRecord,nameAccordingTo,organismQuantityType,recordedBy,materialSampleID,sampleSizeValue,sampleSizeUnit,associatedSequences,locationID,eventDate,minimumDepthInMeters,maximumDepthInMeters,locality,waterBody,countryCode,decimalLatitude,decimalLongitude,geodeticDatum,samplingProtocol,parentEventID,datasetID
0,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTAGACAGTTGAGGGTGAAATCCCGGAGCTTAACTTCGGAACTGCCCCCAATACTACTAATCTAGAGTTCGGAAGAGGTGAGTGGAATTC...,GOMECC4_27N_Sta1_DCM_A_16S,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed9f07abe149f9a01d,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.832189583, against reference database: Silva SSU Ref ...",ASV:00c4c1c65d8669ed9f07abe149f9a01d,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,16187,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
1,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCTCTTTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAAGACTGGAGAGCTAGAAAACGGAAGAGGGTAGTGGAATTC...,GOMECC4_27N_Sta1_DCM_A_16S,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5084987093afa1916,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.867040054, against reference database: Silva SSU Ref ...",ASV:00e6c13fe86364a5084987093afa1916,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,16187,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
2,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTCCGTAGCCGGTCTGGTACATTCGTGGGTAAATCAACTCGCTTAACGAGTTGAATTCTGCGAGGACGGCCAGACTTGGGACCGGGAGAGGTGTGGGGTACTC...,GOMECC4_27N_Sta1_DCM_A_16S,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca90944d905beb2a980bc3,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,Thermoplasmata,urn:lsid:marinespecies.org:taxname:416268,Archaea,Euryarchaeota,Thermoplasmata,,,,Class,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 1.0, against reference database: Silva SSU Ref NR 99 v1...",ASV:015dad1fafca90944d905beb2a980bc3,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,16187,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
3,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTCATTTAAGCGGTCCGATAAGTTAAAAGCCAACAGTTAGAGCCTAACTCTTTCAAGCTTTTAATACTGTCAGACTAGAGTATATCAGAGAATAGTAGAATTC...,GOMECC4_27N_Sta1_DCM_A_16S,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f731954f38e3461564,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,Rickettsiales,urn:lsid:marinespecies.org:taxname:570969,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,,,Order,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.952910602, against reference database: Silva SSU Ref ...",ASV:019c88c6ade406f731954f38e3461564,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,16187,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
4,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTCCGTAGGCGTTTTGCTAAGTTGATCGTTAAATCCATCGGCTTAACCGATGACATGCGATCAAAACTGGCAGAATAGAATATGTGAGGGGAATGTAGAATTC...,GOMECC4_27N_Sta1_DCM_A_16S,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf549d290d48e66e2196,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.818195053, against reference database: Silva SSU Ref ...",ASV:02dfb0869af4bf549d290d48e66e2196,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,16187,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4


In [110]:
occ16_merged.drop(columns=['DNA_sequence']).to_csv("../processed/occurrence_16S.tsv",sep="\t",index=False)

### 18S worms

18S PR2 database provides WORMS IDs for species that are in worms. We will read in that file, assign known worms ids, the do a search for unannotated taxa.

In [111]:
pr2_18S = pd.read_excel("../../../databases/18S_PR2/pr2_v5.0.0_SSU/pr2_version_5.0.0_taxonomy.xlsx",
    index_col=None, na_values=[""])
pr2_18S = pr2_18S.dropna(subset=['worms_id'])
pr2_18S['worms_id'] = pr2_18S['worms_id'].astype('int').astype('str')
pr2_18S['species'] = pr2_18S['species'].replace('_',' ',regex=True)
pr2_18S['species'] = pr2_18S['species'].replace(' sp\.','',regex=True)
pr2_18S['species'] = pr2_18S['species'].replace(' spp\.','',regex=True)
pr2_18S['species'] = pr2_18S['species'].replace('-',' ',regex=True)
pr2_18S['species'] = pr2_18S['species'].replace('\/',' ',regex=True)

In [112]:
pr2_18S_dict = dict(zip(pr2_18S.species,pr2_18S.worms_id))


In [113]:
(pr2_18S_dict['Aphanocapsa feldmannii'])

'614894'

#### code to get record from aphia id

Had some [issues with the parallelization](https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr) on Mac M1. Adding 'OBJC_DISABLE_INITIALIZE_FORK_SAFETY = YES' to .bash_profile and then [This](https://github.com/python/cpython/issues/74570) fixed it.   
Try to run without the bash_profile fix LATER.

In [114]:
os.environ["no_proxy"]="*"

In [115]:
tax_18S = asv_tables['18S V9'][['taxonomy','domain','supergroup','division','subdivision','class','order','family','genus','species']]

In [116]:
tax_18S = tax_18S.drop_duplicates(ignore_index=True)
tax_18S.shape

(1374, 10)

In [117]:
if __name__ == '__main__':
    worms_18s = WoRMS_matching.get_worms_from_aphiaid_or_name_parallel(
    tax_df = tax_18S,worms_dict=pr2_18S_dict,ordered_rank_columns=['species','genus','family','order','class','subdivision','division','supergroup'],
    full_tax_column="taxonomy",full_tax_vI=True,n_proc=6)
    

Aspergillus penicillioides: No match, speciesProtoscenium cf intricatum: No match, species

Euglypha acanthophora: No match, species
RAD B X Group IVe X: No match, species
RAD B X Group IVe X: No match, genus
Eimeriida: No match, order
RAD B X Group IVe: No match, family
Coccidiomorphea: No match, class
MAST 12A: No match, species
MAST 12A: No match, genus
Nibbleromonas: No match, genus
RAD B X: No match, order
MAST 12: No match, family
Nibbleridae: No match, family
RAD B: No match, class
Opalozoa X: No match, order
Nibbleridida: No match, order
Malus x: No match, species
Nibbleridea: No match, class
Euduboscquella cachoni: No match, species
Malus: No match, genus
Obazoa: No match, supergroup
Nibbleridia X: No match, subdivision
Pectinoida: No match, family
Embryophyceae XX: No match, family
Nibbleridia: No match, division
Embryophyceae X: No match, order
Provora: No match, supergroup
Skeletonema menzellii: No match, species
Dictyochales X: No match, species
Dictyochales X: No match, g

In [132]:
worms_18s.head()

Unnamed: 0,full_tax,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,Eukaryota;TSAR;Stramenopiles;Gyrista;Chrysophyceae;Paraphysomonadales;Paraphysomonadaceae;Paraphysomonas;Paraphysomonas_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Chrysophyceae;Paraphysomonadales;Paraphysomonadaceae;Paraphysomonas;Paraphysomonas_sp.;,Paraphysomonas,urn:lsid:marinespecies.org:taxname:291417,Chromista,Ochrophyta,Chrysophyceae,Chromulinales,Paraphysomonadaceae,Paraphysomonas,Genus
1,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Gonyaulax;Gonyaulax_polygramma;,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Gonyaulax;Gonyaulax_polygramma;,Gonyaulax polygramma,urn:lsid:marinespecies.org:taxname:110035,Chromista,Myzozoa,Dinophyceae,Gonyaulacales,Gonyaulacaceae,Gonyaulax,Species
2,Eukaryota;Archaeplastida;Chlorophyta;Chlorophyta_X;Mamiellophyceae;Mamiellales;Mamiellaceae,Eukaryota;Archaeplastida;Chlorophyta;Chlorophyta_X;Mamiellophyceae;Mamiellales;Mamiellaceae,Mamiellaceae,urn:lsid:marinespecies.org:taxname:17663,Plantae,Chlorophyta,Mamiellophyceae,Mamiellales,Mamiellaceae,,Family
3,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Peridiniales;Blastodiniaceae;Blastodinium;Blastodinium_galatheanum;,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Peridiniales;Blastodiniaceae;Blastodinium;Blastodinium_galatheanum;,Blastodinium galatheanum,urn:lsid:marinespecies.org:taxname:623673,Chromista,Myzozoa,Dinophyceae,Blastodiniales,Blastodinidae,Blastodinium,Species
4,Eukaryota;TSAR;Alveolata;Apicomplexa;Coccidiomorphea;Eimeriida,Eukaryota;TSAR;Alveolata;Apicomplexa;Coccidiomorphea;Eimeriida,Apicomplexa,urn:lsid:marinespecies.org:taxname:22565,Chromista,Myzozoa,,,,,Subphylum


In [None]:
# which taxa had absolutely no matches
worms_18s[worms_18s["scientificName"]=="No match"]['old name'].unique()

In [120]:
worms_18s[worms_18s["scientificName"]=="No match"].head(20)

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
18,Eukaryota;TSAR;Stramenopiles;Gyrista;Peronosporomycetes;Peronosporomycetes_X;Haliphthorales;Haliphthorales_X;Haliphthorales_X_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Peronosporomycetes;Peronosporomycetes_X;Haliphthorales;Haliphthorales_X;Haliphthorales_X_sp.;,supergroup,TSAR,No match,,,,,,,,
19,Eukaryota;TSAR;Stramenopiles;Gyrista;Peronosporomycetes;Peronosporomycetes_X;Peronosporomycetes_XX;Peronosporomycetes_XXX;Peronosporomycetes_XXX_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Peronosporomycetes;Peronosporomycetes_X;Peronosporomycetes_XX;Peronosporomycetes_XXX;Peronosporomycetes_XXX_sp.;,supergroup,TSAR,No match,,,,,,,,
28,Eukaryota;TSAR;Stramenopiles;Gyrista;Peronosporomycetes;Peronosporomycetes_X,Eukaryota;TSAR;Stramenopiles;Gyrista;Peronosporomycetes;Peronosporomycetes_X,supergroup,TSAR,No match,,,,,,,,
91,Eukaryota;TSAR;Stramenopiles;Stramenopiles_X;Stramenopiles_X-Group-7;Stramenopiles_X-Group-7_X;Stramenopiles_X-Group-7_XX;Stramenopiles_X-Group-7_...,Eukaryota;TSAR;Stramenopiles;Stramenopiles_X;Stramenopiles_X-Group-7;Stramenopiles_X-Group-7_X;Stramenopiles_X-Group-7_XX;Stramenopiles_X-Group-7_...,supergroup,TSAR,No match,,,,,,,,
116,Eukaryota;TSAR;Stramenopiles;Gyrista;Mediophyceae;Mediophyceae_X;Mediophyceae_XX;Mediophyceae_XXX;Mediophyceae_XXX_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Mediophyceae;Mediophyceae_X;Mediophyceae_XX;Mediophyceae_XXX;Mediophyceae_XXX_sp.;,supergroup,TSAR,No match,,,,,,,,
125,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2,supergroup,TSAR,No match,,,,,,,,
131,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2;MAST-2C;MAST-2C_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2;MAST-2C;MAST-2C_sp.;,supergroup,TSAR,No match,,,,,,,,
185,Eukaryota;Haptista;Centroplasthelida;Centroplasthelida_X,Eukaryota;Haptista;Centroplasthelida;Centroplasthelida_X,supergroup,Haptista,No match,,,,,,,,
187,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2;MAST-2_X;MAST-2_X_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2;MAST-2_X;MAST-2_X_sp.;,supergroup,TSAR,No match,,,,,,,,
189,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2;MAST-2B;MAST-2B_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2;MAST-2B;MAST-2B_sp.;,supergroup,TSAR,No match,,,,,,,,


In [121]:
worms_18s.loc[worms_18s["scientificName"]=="No match",'scientificName'] = "Biota"
worms_18s.loc[worms_18s["scientificName"]=="Biota",'scientificNameID'] = "urn:lsid:marinespecies.org:taxname:1"


In [122]:
worms_18s[worms_18s['scientificName'].isna() == True]

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
120,Eukaryota;Haptista,Eukaryota;Haptista,supergroup,Haptista,,,,,,,,,
77,Eukaryota,Eukaryota,,,,,,,,,,,
109,Unassigned,Unassigned,,,,,,,,,,,
189,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Spiniferites;Spiniferites_mirabilis;,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Spiniferites;Spiniferites_mirabilis;,,aphiaID,,,,,,,,,species


In [123]:
worms_18s.loc[worms_18s["full_tax"]=="Eukaryota;Haptista",'scientificName'] = "Biota"
worms_18s.loc[worms_18s["full_tax"]=="Eukaryota;Haptista",'scientificNameID'] = "urn:lsid:marinespecies.org:taxname:1"
worms_18s.loc[worms_18s["full_tax"]=="Eukaryota",'scientificName'] = "Biota"
worms_18s.loc[worms_18s["full_tax"]=="Eukaryota",'scientificNameID'] = "urn:lsid:marinespecies.org:taxname:1"


In [124]:
worms_18s[worms_18s['scientificName'].isna() == True]

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
109,Unassigned,Unassigned,,,,,,,,,,,
189,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Spiniferites;Spiniferites_mirabilis;,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Spiniferites;Spiniferites_mirabilis;,,aphiaID,,,,,,,,,species


In [125]:

print(worms_18s[worms_18s['scientificName'].isna() == True].shape)
worms_18s.loc[worms_18s['scientificName'].isna() == True,'scientificName'] = 'incertae sedis'
worms_18s.loc[worms_18s['scientificName'] == 'incertae sedis','scientificNameID'] =  'urn:lsid:marinespecies.org:taxname:12'
print(worms_18s[worms_18s['scientificName'].isna() == True].shape)

(2, 13)
(0, 13)


In [126]:
worms_18s[worms_18s["old name"]=="aphiaID"].shape

(332, 13)

In [127]:
worms_18s.to_csv("../processed/worms_18S_matching.tsv",sep="\t",index=False)

In [128]:
worms_18s.drop(columns=['old name','old_taxonRank'],inplace=True)
worms_18s.head()

Unnamed: 0,full_tax,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,Eukaryota;TSAR;Stramenopiles;Gyrista;Chrysophyceae;Paraphysomonadales;Paraphysomonadaceae;Paraphysomonas;Paraphysomonas_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Chrysophyceae;Paraphysomonadales;Paraphysomonadaceae;Paraphysomonas;Paraphysomonas_sp.;,Paraphysomonas,urn:lsid:marinespecies.org:taxname:291417,Chromista,Ochrophyta,Chrysophyceae,Chromulinales,Paraphysomonadaceae,Paraphysomonas,Genus
1,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Gonyaulax;Gonyaulax_polygramma;,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Gonyaulax;Gonyaulax_polygramma;,Gonyaulax polygramma,urn:lsid:marinespecies.org:taxname:110035,Chromista,Myzozoa,Dinophyceae,Gonyaulacales,Gonyaulacaceae,Gonyaulax,Species
2,Eukaryota;Archaeplastida;Chlorophyta;Chlorophyta_X;Mamiellophyceae;Mamiellales;Mamiellaceae,Eukaryota;Archaeplastida;Chlorophyta;Chlorophyta_X;Mamiellophyceae;Mamiellales;Mamiellaceae,Mamiellaceae,urn:lsid:marinespecies.org:taxname:17663,Plantae,Chlorophyta,Mamiellophyceae,Mamiellales,Mamiellaceae,,Family
3,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Peridiniales;Blastodiniaceae;Blastodinium;Blastodinium_galatheanum;,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Peridiniales;Blastodiniaceae;Blastodinium;Blastodinium_galatheanum;,Blastodinium galatheanum,urn:lsid:marinespecies.org:taxname:623673,Chromista,Myzozoa,Dinophyceae,Blastodiniales,Blastodinidae,Blastodinium,Species
4,Eukaryota;TSAR;Alveolata;Apicomplexa;Coccidiomorphea;Eimeriida,Eukaryota;TSAR;Alveolata;Apicomplexa;Coccidiomorphea;Eimeriida,Apicomplexa,urn:lsid:marinespecies.org:taxname:22565,Chromista,Myzozoa,,,,,Subphylum


#### Merge Occurrence and worms

In [129]:
occ['18S V9'].shape

(149182, 16)

In [130]:
# Get identificationRemarks
occ18_test = occ['18S V9'].copy()
occ18_test.drop(columns=['domain','supergroup','division','subdivision','class','order','family','genus','species'],inplace=True)
#occ18_test.drop(columns=['old name'],inplace=True)

occ18_test = occ18_test.merge(worms_18s, how='left', left_on ='taxonomy', right_on='full_tax')
occ18_test.drop(columns='full_tax', inplace=True)
occ18_test.head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,36aa75f9b28f5f831c2d631ba65c2bcb,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCCTGGCGGATTACTCTGCCTGGCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Neocalanus;Neocalanus_cristatus;,0.922099,GOMECC4_27N_Sta1_DCM_A,1516,GOMECC4_27N_Sta1_DCM_A_occ36aa75f9b28f5f831c2d631ba65c2bcb,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Neocalanus;Neocalanus_cristatus;,Neocalanus cristatus,urn:lsid:marinespecies.org:taxname:104470,Animalia,Arthropoda,Copepoda,Calanoida,Calanidae,Neocalanus,Species
1,4e38e8ced9070952b314e1880bede1ca,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGTAGTCGGATCACTCTGACTGCCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Clausocalanus;Clausocalanus_furcatus;,0.999947,GOMECC4_27N_Sta1_DCM_A,962,GOMECC4_27N_Sta1_DCM_A_occ4e38e8ced9070952b314e1880bede1ca,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Clausocalanus;Clausocalanus_furcatus;,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503,Animalia,Arthropoda,Copepoda,Calanoida,Clausocalanidae,Clausocalanus,Species
2,2a31e5c01634165da99e7381279baa75,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAAGATAGTCGCAAGACTACCTTTTCTCCGGAAAGACTTTCAAACTTGAGCGTCTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Acrocalanus;Acrocalanus_sp.;,0.779948,GOMECC4_27N_Sta1_DCM_A,1164,GOMECC4_27N_Sta1_DCM_A_occ2a31e5c01634165da99e7381279baa75,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Acrocalanus;Acrocalanus_sp.;,Acrocalanus,urn:lsid:marinespecies.org:taxname:104192,Animalia,Arthropoda,Copepoda,Calanoida,Paracalanidae,Acrocalanus,Genus
3,ecee60339b2fb88ea6d1c8d18054bed4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAGTGTTCAGTTCCTGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae,0.999931,GOMECC4_27N_Sta1_DCM_A,287,GOMECC4_27N_Sta1_DCM_A_occecee60339b2fb88ea6d1c8d18054bed4,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae,Dinophyceae,urn:lsid:marinespecies.org:taxname:19542,Chromista,Myzozoa,Dinophyceae,,,,Class
4,fa1f1a97dd4ae7c826009186bad26384,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAATGTTTGGATCCCGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniales;Gymnodiniaceae,0.986908,GOMECC4_27N_Sta1_DCM_A,250,GOMECC4_27N_Sta1_DCM_A_occfa1f1a97dd4ae7c826009186bad26384,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniales;Gymnodiniaceae,Gymnodiniaceae,urn:lsid:marinespecies.org:taxname:109410,Chromista,Myzozoa,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,Family


#### identificationRemarks

In [131]:
data['analysis_data'].head()

Unnamed: 0,amplicon_sequenced,ampliconSize,trim_method,cluster_method,pid_clustering,taxa_class_method,taxa_ref_db,code_repo,identificationReferences,controls_used
0,16S V4-V5,411,cutadapt,Tourmaline; qiime2-2021.2; dada2,ASV,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,https://github.com/aomlomics/gomecc,10.5281/zenodo.8392695 | https://github.com/aomlomics/tourmaline,12 distilled water blanks | 2 PCR no-template controls | 7 extraction blanks | 12 2nd PCR no-template controls | 3 Zymo mock community
1,18S V9,260,cutadapt,Tourmaline; qiime2-2021.2; dada2,ASV,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zenodo.8392706,https://github.com/aomlomics/gomecc,10.5281/zenodo.8392706 | https://pr2-database.org/ | https://github.com/aomlomics/tourmaline,12 distilled water blanks | 2 PCR no-template controls | 7 extraction blanks | 7 2nd PCR no-template controls


In [134]:
occ18_test['taxa_class_method'] = data['analysis_data'].loc[data['analysis_data']['amplicon_sequenced'] == '18S V9','taxa_class_method'].item()
occ18_test['taxa_ref_db'] = data['analysis_data'].loc[data['analysis_data']['amplicon_sequenced'] == '18S V9','taxa_ref_db'].item()

occ18_test['identificationRemarks'] = occ18_test['taxa_class_method'] +", confidence (at lowest specified taxon): "+occ18_test['Confidence'].astype(str) +", against reference database: "+occ18_test['taxa_ref_db']

In [135]:
occ18_test['identificationRemarks'][0]

'Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.92209885, against reference database: PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zenodo.8392706'

#### taxonID, basisOfRecord, eventID, nameAccordingTo, organismQuantityType

In [136]:
occ18_test['taxonID'] = 'ASV:'+occ18_test['featureid']
occ18_test['basisOfRecord'] = 'MaterialSample'
occ18_test['nameAccordingTo'] = "WoRMS"
occ18_test['organismQuantityType'] = "DNA sequence reads"
occ18_test['recordedBy'] = data['study_data']['recordedBy'].values[0]

#### associatedSequences, materialSampleID

In [137]:
data['prep_data'].columns

Index(['sample_name', 'library_id', 'title', 'library_strategy',
       'library_source', 'library_selection', 'lib_layout', 'platform',
       'instrument_model', 'design_description', 'filetype', 'filename',
       'filename2', 'biosample_accession', 'sra_accession', 'seq_meth',
       'nucl_acid_ext', 'amplicon_sequenced', 'target_gene',
       'target_subfragment', 'pcr_primer_forward', 'pcr_primer_reverse',
       'pcr_primer_name_forward', 'pcr_primer_name_reverse',
       'pcr_primer_reference', 'pcr_cond', 'nucl_acid_amp', 'adapters',
       'mid_barcode'],
      dtype='object')

In [138]:
occ18_test = occ18_test.merge(data['prep_data'].loc[data['prep_data']['amplicon_sequenced'] == '18S V9',['sample_name','sra_accession','biosample_accession']], how='left', left_on ='eventID', right_on='sample_name')

#### eventID

In [139]:
occ18_test['eventID'] = occ18_test['eventID']+"_18S"

#### sampleSize

In [140]:
# get sampleSize by total number of reads per sample
x = asv_tables['18S V9'].sum(numeric_only=True).astype('int')
x.index = x.index+"_18S"
occ18_test['sampleSizeValue'] = occ18_test['eventID'].map(x).astype('str')
occ18_test['sampleSizeUnit'] = 'DNA sequence reads'

In [141]:
# drop unnneeded columns
occ18_test.drop(columns=['sample_name','featureid','taxonomy','Confidence','taxa_class_method','taxa_ref_db'],inplace=True)

In [None]:
occ18_test['associatedSequences'] = "https://www.ncbi.nlm.nih.gov/sra/"+occ18_test['sra_accession']+' | '+ "https://www.ncbi.nlm.nih.gov/biosample/"+occ18_test['biosample_accession']+' | '+"https://www.ncbi.nlm.nih.gov/bioproject/"+data['study_data']['bioproject_accession'].values[0]

In [143]:
occ18_test.rename(columns={'biosample_accession': 'materialSampleID',
                  'sequence': 'DNA_sequence'},inplace=True)
                   

In [144]:
# drop unnneeded columns
occ18_test.drop(columns=['sra_accession'],inplace=True)

In [145]:
occ18_test.columns

Index(['DNA_sequence', 'eventID', 'organismQuantity', 'occurrenceID',
       'verbatimIdentification', 'scientificName', 'scientificNameID',
       'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'taxonRank',
       'identificationRemarks', 'taxonID', 'basisOfRecord', 'nameAccordingTo',
       'organismQuantityType', 'recordedBy', 'materialSampleID',
       'sampleSizeValue', 'sampleSizeUnit', 'associatedSequences'],
      dtype='object')

In [146]:
occ18_test.head()

Unnamed: 0,DNA_sequence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank,identificationRemarks,taxonID,basisOfRecord,nameAccordingTo,organismQuantityType,recordedBy,materialSampleID,sampleSizeValue,sampleSizeUnit,associatedSequences
0,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCCTGGCGGATTACTCTGCCTGGCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,1516,GOMECC4_27N_Sta1_DCM_A_occ36aa75f9b28f5f831c2d631ba65c2bcb,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Neocalanus;Neocalanus_cristatus;,Neocalanus cristatus,urn:lsid:marinespecies.org:taxname:104470,Animalia,Arthropoda,Copepoda,Calanoida,Calanidae,Neocalanus,Species,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.92209885, against reference database: PR2 v5.0.1; V9 ...",ASV:36aa75f9b28f5f831c2d631ba65c2bcb,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...
1,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGTAGTCGGATCACTCTGACTGCCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,962,GOMECC4_27N_Sta1_DCM_A_occ4e38e8ced9070952b314e1880bede1ca,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Clausocalanus;Clausocalanus_furcatus;,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503,Animalia,Arthropoda,Copepoda,Calanoida,Clausocalanidae,Clausocalanus,Species,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.999946735, against reference database: PR2 v5.0.1; V9...",ASV:4e38e8ced9070952b314e1880bede1ca,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...
2,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAAGATAGTCGCAAGACTACCTTTTCTCCGGAAAGACTTTCAAACTTGAGCGTCTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,1164,GOMECC4_27N_Sta1_DCM_A_occ2a31e5c01634165da99e7381279baa75,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Acrocalanus;Acrocalanus_sp.;,Acrocalanus,urn:lsid:marinespecies.org:taxname:104192,Animalia,Arthropoda,Copepoda,Calanoida,Paracalanidae,Acrocalanus,Genus,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.779948049, against reference database: PR2 v5.0.1; V9...",ASV:2a31e5c01634165da99e7381279baa75,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...
3,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAGTGTTCAGTTCCTGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,287,GOMECC4_27N_Sta1_DCM_A_occecee60339b2fb88ea6d1c8d18054bed4,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae,Dinophyceae,urn:lsid:marinespecies.org:taxname:19542,Chromista,Myzozoa,Dinophyceae,,,,Class,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.999930607, against reference database: PR2 v5.0.1; V9...",ASV:ecee60339b2fb88ea6d1c8d18054bed4,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...
4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAATGTTTGGATCCCGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,250,GOMECC4_27N_Sta1_DCM_A_occfa1f1a97dd4ae7c826009186bad26384,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniales;Gymnodiniaceae,Gymnodiniaceae,urn:lsid:marinespecies.org:taxname:109410,Chromista,Myzozoa,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,Family,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.98690791, against reference database: PR2 v5.0.1; V9 ...",ASV:fa1f1a97dd4ae7c826009186bad26384,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...


### merge event and occurrence

In [147]:
occ18_merged = occ18_test.merge(all_event_data,how='left',on='eventID')

In [148]:
occ18_merged.head()

Unnamed: 0,DNA_sequence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank,identificationRemarks,taxonID,basisOfRecord,nameAccordingTo,organismQuantityType,recordedBy,materialSampleID,sampleSizeValue,sampleSizeUnit,associatedSequences,locationID,eventDate,minimumDepthInMeters,maximumDepthInMeters,locality,waterBody,countryCode,decimalLatitude,decimalLongitude,geodeticDatum,samplingProtocol,parentEventID,datasetID
0,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCCTGGCGGATTACTCTGCCTGGCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,1516,GOMECC4_27N_Sta1_DCM_A_occ36aa75f9b28f5f831c2d631ba65c2bcb,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Neocalanus;Neocalanus_cristatus;,Neocalanus cristatus,urn:lsid:marinespecies.org:taxname:104470,Animalia,Arthropoda,Copepoda,Calanoida,Calanidae,Neocalanus,Species,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.92209885, against reference database: PR2 v5.0.1; V9 ...",ASV:36aa75f9b28f5f831c2d631ba65c2bcb,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
1,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGTAGTCGGATCACTCTGACTGCCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,962,GOMECC4_27N_Sta1_DCM_A_occ4e38e8ced9070952b314e1880bede1ca,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Clausocalanus;Clausocalanus_furcatus;,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503,Animalia,Arthropoda,Copepoda,Calanoida,Clausocalanidae,Clausocalanus,Species,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.999946735, against reference database: PR2 v5.0.1; V9...",ASV:4e38e8ced9070952b314e1880bede1ca,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
2,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAAGATAGTCGCAAGACTACCTTTTCTCCGGAAAGACTTTCAAACTTGAGCGTCTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,1164,GOMECC4_27N_Sta1_DCM_A_occ2a31e5c01634165da99e7381279baa75,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Acrocalanus;Acrocalanus_sp.;,Acrocalanus,urn:lsid:marinespecies.org:taxname:104192,Animalia,Arthropoda,Copepoda,Calanoida,Paracalanidae,Acrocalanus,Genus,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.779948049, against reference database: PR2 v5.0.1; V9...",ASV:2a31e5c01634165da99e7381279baa75,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
3,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAGTGTTCAGTTCCTGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,287,GOMECC4_27N_Sta1_DCM_A_occecee60339b2fb88ea6d1c8d18054bed4,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae,Dinophyceae,urn:lsid:marinespecies.org:taxname:19542,Chromista,Myzozoa,Dinophyceae,,,,Class,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.999930607, against reference database: PR2 v5.0.1; V9...",ASV:ecee60339b2fb88ea6d1c8d18054bed4,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAATGTTTGGATCCCGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,250,GOMECC4_27N_Sta1_DCM_A_occfa1f1a97dd4ae7c826009186bad26384,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniales;Gymnodiniaceae,Gymnodiniaceae,urn:lsid:marinespecies.org:taxname:109410,Chromista,Myzozoa,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,Family,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.98690791, against reference database: PR2 v5.0.1; V9 ...",ASV:fa1f1a97dd4ae7c826009186bad26384,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4


In [149]:
occ18_merged.drop(columns=['DNA_sequence']).to_csv("../processed/occurrence_18S.tsv",sep="\t",index=False)

### combine 16s and 18s occurrence

In [150]:
occ18_merged.shape

(149182, 37)

In [151]:
occ_all = pd.concat([occ16_merged,occ18_merged],axis=0, ignore_index=True)

In [152]:
occ_all['occurrenceStatus'] = 'present' 

In [153]:
occ_all.shape

(318652, 38)

In [154]:
occ_all.drop(columns=['DNA_sequence']).to_csv("../processed/occurrence.csv",index=False)

### DNA-derived

In [155]:
dna_dict = dwc_data['dna'].to_dict('index')

In [156]:
dna_dict.keys()

dict_keys(['eventID', 'samp_name', 'occurrenceID', 'DNA_sequence', 'sop', 'nucl_acid_ext', 'samp_vol_we_dna_ext', 'samp_mat_process', 'nucl_acid_amp', 'target_gene', 'target_subfragment', 'ampliconSize', 'lib_layout', 'pcr_primer_forward', 'pcr_primer_reverse', 'pcr_primer_name_forward', 'pcr_primer_name_reverse', 'pcr_primer_reference', 'pcr_cond', 'seq_meth', 'otu_class_appr', 'otu_seq_comp_appr', 'otu_db', 'env_broad_scale', 'env_local_scale', 'env_medium', 'size_frac', 'concentration', 'concentrationUnit', 'samp_collec_device', 'source_mat_id'])

##### sample_data

In [157]:
# check which dna file terms are in sample_data
for key in dna_dict.keys():
    if dna_dict[key]['AOML_file'] == params['sample_data']:
        print(key,dna_dict[key])

samp_vol_we_dna_ext {'AOML_term': 'samp_vol_we_dna_ext', 'AOML_file': 'water_sample_data', 'DwC_definition': 'Volume (ml) or mass (g) of total collected sample processed for DNA extraction.MIXS:0000111', 'Example': nan}
samp_mat_process {'AOML_term': 'samp_mat_process', 'AOML_file': 'water_sample_data', 'DwC_definition': 'Any processing applied to the sample during or after retrieving the sample from environment. This field accepts OBI, for a browser of OBI (v 2018-02-12) terms please see http://purl.bioontology.org/ontology/OBI', 'Example': nan}
env_broad_scale {'AOML_term': 'env_broad_scale', 'AOML_file': 'water_sample_data', 'DwC_definition': nan, 'Example': nan}
env_local_scale {'AOML_term': 'env_local_scale', 'AOML_file': 'water_sample_data', 'DwC_definition': nan, 'Example': nan}
env_medium {'AOML_term': 'env_medium', 'AOML_file': 'water_sample_data', 'DwC_definition': nan, 'Example': nan}
size_frac {'AOML_term': 'size_frac', 'AOML_file': 'water_sample_data', 'DwC_definition': 'F

In [158]:
# rename sample_data columns to fit DwC standard
rename_dict = {}
gen = (x for x in dna_dict.keys() if dna_dict[x]['AOML_file'] == params['sample_data'])
for x in gen:
    #print(x)
    rename_dict[dna_dict[x]['AOML_term']] = x

gen = (x for x in dna_dict.keys() if dna_dict[x]['AOML_file'] == 'prep_data')
for x in gen:
    #print(x)
    rename_dict[dna_dict[x]['AOML_term']] = x

gen = (x for x in dna_dict.keys() if dna_dict[x]['AOML_file'] == 'analysis_data')
for x in gen:
    #print(x)
    rename_dict[dna_dict[x]['AOML_term']] = x

dna_sample = data['sample_data'].rename(columns=rename_dict).copy()
dna_prep = data['prep_data'].rename(columns=rename_dict).copy()
dna_analysis = data['analysis_data'].rename(columns=rename_dict).copy()

#dna_sample = dna_sample.drop(columns=[col for col in dna_sample if col not in rename_dict.values()])

In [160]:
dna_sample.head()

Unnamed: 0,samp_name,serial_number,cruise_id,line_id,station,locationID,ctd_bottle_no,sample_replicate,source_mat_id,biological_replicates,extract_number,sample_title,bioproject_accession,biosample_accession,notes_sampling,project_id,amplicon_sequenced,metagenome_sequenced,organism,collection_date_local,collection_date,depth,minimumDepthInMeters,maximumDepthInMeters,env_broad_scale,...,woce_sect,ammonium,carbonate,diss_inorg_carb,diss_oxygen,fluor,hydrogen_ion,nitrate,nitrite,nitrate_plus_nitrite,omega_arag,pco2,ph,phosphate,pressure,salinity,samp_store_loc,samp_store_temp,silicate,size_frac_low,size_frac_up,temp,tot_alkalinity,tot_depth_water_col,transmittance
0,GOMECC4_27N_Sta1_Deep_A,GOMECC4_001,GOMECC-4 (2021),27N,Sta1,27N_Sta1,3,A,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_B, GOMECC4_27N_Sta1_Deep_C",Plate4_52,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_A,PRJNA887898,SAMN37516091,DCM = deep chlorophyl max.,gomecc4,16S V4-V5 | 18S V9,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,618,618,marine biome [ENVO:00000447],...,RB2103,0.25971 µmol/kg,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221
1,GOMECC4_27N_Sta1_Deep_B,GOMECC4_002,GOMECC-4 (2021),27N,Sta1,27N_Sta1,3,B,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_C",Plate4_60,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_B,PRJNA887898,SAMN37516092,DCM was around 80 m and not well defined.,gomecc4,16S V4-V5 | 18S V9,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,618,618,marine biome [ENVO:00000447],...,RB2103,0.25971 µmol/kg,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221
2,GOMECC4_27N_Sta1_Deep_C,GOMECC4_003,GOMECC-4 (2021),27N,Sta1,27N_Sta1,3,C,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_B",Plate4_62,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_C,PRJNA887898,SAMN37516093,Surface CTD bottles did not fire correctly; hand niskin bottle used for the surface cast. PM cast.,gomecc4,16S V4-V5 | 18S V9,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,618,618,marine biome [ENVO:00000447],...,RB2103,0.25971 µmol/kg,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221
3,GOMECC4_27N_Sta1_DCM_A,GOMECC4_004,GOMECC-4 (2021),27N,Sta1,27N_Sta1,14,A,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_B, GOMECC4_27N_Sta1_DCM_C",Plate4_53,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_DCM_A,PRJNA887898,SAMN37516094,Only enough water for 2 surface replicates.,gomecc4,16S V4-V5 | 18S V9,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,49 m,49,49,marine biome [ENVO:00000447],...,RB2103,0.32968 µmol/kg,229.99 µmol/kg,2033.19 µmol/kg,193.443 µmol/kg,0.036,0.0000000094 M,0 µmol/kg,0 µmol/kg,0 µmol/kg,3.805,423 µatm,8.027,0.0517 µmol/kg,49 dbar,36.325 psu,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665
4,GOMECC4_27N_Sta1_DCM_B,GOMECC4_005,GOMECC-4 (2021),27N,Sta1,27N_Sta1,14,B,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_A, GOMECC4_27N_Sta1_DCM_C",Plate4_46,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_DCM_B,PRJNA887898,SAMN37516095,,gomecc4,16S V4-V5 | 18S V9,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,49 m,49,49,marine biome [ENVO:00000447],...,RB2103,0.32968 µmol/kg,229.99 µmol/kg,2033.19 µmol/kg,193.443 µmol/kg,0.036,0.0000000094 M,0 µmol/kg,0 µmol/kg,0 µmol/kg,3.805,423 µatm,8.027,0.0517 µmol/kg,49 dbar,36.325 psu,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665


In [161]:
dna_16 = dna_sample[dna_sample['amplicon_sequenced'].str.contains('16S V4-V5')].copy()
dna_16['eventID'] = dna_16['samp_name']+"_16S"
dna_16.drop(columns=['amplicon_sequenced'],inplace=True)
dna_16.head()

Unnamed: 0,samp_name,serial_number,cruise_id,line_id,station,locationID,ctd_bottle_no,sample_replicate,source_mat_id,biological_replicates,extract_number,sample_title,bioproject_accession,biosample_accession,notes_sampling,project_id,metagenome_sequenced,organism,collection_date_local,collection_date,depth,minimumDepthInMeters,maximumDepthInMeters,env_broad_scale,env_local_scale,...,ammonium,carbonate,diss_inorg_carb,diss_oxygen,fluor,hydrogen_ion,nitrate,nitrite,nitrate_plus_nitrite,omega_arag,pco2,ph,phosphate,pressure,salinity,samp_store_loc,samp_store_temp,silicate,size_frac_low,size_frac_up,temp,tot_alkalinity,tot_depth_water_col,transmittance,eventID
0,GOMECC4_27N_Sta1_Deep_A,GOMECC4_001,GOMECC-4 (2021),27N,Sta1,27N_Sta1,3,A,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_B, GOMECC4_27N_Sta1_Deep_C",Plate4_52,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_A,PRJNA887898,SAMN37516091,DCM = deep chlorophyl max.,gomecc4,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,618,618,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],...,0.25971 µmol/kg,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,GOMECC4_27N_Sta1_Deep_A_16S
1,GOMECC4_27N_Sta1_Deep_B,GOMECC4_002,GOMECC-4 (2021),27N,Sta1,27N_Sta1,3,B,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_C",Plate4_60,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_B,PRJNA887898,SAMN37516092,DCM was around 80 m and not well defined.,gomecc4,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,618,618,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],...,0.25971 µmol/kg,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,GOMECC4_27N_Sta1_Deep_B_16S
2,GOMECC4_27N_Sta1_Deep_C,GOMECC4_003,GOMECC-4 (2021),27N,Sta1,27N_Sta1,3,C,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_B",Plate4_62,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_C,PRJNA887898,SAMN37516093,Surface CTD bottles did not fire correctly; hand niskin bottle used for the surface cast. PM cast.,gomecc4,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,618,618,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],...,0.25971 µmol/kg,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,GOMECC4_27N_Sta1_Deep_C_16S
3,GOMECC4_27N_Sta1_DCM_A,GOMECC4_004,GOMECC-4 (2021),27N,Sta1,27N_Sta1,14,A,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_B, GOMECC4_27N_Sta1_DCM_C",Plate4_53,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_DCM_A,PRJNA887898,SAMN37516094,Only enough water for 2 surface replicates.,gomecc4,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,49 m,49,49,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],...,0.32968 µmol/kg,229.99 µmol/kg,2033.19 µmol/kg,193.443 µmol/kg,0.036,0.0000000094 M,0 µmol/kg,0 µmol/kg,0 µmol/kg,3.805,423 µatm,8.027,0.0517 µmol/kg,49 dbar,36.325 psu,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,GOMECC4_27N_Sta1_DCM_A_16S
4,GOMECC4_27N_Sta1_DCM_B,GOMECC4_005,GOMECC-4 (2021),27N,Sta1,27N_Sta1,14,B,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_A, GOMECC4_27N_Sta1_DCM_C",Plate4_46,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_DCM_B,PRJNA887898,SAMN37516095,,gomecc4,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,49 m,49,49,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],...,0.32968 µmol/kg,229.99 µmol/kg,2033.19 µmol/kg,193.443 µmol/kg,0.036,0.0000000094 M,0 µmol/kg,0 µmol/kg,0 µmol/kg,3.805,423 µatm,8.027,0.0517 µmol/kg,49 dbar,36.325 psu,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,GOMECC4_27N_Sta1_DCM_B_16S


In [162]:
dna_18 = dna_sample[dna_sample['amplicon_sequenced'].str.contains('18S V9')].copy()
dna_18['eventID'] = dna_18['samp_name']+"_18S"
dna_18.drop(columns=['amplicon_sequenced'],inplace=True)
dna_18.head()

Unnamed: 0,samp_name,serial_number,cruise_id,line_id,station,locationID,ctd_bottle_no,sample_replicate,source_mat_id,biological_replicates,extract_number,sample_title,bioproject_accession,biosample_accession,notes_sampling,project_id,metagenome_sequenced,organism,collection_date_local,collection_date,depth,minimumDepthInMeters,maximumDepthInMeters,env_broad_scale,env_local_scale,...,ammonium,carbonate,diss_inorg_carb,diss_oxygen,fluor,hydrogen_ion,nitrate,nitrite,nitrate_plus_nitrite,omega_arag,pco2,ph,phosphate,pressure,salinity,samp_store_loc,samp_store_temp,silicate,size_frac_low,size_frac_up,temp,tot_alkalinity,tot_depth_water_col,transmittance,eventID
0,GOMECC4_27N_Sta1_Deep_A,GOMECC4_001,GOMECC-4 (2021),27N,Sta1,27N_Sta1,3,A,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_B, GOMECC4_27N_Sta1_Deep_C",Plate4_52,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_A,PRJNA887898,SAMN37516091,DCM = deep chlorophyl max.,gomecc4,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,618,618,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],...,0.25971 µmol/kg,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,GOMECC4_27N_Sta1_Deep_A_18S
1,GOMECC4_27N_Sta1_Deep_B,GOMECC4_002,GOMECC-4 (2021),27N,Sta1,27N_Sta1,3,B,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_C",Plate4_60,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_B,PRJNA887898,SAMN37516092,DCM was around 80 m and not well defined.,gomecc4,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,618,618,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],...,0.25971 µmol/kg,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,GOMECC4_27N_Sta1_Deep_B_18S
2,GOMECC4_27N_Sta1_Deep_C,GOMECC4_003,GOMECC-4 (2021),27N,Sta1,27N_Sta1,3,C,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_B",Plate4_62,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_C,PRJNA887898,SAMN37516093,Surface CTD bottles did not fire correctly; hand niskin bottle used for the surface cast. PM cast.,gomecc4,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,618,618,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],...,0.25971 µmol/kg,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,GOMECC4_27N_Sta1_Deep_C_18S
3,GOMECC4_27N_Sta1_DCM_A,GOMECC4_004,GOMECC-4 (2021),27N,Sta1,27N_Sta1,14,A,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_B, GOMECC4_27N_Sta1_DCM_C",Plate4_53,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_DCM_A,PRJNA887898,SAMN37516094,Only enough water for 2 surface replicates.,gomecc4,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,49 m,49,49,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],...,0.32968 µmol/kg,229.99 µmol/kg,2033.19 µmol/kg,193.443 µmol/kg,0.036,0.0000000094 M,0 µmol/kg,0 µmol/kg,0 µmol/kg,3.805,423 µatm,8.027,0.0517 µmol/kg,49 dbar,36.325 psu,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,GOMECC4_27N_Sta1_DCM_A_18S
4,GOMECC4_27N_Sta1_DCM_B,GOMECC4_005,GOMECC-4 (2021),27N,Sta1,27N_Sta1,14,B,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_A, GOMECC4_27N_Sta1_DCM_C",Plate4_46,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_DCM_B,PRJNA887898,SAMN37516095,,gomecc4,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,49 m,49,49,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],...,0.32968 µmol/kg,229.99 µmol/kg,2033.19 µmol/kg,193.443 µmol/kg,0.036,0.0000000094 M,0 µmol/kg,0 µmol/kg,0 µmol/kg,3.805,423 µatm,8.027,0.0517 µmol/kg,49 dbar,36.325 psu,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,GOMECC4_27N_Sta1_DCM_B_18S


In [163]:
dna_sample = pd.concat([dna_16,dna_18],axis=0,ignore_index=True)
dna_sample.head()

Unnamed: 0,samp_name,serial_number,cruise_id,line_id,station,locationID,ctd_bottle_no,sample_replicate,source_mat_id,biological_replicates,extract_number,sample_title,bioproject_accession,biosample_accession,notes_sampling,project_id,metagenome_sequenced,organism,collection_date_local,collection_date,depth,minimumDepthInMeters,maximumDepthInMeters,env_broad_scale,env_local_scale,...,ammonium,carbonate,diss_inorg_carb,diss_oxygen,fluor,hydrogen_ion,nitrate,nitrite,nitrate_plus_nitrite,omega_arag,pco2,ph,phosphate,pressure,salinity,samp_store_loc,samp_store_temp,silicate,size_frac_low,size_frac_up,temp,tot_alkalinity,tot_depth_water_col,transmittance,eventID
0,GOMECC4_27N_Sta1_Deep_A,GOMECC4_001,GOMECC-4 (2021),27N,Sta1,27N_Sta1,3,A,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_B, GOMECC4_27N_Sta1_Deep_C",Plate4_52,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_A,PRJNA887898,SAMN37516091,DCM = deep chlorophyl max.,gomecc4,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,618,618,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],...,0.25971 µmol/kg,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,GOMECC4_27N_Sta1_Deep_A_16S
1,GOMECC4_27N_Sta1_Deep_B,GOMECC4_002,GOMECC-4 (2021),27N,Sta1,27N_Sta1,3,B,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_C",Plate4_60,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_B,PRJNA887898,SAMN37516092,DCM was around 80 m and not well defined.,gomecc4,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,618,618,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],...,0.25971 µmol/kg,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,GOMECC4_27N_Sta1_Deep_B_16S
2,GOMECC4_27N_Sta1_Deep_C,GOMECC4_003,GOMECC-4 (2021),27N,Sta1,27N_Sta1,3,C,GOMECC4_27N_Sta1_Deep,"GOMECC4_27N_Sta1_Deep_A, GOMECC4_27N_Sta1_Deep_B",Plate4_62,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_Deep_C,PRJNA887898,SAMN37516093,Surface CTD bottles did not fire correctly; hand niskin bottle used for the surface cast. PM cast.,gomecc4,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,618 m,618,618,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],...,0.25971 µmol/kg,88.434 µmol/kg,2215.45 µmol/kg,129.44 µmol/kg,0.0308,0.0000000142 M,29.3256 µmol/kg,0.00391 µmol/kg,29.3295 µmol/kg,1.168,624 µatm,7.849,1.94489 µmol/kg,623 dbar,34.946 psu,NOAA/AOML Room 248,-20 °C,20.3569 µmol/kg,no pre-filter,0.22 µm,7.479 °C,2318.9 µmol/kg,623 m,4.7221,GOMECC4_27N_Sta1_Deep_C_16S
3,GOMECC4_27N_Sta1_DCM_A,GOMECC4_004,GOMECC-4 (2021),27N,Sta1,27N_Sta1,14,A,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_B, GOMECC4_27N_Sta1_DCM_C",Plate4_53,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_DCM_A,PRJNA887898,SAMN37516094,Only enough water for 2 surface replicates.,gomecc4,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,49 m,49,49,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],...,0.32968 µmol/kg,229.99 µmol/kg,2033.19 µmol/kg,193.443 µmol/kg,0.036,0.0000000094 M,0 µmol/kg,0 µmol/kg,0 µmol/kg,3.805,423 µatm,8.027,0.0517 µmol/kg,49 dbar,36.325 psu,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,GOMECC4_27N_Sta1_DCM_A_16S
4,GOMECC4_27N_Sta1_DCM_B,GOMECC4_005,GOMECC-4 (2021),27N,Sta1,27N_Sta1,14,B,GOMECC4_27N_Sta1_DCM,"GOMECC4_27N_Sta1_DCM_A, GOMECC4_27N_Sta1_DCM_C",Plate4_46,Atlantic Ocean seawater sample GOMECC4_27N_Sta1_DCM_B,PRJNA887898,SAMN37516095,,gomecc4,planned for FY24,seawater metagenome,2021-09-14T11:00-04:00,2021-09-14T07:00,49 m,49,49,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],...,0.32968 µmol/kg,229.99 µmol/kg,2033.19 µmol/kg,193.443 µmol/kg,0.036,0.0000000094 M,0 µmol/kg,0 µmol/kg,0 µmol/kg,3.805,423 µatm,8.027,0.0517 µmol/kg,49 dbar,36.325 psu,NOAA/AOML Room 248,-20 °C,1.05635 µmol/kg,no pre-filter,0.22 µm,28.592 °C,2371 µmol/kg,623 m,4.665,GOMECC4_27N_Sta1_DCM_B_16S


In [165]:
prep_16 = dna_prep[dna_prep['amplicon_sequenced'].str.contains('16S V4-V5')].copy()
prep_16['eventID'] = prep_16['samp_name']+"_16S"
prep_16.head()

Unnamed: 0,samp_name,library_id,title,library_strategy,library_source,library_selection,lib_layout,platform,instrument_model,design_description,filetype,filename,filename2,biosample_accession,sra_accession,seq_meth,nucl_acid_ext,amplicon_sequenced,target_gene,target_subfragment,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,pcr_cond,nucl_acid_amp,adapters,mid_barcode,eventID
4,GOMECC4_BROWNSVILLE_Sta66_DCM_B,GOMECC16S_Plate1_1,16S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC16S_Plate1_1_S1_L001_R1_001.fastq.gz,GOMECC16S_Plate1_1_S1_L001_R2_001.fastq.gz,SAMN37516307,SRR26148474,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S V4-V5,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,GOMECC4_BROWNSVILLE_Sta66_DCM_B_16S
6,GOMECC4_GALVESTON_Sta54_DCM_B,GOMECC16S_Plate1_10,16S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC16S_Plate1_10_S10_L001_R1_001.fastq.gz,GOMECC16S_Plate1_10_S10_L001_R2_001.fastq.gz,SAMN37516268,SRR26148413,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S V4-V5,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,GOMECC4_GALVESTON_Sta54_DCM_B_16S
8,GOMECC4_GALVESTON_Sta54_Deep_A,GOMECC16S_Plate1_11,16S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC16S_Plate1_11_S11_L001_R1_001.fastq.gz,GOMECC16S_Plate1_11_S11_L001_R2_001.fastq.gz,SAMN37516264,SRR26148140,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S V4-V5,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,GOMECC4_GALVESTON_Sta54_Deep_A_16S
10,GOMECC4_GALVESTON_Sta49_Deep_A,GOMECC16S_Plate1_12,16S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC16S_Plate1_12_S12_L001_R1_001.fastq.gz,GOMECC16S_Plate1_12_S12_L001_R2_001.fastq.gz,SAMN37516246,SRR26148197,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S V4-V5,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,GOMECC4_GALVESTON_Sta49_Deep_A_16S
12,GOMECC4_BROWNSVILLE_Sta66_DCM_C,GOMECC16S_Plate1_13,16S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC16S_Plate1_13_S13_L001_R1_001.fastq.gz,GOMECC16S_Plate1_13_S13_L001_R2_001.fastq.gz,SAMN37516308,SRR26148464,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S V4-V5,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,GOMECC4_BROWNSVILLE_Sta66_DCM_C_16S


In [166]:
prep_18 = dna_prep[dna_prep['amplicon_sequenced'].str.contains('18S V9')].copy()
prep_18['eventID'] = prep_18['samp_name']+"_18S"
prep_18.head()

Unnamed: 0,samp_name,library_id,title,library_strategy,library_source,library_selection,lib_layout,platform,instrument_model,design_description,filetype,filename,filename2,biosample_accession,sra_accession,seq_meth,nucl_acid_ext,amplicon_sequenced,target_gene,target_subfragment,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,pcr_cond,nucl_acid_amp,adapters,mid_barcode,eventID
1,GOMECC4_27N_Sta1_DCM_A,GOMECC18S_Plate4_53,18S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC18S_Plate4_53_S340_L001_R1_001.fastq.gz,GOMECC18S_Plate4_53_S340_L001_R2_001.fastq.gz,SAMN37516094,SRR26161153,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,18S V9,18S rRNA,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75;annealing:65_0.25;57_0.5;elongation:72_1.5;final elongation:72_10;35,10.1371/journal.pone.0006372,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,GOMECC4_27N_Sta1_DCM_A_18S
3,GOMECC4_27N_Sta1_DCM_B,GOMECC18S_Plate4_46,18S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC18S_Plate4_46_S333_L001_R1_001.fastq.gz,GOMECC18S_Plate4_46_S333_L001_R2_001.fastq.gz,SAMN37516095,SRR26161138,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,18S V9,18S rRNA,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75;annealing:65_0.25;57_0.5;elongation:72_1.5;final elongation:72_10;35,10.1371/journal.pone.0006372,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,GOMECC4_27N_Sta1_DCM_B_18S
5,GOMECC4_27N_Sta1_DCM_C,GOMECC18S_Plate4_54,18S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC18S_Plate4_54_S341_L001_R1_001.fastq.gz,GOMECC18S_Plate4_54_S341_L001_R2_001.fastq.gz,SAMN37516096,SRR26160919,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,18S V9,18S rRNA,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75;annealing:65_0.25;57_0.5;elongation:72_1.5;final elongation:72_10;35,10.1371/journal.pone.0006372,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,GOMECC4_27N_Sta1_DCM_C_18S
7,GOMECC4_27N_Sta1_Deep_A,GOMECC18S_Plate4_52,18S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC18S_Plate4_52_S339_L001_R1_001.fastq.gz,GOMECC18S_Plate4_52_S339_L001_R2_001.fastq.gz,SAMN37516091,SRR26160709,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,18S V9,18S rRNA,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75;annealing:65_0.25;57_0.5;elongation:72_1.5;final elongation:72_10;35,10.1371/journal.pone.0006372,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,GOMECC4_27N_Sta1_Deep_A_18S
9,GOMECC4_27N_Sta1_Deep_B,GOMECC18S_Plate4_60,18S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC18S_Plate4_60_S347_L001_R1_001.fastq.gz,GOMECC18S_Plate4_60_S347_L001_R2_001.fastq.gz,SAMN37516092,SRR26160970,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,18S V9,18S rRNA,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75;annealing:65_0.25;57_0.5;elongation:72_1.5;final elongation:72_10;35,10.1371/journal.pone.0006372,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,GOMECC4_27N_Sta1_Deep_B_18S


In [167]:
dna_prep = pd.concat([prep_16,prep_18],axis=0,ignore_index=True)
dna_prep.head()

Unnamed: 0,samp_name,library_id,title,library_strategy,library_source,library_selection,lib_layout,platform,instrument_model,design_description,filetype,filename,filename2,biosample_accession,sra_accession,seq_meth,nucl_acid_ext,amplicon_sequenced,target_gene,target_subfragment,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,pcr_cond,nucl_acid_amp,adapters,mid_barcode,eventID
0,GOMECC4_BROWNSVILLE_Sta66_DCM_B,GOMECC16S_Plate1_1,16S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC16S_Plate1_1_S1_L001_R1_001.fastq.gz,GOMECC16S_Plate1_1_S1_L001_R2_001.fastq.gz,SAMN37516307,SRR26148474,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S V4-V5,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,GOMECC4_BROWNSVILLE_Sta66_DCM_B_16S
1,GOMECC4_GALVESTON_Sta54_DCM_B,GOMECC16S_Plate1_10,16S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC16S_Plate1_10_S10_L001_R1_001.fastq.gz,GOMECC16S_Plate1_10_S10_L001_R2_001.fastq.gz,SAMN37516268,SRR26148413,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S V4-V5,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,GOMECC4_GALVESTON_Sta54_DCM_B_16S
2,GOMECC4_GALVESTON_Sta54_Deep_A,GOMECC16S_Plate1_11,16S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC16S_Plate1_11_S11_L001_R1_001.fastq.gz,GOMECC16S_Plate1_11_S11_L001_R2_001.fastq.gz,SAMN37516264,SRR26148140,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S V4-V5,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,GOMECC4_GALVESTON_Sta54_Deep_A_16S
3,GOMECC4_GALVESTON_Sta49_Deep_A,GOMECC16S_Plate1_12,16S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC16S_Plate1_12_S12_L001_R1_001.fastq.gz,GOMECC16S_Plate1_12_S12_L001_R2_001.fastq.gz,SAMN37516246,SRR26148197,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S V4-V5,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,GOMECC4_GALVESTON_Sta49_Deep_A_16S
4,GOMECC4_BROWNSVILLE_Sta66_DCM_C,GOMECC16S_Plate1_13,16S amplicon metabarcoding of marine metagenome: Gulf of Mexico (USA),AMPLICON,METAGENOMIC,PCR,paired,ILLUMINA,Illumina MiSeq,Samples were collected and filtered onto Sterivex 0.22 um cartridge filters. DNA was extracted from Sterivex by adding lysis buffer and magnetic b...,fastq,GOMECC16S_Plate1_13_S13_L001_R1_001.fastq.gz,GOMECC16S_Plate1_13_S13_L001_R2_001.fastq.gz,SAMN37516308,SRR26148464,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S V4-V5,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,ACACTGACGACATGGTTCTACA;TACGGTAGCAGAGACTTGGTCT,missing: not provided,GOMECC4_BROWNSVILLE_Sta66_DCM_C_16S


In [176]:
# merge prep and sample
dna = dna_sample.merge(dna_prep, how='outer', on='eventID')
dna = dna.merge(dna_analysis,how='outer',on='amplicon_sequenced')

In [177]:
rename_dict.values()

dict_values(['samp_vol_we_dna_ext', 'samp_mat_process', 'env_broad_scale', 'env_local_scale', 'env_medium', 'size_frac', 'concentration', 'concentrationUnit', 'samp_collec_device', 'source_mat_id', 'samp_name', 'nucl_acid_ext', 'nucl_acid_amp', 'target_gene', 'target_subfragment', 'lib_layout', 'pcr_primer_forward', 'pcr_primer_reverse', 'pcr_primer_name_forward', 'pcr_primer_name_reverse', 'pcr_primer_reference', 'pcr_cond', 'seq_meth', 'sop', 'ampliconSize', 'otu_class_appr', 'otu_seq_comp_appr', 'otu_db'])

In [178]:
#which columns are not in the list of values for dna-derived extension?
[col for col in dna if col not in rename_dict.values()]

['samp_name_x',
 'serial_number',
 'cruise_id',
 'line_id',
 'station',
 'locationID',
 'ctd_bottle_no',
 'sample_replicate',
 'biological_replicates',
 'extract_number',
 'sample_title',
 'bioproject_accession',
 'biosample_accession_x',
 'notes_sampling',
 'project_id',
 'metagenome_sequenced',
 'organism',
 'collection_date_local',
 'collection_date',
 'depth',
 'minimumDepthInMeters',
 'maximumDepthInMeters',
 'geo_loc_name',
 'waterBody',
 'countryCode',
 'lat_lon',
 'decimalLatitude',
 'decimalLongitude',
 'geodeticDatum',
 'sample_type',
 'collection_method',
 'basisOfRecord',
 'cluster_16s',
 'cluster_18s',
 'line_position',
 'offshore_inshore_200m_isobath',
 'depth_category',
 'ocean_acidification_status',
 'seascape_class',
 'seascape_probability',
 'seascape_window',
 'dna_sample_number',
 'dna_conc.1',
 'dna_yield',
 'extraction_plate_name',
 'extraction_well_number',
 'extraction_well_position',
 'ship_crs_expocode',
 'woce_sect',
 'ammonium',
 'carbonate',
 'diss_inorg_ca

In [179]:
#eventID is getting dropped but it shouldn't. need to investigate why.
event = dna['eventID']
dna = dna.drop(columns=[col for col in dna if col not in rename_dict.values()])
dna['eventID']=event

In [180]:
dna.tail()

Unnamed: 0,source_mat_id,env_broad_scale,env_local_scale,env_medium,samp_vol_we_dna_ext,samp_collec_device,samp_mat_process,size_frac,concentration,concentrationUnit,lib_layout,seq_meth,nucl_acid_ext,target_gene,target_subfragment,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,pcr_cond,nucl_acid_amp,ampliconSize,otu_seq_comp_appr,otu_db,eventID
939,GOMECC4_CAPECORAL_Sta141_DCM,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],sea water [ENVO:00002149],2040 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,18S rRNA,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75;annealing:65_0.25;57_0.5;elongation:72_1.5;final elongation:72_10;35,10.1371/journal.pone.0006372,260,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zenodo.8392706,GOMECC4_CAPECORAL_Sta141_DCM_B_18S
940,GOMECC4_CAPECORAL_Sta141_DCM,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],sea water [ENVO:00002149],2080 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,18S rRNA,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75;annealing:65_0.25;57_0.5;elongation:72_1.5;final elongation:72_10;35,10.1371/journal.pone.0006372,260,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zenodo.8392706,GOMECC4_CAPECORAL_Sta141_DCM_C_18S
941,GOMECC4_CAPECORAL_Sta141_Surface,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],sea water [ENVO:00002149],2100 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,18S rRNA,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75;annealing:65_0.25;57_0.5;elongation:72_1.5;final elongation:72_10;35,10.1371/journal.pone.0006372,260,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zenodo.8392706,GOMECC4_CAPECORAL_Sta141_Surface_A_18S
942,GOMECC4_CAPECORAL_Sta141_Surface,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],sea water [ENVO:00002149],2000 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,18S rRNA,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75;annealing:65_0.25;57_0.5;elongation:72_1.5;final elongation:72_10;35,10.1371/journal.pone.0006372,260,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zenodo.8392706,GOMECC4_CAPECORAL_Sta141_Surface_B_18S
943,GOMECC4_CAPECORAL_Sta141_Surface,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],sea water [ENVO:00002149],2000 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,18S rRNA,V9,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,10.1371/journal.pone.0006372,initial denaturation:94_3;denaturation:94_0.75;annealing:65_0.25;57_0.5;elongation:72_1.5;final elongation:72_10;35,10.1371/journal.pone.0006372,260,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zenodo.8392706,GOMECC4_CAPECORAL_Sta141_Surface_C_18S


#### merge with occurrenceID, DNA_sequence

In [181]:
dna.shape

(944, 26)

In [182]:
dna_occ = dna.merge(occ_all[['eventID','occurrenceID','DNA_sequence']],how='left',left_on='eventID',right_on='eventID')

In [183]:
dna_occ.shape

(311390, 28)

In [184]:
dna_occ.head()

Unnamed: 0,source_mat_id,env_broad_scale,env_local_scale,env_medium,samp_vol_we_dna_ext,samp_collec_device,samp_mat_process,size_frac,concentration,concentrationUnit,lib_layout,seq_meth,nucl_acid_ext,target_gene,target_subfragment,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,pcr_cond,nucl_acid_amp,ampliconSize,otu_seq_comp_appr,otu_db,eventID,occurrenceID,DNA_sequence
0,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep_A_16S_occ009257b156ab4a9dd2f0b0dd33100b7e,TACGAGGGGTGCTAGCGTTGTCCGGAATTACTGGGCGTAAAGGGTTCGTAGGCGTCTTGCCAAGTTGATCGTTAAAGCCACCGGCTTAACCGGTGATCTGCGATCAAAACTGGCGAGATAGAATATGTGAGGGGAATGTGGAATTC...
1,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep_A_16S_occ01398067b1d323b7f992a6764fa69e97,TACGGAGGGTGCAAGCGTTGTTCGGAATTATTGGGCGTAAAGCGGATGTAGGCGGTCTGTCAAGTCGGATGTGAAATCCCTGGGCTCAACCCAGGAACTGCATTCGAAACTGTCAGACTAGAGTCTCGGAGGGGGTGGCGGAATTC...
2,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep_A_16S_occ01770ea2fb7f041c787e5a481888c27e,TACGGAGGATCCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTCCGCAGGCGGACTATTAAGTCAGTGGTGAAAGTCTGCAGCTTAACTGTAGAATTGCCATTGAAACTGATAGTCTTGAGTGTGGTTGAAGTGGGCGGAATAT...
3,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep_A_16S_occ017dbdc8b62705bdf3f93218ac93a030,TACTAGGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGGGTGCGTAGGCGTCTACGTAAGTTGTTTGTTAAATCCATCGGCTTAACCGATGATCTGCAAACAAAACTGCATAGATAGAGTTTGGAAGAGGAAAGTGGAATTC...
4,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep_A_16S_occ069f375524db7812103fe73fdefb7d2b,TACGTAGGAGGCTAGCGTTGTCCGGATTTACTGGGCGTAAAGGGAGCGCAGGTGGCTGAGTTCGTCCGTGGTGCAAGCTCCAGGCCTAACCTGGAGAGGTCTACGGATACTGCTCGGCTTGAGGGCGGTAGAGGAGCACGGAATTC...


In [241]:
dna_occ['concentration'] = dna_occ['concentration'].str.strip(" ng/µl")
dna_occ['concentrationUnit'] = "ng/µl"

In [185]:
# check if all DwC terms are in dna file
for key in dna_dict.keys():
    if key not in dna_occ.columns:
        print(key,dna_dict[key])

samp_name {'AOML_term': 'sample_name', 'AOML_file': 'prep_data', 'DwC_definition': nan, 'Example': nan}
sop {'AOML_term': 'sop', 'AOML_file': 'analysis_data', 'DwC_definition': 'Standard operating procedures used in assembly and/or annotation of genomes, metagenomes or environmental sequences. Or A reference to a well documented protocol, e.g. using protocols.io', 'Example': nan}
otu_class_appr {'AOML_term': 'derived: cluster_method, pid_clustering', 'AOML_file': 'analysis_data', 'DwC_definition': 'Approach/algorithm when defining OTUs or ASVs, include version and parameters separated by semicolons', 'Example': '"dada2; 1.14.0; ASV"'}


In [186]:
data['analysis_data']['cluster_method'][0]

'Tourmaline; qiime2-2021.2; dada2'

In [187]:
dna_occ['otu_class_appr']= data['analysis_data']['cluster_method'][0]+"; "+data['analysis_data']['pid_clustering'][0]

In [188]:
# check if all DwC terms are in dna file
for key in dna_dict.keys():
    if key not in dna_occ.columns:
        print(key,dna_dict[key])

samp_name {'AOML_term': 'sample_name', 'AOML_file': 'prep_data', 'DwC_definition': nan, 'Example': nan}
sop {'AOML_term': 'sop', 'AOML_file': 'analysis_data', 'DwC_definition': 'Standard operating procedures used in assembly and/or annotation of genomes, metagenomes or environmental sequences. Or A reference to a well documented protocol, e.g. using protocols.io', 'Example': nan}


In [189]:
dna_occ.head()

Unnamed: 0,source_mat_id,env_broad_scale,env_local_scale,env_medium,samp_vol_we_dna_ext,samp_collec_device,samp_mat_process,size_frac,concentration,concentrationUnit,lib_layout,seq_meth,nucl_acid_ext,target_gene,target_subfragment,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,pcr_cond,nucl_acid_amp,ampliconSize,otu_seq_comp_appr,otu_db,eventID,occurrenceID,DNA_sequence,otu_class_appr
0,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep_A_16S_occ009257b156ab4a9dd2f0b0dd33100b7e,TACGAGGGGTGCTAGCGTTGTCCGGAATTACTGGGCGTAAAGGGTTCGTAGGCGTCTTGCCAAGTTGATCGTTAAAGCCACCGGCTTAACCGGTGATCTGCGATCAAAACTGGCGAGATAGAATATGTGAGGGGAATGTGGAATTC...,Tourmaline; qiime2-2021.2; dada2; ASV
1,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep_A_16S_occ01398067b1d323b7f992a6764fa69e97,TACGGAGGGTGCAAGCGTTGTTCGGAATTATTGGGCGTAAAGCGGATGTAGGCGGTCTGTCAAGTCGGATGTGAAATCCCTGGGCTCAACCCAGGAACTGCATTCGAAACTGTCAGACTAGAGTCTCGGAGGGGGTGGCGGAATTC...,Tourmaline; qiime2-2021.2; dada2; ASV
2,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep_A_16S_occ01770ea2fb7f041c787e5a481888c27e,TACGGAGGATCCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTCCGCAGGCGGACTATTAAGTCAGTGGTGAAAGTCTGCAGCTTAACTGTAGAATTGCCATTGAAACTGATAGTCTTGAGTGTGGTTGAAGTGGGCGGAATAT...,Tourmaline; qiime2-2021.2; dada2; ASV
3,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep_A_16S_occ017dbdc8b62705bdf3f93218ac93a030,TACTAGGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGGGTGCGTAGGCGTCTACGTAAGTTGTTTGTTAAATCCATCGGCTTAACCGATGATCTGCAAACAAAACTGCATAGATAGAGTTTGGAAGAGGAAAGTGGAATTC...,Tourmaline; qiime2-2021.2; dada2; ASV
4,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep_A_16S_occ069f375524db7812103fe73fdefb7d2b,TACGTAGGAGGCTAGCGTTGTCCGGATTTACTGGGCGTAAAGGGAGCGCAGGTGGCTGAGTTCGTCCGTGGTGCAAGCTCCAGGCCTAACCTGGAGAGGTCTACGGATACTGCTCGGCTTGAGGGCGGTAGAGGAGCACGGAATTC...,Tourmaline; qiime2-2021.2; dada2; ASV


In [190]:
dna_occ.to_csv("../processed/dna-derived.csv",index=False)