# Darwin Core Conversion of eDNA Sequence Data From the FAIRe (NOAA Version) metadata template 

**Version:** 3.0

**Authors:** Katherine Silliman, Bayden Willms

**Last Updated:** 2-June-2025

This notebook is for converting a [FAIR-eDNA](https://fair-edna.github.io/index.html)-based data sheet to DarwinCore for submission to OBIS. It has been testing on a Windows 11 laptop, with Python 3.11. 

To generate the input files for edna2obis, please run [FAIRe2NODE](https://github.com/aomlomics/FAIReSheets/tree/FAIRe2NODE) to generate your own FAIR-eDNA (NOAA) template. Once you've filled in your data, you are ready to begin.

This newest version of edna2obis takes the same input files as the ODE, the [Ocean DNA Explorer](https://www.oceandnaexplorer.org/). Explore your data (publically or privately) with visualizations, API capabilities, and more through ODE. 

[FAIR-eDNA NOAA Google Sheet](https://docs.google.com/spreadsheets/d/1mkjfUQW3gTn3ezhMQmFDQn4EBoQ2Xv4SZeSd9sqagoU/edit?gid=0#gid=0)

**Requirements:**
- Python 3
- Python 3 packages:
    - os
- External packages:
    - Bio.Entrez from biopython
    - numpy
    - pandas
    - openpyxl
    - pyworms
    - multiprocess
- Custom modules:
    - WoRMS_matching
    - analysis_helpers

**Resources:**
- Abarenkov K, Andersson AF, Bissett A, Finstad AG, Fossøy F, Grosjean M, Hope M, Jeppesen TS, Kõljalg U, Lundin D, Nilsson RN, Prager M, Provoost P, Schigel D, Suominen S, Svenningsen C & Frøslev TG (2023) Publishing DNA-derived data through biodiversity data platforms, v1.3. Copenhagen: GBIF Secretariat. https://doi.org/10.35035/doc-vf1a-nr22.https://doi.org/10.35035/doc-vf1a-nr22.
- [OBIS manual](https://manual.obis.org/dna_data.html)
- [TDWG Darwin Core Occurrence Core](https://dwc.tdwg.org/terms/#occurrence)
- [GBIF DNA Derived Data Extension](https://tools.gbif.org/dwca-validator/extension.do?id=http://rs.gbif.org/terms/1.0/DNADerivedData)
- https://github.com/iobis/dataset-edna

**Citation**  
Silliman K, Anderson S, Storo R, Thompson L (2023) A Case Study in Sharing Marine eDNA Metabarcoding Data to OBIS. Biodiversity Information Science and Standards 7: e111048. https://doi.org/10.3897/biss.7.111048


## Installation  

```bash
conda create -n edna2obis
conda activate edna2obis
conda install -c conda-forge notebook
conda install -c conda-forge nb_conda_kernels

conda install -c conda-forge numpy pandas
conda install -c conda-forge openpyxl

#worms conversion
conda install -c conda-forge pyworms
conda install -c conda-forge multiprocess
conda install -c conda-forge biopython
```

In [1]:
## Imports
import os

import numpy as np
import pandas as pd

import WoRMS_matching # custom functions for querying WoRMS API
import WoRMS_v3_matching # new custom functions for querying WoRMS API


In [2]:
# jupyter notebook parameters
pd.set_option('display.max_colwidth', 150)
pd.set_option('display.max_columns', 50)

Note that in a Jupyter Notebook, the current directory is always where the .ipynb file is being run.

## Prepare Input Data 

**Project data and metadata**  
This workflow assumes that you have your project metadata in an Excel sheet formatted like the FAIR-eDNA template located **TODO NEED TO ADD CORRECT LINK**[here](https://docs.google.com/spreadsheets/d/1YBXFU9PuMqm7IT1tp0LTxQ1v2j0tlCWFnhSpy-EBwPw/edit?usp=drive_link). Instructions for filling out the metadata template are located in the 'README' sheet and at the [documentation website](https://noaa-omics-templates.readthedocs.io/en/latest/).

**eDNA and taxonomy data**  
The eDNA data and assigned taxonomy should be in a specific tab-delimited format. ![asv_table format](../images/asv_table.png)

This file is generated automatically by [Tourmaline 2](https://github.com/aomlomics/tourmaline/tree/develop), in X location. If your data was generated with Qiime2 or a previous version of Tourmaline, you can convert the `table.qza`, `taxonomy.qza`, and `repseqs.qza` outputs to the correct format using the `create_asv_seq_taxa_obis.sh` shell script.

Example:  

``` bash
#Run this with a qiime2 environment. 
bash create_asv_seq_taxa_obis.sh -f \
../gomecc_v2_raw/table-16S-merge.qza -t ../gomecc_v2_raw/taxonomy-16S-merge.qza -r ../gomecc_v2_raw/repseqs-16S-merge.qza \
-o ../gomecc_v2_raw/gomecc-16S-asv.tsv
```


## Set configs  

Below you can set definitions for parameters used in the code. 

| Parameter           | Description                                                                                                       | Example                                                                                              |
|---------------------|-------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|
| `sampleMetadata`    | Name of sheet in FAIRe template data Excel file with sample metadata.                                                    | "sampleMetadata"                                                                                  |
| `experimentRunMetadata`         | Name of sheet in FAIRe template data Excel file with data about molecular preparation methods.                           | "experimentRunMetadata"                                                                                 |
| `projectMetadata`        | Name of sheet in FAIRe template data Excel file with metadata about the study.                                           | "projectMetadata"                                                                                         |
| `excel_file`        | Path of the FAIRe data Excel file.                                                                                  | "../raw-v3/FAIRe_noaa-aoml-gomecc4.xlsx"                                                  |
| `FAIRe_NOAA_checklist`        | Path of the FAIRe NOAA Checklist, which contains information on mapping to DarwinCore and the expected files for OBIS submission.                                                                                  | "../raw-v3/FAIRe_NOAA_checklist_v1.0.xlsx"                                                  |
| `datafiles`         | Python dictionary, where keys are the amplicon names and the values are the paths to the cooresponding ASV table. |   See example below to format raw data per analysis  |
| `skip_sample_types` | Python list of sample_category values to skip from OBIS submission, such as controls or blanks.                       | [`negative control`, `positive control`] |
| `skip_columns`      | Python list of columns to ignore when submitting to OBIS.                                                         | [`samp_collect_notes`]                                                                                   |
| `taxonomic_api_source`      | Specify whether you want taxonomic assignment from either WoRMS or GBIF APIs.     |  `WoRMS` | 
<!-- | `analysisMetadata`     | Name of sheet in FAIRe template data Excel file with data about analysis methods.                                        | "analysisMetadata_gomecc4_16s_p1"                                                                                      | -->

In [3]:
# EDIT THIS CELL
# Here, you specify where all of your data is, set a few other parameters, and set parameters related to the taxonomic assignment.

# STEP 1
# Assign sheets from your FAIRe Excel metadata file
# NOTE: Left side is what edna2obis calls that data. Right side is the actual sheetname from your FAIRe Excel metadata file:
params = {}
params['sampleMetadata'] = "sampleMetadata"
params['experimentRunMetadata']= "experimentRunMetadata"
params['projectMetadata'] = "projectMetadata"

params['excel_file'] = "../raw-v3/FAIRe_NOAA_noaa-aoml-gomecc4_SHARING.xlsx"
params['FAIRe_NOAA_checklist'] = "../raw-v3/FAIRe_NOAA_checklist_v1.0.xlsx"
# Be sure to also include the paths for the FAIRe metadata Excel sheet, and the FAIRe NOAA Checklist


# STEP 2
# Assign pathnames for your raw data. Each analysis should have 2 raw data files associated with it.
params['datafiles'] = {
    'gomecc4_18s_p1-6_v2024.10_241122': {
        'taxonomy_file': '../raw-v3/asvTaxaFeatures_gomecc4_18s_p1-6_v2024.10_241122.tsv',
        'occurrence_file': '../raw-v3/table_gomecc4_18s_p1-6_v2024.10_241122.tsv'
    }, 
    'gomecc4_16s_p3-6_v2024.10_241122': {
        'taxonomy_file': '../raw-v3/asvTaxaFeatures_gomecc4_16s_p3-6_v2024.10_241122.tsv',
        'occurrence_file': '../raw-v3/table_gomecc4_16s_p3-6_v2024.10_241122.tsv'
    }, 
    'gomecc4_16s_p1-2_v2024.10_241122': {
        'taxonomy_file': '../raw-v3/asvTaxaFeatures_gomecc4_16s_p1-2_v2024.10_241122.tsv',
        'occurrence_file': '../raw-v3/table_gomecc4_16s_p1-2_v2024.10_241122.tsv'
    },
    # Add other analysis runs here, following the pattern:
    # 'your_analysis_run_name': {
    #     'taxonomy_file': 'path/to/your/asvTaxaFeatures_your_analysis_run_name.tsv',
    #     'occurrence_file': 'path/to/your/table_your_analysis_run_name.tsv'
    # },
}


# STEP 3:
# However you denote control / blank samples, specify here. 
params['skip_sample_types'] = ['negative control','positive control']
params['skip_columns']= ['samp_collect_notes','date_modified','modified_by']


# STEP 4:
# Specify which API you would like to use to assign taxonomy. Options are either WoRMS or GBIF
# Taxonomic Assignment Parameters:
params['taxonomic_api_source'] = 'WoRMS'
# params['taxonomic_api_source'] = 'GBIF'


# STEP 5:
# Define which assays should not consider 'species' rank in their taxonomic assignment.
# This is because certain assays (for example, 16S) species' level assignments are not useful / correct.
# 	Additionally, for example, 18S species' level assignments are good, and we want them!
# This should be the exact 'assay_name' value as found in your analysisMetadata sheets (cell D3)
# and subsequently in the 'assay_name' column of the intermediate occurrence.csv
params['user_defined_assays_to_skip_species'] = [
    'ssu16sv4v5-emp',  # Example, replace with your actual 16S assay names
    # 'another_16s_assay_name_if_any',
]


## Load data

Note that in a Jupyter Notebook, the current directory is always where the .ipynb file is being run.

### Load project, sample, experimentRun, and analysis data from the FAIRe Excel file

projectMetadata, sampleMetadata, and experimentRunMetadata can be loaded normally, but we dynamically load the analysisMetadata sheet(s). The user may have any number of analysisMetadata sheets in their submission, and the cell below will detect each one automatically and load their data. 

In [4]:
# Discover all sheets in the Excel file
excel = pd.ExcelFile(params['excel_file'])
all_sheets = excel.sheet_names

# Find analysis metadata sheets
analysis_sheets = [sheet for sheet in all_sheets if sheet.startswith('analysisMetadata')]
print(f"Discovered {len(analysis_sheets)} analysis metadata sheets:")
for sheet_name_iter in analysis_sheets: # Renamed 'sheet' to 'sheet_name_iter'
    print(f"  - {sheet_name_iter}")

# Load the main data sheets (projectMetadata, sampleMetadata, experimentRunMetadata)
data = pd.read_excel(
    params['excel_file'],
    [params['projectMetadata'], params['sampleMetadata'], params['experimentRunMetadata']],
    index_col=None, na_values=[""], comment="#"
)

# Load all analysis metadata sheets
# This dictionary stores the actual DataFrames from each analysisMetadata sheet.
analysis_data_by_assay = {}

for sheet_name_iter in analysis_sheets:
    # Load the sheet
    analysis_df = pd.read_excel(params['excel_file'], sheet_name_iter)
    
    # Get assay_name and analysis_run_name from specific cells (D3 and D4 in FAIRe template)
    # Ensure these are strings for reliable dictionary keys.
    assay_name = str(analysis_df.iloc[1, 3])        # Corresponds to Excel cell D3
    analysis_run_name = str(analysis_df.iloc[2, 3]) # Corresponds to Excel cell D4
    
    print(f"  - Processing sheet '{sheet_name_iter}': Found assay '{assay_name}' with analysis run '{analysis_run_name}'")
    
    # Store the analysis DataFrame in analysis_data_by_assay, organized by assay_name then analysis_run_name
    if assay_name not in analysis_data_by_assay:
        analysis_data_by_assay[assay_name] = {}
    analysis_data_by_assay[assay_name][analysis_run_name] = analysis_df

# Add the structured analysis data to the main 'data' dictionary
data['analysis_data_by_assay'] = analysis_data_by_assay

# For backward compatibility or general reference, store the DataFrame of the first analysis sheet found
if analysis_sheets:
    data['analysisMetadata'] = pd.read_excel(params['excel_file'], analysis_sheets[0])
else:
    print("Warning: No analysis metadata sheets found! 'data['analysisMetadata']' will not be populated.")

# Print summary of analyses by assay from the loaded metadata sheets
print("\nSummary of analyses by assay (from analysisMetadata sheets):")
for assay, analyses_dict in analysis_data_by_assay.items():
    print(f"  - Assay '{assay}': {len(analyses_dict)} analysis run(s)")
    for run_name_key in analyses_dict.keys():
        print(f"    - {run_name_key}")

# Verify the user-defined params['analysis_files'] from Cell 8
print("\nUser-defined params['datafiles'] content (from Cell 8):")
if 'datafiles' in params and params['datafiles']:
    for run_name, files_dict in params['datafiles'].items():
        print(f"  Analysis Run Name: '{run_name}'")
        print(f"    Taxonomy File: {files_dict.get('taxonomy_file', 'Not specified')}")
        print(f"    Occurrence File: {files_dict.get('occurrence_file', 'Not specified')}")
        # Check if this run_name from params matches one found in the Excel sheets
        found_in_excel = any(run_name in an_dict for an_dict in analysis_data_by_assay.values())
        if not found_in_excel:
            print(f"    Warning: Analysis run name '{run_name}' from params['datafiles'] was not found as an analysis_run_name in any analysisMetadata sheet.")
else:
    print("  params['datafiles'] is empty or not defined in Cell 8.")

Discovered 3 analysis metadata sheets:
  - analysisMetadata_gomecc4_16s_p1
  - analysisMetadata_gomecc4_16s_p3
  - analysisMetadata_gomecc4_18s_p1
  - Processing sheet 'analysisMetadata_gomecc4_16s_p1': Found assay 'ssu16sv4v5-emp' with analysis run 'gomecc4_16s_p1-2_v2024.10_241122'
  - Processing sheet 'analysisMetadata_gomecc4_16s_p3': Found assay 'ssu16sv4v5-emp' with analysis run 'gomecc4_16s_p3-6_v2024.10_241122'
  - Processing sheet 'analysisMetadata_gomecc4_18s_p1': Found assay 'ssu18sv9-emp' with analysis run 'gomecc4_18s_p1-6_v2024.10_241122'

Summary of analyses by assay (from analysisMetadata sheets):
  - Assay 'ssu16sv4v5-emp': 2 analysis run(s)
    - gomecc4_16s_p1-2_v2024.10_241122
    - gomecc4_16s_p3-6_v2024.10_241122
  - Assay 'ssu18sv9-emp': 1 analysis run(s)
    - gomecc4_18s_p1-6_v2024.10_241122

User-defined params['datafiles'] content (from Cell 8):
  Analysis Run Name: 'gomecc4_18s_p1-6_v2024.10_241122'
    Taxonomy File: ../raw-v3/asvTaxaFeatures_gomecc4_18s_p1

In [5]:
#rename keys in data dictionary to a general term
data['sampleMetadata'] = data.pop(params['sampleMetadata'])
data['experimentRunMetadata'] = data.pop(params['experimentRunMetadata'])
# The line below is already done in Cell 4. It is treated differently because the exact sheet name is not 'analysisMetadata'
# data['analysisMetadata'] = data.pop(params['analysisMetadata'])
data['projectMetadata'] = data.pop(params['projectMetadata'])

#### sampleMetadata 
Contextual data about the samples collected, such as when it was collected, where it was collected from, what kind of sample it is, and what were the properties of the environment or experimental condition from which the sample was taken. Each row is a distinct sample, or Event. Most of this information is recorded during sample collection. This sheet contains terms from the FAIRe NOAA data template. 

In [6]:
data['sampleMetadata'].head()

Unnamed: 0,samp_name,samp_category,neg_cont_type,pos_cont_type,materialSampleID,sample_derived_from,sample_composed_of,rel_cont_id,biological_rep_relation,decimalLongitude,decimalLatitude,verbatimLongitude,verbatimLatitude,verbatimCoordinateSystem,verbatimSRS,geo_loc_name,eventDate,eventDurationValue,verbatimEventDate,verbatimEventTime,env_broad_scale,env_local_scale,env_medium,habitat_natural_artificial_0_1,samp_collect_method,...,phosphate,phosphate_unit,pressure,pressure_unit,silicate,silicate_unit,tot_alkalinity,tot_alkalinity_unit,transmittance,transmittance_unit,serial_number,line_id,station_id,ctd_cast_number,ctd_bottle_number,replicate_number,extract_id,extract_plate,extract_well_number,extract_well_position,biosample_accession,organism,samp_collect_notes,dna_yield,dna_yield_unit
0,GOMECC4_27N_Sta1_Deep_A,sample,,,GOMECC4_27N_Sta1_Deep,,,,,-79.618,26.997,,,,WGS84,"USA: Atlantic Ocean, east of Florida (27 N)",2021-09-14T11:00-04:00,,,,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],0,https://zenodo.org/records/14224755 (v1.1.0) protocol_sampling_sterivex_dry.md,...,1.94489,µmol/kg,623,dbar,20.3569,µmol/kg,2318.9,µmol/kg,4.7221,,GOMECC4_001,27N,Sta1,not provided,3,A,Plate4_52,GOMECC2021_Plate4,52,D7,SAMN37516091,seawater metagenome,DCM = deep chlorophyl max.,12.057,ng
1,GOMECC4_27N_Sta1_Deep_B,sample,,,GOMECC4_27N_Sta1_Deep,,,,,-79.618,26.997,,,,WGS84,"USA: Atlantic Ocean, east of Florida (27 N)",2021-09-14T11:00-04:00,,,,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],0,https://zenodo.org/records/14224755 (v1.1.0) protocol_sampling_sterivex_dry.md,...,1.94489,µmol/kg,623,dbar,20.3569,µmol/kg,2318.9,µmol/kg,4.7221,,GOMECC4_002,27N,Sta1,not provided,3,B,Plate4_60,GOMECC2021_Plate4,60,D8,SAMN37516092,seawater metagenome,DCM was around 80 m and not well defined.,17.115,ng
2,GOMECC4_27N_Sta1_Deep_C,sample,,,GOMECC4_27N_Sta1_Deep,,,,,-79.618,26.997,,,,WGS84,"USA: Atlantic Ocean, east of Florida (27 N)",2021-09-14T11:00-04:00,,,,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],0,https://zenodo.org/records/14224755 (v1.1.0) protocol_sampling_sterivex_dry.md,...,1.94489,µmol/kg,623,dbar,20.3569,µmol/kg,2318.9,µmol/kg,4.7221,,GOMECC4_003,27N,Sta1,not provided,3,C,Plate4_62,GOMECC2021_Plate4,62,F8,SAMN37516093,seawater metagenome,Surface CTD bottles did not fire correctly; hand niskin bottle used for the surface cast. PM cast.,10.8345,ng
3,GOMECC4_27N_Sta1_DCM_A,sample,,,GOMECC4_27N_Sta1_DCM,,,,,-79.618,26.997,,,,WGS84,"USA: Atlantic Ocean, east of Florida (27 N)",2021-09-14T11:00-04:00,,,,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],sea water [ENVO:00002149],0,https://zenodo.org/records/14224755 (v1.1.0) protocol_sampling_sterivex_dry.md,...,0.0517,µmol/kg,49,dbar,1.05635,µmol/kg,2371.0,µmol/kg,4.665,,GOMECC4_004,27N,Sta1,not provided,14,A,Plate4_53,GOMECC2021_Plate4,53,E7,SAMN37516094,seawater metagenome,Only enough water for 2 surface replicates.,223.5,ng
4,GOMECC4_27N_Sta1_DCM_B,sample,,,GOMECC4_27N_Sta1_DCM,,,,,-79.618,26.997,,,,WGS84,"USA: Atlantic Ocean, east of Florida (27 N)",2021-09-14T11:00-04:00,,,,marine biome [ENVO:00000447],marine photic zone [ENVO:00000209],sea water [ENVO:00002149],0,https://zenodo.org/records/14224755 (v1.1.0) protocol_sampling_sterivex_dry.md,...,0.0517,µmol/kg,49,dbar,1.05635,µmol/kg,2371.0,µmol/kg,4.665,,GOMECC4_005,27N,Sta1,not provided,14,B,Plate4_46,GOMECC2021_Plate4,46,F6,SAMN37516095,seawater metagenome,,103.26,ng


#### experimentRunMetadata  
Contextual data about how the samples were prepared for sequencing. Includes how they were extracted, what amplicon was targeted, how they were sequenced. Each row is a separate sequencing library preparation, distinguished by a unique lib_id. **TODO: MIGHT NEED HELP WITH THIS DESCRIPTION**

In [7]:
data['experimentRunMetadata'].head(2)

Unnamed: 0,samp_name,assay_name,pcr_plate_id,lib_id,seq_run_id,lib_conc,lib_conc_unit,lib_conc_meth,phix_perc,mid_forward,mid_reverse,filename,filename2,checksum_filename,checksum_filename2,associatedSequences,input_read_count
0,GOMECC4_NegativeControl_1,ssu16sv4v5-emp,not applicable,GOMECC16S_Neg1,20220613_Amplicon_PE250,,,,,TAGCAGCT,CGTCGCTA,GOMECC16S_Neg1_S499_L001_R1_001.fastq.gz,GOMECC16S_Neg1_S499_L001_R2_001.fastq.gz,,,https://www.ncbi.nlm.nih.gov/sra/SRR26148505 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516589 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,29319
1,GOMECC4_NegativeControl_2,ssu16sv4v5-emp,not applicable,GOMECC16S_Neg2,20220613_Amplicon_PE250,,,,,TAGCAGCT,CTAGAGCT,GOMECC16S_Neg2_S500_L001_R1_001.fastq.gz,GOMECC16S_Neg2_S500_L001_R2_001.fastq.gz,,,https://www.ncbi.nlm.nih.gov/sra/SRR26148503 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516590 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,30829


### Load ASV data  
There is one ASV file for each marker that was sequenced. The ASV data files have one row for each unique amplicon sequence variants (ASVs). They contain the ASV DNA sequence, a unique hash identifier the taxonomic assignment for each ASV, the confidence given that assignment by the naive-bayes classifier, and then the number of reads observed in each sample. 

This file is created automatically with [Tourmaline v.2023.5+](https://github.com/aomlomics/tourmaline), and is found in `01-taxonomy/asv_taxa_sample_table.tsv`. 

| column name    | definition                                                                                                                                                                                                                                                                                                                                                                                              |
|----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| featureid      | A hash of the ASV sequence, used as a unique identifier for the ASV.                                                                                                                                                                                                                                                                                                                                    |
| taxonomy       | The full taxonomy assigned to an ASV sequence. This string could be formatted in very different ways depending on the reference database used during classification, however it should always be in reverse rank order separated by ;. We provide examples for how to process results from a Silva classifier and the PR2 18S classifier. For other taxonomy formats, the code will need to be adapted. |
| Confidence     | This is the confidence score assigned the taxonomic classification with a naive-bayes classifier.                                                                                                                                                                                                                                                                                                       |
| sample columns | The next columns each represent a sample (or eventID), and the number of reads for that ASV observed in the sample.                                                                                                                                                                                                                                                                                     |

In [8]:
# Read in ASV tables for each analysis run.
# We now have separate taxonomy and occurrence files per analysis_run_name.

raw_data_tables = {} # This will store DataFrames for taxonomy and occurrences for each run

print("Loading raw data files based on params['datafiles']:")
if 'datafiles' in params and params['datafiles']:
    for analysis_run_name, file_paths in params['datafiles'].items():
        print(f"  Processing analysis run: {analysis_run_name}")
        raw_data_tables[analysis_run_name] = {}
        
        # Load taxonomy file
        if 'taxonomy_file' in file_paths:
            tax_path = file_paths['taxonomy_file']
            try:
                raw_data_tables[analysis_run_name]['taxonomy'] = pd.read_table(tax_path, sep='\t', low_memory=False)
                print(f"    Successfully loaded taxonomy file: {tax_path}")
                # Optional: print shape or head if useful for verification
                # print(f"      Shape: {raw_data_tables[analysis_run_name]['taxonomy'].shape}")
            except FileNotFoundError:
                print(f"    ERROR: Taxonomy file not found at {tax_path}")
            except Exception as e:
                print(f"    ERROR: Could not load taxonomy file {tax_path}. Error: {e}")
        else:
            print(f"    Warning: Taxonomy file path not specified for {analysis_run_name}")
            
        # Load occurrence file
        if 'occurrence_file' in file_paths:
            occ_path = file_paths['occurrence_file']
            try:
                # Row 1 is a comment, Row 2 is header, Data starts Row 3.
                # skiprows=1 means skip the first row (the comment).
                # header=0 (after skipping 1 row) means use the NEW first row (original Row 2) as headers.
                df_occ = pd.read_table(occ_path, 
                                       sep='\t', 
                                       skiprows=1,  # Skip the comment line (original Row 1)
                                       header=0,    # Use the next line (original Row 2) as column headers
                                       low_memory=False)

                # It's common for TSV tools to export the first column header with a '#' prefix, e.g., "#OTU ID".
                # If so, remove the leading '#'.
                if df_occ.columns[0].startswith('#'):
                    df_occ.rename(columns={df_occ.columns[0]: df_occ.columns[0][1:]}, inplace=True)
                
                raw_data_tables[analysis_run_name]['occurrence'] = df_occ
                print(f"    Successfully loaded occurrence file: {occ_path}")
                print(f"      Shape after loading: {raw_data_tables[analysis_run_name]['occurrence'].shape}")
                print(f"      Column names (first 5): {raw_data_tables[analysis_run_name]['occurrence'].columns.tolist()[:5]}")
            except FileNotFoundError:
                print(f"    ERROR: Occurrence file not found at {occ_path}")
            except Exception as e:
                print(f"    ERROR: Could not load occurrence file {occ_path}. Error: {e}")
        else:
            print(f"    Warning: Occurrence file path not specified for {analysis_run_name}")

# Downstream code will need to be adapted to use 'raw_data_tables'.
# The old 'asv_tables' variable, which assumed one merged table per "gene",
# is now replaced by 'raw_data_tables' which holds separate 'taxonomy' and 'occurrence'
# DataFrames for each 'analysis_run_name'.

# For example, to access the taxonomy DataFrame for 'gomecc4_18s_p1-6_v2024.10_241122':
# raw_data_tables['gomecc4_18s_p1-6_v2024.10_241122']['taxonomy']
# And its occurrence DataFrame:
# raw_data_tables['gomecc4_18s_p1-6_v2024.10_241122']['occurrence']


Loading raw data files based on params['datafiles']:
  Processing analysis run: gomecc4_18s_p1-6_v2024.10_241122
    Successfully loaded taxonomy file: ../raw-v3/asvTaxaFeatures_gomecc4_18s_p1-6_v2024.10_241122.tsv
    Successfully loaded occurrence file: ../raw-v3/table_gomecc4_18s_p1-6_v2024.10_241122.tsv
      Shape after loading: (24473, 501)
      Column names (first 5): ['OTU ID', 'GOMECC4_27N_Sta1_DCM_A', 'GOMECC4_27N_Sta1_DCM_B', 'GOMECC4_27N_Sta1_DCM_C', 'GOMECC4_27N_Sta1_Deep_A']
  Processing analysis run: gomecc4_16s_p3-6_v2024.10_241122
    Successfully loaded taxonomy file: ../raw-v3/asvTaxaFeatures_gomecc4_16s_p3-6_v2024.10_241122.tsv
    Successfully loaded occurrence file: ../raw-v3/table_gomecc4_16s_p3-6_v2024.10_241122.tsv
      Shape after loading: (49523, 312)
      Column names (first 5): ['OTU ID', 'GOMECC4_27N_Sta1_DCM_A', 'GOMECC4_27N_Sta1_DCM_B', 'GOMECC4_27N_Sta1_DCM_C', 'GOMECC4_27N_Sta1_Deep_A']
  Processing analysis run: gomecc4_16s_p1-2_v2024.10_241122
   

In [9]:
# Inspect the keys of the newly loaded raw_data_tables
# This will show the analysis_run_names for which data was loaded.
if 'raw_data_tables' in locals() and raw_data_tables: # Check if it exists and is not empty
    print("Analysis runs for which data was loaded:")
    for run_name in raw_data_tables.keys():
        print(f"  - {run_name}")
        if 'taxonomy' in raw_data_tables[run_name]:
            print(f"    - Taxonomy table shape: {raw_data_tables[run_name]['taxonomy'].shape}")
        else:
            print(f"    - Taxonomy table: Not loaded or error during load.")
        if 'occurrence' in raw_data_tables[run_name]:
            print(f"    - Occurrence table shape: {raw_data_tables[run_name]['occurrence'].shape}")
        else:
            print(f"    - Occurrence table: Not loaded or error during load.")
else:
    print("raw_data_tables dictionary is not defined or is empty.")

# Example of how to access a specific table (optional, for your direct inspection):
# if 'gomecc4_18s_p1-6_v2024.10_241122' in raw_data_tables:
#     print("\nHead of taxonomy table for 'gomecc4_18s_p1-6_v2024.10_241122':")
#     if 'taxonomy' in raw_data_tables['gomecc4_18s_p1-6_v2024.10_241122']:
#         print(raw_data_tables['gomecc4_18s_p1-6_v2024.10_241122']['taxonomy'].head(2))
#     print("\nHead of occurrence table for 'gomecc4_18s_p1-6_v2024.10_241122':")
#     if 'occurrence' in raw_data_tables['gomecc4_18s_p1-6_v2024.10_241122']:
#         print(raw_data_tables['gomecc4_18s_p1-6_v2024.10_241122']['occurrence'].head(2))

Analysis runs for which data was loaded:
  - gomecc4_18s_p1-6_v2024.10_241122
    - Taxonomy table shape: (24473, 14)
    - Occurrence table shape: (24473, 501)
  - gomecc4_16s_p3-6_v2024.10_241122
    - Taxonomy table shape: (49523, 12)
    - Occurrence table shape: (49523, 312)
  - gomecc4_16s_p1-2_v2024.10_241122
    - Taxonomy table shape: (19540, 12)
    - Occurrence table shape: (19540, 198)


In [10]:
# Choose an analysis run name that you have defined in params['datafiles']
# For example, the first one:
analysis_to_inspect = list(params['datafiles'].keys())[0] if params['datafiles'] else None

if analysis_to_inspect and analysis_to_inspect in raw_data_tables:
    if 'occurrence' in raw_data_tables[analysis_to_inspect]:
        print(f"Head of OCCURRENCE table for '{analysis_to_inspect}' (first 20 columns):")
        display(raw_data_tables[analysis_to_inspect]['occurrence'].iloc[:, 0:20].head())
    else:
        print(f"Occurrence table not found for '{analysis_to_inspect}'.")
else:
    if not analysis_to_inspect:
        print("params['datafiles'] is empty. No analysis run to inspect.")
    else:
        print(f"Analysis run '{analysis_to_inspect}' not found in loaded raw_data_tables.")

Head of OCCURRENCE table for 'gomecc4_18s_p1-6_v2024.10_241122' (first 20 columns):


Unnamed: 0,OTU ID,GOMECC4_27N_Sta1_DCM_A,GOMECC4_27N_Sta1_DCM_B,GOMECC4_27N_Sta1_DCM_C,GOMECC4_27N_Sta1_Deep_A,GOMECC4_27N_Sta1_Deep_B,GOMECC4_27N_Sta1_Deep_C,GOMECC4_27N_Sta1_Surface_A,GOMECC4_27N_Sta1_Surface_B,GOMECC4_27N_Sta4_DCM_A,GOMECC4_27N_Sta4_DCM_B,GOMECC4_27N_Sta4_DCM_C,GOMECC4_27N_Sta4_Deep_A,GOMECC4_27N_Sta4_Deep_B,GOMECC4_27N_Sta4_Deep_C,GOMECC4_27N_Sta4_Surface_A,GOMECC4_27N_Sta4_Surface_B,GOMECC4_27N_Sta4_Surface_C,GOMECC4_27N_Sta6_DCM_A,GOMECC4_27N_Sta6_DCM_B
0,36aa75f9b28f5f831c2d631ba65c2bcb,1518.0,0.0,0.0,6.0,0.0,0.0,0.0,4268.0,2002.0,0.0,14.0,0.0,0.0,0.0,9532.0,1930.0,2037.0,0.0,0.0
1,4e38e8ced9070952b314e1880bede1ca,963.0,316.0,543.0,19.0,10.0,0.0,0.0,0.0,613.0,561.0,434.0,0.0,395.0,297.0,76.0,915.0,1447.0,140.0,0.0
2,5d4df37251121c08397c6fbc27b06175,0.0,4.0,0.0,12.0,5.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,300.0,0.0,864.0,409.0
3,f863f671a575c6ab587e8de0190d3335,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,92.0,2672.0,2424.0,2605.0,1918.0
4,2a31e5c01634165da99e7381279baa75,1165.0,2267.0,2206.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1103.0,157.0,0.0,104.0,941.0


### Drop samples with unwanted sample types  

Often with eDNA projects, we have control samples that are sequenced along with our survey samples. These can include filtering distilled water, using pure water instead of DNA in a PCR or DNA extraction protocol, or a mock community of known microbial taxa. Controls can help identify and mitigate contaminant DNA in our samples, but are not useful for biodiversity platforms like OBIS. You can select which sample_type values to drop with the `skip_sample_types` parameter.

In [11]:
# This should be your Cell 23 (or equivalent)
samps_to_remove = data['sampleMetadata']['samp_category'].isin(params['skip_sample_types'])
samples_to_drop = data['sampleMetadata']['samp_name'][samps_to_remove].astype(str).str.strip().tolist() 
print(f"DEBUG in Cell 23: samples_to_drop (first 5): {samples_to_drop[:5]}") 

DEBUG in Cell 23: samples_to_drop (first 5): ['GOMECC4_Blank_DIW_20210915_A', 'GOMECC4_Blank_DIW_20210915_B', 'GOMECC4_Blank_DIW_20210915_C', 'GOMECC4_Blank_DIW_20210930_A', 'GOMECC4_Blank_DIW_20210930_B']


You can view the list of samples to be dropped below.

In [12]:
samples_to_drop

['GOMECC4_Blank_DIW_20210915_A',
 'GOMECC4_Blank_DIW_20210915_B',
 'GOMECC4_Blank_DIW_20210915_C',
 'GOMECC4_Blank_DIW_20210930_A',
 'GOMECC4_Blank_DIW_20210930_B',
 'GOMECC4_Blank_DIW_20210930_C',
 'GOMECC4_Blank_DIW_20211011_A',
 'GOMECC4_Blank_DIW_20211011_B',
 'GOMECC4_Blank_DIW_20211011_C',
 'GOMECC4_Blank_DIW_20211016_A',
 'GOMECC4_Blank_DIW_20211016_B',
 'GOMECC4_Blank_DIW_20211016_C',
 'GOMECC4_ExtractionBlank_1',
 'GOMECC4_ExtractionBlank_11',
 'GOMECC4_ExtractionBlank_12',
 'GOMECC4_ExtractionBlank_3',
 'GOMECC4_ExtractionBlank_5',
 'GOMECC4_ExtractionBlank_7',
 'GOMECC4_ExtractionBlank_9',
 'GOMECC4_MSUControl_1',
 'GOMECC4_MSUControl_2',
 'GOMECC4_MSUControl_3',
 'GOMECC4_MSUControl_4',
 'GOMECC4_MSUControl_5',
 'GOMECC4_MSUControl_6',
 'GOMECC4_MSUControl_7',
 'GOMECC4_NegativeControl_1',
 'GOMECC4_NegativeControl_2',
 'GOMECC4_PositiveControl_1',
 'GOMECC4_PositiveControl_2']

In [13]:
# remove samples from sampleMetadata sheet
data['sampleMetadata'] = data['sampleMetadata'][~samps_to_remove]

In [14]:
# check the samp_category values left in your sampleMetadata. We only want 'sample' (indicating it is not a control or blank).
data['sampleMetadata']['samp_category'].unique()

array(['sample'], dtype=object)

In [15]:
# remove samples from experimentRunMetadata
prep_samps_to_remove = data['experimentRunMetadata']['samp_name'].isin(samples_to_drop)
data['experimentRunMetadata'] = data['experimentRunMetadata'][~prep_samps_to_remove]

##### Drop unwanted samples from ASV files


In [16]:
# This cell REMOVES blank/control samples from the ALREADY LOADED occurrence tables

print("Attempting to remove blank/control samples from loaded occurrence tables...")
if 'raw_data_tables' in locals() and raw_data_tables:
    if 'samples_to_drop' in locals() and samples_to_drop:
        for analysis_run_name, tables_dict in raw_data_tables.items():
            if 'occurrence' in tables_dict and not tables_dict['occurrence'].empty:
                occ_df = tables_dict['occurrence'] # Get a reference to the occurrence DataFrame
                original_cols_count = len(occ_df.columns)
                
                # Identify which of the samples_to_drop are actual columns in THIS occurrence table
                cols_to_remove_in_this_df = [col for col in samples_to_drop if col in occ_df.columns]
                
                if cols_to_remove_in_this_df:
                    # Perform the drop operation. This MODIFIES the DataFrame in raw_data_tables.
                    raw_data_tables[analysis_run_name]['occurrence'] = occ_df.drop(columns=cols_to_remove_in_this_df)
                    
                    print(f"  For analysis run '{analysis_run_name}':")
                    print(f"    Original columns: {original_cols_count}, Columns after removal: {len(raw_data_tables[analysis_run_name]['occurrence'].columns)}")
                    print(f"    Removed ({len(cols_to_remove_in_this_df)} total for this run): {cols_to_remove_in_this_df}") # Print all removed
                else:
                    print(f"  For analysis run '{analysis_run_name}': No specified blank/control samples found to remove in this table's columns.")
            else:
                print(f"  Skipping analysis run '{analysis_run_name}': No 'occurrence' table found or it is empty.")
        print("Finished blank/control sample removal process.")
    else:
        print("WARNING: 'samples_to_drop' list not found or empty. No columns removed from occurrence tables.")
else:
    print("WARNING: 'raw_data_tables' not found or empty. Cannot remove columns.")

# You can add a verification step here if you like, for example:
# target_run_to_check = 'gomecc4_18s_p1-6_v2024.10_241122' # Use one of your actual analysis run names
# if target_run_to_check in raw_data_tables and 'occurrence' in raw_data_tables[target_run_to_check]:
# print(f"\nColumns in occurrence table for '{target_run_to_check}' AFTER blank removal:")
# print(raw_data_tables[target_run_to_check]['occurrence'].columns.tolist()[:30])
# else:
# print(f"\nCould not verify columns for '{target_run_to_check}'.")

Attempting to remove blank/control samples from loaded occurrence tables...
  For analysis run 'gomecc4_18s_p1-6_v2024.10_241122':
    Original columns: 501, Columns after removal: 473
    Removed (28 total for this run): ['GOMECC4_Blank_DIW_20210915_A', 'GOMECC4_Blank_DIW_20210915_B', 'GOMECC4_Blank_DIW_20210915_C', 'GOMECC4_Blank_DIW_20210930_A', 'GOMECC4_Blank_DIW_20210930_B', 'GOMECC4_Blank_DIW_20210930_C', 'GOMECC4_Blank_DIW_20211011_A', 'GOMECC4_Blank_DIW_20211011_B', 'GOMECC4_Blank_DIW_20211011_C', 'GOMECC4_Blank_DIW_20211016_A', 'GOMECC4_Blank_DIW_20211016_B', 'GOMECC4_Blank_DIW_20211016_C', 'GOMECC4_ExtractionBlank_1', 'GOMECC4_ExtractionBlank_11', 'GOMECC4_ExtractionBlank_12', 'GOMECC4_ExtractionBlank_3', 'GOMECC4_ExtractionBlank_5', 'GOMECC4_ExtractionBlank_7', 'GOMECC4_ExtractionBlank_9', 'GOMECC4_MSUControl_1', 'GOMECC4_MSUControl_2', 'GOMECC4_MSUControl_3', 'GOMECC4_MSUControl_4', 'GOMECC4_MSUControl_5', 'GOMECC4_MSUControl_6', 'GOMECC4_MSUControl_7', 'GOMECC4_NegativeCon

### Drop columns with all NAs  

If your project data file has columns with only NAs, this code will check for those, provide their column headers for verification, then remove them.

In [17]:
# Identifies columns that have ALL NA values in 'sampleMetadata' and 'experimentRunMetadata'.
# This is based on your original code, modified to exclude 'analysisMetadata'.
# TODO: remove empty rows from analysisMetadata. Inform user of rows / fields deleted 

dropped = pd.DataFrame()

# Only check 'sampleMetadata' and 'experimentRunMetadata'
for sheet_name in ['sampleMetadata', 'experimentRunMetadata']:
    # Safety check: ensure the sheet exists in data and is not empty
    if sheet_name in data and not data[sheet_name].empty:
        all_na_cols = data[sheet_name].columns[data[sheet_name].isnull().all(axis=0)]
        res = pd.Series(all_na_cols, name=sheet_name)
        dropped = pd.concat([dropped, res], axis=1)
    elif sheet_name not in data:
        print(f"FYI: Sheet '{sheet_name}' not found in 'data' dictionary. Cannot check for all-NA columns.")
    else: # In data but empty
        print(f"FYI: Sheet '{sheet_name}' is empty. Cannot check for all-NA columns.")
    

Which columns in each sheet have only NA values?

In [18]:
dropped

Unnamed: 0,sampleMetadata,experimentRunMetadata
0,neg_cont_type,lib_conc
1,pos_cont_type,lib_conc_unit
2,sample_derived_from,lib_conc_meth
3,sample_composed_of,phix_perc
4,rel_cont_id,checksum_filename
...,...,...
71,org_matter,
72,org_matter_unit,
73,org_nitro,
74,org_nitro_unit,


If you are fine with leaving these columns out, proceed:

In [19]:
# Drops all-NA columns from 'sampleMetadata' and 'experimentRunMetadata'
# based on the 'dropped' DataFrame.

# These are the sheets Cell 19 might have put into the 'dropped' DataFrame
sheets_to_clean = ['sampleMetadata', 'experimentRunMetadata']

for sheet_name in sheets_to_clean:
    # Check if 'dropped' has a column for this sheet AND if that column lists any actual columns to drop
    if sheet_name in dropped.columns and not dropped[sheet_name].dropna().empty:
        cols_to_drop = list(dropped[sheet_name].dropna())
        
        # Ensure the target DataFrame exists in 'data'
        if sheet_name in data:
            print(f"Dropping from data['{sheet_name}']: {cols_to_drop}")
            data[sheet_name].drop(columns=cols_to_drop, inplace=True, errors='ignore')
        # else: # Optional: warning if data[sheet_name] is missing, but Cell 19 should prevent this
            # print(f"Warning: data['{sheet_name}'] not found, cannot drop columns.")

Dropping from data['sampleMetadata']: ['neg_cont_type', 'pos_cont_type', 'sample_derived_from', 'sample_composed_of', 'rel_cont_id', 'biological_rep_relation', 'verbatimLongitude', 'verbatimLatitude', 'verbatimCoordinateSystem', 'eventDurationValue', 'verbatimEventDate', 'verbatimEventTime', 'samp_store_method_additional', 'stationed_sample_dur', 'pump_flow_rate', 'pump_flow_rate_unit', 'prefilter_material', 'filter_diameter', 'filter_surface_area', 'prepped_samp_store_temp', 'prepped_samp_store_sol', 'prepped_samp_store_dur', 'prep_method_additional', 'date_ext', 'nucl_acid_ext_modify', 'dna_cleanup_0_1', 'dna_cleanup_method', 'concentration_method', 'ratioOfAbsorbance260_280', 'pool_dna_num', 'nucl_acid_ext_method_additional', 'samp_weather', 'elev', 'light_intensity', 'suspend_part_matter', 'tidal_stage', 'turbidity', 'water_current', 'solar_irradiance', 'wind_direction', 'wind_speed', 'diss_inorg_nitro', 'diss_inorg_nitro_unit', 'diss_org_carb', 'diss_org_carb_unit', 'diss_org_nitr

### Now lets drop NA rows of each analysisMetadata sheet

In [20]:
# Cell to identify and report rows with empty 'values' in analysisMetadata sheets

analysis_rows_to_drop_info = {} 
expected_value_col_name = 'values' # As per your Excel sheet (column D header)

print(f"Identifying rows with empty '{expected_value_col_name}' column in analysisMetadata sheets:")
if 'analysis_data_by_assay' in data and data['analysis_data_by_assay']:
    for assay_name, analyses_dict in data['analysis_data_by_assay'].items():
        if not isinstance(analyses_dict, dict): continue
        for run_name, analysis_df in analyses_dict.items():
            if not isinstance(analysis_df, pd.DataFrame) or analysis_df.empty: continue

            # Ensure the 'values' column exists
            if expected_value_col_name not in analysis_df.columns:
                print(f"  Warning: Column '{expected_value_col_name}' not found in Assay: '{assay_name}', Run: '{run_name}'. Skipping this sheet for row deletion.")
                continue
            
            # Identify rows where the 'values' column is NA
            # This will correctly ignore the 'assay_name' and 'analysis_run_name' rows
            # as their 'values' column is populated.
            empty_values_mask = analysis_df[expected_value_col_name].isnull()
            rows_to_drop_indices = analysis_df[empty_values_mask].index.tolist()

            if rows_to_drop_indices:
                print(f"  In analysis sheet - Assay: '{assay_name}', Run: '{run_name}':")
                # To provide more context, let's also show the 'term_name' for rows to be dropped
                terms_to_drop = analysis_df.loc[rows_to_drop_indices, 'term_name'].tolist() if 'term_name' in analysis_df.columns else ["N/A"]*len(rows_to_drop_indices)
                print(f"    Found {len(rows_to_drop_indices)} row(s) where '{expected_value_col_name}' is empty.")
                for i, idx in enumerate(rows_to_drop_indices):
                    print(f"      - Index: {idx}, Term Name: {terms_to_drop[i]}")
                analysis_rows_to_drop_info[(assay_name, run_name)] = rows_to_drop_indices

if not analysis_rows_to_drop_info:
    print(f"\nNo rows with an empty '{expected_value_col_name}' column found in any analysisMetadata sheet.")
else:
    print(f"\nSummary of rows with an empty '{expected_value_col_name}' column that can be dropped:")
    for (assay, run), indices in analysis_rows_to_drop_info.items():
        print(f"  - Assay: '{assay}', Run: '{run}' - {len(indices)} row(s) at Indices: {indices}")
    print("Proceed to the next cell to remove these rows if desired.")

Identifying rows with empty 'values' column in analysisMetadata sheets:
  In analysis sheet - Assay: 'ssu16sv4v5-emp', Run: 'gomecc4_16s_p1-2_v2024.10_241122':
    Found 36 row(s) where 'values' is empty.
      - Index: 30, Term Name: screen_contam_method
      - Index: 31, Term Name: screen_geograph_method
      - Index: 32, Term Name: screen_nontarget_method
      - Index: 33, Term Name: screen_other
      - Index: 36, Term Name: bioinfo_method_additional
      - Index: 37, Term Name: output_read_count
      - Index: 38, Term Name: output_otu_num
      - Index: 39, Term Name: otu_num_tax_assigned
      - Index: 40, Term Name: discard_untrimmed
      - Index: 41, Term Name: qiime2_version
      - Index: 42, Term Name: tourmaline_asv_method
      - Index: 43, Term Name: dada2_trunc_len_f
      - Index: 44, Term Name: dada2pe_trunc_len_r
      - Index: 45, Term Name: dada2_trim_left_f
      - Index: 46, Term Name: dada2pe_trim_left_r
      - Index: 47, Term Name: dada2_max_ee_f
      - 

### If you are okay with those rows being deleted, continue:

In [21]:
# Cell to perform the deletion of rows with empty 'values' from analysisMetadata sheets

if 'analysis_rows_to_drop_info' in locals() and analysis_rows_to_drop_info:
    print(f"\nRemoving identified rows with empty 'values' from analysisMetadata sheets...")
    for (assay_name, run_name), indices_to_drop in analysis_rows_to_drop_info.items():
        if indices_to_drop: 
            try:
                original_count = len(data['analysis_data_by_assay'][assay_name][run_name])
                data['analysis_data_by_assay'][assay_name][run_name].drop(index=indices_to_drop, inplace=True)
                new_count = len(data['analysis_data_by_assay'][assay_name][run_name])
                print(f"  For Assay: '{assay_name}', Run: '{run_name}': Removed {original_count - new_count} row(s).")
            except Exception as e:
                print(f"  An error occurred while dropping rows for Assay: '{assay_name}', Run: '{run_name}': {e}")
    print("\nFinished removing rows with empty 'values' from analysisMetadata sheets.")
    # del analysis_rows_to_drop_info # Optional: clear the info variable
else:
    print("\nNo rows with empty 'values' were previously identified for deletion from analysisMetadata sheets, or 'analysis_rows_to_drop_info' not found.")


Removing identified rows with empty 'values' from analysisMetadata sheets...
  For Assay: 'ssu16sv4v5-emp', Run: 'gomecc4_16s_p1-2_v2024.10_241122': Removed 36 row(s).
  For Assay: 'ssu16sv4v5-emp', Run: 'gomecc4_16s_p3-6_v2024.10_241122': Removed 36 row(s).
  For Assay: 'ssu18sv9-emp', Run: 'gomecc4_18s_p1-6_v2024.10_241122': Removed 37 row(s).

Finished removing rows with empty 'values' from analysisMetadata sheets.


### Now lets check for columns or rows with SOME missing values. OBIS wants those deleted as well

Now let's check which columns have missing values in some of the rows. These should be filled in on the Excel sheet with the appropriate term ('not applicable', 'missing', or 'not collected'). Alternatively, you can drop the column if it is not needed for submission to OBIS.

In [22]:
# Identifies columns in 'sampleMetadata' and 'experimentRunMetadata'
# that have *some* (but not all) missing NA values.

some = pd.DataFrame()

# Sheets to check for columns with *some* NAs.
# We focus on the wide-format sheets where this is most informative.
sheets_to_examine_for_some_na = ['sampleMetadata', 'experimentRunMetadata']

for sheet_name in sheets_to_examine_for_some_na:
    if sheet_name in data and not data[sheet_name].empty:
        # Find columns that have at least one NA, but not all NAs
        # .any() finds columns with at least one NA
        # .all() finds columns with all NAs (we want to exclude these if they were already handled,
        # but for "some", .any() is sufficient to identify columns that are not completely full)
        cols_with_some_na = data[sheet_name].columns[data[sheet_name].isnull().any(axis=0)]
        
        res = pd.Series(cols_with_some_na.tolist(), name=sheet_name) # .tolist() for cleaner Series
        some = pd.concat([some, res], axis=1)

In [23]:
some

Unnamed: 0,sampleMetadata,experimentRunMetadata
0,ph_meth,
1,carbonate_unit,
2,pco2_unit,
3,samp_collect_notes,


Here I'm going to drop all the columns with some missing data, as I don't need them for submission to OBIS.

In [24]:
# Drops any column from 'sampleMetadata' and 'experimentRunMetadata'
# if that column contains ANY missing NA values. Lists the dropped columns.

sheets_to_aggressively_clean = ['sampleMetadata', 'experimentRunMetadata']

for sheet_name in sheets_to_aggressively_clean:
    if sheet_name in data and not data[sheet_name].empty: # Check if DataFrame exists and is not empty
        original_columns = data[sheet_name].columns.tolist() # Get column names before dropping
        
        data[sheet_name].dropna(axis=1, how='any', inplace=True)
        
        current_columns = data[sheet_name].columns.tolist() # Get column names after dropping
        dropped_column_names = [col for col in original_columns if col not in current_columns]
        
        if dropped_column_names:
            print(f"From data['{sheet_name}']: Dropped columns (had any NA): {dropped_column_names}")
        else: # Optional: if no columns were dropped
            print(f"No columns dropped from data['{sheet_name}'] by 'any NA' rule.")
    elif sheet_name not in data:
        print(f"FYI: Sheet '{sheet_name}' not found in 'data'. No columns dropped.")
    else: # Sheet is in data but empty
        print(f"FYI: Sheet '{sheet_name}' is empty. No columns dropped.")

From data['sampleMetadata']: Dropped columns (had any NA): ['ph_meth', 'carbonate_unit', 'pco2_unit', 'samp_collect_notes']
No columns dropped from data['experimentRunMetadata'] by 'any NA' rule.


### Load data dictionary Excel file 
This FAIRe NOAA Checklist Excel file also contains columns for mapping FAIRe fields to the appropriate Darwin Core terms which OBIS is expecting. Currently, we are only preparing an Occurrence core file and a DNA-derived extension file, with Event information in the Occurrence file. Future versions of this workflow will prepare an extendedMeasurementOrFact file as well.

In [25]:
dwc_data = {}
checklist_df = pd.DataFrame()

try:
    # Load the 'checklist' sheet from the FAIRe NOAA Checklist Excel file
    checklist_df = pd.read_excel(
        params['FAIRe_NOAA_checklist'], # Ensure this param is set in your params cell
        sheet_name='checklist',
        na_values=[""]
    )
except Exception as e:
    print(f"Error loading 'checklist' sheet: {e}")

# Define relevant column names from your checklist
col_faire_term = 'term_name'
col_output_spec = 'edna2obis_output_file'
col_dwc_mapping = 'dwc_term'

occurrence_maps = []
dna_derived_maps = []

# Process the checklist if it loaded successfully and has the required columns
if not checklist_df.empty and all(col in checklist_df.columns for col in [col_faire_term, col_output_spec, col_dwc_mapping]):
    for _, row in checklist_df.iterrows():
        faire_term = row[col_faire_term]
        output_file = str(row[col_output_spec]).lower() # Convert to string and lowercase
        dwc_term = row[col_dwc_mapping]

        # Add to lists if terms are valid
        if pd.notna(faire_term) and pd.notna(dwc_term) and str(faire_term).strip() and str(dwc_term).strip():
            if 'occurrence' in output_file:
                occurrence_maps.append({'DwC_term': dwc_term, 'FAIRe_term': faire_term})
            if 'dnaderived' in output_file:
                dna_derived_maps.append({'DwC_term': dwc_term, 'FAIRe_term': faire_term})
    
    # Create DataFrames, using DwC_term as index
    dwc_data['occurrence'] = pd.DataFrame(occurrence_maps).drop_duplicates().set_index('DwC_term') if occurrence_maps else \
                             pd.DataFrame(columns=['FAIRe_term']).set_index(pd.Index([], name='DwC_term'))
    
    dwc_data['dnaDerived'] = pd.DataFrame(dna_derived_maps).drop_duplicates().set_index('DwC_term') if dna_derived_maps else \
                             pd.DataFrame(columns=['FAIRe_term']).set_index(pd.Index([], name='DwC_term'))
else:
    # If checklist is empty or missing columns, create empty structures for dwc_data
    if checklist_df.empty and 'FAIRe_NOAA_checklist_excel' in params: # Avoid double error message if file load failed
        print(f"Checklist DataFrame is empty or required columns are missing. Creating empty DwC mappings.")
    dwc_data['occurrence'] = pd.DataFrame(columns=['FAIRe_term']).set_index(pd.Index([], name='DwC_term'))
    dwc_data['dnaDerived'] = pd.DataFrame(columns=['FAIRe_term']).set_index(pd.Index([], name='DwC_term'))

# Print summary
print(f"dwc_data created. Occurrence mappings: {len(dwc_data['occurrence'])}, dnaDerived mappings: {len(dwc_data['dnaDerived'])}")

# The original notebook's Cell 26 was dwc_data['event'].head(). This will now be skipped or adapted.
# Original Cell 27 was dwc_data['occurrence'].head().
# Original Cell 28 was dwc_data['dna'].head() (now 'dnaDerived').

dwc_data created. Occurrence mappings: 26, dnaDerived mappings: 26


In [26]:
# Print the mapping for the Occurrence Core
# Keep in mind, some terms are hard coded later in this workflow, or are derived by more than one FAIRe term
# This means there may be less fields listed below than what the Occurrence Core will have upon completion of edna2obis
dwc_data['occurrence']

Unnamed: 0_level_0,FAIRe_term
DwC_term,Unnamed: 1_level_1
recordedBy,recordedBy
datasetID,project_id
parentEventID,samp_name
materialSampleID,materialSampleID
decimalLongitude,decimalLongitude
decimalLatitude,decimalLatitude
geodeticDatum,verbatimSRS
locality,geo_loc_name
eventDate,eventDate
sampleSizeValue,samp_size


In [27]:
# Print the mapping for the DNA Derived Extension
# Keep in mind, some terms are hard coded later in this workflow, or are derived by more than one FAIRe term
# This means there may be less fields listed below than what the DNA Derived Extension will have upon completion of edna2obis
dwc_data['dnaDerived']

Unnamed: 0_level_0,FAIRe_term
DwC_term,Unnamed: 1_level_1
env_broad_scale,env_broad_scale
env_local_scale,env_local_scale
env_medium,env_medium
samp_collect_device,samp_collect_device
samp_mat_process,samp_mat_process
size_frac,size_frac
samp_vol_we_dna_ext,samp_vol_we_dna_ext
nucl_acid_ext,nucl_acid_ext
concentration,concentration
concentrationUnit,concentration_unit


## Convert to Occurrence file
In order to link the DNA-derived extension metadata to our OBIS occurrence records, we have to use the Occurrence core. For this data set, a `parentEvent` is a filtered water sample that was DNA extracted, a sequencing library from that DNA extraction is an `event`, and an `occurrence` is an ASV observed within a library. We will have an an occurence file and a DNA derived data file. Future versions will generate a measurements file.   
**Define files**


### Build a combined Occurrence Core for all Analyses
This code creates an Occurrence Core for each analysis in the submission, then combines them into one Occurence Core. This operation is complex, includes merging dataframes together, adding the missing fields which OBIS and GBIF expect, and parsing taxonomy raw data. 

In [28]:
# --- MAIN DATA PROCESSING LOOP --- Loop through each analysis run defined in params['datafiles']
import pandas as pd
import numpy as np
import os
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

print(f"🚀 Starting data processing for {len(params['datafiles'])} analysis run(s) to generate occurrence records.")

# Define the desired final columns for occurrence.csv IN THE SPECIFIC ORDER REQUIRED
DESIRED_OCCURRENCE_COLUMNS_IN_ORDER = [
    'eventID', 'organismQuantity', 'assay_name', 'occurrenceID', 'verbatimIdentification',
    'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species',  
    'scientificName', 'scientificNameID', 
    'taxonRank', 'identificationRemarks',
    'taxonID', 'basisOfRecord', 'nameAccordingTo', 'organismQuantityType',
    'recordedBy', 'materialSampleID', 'sampleSizeValue', 'sampleSizeUnit',
    'associatedSequences', 'locationID', 'eventDate', 'minimumDepthInMeters', 'maximumDepthInMeters',
    'locality', 'decimalLatitude', 'decimalLongitude',
    'geodeticDatum', 'parentEventID', 'datasetID', 'occurrenceStatus'
]

output_dir = "../processed-v3/" 
os.makedirs(output_dir, exist_ok=True)
output_filename = "occurrence.csv"
output_path = os.path.join(output_dir, output_filename)

all_processed_occurrence_dfs = []
successful_runs = 0
failed_runs = 0

# --- Fetch Project-Level Metadata ONCE ---
project_recorded_by = "recordedBy_NotProvided"
project_dataset_id = "DatasetID_NotProvided"

def get_project_meta_value(project_meta_df, term_to_find, default_val=pd.NA):
    if not all(col in project_meta_df.columns for col in ['term_name', 'project_level']):
        return default_val
    term_to_find_stripped = str(term_to_find).strip()
    match = project_meta_df[project_meta_df['term_name'].astype(str).str.strip().str.lower() == term_to_find_stripped.lower()]
    if not match.empty:
        value = match['project_level'].iloc[0]
        if pd.notna(value):
            return str(value).strip()
    return default_val

if 'projectMetadata' in data and not data['projectMetadata'].empty:
    project_meta_df = data['projectMetadata']
    project_recorded_by = get_project_meta_value(project_meta_df, 'recordedBy', project_recorded_by)
    project_dataset_id = get_project_meta_value(project_meta_df, 'project_id', project_dataset_id)
    print(f"  ℹ️ Project Metadata Fetched: recordedBy='{project_recorded_by}', datasetID (from project_id)='{project_dataset_id}'")
else:
    print("  ⚠️ Warning: projectMetadata is empty or not found. 'recordedBy' and 'datasetID' will use default placeholder values.")

# --- MAIN DATA PROCESSING LOOP ---
for analysis_run_name, file_paths_dict in params['datafiles'].items():
    print(f"\nProcessing Analysis Run: {analysis_run_name}")
    try:
        # --- STEP 1: Load and Prepare Raw Taxonomy and Abundance Data ---
        if not (analysis_run_name in raw_data_tables and
                'taxonomy' in raw_data_tables[analysis_run_name] and not raw_data_tables[analysis_run_name]['taxonomy'].empty and
                'occurrence' in raw_data_tables[analysis_run_name] and not raw_data_tables[analysis_run_name]['occurrence'].empty):
            print(f"  Skipping {analysis_run_name}: Raw taxonomy or occurrence data is missing or empty.")
            failed_runs += 1
            continue

        current_tax_df_raw = raw_data_tables[analysis_run_name]['taxonomy'].copy()
        current_abundance_df_raw = raw_data_tables[analysis_run_name]['occurrence'].copy()

        featureid_col_tax = current_tax_df_raw.columns[0]
        current_tax_df_raw.rename(columns={featureid_col_tax: 'featureid'}, inplace=True)
        featureid_col_abun = current_abundance_df_raw.columns[0] 
        current_abundance_df_raw.rename(columns={featureid_col_abun: 'featureid'}, inplace=True)

        sequence_col_dwc = 'DNA_sequence'
        sequence_col_input = 'sequence' 
        confidence_col_original_case = 'Confidence' 
        
        if 'dna_sequence' in current_tax_df_raw.columns and sequence_col_input not in current_tax_df_raw.columns:
             current_tax_df_raw.rename(columns={'dna_sequence': sequence_col_input}, inplace=True)
        elif sequence_col_input not in current_tax_df_raw.columns:
            current_tax_df_raw[sequence_col_input] = pd.NA
            
        # Determine the source for DwC 'verbatimIdentification'.
        # User wants to use the column named 'taxonomy' from the input asvTaxaFeatures file.
        # NOTE: we are basically using the taxonomy string instead of verbatimIdentification, for the Occurrence Core's verbatimIdentifiction.
        # Don't be confused- this is to help with taxonomic matching. The 'taxonomy' field is the same as verbatimIdentification, except it just is more machine readable. 
        
        # This is the name of the FAIRe column in current_tax_df_raw that holds the string to be used.
        # It will be renamed to 'verbatimIdentification' in current_tax_df_processed.
        source_column_for_verbatim_id = 'taxonomy' 

        if source_column_for_verbatim_id in current_tax_df_raw.columns:
            verbatim_id_source_col = source_column_for_verbatim_id 
            # print(f"  Using column '{verbatim_id_source_col}' from input as source for DwC 'verbatimIdentification'.") # Kept for debug, can be commented
        else:
            # If the explicitly desired 'taxonomy' column is MISSING from the input file.
            print(f"  CRITICAL WARNING: Column '{source_column_for_verbatim_id}' not found in input taxonomy table for '{analysis_run_name}'. DwC 'verbatimIdentification' will use a placeholder.")
            current_tax_df_raw['verbatimIdentification_placeholder'] = f"Data from '{source_column_for_verbatim_id}' column not available in source"
            verbatim_id_source_col = 'verbatimIdentification_placeholder'
            # This placeholder column will be picked up by tax_cols_to_keep and then renamed.                               
        
        tax_cols_to_keep = ['featureid', sequence_col_input, verbatim_id_source_col]
        if confidence_col_original_case in current_tax_df_raw.columns:
            tax_cols_to_keep.append(confidence_col_original_case)
        else:
            current_tax_df_raw[confidence_col_original_case] = pd.NA 
            tax_cols_to_keep.append(confidence_col_original_case)

        current_tax_df_processed = current_tax_df_raw[[col for col in tax_cols_to_keep if col in current_tax_df_raw.columns]].copy()
        current_tax_df_processed.rename(columns={verbatim_id_source_col: 'verbatimIdentification', 
                                                 sequence_col_input: sequence_col_dwc}, inplace=True)

        # --- STEP 2: Melt Abundance and Merge with Taxonomy ---
        current_assay_occ_melted = pd.melt(
            current_abundance_df_raw, id_vars=['featureid'],
            var_name='samp_name', value_name='organismQuantity'
        )
        current_assay_occ_melted = current_assay_occ_melted[current_assay_occ_melted['organismQuantity'] > 0.0]
        
        current_assay_occurrence_intermediate_df = pd.merge(
            current_assay_occ_melted, current_tax_df_processed,
            on='featureid', how='left'
        )

        # --- STEP 3: Initialize ALL DwC Fields from DESIRED_OCCURRENCE_COLUMNS_IN_ORDER ---
        for col in DESIRED_OCCURRENCE_COLUMNS_IN_ORDER:
            if col not in current_assay_occurrence_intermediate_df.columns:
                 current_assay_occurrence_intermediate_df[col] = pd.NA
        
        # Set values for fields that are constructed or have fixed values for this process
        current_assay_occurrence_intermediate_df['taxonID'] = 'ASV:' + current_assay_occurrence_intermediate_df['featureid'].astype(str)
        current_assay_occurrence_intermediate_df['organismQuantityType'] = 'DNA sequence reads'
        current_assay_occurrence_intermediate_df['occurrenceStatus'] = 'present'
        current_assay_occurrence_intermediate_df['basisOfRecord'] = 'MaterialSample'
        current_assay_occurrence_intermediate_df['nameAccordingTo'] = 'Original Classification; WoRMS/GBIF (pending further matching)'
        
        # --- STEP 4: Assign Project-Level Metadata ---
        current_assay_occurrence_intermediate_df['recordedBy'] = project_recorded_by
        current_assay_occurrence_intermediate_df['datasetID'] = project_dataset_id

        # --- STEP 5: Merge `sampleMetadata` and map FAIRe terms to DwC terms ---
        if 'sampleMetadata' in data and not data['sampleMetadata'].empty:
            sm_df_to_merge = data['sampleMetadata'].copy()
            sm_df_to_merge['samp_name'] = sm_df_to_merge['samp_name'].astype(str).str.strip()
            current_assay_occurrence_intermediate_df['samp_name'] = current_assay_occurrence_intermediate_df['samp_name'].astype(str).str.strip()

            current_assay_occurrence_intermediate_df = pd.merge(
                current_assay_occurrence_intermediate_df, sm_df_to_merge,
                on='samp_name', how='left', suffixes=('', '_sm') 
            )
            
            # These columns have specific assignment logic/calculation later or are core IDs
            # They are filled by mapping only if currently NA. They all have different, specific logic to construct.
            cols_with_specific_logic_or_origin = [
                'datasetID', 'recordedBy', 'eventID', 'occurrenceID', 'taxonID', 
                'organismQuantityType', 'occurrenceStatus', 'basisOfRecord', 'nameAccordingTo',
                'parentEventID', 'associatedSequences', 
                'sampleSizeValue', 'sampleSizeUnit', 'identificationRemarks'
            ]

            # Populate DwC columns using the dwc_data['occurrence'] mapping
            for dwc_col_target, faire_row in dwc_data['occurrence'].iterrows():
                if dwc_col_target not in DESIRED_OCCURRENCE_COLUMNS_IN_ORDER: # Ensure we only care about desired output columns
                    continue 

                faire_col_source_original = str(faire_row['FAIRe_term']).strip()
                source_col_in_df = None
                
                # Check for the column with _sm suffix first (if a clash occurred during merge with sampleMetadata)
                if faire_col_source_original + '_sm' in current_assay_occurrence_intermediate_df.columns:
                    source_col_in_df = faire_col_source_original + '_sm'
                # Else, check for the original FAIRe term name (if no clash)
                elif faire_col_source_original in current_assay_occurrence_intermediate_df.columns:
                    source_col_in_df = faire_col_source_original
                
                if source_col_in_df:
                    # If the DwC target column has specific logic for its creation or is a core ID,
                    # only fill it from sampleMetadata if it's currently NA.
                    if dwc_col_target in cols_with_specific_logic_or_origin:
                        current_assay_occurrence_intermediate_df[dwc_col_target] = current_assay_occurrence_intermediate_df[dwc_col_target].fillna(current_assay_occurrence_intermediate_df[source_col_in_df])
                    else: 
                        # For other "standard" DwC columns (like locality, lat, lon, geodeticDatum, etc.),
                        # directly assign from the source FAIRe column.
                        current_assay_occurrence_intermediate_df[dwc_col_target] = current_assay_occurrence_intermediate_df[source_col_in_df]
                elif dwc_col_target in ['locality', 'decimalLatitude', 'decimalLongitude', 'geodeticDatum', 'eventDate']: # Only print diagnostic for key terms if mapping is missing in checklist
                     print(f"  DIAGNOSTIC: For DwC term '{dwc_col_target}', its mapped FAIRe term '{faire_col_source_original}' (from checklist) was NOT found as a column in the merged sample data (checked as '{faire_col_source_original}' and '{faire_col_source_original}_sm'). The DwC column '{dwc_col_target}' will likely be empty if not populated by other means.")
        else:
            print(f"  Warning: 'sampleMetadata' is empty or not found. Cannot merge for DwC term population for run {analysis_run_name}.")

        # Construct 'locationID' 
        line_id_col_sm = 'line_id_sm' if 'line_id_sm' in current_assay_occurrence_intermediate_df.columns else 'line_id'
        station_id_col_sm = 'station_id_sm' if 'station_id_sm' in current_assay_occurrence_intermediate_df.columns else 'station_id'

        if line_id_col_sm in current_assay_occurrence_intermediate_df.columns and \
           station_id_col_sm in current_assay_occurrence_intermediate_df.columns:
            line_ids = current_assay_occurrence_intermediate_df[line_id_col_sm].astype(str).fillna('NoLineID')
            station_ids = current_assay_occurrence_intermediate_df[station_id_col_sm].astype(str).fillna('NoStationID')
            current_assay_occurrence_intermediate_df['locationID'] = line_ids + "_" + station_ids
            # print(f"    Constructed 'locationID'.") # Reduced verbosity
        else:
            print(f"    Warning: Could not construct 'locationID' using '{line_id_col_sm}' or '{station_id_col_sm}'.")
            if 'locationID' not in current_assay_occurrence_intermediate_df.columns or current_assay_occurrence_intermediate_df['locationID'].isna().all():
                 current_assay_occurrence_intermediate_df['locationID'] = "LocationID_NotAvailable"


        # --- STEP 6: Merge `experimentRunMetadata` & Define `eventID`, `associatedSequences` ---
        assay_name_for_current_run = next((an_key for an_key, runs_dict_val in data.get('analysis_data_by_assay', {}).items() if isinstance(runs_dict_val, dict) and analysis_run_name in runs_dict_val), None)
        
        current_assay_occurrence_intermediate_df['assay_name'] = assay_name_for_current_run # <-- ADD THIS LINE

        if not assay_name_for_current_run:
            print(f"    ERROR: Could not determine assay_name for '{analysis_run_name}'.")
            current_assay_occurrence_intermediate_df['eventID'] = current_assay_occurrence_intermediate_df['eventID'].fillna(f"ERROR_eventID_for_{analysis_run_name}")
            # Also ensure a placeholder for assay_name if it couldn't be found, though it should be an error condition
            current_assay_occurrence_intermediate_df['assay_name'] = current_assay_occurrence_intermediate_df['assay_name'].fillna(f"UNKNOWN_ASSAY_FOR_{analysis_run_name}")
        elif 'experimentRunMetadata' in data and not data['experimentRunMetadata'].empty:
            erm_df = data['experimentRunMetadata'].copy()
            erm_df['samp_name'] = erm_df['samp_name'].astype(str).str.strip()
            erm_df_assay_specific = erm_df[erm_df['assay_name'].astype(str).str.strip() == str(assay_name_for_current_run).strip()]

            if not erm_df_assay_specific.empty:
                faire_lib_id_col = str(dwc_data['occurrence'].loc['eventID', 'FAIRe_term']).strip() if 'eventID' in dwc_data['occurrence'].index else 'lib_id'
                faire_assoc_seq_col = str(dwc_data['occurrence'].loc['associatedSequences', 'FAIRe_term']).strip() if 'associatedSequences' in dwc_data['occurrence'].index else 'associatedSequences'

                cols_to_select_from_erm = {'samp_name'}
                if faire_lib_id_col in erm_df_assay_specific.columns: cols_to_select_from_erm.add(faire_lib_id_col)
                if faire_assoc_seq_col in erm_df_assay_specific.columns: cols_to_select_from_erm.add(faire_assoc_seq_col)
                
                erm_to_merge = erm_df_assay_specific[list(cols_to_select_from_erm)].drop_duplicates(subset=['samp_name']).copy()
                
                current_assay_occurrence_intermediate_df = pd.merge(
                    current_assay_occurrence_intermediate_df, erm_to_merge,
                    on='samp_name', how='left', suffixes=('', '_erm')
                )
                
                source_lib_id_col_actual = faire_lib_id_col + '_erm' if faire_lib_id_col + '_erm' in current_assay_occurrence_intermediate_df.columns else faire_lib_id_col
                if source_lib_id_col_actual in current_assay_occurrence_intermediate_df.columns:
                    current_assay_occurrence_intermediate_df['eventID'] = current_assay_occurrence_intermediate_df['eventID'].fillna(current_assay_occurrence_intermediate_df[source_lib_id_col_actual])
                
                source_assoc_seq_col_actual = faire_assoc_seq_col + '_erm' if faire_assoc_seq_col + '_erm' in current_assay_occurrence_intermediate_df.columns else faire_assoc_seq_col
                if source_assoc_seq_col_actual in current_assay_occurrence_intermediate_df.columns:
                     current_assay_occurrence_intermediate_df['associatedSequences'] = current_assay_occurrence_intermediate_df['associatedSequences'].fillna(current_assay_occurrence_intermediate_df[source_assoc_seq_col_actual])
            else:
                current_assay_occurrence_intermediate_df['eventID'] = current_assay_occurrence_intermediate_df['eventID'].fillna(f"NoExpMeta_eventID_for_{analysis_run_name}")
        
        current_assay_occurrence_intermediate_df['eventID'] = current_assay_occurrence_intermediate_df['eventID'].astype(str)
        
        # --- STEP 7: Construct `occurrenceID` ---
        current_assay_occurrence_intermediate_df['occurrenceID'] = \
            current_assay_occurrence_intermediate_df['eventID'] + \
            '_occ_' + \
            current_assay_occurrence_intermediate_df['featureid'].astype(str)

        # --- STEP 8: `identificationRemarks`, `sampleSizeValue`/`Unit`, `parentEventID` ---
        otu_seq_comp_appr_str = "Unknown sequence comparison approach"
        taxa_class_method_str = ""
        taxa_ref_db_str = "Unknown reference DB"

        if assay_name_for_current_run and analysis_run_name in data.get('analysis_data_by_assay', {}).get(assay_name_for_current_run, {}):
            analysis_meta_df_for_run = data['analysis_data_by_assay'][assay_name_for_current_run][analysis_run_name]
            if 'term_name' in analysis_meta_df_for_run.columns and 'values' in analysis_meta_df_for_run.columns:
                def get_analysis_meta(term, df, default):
                    val_series = df[df['term_name'].astype(str).str.strip() == term]['values']
                    return str(val_series.iloc[0]).strip() if not val_series.empty and pd.notna(val_series.iloc[0]) else default
                otu_seq_comp_appr_str = get_analysis_meta('otu_seq_comp_appr', analysis_meta_df_for_run, otu_seq_comp_appr_str)
                taxa_class_method_str = get_analysis_meta('taxa_class_method', analysis_meta_df_for_run, taxa_class_method_str)
                taxa_ref_db_str = get_analysis_meta('otu_db', analysis_meta_df_for_run, taxa_ref_db_str)
        
        confidence_value_series = pd.Series(["unknown confidence"] * len(current_assay_occurrence_intermediate_df), index=current_assay_occurrence_intermediate_df.index, dtype=object)
        if confidence_col_original_case in current_assay_occurrence_intermediate_df.columns:
            confidence_value_series = current_assay_occurrence_intermediate_df[confidence_col_original_case].astype(str).fillna("unknown confidence")
        
        current_assay_occurrence_intermediate_df['identificationRemarks'] = \
            f"{otu_seq_comp_appr_str}, confidence: " + \
            confidence_value_series + \
            f", against reference database: {taxa_ref_db_str}"
        
        if 'eventID' in current_assay_occurrence_intermediate_df.columns and not current_assay_occurrence_intermediate_df['eventID'].isna().all():
            sample_size_map = current_assay_occurrence_intermediate_df.groupby('eventID')['organismQuantity'].sum().to_dict()
            current_assay_occurrence_intermediate_df['sampleSizeValue'] = current_assay_occurrence_intermediate_df['eventID'].map(sample_size_map)
        current_assay_occurrence_intermediate_df['sampleSizeUnit'] = 'DNA sequence reads'
        
        if 'parentEventID' in dwc_data['occurrence'].index:
            faire_parent_event_id_col_name = str(dwc_data['occurrence'].loc['parentEventID','FAIRe_term']).strip() 
            actual_parent_event_col_name = faire_parent_event_id_col_name + "_sm" if faire_parent_event_id_col_name + "_sm" in current_assay_occurrence_intermediate_df.columns else faire_parent_event_id_col_name
            if actual_parent_event_col_name in current_assay_occurrence_intermediate_df.columns :
                 current_assay_occurrence_intermediate_df['parentEventID'] = current_assay_occurrence_intermediate_df['parentEventID'].fillna(current_assay_occurrence_intermediate_df[actual_parent_event_col_name])

        # --- STEP 9: Final Column Selection and Order for this assay's DataFrame ---
        for col_final_desired in DESIRED_OCCURRENCE_COLUMNS_IN_ORDER:
            if col_final_desired not in current_assay_occurrence_intermediate_df.columns:
                current_assay_occurrence_intermediate_df[col_final_desired] = pd.NA
        
        current_assay_occurrence_final_df = current_assay_occurrence_intermediate_df[DESIRED_OCCURRENCE_COLUMNS_IN_ORDER].copy()
        
        all_processed_occurrence_dfs.append(current_assay_occurrence_final_df)
        successful_runs += 1
        print(f"  Successfully processed {analysis_run_name}: Generated {len(current_assay_occurrence_final_df)} records.")

    except Exception as e:
        import traceback
        print(f"  ❌ Error processing {analysis_run_name}: {str(e)}")
        print(f"  Traceback for {analysis_run_name}: {traceback.format_exc()}")
        failed_runs += 1

# --- POST-LOOP CONCATENATION & FINALIZATION ---
print(f"\n🏁 LOOP COMPLETED: Successful runs: {successful_runs}, Failed runs: {failed_runs}, Total DataFrames to combine: {len(all_processed_occurrence_dfs)}")

if all_processed_occurrence_dfs:
    occ_all_final_combined = pd.concat(all_processed_occurrence_dfs, ignore_index=True, sort=False)
    
    for col_final_desired in DESIRED_OCCURRENCE_COLUMNS_IN_ORDER:
        if col_final_desired not in occ_all_final_combined.columns:
            occ_all_final_combined[col_final_desired] = pd.NA
    
    occ_all_final_output = occ_all_final_combined.reindex(columns=DESIRED_OCCURRENCE_COLUMNS_IN_ORDER)
    
    original_rows_before_dedup = len(occ_all_final_output)
    if 'occurrenceID' in occ_all_final_output.columns and not occ_all_final_output['occurrenceID'].isna().all():
        num_duplicates = occ_all_final_output.duplicated(subset=['occurrenceID']).sum()
        if num_duplicates > 0:
            occ_all_final_output.drop_duplicates(subset=['occurrenceID'], keep='first', inplace=True)
            print(f"🔄 Dropped {num_duplicates} duplicate occurrenceID records. Final rows: {len(occ_all_final_output)}.")
        else:
            print("🔄 No duplicate occurrenceID records found to drop.")
    else:
        print("  ⚠️ WARNING: 'occurrenceID' column not found or is all NA. Cannot effectively drop duplicates based on it.")

    try:
        occ_all_final_output.to_csv(output_path, index=False, na_rep='') 
        print(f"\n💾 Combined occurrence file '{output_filename}' saved to '{output_path}' with {len(occ_all_final_output)} records.")
        # final_cols_list_written = occ_all_final_output.columns.tolist() # Removed for brevity
        # print(f"📋 Final columns written ({len(final_cols_list_written)} total), in order: {final_cols_list_written}")

        print(f"\n👀 Preview of final combined occurrence data (first 5 rows, selected columns):")
        preview_cols_subset = ['eventID', 'occurrenceID', 'assay_name', 'parentEventID', 'datasetID', 'recordedBy', 
                               'locality', 'decimalLatitude', 'decimalLongitude', 'geodeticDatum', 
                               'identificationRemarks', 'locationID']
        preview_cols_to_show = [col for col in preview_cols_subset if col in occ_all_final_output.columns]
        display(occ_all_final_output[preview_cols_to_show].head())
    except Exception as e:
        print(f"  ❌ Error saving combined occurrence file: {str(e)}")
else:
    print(f"❌ No data to combine - all analysis runs may have failed or yielded no occurrence records.")


🚀 Starting data processing for 3 analysis run(s) to generate occurrence records.
  ℹ️ Project Metadata Fetched: recordedBy='Luke Thompson', datasetID (from project_id)='noaa-aoml-gomecc4'

Processing Analysis Run: gomecc4_18s_p1-6_v2024.10_241122
  Successfully processed gomecc4_18s_p1-6_v2024.10_241122: Generated 147083 records.

Processing Analysis Run: gomecc4_16s_p3-6_v2024.10_241122
  Successfully processed gomecc4_16s_p3-6_v2024.10_241122: Generated 118671 records.

Processing Analysis Run: gomecc4_16s_p1-2_v2024.10_241122
  Successfully processed gomecc4_16s_p1-2_v2024.10_241122: Generated 46460 records.

🏁 LOOP COMPLETED: Successful runs: 3, Failed runs: 0, Total DataFrames to combine: 3
🔄 No duplicate occurrenceID records found to drop.

💾 Combined occurrence file 'occurrence.csv' saved to '../processed-v3/occurrence.csv' with 312214 records.

👀 Preview of final combined occurrence data (first 5 rows, selected columns):


Unnamed: 0,eventID,occurrenceID,assay_name,parentEventID,datasetID,recordedBy,locality,decimalLatitude,decimalLongitude,geodeticDatum,identificationRemarks,locationID
0,GOMECC18S_Plate4_53,GOMECC18S_Plate4_53_occ_36aa75f9b28f5f831c2d631ba65c2bcb,ssu18sv9-emp,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,Luke Thompson,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,WGS84,"qiime2-2021.2; naive bayes classifier; scikit-learn 0.23.1, confidence: 0.99999999990944, against reference database: PR2 v5.0.1; V9 1391f-1510r r...",27N_Sta1
1,GOMECC18S_Plate4_53,GOMECC18S_Plate4_53_occ_4e38e8ced9070952b314e1880bede1ca,ssu18sv9-emp,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,Luke Thompson,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,WGS84,"qiime2-2021.2; naive bayes classifier; scikit-learn 0.23.1, confidence: 0.999067062720315, against reference database: PR2 v5.0.1; V9 1391f-1510r ...",27N_Sta1
2,GOMECC18S_Plate4_53,GOMECC18S_Plate4_53_occ_2a31e5c01634165da99e7381279baa75,ssu18sv9-emp,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,Luke Thompson,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,WGS84,"qiime2-2021.2; naive bayes classifier; scikit-learn 0.23.1, confidence: 0.8911679667827849, against reference database: PR2 v5.0.1; V9 1391f-1510r...",27N_Sta1
3,GOMECC18S_Plate4_53,GOMECC18S_Plate4_53_occ_ecee60339b2fb88ea6d1c8d18054bed4,ssu18sv9-emp,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,Luke Thompson,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,WGS84,"qiime2-2021.2; naive bayes classifier; scikit-learn 0.23.1, confidence: 0.9996383713806553, against reference database: PR2 v5.0.1; V9 1391f-1510r...",27N_Sta1
4,GOMECC18S_Plate4_53,GOMECC18S_Plate4_53_occ_fa1f1a97dd4ae7c826009186bad26384,ssu18sv9-emp,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4,Luke Thompson,"USA: Atlantic Ocean, east of Florida (27 N)",26.997,-79.618,WGS84,"qiime2-2021.2; naive bayes classifier; scikit-learn 0.23.1, confidence: 0.97066987594629, against reference database: PR2 v5.0.1; V9 1391f-1510r r...",27N_Sta1


## NOTE:
The Occurrence Core at this step (before taxonomic assignment through WoRMS or GBIF), contains an assay_name column. This will be removed from the final Occurrence Core (after taxonomic assignment) but it is used by the taxonomic assignment code to know which assay's data you want to remove the 'species' rank from consideration. This is because some assays, like 16S for example, return non-usable assignments at species level, while, for example, 18S species assignments ARE useful.

# DEBUGGING ONLY: Create Smaller Subset for Faster Taxonomic Matching

**Run this cell ONLY if you want to test taxonomic matching on a small subset.**
This cell will take the `occ_df_for_processing_step` (which currently holds the full dataset from Cell 33), create a small subset from it, and then **overwrite** `occ_df_for_processing_step` with this smaller subset.
If you want to run taxonomic matching on the **full dataset, simply DO NOT RUN THIS CELL.**

In [29]:
# Create a small subset of occurrence.csv for faster testing of taxonomic matching

import pandas as pd
import os

# Determine the correct output directory path
current_output_dir = None
if 'output_dir' in globals() and isinstance(output_dir, str):
    current_output_dir = output_dir # From a previous run of cell 28
elif 'params' in globals() and 'output_dir' in params and isinstance(params['output_dir'], str):
    current_output_dir = params['output_dir'] # From cell 8
else:
    current_output_dir = "../processed-v3/" # Fallback
    print(f"Warning: 'output_dir' not found in globals or params, using default: {current_output_dir}")

full_occurrence_csv_path = os.path.join(current_output_dir, "occurrence.csv")
occ_df_subset_for_matching = pd.DataFrame() # Initialize
full_occ_df = pd.DataFrame() # Initialize

N_ROWS_PER_ASSAY_SUBSET = 10 # Number of rows to take per assay for the subset
print(f"Attempting to load full occurrence data from: {full_occurrence_csv_path}")
if os.path.exists(full_occurrence_csv_path):
    try:
        full_occ_df = pd.read_csv(full_occurrence_csv_path, low_memory=False, dtype=str)
        if 'organismQuantity' in full_occ_df.columns:
            full_occ_df['organismQuantity'] = pd.to_numeric(full_occ_df['organismQuantity'], errors='coerce').fillna(0).astype(float)
        print(f"Successfully loaded {full_occurrence_csv_path} with {len(full_occ_df)} records.")

        if not full_occ_df.empty and 'assay_name' in full_occ_df.columns and 'verbatimIdentification' in full_occ_df.columns:
            available_assays = full_occ_df['assay_name'].dropna().unique()
            print(f"  Available assays in occurrence.csv: {available_assays}")
            
            subset_dfs = []
            for assay in available_assays:
                assay_subset = full_occ_df[full_occ_df['assay_name'] == assay].head(N_ROWS_PER_ASSAY_SUBSET)
                if not assay_subset.empty:
                    subset_dfs.append(assay_subset)
                    print(f"  Selected {len(assay_subset)} rows for assay '{assay}' in the subset.")
                else:
                    print(f"  No rows found for assay '{assay}' to include in subset.")

            if subset_dfs:
                occ_df_subset_for_matching = pd.concat(subset_dfs, ignore_index=True)
                print(f"Successfully created subset with {len(occ_df_subset_for_matching)} total rows for taxonomic matching.")
                print(f"  Unique assay_names in subset: {occ_df_subset_for_matching['assay_name'].unique()}")
                # For inspection:
                # display(occ_df_subset_for_matching[['eventID', 'occurrenceID', 'verbatimIdentification', 'assay_name']].head())
            else:
                print("Could not create a valid subset. No assay data found or all assays resulted in empty subsets.")
        else:
            print("Full occurrence DataFrame is empty or missing 'assay_name'/'verbatimIdentification' columns. Cannot create subset.")

    except Exception as e:
        print(f"Error loading or processing {full_occurrence_csv_path} for subsetting: {e}")
        import traceback
        print(traceback.format_exc())
else:
    print(f"ERROR: Full occurrence file not found at {full_occurrence_csv_path}. Cannot create subset. Ensure Cell 28 (occurrence.csv generation) ran successfully.")

# Determine which DataFrame to use for the actual matching process
current_run_type = "NO DATA"
if not occ_df_subset_for_matching.empty:
    occ_df_for_actual_matching = occ_df_subset_for_matching
    current_run_type = "SUBSET"
    print(f"\nProceeding with the {current_run_type} ({len(occ_df_for_actual_matching)} rows) for taxonomic matching.")
elif not full_occ_df.empty: # Fallback to full if subset is empty but full loaded
    print("\nWARNING: Subset creation failed or resulted in an empty DataFrame. Proceeding with FULL dataset.")
    occ_df_for_actual_matching = full_occ_df
    current_run_type = "FULL DATASET"
else:
    print("\nERROR: Neither subset nor full occurrence data is available for matching. Please check previous cells.")
    occ_df_for_actual_matching = pd.DataFrame() # Ensure it's an empty DF

Attempting to load full occurrence data from: ../processed-v3/occurrence.csv
Successfully loaded ../processed-v3/occurrence.csv with 312214 records.
  Available assays in occurrence.csv: ['ssu18sv9-emp' 'ssu16sv4v5-emp']
  Selected 10 rows for assay 'ssu18sv9-emp' in the subset.
  Selected 10 rows for assay 'ssu16sv4v5-emp' in the subset.
Successfully created subset with 20 total rows for taxonomic matching.
  Unique assay_names in subset: ['ssu18sv9-emp' 'ssu16sv4v5-emp']

Proceeding with the SUBSET (20 rows) for taxonomic matching.


## Taxonomic Assignment

The following cells will perform taxonomic matching using the API specified in the notebook parameters (`params['taxonomic_api_source']`). This process will:
1. Dynamically import the appropriate matching script (e.g., `WoRMS_v3_matching.py` or a future `GBIF_v3_matching.py`).
2. Read the `occurrence.csv` file generated previously (which includes raw `verbatimIdentification` strings and `assay_name`).
3. Use the imported script to query the selected API.
4. Implement caching to avoid redundant API calls.
5. Handle assay-specific rules (e.g., skipping species-level matching for specified assays).
6. Update the occurrence data with API-derived `scientificName`, `scientificNameID`, `taxonRank`, and hierarchical rank columns.
7. Set `nameAccordingTo` to reflect the API source.
8. Handle specific post-processing rules, such as for "Eukaryota"-only verbatim strings.

In [33]:
import importlib
import pandas as pd
import os
import WoRMS_v3_matching # Import the new script

# Reload the module to pick up any changes if you edit the .py file
importlib.reload(WoRMS_v3_matching)

print("Successfully imported and reloaded WoRMS_v3_matching.py")

# --- Define Parameters for Taxonomic Matching ---

# This should be defined in your main parameters cell (e.g., Cell 8 of your notebook)
# Ensure 'params' dictionary exists from that cell.
if 'params' not in globals():
    print("CRITICAL ERROR: 'params' dictionary not found. It should be defined in an early cell (e.g., Cell 8).")
    params = {} # Initialize to prevent immediate crash, but this is not ideal

# 1. API Source (already in your params from Cell 8)
# params['taxonomic_api_source'] = 'WoRMS' # or 'GBIF' when ready

# 2. Assays to skip SPECIES-level matching for (already in your params from Cell 8)
# params['user_defined_assays_to_skip_species'] = ['ssu16sv4v5-emp'] # Example

# 3. Number of processes for matching (0 means use all available CPUs)
# This can also be in your main params cell or set here.
params['worms_n_proc'] = params.get('worms_n_proc', 0) 
params['gbif_n_proc'] = params.get('gbif_n_proc', 0) # For future GBIF use

# 4. Output directory (should be consistent with where occurrence.csv was saved)
# This should also ideally come from your main params cell (Cell 8) or the occurrence.csv generation cell (Cell 28)
if 'output_dir' not in params:
    if 'output_dir' in globals() and isinstance(output_dir, str): # If set by cell 28
         params['output_dir'] = output_dir
    else:
        params['output_dir'] = "../processed-v3/" # Fallback
        print(f"Warning: 'output_dir' not found in params, using fallback: {params['output_dir']}")
else: # If it is in params, ensure it's a string
    if not isinstance(params['output_dir'], str):
        params['output_dir'] = "../processed-v3/"
        print(f"Warning: params['output_dir'] was not a string, reset to default: {params['output_dir']}")


# --- Consolidate parameters for the matching function ---
# The WoRMS_OBIS_matcher.py script will primarily use:
# - params_dict['taxonomic_api_source']
# - params_dict['assays_to_skip_species_match'] (derived from user_defined_assays_to_skip_species)
# - params_dict['worms_n_proc'] (or 'gbif_n_proc' for GBIF)

current_api_source = params.get('taxonomic_api_source', 'WoRMS') # Default to WoRMS if not set
print(f"\nTaxonomic Matching Setup:")
print(f"  API Source: {current_api_source}")
print(f"  Assays configured by user to skip species-level matching: {params.get('user_defined_assays_to_skip_species', 'NONE DEFINED')}")
if current_api_source == 'WoRMS':
    print(f"  Number of processes for WoRMS: {params['worms_n_proc'] if params['worms_n_proc'] > 0 else 'All available CPUs'}")
elif current_api_source == 'GBIF':
    print(f"  Number of processes for GBIF: {params['gbif_n_proc'] if params['gbif_n_proc'] > 0 else 'All available CPUs'} (GBIF matching not yet fully implemented)")
print(f"  Output directory for reading/writing files: {params['output_dir']}")

# Standard DwC ranks that the matching script will try to populate
DWC_RANKS_STD = ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']

Successfully imported and reloaded WoRMS_v3_matching.py

Taxonomic Matching Setup:
  API Source: WoRMS
  Assays configured by user to skip species-level matching: ['ssu16sv4v5-emp']
  Number of processes for WoRMS: All available CPUs
  Output directory for reading/writing files: ../processed-v3/


### Perform Taxonomic Matching

This cell loads the intermediate `occurrence.csv` file (which should include the `assay_name` column) and applies the taxonomic matching using the selected API and its parameters. DEBUG statements will show progress.

NOTE: This operation may take a few minutes

In [34]:
# Perform Taxonomic Matching

# 'occ_df_for_actual_matching' should be available from the subsetting cell (or fallback to full_occ_df)
# 'params' dictionary should be available from the previous cell and Cell 8.

if 'occ_df_for_actual_matching' not in globals() or occ_df_for_actual_matching.empty:
    print("ERROR: 'occ_df_for_actual_matching' is not defined or is empty. Cannot proceed with taxonomic matching.")
    print("Please ensure the subsetting cell (or the cell loading occurrence.csv) ran correctly.")
    occ_df_matched_with_api = pd.DataFrame() # Initialize as empty to prevent error in next cell
elif 'params' not in globals():
    print("ERROR: 'params' dictionary not found. Cannot proceed with taxonomic matching.")
    occ_df_matched_with_api = pd.DataFrame()
else:
    api_to_use = params.get('taxonomic_api_source')
    print(f"\nStarting taxonomic matching using '{api_to_use}' API on {current_run_type} data ({len(occ_df_for_actual_matching)} rows)...")

    if api_to_use == 'WoRMS':
        if hasattr(WoRMS_v3_matching, 'get_worms_match_for_dataframe'):
            # The function now directly uses the 'user_defined_assays_to_skip_species' from params for clarity
            # It also needs 'worms_n_proc' from params
            matching_params_for_script = {
                'taxonomic_api_source': 'WoRMS', # Explicitly pass for clarity inside the script
                'assays_to_skip_species_match': params.get('user_defined_assays_to_skip_species', []),
                'worms_n_proc': params.get('worms_n_proc', 0) # Pass n_proc for WoRMS
            }
            occ_df_matched_with_api = WoRMS_v3_matching.get_worms_match_for_dataframe(
                occurrence_df=occ_df_for_actual_matching,
                params_dict=matching_params_for_script, # Pass the consolidated dict
                n_proc=params.get('worms_n_proc', 0) # Also passed directly for Pool
            )
        else:
            print("ERROR: WoRMS_OBIS_matcher.py does not have 'get_worms_match_for_dataframe' function.")
            occ_df_matched_with_api = occ_df_for_actual_matching.copy() # Avoid error
    
    elif api_to_use == 'GBIF':
        print("GBIF matching is selected but not yet implemented in this workflow.")
        print("The 'occ_df_matched_with_api' will be a copy of the input or empty.")
        # Placeholder for GBIF call when ready:
        # matching_params_for_script = {
        #     'taxonomic_api_source': 'GBIF',
        #     'assays_to_skip_species_match': params.get('user_defined_assays_to_skip_species', []), # or a GBIF specific list
        #     'gbif_n_proc': params.get('gbif_n_proc', 0)
        # }
        # occ_df_matched_with_api = GBIF_OBIS_matcher.get_gbif_match_for_dataframe(
        #     occurrence_df=occ_df_for_actual_matching,
        #     params_dict=matching_params_for_script
        # )
        occ_df_matched_with_api = occ_df_for_actual_matching.copy() # For now
    else:
        print(f"ERROR: Unknown or unsupported taxonomic_api_source: {api_to_use}")
        occ_df_matched_with_api = occ_df_for_actual_matching.copy() # Avoid error

    print(f"\nTaxonomic matching process via '{api_to_use}' finished.")
    
    if not occ_df_matched_with_api.empty:
        print("\nPreview of DataFrame after matching (selected columns):")
        preview_cols = ['eventID', 'occurrenceID', 'verbatimIdentification', 'assay_name', 
                        'scientificName', 'scientificNameID', 'taxonRank', 'nameAccordingTo', 'match_type_debug'] + DWC_RANKS_STD
        display_cols = [col for col in preview_cols if col in occ_df_matched_with_api.columns]
        display(occ_df_matched_with_api[display_cols].head())
        
        if 'match_type_debug' in occ_df_matched_with_api.columns:
            print(f"\nCounts of '{api_to_use}' match_type_debug:")
            print(occ_df_matched_with_api['match_type_debug'].value_counts(dropna=False))
        if 'scientificName' in occ_df_matched_with_api.columns:
            print(f"\nCounts of resulting '{api_to_use}' scientificName (top 10 unique non-NA values):")
            print(occ_df_matched_with_api['scientificName'].dropna().value_counts().head(10))
    else:
        print(f"DataFrame 'occ_df_matched_with_api' is empty after '{api_to_use}' matching attempt.")



Starting taxonomic matching using 'WoRMS' API on SUBSET data (20 rows)...
Found 13 unique, non-empty (verbatimIdentification, assay_name) combinations for WoRMS matching.
Starting WoRMS queries with 8 processes using multiprocess.Pool...
  Processed 13/13 unique combinations from API/pool.

Finished applying WoRMS taxonomic matches to DataFrame.

Taxonomic matching process via 'WoRMS' finished.

Preview of DataFrame after matching (selected columns):


Unnamed: 0,eventID,occurrenceID,verbatimIdentification,assay_name,scientificName,scientificNameID,taxonRank,nameAccordingTo,match_type_debug,kingdom,phylum,class,order,family,genus,species
0,GOMECC18S_Plate4_53,GOMECC18S_Plate4_53_occ_36aa75f9b28f5f831c2d631ba65c2bcb,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda,ssu18sv9-emp,Crustacea,urn:lsid:marinespecies.org:taxname:1066,Subphylum,WoRMS,Success_Query_Crustacea_EffectiveRank_genus_WoRMSRank_Subphylum,Animalia,Arthropoda,,,,,
1,GOMECC18S_Plate4_53,GOMECC18S_Plate4_53_occ_4e38e8ced9070952b314e1880bede1ca,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Clausocalanus;Clausocalanus_furcatus;,ssu18sv9-emp,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503,Species,WoRMS,Success_Query_Clausocalanus_furcatus_EffectiveRank_species_WoRMSRank_Species,Animalia,Arthropoda,Copepoda,Calanoida,Clausocalanidae,Clausocalanus,
2,GOMECC18S_Plate4_53,GOMECC18S_Plate4_53_occ_2a31e5c01634165da99e7381279baa75,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Oithona;Oithona_sp.;,ssu18sv9-emp,Oithona,urn:lsid:marinespecies.org:taxname:106485,Genus,WoRMS,Success_Query_Oithona_EffectiveRank_genus_WoRMSRank_Genus,Animalia,Arthropoda,Copepoda,Cyclopoida,Oithonidae,Oithona,
3,GOMECC18S_Plate4_53,GOMECC18S_Plate4_53_occ_ecee60339b2fb88ea6d1c8d18054bed4,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae,ssu18sv9-emp,Dinophyceae,urn:lsid:marinespecies.org:taxname:19542,Class,WoRMS,Success_Query_Dinophyceae_EffectiveRank_species_WoRMSRank_Class,Chromista,Myzozoa,Dinophyceae,,,,
4,GOMECC18S_Plate4_53,GOMECC18S_Plate4_53_occ_fa1f1a97dd4ae7c826009186bad26384,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniales;Gymnodiniaceae,ssu18sv9-emp,Gymnodiniaceae,urn:lsid:marinespecies.org:taxname:109410,Family,WoRMS,Success_Query_Gymnodiniaceae_EffectiveRank_species_WoRMSRank_Family,Chromista,Myzozoa,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,



Counts of 'WoRMS' match_type_debug:
match_type_debug
Success_Query_Alphaproteobacteria_EffectiveRank_order_WoRMSRank_Class           6
Success_Query_Dinophyceae_EffectiveRank_species_WoRMSRank_Class                 3
Success_Query_Synechococcales_EffectiveRank_family_WoRMSRank_Order              3
Success_Query_Hydrozoa_EffectiveRank_species_WoRMSRank_Class                    2
Success_Query_Clausocalanus_furcatus_EffectiveRank_species_WoRMSRank_Species    1
Success_Query_Crustacea_EffectiveRank_genus_WoRMSRank_Subphylum                 1
Success_Query_Gymnodiniaceae_EffectiveRank_species_WoRMSRank_Family             1
Success_Query_Oithona_EffectiveRank_genus_WoRMSRank_Genus                       1
Success_Query_Pelagomonas_calceolata_EffectiveRank_species_WoRMSRank_Species    1
Success_Query_Alphaproteobacteria_EffectiveRank_family_WoRMSRank_Class          1
Name: count, dtype: int64

Counts of resulting 'WoRMS' scientificName (top 10 unique non-NA values):
scientificName
Alphaprote

### Post-Matching Processing and Final Save

This cell handles API-specific post-processing. For example, if WoRMS was used, it checks for verbatim strings that were effectively "Eukaryota"-only and ensures they are set to "incertae sedis" if not already resolved to a lower rank. The temporary `assay_name` and `match_type_debug` columns are then removed, and the file is saved.

In [32]:
# Post-Matching Processing and Final Save

if 'occ_df_matched_with_api' not in globals() or occ_df_matched_with_api.empty:
    print("Skipping post-matching and save: 'occ_df_matched_with_api' is not defined or is empty.")
else:
    print("Starting post-matching processing.")
    api_source_used = params.get('taxonomic_api_source', 'UnknownAPI') # Get from params
    
    # --- Handle "Eukaryota"-only verbatim strings specifically after WoRMS matching ---
    # This logic is simplified here. The matching script itself tries to be robust.
    # However, a final check can be useful.
    if api_source_used == 'WoRMS':
        # Example condition: if scientificName is 'Eukaryota' and taxonRank is 'kingdom' (or similar high rank)
        # after matching, and the original verbatim was simple like "Eukaryota", convert to "incertae sedis".
        # This specific rule might need more nuanced conditions based on your actual WoRMS script's behavior.
        
        # A more robust check for "effectively Eukaryota only" verbatim strings that resulted in high-level Eukaryota match:
        eukaryota_verbatim_mask = pd.Series([False] * len(occ_df_matched_with_api), index=occ_df_matched_with_api.index)
        if 'verbatimIdentification' in occ_df_matched_with_api.columns and \
           'scientificName' in occ_df_matched_with_api.columns and \
           'taxonRank' in occ_df_matched_with_api.columns:

            for index, row in occ_df_matched_with_api.iterrows():
                vi = str(row.get('verbatimIdentification', '')).strip().lower()
                parsed_vi = [term.strip() for term in vi.split(';') if term.strip()]
                
                is_effectively_eukaryota_only_verbatim = False
                if len(parsed_vi) == 1 and parsed_vi[0] == 'eukaryota':
                    is_effectively_eukaryota_only_verbatim = True
                elif len(parsed_vi) > 0 and parsed_vi[-1] == 'eukaryota': 
                    preceding_meaningful = [term for term in parsed_vi[:-1] if term and term != 'unassigned']
                    if not preceding_meaningful: # If only "eukaryota" is meaningful at the end
                        is_effectively_eukaryota_only_verbatim = True
                
                if is_effectively_eukaryota_only_verbatim:
                    current_sn = str(row.get('scientificName', '')).lower()
                    current_tr = str(row.get('taxonRank', '')).lower()
                    high_ranks = ['kingdom', 'superkingdom', 'domain', 'subkingdom', 'no rank', 'unranked', ''] # Include empty/None as high rank
                    
                    if current_sn == 'eukaryota' and current_tr in high_ranks:
                        eukaryota_verbatim_mask[index] = True
                    # Also, if WoRMS already returned incertae sedis for such a string, ensure it's standardized.
                    elif current_sn == 'incertae sedis' and is_effectively_eukaryota_only_verbatim:
                         eukaryota_verbatim_mask[index] = True


            num_eukaryota_verbatim_to_convert = eukaryota_verbatim_mask.sum()
            if num_eukaryota_verbatim_to_convert > 0:
                print(f"  Post-processing ({api_source_used}): Found {num_eukaryota_verbatim_to_convert} records with 'Eukaryota'-only verbatim strings that matched to high-level Eukaryota or were already 'incertae sedis'. Standardizing to 'incertae sedis'.")
                occ_df_matched_with_api.loc[eukaryota_verbatim_mask, 'scientificName'] = 'incertae sedis'
                occ_df_matched_with_api.loc[eukaryota_verbatim_mask, 'scientificNameID'] = 'urn:lsid:marinespecies.org:taxname:12' # AphiaID for Incertae Sedis in WoRMS
                occ_df_matched_with_api.loc[eukaryota_verbatim_mask, 'taxonRank'] = 'no rank'
                for rank_col in DWC_RANKS_STD: # Clear out other ranks
                    if rank_col in occ_df_matched_with_api.columns:
                        occ_df_matched_with_api.loc[eukaryota_verbatim_mask, rank_col] = None 
                if 'match_type_debug' in occ_df_matched_with_api.columns:
                     occ_df_matched_with_api.loc[eukaryota_verbatim_mask, 'match_type_debug'] = 'Standardized_IncertaeSedis_From_EukaryotaOnlyVerbatim'
                if 'nameAccordingTo' in occ_df_matched_with_api.columns:
                    occ_df_matched_with_api.loc[eukaryota_verbatim_mask, 'nameAccordingTo'] = api_source_used + "; Local rule for Eukaryota-only verbatim"


    # --- Remove temporary/intermediate columns ---
    columns_to_drop_final = []
    if 'assay_name' in occ_df_matched_with_api.columns: columns_to_drop_final.append('assay_name')
    if 'match_type_debug' in occ_df_matched_with_api.columns: columns_to_drop_final.append('match_type_debug')
    
    if columns_to_drop_final:
        occ_df_matched_with_api.drop(columns=columns_to_drop_final, inplace=True, errors='ignore')
        print(f"  Removed temporary columns: {columns_to_drop_final}")

    # --- Define final column order (should match DESIRED_OCCURRENCE_COLUMNS_IN_ORDER from Cell 28, *without* 'assay_name') ---
    # This list is defined in your cell 28 (occurrence core generation)
    # We re-use it here, excluding 'assay_name' if it was present.
    if 'DESIRED_OCCURRENCE_COLUMNS_IN_ORDER' in globals():
        final_dwc_columns_ordered_post_match = [col for col in DESIRED_OCCURRENCE_COLUMNS_IN_ORDER if col != 'assay_name']
    else: # Fallback if the list isn't found (shouldn't happen if cell 28 ran)
        print("Warning: DESIRED_OCCURRENCE_COLUMNS_IN_ORDER not found. Using columns present in DataFrame.")
        final_dwc_columns_ordered_post_match = occ_df_matched_with_api.columns.tolist()

    # Ensure all desired columns exist, add as NA if missing
    for col in final_dwc_columns_ordered_post_match:
        if col not in occ_df_matched_with_api.columns:
            occ_df_matched_with_api[col] = pd.NA 
            print(f"  Warning: Final column '{col}' was missing from matched data and added as NA.")

    occ_df_final_output = occ_df_matched_with_api.reindex(columns=final_dwc_columns_ordered_post_match)
    
    # --- Save the final, taxonomically enriched occurrence file ---
    # Filename depends on whether it was a subset or full run for clarity
    file_prefix = "occurrence_SUBSET" if current_run_type == "SUBSET" else "occurrence"
    final_occurrence_filename = f"{file_prefix}_{api_source_used.lower()}_matched.csv" 
    
    # Ensure params['output_dir'] is a valid path
    output_directory_for_save = params.get('output_dir', "../processed-v3/")
    if not isinstance(output_directory_for_save, str): # Safeguard
        output_directory_for_save = "../processed-v3/"
    os.makedirs(output_directory_for_save, exist_ok=True) # Ensure directory exists
    
    final_output_path = os.path.join(output_directory_for_save, final_occurrence_filename)

    try:
        occ_df_final_output.to_csv(final_output_path, index=False, na_rep='') 
        print(f"\n💾 Taxonomically updated {current_run_type} occurrence file '{final_occurrence_filename}' saved to '{final_output_path}' with {len(occ_df_final_output)} records.")
        print(f"  Final columns written ({len(occ_df_final_output.columns)} total): {occ_df_final_output.columns.tolist()}")
        
        print(f"\n👀 Preview of final taxonomically updated {current_run_type} occurrence data (first 5 rows, selected columns):")
        preview_cols_final = ['eventID', 'occurrenceID', 'verbatimIdentification', 
                              'scientificName', 'scientificNameID', 'taxonRank', 'nameAccordingTo'] + DWC_RANKS_STD
        display_cols_final = [col for col in preview_cols_final if col in occ_df_final_output.columns]
        display(occ_df_final_output[display_cols_final].head())

    except Exception as e:
        print(f"  ❌ ERROR saving final taxonomically updated {current_run_type} occurrence file: {str(e)}")
        import traceback
        print(traceback.format_exc())

Starting post-matching processing.
  Removed temporary columns: ['assay_name', 'match_type_debug']

💾 Taxonomically updated SUBSET occurrence file 'occurrence_SUBSET_worms_matched.csv' saved to '../processed-v3/occurrence_SUBSET_worms_matched.csv' with 20 records.
  Final columns written (35 total): ['eventID', 'organismQuantity', 'occurrenceID', 'verbatimIdentification', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'scientificName', 'scientificNameID', 'taxonRank', 'identificationRemarks', 'taxonID', 'basisOfRecord', 'nameAccordingTo', 'organismQuantityType', 'recordedBy', 'materialSampleID', 'sampleSizeValue', 'sampleSizeUnit', 'associatedSequences', 'locationID', 'eventDate', 'minimumDepthInMeters', 'maximumDepthInMeters', 'locality', 'decimalLatitude', 'decimalLongitude', 'geodeticDatum', 'parentEventID', 'datasetID', 'occurrenceStatus']

👀 Preview of final taxonomically updated SUBSET occurrence data (first 5 rows, selected columns):


Unnamed: 0,eventID,occurrenceID,verbatimIdentification,scientificName,scientificNameID,taxonRank,nameAccordingTo,kingdom,phylum,class,order,family,genus,species
0,GOMECC18S_Plate4_53,GOMECC18S_Plate4_53_occ_36aa75f9b28f5f831c2d631ba65c2bcb,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda,Crustacea,urn:lsid:marinespecies.org:taxname:1066,Subphylum,WoRMS,Animalia,Arthropoda,,,,,
1,GOMECC18S_Plate4_53,GOMECC18S_Plate4_53_occ_4e38e8ced9070952b314e1880bede1ca,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Clausocalanus;Clausocalanus_furcatus;,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503,Species,WoRMS,Animalia,Arthropoda,Copepoda,Calanoida,Clausocalanidae,Clausocalanus,
2,GOMECC18S_Plate4_53,GOMECC18S_Plate4_53_occ_2a31e5c01634165da99e7381279baa75,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Oithona;Oithona_sp.;,Oithona,urn:lsid:marinespecies.org:taxname:106485,Genus,WoRMS,Animalia,Arthropoda,Copepoda,Cyclopoida,Oithonidae,Oithona,
3,GOMECC18S_Plate4_53,GOMECC18S_Plate4_53_occ_ecee60339b2fb88ea6d1c8d18054bed4,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae,Dinophyceae,urn:lsid:marinespecies.org:taxname:19542,Class,WoRMS,Chromista,Myzozoa,Dinophyceae,,,,
4,GOMECC18S_Plate4_53,GOMECC18S_Plate4_53_occ_fa1f1a97dd4ae7c826009186bad26384,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniales;Gymnodiniaceae,Gymnodiniaceae,urn:lsid:marinespecies.org:taxname:109410,Family,WoRMS,Chromista,Myzozoa,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,


# WoRMS conversion. 
Note, can't use `multiprocessing` library in a Jupyter notebook, need `multiprocess`. See [here](https://stackoverflow.com/questions/41385708/multiprocessing-example-giving-attributeerror)

OBIS currently requires taxonomy assignments that match WoRMS, however none of the commonly used metabarcoding reference databases use WoRMS as the basis of their taxonomy. This means the taxonomic ranks for any given scientific name on WoRMS may not directly compare to what is assigned. There are ongoing discussions about this problem (see [this](https://github.com/iobis/Project-team-Genetic-Data/issues/5) GitHub issue).     

Many of them, especially for microbes, include taxa that aren't on WoRMS at all. This is because the name may not have been fully and officially adopted by the scientific community (or at least not adopted by WoRMS yet). We therefore need a system for searching through the higher taxonomic ranks given, finding the lowest one that will match on WoRMS, and putting that name in the `scientificName` column. The assigned taxonomy is then recorded in `verbatimIdentification`.

Had some [issues with the parallelization](https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr) on Mac M1. Adding 'OBJC_DISABLE_INITIALIZE_FORK_SAFETY = YES' to .bash_profile and then [This](https://github.com/python/cpython/issues/74570) fixed it.   
Try to run without the bash_profile fix LATER.

In [67]:
os.environ["no_proxy"]="*"

### 16S worms

Species level IDs might be trash, [see here](https://forum.qiime2.org/t/processing-filtering-and-evaluating-the-silva-database-and-other-reference-sequence-data-with-rescript/15494), so look at genus and up.

In [68]:
import WoRMS_matching

In [69]:
import importlib
importlib.reload(WoRMS_matching)

<module 'WoRMS_matching' from 'c:\\Users\\bayde\\OneDrive\\Documents\\NOAA_AOML_Code\\edna2obis\\edna2obis\\src\\WoRMS_matching.py'>

In [70]:
tax_16S = asv_tables['16S V4-V5'][['taxonomy','domain','phylum','class','order','family','genus','species']]

In [71]:
#ignore_index is important!
tax_16S = tax_16S.drop_duplicates(ignore_index=True)

In [72]:
tax_16S.shape

(2729, 8)

In [76]:
if __name__ == '__main__':
    worms_16s = WoRMS_matching.get_worms_from_scientific_name_parallel(
    tax_df = tax_16S,ordered_rank_columns=['genus','family','order','class','phylum','domain'],
    full_tax_column="taxonomy",full_tax_vI=True,n_proc=7)

In [77]:
worms_16s.head()

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Cellvibrionales; f__Halieaceae; g__OM60(NOR5)_clade; s__uncultured_Haliea,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Cellvibrionales; f__Halieaceae; g__OM60(NOR5)_clade; s__uncultured_Haliea,class,Gammaproteobacteria,Gammaproteobacteria,urn:lsid:marinespecies.org:taxname:393018,Bacteria,Proteobacteria,Gammaproteobacteria,,,,Class
1,d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Chitinophagales,d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Chitinophagales,class,Bacteroidia,Bacteroidia,urn:lsid:marinespecies.org:taxname:559846,Bacteria,Bacteroidetes,Bacteroidia,,,,Class
2,d__Bacteria; p__Verrucomicrobiota; c__Omnitrophia; o__Omnitrophales; f__Omnitrophales; g__Omnitrophales; s__uncultured_bacterium,d__Bacteria; p__Verrucomicrobiota; c__Omnitrophia; o__Omnitrophales; f__Omnitrophales; g__Omnitrophales; s__uncultured_bacterium,domain,Bacteria,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom
3,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__AEGEAN-169_marine_group; g__AEGEAN-169_marine_group; s__alpha_prot...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__AEGEAN-169_marine_group; g__AEGEAN-169_marine_group; s__alpha_prot...,order,Rhodospirillales,Rhodospirillales,urn:lsid:marinespecies.org:taxname:392751,Bacteria,Proteobacteria,Alphaproteobacteria,Rhodospirillales,,,Order
4,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Sphingomonadales; f__Sphingomonadaceae; g__Sphingobium,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Sphingomonadales; f__Sphingomonadaceae; g__Sphingobium,genus,Sphingobium,Sphingobium,urn:lsid:marinespecies.org:taxname:571470,Bacteria,Proteobacteria,Alphaproteobacteria,Sphingomonadales,Sphingomonadaceae,Sphingobium,Genus


In [78]:
worms_16s[worms_16s["scientificName"]=="No match"]

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
242,d__Eukaryota,d__Eukaryota,domain,Eukaryota,No match,,,,,,,,


In [79]:
worms_16s.loc[worms_16s["scientificName"]=="No match",'scientificName'] = "Biota"
worms_16s.loc[worms_16s["scientificName"]=="Biota",'scientificNameID'] = "urn:lsid:marinespecies.org:taxname:1"


In [80]:
worms_16s[worms_16s['scientificName'].isna() == True]

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
97,Unassigned,Unassigned,,,,,,,,,,,


# incertae sedis    is controlled vocab for saying unnassigned 
# also insert that if it just matched to Eukaryota

In [81]:

print(worms_16s[worms_16s['scientificName'].isna() == True].shape)
worms_16s.loc[worms_16s['scientificName'].isna() == True,'scientificName'] = 'incertae sedis'
worms_16s.loc[worms_16s['scientificName'] == 'incertae sedis','scientificNameID'] =  'urn:lsid:marinespecies.org:taxname:12'
print(worms_16s[worms_16s['scientificName'].isna() == True].shape)

(1, 13)
(0, 13)


In [82]:
worms_16s.to_csv("../processed/worms_16S_matching.tsv",sep="\t",index=False)

In [83]:
worms_16s.drop(columns=['old name','old_taxonRank'],inplace=True)
worms_16s.head()

Unnamed: 0,full_tax,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Cellvibrionales; f__Halieaceae; g__OM60(NOR5)_clade; s__uncultured_Haliea,d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Cellvibrionales; f__Halieaceae; g__OM60(NOR5)_clade; s__uncultured_Haliea,Gammaproteobacteria,urn:lsid:marinespecies.org:taxname:393018,Bacteria,Proteobacteria,Gammaproteobacteria,,,,Class
1,d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Chitinophagales,d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Chitinophagales,Bacteroidia,urn:lsid:marinespecies.org:taxname:559846,Bacteria,Bacteroidetes,Bacteroidia,,,,Class
2,d__Bacteria; p__Verrucomicrobiota; c__Omnitrophia; o__Omnitrophales; f__Omnitrophales; g__Omnitrophales; s__uncultured_bacterium,d__Bacteria; p__Verrucomicrobiota; c__Omnitrophia; o__Omnitrophales; f__Omnitrophales; g__Omnitrophales; s__uncultured_bacterium,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom
3,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__AEGEAN-169_marine_group; g__AEGEAN-169_marine_group; s__alpha_prot...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__AEGEAN-169_marine_group; g__AEGEAN-169_marine_group; s__alpha_prot...,Rhodospirillales,urn:lsid:marinespecies.org:taxname:392751,Bacteria,Proteobacteria,Alphaproteobacteria,Rhodospirillales,,,Order
4,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Sphingomonadales; f__Sphingomonadaceae; g__Sphingobium,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Sphingomonadales; f__Sphingomonadaceae; g__Sphingobium,Sphingobium,urn:lsid:marinespecies.org:taxname:571470,Bacteria,Proteobacteria,Alphaproteobacteria,Sphingomonadales,Sphingomonadaceae,Sphingobium,Genus


In [84]:
occ['16S V4-V5'].head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,domain,phylum,class,order,family,genus,species,eventID,organismQuantity,occurrenceID
182,00c4c1c65d8669ed9f07abe149f9a01d,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTAGACAGTTGAGGGTGAAATCCCGGAGCTTAACTTCGGAACTGCCCCCAATACTACTAATCTAGAGTTCGGAAGAGGTGAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,0.83219,Bacteria,Proteobacteria,Alphaproteobacteria,Parvibaculales,OCS116 clade,OCS116 clade,uncultured marine,GOMECC4_27N_Sta1_DCM_A,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed9f07abe149f9a01d
225,00e6c13fe86364a5084987093afa1916,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCTCTTTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAAGACTGGAGAGCTAGAAAACGGAAGAGGGTAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,0.86704,Bacteria,Proteobacteria,Alphaproteobacteria,Puniceispirillales,SAR116 clade,SAR116 clade,,GOMECC4_27N_Sta1_DCM_A,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5084987093afa1916
347,015dad1fafca90944d905beb2a980bc3,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTCCGTAGCCGGTCTGGTACATTCGTGGGTAAATCAACTCGCTTAACGAGTTGAATTCTGCGAGGACGGCCAGACTTGGGACCGGGAGAGGTGTGGGGTACTC...,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,1.0,Archaea,Thermoplasmatota,Thermoplasmata,Marine Group II,Marine Group II,Marine Group II,,GOMECC4_27N_Sta1_DCM_A,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca90944d905beb2a980bc3
412,019c88c6ade406f731954f38e3461564,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTCATTTAAGCGGTCCGATAAGTTAAAAGCCAACAGTTAGAGCCTAACTCTTTCAAGCTTTTAATACTGTCAGACTAGAGTATATCAGAGAATAGTAGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,0.952911,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,Mitochondria,Mitochondria,uncultured bacterium,GOMECC4_27N_Sta1_DCM_A,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f731954f38e3461564
719,02dfb0869af4bf549d290d48e66e2196,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTCCGTAGGCGTTTTGCTAAGTTGATCGTTAAATCCATCGGCTTAACCGATGACATGCGATCAAAACTGGCAGAATAGAATATGTGAGGGGAATGTAGAATTC...,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,0.818195,Bacteria,Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),Marinimicrobia (SAR406 clade),uncultured bacterium,GOMECC4_27N_Sta1_DCM_A,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf549d290d48e66e2196


#### Merge Occurrence and worms

In [85]:
occ['16S V4-V5'].shape

(169470, 14)

In [86]:

occ16_test = occ['16S V4-V5'].copy()
occ16_test.drop(columns=['domain','phylum','class','order','family','genus','species'],inplace=True)

occ16_test = occ16_test.merge(worms_16s, how='left', left_on ='taxonomy', right_on='full_tax')
occ16_test.drop(columns='full_tax', inplace=True)
occ16_test.head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,00c4c1c65d8669ed9f07abe149f9a01d,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTAGACAGTTGAGGGTGAAATCCCGGAGCTTAACTTCGGAACTGCCCCCAATACTACTAATCTAGAGTTCGGAAGAGGTGAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,0.83219,GOMECC4_27N_Sta1_DCM_A,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed9f07abe149f9a01d,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class
1,00e6c13fe86364a5084987093afa1916,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCTCTTTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAAGACTGGAGAGCTAGAAAACGGAAGAGGGTAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,0.86704,GOMECC4_27N_Sta1_DCM_A,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5084987093afa1916,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class
2,015dad1fafca90944d905beb2a980bc3,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTCCGTAGCCGGTCTGGTACATTCGTGGGTAAATCAACTCGCTTAACGAGTTGAATTCTGCGAGGACGGCCAGACTTGGGACCGGGAGAGGTGTGGGGTACTC...,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,1.0,GOMECC4_27N_Sta1_DCM_A,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca90944d905beb2a980bc3,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,Thermoplasmata,urn:lsid:marinespecies.org:taxname:416268,Archaea,Euryarchaeota,Thermoplasmata,,,,Class
3,019c88c6ade406f731954f38e3461564,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTCATTTAAGCGGTCCGATAAGTTAAAAGCCAACAGTTAGAGCCTAACTCTTTCAAGCTTTTAATACTGTCAGACTAGAGTATATCAGAGAATAGTAGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,0.952911,GOMECC4_27N_Sta1_DCM_A,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f731954f38e3461564,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,Rickettsiales,urn:lsid:marinespecies.org:taxname:570969,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,,,Order
4,02dfb0869af4bf549d290d48e66e2196,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTCCGTAGGCGTTTTGCTAAGTTGATCGTTAAATCCATCGGCTTAACCGATGACATGCGATCAAAACTGGCAGAATAGAATATGTGAGGGGAATGTAGAATTC...,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,0.818195,GOMECC4_27N_Sta1_DCM_A,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf549d290d48e66e2196,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom


#### identificationRemarks  

```
occ16_test['identificationRemarks'] = occ16_test['taxa_class_method'] +", confidence (at lowest specified taxon): "+occ16_test['Confidence'].astype(str) +", against reference database: "+occ16_test['taxa_ref_db']
```

'Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.832189583, against reference database: Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695'

In [88]:
data['analysis_data'].head()

Unnamed: 0,amplicon_sequenced,ampliconSize,trim_method,cluster_method,pid_clustering,taxa_class_method,taxa_ref_db,code_repo,identificationReferences,controls_used
0,16S V4-V5,411,cutadapt,Tourmaline; qiime2-2021.2; dada2,ASV,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,https://github.com/aomlomics/gomecc,10.5281/zenodo.8392695 | https://github.com/aomlomics/tourmaline,12 distilled water blanks | 2 PCR no-template controls | 7 extraction blanks | 12 2nd PCR no-template controls | 3 Zymo mock community
1,18S V9,260,cutadapt,Tourmaline; qiime2-2021.2; dada2,ASV,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zenodo.8392706,https://github.com/aomlomics/gomecc,10.5281/zenodo.8392706 | https://pr2-database.org/ | https://github.com/aomlomics/tourmaline,12 distilled water blanks | 2 PCR no-template controls | 7 extraction blanks | 7 2nd PCR no-template controls


In [89]:
occ16_test['taxa_class_method'] = data['analysis_data'].loc[data['analysis_data']['amplicon_sequenced'] == '16S V4-V5','taxa_class_method'].item()
occ16_test['taxa_ref_db'] = data['analysis_data'].loc[data['analysis_data']['amplicon_sequenced'] == '16S V4-V5','taxa_ref_db'].item()

occ16_test['identificationRemarks'] = occ16_test['taxa_class_method'] +", confidence (at lowest specified taxon): "+occ16_test['Confidence'].astype(str) +", against reference database: "+occ16_test['taxa_ref_db']

In [90]:
occ16_test['identificationRemarks'][0]

'Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.832189583, against reference database: Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695'

#### taxonID, basisOfRecord, eventID, nameAccordingTo, organismQuantityType

In [91]:
occ16_test['taxonID'] = 'ASV:'+occ16_test['featureid']
occ16_test['basisOfRecord'] = 'MaterialSample'
occ16_test['nameAccordingTo'] = "WoRMS"
occ16_test['organismQuantityType'] = "DNA sequence reads"
occ16_test['recordedBy'] = data['study_data']['recordedBy'].values[0]

#### associatedSequences, materialSampleID

In [92]:
data['prep_data'].columns

Index(['sample_name', 'library_id', 'title', 'library_strategy',
       'library_source', 'library_selection', 'lib_layout', 'platform',
       'instrument_model', 'design_description', 'filetype', 'filename',
       'filename2', 'biosample_accession', 'sra_accession', 'seq_meth',
       'nucl_acid_ext', 'amplicon_sequenced', 'target_gene',
       'target_subfragment', 'pcr_primer_forward', 'pcr_primer_reverse',
       'pcr_primer_name_forward', 'pcr_primer_name_reverse',
       'pcr_primer_reference', 'pcr_cond', 'nucl_acid_amp', 'adapters',
       'mid_barcode'],
      dtype='object')

In [93]:
occ16_test = occ16_test.merge(data['prep_data'].loc[data['prep_data']['amplicon_sequenced'] == '16S V4-V5',['sample_name','sra_accession','biosample_accession']], how='left', left_on ='eventID', right_on='sample_name')

In [94]:
occ16_test.head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank,taxa_class_method,taxa_ref_db,identificationRemarks,taxonID,basisOfRecord,nameAccordingTo,organismQuantityType,recordedBy,sample_name,sra_accession,biosample_accession
0,00c4c1c65d8669ed9f07abe149f9a01d,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTAGACAGTTGAGGGTGAAATCCCGGAGCTTAACTTCGGAACTGCCCCCAATACTACTAATCTAGAGTTCGGAAGAGGTGAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,0.83219,GOMECC4_27N_Sta1_DCM_A,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed9f07abe149f9a01d,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.832189583, against reference database: Silva SSU Ref ...",ASV:00c4c1c65d8669ed9f07abe149f9a01d,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094
1,00e6c13fe86364a5084987093afa1916,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCTCTTTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAAGACTGGAGAGCTAGAAAACGGAAGAGGGTAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,0.86704,GOMECC4_27N_Sta1_DCM_A,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5084987093afa1916,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.867040054, against reference database: Silva SSU Ref ...",ASV:00e6c13fe86364a5084987093afa1916,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094
2,015dad1fafca90944d905beb2a980bc3,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTCCGTAGCCGGTCTGGTACATTCGTGGGTAAATCAACTCGCTTAACGAGTTGAATTCTGCGAGGACGGCCAGACTTGGGACCGGGAGAGGTGTGGGGTACTC...,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,1.0,GOMECC4_27N_Sta1_DCM_A,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca90944d905beb2a980bc3,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,Thermoplasmata,urn:lsid:marinespecies.org:taxname:416268,Archaea,Euryarchaeota,Thermoplasmata,,,,Class,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 1.0, against reference database: Silva SSU Ref NR 99 v1...",ASV:015dad1fafca90944d905beb2a980bc3,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094
3,019c88c6ade406f731954f38e3461564,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTCATTTAAGCGGTCCGATAAGTTAAAAGCCAACAGTTAGAGCCTAACTCTTTCAAGCTTTTAATACTGTCAGACTAGAGTATATCAGAGAATAGTAGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,0.952911,GOMECC4_27N_Sta1_DCM_A,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f731954f38e3461564,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,Rickettsiales,urn:lsid:marinespecies.org:taxname:570969,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,,,Order,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.952910602, against reference database: Silva SSU Ref ...",ASV:019c88c6ade406f731954f38e3461564,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094
4,02dfb0869af4bf549d290d48e66e2196,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTCCGTAGGCGTTTTGCTAAGTTGATCGTTAAATCCATCGGCTTAACCGATGACATGCGATCAAAACTGGCAGAATAGAATATGTGAGGGGAATGTAGAATTC...,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,0.818195,GOMECC4_27N_Sta1_DCM_A,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf549d290d48e66e2196,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.818195053, against reference database: Silva SSU Ref ...",ASV:02dfb0869af4bf549d290d48e66e2196,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094


#### eventID

In [95]:
occ16_test['eventID'] = occ16_test['eventID']+"_16S"

#### sampleSize 

In [97]:
# get sampleSize by total number of reads per sample
x = asv_tables['16S V4-V5'].sum(numeric_only=True).astype('int')
x.index = x.index+"_16S"
occ16_test['sampleSizeValue'] = occ16_test['eventID'].map(x).astype('str')
occ16_test['sampleSizeUnit'] = 'DNA sequence reads'

In [98]:
occ16_test.head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank,taxa_class_method,taxa_ref_db,identificationRemarks,taxonID,basisOfRecord,nameAccordingTo,organismQuantityType,recordedBy,sample_name,sra_accession,biosample_accession,sampleSizeValue,sampleSizeUnit
0,00c4c1c65d8669ed9f07abe149f9a01d,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTAGACAGTTGAGGGTGAAATCCCGGAGCTTAACTTCGGAACTGCCCCCAATACTACTAATCTAGAGTTCGGAAGAGGTGAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,0.83219,GOMECC4_27N_Sta1_DCM_A_16S,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed9f07abe149f9a01d,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.832189583, against reference database: Silva SSU Ref ...",ASV:00c4c1c65d8669ed9f07abe149f9a01d,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094,16187,DNA sequence reads
1,00e6c13fe86364a5084987093afa1916,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCTCTTTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAAGACTGGAGAGCTAGAAAACGGAAGAGGGTAGTGGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,0.86704,GOMECC4_27N_Sta1_DCM_A_16S,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5084987093afa1916,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.867040054, against reference database: Silva SSU Ref ...",ASV:00e6c13fe86364a5084987093afa1916,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094,16187,DNA sequence reads
2,015dad1fafca90944d905beb2a980bc3,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTCCGTAGCCGGTCTGGTACATTCGTGGGTAAATCAACTCGCTTAACGAGTTGAATTCTGCGAGGACGGCCAGACTTGGGACCGGGAGAGGTGTGGGGTACTC...,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,1.0,GOMECC4_27N_Sta1_DCM_A_16S,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca90944d905beb2a980bc3,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,Thermoplasmata,urn:lsid:marinespecies.org:taxname:416268,Archaea,Euryarchaeota,Thermoplasmata,,,,Class,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 1.0, against reference database: Silva SSU Ref NR 99 v1...",ASV:015dad1fafca90944d905beb2a980bc3,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094,16187,DNA sequence reads
3,019c88c6ade406f731954f38e3461564,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTCATTTAAGCGGTCCGATAAGTTAAAAGCCAACAGTTAGAGCCTAACTCTTTCAAGCTTTTAATACTGTCAGACTAGAGTATATCAGAGAATAGTAGAATTC...,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,0.952911,GOMECC4_27N_Sta1_DCM_A_16S,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f731954f38e3461564,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,Rickettsiales,urn:lsid:marinespecies.org:taxname:570969,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,,,Order,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.952910602, against reference database: Silva SSU Ref ...",ASV:019c88c6ade406f731954f38e3461564,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094,16187,DNA sequence reads
4,02dfb0869af4bf549d290d48e66e2196,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTCCGTAGGCGTTTTGCTAAGTTGATCGTTAAATCCATCGGCTTAACCGATGACATGCGATCAAAACTGGCAGAATAGAATATGTGAGGGGAATGTAGAATTC...,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,0.818195,GOMECC4_27N_Sta1_DCM_A_16S,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf549d290d48e66e2196,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.818195053, against reference database: Silva SSU Ref ...",ASV:02dfb0869af4bf549d290d48e66e2196,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,GOMECC4_27N_Sta1_DCM_A,SRR26148187,SAMN37516094,16187,DNA sequence reads


In [99]:
# drop unnneeded columns
occ16_test.drop(columns=['sample_name','featureid','taxonomy','Confidence','taxa_class_method','taxa_ref_db'],inplace=True)

### merge event and occurrence

In [107]:
all_event_data.tail()

Unnamed: 0,eventID,locationID,eventDate,minimumDepthInMeters,maximumDepthInMeters,locality,waterBody,countryCode,decimalLatitude,decimalLongitude,geodeticDatum,samplingProtocol,parentEventID,datasetID
939,GOMECC4_CAPECORAL_Sta141_DCM_B_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,59,59,USA: Gulf of Mexico,"Mexico, Gulf of",US,25.574,-84.843,WGS84,CTD rosette,GOMECC4_CAPECORAL_Sta141_DCM_B,noaa-aoml-gomecc4
940,GOMECC4_CAPECORAL_Sta141_DCM_C_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,59,59,USA: Gulf of Mexico,"Mexico, Gulf of",US,25.574,-84.843,WGS84,CTD rosette,GOMECC4_CAPECORAL_Sta141_DCM_C,noaa-aoml-gomecc4
941,GOMECC4_CAPECORAL_Sta141_Surface_A_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,4,4,USA: Gulf of Mexico,"Mexico, Gulf of",US,25.574,-84.843,WGS84,CTD rosette,GOMECC4_CAPECORAL_Sta141_Surface_A,noaa-aoml-gomecc4
942,GOMECC4_CAPECORAL_Sta141_Surface_B_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,4,4,USA: Gulf of Mexico,"Mexico, Gulf of",US,25.574,-84.843,WGS84,CTD rosette,GOMECC4_CAPECORAL_Sta141_Surface_B,noaa-aoml-gomecc4
943,GOMECC4_CAPECORAL_Sta141_Surface_C_18S,CAPECORAL_Sta141,2021-10-20T12:47-04:00,4,4,USA: Gulf of Mexico,"Mexico, Gulf of",US,25.574,-84.843,WGS84,CTD rosette,GOMECC4_CAPECORAL_Sta141_Surface_C,noaa-aoml-gomecc4


In [108]:
occ16_merged = occ16_test.merge(all_event_data,how='left',on='eventID')

In [109]:
occ16_merged.head()

Unnamed: 0,DNA_sequence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank,identificationRemarks,taxonID,basisOfRecord,nameAccordingTo,organismQuantityType,recordedBy,materialSampleID,sampleSizeValue,sampleSizeUnit,associatedSequences,locationID,eventDate,minimumDepthInMeters,maximumDepthInMeters,locality,waterBody,countryCode,decimalLatitude,decimalLongitude,geodeticDatum,samplingProtocol,parentEventID,datasetID
0,TACGGAGGGGGCTAACGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTAGACAGTTGAGGGTGAAATCCCGGAGCTTAACTTCGGAACTGCCCCCAATACTACTAATCTAGAGTTCGGAAGAGGTGAGTGGAATTC...,GOMECC4_27N_Sta1_DCM_A_16S,18,GOMECC4_27N_Sta1_DCM_A_16S_occ00c4c1c65d8669ed9f07abe149f9a01d,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Parvibaculales; f__OCS116_clade; g__OCS116_clade; s__uncultured_marine,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.832189583, against reference database: Silva SSU Ref ...",ASV:00c4c1c65d8669ed9f07abe149f9a01d,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,16187,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
1,TACGAAGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCTCTTTAAGTTAGGCGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGCTTAAGACTGGAGAGCTAGAAAACGGAAGAGGGTAGTGGAATTC...,GOMECC4_27N_Sta1_DCM_A_16S,36,GOMECC4_27N_Sta1_DCM_A_16S_occ00e6c13fe86364a5084987093afa1916,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Puniceispirillales; f__SAR116_clade; g__SAR116_clade,Alphaproteobacteria,urn:lsid:marinespecies.org:taxname:392750,Bacteria,Proteobacteria,Alphaproteobacteria,,,,Class,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.867040054, against reference database: Silva SSU Ref ...",ASV:00e6c13fe86364a5084987093afa1916,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,16187,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
2,TACCGGCGCCTCAAGTGGTAGTCGCTTTTATTGGGCCTAAAACGTCCGTAGCCGGTCTGGTACATTCGTGGGTAAATCAACTCGCTTAACGAGTTGAATTCTGCGAGGACGGCCAGACTTGGGACCGGGAGAGGTGTGGGGTACTC...,GOMECC4_27N_Sta1_DCM_A_16S,49,GOMECC4_27N_Sta1_DCM_A_16S_occ015dad1fafca90944d905beb2a980bc3,d__Archaea; p__Thermoplasmatota; c__Thermoplasmata; o__Marine_Group_II; f__Marine_Group_II; g__Marine_Group_II,Thermoplasmata,urn:lsid:marinespecies.org:taxname:416268,Archaea,Euryarchaeota,Thermoplasmata,,,,Class,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 1.0, against reference database: Silva SSU Ref NR 99 v1...",ASV:015dad1fafca90944d905beb2a980bc3,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,16187,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
3,TACAGGAGGGACGAGTGTTACTCGGAATGATTAGGCGTAAAGGGTCATTTAAGCGGTCCGATAAGTTAAAAGCCAACAGTTAGAGCCTAACTCTTTCAAGCTTTTAATACTGTCAGACTAGAGTATATCAGAGAATAGTAGAATTC...,GOMECC4_27N_Sta1_DCM_A_16S,2,GOMECC4_27N_Sta1_DCM_A_16S_occ019c88c6ade406f731954f38e3461564,d__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__uncultured_bacterium,Rickettsiales,urn:lsid:marinespecies.org:taxname:570969,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,,,Order,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.952910602, against reference database: Silva SSU Ref ...",ASV:019c88c6ade406f731954f38e3461564,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,16187,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
4,TACGAGGGGTGCTAGCGTTGTCCGGAATAACTGGGCGTAAAGGGTCCGTAGGCGTTTTGCTAAGTTGATCGTTAAATCCATCGGCTTAACCGATGACATGCGATCAAAACTGGCAGAATAGAATATGTGAGGGGAATGTAGAATTC...,GOMECC4_27N_Sta1_DCM_A_16S,3,GOMECC4_27N_Sta1_DCM_A_16S_occ02dfb0869af4bf549d290d48e66e2196,d__Bacteria; p__Marinimicrobia_(SAR406_clade); c__Marinimicrobia_(SAR406_clade); o__Marinimicrobia_(SAR406_clade); f__Marinimicrobia_(SAR406_clade...,Bacteria,urn:lsid:marinespecies.org:taxname:6,Bacteria,,,,,,Kingdom,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.818195053, against reference database: Silva SSU Ref ...",ASV:02dfb0869af4bf549d290d48e66e2196,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,16187,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26148187 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4


In [110]:
occ16_merged.drop(columns=['DNA_sequence']).to_csv("../processed/occurrence_16S.tsv",sep="\t",index=False)

### 18S worms

18S PR2 database provides WORMS IDs for species that are in worms. We will read in that file, assign known worms ids, the do a search for unannotated taxa.

In [111]:
# Update the path to the PR2 database
pr2_18S = pd.read_excel("C:/Users/bayde/OneDrive/Documents/NOAA_AOML_Code/reference_DBs/pr2_version_5.0.0_taxonomy.xlsx",
    index_col=None, na_values=[""])
pr2_18S = pr2_18S.dropna(subset=['worms_id'])
pr2_18S['worms_id'] = pr2_18S['worms_id'].astype('int').astype('str')
pr2_18S['species'] = pr2_18S['species'].replace('_',' ',regex=True)
pr2_18S['species'] = pr2_18S['species'].replace(' sp\.','',regex=True)
pr2_18S['species'] = pr2_18S['species'].replace(' spp\.','',regex=True)
pr2_18S['species'] = pr2_18S['species'].replace('-',' ',regex=True)
pr2_18S['species'] = pr2_18S['species'].replace('\/',' ',regex=True)

In [112]:
pr2_18S_dict = dict(zip(pr2_18S.species,pr2_18S.worms_id))


In [113]:
(pr2_18S_dict['Aphanocapsa feldmannii'])

'614894'

#### code to get record from aphia id

Had some [issues with the parallelization](https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr) on Mac M1. Adding 'OBJC_DISABLE_INITIALIZE_FORK_SAFETY = YES' to .bash_profile and then [This](https://github.com/python/cpython/issues/74570) fixed it.   
Try to run without the bash_profile fix LATER.

In [114]:
os.environ["no_proxy"]="*"

In [115]:
tax_18S = asv_tables['18S V9'][['taxonomy','domain','supergroup','division','subdivision','class','order','family','genus','species']]

In [116]:
tax_18S = tax_18S.drop_duplicates(ignore_index=True)
tax_18S.shape

(1374, 10)

In [117]:
if __name__ == '__main__':
    worms_18s = WoRMS_matching.get_worms_from_aphiaid_or_name_parallel(
    tax_df = tax_18S,worms_dict=pr2_18S_dict,ordered_rank_columns=['species','genus','family','order','class','subdivision','division','supergroup'],
    full_tax_column="taxonomy",full_tax_vI=True,n_proc=6)
    

Aspergillus penicillioides: No match, speciesProtoscenium cf intricatum: No match, species

Euglypha acanthophora: No match, species
RAD B X Group IVe X: No match, species
RAD B X Group IVe X: No match, genus
Eimeriida: No match, order
RAD B X Group IVe: No match, family
Coccidiomorphea: No match, class
MAST 12A: No match, species
MAST 12A: No match, genus
Nibbleromonas: No match, genus
RAD B X: No match, order
MAST 12: No match, family
Nibbleridae: No match, family
RAD B: No match, class
Opalozoa X: No match, order
Nibbleridida: No match, order
Malus x: No match, species
Nibbleridea: No match, class
Euduboscquella cachoni: No match, species
Malus: No match, genus
Obazoa: No match, supergroup
Nibbleridia X: No match, subdivision
Pectinoida: No match, family
Embryophyceae XX: No match, family
Nibbleridia: No match, division
Embryophyceae X: No match, order
Provora: No match, supergroup
Skeletonema menzellii: No match, species
Dictyochales X: No match, species
Dictyochales X: No match, g

In [132]:
worms_18s.head()

Unnamed: 0,full_tax,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,Eukaryota;TSAR;Stramenopiles;Gyrista;Chrysophyceae;Paraphysomonadales;Paraphysomonadaceae;Paraphysomonas;Paraphysomonas_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Chrysophyceae;Paraphysomonadales;Paraphysomonadaceae;Paraphysomonas;Paraphysomonas_sp.;,Paraphysomonas,urn:lsid:marinespecies.org:taxname:291417,Chromista,Ochrophyta,Chrysophyceae,Chromulinales,Paraphysomonadaceae,Paraphysomonas,Genus
1,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Gonyaulax;Gonyaulax_polygramma;,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Gonyaulax;Gonyaulax_polygramma;,Gonyaulax polygramma,urn:lsid:marinespecies.org:taxname:110035,Chromista,Myzozoa,Dinophyceae,Gonyaulacales,Gonyaulacaceae,Gonyaulax,Species
2,Eukaryota;Archaeplastida;Chlorophyta;Chlorophyta_X;Mamiellophyceae;Mamiellales;Mamiellaceae,Eukaryota;Archaeplastida;Chlorophyta;Chlorophyta_X;Mamiellophyceae;Mamiellales;Mamiellaceae,Mamiellaceae,urn:lsid:marinespecies.org:taxname:17663,Plantae,Chlorophyta,Mamiellophyceae,Mamiellales,Mamiellaceae,,Family
3,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Peridiniales;Blastodiniaceae;Blastodinium;Blastodinium_galatheanum;,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Peridiniales;Blastodiniaceae;Blastodinium;Blastodinium_galatheanum;,Blastodinium galatheanum,urn:lsid:marinespecies.org:taxname:623673,Chromista,Myzozoa,Dinophyceae,Blastodiniales,Blastodinidae,Blastodinium,Species
4,Eukaryota;TSAR;Alveolata;Apicomplexa;Coccidiomorphea;Eimeriida,Eukaryota;TSAR;Alveolata;Apicomplexa;Coccidiomorphea;Eimeriida,Apicomplexa,urn:lsid:marinespecies.org:taxname:22565,Chromista,Myzozoa,,,,,Subphylum


In [None]:
# which taxa had absolutely no matches
worms_18s[worms_18s["scientificName"]=="No match"]['old name'].unique()

In [120]:
worms_18s[worms_18s["scientificName"]=="No match"].head(20)

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
18,Eukaryota;TSAR;Stramenopiles;Gyrista;Peronosporomycetes;Peronosporomycetes_X;Haliphthorales;Haliphthorales_X;Haliphthorales_X_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Peronosporomycetes;Peronosporomycetes_X;Haliphthorales;Haliphthorales_X;Haliphthorales_X_sp.;,supergroup,TSAR,No match,,,,,,,,
19,Eukaryota;TSAR;Stramenopiles;Gyrista;Peronosporomycetes;Peronosporomycetes_X;Peronosporomycetes_XX;Peronosporomycetes_XXX;Peronosporomycetes_XXX_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Peronosporomycetes;Peronosporomycetes_X;Peronosporomycetes_XX;Peronosporomycetes_XXX;Peronosporomycetes_XXX_sp.;,supergroup,TSAR,No match,,,,,,,,
28,Eukaryota;TSAR;Stramenopiles;Gyrista;Peronosporomycetes;Peronosporomycetes_X,Eukaryota;TSAR;Stramenopiles;Gyrista;Peronosporomycetes;Peronosporomycetes_X,supergroup,TSAR,No match,,,,,,,,
91,Eukaryota;TSAR;Stramenopiles;Stramenopiles_X;Stramenopiles_X-Group-7;Stramenopiles_X-Group-7_X;Stramenopiles_X-Group-7_XX;Stramenopiles_X-Group-7_...,Eukaryota;TSAR;Stramenopiles;Stramenopiles_X;Stramenopiles_X-Group-7;Stramenopiles_X-Group-7_X;Stramenopiles_X-Group-7_XX;Stramenopiles_X-Group-7_...,supergroup,TSAR,No match,,,,,,,,
116,Eukaryota;TSAR;Stramenopiles;Gyrista;Mediophyceae;Mediophyceae_X;Mediophyceae_XX;Mediophyceae_XXX;Mediophyceae_XXX_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Mediophyceae;Mediophyceae_X;Mediophyceae_XX;Mediophyceae_XXX;Mediophyceae_XXX_sp.;,supergroup,TSAR,No match,,,,,,,,
125,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2,supergroup,TSAR,No match,,,,,,,,
131,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2;MAST-2C;MAST-2C_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2;MAST-2C;MAST-2C_sp.;,supergroup,TSAR,No match,,,,,,,,
185,Eukaryota;Haptista;Centroplasthelida;Centroplasthelida_X,Eukaryota;Haptista;Centroplasthelida;Centroplasthelida_X,supergroup,Haptista,No match,,,,,,,,
187,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2;MAST-2_X;MAST-2_X_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2;MAST-2_X;MAST-2_X_sp.;,supergroup,TSAR,No match,,,,,,,,
189,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2;MAST-2B;MAST-2B_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Gyrista_X;Gyrista_XX;MAST-2;MAST-2B;MAST-2B_sp.;,supergroup,TSAR,No match,,,,,,,,


In [121]:
worms_18s.loc[worms_18s["scientificName"]=="No match",'scientificName'] = "Biota"
worms_18s.loc[worms_18s["scientificName"]=="Biota",'scientificNameID'] = "urn:lsid:marinespecies.org:taxname:1"


In [122]:
worms_18s[worms_18s['scientificName'].isna() == True]

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
120,Eukaryota;Haptista,Eukaryota;Haptista,supergroup,Haptista,,,,,,,,,
77,Eukaryota,Eukaryota,,,,,,,,,,,
109,Unassigned,Unassigned,,,,,,,,,,,
189,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Spiniferites;Spiniferites_mirabilis;,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Spiniferites;Spiniferites_mirabilis;,,aphiaID,,,,,,,,,species


In [123]:
worms_18s.loc[worms_18s["full_tax"]=="Eukaryota;Haptista",'scientificName'] = "Biota"
worms_18s.loc[worms_18s["full_tax"]=="Eukaryota;Haptista",'scientificNameID'] = "urn:lsid:marinespecies.org:taxname:1"
worms_18s.loc[worms_18s["full_tax"]=="Eukaryota",'scientificName'] = "Biota"
worms_18s.loc[worms_18s["full_tax"]=="Eukaryota",'scientificNameID'] = "urn:lsid:marinespecies.org:taxname:1"


In [124]:
worms_18s[worms_18s['scientificName'].isna() == True]

Unnamed: 0,full_tax,verbatimIdentification,old_taxonRank,old name,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
109,Unassigned,Unassigned,,,,,,,,,,,
189,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Spiniferites;Spiniferites_mirabilis;,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Spiniferites;Spiniferites_mirabilis;,,aphiaID,,,,,,,,,species


In [125]:

print(worms_18s[worms_18s['scientificName'].isna() == True].shape)
worms_18s.loc[worms_18s['scientificName'].isna() == True,'scientificName'] = 'incertae sedis'
worms_18s.loc[worms_18s['scientificName'] == 'incertae sedis','scientificNameID'] =  'urn:lsid:marinespecies.org:taxname:12'
print(worms_18s[worms_18s['scientificName'].isna() == True].shape)

(2, 13)
(0, 13)


In [126]:
worms_18s[worms_18s["old name"]=="aphiaID"].shape

(332, 13)

In [127]:
worms_18s.to_csv("../processed/worms_18S_matching.tsv",sep="\t",index=False)

In [128]:
worms_18s.drop(columns=['old name','old_taxonRank'],inplace=True)
worms_18s.head()

Unnamed: 0,full_tax,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,Eukaryota;TSAR;Stramenopiles;Gyrista;Chrysophyceae;Paraphysomonadales;Paraphysomonadaceae;Paraphysomonas;Paraphysomonas_sp.;,Eukaryota;TSAR;Stramenopiles;Gyrista;Chrysophyceae;Paraphysomonadales;Paraphysomonadaceae;Paraphysomonas;Paraphysomonas_sp.;,Paraphysomonas,urn:lsid:marinespecies.org:taxname:291417,Chromista,Ochrophyta,Chrysophyceae,Chromulinales,Paraphysomonadaceae,Paraphysomonas,Genus
1,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Gonyaulax;Gonyaulax_polygramma;,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gonyaulacales;Gonyaulacaceae;Gonyaulax;Gonyaulax_polygramma;,Gonyaulax polygramma,urn:lsid:marinespecies.org:taxname:110035,Chromista,Myzozoa,Dinophyceae,Gonyaulacales,Gonyaulacaceae,Gonyaulax,Species
2,Eukaryota;Archaeplastida;Chlorophyta;Chlorophyta_X;Mamiellophyceae;Mamiellales;Mamiellaceae,Eukaryota;Archaeplastida;Chlorophyta;Chlorophyta_X;Mamiellophyceae;Mamiellales;Mamiellaceae,Mamiellaceae,urn:lsid:marinespecies.org:taxname:17663,Plantae,Chlorophyta,Mamiellophyceae,Mamiellales,Mamiellaceae,,Family
3,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Peridiniales;Blastodiniaceae;Blastodinium;Blastodinium_galatheanum;,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Peridiniales;Blastodiniaceae;Blastodinium;Blastodinium_galatheanum;,Blastodinium galatheanum,urn:lsid:marinespecies.org:taxname:623673,Chromista,Myzozoa,Dinophyceae,Blastodiniales,Blastodinidae,Blastodinium,Species
4,Eukaryota;TSAR;Alveolata;Apicomplexa;Coccidiomorphea;Eimeriida,Eukaryota;TSAR;Alveolata;Apicomplexa;Coccidiomorphea;Eimeriida,Apicomplexa,urn:lsid:marinespecies.org:taxname:22565,Chromista,Myzozoa,,,,,Subphylum


#### Merge Occurrence and worms

In [129]:
occ['18S V9'].shape

(149182, 16)

In [130]:
# Get identificationRemarks
occ18_test = occ['18S V9'].copy()
occ18_test.drop(columns=['domain','supergroup','division','subdivision','class','order','family','genus','species'],inplace=True)
#occ18_test.drop(columns=['old name'],inplace=True)

occ18_test = occ18_test.merge(worms_18s, how='left', left_on ='taxonomy', right_on='full_tax')
occ18_test.drop(columns='full_tax', inplace=True)
occ18_test.head()

Unnamed: 0,featureid,sequence,taxonomy,Confidence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank
0,36aa75f9b28f5f831c2d631ba65c2bcb,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCCTGGCGGATTACTCTGCCTGGCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Neocalanus;Neocalanus_cristatus;,0.922099,GOMECC4_27N_Sta1_DCM_A,1516,GOMECC4_27N_Sta1_DCM_A_occ36aa75f9b28f5f831c2d631ba65c2bcb,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Neocalanus;Neocalanus_cristatus;,Neocalanus cristatus,urn:lsid:marinespecies.org:taxname:104470,Animalia,Arthropoda,Copepoda,Calanoida,Calanidae,Neocalanus,Species
1,4e38e8ced9070952b314e1880bede1ca,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGTAGTCGGATCACTCTGACTGCCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Clausocalanus;Clausocalanus_furcatus;,0.999947,GOMECC4_27N_Sta1_DCM_A,962,GOMECC4_27N_Sta1_DCM_A_occ4e38e8ced9070952b314e1880bede1ca,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Clausocalanus;Clausocalanus_furcatus;,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503,Animalia,Arthropoda,Copepoda,Calanoida,Clausocalanidae,Clausocalanus,Species
2,2a31e5c01634165da99e7381279baa75,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAAGATAGTCGCAAGACTACCTTTTCTCCGGAAAGACTTTCAAACTTGAGCGTCTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Acrocalanus;Acrocalanus_sp.;,0.779948,GOMECC4_27N_Sta1_DCM_A,1164,GOMECC4_27N_Sta1_DCM_A_occ2a31e5c01634165da99e7381279baa75,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Acrocalanus;Acrocalanus_sp.;,Acrocalanus,urn:lsid:marinespecies.org:taxname:104192,Animalia,Arthropoda,Copepoda,Calanoida,Paracalanidae,Acrocalanus,Genus
3,ecee60339b2fb88ea6d1c8d18054bed4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAGTGTTCAGTTCCTGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae,0.999931,GOMECC4_27N_Sta1_DCM_A,287,GOMECC4_27N_Sta1_DCM_A_occecee60339b2fb88ea6d1c8d18054bed4,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae,Dinophyceae,urn:lsid:marinespecies.org:taxname:19542,Chromista,Myzozoa,Dinophyceae,,,,Class
4,fa1f1a97dd4ae7c826009186bad26384,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAATGTTTGGATCCCGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniales;Gymnodiniaceae,0.986908,GOMECC4_27N_Sta1_DCM_A,250,GOMECC4_27N_Sta1_DCM_A_occfa1f1a97dd4ae7c826009186bad26384,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniales;Gymnodiniaceae,Gymnodiniaceae,urn:lsid:marinespecies.org:taxname:109410,Chromista,Myzozoa,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,Family


#### identificationRemarks

In [131]:
data['analysis_data'].head()

Unnamed: 0,amplicon_sequenced,ampliconSize,trim_method,cluster_method,pid_clustering,taxa_class_method,taxa_ref_db,code_repo,identificationReferences,controls_used
0,16S V4-V5,411,cutadapt,Tourmaline; qiime2-2021.2; dada2,ASV,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,https://github.com/aomlomics/gomecc,10.5281/zenodo.8392695 | https://github.com/aomlomics/tourmaline,12 distilled water blanks | 2 PCR no-template controls | 7 extraction blanks | 12 2nd PCR no-template controls | 3 Zymo mock community
1,18S V9,260,cutadapt,Tourmaline; qiime2-2021.2; dada2,ASV,Tourmaline; qiime2-2021.2; naive-bayes classifier,PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zenodo.8392706,https://github.com/aomlomics/gomecc,10.5281/zenodo.8392706 | https://pr2-database.org/ | https://github.com/aomlomics/tourmaline,12 distilled water blanks | 2 PCR no-template controls | 7 extraction blanks | 7 2nd PCR no-template controls


In [134]:
occ18_test['taxa_class_method'] = data['analysis_data'].loc[data['analysis_data']['amplicon_sequenced'] == '18S V9','taxa_class_method'].item()
occ18_test['taxa_ref_db'] = data['analysis_data'].loc[data['analysis_data']['amplicon_sequenced'] == '18S V9','taxa_ref_db'].item()

occ18_test['identificationRemarks'] = occ18_test['taxa_class_method'] +", confidence (at lowest specified taxon): "+occ18_test['Confidence'].astype(str) +", against reference database: "+occ18_test['taxa_ref_db']

In [135]:
occ18_test['identificationRemarks'][0]

'Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.92209885, against reference database: PR2 v5.0.1; V9 1391f-1510r region; 10.5281/zenodo.8392706'

#### taxonID, basisOfRecord, eventID, nameAccordingTo, organismQuantityType

In [136]:
occ18_test['taxonID'] = 'ASV:'+occ18_test['featureid']
occ18_test['basisOfRecord'] = 'MaterialSample'
occ18_test['nameAccordingTo'] = "WoRMS"
occ18_test['organismQuantityType'] = "DNA sequence reads"
occ18_test['recordedBy'] = data['study_data']['recordedBy'].values[0]

#### associatedSequences, materialSampleID

In [137]:
data['prep_data'].columns

Index(['sample_name', 'library_id', 'title', 'library_strategy',
       'library_source', 'library_selection', 'lib_layout', 'platform',
       'instrument_model', 'design_description', 'filetype', 'filename',
       'filename2', 'biosample_accession', 'sra_accession', 'seq_meth',
       'nucl_acid_ext', 'amplicon_sequenced', 'target_gene',
       'target_subfragment', 'pcr_primer_forward', 'pcr_primer_reverse',
       'pcr_primer_name_forward', 'pcr_primer_name_reverse',
       'pcr_primer_reference', 'pcr_cond', 'nucl_acid_amp', 'adapters',
       'mid_barcode'],
      dtype='object')

In [138]:
occ18_test = occ18_test.merge(data['prep_data'].loc[data['prep_data']['amplicon_sequenced'] == '18S V9',['sample_name','sra_accession','biosample_accession']], how='left', left_on ='eventID', right_on='sample_name')

#### eventID

In [139]:
occ18_test['eventID'] = occ18_test['eventID']+"_18S"

#### sampleSize

In [140]:
# get sampleSize by total number of reads per sample
x = asv_tables['18S V9'].sum(numeric_only=True).astype('int')
x.index = x.index+"_18S"
occ18_test['sampleSizeValue'] = occ18_test['eventID'].map(x).astype('str')
occ18_test['sampleSizeUnit'] = 'DNA sequence reads'

In [141]:
# drop unnneeded columns
occ18_test.drop(columns=['sample_name','featureid','taxonomy','Confidence','taxa_class_method','taxa_ref_db'],inplace=True)

In [None]:
occ18_test['associatedSequences'] = "https://www.ncbi.nlm.nih.gov/sra/"+occ18_test['sra_accession']+' | '+ "https://www.ncbi.nlm.nih.gov/biosample/"+occ18_test['biosample_accession']+' | '+"https://www.ncbi.nlm.nih.gov/bioproject/"+data['study_data']['bioproject_accession'].values[0]

In [143]:
occ18_test.rename(columns={'biosample_accession': 'materialSampleID',
                  'sequence': 'DNA_sequence'},inplace=True)
                   

In [144]:
# drop unnneeded columns
occ18_test.drop(columns=['sra_accession'],inplace=True)

In [145]:
occ18_test.columns

Index(['DNA_sequence', 'eventID', 'organismQuantity', 'occurrenceID',
       'verbatimIdentification', 'scientificName', 'scientificNameID',
       'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'taxonRank',
       'identificationRemarks', 'taxonID', 'basisOfRecord', 'nameAccordingTo',
       'organismQuantityType', 'recordedBy', 'materialSampleID',
       'sampleSizeValue', 'sampleSizeUnit', 'associatedSequences'],
      dtype='object')

In [146]:
occ18_test.head()

Unnamed: 0,DNA_sequence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank,identificationRemarks,taxonID,basisOfRecord,nameAccordingTo,organismQuantityType,recordedBy,materialSampleID,sampleSizeValue,sampleSizeUnit,associatedSequences
0,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCCTGGCGGATTACTCTGCCTGGCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,1516,GOMECC4_27N_Sta1_DCM_A_occ36aa75f9b28f5f831c2d631ba65c2bcb,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Neocalanus;Neocalanus_cristatus;,Neocalanus cristatus,urn:lsid:marinespecies.org:taxname:104470,Animalia,Arthropoda,Copepoda,Calanoida,Calanidae,Neocalanus,Species,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.92209885, against reference database: PR2 v5.0.1; V9 ...",ASV:36aa75f9b28f5f831c2d631ba65c2bcb,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...
1,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGTAGTCGGATCACTCTGACTGCCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,962,GOMECC4_27N_Sta1_DCM_A_occ4e38e8ced9070952b314e1880bede1ca,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Clausocalanus;Clausocalanus_furcatus;,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503,Animalia,Arthropoda,Copepoda,Calanoida,Clausocalanidae,Clausocalanus,Species,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.999946735, against reference database: PR2 v5.0.1; V9...",ASV:4e38e8ced9070952b314e1880bede1ca,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...
2,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAAGATAGTCGCAAGACTACCTTTTCTCCGGAAAGACTTTCAAACTTGAGCGTCTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,1164,GOMECC4_27N_Sta1_DCM_A_occ2a31e5c01634165da99e7381279baa75,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Acrocalanus;Acrocalanus_sp.;,Acrocalanus,urn:lsid:marinespecies.org:taxname:104192,Animalia,Arthropoda,Copepoda,Calanoida,Paracalanidae,Acrocalanus,Genus,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.779948049, against reference database: PR2 v5.0.1; V9...",ASV:2a31e5c01634165da99e7381279baa75,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...
3,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAGTGTTCAGTTCCTGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,287,GOMECC4_27N_Sta1_DCM_A_occecee60339b2fb88ea6d1c8d18054bed4,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae,Dinophyceae,urn:lsid:marinespecies.org:taxname:19542,Chromista,Myzozoa,Dinophyceae,,,,Class,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.999930607, against reference database: PR2 v5.0.1; V9...",ASV:ecee60339b2fb88ea6d1c8d18054bed4,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...
4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAATGTTTGGATCCCGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,250,GOMECC4_27N_Sta1_DCM_A_occfa1f1a97dd4ae7c826009186bad26384,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniales;Gymnodiniaceae,Gymnodiniaceae,urn:lsid:marinespecies.org:taxname:109410,Chromista,Myzozoa,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,Family,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.98690791, against reference database: PR2 v5.0.1; V9 ...",ASV:fa1f1a97dd4ae7c826009186bad26384,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...


### merge event and occurrence

In [147]:
occ18_merged = occ18_test.merge(all_event_data,how='left',on='eventID')

In [148]:
occ18_merged.head()

Unnamed: 0,DNA_sequence,eventID,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID,kingdom,phylum,class,order,family,genus,taxonRank,identificationRemarks,taxonID,basisOfRecord,nameAccordingTo,organismQuantityType,recordedBy,materialSampleID,sampleSizeValue,sampleSizeUnit,associatedSequences,locationID,eventDate,minimumDepthInMeters,maximumDepthInMeters,locality,waterBody,countryCode,decimalLatitude,decimalLongitude,geodeticDatum,samplingProtocol,parentEventID,datasetID
0,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGCCTGGCGGATTACTCTGCCTGGCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,1516,GOMECC4_27N_Sta1_DCM_A_occ36aa75f9b28f5f831c2d631ba65c2bcb,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Neocalanus;Neocalanus_cristatus;,Neocalanus cristatus,urn:lsid:marinespecies.org:taxname:104470,Animalia,Arthropoda,Copepoda,Calanoida,Calanidae,Neocalanus,Species,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.92209885, against reference database: PR2 v5.0.1; V9 ...",ASV:36aa75f9b28f5f831c2d631ba65c2bcb,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
1,GCTACTACCGATTGAACGTTTTAGTGAGGTCCTCGGACTGTTTGGTAGTCGGATCACTCTGACTGCCTGGCGGGAAGACGACCAAACTGTAGCGTTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,962,GOMECC4_27N_Sta1_DCM_A_occ4e38e8ced9070952b314e1880bede1ca,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Clausocalanus;Clausocalanus_furcatus;,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503,Animalia,Arthropoda,Copepoda,Calanoida,Clausocalanidae,Clausocalanus,Species,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.999946735, against reference database: PR2 v5.0.1; V9...",ASV:4e38e8ced9070952b314e1880bede1ca,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
2,GCTACTACCGATTGGACGTTTTAGTGAGACATTTGGACTGGGTTAAGATAGTCGCAAGACTACCTTTTCTCCGGAAAGACTTTCAAACTTGAGCGTCTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,1164,GOMECC4_27N_Sta1_DCM_A_occ2a31e5c01634165da99e7381279baa75,Eukaryota;Obazoa;Opisthokonta;Metazoa;Arthropoda;Crustacea;Maxillopoda;Acrocalanus;Acrocalanus_sp.;,Acrocalanus,urn:lsid:marinespecies.org:taxname:104192,Animalia,Arthropoda,Copepoda,Calanoida,Paracalanidae,Acrocalanus,Genus,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.779948049, against reference database: PR2 v5.0.1; V9...",ASV:2a31e5c01634165da99e7381279baa75,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
3,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAGTGTTCAGTTCCTGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,287,GOMECC4_27N_Sta1_DCM_A_occecee60339b2fb88ea6d1c8d18054bed4,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae,Dinophyceae,urn:lsid:marinespecies.org:taxname:19542,Chromista,Myzozoa,Dinophyceae,,,,Class,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.999930607, against reference database: PR2 v5.0.1; V9...",ASV:ecee60339b2fb88ea6d1c8d18054bed4,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4
4,GCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAATGTTTGGATCCCGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC,GOMECC4_27N_Sta1_DCM_A_18S,250,GOMECC4_27N_Sta1_DCM_A_occfa1f1a97dd4ae7c826009186bad26384,Eukaryota;TSAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniales;Gymnodiniaceae,Gymnodiniaceae,urn:lsid:marinespecies.org:taxname:109410,Chromista,Myzozoa,Dinophyceae,Gymnodiniales,Gymnodiniaceae,,Family,"Tourmaline; qiime2-2021.2; naive-bayes classifier, confidence (at lowest specified taxon): 0.98690791, against reference database: PR2 v5.0.1; V9 ...",ASV:fa1f1a97dd4ae7c826009186bad26384,MaterialSample,WoRMS,DNA sequence reads,Luke Thompson | Katherine Silliman,SAMN37516094,9838,DNA sequence reads,https://www.ncbi.nlm.nih.gov/sra/SRR26161153 | https://www.ncbi.nlm.nih.gov/biosample/SAMN37516094 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA...,27N_Sta1,2021-09-14T11:00-04:00,49,49,"USA: Atlantic Ocean, east of Florida (27 N)","Mexico, Gulf of",US,26.997,-79.618,WGS84,CTD rosette,GOMECC4_27N_Sta1_DCM_A,noaa-aoml-gomecc4


In [149]:
occ18_merged.drop(columns=['DNA_sequence']).to_csv("../processed/occurrence_18S.tsv",sep="\t",index=False)

### combine 16s and 18s occurrence

In [150]:
occ18_merged.shape

(149182, 37)

In [151]:
occ_all = pd.concat([occ16_merged,occ18_merged],axis=0, ignore_index=True)

In [152]:
occ_all['occurrenceStatus'] = 'present' 

In [153]:
occ_all.shape

(318652, 38)

In [154]:
occ_all.drop(columns=['DNA_sequence']).to_csv("../processed/occurrence.csv",index=False)

# DNA DERIVED AFTER THIS

## Create DNA Derived Extension

First we clearly define the desired columns for the DNA Derived Extension output file:

In [155]:
# Define desired columns for DNA derived extension in output order
DESIRED_DNA_DERIVED_COLUMNS = [
    'eventID', 'source_mat_id', 'env_broad_scale', 'env_local_scale', 'env_medium', 
    'samp_vol_we_dna_ext', 'samp_collect_device', 'samp_mat_process', 'size_frac', 
    'concentration', 'lib_layout', 'seq_meth', 'nucl_acid_ext', 'target_gene', 
    'target_subfragment', 'pcr_primer_forward', 'pcr_primer_reverse', 
    'pcr_primer_name_forward', 'pcr_primer_name_reverse', 'pcr_primer_reference', 
    'pcr_cond', 'nucl_acid_amp', 'ampliconSize', 'otu_seq_comp_appr', 'otu_db', 
    'occurrenceID', 'DNA_sequence', 'concentrationUnit', 'otu_class_appr'
]

print(f"DNA derived extension will have {len(DESIRED_DNA_DERIVED_COLUMNS)} columns")

Start building DNA derived extension by merging occurrence data with experiment run metadata to get library and sample information.

In [156]:
# Start with occurrence data and merge with experimentRunMetadata on eventID (lib_id)
dna_base = occ_all_final_output[['eventID', 'occurrenceID', 'DNA_sequence']].copy()

# Merge with experimentRunMetadata to get samp_name and library metadata
dna_with_exp = dna_base.merge(data['experimentRunMetadata'], left_on='eventID', right_on='lib_id', how='left')

print(f"DNA base rows: {len(dna_base)}")
print(f"After exp merge: {len(dna_with_exp)}")
print(f"Columns added: {set(dna_with_exp.columns) - set(dna_base.columns)}")

dict_keys(['eventID', 'samp_name', 'occurrenceID', 'DNA_sequence', 'sop', 'nucl_acid_ext', 'samp_vol_we_dna_ext', 'samp_mat_process', 'nucl_acid_amp', 'target_gene', 'target_subfragment', 'ampliconSize', 'lib_layout', 'pcr_primer_forward', 'pcr_primer_reverse', 'pcr_primer_name_forward', 'pcr_primer_name_reverse', 'pcr_primer_reference', 'pcr_cond', 'seq_meth', 'otu_class_appr', 'otu_seq_comp_appr', 'otu_db', 'env_broad_scale', 'env_local_scale', 'env_medium', 'size_frac', 'concentration', 'concentrationUnit', 'samp_collec_device', 'source_mat_id'])

Merge with sample metadata to get sample collection information.

In [None]:
# Merge with sampleMetadata to get sample collection metadata
dna_with_sample = dna_with_exp.merge(data['sampleMetadata'], on='samp_name', how='left')

print(f"After sample merge: {len(dna_with_sample)}")
print(f"Total columns now: {len(dna_with_sample.columns)}")

Apply Darwin Core mapping to transform raw field names to standardized terms.

In [None]:
# Apply Darwin Core mapping using the dwc_data['dnaDerived'] dictionary
dna_mapped = dna_with_sample.copy()

# Apply mappings for each column that has a Darwin Core equivalent
for col in dna_mapped.columns:
    if col in dwc_data['dnaDerived']:
        dwc_term = dwc_data['dnaDerived'][col]
        if dwc_term != col:  # Only rename if different
            dna_mapped = dna_mapped.rename(columns={col: dwc_term})
            print(f"Mapped: {col} -> {dwc_term}")

print(f"Mapping complete. Final columns: {len(dna_mapped.columns)}")

Add hardcoded fields and create final DNA derived extension output.

# CHECK THIS TOMORROW. DO WE WANT TO INCLUDE THIS MAPPING IN THE CHECKLIST? WHY OR WHY NOT?

In [158]:
# Add source_mat_id using materialSampleID 
if 'materialSampleID' in dna_mapped.columns:
    dna_mapped['source_mat_id'] = dna_mapped['materialSampleID']
    print("Added source_mat_id from materialSampleID")

# Select only the desired columns that exist in our data
available_columns = [col for col in DESIRED_DNA_DERIVED_COLUMNS if col in dna_mapped.columns]
missing_columns = [col for col in DESIRED_DNA_DERIVED_COLUMNS if col not in dna_mapped.columns]

print(f"Available columns: {len(available_columns)}")
print(f"Missing columns: {missing_columns}")

# Create final DNA derived extension with only available columns
dna_derived_final = dna_mapped[available_columns].copy()

Final DNA derived extension output and save to file.

In [None]:
print(f"Final DNA derived extension shape: {dna_derived_final.shape}")
print(f"Columns: {list(dna_derived_final.columns)}")

# Display first few rows
print("\nFirst 3 rows:")
display(dna_derived_final.head(3))

# Save to CSV
output_filename = 'dna_derived_extension.csv'
dna_derived_final.to_csv(output_filename, index=False)
print(f"\nDNA derived extension saved as: {output_filename}")

# BELOW IS FROM THE OLD VERSION OF EDNA2OBIS

In [189]:
dna_occ.head()

Unnamed: 0,source_mat_id,env_broad_scale,env_local_scale,env_medium,samp_vol_we_dna_ext,samp_collec_device,samp_mat_process,size_frac,concentration,concentrationUnit,lib_layout,seq_meth,nucl_acid_ext,target_gene,target_subfragment,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,pcr_cond,nucl_acid_amp,ampliconSize,otu_seq_comp_appr,otu_db,eventID,occurrenceID,DNA_sequence,otu_class_appr
0,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep_A_16S_occ009257b156ab4a9dd2f0b0dd33100b7e,TACGAGGGGTGCTAGCGTTGTCCGGAATTACTGGGCGTAAAGGGTTCGTAGGCGTCTTGCCAAGTTGATCGTTAAAGCCACCGGCTTAACCGGTGATCTGCGATCAAAACTGGCGAGATAGAATATGTGAGGGGAATGTGGAATTC...,Tourmaline; qiime2-2021.2; dada2; ASV
1,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep_A_16S_occ01398067b1d323b7f992a6764fa69e97,TACGGAGGGTGCAAGCGTTGTTCGGAATTATTGGGCGTAAAGCGGATGTAGGCGGTCTGTCAAGTCGGATGTGAAATCCCTGGGCTCAACCCAGGAACTGCATTCGAAACTGTCAGACTAGAGTCTCGGAGGGGGTGGCGGAATTC...,Tourmaline; qiime2-2021.2; dada2; ASV
2,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep_A_16S_occ01770ea2fb7f041c787e5a481888c27e,TACGGAGGATCCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTCCGCAGGCGGACTATTAAGTCAGTGGTGAAAGTCTGCAGCTTAACTGTAGAATTGCCATTGAAACTGATAGTCTTGAGTGTGGTTGAAGTGGGCGGAATAT...,Tourmaline; qiime2-2021.2; dada2; ASV
3,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep_A_16S_occ017dbdc8b62705bdf3f93218ac93a030,TACTAGGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGGGTGCGTAGGCGTCTACGTAAGTTGTTTGTTAAATCCATCGGCTTAACCGATGATCTGCAAACAAAACTGCATAGATAGAGTTTGGAAGAGGAAAGTGGAATTC...,Tourmaline; qiime2-2021.2; dada2; ASV
4,GOMECC4_27N_Sta1_Deep,marine biome [ENVO:00000447],marine mesopelagic zone [ENVO:00000213],sea water [ENVO:00002149],1920 ml,Niskin bottle,Pumped through Sterivex filter (0.22-µm) using peristaltic pµmp,0.22 µm,missing: not provided,missing: not provided,paired,Illumina MiSeq 2x250,https://github.com/aomlomics/protocols/blob/main/protocol_DNA_extraction_Sterivex.md,16S rRNA,V4-V5,GTGYCAGCMGCCGCGGTAA,CCGYCAATTYMTTTRAGTTT,515F-Y,926R,10.1111/1462-2920.13023,initial denaturation:95_2;denaturation:95_0.75;annealing:50_0.75;elongation:68_1.5;final elongation:68_5;25,10.1111/1462-2920.13023,411,Tourmaline; qiime2-2021.2; naive-bayes classifier,Silva SSU Ref NR 99 v138.1; 515f-926r region; 10.5281/zenodo.8392695,GOMECC4_27N_Sta1_Deep_A_16S,GOMECC4_27N_Sta1_Deep_A_16S_occ069f375524db7812103fe73fdefb7d2b,TACGTAGGAGGCTAGCGTTGTCCGGATTTACTGGGCGTAAAGGGAGCGCAGGTGGCTGAGTTCGTCCGTGGTGCAAGCTCCAGGCCTAACCTGGAGAGGTCTACGGATACTGCTCGGCTTGAGGGCGGTAGAGGAGCACGGAATTC...,Tourmaline; qiime2-2021.2; dada2; ASV
