This notebook explores the datasets I've found through my lit search.

In [13]:
import pandas as pd
import janitor as jn

In [14]:
fname = '../../data/raw/Environmental metagenomics datasets - Datasets.tsv'
df = pd.read_csv(fname, sep='\t')
df = jn.clean_names(df, strip_underscores=True)
df.head()

Unnamed: 0,person,date_added,paper_source,environment,selective_pressure,environment_detail,n_samples,sample_details,first_author_last_author,all_authors,paper_year,paper_title,paper_link,where_is_data_database_online_email,data_detail_accession_date_emailed_etc,data_link,where_is_metadata_database_supplement_etc,metadata_detail,interest,availability
0,Claire,5/1,AMR forum report,human_fecal,"yes, abx",before and after antibiotic exposure,401,84 preterm infants sampled longitudinally and ...,"Molly Gibson, Gautam Dantas","Molly K. Gibson, Bin Wang, Sara Ahmadi, Carey-...",2016,Developmental dynamics of the preterm infant g...,https://www.nature.com/articles/nmicrobiol201624,SRA,PRJNA301903,https://www.ncbi.nlm.nih.gov/Traces/study/?Web...,SRA and Table S1,great metadata,high,high
1,Claire,5/1,AMR forum report,drinking_water,"yes, water treatment","drinking water treatment plant, before and aft...",3,,"Peng Shi, Aimin Li","Peng Shi, Shuyu Jia, Xu-Xiang Zhang, Tong Zhan...",2013,Metagenomic insights into chlorination effects...,https://www.sciencedirect.com/science/article/...,SRA,SRA050945,https://www.ncbi.nlm.nih.gov/Traces/study/?acc...,SRA,"filtered water (FW), disinfected water (DW) an...",low,high
2,Claire,5/1,AMR forum report,wwtp,,"activated sludge, Danish WWTP",15,,"Christian Munck, Morten Sommer","Christian Munck, Mads Albertsen, Amar Telke, M...",2015,Limited dissemination of the wastewater treatm...,https://www.nature.com/articles/ncomms9452,ENA,PRJEB8087,https://www.ebi.ac.uk/ena/data/view/PRJEB8087,sample IDs,,medium,high
3,Claire,5/1,AMR forum report,animal,"yes, abx feed additives",cattle,10,"5 controls, 5 steers with additivies","Milton Thomas, Joy Scaria","Milton Thomas, Megan Webb, Sudeep Ghimire, Ama...",2017,Metagenomic characterization of the effect of ...,https://www.nature.com/articles/s41598-017-124...,SRA,PRJNA390551,https://www.ncbi.nlm.nih.gov/Traces/study/?Web...,SRA,,medium,high
4,Claire,5/1,AMR forum report,toilet_waste,,from international airplanes from 9 cities,18,,"Thomas Petersen, Frank Aarestrup","Thomas Nordahl Petersen, Simon Rasmussen, Henr...",2015,Meta-genomic analysis of toilet waste from lon...,https://www.nature.com/articles/srep11444,ENA,PRJEB12466,https://www.ebi.ac.uk/ena/data/view/PRJEB12466,sample IDs,,medium,high


# Easily available datasets

First, let's see how many and what type of datasets I have that should be easy to download.

In [15]:
df.query('availability == "high"').shape

(16, 20)

In [16]:
(df.query('availability == "high"')
     [['environment', 'n_samples', 'selective_pressure']]
     .astype({'n_samples': int})
     .sort_values(by='n_samples', ascending=False)
)

Unnamed: 0,environment,n_samples,selective_pressure
0,human_fecal,401,"yes, abx"
26,drinking_water,25,
28,"sediment, lake, wwtp",22,"yes, different distance to wwtp"
4,toilet_waste,18,
21,air,16,
2,wwtp,15,
3,animal,10,"yes, abx feed additives"
16,river,8,"yes, upstream and downstream of dumping"
20,animal,7,"yes, abx vs no abx"
33,"river, wwtp",6,"yes, anthropogenic gradient"


Not very impressive, also missing soil/manure... But most of these have selective pressure so that's cool.

Let's see about the "medium" datasets.

In [17]:
df.query('(availability == "high") | (availability == "medium")').shape

(24, 20)

In [18]:
(df.query('(availability == "high") | (availability == "medium")')
     [['environment', 'n_samples', 'selective_pressure', 'environment_detail', 'sample_details']]
     .astype({'n_samples': int})
     .sort_values(by='n_samples', ascending=False)
     .reset_index(drop=True)
)

Unnamed: 0,environment,n_samples,selective_pressure,environment_detail,sample_details
0,human_fecal,401,"yes, abx",before and after antibiotic exposure,84 preterm infants sampled longitudinally and ...
1,wwtp,72,"yes, abx concentrations measured","different steps of 3 swedish WWTPs, water and ...",21 unique sites sampled - 72 samples are eithe...
2,human_fecal,72,"yes, abx",before and after antibiotic exposure,"18 participants + 6 controls, sampled before a..."
3,animal,50,"no, abx-free cattle (but different diets)","rumen, four cattle breeds with two different d...",
4,drinking_water,25,,"tap water from 25 different cities, 20 in China",
5,"sediment, lake, wwtp",22,"yes, different distance to wwtp","wwtp effluent (2), sediment samples near wwtp ...",
6,animal,20,"yes, different herds have different abx usage","pig, pen floor and pig samples, samples are po...",
7,toilet_waste,18,,from international airplanes from 9 cities,
8,air,16,,NYC and San Diego indoor and outdoor air,
9,"hospital_effluent, farm_effluent, river",15,"yes, abx concentrations measured",,


Much better. How many datasets have at least 10 samples?

In [19]:
(df.query('(availability == "high") | (availability == "medium")')
     [['environment', 'n_samples', 'selective_pressure']]
     .astype({'n_samples': int})
     .sort_values(by='n_samples', ascending=False)
     .query('n_samples >= 10')
).shape

(13, 3)

Things I'd like to see more of:

- hospital effluent     
- manure      
- recreational swimming near potential contamination site     
- adults with abx     
- adults with C. diff or other infectious disease??