In [39]:
# Check if python is 3.9.5
import sys
print(sys.version)
%load_ext autoreload
%autoreload 2

3.9.5 (default, May 18 2021, 12:31:01) 
[Clang 10.0.0 ]
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Geographic distribution of the earth microbiome project (EMP)

(The importing step below may take some time, as country assignments are reevaluated.)

In [40]:
import utils
import pandas as pd

The [earth microbiome project](https://earthmicrobiome.org/) provides a lot of good metagenomic samples. Metainformation of the large dataset of Thompson, et al. (2017) can be found [here](https://zenodo.org/record/890000#.YMx_05P7Rp8).

The samples of the EMP dataset are assigned regarding their origin by the use of the environmental ontology framework [ENVO](https://sites.google.com/site/environmentontology/home) (This could be also interesting to use for land use categorization). As we are interested in terrestrial soil samples, here a short summary:

In [None]:
utils.total_n_samples(utils.emp_df)

The inventors chose three different classes of categories: biomes (for overall ecosystems), material (for the matrix whereof was sampled) and features (describing key features of the biome). Here is how many samples of which ontology are stored: 

In [None]:
utils.overall_env_features(utils.emp_df)

We can observe that multiple ontologies have the term "soil" within there name. And what about "rhizosphere"?
In a first step let us filter for all ontologies with "soil". So, let me summarize the subset for soil samples:

In [None]:
utils.subset_for_soil()
utils.total_n_samples(utils.emp_sdf)

In [None]:
utils.overall_env_features(utils.emp_sdf)

So, now it is becoming interesting. Where do these samples come from?

In [None]:
utils.summarize_cntr_and_ftrs(utils.emp_sdf)

<div class="alert alert-block alert-danger">
<b>Attention</b> - many samples have a bad resolution in their coordinates.
</div>

Once more, let us checkout, where we can find the samples on a map. The colors reflect the type of biome (some colors are shared by two types). Orange icons represent locations with multiple biome types (maybe not good?!).

In [None]:
utils.map_the_data(utils.emp_sdf)

<b>Okay...</b> USA and Canada seem promising somehow.

# Geographic distribution of soil samples from EBI metagenomics

As a next step, we try to increase the datasert to the whole [EBI metagenomics database](https://www.ebi.ac.uk/metagenomics/). Again samples are stored with annotation of ENVO biomes. Thus, we can sample data which is either from soil, rhizosphere or rhizoplane.

<div class="alert alert-block alert-info">
    <b>Wait!</b><br>What is the difference between soil, rhizosphere and rhizoplane? <a href="http://www.ontobee.org/ontology/ENVO">OntoBEE</a> is a good source for finding ínformation about ENVO biomes. It defines:
<ul>
    <li><b>Soil:</b> Soil is an environmental material which is primarily composed of minerals, varying proportions of sand, silt, and clay, organic material such as humus, gases, liquids, and a broad range of resident micro- and macroorganisms.</li>
    <li><b>Rhizosphere:</b> The narrow region of soil that is directly influenced by root secretions and associated soil microorganisms.</li>
    <li><b>Rhizoplane:</b> A surface layer which is composed of the external surface of a root, together with closely adhering soil particles and debris.</li>
    </ul>    
    So, somehow all of these are related to soil. As a real soil inspector knows, there are always roots within the soil. Also, there is always soil around the roots. Therefore, we use all three categories to build the dataset.
</div>

<div class="alert alert-block alert-warning">
    <b>Possible problem!</b><br>What if some samples of the rhizosphere/rhizoplane were from water plants?
</div>

Using the [MGnify Rest API](https://www.ebi.ac.uk/metagenomics/api/v1/) meta data for all this samples can be obtained.

In [50]:
utils.load_ebi_data()

Do you want to reload the metadata from EBI (or MG-RAST) database? [y/n] n


Similarly to the EMP dataset, it can be imported and stored globally.

In [51]:
utils.import_ebi()
utils.total_n_samples(utils.ebi_df)

/Users/Thomsn/Desktop/island_in_the_sun/jupyters/2021_06_ranker/Ranker/data/ebi_ena_soil_dataset_2021_07_07_processed.csv
Overall there are 24945 samples within the dataset.


As many of the samples do not even have coordinates given, we filter for samples with coordinates. More precisely we only use samples with coordinates of at least 3 significant figures in the digits of the degrees (argumentation see "Decision making > Coordinates' precision").

In [52]:
precise_soil_ebi_df = utils.subset_for_coordprec(utils.ebi_df)
utils.total_n_samples(precise_soil_ebi_df)

Overall there are 15979 samples within the dataset.


Again, we can inspect the dataset a bit more regarding ecosystems and countries:

In [53]:
utils.overall_env_features(precise_soil_ebi_df)

--- Biomes ---
millet root                                2876
Rhizosphere                                1362
ENVO:urban biome                           1003
Bog Forest                                  838
urban biome                                 823
grassland                                   576
cropland biome                              492
soil                                        403
cropland                                    338
terrestrial biome                           300
montane grassland biome                     250
boreal forest                               233
rangeland                                   224
Temperate grasslands                        216
tundra biome                                172
Temperate Conifer Forest                    140
clay soil                                   134
Temperate grassland                         119
forest biome                                105
temperate woodland                          101
Tundra                   

In [None]:
utils.summarize_cntr_and_ftrs(precise_soil_ebi_df)

Again, let us plot the data on an interactive global map:

In [None]:
utils.map_the_data(precise_soil_ebi_df)

Check out the Republic of Korea. It is a large project on agricultural soils.

In [54]:
precise_soil_ebi_df[precise_soil_ebi_df.std_country=="Korea, Republic of"]

Unnamed: 0.1,Unnamed: 0,accession,sample_name,longitude_deg,latitude_deg,country,studies,env_biome,env_feature,env_material,experiment_type,collection_date,geometry,std_country
6299,6807,ERS2540919,V4-5TPH Taxonomy ID:256318,126.9784,37.566,South Korea,MGYS00003722,biome,environmental zone,soil,amplicon,2018-05-31,POINT (126.97840 37.56600),"Korea, Republic of"
6300,6808,ERS2540920,V4-5NTPD Taxonomy ID:256318,126.9784,37.566,South Korea,MGYS00003722,biome,environmental zone,soil,amplicon,2018-05-31,POINT (126.97840 37.56600),"Korea, Republic of"
6301,6809,ERS2540921,V4-5NTPH Taxonomy ID:256318,126.9784,37.566,South Korea,MGYS00003722,biome,environmental zone,soil,amplicon,2018-05-31,POINT (126.97840 37.56600),"Korea, Republic of"
12803,14707,SRS3994526,14005,127.6074,37.1125,,MGYS00005205,,,,amplicon,2014-04-01,POINT (127.60740 37.11250),"Korea, Republic of"
12804,14708,SRS3994527,14004,127.515,37.1816,,MGYS00005205,,,,amplicon,2014-04-01,POINT (127.51500 37.18160),"Korea, Republic of"
12805,14709,SRS3994528,14003,127.2084,37.6069,,MGYS00005205,,,,amplicon,2014-04-01,POINT (127.20840 37.60690),"Korea, Republic of"
12806,14710,SRS3994529,14002,127.2141,37.6183,,MGYS00005205,,,,amplicon,2014-04-01,POINT (127.21410 37.61830),"Korea, Republic of"
12807,14711,SRS3994530,14009,127.2473,36.9441,,MGYS00005205,,,,amplicon,2014-04-01,POINT (127.24730 36.94410),"Korea, Republic of"
12808,14712,SRS3994531,14008,127.2463,36.9452,,MGYS00005205,,,,amplicon,2014-04-01,POINT (127.24630 36.94520),"Korea, Republic of"
12809,14713,SRS3994532,14007,127.5055,37.1819,,MGYS00005205,,,,amplicon,2014-04-01,POINT (127.50550 37.18190),"Korea, Republic of"


# Geographic distribution of soil samples from EBI metagenomics

<div class="alert alert-block alert-warning">
<b>EBI enough?</b> Nop, that was not all.
</div>

Well, it seemed, as if that was everything. But then, I read the paper by Mendes et al. 2015 in more detail (while building the notebook `metagenomethodo/how_to_compare_the_metagenomes.ipynb`) and found out their data is not stored in EBI. Instead it was in the [MG-RAST](https://www.mg-rast.org/) database (see the other notebook for more info). Luckily it also features an API. But it has a bit different biome assignments. It provides ENVO labels for `biome`, `feature` and `material`, but also a property called `env_package`, which also features the category "soil".

<div class="alert alert-block alert-danger">
<b>Attention, mislabelled data!</b> Some samples are assigned as `material` "soil", but feature as `env_package` "human-associated". For instance "mge777641" is from a study about lung metagenomics. This should not be soil ;-)
</div>

For more elaborate analysis we can download all MG-RAST samples:

In [41]:
utils.load_mg_rast_data(total=True)

Do you want to reload the metadata from EBI (or MG-RAST) database? [y/n] n


The samples are not as many as proposed on the website, but I cannot tell why.
But, no matter... 

In the meantime we will stick to the smaller dataset with only `env_package` "soil" so far:

In [42]:
utils.load_mg_rast_data()

Do you want to reload the metadata from EBI (or MG-RAST) database? [y/n] n


Again, let us load some summaries:

In [43]:
utils.import_ebi(mg_rast=True)
utils.total_n_samples(utils.mgrast_df)

/Users/Thomsn/Desktop/island_in_the_sun/jupyters/2021_06_ranker/Ranker/data/mgrast_soil_dataset_2021_07_08_processed.csv
Overall there are 3332 samples within the dataset.


In [44]:
precise_soil_mgrast_df = utils.subset_for_coordprec(utils.mgrast_df)
utils.total_n_samples(precise_soil_mgrast_df)

Overall there are 2962 samples within the dataset.


In [45]:
utils.overall_env_features(precise_soil_mgrast_df)

--- Biomes ---
Temperate grasslands, savannas, and shrubland biome                          735
terrestrial biome                                                            662
Temperate grasslands                                                         163
Temperate coniferous forest biome                                            115
Deserts and xeric shrubland biome                                            114
Udvardy biome                                                                 94
soil                                                                          93
urban biome                                                                   88
coral reef                                                                    75
Temperate broadleaf and mixed forest biome                                    71
forest biome                                                                  68
temperate grassland                                                           65
Tropical and 

In [46]:
precise_soil_mgrast_df

Unnamed: 0.1,Unnamed: 0,accession,sample_name,longitude_deg,latitude_deg,country,studies,env_biome,env_feature,env_material,experiment_type,collection_date,env_package_type,geometry,std_country
0,0,mgm4441091.3,Waseca Farm Soil,93.6623,43.9614,United States of America,Waseca County Farm Soil Metagenome,soil,soil,soil,WGS,2001-09-05 12:00:00 UTC,soil,POINT (93.66230 43.96140),China
1,1,mgm4443231.3,A,-85.83945,80.00048,Canada,Meta data,soil,soil,soil,WGS,2003-05-00 00:00:00 UTC,soil,POINT (-85.83945 80.00048),Canada
2,2,mgm4443232.3,2M,-85.83945,80.00048,Canada,Meta data,soil,soil,soil,WGS,2003-05-00 00:00:00 UTC,soil,POINT (-85.83945 80.00048),Canada
9,9,mgm4449125.3,Composite,-62.283,81.517,Canada,Meta data,soil,soil,soil,WGS,2005-08-00 00:00:00 UTC-5,soil,POINT (-62.28300 81.51700),
10,10,mgm4449126.3,Biopiles-2006,-62.283,81.517,Canada,Meta data,soil,soil,soil,WGS,2006-08-00 00:00:00 UTC-5,soil,POINT (-62.28300 81.51700),
11,11,mgm4449249.3,1_8_RL1,-93.26529,45.39926,USA,"Comparative metagenomic, phylogenetic, and ph...",grassland biome,meadow soil,bulk soil,WGS,2008-07-15 12:00:00 UTC-7,soil,POINT (-93.26529 45.39926),United States
12,12,mgm4449252.3,1_17_RL2,-93.26529,45.39926,USA,"Comparative metagenomic, phylogenetic, and ph...",grassland biome,meadow soil,bulk soil,WGS,2008-07-15 12:00:00 UTC-7,soil,POINT (-93.26529 45.39926),United States
13,13,mgm4449255.3,1_22_RL3,-93.26529,45.39926,USA,"Comparative metagenomic, phylogenetic, and ph...",grassland biome,meadow soil,bulk soil,WGS,2008-07-15 12:00:00 UTC-7,soil,POINT (-93.26529 45.39926),United States
14,14,mgm4449256.3,1_23_RL4,-93.26529,45.39926,USA,"Comparative metagenomic, phylogenetic, and ph...",grassland biome,meadow soil,bulk soil,WGS,2008-07-15 12:00:00 UTC-7,soil,POINT (-93.26529 45.39926),United States
15,15,mgm4449258.3,1_38_RL5,-93.26529,45.39926,USA,"Comparative metagenomic, phylogenetic, and ph...",grassland biome,meadow soil,bulk soil,WGS,2008-07-15 12:00:00 UTC-7,soil,POINT (-93.26529 45.39926),United States


In [None]:
utils.summarize_cntr_and_ftrs(precise_soil_mgrast_df)

In [None]:
utils.map_the_data(precise_soil_mgrast_df)

Again we hava many samples from Northern America. Also we see, that the dataset could need some more filtering (what is soil from coral reefs?).

# Sequence types?

So a synthesis between EBI and MG-RAST database would probably be cool. Which kinds of sequence data do we have? As we can learn in the `how_to_compare_the_metagenomes.ipynb` notebook, Mendes et al. worked with WGS data.

In [None]:
precise_soil_total_df = pd.concat([precise_soil_mgrast_df, precise_soil_ebi_df], keys = ["MG_RAST", "EBI"])

In [None]:
utils.map_the_data(precise_soil_total_df, experiment_type=True)

## The states

After reading the notebook `dbexploration/what_about_geolifeclef.ipynb` we know, that there is a lot of great remote sensing and land use data for the central USA (and France); GeoLifeClef even provides code to load and proceed the dataset. So we can subsample the metagenomic data from central USA.

In [None]:
COUNTRY = '"United States"'
SOMEWHERE_IN_CANADA = 55.166670  # to avoid samples from alaska
WEST_OF_CALI = -147.69870  # to avoid samples from hawaii

full_df = utils.summarize_cntr_and_ftrs(precise_soil_total_df)
cus_df = full_df.query(
    f"std_country == {COUNTRY} & latitude_deg < {SOMEWHERE_IN_CANADA} & longitude_deg > {WEST_OF_CALI}")

print(f"For the subsampled {COUNTRY} we have samples from {cus_df.shape[0]} locations")

That is a lot. These are all the locations:

In [None]:
cus_df
# cus_df[cus_df.biome == "ENVO:urban biome"].shape

Still, we see most sample locations (565) belong to urban biomes near 40.769700 °N, -73.979800 °E - that the Central Park, New York (from 2014, [project page](https://www.ebi.ac.uk/ena/browser/view/PRJEB6596)).

We repeat the mapping of biome types only for central USA.

In [None]:
map_df = precise_soil_total_df.query(
f"std_country == 'United States' & latitude_deg < {SOMEWHERE_IN_CANADA} & longitude_deg > {WEST_OF_CALI}")
utils.map_the_data(map_df)

---

# Decision making

### Country assignments

As the data curation regarding countries is not universal, we reassign the country of sample origin by mapping it to shapefiles of all countries (and continents) worldwide (from [thematicmapping.org](http://thematicmapping.org/downloads/world_borders.php)). Hereby, we also observe that sometimes longitude and latitude is mistaken...

### Coordinates' precision
As environments can vary on small spatial scales, we should to ensure that the coordinates of our samples are really from the ecosystem that we assume it from. Assume we have two places Rucola (0.000° N, 0.000 °E) and Oregano (0.001° N, 0.001 °E) close to the equator. Using the [Haversine formula](https://en.wikipedia.org/wiki/Haversine_formula) to compute geospatial distances (applied from [this post](https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points)).

In [None]:
rucola = [0.000, 0.000]
oregano = [0.001, 0.001]
utils.dist_between_coord(rucola, oregano)

Around the equator, the metric distances are at its largest, when compared to the distance in radians. As we see on the distance between Rucola and Oregano, it would be good to use samples, with a precision of at least 0.001° (meaning 3 significant figures for the degrees). It would be even better to use another magnitude less.

# Update for David's grant (Jan 2022)
To support the grant, I should collect the soil daita in csvs. Here we go:

In [None]:
soil_emp_df = utils.subset_for_coordprec(utils.emp_sdf)
for group, df in soil_emp_df.groupby(["latitude_deg", "longitude_deg"]):
    print(group)
    print(df[["#SampleID", "Description"]], end="\n\n")

In [None]:
soil_emp_df.to_csv("emp_dataset_david.csv")

In [49]:
precise_soil_mgrast_df.to_csv("mg_rast_dataset_david.csv", index=False)

In [55]:
precise_soil_ebi_df.to_csv("enaebi_dataset_david.csv", index=False)

# References
Thompson, L.R., Sanders, J.G., McDonald, D., Amir, A., Ladau, J., Locey, K.J., Prill, R.J., Tripathi, A., Gibbons, S.M., Ackermann, G. and Navas-Molina, J.A., 2017. A communal catalogue reveals Earth’s multiscale microbial diversity. Nature, 551(7681), pp.457-463.