In [1]:
# Check if python is 3.9.5
import sys
print(sys.version)
%load_ext autoreload
%autoreload 2

3.9.5 (default, May 18 2021, 12:31:01) 
[Clang 10.0.0 ]


# Geographic distribution of the earth microbiome project (EMP)

(The importing step below may take some time, as country assignments are reevaluated.)

In [26]:
import utils
import pandas as pd

The [earth microbiome project](https://earthmicrobiome.org/) provides a lot of good metagenomic samples. Metainformation of the large dataset of Thompson, et al. (2017) can be found [here](https://zenodo.org/record/890000#.YMx_05P7Rp8).

The samples of the EMP dataset are assigned regarding their origin by the use of the environmental ontology framework [ENVO](https://sites.google.com/site/environmentontology/home) (This could be also interesting to use for land use categorization). As we are interested in terrestrial soil samples, here a short summary:

In [6]:
utils.total_n_samples(utils.emp_df)

Overall there are 27738 samples within the dataset.


The inventors chose three different classes of categories: biomes (for overall ecosystems), material (for the matrix whereof was sampled) and features (describing key features of the biome). Here is how many samples of which ontology are stored: 

In [7]:
utils.overall_env_features(utils.emp_df)

--- Biomes ---
urban biome                              7588
Small lake biome                         3787
marine biome                             2972
Large river biome                        2139
cropland biome                           1919
freshwater biome                         1538
Large lake biome                         1192
rangeland biome                          1097
village biome                             614
mediterranean shrubland biome             608
tundra biome                              585
mediterranean woodland biome              570
aquatic biome                             438
tropical moist broadleaf forest biome     387
forest biome                              285
marine benthic biome                      283
montane grassland biome                   230
shrubland biome                           207
Small river biome                         192
dense settlement biome                    189
polar desert biome                        166
tropical shrubland 

We can observe that multiple ontologies have the term "soil" within there name. And what about "rhizosphere"?
In a first step let us filter for all ontologies with "soil". So, let me summarize the subset for soil samples:

In [15]:
utils.subset_for_soil()
utils.total_n_samples(utils.emp_sdf)

Overall there are 3573 samples within the dataset.


In [16]:
utils.overall_env_features(utils.emp_sdf)

--- Biomes ---
cropland biome                           1111
urban biome                               864
tundra biome                              385
forest biome                              254
montane grassland biome                   230
shrubland biome                           128
tropical shrubland biome                  127
tropical moist broadleaf forest biome     123
Large river biome                          93
grassland biome                            57
polar desert biome                         45
rangeland biome                            35
desert biome                               27
temperate grassland biome                  25
montane shrubland biome                    23
coniferous forest biome                    19
tropical coniferous forest biome            6
temperate mixed forest biome                6
tropical broadleaf forest biome             4
temperate coniferous forest biome           4
tropical grassland biome                    4
dense settlement bi

So, now it is becoming interesting. Where do these samples come from?

In [17]:
utils.summarize_cntr_and_ftrs(utils.emp_sdf)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,biome,n_samples
std_country,latitude_deg,longitude_deg,Unnamed: 3_level_1,Unnamed: 4_level_1
Antarctica,-78.02,163.88,tundra biome,1
Antarctica,-77.73,161.31,tundra biome,1
Antarctica,-77.72,162.31,tundra biome,1
Antarctica,-77.65,162.89,tundra biome,1
Antarctica,-77.63,162.88,tundra biome,1
Antarctica,-77.61,162.25,tundra biome,2
Antarctica,-77.53,161.7,tundra biome,3
Antarctica,-77.52,162.89,tundra biome,2
Antarctica,-62.05,-58.24,polar desert biome,45
Argentina,-27.73,-55.68,grassland biome,4


<div class="alert alert-block alert-danger">
<b>Attention</b> - many samples have a bad resolution in their coordinates.
</div>

Once more, let us checkout, where we can find the samples on a map. The colors reflect the type of biome (some colors are shared by two types). Orange icons represent locations with multiple biome types (maybe not good?!).

In [18]:
utils.map_the_data(utils.emp_sdf)

<b>Okay...</b> USA and Canada seem promising somehow.

# Geographic distribution of soil samples from EBI metagenomics

As a next step, we try to increase the datasert to the whole [EBI metagenomics database](https://www.ebi.ac.uk/metagenomics/). Again samples are stored with annotation of ENVO biomes. Thus, we can sample data which is either from soil, rhizosphere or rhizoplane.

<div class="alert alert-block alert-info">
    <b>Wait!</b><br>What is the difference between soil, rhizosphere and rhizoplane? <a href="http://www.ontobee.org/ontology/ENVO">OntoBEE</a> is a good source for finding ínformation about ENVO biomes. It defines:
<ul>
    <li><b>Soil:</b> Soil is an environmental material which is primarily composed of minerals, varying proportions of sand, silt, and clay, organic material such as humus, gases, liquids, and a broad range of resident micro- and macroorganisms.</li>
    <li><b>Rhizosphere:</b> The narrow region of soil that is directly influenced by root secretions and associated soil microorganisms.</li>
    <li><b>Rhizoplane:</b> A surface layer which is composed of the external surface of a root, together with closely adhering soil particles and debris.</li>
    </ul>    
    So, somehow all of these are related to soil. As a real soil inspector knows, there are always roots within the soil. Also, there is always soil around the roots. Therefore, we use all three categories to build the dataset.
</div>

<div class="alert alert-block alert-warning">
    <b>Possible problem!</b><br>What if some samples of the rhizosphere/rhizoplane were from water plants?
</div>

Using the [MGnify Rest API](https://www.ebi.ac.uk/metagenomics/api/v1/) meta data for all this samples can be obtained.

In [19]:
utils.load_ebi_data()

Do you want to reload the metadata from EBI database? [y/n] n


Similarly to the EMP dataset, it can be imported and stored globally.

In [21]:
utils.import_ebi()
utils.total_n_samples(utils.ebi_df)

Overall there are 3411 samples within the dataset.


As many of the samples do not even have coordinates given, we filter for samples with coordinates. More precisely we only use samples with coordinates of at least 3 significant figures in the digits of the degrees (argumentation see "Decision making > Coordinates' precision").

In [22]:
precise_soil_df = utils.subset_for_coordprec(utils.ebi_df)
utils.total_n_samples(precise_soil_df)

Overall there are 2661 samples within the dataset.


Again, we can inspect the dataset a bit more regarding ecosystems and countries:

In [23]:
utils.overall_env_features(precise_soil_df)

--- Biomes ---
urban biome                              823
cropland biome                           480
Bog Forest                               288
montane grassland biome                  230
tundra biome                             172
clay soil                                134
forest biome                             104
Tundra                                    96
tropical moist broadleaf forest biome     69
large river biome                         44
prarie                                    41
marine salt marsh biome                   38
mountain forest soil                      37
terrestrial                               34
Semiarid                                  18
cool tempreate                            18
terrestrial biome                          7
desert                                     6
montane shrubland biome                    4
tropical broadleaf forest biome            4
temperate grassland biome                  4
sediment                                

In [24]:
utils.summarize_cntr_and_ftrs(precise_soil_df)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,biome,n_samples
std_country,latitude_deg,longitude_deg,Unnamed: 3_level_1,Unnamed: 4_level_1
Australia,-28.833,153.417,cropland biome,295
Canada,51.6522,-128.1297,Bog Forest,288
Chile,-40.7767,-72.1978,mountain forest soil,3
Chile,-40.7592,-72.2976,mountain forest soil,3
Chile,-40.1986,-73.4314,mountain forest soil,3
Chile,-40.1963,-73.4355,mountain forest soil,3
Chile,-40.1962,-73.436,mountain forest soil,3
Chile,-40.1961,-73.4351,mountain forest soil,3
Chile,-40.1697,-73.549,mountain forest soil,3
Chile,-39.6013,-72.0966,mountain forest soil,3


Again, let us plot the data on an interactive global map:

In [25]:
utils.map_the_data(precise_soil_df)

# Decision making

### Country assignments

As the data curation regarding countries is not universal, we reassign the country of sample origin by mapping it to shapefiles of all countries (and continents) worldwide (from [thematicmapping.org](http://thematicmapping.org/downloads/world_borders.php)). Hereby, we also observe that sometimes longitude and latitude is mistaken...

### Coordinates' precision
As environments can vary on small spatial scales, we should to ensure that the coordinates of our samples are really from the ecosystem that we assume it from. Assume we have two places Rucola (0.000° N, 0.000 °E) and Oregano (0.001° N, 0.001 °E) close to the equator. Using the [Haversine formula](https://en.wikipedia.org/wiki/Haversine_formula) to compute geospatial distances (applied from [this post](https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points)).

In [40]:
rucola = [0.000, 0.000]
oregano = [0.001, 0.001]
utils.dist_between_coord(rucola, oregano)

The points [ 0.0 °N, 0.0 °E ] and [ 0.001 °N, 0.001 °E ] have a distance of:
157.25 m


Around the equator, the metric distances are at its largest, when compared to the distance in radians. As we see on the distance between Rucola and Oregano, it would be good to use samples, with a precision of at least 0.001° (meaning 3 significant figures for the degrees). It would be even better to use another magnitude less.

# References
Thompson, L.R., Sanders, J.G., McDonald, D., Amir, A., Ladau, J., Locey, K.J., Prill, R.J., Tripathi, A., Gibbons, S.M., Ackermann, G. and Navas-Molina, J.A., 2017. A communal catalogue reveals Earth’s multiscale microbial diversity. Nature, 551(7681), pp.457-463.