<a href="https://colab.research.google.com/github/alldominguez/isee_young_rennes_ws1/blob/main/ws1_isee_young_rennes_version1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<img src="https://github.com/alldominguez/isee_young_rennes_ws1/blob/main/figures/logo_isse_young_rennes.PNG?raw=1" alt="ISGlobal logo" width="1000"/>  

# **Workshop 1: Statistical methods for studying mixtures and the exposome"**  

The study of mixtures and the exposome in the context of environmental epidemiological research is rapidly growing. Investigating mixtures and the exposome allows researchers to assess the independent and combined effects of various exposures, as well as their potential synergistic or antagonistic effects, on health outcomes. However, the complexity of exploring these questions requires the use of specific statistical models to account for aspects that single-exposure models cannot typically handle (e.g. multicollinearity).

This workshop therefore aims at summarizing and presenting the main models used for studying mixtures and the exposome, and discussing the pros and cons of each method in relation to a specific study objectives.




##**Introduction to the NoteBook** 📚

Within this **NoteBook**, you will be guided step by step from loading a dataset to running some mixture and exposome analysis.

The [Jupyter notebook](https://github.com/jupyter/notebook/tree/main) is an interactive computing environment that allows users to author notebook documents. Notebooks consist of **linear sequence of cells** that combines **code cells (input and output of live code that is run)**, and **markdown cells (narative text)**.

The components of the notebook are:

- **notebook web application:** an interactive web application for writing and running code interectively.
- **kernels**: separate processes started by the notebook application that run users' code in an specific language (Python, R, Julia, Ruby, Scala, etc).
- **notebook documents:** documents that contain a representation of all content visibile in the notebook web application.

## **Step-by-step**

The order of the instructions is **essential**, so each cell in this notebook must be executed **sequentially**. If you omit any, you could have an error in your notebook, so you should start running cells from the beginning.

🔴 It is **very very important** that at the beginning you select **"*Open in test mode*" (draft mode)**, at the top left. Otherwise, it will not allow any block of code to be executed, for security reasons. When the first block is executed, the following message will appear: "**Warning: This notebook was not created by Google.**". Don't worry, you will have to trust the contents of the  (*NoteBook*) and click "Run anyway".

Click the "**run**" button on the left side of each code cell. Lines of code that begin with a **#** are comments and do not affect the execution of the code in the different chunks across the notebook.




## **INDEX**
1. [Installation of the R environment and libraries for analysis](#install-libraries)    
2. [Load data](#load-data)
3. [Análisis descriptivo del Exposoma](#descriptivo)   
4. [Análisis de asociación del Exposoma](#asociacion)
  


## **1. Installation of the R environment and libraries for analysis** <a name="install-libraries"></a>

* **Install R environment**

Installing R in our Google Colab environment will be done in the following code block. Remember that all library installations that we perform in the Google Colab environment will only remain active for a few hours, after which the installed libraries are deleted. Therefore, you will need to rerun the library installation codes in this section when you need to run notebook again after this time.

In [3]:
# Check R version
R.version

               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          4.0                         
year           2024                        
month          04                          
day            24                          
svn rev        86474                       
language       R                           
version.string R version 4.4.0 (2024-04-24)
nickname       Puppy Cup                   

* **Install/load libraries for the session**

We will install/load the necessary libraries for the practical session, for this we will use the `pacman` package, this package is an administration tool that combines functionalities of the `install.packages` + `library` functions.

In the context of exposome analysis, R libraries offer us a much more convenient way to process, manipulate and analyze data. Some of the  libraries that we will use in this session are: `rexposome`, `bkmr`, `gWQS`.

In [4]:
# Execution time: 3 sec.
install.packages("pacman")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [6]:
# Execution time: 23 min aprox.
pacman::p_load(Biobase, mice, MultiDataSet, lsr, FactoMiner,
               stingr, circlize, reshape2, pryr, scales, imputeLCMD,
               scatterplot3d, glmnet, gridExtra, grid, Hmisc, gplots,
               gtools, S4Vectors, tidyverse, corrplot, RColorBrewer,
               skimr, bkmr, gWQS, ggridges, rexposome, MASS, caret, partDSA)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

“package ‘Biobase’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages”
“'BiocManager' not available.  Could not check Bioconductor.

Please use `install.packages('BiocManager')` and then retry.”
“”
“there is no package called ‘Biobase’”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

“package ‘MultiDataSet’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages”
“'BiocManager' not available.  Could not check Bioconductor.

Please use `install.packages('BiocManager')` and then retry.”
“”
“there is no package called ‘MultiDataSet’”
Installing package

## **2. Load the data** <a name="cargar-datos"></a>

Below are the **lines of code** required to **load** the Exposoma data set into the R environment. For this hands-on session we will use data from the HELIX exposome study. The HELIX study is a collaborative project between six population-based longitudinal birth cohort studies from six European countries (France, Greece, Lithuania, Norway, Spain and the United Kingdom).

<img src="https://github.com/alldominguez/isee_young_rennes_ws1/blob/main/figures/HELIX.png?raw=1" alt="HELIX logo" width="600"/>

**Note:** The data provided in this introductory course were simulated using data from the HELIX subcohort. Details of the HELIX project and the origin of the data collected can be consulted in the following publication: https://bmjopen.bmj.com/content/8/9/e021311 and website: https://www.projecthelix.eu/es .

* The **exposome data (n = 1301)** that we will use is contained in an Rdata file, the file contains the following files:

1. `phenotype` (outcomes)
2. `exposome`
3. `covariates` (covariates)


The `exposome` database contains more than **200 exposures**.

<img src="https://github.com/alldominguez/isee_young_rennes_ws1/blob/main/figures/HELIX_exposures.png?raw=1" alt="HELIX exposures" width="700"/>

The description of each variable (name, structure, variable type, transformation, ...) is detailed in the [codebook](https://github.com/alldominguez/isee_young_rennes/blob/main/data/codebook.csv).


* Load the neccesary data for the session

In [7]:
# This RData file contains (phenotype, exposoe, covariates and codebook)
load(url("https://raw.githubusercontent.com/alldominguez/ISGlobal.sesion4.Exposoma/main/data/exposome.RData"))

In [10]:
dplyr::glimpse(phenotype) # outcomes
dplyr::glimpse(exposome) # exposures
dplyr::glimpse(covariates) # covariates

Rows: 1,301
Columns: 7
$ ID               [3m[90m<int>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ e3_bw            [3m[90m<int>[39m[23m 4100, 4158, 4110, 3270, 3950, 2900, 3350, 3580, 3000,…
$ hs_asthma        [3m[90m<dbl>[39m[23m 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,…
$ hs_zbmi_who      [3m[90m<dbl>[39m[23m 0.30, 0.41, 3.33, -0.76, 0.98, -0.08, 0.04, -0.10, -1…
$ hs_correct_raven [3m[90m<int>[39m[23m 18, 25, 13, 28, 19, 19, 34, 16, 35, 32, 18, 24, 30, 3…
$ hs_Gen_Tot       [3m[90m<dbl>[39m[23m 84.0000, 39.0000, 40.0000, 54.5000, 18.0000, 4.0000, …
$ hs_bmi_c_cat     [3m[90m<fct>[39m[23m 2, 2, 4, 2, 2, 2, 2, 2, 2, 3, 4, 4, 2, 2, 4, 3, 2, 2,…
Rows: 1,301
Columns: 223
$ ID                           [3m[90m<int>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
$ h_abs_ratio_preg_Log         [3m[90m<dbl>[39m[23m 0.89671105, 0.89253797, 0.77872299, 0.089…
$ h_no2_ratio_preg_Log         [3m[90m<dbl>[39m[23m 2.872

In [14]:
codebook

Unnamed: 0_level_0,variable_name,domain,family,subfamily,period,location,period_postnatal,description,var_type,transformation,labels,labelsshort
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
h_abs_ratio_preg_Log,h_abs_ratio_preg_Log,Outdoor exposures,Air Pollution,PMAbsorbance,Pregnancy,Home,,abs value (extrapolated back in time using ratio method)duringpregnancy,numeric,Natural Logarithm,PMabs,PMabs
h_no2_ratio_preg_Log,h_no2_ratio_preg_Log,Outdoor exposures,Air Pollution,NO2,Pregnancy,Home,,no2 value (extrapolated back in time using ratio method)during pregnancy,numeric,Natural Logarithm,NO2,NO2
h_pm10_ratio_preg_None,h_pm10_ratio_preg_None,Outdoor exposures,Air Pollution,PM10,Pregnancy,Home,,pm10 value (extrapolated back in time using ratio method)duringpregnancy,numeric,,PM10,PM10
h_pm25_ratio_preg_None,h_pm25_ratio_preg_None,Outdoor exposures,Air Pollution,PM2.5,Pregnancy,Home,,pm25 value (extrapolated back in time using ratio method)duringpregnancy,numeric,,PM2.5,PM2.5
hs_no2_dy_hs_h_Log,hs_no2_dy_hs_h_Log,Outdoor exposures,Air Pollution,NO2,Postnatal,Home,Day before examination,no2 value (extrapolated back in time using ratio method)one day before hs test at home,numeric,Natural Logarithm,NO2(day),NO2(day)
hs_no2_wk_hs_h_Log,hs_no2_wk_hs_h_Log,Outdoor exposures,Air Pollution,NO2,Postnatal,Home,Week before examination,no2 value (extrapolated back in time using ratio method)one week before hs test at home,numeric,Natural Logarithm,NO2(week),NO2(week)
hs_no2_yr_hs_h_Log,hs_no2_yr_hs_h_Log,Outdoor exposures,Air Pollution,NO2,Postnatal,Home,Year before examination,no2 value (extrapolated back in time using ratio method)one year before hs test at home,numeric,Natural Logarithm,NO2(year),NO2(year)
hs_pm10_dy_hs_h_None,hs_pm10_dy_hs_h_None,Outdoor exposures,Air Pollution,PM10,Postnatal,Home,Day before examination,pm10 value (extrapolated back in time using ratio method)one day before hs test at home,numeric,,PM10(day),PM10(day)
hs_pm10_wk_hs_h_None,hs_pm10_wk_hs_h_None,Outdoor exposures,Air Pollution,PM10,Postnatal,Home,Week before examination,pm10 value (extrapolated back in time using ratio method)one week before hs test at home,numeric,,PM10(week),PM10(week)
hs_pm10_yr_hs_h_None,hs_pm10_yr_hs_h_None,Outdoor exposures,Air Pollution,PM10,Postnatal,Home,Year before examination,pm10 value (extrapolated back in time using ratio method)one year before hs test at home,numeric,,PM10(year),PM10(year)


We are going to use the `rexposome::loadExposome` function to create a single dataset (`ExposomeSet`) through the `data.frames` that we initially loaded. First we will organize the data in the appropriate format for our analysis.

In [26]:
# Time windows of exposure availables
levels(codebook$period)

In [27]:
# Exposure families availables for the analysis
levels(codebook$family)

In [22]:
expo.list <- as.character(codebook$variable_name[(codebook$family == "Organochlorines" |
                                                  codebook$family == "Metals" |
                                                  codebook$family == "Built environment") &
                                                  codebook$period == "Postnatal"]) # we can also select "Pregnancy"
expo.list

In [18]:
# We can exclude innecesary information
expo.list <- expo.list[-which(expo.list == "hs_tl_cdich_None")]
expo.list <- expo.list[-which(expo.list == "hs_sumPCBs5_cadj_Log2")]

In [23]:
# Select specific columns (variables) from the families that we selected in the previous step and add the identifier per subject (ID)
expo2 <- exposome[ ,c("ID", expo.list)]

In [24]:
# Now we scale the continous variables
index.cont <- c(3:9,11:ncol(expo2))
for (i in index.cont) {
  expo2[,i] <- expo2[,i]/IQR(expo2[,i],na.rm=T)
}

“‘/’ not meaningful for factors”


In [25]:
# check the selected exposure variables
dplyr::glimpse(expo2)

Rows: 1,301
Columns: 35
$ ID                         [3m[90m<int>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
$ hs_accesslines300_h_dic0   [3m[90m<dbl>[39m[23m 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
$ hs_accesspoints300_h_Log   [3m[90m<dbl>[39m[23m 1.0410206, 1.7236128, 0.3585836, 2.8483688,…
$ hs_builtdens300_h_Sqrt     [3m[90m<dbl>[39m[23m 2.5614489, 2.4102779, 2.3607646, 3.0238301,…
$ hs_connind300_h_Log        [3m[90m<dbl>[39m[23m 4.761417, 3.915685, 3.002645, 5.709277, 5.0…
$ hs_fdensity300_h_Log       [3m[90m<dbl>[39m[23m 4.935381, 4.935381, 4.935381, 6.721231, 5.2…
$ hs_landuseshan300_h_None   [3m[90m<dbl>[39m[23m 1.9739687, 1.7899665, 2.6763931, 2.5376134,…
$ hs_popdens_h_Sqrt          [3m[90m<dbl>[39m[23m 1.54657294, 0.18646883, 0.18646883, 1.21078…
$ hs_walkability_mean_h_None [3m[90m<dbl>[39m[23m 3.75, 2.00, 2.50, 5.25, 3.00, 3.75, 3.00, 3…
$ hs_accesslines300_s_dic0   [3m[90m<dbl>[39m[23m 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0

## **Mixture analysis**

The main idea in mixture analysis is that low levels of exposure to a given contaminant may produce no health effects (or effects that are too small to be detected), but combined exposure to multiple contaminants can generate an effect.

<img src="https://github.com/alldominguez/isee_young_rennes_ws1/blob/main/figures/PRIME.png?raw=1" alt="ISGlobal logo" width="500"/>  

Approaches used in environmental epidemiology fail to capture the complexity when evaluating the combined effect of multiple exposures due to some limitations:

- They will not evaluate the joint effect of multiple exposures.
- The interaction between different exposures is not considered.

Therefore, other methods are needed to investigate the health effects of mixtures or multiple exposures. In recent years, various methods have been proposed to estimate the independent and joint effects of multiple exposures.

The selection of the **correct method** in **mixture analysis** should be guided by the **research question we want to answer**.
