# Preprocessing Step
---

This notebook carrieds out the preprocessing steps for the metabolomics data:    
- Imputation
- Normalization
- Log2 Transformation

## Input

### Libraries

In [1]:
# To use RCall for the first time, one needs to 
# the location of the R home directory.
firstTimeRCall = false
if firstTimeRCall 
    using Pkg
    ENV["R_HOME"] = "C:/PROGRA~1/R/R-42~1.1" # from R.home() in R
    Pkg.build("RCall")
end     

In [2]:
using CSV, DataFrames
using RCall

In [3]:
using CSV, DataFrames, Missings #, CategoricalArrays
using StatsBase, Statistics #, MultivariateStats#, RCall
using FreqTables #, Plots, StatsPlots

### Ext. Functions

In [4]:
include(joinpath(@__DIR__,"..","..","src","preprocessing.jl" ));
include(joinpath(@__DIR__,"..","..","src","wrangle_utils.jl" ));

### Load data

In [5]:
# Get reference metabolite file
fileRef = joinpath(@__DIR__,"..","..","data","processed","SPIROMICS","refMeta.csv");
dfRef = CSV.read(fileRef, DataFrame);
println("The reference metabolite dataset contains $(size(dfRef, 1)) different metabolites.")

The reference metabolite dataset contains 1174 different metabolites.


In [6]:
# Get negative metabolite file
dfNegMetabo = readCOPDdata(realpath(joinpath(@__DIR__,"..","..","data","processed","SPIROMICS","negMeta.csv"))) 
println("The negative metabolite dataset contains $(size(dfNegMetabo, 1)) samples and $(size(dfNegMetabo, 2)-1) metabolites.")

The negative metabolite dataset contains 372 samples and 588 metabolites.


In [7]:
# Get polar metabolite file
dfPolarMetabo = readCOPDdata(joinpath(@__DIR__,"..","..","data","processed","SPIROMICS","polarMeta.csv"));
println("The polar metabolite dataset contains $(size(dfPolarMetabo, 1)) samples and $(size(dfPolarMetabo, 2)-1) metabolites.")

The polar metabolite dataset contains 372 samples and 96 metabolites.


In [8]:
# Get positive early metabolite file
dfPosEarlyMetabo = readCOPDdata(joinpath(@__DIR__,"..","..","data","processed","SPIROMICS","posEarlyMeta.csv"));
println("The positive early metabolite dataset contains $(size(dfPosEarlyMetabo, 1)) samples and $(size(dfPosEarlyMetabo, 2)-1) metabolites.")

The positive early metabolite dataset contains 372 samples and 258 metabolites.


In [9]:
# Get positive late metabolite file
dfPosLateMetabo = readCOPDdata(joinpath(@__DIR__,"..","..","data","processed","SPIROMICS","posLateMeta.csv"));
println("The positive late metabolite dataset contains $(size(dfPosLateMetabo, 1)) samples and $(size(dfPosLateMetabo, 2)-1) metabolites.")

The positive late metabolite dataset contains 372 samples and 232 metabolites.


### Join dataframes

The dataframe `dfPolarMetabo` contains Integer type instead of Float type, which produces an error during the imputation. We need to convert the values to Integer type.

In [10]:
typeof(dfPolarMetabo[:, 2])

Vector{Union{Missing, Float64}}[90m (alias for [39m[90mArray{Union{Missing, Float64}, 1}[39m[90m)[39m

In [11]:
dfPolarMetabo[!,2:end] .= convert.(Union{Missing, Float64}, dfPolarMetabo[:, 2:end]);
typeof(dfPolarMetabo[:, 2])

Vector{Union{Missing, Float64}}[90m (alias for [39m[90mArray{Union{Missing, Float64}, 1}[39m[90m)[39m

In [12]:
df = leftjoin(dfNegMetabo, dfPolarMetabo, on = :SampleID)
leftjoin!(df, dfPosEarlyMetabo, on = :SampleID)
leftjoin!(df, dfPosLateMetabo, on = :SampleID)
size(df)

(372, 1175)

## Imputation

In [13]:
names(dfRef)

6-element Vector{String}:
 "metabolite_name"
 "CompID"
 "SubPathway"
 "SuperPathway"
 "SubClassID"
 "SuperClassID"

In [14]:
df = imputeSPIROMICS(df, dfRef);

The metabolite cotinine contains 44.35% missing samples.
We dropped 392 metabolites due to a missingness greater than 20%.
We preserved 782 metabolites.


## Normalization
----

### Probabilistic Quotient Normalization

> 1. Perform an integral normalization (typically a constant
integral of 100 is used).
> 2. Choose/calculate the reference spectrum (the best approach
is the calculation of the median spectrum of control samples).
> 3. Calculate the quotients of all variables of interest of the test
spectrum with those of the reference spectrum.
> 4. Calculate the median of these quotients.
> 5. Divide all variables of the test spectrum by this median.


In [15]:
df[!,2:end] .= convert.(Union{Missing, Float64}, df[:, 2:end]);
df = pqnorm(df, startCol = 2);

## Transformation
---

A simple and widely used transformation to make data more symmetric and homoscedastic is the log-transformation.

In [16]:
df = log2tx(df, startCol = 2);

In [17]:
first(df)

Row,SampleID,comp0588,comp0001,comp0002,comp0003,comp0014,comp0016,comp0034,comp0036,comp0037,comp0038,comp0040,comp0048,comp0049,comp0050,comp0051,comp0057,comp0070,comp0071,comp0090,comp0091,comp0109,comp0110,comp0112,comp0113,comp0114,comp0115,comp0116,comp0117,comp0118,comp0122,comp0123,comp0124,comp0125,comp0126,comp0131,comp0132,comp0134,comp0140,comp0144,comp0147,comp0148,comp0150,comp0153,comp0156,comp0157,comp0158,comp0160,comp0161,comp0162,comp0174,comp0180,comp0184,comp0185,comp0192,comp0193,comp0195,comp0202,comp0203,comp0204,comp0206,comp0208,comp0209,comp0214,comp0217,comp0222,comp0227,comp0228,comp0229,comp0233,comp0234,comp0238,comp0239,comp0240,comp0244,comp0245,comp0246,comp0252,comp0253,comp0254,comp0255,comp0259,comp0260,comp0262,comp0268,comp0269,comp0271,comp0273,comp0274,comp0278,comp0281,comp0282,comp0284,comp0285,comp0288,comp0289,comp0290,comp0291,comp0297,comp0298,⋯
Unnamed: 0_level_1,String,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,⋯
1,NJHC-01517,0.0,0.0373768,0.00206656,0.00116913,0.137739,0.00588922,0.0071546,0.000116814,0.000145573,0.000516897,0.000348006,0.0248131,0.000455229,0.0060276,0.00259249,0.000648517,0.00398141,0.00672786,0.00016807,0.00124711,0.00719697,0.0483302,0.0120918,0.00737435,0.000941944,0.000396282,0.00337326,0.00156687,0.00259645,0.000331377,0.000270747,0.0304043,0.000132622,0.00423969,0.170115,0.00747499,0.00457436,0.00160028,0.000442446,0.000242977,0.000462785,0.00250707,0.134793,0.000922225,0.000789232,0.0266301,0.00245236,0.000247344,0.0104925,0.000106631,0.027733,0.00107934,0.00196985,0.000130318,0.0010145,0.0133458,0.000882351,0.0417764,0.0537596,0.000324172,0.00151231,0.00209107,0.0127671,0.001996,0.000590868,0.00178039,0.00123887,0.00151059,0.00527466,0.00032863,0.000200496,0.0627678,0.000765749,0.00152187,0.0136611,0.0416024,0.000618873,0.00471194,0.00858642,0.0223975,0.00168698,0.00227623,0.000704866,0.000315561,0.00320773,0.000548317,0.00154223,0.0102585,0.00732224,0.000740931,0.000428641,0.000254889,0.000475495,0.0618251,8.02272e-05,0.00603715,9.515e-05,0.00859314,0.000335704,⋯


## Save pretreatments

In [18]:
fileMeta = joinpath(@__DIR__,"..","..","data","processed","SPIROMICS","inl2_Meta.csv");
df |> CSV.write(fileMeta)

"C:\\git\\gregfa\\Metabolomic\\COPDstudy\\notebooks\\preprocessing\\..\\..\\data\\processed\\SPIROMICS\\inl2_Meta.csv"

In [19]:
versioninfo()

Julia Version 1.8.2
Commit 36034abf26 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 4 × Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 4 virtual cores


In [20]:
R"""
sessionInfo()
"""

RObject{VecSxp}
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] impute_1.70.0

loaded via a namespace (and not attached):
[1] compiler_4.2.1
