# Preprocessing Step
---

This notebook carrieds out the preprocessing steps for the metabolomics data:    
- Imputation
- Normalization
- Log2 Transformation

## Input

### Libraries

In [1]:
# To use RCall for the first time, one needs to 
# the location of the R home directory.
firstTimeRCall = false
if firstTimeRCall 
    ENV["R_HOME"] = "C:/PROGRA~1/R/R-40~1.4" # from R.home() in R
    Pkg.build("RCall")
end     

In [2]:
using CSV, DataFrames, Missings
using RCall

In [3]:
# using CSV, DataFrames, Missings, CategoricalArrays
# using StatsBase, Statistics, MultivariateStats#, RCall
# using FreqTables, Plots, StatsPlots

### Ext. Functions

In [4]:
include(joinpath(@__DIR__,"..","..","src","preprocessing.jl" ));
include(joinpath(@__DIR__,"..","..","src","wrangle_utils.jl" ));

### Load data

In [5]:
# Get reference metabolite file
fileRef = joinpath(@__DIR__,"..","..","data","processed","COPDGene","refMeta.csv");
dfRef = CSV.read(fileRef, DataFrame);

In [6]:
# Get negative metabolite file
dfNegMetabo = readCOPDdata(realpath(joinpath(@__DIR__,"..","..","data","processed","COPDGene","negMeta.csv"))) 
println("The negative metabolite dataset contains $(size(dfNegMetabo, 1)) samples and $(size(dfNegMetabo, 2)-1) metabolites.")

The negative metabolite dataset contains 784 samples and 739 metabolites.


In [7]:
# Get polar metabolite file
dfPolarMetabo = readCOPDdata(joinpath(@__DIR__,"..","..","data","processed","COPDGene","polarMeta.csv"));
println("The polar metabolite dataset contains $(size(dfPolarMetabo, 1)) samples and $(size(dfPolarMetabo, 2)-1) metabolites.")

The polar metabolite dataset contains 784 samples and 83 metabolites.


In [8]:
# Get positive early metabolite file
dfPosEarlyMetabo = readCOPDdata(joinpath(@__DIR__,"..","..","data","processed","COPDGene","posEarlyMeta.csv"));
println("The positive early metabolite dataset contains $(size(dfPosEarlyMetabo, 1)) samples and $(size(dfPosEarlyMetabo, 2)-1) metabolites.")

The positive early metabolite dataset contains 784 samples and 319 metabolites.


In [9]:
# Get positive late metabolite file
dfPosLateMetabo = readCOPDdata(joinpath(@__DIR__,"..","..","data","processed","COPDGene","posLateMeta.csv"));
println("The positive late metabolite dataset contains $(size(dfPosLateMetabo, 1)) samples and $(size(dfPosLateMetabo, 2)-1) metabolites.")

The positive late metabolite dataset contains 784 samples and 251 metabolites.


### Join dataframes

The dataframe `dfPolarMetabo` contains Integer type instead of Float type, which produces an error during the imputation. We need to convert the values to Integer type.

In [10]:
typeof(dfPolarMetabo[:, 2])

Vector{Union{Missing, Int64}}[90m (alias for [39m[90mArray{Union{Missing, Int64}, 1}[39m[90m)[39m

In [11]:
dfPolarMetabo[!,2:end] .= convert.(Union{Missing, Float64}, dfPolarMetabo[:, 2:end]);
typeof(dfPolarMetabo[:, 2])

Vector{Union{Missing, Float64}}[90m (alias for [39m[90mArray{Union{Missing, Float64}, 1}[39m[90m)[39m

In [12]:
df = leftjoin(dfNegMetabo, dfPolarMetabo, on = :SampleID)
leftjoin!(df, dfPosEarlyMetabo, on = :SampleID)
leftjoin!(df, dfPosLateMetabo, on = :SampleID)
size(df)

(784, 1393)

## Imputation

In [13]:
df = imputeCOPD(df, dfRef);

The metabolite cotinine contains 28.7% missing samples.
We dropped 393 metabolites due to a missingness greater than 20%.
We preserved 999 metabolites.


## Normalization
----

### Probabilistic Quotient Normalization

> 1. Perform an integral normalization (typically a constant
integral of 100 is used).
> 2. Choose/calculate the reference spectrum (the best approach
is the calculation of the median spectrum of control samples).
> 3. Calculate the quotients of all variables of interest of the test
spectrum with those of the reference spectrum.
> 4. Calculate the median of these quotients.
> 5. Divide all variables of the test spectrum by this median.


In [14]:
df[!,2:end] .= convert.(Union{Missing, Float64}, df[:, 2:end]);
df = pqnorm(df, startCol = 2);

## Transformation
---

A simple and widely used transformation to make data more symmetric and homoscedastic is the log-transformation.

In [15]:
df = log2tx(df, startCol = 2);

In [16]:
first(df)

Row,SampleID,comp553,comp38768,comp38296,comp62533,comp48762,comp34404,comp32391,comp20675,comp34400,comp33971,comp33972,comp32497,comp38395,comp37752,comp38168,comp39609,comp34214,comp34397,comp62558,comp62559,comp62566,comp62564,comp62562,comp57783,comp27447,comp54885,comp36594,comp30460,comp34389,comp21184,comp45968,comp36602,comp45970,comp35305,comp34437,comp19324,comp57547,comp62805,comp46115,comp43266,comp52602,comp36746,comp61700,comp57663,comp42489,comp18281,comp52916,comp61698,comp22036,comp35675,comp17945,comp48445,comp62520,comp57655,comp35253,comp41220,comp35635,comp32197,comp53026,comp34399,comp48693,comp54805,comp43507,comp61871,comp31787,comp62796,comp62863,comp32397,comp22053,comp53230,comp39600,comp32457,comp21158,comp22001,comp61843,comp48448,comp31943,comp52938,comp27672,comp48763,comp48752,comp46165,comp46164,comp44526,comp15676,comp32445,comp15749,comp1558,comp44620,comp37181,comp36099,comp48441,comp37445,comp35527,comp541,comp1669,comp48457,comp22116,comp43592,⋯
Unnamed: 0_level_1,String,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,⋯
1,NJHC-00611,0.0,0.175727,0.00981606,0.00216631,0.000143143,0.000757537,0.000366451,0.206559,0.00610898,0.279433,0.0701939,0.0261717,0.000660977,0.00868807,0.00315498,0.00875556,0.0117311,0.000250954,0.000376762,0.000312595,0.00211941,0.000631249,0.00138704,0.00180572,0.00183406,0.000732361,0.00382076,0.00248375,0.00373848,0.0147009,0.00151826,0.00337741,0.000791023,0.00134178,0.000560886,0.0112193,0.00126629,0.00972796,0.00165068,0.00078933,0.000691694,0.00190041,0.000634838,0.000967366,0.00919252,0.238593,0.000839806,0.00240433,0.00187065,0.0755403,0.021469,0.000412431,0.000209064,0.000779322,0.0474,0.00279417,0.00288782,0.0594119,0.0014656,0.000850798,0.000114371,0.00996112,0.000207818,0.102752,0.562724,0.000346287,0.00246062,0.010895,0.00740511,0.00434027,0.00022776,0.0073275,0.00648744,0.00463509,0.0241953,0.000732125,0.000770882,0.00366298,0.0555253,0.000768289,0.000195272,0.000848938,0.000223549,0.0651193,0.10632,0.0198293,0.00261911,0.00219569,0.000158985,0.0155482,0.00198651,0.0186191,0.000162417,0.00500991,0.00175287,0.00211678,0.000702884,0.14612,0.000515757,⋯


## Save pretreatments

In [17]:
fileMeta = joinpath(@__DIR__,"..","..","data","processed","COPDGene","inl2_Meta.csv");
df |> CSV.write(fileMeta)

"C:\\git\\gregfa\\Metabolomic\\COPDstudy\\notebooks\\preprocessing\\..\\..\\data\\processed\\COPDGene\\inl2_Meta.csv"

In [18]:
versioninfo()

Julia Version 1.8.2
Commit 36034abf26 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 4 × Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 4 virtual cores


In [19]:
R"""
sessionInfo()
"""

RObject{VecSxp}
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] impute_1.70.0

loaded via a namespace (and not attached):
[1] compiler_4.2.1
