# <p style="text-align: justify;"><div class="alert alert-info" role="alert">Transforming physical predictors for multiple linear regression</div></p>

## `Ana González Guerra` 

### ` Student of the master in Data Science at the University of Cantabria`  

## Index<a class="anchor" id="index"></a>
* [Loading physical data](#1)
* [Analysis of predictors with zero or close to zero variance](#2)
* [Analysis of linear dependencies between predictors using QR matrix decomposition](#3)
* [Centering and scaling](#4)
* [References](#ref)

## Loading physical data <a class="anchor" id="1"></a>

* [Returning to the index of contents](#index)

In [2]:
handle_physical <- read.csv('handle_physical_fusion_31_03_20.csv',row.name = 1)
head(handle_physical)
names(handle_physical)

Unnamed: 0,W1Fc,W1Ox,W1Lac,W1tFc,W1tOx,W1tLac,W1tBorg,W1t5Fc,W1t5Ox,W1t5Lac,...,PerCue,PerCin,FM1,FM2,X.FFM1,X.FFM2,FFM1,FFM2,FFMI1,FFMI2
0,109,92,1.9,133,95,2.3,8,59,90,4.7,...,31.0,65.0,7.883768,8.432778,83.04566,81.86499,38.61623,38.06722,16.71409,16.47646
4,84,90,3.4,151,82,4.3,8,74,97,6.3,...,36.0,67.5,6.914187,10.373099,87.86985,81.80158,50.08581,46.6269,17.33073,16.13388
7,136,98,1.3,164,97,2.8,10,121,97,9.0,...,31.0,65.5,4.91771,10.003592,90.45105,80.57555,46.58229,41.49641,17.74969,15.81177
8,120,97,1.8,150,76,4.2,8,141,98,9.7,...,35.5,68.5,11.222758,14.984471,81.60204,75.43529,49.77724,46.01553,18.5073,17.10869
9,62,97,2.8,172,91,4.9,9,119,88,10.8,...,33.5,69.8,15.642411,16.063989,72.55717,71.81756,41.35759,40.93601,16.56689,16.39802
11,103,99,2.4,174,96,3.6,10,115,99,9.3,...,35.0,69.0,8.131694,11.411066,85.73387,79.98059,48.86831,45.58893,17.94979,16.74525


In R code the % symbol is not well detected but it still has representation, instead of a % we have 'X.'

In order to avoid interferences with physical predictors relating to FFM1 and similar  (because the intention when we performed the regression model is using W1_2MedDifWRelFFMI as target variable), we remove those physical predictors that have the substring "FM" in their name:

In [3]:
handle_physical <- handle_physical[!grepl('FM',names(handle_physical))] #eliminamos las variables que en su nombre contienen FM
names(handle_physical)

## Analysis of predictors with zero or close to zero variance<a class="anchor" id="2"></a>

* [Returning to the index of contents](#index)

If frequency rate is bigger than a threshold previously established (by default 95/5) and the percent of unique values is smaller than its respectiverly thresholds (by default 10), we could consider that the predictor has a variance near to zero. By default, nearZeroVar function returns the predictors that are considered problematic in relation with the variance. On the other hand, with argument saveMetrics = True we can see the different metrics considered for each one ot the selected predictors as problematic, those that present a TRUE value for nzv column (near zero variance) [[1]](#kuhn2019).

In [4]:
library(caret)

"package 'caret' was built under R version 3.6.2"Loading required package: lattice
Loading required package: ggplot2
"package 'ggplot2' was built under R version 3.6.3"

In [5]:
handle_physical_nzv <- nearZeroVar(handle_physical,saveMetrics = TRUE)

handle_physical_nzv[handle_physical_nzv$nzv == TRUE,]

physical_problematicas <- rownames(handle_physical_nzv[handle_physical_nzv$nzv == TRUE,])

freqRatio,percentUnique,zeroVar,nzv


It seems physical predictors does not present problems with variance.

## Analysis of linear dependencies between predictors using QR matrix decomposition<a classs="anchor" id="3"></a>

* [Returning to the index of contents](#index)

The findLinearCombos() function uses the QR decomposition of a matrix to list sets of linear combinations (if any). For each linear combination, the number of columns to be removed from the matrix will be increased to check if the dependencies have been resolved. findLinearCombos() will also return a vector with the column positions that can be removed to resolve the situation [[2]](#kuhn2019b).

findLinearCombos matrix 

In [6]:
comboInfo <- findLinearCombos(handle_physical)
comboInfo

In [7]:
comb_lineal <- comboInfo$linearCombos
names(comb_lineal) <- 'cb'

In [8]:
comb_lineal$cb

We can appreciate a linear combinatio for the next physical predictors:

In [9]:
head(handle_physical[comb_lineal$cb])

Unnamed: 0,PliegueMus,Sumat,PlieguePec,PliegueAx,PliegueTri,PliegueSub,PliegueAbd,PliegueSup
0,30.3,74.3,4.05,4.5,18.35,6.85,6.0,4.25
4,17.26667,46.76667,3.2,3.633333,7.866667,5.3,5.533333,3.966667
7,11.9,34.4,2.85,2.6,6.1,3.8,4.6,2.55
8,31.15,83.2,4.2,5.75,14.4,9.15,11.1,7.45
9,34.4,146.4,15.2,13.3,23.0,18.0,26.5,16.0
11,17.2,62.6,4.4,4.7,15.45,8.15,7.3,5.4


Where the suggest predictor to remove is 'PliegueMus':

In [11]:
head(handle_fisic[comboInfo$remove])

Unnamed: 0,PliegueMus
0,30.3
4,17.26667
7,11.9
8,31.15
9,34.4
11,17.2


Since by default the function takes as a predictor to eliminate always the first predictor of the combination, to preserve the biological sense and considering that 'Sumat' is a combination of the rest of fat fold variables (it is the sum of all of them) it seems more reasonable to eliminate 'Sumat' than one of the variables that contributes to the construction of 'Sumat'.

Let's check what happens if we eliminate 'Sumat':

In [11]:
comboInfo_2 <- findLinearCombos(handle_physical[-comb_lineal$cb[2]])
comboInfo_2

Without Sumat we no longer have linear combinations among the physical predictors, so now we elimnate it:

In [12]:
handle_physical <- handle_physical[,-comboInfo$remove]

## Centering and scaling <a class="anchor" id ="4"></a>

* [Returning to the index of contents](#index)

In [13]:
summary(handle_physical)

      W1Fc             W1Ox           W1Lac           W1tFc      
 Min.   : 62.00   Min.   :69.00   Min.   :0.800   Min.   : 88.0  
 1st Qu.: 90.25   1st Qu.:91.00   1st Qu.:1.500   1st Qu.:157.8  
 Median :105.50   Median :97.00   Median :2.000   Median :171.5  
 Mean   :106.87   Mean   :94.34   Mean   :2.123   Mean   :164.2  
 3rd Qu.:120.75   3rd Qu.:98.00   3rd Qu.:2.675   3rd Qu.:176.8  
 Max.   :162.00   Max.   :99.00   Max.   :4.500   Max.   :194.0  
     W1tOx           W1tLac         W1tBorg           W1t5Fc          W1t5Ox  
 Min.   :61.00   Min.   :1.700   Min.   : 7.000   Min.   : 59.0   Min.   :51  
 1st Qu.:87.00   1st Qu.:3.400   1st Qu.: 8.250   1st Qu.:111.5   1st Qu.:95  
 Median :92.00   Median :3.900   Median : 9.000   Median :122.5   Median :98  
 Mean   :90.61   Mean   :3.987   Mean   : 9.171   Mean   :120.8   Mean   :95  
 3rd Qu.:97.00   3rd Qu.:4.675   3rd Qu.:10.000   3rd Qu.:133.0   3rd Qu.:98  
 Max.   :99.00   Max.   :6.700   Max.   :10.000   Max.   :155.0 

In this review of the physical predictors data we can see that the ranges in which they move can be very different, therefore it is necessary to scale.

In [14]:
handle_physical <- scale(handle_physical, center = TRUE, scale = TRUE)
summary(handle_physical)

      W1Fc               W1Ox             W1Lac             W1tFc        
 Min.   :-2.03714   Min.   :-3.9100   Min.   :-1.6363   Min.   :-3.5572  
 1st Qu.:-0.75444   1st Qu.:-0.5156   1st Qu.:-0.7706   1st Qu.:-0.3019  
 Median :-0.06202   Median : 0.4102   Median :-0.1523   Median : 0.3398  
 Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.63041   3rd Qu.: 0.5645   3rd Qu.: 0.6824   3rd Qu.: 0.5848  
 Max.   : 2.50337   Max.   : 0.7188   Max.   : 2.9393   Max.   : 1.3899  
     W1tOx             W1tLac           W1tBorg            W1t5Fc        
 Min.   :-3.7007   Min.   :-2.1444   Min.   :-2.2771   Min.   :-3.38052  
 1st Qu.:-0.4512   1st Qu.:-0.5501   1st Qu.:-0.9658   1st Qu.:-0.50724  
 Median : 0.1738   Median :-0.0812   Median :-0.1791   Median : 0.09477  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
 3rd Qu.: 0.7987   3rd Qu.: 0.6456   3rd Qu.: 0.8699   3rd Qu.: 0.66943  
 Max.   : 1.0486   Max.   : 2.5447   M

We export the .csv with the data transformed.

In [15]:
write.csv(handle_physical, file="handle_physical_transformed_for_multiple_linear_regression.csv", row.names = F)

## References <a class="anchor" id="ref"></a>

* [Returning to the index of contents](#index)

[1] Kuhn, M. (2019) Zero-and near zero-variance predictors. Available at: https://topepo.github.io/caret/pre-processing.html#zero--and-near-zero-variance-predictors (Accessed: April 20, 2020).<a class="anchor" id="kuhn2019"></a>

[2] Kuhn, M. (2019) Linear dependencies. Available at: https://topepo.github.io/caret/pre-processing.html#linear-dependencies (Accessed: April 20, 2020).<a class="anchor" id="kuhn2019b"></a>