# Exploratory Analysis & Data Preprocessing

# 1. Data Import

In [2]:
library(datasets)

In [3]:
data(iris)

In [4]:
head(iris)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


# 2. Exploratory analysis

## 2.1. Data Summary

Motivation:

* To better understand the data. 

* To identify the data centrality and dispersion characteristics (median, max, min, quantiles, outliers, variance)

Descriptive measures:

* Centrality: mean, median, mode

* Dispersion: variance, standard deviation, quartiles

Mean and median: measures used to identify the asymmetry of the series.

* $median < mean$: positive skewness

* $median > mean$: negative skewness

In [5]:
summary(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Outra forma de visualizar o resumo estatístico.

In [10]:
install.packages("https://cran.r-project.org/src/contrib/Archive/rlang/rlang_0.4.4.tar.gz", repo=NULL, type="source")

"installation of package 'C:/Users/Gea/AppData/Local/Temp/RtmpkzGfGJ/downloaded_packages/rlang_0.4.4.tar.gz' had non-zero exit status"

In [9]:
library(Hmisc)
library(ggplot2)

"package 'Hmisc' was built under R version 3.6.3"Loading required package: ggplot2
"package 'ggplot2' was built under R version 3.6.3"Error: package or namespace load failed for 'ggplot2' in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
 there is no package called 'rlang'


ERROR: Error: package 'ggplot2' could not be loaded


In [None]:
describe(iris)

# missing => missing values
# distinct => distinct values
# .05, .10, .25, .75, .90, .95 => percentiles
# .50 => median
# Gmd => Gini’s difference (measure of dispersion that is the mean absolute difference between any pairs of observations)

Data visualization:

* Provide insight into an information space.

* Provide a qualitative overview of large data sets.

* Search for patterns, trends, structure, irregularities, relationships among data.

* Help find interesting regions and suitable parameters for further quantitative analysis.

## 2.2. Histogram analysis

**Histogram:** display values of tabulated frequencies. It shows what proportion of cases into each category.

Histograms may tell more than Boxplots.Two variables can have similar boxplots and have different data distributions.

In [None]:
# Packages
library(gridExtra)
library(ggplot2)
library(dplyr)

In [None]:
histA1 <- ggplot(iris, aes(x = Sepal.Length)) + geom_histogram(bins=20)

histB1 <- ggplot(iris, aes(x = Sepal.Width)) + geom_histogram(bins=20)

histC1 <- ggplot(iris, aes(x = Petal.Length)) + geom_histogram(bins=20)

histD1 <- ggplot(iris, aes(x = Petal.Width)) + geom_histogram(bins=20)

grid.arrange(histA1, histB1, histC1, histD1, ncol=2, nrow=2)

**Label analysis**

In [None]:
histA2 <- ggplot(iris, aes(x = Sepal.Length)) + geom_histogram(bins=20) + facet_grid(~ Species)

histB2 <- ggplot(iris, aes(x = Sepal.Width)) + geom_histogram(bins=20) + facet_grid(~ Species)

histC2 <- ggplot(iris, aes(x = Petal.Length)) + geom_histogram(bins=20) + facet_grid(~ Species)

histD2 <- ggplot(iris, aes(x = Petal.Width)) + geom_histogram(bins=20) + facet_grid(~ Species)

grid.arrange(histA2, histB2, histC2, histD2, ncol=2, nrow=2)

## 2.3. Box-plot analysis

**Boxplot:** shows the data dispersion (Q1, Q2, Q3) and possible outliers. The size of the bars is indicative of the data distortion.

In [None]:
boxplotA1 <- ggplot(iris, aes(y = Sepal.Length)) + geom_boxplot()

boxplotB1 <- ggplot(iris, aes(y = Sepal.Width)) + geom_boxplot()

boxplotC1 <- ggplot(iris, aes(y = Petal.Length)) + geom_boxplot()

boxplotD1 <- ggplot(iris, aes(y = Petal.Width)) + geom_boxplot()

grid.arrange(boxplotA1, boxplotB1, boxplotC1, boxplotD1, ncol=2, nrow=2)

**Label analysis**

In [None]:
boxplotA2 <- ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot()

boxplotB2 <- ggplot(iris, aes(x = Species, y = Sepal.Width)) + geom_boxplot()

boxplotC2 <- ggplot(iris, aes(x = Species, y = Petal.Length)) + geom_boxplot()

boxplotD2 <- ggplot(iris, aes(x = Species, y = Petal.Width)) + geom_boxplot()

grid.arrange(boxplotA2, boxplotB2, boxplotC2, boxplotD2, ncol=2, nrow=2)

## 2.4. Density distribution

**Density distribution:** shows the probability density function of a variable.

In [None]:
denA1 <- ggplot(iris, aes(x = Sepal.Length)) + geom_density()

denB1 <- ggplot(iris, aes(x = Sepal.Width)) + geom_density()

denC1 <- ggplot(iris, aes(x = Petal.Length)) + geom_density()

denD1 <- ggplot(iris, aes(x = Petal.Width)) + geom_density()

grid.arrange(denA1, denB1, denC1, denD1, ncol=2, nrow=2)

**Label analysis**

In [None]:
denA2 <- ggplot(iris, aes(x = Sepal.Length)) + geom_density() + facet_grid(~ Species)

denB2 <- ggplot(iris, aes(x = Sepal.Width)) + geom_density() + facet_grid(~ Species)

denC2 <- ggplot(iris, aes(x = Petal.Length)) + geom_density() + facet_grid(~ Species)

denD2 <- ggplot(iris, aes(x = Petal.Width)) + geom_density() + facet_grid(~ Species)

grid.arrange(denA2, denB2, denC2, denD2, ncol=2, nrow=2)

## 2.5. Density + Histogram

In [None]:
denhistA1 <- ggplot(iris, aes(x = Sepal.Length)) + 
             geom_histogram(binwidth = 0.2, color = "black", fill = "steelblue", aes(y = ..density..)) +
             geom_density(stat = "density", alpha = I(0.2), fill = "blue") +
             xlab("Sepal Length") +  ylab("Density")

denhistB1 <- ggplot(iris, aes(x = Sepal.Width)) + 
             geom_histogram(binwidth = 0.2, color = "black", fill = "steelblue", aes(y = ..density..)) +
             geom_density(stat = "density", alpha = I(0.2), fill = "blue") +
             xlab("Sepal Width") +  ylab("Density")

denhistC1 <- ggplot(iris, aes(x = Petal.Length)) + 
             geom_histogram(binwidth = 0.2, color = "black", fill = "steelblue", aes(y = ..density..)) +
             geom_density(stat = "density", alpha = I(0.2), fill = "blue") +
             xlab("Petal Length") +  ylab("Density")

denhistD1 <- ggplot(iris, aes(x = Petal.Width)) + 
             geom_histogram(binwidth = 0.2, color = "black", fill = "steelblue", aes(y = ..density..)) +
             geom_density(stat = "density", alpha = I(0.2), fill = "blue") +
             xlab("Petal Width") +  ylab("Density")

grid.arrange(denhistA1, denhistB1, denhistC1, denhistD1, ncol=2, nrow=2)

**Labal analysis**

In [None]:
denhistA2 <- ggplot(iris, aes(x = Sepal.Length, fill=Species)) +
             geom_histogram(binwidth = 0.2, color = "black", fill = "steelblue", aes(y = ..density..)) +
             geom_density(stat = "density", alpha = I(0.2), fill = "blue") +
             xlab("Sepal Length") +  ylab("Density") +
             facet_grid(~ Species)

denhistB2 <- ggplot(iris, aes(x = Sepal.Width, fill=Species)) + 
             geom_histogram(binwidth = 0.2, color = "black", fill = "steelblue", aes(y = ..density..)) +
             geom_density(stat = "density", alpha = I(0.2), fill = "blue") +
             xlab("Sepal Width") +  ylab("Density") +
             facet_grid(~ Species)

denhistC2 <- ggplot(iris, aes(x = Petal.Length, fill=Species)) + 
             geom_histogram(binwidth = 0.2, color = "black", fill = "steelblue", aes(y = ..density..)) +
             geom_density(stat = "density", alpha = I(0.2), fill = "blue") +
             xlab("Petal Length") +  ylab("Density") +
             facet_grid(~ Species)

denhistD2 <- ggplot(iris, aes(x = Petal.Width, fill=Species)) + 
             geom_histogram(binwidth = 0.2, color = "black", fill = "steelblue", aes(y = ..density..)) +
             geom_density(stat = "density", alpha = I(0.2), fill = "blue") +
             xlab("Petal Width") +  ylab("Density") +
             facet_grid(~ Species)

grid.arrange(denhistA2, denhistB2, denhistC2, denhistD2, ncol=2, nrow=2)

## 2.6. Scatter plot

**Scatter plot:** provides the first look at bbivariate data to see clusters of points, outliers.

Each pair of values is treated as a pair of coordinates ant plotted as points in the plane.

In [None]:
scatterA <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
            geom_point(aes(colour = Species)) +
            labs(x = "Sepal Length", y = "Sepal Width") + 
            scale_colour_brewer(palette = "Dark2", name = "Species") +
            theme_bw()

scatterB <- ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) +
            geom_point(aes(colour = Species)) +
            labs(x = "Sepal Length", y = "Petal Length") + 
            scale_colour_brewer(palette = "Dark2", name = "Species") +
            theme_bw()

scatterC <- ggplot(iris, aes(x = Sepal.Length, y = Petal.Width)) +
            geom_point(aes(colour = Species)) +
            labs(x = "Sepal Length", y = "Petal Width") + 
            scale_colour_brewer(palette = "Dark2", name = "Species") +
            theme_bw()

scatterD <- ggplot(iris, aes(x = Sepal.Width, y = Petal.Length)) +
            geom_point(aes(colour = Species)) +
            labs(x = "Sepal Width", y = "Petal Length") + 
            scale_colour_brewer(palette = "Dark2", name = "Species") +
            theme_bw()

scatterE <- ggplot(iris, aes(x = Sepal.Width, y = Petal.Width)) +
            geom_point(aes(colour = Species)) +
            labs(x = "Sepal Width", y = "Petal Width") + 
            scale_colour_brewer(palette = "Dark2", name = "Species") +
            theme_bw()

scatterF <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +
            geom_point(aes(colour = Species)) +
            labs(x = "Petal Length", y = "Petal Width") + 
            scale_colour_brewer(palette = "Dark2", name = "Species") +
            theme_bw()

grid.arrange(scatterA, scatterB, scatterC, scatterD, scatterE, scatterF, ncol=2, nrow=3)

## 2.7. Scatterplot matrix

In [None]:
my_cols <- c("#00AFBB", "#E7B800", "#FC4E07") 

pairs(iris[,1:4], pch = 19,  cex = 0.5, col = my_cols[iris$Species], lower.panel=NULL)

# 3. Preprocessing

Major tasks in data preprocessing

* Data cleaning: verifying that the data are in reasonable condition

* Data integration

* Data reduction: involve operations such as eliminating unneeded variables, transforming variables, and creating new variables. Make sure that you know what each variable means and whether it is sensible to include it in the model.

* Data transformation

The ability to generalize classifiers refers to your performance when classifying test patterns that were not used during training. Deficiencies in the ability to generalize a classifier can be attributed to the following factors: overfitting, overtraining and curse of dimensionality.

* Overfitting: generalization loss problem (adaptation of the classifier to the specific peculiarities of the training set). When the number of features is large, the classifier tends to adapt to specific details of the training base, which can cause a reduction in the hit rate.

* Overtraining: it occurs when the classifier is trained with a very large set of examples of patterns with little intra-class variation (in the case of statistical classifiers) or with many training iterations (in the case of neural classifiers). The consequence of this fact is that the classifier's generalization capacity is reduced, providing many flaws when it is used to classify standards that do not belong to the training set.

* Curse of dimensionality: problem caused by the exponential increase in volume associated with adding extra dimensions to a mathematical space. This implies that for a sample size data, there is a maximum number of characteristics from which the classifier's performance will degrade, rather than improve.

## 3.1. Outliers removing

In [None]:
bpRule <- function(x, const=1.5, positions=FALSE) {
    x <- x[!is.na(x)]
    qs <- quantile(x, probs = c(0.25, 0.75))
    iqr <- qs[2]-qs[1]
    if (!positions) x[x < qs[1] - const*iqr | x > qs[2] + const*iqr]
    else which(x < qs[1] - const*iqr | x > qs[2] + const*iqr)
    return(x)
}

In [None]:
iris2 <- iris[,-5]

for (column in colnames(iris2)){
    iris2[,column] <- bpRule(iris2[,column])
}

In [None]:
head(iris2)

In [None]:
boxplot(iris.out)

## 3.2 Data transformation

### (a) Normalization Min-Max

In [None]:
mxs <- apply(select(iris,-Species), 2, max, na.rm = TRUE)
mns <- apply(select(iris,-Species), 2, min, na.rm = TRUE)
iris_new1 <- cbind(scale(select(iris,-Species), center = mns, scale = mxs-mns), select(iris,Species))

head(iris_new1)

### (b) Normalization Z-score

In [None]:
iris_new2 <- iris %>% mutate_each_(list(~scale(.) %>% as.vector),
                                   vars = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))

head(iris_new2)

## 3.3. Feature selection

The selection os a subset of variables (feature selection) is a task we carry out in data analysis. This may have several motivations, like for instance trying to remove irrelevant variables or variables that highly correlated with others. Another motivation for feature selection is simply reducing the dimensionality of the dataset.

There are many methods we can use to select features.

* **Filter methods:** involve looking at variables individually and asserting their value using some metric, which is then used to rank them and remove the less relevant ones (in terms of the selected metric).

* **Wrap methods:** work by taking into consideration the objectives of the analysis we plan to carry out with the dataset. This means that they try to search for the subset of variables that are more adequate in terms of the criteria used to evaluate the results of the posterior modeling stages. These methods typically involve an iterative search procedure where at each step a candidate set of features is used to obtain a model, which is evaluated and the results of this evaluation are used to decide if the features are good enough or if we need to try other set.

* **Unsupervised methods:** look at each feature individually and calculate its relevance using only the values of the variable

* **Supervised methods:** explore the existence of a “special” variable in the dataset, the so-called target variable. These supervised methods evaluate each feature by looking at its relationship with the target variable. This may be as simple as calculating the correlation of each feature with the target, but it may also involve other metrics.

**Obs.:** Wrapper methods are most of the time supervised methods because they typically use some predictive model to assert the value of a set of candidate features.

**Obs.:** Because of this iterative search process, wrapper methods are typically more demanding in computation terms.

### (a) Principal components analysis (PCA)

PCA:

* Way to reduce the number of noninformative dimensions and to eliminate correlated variables. Once PCA has been applied, we select the most informative variables for the problem considered.

* Statistical procedure that transforms and converts a data set into a new data set containing linearly uncorrelated variables, known as principal components. The basic idea is that the data set is transformed into a set of components where each one attempts to capture as much of the variance (information) in data as possible.

* Unsupervised learning technique and it is used to reduce the dimension of the data with minimum loss of information.

* Transforms the feature from original space to a new feature space to increase the separation between data.

* Useful method for dimension reduction, especially when the number of variables is large.

* **Valuable when we have subsets of measurements that are measured on the same scale and are highly correlated**

* Provides a few variables (often as few as three) that are weighted linear combinations of the original variables, and that retain the majority of the information of the full original set.

In short, this method searches for a set of “new” variables, each being a linear combination of the original variables. The idea is that a smaller set of these new variables could be able to “explain” most of the variability of the original data, and if that is the case we can carry out our analysis using only this subset.

Some important point to note before using PCA:

* As PCA tries to find the linear combination of data and if the data in the dataset has non-linear relation then PCA will not work efficiently.

* **Data should be normalized before performing PCA. PCA is sensitive to scaling of data as higher variance data will drive the principal component**.

**Formalization:**

Denote the original $p$ variables by $X_{1},X_{2},...,X_{p}$. In PCA, we are looking for a set of new variables $PCA_{1},PCA_{2},...,PCA_{p}$ que são médias ponderadas das variáveis originais (after subtracting their mean):

$$
\begin{matrix}
PCA_{i}=a_{i,1}(X_{1}-\bar{X_{1}})+a_{i,2}(X_{2}-\bar{X_{2}})+...+a_{i,p}(X_{p}-\bar{X_{p}}), & i=1,...,p
\end{matrix}
$$

where each pair of PCA's has correlation = 0. We then order the resulting PCA's by their variance, with $PCA_{1}$ having the largest variance and $PCA_{p}$ having the smallest variance.

The further advantage of the PCA compared to the original data it that they are uncorrelated (correlation coefficient = 0). If we construct regression models using these principal components as predictors (independent variables), we will not encounter problems of multicolinearity.

In [None]:
# whole magnitude (whole variance)
pca <- prcomp(x = iris[,1:4], center = TRUE, scale. = TRUE)
summary(pca)

# Choose the principal components with highest variances

# summary: gives the reallocated variance (Proportion of Variance)
# 72.96% of the total variability for PC1 (can capture most of the variability in the data)
# 22.85% of the total variability for PC2
# 3.67% of the total variability for PC3
# 0.52% of the total variability for PC4

# the first two principal components alone capture 95.81% of the total variation.

The prcomp function provides four output.

* Sdev: defines the standard deviation of projected points on PC1, PC2, PC3 and PC4. The standard deviation of projected point is in decreasing order from PC1 to PC4.

* Rotation: defines the principal components axis. Here there are four principal components as there are four input features. The rotation is the rotation matrix, which gives the weights that are used to project the original points onto the two new directions.

* Center: mean of input features in original feature space (without any transformation).

* Scale: standard deviation of input features in original feature space (without any transformation).

In [None]:
print('Sdev')
pca$sdev

print('Rotation')
pca$rotation

print('Center')
pca$center

print('Scale')
pca$scale

In [None]:
# scores for the dimensions
head(pca$x)

The following figure shows the plots of projected points on principal components. It is very clear that projected points on PC1 clearly classify the data but the plots of projected points by lower principal components (for PC2, PC3 & PC4) is not able to classify the data as convincingly as PC1.

In [None]:
par(mfrow = c(2,2))
plot(pca$x[,1], col = iris[,5])
plot(pca$x[,2], col = iris[,5])
plot(pca$x[,3], col = iris[,5])
plot(pca$x[,4], col = iris[,5])

The following plots show the dominance of PC1. The bar graph shows the proportion of variance explained by principal components. We can see that PC1 explains 72% of the variance, PC2 explains 23% of the variance and so on. The same has been shown in the plot below. Please note that PC1 and PC2 together explain around 95% of the variance and we can discard the PC3 and PC4 because their contribution towards explaining the variance is just 5%.

In [None]:
plot(pca)

In [None]:
# Visulization of Data in the new reduced dimension
pcar <- princomp(iris[,1:4])
loadings(pcar)

**loadings()** was used to check what were the found (rotated) axes, as well as the proportion of the original variance that is captured by each of them. 

From the analysis of the output of this function one can conclude that, in this example, if we used only the first three components (each component is a “new” feature) to describe the data, then we would only be capturing 75% of the original variance of the cases.

The first part of the output produced by **loadings()** shows us that each of the new variables is a linear combination of the original features. For instance, in the above example we see that the 1st component (PCA1) is calculated as $0.361 \times Sepal.Length + 0.857 \times Petal.Length + 0.358 \times Petal.Width$.

In [None]:
# the values of the new features for for the first 5 cases
pcar$scores[1:5,]

Suppose we are happy with the proportion of variance explained by a small subset of the components (say the first 2). We could carry out the posterior modeling stages on this new (and reduced) feature space. For instance, instead of using the original Iris dataset we could use the scores of the first two components.

In [None]:
reduced.iris <- data.frame(pcar$scores[,1:2], Species = iris$Species)

In [None]:
print('Original dataset:')
dim(iris)[2]

print('Reduced dataset:')
dim(reduced.iris)[2]

### (b) Boruta

Boruta is a feature ranking and selection algorithm based on random forests algorithm.

The advantage with Boruta is that it clearly decides if a variable is important or not and helps to select variables that are statistically significant. Besides, you can adjust the strictness of the algorithm by adjusting the p values that defaults to 0.01 and the maxRuns.

maxRuns is the number of times the algorithm is run. The higher the maxRuns the more selective you get in picking the variables. The default value is 100.

In the process of deciding if a feature is important or not, some features may be marked by Boruta as 'Tentative'. Sometimes increasing the maxRuns can help resolve the 'Tentativeness' of the feature.

In [None]:
library(Boruta)

In [None]:
# Perform Boruta search
boruta.output <- Boruta(Species ~., data = iris, doTrace = 0)
names(boruta.output)

In [None]:
# Get significant variables including tentatives
boruta.signif <- getSelectedAttributes(boruta.output, withTentative = TRUE)
print(boruta.signif)  

If you are not sure about the tentative variables being selected for granted, you can choose a **TentativeRoughFix** on **boruta.output**.

In [None]:
# Do a tentative rough fix
roughFixMod <- TentativeRoughFix(boruta.output)
boruta.signif <- getSelectedAttributes(roughFixMod)
print(boruta.signif)

In [None]:
# Variable Importance Scores
imps <- attStats(roughFixMod)
imps2 = imps[imps$decision != 'Rejected', c('meanImp', 'decision')]
head(imps2[order(-imps2$meanImp), ])  # descending sort

In [None]:
# Plot variable importance
plot(boruta.output, cex.axis = .7, las = 2, xlab ="", main = "Variable Importance")  

The columns in green are ‘confirmed’ and the ones in red are not. There are couple of blue bars representing **ShadowMax** and **ShadowMin**. They are not actual features, but are used by the boruta algorithm to decide if a variable is important or not.

### (c) Variable Importance from Machine Learning Algorithms

Another way to look at feature selection is to consider variables most used by various ML algorithms the most to be important.

Depending on how the machine learning algorithm learns the relationship between X’s and Y, different machine learning algorithms may possibly end up using different variables (but mostly common vars) to various degrees.

In [None]:
library(caret)
library(e1071)

The variable importances from Recursive Partitioning (rpart) algorithm (decision tree method).

In [None]:
set.seed(100)
rPartMod <- train(Species ~., data = iris, method = "rpart")
rpartImp <- varImp(rPartMod)
print(rpartImp)

Only 3 of the 4 features was used by rpart and if you look closely, the variables used here are in the top that boruta selected.

In [None]:
plot(rpartImp, main='Variable Importance')

The variable importances from Regularized Random Forest (RRF) algorithm.

In [None]:
library(RRF)

In [None]:
# Train an RRF model and compute variable importance.
set.seed(100)
rrfMod <- train(Species ~., data = iris, method="RRF")
rrfImp <- varImp(rrfMod, scale=F)
rrfImp

In [None]:
plot(rrfImp, main='Variable Importance')

### (d) Recursive Feature Elimination (RFE)

Recursive feature elimnation (rfe) offers a rigorous way to determine the important variables before you even feed them into a ML algo.

It can be implemented using the rfe() from caret package.

In [None]:
ctrl <- rfeControl(functions = rfFuncs, method = "repeatedcv", repeats = 5, verbose = FALSE)
# method='repeatedCV' means it will do a repeated k-Fold cross validation

lmProfile <- rfe(x = iris[,-5], y = iris[,5], sizes = 3, rfeControl = ctrl)
# sizes => determines the number of most important features the rfe should iterate
# rfeControl parameter => receives the output of the rfeControl()

lmProfile

In [None]:
lmProfile$optVariables

### (e) Genetic Algorithm

You can perform a supervised feature selection with genetic algorithms using the **gafs()**. This is quite resource expensive so consider that before choosing the number of iterations (iters) and the number of repeats in **gafsControl()**.

In [None]:
# Define control function
ga_ctrl <- gafsControl(functions = rfGA,  # another option is `caretGA`.
                        method = "cv",
                        repeats = 3)

In [None]:
# Genetic Algorithm feature selection
set.seed(100)
ga_obj <- gafs(x = iris[,-5], 
               y = iris[, 5], 
               iters = 3,   # normally much higher (100+)
               gafsControl = ga_ctrl)

ga_obj

In [None]:
# Optimal variables
ga_obj$optVariables