<img src="http://oproject.org/tiki-download_file.php?fileId=8&display&x=450&y=128" height="200" width="200"  />
<img class="middle-img" src="http://gfif.udea.edu.co/root/tmva/img/tmva_logo.gif" height="200" width="200" />
# Variance Threshold Transformation
<hr style="border-top-width: 6px; border-top-color: #34609b"> 
This notebook demonstrates the Variable Transformation method in DataLoader class by code samples and visual plots. 

## Introduction

In high energy physics and machine learning problems, we often encounter data which have large number of input variables. However to extract maximum information from the data, we need to select the relevant input variables for the multivariate classification and regression methods implemented in TMVA. Variance Threshold is a simple unsupervised variable selection method which automates this process. 

It computes weighted variance $\sigma^2_V$ for each variable $V$ and ignores the ones whose variance doesn't meet a specific threshold. Weighted variance for each variable is defined as follows: 
$$\sigma^2_V = \frac{\sum_{i=1}^N w_i (x_i - \mu_V)^2}{\sum_{i=1}^N w_i}$$

where $N$ is the number of events in a dataset, $x_i$ denotes the value of variable for the $i$th event, $w_i$ is the weight of each event and $\mu_V$ denotes the weighted mean of variable. 
$$\mu_V = \frac{\sum_{i=1}^N w_i x_i}{\sum_{i=1}^N w_i}$$

A threshold $T$ for variance can be set by user otherwise default value of threshold is zero i.e. remove the variables which have same value in all the events. We get a new set of variables $S$ which can be formally defined as: 

$$S = \{V  \mid \sigma^2_V > T \}$$

## Initialize DataLoader and Factory 

In [1]:
TMVA::Tools::Instance();
TFile *inputFile = TFile::Open( "../../datasets/mydataset.root"); 
TFile* outputFile = TFile::Open( "mydataset_output.root", "RECREATE" );
TMVA::Factory *factory = new TMVA::Factory("TMVAClassification", outputFile, 
                                           "!V:ROC:!Correlations:!Silent:Color:!DrawProgressBar:AnalysisType=Classification" );
TMVA::DataLoader *loader1=new TMVA::DataLoader("mydataset");

## Setup DataLoader

In [2]:
loader1->AddVariable("var0", 'F');
loader1->AddVariable("var1", 'F');
loader1->AddVariable("var2", 'F');
loader1->AddVariable("var3 := var0-var1", 'F');
loader1->AddVariable("var4 := var0*var2", 'F');
loader1->AddVariable("var5 := var1+var2", 'F');
TTree *tsignal = (TTree*)inputFile->Get("MyMCSig");
TTree *tbackground = (TTree*)inputFile->Get("MyMCBkg");
TCut mycuts = "";
TCut mycutb = "";
loader1->AddSignalTree( tsignal,     1.0 );
loader1->AddBackgroundTree( tbackground, 1.0 );
loader1->PrepareTrainingAndTestTree( mycuts, mycutb,"nTrain_Signal=3000:nTrain_Background=3000:nTest_Signal=1449:nTest_Background=1449:SplitMode=Random:NormMode=NumEvents:!V");

DataSetInfo              : [mydataset] : Added class "Signal"
                         : Add Tree MyMCSig of type Signal with 5449 events
DataSetInfo              : [mydataset] : Added class "Background"
                         : Add Tree MyMCBkg of type Background with 5449 events


## Apply Variance Threshold 

After having dataset loaded in DataLoader with all the variables, we are now ready to apply Variance Threshold transformation. It is implemented in VarTransform method in [DataLoader](https://root.cern.ch/doc/master/classTMVA_1_1DataLoader.html) class. 

### Method Definition
Parameters: Transformation definition string  
Returns: DataLoader with selected subset of variables   

Transformation defintion string **should only follow** either of the following formats otherwise method would raise an error.

|String            | Description                                                            |
|------------------|------------------------------------------------------------------------|
|"VT"              | Select variables whose variance is above threshold value = 0 (Default) |
|"VT(float_value)" | Select variables whose variance lies above float_value passed.         |

In [3]:
TMVA::DataLoader* loader2 = loader1->VarTransform("VT(2.95)");

DataSetFactory           : [mydataset] : Number of events in input trees
                         : Number of training and testing events
                         : ---------------------------------------------------------------------------
                         : Signal     -- training events            : 3000
                         : Signal     -- testing events             : 1449
                         : Signal     -- training and testing events: 4449
                         : Background -- training events            : 3000
                         : Background -- testing events             : 1449
                         : Background -- training and testing events: 4449
                         : 
DataSetInfo              : Correlation matrix (Signal):
                         : ----------------------------------------------------------------
                         :               var0    var1    var2 var0-var1 var0*var2 var1+var2
                         :      var0:  +1

In [4]:
//Boosted Decision Trees
factory->BookMethod(loader1,TMVA::Types::kBDT, "BDT",
                   "!V:NTrees=200:MinNodeSize=2.5%:MaxDepth=2:BoostType=AdaBoost:AdaBoostBeta=0.5:UseBaggedBoost:BaggedSampleFraction=0.5:SeparationType=GiniIndex:nCuts=20" );

//Multi-Layer Perceptron (Neural Network)
factory->BookMethod(loader1, TMVA::Types::kMLP, "MLP",
                   "!H:!V:NeuronType=tanh:VarTransform=N:NCycles=100:HiddenLayers=N+5:TestRate=5:!UseRegulator" );

//Support Vector Machine
factory->BookMethod(loader1, TMVA::Types::kSVM, "SVM", "Gamma=0.25:Tol=0.001" );

Factory                  : Booking method: [1mBDT[0m
                         : 
Factory                  : Booking method: [1mMLP[0m
                         : 
MLP                      : [mydataset] : Create Transformation "N" with events from all classes.
                         : 
                         : Transformation, Variable selection : 
                         : Input : variable 'var0' <---> Output : variable 'var0'
                         : Input : variable 'var1' <---> Output : variable 'var1'
                         : Input : variable 'var2' <---> Output : variable 'var2'
                         : Input : variable 'var3' <---> Output : variable 'var3'
                         : Input : variable 'var4' <---> Output : variable 'var4'
                         : Input : variable 'var5' <---> Output : variable 'var5'
MLP                      : Building Network. 
                         : Initializing weights
Factory                  : Booking method: [1mSVM[0m
   

In [5]:
//Boosted Decision Trees
factory->BookMethod(loader2,TMVA::Types::kBDT, "BDT",
                   "!V:NTrees=200:MinNodeSize=2.5%:MaxDepth=2:BoostType=AdaBoost:AdaBoostBeta=0.5:UseBaggedBoost:BaggedSampleFraction=0.5:SeparationType=GiniIndex:nCuts=20" );

//Multi-Layer Perceptron (Neural Network)
factory->BookMethod(loader2, TMVA::Types::kMLP, "MLP",
                   "!H:!V:NeuronType=tanh:VarTransform=N:NCycles=100:HiddenLayers=N+5:TestRate=5:!UseRegulator" );

factory->BookMethod(loader2, TMVA::Types::kFisher, "Fisher",
"H:!V:Fisher:CreateMVAPdfs:PDFInterpolMVAPdf=Spline2:NbinsMVAPdf=60:\
NsmoothMVAPdf=10" );

//Support Vector Machine
factory->BookMethod(loader2, TMVA::Types::kSVM, "SVM", "Gamma=0.25:Tol=0.001" );

Factory                  : Booking method: [1mBDT[0m
                         : 
DataSetFactory           : [vt_transformed_dataset] : Number of events in input trees
                         : Number of training and testing events
                         : ---------------------------------------------------------------------------
                         : Signal     -- training events            : 3000
                         : Signal     -- testing events             : 1449
                         : Signal     -- training and testing events: 4449
                         : Background -- training events            : 3000
                         : Background -- testing events             : 1449
                         : Background -- training and testing events: 4449
                         : 
DataSetInfo              : Correlation matrix (Signal):
                         : ----------------------------------------
                         :            var0-var1 var0*var2 var

In [6]:
factory->TrainAllMethods();

Factory                  : [mydataset] : Create Transformation "I" with events from all classes.
                         : 
                         : Transformation, Variable selection : 
                         : Input : variable 'var0' <---> Output : variable 'var0'
                         : Input : variable 'var1' <---> Output : variable 'var1'
                         : Input : variable 'var2' <---> Output : variable 'var2'
                         : Input : variable 'var3' <---> Output : variable 'var3'
                         : Input : variable 'var4' <---> Output : variable 'var4'
                         : Input : variable 'var5' <---> Output : variable 'var5'
TFHandler_Factory        : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :     var0:     3.0588     1.6838   [   0.043380     9.9950 ]
                         :     var1:     1.5648     1.422

In [7]:
factory->TestAllMethods();
factory->EvaluateAllMethods();

Factory                  : Test method: BDT for Classification performance
                         : 
BDT                      : [mydataset] : Evaluation of BDT on testing sample (2898 events)
                         : Elapsed time for evaluation of 2898 events: [1;31m0.0604 sec[0m       
Factory                  : Test method: MLP for Classification performance
                         : 
MLP                      : [mydataset] : Evaluation of MLP on testing sample (2898 events)
                         : Elapsed time for evaluation of 2898 events: [1;31m0.00599 sec[0m       
Factory                  : Test method: SVM for Classification performance
                         : 
SVM                      : [mydataset] : Evaluation of SVM on testing sample (2898 events)
                         : Elapsed time for evaluation of 2898 events: [1;31m0.226 sec[0m       
Factory                  : Test method: BDT for Classification performance
                         : 
BDT            