<img src="http://oproject.org/tiki-download_file.php?fileId=8&display&x=450&y=128" height="200" width="200"  />
<img class="middle-img" src="http://gfif.udea.edu.co/root/tmva/img/tmva_logo.gif" height="200" width="200" />
# Variance Threshold Transformation
<hr style="border-top-width: 6px; border-top-color: #34609b"> 
This notebook demonstrates the Variable Transformation method in DataLoader class by code samples and visual plots. 

## Introduction

In high energy physics and machine learning problems, we often encounter data which have large number of input variables. However to extract maximum information from the data, we need to select the relevant input variables for the multivariate classification and regression methods implemented in TMVA. Variance Threshold is a simple unsupervised variable selection method which automates this process. 

It computes weighted variance $\sigma^2_V$ for each variable $V$ and ignores the ones whose variance doesn't meet a specific threshold. Weighted variance for each variable is defined as follows: 
$$\sigma^2_V = \frac{\sum_{i=1}^N w_i (x_i - \mu_V)^2}{\sum_{i=1}^N w_i}$$

where $N$ is the number of events in a dataset, $x_i$ denotes the value of variable for the $i$th event, $w_i$ is the weight of each event and $\mu_V$ denotes the weighted mean of variable. 
$$\mu_V = \frac{\sum_{i=1}^N w_i x_i}{\sum_{i=1}^N w_i}$$

A threshold $T$ for variance can be set by user otherwise default value of threshold is zero i.e. remove the variables which have same value in all the events. We get a new set of variables $S$ which can be formally defined as: 

$$S = \{V  \mid \sigma^2_V > T \}$$

## Initialize DataLoader and Factory 

In [1]:
TMVA::Tools::Instance();
TFile *inputFile = TFile::Open( "../datasets/mydataset.root"); 
TFile* outputFile = TFile::Open( "mydataset_output.root", "RECREATE" );
TMVA::Factory *factory = new TMVA::Factory("TMVAClassification", outputFile, 
                                           "!V:ROC:!Correlations:!Silent:Color:!DrawProgressBar:AnalysisType=Classification" );
TMVA::DataLoader *loader1=new TMVA::DataLoader("mydataset");

--- Factory                  : You are running ROOT Version: 6.07/07, Apr 1, 2016
--- Factory                  : 
--- Factory                  : _/_/_/_/_/ _|      _|  _|      _|    _|_|   
--- Factory                  :    _/      _|_|  _|_|  _|      _|  _|    _| 
--- Factory                  :   _/       _|  _|  _|  _|      _|  _|_|_|_| 
--- Factory                  :  _/        _|      _|    _|  _|    _|    _| 
--- Factory                  : _/         _|      _|      _|      _|    _| 
--- Factory                  : 
--- Factory                  : ___________TMVA Version 4.2.1, Feb 5, 2015
--- Factory                  : 


## Setup DataLoader

In [2]:
loader1->AddVariable("var0", 'F');
loader1->AddVariable("var1", 'F');
loader1->AddVariable("var2", 'F');
loader1->AddVariable("var3 := var0-var1", 'F');
loader1->AddVariable("var4 := var0*var2", 'F');
loader1->AddVariable("var5 := var1+var2", 'F');
TTree *tsignal = (TTree*)inputFile->Get("MyMCSig");
TTree *tbackground = (TTree*)inputFile->Get("MyMCBkg");
TCut mycuts = "";
TCut mycutb = "";
loader1->AddSignalTree( tsignal,     1.0 );
loader1->AddBackgroundTree( tbackground, 1.0 );
loader1->PrepareTrainingAndTestTree( mycuts, mycutb,"nTrain_Signal=3000:nTrain_Background=3000:nTest_Signal=1449:nTest_Background=1449:SplitMode=Random:NormMode=NumEvents:!V");

--- DataSetInfo              : Dataset[mydataset] : Added class "Signal"	 with internal class number 0
--- Configurable             : Add Tree MyMCSig of type Signal with 5449 events
--- DataSetInfo              : Dataset[mydataset] : Added class "Background"	 with internal class number 1
--- Configurable             : Add Tree MyMCBkg of type Background with 5449 events
--- Configurable             : Preparing trees for training and testing...


## Apply Variance Threshold 

After having dataset loaded in DataLoader with all the variables, we are now ready to apply Variance Threshold transformation. It is implemented in VarTransform method in [DataLoader](https://root.cern.ch/doc/master/classTMVA_1_1DataLoader.html) class. 

### Method Definition
Parameters: Transformation definition string  
Returns: DataLoader with selected subset of variables   

Transformation defintion string **should only follow** either of the following formats otherwise method would raise an error.

|String            | Description                                                            |
|------------------|------------------------------------------------------------------------|
|"VT"              | Select variables whose variance is above threshold value = 0 (Default) |
|"VT(float_value)" | Select variables whose variance lies above float_value passed.         |

In [3]:
TMVA::DataLoader* loader2 = loader1->VarTransform("VT(2.95)");

--- DataSetFactory           : Dataset[mydataset] : Splitmode is: "RANDOM" the mixmode is: "SAMEASSPLITMODE"
--- DataSetFactory           : Dataset[mydataset] : Create training and testing trees -- looping over class "Signal" ...
--- DataSetFactory           : Dataset[mydataset] : Weight expression for class 'Signal': ""
--- DataSetFactory           : Dataset[mydataset] : Create training and testing trees -- looping over class "Background" ...
--- DataSetFactory           : Dataset[mydataset] : Weight expression for class 'Background': ""
--- DataSetFactory           : Dataset[mydataset] : Number of events in input trees (after possible flattening of arrays):
--- DataSetFactory           : Dataset[mydataset] :     Signal      -- number of events       : 5449  / sum of weights: 5449
--- DataSetFactory           : Dataset[mydataset] :     Background      -- number of events       : 5449  / sum of weights: 5449
--- DataSetFactory           : Dataset[mydataset] :     Signal tree -- total n

In [4]:
//Boosted Decision Trees
factory->BookMethod(loader1,TMVA::Types::kBDT, "BDT",
                   "!V:NTrees=200:MinNodeSize=2.5%:MaxDepth=2:BoostType=AdaBoost:AdaBoostBeta=0.5:UseBaggedBoost:BaggedSampleFraction=0.5:SeparationType=GiniIndex:nCuts=20" );

//Multi-Layer Perceptron (Neural Network)
factory->BookMethod(loader1, TMVA::Types::kMLP, "MLP",
                   "!H:!V:NeuronType=tanh:VarTransform=N:NCycles=100:HiddenLayers=N+5:TestRate=5:!UseRegulator" );

factory->BookMethod(loader1, TMVA::Types::kFisher, "Fisher",
"H:!V:Fisher:CreateMVAPdfs:PDFInterpolMVAPdf=Spline2:NbinsMVAPdf=60:\
NsmoothMVAPdf=10" );

//Cut optimisation using Monte Carlo sampling
// factory->BookMethod(loader1, TMVA::Types::kCuts, "Cuts",
// "!H:!V:FitMethod=MC:EffSel:SampleSize=200000:VarProp=FSmart" );

//Support Vector Machine
factory->BookMethod(loader1, TMVA::Types::kSVM, "SVM", "Gamma=0.25:Tol=0.001" );

// DNN 
TString layoutString ("Layout=TANH|100,TANH|50,TANH|10,LINEAR");
TString training0 ("LearningRate=1e-1,Momentum=0.0,Repetitions=1,ConvergenceSteps=300,BatchSize=20,TestRepetitions=15,WeightDecay=0.001,Regularization=NONE,DropConfig=0.0+0.5+0.5+0.5,DropRepetitions=1,Multithreading=True");
TString training1 ("LearningRate=1e-2,Momentum=0.5,Repetitions=1,ConvergenceSteps=300,BatchSize=30,TestRepetitions=7,WeightDecay=0.001,Regularization=L2,Multithreading=True,DropConfig=0.0+0.1+0.1+0.1,DropRepetitions=1");
TString trainingStrategyString ("TrainingStrategy=");
trainingStrategyString += training0 + "|" + training1;
TString nnOptions ("!H:V:ErrorStrategy=CROSSENTROPY:VarTransform=G:WeightInitialization=XAVIERUNIFORM");
nnOptions.Append (":");
nnOptions.Append (layoutString);
nnOptions.Append (":");
nnOptions.Append (trainingStrategyString);
factory->BookMethod(loader1, TMVA::Types::kDNN, "DNN", nnOptions );

--- Factory                  : Booking method: [1mBDT[0m DataSet Name: [1mmydataset[0m
--- Factory                  : Booking method: [1mMLP[0m DataSet Name: [1mmydataset[0m
--- MLP                      : Dataset[mydataset] : Create Transformation "N" with events from all classes.
--- Norm                     : Transformation, Variable selection : 
--- Norm                     : Input : variable 'var0' (index=0).   <---> Output : variable 'var0' (index=0).
--- Norm                     : Input : variable 'var1' (index=1).   <---> Output : variable 'var1' (index=1).
--- Norm                     : Input : variable 'var2' (index=2).   <---> Output : variable 'var2' (index=2).
--- Norm                     : Input : variable 'var3' (index=3).   <---> Output : variable 'var3' (index=3).
--- Norm                     : Input : variable 'var4' (index=4).   <---> Output : variable 'var4' (index=4).
--- Norm                     : Input : variable 'var5' (index=5).   <---> Output : variable

In [5]:
//Boosted Decision Trees
factory->BookMethod(loader2,TMVA::Types::kBDT, "BDT",
                   "!V:NTrees=200:MinNodeSize=2.5%:MaxDepth=2:BoostType=AdaBoost:AdaBoostBeta=0.5:UseBaggedBoost:BaggedSampleFraction=0.5:SeparationType=GiniIndex:nCuts=20" );

//Multi-Layer Perceptron (Neural Network)
factory->BookMethod(loader2, TMVA::Types::kMLP, "MLP",
                   "!H:!V:NeuronType=tanh:VarTransform=N:NCycles=100:HiddenLayers=N+5:TestRate=5:!UseRegulator" );

factory->BookMethod(loader2, TMVA::Types::kFisher, "Fisher",
"H:!V:Fisher:CreateMVAPdfs:PDFInterpolMVAPdf=Spline2:NbinsMVAPdf=60:\
NsmoothMVAPdf=10" );

//Support Vector Machine
factory->BookMethod(loader2, TMVA::Types::kSVM, "SVM", "Gamma=0.25:Tol=0.001" );

//DNN
factory->BookMethod(loader2, TMVA::Types::kDNN, "DNN", nnOptions );

--- Factory                  : Booking method: [1mBDT[0m DataSet Name: [1mtransformed_dataset[0m
--- DataSetFactory           : Dataset[default] : Splitmode is: "RANDOM" the mixmode is: "SAMEASSPLITMODE"
--- DataSetFactory           : Dataset[default] : Create training and testing trees -- looping over class "Signal" ...
--- DataSetFactory           : Dataset[default] : Weight expression for class 'Signal': ""
--- DataSetFactory           : Dataset[default] : Create training and testing trees -- looping over class "Background" ...
--- DataSetFactory           : Dataset[default] : Weight expression for class 'Background': ""
--- DataSetFactory           : Dataset[default] : Number of events in input trees (after possible flattening of arrays):
--- DataSetFactory           : Dataset[default] :     Signal      -- number of events       : 5449  / sum of weights: 5449
--- DataSetFactory           : Dataset[default] :     Background      -- number of events       : 5449  / sum of weights

In [6]:
factory->TrainAllMethods();

--- Factory                  :  
--- Factory                  : Train all methods for Classification ...
--- Factory                  : 
--- Factory                  : current transformation string: 'I'
--- Factory                  : Dataset[mydataset] : Create Transformation "I" with events from all classes.
--- Id                       : Transformation, Variable selection : 
--- Id                       : Input : variable 'var0' (index=0).   <---> Output : variable 'var0' (index=0).
--- Id                       : Input : variable 'var1' (index=1).   <---> Output : variable 'var1' (index=1).
--- Id                       : Input : variable 'var2' (index=2).   <---> Output : variable 'var2' (index=2).
--- Id                       : Input : variable 'var3' (index=3).   <---> Output : variable 'var3' (index=3).
--- Id                       : Input : variable 'var4' (index=4).   <---> Output : variable 'var4' (index=4).
--- Id                       : Input : variable 'var5' (index=5).   <-

In [None]:
factory->TestAllMethods();
factory->EvaluateAllMethods();

--- Factory                  : Test all methods...
--- Factory                  : Test method: BDT for Classification performance
--- BDT                      : Dataset[mydataset] : Evaluation of BDT on testing sample (2898 events)
--- BDT                      : Dataset[mydataset] : Elapsed time for evaluation of 2898 events: [1;31m0.0584 sec[0m       
--- Factory                  : Test method: MLP for Classification performance
--- MLP                      : Dataset[mydataset] : Evaluation of MLP on testing sample (2898 events)
--- MLP                      : Dataset[mydataset] : Elapsed time for evaluation of 2898 events: [1;31m0.00749 sec[0m       
--- Factory                  : Test method: Fisher for Classification performance
--- Fisher                   : Dataset[mydataset] : Evaluation of Fisher on testing sample (2898 events)
--- Fisher                   : Dataset[mydataset] : Elapsed time for evaluation of 2898 events: [1;31m0.000514 sec[0m       
--- Fisher            

In [None]:
%jsroot on
auto c1 = factory->VarTransformROCPlot(loader1, loader2);
c1->Draw();

[<unknown binary>]
[<unknown binary>]



 *** Break *** segmentation violation


[/Users/testuser/root-build/lib/libTMVA.so] TMVA::Factory::VarTransformROCPlot(TMVA::DataLoader*, TMVA::DataLoader*) /Users/testuser/root/tmva/tmva/src/Factory.cxx:2224
[<unknown binary>]


In [None]:
auto c2 = factory->GetROCCurve(loader2);
c2->Draw();