# Ames House Price Data - Extraction of Important Features
>Juptyer notebook, running a Julia 0.5.2 kernel, with the help of Machine Learning modules written by the author

*Here we obtain a ranking of the importance of each feature using the decision tree regularization method of [Deng and Runger (2012)](https://arxiv.org/abs/1201.1587v3). Another way to extract important features, using the Lasso regularized linear model, is described in [The Elastic Net Model](ElasticNet.ipynb)* 

## Transforming the data

The data is read in as a `DataFrame` instance but the regularized decision tree we build requires the input data to be in `DataTable` form. The `DataTable` form is defined in the `TreeCollections` module. The query `?DataTable` describes this data structure in detail. 

Note that we have no need to one-hot encode categoricals as our decision tree algorithms handle mixed data types.


In [None]:
using ADBUtilities, Preprocess, Regressors, TreeCollections
import DataFrames: DataFrame, head, readtable, writetable

df = readtable("2.cleaned/train_randomized.csv")

const X = DataTable(df[2:end-1]) # drop the identifying feature :Id and the target
const y = collect(df[:target]);

## Ranking the features

To build a Deng-Runger regularized tree we simply give the basic decision tree model a `penalty` keyword argument. We use the default value sugggested by the authors:

In [32]:
tree = TreeRegressor(X,y,penalty=0.5)

TreeRegressor@...7531

In [33]:
@more # shorthand for `showall(ans)`

Dict{Symbol,Any} with 6 entries:
  :max_features       => 0
  :extreme            => false
  :regularization     => 0.0
  :min_patterns_split => 2
  :cutoff             => 0
  :penalty            => 0.5

TreeRegressor@...7531
  Hyperparameters:
[1m[37m                            Feature importance
[0m[1m[37m                 ┌────────────────────────────────────────┐[0m 
     [1m[37mOverallQual[0m[1m[37m │[0m[1m[34m▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪[0m[1m[37m 1.0[0m [1m[37m│[0m [1m[37m[0m
       [1m[37mGrLivArea[0m[1m[37m │[0m[1m[34m▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪[0m[1m[37m 0.824[0m     [1m[37m│[0m [1m[37m[0m
      [1m[37mGarageCars[0m[1m[37m │[0m[1m[34m▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪[0m[1m[37m 0.824[0m     [1m[37m│[0m [1m[37m[0m
      [1m[37mMSSubClass[0m[1m[37m │[0m[1m[34m▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪[0m[1m[37m 0.765[0m       [1m[37m│[0m [1m[37m[0m
          [1m[37mMoSold[0m[1m[37m │[0m[1m[34m▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪[0m[1m[37m 0.706[0m         [1m[37m│[0m [1m[37m[0m
       [1m[37mLandSlope[0m[1m[37m │[0m[1m[34m▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪[0m[1m[37m 0.706[0m         [1m[37m│[0m [1m

By definition, the *importance* of a feature is one minus the normalized depth at which a feature first appears at a decision node. A dictionary of feature importance, keyed on feature index, is stored as the `importance_given_feature` attribute of `tree`. We record the 40 most important feautres:

In [34]:
importance_given_feature = tree.importance_given_feature

important_feature_indices = reverse(collect(keys_ordered_by_values(importance_given_feature)))[1:40]
important_features = [X.names[j] for j in important_feature_indices]


40-element Array{Symbol,1}:
 :OverallQual 
 :GrLivArea   
 :GarageCars  
 :MSSubClass  
 :MoSold      
 :LandSlope   
 :BsmtFinSF1  
 :Fireplaces  
 :TotRmsAbvGrd
 :MSZoning    
 :SaleType    
 :TotalBsmtSF 
 :BsmtUnfSF   
 ⋮            
 :BsmtFinType2
 :BsmtFinType1
 :Condition1  
 :x1stFlrSF   
 :GarageType  
 :OverallCond 
 :x2ndFlrSF   
 :BsmtHalfBath
 :HalfBath    
 :GarageFinish
 :MasVnrType  
 :MasVnrArea  

We see that about half of the 40 most important features are ordinal, half categorical:

In [35]:
sum(X[important_features].scheme.is_ordinal)

22

## Writing results to file

In [36]:
dg = DataFrame([important_features,],[:field,])

writetable("3.important_features/important_features.csv", dg)