## **RULE FIT**

* The linear regression model does not account for interactions between features. Would it not be convenient to have a model that is as simple and interpretable as linear models, but also integrates feature interactions? 
*  RuleFit automatically generates these features from decision trees. Each path through a tree can be transformed into a decision rule by combining the split decisions into a rule.
* The RuleFit algorithm is implemented in R by Fokkema and Christoffersen (2017) and you can find a Python version on Github: https://github.com/christophM/rulefit.

## **Install package**

In [1]:
#!pip install git+git://github.com/christophM/rulefit.git

Collecting git+git://github.com/christophM/rulefit.git
  Cloning git://github.com/christophM/rulefit.git to /tmp/pip-req-build-jbg9pgq2
  Running command git clone -q git://github.com/christophM/rulefit.git /tmp/pip-req-build-jbg9pgq2
Building wheels for collected packages: RuleFit
  Building wheel for RuleFit (setup.py) ... [?25ldone
[?25h  Created wheel for RuleFit: filename=RuleFit-0.3-cp37-none-any.whl size=7766 sha256=f5774a1df8e679ac3e8cc287ee20b7ccb31b123197826e1471c1a0ea6c9584dd
  Stored in directory: /tmp/pip-ephem-wheel-cache-zqy42efk/wheels/0b/51/b8/9dc135361d610b383e5029f82ceb5b73eef717e0c1212c8cd1
Successfully built RuleFit
Installing collected packages: RuleFit
Successfully installed RuleFit-0.3


## **LOAD DATA**

Data from: https://www.kaggle.com/puxama/bostoncsv

In [3]:
import numpy as np
import pandas as pd
from rulefit import RuleFit

boston_data = pd.read_csv("data/Boston.csv", index_col=0)

In [4]:
boston_data.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [9]:
y = boston_data.medv.values
X = boston_data.drop("medv", axis=1)
features = X.columns
X = X.values

rf = RuleFit()
rf.fit(X, y, feature_names=features)

  positive)
  positive)
  positive)
  positive)


RuleFit(Cs=None, cv=3, exp_rand_tree_size=True, lin_standardise=True,
        lin_trim_quantile=0.025, max_rules=2000, memory_par=0.01,
        model_type='rl', random_state=None, rfmode='regress',
        sample_fract='default',
        tree_generator=GradientBoostingRegressor(alpha=0.9,
                                                 criterion='friedman_mse',
                                                 init=None, learning_rate=0.01,
                                                 loss='ls', max_depth=100,
                                                 max_features=None,
                                                 max_leaf_nodes=2,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                                                 min_wei

## **Predict**

In [12]:
PRED = rf.predict(X)

## **INTERPRETATION**

RuleFit is a linear model so the interpretation is the same as for regular linear models. The only difference is that the model has new features derived from decision rules. So, there are to types of rules: binary and linear
* **Binary rules**: (named rule) A value of 1 means that all conditions of the rule are met, otherwise the value is 0. 
* **Linear rules**: (names linear) the interpretation is the same as in linear regression models: If the feature increases by one unit, the predicted outcome changes by the corresponding feature weight.

## EXAMPLE 
There are 215 iteractions between lineal and binary rules that were created from the original feaures. If we apply lasso we can decrease tnumber of rules.

The most important binary rule was: rm <= 8.157999992370605 & dis > 1.1716500520706177 with a weight of -1.935740e+00. So, given this rule the variable medv decrease in a abs value of 1.935740e+00 , when all other feature values remain fixed.

In [17]:
## **INSPECT RESULTS**
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_colwidth', -1)  # or 199
rules = rf.get_rules()
rules = rules[rules.coef != 0].sort_values("support", ascending=False)
rules

Unnamed: 0,rule,type,coef,support,importance
1,zn,linear,0.0105898,1.0,0.2390258
11,black,linear,-0.00266732,1.0,0.2396465
6,age,linear,-0.03703631,1.0,1.035915
1031,rm <= 8.157999992370605 & dis > 1.1716500520706177,rule,-1.93574,0.974359,0.3059668
1037,rm <= 8.157999992370605 & dis > 1.1736000180244446,rule,-0.005828717,0.965812,0.001059147
1612,rm <= 8.752500057220459 & rm <= 8.281499862670898 & crim <= 33.15884971618652,rule,0.6711883,0.961538,0.1290747
595,rm <= 7.437000036239624 & dis > 1.274150013923645,rule,-0.2139402,0.935897,0.05240157
593,rm <= 7.437000036239624 & dis > 1.2736499905586243,rule,-0.2512372,0.92735,0.06521124
1637,rm <= 8.331999778747559 & dis > 1.1736000180244446 & lstat > 4.0299999713897705,rule,-0.3852174,0.918803,0.105217
989,crim <= 9.820364952087402,rule,0.6949606,0.888889,0.218405
