In [1]:
#import warnings
#warnings.simplefilter('ignore')

## **RULE FIT**

* The linear regression model does not account for interactions between features, this method arise as a convenient methodoly, easy and interpretable that also integrates feature interactions
*  RuleFit automatically generates these features from decision trees. Each path through a tree can be transformed into a decision rule by combining the split decisions into a rule.
* The RuleFit algorithm is implemented in R by Fokkema and Christoffersen (2017) and you can find a Python version on Github: https://github.com/christophM/rulefit.


## **Definition**

The principal idea of RuleFit is to train a series of diverse Decision Trees, extract all single decision rules from the trees into a matrix of binary dummy variables and run a (penalized) Linear Regression between the explained variable and the original features combined with the dummy matrix.

## **STEPS**

The algorithm is a multi-step process:

1. Generate a tree ensemble using gradient boosting
2. Use the trees to form rules, with each decision path in a tree forming one rule.
3. Prune the rules and the original input features using L1-regularised regression (LASSO)

## **Install package**

In [2]:
#!pip install git+git://github.com/christophM/rulefit.git

## **LOAD DATA**

Data from: https://www.kaggle.com/puxama/bostoncsv

In [3]:
import numpy as np
import pandas as pd
from rulefit import RuleFit

boston_data = pd.read_csv("data/Boston.csv", index_col=0)

In [4]:
boston_data.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [5]:
y = boston_data.medv.values
X = boston_data.drop("medv", axis=1)
features = X.columns
X = X.values

rf = RuleFit()
rf.fit(X, y, feature_names=features)

  positive)
  positive)


RuleFit(Cs=None, cv=3, exp_rand_tree_size=True, lin_standardise=True,
        lin_trim_quantile=0.025, max_rules=2000, memory_par=0.01,
        model_type='rl', random_state=None, rfmode='regress',
        sample_fract='default',
        tree_generator=GradientBoostingRegressor(alpha=0.9,
                                                 criterion='friedman_mse',
                                                 init=None, learning_rate=0.01,
                                                 loss='ls', max_depth=100,
                                                 max_features=None,
                                                 max_leaf_nodes=4,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                                                 min_wei

## **Predict**

In [6]:
PRED = rf.predict(X)

## **INTERPRETATION**

RuleFit is a linear model so the interpretation is the same as for regular linear models. The only difference is that the model has new features derived from decision rules. So, there are to types of rules: binary and linear
* **Binary rules**: (named rule) A value of 1 means that all conditions of the rule are met, otherwise the value is 0. 
* **Linear rules**: (names linear) the interpretation is the same as in linear regression models: If the feature increases by one unit, the predicted outcome changes by the corresponding feature weight.

## EXAMPLE 
There are 215 iteractions between lineal and binary rules that were created from the original feaures. If we apply lasso we can decrease tnumber of rules.

The most important binary rule was: rm <= 8.157999992370605 & dis > 1.1716500520706177 with a weight of -1.935740e+00. So, given this rule the variable medv decrease in a abs value of 1.935740e+00 , when all other feature values remain fixed.

In [7]:
## **INSPECT RESULTS**
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_colwidth', -1)  # or 199
rules = rf.get_rules()
rules = rules[rules.coef != 0].sort_values("support", ascending=False)

In [8]:
print(type(rules))

<class 'pandas.core.frame.DataFrame'>


In [9]:
rules.head()

Unnamed: 0,rule,type,coef,support,importance
1,zn,linear,0.010859,1.0,0.245109
11,black,linear,-0.007337,1.0,0.659214
6,age,linear,-0.032589,1.0,0.91151
1312,lstat <= 28.684999465942383,rule,0.301599,0.948718,0.066524
321,rm <= 8.752500057220459 & tax <= 688.5 & ptratio > 14.549999713897705,rule,0.262999,0.944444,0.060243


## **Cabify experiment**

In [10]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import os

APP_NAME = 'pyspark_python'
MASTER = 'local[*]'

sc = SparkContext('local', 'Spark SQL')
sqlc = SQLContext(sc)

path = os.getcwd()
file = (f'{path}/data/intervals_challenge.json')

## **Data description**

* **type**: can be going_to_pickup, waiting_for_rider or driving_to_destination
* **trip_id**: uniquely identifies the trip
* **duration**: how long the interval last, **in seconds**
* **distance**: how far the vehicle moved in this interval, **in meters**
* **city_id**: either bravos, pentos and volantis
* **started_at**: when the interval started, UTC Time
* **vehicle_id**: uniquely identifies the vehicle
* **rider_id:** uniquely identifies the rider

In [11]:
# LOAD DATA -> WE USE SPARK, BECAUSE IS EASY AND FAST.
df = sqlc.read.json(file)
# this creates a view of the json dataset
df.createOrReplaceTempView("json_view")
# issue the SQL query to select only the 'text' field
dataset = sqlc.sql("select duration, distance, started_at, trip_id, vehicle_id, city_id, type from json_view")
# show some output
dataset.show()

+--------+--------+----------------+--------------------+--------------------+--------+--------------------+
|duration|distance|      started_at|             trip_id|          vehicle_id| city_id|                type|
+--------+--------+----------------+--------------------+--------------------+--------+--------------------+
|     857|    5384|1.475499600287E9|c00cee6963e0dc66e...|52d38cf1a3240d5cb...|  pentos|driving_to_destin...|
|     245|    1248|1.475499600853E9|427425e1f4318ca24...|8336b28f24c3e7a1e...|volantis|     going_to_pickup|
|    1249|    5847| 1.47549960167E9|757867f6d7c00ef92...|8885c59374cc53916...|  pentos|driving_to_destin...|
|     471|    2585|1.475499601841E9|d09d1301d361f7359...|81b63920454f70b67...|  bravos|     going_to_pickup|
|     182|     743| 1.47549960197E9|00f20a701f0ec2519...|b73030977cbad61c9...|  pentos|     going_to_pickup|
|     599|    1351|1.475499602154E9|158e7bc8d42e1d8c9...|126e868fb282852c2...|volantis|     going_to_pickup|
|     529|    3297|

In [12]:
#transform to pandas
print(dataset.toPandas().shape)
dataset = dataset.toPandas()
dataset.head()

(165170, 7)


Unnamed: 0,duration,distance,started_at,trip_id,vehicle_id,city_id,type
0,857,5384,1475500000.0,c00cee6963e0dc66e50e271239426914,52d38cf1a3240d5cbdcf730f2d9a47d6,pentos,driving_to_destination
1,245,1248,1475500000.0,427425e1f4318ca2461168bdd6e4fcbd,8336b28f24c3e7a1e3d582073b164895,volantis,going_to_pickup
2,1249,5847,1475500000.0,757867f6d7c00ef92a65bfaa3895943f,8885c59374cc539163e83f01ed59fd16,pentos,driving_to_destination
3,471,2585,1475500000.0,d09d1301d361f7359d0d936557d10f89,81b63920454f70b6755a494e3b28b3a7,bravos,going_to_pickup
4,182,743,1475500000.0,00f20a701f0ec2519353ef3ffaf75068,b73030977cbad61c9db55418909864fa,pentos,going_to_pickup


## **CHANGE TYPES**

In [13]:
# change the last variables created to continuous
col = ['duration', 'distance', 'started_at']
dataset[col] = dataset[col].apply(pd.to_numeric, downcast='float', errors = 'coerce')

In [14]:
dataset.dtypes

duration      float32
distance      float32
started_at    float32
trip_id       object 
vehicle_id    object 
city_id       object 
type          object 
dtype: object

In [15]:
#dropnas
dataset = dataset.dropna()

In [16]:
#pd.Timestamp(dataset.started_at, unit='s')
dataset['started_at_date'] = pd.to_datetime(dataset['started_at'], unit='s')
dataset['Month'] = dataset['started_at_date'].dt.month
dataset['year'] = dataset['started_at_date'].dt.year
dataset['day'] = dataset['started_at_date'].dt.day
dataset['hour'] = dataset['started_at_date'].dt.hour
dataset['time_of_day'] = np.where(dataset.hour < 24 , 'night', 'evening')
dataset['time_of_day'] = np.where(dataset.hour < 20 , 'evening', dataset['time_of_day'])
dataset['time_of_day'] = np.where(dataset.hour < 16 , 'afternoon', dataset['time_of_day'])
dataset['time_of_day'] = np.where(dataset.hour < 12 , 'morning', dataset['time_of_day'] )
dataset['time_of_day'] = np.where(dataset.hour < 6 , 'early morning', dataset['time_of_day'] )

## **EXPERIMENT DESIGN**

The designed expirement is very simple. For a period of 5 days, all trips in 3 cities (Bravos, Pentos and Volantis) have been randomly assigned using linear or road distance:

* Trips whose trip_id starts with digits 0-8 were assigned using road distance
* Trips whose trip_id starts with digits 9-f were assigned using linear distance

In [17]:
# indicate the type of experiment
dataset['trip_id_exp'] = dataset['trip_id'].str[:1]
values = ['0', '1', '2', '3', '4', '5', '6', '7', '8']
dataset['trip_id_exp'] = np.where(dataset.trip_id_exp.isin(values), 'road', 'linear')

In [18]:
dataset.head()

Unnamed: 0,duration,distance,started_at,trip_id,vehicle_id,city_id,type,started_at_date,Month,year,day,hour,time_of_day,trip_id_exp
0,857.0,5384.0,1475500000.0,c00cee6963e0dc66e50e271239426914,52d38cf1a3240d5cbdcf730f2d9a47d6,pentos,driving_to_destination,2016-10-03 13:00:48,10,2016,3,13,afternoon,linear
1,245.0,1248.0,1475500000.0,427425e1f4318ca2461168bdd6e4fcbd,8336b28f24c3e7a1e3d582073b164895,volantis,going_to_pickup,2016-10-03 13:00:48,10,2016,3,13,afternoon,road
2,1249.0,5847.0,1475500000.0,757867f6d7c00ef92a65bfaa3895943f,8885c59374cc539163e83f01ed59fd16,pentos,driving_to_destination,2016-10-03 13:00:48,10,2016,3,13,afternoon,road
3,471.0,2585.0,1475500000.0,d09d1301d361f7359d0d936557d10f89,81b63920454f70b6755a494e3b28b3a7,bravos,going_to_pickup,2016-10-03 13:00:48,10,2016,3,13,afternoon,linear
4,182.0,743.0,1475500000.0,00f20a701f0ec2519353ef3ffaf75068,b73030977cbad61c9db55418909864fa,pentos,going_to_pickup,2016-10-03 13:00:48,10,2016,3,13,afternoon,road


In [19]:
dataset = dataset[dataset['type'] == 'going_to_pickup']
dataset = dataset[['duration', 'distance', 'city_id',  'time_of_day', 'trip_id_exp']]
dataset = pd.get_dummies(dataset, columns=['city_id',  'time_of_day', 'trip_id_exp'])

In [20]:
dataset.head()

Unnamed: 0,duration,distance,city_id_bravos,city_id_pentos,city_id_volantis,time_of_day_afternoon,time_of_day_early morning,time_of_day_evening,time_of_day_morning,time_of_day_night,trip_id_exp_linear,trip_id_exp_road
1,245.0,1248.0,0,0,1,1,0,0,0,0,0,1
3,471.0,2585.0,1,0,0,1,0,0,0,0,1,0
4,182.0,743.0,0,1,0,1,0,0,0,0,0,1
5,599.0,1351.0,0,0,1,1,0,0,0,0,0,1
9,1525.0,2674.0,1,0,0,1,0,0,0,0,1,0


In [21]:
#dataset = dataset.iloc[0:2000, :]
print(dataset.shape)

(58211, 12)


In [22]:
dataset = dataset[['duration',  'trip_id_exp_linear', 'distance'
                   #, 'city_id_bravos', 'city_id_pentos', 'city_id_volantis'
                  ]]

In [23]:
y = dataset.duration.values
X = dataset.drop(["duration"], axis=1)
features = X.columns
X = X.values

In [24]:
rf = RuleFit()
rf.fit(X, y, feature_names=features)

RuleFit(Cs=None, cv=3, exp_rand_tree_size=True, lin_standardise=True,
        lin_trim_quantile=0.025, max_rules=2000, memory_par=0.01,
        model_type='rl', random_state=None, rfmode='regress',
        sample_fract='default',
        tree_generator=GradientBoostingRegressor(alpha=0.9,
                                                 criterion='friedman_mse',
                                                 init=None, learning_rate=0.01,
                                                 loss='ls', max_depth=100,
                                                 max_features=None,
                                                 max_leaf_nodes=4,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                                                 min_wei

In [25]:
## **INSPECT RESULTS**
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_colwidth', -1)  # or 199
rules = rf.get_rules()
rules = rules[rules.coef != 0].sort_values("support", ascending=False)

In [26]:
rules

Unnamed: 0,rule,type,coef,support,importance
0,trip_id_exp_linear,linear,-0.879591,1.0,0.435871
