Here we are trying to predict tip based on total bill and party amount.

Turn the regression problem into a classification problem by binning.

In [185]:
%matplotlib inline
from pydataset import data
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

import warnings
warnings.filterwarnings("ignore")

Load the data:

In [186]:
tips = data('tips')
tips = tips[['total_bill', 'tip', 'size']]
tips.head()

Unnamed: 0,total_bill,tip,size
1,16.99,1.01,2
2,10.34,1.66,3
3,21.01,3.5,3
4,23.68,3.31,2
5,24.59,3.61,4


Our goal is to predict tip amount.

Let's bin the target variable so we can solve this as a classification problem:

In [187]:
# Create the feature
tips['tip_bin'] = pd.qcut(tips.tip, 3, labels=['bad tip', 'okay tip', 'good tip'])

In [179]:
# Explore our new feature
tips.groupby('tip_bin').tip.agg(['mean', 'count'])

Unnamed: 0_level_0,mean,count
tip_bin,Unnamed: 1_level_1,Unnamed: 2_level_1
bad tip,1.73012,83
okay tip,2.771125,80
good tip,4.522099,81


Split our data into X and y

In [190]:
X = tips[['total_bill', 'size']]
y = tips['tip_bin']

In [191]:
X.head()

Unnamed: 0,total_bill,size
1,16.99,2
2,10.34,3
3,21.01,3
4,23.68,2
5,24.59,4


In [192]:
y.head()

1     bad tip
2     bad tip
3    good tip
4    good tip
5    good tip
Name: tip_bin, dtype: category
Categories (3, object): [bad tip < okay tip < good tip]

Train-test split

In [193]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

Fit the model

In [200]:
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

Evaluate performance:

In [201]:
model.score(X_train, y_train)


0.6338797814207651

In [202]:
model.score(X_test, y_test)


0.639344262295082

Hooray!?

