-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.html
285 lines (217 loc) · 18 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
<html>
<head>
<link rel="stylesheet" type="text/css" href="nv.d3.css">
<link rel="stylesheet" type="text/css" href="my.css">
<link rel="stylesheet" type="text/css" href="d3.slider.css">
<link rel="stylesheet" type="text/css" href="ProfitHeatMap.css">
<title>Project: Target Marketing for a Bank</title>
<!--script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script-->
<script src="http://d3js.org/d3.v3.js" charset="utf-8"></script>
<!--script src="nv.d3.min.js" charset="utf-8"></script)-->
<script src="nv.d3.js" charset="utf-8"></script>
<script src="d3.slider.js" charset="utf-8"></script>
</head>
<body>
<p>During week 4, 5 and 6, the Metis Data Science Bootcamp zoomed in on the following technologies:</p>
<ul>
<li>SQL and mySQL in the <a href="https://www.digitalocean.com" target="_blank">cloud</a> (0.5 day);</li>
<li>Supervised learning with <a href="http://scikit-learn.org/" target="_blank">scikit-learn</a>
and <a href="http://statsmodels.sourceforge.net/" target="_blank">statsmodels</a> (1.5 week);</li>
<li>Interactive visualization with <a href="http://matplotlib.org/" target="_blank">matplotlib</a>, <a href="http://nbviewer.ipython.org/github/esss/ipython/blob/master/examples/Interactive%20Widgets/Index.ipynb" target="_blank">interactive widgets in iPython</a>, <a href="http://mpld3.github.io/" target="_blank">mpld3</a> and especially <a href="http://d3js.org/" target="_blank">D3.js</a> (1.5 week).</li>
</li>
</ul>
<p>The remainder of this article reports on the project and business context in which these technologies were applied.</p>
<h2>Business questions</h2>
<p>A bank wants to run a targeted marketing campaign in order to <strong>sell a term deposit product</strong>.</p>
<p>Given a certain (but unspecified) <strong>campaign budget</strong>:</p>
<ul>
<li><strong>Which customers</strong> should it target?</li>
<li>In <strong>what order</strong>?</li>
<li>Is it wise to spend the <strong>complete</strong> campaign budget?</li>
</ul>
<p>The bank has a <a href="https://archive.ics.uci.edu/ml/datasets/Bank+Marketing" target="_blank">database of 41K past customer
contact records</a>, with an indication of successful and failed sales of this product.</p>
<h2>Technical assignment</h2>
<p>Using the historic customer database, build a model to:</p>
<ul>
<li><strong>predict</strong> which customers are (more) likely to buy the term deposit product</li>
<li><strong>rank</strong> customers by their propensity to buy the product</li>
</ul>
<p>To evaluate model performance, produce for each machine learning algorithm:</p>
<ul>
<li>an <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic" target="_blank">ROC</a> curve</li>
<li>a <a href="http://en.wikipedia.org/wiki/Lift_(data_mining)">lift</a> curve</li>
<li>a profit visualization</li>
</ul>
<h2>Data preparation, feature selection and model building</h2>
<p>For these steps, please check the <a href="http://nbviewer.ipython.org/github/fdurant/mcnulty_banking/blob/master/project_mcnulty_banking.ipynb" target="_blank">iPython notebook</a>, also available in our <a href="https://github.com/fdurant/mcnulty_banking" target="_">GitHub repository</a>.</p>
<h2>Model evaluation</h2>
<p>How well do our models perform? And much more importantly: how do they contribute to answering the business questions?</p>
<p>Let's start by looking at the <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic" target="_blank">ROC curve</a>.</p>
<h3>ROC curve: between a rock and a hard place</h3>
<p>Hint: move the mouse over the graph, to highlight individual line points</p>
<svg id="roccurve" style="height:400px; width:100%; border-style:solid; border-width:1px; border-color:black;"></svg>
<script src="ROCCurve.js" type="text/javascript" charset="utf-8"></script>
<p align="right"><small>[<a href="ROCCurve.html" target="_roc">view in separate window</a>]</small></p>
<p>Inside the <a href="http://en.wikipedia.org/wiki/False_positives_and_false_negatives" target="_blank">False Positive Rate</a> intervals [0-0.07] and [0.54-1.00] (horizontal axis), the
<a href="http://en.wikipedia.org/wiki/Logistic_regression" target="_blank">logistic regression</a> model
scores best in terms of <a href="http://en.wikipedia.org/wiki/False_positives_and_false_negatives#true_positive" target="_blank">True Positive Rate</a> (vertical axis).
In between these two intervals, <a href="http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm" target="_blank">KNN</a> (with k = 30) is the winning model.
But all in all, the three algorithms play in the same league.</p>
<p>The selection of a specific FPR/TPR combination (and of a corresponding best algorithm) is for our customer to make.
In business terms, this means making a trade-off between:</p>
<ul>
<li>Reaching sufficiently interested customers, which will bring in revenue</li>
<li>In the process, annoying too many uninterested ones, and incurring the cost of making these contact</li>
</ul>
<p>On its own, the ROC curve does not tell us how to set the optimal threshold. Therefore, let's try to formulate the trade-off
more <strong>in operational terms</strong>,
by means of a <a href="http://en.wikipedia.org/wiki/Lift_(data_mining)" target="_blank">lift curve</a>.</p>
<h2 align="center">Lift curve: trained model vs. baseline performance</h2>
<p>From our <a href="https://archive.ics.uci.edu/ml/datasets/Bank+Marketing" target="_blank">reference dataset</a>, we know that only a fraction (<10 %) of our potential customers will accept an
offer to buy the term deposit product. The question, of course, is <strong>who</strong> these customers are. For each customer
in the test set, our trained models return a <strong>probability estimate</strong> of their purchase of the product.
This probability allows us to <strong>rank our customers</strong> accordingly, which is key to making smart(er) decisions.</p>
<p>But first, for the sake of the argument, let's assume we only know the distribution of yes/no answers in the training data set,
and nothing more. The (near-perfect) diagonal line from (0,0) to (100, 100) in the graph below
represents such a baseline case. The model behind it was built by <a href="http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html" target="_">randomly guessing</a>
who might accept or reject the offer, according to this yes/no distribution.</p>
<p> So, in the baseline case, if we were to contact 20% of the customers (x axis), we can expect to
hit approximately 20% of all customers <em>who would actually buy the product</em> (y axis). This is approximately
true for any other percentage point along this (near-perfect) diagonal.</p>
<svg id="liftcurve" style="height:400px; width:100%; border-style:solid; border-width:1px;"></svg>
<p align="right"><small>[<a href="LiftCurve.html" target="_lift">view in separate window</a>]</small></p>
<script src="LiftCurve.js" type="text/javascript" charset="utf-8"></script>
<p>Luckily, we <strong>do</strong> have more information, so we <strong>can</strong> do better! For example, hover
over data point (20.0, 63.2) on the upper KNN line. This data point means that by merely selecting the top-ranked
<strong>one fifth</strong> of our customers, we can expect to hit no less than <strong>two thirds</strong> of all customers
who would actually buy the product!</p>
<p>Each table that appears when hovering over the data points also mentions <strong>cumulative lift</strong>.
This measure is simply defined as the ratio of the y over the x value in that point.
For the example data point at (20.0, 63.2), lift is equal to the trained model yield (63.2 percent)
divided by the (theoretical) baseline yield at the same x value, so lift is 63.2 / 20.0 = 3.16. In the steep left parts
of the trained model curves, cumulative lift rises sharply. On the KNN curve, it reaches its maximum value (4.13) at x=13.0.
</p>
<p>So based on the lift graph alone, the bank could decide to <strong>contact its 13% top-ranked customers, and then stop</strong>.</p>
<p>Alternatively, it could also decide to only stop contacting customers when the <strong>local lift</strong> (i.e. the lift in
the percentile immediately preceding the point) starts dropping below 1, i.e. performing worse than the (theoretical) baseline.
On the KNN curve, local lift fluctuates around 1 in the x interval [19.0-32.0], so this is not such a clear-cut decision.</p>
<p>Now, despite all these niceties, there's still one ingredient missing: <strong>money</strong>!</p>
<h2>Profit curve: knowing when to stop the campaign</h2>
<p>Each customer contact costs money, but also carries a potential reward in the form of future revenue.
Let's bring in two new variables:</p>
<ul>
<li>the <strong>average cost per contact</strong>. This cost is incurred irrespective of whether the
contact leads to a sale or not.</li>
<li>the <strong>average revenue per <em>successful</em> contact</strong>. An unsuccesful contact brings
in zero revenue, by definition.</li>
</ul>
<p>Even though we have to limit ourselves to averages, these variables do allow us to improve on our rather abstract lift curve.
The default <strong>profit curve</strong> below assumes that each contact (successful or not) costs $10 on average,
while the average <em>successful</em> one carries $50 in revenue.
</p>
<p>In this default configuration, the KNN profit curve maxes out at x=14, bringing in $6840 of cumulative profit.
Compare this to the cumulative loss of $2410 in case we would just contact customers in random order.
As a matter of fact, the current cost/revenue configuration would constantly write in the red in the baseline case.</p>
<div id="containerForSlidersAndProfitCurve" style="border-style:solid; border-width:1px; border-color:black; padding:25px 5px 5px 0px">
<div id="containerFor2Sliders" style="height:100; padding-left:10px;">
<div style="clear:both;">
<div style="float:left;"> avg. cost per customer contact (USD):</div>
<span style="margin-left: 10px; margin-top: -20px; width:30%; float:left; margin-bottom:0px;" id="avgCostPerContactSlider"></span>
<div style="margin-left: 10px; margin-top: 0px; float:left;"><strong id="avgCostPerContactSliderText">10</strong></div>
</div>
<div style="clear:both;">
<div style="float:left;">avg. revenue per successful sale (USD):</div>
<span style="margin-left: 2px; margin-top:-20px; width:60%; float:left;" id="avgRevenuePerContactSlider"></span>
<div style="margin-left: 10px; margin-top:0px; float:left"><strong id="avgRevenuePerContactSliderText">50</strong></div>
</div>
</div><!-- containerFor2Sliders -->
<svg id="profitcurve" style="clear:both; height:400px; width:100%;"></svg>
<script src="calcProfit.js" type="text/javascript" charset="utf-8"></script>
<script src="ProfitCurve.js" type="text/javascript" charset="utf-8"></script>
</div><!-- containerForSlidersAndProfitCurve -->
<p align="right"><small>[<a href="ProfitCurve.html" target="_profit">view in separate window</a>]</small></p>
<p>Now increase the average revenue per successful contact to $90, using the second slider
on top of the graph above, and watch attentively. By increasing the profit per successful contact to
<span style="white-space:nowrap">$90-$10=$80</span>, the shape of the profit curve
has taken a different form. Cumulative profit now tops at $17.530, but to get there we have to contact 20%
of all customers. Also, if we were to contact <em>all</em> (100%) customers, we would more or less break even, with
a cumulative loss of only $330.</p>
<p>Finally, increase both sliders with $10, so their respective values become $20 and $100. While this has no effect
on the profit per successful contact <span style="white-space:nowrap">($100-$20 = $80 = $90-$10)</span>, the loss per
<em>failed</em> contact has <em>doubled</em> from $10 to $20. As a result, the trained model profit curves now cross
the breakeven line again (around x=40%), while the baseline curve goes permanently in the red again.</p>
<p>The bottom line is that any specific configuration of cost and revenue greatly influences the shape of the curves,
the profitability intervals, and therefore the decision boundaries.</p>
<h2>Profit heat map: impressionistic view on individual and cumulative profit contributions</h2>
<p>How much does each <em>individual</em> customer contribute to cumulative profit? In the heatmap below, each cell
represents one customer from the test set. Customers are displayed in the ranking order defined by the model, with
the most probable buyer first. Reading order is left-to-right, top-down - the same as for written English.</p>
<p>In the baseline case, with default cost and revenue values, the heatmap progressively takes on deeper shades of red.
Since the baseline lists customers in random order, there are more negative than positive contributions,
no matter which region of the heatmap we look at. This is aptly demonstrated if you select
profit: "contribution per individual customer".</p>
<div id="containerForControlsAndProfitHeatMap" style="width:840px; border-style:solid; border-width:1px; border-color:black; padding:5px 5px 5px 10px">
<div id="DropDownBoxes" style="height:20px;">
<span style="float:left; padding-left:10px; valign:middle padding-top:5px">Model</span>
<select id="model" style="float:left; margin-left:20px;">
<option value="baseline" selected="selected">Baseline</option>
<option value="knn">KNN</option>
<option value="logres">Logistic Regression</option>
<option value="gaussianNB">Gaussian Naive Bayes</option>
</select>
<span style="float:left; margin-left:20px; padding-left:10px; valign:middle padding-top:5px">Profit</span>
<select id="profittype" style="float:left; margin-left:10px;">
<option value="cumulative" selected="selected">Cumulative: includes contributions by higher-ranked customers</option>
<option value="single">Contribution per individual customer</option>
</select>
</div>
<div id="containerFor2Sliders4Heatmap" style="clear:both; height:110; padding-left:10px; padding-top:30px;">
<div style="clear:both;">
<div style="float:left;"> avg. cost per customer contact (USD):</div>
<span style="margin-left: 10px; margin-top: -20px; width:30%; float:left; margin-bottom:0px;" id="avgCostPerContactSliderForHeatmap"></span>
<div style="margin-left: 10px; float:left;"><strong id="avgCostPerContactSliderForHeatmapText">10</strong></div>
</div>
<div style="clear:both; margin-top:30px;">
<div style="float:left;">avg. revenue per successful sale (USD):</div>
<span style="margin-left: 2px; margin-top:-20px; width:60%; float:left;" id="avgRevenuePerContactSliderForHeatmap"></span>
<div style="margin-left: 10px; float:left"><strong id="avgRevenuePerContactSliderForHeatmapText">50</strong></div>
</div>
</div><!-- containerFor2Sliders4Heatmap -->
<div id="profitheatmap" style="clear:both; width:830px;"></div>
<script src="ProfitHeatMap.js"></script>
<div id="heatmaptooltip" class="hidden">
<p><span id="value"/></p>
</div>
</div><!-- containerForControlsAndProfitHeatMap -->
<p align="right" style="width:840px;"><small>[<a href="ProfitHeatMap.html" target="_profitheatmap">view in separate window</a>]</small></p>
<p>Now select a non-baseline model, and see how the random distribution of individual positive profit contributions
transforms into a more ordered view: the positive profit contributions are indeed pushed towards the beginning
(top-left corner) of the heatmap. This is the power of the learned models at work.</p>
<p>Finally, also select different configurations of the cost and revenue parameters, and see how this influences the heatmap.
For example, try the baseline model with cost 10 and revenue 90, as we did in the profit curve before. This configuration
confirms that the cumulative profit remains near the break-even line.</p>
<h2>Recommendations</h2>
<p>Customers must be targeted in descending order of their probability to buy the product. Running the trained models on
unseen customer data will provide such a ranking.</p>
<p>To enable an informed decision about when to halt the campaign, the bank must first provide information on cost and revenue
per (successful) contact.
<ul>
<li>Without such information, we advise to stop when maximum cumulative lift is achieved,
or when we have run out of budget, whichever comes first.</li>
<li>If we do have such information, we advise to stop when maximum profit is achieved,
or when we have run out of budget, whichever comes first.</li>
</ul>
</p>
<p>Having set the cost and revenue variables, define the optimization criterion (e.g. maximum cumulative profit).
Then consult the profit curve and heatmap to identify which and how many customers to contact, most probable
buyers first.</p>
<h2>Additional resources</h2>
<ul>
<li>Foster Provost & Tom Fawcett: <a href="http://shop.oreilly.com/product/0636920028918.do" target="_blank">Data Science for Business. What you need to know about data mining and data-analytic thinking</a>. O'Reilly, 2013.<br/>Chapter 8: "Visualizing Model Performance" introduces the ROC, Lift and Profit curves.</li>
<li><a href="http://nvd3.org" target="_blank">NVD3. Re-usable charts for D3.js</a></li>
<li><a href="https://github.com/sujeetsr/d3.slider" target="_blank">D3.slider</a></li>
<li><a href="http://bl.ocks.org/ianyfchang/8119685" target="_blank">Heatmap example</a></li>
</ul>
</body>
</html>