# Book Club: Logit

**“You will lose money sending a terrific piece of mail to a lousy list, but make money sending a lousy piece of mail to a terrific list.” (On database marketing)**

MindBook was established in 1986 and sells specialty books and selected other merchandise through direct marketing. New members are acquired by advertising in specialty magazines, newspapers and TV. After joining, members receive regular mailings offering new titles and, occasionally, related merchandise. Right from its start, MindBook made a strategic decision to build and maintain a detailed database about its club members containing all the relevant information about their customers. Initially, MindBook mailed each offer to all its members. However, as MindBook has grown, the cost of mailing offers to the full customer list has grown as well. In an effort to improve profitability and the return on his marketing dollars, Stan Lawton, MindBook marketing director, was eager to assess the effectiveness of database marketing techniques. Stan proposes to conduct live market tests, involving a random sample of customers from the database, for new book titles in order to analyze customers' response and calibrate a response model for the new book offering. The response model's results will then be used to "score" the remaining customers (i.e. those not selected for the test) and to select which customers to mail the offer to. MindBook’ customer database provides a complete record of purchasing history for each customer. This includes how long they have been a customer, the specific titles ordered and summary totals by category such as cooking or children’s books. MindBook keeps a record of the number of months since last purchase, the total number of purchases made as well as the total dollars spent by each customer. He had a random sample of customers drawn from MindBook customer database. By selecting a random sample of customers, Stan could be confident that all types of customers would be represented: both recent and not-so-recent purchasers, frequent and infrequent purchasers and customers spanning a range of total dollars spent. This random sample of customers was mailed an offer to purchase The Art History of Florence and their response – either purchase or no purchase – was recorded. Stan’s objective is to use the results of the test mailing to identify which groups of customers are more likely to respond. Then, for the ‘rollout’ mailing, he will only target customers who fit the profile of those more likely to respond. By carefully targeting which customers to mail the offer to, Stan hopes to reach the majority of the responders while significantly reducing costs by not mailing to those with a low likelihood of responding. A secondary benefit is that customers with little interest in a title such as The Art History of Florence will not get the mailing – which, had they received it, may leave them wondering why they are getting such unappealing offers.  


### Use “Book club.csv”
- Buy: 1 for a purchase of The Art History of Florence and 0 otherwise
- Male: 0 = Female and 1 = Male
- Amount purchased (buyamt): Total money spent on MindBook books
- Frequency (freq): Total number of purchases in the chosen period
- Last purchase (lastbuy): Months since last purchase
- First purchase (firstbuy): Months since first purchase
- Child: Number of children’s books purchased
- Youth: Number of youth books purchased
- Cook: Number of cookbooks purchased
- DIY: Number of do-it-yourself books purchased
- Art: Number of art books purchased

In [8]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from sklearn.metrics import accuracy_score

### 1. Report what percentage of male and female purchased the book, The Art History of Florence. 

In [9]:
df = pd.read_csv('Book club.csv')
df.head()

Unnamed: 0,id,buy,male,buyamt,freq,lastbuy,firstbuy,child,youth,cook,diy,art
0,1,1,1,113,8,1,8,0,1,0,0,0
1,2,1,1,418,6,11,66,0,2,3,2,3
2,3,1,1,336,18,6,32,2,0,1,1,2
3,4,1,1,180,16,5,42,2,0,0,1,1
4,5,1,0,320,2,3,18,0,0,0,1,2


In [10]:
df.groupby('male').buy.mean()

male
0    0.220994
1    0.123054
Name: buy, dtype: float64

### 2. Divide the data into 70% training and 30% test set (use random_state=10) and run the following logit model on training data.

Buy = β0 + β1 Male + β2 Buyamt + β3 Freq + β4 Lastbuy + β5 Firstbuy + β6 Child + β7 Youth + β8 Cook + β9 DIY + β10 Art + e

In [18]:
y=df.buy
x=df[['male', 'buyamt' ,'freq', 'lastbuy', 'firstbuy', 'child', 'youth', 'cook', 'diy', 'art']]
x1=df.iloc[:, 2:]  # alternative: [all rows, 3rd-end columns]

xtrain, xtest, ytrain, ytest=train_test_split(x, y, random_state=10, train_size=0.7)
    # default: 75%, 25% split
    # random_state: A different result will be returned if not specified

print(len(xtrain))
print (len(ytrain))# sample size

len(xtest)

2730
2730


1170

- There are two ways of running the logit model: (1) "Logit" from statsmodels.api package, (2) "LogisticRegression" from sklearn.linear_model package </br>

- Logistic regression with the "training" sample using statsmodels.api (i.e., sm.Logit) </br>

In [12]:
import statsmodels.api as sm

m1 = sm.Logit(ytrain, sm.add_constant(xtrain), maxiter=300).fit() 
 # m1 = sm.Logit(ytrain, sm.add_constant(xtrain)).fit(maxiter=300)   # alternative: not generating warning.
 # m1=sm.Logit(ytrain, xtrain).fit(maxiter=300)                      # prediction without a constant
 # add_constant: add intercept to the model
m1.summary()

Optimization terminated successfully.
         Current function value: 0.339163
         Iterations 7




0,1,2,3
Dep. Variable:,buy,No. Observations:,2730.0
Model:,Logit,Df Residuals:,2719.0
Method:,MLE,Df Model:,10.0
Date:,"Sun, 17 Dec 2023",Pseudo R-squ.:,0.2065
Time:,19:32:04,Log-Likelihood:,-925.91
converged:,True,LL-Null:,-1166.9
Covariance Type:,nonrobust,LLR p-value:,3.059e-97

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.7737,0.187,-4.133,0.000,-1.141,-0.407
male,-0.8910,0.122,-7.295,0.000,-1.130,-0.652
buyamt,0.0018,0.001,2.595,0.009,0.000,0.003
freq,-0.0925,0.015,-6.228,0.000,-0.122,-0.063
lastbuy,0.5232,0.085,6.175,0.000,0.357,0.689
firstbuy,-0.0020,0.011,-0.176,0.861,-0.024,0.020
child,-0.7651,0.105,-7.264,0.000,-0.972,-0.559
youth,-0.4871,0.126,-3.862,0.000,-0.734,-0.240
cook,-0.8853,0.107,-8.240,0.000,-1.096,-0.675


- Logistic regression with the "training" sample using sklearn.linear_model (i.e., LogisticRegression) </br>

In [13]:
m = LogisticRegression().fit(xtrain, ytrain) 
 # Increase max_iter if it fails to converge
 # Default: solver='lbfgs',max_iter=100

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [14]:
m = LogisticRegression(max_iter=300).fit(xtrain, ytrain)

In [15]:
m.n_iter_  # Number of actual iterations

array([192])

In [16]:
m.intercept_ 

array([-0.77558773])

In [17]:
m.coef_

array([[-0.87526891,  0.0018063 , -0.09269807,  0.50560136, -0.001637  ,
        -0.74408867, -0.4675809 , -0.86309565, -0.95754447,  0.65776539]])

- Logistic regression with the "training" sample using statsmodels.api (i.e., sm.Logit) </br>
Advantage of using this code (i.e., sm.Logit) : nicely display the outputs </br>
Disadvantage of using this code (i.e., sm.Logit) : including constant term is an additional job.

- Logistic regression with the "training" sample using sklearn.linear_model (i.e., LogisticRegression) </br>
Advantage of using this code (i.e., LogisticRegression) : including a constant term is default. No worries about adding the constant term </br>
Disadvantage of using this code (i.e., LogisticRegression) : additional job required for a nice display of the outputs </br></br>
* Note: </br> 
Use sklearn's 'LogisticRegression' code for predicting the binary responses and their probabilities since adding constant is the default. </br>
The results between sklearn's 'LogisticRegression' and statsmodel's 'Logit' code are a little bit different because the two optimization processes are different. But the sign of the estimated coefficients and their significance will not be different. So, interpret the results based on the outputs from statsmodel's 'Logit' code, which is nicely displayed.

### 3. Interpret statistically significant coefficients. (Use 5% significance level)

All variables are significant except firstbuy (p-value is greater than 5%).

- Men are less likely to buy than women.
- The greater the amount previously purchased, the greater the likelihood of responding to the current mailing.
- The more frequently the customer buys, the lower the likelihood the customer would respond to the current mailing, i.e., the Art History of Florence has the potential to attract customers who have fewer interactions with the firm.
- The longer the time since last purchase, the greater the likelihood the customer would respond to the current mailing. The book seems to appeal to those who are less loyal customers and enhance retention of those customers. Typically, the Recency variable has a negative coefficient in most direct marketing programs.
- The greater the number of child, youth, cook, and DIY books bought, the less the likelihood of responding to the current mailing.
- The greater the number of art books bought, the greater the likelihood of responding to the current mailing.

### 4. What is the percentage of correct in-sample prediction? 

In [11]:
pred1 = m.predict(xtrain)
print('In-sample accuracy', accuracy_score(ytrain, pred1))

In-sample accuracy 0.8641025641025641


### 5. What is the percentage of correct out-of-sample prediction? 

In [12]:
pred2 = m.predict(xtest)
print('Out-of-sample accuracy', accuracy_score(ytest, pred2))

Out-of-sample accuracy 0.8666666666666667


### 6. Predict response probabilities on the test data. Save the probabilities as a dataframe.

In [13]:
prob=pd.DataFrame(m.predict_proba(xtest))
prob.head()

Unnamed: 0,0,1
0,0.963705,0.036295
1,0.952067,0.047933
2,0.944481,0.055519
3,0.559974,0.440026
4,0.961441,0.038559


In [14]:
prob = prob.rename(columns={0:'pred_zero', 1:'pred_one'})
prob.head()

Unnamed: 0,pred_zero,pred_one
0,0.963705,0.036295
1,0.952067,0.047933
2,0.944481,0.055519
3,0.559974,0.440026
4,0.961441,0.038559


### 7. (Dumb scenario: Random mailing) How many customers can you expect to respond to the current mailing from a random 10% mailing to the test data?

In [15]:
print(len(ytest)) 
sum(ytest)  # sum of ones in ytest (=buy)

1170


187

In [16]:
# % of buy

print(187/1170) 
sum(ytest)/len(ytest)  # alternative

0.15982905982905982


0.15982905982905982

In [17]:
# We expect the Same % of people to buy in random sampling

print(117*(187/1170))
0.1*len(ytest)*sum(ytest)/len(ytest)  # alternative

18.7


18.7

### 8. (Smart scenario: Data science approach) You want to target top 10% of customers in the test data in terms of predicted probabilities by the logit model. What would be the response rate from the top 10% people?

In [18]:
# Need to match index to combine dataframes: ytest, xtest, prob 

prob # index: 0~1169

Unnamed: 0,pred_zero,pred_one
0,0.963705,0.036295
1,0.952067,0.047933
2,0.944481,0.055519
3,0.559974,0.440026
4,0.961441,0.038559
...,...,...
1165,0.228135,0.771865
1166,0.889768,0.110232
1167,0.859120,0.140880
1168,0.938547,0.061453


In [19]:
xtest  # index is not from 0 due to random sampling

Unnamed: 0,male,buyamt,freq,lastbuy,firstbuy,child,youth,cook,diy,art
193,1,80,14,1,14,0,0,0,1,0
936,0,269,16,3,30,0,0,2,1,0
2625,0,80,16,2,18,1,0,0,1,0
234,0,174,10,1,10,0,0,0,0,1
3823,1,303,16,2,20,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...
1792,0,290,2,2,8,0,0,0,0,1
3074,1,50,14,2,22,0,0,1,0,1
1051,1,268,2,1,2,0,0,0,1,0
306,1,274,22,1,22,0,0,0,0,0


In [20]:
# reset index to 0~1699

ytest.reset_index(drop=True, inplace=True) 
xtest.reset_index(drop=True, inplace=True) 
 # inplace=True: Modify dataframe `670
 # drop=True: Do not insert index into dataframe columns

xtest  # index starts from 0

Unnamed: 0,male,buyamt,freq,lastbuy,firstbuy,child,youth,cook,diy,art
0,1,80,14,1,14,0,0,0,1,0
1,0,269,16,3,30,0,0,2,1,0
2,0,80,16,2,18,1,0,0,1,0
3,0,174,10,1,10,0,0,0,0,1
4,1,303,16,2,20,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...
1165,0,290,2,2,8,0,0,0,0,1
1166,1,50,14,2,22,0,0,1,0,1
1167,1,268,2,1,2,0,0,0,1,0
1168,1,274,22,1,22,0,0,0,0,0


In [21]:
df1=pd.concat([ytest, xtest, prob], axis=1) 
df1

Unnamed: 0,buy,male,buyamt,freq,lastbuy,firstbuy,child,youth,cook,diy,art,pred_zero,pred_one
0,1,1,80,14,1,14,0,0,0,1,0,0.963705,0.036295
1,0,0,269,16,3,30,0,0,2,1,0,0.952067,0.047933
2,0,0,80,16,2,18,1,0,0,1,0,0.944481,0.055519
3,1,0,174,10,1,10,0,0,0,0,1,0.559974,0.440026
4,0,1,303,16,2,20,1,0,1,0,0,0.961441,0.038559
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1165,1,0,290,2,2,8,0,0,0,0,1,0.228135,0.771865
1166,0,1,50,14,2,22,0,0,1,0,1,0.889768,0.110232
1167,0,1,268,2,1,2,0,0,0,1,0,0.859120,0.140880
1168,1,1,274,22,1,22,0,0,0,0,0,0.938547,0.061453


In [22]:
# Sort by descending order of prob

df1=df1.sort_values(by='pred_one', ascending=False)
df1

Unnamed: 0,buy,male,buyamt,freq,lastbuy,firstbuy,child,youth,cook,diy,art,pred_zero,pred_one
589,1,0,202,10,6,40,0,0,0,0,3,0.026476,0.973524
1108,1,0,414,2,8,46,3,1,0,0,3,0.046111,0.953889
776,1,1,474,12,12,46,2,1,1,1,4,0.050238,0.949762
304,1,0,148,6,9,48,0,1,1,1,2,0.080430,0.919570
853,1,1,428,2,9,22,2,1,1,0,2,0.124782,0.875218
...,...,...,...,...,...,...,...,...,...,...,...,...,...
601,0,1,192,34,6,52,0,0,3,2,1,0.995293,0.004707
554,0,1,248,36,2,40,0,0,1,1,0,0.995572,0.004428
568,0,1,246,36,2,40,0,0,1,1,0,0.995588,0.004412
1007,0,1,350,30,9,72,3,2,3,1,0,0.997721,0.002279


In [23]:
# The number of buy=1 in top 10%

sum(df1[0:117].buy)  # df1[0:117]=0~116

72

In [24]:
72/117  # hit rate

0.6153846153846154

### 9. The direct marketing costs \\$1 per customer. The company earns $10 (revenue) if a customer purchases the book. Compare the profits from the campaign under the dumb scenario and smart scenario. Assume that there is no additional cost.

In [25]:
print('Profit from random sampling', 18*10-117*1)  # profit=revenue-cost
print('Profit from data science', 72*10-117*1) 

Profit from random sampling 63
Profit from data science 603


## Why Online Retailers Like Bonobos, Boden, Athleta Mail So Many Catalogs (WSJ 2014)

When everything is available for sale on your smartphone, why do catalogs still clutter your mailbox? The old-school marketing format has survived to play a crucial creative role in modern e-commerce. Today, the catalog is bait for customers, like a store window display, and a source of inspiration, the way roaming through store aisles can be. The hope is shoppers will mark pages they like and then head online, or into a store, to buy.

Today's catalogs are no longer phone-book-size compilations of every item a retailer sells. Instead, they have fewer pages and merchandise descriptions, and more and bigger photos and lifestyle images. For retailers, creating the inspiration comes with hefty costs, including expensive photo shoots and rising postage rates. And with catalogs produced many months in advance, they lock retailers into specific trends and merchandise, unlike digital marketing pieces that can be updated in minutes.

"It's still a very, very important part of our marketing mix," says Pat Connolly, chief marketing officer at Williams-Sonoma Inc., parent company to seven brands with catalogs including Pottery Barn and West Elm. Consumers "look through it to get ideas and inspiration. And if we do a good job, they get ideas for things they didn't even know they wanted before they got there."
Williams-Sonoma maintains a database of 2,000 privately owned houses that serve as locations for catalog photo shoots. More than half the company's marketing budget goes to catalog production and mailing.

Marketers mailed 11.9 billion catalogs in 2013, according to the Direct Marketing Association, marking the first uptick in years. Total catalog circulation is still far below the 2007 peak of 19.6 billion. The 2008 recession forced catalog companies to cut dead wood out of their mailing lists and get smarter about how and when they mail.

Bonobos mailed a test catalog just over a year ago to a small number of current and potential customers. Results prompted the brand to try several more, gradually increasing circulation each time. Now, some 20% of the website's first-time customers are placing their order after having received a catalog, says Craig Elbert, vice president of marketing for Bonobos. They spend 1.5 times as much as new shoppers who didn't receive a catalog first.

Bonobos has studied catalog responses to understand sales patterns, such as what was driving strong sales of casual shirts. Its first catalog, in March 2013, featured a model wearing a blue-and-green checked shirt with white jeans. Many men ordered both. As a result, the brand now routinely emphasizes full-outfit shots.

Many retailers can pinpoint exactly when their catalogs land in mailboxes because of a spike in activity in stores and online. "We see an immediate sales lift," says John Koryl, president of stores and online at Neiman Marcus. The catalog's halo effect reaches beyond the contents of the book to the brand's broader offerings.

Shoppers "may not buy what's on the cover of the catalog. They may not even buy in the category that the catalog covered," Mr. Koryl says. "But it is this inspirational moment to remind them" to shop.

The average catalog costs much less than a dollar to produce, including printing, mailing, the purchase of new addresses and fees for an outside mailing house or project management, says Polly Wong, managing partner for strategic e-commerce and creative services at Belardi/Ostroy, a retail marketing consulting firm. Response rates and order sizes run the gamut, but typically each catalog mailed results in about $4 in sales, she says.

Boden, the U.K.-based clothing retailer, ships millions of catalogs around the world each year. Shoppers spend up to 15 to 20 minutes with the catalog, says Shanie Cunningham, head of U.S. marketing, compared with an average of just eight seconds for a Boden email and about five minutes with the Boden iPad app.

More catalogs are tailored for individuals, meaning the one you get could look very unlike the one your next-door neighbor gets. "We definitely are targeting and personalizing," says Ms. Cunningham. Boden will change the theme, the size of the book and even the discount it offers to the same address. A recent catalog offered one spouse 15% off and the other just 11% off.

L.L.Bean is playing with the page count of catalogs it sends to regular website shoppers, says Steve Fuller, chief marketing officer at the outdoor and apparel retailer. Many of its catalogs come in different versions. So instead of sending every customer the largest book, Mr. Fuller looks for frequent website visitors and asks, "Can I only send her 50 pages, or 20, as a reminder of, 'Oh, I've got to go to the website'?"