# Logistic Regression

## Theoretical Introduction

Logistic regression is a type of regression used when the response variable *y* is not continuous in nature, but dichotomous: it can only take on two values.

More generally, while a classic regression model assumes that the response variable follows a normal distribution, in a generalized linear model its distribution could be any other: in the case of logistic regression, it's the Bernoulli distribution. A random variable with a Bernoulli distribution can take on two values only, which we can numerically encode as 0 or 1, and the only parameter is *p*, the probability of success, that is the probability that the random variable takes on the value 1.

The coefficients of the explanatory variables estimated in a logistic regression give important indications regarding how the values of the regressors affect the value taken by *y*. Their interpretation is a bit different from the classic regression, though, for the following reason: since the values of a linear combination of the regressors with the coefficients can take on any real value, this model is not adequate to accurately describe the values taken on by *y*. To solve this problem, the linear combination gets modified by the *logistic function*, which normalizes its value so that it is included between 0 and 1.

With these assumptions, for each unit increase in the value of a regressor, the odds of success vary by *e<sup>$\beta$</sup>*, where $\beta$ is the corresponding coefficient. Thus, given a positive coefficient, the variation will increase the chance of success, pushing the explained variable to a value closer to 1; conversely, given a negative coefficient, the variation will decrease the chance of success, pushing the explained variable to a value more distant to 1.

## Business Application - DVD purchase predictions

In a business setting, logistic regression can be very useful to analyze how a set of variables affects a certain outcome, like the purchase of a product. Let's see a basic example using a dataset regarding DVD sellings.

In [1]:
data <- read.csv("dvd.csv")
data$training <- NULL
str(data)

'data.frame':	20000 obs. of  4 variables:
 $ buy   : chr  "yes" "no" "no" "no" ...
 $ coupon: int  5 5 4 3 1 5 2 4 3 5 ...
 $ purch : int  2 2 11 5 1 10 1 6 9 2 ...
 $ last  : int  5 33 11 25 15 27 11 25 3 27 ...


In [2]:
head(data)

Unnamed: 0_level_0,buy,coupon,purch,last
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>
1,yes,5,2,5
2,no,5,2,33
3,no,4,11,11
4,no,3,5,25
5,no,1,1,15
6,no,5,10,27


Data contains information on a sample of 20,000 customers who received an instant coupon, with a value between $1 and $5 chosen at random.

Other explanatory variables include *purch*, which is the number of purchases by the customer in the past year, and *last*, which is the number of days passed since the last purchase by the customer.

The response variable is *buy*, which takes on the value *yes* if the customer decided to buy the DVD and *no* otherwise.

Let's estimate a logistic regression model to investigate the effects of the explanatory variables in determining the purchase of the DVD.

In [3]:
data$buy <- ifelse(data$buy == "yes", 1, 0)
logistic_model <- glm(buy ~ coupon + purch + last, data = data, family = "binomial")

In [4]:
summary(logistic_model)


Call:
glm(formula = buy ~ coupon + purch + last, family = "binomial", 
    data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0010  -0.7065  -0.4173   0.7215   3.0337  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.037797   0.063109  -48.14   <2e-16 ***
coupon       0.774109   0.015108   51.24   <2e-16 ***
purch        0.091110   0.005096   17.88   <2e-16 ***
last        -0.069112   0.001953  -35.39   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 23018  on 19999  degrees of freedom
Residual deviance: 18221  on 19996  degrees of freedom
AIC: 18229

Number of Fisher Scoring iterations: 5


Here is our model. All of the coefficients of the explanatory variables are significative, which is a good thing. Looking at their sign, we can see that they totally make sense: increasing the value of the coupon increases the probability of purchase, holding all other variables constant; the more DVDs a customer has purchased in the last year, meaning they are a usual client, the more the probability that they purchase the DVD increases; and finally, the more days have passed since the last purchase, meaning that the customer has somewhat lost interest in our products, the less likely they are to buy the DVD.