# Building Logistic Regression Models

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

%matplotlib inline 


## Exploring the base table

Before diving into model building, it is important to understand the data you are working with. In this exercise, you will learn how to obtain the population size, number of targets and target incidence from a given basetable.

In [21]:
basetable = pd.read_csv('40Data1.csv', index_col = 0, parse_dates = True)

In [22]:
# Assign the number of rows in the basetable to the variable 'population_size'.
population_size  = len(basetable)

# Print the population size.
print(population_size)

# Assign the number of targets to the variable 'targets_count'.
targets_count = sum(basetable["target"])

# Print the number of targets.
print(targets_count)

# Print the incidence, i.e. the number of targets divided by the population size.
print(targets_count/population_size)

100000
4990
0.0499


## Exploring the predictive variables

It is always useful to get a better understanding of the population. Therefore, one can have a closer look at the predictive variables. Recall that you can select a column in a pandas dataframe by indexing as follows:

    basetable["variable"]

To count the number of occurrences of a certain value in a column, you can use the sum method:

    sum(basetable["variable"]==value)

In this exercise you will find out whether there are more males than females in the population.

In [24]:
# Count and print the number of females.
print(sum(basetable["gender"]=="F"))

# Count and print the number of males.
print(sum(basetable["gender"]=="M"))

50624
49376


## Building a logistic regression model

You can build a logistic regression model using the module linear_model from sklearn. First, you create a logistic regression model using the LogisticRegression() method:

    logreg = linear_model.LogisticRegression()

Next, you need to feed data to the logistic regression model, so that it can be fit. X contains the predictive variables, whereas y has the target.

    X = basetable[["predictor_1","predictor_2","predictor_3"]]`
    y = basetable[["target"]]
    logreg.fit(X,y)

In this exercise you will build your first predictive model using three predictors.

In [33]:
basetable = pd.read_csv('basetable_ex2_4.csv', parse_dates = True)
basetable.head()

Unnamed: 0,target,gender_F,income_high,income_low,country_USA,country_India,country_UK,age,time_since_last_gift,time_since_first_gift,max_gift,min_gift,mean_gift,number_gift
0,0,1,0,1,0,1,0,65,530,2265,166.0,87.0,116.0,7
1,0,1,0,0,0,1,0,71,715,715,90.0,90.0,90.0,1
2,0,1,0,0,0,1,0,28,150,1806,125.0,74.0,96.0,9
3,0,1,0,1,1,0,0,52,725,2274,117.0,97.0,104.25,4
4,0,1,1,0,1,0,0,82,805,805,80.0,80.0,80.0,1


In [34]:
# Import linear_model from sklearn.
from sklearn import linear_model

# Create a dataframe X that only contains the candidate predictors age, gender_F and time_since_last_gift.
X = basetable[["age","gender_F","time_since_last_gift"]]

# Create a dataframe y that contains the target.
y = basetable[["target"]]

# Create a logistic regression model logreg and fit it to the data.
logreg = linear_model.LogisticRegression()
logreg.fit(X, y)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

## Showing the coefficients and intercept

Once the logistic regression model is ready, it can be interesting to have a look at the coefficients to check whether the model makes sense.

Given a fitted logistic regression model logreg, you can retrieve the coefficients using the attribute coef_. The order in which the coefficients appear, is the same as the order in which the variables were fed to the model. The intercept can be retrieved using the attribute intercept_.

The logistic regression model that you built in the previous exercises has been added and fitted for you in logreg.

In [35]:
# Construct a logistic regression model that predicts the target using age, gender_F and time_since_last gift
predictors = ["age","gender_F","time_since_last_gift"]
X = basetable[predictors]
y = basetable[["target"]]
logreg = linear_model.LogisticRegression()
logreg.fit(X, y)

# Assign the coefficients to a list coef
coef = logreg.coef_
for p,c in zip(predictors,list(coef[0])):
    print(p + '\t' + str(c))

# Assign the intercept to the variable intercept
intercept = logreg.intercept_
print(intercept)

age	0.007178355659086629
gender_F	0.11430414536348246
time_since_last_gift	-0.00130875011331457
[-2.54149728]


  y = column_or_1d(y, warn=True)


## Making predictions

Once your model is ready, you can use it to make predictions for a campaign. It is important to always use the latest information to make predictions.

In this exercise you will, given a fitted logistic regression model, learn how to make predictions for a new, updated basetable.

The logistic regression model that you built in the previous exercises has been added and fitted for you in logreg.

In [37]:
current_data = basetable.copy()

In [38]:
# Fit a logistic regression model
from sklearn import linear_model
X = basetable[["age","gender_F","time_since_last_gift"]]
y = basetable[["target"]]
logreg = linear_model.LogisticRegression()
logreg.fit(X, y)

# Create a dataframe new_data from current_data that has only the relevant predictors 
new_data = current_data[["age","gender_F","time_since_last_gift"]]

# Make a prediction for each observation in new_data and assign it to predictions
predictions = logreg.predict_proba(new_data)
print(predictions[0:5])

[[0.93427169 0.06572831]
 [0.9454883  0.0545117 ]
 [0.9185279  0.0814721 ]
 [0.95269877 0.04730123]
 [0.94745512 0.05254488]]


  y = column_or_1d(y, warn=True)


## Donor that is most likely to donate

The predictions that result from the predictive model reflect how likely it is that someone is a target. For instance, assume that you constructed a model to predict whether a donor will donate more than 50 Euro for a certain campaign. If the prediction for a certain donor is 0.82, it means that there is an 82% chance that he will donate more than 50 Euro.

In this exercise you will find the donor that is most likely to donate more than 50 Euro.

Recall that you can sort a pandas dataframe df according to a certain column c using

    df_sorted = df.sort(["c"])

and that you can select the first and last row of a pandas dataframe using

    first_row = df.head(1)
    last_row = df.tail(1)


In [39]:
# Sort the predictions
predictions_sorted = predictions.sort(["probability"])

# Select the last row of the sorted predictions
print(predictions_sorted.tail(1))

TypeError: an integer is required (got type list)

# Forward stepwise variable selection for logistic regression

## Calculating AUC

The AUC value assesses how well a model can order observations from low probability to be target to high probability to be target. In Python, the roc_auc_score function can be used to calculate the AUC of the model. It takes the true values of the target and the predictions as arguments.

You will make predictions again, before calculating its roc_auc_score.

In [40]:
# Make predictions
predictions = logreg.predict_proba(X)
predictions_target = predictions[:,1]

# Calculate the AUC value
auc = roc_auc_score(y, predictions_target)
print(round(auc,2))

NameError: name 'roc_auc_score' is not defined