# Foundations of Predictive Analytics in Python 

## Introduction and base table structure

### Structure of the base table
Consider the predictive modeling problem where you want to predict whether a candidate donor will make a donation in the next year. To build the model, you use historical data and calculate the target in 2017. The target is 1 if a donation is made in 2017 and 0 otherwise. 

![Screen%20Shot%202019-01-22%20at%209.32.18%20AM.png](attachment:Screen%20Shot%202019-01-22%20at%209.32.18%20AM.png)

### Exploring the base table


In [1]:
import pandas as pd

In [2]:
basetable = pd.read_csv('data/basetable_ex2_4.csv')
basetable.head()

Unnamed: 0,target,gender_F,income_high,income_low,country_USA,country_India,country_UK,age,time_since_last_gift,time_since_first_gift,max_gift,min_gift,mean_gift,number_gift
0,0,1,0,1,0,1,0,65,530,2265,166.0,87.0,116.0,7
1,0,1,0,0,0,1,0,71,715,715,90.0,90.0,90.0,1
2,0,1,0,0,0,1,0,28,150,1806,125.0,74.0,96.0,9
3,0,1,0,1,1,0,0,52,725,2274,117.0,97.0,104.25,4
4,0,1,1,0,1,0,0,82,805,805,80.0,80.0,80.0,1


In [3]:
# Assign the number of rows in the basetable 
basetable_size = len(basetable)
print(basetable_size)

25000


In [4]:
# Assign the number of targets to the variable 'targets_count'
target_count =sum(basetable['target'])
print(target_count)

1187


In [5]:
# Print the target incidence.
print(target_count / basetable_size )

0.04748


In [6]:
#Let's find out whether how many  males and females in the population
female = sum(basetable['gender_F'] == 1)
male = sum(basetable['gender_F'] == 0)
print('Female:{} , Male:{}'.format(female, male))

Female:12579 , Male:12421


In [7]:
per_femele = (female/basetable_size)*100
per_male = (male/basetable_size)*100
print('Female: {} %, Male: {} %'.format(per_femele, per_male)) 

Female: 50.316 %, Male: 49.684 %


### Logistic regression
![Screen%20Shot%202019-01-22%20at%2011.58.08%20AM.png](attachment:Screen%20Shot%202019-01-22%20at%2011.58.08%20AM.png)

If we plot the target in function of age for all donors in the we that a 1 accurs more to the right, where the older donors are. If we fit a regression line though these points, it is of the form a*x+b, with a positive number. A is called the coefficient of age, and b is called the intercept. If we plot the target as a function of the time since the last donation for each donor, it can be seen that who recently donated, are more likely to danate. in this case coefficient  of recency is negative. 

In [8]:
base_copy = basetable.copy()

In [9]:
base_copy.head()

Unnamed: 0,target,gender_F,income_high,income_low,country_USA,country_India,country_UK,age,time_since_last_gift,time_since_first_gift,max_gift,min_gift,mean_gift,number_gift
0,0,1,0,1,0,1,0,65,530,2265,166.0,87.0,116.0,7
1,0,1,0,0,0,1,0,71,715,715,90.0,90.0,90.0,1
2,0,1,0,0,0,1,0,28,150,1806,125.0,74.0,96.0,9
3,0,1,0,1,1,0,0,52,725,2274,117.0,97.0,104.25,4
4,0,1,1,0,1,0,0,82,805,805,80.0,80.0,80.0,1


### Interpretation of coefficients
Assume you built a logistic regression model to predict which donors are most likely to donate for a project, using age and time_since_last_gift (number of months since the last gift) as predictors. The output of the logistic regression model is as follows:
```python
y = 0.3 + 4.5*age - 2.3*time_since_last_gift
```
### Building a logistic regression model
You can build a logistic regression model using the module linear_model from sklearn. First, you create a logistic regression model using the LogisticRegression() method:
```python
logreg = linear_model.LogisticRegression()
```
Next, you need to feed data to the logistic regression model, so that it can be fit. X contains the predictive variables, whereas y has the target.
```python
X = basetable[["predictor_1","predictor_2","predictor_3"]]`
y = basetable[["target"]]
logreg.fit(X,y)
```
### Showing the coefficients and intercept
Once the logistic regression model is ready, it can be interesting to have a look at the coefficients to check whether the model makes sense.

Given a fitted logistic regression model logreg, you can retrieve the coefficients using the attribute `coef_`. The order in which the coefficients appear, is the same as the order in which the variables were fed to the model. The intercept can be retrieved using the attribute `intercept_`.

In [10]:
# Import linear_model from sklearn.
from sklearn import linear_model

In [11]:
# Create a dataframe X that only contains the candidate predictors age, gender_F and time_since_last_gift.
X = basetable[['age', 'gender_F', 'time_since_last_gift']]
# Create a dataframe y that contains the target
y =basetable['target']
# Create a logistic regression model logreg and fit it to the data.
logreg = linear_model.LogisticRegression()
logreg.fit(X,y)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [12]:
predictors = ["age","gender_F","time_since_last_gift"]

# Assign the coefficients to a list coef
coef = logreg.coef_

for c,p in zip(predictors,list(coef[0])):
    print(p , c)
print(logreg.intercept_)

    


0.007178355658921441 age
0.11430414536794431 gender_F
-0.0013087501133203447 time_since_last_gift
[-2.54149728]


In [13]:
basetable_predict = basetable[predictors]
prediction= logreg.predict_proba(basetable_predict)
print(prediction[0:5])

[[0.93427169 0.06572831]
 [0.9454883  0.0545117 ]
 [0.9185279  0.0814721 ]
 [0.95269877 0.04730123]
 [0.94745512 0.05254488]]


### Calculating AUC

The AUC value assesses how well a model can order observations from low probability to be target to high probability to be target. In Python, the roc_auc_score function can be used to calculate the AUC of the model. It takes the true values of the target and the predictions as arguments.



In [14]:
from sklearn.metrics import roc_auc_score
# Make predictions
prediction_X = logreg.predict_proba(X)
prediction_X_target = prediction_X[:,1]

# Calculate the AUC value
auc = roc_auc_score(y, prediction_X_target)
print(round(auc,2))

0.63


In [17]:
#Trying differetn model 
variable_1 = ['mean_gift', 'income_low']
variable_2 = ['mean_gift', 'income_low', 'gender_F', 'country_India', 'age']

X_1 = basetable[variable_1]
X_2 = basetable[variable_2]

In [19]:
logreg.fit(X_1, y)
prediction_1 = logreg.predict_proba(X_1)[:,1]
auc_1 = roc_auc_score(y, prediction_1)

In [20]:
logreg.fit(X_2,y)
prediction_2 = logreg.predict_proba(X_2)[:,1]
auc_2 = roc_auc_score(y, prediction_2)

In [21]:
print(round(auc_1,2))
print(round(auc_2,2))

0.68
0.69


### Forward stepwise variable selection
#### Selecting the next best variable
The forward stepwise variable selection method starts with an empty variable set and proceeds in steps, where in each step the next best variable is added. To implement this procedure, two handy functions have been implemented for you.

The `auc_score` function calculates for a given variable set variables the AUC of the model that uses this variable set as predictors. The `next_best_variable` function calculates which variable should be added in the next step to the variable list.


In [26]:
def auc_score(variable, target, df):
    X = df[variable]
    y = df[target]
    
    logreg = linear_model.LogisticRegression()
    logreg.fit(X,y)
    predict = logreg.predict_proba(X)[:,1]
    auc = roc_auc_score(y, predict)
    return (auc)
    

In [30]:
#defining a functiong that loop the candidate variable to find out next variable combination
def next_best_variable(current_varibales, candidate_variables, target, df):
    best_auc = -1
    best_variable = None
    
    for v in candidate_variables:
        auc_v = auc_score(current_varibales + [v], target, df)
        
        if auc_v >= best_auc:
            best_auc = auc_v
            best_variable =v 
    return best_variable    

In [28]:
# Calculate the AUC of a model that uses "max_gift", "mean_gift" and "min_gift" as predictors
current_auc = auc_score(["max_gift", "mean_gift","min_gift"],'target', basetable)
print(round(current_auc, 4))


0.7125


In [31]:
# Calculate which variable among "age" and "gender_F" should be added to the variables 
#"max_gift", "mean_gift" and "min_gift"

next_variable = next_best_variable(["max_gift", "mean_gift","min_gift"],["age","gender_F"], 'target', basetable)
print(next_variable)


age


In [32]:
# Calculate the AUC of a model that uses "max_gift", "mean_gift", "min_gift" and "age" as predictors
current_auc_age = auc_score(["max_gift", "mean_gift","min_gift","age"],'target', basetable)
print(round(current_auc_age,4))


0.7148


In [33]:
current_auc_gender_F = auc_score(["max_gift", "mean_gift","min_gift","gender_F"],'target', basetable)
print(round(current_auc_gender_F,4))

0.713


**Nice! The model that has age as next variable has a better AUC than the model that has gender_F as next variable. Therefore, age is selected as the next best variable.**

#### Finding the order of variables
The forward stepwise variable selection procedure starts with an empty set of variables, and adds predictors one by one. In each step, the predictor that has the highest AUC in combination with the current variables is selected.

In [65]:
# Find the candidate variables
#Create the candidates variable 
candidate_variable = list(basetable.columns.values)
print(candidate_variable)

['target', 'gender_F', 'income_high', 'income_low', 'country_USA', 'country_India', 'country_UK', 'age', 'time_since_last_gift', 'time_since_first_gift', 'max_gift', 'min_gift', 'mean_gift', 'number_gift']


In [66]:
candidate_variable.remove('target')

In [67]:
# Initialize the current variables
current_variable =[]

In [68]:
# The forward stepwise variable selection procedure
number_of_iteration =5

for i in range (0, number_of_iteration):
    
    next_variable = next_best_variable(current_variable, candidate_variable, 'target', basetable)
    current_variable = current_variable + [next_variable]
    candidate_variable.remove(next_variable)
    
    print('Variable added in step '+str(i+1)+ ' is '+next_variable)
print(current_variable)

Variable added in step 1 is max_gift
Variable added in step 2 is number_gift
Variable added in step 3 is time_since_last_gift
Variable added in step 4 is mean_gift
Variable added in step 5 is age
['max_gift', 'number_gift', 'time_since_last_gift', 'mean_gift', 'age']


In [69]:
current_list_auc = auc_score(current_variable, 'target', basetable)
print(current_list_auc)

0.768756710130262


#### Correlated variables
you can test this calculating the correlation between these variables:
```python
import numpy
numpy.corrcoef(basetable["variable_1"],basetable["variable_2"])[0,1]
```

In [70]:
#if we look at first 10 variable
candidate_variable_10 = list(basetable.columns.values)
current_variable_10 = []

In [71]:
for i in range (0, 10):
    
    next_variable_10 = next_best_variable(current_variable_10, candidate_variable_10, 'target', basetable)
    current_variable_10 = current_variable_10 + [next_variable_10]
    candidate_variable_10.remove(next_variable_10)
    
    print('Variable added in step '+str(i+1)+ ' is '+next_variable_10)
print(current_variable_10)

Variable added in step 1 is target
Variable added in step 2 is number_gift
Variable added in step 3 is mean_gift
Variable added in step 4 is min_gift
Variable added in step 5 is max_gift
Variable added in step 6 is time_since_last_gift
Variable added in step 7 is time_since_first_gift
Variable added in step 8 is age
Variable added in step 9 is country_UK
Variable added in step 10 is country_India
['target', 'number_gift', 'mean_gift', 'min_gift', 'max_gift', 'time_since_last_gift', 'time_since_first_gift', 'age', 'country_UK', 'country_India']
