# Foundations of Predictive Analytics in Python (Part 2)

Building good models only succeeds if you have a decent base table to start with. In this course you will learn how to construct a good base table, create variables and prepare your data for modeling. We finish with advanced topics on the matter. 

## Crucial base table concepts

In this chapter you will learn how to construct the foundations of your base table, namely the population and the target.

A predictive model can be use the predict an event.

All information needed to make these predections are stored in the basetable. There are three important concept in the basetable. <br>
**1- Population** is the group of people or object you want to make a prediction for.<br>
**2- Candidate Predictors** describe the object in the populations(age, gender, etc...)<br>
**3- target** has information the event the predict itself. It is one if the events occurs, and zero otherwise.

**Draw a timeline** When biulding a baseline table for predictive modelling, the first thingyou should do is draw a timeline, on this timeline, you could depict the situation in which you want to use the predictive model. 


We want to construct a model that predicts `which donors are most likely to donate more than 50 Euro in April 2018`. To build the predictive model, we reconstruct the timeline one year back in time, so the target period of the basetable is April 2017.

Assume that you want to construct a model that predicts whether someone will donate in a certain year. The timeline to construct the basetable has 2017 as target period, this means that the target is based on donations made in 2017, and that the predictive variables are based on donations made before 2017. All donations are given in a pandas dataframe gifts with three columns: the donor id, the donation date and the amount donated. In this exercise you will learn to construct a new pandas dataframe that excludes donations made in 2017 or later.




In [1]:
#load the libraries
import numpy as np
import pandas as pd
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

In [62]:
#import the data
gift = pd.read_csv('data/gifts.csv', index_col=0)
print(gift.head())
print(type(gift['date'][0]))

   id        date  amount
0   1  2015-10-16    75.0
1   1  2014-02-11   111.0
2   1  2012-03-28    93.0
3   1  2013-12-13   113.0
4   1  2012-01-10    93.0
<class 'str'>


In [112]:
#convert the date column to datetime 
gift['date'] = pd.to_datetime(gift['date'])
print(type(gift['date'][0]))
print(gift.head(10))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>
   id       date  amount
0   1 2015-10-16    75.0
1   1 2014-02-11   111.0
2   1 2012-03-28    93.0
3   1 2013-12-13   113.0
4   1 2012-01-10    93.0
5   1 2015-04-22    85.0
6   1 2014-03-05   109.0
7   2 2015-04-30    53.0
8   2 2012-06-05   101.0
9   2 2014-01-18    86.0


In [64]:
# Start of the target is January 1st 2017
start_date = datetime(year=2017, month=1, day=1)
# Select gifts made before start_target
gift_before_2017 = gift[gift['date'] < start_date]
print(gift_before_2017.count())

id        145045
date      145045
amount    145045
dtype: int64


### Timeline violation
To illustrate the importance of the timeline, consider an example where you violate the timeline and use information from the target period to construct the predictive variables.

Let's crete a base table which will have two columns in the pandas dataframe basetable: "amount_2017" is the total amount of donations in 2017, and "target" is 1 if this amount is larger than 30 and 0 else.

Construct a logistic regression model that uses "amount_2017" as single predictive variable to predict the target, and calculate the AUC

In [48]:
start_date = datetime(year=2017, month=1, day=1)
end_date = datetime(year=2017, month=12, day=30)

#create a data frame which only have the information from year 2017
year_2017 = gift[(gift['date']> start_date) & (gift['date']<end_date)]
#year_2017


In [49]:
#create target varibale which only have 30 euro or more donation made 
year_2017['target'] = np.where(year_2017['amount']>=30, 1,0)

In [50]:
# Select the relevant predictors and the target
X_year_2017 = np.array(year_2017[['amount']])
#we need to rehsape the data for single future information
X_year_2017 = X_year_2017.reshape(-1,1)
y_year_2017 = year_2017['target']


# Build the logistic regression model
from sklearn import linear_model
logreg = linear_model.LogisticRegression()
#fit the model
logreg.fit(X_year_2017,y_year_2017)
# Make predictions for X
predictions = logreg.predict_proba(X_year_2017)[:,1]
# Calculate and print the AUC value
from sklearn.metrics import roc_auc_score
auc_year =roc_auc_score(y_year_2017, predictions)
print(round(auc_year,2))
"""Great job! As you can see, the model makes perfect predictions, 
 but is not realistic as target, the amount given, is not available if the gift has not been made yet."""

1.0


'Great job! As you can see, the model makes perfect predictions, \n but is not realistic as target, the amount given, is not available if the gift has not been made yet.'

## The population
The population should be eligible for beging target
* addrees should be avaible
* privacy setting should be allow sending a letter 

*Time population: if you want to target younger than 25, you should make sure your target is youger than specific timeline that you are modelling for predicting.*

 
Assume that you want to construct a basetable for a predictive model that predicts whether donors will donate in 2018. The timeline indicates that the population should contain `all donors that donated at least once since January 1st 2013`, `but made no donations after January 1st 2017`. Given is a pandas dataframe gifts with all the donations made since 2010. In this exercise, you will construct a set with the donor ids of all donors in the population.
* find the donors made a donation after 2013 and later
* find the donor made the donation after 2017 and later
* exclude the 2017 and later donors from 2013 and later for our specific timeline 
* then we find the donor since January 1st 2013 but not in 2017 or later



In [114]:
# Gifts made in 2013 or later
gift_include = gift[gift['date'].dt.year >=2013]

# Gifts made in 2017 or later
gift_exclude =  gift[gift['date'].dt.year >=2017]

#using the set() function to find unique donor's id 
donors_include = set(gift_include['id'])
donors_exclude = set(gift_exclude['id'])
print(len(donors_include), len(donors_exclude))
#population is set of unique donor's id since donation 2013 but not in 2017
population = donors_include.difference(donors_exclude)
print('Population of the donor is: ',len(population))
#print(gift_exclude)

27362 3843
Population of the donor is:  23519


### Removing duplicate objects
Assume that you want to construct a predictive model in order to select donors that are most likely to respond on a letter. The population of the basetable should contain donors that have an adress available, and that have privacy settings that allow to send them a letter. All candidate donors are given in a dataframe donors with three columns: the donor_id, a flag address that is 1 if the address is available and 0 otherwise, and a flag letter_allowed that is 1 if one can send this donor a letter and 0 otherwise. In this exercise you will construct a set with the donors that should go in the population.

In [9]:
#load the dataframe with information address and letter allowed
donors = pd.read_csv('data/gift_address_letter.csv', index_col=0)
print(donors.head())
print('Lenght of the original data',len(donors))


   id        date  amount  address  letter_allowed
0   1  2015-10-16    75.0        1               1
1   1  2014-02-11   111.0        1               1
2   1  2012-03-28    93.0        0               1
3   1  2013-12-13   113.0        0               1
4   1  2012-01-10    93.0        1               1
Lenght of the original data 150000


In [10]:
#donor_population which has adress information and letter allowed
donor_population = donors[(donors['address'] == 1) & (donors['letter_allowed'] == 1)]
population_list = list(donor_population['id'])
unique_population  = set(population_list)
print('Selected population ', len(unique_population))

Selected population  20698


## The Target

Once the timeline is set and the population in place, you are ready to add the target to the basetable. The target is a special column in the basetable, namely the value, zero or one, that you want to predict. In predicting modeling target id equel to the one if a certain events happends during the target period for the observations and zero otherwise.

it is an unknown events that you want to predic. 


In [11]:
#load the doonor id unformayion
donor_id = pd.read_csv('data/basetable.csv')
donot_id = list(donor_id)

#### Calculate an event target
======= You are organising a charity event and want to predict which donors are most likely to attend this event. You organized a similar event in the past, so you can use that information to construct a predictive model. Given is a list population with unique donor ids for this basetable and a list attend_event with donors in the population that attended this past event. In this exercise you will construct a basetable with two columns: the donor_id and the target, which is 1 if the donor attended the event and 0 otherwise.
```python
# Basetable with one column: donor_id
basetable = pd.DataFrame(population, columns=["donor_id"])
# Add target to the basetable
basetable["target"] = pd.Series([1 if donor_id in donor_id else 0 for donor_id in basetable["donor_id"]])

# Calculate and print the target incidence
print(round(basetable["target"].sum() / len(basetable['target']), 2))

```
Assume you want to construct a predictive model that predicts which donors are most likely to donate more than 50 euro in a certain month.

Given is a basetable basetable that already has one row for each donor in the population, the column donor_id represents the donor. The timeline indicates that the target should be 1 if the donor has donated more than 50 euro in January 2017 and 0 else.


```python
# Sum of donations for each donor in gifts_201701
gifts_summed = gifts_201701.groupby("id")["amount"].sum().reset_index()

# List with targets
targets = list(gifts_summed["id"][gifts_summed["amount"] > 50])

# Add targets to the basetable
basetable["target"] = pd.Series([1 if donor_id in targets else 0 for donor_id in basetable["donor_id"]])

# Calculate and print the target incidence
print(round(basetable["target"].sum() / len(basetable), 2))
```


## The basetable timeline


#### Adding the donor segment
Besides age, you also want to add the segment of a donor to the basetable. A selected group of donors that has made many donations in the past is assigned a segment: bronze, silver or gold. Given is an early stage basetable and a pandas dataframe segments that contains the segments for a selected group of the donors in the basetable. In this exercise you will add the segment to the basetable.

You can left join two pandas dataframes using the following code:

