### Retrieving information from the predictor insight table
- The predictor insight graph table contains all the information needed to construct the predictor insight graph. For each value the predictor takes, it has the number of observations with this value and the target incidence within this group. The predictor insight graph table of the predictor Country is loaded as a pandas object pig_table

In [21]:
import pandas as pd

In [22]:
pig_table = pd.DataFrame({'Country':['India','UK','USA'], 'Size':[49849,10057,40094], 'Incidence':[0.05,0.05,0.05]})
pig_table

Unnamed: 0,Country,Size,Incidence
0,India,49849,0.05
1,UK,10057,0.05
2,USA,40094,0.05


In [23]:
# Inspect the predictor insight graph table of Country
print(pig_table)

  Country   Size  Incidence
0   India  49849       0.05
1      UK  10057       0.05
2     USA  40094       0.05


In [24]:
# Print the number of UK donors
print(pig_table["Size"][pig_table["Country"]=="UK"])

1    10057
Name: Size, dtype: int64


In [25]:
# Check the target incidence of USA and India donors
print(pig_table["Incidence"][pig_table["Country"]=="USA"])
print(pig_table["Incidence"][pig_table["Country"]=="India"])

2    0.05
Name: Incidence, dtype: float64
0    0.05
Name: Incidence, dtype: float64


###  The target incidence of USA and India donors is the same, indicating that country is not a good variable to predict donations.

#### Discretization of a certain variable
- In order to make predictor insight graphs for continuous variables, we first need to discretize them. In Python, we can discretize pandas columns using the `qcut` method.
- To check whether the variable was nicely discretized, we can verify that the bins have equal size using the groupby method:
`print(basetable.groupby("discretized_variable").size()`

In [26]:

basetable = pd.read_csv('basetable_ex2_4.csv')
basetable.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   target                 25000 non-null  int64  
 1   gender_F               25000 non-null  int64  
 2   income_high            25000 non-null  int64  
 3   income_low             25000 non-null  int64  
 4   country_USA            25000 non-null  int64  
 5   country_India          25000 non-null  int64  
 6   country_UK             25000 non-null  int64  
 7   age                    25000 non-null  int64  
 8   time_since_last_gift   25000 non-null  int64  
 9   time_since_first_gift  25000 non-null  int64  
 10  max_gift               25000 non-null  float64
 11  min_gift               25000 non-null  float64
 12  mean_gift              25000 non-null  float64
 13  number_gift            25000 non-null  int64  
dtypes: float64(3), int64(11)
memory usage: 2.7 MB


In [27]:
# Discretize the variable time_since_last_donation in 10 bins
basetable["bins_recency"] = pd.qcut(basetable['time_since_last_gift'], 10)

# Print the group sizes of the discretized variable
print(basetable.groupby("bins_recency").size())


bins_recency
(31.999, 315.0]     2509
(315.0, 459.0]      2492
(459.0, 571.0]      2506
(571.0, 656.0]      2538
(656.0, 736.0]      2461
(736.0, 832.0]      2501
(832.0, 931.0]      2507
(931.0, 1047.0]     2499
(1047.0, 1211.0]    2498
(1211.0, 2305.0]    2489
dtype: int64


### Discretizing all variables
- Instead of discretizing the continuous variables one by one, it is easier to discretize them automatically. 
- Only variables that are continuous should be discretized. You can verify whether variables should be discretized by checking whether they have more than a predefined number of different values.
- Only variables that are continuous should be discretized. we can verify whether variables should be discretized by checking whether they have more than a predefined number of different values.



- Make a list variables containing all the column names of the basetable.
- Create a loop that checks all the variables in the list variables.
- Complete the ifstatement such that only variables with more than 5 different values are discretized.
- Group the continuous variables in 10 bins using the qcut method.


In [28]:
# Print the columns in the original basetable
print(basetable.columns)

# Get all the variable names except "target"
variables = list(basetable.columns)
variables.remove("target")

# Loop through all the variables and discretize in 10 bins if there are more than 5 different values
for variable in variables:
    if len(basetable.groupby(variable))>5:
        new_variable = "disc_" + variable
        basetable[new_variable] = pd.qcut(basetable[variable], 10)
        
# Print the columns in the new basetable
print(basetable.columns)

Index(['target', 'gender_F', 'income_high', 'income_low', 'country_USA',
       'country_India', 'country_UK', 'age', 'time_since_last_gift',
       'time_since_first_gift', 'max_gift', 'min_gift', 'mean_gift',
       'number_gift', 'bins_recency'],
      dtype='object')


ValueError: Bin edges must be unique: array([ 1.,  1.,  3.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 18.]).
You can drop duplicate edges by setting the 'duplicates' kwarg

```python
Index(['target', 'gender_F', 'gender_M', 'income_average', 'income_low',
       'income_high', 'country_USA', 'country_India', 'country_UK', 'age',
       'time_since_last_gift', 'time_since_first_gift', 'max_gift', 'min_gift',
       'mean_gift', 'median_gift'],
      dtype='object')
Index(['target', 'gender_F', 'gender_M', 'income_average', 'income_low',
       'income_high', 'country_USA', 'country_India', 'country_UK', 'age',
       'time_since_last_gift', 'time_since_first_gift', 'max_gift', 'min_gift',
       'mean_gift', 'median_gift', 'disc_age', 'disc_time_since_last_gift',
       'disc_time_since_first_gift', 'disc_max_gift', 'disc_min_gift',
       'disc_mean_gift', 'disc_median_gift'],
      dtype='object')
```

### Making clean cuts
- The `qcut` method divides the variable in n_bins equal bins. In some cases, however, it is nice to choose our own bins. The method cut in python allows us to choose our own bins.

- Discretize the variable number_gift in three bins with borders 0 and 5, 5 and 10, 10 and 20 and assign this variable to a new column called `disc_number_gift`.
- Count the number of observations in each group.


In [29]:
# Discretize the variable 
basetable["disc_number_gift"] = pd.cut(basetable['number_gift'],[0,5,10, 20])

# Count the number of observations per group
print(basetable.groupby("disc_number_gift").size())

disc_number_gift
(0, 5]      13808
(5, 10]     10220
(10, 20]      972
dtype: int64


```python
disc_number_gift
(0, 5)      55063
(5, 10)     41120
(10, 20)     3817
dtype: int64
```

- Notice that the bins aren't approximately equally sized anymore.