# Module 4 - Data Science for Business

## Cole Bailey - colebailey@sandiego.edu

### Background

A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling is used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.


In [1]:
# Load Packages
%matplotlib inline
import pandas as pd
import numpy as np
import random
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn import metrics
from scipy import stats
from sklearn import preprocessing
import statsmodels.api as sm
import scikitplot as skplt
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier


from sklearn.naive_bayes import GaussianNB


import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV, LinearRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, recall_score

from dmba import gainsChart
from pathlib import Path
DATA = Path('.').resolve().parent / 'data'

no display found. Using non-interactive Agg backend


# 1.	Data preparation: Load the data and prepare it for predictive analysis. (

In [2]:
# Load Data

fun = pd.read_csv('/Users/colebailey/Documents/USD/Data Science for Business/Module 4/Fundraising.csv')

fun.head(5)

Unnamed: 0,Row Id,Row Id.,zipconvert_2,zipconvert_3,zipconvert_4,zipconvert_5,homeowner dummy,NUMCHLD,INCOME,gender dummy,...,IC15,NUMPROM,RAMNTALL,MAXRAMNT,LASTGIFT,totalmonths,TIMELAG,AVGGIFT,TARGET_B,TARGET_D
0,1,17,0,1,0,0,1,1,5,1,...,1,74,102.0,6.0,5.0,29,3,4.857143,1,5.0
1,2,25,1,0,0,0,1,1,1,0,...,4,46,94.0,12.0,12.0,34,6,9.4,1,10.0
2,3,29,0,0,0,1,0,2,5,1,...,13,32,30.0,10.0,5.0,29,7,4.285714,1,5.0
3,4,38,0,0,0,1,1,1,3,0,...,4,94,177.0,10.0,8.0,30,3,7.08,0,0.0
4,5,40,0,1,0,0,1,1,4,0,...,7,20,23.0,11.0,11.0,30,6,7.666667,0,0.0


In [3]:
#Data exploration for numerical
fun.nunique(axis=0)
fun.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))

Unnamed: 0,Row Id,Row Id.,zipconvert_2,zipconvert_3,zipconvert_4,zipconvert_5,homeowner dummy,NUMCHLD,INCOME,gender dummy,...,IC15,NUMPROM,RAMNTALL,MAXRAMNT,LASTGIFT,totalmonths,TIMELAG,AVGGIFT,TARGET_B,TARGET_D
count,3120.0,3120.0,3120.0,3120.0,3120.0,3120.0,3120.0,3120.0,3120.0,3120.0,...,3120.0,3120.0,3120.0,3120.0,3120.0,3120.0,3120.0,3120.0,3120.0,3120.0
mean,1560.5,11615.770833,0.214423,0.185256,0.214423,0.384615,0.770192,1.069231,3.89391,0.609295,...,14.702885,49.089423,110.399875,16.651397,13.522917,31.136859,6.861859,10.690713,0.5,6.499612
std,900.810746,6698.678131,0.410487,0.388568,0.410487,0.486582,0.420777,0.347688,1.636186,0.487987,...,12.079882,22.71713,147.299933,22.223521,10.581439,4.132952,5.561209,7.44398,0.50008,10.597849
min,1.0,17.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,11.0,15.0,5.0,0.0,17.0,0.0,2.138889,0.0,0.0
25%,780.75,5820.75,0.0,0.0,0.0,0.0,1.0,1.0,3.0,0.0,...,5.0,29.0,45.0,10.0,7.0,29.0,3.0,6.356092,0.0,0.0
50%,1560.5,11735.5,0.0,0.0,0.0,0.0,1.0,1.0,4.0,1.0,...,12.0,48.0,81.0,15.0,10.0,31.0,5.0,9.0,0.5,0.5
75%,2340.25,17435.75,0.0,0.0,0.0,1.0,1.0,1.0,5.0,1.0,...,21.0,65.0,134.625,20.0,16.0,34.0,9.0,12.811652,1.0,10.0
max,3120.0,23293.0,1.0,1.0,1.0,1.0,1.0,5.0,7.0,1.0,...,90.0,157.0,5674.9,1000.0,219.0,37.0,77.0,122.166667,1.0,200.0


In [4]:
fun.isnull().sum()

Row Id             0
Row Id.            0
zipconvert_2       0
zipconvert_3       0
zipconvert_4       0
zipconvert_5       0
homeowner dummy    0
NUMCHLD            0
INCOME             0
gender dummy       0
WEALTH             0
HV                 0
Icmed              0
Icavg              0
IC15               0
NUMPROM            0
RAMNTALL           0
MAXRAMNT           0
LASTGIFT           0
totalmonths        0
TIMELAG            0
AVGGIFT            0
TARGET_B           0
TARGET_D           0
dtype: int64

There are no null values in the data set.

In [5]:
fun.dtypes

Row Id               int64
Row Id.              int64
zipconvert_2         int64
zipconvert_3         int64
zipconvert_4         int64
zipconvert_5         int64
homeowner dummy      int64
NUMCHLD              int64
INCOME               int64
gender dummy         int64
WEALTH               int64
HV                   int64
Icmed                int64
Icavg                int64
IC15                 int64
NUMPROM              int64
RAMNTALL           float64
MAXRAMNT           float64
LASTGIFT           float64
totalmonths          int64
TIMELAG              int64
AVGGIFT            float64
TARGET_B             int64
TARGET_D           float64
dtype: object

There are no categorical variables that will need to be made into dummies or numerical values.

In [6]:
fun['TARGET_B'].value_counts()

1    1560
0    1560
Name: TARGET_B, dtype: int64

In [7]:
fun = fun.drop('TARGET_D', axis=1)

### 2. Step 1—Partitioning: Partition the dataset into 60% training and 40% validation (set the seed to 12345). 

In [24]:
#Creation of training and validation sets
X = fun.drop('TARGET_B', axis=1)
y = fun['TARGET_B']

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.4, random_state=12345)

### 3.	Step 2—Model Building: Follow the following steps to build, evaluate, and choose a model.

#### Select classification tool and parameters: Run at least two classification models of your choosing. Be sure NOT to use TARGET_D in your analysis. Describe the two models that you chose, with sufficient detail (method, parameters, variables, etc.) so that it can be replicated. 

In [25]:
# Random Forest
#Target D was dropped already
model = RandomForestClassifier(random_state=12345)
model.fit(X_train, y_train)
pred1 = model.predict(X_val)

print(f'Accuracy = {accuracy_score(y_val, pred1):.2f} Recall = {recall_score(y_val, pred1):.2f}')
cm = confusion_matrix(y_val, pred1)
print(cm)

Accuracy = 0.55 Recall = 0.55
[[351 287]
 [276 334]]


In [26]:
#K Neearest Neighbors

model3 = KNeighborsClassifier(n_neighbors=25)
model3.fit(X_train, y_train)
pred3 = model3.predict(X_val)

print(f'Accuracy = {accuracy_score(y_val, pred3):.2f} Recall = {recall_score(y_val, pred3):.2f}')
cm3 = confusion_matrix(y_val, pred3)
print(cm3)

Accuracy = 0.50 Recall = 0.53
[[293 345]
 [284 326]]


The first model chosen was based off the random forest classifier. The random forest classifier was chosen because it combines multiple decision trees to arrive at the most optimal solution. The output that is arrived at the most by the decision trees becomes the final classification based on the respective input variables. Since all the predictor variables were numerical, they were all included in the model to arrive at a balanced solution.

The second model chosen was the K Nearest neighbor model. This model creates "neighbors" and measures the distance from these neighbors to the points. The closer an output is to the most points earns the respective classification. For this result, the amount of neighbors was set to 25 with a random seed of 12345. This resulted in an accuracy of 50% and a recall of 53%. It classified 293 records as true positives and 326 as false negatives.

### 3.2. Classification under asymmetric response and cost: What is the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset? 

Weighted sampling is necessary for classification problems since a random sample may result in a disparity in the imbalance may have a negative impact on the model results. A simple random sample may be bias depending on which set the outcome results have an overwhelming majority in. This can lead to issues such as over or underfitting and have unrealistic results transpose to the validation/testing set. 

### 3.3. Calculate net profit: For each method, calculate the cumulative gains of net profit for both the training and validation sets based on the actual response rate (5.1%.) Again, the expected donation, given that they are donors, is $13.00, and the total cost of each mailing is $0.68. (Hint: To calculate estimated net profit, we will need to undo the effects of the weighted sampling and calculate the net profit that would reflect the actual response distribution of 5.1% donors and 94.9% non-donors. To do this, divide each row’s net profit by the oversampling weights applicable to the actual status of that row. The oversampling weight for actual donors is 50%/5.1% = 9.8. The oversampling weight for actual non-donors is 50%/94.9% = 0.53.) 

In [34]:
print(round(X_train['AVGGIFT'].sum(),2))
print(round(X_val['AVGGIFT'].sum(),2))
print(X_train.size)

19914.58
13440.44
41184


The total amount received is the train set is 19914.58

In [36]:
#Calculate Net on Train Set
print(X_train.size*0.68)
print("There was a total of 28,005.12 dollars spent on mailings for the training set.")

28005.120000000003
There was a total of 28,005.12 dollars spent on mailings for the training set.


In [38]:
print("The cumulative net for the training set at 5.1% is ""dollars")

The cumulative net for the training set at 5.1% is dollars


### 3.4. Draw cumulative gains curves: Draw the different models’ net profit cumulative gains curves for the validation set in a single plot (net profit on the y-axis, proportion of list or number mailed on the x-axis). Is there a model that dominates? 

In [39]:
#KNN Gains Chart
gains_df = pd.DataFrame({
    'actual': y_val,
    'prob': model3.predict_proba(X_val)[:, 1]
})

gains_df = gains_df.sort_values(by=['prob'], ascending=False).reset_index(drop=True)

gainsChart(gains_df.actual)
plt.show()

  plt.show()


In [40]:
#Random Forest Gains Chart

gains_df = pd.DataFrame({
    'actual': y_val,
    'prob': model.predict_proba(X_val)[:, 1]
})

gains_df = gains_df.sort_values(by=['prob'], ascending=False).reset_index(drop=True)

gainsChart(gains_df.actual)
plt.show()

  plt.show()


### 3.5. Select the best model: From your answer in (4), what do you think is the “best” model? 

Since the random forest model has nearly the same accuracy and recall results as the KNN, I would opt to move forward with the KNN model. It is comutationally less expensive than its counterpart and it is more difficult to train when moving into its hyperparameters. Also, since the data is mostly numerical, the KNN model seems to be a better fit for this particular case.

### 4.1. Using your “best” model from Step 2 (number 5), which of these candidates do you predict as donors and non-donors? List them in descending order of the probability of being a donor. Starting at the top of this sorted list, roughly how far down would you go in a mailing campaign? 

In [41]:
# Load Data

fund = pd.read_csv('/Users/colebailey/Documents/USD/Data Science for Business/Module 4/FutureFundraising.csv')

fund.head(5)

Unnamed: 0,Row Id,Row Id.,zipconvert_2,zipconvert_3,zipconvert_4,zipconvert_5,homeowner dummy,NUMCHLD,INCOME,gender dummy,...,IC15,NUMPROM,RAMNTALL,MAXRAMNT,LASTGIFT,totalmonths,TIMELAG,AVGGIFT,TARGET_B,TARGET_D
0,1,3,0,1,0,0,1,1,1,1,...,3,42,92.0,29.0,15.0,17,8,15.333333,,
1,2,4,0,0,1,0,0,1,2,1,...,4,21,30.0,20.0,20.0,33,9,15.0,,
2,3,5,0,0,0,1,0,1,1,0,...,10,61,220.0,35.0,25.0,31,9,24.444444,,
3,4,1,0,0,0,0,1,1,4,0,...,21,32,41.0,19.0,19.0,31,13,13.666667,,
4,5,4,0,0,1,0,1,1,7,1,...,1,47,46.0,10.0,10.0,28,8,5.75,,


In [45]:
fund.shape

(2000, 24)

In [42]:
X = fund.drop('TARGET_B', axis=1)
y = fund['TARGET_B']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=12345)

After using the original models on the new future fundraising test set,the optimal way to go down the list would be roughly 600 records. This is the area where the maximum profit is secured before spending additional funds on other fundraisiong customers. Though more moeny would be accumulated by proceeding further down the list, it is optimal to cut when the proportions of profits are highest, or roughly after 600 of the 2000 future fundraising customers.

### Briefly explain, in two to three paragraphs, the business objective, the data mining models used, why they were used, the model results, and your recommendations to your non-technical stakeholder team. 

The business objective is to maximize profits by analyzing the optimal number of customers to reach out to and expect responses from. Since there is a 0.68 cent charge for every customer and only a 5.1% average response rate, it is the goal of the models used to determine the likelihood of a customer responding in order to maximize profits. In this sense, individuals with a high likelihood of replying are worth the investment of the 68 cents. The data mining models used were random forest and K nearest neighbors. These models were used for classification because the random forest model reduces overfitting and errors by using smaller trees until the optimal threshold is reached. The KNN model can be fine tuned to group individuals into a potential donor or non-donor list. The models were relatively successful in prediciting donors and non donors. When appplying the models to the future fundraising dataset, the individual probabilities were shown. As a result, these models can be used to identify highly potential donors to maximize profits.