# Marketing Campaign Success for a Portuguese Bank

The business for this study, a Portuguese bank, developed a marketing campaign to subscribe clients to deposit money in their bank offering good interest rates. Seventeen marketing campaigns were conducted between May 2008 and November 2010 for a total of 79354 contacts. The campaign resulted in 6499 successful subscriptions.this is equivalent to an approximately 8% success rate.

The business goal for this project is to develop a model that can identify the factors that determine whether a marketing contact will be successful, i.e. that a client will subscribe the deposit.

## Data

 A subset of the contacts consisting of 41187 records with information on potential clients consisting of seven categories - age, job type, marital status, education level, history of credit default, housing loan status, and personal loan status. The distribution of the target variable, was the contact successful at subscribing the client ('success'), is shown below. The variable is highly imbalanced. 88.73% of the contacts for this subset had no success. 11.27% of the contacts were successful.

![success.png](attachment:success.png)

##### Numerical Predictor Variable

Age was the only numerical predictor variable. The boxplot below shows the distribution of age values for each success categories. The plot shows that there are not separate populations for the success categories. Age probably does not have strong predictive value on the success of a contact.

![age_success.png](attachment:age_success.png)

##### Categorical Predictor Variables

There are six categorical predictor variables - job, marital, education, default, housing and loan. The chi-square values for each variable with respect to success are tabulated below. The higher the chi-square value, the more likely the variable influences the response variable. Based on the data, job and education have the most influence on the success of a contact. Barplots of these variables are shown below.

![chi_square.PNG](attachment:chi_square.PNG)

![job.png](attachment:job.png) 

![eduction.png](attachment:eduction.png)

## Models

##### Zero Model or Baseline Model

The baseline model consists of classifying all data as 'no' success. If all data were classified as 'no', the model would have an accuracy rate of 88.73%. Any computational models should improve on this accuracy rate. The model doesn't classify any observations as positive so the precision and recall are zero percent.

##### Classification Models

K-nearest neighbors (KNN), logistic regression, decision tree and support vector machine (SVC) models were used to clasify the data as successful or not. The results of the models are shown below. Hyperparameter tuning and grid search were used to improve models but scores were not improved. Accuracy was used as the metric to fine-tune models.

![model_metrics-2.PNG](attachment:model_metrics-2.PNG)

The logistic regression model selected the zero model and did not classify any observations as 'yes' for success. The KNN, decision tree and SVC models were slightly less accurate than the baseline model. However, they showed improvement in the precision and recall metrics. Confusion matrices on the test data are shown below.

![knnCM.png](attachment:knnCM.png)

![dtCM.png](attachment:dtCM.png)

![svcCM.png](attachment:svcCM.png)

Looking at the confusion matrices, the decision tree model correctly predicted the most 'yes' contacts. If our goal is to capture this information, recall is the most useful metric. The decision tree model has the best performance.

##### Feature Selection

The logistic regression chose the zero model, however we can use SelectFromModel function to identify to rank features. The table below shows the top ten features ranked by the absolute value of their coefficients. Job types of student, retired, unemployed and unknown were positively correlated with success. Default_unknown, marital status of divorced and married, and job types of blue-collar, entrepreneurial, and services were negatively correlated with success. 

![feature_coef-2.PNG](attachment:feature_coef-2.PNG)

## Findings

Recall was the most useful metric and a decision tree model performed the best for all classification models. The data was highly imbalanced and it was difficult to improve on the baseline model.

##### Factors most important in the success of a contact

* A client having no history of credit default is more likely to accept a subscription.
* A client's job type may be predictive of whether they accept a subscription. 
    + Student, retired, unemployed and unknown are positively correlated with success.
    + Blue-collar, entrepreneur, services are negatively correlated with sucess.
* A client's marital status may be predictive of success.
    + Married and divorced clients were less likely to accept a subscription.
* An unknown credit default history was negatively correlated with success.

## Next Steps and Recommendations

Accuracy was used as the metric to fine-tune hyperparameters and grid search on models. As discussed above recall is likely a more useful metric. Models could be improved using recall as the metric to select an optimum model. This could improve the findings for this study.