# 1. Describe your solution to the problem with any assumptions you may make

Since Group B is a random sample from Group A, which was randomly selected from the entire population of eligible customers, we can assume that both Group A and Group B (treatment group) are representative samples of the poulation. With a sample size of 17642 at ~43% of Group A,  Group B is very good representation of Group A. The difference between Group A and Group B (Group A- Group B= our control group), with a sample size of 23546 is also a very representation of the non-campaigned (control group) users.


A chi2 significant test shows (available via the `data_exploration` notebook) the campaign was statistically significant. Campaign was shown to have a significant 31% increase in the credit card updake (over non-campaigned users with ~10% uptake, with campaigned users at 13%). I also checked given the baseline conversion rate (of non-campaigned customers) at 9.9380%, the detected effect of the campaign which was (13.0371-9.9380)%=3.0991%, also satisfied the minimum sample size criteria very comfortably (at 5% type 1 and 20% type 2 error). Required sample to detect a 3.0991% change was only ~1520 (we had 17642 samples). I used https://www.evanmiller.org/ab-testing/sample-size.html for this check. 

With those out of the way, we can proceed to modelling the classfiers that can be used to predict outcomes on the remaining eligible customers.

I have written an engine that can take any sklearn compatible classifier, and can optimise the performance of that classifer by selecting model hyperparameters, classifier threshold, and a campaign cost specific optimisation goal in mind. This set up will lead to very easy optimisation of not only hyperparameters, but also the classifier algorithm itself. I am running out of time to try a Neural Network based binary classifier, but can done relatively easily. A Bayesian model will also be interesting to explore gievn a bit more time as this will provide not only a mean prediction, but an uncertainity of these predictions. This can be quite helpful by assisting us with taking decisions when a class (or class probability) has a narrow spread, as oppossed to a wider spread, as the spread gives us a confidence measure of our predictions. A smaller sprad allows us to confidently include or discard a customer from future campaign considerations and vice versa.

Once a classifer is chosen, and the general relationships in the data are established, the customer base can be filtered further for future and more tailored campaign. While the classification models are useful for providing general insights on the entire dataset, a quick data exploration shows various important features that selectively raise the effetiveness of the campaign. With a bit more time in hand, I would build a recommender system in addition to these classifers that is designed to work on the whole dataset.


### Cost optimisation: A cost assumption of misclassification


A binary  classification problem is a compromise between type 1 and type 2 
error, i.e., if we are to prioritise the prediction of the positive class, the 
negative class will tend to have lower accuracy. 

In order to optimise the model, we have to assume 

    (a) cost of missing a sale due to a customer not being campaigned, and
    (b) a cost of the campaign

Let's assume (as we have not been supplied) that the cost of the campaign is 
only 10% of the cost of missing a customer who would have purchased the product
if campaigned. This will allow our classifier to hit target 
misclassifications for both 'no' (our majority class) and 'yes' (minority 
class) categories. 

This allows us to set the total number of FN and FP's for our classification.
 If we define `yes` in the response variable as our positive class (the card 
 is taken), then with the above assumption, our (cost) optimal classifier will 
 allow 10 times the number of false positives, than false negatives. This 
 also means that a higher recall is more important than precision. So while 
 picking optimisation thresholds and hyperparameters for classification, we 
 will pick one that is as low as possible in both FN and FP, with a FP ~ 10FN, 
 implying FNs are 10x less desirable than FPs.
 
 To address this we can experiment with the custom `config.fbeta2` function 
 for optimisation to see if it helps speed up optimisation. 

With that out of the way, we can now choose model classifier 
threshold, hyperparameters, precision and recall.

## Narrowing down the customer base for future campaigns

We should only campaign to customers that are otherwise unlikely to get the 
credit card without a campaign, and those who can be influenced to buy the 
credit card.

### Who we don't need to campaign to

There are two types of these customers from the two extremes:

(a) Those who would take the card without campaign: For this we choose a subset
 of the customers in the dataset provided who were not campaigned (Group A - 
 Group B) customers. We model these non-campaigned customers and find the  
 customers with a high `>70%` (arbitrarily set) probability of getting the card 
 without any campaign. 
 
 Since these are customers with a high probability of taking the card without a 
 campaign, we should not be wasting campaign budget on these customers.

(b) Those who were campaigned (Group B), but still came back with a low 
probability `<30%` (again arbitrarily set) of getting the card, this group 
will have even lower probability of getting the card if not campaigned. These 
are customers who would not get the card irrespective of the campaign. However, 
if this analysis is to be applied to non-campaigned customers,
we need to remove the campaign variables ("contact", "month", "day_of_week", 
"duration") from this analysis as these won't be available for non-campaigned
 customers. 

These (`70%` and `30%`) probabilities will be informed by business 
requirements and can probably be optimised further based on cost of campaign 
cost vs cost due to a missed credit card sale.


### Who we need to campaign to

Customers not selected in the previous step, should be looked into further. 
These are the customers that can potentially be influenced by a 
 campaign, and campaign budget should be spent on this group. During model 
 building and data exploration we have found several factors that show the 
 effectiveness of the campaign. For example if the campaign is rolled into the non-campaigned users that have not already taken out the card, a good place to start will be to select the False Positives from the non-campaigned users as these customers have shown a great likelihood to take the card.

Starting from the FPs above, a further level of customisation can be performed by finding non campaigned customers with similar scores of that of successful campaigned customers with a similarity scoring technique like a recommender system. 


# 2. Implement your solution in Python with a well-commented out code, by emphasising the following points:
#    • If you select to build a statistical model:
# -  Specify reasoning behind the selection of your algorithm

The atached python code uses the python file `config.py` for settings directly in python.
Parameters from `config.py`(often referred as `config` file) are used in `campaign.py` for various jobs. The two main function in `campaign.py` are `config.py` driven and helps us perform various analysis. 

Since the majority and minortity classes in this dataset are imbalanced, there is provision in the `config.py` to use various classifiers and data resampling techniques from the very useful [imbalanced learn package](https://github.com/scikit-learn-contrib/imbalanced-learn). A simple class weight based approach is also attempted for each algorithm in addition to the more advanced oversampling techniques from `imblearn`.

The reason behing choosing a treebased classifier (RF, XGBoost) is that they are very good general purpose classifiers which can be used to quickly establish benchmark for our dataset. 

Once a benchmark is available, based on the requirement (data volume, speed or training vs prediction), one can choose other classifiers. When data volume is low, and at the same time speed iI have done some work on it already and is a concern, something like LogisticRegression can offer good compromise.

When data volume is large, and speed of training is less important, one can use a stochastic gradient descent based classifier (SGDClassifier being one of them). Prediction of the SGDClassier is very fast.


# -  From the given dataset, which variables you use to build such a model/solution

I have considered three different types of models.

    1. Treatment group model: only campaigned customer (Group B) data based classifier,
    2. control group model: only non campaigned customer (Group A - Group B) data based calssifiers, 
    3. all data: all of Group A including both campaigned and non-campaigned dataset

For 1, we can use all of the following:

    Categorical: 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact','month', 'day_of_week', 'poutcome'
    Ordinal: 'age', "cons.price.idx", "cons.conf.idx"

For 2 , we can have to discard the campaign related features:

    Categorical: 'job', 'marital', 'education', 'default', 'housing', 'loan', 'poutcome'
    Ordinal: 'age', "cons.price.idx", "cons.conf.idx"

Of the above, the monthly data looks suspicious as whenever the campaign success rate is high(column y), the volume (count) is low (see `data_exploration` notebook). I will go back and ask clarification on how this data is being reported. 

Also, given the campaign was on for a few months (apr-dec), while the consumer price index and comsumer confidence index varied, for non-campaigned users who did not take up the offer, but there is still a value for "cons.price.idx", "cons.conf.idx" for each customer, the significance of these features for non-campaigned customers are questionable. A natural question comes to mind - exactly on which day was this data recorded for non-campaigned customers? We should seek further clarification of our data for these features. 

Model 3 may also shed some light as in this case we will include a bollean variable for the campaign. 


# - How would you do the hyper-parameter selection of the selected algorithm

A grid search can be performed to optimise the hyperparameters of a 
classifier. Use `optimise=True` in the `config.py`

    optimise=True

Optimisation grid `p_grid` specific to a classifier can be supplied when 
`GridSearch` is run. Example for `LogisticRegression` and 
`RandomForestClassifier` are provided in the config file.

The `GridSearchCV` impelemtation in the code allows optimisation based on any
 of 'precision', 'recall', 'accuracy', 'f1-score', or other custom score 
 function that can be relatively easily integrated in the setup. An example 
 of a custom custom scorer can be found in the `fbeta2` function in the 
 config file.

If `optimise=True` is set in config, the `campaign.analyze` function will perform hyperparamter search in in the space set by `config.p_grid`, and will select the best hyperparameter combination found by grid search as another potential classifier (amongst other).

- What are the most important variables in the model/solution you built?

This depends on the options chosen and the classifer algorithm. 

#### Feature Selection
Feature selectA feature selection can be performed using [Recursive Feature Elimination (RFE)](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html)
For classifiers that support a 'coef_' or 'feature_importance_'
attributes. Caution should be exercised as in the presence of 
collinear features, a process like feature importance simply replaces one 
feature with a dependent one (for example, age vs job categories in this dataset).

#### Feature significant test
Of these selected features in the previous step, a further significance test 
is performed using `statsmodels` `Logit` class, which provides `pvalues` in 
addition to estimating the regression coefficient. Only features with 
`pvalues` less than `alpha` are accepted. Note that this step can result in 
linear algebra error if the data matrix supplied to the `Logit` class is 
singular or low rank. This should integrate well with `LogisticRegression`.

#### Choosing the optimum classifier threshold
After hyperparameter optimisation, the best model is chosen. The optimum threshold for the best model is then selected from a precision-recall curve with our special cost model in mind. Given the condition we are working with (recall more  important that precision), for the RF model with campain only customers, a threshold of 0.18 was found to be optimal (see `campaign-rf.ipynb`). 

For LogisticRegression, a decision threshold of `t=0.43` allowed us to hit FP~10FN criteria we set using with the following confusion matrix for non-campaign constomers.

    pred_0  pred_1

0   3780    2598

1   256     430

For campaigned users (Group B), the optimal desicsion threshold for LogisticRegression was found to be 't=0.435' with the following confusion matrix 

    pred_0  pred_1
    
0   3014   1604

1   184   491


Some of the variables that come out important from campaign only `RandomForestClassifier`
for only campaigned customers (Group B):

['age', 'cons.conf.idx', 'cons.price.idx', 'previous', 'job_admin.', 'job_blue-collar', 'job_services', 'job_technician', 'marital_married', 'marital_single', 'education_basic.4y', 'education_basic.9y', 'education_high.school', 'education_university.degree', 'default_no', 'default_unknown', 'housing_no', 'housing_yes', 'loan_no', 'loan_yes']

The above variables, however, are dependent on the config parameters (feature selection paramers, whether significance test was performed etc.).

# - Any other considerations you may have to make given the characteristics of the dataset

Several cosniderations:
    
    1. Imbalanced dataset, so use either a weighted classifer, or use a class balancer
    2. The model is a mix of ordinal and categorical data, we need to one hot encode the categories.
    3. To be able to use the same data prep for classifiers that are scale dependent (like SVC, or GPClassifier), we normalise the ordinal data. This works well with one hot encoded data for most classifiers.
    4. Ignore 'duration' column. Duration of call is only available once a campaign has been performed. For example, someone totally not wanting a card, will not want to talk for long time.
    5. Given the campaign was on for a few months (apr-dec), while the consumer price index and comsumer confidence index varied, for non-campaigned users who did not take up the offer, the significance of these features for non-campaigned are questionable. We should seek further clarification of our data for these features.
    6. I would have liked to take a closer look at collinerity of features as some features like age and job type may be correlated (think studend/retirees vs age), education and job type may be correlated, housing loan and age can be related.

# - How would you validate your model/solution
    
I have used several techinques already in this exercise:
    
    1. Train/test split
    2. GridSearchCV which uses cross val, I used StratifiedKFold
    3. With a bit more time, I would check train/test accuracy differential, bias/variance trade offs etc.
    

# 3. How would you use the proposed model in the actual campaign?

While a generalised classifier works reasonably well on the whole dataset, the recommnedation can be further improved by selectively chooseing the customers using various classification models. Several considerations have been proposed in the first paragraph already. 

Here I will summarise the important categories that showed significan campaign specific effect, and the details can be found in the `data_exporation` notebook.

Job:

    1.  Campaign was almost 2X more effective on `unknown` and `retired` people, than any other job types. May be the `unknown` job category had multiple jobs, or people with high income who did not disclose job type
    2. Job type `admin` also showed significant campaign effectiveness
    3. Campaign was least effective on `services`, `students` and `blue-collar` job types

Education:

    1. Campaign had a higher impact on people with unknown, and illiterate education types and least impact on people with `basic education`
    
Marital:

    1. Campaign had higher effect on unknown and signle matrital types, and least effect on divorced customers
    
Default:

    1. Campaign had higher effect with 'no' default customers, and negligible effect on 'unknown' default types.

Loan:

    1. Campaign was 2x more effective on customers with 'unknown' loan type
    
Month:

    1. In general, the month effect was ambigious as for each month with lower campaign count, the conversion rate was much higher than the motnhs with much larger campaign count. I will go back and ask clarification on how this data is being reported.
    
    
Contact:

    1. Campaign was ~3x more effective with cenlluar contact type

Day_of_the_week: There was not significant difference

Previous:

    1. The number of `previous` attempted contacts was a very good predictor, with campaign success going up with numner of 'previous' contacts.

Poutcome:
    
    1. Campaign was more sucessful for previously 'success' poutcome customers and least successful for previously 'failure' poutcome customers.


I am addressing this point here:

# """Now the marketing team wants to roll-out this campaign to the entire customer population who are eligible for this product."""

So our classification models can be used on the entire set of eligible customers as follows:

    1. Discard customers that have already taken up this card with or without the campaign (nothing to win here).
    2. Classification model based on non-campaigned customers (control set): Prediction from this model on all of the eligible customers can be used to discard customers who are predicted to have high probability (say 70%) of getting the card without a campaign. Let's can say this will be the over investment category.
    3. Classification model based on campaigned customers (treatment set, not including campaign features): Prediction using this model on all of the eligible customers can be used to discard customers who have very low chance of getting the card (say 30%). We can say this is the bad investment category and the campaign will be mostly wasted on this group.
    4. Customers with a moderately high probability level (say 30-70%) from the non-campaigned customers model should be campaigned.
    5. A recommendation system can be set up to find similar customers that are likely to take the card if campaigned based on the campaign data and customer features that are influenced by campaigns as identified above (during data exploration).
    6. The data exploration notebook insights also tell us that we can be more specific by campaigning to (a) 'cellular' contatcs, (b) 'unknown' loan type, (c) 'no' default, (d) 'unknown' and 'signle' marital status, (e) 'unknown' and 'illiterate' education type , (f) 'unknown', 'retired' and 'admin' job type customers.
    7. Although we don't have this data, it would be possible to design more personalised campaign if a customer's past campaign history was available for each previous campaign rather than just `poutcome` binary. It's quite possible some customers are rolling over credit card balances regularly, and given the right incentives (via a campaign), these customers may take this card offer as opposted to a competitor's offer.


# 4. Now assume that the Bank decides to change the eligibility criteria of this product so that they could include more eligible customers in the campaign they want to run using your solution, where some of these customers could look different from the customers they ran the random campaign on: 
#        a. Do you think whether you would be able to apply the same solution/model you build above as it is with this new extended customer set?
        
Changing the eligibility criteria implies a change in the feature space of the model. So customers that come from the a different feature space, can not be modeled using the current model(s). 

For example, if a new feature is introduced, or a new category in an existing feature is introduced then the current model will not applicable. For continious features, if the new eligibility criteria chooses customers from outside the range of the continious features modeled, the model prediction can be very erroneous.        

There is another way we can argue that the modelling assumptions are not valid when the eligibility criteria is changed. The random sampling of customers (Group B) was performed from Group A, which was randomly selected from the population of eligible customers. If the eligibility criteria is changed, the population will change, and the conclusions drawn based on a sample from a different population can not be applied.
        
#        b. If you think you’ll have to do any adjustments to your solution, specify why you would have to do that and what those adjustments are (no need of implementation).
        
For new categorical features, we will need to remodel. For continious features that are outside our range, we need remodel again as the model performance will be very bad outside the standardised range of the ordinal feaures.