# Introduction to the Churn Business Problem

In this project we are going to be predicting customer churn in the banking industry. Consider the following scenario:

- Bank XYZ has been observing a lot of customer closing their accounts or switching to competitor banks over the past couple of quarters. 

- As such, this has caused a huge dent in the quarterly revenues and might drastically affect annual revenues for the ongoing financial year, causing stocks to plunge and market cap to reduce by X%.

- Consequently, the leadership team has come into action by building a team of folks from business, product, engineering and data science to arrest this slide. Meaning that some interventions will be put in order to reduce the number of customers churning.

- But, because the organization doesn't have unlimited resources to put these interventions to everyone, so the first step is to identify such customers in order to allow for interventions to be targeted.

- Hence, the question to the data science team is: **Can we build a model to predict with reasonable accuracy the customers who are going to churn in the near future?**

## Defintions

- **Churn**: a consumer is said to have churned in our scenario if they have closed all of their active accounts with the bank.

- However, keep in mind that churn can be characterized in a variety of ways depending on the situation and what is most appropriate for the organization. For example, in some cases, if a customer has not transacting for 90 days/6 months/1 year, he can be said to have churned.

## Data Science Workflow

Note that when solving such kind of problem in real-world setting within an organisation, apart from the task of general modelling, a data science team would also have to collaborate with:
- (Business or Product) teams to define the (problem statement, metrics, etc)
- (Engineering) teams to get the (data)
- (DevOps) teams to monitor the (model) when launched to production

### With (Business and Product) teams
1. **Defining the business goal** => Arresting the slide in revenues caused by loss of active bank customers
2. **Identifying the data source** => transactional systems as event-based logs. This data can be stored in data warehouses (MySQL DBs, AWS Redshift), Data Lakes, NoSQL DBs, etc
3. **Perform auditing for data quality**, include aspects such as:
    - Deleting of duplicate events/transactions (de-deduplication)
    - Handling absence of data for chunks of time in between (handling missing values)
    - Obscuring PII (personally identifiable information) data. Because often in data science problems you would require using some private customer features but if they're obscured, then it can lead to privacy issues.
4. **Defining metrics**. There're two types of metrics
    - Business metrics => responsibility of both (business and data science) teams to combine their opinions to decide on the relevant metrics. In our case, business metrics could be:
        - churn rate which can be tracked over time (on a monthly, weeks, quartetly level)
            (we want this metric to descrease)
        - trend of average number of products per customer tracked over time
            (we want this metric to increase)
        - percentage of dormant customers tracked over time
            (we want this metric to decrease)
        - other such descriptive metrics tracked over time
    - Data-related metrics => responsibility of the data science team to define the relevant metrics
         - Recall = TP/(TP + FN)
         - Precision = TP/(TP + FP)
         - F1-Score = Harmonic mean of Recall and Precision
         - (where: TP=True Positive, FP=False Positive, FN=False Negative, TN=True Negative)
         - (we're not using Accuracy because we will most likely have an imbalanced dataset)
5. **Decide on prediction model output format** => Since this isn't going to be an online model, it doesn't require deployment. Instead, periodic (ex: monthly) model runs could be made to output list of customers with their propensity to churn shared with business (Sales/Marketing/Product) teams
    - Note: it's important to decide right at the beginning what should be the output format which will be given by the data science team to the sales/marketing team so that they can take the relevant interventions
6. **Decide the actions to be taken based on model's output/insights**. Based on the output obtained from the Data Science team, which would be the list customers with high propensity of churning in the near future, various business interventions can be made to save the customer from getting churned, for example:
    - Customer-centric bank offers
    - Getting in touch with customers to address any grievances
    
    
(PUT IN THE DIAGRAM OF THE WORKFLOW BTW BUSINESS/DS/DEVOPS/ENGINEERING)

### Data-related metrics: Intuition behind TP/FP/TN/FN

Fun explanation:
- True Positive (TP): Reality: A wolf threatened. Shepherd said: "Wolf." Outcome: Shepherd is a hero.
- True Negative (TN): Reality: No wolf threatened. Shepherd said: "No wolf." Outcome: Everyone is fine.
- False Positive (FP): Reality: No wolf threatened. Shepherd said: "Wolf." Outcome: Villagers are angry at shepherd for waking them up.
- False Negative (FN): Reality: A wolf threatened. Shepherd said: "No wolf." Outcome: The wolf ate all the sheep.

Concrete example, say we have hot news classifier:
- True Positive (TP): Reality: a piece of hot news. classifier predicts: hot.
- True Negative (TN): Reality: not a piece of hot news. classifier predicts: not hot.
- False Positive (FP): Reality: not a piece of hot news. classifier predicts: hot.
- False Negative (FN): Reality: a piece of hot news. classifier predicts: not hot.

## Part 1: Setting up the target/goal for the metrics

#### Data-related metrics:
- **Recall** = TP/(TP + FN) => out of the ones in the positive class, how many of them we could predict correctly?
- (so, out of all the customers who are potentially likely to churn, how many of them we could identify correctly?)
- (if we could identify 50% of them correctly, then recall would be 50%)
- **Precision** = TP/(TP + FP) => out of all the positive predictions we have made, how many of them were correct?
- (if we predict 100 customers as likely to churn, we need to check how many of them actually churn)
- (if only 30 of them actually churn, our precision is 30/100 => 30% - we predicted 100 customers to churn but only 30 of them actually churn)
- **F1-Score** = Harmonic mean of Recall and Precision


Although we don't know what's the maximum/minimum we can get on this dataset without exploring the data samples, we can set a rough conservative estimate. Good approach to set these metrics are:

First, find minimum and maximum values (create a range for these)
- To try find minimum value, let's say we predict all rows as (1 or churn) => in that case my recall would be 100% but my precision would be whatever the class imabalance ratio is. For example, if 20% of customers in the dataset have actually churned, then precision would be 20%. F1-Score, which is the harmonic mean of Recall and Precision, would be close to 30%. So, not a great score at all.
- Maximum value, would preferably be 100%, but we know that is not realistically possible.

So, a conservative estimate would be around 70%.


#### Business metrics:

Actual values/thresholds for business metrics usually come from the leadership team. So, we should try and achieve the given target values. But, at the same time we should ensure that that value/threshold isn't something improbable.

For example:
- if we take the recall target to be 70%, which means correctly identifying 70% of customers who're going to churn in the near future
- we can expect that due to business interventions (offers, getting in touch with customers) - 50% of customers can be saved from being churned
- which means at least 35% improvement in churn rate