## Credit Card Data analysis

Segmentation in marketing is a technique used to divide customers or other entities into groups based on attributes such as behaviour or demographics.

The top challenge faced by marketers is understanding who they are selling to. Once you know your buyer personas, you can tailor your targeting and offerings to increase their satisfaction and your revenue as a result. When you already have a pool of customers and plenty of data, it can be incredibly useful to segment them.

Here we will be using Credit card data to segment the customers.

## About the dataset
The credit card data has 18 attributes for each customer, which include the balance (credit owed by the customer), cash advance (when a customer withdraws cash using the credit card), the customer’s credit limit, minimum payment, percentage of full payments and tenure. A complete data dictionary info is given below:




|Feature|Description|
|-----|-----|
|CUST_ID| Identification of Credit Card holder (Categorical)| 
|BALANCE | Balance amount left in their account to make purchases| 
|BALANCE_FREQUENCY | How frequently the Balance is updated, score between 0 and 1(1 = frequently updated, 0 = not frequently updated |
|PURCHASES | Amount of purchases made from account| 
|ONEOFF_PURCHASES | Maximum purchase amount done in one-go| 
|INSTALLMENTS_PURCHASES | Amount of purchase done in installment| 
|CASH_ADVANCE | Cash in advance given by the user |
|PURCHASES_FREQUENCY | How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased) |
|ONEOFFPURCHASESFREQUENCY | How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased) |
|PURCHASESINSTALLMENTSFREQUENCY | How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done) |
|CASHADVANCEFREQUENCY | How frequently the cash in advance being paid |
|CASHADVANCETRX | Number of Transactions made with "Cash in Advanced" |
|PURCHASES_TRX | Numbe of purchase transactions made |
|CREDIT_LIMIT | Limit of Credit Card for user| 
|PAYMENTS | Amount of Payment done by user |
|MINIMUM_PAYMENTS | Minimum amount of payments made by user| 
|PRCFULLPAYMENT | Percent of full payment paid by user |
|TENURE | Tenure of credit card service for user|

### Importing necessary libraries

In [17]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
import warnings
warnings.filterwarnings("ignore")

### Loading the dataset

In [2]:
df = pd.read_csv("customer_seg.csv")
df.head()

Unnamed: 0,CUST_ID,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,CREDIT_LIMIT,PAYMENTS,MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
0,C10001,40.900749,0.818182,95.4,0.0,95.4,0.0,0.166667,0.0,0.083333,0.0,0,2,1000.0,201.802084,139.509787,0.0,12
1,C10002,3202.467416,0.909091,0.0,0.0,0.0,6442.945483,0.0,0.0,0.0,0.25,4,0,7000.0,4103.032597,1072.340217,0.222222,12
2,C10003,2495.148862,1.0,773.17,773.17,0.0,0.0,1.0,1.0,0.0,0.0,0,12,7500.0,622.066742,627.284787,0.0,12
3,C10004,1666.670542,0.636364,1499.0,1499.0,0.0,205.788017,0.083333,0.083333,0.0,0.083333,1,1,7500.0,0.0,,0.0,12
4,C10005,817.714335,1.0,16.0,16.0,0.0,0.0,0.083333,0.083333,0.0,0.0,0,1,1200.0,678.334763,244.791237,0.0,12


### Lets check the descriptive Statistics of the data.

### Dealing with missing values
Lets check the number of missing values in the given dataset

### Impute these missing values with mean and remove `CUST_ID` which is not useful.

### Perform log transformation on the data

###  Detect outliers in the continuous columns 

As this is a clustering problem, I decided to test without outlier's replacement because to get the meaningful clusters and should make sense after plotting the pair graph.  

We will be Using IRQ Score to identify outliers values in dataset. IRQ method is used in boxplot to identify possible outliers values.

```python
The interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1. In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data. It is a measure of the dispersion similar to standard deviation or variance, but is much more robust against outliers.
```
For now, we`ll do nothing with outliers because this may harm the clustering.

### Visualize the outliers using box plot

### Scale the features using scale function. This function will put all variables at the same scale, with mean zero and standard deviation equals to one.

### Using the elbow method find the optimal number of clusters

### Plot the graph to visualize the Elbow Method to find the optimal number of cluster 

### Applying KMeans to the dataset with the optimal number of cluster and store the clusters in the dataframe.

### Interpretation of Clusters

### Visualization of Clusters, use PCA to transform data to 2 dimensions for visualization