## Motivation
Industries with a subscriber-based service model, including banking, inevitably suffer from customer attrition, or churn. Identifying customers with a high probability of churn can significantly impact profits if the motivations for churn can be identified and addressed with retention strategies. Historical customer data is extremely valuable for both inferring motivations of past customers and building models to predict which customers are at risk of churn. 

## Data
The dataset used for the current analysis included historical data for 500 customers. Each customer had metrics recorded on banking activities and demographic data. This included (but is not limited to):
- If the account is inactive
- Credit scores 
- Account open date
- Monthly deposit balances for the period of one year (year previous to churn)
- Counts for several types of banking products (VISA, IRA, Loans, etc)
- Demographic data such as Zip Code, Age, and Membership Group

Half of the dataset were current customers at time of data collection and half had closed their account. This account status was used as the classification metric for churn. 

## Model Building and Model Evaluation 
The predictive model was built using a Random Forest Classifier. Most provided metrics were used as features and some aggregated features were created for use in the model. 

The model consistently produced high accuracy scores on data not used in model building (via cross validation), which demonstrates an ability to accurately predict churn given the aforementioned historical data about a customer. 

## Actionable Insights 
The following features were identified as predictive for churn: 

(redacted for client)



##Import and Summarize data

In [None]:
# Imports
%matplotlib inline 
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pdb

# Options for display purposes
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [None]:
# Grab data and preview
data = pd.read_excel('DataSet.xlsx',sheetname='Data')
data.head()

In [None]:
data.describe()

**Summary of Data types**  
And categorizing which need cleanup before running through model

*Nominal* - 'Membership Status' 'Inactive' 'Membership Zip Code' 'Membership Branch (Historical)' 'Membership Branch (Current)' 

*Nominal (text)* - 'Membership Account Type' 'Membership Group' 

*Ordinal (text)* - 'Age Band' 'Original Credit Score Band' 'Recent Fico Credit Score Band'

*Interval* - 'Membership Account Open Date' 

*Ratio* - 'Duration (Years)' 'Deposit Transaction Count' 0-11 'Monthly Deposit Balance $' 0-11 'Loan Balance $' 0-11 'Total Savings Products' 'Total Checking Products' 'Total CD Product Count' 'Total Money Market' 'Total IRA Products' 'Total Personal Loans' 'Total New Auto Loans' 'Total Used Auto Loans' 'Total VISA Products', 'Total Mortgage Products' 'Total Other Loans' 'Total Products' 'Loan App Denied Count' 'Loan App Cancelled Count' 'Loan App Approval Count' 'Online Banking Count' 0-11 'Online Bill Pay Count' 0-11 'Online Direct Deposit Count' 0-11

*Garbage* - 'Average Monthly Transactions'? 'Identifier' 'Membership Account Close Date'

##Data clean up:

In [None]:
# First row is not needed / annoying
data=data[1:]

In [None]:
# Mapping of ordinal text to integer data

print np.unique(data['Age Band'])  
#[0 999 u'0-17' u'18-34' u'35-54' u'55+']
data['Age Band'].replace(999,0,inplace=True)
agefeaturemap = {label:idx for idx,label in enumerate(np.unique(data['Age Band']))}
print agefeaturemap
data['Age Band'] = data['Age Band'].map(agefeaturemap)

print np.unique(data['Original Credit Score Band']) 
#[0 u'300-553' u'554-602' u'603-640' u'641-685' u'686-733' u'734-850']
creditscoremap = {label:idx for idx,label in enumerate(np.unique(data['Original Credit Score Band']))}
print creditscoremap
data['Original Credit Score Band'] = data['Original Credit Score Band'].map(creditscoremap)

print np.unique(data['Recent Fico Credit Score Band']) 
#[0 u'300-553' u'554-602' u'603-640' u'641-685' u'686-733' u'734-850']
data['Recent Fico Credit Score Band'] = data['Recent Fico Credit Score Band'].map(creditscoremap)

In [None]:
# One-hot encoding of nominal text data 
data = pd.get_dummies(data, columns=['Membership Account Type', 'Membership Group'])
data.head()

In [None]:
# Interval data - split dates into several features
data['year'] = data['Membership Account Open Date'].dt.year
data['month'] = data['Membership Account Open Date'].dt.month
data['dayofyear'] = data['Membership Account Open Date'].dt.dayofyear
data['dayofweek'] = data['Membership Account Open Date'].dt.dayofweek
data['day'] = data['Membership Account Open Date'].dt.day

In [None]:
# Drop columns to avoid GIGO
data.drop(['Average Monthly Transactions', 'Identifier', 'Membership Account Close Date'], \
          axis=1, inplace=True)
data.drop(['Membership Account Open Date'], axis=1, inplace=True)

In [None]:
# Dealing with some obvious data entry errors / hangups... 

# A weird non-number zip code
data['Membership Zip Code'] = [0 if type(x) is not int else x for x in data['Membership Zip Code']]

# Some parentheses added to account balances (probably not important for this analysis)



In [None]:
# Fixing some stray data types 
mylist = list(df.select_dtypes(include=['object']).columns)
temp = data[mylist].astype(float)
temp.head()

##Data visualization: 

In [None]:
for i in xrange(1,7):
    plt.subplot(2,3,i)
    plt.hist(data.iloc[:,i].values, bins=20)

##Predictive model: 

In [None]:
# Split into training and validation data subsets
y = data['Membership Status']
X = data.drop(['Membership Status'], axis=1)
#X = X.iloc[:,0:9]
#X = pd.concat([X.iloc[:,3:9], X.iloc[:,:3]], axis=1)

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)


In [None]:
print len(y_train)
print sum(y_train)

In [None]:
# Random Forest model 
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import StratifiedKFold

forest = RandomForestClassifier(n_estimators=10000,
                                random_state=0,
                                n_jobs=-1)

#forest.fit(X_train, y_train)
#print forest.score(X_test, y_test)


#K-fold cross validation code 
kfold = StratifiedKFold(y=y_train, 
                        n_folds=10,
                        random_state=1)

scores = []
for k, (train, test) in enumerate(kfold):
    forest.fit(X_train[train], y_train[train])
    score = forest.score(X_train[test], y_train[test])
    scores.append(score)
    print('Fold: %s, Acc: %.3f' % (k+1, score))
    
print('\nCV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

In [None]:
print forest.score(X_test, y_test)

In [None]:
# Evaluate feature importances

feat_labels = X.columns.values.tolist()
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, 
                            feat_labels[indices[f]], 
                            importances[indices[f]]))

# plt.title('Feature Importances')
# plt.bar(range(X_train.shape[1]), 
#         importances[indices],
#         color='lightblue', 
#         align='center')

# plt.xticks(range(X_train.shape[1]), 
#            feat_labels, rotation=90)
# plt.xlim([-1, X_train.shape[1]])
# plt.tight_layout()
# # plt.savefig('./figures/random_forest.png', dpi=300)
# plt.show()