# Objective

**Predict customer churn for tour and travels company based on customer data. What are the key indicators of customer churn?**

# Main results summary:

The key indicators of customer churn were age, frequent flyer status and income class. Specifically, 

 * younger customers (27-28y) tend to churn proportionally more often
 * frequent flyers churn more than non-frequent flyers   
 * high income individuals churn more than low and middle income classes  

Of the compared models, the balanced bagging classifier performed best in order to predict customer churn. It performed with an overall accuracy of 90%, as well as an F1 score 81 and other performance metrics >70 for the minority class. Given that it may be most important to correctly identify those customers who churn, the priority is to predict the minority class correctly. Thus, using a classifier focusing on balancing the data set and therefore boosting performance to identify those customers who churn, so they can be in focus of measures to improve customer satisfaction.  

In [None]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# data pre-processing 
from sklearn.preprocessing import StandardScaler, LabelEncoder

# models
from sklearn.linear_model import LogisticRegression 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from imblearn.ensemble import BalancedRandomForestClassifier, BalancedBaggingClassifier 

# model selection and evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
# import data
df = pd.read_csv('../input/tour-travels-customer-churn-prediction/Customertravel.csv')
df.head() 

In [None]:
# rename churn column for clarity
df.rename(columns={'Target': 'Churn'}, inplace=True)

In [None]:
df.info()

# Exploratory Data Analysis

In [None]:
df.groupby('Churn').describe()

In [None]:
ax = sns.countplot(data=df, x='Churn')
percentage = df['Churn'].value_counts(normalize=True).values * 100
lbls = [f'{p:.1f}%' for p in percentage]

ax.bar_label(container=ax.containers[0], labels=lbls)   
plt.ylim(top=800)
plt.title('Churned (0=no, 1=yes)');  

The data is unbalanced, as around 77% are customers in the "no-churn" class (modest class imbalance 3:1), which needs to be taken into account when training and evaluating the models.

In [None]:
sns.countplot(data=df, x='Age', hue='Churn').set_title('Churn by Age');

It looks like younger customers (27-28y) tend to churn proportionally more often.

In [None]:
sns.countplot(data=df, x='ServicesOpted', hue='Churn').set_title('Churn by Services Opted');

In [None]:
sns.countplot(data=df, x='FrequentFlyer', hue='Churn').set_title('Churn by Frequent Flyer Status');

It looks like frequent flyers churn more than non-frequent flyers. 

In [None]:
sns.countplot(data=df, x='AnnualIncomeClass', order=['Low Income','Middle Income','High Income'], hue='Churn').set_title('Churn by Annual Income Class');

It appears that high income individuals churn more than low and middle income classes.

In [None]:
sns.countplot(data=df, x='AccountSyncedToSocialMedia', hue='Churn').set_title('Churn by Account Synched To Social Media');

In [None]:
sns.countplot(data=df, x='BookedHotelOrNot', hue='Churn').set_title('Churn by Booked Hotel'); 


# Data pre-processing

In [None]:
df.isnull().sum()

There are no missing data

In [None]:
# create copy for encoding
df_coded = df.copy()

In [None]:
# Label Encoding ordinal features for services 
ordinals = ['ServicesOpted']
df_coded[ordinals] = df_coded[ordinals].apply(LabelEncoder().fit_transform)

In [None]:
# manually adapt scaling for Annual Income Class as we need to implement the ordinal scale in order to get correct scaling
df_coded = df_coded.replace({'AnnualIncomeClass': {'Low Income': 0, 'Middle Income': 1, 'High Income': 2}})

In [None]:
# One-Hot Encoding non-ordinal features
dummies = ['BookedHotelOrNot', 'AccountSyncedToSocialMedia', 'FrequentFlyer']
df_coded = pd.get_dummies(df_coded, columns = dummies, drop_first=True)

In [None]:
#rename some cols for clarity
df_coded.rename(columns={'BookedHotelOrNot_Yes':'BookedHotel', 'AccountSyncedToSocialMedia_Yes':'AccountSyncedToSocialMedia'}, inplace=True)

In [None]:
df_coded.head(6)

Correlation matrix to explore relationships between variables.  

In [None]:
sns.heatmap(np.round(df_coded.corr(method ='spearman'), 2), annot=True,  cmap='Blues');

Even though the correlation matrix needs to be interpreted with caution (as it's based on scaled variables, which is also why Spearman's correlation is used), it gives some indication that e.g. annual income class and frequent flyer status are correlated. This also indicates that frequent flyers are associated with churning. 

# Model Creation and Evaluation

Split the data into training set and test set:

In [None]:
X = df_coded.drop(columns='Churn', axis=1)
y = df_coded['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
# Generic function to fit data and display results/predictions
def fit_evaluate(clf, X_train, X_test, y_train, y_test):
    # fit model to training data
    clf.fit(X_train, y_train)
    # make predictions for test data
    y_pred = clf.predict(X_test)
    # print evaluation
    print(classification_report(y_test, y_pred))
    print('\nConfusion Matrix: \n')
    s = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='g', cmap='YlGnBu');
    s.set(xlabel='Predicted class', ylabel='True class')

Note: unbalanced data is not optimal for machine learning. Here I want to test some classic algorithms but also explore some classifiers that specifically account for unbalanced data (using imlearn).   

In [None]:
modelLR = LogisticRegression()
print('* Logistic regression * \n')
fit_evaluate(modelLR, X_train, X_test, y_train, y_test)

In [None]:
modelLR = LogisticRegression(class_weight='balanced')
print('* Logistic regression * \n')
fit_evaluate(modelLR, X_train, X_test, y_train, y_test)

Balancing the classes does not substantially improve the model. The prediction of the minority class gets better (at a cost of the majority class), as the model gets penalized more for errors in the minority class. 

In [None]:
modelRF = RandomForestClassifier()
print('* Random Forest Classifier * \n')
fit_evaluate(modelRF, X_train, X_test, y_train, y_test)

The random forest classifier performs overall better than logistic regression. Next, testing if balancing the data improves the performance, particularly for the underrepresented class 1 (churned) using a balanced random forest classifier that randomly under-samples each boostrap sample to balance it.

In [None]:
modelRF_bal = BalancedRandomForestClassifier()
print('* Balanced Random Forest Classifier * \n')
fit_evaluate(modelRF_bal, X_train, X_test, y_train, y_test)

The balanced random forest classifier classifies the minority class often correctly, however, at the cost of many false negatives. This shows that downsampling favors class 1, compared to other models.    

In [None]:
modelGB = GradientBoostingClassifier()
print('* Gradient Boosting Classifier * \n')
fit_evaluate(modelGB, X_train, X_test, y_train, y_test)

The GB classifier performs quite well. Next, testing a bagging classifier with additional balancing. The bagging approach randomly selects a subset of data to build several estimators. The base estimator is a decision tree.

In [None]:
modelBBC = BalancedBaggingClassifier()
print('* Balanced Bagging Classifier * \n')
fit_evaluate(modelBBC, X_train, X_test, y_train, y_test)

As expected, the balanced bagging classifier also favors the underrepresented class 1 (churned), providing the best f1-value for the minority class and an overall good accuracy of 90%. 

In [None]:
modelKNN = KNeighborsClassifier() 
print('* K Nearest Neighbors Classifier * \n')
fit_evaluate(modelKNN, X_train, X_test, y_train, y_test)

In [None]:
# finding the best k 
error_rate = []
for i in range(1,25):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

plt.plot(range(1,25), error_rate, color='b', linestyle='--', marker='o', markerfacecolor='r', markeredgecolor='r', markersize=8)
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.title('Error Rate vs. K Value')

# plotting the k value that minimizes the error 
print('Minimum error:', np.round(min(error_rate), 3),'at K =', (error_rate.index(min(error_rate)) + 1), '\n');

As the default for k is 5 already in KNN, the best k is already implemented in the default KNN above. 