In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
plt.rc('figure', figsize=(10,5))
sns.set_theme(style='darkgrid')
sns.set_palette('plasma')

In [3]:
df_dict = pd.read_excel("../raw_data/Customer_Churn_Data_Large.xlsx", sheet_name=None)

In [3]:
# listing names of sheets in the excel file
df_dict.keys()

dict_keys(['Customer_Demographics', 'Transaction_History', 'Customer_Service', 'Online_Activity', 'Churn_Status'])

In [4]:
# listing columns of dataframes
for key, df in df_dict.items():
    print(f"key: ", key)
    print("Columns: ", df.columns.values, "\n")

key:  Customer_Demographics
Columns:  ['CustomerID' 'Age' 'Gender' 'MaritalStatus' 'IncomeLevel'] 

key:  Transaction_History
Columns:  ['CustomerID' 'TransactionID' 'TransactionDate' 'AmountSpent'
 'ProductCategory'] 

key:  Customer_Service
Columns:  ['CustomerID' 'InteractionID' 'InteractionDate' 'InteractionType'
 'ResolutionStatus'] 

key:  Online_Activity
Columns:  ['CustomerID' 'LastLoginDate' 'LoginFrequency' 'ServiceUsage'] 

key:  Churn_Status
Columns:  ['CustomerID' 'ChurnStatus'] 



### Data Selection Rationale
On a quick look at the data, we can see that all the dataframes should be taken for model building since they all provide relevant information.  
The relevant datasets are selected using domain knowledge, a quick thought on the kind of dataset will reveal a possible relation to the customer churn. The rationale is as following: 

- __Customer Demographics__ : The features such as MaritalStatus and IncomeLevel is too important to ignore since it is possible that the customer churn is influenced by a change in marital status or income level or even Age. 
- __Transaction History__ : The transaction dates, spending history could reveal a change in spending which might cause churning. Some customers may have left (or stayed) due to better offers or experiences.
- __Customer Service__ : Frequent customer service interaction should affect the churning. The resolution status of the interaction might have a direct correlation with the churn status.
- __Online activity__ : The frequency of online activity and a very old last login could indicate potential customer churn. 

#### Customer Churn

In [4]:
target_df = df_dict['Churn_Status']
target_df.head()

Unnamed: 0,CustomerID,ChurnStatus
0,1,0
1,2,1
2,3,0
3,4,0
4,5,0


In [5]:
target_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   CustomerID   1000 non-null   int64
 1   ChurnStatus  1000 non-null   int64
dtypes: int64(2)
memory usage: 15.8 KB


No null values in the target variable

In [6]:
target_df['ChurnStatus'].value_counts()
# 0 = No Churn, 1 = Churn

ChurnStatus
0    796
1    204
Name: count, dtype: int64