***KNOWN***

// general info

- Binary Classification Task
- features = ['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']
- target = ['Exited']
- no external test set: 3:1:1 - training:valid:test - 60%:20%:20%

// feature types

- Categorical Features: ['Surname', 'Geography', 'Gender']
- Numerical Features: ['RowNumber', 'CustomerId', 'CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']
- Target is Numerical

// misc observations

- Tenure column has 9091 / 10000 entries (-909) (1.1% missing)
- Surname column contains entries with special characters: data['Surname'][9] = 'H?'
- no duplicate CustomerId, 10000 different customers
- 'IsActiveMemeber' not mutually exclusive with 'Exited'

// class balance

- 79.63% of customers have exited ('Exited' == 1)
- 20.37% of customers have not exited ('Exited' == 0)

***UNKNOWN***
- model type? DecisionTreeClassifier, RandomForestClassifier, LogisticRegression
- hyperparameters?
- class balancing methods?

***OBJECTIVES***
- F1 score: 0.59 against test set
- Plot ROC and measure AUC-ROC
- methods of balance to try: upsampling & class_weight='balanced'

In [None]:
import pandas as pd

data = pd.read_csv('Churn.csv')

#print(data.dtypes)
#data.info(verbose=True)
#display(data.head(10))
#print(data['Surname'][9])
#print(data.isna().sum())
#print(data['CustomerId'].duplicated().sum())
#print(data['Surname'].value_counts())
#print(data[data['Surname'] == 'Smith'])
#print(data.duplicated(subset='CustomerId').value_counts())
#print(data[data['Tenure'].isna()])
#print(data[(data['Tenure'] > 0) & (data['Tenure'] < 1)])
#print(data[(data['Tenure'] < 1)])
#print(data['Geography'].value_counts())

data['Tenure_Missing'] = data['Tenure'].isnull().astype(int)
data['Tenure'] = data['Tenure'].fillna(0)
#print(data[data['Tenure_Missing'] == 1])
#print(data.info(verbose=True))

#print(data['Exited'].value_counts(normalize=True))

***data analysis & cleaning, adding surface-level observations to known & unknown***

**We've done away with the missing values within the 'Tenure' column, replacing them with 0. Seeing as though 'Tenure' is measured in years, and only whole numbers, if there exists a customer with 11 months of loan history, are they rounded up to 1.0? or do they remain at 0.0? Or could the missing values have different implications on a per observation basis? Human error, new customer, bank system glitch, or has no active loan?**

**In any case, I've decided to create a new column that saves the instances of 'Tenure' == NaN, as 'Tenure_Missing'**

**Class balance is 79.63% negative, 20.37% positive. We NEED to consider this when training our model later, as predicting positive for every observation would yield a ~80% accuracy rating**

In [38]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# since our categorical entries are nominal, we will use OHE to prepare them
data_ohe = pd.get_dummies(data, drop_first=True)

# separate features from target
features = data_ohe.drop('Exited', axis=1)
target = data_ohe['Exited']

# 60% train set, 40% temporary set
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size=.4, random_state=12345)

# 20% valid set, 20% test set
features_valid , features_test, target_valid, target_test = train_test_split(features_temp, target_temp, test_size=.5, random_state=12345)

# specify numeric features
numeric = ['RowNumber', 'CustomerId', 'CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']

# tune scaler to training data features
scaler = StandardScaler()
scaler.fit(features_train[numeric])

# apply scaling to numeric columns in our feature sets
features_train[numeric] = scaler.transform(features_train[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

#print(features_train.head())
#print(features_valid.head())
#print(features_test.head())