# Machine Learning Algorithms for Predicting Client Churn (Prototype)

> Disclaimer: The data presented here is sensitive and for internal use only

TODO: Description


## Technologies
---

- Jupyter Notebooks
- Python3
  - `pandas`
  - `sklearn`
  - `numpy`
  - `matplotlib`

## Libraries
---

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
!pip install sklearn



In [4]:
import sklearn

In [5]:
!pip install matplotlib



In [6]:
import matplotlib.pyplot as plt

In [7]:
%matplotlib inline

## Data
---

### Raw Data

In [8]:
raw_data = pd.read_csv('./datasets/sample-raw-data.csv')

The raw data needs to be "cleaned" before a machine learning algorithm can be performed on it. MUST HAVE steps to be taken are

- Concatenate data based on each unique client
- Remove columns that are clearly not great features choices
- Create "dummy" values for each column of strings; many ML implementations can't handle string values out-of-the-box

Further steps that can be taken:

- Separate "cohorts" based on purchase date, or normalize older and newer clients versus each other
  - Explanation: Because we're trying to predict the likelihood of cancellation, newer clients will naturally (and mistakenly) appear to cancel at a lower rate than older clients; this is because older clients have been given more "opportunity" to cancel
- Convert dates to a simple scalar value
- Convert location addresses to scalar or vector values; i.e., avoid (or be explicit about) basing the prediction on simple population density
- Normalize each scalar value

In [9]:
# Current "dirty" data
raw_data.head(5)

Unnamed: 0,CHURN_DATE,HEALTH_SCORE,HEALTH_SCORE_UPDATED_DATE,RELATIONSHIP_STRENGTH,SENTIMENT,LAST_ZENDESK_TICKET_DATE,NPS_SCORE,NPS_SCORE_DATE,LAST_LOGIN,IS_EDITED,...,LAUNCH_DATE,PAYMENT_PLAN,FREQUENCY,SUBSCRIPTION_ID,SUBSCRIPTION_ITEM_ID,PRODUCT_ID,PRODUCT_QUANTITY,PRODUCT_PRICE,START_DATE,END_DATE
0,,73.0,2019-11-08 00:31:00.000 +0000,Good,neutral,2019-11-05 19:54:48.000 +0000,10.0,2019-04-29 04:00:00.000 +0000,2019-11-18 05:00:00.000 +0000,True,...,2019-04-15 00:00:00.000 +0000,Monthly,Monthly,sub_E3cpvGPbhWYXL9,si_E3cpsucoXGLgRq,addon_catering_store_monthly,1,139.0,2018-11-28 16:07:31.000 +0000,
1,,73.0,2019-11-08 00:31:00.000 +0000,Good,neutral,2019-11-05 19:54:48.000 +0000,10.0,2019-04-29 04:00:00.000 +0000,2019-11-18 05:00:00.000 +0000,True,...,2019-04-15 00:00:00.000 +0000,Monthly,Monthly,sub_E3cpvGPbhWYXL9,si_E3cpSjCw4yRknW,base_plus_monthly,1,199.0,2018-11-28 16:07:31.000 +0000,
2,,50.0,2019-11-08 00:31:00.000 +0000,Neutral,,,,,2019-11-18 05:00:00.000 +0000,False,...,2016-10-31 00:00:00.000 +0000,Monthly,Monthly,sub_9OtLNjGdGik2ge,si_FkYuZKZWOAhqdC,base_plus_monthly,1,199.0,2019-09-04 19:05:07.000 +0000,
3,,50.0,2019-11-08 00:31:00.000 +0000,Neutral,,,,,2019-11-18 05:00:00.000 +0000,False,...,2016-10-31 00:00:00.000 +0000,Monthly,Monthly,sub_9OtLNjGdGik2ge,si_1965Ys2vS2BuzRKbWsoEErwQ,baby,1,99.0,2016-10-18 21:55:55.000 +0000,2019-09-18 21:55:54.000 -0700
4,,86.0,2019-11-08 00:31:00.000 +0000,Good,neutral,2019-07-29 17:01:59.000 +0000,,,2019-07-24 04:00:00.000 +0000,True,...,2017-10-24 00:00:00.000 +0000,Monthly,Monthly,sub_BGEa6oquQ3Wrhh,si_1Ati6w2vS2BuzRKbP1djq3kT,plus,1,199.0,2017-08-22 19:32:26.000 +0000,


In [10]:
raw_data.count()

CHURN_DATE                           160
HEALTH_SCORE                        4538
HEALTH_SCORE_UPDATED_DATE           4875
RELATIONSHIP_STRENGTH               4993
SENTIMENT                           1778
LAST_ZENDESK_TICKET_DATE            2404
NPS_SCORE                            950
NPS_SCORE_DATE                       942
LAST_LOGIN                          4919
IS_EDITED                           5814
IS_DELINQUENT                       5814
IS_CONCEPT                          5814
NUMBER_OF_CONCEPTS                  2442
NUMBER_OF_LOCATIONS                 5636
ANNUAL_REVENUE                      4954
ANNUAL_HOSPITALITY_GROUP_REVENUE    5814
PARTNERSHIP                         1878
BILLING_CITY                        5795
BILLING_COUNTRY                     5810
BILLING_STATE                       5777
BUSINESS_MARKET                     4115
BUSINESS_TYPE                       5793
SERVICE_STYLE                       5790
NEW_OR_EXISTING                     5790
CUISINE         

### Filtered columns

The following dataset concatenates duplicate clients and filters in only the columns we'll use for prototype.

In a live environment, these actions can be performed programmatically.

In [11]:
filtered_data = pd.read_csv('./datasets/sample-filtered-data.csv')

In [12]:
filtered_data.head(5)

Unnamed: 0,SUBSCRIPTION_ID,RELATIONSHIP_STRENGTH,SENTIMENT,NPS_SCORE,IS_DELINQUENT,NUMBER_OF_LOCATIONS,ANNUAL_REVENUE,PARTNERSHIP,BILLING_CITY,BUSINESS_TYPE,CUISINE,PAYMENT_PLAN,CHURN_DATE
0,sub_E3cpvGPbhWYXL9,Good,neutral,10.0,False,1.0,4056.0,,Philadelphia,Restaurant,American (Traditional),Monthly,
1,sub_9OtLNjGdGik2ge,Neutral,,,False,1.0,1188.0,,New York,Restaurant,American (New),Monthly,
2,sub_BGEa6oquQ3Wrhh,Good,neutral,,False,1.0,1908.0,US Foods,South Bend,Restaurant,American (Traditional),Monthly,
3,sub_EtPCVOt1zlHjQj,Good,,,False,1.0,1188.0,US Foods,Walhalla,Restaurant,"[""Diner"",""Sandwiches"",""American"",""Frozen Yogur...",,
4,sub_EWvgOoZikMZEzh,Neutral,neutral,,False,3.0,2388.0,,Vancouver,Restaurant,Other,Monthly,


In [13]:
filtered_data.count()

SUBSCRIPTION_ID          4638
RELATIONSHIP_STRENGTH    4019
SENTIMENT                1363
NPS_SCORE                 731
IS_DELINQUENT            4638
NUMBER_OF_LOCATIONS      4498
ANNUAL_REVENUE           3972
PARTNERSHIP              1645
BILLING_CITY             4621
BUSINESS_TYPE            4621
CUISINE                  4619
PAYMENT_PLAN             4473
CHURN_DATE                 96
dtype: int64

### Further cleaning

In [14]:
current_data = pd.DataFrame(filtered_data).copy(deep=True) 

In [15]:
# Show column names
print(current_data.columns.values)

['SUBSCRIPTION_ID' 'RELATIONSHIP_STRENGTH' 'SENTIMENT' 'NPS_SCORE'
 'IS_DELINQUENT' 'NUMBER_OF_LOCATIONS' 'ANNUAL_REVENUE' 'PARTNERSHIP'
 'BILLING_CITY' 'BUSINESS_TYPE' 'CUISINE' 'PAYMENT_PLAN' 'CHURN_DATE']


#### Set the index

In [16]:
current_data = current_data.set_index('SUBSCRIPTION_ID')

#### Handle string transformations

In [17]:
import re

In [18]:
def clean_string(x):
    def convert_letters(x):
        subbed_commas = re.sub(',', ' ', x)
        subbed_non_alpha = re.sub(r'([^\s\w]|_)+', '', subbed_commas)
        return subbed_non_alpha

    if isinstance(x, str):
        return convert_letters(x).strip().lower()
    
    return x

In [19]:
def clean_and_get_first_two_words(x):
    if isinstance(x, str):
        splitted = clean_string(x).split()

        first = splitted[0]

        try:
            second = splitted[1]
            return '{} {}'.format(first, second)
        except:
            return first
    
    return x

In [20]:
current_data['BILLING_CITY'] = current_data['BILLING_CITY'].apply(clean_string)

In [21]:
current_data['PAYMENT_PLAN'] = current_data['PAYMENT_PLAN'].apply(clean_string)

In [22]:
current_data['PARTNERSHIP'] = current_data['PARTNERSHIP'].apply(clean_string)

In [23]:
current_data['BUSINESS_TYPE'] = current_data['BUSINESS_TYPE'].apply(clean_and_get_first_two_words)

In [24]:
current_data['CUISINE'] = current_data['CUISINE'].apply(clean_and_get_first_two_words)

In [25]:
current_data = pd.get_dummies(
    current_data, 
    columns=['BILLING_CITY', 'BUSINESS_TYPE', 'CUISINE', 'PAYMENT_PLAN', 'PARTNERSHIP']
)

#### Handle the "feelings" columns

In [26]:
def handle_feelings(x):
    cleaned = clean_string(x)
    if cleaned == 'good' or cleaned == 'positive':
        return 1.0
    if cleaned == 'neutral':
        return 0.0
    if cleaned == 'bad' or cleaned == 'negative':
        return -1.0
    return None

In [27]:
current_data['RELATIONSHIP_STRENGTH'] = current_data['RELATIONSHIP_STRENGTH'].apply(handle_feelings)

In [28]:
current_data['SENTIMENT'] = current_data['SENTIMENT'].apply(handle_feelings)

In [29]:
current_data = pd.get_dummies(current_data, columns=['RELATIONSHIP_STRENGTH', 'SENTIMENT'])

#### Normalize the scalar columns

In [30]:
def normalize(df):
    return (df - df.mean()) / df.std()

In [31]:
current_data['NPS_SCORE'] = normalize(current_data['NPS_SCORE'])

In [32]:
current_data['NUMBER_OF_LOCATIONS'] = normalize(current_data['NUMBER_OF_LOCATIONS'])

In [33]:
current_data['ANNUAL_REVENUE'] = normalize(current_data['ANNUAL_REVENUE'])

#### Handle misc transformations

In [34]:
# Inverting the following value, so as to remain parallel with the other colums' "Higher == Good, Lower == Bad" theme
current_data = current_data.rename(columns={'IS_DELINQUENT': 'IS_CURRENT'})

In [35]:
def handle_bools(x):
    if x == True:
        return 1.0
    if x == False:
        return 0.0
    return x

In [36]:
current_data['IS_CURRENT'] = current_data['IS_CURRENT'].apply(handle_bools)

#### Handle the prediction value

In [37]:
# Because we're running a regression, we need to convert "churn date" to likelihood of cancellation", 
#     i.e. a percentage value
# Also, because most of the feature values are "Higher == Good, Lower == Bad", we should invert this
#     field as well
current_data = current_data.rename(columns={'CHURN_DATE': 'HAPPINESS_SCORE'})

In [38]:
def handle_churn_date(x):
    if pd.isnull(x):
        return 1
    return 0

In [39]:
current_data['HAPPINESS_SCORE'] = current_data['HAPPINESS_SCORE'].apply(handle_churn_date)

#### Handle NaNs

In [40]:
current_data = current_data.fillna(0)

### Final Dataset to process

In [41]:
dataset = pd.DataFrame(current_data).copy(deep=True) 

In [42]:
dataset.head(5)

Unnamed: 0_level_0,NPS_SCORE,IS_CURRENT,NUMBER_OF_LOCATIONS,ANNUAL_REVENUE,HAPPINESS_SCORE,BILLING_CITY_33037,BILLING_CITY_98101,BILLING_CITY_abilene,BILLING_CITY_accokeek,BILLING_CITY_acton,...,PARTNERSHIP_resy,PARTNERSHIP_rsi,PARTNERSHIP_square,PARTNERSHIP_us foods,RELATIONSHIP_STRENGTH_-1.0,RELATIONSHIP_STRENGTH_0.0,RELATIONSHIP_STRENGTH_1.0,SENTIMENT_-1.0,SENTIMENT_0.0,SENTIMENT_1.0
SUBSCRIPTION_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
sub_E3cpvGPbhWYXL9,0.650948,0.0,-0.24905,1.811271,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
sub_9OtLNjGdGik2ge,0.0,0.0,-0.24905,-0.750732,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
sub_BGEa6oquQ3Wrhh,0.0,0.0,-0.24905,-0.107551,1,0,0,0,0,0,...,0,0,0,1,0,0,1,0,1,0
sub_EtPCVOt1zlHjQj,0.0,0.0,-0.24905,-0.750732,1,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
sub_EWvgOoZikMZEzh,0.0,0.0,0.71583,0.321236,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0


In [43]:
print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
Index: 4638 entries, sub_E3cpvGPbhWYXL9 to sub_EmYCrnN9rzQ7xU
Columns: 1867 entries, NPS_SCORE to SENTIMENT_1.0
dtypes: float64(4), int64(1), uint8(1862)
memory usage: 8.4+ MB
None


## Making a predication
---

### Splitting the data

In [45]:
dataset["HAPPINESS_SCORE"] = dataset["HAPPINESS_SCORE"].astype(int)

In [46]:
data_x = dataset.drop(labels = ["HAPPINESS_SCORE"],axis = 1)

In [47]:
data_y = dataset['HAPPINESS_SCORE'].values

In [48]:
from sklearn.model_selection import train_test_split

In [84]:
x_train, x_test, y_train, y_test = train_test_split(
    data_x, 
    data_y, 
    test_size=0.2, 
    random_state=101,
)

### Logistic Regression

In [50]:
from sklearn.linear_model import LogisticRegression

In [51]:
model = LogisticRegression(solver='lbfgs')

In [77]:
result = model.fit(x_train, y_train)

In [80]:
print(result)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)


### Testing the prediction

In [53]:
from sklearn import metrics

In [54]:
prediction_test = model.predict(x_test)

In [55]:
print('{}%'.format(round(metrics.accuracy_score(y_test, prediction_test) * 100)))

97.0%


In [68]:
weights = pd.Series(model.coef_[0])

In [69]:
weights.sort_values(ascending = False)

1853    2.463294
1852    1.914058
1629    0.759923
1725    0.756618
1827    0.740226
          ...   
1851   -0.994658
1      -1.118687
279    -1.178399
510    -1.275523
682    -1.710798
Length: 1866, dtype: float64