The following is an example of a Decision Tree Classifier 
Goal: to PREDICT on Day 0 whether or not someone will convert to a paying customer at end of Free Trial based on three input variables:

1. Channel Acquisition
2. State
3. Whether or not they Completed the Product Onboarding on the first day

In [1]:
# # Importing the necessary libraries
import pandas as pd
import numpy as np 

# Importing the csv data and exploring the data structure
df = pd.read_csv("cvr_data.csv")
df.head()

Unnamed: 0,Channel,State,Onboarding_Completed,converted_to_paid
0,Stripe,California,Y,1
1,Instagram,Washington,N,0
2,Facebook,Oregon,Y,0
3,Shopify,California,Y,1
4,Stripe,California,N,1


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56 entries, 0 to 55
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Channel               56 non-null     object
 1   State                 56 non-null     object
 2   Onboarding_Completed  56 non-null     object
 3   converted_to_paid     56 non-null     int64 
dtypes: int64(1), object(3)
memory usage: 1.9+ KB


In [4]:
# Random Forest can only handle numbers, so we need to import a library that helps us encode all values into numbers
from sklearn.preprocessing import LabelEncoder

# Now we're using the encoder to convert string columns into numbers
Channel_enc = LabelEncoder()
State_enc = LabelEncoder()
Onboarding_Completed_enc = LabelEncoder()

# Creating new columns that match the encoded values from above
df['channel_values'] = Channel_enc.fit_transform(df['Channel'])
df['State_values'] = State_enc.fit_transform(df['State'])
df['Onboarding_Completed_values'] = Onboarding_Completed_enc.fit_transform(df['Onboarding_Completed'])

# Checking to ensure our encoding worked properly on our independent variable set
df.head()

Unnamed: 0,Channel,State,Onboarding_Completed,converted_to_paid,channel_values,State_values,Onboarding_Completed_values
0,Stripe,California,Y,1,3,0,1
1,Instagram,Washington,N,0,1,2,0
2,Facebook,Oregon,Y,0,0,1,1
3,Shopify,California,Y,1,2,0,1
4,Stripe,California,N,1,3,0,0


Creating a mapping for reference later when we want to see the Decision Classifier at work

# Channel mapping: FB = 0, # IG = 1, Shopify = 2, Stripe = 3
# State mapping: CA = 0, OR = 1, WA = 2
# Onboarding Completed mapping: Y = 1, N = 0

In [5]:
# Creating a new dataframe that drops all the duplicated columns that have now been encoded into numbers
df_clean = df.drop(['Channel','State','Onboarding_Completed'],axis='columns')

# Checking it all looks good and we haven't missed anything
df_clean.head()

Unnamed: 0,converted_to_paid,channel_values,State_values,Onboarding_Completed_values
0,1,3,0,1
1,0,1,2,0
2,0,0,1,1
3,1,2,0,1
4,1,3,0,0


In [6]:
# Creating the feature selection for the decision tree and the target variable we're trying to predict

# we need to remove the 'answer' from the feature selection to train the model properly so we drop the target variable
features = df_clean.drop(['converted_to_paid'],axis = 'columns')
target = df_clean['converted_to_paid']

# Showing that the field we're predicting has been removed
features.head()

Unnamed: 0,channel_values,State_values,Onboarding_Completed_values
0,3,0,1
1,1,2,0
2,0,1,1
3,2,0,1
4,3,0,0


In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features,target,test_size=0.2, random_state=50) 
# test size means 20% of dataset will be withheld for testing purposes
# random state means if we run this all again it will start from same place and be verifiable

In [8]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, 44 to 52
Data columns (total 3 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   channel_values               12 non-null     int64
 1   State_values                 12 non-null     int64
 2   Onboarding_Completed_values  12 non-null     int64
dtypes: int64(3)
memory usage: 384.0 bytes


In [9]:
# Importing the necessary library to create the Decision Tree
from sklearn import tree

# Defining the model and number of trees deep
model = tree.DecisionTreeClassifier(max_depth=3)

In [10]:
# Need to pass an array instead of a DataFrame when we fit the model in order for the predictions to come out w/o warnings
X_train = X_train.values
y_train = y_train.values
X_test = X_test.values

In [11]:
# Training the model
## need to use test,train split (80/20,70/30, whatever ratio.  this is just very simple)

model.fit(X_train,y_train)

In [12]:
# Let's check the score of the model.  The closer to 1 the more accurate it will be.
train_data_score = model.score(X_train,y_train)
test_data_score = model.score(X_test,y_test)

print("The score of the decision tree in Training data is " + str(train_data_score))
print("The score of the decision tree in Testing data is " + str(test_data_score))

The score of the decision tree in Training data is 0.8863636363636364
The score of the decision tree in Testing data is 0.9166666666666666


A score of 1 would be perfect.
92% Score is pretty great!  
Let's test out predicting a few different cases for the algorithm 

In [13]:
# First case: 
# Stripe channel acquisition from California who completed Onboarding on Day 0 (3,0,1)
model.predict([[3,0,1]])

array([1])

A value of above means they would be likley to convert.  Hooray, let's double down on Stripe and California marketing

In [14]:
# Second case: 
# Instagram channel acquisition from Oregon who did not complete Onboarding on Day 0 (1,1,0)
model.predict([[1,1,0]])

array([0])

A value of 0 above means they would not be likely to convert.  But is that because of location and channel or onboarding?
Let's find out by changing the onboarding completed variable to 1 below.

In [15]:
# Third case: 
# Instagram channel acquisition from Oregon who did complete Onboarding on Day 0 (1,1,1)
model.predict([[1,1,1]])

array([1])

A 1!  Looks like completing Onboarding is pretty important for whether or not a user is likely to convert.