# LoL Churn Predictor [Part 3 - Data Modeling]

**David Skarbrevik - 2018**

In part 2 we cleaned and analyzed our League of Legends data. Now we want to use that data to build a model that completes some sort of churn-like prediction.

<a id="toc"></a>

<br>
<hr style="background-color: black; padding: 1px;">
<br>

<h2>Table of Contents</h2>

<br>

<ol>
    <h3><li><a href="#section1">Planning</a></li></h3>
    <br>
    <h3><li><a href="#section2">Prepparing the data for modeling</a></li></h3>
    <br>
    <h3><li><a href="#section3">Modeling the data</a></li></h3>
</ol>

<br>
<hr style="background-color: black; padding: 1px;">
<br>

<a id='section1'></a>

## Step 1) Planning

### What preprocessing steps maybe necessary before fitting data to a model?

**There are four things we may need/want to do before training our models:**

**1)** OHE encode categorical and binary features

**2) [optional]** normalize values (many columns have large values)

**3)** create a "label" feature that we will predict on

**4)** remove unwanted features from the dataset

**5)** randomize and split data into a train/test sets


### What prediction tasks to model?

Some possibilities:

* Will the summoner get to level 3 or higher within the first month of play?
* Did the summoner play more than 1 match?
* Did the summoner play at least X matches?



### What types of models to try?

I will start with logistic regression as it seems most fitting for the prediction tasks above, however I will also try some other common models such as random forests. Finally I'll try a simple neural network model if I have time.

### Import needed libraries:

In [44]:
import pandas as pd
import numpy as np
from collections import Counter

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import f1_score

import tensorflow as tf

import math
import matplotlib.pyplot as plt
from tensorflow.python.framework import ops

%matplotlib inline
np.random.seed(1)

***

<div align="right">
    <a href="#toc">back to top</a>
</div>
<a id='section3'></a>

## Step 2) Prepping the data for modeling

**First, read cleaned dataframe from file**

In [4]:
df = pd.read_csv("./data/cleaned_riot_data.csv", encoding="ISO-8859-1")

**1) OHE encoding for categorical features**

In [7]:
[(feature, Counter(df[feature].head(n=3))) for feature in list(df)] # which features are categorical?

[('summoner_id', Counter({92201075: 1, 93650017: 1, 92729877: 1})),
 ('summoner_name', Counter({'TrEx18': 1, 'iMain N01': 1, 'luexolu99': 1})),
 ('summoner_level', Counter({5: 3})),
 ('total_matches', Counter({7: 1, 6: 1, 4: 1})),
 ('first_match_time',
  Counter({'2018-01-16T05:50:54.986000+00:00': 1,
           '2018-03-10T06:24:33.477000+00:00': 1,
           '2018-03-10T06:56:11.750000+00:00': 1})),
 ('first_match_duration',
  Counter({'0 days 00:14:00.000000000': 1,
           '0 days 00:16:41.000000000': 1,
           '0 days 00:22:54.000000000': 1})),
 ('first_match_id', Counter({2695060245: 1, 2736710108: 1, 2736716714: 1})),
 ('assists', Counter({16.0: 1, 4.0: 1, 10.0: 1})),
 ('champLevel', Counter({12.0: 1, 11.0: 1, 13.0: 1})),
 ('combatPlayerScore', Counter({0.0: 3})),
 ('creepsPerMinDeltas_0-10',
  Counter({5.0999999999999996: 1, 3.5: 1, 0.80000000000000004: 1})),
 ('creepsPerMinDeltas_10-20', Counter({0.0: 2, 3.6000000000000001: 1})),
 ('creepsPerMinDeltas_20-30', Counter({

Looks like just the 'lane' and 'role' features are categorical so let's OHE those. 

Note also, there are a couple True/False columns. Python and python libraries usually treats them as 1/0 already but we'll explicitly change them just in case it causes trouble later.

In [8]:
dummy_lane = pd.get_dummies(df['lane'])
dummy_role = pd.get_dummies(df['role'])

dummy_role.columns.values[3] = "NO_ROLE"
dummy_lane.columns.values[3] = "NO_LANE"

In [9]:
dummy_lane.head()

Unnamed: 0,BOTTOM,JUNGLE,MIDDLE,NO_LANE,TOP
0,0,0,1,0,0
1,0,0,0,1,0
2,0,1,0,0,0
3,0,0,0,0,1
4,0,1,0,0,0


In [10]:
dummy_role.head()

Unnamed: 0,DUO,DUO_CARRY,DUO_SUPPORT,NO_ROLE,SOLO
0,0,0,0,0,1
1,0,0,1,0,0
2,0,0,0,1,0
3,0,0,1,0,0
4,0,0,0,1,0


In [11]:
df = df.drop(['lane', 'role'], axis=1)
df = df.join([dummy_role,dummy_lane])

In [12]:
boolean_features = ['firstBloodAssist', 'firstBloodKill', 'firstInhibitorAssist', 
                    'firstInhibitorKill', 'firstTowerAssist', 'firstTowerKill', 'win']

df[boolean_features] = df[boolean_features].astype(int)

Let's just make sure all our features are numeric now:

In [13]:
df.head()

Unnamed: 0,summoner_id,summoner_name,summoner_level,total_matches,first_match_time,first_match_duration,first_match_id,assists,champLevel,combatPlayerScore,...,DUO,DUO_CARRY,DUO_SUPPORT,NO_ROLE,SOLO,BOTTOM,JUNGLE,MIDDLE,NO_LANE,TOP
0,92201075,TrEx18,5,7,2018-01-16T05:50:54.986000+00:00,0 days 00:14:00.000000000,2695060245,16.0,12.0,0.0,...,0,0,0,0,1,0,0,1,0,0
1,93650017,iMain N01,5,6,2018-03-10T06:24:33.477000+00:00,0 days 00:16:41.000000000,2736710108,4.0,11.0,0.0,...,0,0,1,0,0,0,0,0,1,0
2,92729877,luexolu99,5,4,2018-03-10T06:56:11.750000+00:00,0 days 00:22:54.000000000,2736716714,10.0,13.0,0.0,...,0,0,0,1,0,0,1,0,0,0
3,93839689,Md95359,1,1,2018-03-09T13:05:59.377000+00:00,0 days 00:28:54.000000000,2736217337,0.0,1.0,0.0,...,0,0,1,0,0,0,0,0,0,1
4,93599676,lwvvs,5,6,2018-03-09T11:42:58.309000+00:00,0 days 00:30:05.000000000,2736195812,7.0,17.0,0.0,...,0,0,0,1,0,0,1,0,0,0


Looks good, let's move on.

**2) Normalize feature values**

In [14]:
# skipping this step for now

**3) Create a prediction task label feature**

We are trying to predict, from a player's first match stats, if they will play enough to reach summoner level 3 or greater. 

Let's see what the summoner level breakdown for players is in this dataset.

In [15]:
level_counts = Counter(df['summoner_level'])
level_counts

Counter({5: 453, 1: 192, 4: 377, 3: 279, 2: 194})

We see a good amount of each summoner level 1-5. Let's make prediction labels where `level < 3` gets `0` and `level >= 3` gets `1`.

In [16]:
level_data = df['summoner_level'].tolist()
labels = []

for value in level_data:
    if value < 3:
        labels.append(0)
    else:
        labels.append(1)

In [17]:
Counter(labels)

Counter({1: 1109, 0: 386})

While there may not be as many players under level 3 as we'd like, there are still over 300 examples in this group, so this may be a reasonable dataset to prototype the viability of our prediction task.

**4) Removing unwanted features**

Some features like "summoner_name" aren't relevant to training our model and others like "total_matches" and "summoner_level" give the model information we don't want it to have access to.

In [18]:
list(df)

['summoner_id',
 'summoner_name',
 'summoner_level',
 'total_matches',
 'first_match_time',
 'first_match_duration',
 'first_match_id',
 'assists',
 'champLevel',
 'combatPlayerScore',
 'creepsPerMinDeltas_0-10',
 'creepsPerMinDeltas_10-20',
 'creepsPerMinDeltas_20-30',
 'creepsPerMinDeltas_30-end',
 'csDiffPerMinDeltas_0-10',
 'csDiffPerMinDeltas_10-20',
 'csDiffPerMinDeltas_20-30',
 'csDiffPerMinDeltas_30-end',
 'damageDealtToObjectives',
 'damageDealtToTurrets',
 'damageSelfMitigated',
 'damageTakenDiffPerMinDeltas_0-10',
 'damageTakenDiffPerMinDeltas_10-20',
 'damageTakenDiffPerMinDeltas_20-30',
 'damageTakenDiffPerMinDeltas_30-end',
 'damageTakenPerMinDeltas_0-10',
 'damageTakenPerMinDeltas_10-20',
 'damageTakenPerMinDeltas_20-30',
 'damageTakenPerMinDeltas_30-end',
 'deaths',
 'doubleKills',
 'firstBloodAssist',
 'firstBloodKill',
 'firstInhibitorAssist',
 'firstInhibitorKill',
 'firstTowerAssist',
 'firstTowerKill',
 'goldEarned',
 'goldPerMinDeltas_0-10',
 'goldPerMinDeltas_10-

In [19]:
extra_features = ['summoner_id', 'summoner_level', 'summoner_name', 'id', 'first_match_time',
                  'first_match_id', 'first_match_duration', 'latest_match_time', 'total_matches']

num_full_features = df.shape[1]

df = df.drop(extra_features, axis=1)

num_training_features = df.shape[1]

print("Number of features in full dataset: {}".format(num_full_features))
print("Number of features in model training dataset: {}".format(num_training_features))

Number of features in full dataset: 124
Number of features in model training dataset: 115


**5) Randomize and split data into train/test sets**

In [20]:
# define our training data and label data
X = np.array(df)
Y = np.array(labels)

In [23]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size=0.80, test_size=0.20, random_state=1)

print("Training examples: {}".format(X_train.shape[0]))
print("Training labels: {}".format(Y_train.shape[0]))
print("Test examples: {}".format(X_test.shape[0]))
print("Test labels: {}".format(Y_test.shape[0]))

Training examples: 1196
Training labels: 1196
Test examples: 299
Test labels: 299


**Great! We're finally ready to fit our data to some models!**

***

<div align="right">
    <a href="#toc">back to top</a>
</div>
<a id='section3'></a>

## Step 3) Modeling the data

### Logistic Regression Model

In [25]:
log_model = LogisticRegression()
log_model.fit(X_train,Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [40]:
logistic_accuracy = log_model.score(X_test, Y_test)
preds = log_model.predict(X_test)
log_f1 = f1_score(Y_test, preds)
print("Accuracy of logistic regression model on test data: {:.2f}%".format(logistic_accuracy*100))
print("F1-score for logistic regression model on test data: {:.2f}".format(log_f1))

Accuracy of logistic regression model on test data: 74.25%
F1-score for logistic regression model on test data: 0.84


**Not bad at all!**

There are a lot of qualifiers we should point out about this result before we jump for joy, but 74% accuracy on this relatively small dataset with just a vanilla, out of the box, logistic regression model is very encouraging.

Next, let's try some other out of the box models.

### Decision Tree Model

In [27]:
tree_model = DecisionTreeClassifier()
tree_model.fit(X_train, Y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [41]:
tree_accuracy = tree_model.score(X_test, Y_test)
preds = tree_model.predict(X_test)
tree_f1 = f1_score(Y_test, preds)
print("Accuracy of ada boosted model on test data: {:.2f}%".format(tree_accuracy*100))
print("F1-score for ada boosted model on test data: {:.2f}".format(tree_f1))

Accuracy of ada boosted model on test data: 61.87%
F1-score for ada boosted model on test data: 0.74


### Random Forest Model

In [29]:
forest_model = RandomForestClassifier()
forest_model.fit(X_train, Y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [42]:
forest_accuracy = forest_model.score(X_test, Y_test)
preds = forest_model.predict(X_test)
forest_f1 = f1_score(Y_test, preds)
print("Accuracy of ada boosted model on test data: {:.2f}%".format(forest_accuracy*100))
print("F1-score for ada boosted model on test data: {:.2f}".format(forest_f1))

Accuracy of ada boosted model on test data: 70.23%
F1-score for ada boosted model on test data: 0.81


### Gradient Boosting Model

In [32]:
gradient_model = GradientBoostingClassifier()
gradient_model.fit(X_train, Y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [43]:
gradient_accuracy = gradient_model.score(X_test, Y_test)
preds = gradient_model.predict(X_test)
gradient_f1 = f1_score(Y_test, preds)
print("Accuracy of ada boosted model on test data: {:.2f}%".format(gradient_accuracy*100))
print("F1-score for ada boosted model on test data: {:.2f}".format(gradient_f1))

Accuracy of ada boosted model on test data: 74.25%
F1-score for ada boosted model on test data: 0.84


### Ada Boosting Model

In [34]:
ada_model = AdaBoostClassifier()
ada_model.fit(X_train, Y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)

In [39]:
ada_accuracy = ada_model.score(X_test, Y_test)
preds = ada_model.predict(X_test)
ada_f1 = f1_score(Y_test, preds)
print("Accuracy of ada boosted model on test data: {:.2f}%".format(ada_accuracy*100))
print("F1-score for ada boosted model on test data: {:.2f}".format(ada_f1))

Accuracy of ada boosted model on test data: 70.90%
F1-score for ada boosted model on test data: 0.81


<h3>Summary of Basic Models</h3>

<br>

<table align="left" style="width:30%">
    <tr>
        <th>Model Type</th>
        <th>Accuracy</th>
        <th>F1-Score</th>
    </tr>
    <tr>
        <td>Logistic Regression</td>
        <td>74.25%</td>
        <td>0.84</td>
    </tr>
    <tr>
        <td>Decision Tree</td>
        <td>62.21%</td>
        <td>0.74</td>
    </tr>
    <tr>
        <td>Random Forest</td>
        <td>74.92%</td>
        <td>0.81</td>
    </tr>
    <tr>
        <td>Gradient Boosting</td>
        <td>74.25%</td>
        <td>0.84</td>
    </tr>
    <tr>
        <td>Ada Boosting</td>
        <td>70.9%</td>
        <td>0.81</td>
    </tr>
</table>

<br>

<p style="padding-left:350px;"> We achieved good results with logistic regression and random forests without any hyperparameter tuning. Because logistic regression worked so well, I'm curious how a small feed-forward neural network will fare on this dataset. </p>

### Neural Network Model

In [45]:
# transpose train/test set to prepare for neural network
X_train_network = X_train.T
X_test_network = X_test.T
Y_train_network = Y_train.reshape(1, Y_train.shape[0])
Y_test_network = Y_test.reshape(1, Y_test.shape[0])

In [46]:
def create_placeholders(n_x, n_y):

    X = tf.placeholder(tf.float32, shape=[n_x, None], name="X_data")
    Y = tf.placeholder(tf.float32, shape=[n_y, None], name="Y_data")
    
    return X, Y

In [47]:
def initialize_parameters():
    
    tf.set_random_seed(1)                   
        
    W1 = tf.get_variable("W1", [25,115], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
    b1 = tf.get_variable("b1", [25,1], initializer = tf.zeros_initializer())
    W2 = tf.get_variable("W2", [12,25], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
    b2 = tf.get_variable("b2", [12,1], initializer = tf.zeros_initializer())
    W3 = tf.get_variable("W3", [1,12], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
    b3 = tf.get_variable("b3", [1,1], initializer = tf.zeros_initializer())

    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2,
                  "W3": W3,
                  "b3": b3}
    
    return parameters

In [48]:
def forward_propagation(X, parameters):
    
    # Retrieve the parameters from the dictionary "parameters" 
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    W3 = parameters['W3']
    b3 = parameters['b3']
    

    Z1 = tf.add(tf.matmul(W1,X),b1)                        
    A1 = tf.nn.relu(Z1)                                   
    Z2 = tf.add(tf.matmul(W2,A1),b2)                       
    A2 = tf.nn.relu(Z2)                                    
    Z3 = tf.add(tf.matmul(W3,A2),b3)                       

    
    return Z3

In [49]:
def compute_cost(Z3, Y):

    logits = tf.transpose(Z3)
    labels = tf.transpose(Y)
    
    cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits,labels=labels))

    
    return cost

In [50]:
def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):

    np.random.seed(seed)            
    m = X.shape[1]                  # number of training examples
    mini_batches = []
        
    # Step 1: Shuffle (X, Y)
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1,m))

    # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
    num_complete_minibatches = math.floor(m/mini_batch_size) 
    for k in range(0, num_complete_minibatches):

        mini_batch_X = shuffled_X[:, k*mini_batch_size:(k+1)*mini_batch_size]
        mini_batch_Y = shuffled_Y[:, k*mini_batch_size:(k+1)*mini_batch_size]

        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    # Handling the end case (last mini-batch < mini_batch_size)
    if m % mini_batch_size != 0:

        mini_batch_X = shuffled_X[:, (num_complete_minibatches)*mini_batch_size:]
        mini_batch_Y = shuffled_Y[:, (num_complete_minibatches)*mini_batch_size:]

        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    return mini_batches

In [51]:
def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.0001,
          num_epochs = 1500, minibatch_size = 32, print_cost = True):

    ops.reset_default_graph()                         # to be able to rerun the model without overwriting tf variables
    tf.set_random_seed(1)                             # to keep consistent results
    seed = 3                                          # to keep consistent results
    (n_x, m) = X_train.shape                          # (n_x: input size, m : number of examples in the train set)
    n_y = Y_train.shape[0]                            # n_y : output size
    costs = []                                        # To keep track of the cost
    

    X, Y = create_placeholders(n_x,n_y)


    # Initialize parameters
    parameters = initialize_parameters()

    
    # Forward propagation: Build the forward propagation in the tensorflow graph
    Z3 = forward_propagation(X, parameters)
    
    # Cost function: Add cost function to tensorflow graph
    cost = compute_cost(Z3, Y)
    
    # Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer.
    optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(cost)
    
    # Initialize all the variables
    init = tf.global_variables_initializer()

    # Start the session to compute the tensorflow graph
    with tf.Session() as sess:
        
        # Run the initialization
        sess.run(init)
        
        # Do the training loop
        for epoch in range(num_epochs):

            epoch_cost = 0.                       # Defines a cost related to an epoch
            num_minibatches = int(m / minibatch_size) # number of minibatches of size minibatch_size in the train set
            seed = seed + 1
            minibatches = random_mini_batches(X_train, Y_train, minibatch_size, seed)

            for minibatch in minibatches:

                # Select a minibatch
                (minibatch_X, minibatch_Y) = minibatch
                
                # IMPORTANT: The line that runs the graph on a minibatch.
                # Run the session to execute the "optimizer" and the "cost", the feedict should contain a minibatch for (X,Y).

                _ , minibatch_cost = sess.run([optimizer, cost], feed_dict={X: minibatch_X, Y: minibatch_Y})
                
                epoch_cost += minibatch_cost / num_minibatches

            # Print the cost every epoch
            if print_cost == True and epoch % 100 == 0:
                print ("Cost after epoch %i: %f" % (epoch, epoch_cost))
            if print_cost == True and epoch % 5 == 0:
                costs.append(epoch_cost)
                
        # plot the cost
        plt.plot(np.squeeze(costs))
        plt.ylabel('cost')
        plt.xlabel('iterations (per tens)')
        plt.title("Learning rate =" + str(learning_rate))
        plt.show()

        # lets save the parameters in a variable
        parameters = sess.run(parameters)
        print ("Parameters have been trained!")

        # Calculate the correct predictions
        correct_prediction = tf.equal(tf.argmax(Z3), tf.argmax(Y))

        # Calculate accuracy on the test set
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

        print ("Train Accuracy:", accuracy.eval({X: X_train, Y: Y_train}))
        print ("Test Accuracy:", accuracy.eval({X: X_test, Y: Y_test}))
        
        return parameters

In [None]:
parameters = model(X_train_network, Y_train_network, X_test_network, Y_test_network, num_epochs=101)

We see that in essentially no time, this simple neural network found parameters that fit the training set perfectly AND still performed perfectly on the test set. I'll need to investigate whether this is result is real or if I made a mistake setting up this network...

***

<div align="right">
    <a href="#toc">back to top</a>
</div>

## That's it!

We've seen that it is possible to predict (with great accuracy) whether a new League of Legends player will ge through their tutorial matches (reach summoner level 3) or not, based only on the gameplay data of that player's first match!

***