<div class="alert alert-success alertinfo" style="margin-top: 0px">
<h1>  Part 2. Data Modelling </h1>    
</div>

# 1. Imports

In [1]:
# Turning off warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Data Manipulation
import sys
import random
import pandas as pd
import numpy as np

# Visualization 
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
plt.style.use('seaborn-whitegrid')

# Import data
clean_data=pd.read_csv(r"C:\Users\giuse\Desktop\job seeking\DATAscience_interview\clean_data.csv")

# 2. Data View

In [3]:
clean_data.head()

Unnamed: 0,USER_ID,CURRENCY,CURRENCY_IS_CRYPTO,GAME_TITLE,GAME_TYPE,GAME_PROVIDER,BET_AMOUNT_IN_CURRENCY,BET_AMOUNT_IN_EUR,BET_COUNT
0,1,BTC,Y,Fire Lightning,slots,enigmatic,5e-05,0.472637,3.0
1,1,BTC,Y,Slotomon Go,slots,enigmatic,1.5e-05,0.151416,1.0
2,5,BTC,Y,Aztec Magic Deluxe,slots,enigmatic,0.001485,5.057532,33.0
3,5,BTC,Y,Bac Agin,card,asiagaming,0.024,75.11976,1.0
4,5,BTC,Y,Book of Lords,slots,amatic,0.0002,0.683304,1.0


In [4]:
# Data info
print('There are {} rows and {} columns.'.format(clean_data.shape[0],clean_data.shape[1]))
print('          {} unique users.'.format(len(clean_data['USER_ID'].unique())))
print('          {} unique game titles.'.format(len(clean_data['GAME_TITLE'].unique())))

There are 984359 rows and 9 columns.
          39326 unique users.
          2569 unique game titles.


# 3. Data Modelling

In [5]:
model_data = clean_data

## 3a Investigating bets and creating implied feedback

<font size="4">
    Recommendation systems are a great revenue driver, but are tricky to implement for iGaming. Normally when building recommender system we would take explicit feedback into account. Explicit feedback is data where we have some sort of rating. Like the 1 to 5 ratings from the Netflix dataset. Here we know how much a user likes or dislikes an item. For example, with star ratings we know that a 1 means the user did not like that item and a 5 that they really loved it. With igaming it might be that the user played a game and hated it, or loved it, or somewhere in-between. If they did not play a game it might be since they don’t like it or that they would love it if they just knew about it.</font> 
<br></br>
<br></br>
<font size="4"> 
    So here we are dealing with implicit feedback - which means it is more vague and not so easy to interpret. We will focus on what we do know, namely bet count and bet amount, and the confidence we have in whether or not they like any given game. We assume a higher confidence if they’ve placed many bets vs one time, we will also give more value to feedback if player has placed large bets. </font>  

In [6]:
def outliers_info(bounds_calculated_on, outlier_search_in, feature): 
    
    # defining bounds
    global lower_bound, upper_bound
    mean = bounds_calculated_on[feature].mean()       
    std = bounds_calculated_on[feature].std()           
    lower_bound = mean - (3 * std)
    upper_bound = mean + (3 * std)
    
    # searching for outliers     
    index_list_lower = outlier_search_in[(outlier_search_in[feature] <  lower_bound)].index
    index_list_upper = outlier_search_in[(outlier_search_in[feature] >  upper_bound)].index
    how_many_outliers_in_lower = len(index_list_lower)
    how_many_outliers_in_upper = len(index_list_upper)
    
    # creating a global list of outlier's idexes
    global outlier_indices
    outlier_indices =[]
    outlier_indices.extend(index_list_lower)
    outlier_indices.extend(index_list_upper)
    
    # getting name of data frame as a string
    global x
    name =[x for x in globals() if globals()[x] is outlier_search_in][0]    
      
    print("Outliers detected in {} for feature {}:".format(name,feature))
    print("_______________________________________________________________")    
    print("The lower bound value is {}. Number of outliers found in lower end: {}".format(lower_bound,how_many_outliers_in_lower))
    print("_______________________________________________________________")      
    print("The upper bound value is {}. Number of outliers found in upper end: {}".format(upper_bound,how_many_outliers_in_upper))
    print("_______________________________________________________________")
    print("The mean {} is {}".format(feature,mean))
    print("_______________________________________________________________")
    print("The standard deviation of {} is {}".format(feature,std))
    

### Bet count

In [7]:
feature = 'BET_COUNT'
new_feature = 'FEEDBACK_bet_count'
outliers_info(model_data, model_data, feature)

Outliers detected in clean_data for feature BET_COUNT:
_______________________________________________________________
The lower bound value is -10038.520338424467. Number of outliers found in lower end: 0
_______________________________________________________________
The upper bound value is 10930.158289619103. Number of outliers found in upper end: 5095
_______________________________________________________________
The mean BET_COUNT is 445.81897559731766
_______________________________________________________________
The standard deviation of BET_COUNT is 3494.779771340595


In [8]:
average = model_data[feature].mean()
plus_one = model_data[feature].mean() + model_data[feature].std()
plus_two = model_data[feature].mean() + (2*model_data[feature].std())
plus_three = model_data[feature].mean() + (3*model_data[feature].std())

In [9]:
model_data[[new_feature]] = model_data[[feature]]

def feedback(row):           
    i = row[new_feature]
    if i > plus_three:
        return 5
    elif i > plus_two:
        return 4 
    elif i > plus_one:
        return 3      
    elif i > average:
        return 2
    elif i > 0.5*average:
        return 1
    elif i > 0:
        return -1   
    else:
        return 0

model_data[new_feature] = model_data.apply(feedback, axis=1)
model_data = model_data.drop(feature,axis=1)
model_data.head()

Unnamed: 0,USER_ID,CURRENCY,CURRENCY_IS_CRYPTO,GAME_TITLE,GAME_TYPE,GAME_PROVIDER,BET_AMOUNT_IN_CURRENCY,BET_AMOUNT_IN_EUR,FEEDBACK_bet_count
0,1,BTC,Y,Fire Lightning,slots,enigmatic,5e-05,0.472637,-1
1,1,BTC,Y,Slotomon Go,slots,enigmatic,1.5e-05,0.151416,-1
2,5,BTC,Y,Aztec Magic Deluxe,slots,enigmatic,0.001485,5.057532,-1
3,5,BTC,Y,Bac Agin,card,asiagaming,0.024,75.11976,-1
4,5,BTC,Y,Book of Lords,slots,amatic,0.0002,0.683304,-1


### Bet amount in Euro

In [10]:
feature = 'BET_AMOUNT_IN_EUR'
new_feature = 'FEEDBACK_bet_amount_euro'
outliers_info(model_data, model_data, feature)

Outliers detected in model_data for feature BET_AMOUNT_IN_EUR:
_______________________________________________________________
The lower bound value is -113720.23712941381. Number of outliers found in lower end: 0
_______________________________________________________________
The upper bound value is 115498.04405198811. Number of outliers found in upper end: 755
_______________________________________________________________
The mean BET_AMOUNT_IN_EUR is 888.9034612871523
_______________________________________________________________
The standard deviation of BET_AMOUNT_IN_EUR is 38203.04686356699


In [11]:
average = model_data[feature].mean()
plus_one = model_data[feature].mean() + model_data[feature].std()
plus_two = model_data[feature].mean() + (2*model_data[feature].std())
plus_three = model_data[feature].mean() + (3*model_data[feature].std())

In [12]:
model_data[[new_feature]] = model_data[[feature]]
model_data[new_feature] = model_data.apply(feedback, axis=1)
model_data = model_data.drop(feature,axis=1)
model_data.head()

Unnamed: 0,USER_ID,CURRENCY,CURRENCY_IS_CRYPTO,GAME_TITLE,GAME_TYPE,GAME_PROVIDER,BET_AMOUNT_IN_CURRENCY,FEEDBACK_bet_count,FEEDBACK_bet_amount_euro
0,1,BTC,Y,Fire Lightning,slots,enigmatic,5e-05,-1,-1
1,1,BTC,Y,Slotomon Go,slots,enigmatic,1.5e-05,-1,-1
2,5,BTC,Y,Aztec Magic Deluxe,slots,enigmatic,0.001485,-1,-1
3,5,BTC,Y,Bac Agin,card,asiagaming,0.024,-1,-1
4,5,BTC,Y,Book of Lords,slots,amatic,0.0002,-1,-1


### Bet amount in Currency

In [13]:
feature = 'BET_AMOUNT_IN_CURRENCY'
new_feature = 'FEEDBACK_bet_amount_currency'
outliers_info(model_data, model_data, feature)

Outliers detected in model_data for feature BET_AMOUNT_IN_CURRENCY:
_______________________________________________________________
The lower bound value is -490279.8659445058. Number of outliers found in lower end: 0
_______________________________________________________________
The upper bound value is 499041.42242295505. Number of outliers found in upper end: 969
_______________________________________________________________
The mean BET_AMOUNT_IN_CURRENCY is 4380.778239224621
_______________________________________________________________
The standard deviation of BET_AMOUNT_IN_CURRENCY is 164886.8813945768


In [14]:
average = model_data[feature].mean()
plus_one = model_data[feature].mean() + model_data[feature].std()
plus_two = model_data[feature].mean() + (2*model_data[feature].std())
plus_three = model_data[feature].mean() + (3*model_data[feature].std())

In [15]:
model_data[[new_feature]] = model_data[[feature]]
model_data[new_feature] = model_data.apply(feedback, axis=1)
model_data = model_data.drop(feature,axis=1)
model_data.head()

Unnamed: 0,USER_ID,CURRENCY,CURRENCY_IS_CRYPTO,GAME_TITLE,GAME_TYPE,GAME_PROVIDER,FEEDBACK_bet_count,FEEDBACK_bet_amount_euro,FEEDBACK_bet_amount_currency
0,1,BTC,Y,Fire Lightning,slots,enigmatic,-1,-1,-1
1,1,BTC,Y,Slotomon Go,slots,enigmatic,-1,-1,-1
2,5,BTC,Y,Aztec Magic Deluxe,slots,enigmatic,-1,-1,-1
3,5,BTC,Y,Bac Agin,card,asiagaming,-1,-1,-1
4,5,BTC,Y,Book of Lords,slots,amatic,-1,-1,-1


In [16]:
model_data.shape

(984359, 9)

In [17]:
# model_data[['FEEDBACK']] = model_data[['FEEDBACK_bet_count']]

def feedback(row):
    bet_count = row['FEEDBACK_bet_count']
    euro = row['FEEDBACK_bet_amount_euro']
    currency = row ['FEEDBACK_bet_amount_currency']
    row['FEEDBACK'] = max([bet_count, euro, currency])
    return row
model_data_df = model_data.apply(feedback, axis=1)
model_data = model_data_df
model_data_df.tail()

Unnamed: 0,USER_ID,CURRENCY,CURRENCY_IS_CRYPTO,GAME_TITLE,GAME_TYPE,GAME_PROVIDER,FEEDBACK_bet_count,FEEDBACK_bet_amount_euro,FEEDBACK_bet_amount_currency,FEEDBACK
984354,2583592,AUD,N,Chinese Tigers,slots,platipus,-1,-1,-1,-1
984355,2583636,AUD,N,Mega Drago,slots,platipus,-1,-1,-1,-1
984356,2583636,AUD,N,Wolf Treasure,slots,igtech,-1,-1,-1,-1
984357,2583705,USD,N,40 Lucky Fruits,slots,spinomenal,-1,-1,-1,-1
984358,2583705,USD,N,Chinese Tigers,slots,platipus,-1,-1,-1,-1


## 3b Converting data into the format expected by the model

In [18]:
# Create a numeric user_id and game_id column

model_data['USER_ID'] = model_data['USER_ID'].astype("category")
model_data['GAME_TITLE'] = model_data['GAME_TITLE'].astype("category")
model_data['USER_CODE'] = model_data['USER_ID'].cat.codes
model_data['GAME_CODE'] = model_data['GAME_TITLE'].cat.codes
model_data.head()

Unnamed: 0,USER_ID,CURRENCY,CURRENCY_IS_CRYPTO,GAME_TITLE,GAME_TYPE,GAME_PROVIDER,FEEDBACK_bet_count,FEEDBACK_bet_amount_euro,FEEDBACK_bet_amount_currency,FEEDBACK,USER_CODE,GAME_CODE
0,1,BTC,Y,Fire Lightning,slots,enigmatic,-1,-1,-1,-1,0,839
1,1,BTC,Y,Slotomon Go,slots,enigmatic,-1,-1,-1,-1,0,2027
2,5,BTC,Y,Aztec Magic Deluxe,slots,enigmatic,-1,-1,-1,-1,1,189
3,5,BTC,Y,Bac Agin,card,asiagaming,-1,-1,-1,-1,1,198
4,5,BTC,Y,Book of Lords,slots,amatic,-1,-1,-1,-1,1,396


## 3c Train/Test split

<font size="4">In Data Science we normally split the data into 80% training set (for the machine learning to learn on) and 20% test set (for us to check accuracy of our recommendations). Since our chosen approach is acollaborative filltering we need to more careful about how we are spliting the data.  The problem we are facing is that we have to make sure that every user and a game is represented in our training set, because of that we are going to split the data using following methodology.</font>

- <font size="4"> For each user we randomly select one observation</font>
- <font size="4"> We merge the results of the previous two steps together. This should produce a set of observations that covers all of the users and all of the games.</font>
- <font size="4"> We calculate how many observations we need to fill our training set up to 80% and we randomly select enough of the remaining observations to fill the training set.</font>
- <font size="4"> The remaining observations are our testing set.</font>  

In [19]:
print('We need {} observations for our training set.'.format(int(0.8*model_data.shape[0])))
print('We need {} observations for our test set.'.format(model_data.shape[0] - int(0.8*model_data.shape[0])))
print('We should have {} unique users'.format(model_data['USER_ID'].nunique()))
print('We should have {} unique games'.format(model_data['GAME_TITLE'].nunique()))

We need 787487 observations for our training set.
We need 196872 observations for our test set.
We should have 39326 unique users
We should have 2569 unique games


In [20]:
model_data['unique']=model_data.index
# selecting one random observation for each user
users = list(np.sort(model_data.USER_ID.unique()))
train = pd.DataFrame()
train = pd.DataFrame()
for i in users:
    train = train.append(model_data[model_data.USER_ID == i].sample(1))
test = pd.concat([model_data,train]).drop_duplicates(keep=False)

# from the remaining data, selecting one random observation for each game that hasn't been selected
list1 = train['GAME_TITLE'].unique()
list2 = model_data['GAME_TITLE'].unique()
list3 = list(set(list2) - set(list1)) 
for i in list3:
    train = train.append(test[test.GAME_TITLE == i].sample(1))
test = pd.concat([model_data,train]).drop_duplicates(keep=False) 

# defining how much more observations we needto makeup 80% of the data
additional = int(0.8*model_data.shape[0]) - train.shape[0]
    
# adding additional observations atrandom
train = train.append(test.sample(additional))
test = pd.concat([model_data,train]).drop_duplicates(keep=False)    

# checking the size
print('We have {} observations in our training set'.format(train.shape[0]))
print('We have {} observations in our test set'.format(test.shape[0]))
print('We have {} unique users'.format(train['USER_ID'].nunique()))
print('We have {} unique games'.format(train['GAME_TITLE'].nunique()))

We have 787487 observations in our training set
We have 196872 observations in our test set
We have 39326 unique users
We have 2569 unique games


In [21]:
# Saving data
model_data.to_csv('model_data.csv', index=False)
train.to_csv('train.csv', index=False)
test.to_csv('test.csv', index=False)