<div class="alert alert-success alertinfo" style="margin-top: 0px">
<h1>  Part 2. Train test/split </h1>    
</div>

# 1. Imports

In [1]:
# Turning off warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Data Manipulation
import sys
import random
import pandas as pd
import numpy as np

# Visualization 
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
plt.style.use('seaborn-whitegrid')

# Import data
clean_data=pd.read_csv(r"C:\Users\giuse\Desktop\job seeking\DATAscience_interview\clean_data.csv")

# 2. Data View

In [3]:
clean_data.head()

Unnamed: 0,USER_ID,CURRENCY,CURRENCY_IS_CRYPTO,GAME_TITLE,GAME_TYPE,GAME_PROVIDER,BET_AMOUNT_IN_CURRENCY,BET_AMOUNT_IN_EUR,BET_COUNT
0,1,BTC,Y,Fire Lightning,slots,enigmatic,5e-05,0.472637,3.0
1,1,BTC,Y,Slotomon Go,slots,enigmatic,1.5e-05,0.151416,1.0
2,5,BTC,Y,Aztec Magic Deluxe,slots,enigmatic,0.001485,5.057532,33.0
3,5,BTC,Y,Bac Agin,card,asiagaming,0.024,75.11976,1.0
4,5,BTC,Y,Book of Lords,slots,amatic,0.0002,0.683304,1.0


In [4]:
# Data info
print('There are {} rows and {} columns.'.format(clean_data.shape[0],clean_data.shape[1]))
print('          {} unique users.'.format(len(clean_data['USER_ID'].unique())))
print('          {} unique game titles.'.format(len(clean_data['GAME_TITLE'].unique())))

There are 984359 rows and 9 columns.
          39326 unique users.
          2569 unique game titles.


# 3. Train/Test split

<font size="4">In Data Science we normally split the data into 80% training set (for the machine learning to learn on) and 20% test set (for us to check accuracy of our recommendations). Since our chosen approach is acollaborative filltering we need to more careful about how we are spliting the data.  The problem we are facing is that we have to make sure that every user and a game is represented in our training set, because of that we are going to split the data using following methodology.</font>

- <font size="4"> For each user we randomly select one observation</font>
- <font size="4"> We merge the results of the previous two steps together. This should produce a set of observations that covers all of the users and all of the games.</font>
- <font size="4"> We calculate how many observations we need to fill our training set up to 80% and we randomly select enough of the remaining observations to fill the training set.</font>
- <font size="4"> The remaining observations are our testing set.</font>  

In [5]:
print('We need {} observations for our training set.'.format(int(0.8*clean_data.shape[0])))
print('We need {} observations for our test set.'.format(clean_data.shape[0] - int(0.8*clean_data.shape[0])))
print('We should have {} unique users'.format(clean_data['USER_ID'].nunique()))
print('We should have {} unique games'.format(clean_data['GAME_TITLE'].nunique()))

We need 787487 observations for our training set.
We need 196872 observations for our test set.
We should have 39326 unique users
We should have 2569 unique games


In [6]:
clean_data['unique']=clean_data.index
# selecting one random observation for each user
users = list(np.sort(clean_data.USER_ID.unique()))
train = pd.DataFrame()
train = pd.DataFrame()
for i in users:
    train = train.append(clean_data[clean_data.USER_ID == i].sample(1))
test = pd.concat([clean_data,train]).drop_duplicates(keep=False)

# from the remaining data, selecting one random observation for each game that hasn't been selected
list1 = train['GAME_TITLE'].unique()
list2 = clean_data['GAME_TITLE'].unique()
list3 = list(set(list2) - set(list1)) 
for i in list3:
    train = train.append(test[test.GAME_TITLE == i].sample(1))
test = pd.concat([clean_data,train]).drop_duplicates(keep=False) 

# defining how much more observations we needto makeup 80% of the data
additional = int(0.8*clean_data.shape[0]) - train.shape[0]
    
# adding additional observations atrandom
train = train.append(test.sample(additional))
test = pd.concat([clean_data,train]).drop_duplicates(keep=False)    

# checking the size
print('We have {} observations in our training set'.format(train.shape[0]))
print('We have {} observations in our test set'.format(test.shape[0]))
print('We have {} unique users'.format(train['USER_ID'].nunique()))
print('We have {} unique games'.format(train['GAME_TITLE'].nunique()))

We have 787487 observations in our training set
We have 196872 observations in our test set
We have 39326 unique users
We have 2569 unique games


In [7]:
# Saving data
train.to_csv('train.csv', index=False)
test.to_csv('test.csv', index=False)