# Verifying [*gamba*](https://github.com/gamba-dev/gamba)'s LaBrie et al 2008 Replication
In [this notebook]() we reproduced LaBrie et al's 2008 analysis of casino gamblers. We've seen that the figures [*gamba*](https://github.com/gamba-dev/gamba) produces match those of the original paper, nevertheless it is good to confirm that our calculated measures match the original author's.

To do this, [download their Analysis Dataset as text](http://www.thetransparencyproject.org/download_index.php) - extract it to the same folder as this notebook and you should see a the file **AnalyticDataSet_VirtualCasinoTXT.txt**.

Now run the following cell, which gives the columns more intuitive names and displays only those needed for the replication;

In [1]:
# convert original analytic dat set to csv format with descriptive column names
import pandas as pd
analytics_set = pd.read_csv('AnalyticDataSet_VirtualCasinoTXT.txt', delimiter='\t')
#print('original columns:', list(analytics_set.columns))
analytics_set.columns = ['player', #
                         'country','language_code','gender','registration_date',
                         'age_at_registration','first_action_date','first_deposit_date',
                         'total_wagered', #
                         'total_payouts',
                         'num_bets', #
                         'first_bet',
                         'last_bet',
                         'active_days',
                         'duration', #
                         'net_loss', #
                         'average_bet_size', #
                         'frequency', #
                         'in_top5_total_bet',
                         'in_top5_net_loss']
#print('better columns:', list(analytics_set.columns))

# average bets per day is missing - this is the calculation from the paper (see Measures section)
analytics_set['average_bets_per_day'] = analytics_set['num_bets'] / analytics_set['active_days']
analytics_set['percent_loss'] = analytics_set['net_loss'] / analytics_set['total_wagered'] * 100

useful_data = analytics_set[['player','duration','frequency','num_bets','average_bets_per_day','average_bet_size','total_wagered','net_loss','percent_loss']]

display(useful_data.head(3))

Unnamed: 0,player,duration,frequency,num_bets,average_bets_per_day,average_bet_size,total_wagered,net_loss,percent_loss
0,1324364,173,2.890173,1430,286.0,1.524825,2180.5,102.5,4.700757
1,1324369,677,8.271787,1738,31.035714,5.478136,9521.0,315.0,3.308476
2,1324377,580,0.862069,7,1.4,8.428571,59.0,31.0,52.542373


With the original analytics set loaded, let's compare any player against our own measures calculated in our replication notebook. This holds for any player in the set, so we can be sure that our code is doing exactly what LaBrie et al's did, verifying the replication on top of the figures produced being identical.

In [2]:
player = 1324377

labrie_measures = pd.read_csv('gamba_labrie_measures.csv')
display(labrie_measures[labrie_measures['player'] == player])
display(useful_data[useful_data['player'] == player])

Unnamed: 0,player,duration,frequency,num_bets,average_bets_per_day,average_bet_size,total_wagered,net_loss,percent_loss
3993,1324377,580,0.862069,7,1.4,8.428571,59.0,31.0,52.542373


Unnamed: 0,player,duration,frequency,num_bets,average_bets_per_day,average_bet_size,total_wagered,net_loss,percent_loss
2,1324377,580,0.862069,7,1.4,8.428571,59.0,31.0,52.542373
