In [None]:
# default_exp labrie2007

# LaBrie et al 2007

> Full study replication

This notebook reproduces every table in LaBrie et al's 2007 paper on internet sports gambling activity. To get started, dowload the raw data from the transparency project's website. The data we need is **Raw Dataset 2 (text version)** under the title 'Actual Internet Sports Gambling Activity: February 2005 through September 2005' towards the bottom of the page.

Once you've downloaded and extracted it, you should see a file called **RawDataIIUserDailyAggregation.txt** - copy this into the same directory as this notebook to begin.

- [Data Download (thetransparencyproject.org)](http://www.thetransparencyproject.org/download_index.php)
- [Original data description](http://www.thetransparencyproject.org/codebooks/Codebook_for_Actual_Internet_Sports_Gambling_Activity_from_February_2005_through_September_2005.pdf)
- [Original paper link](https://www.researchgate.net/profile/Howard_Shaffer/publication/6261653_Assessing_the_Playing_Field_A_Prospective_Longitudinal_Study_of_Internet_Sports_Gambling_Behavior/links/0fcfd50a6bc690d200000000/Assessing-the-Playing-Field-A-Prospective-Longitudinal-Study-of-Internet-Sports-Gambling-Behavior.pdf)

<img style='float:left; margin-right: 7px;' src='https://img.shields.io/badge/-FLAG-red?style=flat-square'> Minor discrepency between number of players taken forward after data cleaning - this has minor effects on fixed-odds figures below.

In [None]:
import gamba as gb

The first step is to split the raw data into CSV files for each player, this lets us calculate measures later on, on a per-player basis. For this particular study, the data must first be split by product ID (live action vs fixed odds betting)...

In [None]:
daily_data = gb.data.read_csv('RawDataIIUserDailyAggregation.txt')
daily_data[daily_data['ProductID'] == 1].to_csv('fixed_odds_daily.csv', index=False)
daily_data[daily_data['ProductID'] == 2].to_csv('live_action_daily.csv', index=False)

In [None]:
fo_data = gb.data.prepare_labrie_data('fixed_odds_daily.csv', year=2007)
la_data = gb.data.prepare_labrie_data('live_action_daily.csv', year=2007)

Now to calculate the behavioural measures used in the paper for each of the players, this includes the **duration**, **frequency**, **number of bets**, **average bets per day**, **average bet size**, **total amount wagered**, **net loss**, and **percentage loss**;

In [None]:
gb.measures.calculate_labrie_measures(fo_data, filename='fo_labrie_measures.csv', loud=True)

100%|██████████| 42157/42157 [1:01:43<00:00, 11.38it/s]


LaBrie measures saved
fixed odd labrie measures saved.


In [None]:
gb.measures.calculate_labrie_measures(la_data, filename='la_labrie_measures.csv', loud=True)

100%|██████████| 26198/26198 [25:31<00:00, 17.10it/s]


LaBrie measures saved
live action labrie measures saved.


Next, for this replication we can take the user id's from the original analytic data set and take those from the measures dataset calculated above;

In [None]:
fo_gamba_measures = gb.data.read_csv('fo_labrie_measures.csv')
la_gamba_measures = gb.data.read_csv('la_labrie_measures.csv')
original = gb.data.read_csv('AnalyticDataInternetGambling.txt')

fo_bettors = original[original['FOTotalBets'] > 0]
la_bettors = original[original['LATotalBets'] > 0]

gamba_fo = fo_gamba_measures[fo_gamba_measures['player_id'].isin(fo_bettors['USERID'].values)]
gamba_la = la_gamba_measures[la_gamba_measures['player_id'].isin(la_bettors['USERID'].values)]

FileNotFoundError: [Errno 2] File AnalyticDataInternetGambling.txt does not exist: 'AnalyticDataInternetGambling.txt'

In [None]:
t1a = gb.statistics.descriptive_table(gamba_fo)
t1b = gb.statistics.descriptive_table(gamba_la)
display(t1a.round())
display(t1b.round())
    
fo_spearmans = gb.statistics.spearmans_r(gamba_fo)
la_spearmans = gb.statistics.spearmans_r(gamba_la)
display(fo_spearmans)
display(la_spearmans)

With both the descriptive and inter-measure correlation tables complete, the sample of measures can be labelled according to the presence of a player in the top 1% of their cohort by a given measure. In this case the measures include the **net loss**, **total amount wagered**, and **number of bets**. This is done for both the fixed odds (fo) and live action (la) data...

In [None]:
fo_labelled = gb.labelling.top_split(gamba_fo, 'net_loss', percentile=99)
fo_labelled = gb.labelling.top_split(fo_labelled, 'total_wagered', percentile=99)
fo_labelled = gb.labelling.top_split(fo_labelled, 'num_bets', percentile=99)

t3a = gb.statistics.descriptive_table(fo_labelled[fo_labelled['top_net_loss'] == 1])
t3b = gb.statistics.descriptive_table(fo_labelled[fo_labelled['top_total_wagered'] == 1])
t3c = gb.statistics.descriptive_table(fo_labelled[fo_labelled['top_num_bets'] == 1])

la_labelled = gb.labelling.top_split(gamba_la, 'net_loss', percentile=99)
la_labelled = gb.labelling.top_split(la_labelled, 'total_wagered', percentile=99)
la_labelled = gb.labelling.top_split(la_labelled, 'num_bets', percentile=99)

t3d = gb.statistics.descriptive_table(la_labelled[la_labelled['top_net_loss'] == 1])
t3e = gb.statistics.descriptive_table(la_labelled[la_labelled['top_total_wagered'] == 1])
t3f = gb.statistics.descriptive_table(la_labelled[la_labelled['top_num_bets'] == 1])

t3_top = gb.data.concat([t3a, t3b, t3c], axis=1).reindex(t3a.index)
t3_bottom = gb.data.concat([t3d, t3e, t3f], axis=1).reindex(t3d.index)
t3_top.drop(t3_top.tail(3).index,inplace=True)
t3_bottom.drop(t3_bottom.tail(3).index,inplace=True)

display(t3_top)
display(t3_bottom)

Finally, explore the overlap between players labelled with different schemes as in the original paper;

In [None]:
fo_table = gb.label_overlap_table(fo_labelled, ['top_net_loss','top_total_wagered','top_num_bets'])
la_table = gb.label_overlap_table(la_labelled, ['top_net_loss','top_total_wagered','top_num_bets'])
display(fo_table)
display(la_table)

...