# [Kaggle - TalkingData AdTracking Fraud Detection Challenge](https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection)

Our task is to predict where a click on some advertising is fraudlent given a few basic attributes about the device that made the click. What sets this competition apart is the sheer scale of the dataset: **with 240 million rows**.

Looking at the evaluation page, we can see that the evaluation metric used is** ROC-AUC** (the area under a curve on a Receiver Operator Characteristic graph). In other words:

- This competition is a **binary classification** problem - i.e. our target variable is a binary attribute (Is the user making the click fraudlent or not?) and our goal is to classify users into "fraudlent" or "not fraudlent" as well as possible

- Unlike metrics such as [LogLoss](http://www.exegetic.biz/blog/2015/12/making-sense-logarithmic-loss/), the AUC score only depends on **how well you can separate the two classes**. In practice, this means that only the order of your predictions matter.

  - As a result of this, any rescaling done to your model's output probabilities will have no effect on your score. In some other competitions, adding a constant or multiplier to your predictions to rescale it to the distribution can help but that doesn't apply here.
  
If you want a more intuitive explanation of how AUC works, I recommend [this post](https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it).
  
Let's dive right in by looking at the data we're given:

Due to the sheer scale of the dataset, it is most likely that this dataset won't fit in memory of most laptops. One solution to this is to use **Dask** 

In [1]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")
import pathlib
import dask.dataframe as dd
from distributed import Client, progress


In [2]:
DATA = pathlib.Path('data')

In [3]:
ls {DATA}

sample_submission.csv  test.csv  train.csv


In [4]:
train_filepath = DATA / 'train.csv'

In [5]:
client = Client(processes=False)

In [6]:
client

0,1
Client  Scheduler: inproc://144.167.111.156/21629/1  Dashboard: http://localhost:8787/status,Cluster  Workers: 1  Cores: 4  Memory: 12.50 GB


In [7]:
dtypes = {
        'ip':'uint32',
        'app': 'uint16',
        'device': 'uint16',
        'os': 'uint16',
        'channel': 'uint16',
        'is_attributed': 'uint8'
        }


In [8]:
train_df = dd.read_csv(train_filepath, blocksize=100e6, 
                       parse_dates=['click_time', 'attributed_time'], dtype=dtypes,
                       storage_options={'anon': True})

In [9]:
train_df.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,83230,3,1,13,379,2017-11-06 14:32:21,NaT,0
1,17357,3,1,19,379,2017-11-06 14:33:34,NaT,0
2,35810,3,1,13,379,2017-11-06 14:34:12,NaT,0
3,45745,14,1,13,478,2017-11-06 14:34:52,NaT,0
4,161007,3,1,13,379,2017-11-06 14:35:08,NaT,0


In [10]:
train_df.tail()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
925194,121312,12,1,10,340,2017-11-09 16:00:00,NaT,0
925195,46894,3,1,19,211,2017-11-09 16:00:00,NaT,0
925196,320126,1,1,13,274,2017-11-09 16:00:00,NaT,0
925197,189286,12,1,37,259,2017-11-09 16:00:00,NaT,0
925198,106485,11,1,19,137,2017-11-09 16:00:00,NaT,0


## Looking at the features

Each row of the training data contains a click record, with the following features.:

- ip: ip address of click
- app: app id for marketing
- device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
- os: os version id of user mobile phone
- channel: channel id of mobile ad publisher
- click_time: timestamp of click (UTC)
- attributed_time: if user download the app for after clicking an ad, this is the time of the app download
- is_attributed: the target that is to be predicted, indicating the app was downloaded

**NOTE:**

- By looking at the data samples above, you'll notice that all these variables are encoded - meaning we don't know what the actual value corresponds to - each value has instead been assigned an ID which we're given. This has likely been done because data such as IP addresses are sensitive, although it does unfortunately reduce the amount of feature engineering we can do on these.
    
- The attributed_time variable is only available in the training set - it's not immediately useful for classification but it could be used for some interesting analysis (for example, one could fill in the variable in the test set by building a model to predict it).


In [11]:
train_df.is_attributed.mean().compute()

0.002470721410998979

In [12]:
%time len(train_df)

CPU times: user 5min 55s, sys: 18.2 s, total: 6min 13s
Wall time: 4min 19s


184903890

We can see that the training set consists of **184,903,890 rows**.

In [16]:
%%time 
means = {}
weights = {}
cols = ['ip', 'app', 'device', 'os', 'channel']
for col in cols:
    means[col] = train_df.groupby(col)['is_attributed'].mean().compute()#.to_dict()
    weights[col] = train_df[col].value_counts().compute()#.to_dict()

CPU times: user 1h 1min 17s, sys: 2min 39s, total: 1h 3min 57s
Wall time: 44min 37s


In [17]:
means

{'app': app
 0      0.309421
 1      0.000212
 2      0.000262
 3      0.000303
 4      0.000040
 5      0.072598
 6      0.000083
 7      0.000670
 8      0.001842
 9      0.001144
 10     0.050549
 11     0.001555
 12     0.000109
 13     0.000167
 14     0.000250
 15     0.000214
 16     0.230298
 17     0.000640
 18     0.000503
 19     0.143450
 20     0.002176
 21     0.000133
 22     0.000245
 23     0.000019
 24     0.000491
 25     0.000047
 26     0.000467
 27     0.001720
 28     0.000082
 29     0.061275
          ...   
 768    0.300000
 753    0.250000
 748    0.000000
 742    0.000000
 755    0.000000
 745    0.000000
 756    0.000000
 763    1.000000
 760    0.000000
 761    0.000000
 424    0.000000
 766    0.000000
 410    0.000000
 757    0.000000
 759    0.000000
 438    0.000000
 743    0.000000
 765    0.000000
 767    0.000000
 532    0.000000
 744    0.000000
 746    0.000000
 747    0.000000
 749    0.000000
 764    0.000000
 201    0.000000
 741    0.000000
 7

In [18]:
weights

{'app': 3      33911780
 12     24179003
 2      21642136
 9      16458268
 15     15958970
 18     15756587
 14     10027169
 1       5796274
 13      4329409
 8       3731948
 21      3616407
 11      3466971
 26      3126136
 23      2675259
 6       2464136
 64      1893969
 7       1764954
 20      1758934
 25      1467907
 28      1311496
 27      1296189
 24      1259100
 19       922902
 17       797335
 22       684604
 10       684043
 29       652090
 32       485426
 5        375533
 151      188490
          ...   
 678           1
 679           1
 558           1
 681           1
 404           1
 653           1
 684           1
 410           1
 687           1
 608           1
 689           1
 691           1
 673           1
 671           1
 571           1
 669           1
 668           1
 667           1
 572           1
 665           1
 664           1
 578           1
 661           1
 580           1
 659           1
 582           1
 657           1
 656   

In [46]:
means_df = pd.DataFrame(means)
means_df.head()

Unnamed: 0,app,channel,device,ip,os
0,0.309421,0.077345,0.098525,,0.104272
1,0.000212,,0.001758,0.191489,0.001035
2,0.000262,,0.000274,,0.000249
3,0.000303,0.000413,,,0.000845
4,4e-05,0.085847,0.186807,,0.009622


In [47]:
means_df.describe()

Unnamed: 0,app,channel,device,ip,os
count,706.0,202.0,3475.0,277396.0,800.0
mean,0.064204,0.043271,0.149752,0.256061,0.007943
std,0.161892,0.138098,0.230485,0.351494,0.057189
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.000153,0.0,0.0022,0.0
50%,0.0,0.000457,0.076923,0.066667,0.0
75%,0.008699,0.003615,0.2,0.333333,0.0
max,1.0,0.95245,1.0,1.0,0.925237


In [51]:
means_df.to_csv('cols_means.csv', index=False)

In [64]:
weights['ip'].nunique()

10499

In [48]:
weights_df = pd.DataFrame(weights)
weights_df.head()

Unnamed: 0,app,channel,device,ip,os
0,3248.0,1642.0,1033413.0,,364804.0
1,5796274.0,,174330052.0,47.0,2215593.0
2,21642136.0,,8105054.0,,691125.0
3,33911780.0,875627.0,,,2904808.0
4,126275.0,862.0,1713.0,,593103.0


In [52]:
weights_df.describe()

Unnamed: 0,app,channel,device,ip,os
count,706.0,202.0,3475.0,277396.0,800.0
mean,261903.5,915365.8,53209.75,666.5701,231129.9
std,2118369.0,1815411.0,2960519.0,5446.831,2221556.0
min,1.0,1.0,1.0,1.0,1.0
25%,2.0,4234.5,1.0,3.0,1.0
50%,29.0,107788.0,4.0,13.0,3.0
75%,753.25,1019082.0,18.0,149.0,73.5
max,33911780.0,15065930.0,174330100.0,1238734.0,44181910.0


In [65]:
weights_df.to_csv('weights_df.csv', index=False)

For each of our encoded values, let's look at the number of unique values:

In [21]:
%%time 
uniques = [len(train_df[col].unique().compute()) for col in cols]

CPU times: user 30min 45s, sys: 1min 10s, total: 31min 56s
Wall time: 23min 2s


In [22]:
uniques

[277396, 706, 3475, 800, 202]

In [66]:
import pandas as pd

In [39]:
cols_df = pd.DataFrame({'cols': cols, 'unique_counts': uniques})
cols_df.head()

Unnamed: 0,cols,unique_counts
0,ip,277396
1,app,706
2,device,3475
3,os,800
4,channel,202


In [44]:
cols_df.to_csv('cols_unique_counts.csv', index=False)

In [45]:
pd.read_csv('cols_unique_counts.csv')

Unnamed: 0,cols,unique_counts
0,ip,277396
1,app,706
2,device,3475
3,os,800
4,channel,202


In [68]:
%time target = train_df.is_attributed.values.compute()

CPU times: user 5min 59s, sys: 14.1 s, total: 6min 13s
Wall time: 4min 21s


In [73]:
1 - (target == 0).mean()

0.0024707214109990216