# TalkingData AdTracking Fraud Detection Challenge
This notebook uses a dataset from [Kaggle's TalkingData AdTracking Fraud Detection Challenge](https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection) competition:
> In this competition you’re challenged to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad. To support your modeling, they have provided a generous dataset covering approximately 200 million clicks over 4 days.

## Load data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

In [3]:
dtypes = {
        'ip'            : 'uint32',
        'app'           : 'uint16',
        'device'        : 'uint16',
        'os'            : 'uint16',
        'channel'       : 'uint16',
        'is_attributed' : 'uint8',
        'click_id'      : 'uint32'
        }

train = pd.read_csv('./input/train.csv', dtype=dtypes, usecols=['ip', 'click_time'])

print('Size of the dataframe: {} rows and {} columns'.format(*train.shape))
train.head()

Size of the dataframe: 184903890 rows and 2 columns


Unnamed: 0,ip,click_time
0,83230,06 14:32:21
1,17357,06 14:33:34
2,35810,06 14:34:12
3,45745,06 14:34:52
4,161007,06 14:35:08


## Click time

First click in train set is on 2017-11-06 14:32:21. Test clicks start on 2017-11-10. Based on data specifications, train covers a 4 day period. This means that the train and test data do not overlap, but test data is taken the day after train data ends. Train data is ordered by timestamp, therefore batches pulled in order cover limited time span.
2017-11-06 was a Monday, 2017-11-10 was a Friday, i.e. train is Mon-Thur, test is Friday. There is no missing data in Test. Missing values in train appear to be only for attributed_time, where there isn't any value due to no app download.

In [20]:
print("min train click_time: {}, max train click_time {}".format(train.click_time.min(), train.click_time.max()))

min train click_time: 06 14:32:21, max train click_time 09 16:00:00


In [19]:
test = pd.read_csv('./input/test.csv', dtype=dtypes, usecols=['ip', 'click_time'])

print('Size of the dataframe: {} rows and {} columns'.format(*test.shape))
test.head()

Size of the dataframe: 18790469 rows and 2 columns


Unnamed: 0,ip,click_time
0,5744,2017-11-10 04:00:00
1,119901,2017-11-10 04:00:00
2,72287,2017-11-10 04:00:00
3,78477,2017-11-10 04:00:00
4,123080,2017-11-10 04:00:00


In [21]:
print("min test click_time: {}, max test click_time {}".format(test.click_time.min(), test.click_time.max()))

min test click_time: 2017-11-10 04:00:00, max test click_time 2017-11-10 15:00:00


## IP

In [22]:
print("Numder of train unique ip's: {}, number of test unique ip's: {}".format(
    train.ip.nunique(), test.ip.nunique()))

Numder of train unique ip's: 277396, number of test unique ip's: 93936


In [17]:
print("min train ip: {}, max train ip {}".format(train.ip.min(), train.ip.max()))

min train ip: 1, max train ip 364778


In [18]:
print("min test ip: {}, max test ip {}".format(test.ip.min(), test.ip.max()))

min test ip: 0, max test ip 126413


In [23]:
train[train.ip.isin(test.ip)].shape

(147863120, 2)