# Data fields

Each row of the training data contains a click record, with the following features.

    ip: ip address of click.
    app: app id for marketing.
    device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
    os: os version id of user mobile phone
    channel: channel id of mobile ad publisher
    click_time: timestamp of click (UTC)
    attributed_time: if user download the app for after clicking an ad, this is the time of the app download
    is_attributed: the target that is to be predicted, indicating the app was downloaded

Note that ip, app, device, os, and channel are encoded.

The test data is similar, with the following differences:

    click_id: reference for making predictions
    is_attributed: not included

## Goal

build an algorithm that predicts the probability of a user downloading an app after clicking a mobile app ad
   
# Submission

Submission File

For each click_id in the test set, you must predict a probability for the target is_attributed variable. The file should contain a header and have the following format:

click_id,is_attributed
1,0.003
2,0.001
3,0.000
etc.

   
   
 ## Notes about the data - brainstorm
   timespan is short (4 days)
   
   imo there are very few columns / potential features in the data
   
   maybe the data has fraudulent clicks which would not result in a download of an application (but I would assume the test data would too?)
   
   device might be interesting, I would assume a device to be linked to a type of consumer, and a device type could be just a proxy to some kind of consumer behavior
   
   channel id interesting as well
   
   os: not sure what that would give us
   
   ip, could be used as a "user ID" so we could build some history / aggregates there, we could also use that to infer countries (we would have to use some external database that maps IPs to countries)
   
   click_time and attributed_time, could be interesting to see how long a user took to download an app (attributed_time - click_time) maybe there are some classes there: short time, medium time, long time -> but not sure how it would help yet


In [1]:
# imports
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

In [2]:
df = pd.read_csv("../data/train_sample.csv.zip")
df

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,87540,12,1,13,497,2017-11-07 09:30:38,,0
1,105560,25,1,17,259,2017-11-07 13:40:27,,0
2,101424,12,1,19,212,2017-11-07 18:05:24,,0
3,94584,13,1,13,477,2017-11-07 04:58:08,,0
4,68413,12,1,1,178,2017-11-09 09:00:09,,0
5,93663,3,1,17,115,2017-11-09 01:22:13,,0
6,17059,1,1,17,135,2017-11-09 01:17:58,,0
7,121505,9,1,25,442,2017-11-07 10:01:53,,0
8,192967,2,2,22,364,2017-11-08 09:35:17,,0
9,143636,3,1,19,135,2017-11-08 12:35:26,,0


In [3]:
# trying to figure out if there are some fraudulent IPs we should get rid of

# getting all ips which downloaded at least one app
grouped_by_ip_sum = df.groupby("ip").sum()
ips_that_downloaded = grouped_by_ip_sum.loc[grouped_by_ip_sum["is_attributed"] > 0].index

df_fraud_explo = df.loc[~df["ip"].isin(ips_that_downloaded)].groupby("ip").size().sort_values(ascending=False).to_frame('counts')

print(df_fraud_explo.describe())
print(df_fraud_explo.head())

# counts of IPs that never downloaded an app but have an ad click count > 20
print("Potentially fraudulent IPs:",len(df_fraud_explo.loc[df_fraud_explo["counts"] > 20].index))

# bin by count of clicks (starting at 25, TODO: reposition x axis..)
df_fraud_explo.hist(column="counts", bins=50, range=(25,800))

             counts
count  34634.000000
mean       2.826760
std        6.374741
min        1.000000
25%        1.000000
50%        2.000000
75%        3.000000
max      439.000000
        counts
ip            
73487      439
73516      399
53454      280
114276     219
26995      218
('Potentially fraudulent IPs:', 275)


NameError: global name '_converter' is not defined