# CTR Prediction

	https://www.kaggle.com/c/avazu-ctr-prediction/data

## File descriptions
**train** - Training set. 10 days of click-through data, ordered chronologically. Non-clicks and clicks are subsampled according to different strategies. 

	https://www.kaggle.com/c/avazu-ctr-prediction/download/train.gz

**test** - Test set. 1 day of ads to for testing your model predictions. 

	https://www.kaggle.com/c/avazu-ctr-prediction/download/test.gz

**sampleSubmission.csv** - Sample submission file in the correct format, corresponds to the All-0.5 Benchmark. 

	https://www.kaggle.com/c/avazu-ctr-prediction/download/sampleSubmission.gz

## Data fields
id: ad identifier
click: 0/1 for non-click/click
hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.

C1 -- anonymized categorical variable, banner_pos, site_id, site_domain, site_category, app_id, app_domain, app_category, device_id, device_ip, device_model, device_type, device_conn_type, C14-C21 -- anonymized categorical variables

# Load Data

In [1]:
import pandas as pd


# Initial setup
train_filename = "train_small.csv"
test_filename = "test.csv"
submission_filename = "submit.csv"

training_set = pd.read_csv(train_filename)

Unnamed: 0,click,hour,C1,banner_pos,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
count,2999.0,2999,2999.0,2999.0,2999.0,2999.0,2999.0,2999.0,2999.0,2999.0,2999.0,2999.0,2999.0,2999.0
mean,0.164055,14102100,1005.03968,0.190063,1.056686,0.19073,17725.745915,318.507503,57.323775,1967.792264,0.787929,125.41147,39050.640213,87.887963
std,0.370387,0,1.042825,0.397484,0.564801,0.623881,3062.115201,9.110243,38.256554,376.396854,1.229925,228.664376,48834.382147,44.755433
min,0.0,14102100,1001.0,0.0,0.0,0.0,377.0,216.0,36.0,112.0,0.0,35.0,-1.0,13.0
25%,0.0,14102100,1005.0,0.0,1.0,0.0,15704.0,320.0,50.0,1722.0,0.0,35.0,-1.0,61.0
50%,0.0,14102100,1005.0,0.0,1.0,0.0,17653.0,320.0,50.0,1955.0,0.0,35.0,-1.0,79.0
75%,0.0,14102100,1005.0,0.0,1.0,0.0,20362.0,320.0,50.0,2283.0,2.0,39.0,100084.0,117.0
max,1.0,14102100,1010.0,4.0,5.0,5.0,21704.0,320.0,480.0,2497.0,3.0,1835.0,100248.0,157.0


# Explore Data

In [2]:
training_set.describe()

Unnamed: 0,click,hour,C1,banner_pos,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
count,2999.0,2999,2999.0,2999.0,2999.0,2999.0,2999.0,2999.0,2999.0,2999.0,2999.0,2999.0,2999.0,2999.0
mean,0.164055,14102100,1005.03968,0.190063,1.056686,0.19073,17725.745915,318.507503,57.323775,1967.792264,0.787929,125.41147,39050.640213,87.887963
std,0.370387,0,1.042825,0.397484,0.564801,0.623881,3062.115201,9.110243,38.256554,376.396854,1.229925,228.664376,48834.382147,44.755433
min,0.0,14102100,1001.0,0.0,0.0,0.0,377.0,216.0,36.0,112.0,0.0,35.0,-1.0,13.0
25%,0.0,14102100,1005.0,0.0,1.0,0.0,15704.0,320.0,50.0,1722.0,0.0,35.0,-1.0,61.0
50%,0.0,14102100,1005.0,0.0,1.0,0.0,17653.0,320.0,50.0,1955.0,0.0,35.0,-1.0,79.0
75%,0.0,14102100,1005.0,0.0,1.0,0.0,20362.0,320.0,50.0,2283.0,2.0,39.0,100084.0,117.0
max,1.0,14102100,1010.0,4.0,5.0,5.0,21704.0,320.0,480.0,2497.0,3.0,1835.0,100248.0,157.0


In [4]:
training_set.head()

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1000009418151094273,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,2,15706,320,50,1722,0,35,-1,79
1,10000169349117863715,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
2,10000371904215119486,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
3,10000640724480838376,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15706,320,50,1722,0,35,100084,79
4,10000679056417042096,0,14102100,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,...,1,0,18993,320,50,2161,0,35,-1,157


In [7]:
training_set['hour'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x10b2a4f10>