## Click Through Rate Prediction    
### Data Available:    
I have log files for Bidding, Impression, click and conversion and my aim is to maximize the click through rate    
Click Through Rate is given as:    
$$CTR = \frac{Number Of Clicks}{Number Of Impression} * 100$$    

Here I have 4 tables Bidding, Impression, Click and Conversion.     

### Problem Formulation:    
In online advertising Impression is when an ad is fetched from its source, and is countable or it is viewed.     
For this ad to be viewed various bids are made and of them the bid with most monetary benefits is chosen. To maximize the click through rate we need to improve the number of clicks per impression. So in this case if we select bids which are more likely to be clicked then our click through rate can be increased.   
Hence in this problem I will select the Bidding table and add a column for the bids which are clicked and thus will treat this as supervised problem to find the probability of clicking for each bid. 

### Breakdown of Data to test, validation and training:     
We have bidding data from 19 to 28 and click data from 19 to 27, so we will treat bidding data 
- Training Data: bidding data from 19 to 25
- Validation Data: bidding data for 26 and 27 
- Holdout sample: bidding data for 28

In [1]:
from pathlib import Path
import pandas as pd
import dask.dataframe as dd

In [2]:
dataFolder = Path.cwd().joinpath('Data')

In [3]:
columnHeaders = ['bidID', 'Timestamp', 'XYZID', 'useragent', 'ip', 'region', 'city', 'adexchange', 'domain', \
                 'url', 'anonURLID', 'adSlotID', 'width', 'height', \
                 'visibility', 'format', 'slotPrice', 'creativeId', 'bidprice', 'adverId', 'userTag']
# 'logType','payPrice', 'keypageUrl'

In [4]:
# biddataframes = [dd.read_csv(f, sep='\t', names=columnHeaders, blocksize=None, compression='bz2') \
#                  for f in dataFolder.glob("bid*.bz2")]
biddataframes = [dd.read_csv(f, sep='\t', names=columnHeaders, blocksize=None, compression='bz2') \
                 for f in dataFolder.glob('bid.201310[1-2][9,0,1,2,3,4,5].txt.bz2')]

In [5]:
for f in dataFolder.glob('bid.201310[1-2][9,0,1,2,3,4,5].txt.bz2'):
    print(f)

/mnt/New Volume/Projects/Challenges/zypmedia/Click-Through-Rate-Prediction/Data/bid.20131019.txt.bz2
/mnt/New Volume/Projects/Challenges/zypmedia/Click-Through-Rate-Prediction/Data/bid.20131020.txt.bz2
/mnt/New Volume/Projects/Challenges/zypmedia/Click-Through-Rate-Prediction/Data/bid.20131021.txt.bz2
/mnt/New Volume/Projects/Challenges/zypmedia/Click-Through-Rate-Prediction/Data/bid.20131022.txt.bz2
/mnt/New Volume/Projects/Challenges/zypmedia/Click-Through-Rate-Prediction/Data/bid.20131023.txt.bz2
/mnt/New Volume/Projects/Challenges/zypmedia/Click-Through-Rate-Prediction/Data/bid.20131024.txt.bz2
/mnt/New Volume/Projects/Challenges/zypmedia/Click-Through-Rate-Prediction/Data/bid.20131025.txt.bz2


In [6]:
biddf = dd.concat(biddataframes, axis=0)
biddf.head()

Unnamed: 0,bidID,Timestamp,XYZID,useragent,ip,region,city,adexchange,domain,url,...,adSlotID,width,height,visibility,format,slotPrice,creativeId,bidprice,adverId,userTag
0,35fd1ad90bf35fadd3047b0bfc7be326,20131019100000067,DAB8cXDkxdy,Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.3...,113.107.229.*,216,237,1,b5f57062ae7f4ba7f1489b9133b991d6,51773ab0ae291eb88b2740e781a99099,...,mm_30646014_3428401_13076720,300,250,Na,Fixed,0,7323,294,2259,
1,79b665070e59ebb2b07a7592334ce75a,20131019100000039,D96MEd7Dxp9,Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1...,119.32.95.*,216,217,2,ed5dfe7e38655bf6defecf2f284e79ba,11f05bd65774f9b52400e73e76dde41b,...,1058395156,250,250,OtherView,Na,4,7321,277,2259,
2,633707a960a18e07fc639afa6a4afd97,20131019100000062,,Mozilla/4.0 (Windows; U; Windows NT 5.1; zh-TW...,113.83.245.*,216,227,2,f78321209a9738b2e198ddc508045c99,,...,221332875,300,250,OtherView,Na,5,7323,277,2259,
3,264b38e31c1d6978802292a7815c8b80,20131019100000768,DAIJrqA8xbC,Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ...,113.111.33.*,216,217,3,369fbd5dd0636d8b3e8199bf027854b1,6bf50b173a5827b99f9ddde1056ee59a,...,discuz_18316225_006,728,90,Na,Na,20,7330,294,2259,
4,55c2625f4872a9246fde4b719067424b,20131019100002089,,Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.3...,183.1.70.*,216,217,3,eb6cf22883534275b3c9b6af2307e28d,25561276d5e8edaec04853fb455c0504,...,Game_F_Width1,1000,90,Na,Na,50,7336,294,2259,


In [7]:
columnHeaders = ['bidID', 'Timestamp', 'logType', 'XYZID', 'useragent', 'ip', 'region', 'city', 'adexchange', 'domain', \
                 'url', 'anonURLID', 'adSlotID', 'width', 'height', 'visibility', \
                 'format', 'slotPrice', 'creativeId', 'bidprice', 'payPrice', 'keypageUrl', 'adverId', 'userTag']

In [8]:
clkdataframes = [dd.read_csv(f, sep='\t', names=columnHeaders, blocksize=None, compression='bz2') \
                 for f in dataFolder.glob('clk.201310[1-2][9,0,1,2,3,4,5].txt.bz2')]

In [9]:
clkdf = dd.concat(clkdataframes, axis=0)

In [10]:
clkdf.head()

Unnamed: 0,bidID,Timestamp,logType,XYZID,useragent,ip,region,city,adexchange,domain,...,height,visibility,format,slotPrice,creativeId,bidprice,payPrice,keypageUrl,adverId,userTag
0,c671f4c2919831696be3efa912ca1bd5,20131024175602152,2,DAOETC97vz2,Mozilla/5.0 (Linux; U; Android 2.3.6; zh-cn; O...,117.136.24.*,201,202,,,...,50,FirstView,Na,6,11908,277,6,,2997,
1,44f58336fb0dbeaf7f0b2be1b7c6dc45,20131024181101117,2,DAOIB11rvfj,Mozilla/5.0 (Linux; U; Android 2.3.4; zh-cn; A...,123.151.186.*,2,2,,,...,50,FirstView,Na,131,11908,277,131,,2997,
2,c002cddb8118d4f5683a16c74b5d6a24,20131024170802729,2,DAOH82BPz61,Mozilla/5.0 (Linux; U; Android 4.1.2; zh-cn; S...,115.218.67.*,94,97,,,...,50,FirstView,Na,6,11908,277,28,,2997,
3,8b52831079abbaf317c53cf6eb15d524,20131024143700840,2,DAOEZ0AMs9x,Mozilla/5.0 (Linux; U; Android 4.1.2; zh-cn; G...,113.94.124.*,216,225,,,...,50,FirstView,Na,170,11908,277,170,,2997,
4,8cb74b8c3242e651e44825d2307de7c6,20131024173001601,2,DAOHU19Pxav,Mozilla/5.0 (iPhone; CPU iPhone OS 6_0_2 like ...,180.106.148.*,80,85,1.0,a5f6c3952b9b24fa5c6cc608aa1625a2,...,90,Na,Na,0,12626,294,12,,2261,
