## The Data Science Process

1. Frame the problem
2. Collect raw data
3. Process the data
4. Explore the data
5. Perform in-depth analysis
6. Communicate results

### 1. Frame the problem

We would like to improve our understanding of users' interactions with an e-commerce website, i.e. in particular what actions drive a user to purchase a product (propensity to purchase). The target is then to be able to timely react on users' behavior more appropirately, e.g. through personalization or couponing.
We have 1 year worth of clickstream data from an e-commerce website, capturing the interaction of users with the online shop and their purchasing behavior. We would like to build and compare different models that predict whether a user purchased something or not after a given visit / within 7 days of a given visit (alternative target could have labels *purchase*, *abandoned cart*, and *browsling-only*).

### 2. Collect raw data

Information on raw data:
- Entire dataset: 1 year worth of Adobe clickstream data from an e-commerce website ~ 300-400 GB
- Sample: 1 day ~ 1 GB
- File format: tsv format
- Granularity: 1 row = 1 hit (server call)

### 3. Process the data

##### Step 1: Examine data at a high-level (data understanding)
- understand all data and lookup files
- understand every column in hit_data.tsv
- identify important and unimportant columns (e.g. pre vs. post columns, na only columns, static columns, technical columns like js version etc. can be dropped; important features/columns include referrer_type, device_type, post_event_list, post_product_list, nps, va_closer_id for marketing channels, maybe post_campaign)

##### Step 2: Clean the data (data cleaning)
- throw away, replace, and/or filter corrupt/error prone/missing values and unnecessary columns
- identify errors
- handle missing values
- handle corrupt records
- drop all rows where exclude_hit > 0
- drop all rows where hit_source is 5, 7, 8 or 9
- drop all rows where user is internal (event 30)
- concat post_visid_high and post_visid_low (and visit_start_time_gmt) to get unique visit ids
- map browser, country, va_closer_id (marketing channels) and post_event_list (also os and search engine)
- split post_product_list

##### Step 3: Aggregate data (data aggregation)
- hit-level
- visit-level (session-level)

#### Step 4: Get entire dataset and process it

### Important features (aggregated to session level)

- visitorid
- visit_num_period
- visit_num_lifetime
- datetime
- device_type
- channel
- fullcountry
- geo_region
- gender
- yearofbirth
- session_pageviews
- session_timespent
- session_productViews
- session_cartAdditions
- session_cartRemovals
- session_cartCheckouts
- session_purchase
- session_revenue
- session_voucherInPurchase
- purchase_product_categories

### Load data and get a first impression

In [1]:
# import essential libraries

import pandas as pd
import numpy as np

In [2]:
# get column names

column_names = list(pd.read_csv('column_headers.tsv', delimiter='\t'))

In [3]:
# load sample data

df = pd.read_csv('hit_data.tsv.proc.anonymized.gz', compression='gzip', sep='\t', encoding='utf-8', names=column_names, quoting=3, low_memory=False)

In [4]:
# number of rows and columns in the sample

df.shape

(241279, 668)

In [5]:
# basic information on the sample

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241279 entries, 0 to 241278
Columns: 668 entries, accept_language to zip
dtypes: float64(381), int64(40), object(247)
memory usage: 1.2+ GB


Basic information on the sample
- 668 columns in total
- 381 columns with dtype float
- 40 columns with dtype int
- 247 columns with dtype object

In [6]:
# checkout the head of the sample and display all columns

pd.set_option('display.max_columns', df.shape[1])
df.head()

Unnamed: 0,accept_language,browser,browser_height,browser_width,c_color,campaign,carrier,channel,click_action,click_action_type,click_context,click_context_type,click_sourceid,click_tag,code_ver,color,connection_type,cookies,country,ct_connect_type,curr_factor,curr_rate,currency,cust_hit_time_gmt,cust_visid,daily_visitor,date_time,domain,duplicate_events,duplicate_purchase,duplicated_from,ef_id,evar1,evar2,evar3,evar4,evar5,evar6,evar7,evar8,evar9,evar10,evar11,evar12,evar13,evar14,evar15,evar16,evar17,evar18,evar19,evar20,evar21,evar22,evar23,evar24,evar25,evar26,evar27,evar28,evar29,evar30,evar31,evar32,evar33,evar34,evar35,evar36,evar37,evar38,evar39,evar40,evar42,evar43,evar44,evar45,evar46,evar47,evar48,evar49,evar50,evar51,evar52,evar53,evar54,evar55,evar56,evar57,evar58,evar59,evar60,evar61,evar62,evar63,evar64,evar65,evar66,evar67,evar68,evar69,evar70,evar71,evar72,evar73,evar74,evar75,evar76,evar77,evar78,evar79,evar80,evar81,evar82,evar83,evar84,evar85,evar86,evar87,evar88,evar89,evar90,evar91,evar92,evar93,evar94,evar95,evar96,evar97,evar99,evar100,event_list,exclude_hit,first_hit_page_url,first_hit_pagename,first_hit_referrer,first_hit_time_gmt,geo_city,geo_country,geo_dma,geo_region,geo_zip,hier1,hier2,hier3,hier4,hier5,hit_source,hit_time_gmt,hitid_high,hitid_low,homepage,hourly_visitor,ip,ip2,j_jscript,java_enabled,javascript,language,last_hit_time_gmt,last_purchase_num,last_purchase_time_gmt,mc_audiences,mcvisid,mobile_id,mobileaction,mobileappid,mobilecampaigncontent,mobilecampaignmedium,mobilecampaignname,mobilecampaignsource,mobilecampaignterm,mobiledayofweek,mobiledayssincefirstuse,mobiledayssincelastuse,mobiledevice,mobilehourofday,mobileinstalldate,mobilelaunchnumber,mobileltv,mobilemessageid,mobilemessageonline,mobileosversion,mobilepushoptin,mobilepushpayloadid,mobileresolution,monthly_visitor,mvvar1,mvvar2,mvvar3,namespace,new_visit,os,p_plugins,page_event,page_event_var1,page_event_var2,page_event_var3,page_type,page_url,pagename,paid_search,partner_plugins,persistent_cookie,plugins,pointofinterest,pointofinterestdistance,post_browser_height,post_browser_width,post_campaign,post_channel,post_cookies,post_currency,post_cust_hit_time_gmt,post_cust_visid,post_ef_id,post_evar1,post_evar2,post_evar3,post_evar4,post_evar5,post_evar6,post_evar7,post_evar8,post_evar9,post_evar10,post_evar11,post_evar12,post_evar13,post_evar14,post_evar15,post_evar16,post_evar17,post_evar18,post_evar19,post_evar20,post_evar21,post_evar22,post_evar23,post_evar24,post_evar25,post_evar26,post_evar27,post_evar28,post_evar29,post_evar30,post_evar31,post_evar32,post_evar33,post_evar34,post_evar35,post_evar36,post_evar37,post_evar38,post_evar39,post_evar40,post_evar42,post_evar43,post_evar44,post_evar45,post_evar46,post_evar47,post_evar48,post_evar49,post_evar50,post_evar51,post_evar52,post_evar53,post_evar54,post_evar55,post_evar56,post_evar57,post_evar58,post_evar59,post_evar60,post_evar61,post_evar62,post_evar63,post_evar64,post_evar65,post_evar66,post_evar67,post_evar68,post_evar69,post_evar70,post_evar71,post_evar72,post_evar73,post_evar74,post_evar75,post_evar76,post_evar77,post_evar78,post_evar79,post_evar80,post_evar81,post_evar82,post_evar83,post_evar84,post_evar85,post_evar86,post_evar87,post_evar88,post_evar89,post_evar90,post_evar91,post_evar92,post_evar93,post_evar94,post_evar95,post_evar96,post_evar97,post_evar99,post_evar100,post_event_list,post_hier1,post_hier2,post_hier3,post_hier4,post_hier5,post_java_enabled,post_keywords,post_mc_audiences,post_mobileaction,post_mobileappid,post_mobilecampaigncontent,post_mobilecampaignmedium,post_mobilecampaignname,post_mobilecampaignsource,post_mobilecampaignterm,post_mobiledayofweek,post_mobiledayssincefirstuse,post_mobiledayssincelastuse,post_mobiledevice,post_mobilehourofday,post_mobileinstalldate,post_mobilelaunchnumber,post_mobileltv,post_mobilemessageid,post_mobilemessageonline,post_mobileosversion,post_mobilepushoptin,post_mobilepushpayloadid,post_mobileresolution,post_mvvar1,post_mvvar2,post_mvvar3,post_page_event,post_page_event_var1,post_page_event_var2,post_page_event_var3,post_page_type,post_page_url,post_pagename,post_pagename_no_url,post_partner_plugins,post_persistent_cookie,post_pointofinterest,post_pointofinterestdistance,post_product_list,post_prop1,post_prop2,post_prop3,post_prop4,post_prop5,post_prop6,post_prop7,post_prop8,post_prop9,post_prop10,post_prop11,post_prop12,post_prop13,post_prop14,post_prop15,post_prop16,post_prop17,post_prop18,post_prop19,post_prop20,post_prop21,post_prop22,post_prop23,post_prop24,post_prop25,post_prop26,post_prop27,post_prop28,post_prop29,post_prop30,post_prop31,post_prop32,post_prop33,post_prop34,post_prop35,post_prop36,post_prop37,post_prop38,post_prop39,post_prop40,post_prop41,post_prop42,post_prop43,post_prop44,post_prop45,post_prop46,post_prop47,post_prop48,post_prop49,post_prop50,post_prop51,post_prop52,post_prop53,post_prop54,post_prop55,post_prop56,post_prop57,post_prop58,post_prop59,post_prop60,post_prop61,post_prop62,post_prop63,post_prop64,post_prop65,post_prop66,post_prop67,post_prop68,post_prop69,post_prop70,post_prop71,post_prop72,post_prop73,post_prop74,post_prop75,post_purchaseid,post_referrer,post_s_kwcid,post_search_engine,post_socialaccountandappids,post_socialassettrackingcode,post_socialauthor,post_socialcontentprovider,post_socialfbstories,post_socialfbstorytellers,post_socialinteractioncount,post_socialinteractiontype,post_sociallanguage,post_sociallatlong,post_sociallikeadds,post_socialmentions,post_socialowneddefinitioninsighttype,post_socialowneddefinitioninsightvalue,post_socialowneddefinitionmetric,post_socialowneddefinitionpropertyvspost,post_socialownedpostids,post_socialownedpropertyid,post_socialownedpropertyname,post_socialownedpropertypropertyvsapp,post_socialpageviews,post_socialpostviews,post_socialpubcomments,post_socialpubposts,post_socialpubrecommends,post_socialpubsubscribers,post_socialterm,post_socialtotalsentiment,post_state,post_survey,post_t_time_info,post_tnt,post_tnt_action,post_transactionid,post_video,post_videoad,post_videoadinpod,post_videoadplayername,post_videoadpod,post_videochannel,post_videochapter,post_videocontenttype,post_videoplayername,post_videoqoebitrateaverageevar,post_videoqoebitratechangecountevar,post_videoqoebuffercountevar,post_videoqoebuffertimeevar,post_videoqoedroppedframecountevar,post_videoqoeerrorcountevar,post_videoqoetimetostartevar,post_videosegment,post_visid_high,post_visid_low,post_visid_type,post_zip,prev_page,product_list,product_merchandising,prop1,prop2,prop3,prop4,prop5,prop6,prop7,prop8,prop9,prop10,prop11,prop12,prop13,prop14,prop15,prop16,prop17,prop18,prop19,prop20,prop21,prop22,prop23,prop24,prop25,prop26,prop27,prop28,prop29,prop30,prop31,prop32,prop33,prop34,prop35,prop36,prop37,prop38,prop39,prop40,prop41,prop42,prop43,prop44,prop45,prop46,prop47,prop48,prop49,prop50,prop51,prop52,prop53,prop54,prop55,prop56,prop57,prop58,prop59,prop60,prop61,prop62,prop63,prop64,prop65,prop66,prop67,prop68,prop69,prop70,prop71,prop72,prop73,prop74,prop75,purchaseid,quarterly_visitor,ref_domain,ref_type,referrer,resolution,s_kwcid,s_resolution,sampled_hit,search_engine,search_page_num,secondary_hit,service,socialaccountandappids,socialassettrackingcode,socialauthor,socialcontentprovider,socialfbstories,socialfbstorytellers,socialinteractioncount,socialinteractiontype,sociallanguage,sociallatlong,sociallikeadds,socialmentions,socialowneddefinitioninsighttype,socialowneddefinitioninsightvalue,socialowneddefinitionmetric,socialowneddefinitionpropertyvspost,socialownedpostids,socialownedpropertyid,socialownedpropertyname,socialownedpropertypropertyvsapp,socialpageviews,socialpostviews,socialpubcomments,socialpubposts,socialpubrecommends,socialpubsubscribers,socialterm,socialtotalsentiment,sourceid,state,stats_server,t_time_info,tnt,tnt_action,tnt_post_vista,transactionid,truncated_hit,ua_color,ua_os,ua_pixels,user_agent,user_hash,user_server,userid,username,va_closer_detail,va_closer_id,va_finder_detail,va_finder_id,va_instance_event,va_new_engagement,video,videoad,videoadinpod,videoadplayername,videoadpod,videochannel,videochapter,videocontenttype,videoplayername,videoqoebitrateaverageevar,videoqoebitratechangecountevar,videoqoebuffercountevar,videoqoebuffertimeevar,videoqoedroppedframecountevar,videoqoeerrorcountevar,videoqoetimetostartevar,videosegment,visid_high,visid_low,visid_new,visid_timestamp,visid_type,visit_keywords,visit_num,visit_page_num,visit_referrer,visit_search_engine,visit_start_page_url,visit_start_pagename,visit_start_time_gmt,weekly_visitor,yearly_visitor,zip
0,"fr-CH,en-US;q=0.8",919134536,515,360,32.0,programmatic:cpc:dbm_de_pro_pe_topseller_sta-deal,swisscom.ch:swisscom schweiz ag,Product,,0,,0,0,,JS-1.6.3,0,4,Y,215,,2,1.0,CHF,0,,1,2016-10-24 00:00:59,swisscom.ch,,0,,,,,,,,,,,,,has_not_bought,,,,,,LEGO Technic Schaufelradbagger (42055),,,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,ecommerce-shop.de,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,utm_source=programmatic&utm_medium=cpc&utm_cam...,,,,,,,,prod,2.0,20161024.0,,580d33198015c2.77500686,,n,,,,,,,,,,,,,,,,,,,,,,,,,,,dbm_pro,dbm_pro,dbm_de_pro_pe_topseller_sta-deal,dbm_de_pro_pe_topseller_sta-deal,,,,sta-deal,,,,,,,,,,,,,,,,,,240001.0,,,,,Mozilla/5.0 (Linux; Android 6.0.1; SM-G930F Bu...,,,y,,"259,257,2,200,201,269,256=1,258=1,255=1,20,110...",0,https://ecommerce-shop.de/baby-spielzeug/spiel...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,,1477260059,heimberg,che,0,be,3627,,,,,,1,1477260059,3172391910827393024,184263871271240045,U,1,,,1.6,N,0,54,0,0,0,,75059068711088727562364561556761025758,11873536,,,,,,,,,,,,,,,,,,,,,,1,,,,,1,3874308541,,70,,,,,https://ecommerce-shop.de/baby-spielzeug/spiel...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,0,,Y,,,,0,0,programmatic:cpc:dbm_de_pro_pe_topseller_sta-deal,Product,Y,CHF,1477260059,,,,,,,,,,,,,has_not_bought,,,,,,LEGO Technic Schaufelradbagger (42055),Product,ecommerce-shop.de/baby-spielzeug/spielzeug/bau...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,ecommerce-shop.de,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,utm_source=programmatic&utm_medium=cpc&utm_cam...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,,,,P:/baby-spielzeug/spielzeug/bausteine-lego/leg...,P:/baby-spielzeug/spielzeug/bausteine-lego/leg...,ecommerce-shoplive,prod,2.0,20161024,,580d33198015c2.77500686,,n,,,,,,,,,,,,,,,,,,,,,,,,,,,dbm_pro,dbm_pro,dbm_de_pro_pe_topseller_sta-deal,dbm_de_pro_pe_topseller_sta-deal,,programmatic:cpc:dbm_de_pro_pe_topseller_sta-deal,,sta-deal,,,,,,,,,,,,,,,,,,240001.0,,,,,Mozilla/5.0 (Linux; Android 6.0.1; SM-G930F Bu...,,,y,,"259,257,2,200,201,269,256=1.00,258=1.00,255=1....",,,,,,U,,,,,,,,,,,,,,,,,,,,,,,,,,,70.0,,,,,https://ecommerce-shop.de/baby-spielzeug/spiel...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,,Y,,,Baby & Spielzeug/Spielzeug/Bausteine & LEGO;21...,,,,,,,,,,,has_not_bought,,,,,,LEGO Technic Schaufelradbagger (42055),Product,ecommerce-shop.de/baby-spielzeug/spielzeug/bau...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,ecommerce-shop.de,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,utm_source=programmatic&utm_medium=cpc&utm_cam...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,,,,P:/baby-spielzeug/spielzeug/bausteine-lego/leg...,P:/baby-spielzeug/spielzeug/bausteine-lego/leg...,ecommerce-shoplive,prod,2,20161024,,580d33198015c2.77500686,,n,,,,,,,,,,,,,,,,,,pdp_recomm,,,,,,,,,,dbm_pro,dbm_pro,dbm_de_pro_pe_topseller_sta-deal,dbm_de_pro_pe_topseller_sta-deal,,programmatic:cpc:dbm_de_pro_pe_topseller_sta-deal,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,24/9/2016 0:1:1 1 -120,70051:0:0,"70051:0:0|0,70051:0:0|2,70051:0:0|1",,,,,,,,,,,,,,,,,,,3.728291e+18,4.642425e+16,5,::hash::0,0.0,Baby & Spielzeug/Spielzeug/Bausteine & LEGO;21...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,pdp_recomm,,,,,,,,,,,,,,,,,,,,,,1.0,,6.0,,0.0,,360x640,Y,0,0,0,ss,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,www30.ams1.omniture.com,24/9/2016 0:1:1 1 -120,,"70051:0:0|0,70051:0:0|2,70051:0:0|1",,,N,,,,Mozilla/5.0 (Linux; Android 6.0.1; SM-G930F Bu...,1841007000.0,,100030943,ecommerce-shoplive,dbm_de_pro_pe_topseller_sta-deal,13,dbm_de_pro_pe_topseller_sta-deal,13,1,1.0,,,,,,,,,,,,,,,,,,0.0,0,N,0,5,,1,1,,0,https://ecommerce-shop.de/baby-spielzeug/spiel...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,1477260059,1,1,
1,de-ch,717426878,529,320,32.0,paid search - adwords:ci:670707020,swisscom.ch:swisscom schweiz ag,Search Results,,0,,0,0,,JS-1.6.3,0,4,Y,215,,2,1.0,CHF,0,,1,2016-10-24 00:00:21,swisscom.ch,,0,,,,,,,,,,,,,has_not_bought,,,,,,Suchergebnisse,,,/search?brand_name=MyKronoz&network=g&campaign...,ecommerce-shop.de,/search,brand_name=MyKronoz&network=g&campaignid=67070...,,,,brand_name=mykronoz,,,,prod,2.0,20161024.0,,580d32f00b12b8.44726186,,n,,,,,,,,,,,,,,,,,,,,,,,,,,,srch_p_oth,srch_p_oth,srch_p_oth:670707020,srch_p_oth:670707020,,,network=g&campaignid=670707020&adgroupid=34625...,,,,,,,,,,,,,,,,,,,240000.0,,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_2 like...,,,y,,"200,201,269,20,110,116,119,120,121,122,126,130...",0,https://ecommerce-shop.de/search?brand_name=My...,/search?brand_name=mykronoz,,1477260021,zurich,che,0,zh,8075,,,,,,1,1477260021,3172391756208570368,471368334630643351,U,1,,,1.6,N,0,61,0,0,0,,01343120194577155900070852375174225289,205202,,,,,,,,,,,,,,,,,,,,,,1,,,,,1,828421107,,70,,,,,https://ecommerce-shop.de/search?brand_name=My...,/search?brand_name=mykronoz,0,,Y,,,,0,0,paid search - adwords:ci:670707020,Search Results,Y,CHF,1477260021,,,,,,,,,,,,,has_not_bought,,,,,,Suchergebnisse,Search Results,ecommerce-shop.de/search?brand_name=mykronoz,/search?brand_name=MyKronoz&network=g&campaign...,ecommerce-shop.de,/search,brand_name=MyKronoz&network=g&campaignid=67070...,/searchbrand_name=mykronoz,,,brand_name=mykronoz,P:/search?,P:/search,ecommerce-shoplive,prod,2.0,20161024,,580d32f00b12b8.44726186,,n,,,,,,,,,,,,,,,,,,,,,,,,,,,srch_p_oth,srch_p_oth,srch_p_oth:670707020,srch_p_oth:670707020,,paid search - adwords:ci:670707020,network=g&campaignid=670707020&adgroupid=34625...,,,,,,,,,,,,,,,,,,,240000.0,,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_2 like...,,,y,,"200,201,269,20,110,116,119,120,121,122,126,130...",,,,,,U,,,,,,,,,,,,,,,,,,,,,,,,,,,70.0,,,,,https://ecommerce-shop.de/search,/search?brand_name=mykronoz,/search?brand_name=mykronoz,,Y,,,;;;;;139=::hash::0|141=::hash::0|142=::hash::0...,,,,,,,,,,,has_not_bought,,,,,,Suchergebnisse,Search Results,ecommerce-shop.de/search?brand_name=mykronoz,/search?brand_name=MyKronoz&network=g&campaign...,ecommerce-shop.de,/search,brand_name=MyKronoz&network=g&campaignid=67070...,/searchbrand_name=mykronoz,,,brand_name=mykronoz,P:/search?,P:/search,ecommerce-shoplive,prod,2,20161024,,580d32f00b12b8.44726186,,n,,,,,,,,,,,,,,,,,,Search Results,,,,,,,,,,srch_p_oth,srch_p_oth,srch_p_oth:670707020,srch_p_oth:670707020,,paid search - adwords:ci:670707020,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,24/9/2016 0:0:28 1 -120,70051:1:0,"70051:1:0|0,70051:1:0|2,70051:1:0|1",,,,,,,,,,,,,,,,,,,4.895265e+18,6.469507e+18,5,::hash::0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Search Results,,,,,,,,,,,,,,,,,,,,,,1.0,,6.0,,0.0,,320x568,Y,0,0,0,ss,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,www6.ams1.omniture.com,24/9/2016 0:0:28 1 -120,,"70051:1:0|0,70051:1:0|2,70051:1:0|1",,,N,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_2 like...,1841007000.0,,100030943,ecommerce-shoplive,paid search - adwords:ci:670707020,10,paid search - adwords:ci:670707020,10,1,1.0,,,,,,,,,,,,,,,,,,0.0,0,N,0,5,,1,1,,0,https://ecommerce-shop.de/search?brand_name=My...,/search?brand_name=mykronoz,1477260021,1,1,
2,de-ch,717426878,559,375,32.0,paid search - adwords:ci:393800826,swisscom.ch:swisscom schweiz ag,Product,,0,,0,0,,JS-1.6.3,0,4,Y,215,,2,1.0,CHF,0,,1,2016-10-24 00:01:05,swisscom.ch,,0,,,,,,,,,,,,,has_not_bought,,,,,,UE Megaboom Bluetooth Speaker Rot,,,/computer-elektronik/audio/portable-lautsprech...,ecommerce-shop.de,/computer-elektronik/audio/portable-lautsprech...,network=g&campaignid=393800826&adgroupid=26304...,,,https://www.google.ch/search?q=ue+megaboom&cli...,,,,,prod,2.0,20161024.0,,580d3320193f63.14173598,,n,,,,,,,,,,,,,,,,,,,,,,,,,,,srch_p_gsh,srch_p_gsh,srch_p_gsh:393800826,srch_p_gsh:393800826,,,network=g&campaignid=393800826&adgroupid=26304...,,,,,,,,,,,,,,,,,,,240001.0,,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_1 like...,,,y,,"259,257,2,200,201,269,256=1,258=1,255=1,20,110...",0,https://ecommerce-shop.de/computer-elektronik/...,/computer-elektronik/audio/portable-lautsprech...,https://www.google.ch/search?q=ue+megaboom&cli...,1477260065,aarburg,che,0,ag,4663,,,,,,1,1477260065,3172391835665465344,239010760146072976,U,1,,,1.6,N,0,61,0,0,0,,85461445190431956123742270092956469695,205202,,,,,,,,,,,,,,,,,,,,,,1,,,,,1,1448537301,,70,,,,,https://ecommerce-shop.de/computer-elektronik/...,/computer-elektronik/audio/portable-lautsprech...,1,,Y,,,,0,0,paid search - adwords:ci:393800826,Product,Y,CHF,1477260065,,,,,,,,,,,,,has_not_bought,,,,,,UE Megaboom Bluetooth Speaker Rot,Product,ecommerce-shop.de/computer-elektronik/audio/po...,/computer-elektronik/audio/portable-lautsprech...,ecommerce-shop.de,/computer-elektronik/audio/portable-lautsprech...,network=g&campaignid=393800826&adgroupid=26304...,/computer-elektronik/audio/portable-lautsprech...,,https://www.google.ch/search?q=ue+megaboom&cli...,,P:/computer-elektronik/audio/portable-lautspre...,P:/computer-elektronik/audio/portable-lautspre...,ecommerce-shoplive,prod,2.0,20161024,,580d3320193f63.14173598,,n,,,,,,,,,,,,,,,,,,,,,,,,,,,srch_p_gsh,srch_p_gsh,srch_p_gsh:393800826,srch_p_gsh:393800826,,paid search - adwords:ci:393800826,network=g&campaignid=393800826&adgroupid=26304...,,,,,,,,,,,,,,,,,,,240001.0,,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_1 like...,,,y,,"259,257,2,200,201,269,256=1.00,258=1.00,255=1....",,,,,,U,ue megaboom,,,,,,,,,,,,,,,,,,,,,,,,,,70.0,,,,,https://ecommerce-shop.de/computer-elektronik/...,/computer-elektronik/audio/portable-lautsprech...,/computer-elektronik/audio/portable-lautsprech...,,Y,,,Computer & Elektronik/Audio/Portable Lautsprec...,,,,,,,,,,,has_not_bought,,,,,,UE Megaboom Bluetooth Speaker Rot,Product,ecommerce-shop.de/computer-elektronik/audio/po...,/computer-elektronik/audio/portable-lautsprech...,ecommerce-shop.de,/computer-elektronik/audio/portable-lautsprech...,network=g&campaignid=393800826&adgroupid=26304...,/computer-elektronik/audio/portable-lautsprech...,,https://www.google.ch/search?q=ue+megaboom&cli...,,P:/computer-elektronik/audio/portable-lautspre...,P:/computer-elektronik/audio/portable-lautspre...,ecommerce-shoplive,prod,2,20161024,,580d3320193f63.14173598,,n,,,,,,,,,,,,,,,,,,pdp_recomm,,,,,,,,,,srch_p_gsh,srch_p_gsh,srch_p_gsh:393800826,srch_p_gsh:393800826,,paid search - adwords:ci:393800826,,,,,,,https://www.google.ch/search?q=ue+megaboom&cli...,,229.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,24/9/2016 0:1:6 1 -120,70051:0:0,"70051:0:0|0,70051:0:0|2,70051:0:0|1",,,,,,,,,,,,,,,,,,,8.875371e+18,1.528051e+18,5,::hash::0,0.0,Computer & Elektronik/Audio/Portable Lautsprec...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,pdp_recomm,,,,,,,,,,,,,,,,,,,,,,1.0,google.ch,3.0,https://www.google.ch/search?q=ue+megaboom&cli...,0.0,,375x667,Y,229,1,0,ss,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,www41.ams1.omniture.com,24/9/2016 0:1:6 1 -120,,"70051:0:0|0,70051:0:0|2,70051:0:0|1",,,N,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_1 like...,1841007000.0,,100030943,ecommerce-shoplive,paid search - adwords:ci:393800826,1,paid search - adwords:ci:393800826,1,1,1.0,,,,,,,,,,,,,,,,,,0.0,0,N,0,5,ue megaboom,1,1,https://www.google.ch/search?q=ue+megaboom&cli...,229,https://ecommerce-shop.de/computer-elektronik/...,/computer-elektronik/audio/portable-lautsprech...,1477260065,1,1,
3,de-ch,2661327960,645,375,32.0,facebook.com:cpc:fb_de_pro_pe_sta_wohnen-haush...,,Themeworld,,0,,0,0,,JS-1.6.3,0,2,Y,215,,2,1.0,CHF,0,,1,2016-10-24 00:00:15,adslplus.ch,,0,,,,,,,,,,,,,has_not_bought,,,,,,Stehlampen fürs passende herbstliche Ambiente,,,/inspiration/licht-im-herbst-stehlampen-fuers-...,ecommerce-shop.de,/inspiration/licht-im-herbst-stehlampen-fuers-...,utm_medium=cpc&utm_source=facebook.com&utm_cam...,,,,,,,,prod,2.0,20161024.0,,578c4b4c626cf5.88180099,,n,,,,,,,,,,,,,,,,,,,,,,,,,,,soc_pro,soc_pro<soc_pro,fb_de_pro_pe_sta_wohnen-haushalt_lights,fb_de_pro_pe_sta_wohnen-haushalt_lights<fb_de_...,,,,herbstliches-heim-lights_2-2,57ead549b9449f24718b4593,,,,,,,,,,,,,,,,,240000.0,,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_2 like...,,,y,,"200,201,269,20,110,116,119,120,121,122,130,131...",0,https://ecommerce-shop.de/inspiration/licht-im...,/inspiration/licht-im-herbst-stehlampen-fuers-...,,1477260015,basel,che,0,bs,4053,,,,,,1,1477260015,3172391730438799360,1150952337800982458,U,1,,,1.6,N,0,61,0,0,0,,02693539464992941202959228461067897809,205202,,,,,,,,,,,,,,,,,,,,,,1,,,,,1,828421107,,70,,,,,https://ecommerce-shop.de/inspiration/licht-im...,/inspiration/licht-im-herbst-stehlampen-fuers-...,0,,Y,,,,0,0,facebook.com:cpc:fb_de_pro_pe_sta_wohnen-haush...,Themeworld,Y,CHF,1477260015,,,,,,,,,,,,,has_not_bought,,,,,,Stehlampen fürs passende herbstliche Ambiente,Themeworld,ecommerce-shop.de/inspiration/licht-im-herbst-...,/inspiration/licht-im-herbst-stehlampen-fuers-...,ecommerce-shop.de,/inspiration/licht-im-herbst-stehlampen-fuers-...,utm_medium=cpc&utm_source=facebook.com&utm_cam...,/inspiration/licht-im-herbst-stehlampen-fuers-...,,,,P:/inspiration/licht-im-herbst-stehlampen-fuer...,P:/inspiration/licht-im-herbst-stehlampen-fuer...,ecommerce-shoplive,prod,2.0,20161024,,578c4b4c626cf5.88180099,,n,,,,,,,,,,,,,,,,,,,,,,,,,,,soc_pro,soc_pro<soc_pro,fb_de_pro_pe_sta_wohnen-haushalt_lights,fb_de_pro_pe_sta_wohnen-haushalt_lights<fb_de_...,,facebook.com:cpc:fb_de_pro_pe_sta_wohnen-haush...,,herbstliches-heim-lights_2-2,57ead549b9449f24718b4593,,,,,,,,,,,,,,,,,240000.0,,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_2 like...,,,y,,"200,201,269,20,110,116,119,120,121,122,130,131...",,,,,,U,,,,,,,,,,,,,,,,,,,,,,,,,,,70.0,,,,,https://ecommerce-shop.de/inspiration/licht-im...,/inspiration/licht-im-herbst-stehlampen-fuers-...,/inspiration/licht-im-herbst-stehlampen-fuers-...,,Y,,,;;;;;139=::hash::0|141=::hash::0|142=::hash::0...,,,,,,,,,,,has_not_bought,,,,,,Stehlampen fürs passende herbstliche Ambiente,Themeworld,ecommerce-shop.de/inspiration/licht-im-herbst-...,/inspiration/licht-im-herbst-stehlampen-fuers-...,ecommerce-shop.de,/inspiration/licht-im-herbst-stehlampen-fuers-...,utm_medium=cpc&utm_source=facebook.com&utm_cam...,/inspiration/licht-im-herbst-stehlampen-fuers-...,,,,P:/inspiration/licht-im-herbst-stehlampen-fuer...,P:/inspiration/licht-im-herbst-stehlampen-fuer...,ecommerce-shoplive,prod,2,20161024,,578c4b4c626cf5.88180099,,n,,,,,,,,,,,,,,,,,,TW - Stehlampen fürs passende herbstliche Ambi...,,,,,,,,,,soc_pro,soc_pro<soc_pro,fb_de_pro_pe_sta_wohnen-haushalt_lights,fb_de_pro_pe_sta_wohnen-haushalt_lights<fb_de_...,,facebook.com:cpc:fb_de_pro_pe_sta_wohnen-haush...,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,24/9/2016 0:0:16 1 -120,70051:1:0,"70051:1:0|0,70051:1:0|2,70051:1:0|1",,,,,,,,,,,,,,,,,,,7.862428e+18,6.669572e+18,5,::hash::0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,TW - Stehlampen fürs passende herbstliche Ambi...,,,,,,,,,,,,,,,,,,,,,,1.0,,6.0,,0.0,,375x667,Y,0,0,0,ss,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,www109.lon5.omniture.com,24/9/2016 0:0:16 1 -120,,"70051:1:0|0,70051:1:0|2,70051:1:0|1",,,N,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_2 like...,1841007000.0,,100030943,ecommerce-shoplive,fb_de_pro_pe_sta_wohnen-haushalt_lights,12,fb_de_pro_pe_sta_wohnen-haushalt_lights,12,1,1.0,,,,,,,,,,,,,,,,,,0.0,0,N,0,5,,1,1,,0,https://ecommerce-shop.de/inspiration/licht-im...,/inspiration/licht-im-herbst-stehlampen-fuers-...,1477260015,1,1,
4,"de-DE,de;q=0.8,en-US",2767820126,950,1920,24.0,paid search - adwords:ci:420952937,,Product,,0,,0,0,,JS-1.6.3,2,2,Y,215,,2,1.0,CHF,0,,1,2016-10-24 00:00:28,breitband.ch,,0,,,,,,,,,,,,,has_not_bought,,,,,,Feller EDIZIOdue Steckdose 1xRJ45 ungeschirmt ...,,,/baumarkt-garten/bauen-renovieren/elektromater...,ecommerce-shop.de,/baumarkt-garten/bauen-renovieren/elektromater...,network=g&campaignid=420952937&adgroupid=27933...,,,https://www.google.ch/,,,,,prod,2.0,20161024.0,,580d32fae29bb6.17871311,,n,,,,,,,,,,,,,,,,,,,,,,,,,,,srch_p_gsh,srch_p_gsh,srch_p_gsh:420952937,srch_p_gsh:420952937,,,network=g&campaignid=420952937&adgroupid=27933...,,,,,,,,,,,,,,,,,,,240000.0,,,,,Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK...,,,y,,"259,257,2,200,201,269,256=1,258=1,255=1,20,110...",0,https://ecommerce-shop.de/baumarkt-garten/baue...,/baumarkt-garten/bauen-renovieren/elektromater...,https://www.google.ch/,1477260028,binningen,che,0,bl,4102,,,,,,1,1477260028,3172391754061119488,5218195508153067873,U,1,,,1.6,N,7,60,0,0,0,,76591058396162060604068554955823483130,0,,,,,,,,,,,,,,,,,,,,,,1,,,,,1,1240087047,,0,,,,,https://ecommerce-shop.de/baumarkt-garten/baue...,/baumarkt-garten/bauen-renovieren/elektromater...,1,,Y,,,,950,1920,paid search - adwords:ci:420952937,Product,Y,CHF,1477260028,,,,,,,,,,,,,has_not_bought,,,,,,Feller EDIZIOdue Steckdose 1xRJ45 ungeschirmt ...,Product,ecommerce-shop.de/baumarkt-garten/bauen-renovi...,/baumarkt-garten/bauen-renovieren/elektromater...,ecommerce-shop.de,/baumarkt-garten/bauen-renovieren/elektromater...,network=g&campaignid=420952937&adgroupid=27933...,/baumarkt-garten/bauen-renovieren/elektromater...,,https://www.google.ch/,,P:/baumarkt-garten/bauen-renovieren/elektromat...,P:/baumarkt-garten/bauen-renovieren/elektromat...,ecommerce-shoplive,prod,2.0,20161024,,580d32fae29bb6.17871311,,n,,,,,,,,,,,,,,,,,,,,,,,,,,,srch_p_gsh,srch_p_gsh,srch_p_gsh:420952937,srch_p_gsh:420952937,,paid search - adwords:ci:420952937,network=g&campaignid=420952937&adgroupid=27933...,,,,,,,,,,,,,,,,,,,240000.0,,,,,Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK...,,,y,,"259,257,2,200,201,269,256=1.00,258=1.00,255=1....",,,,,,N,::empty::,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,https://ecommerce-shop.de/baumarkt-garten/baue...,/baumarkt-garten/bauen-renovieren/elektromater...,/baumarkt-garten/bauen-renovieren/elektromater...,,Y,,,Baumarkt & Garten/Bauen & Renovieren/Elektroma...,,,,,,,,,,,has_not_bought,,,,,,Feller EDIZIOdue Steckdose 1xRJ45 ungeschirmt ...,Product,ecommerce-shop.de/baumarkt-garten/bauen-renovi...,/baumarkt-garten/bauen-renovieren/elektromater...,ecommerce-shop.de,/baumarkt-garten/bauen-renovieren/elektromater...,network=g&campaignid=420952937&adgroupid=27933...,/baumarkt-garten/bauen-renovieren/elektromater...,,https://www.google.ch/,,P:/baumarkt-garten/bauen-renovieren/elektromat...,P:/baumarkt-garten/bauen-renovieren/elektromat...,ecommerce-shoplive,prod,2,20161024,,580d32fae29bb6.17871311,,n,,,,,,,,,,,,,,,,,,pdp_recomm,,,,,,,,,,srch_p_gsh,srch_p_gsh,srch_p_gsh:420952937,srch_p_gsh:420952937,,paid search - adwords:ci:420952937,,,,,,,https://www.google.ch/,,229.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,24/9/2016 0:0:15 1 -120,,,,,,,,,,,,,,,,,,,,,3.861464e+18,7.478641e+18,5,::hash::0,0.0,Baumarkt & Garten/Bauen & Renovieren/Elektroma...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,pdp_recomm,,,,,,,,,,,,,,,,,,,,,,1.0,google.ch,3.0,https://www.google.ch/,186.0,,1920x1080,Y,229,1,0,ss,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,www287.lon5.omniture.com,24/9/2016 0:0:15 1 -120,,,,,N,,,,Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK...,1841007000.0,,100030943,ecommerce-shoplive,paid search - adwords:ci:420952937,1,paid search - adwords:ci:420952937,1,1,1.0,,,,,,,,,,,,,,,,,,0.0,0,N,0,5,::empty::,1,1,https://www.google.ch/,229,https://ecommerce-shop.de/baumarkt-garten/baue...,/baumarkt-garten/bauen-renovieren/elektromater...,1477260028,1,1,


## Data cleaning

In [7]:
# reset index to make sure that index values are unique

df = df.reset_index(drop=True)

# drop rows where exclude_hit > 1

df = df.drop(df[df.exclude_hit > 0].index)

# drop rows where hit_source is 5, 7, 8 or 9

df = df.drop(df[(df.hit_source == 5) | (df.hit_source == 7) | (df.hit_source == 8) | (df.hit_source == 9)].index)

# drop all not post columns that have a post column

def drop_columns(df):
    
    post_columns = [x for x in df.columns if x.lower()[:4] == 'post']
    post_columns_stripped = [x.replace('post_', '') for x in post_columns]
    not_post_columns = [x for x in df.columns if x.lower()[:4] != 'post']
    columns_without_post_column = [x for x in not_post_columns if x not in post_columns_stripped]
    
    relevant_columns = post_columns
    relevant_columns.extend(columns_without_post_column)
    
    df = df.loc[:, relevant_columns]
       
    return df

df = drop_columns(df)

### Country mapping

In [11]:
# reset index to make sure that index values are unique

df = df.reset_index(drop=True)

# load file for country mapping

country_mapping = pd.read_csv('country.tsv', sep='\t', header=None)
country_mapping.columns = ['country_id', 'country_name']

# drop dupliate countries

country_mapping = country_mapping.drop_duplicates('country_name').reset_index(drop=True)

# create dictionary for country mapping

country_mapping_dict = dict(zip(country_mapping.country_id, country_mapping.country_name))

# map countries

df['country'] = df['country'].map(country_mapping_dict).fillna(df['country'])

### Browser mapping

In [16]:
# reset index to make sure that index values are unique

df = df.reset_index(drop=True)

# load file for browser mapping

browser_mapping = pd.read_csv('browser.tsv', sep='\t', header=None)
browser_mapping.columns = ['browser_id', 'browser_name']

# create dictionary for browser mapping

browser_mapping_dict = dict(zip(browser_mapping.browser_id, browser_mapping.browser_name))

# map browsers

df['browser'] = df['browser'].map(browser_mapping_dict).fillna(df['browser'])

# coerce browser column to dtype string

df['browser'] = df['browser'].astype(str)

# generalize browser

def generalize_browser(row):
    if 'Internet Explorer' in row['browser']:
        return 'Internet Explorer'
    elif 'Chrome' in row['browser']:
        return 'Chrome'
    elif 'Firefox' in row['browser']:
        return 'Firefox'
    elif 'Opera' in row['browser']:
        return 'Opera'
    elif 'Safari' in row['browser']:
        return 'Safari'
    else:
        return 'Other'
    
df['browser'] = df.apply(generalize_browser, axis=1)

### Marketing channel mapping

In [22]:
# reset index to make sure that index values are unique

df = df.reset_index(drop=True)

# drop rows where va_closer_id is not an id but some string

df = df.drop(df[df['va_closer_id'].map(len) > 2].index)

# coerce va_closer_id to dtype int64

df['va_closer_id'] = df['va_closer_id'].astype(np.int64)

# load file for marketing channels mapping

marketing_channels_mapping = pd.read_csv('siroop_marketingchannels_181009.tsv', sep='\t')

# create dictionary for marketing channel mapping

marketing_channels_mapping_dict = dict(zip(marketing_channels_mapping.channel_id, marketing_channels_mapping.name))

# map marketing channels

df['marketing_channel'] = df['va_closer_id'].map(marketing_channels_mapping_dict).fillna(df['va_closer_id'])

### OS mapping

In [23]:
# reset index to make sure that index values are unique

df = df.reset_index(drop=True)

# load file for operating system mapping

operating_system_mapping = pd.read_csv('operating_systems.tsv', sep='\t', header=None)
operating_system_mapping.columns = ['os_id', 'os_name']

# create dictionary for operating system mapping

operating_system_mapping_dict = dict(zip(operating_system_mapping.os_id, operating_system_mapping.os_name))

# map operating systems

df['operating_system'] = df['os'].map(operating_system_mapping_dict).fillna(df['os'])

# generalize operating system

def generalize_operating_system(row):
    if 'Windows' in row['operating_system']:
        return 'Windows'
    elif 'Linux' in row['operating_system']:
        return 'Linux'
    elif 'Android' in row['operating_system']:
        return 'Android'
    elif 'Mobile iOS' in row['operating_system']:
        return 'Apple'
    elif 'Macintosh' in row['operating_system']:
        return 'Apple'
    elif 'OS X' in row['operating_system']:
        return 'Apple'
    else:
        return 'Other'
    
df['operating_system_generalized'] = df.apply(generalize_operating_system, axis=1)

# get device type

def get_device_type(row):
    if 'Windows Phone' in row['operating_system']:
        return 'Mobile'
    elif 'Windows' in row['operating_system']:
        return 'Desktop'
    elif 'Android' in row['operating_system']:
        return 'Mobile'
    elif 'Linux' in row['operating_system']:
        return 'Desktop'
    elif 'Mobile iOS' in row['operating_system']:
        return 'Mobile'
    elif 'Macintosh (iPhone)' in row['operating_system']:
        return 'Mobile'
    elif 'Macintosh' in row['operating_system']:
        return 'Desktop'
    elif 'OS X' in row['operating_system']:
        return 'Desktop'
    elif 'Nokia' in row['operating_system']:
        return 'Mobile'
    else:
        return 'Mobile'
    
df['device_type'] = df.apply(get_device_type, axis=1)

# combination os and device type? what about tablets and other devices?

### Search engine mapping (post_search_engine vs. visit_search_engine?)

In [24]:
# reset index to make sure that index values are unique

df = df.reset_index(drop=True)

# load file for search engine mapping

search_engine_mapping = pd.read_csv('search_engines.tsv', sep='\t', header=None)
search_engine_mapping.columns = ['search_engine_id', 'search_engine_name']

# create dictionary for search engine mapping

search_engine_mapping_dict = dict(zip(search_engine_mapping.search_engine_id, search_engine_mapping.search_engine_name))

# map search engines

df['search_engine'] = df['post_search_engine'].map(search_engine_mapping_dict).fillna(df['post_search_engine'])
#df.rename(columns={'post_search_engine' : 'search_engine'}, inplace=True)

# coerce search_engine column to dtype string

df['search_engine'] = df['search_engine'].astype(str)

# generalize search engine

def generalize_search_engine(row):
    if 'Google' in row['search_engine']:
        return 'Google'
    elif 'Yahoo' in row['search_engine']:
        return 'Yahoo'
    elif 'Bing' in row['search_engine']:
        return 'Bing'
    elif 'Baidu' in row['search_engine']:
        return 'Baidu'
    elif 'DuckDuckGo' in row['search_engine']:
        return 'DuckDuckGo'
    else:
        return 'Other'
    
df['search_engine_generalized'] = df.apply(generalize_search_engine, axis=1)

### Referrer type mapping (ref_type vs. visit_ref_type?)

In [25]:
# reset index to make sure that index values are unique

df = df.reset_index(drop=True)

# load file for referrer type mapping

referrer_type_mapping = pd.read_csv('referrer_type.tsv', sep='\t', header=None)
referrer_type_mapping.columns = ['referrer_type_id', 'referrer_type_name', 'referrer_type']

# create dictionary for referrer type mapping

referrer_type_mapping_dict = dict(zip(referrer_type_mapping.referrer_type_id, referrer_type_mapping.referrer_type))

# map referrer types

df['referrer_type'] = df['ref_type'].map(referrer_type_mapping_dict).fillna(df['ref_type'])

### Connection type mapping

In [61]:
# reset index to make sure that index values are unique

df = df.reset_index(drop=True)

# load file for connection type mapping

connection_type_mapping = pd.read_csv('connection_type.tsv', sep='\t', header=None)
connection_type_mapping.columns = ['connection_type_id', 'connection_type_name']

# create dictionary for connection type mapping

connection_type_mapping_dict = dict(zip(connection_type_mapping.connection_type_id, connection_type_mapping.connection_type_name))

# map connection types

df['connection_type'] = df['connection_type'].map(connection_type_mapping_dict).fillna(df['connection_type'])

### Standard post event list mapping

In [26]:
# reset index to make sure that index values are unique

df = df.reset_index(drop=True)

# load file for event mapping

events = pd.read_csv('event.tsv', sep='\t', header=None)
events.columns = ['event_id', 'event_name']

# make sure there are no whitespaces in post_event_list

df['post_event_list'] = df['post_event_list'].apply(lambda x: x.replace(' ',''))

# create list with standard events

standard_events_list = events.iloc[:8,]

# iterate through standard events and create dummies

for id, event in zip(standard_events_list.iloc[:,0], standard_events_list.iloc[:,1]):
        df[str.lower(event).replace(' ','_')] = df['post_event_list'].apply(lambda x: 1 if ','+str(id)+',' in x else 0)

### Custom post event list mapping

In [27]:
# reset index to make sure that index values are unique

df = df.reset_index(drop=True)

siroop_events = pd.read_csv('siroop_events_181009.tsv', sep='\t')

siroop_events['event_id'] = siroop_events.index + 200
events_siroop_events = pd.merge(events, siroop_events, how='inner', on='event_id')
events_siroop_events_list = events_siroop_events[['event_id', 'name']]
relevant_custom_event_ids = [201, 202, 203, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 219, 222, 223, 225, 226, 234, 
                             235, 236, 237, 247, 248, 249, 251, 270, 271, 272, 273, 274, 279, 280, 281, 282]
relevant_events_siroop_events_list = events_siroop_events_list[events_siroop_events_list['event_id'].isin(relevant_custom_event_ids)]

# iterate through custom events and create dummies

for id, event in zip(relevant_events_siroop_events_list.iloc[:,0], relevant_events_siroop_events_list.iloc[:,1]):
        df[str.lower(event).replace(' ','_')] = df['post_event_list'].apply(lambda x: 1 if ','+str(id)+',' in x else 0)

# add flag for internal users (event 229)

df['internal_user'] = df['post_event_list'].apply(lambda x: 1 if ',229,' in x else 0)

# drop internal users

df = df.drop(df[df.internal_user == 1].index)

### Custom post evar mapping

In [28]:
# reset index to make sure that index values are unique

df = df.reset_index(drop=True)

evars = pd.read_csv('siroop_evars_181009.tsv', sep='\t')

evars_list = evars[['id', 'name']]
relevant_evar_ids = [10, 11, 12, 26, 31, 32, 33, 34, 37, 38, 47, 49, 50, 54, 56, 58, 60, 61, 62, 65, 69, 82, 83, 84, 85, 96,
                     88, 97]
relevant_evars = ['evar' + str(id) for id in relevant_evar_ids]
relevant_evars_list = evars[evars['id'].isin(relevant_evars)]
relevant_evars_list = relevant_evars_list[['id', 'name']]
relevant_evars_list['id'] = relevant_evars_list['id'].apply(lambda x: 'post_' + x)
relevant_evars_list = relevant_evars_list.reset_index(drop=True)

# rename relevant evar columns

for i in range(relevant_evars_list.shape[0]):
    df.rename(columns={relevant_evars_list.iloc[i,0] : str.lower(relevant_evars_list.iloc[i,1]).replace(' ','_')}, inplace=True)

### Custom post prop mapping

In [None]:
props = pd.read_csv('siroop_props_181009.tsv', sep='\t') # same variables as in siroop_evars_181009.tsv?

### Select relevant columns

### Important features (aggregated to session level)

- visitorid = post_visid_high + post_visid_low
- visit_num_period = visit_num
- visit_num_lifetime = first_hit_time_gmt, visid_timestamp, visit_num
- datetime = date_time, hit_time_gmt, (post_t_time_info)
- device_type = (post_evar31 - ?), User Agent, mobile_id, os (lookup)
- channel = marketing_channel (lookup)
- fullcountry = country (lookup) MESSY
- geo_region = geo_region MESSY
- gender : post_evar61 (lookup)
- yearofbirth : post_evar62 (lookup) MESSY
- session_pageviews = event 201 (E2) (lookup)
- session_timespent = last_hit_time_gmt, visit_start_time_gmt
- session_productViews : event 2 (Product View) (lookup)
- session_cartAdditions : event 12 (Cart Add) (lookup) empty
- session_cartRemovals : event 13 (Cart Remove') (lookup) empty
- session_cartCheckouts : event 11 (Checkout) (lookup) empty
- session_purchase : event 1 (Purchase), post_evar50, post_prop50 (lookup)
- session_revenue : event 211 (E12) - empty, event 206 (E7), event 207 (E8), event 209 (E10) - empty, event 210 (E11), event 214 (E15) - empty (lookup)
- session_voucherInPurchase : post_evar56 (V56) - empty, event 212 (E13) - empty, event 213 (E14) - empty,  post_prop56 (P56) (lookup)
- purchase_product_categories : post_evar42, post_evar43, post_evar44, event 247 (E48) - empty, event 248 (E49) - empty, event 249 (E50) - empty (lookup)

### Other potentially relevant features
- browser
- referrer type
- nps 
- campaign
- operating system
- search engine
- logged in/registered user, login/registration status (success, fail)
- newsletter subscription/subscriber
- payment type
- wishlist view, add, remove, to cart
- time features (day of week, hour of day, days since last purchase, days since last visit, life time order count, life time purchase count, datetime according to time zone)
- geographical features (city, zip code, region/state, country)
- buckets for number of product views, number of sessions, days since last session, days since last purchase etc.

### Aggregations

In [62]:
df.head()

Unnamed: 0,post_browser_height,post_browser_width,post_campaign,post_channel,post_cookies,post_currency,post_cust_hit_time_gmt,post_cust_visid,post_ef_id,post_evar1,post_evar2,post_evar3,post_evar4,post_evar5,post_evar6,post_evar7,post_evar8,post_evar9,net_promoter_score_raw_(v10)_-_user,user_purchase_history_(v11),content_language_(v12),post_evar13,post_evar14,post_evar15,post_evar16,post_evar17,post_evar18,post_evar19,post_evar20,post_evar21,post_evar22,post_evar23,post_evar24,post_evar25,referrer_of_current_page_(v26),post_evar27,post_evar28,post_evar29,post_evar30,device_viewport_(v31),weekday_(v32),date_(yyyymmdd)_(v33),registered_user_(user)_(v34),post_evar35,post_evar36,login_status_(hit)_(v37),payment_type_of_completed_order_(v38),post_evar39,post_evar40,post_evar42,post_evar43,post_evar44,post_evar45,post_evar46,prod_price_during_hit_(v47-merch),post_evar48,prod_delivery_(v49-merch),cart_value_(v50),post_evar51,post_evar52,post_evar53,products_in_cart_(v54),post_evar55,prod_voucher_code:_descr_(v56-merch),post_evar57,prod_ad_label_(v58),post_evar59,user_zip_code_(v60),user_gender_(v61),user_age_(v62),post_evar63,post_evar64,mktg_touchpoint_-_last_(v65),post_evar66,post_evar67,post_evar68,campaign_name_(utm_campaign)_(v69),post_evar70,post_evar71,post_evar72,post_evar73,post_evar74,post_evar75,post_evar76,post_evar77,post_evar78,post_evar79,post_evar80,post_evar81,lifetime_devices_used_(v82),lifetime_browsers_used_(v83),user_lifetime_order_count_(v84),raw:_days_since_last_purchase_(v85),post_evar86,post_evar87,utc_timestamp_(v88),post_evar89,post_evar90,post_evar91,post_evar92,post_evar93,post_evar94,post_evar95,session_hit_counter_(v96),session_pageview_counter_(v97),post_evar99,post_evar100,post_event_list,post_hier1,post_hier2,post_hier3,post_hier4,post_hier5,post_java_enabled,post_keywords,post_mc_audiences,post_mobileaction,post_mobileappid,post_mobilecampaigncontent,post_mobilecampaignmedium,post_mobilecampaignname,post_mobilecampaignsource,post_mobilecampaignterm,post_mobiledayofweek,post_mobiledayssincefirstuse,post_mobiledayssincelastuse,post_mobiledevice,post_mobilehourofday,post_mobileinstalldate,post_mobilelaunchnumber,post_mobileltv,post_mobilemessageid,post_mobilemessageonline,post_mobileosversion,post_mobilepushoptin,post_mobilepushpayloadid,post_mobileresolution,post_mvvar1,post_mvvar2,post_mvvar3,post_page_event,post_page_event_var1,post_page_event_var2,post_page_event_var3,post_page_type,post_page_url,post_pagename,post_pagename_no_url,post_partner_plugins,post_persistent_cookie,post_pointofinterest,post_pointofinterestdistance,post_product_list,post_prop1,post_prop2,post_prop3,post_prop4,post_prop5,post_prop6,post_prop7,post_prop8,post_prop9,post_prop10,post_prop11,post_prop12,post_prop13,post_prop14,post_prop15,post_prop16,post_prop17,post_prop18,post_prop19,post_prop20,post_prop21,post_prop22,post_prop23,post_prop24,post_prop25,post_prop26,post_prop27,post_prop28,post_prop29,post_prop30,post_prop31,post_prop32,post_prop33,post_prop34,post_prop35,post_prop36,post_prop37,post_prop38,post_prop39,post_prop40,post_prop41,post_prop42,post_prop43,post_prop44,post_prop45,post_prop46,post_prop47,post_prop48,post_prop49,post_prop50,post_prop51,post_prop52,post_prop53,post_prop54,post_prop55,post_prop56,post_prop57,post_prop58,post_prop59,post_prop60,post_prop61,post_prop62,post_prop63,post_prop64,post_prop65,post_prop66,post_prop67,post_prop68,post_prop69,post_prop70,post_prop71,post_prop72,post_prop73,post_prop74,post_prop75,post_purchaseid,post_referrer,post_s_kwcid,post_search_engine,post_socialaccountandappids,post_socialassettrackingcode,post_socialauthor,post_socialcontentprovider,post_socialfbstories,post_socialfbstorytellers,post_socialinteractioncount,post_socialinteractiontype,post_sociallanguage,post_sociallatlong,post_sociallikeadds,post_socialmentions,post_socialowneddefinitioninsighttype,post_socialowneddefinitioninsightvalue,post_socialowneddefinitionmetric,post_socialowneddefinitionpropertyvspost,post_socialownedpostids,post_socialownedpropertyid,post_socialownedpropertyname,post_socialownedpropertypropertyvsapp,post_socialpageviews,post_socialpostviews,post_socialpubcomments,post_socialpubposts,post_socialpubrecommends,post_socialpubsubscribers,post_socialterm,post_socialtotalsentiment,post_state,post_survey,post_t_time_info,post_tnt,post_tnt_action,post_transactionid,post_video,post_videoad,post_videoadinpod,post_videoadplayername,post_videoadpod,post_videochannel,post_videochapter,post_videocontenttype,post_videoplayername,post_videoqoebitrateaverageevar,post_videoqoebitratechangecountevar,post_videoqoebuffercountevar,post_videoqoebuffertimeevar,post_videoqoedroppedframecountevar,post_videoqoeerrorcountevar,post_videoqoetimetostartevar,post_videosegment,post_visid_high,post_visid_low,post_visid_type,post_zip,accept_language,browser,c_color,carrier,click_action,click_action_type,click_context,click_context_type,click_sourceid,click_tag,code_ver,color,connection_type,country,ct_connect_type,curr_factor,curr_rate,daily_visitor,date_time,domain,duplicate_events,duplicate_purchase,duplicated_from,exclude_hit,first_hit_page_url,first_hit_pagename,first_hit_referrer,first_hit_time_gmt,geo_city,geo_country,geo_dma,geo_region,geo_zip,hit_source,hit_time_gmt,hitid_high,hitid_low,homepage,hourly_visitor,ip,ip2,j_jscript,javascript,language,last_hit_time_gmt,last_purchase_num,last_purchase_time_gmt,mcvisid,mobile_id,monthly_visitor,namespace,new_visit,os,p_plugins,paid_search,plugins,prev_page,product_merchandising,quarterly_visitor,ref_domain,ref_type,resolution,s_resolution,sampled_hit,search_page_num,secondary_hit,service,sourceid,stats_server,tnt_post_vista,truncated_hit,ua_color,ua_os,ua_pixels,user_agent,user_hash,user_server,userid,username,va_closer_detail,va_closer_id,va_finder_detail,va_finder_id,va_instance_event,va_new_engagement,visid_new,visid_timestamp,visit_keywords,visit_num,visit_page_num,visit_referrer,visit_search_engine,visit_start_page_url,visit_start_pagename,visit_start_time_gmt,weekly_visitor,yearly_visitor,marketing_channel,operating_system,operating_system_generalized,device_type,search_engine,search_engine_generalized,referrer_type,purchase,product_view,cart_open,checkout,cart_add,cart_remove,cart_view,campaign_view,page_view_counter_(e2),checkout_step:_payment_(e3),checkout_step:_overview_(e4),fast_checkout_(e6),order_commission_(e7),product_commission_(e8),repeat_orders_(e9),order_tax_(e10),order_shipping_(e11),product_revenue_with_discount_(e12),order_voucher_discount_(e13),product_voucher_discount_(e14),order_discounted_tax_(e15),registration_(any_form)_(e20),hit_of_logged_in_user_(e23),hit_of_registered_user_(e24)_-_notyet_active,newsletter_signup_(any_form)_(e26),newsletter_subscriber_(e27),impressions_(e35),clicks_(e36),ad_cost_(e37),view-through_conversions_(e38),unique_lvl1_categories_(e48),unique_lvl2_categories_(e49),unique_lvl1-3_categories_(e50),unique_products_(e52),visit_during_tv_spot_(e71),login_success_(e72),logout_success_(e73),login_fail_(e74),registration_fail_(e75),wishlist_view_(e80),wishlist_add_(e81),wishlist_remove_(e82),cart_add_from_wishlist_(e83),internal_user
0,0,0,programmatic:cpc:dbm_de_pro_pe_topseller_sta-deal,Product,Y,CHF,1477260059,,,,,,,,,,,,,has_not_bought,,,,,,LEGO Technic Schaufelradbagger (42055),Product,ecommerce-shop.de/baby-spielzeug/spielzeug/bau...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,ecommerce-shop.de,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,utm_source=programmatic&utm_medium=cpc&utm_cam...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,,,,P:/baby-spielzeug/spielzeug/bausteine-lego/leg...,P:/baby-spielzeug/spielzeug/bausteine-lego/leg...,ecommerce-shoplive,prod,2.0,20161024,,580d33198015c2.77500686,,n,,,,,,,,,,,,,,,,,,,,,,,,,,,dbm_pro,dbm_pro,dbm_de_pro_pe_topseller_sta-deal,dbm_de_pro_pe_topseller_sta-deal,,programmatic:cpc:dbm_de_pro_pe_topseller_sta-deal,,sta-deal,,,,,,,,,,,,,,,,,,240001.0,,,,,Mozilla/5.0 (Linux; Android 6.0.1; SM-G930F Bu...,,,y,,"259,257,2,200,201,269,256=1.00,258=1.00,255=1....",,,,,,U,,,,,,,,,,,,,,,,,,,,,,,,,,,70.0,,,,,https://ecommerce-shop.de/baby-spielzeug/spiel...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,,Y,,,Baby & Spielzeug/Spielzeug/Bausteine & LEGO;21...,,,,,,,,,,,has_not_bought,,,,,,LEGO Technic Schaufelradbagger (42055),Product,ecommerce-shop.de/baby-spielzeug/spielzeug/bau...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,ecommerce-shop.de,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,utm_source=programmatic&utm_medium=cpc&utm_cam...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,,,,P:/baby-spielzeug/spielzeug/bausteine-lego/leg...,P:/baby-spielzeug/spielzeug/bausteine-lego/leg...,ecommerce-shoplive,prod,2,20161024,,580d33198015c2.77500686,,n,,,,,,,,,,,,,,,,,,pdp_recomm,,,,,,,,,,dbm_pro,dbm_pro,dbm_de_pro_pe_topseller_sta-deal,dbm_de_pro_pe_topseller_sta-deal,,programmatic:cpc:dbm_de_pro_pe_topseller_sta-deal,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,24/9/2016 0:1:1 1 -120,70051:0:0,"70051:0:0|0,70051:0:0|2,70051:0:0|1",,,,,,,,,,,,,,,,,,,3.728291e+18,4.642425e+16,5,::hash::0,"fr-CH,en-US;q=0.8",Chrome,32.0,swisscom.ch:swisscom schweiz ag,,0,,0,0,,JS-1.6.3,0,Mobile Carrier,Switzerland,,2,1.0,1,2016-10-24 00:00:59,swisscom.ch,,0,,0,https://ecommerce-shop.de/baby-spielzeug/spiel...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,,1477260059,heimberg,che,0,be,3627,1,1477260059,3172391910827393024,184263871271240045,U,1,,,1.6,0,54,0,0,0,75059068711088727562364561556761025758,11873536,1,,1,3874308541,,0,,0.0,,1.0,,6.0,0.0,360x640,Y,0,0,ss,0.0,www30.ams1.omniture.com,,N,,,,Mozilla/5.0 (Linux; Android 6.0.1; SM-G930F Bu...,1841007000.0,,100030943,ecommerce-shoplive,dbm_de_pro_pe_topseller_sta-deal,13,dbm_de_pro_pe_topseller_sta-deal,13,1,1.0,N,0,,1,1,,0,https://ecommerce-shop.de/baby-spielzeug/spiel...,/baby-spielzeug/spielzeug/bausteine-lego/lego-...,1477260059,1,1,DBM Prospecting,Android 6.0.1,Android,Mobile,0.0,Other,bookmarked,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,paid search - adwords:ci:670707020,Search Results,Y,CHF,1477260021,,,,,,,,,,,,,has_not_bought,,,,,,Suchergebnisse,Search Results,ecommerce-shop.de/search?brand_name=mykronoz,/search?brand_name=MyKronoz&network=g&campaign...,ecommerce-shop.de,/search,brand_name=MyKronoz&network=g&campaignid=67070...,/searchbrand_name=mykronoz,,,brand_name=mykronoz,P:/search?,P:/search,ecommerce-shoplive,prod,2.0,20161024,,580d32f00b12b8.44726186,,n,,,,,,,,,,,,,,,,,,,,,,,,,,,srch_p_oth,srch_p_oth,srch_p_oth:670707020,srch_p_oth:670707020,,paid search - adwords:ci:670707020,network=g&campaignid=670707020&adgroupid=34625...,,,,,,,,,,,,,,,,,,,240000.0,,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_2 like...,,,y,,"200,201,269,20,110,116,119,120,121,122,126,130...",,,,,,U,,,,,,,,,,,,,,,,,,,,,,,,,,,70.0,,,,,https://ecommerce-shop.de/search,/search?brand_name=mykronoz,/search?brand_name=mykronoz,,Y,,,;;;;;139=::hash::0|141=::hash::0|142=::hash::0...,,,,,,,,,,,has_not_bought,,,,,,Suchergebnisse,Search Results,ecommerce-shop.de/search?brand_name=mykronoz,/search?brand_name=MyKronoz&network=g&campaign...,ecommerce-shop.de,/search,brand_name=MyKronoz&network=g&campaignid=67070...,/searchbrand_name=mykronoz,,,brand_name=mykronoz,P:/search?,P:/search,ecommerce-shoplive,prod,2,20161024,,580d32f00b12b8.44726186,,n,,,,,,,,,,,,,,,,,,Search Results,,,,,,,,,,srch_p_oth,srch_p_oth,srch_p_oth:670707020,srch_p_oth:670707020,,paid search - adwords:ci:670707020,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,24/9/2016 0:0:28 1 -120,70051:1:0,"70051:1:0|0,70051:1:0|2,70051:1:0|1",,,,,,,,,,,,,,,,,,,4.895265e+18,6.469507e+18,5,::hash::0,de-ch,Safari,32.0,swisscom.ch:swisscom schweiz ag,,0,,0,0,,JS-1.6.3,0,Mobile Carrier,Switzerland,,2,1.0,1,2016-10-24 00:00:21,swisscom.ch,,0,,0,https://ecommerce-shop.de/search?brand_name=My...,/search?brand_name=mykronoz,,1477260021,zurich,che,0,zh,8075,1,1477260021,3172391756208570368,471368334630643351,U,1,,,1.6,0,61,0,0,0,01343120194577155900070852375174225289,205202,1,,1,828421107,,0,,0.0,,1.0,,6.0,0.0,320x568,Y,0,0,ss,0.0,www6.ams1.omniture.com,,N,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_2 like...,1841007000.0,,100030943,ecommerce-shoplive,paid search - adwords:ci:670707020,10,paid search - adwords:ci:670707020,10,1,1.0,N,0,,1,1,,0,https://ecommerce-shop.de/search?brand_name=My...,/search?brand_name=mykronoz,1477260021,1,1,Paid Search Other,Mobile iOS 10.0.2,Apple,Mobile,0.0,Other,bookmarked,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,paid search - adwords:ci:393800826,Product,Y,CHF,1477260065,,,,,,,,,,,,,has_not_bought,,,,,,UE Megaboom Bluetooth Speaker Rot,Product,ecommerce-shop.de/computer-elektronik/audio/po...,/computer-elektronik/audio/portable-lautsprech...,ecommerce-shop.de,/computer-elektronik/audio/portable-lautsprech...,network=g&campaignid=393800826&adgroupid=26304...,/computer-elektronik/audio/portable-lautsprech...,,https://www.google.ch/search?q=ue+megaboom&cli...,,P:/computer-elektronik/audio/portable-lautspre...,P:/computer-elektronik/audio/portable-lautspre...,ecommerce-shoplive,prod,2.0,20161024,,580d3320193f63.14173598,,n,,,,,,,,,,,,,,,,,,,,,,,,,,,srch_p_gsh,srch_p_gsh,srch_p_gsh:393800826,srch_p_gsh:393800826,,paid search - adwords:ci:393800826,network=g&campaignid=393800826&adgroupid=26304...,,,,,,,,,,,,,,,,,,,240001.0,,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_1 like...,,,y,,"259,257,2,200,201,269,256=1.00,258=1.00,255=1....",,,,,,U,ue megaboom,,,,,,,,,,,,,,,,,,,,,,,,,,70.0,,,,,https://ecommerce-shop.de/computer-elektronik/...,/computer-elektronik/audio/portable-lautsprech...,/computer-elektronik/audio/portable-lautsprech...,,Y,,,Computer & Elektronik/Audio/Portable Lautsprec...,,,,,,,,,,,has_not_bought,,,,,,UE Megaboom Bluetooth Speaker Rot,Product,ecommerce-shop.de/computer-elektronik/audio/po...,/computer-elektronik/audio/portable-lautsprech...,ecommerce-shop.de,/computer-elektronik/audio/portable-lautsprech...,network=g&campaignid=393800826&adgroupid=26304...,/computer-elektronik/audio/portable-lautsprech...,,https://www.google.ch/search?q=ue+megaboom&cli...,,P:/computer-elektronik/audio/portable-lautspre...,P:/computer-elektronik/audio/portable-lautspre...,ecommerce-shoplive,prod,2,20161024,,580d3320193f63.14173598,,n,,,,,,,,,,,,,,,,,,pdp_recomm,,,,,,,,,,srch_p_gsh,srch_p_gsh,srch_p_gsh:393800826,srch_p_gsh:393800826,,paid search - adwords:ci:393800826,,,,,,,https://www.google.ch/search?q=ue+megaboom&cli...,,229.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,24/9/2016 0:1:6 1 -120,70051:0:0,"70051:0:0|0,70051:0:0|2,70051:0:0|1",,,,,,,,,,,,,,,,,,,8.875371e+18,1.528051e+18,5,::hash::0,de-ch,Safari,32.0,swisscom.ch:swisscom schweiz ag,,0,,0,0,,JS-1.6.3,0,Mobile Carrier,Switzerland,,2,1.0,1,2016-10-24 00:01:05,swisscom.ch,,0,,0,https://ecommerce-shop.de/computer-elektronik/...,/computer-elektronik/audio/portable-lautsprech...,https://www.google.ch/search?q=ue+megaboom&cli...,1477260065,aarburg,che,0,ag,4663,1,1477260065,3172391835665465344,239010760146072976,U,1,,,1.6,0,61,0,0,0,85461445190431956123742270092956469695,205202,1,,1,1448537301,,1,,0.0,,1.0,google.ch,3.0,0.0,375x667,Y,1,0,ss,0.0,www41.ams1.omniture.com,,N,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_1 like...,1841007000.0,,100030943,ecommerce-shoplive,paid search - adwords:ci:393800826,1,paid search - adwords:ci:393800826,1,1,1.0,N,0,ue megaboom,1,1,https://www.google.ch/search?q=ue+megaboom&cli...,229,https://ecommerce-shop.de/computer-elektronik/...,/computer-elektronik/audio/portable-lautsprech...,1477260065,1,1,Paid Search G. Shopping,Mobile iOS 10.0.1,Apple,Mobile,Google - Switzerland,Google,search_engine,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,facebook.com:cpc:fb_de_pro_pe_sta_wohnen-haush...,Themeworld,Y,CHF,1477260015,,,,,,,,,,,,,has_not_bought,,,,,,Stehlampen fürs passende herbstliche Ambiente,Themeworld,ecommerce-shop.de/inspiration/licht-im-herbst-...,/inspiration/licht-im-herbst-stehlampen-fuers-...,ecommerce-shop.de,/inspiration/licht-im-herbst-stehlampen-fuers-...,utm_medium=cpc&utm_source=facebook.com&utm_cam...,/inspiration/licht-im-herbst-stehlampen-fuers-...,,,,P:/inspiration/licht-im-herbst-stehlampen-fuer...,P:/inspiration/licht-im-herbst-stehlampen-fuer...,ecommerce-shoplive,prod,2.0,20161024,,578c4b4c626cf5.88180099,,n,,,,,,,,,,,,,,,,,,,,,,,,,,,soc_pro,soc_pro<soc_pro,fb_de_pro_pe_sta_wohnen-haushalt_lights,fb_de_pro_pe_sta_wohnen-haushalt_lights<fb_de_...,,facebook.com:cpc:fb_de_pro_pe_sta_wohnen-haush...,,herbstliches-heim-lights_2-2,57ead549b9449f24718b4593,,,,,,,,,,,,,,,,,240000.0,,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_2 like...,,,y,,"200,201,269,20,110,116,119,120,121,122,130,131...",,,,,,U,,,,,,,,,,,,,,,,,,,,,,,,,,,70.0,,,,,https://ecommerce-shop.de/inspiration/licht-im...,/inspiration/licht-im-herbst-stehlampen-fuers-...,/inspiration/licht-im-herbst-stehlampen-fuers-...,,Y,,,;;;;;139=::hash::0|141=::hash::0|142=::hash::0...,,,,,,,,,,,has_not_bought,,,,,,Stehlampen fürs passende herbstliche Ambiente,Themeworld,ecommerce-shop.de/inspiration/licht-im-herbst-...,/inspiration/licht-im-herbst-stehlampen-fuers-...,ecommerce-shop.de,/inspiration/licht-im-herbst-stehlampen-fuers-...,utm_medium=cpc&utm_source=facebook.com&utm_cam...,/inspiration/licht-im-herbst-stehlampen-fuers-...,,,,P:/inspiration/licht-im-herbst-stehlampen-fuer...,P:/inspiration/licht-im-herbst-stehlampen-fuer...,ecommerce-shoplive,prod,2,20161024,,578c4b4c626cf5.88180099,,n,,,,,,,,,,,,,,,,,,TW - Stehlampen fürs passende herbstliche Ambi...,,,,,,,,,,soc_pro,soc_pro<soc_pro,fb_de_pro_pe_sta_wohnen-haushalt_lights,fb_de_pro_pe_sta_wohnen-haushalt_lights<fb_de_...,,facebook.com:cpc:fb_de_pro_pe_sta_wohnen-haush...,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,24/9/2016 0:0:16 1 -120,70051:1:0,"70051:1:0|0,70051:1:0|2,70051:1:0|1",,,,,,,,,,,,,,,,,,,7.862428e+18,6.669572e+18,5,::hash::0,de-ch,Safari,32.0,,,0,,0,0,,JS-1.6.3,0,LAN/Wifi,Switzerland,,2,1.0,1,2016-10-24 00:00:15,adslplus.ch,,0,,0,https://ecommerce-shop.de/inspiration/licht-im...,/inspiration/licht-im-herbst-stehlampen-fuers-...,,1477260015,basel,che,0,bs,4053,1,1477260015,3172391730438799360,1150952337800982458,U,1,,,1.6,0,61,0,0,0,02693539464992941202959228461067897809,205202,1,,1,828421107,,0,,0.0,,1.0,,6.0,0.0,375x667,Y,0,0,ss,0.0,www109.lon5.omniture.com,,N,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_2 like...,1841007000.0,,100030943,ecommerce-shoplive,fb_de_pro_pe_sta_wohnen-haushalt_lights,12,fb_de_pro_pe_sta_wohnen-haushalt_lights,12,1,1.0,N,0,,1,1,,0,https://ecommerce-shop.de/inspiration/licht-im...,/inspiration/licht-im-herbst-stehlampen-fuers-...,1477260015,1,1,Social Media Prospecting,Mobile iOS 10.0.2,Apple,Mobile,0.0,Other,bookmarked,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,950,1920,paid search - adwords:ci:420952937,Product,Y,CHF,1477260028,,,,,,,,,,,,,has_not_bought,,,,,,Feller EDIZIOdue Steckdose 1xRJ45 ungeschirmt ...,Product,ecommerce-shop.de/baumarkt-garten/bauen-renovi...,/baumarkt-garten/bauen-renovieren/elektromater...,ecommerce-shop.de,/baumarkt-garten/bauen-renovieren/elektromater...,network=g&campaignid=420952937&adgroupid=27933...,/baumarkt-garten/bauen-renovieren/elektromater...,,https://www.google.ch/,,P:/baumarkt-garten/bauen-renovieren/elektromat...,P:/baumarkt-garten/bauen-renovieren/elektromat...,ecommerce-shoplive,prod,2.0,20161024,,580d32fae29bb6.17871311,,n,,,,,,,,,,,,,,,,,,,,,,,,,,,srch_p_gsh,srch_p_gsh,srch_p_gsh:420952937,srch_p_gsh:420952937,,paid search - adwords:ci:420952937,network=g&campaignid=420952937&adgroupid=27933...,,,,,,,,,,,,,,,,,,,240000.0,,,,,Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK...,,,y,,"259,257,2,200,201,269,256=1.00,258=1.00,255=1....",,,,,,N,::empty::,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,https://ecommerce-shop.de/baumarkt-garten/baue...,/baumarkt-garten/bauen-renovieren/elektromater...,/baumarkt-garten/bauen-renovieren/elektromater...,,Y,,,Baumarkt & Garten/Bauen & Renovieren/Elektroma...,,,,,,,,,,,has_not_bought,,,,,,Feller EDIZIOdue Steckdose 1xRJ45 ungeschirmt ...,Product,ecommerce-shop.de/baumarkt-garten/bauen-renovi...,/baumarkt-garten/bauen-renovieren/elektromater...,ecommerce-shop.de,/baumarkt-garten/bauen-renovieren/elektromater...,network=g&campaignid=420952937&adgroupid=27933...,/baumarkt-garten/bauen-renovieren/elektromater...,,https://www.google.ch/,,P:/baumarkt-garten/bauen-renovieren/elektromat...,P:/baumarkt-garten/bauen-renovieren/elektromat...,ecommerce-shoplive,prod,2,20161024,,580d32fae29bb6.17871311,,n,,,,,,,,,,,,,,,,,,pdp_recomm,,,,,,,,,,srch_p_gsh,srch_p_gsh,srch_p_gsh:420952937,srch_p_gsh:420952937,,paid search - adwords:ci:420952937,,,,,,,https://www.google.ch/,,229.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,24/9/2016 0:0:15 1 -120,,,,,,,,,,,,,,,,,,,,,3.861464e+18,7.478641e+18,5,::hash::0,"de-DE,de;q=0.8,en-US",Chrome,24.0,,,0,,0,0,,JS-1.6.3,2,LAN/Wifi,Switzerland,,2,1.0,1,2016-10-24 00:00:28,breitband.ch,,0,,0,https://ecommerce-shop.de/baumarkt-garten/baue...,/baumarkt-garten/bauen-renovieren/elektromater...,https://www.google.ch/,1477260028,binningen,che,0,bl,4102,1,1477260028,3172391754061119488,5218195508153067873,U,1,,,1.6,7,60,0,0,0,76591058396162060604068554955823483130,0,1,,1,1240087047,,1,,0.0,,1.0,google.ch,3.0,186.0,1920x1080,Y,1,0,ss,0.0,www287.lon5.omniture.com,,N,,,,Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK...,1841007000.0,,100030943,ecommerce-shoplive,paid search - adwords:ci:420952937,1,paid search - adwords:ci:420952937,1,1,1.0,N,0,::empty::,1,1,https://www.google.ch/,229,https://ecommerce-shop.de/baumarkt-garten/baue...,/baumarkt-garten/bauen-renovieren/elektromater...,1477260028,1,1,Paid Search G. Shopping,Windows 10,Windows,Desktop,Google - Switzerland,Google,search_engine,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [31]:
df.shape

(239679, 435)

In [113]:
df['repeat_orders_(e9)'].dtype

dtype('int64')

In [124]:
df['page_view_counter_(e2)'].unique()

array([1, 0], dtype=int64)

In [115]:
df['repeat_orders_(e9)'].isnull().sum()

0

In [140]:
# coerce visit_page_num to dtype int64

df['visit_page_num'] = df['visit_page_num'].astype(np.int64)

In [141]:
# replace cart value with 0 where nan

df['cart_value_(v50)'].fillna(0, inplace=True)

In [221]:
# select columns for aggregation

agg_cols = ['post_visid_high', 'post_visid_low', 'visit_num', 'visit_page_num', 'hit_time_gmt', 'last_hit_time_gmt', 
            'last_purchase_time_gmt', 'purchase', 'product_view', 'campaign_view', 'cart_value_(v50)', 'page_view_counter_(e2)']

agg_df = df.loc[:, df.columns.isin(agg_cols)].copy()

In [222]:
# concat post_visid_high and post_visid_low

agg_df['post_visid_high_low'] = agg_df['post_visid_high'] + agg_df['post_visid_low']
agg_df.drop(['post_visid_high', 'post_visid_low'], axis=1, inplace=True)

In [223]:
# number of unique visitors

agg_df['post_visid_high_low'].nunique()

54271

In [224]:
# import modules to work with dates and times

from datetime import datetime, date

In [225]:
# convert columns from Unix format to datetime format

agg_df['hit_time_gmt'] = pd.to_datetime(agg_df['hit_time_gmt'], unit='s')
agg_df['last_hit_time_gmt'] = pd.to_datetime(agg_df['last_hit_time_gmt'], unit='s')
agg_df['last_purchase_time_gmt'] = pd.to_datetime(agg_df['last_purchase_time_gmt'], unit='s')

In [226]:
# first aggregation

agg1 = agg_df.groupby(by = ['post_visid_high_low', 'visit_num'], as_index=False).agg({'visit_page_num' : 'max',
                                                                                      'hit_time_gmt': ['min', 'max'],
                                                                                      'last_hit_time_gmt': ['min', 'max'],
                                                                                      'last_purchase_time_gmt': ['min', 'max'],
                                                                                      'purchase' : 'sum',
                                                                                      'product_view' : 'sum',
                                                                                      'campaign_view' : 'sum', 
                                                                                      'cart_value_(v50)' : 'sum', 
                                                                                      'page_view_counter_(e2)': 'sum'})
agg1 = agg1.reset_index(drop=True)
agg1.columns = list(agg1.columns)
agg1.columns = ['post_visid_high_low', 'visit_num', 'visit_page_num', 'hit_time_gmt_min', 'hit_time_gmt_max', 
                'last_hit_time_gmt_min', 'last_hit_time_gmt_max', 'last_purchase_time_gmt_min', 'last_purchase_time_gmt_max',
                'purchase', 'product_view', 'campaign_view', 'cart_value', 'page_view']

In [227]:
agg1.head()

Unnamed: 0,post_visid_high_low,visit_num,visit_page_num,hit_time_gmt_min,hit_time_gmt_max,last_hit_time_gmt_min,last_hit_time_gmt_max,last_purchase_time_gmt_min,last_purchase_time_gmt_max,purchase,product_view,campaign_view,cart_value,page_view
0,2191213000.0,2,1,2016-10-24 12:06:56,2016-10-24 12:06:56,2016-10-18 09:48:58,2016-10-18 09:48:58,1970-01-01,1970-01-01,0,0,1,0.0,1
1,2576949000.0,1,2,2016-10-24 05:58:21,2016-10-24 05:58:24,1970-01-01 00:00:00,2016-10-24 05:58:21,1970-01-01,1970-01-01,0,0,1,0.0,1
2,3141938000.0,1,1,2016-10-24 15:30:24,2016-10-24 15:30:24,1970-01-01 00:00:00,1970-01-01 00:00:00,1970-01-01,1970-01-01,0,1,1,0.0,1
3,3872539000.0,1,1,2016-10-24 08:15:25,2016-10-24 08:15:25,1970-01-01 00:00:00,1970-01-01 00:00:00,1970-01-01,1970-01-01,0,0,1,0.0,1
4,4087063000.0,8,2,2016-10-24 10:39:26,2016-10-24 10:39:40,2016-10-22 12:30:25,2016-10-24 10:39:26,1970-01-01,1970-01-01,0,2,2,0.0,2


In [228]:
agg1.dtypes

post_visid_high_low                  float64
visit_num                              int64
visit_page_num                         int64
hit_time_gmt_min              datetime64[ns]
hit_time_gmt_max              datetime64[ns]
last_hit_time_gmt_min         datetime64[ns]
last_hit_time_gmt_max         datetime64[ns]
last_purchase_time_gmt_min    datetime64[ns]
last_purchase_time_gmt_max    datetime64[ns]
purchase                               int64
product_view                           int64
campaign_view                          int64
cart_value                           float64
page_view                              int64
dtype: object

In [229]:
agg1.isnull().sum()

post_visid_high_low           0
visit_num                     0
visit_page_num                0
hit_time_gmt_min              0
hit_time_gmt_max              0
last_hit_time_gmt_min         0
last_hit_time_gmt_max         0
last_purchase_time_gmt_min    0
last_purchase_time_gmt_max    0
purchase                      0
product_view                  0
campaign_view                 0
cart_value                    0
page_view                     0
dtype: int64

In [230]:
# sort dataframe by post_visid_high_low, visit_num, and hit_time_gmt_min

agg1 = agg1.sort_values(['post_visid_high_low', 'visit_num', 'hit_time_gmt_min'], ascending=[True, True, True])

In [231]:
# add bounce features

agg1['bounce'] = agg1['visit_page_num'].apply(lambda x: 1 if x == 1 else 0)

In [232]:
# get categorical features

categorical_cols = ['post_visid_high', 'post_visid_low', 'visit_num', 'browser', 'operating_system_generalized', 'device_type', 
                    'country', 'marketing_channel', 'search_engine_generalized', 'referrer_type', 'geo_city', 'geo_region',
                   'geo_zip', 'user_purchase_history_(v11)', 'registered_user_(user)_(v34)', 'login_status_(hit)_(v37)', 
                   'user_gender_(v61)', 'user_age_(v62)', 'net_promoter_score_raw_(v10)_-_user', 'connection_type', 'mobile_id',
                   'new_visit', 'hourly_visitor', 'daily_visitor', 'weekly_visitor', 'monthly_visitor', 'quarterly_visitor',
                   'yearly_visitor', 'hit_of_logged_in_user_(e23)', 'visit_during_tv_spot_(e71)', 'repeat_orders_(e9)']

categorical_df = df.loc[:, df.columns.isin(categorical_cols)].copy()
categorical_df['post_visid_high_low'] = categorical_cols_df['post_visid_high'] + categorical_cols_df['post_visid_low']
categorical_df.drop(['post_visid_high', 'post_visid_low'], axis=1, inplace=True)

In [233]:
categorical_df.drop(['net_promoter_score_raw_(v10)_-_user', 'user_gender_(v61)', 'user_age_(v62)'], axis=1, inplace=True)

In [234]:
#categorical_df['has_bought_before'] = categorical_df['user_purchase_history_(v11)'].apply(lambda x: 1 if x == 'has_bought'
#                                                                                          else (0 if x == 'has_not_bought'
#                                                                                               else np.nan))
categorical_df['has_bought_before'] = categorical_df['user_purchase_history_(v11)'].apply(lambda x: 1 if x == 'has_bought' else 0)
categorical_df.drop('user_purchase_history_(v11)', axis=1, inplace=True)

In [235]:
#categorical_df['logged_in'] = categorical_df['login_status_(hit)_(v37)'].apply(lambda x: 1 if x == 'y'
#                                                                                          else (0 if x == 'n'
#                                                                                               else np.nan))
categorical_df['logged_in'] = categorical_df['login_status_(hit)_(v37)'].apply(lambda x: 1 if x == 'y' else 0)
categorical_df.drop('login_status_(hit)_(v37)', axis=1, inplace=True)

In [236]:
categorical_df['registered_user'] = categorical_df['registered_user_(user)_(v34)'].apply(lambda x: 1 if x == 'y' else 0)
categorical_df.drop('registered_user_(user)_(v34)', axis=1, inplace=True)

In [237]:
categorical_df['is_mobile'] = categorical_df['mobile_id'].apply(lambda x: 1 if x != 0 else 0)
categorical_df.drop('mobile_id', axis=1, inplace=True)

In [238]:
categorical_df['quarterly_visitor'] = categorical_df['quarterly_visitor'].astype(np.int64)
categorical_df['weekly_visitor'] = categorical_df['weekly_visitor'].astype(np.int64)

In [239]:
categorical_df['geo_city'] = categorical_df['geo_city'].apply(lambda x: np.nan if x == '?' else x)
categorical_df['geo_region'] = categorical_df['geo_region'].apply(lambda x: np.nan if x == '?' else x)

In [240]:
categorical_df = categorical_df.dropna()

In [241]:
categorical_df.head()

Unnamed: 0,browser,connection_type,country,daily_visitor,geo_city,geo_region,geo_zip,hourly_visitor,monthly_visitor,new_visit,quarterly_visitor,visit_num,weekly_visitor,yearly_visitor,marketing_channel,operating_system_generalized,device_type,search_engine_generalized,referrer_type,repeat_orders_(e9),hit_of_logged_in_user_(e23),visit_during_tv_spot_(e71),post_visid_high_low,has_bought_before,logged_in,registered_user,is_mobile
0,Chrome,Mobile Carrier,Switzerland,1,heimberg,be,3627,1,1,1,1,1,1,1,DBM Prospecting,Android,Mobile,Other,bookmarked,0,0,0,3.774716e+18,0,0,0,1
1,Safari,Mobile Carrier,Switzerland,1,zurich,zh,8075,1,1,1,1,1,1,1,Paid Search Other,Apple,Mobile,Other,bookmarked,0,0,0,1.136477e+19,0,0,0,1
2,Safari,Mobile Carrier,Switzerland,1,aarburg,ag,4663,1,1,1,1,1,1,1,Paid Search G. Shopping,Apple,Mobile,Google,search_engine,0,0,0,1.040342e+19,0,0,0,1
3,Safari,LAN/Wifi,Switzerland,1,basel,bs,4053,1,1,1,1,1,1,1,Social Media Prospecting,Apple,Mobile,Other,bookmarked,0,0,0,1.4532e+19,0,0,0,1
4,Chrome,LAN/Wifi,Switzerland,1,binningen,bl,4102,1,1,1,1,1,1,1,Paid Search G. Shopping,Windows,Desktop,Google,search_engine,0,0,0,1.13401e+19,0,0,0,0


In [242]:
agg1_cat_df = agg1.merge(categorical_df, on=['post_visid_high_low', 'visit_num'], how='left')

In [243]:
agg1_cat_df.head()

Unnamed: 0,post_visid_high_low,visit_num,visit_page_num,hit_time_gmt_min,hit_time_gmt_max,last_hit_time_gmt_min,last_hit_time_gmt_max,last_purchase_time_gmt_min,last_purchase_time_gmt_max,purchase,product_view,campaign_view,cart_value,page_view,bounce,browser,connection_type,country,daily_visitor,geo_city,geo_region,geo_zip,hourly_visitor,monthly_visitor,new_visit,quarterly_visitor,weekly_visitor,yearly_visitor,marketing_channel,operating_system_generalized,device_type,search_engine_generalized,referrer_type,repeat_orders_(e9),hit_of_logged_in_user_(e23),visit_during_tv_spot_(e71),has_bought_before,logged_in,registered_user,is_mobile
0,2191213000.0,2,1,2016-10-24 12:06:56,2016-10-24 12:06:56,2016-10-18 09:48:58,2016-10-18 09:48:58,1970-01-01,1970-01-01,0,0,1,0.0,1,1,Safari,LAN/Wifi,Switzerland,1.0,thayngen,sh,8240,1.0,0.0,1.0,0.0,1.0,0.0,Paid Search Other,Apple,Mobile,Other,bookmarked,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,2576949000.0,1,2,2016-10-24 05:58:21,2016-10-24 05:58:24,1970-01-01 00:00:00,2016-10-24 05:58:21,1970-01-01,1970-01-01,0,0,1,0.0,1,0,Other,LAN/Wifi,Switzerland,1.0,grenchen,so,2540,1.0,1.0,1.0,1.0,1.0,1.0,Paid Search Other,Windows,Mobile,Other,bookmarked,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,2576949000.0,1,2,2016-10-24 05:58:21,2016-10-24 05:58:24,1970-01-01 00:00:00,2016-10-24 05:58:21,1970-01-01,1970-01-01,0,0,1,0.0,1,0,Other,LAN/Wifi,Switzerland,0.0,grenchen,so,2540,0.0,0.0,0.0,0.0,0.0,0.0,Paid Search Other,Windows,Mobile,Other,bookmarked,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,3141938000.0,1,1,2016-10-24 15:30:24,2016-10-24 15:30:24,1970-01-01 00:00:00,1970-01-01 00:00:00,1970-01-01,1970-01-01,0,1,1,0.0,1,1,Safari,LAN/Wifi,Switzerland,1.0,zug,zg,6300,1.0,1.0,1.0,1.0,1.0,1.0,Display Prospecting,Apple,Mobile,Other,bookmarked,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,3872539000.0,1,1,2016-10-24 08:15:25,2016-10-24 08:15:25,1970-01-01 00:00:00,1970-01-01 00:00:00,1970-01-01,1970-01-01,0,0,1,0.0,1,1,Chrome,Mobile Carrier,Switzerland,1.0,olten,so,4600,1.0,1.0,1.0,1.0,1.0,1.0,Paid Search Other,Android,Mobile,Google,search_engine,0.0,0.0,0.0,0.0,0.0,0.0,1.0


#### Time features

In [157]:
# calculate visit duration in seconds

agg1['visit_duration_seconds'] = agg1['hit_time_gmt_max'] - agg1['hit_time_gmt_min']
agg1['visit_duration_seconds'] = agg1['visit_duration_seconds'].apply(lambda x: x.seconds)

# create lag columns for post_visid_high_low and post_visid_high_low

agg1['post_visid_high_low_lag'] = agg1['post_visid_high_low'].shift(1)
agg1['hit_time_gmt_max_lag'] = agg1['hit_time_gmt_max'].shift(1)

# calculate days since last visit

agg1['days_since_last_visit'] = agg1.apply(lambda x: x['hit_time_gmt_min'] - x['hit_time_gmt_max_lag'] 
                                       if x['post_visid_high_low'] == x['post_visid_high_low_lag'] 
                                       else np.nan, axis=1)
agg1['days_since_last_visit'] = agg1['days_since_last_visit'].apply(lambda x: x.days)

# calculate days since last purchase

agg1['days_since_last_purchase'] = agg1['hit_time_gmt_min'] - agg1['last_purchase_time_gmt_max']
agg1['days_since_last_purchase'] = agg1['days_since_last_purchase'].apply(lambda x: x.days)
# has bought before

In [None]:
target = agg1_cat_df['purchase']

In [None]:
feature_cols = ['visit_num', 'visit_page_num', 'product_view', 'campaign_view', 'visit_duration_seconds', 'bounce', 'browser', 
                'country', 'geo_city', 'geo_region', 'geo_zip', 'marketing_channel', 'operating_system_generalized', 
               'device_type', 'search_engine_generalized', 'referrer_type', 'repeat_buyer']

features = df.loc[:, agg2.columns.isin(feature_cols)].copy()

# convert
# drop na
# encoding

## Inspect missing values

In [None]:
# get na only and static columns

na_only_columns = []
static_columns = []

for i in column_headers:
    n_unique_values = df[str(i)].nunique()

    if n_unique_values == 0:
        na_only_columns.append(i)
        
    elif n_unique_values == 1:
        static_columns.append(i)
        
    else:
        pass

In [None]:
social_columns = [x for x in df.columns if x.lower()[:6] == 'social']

In [None]:
# columns that are not used anymore according to the Adobe documentation

columns_no_longer_used = ['click_action', # no longer used
                         'click_action_type',
                         'click_context',
                         'click_context_type',
                         'click_sourceid',
                         'click_tag',
                         'homepage',
                         'p_plugins',
                         'page_event_var3',
                         'plugins',
                         'sampled_hit',
                         'tnt_post_vista',
                         'ua_color',
                         'ua_os',
                         'ua_pixels',
                         'ip2', # not used
                         'namespace',
                         'partner_plugins',
                         'prev_page',
                         'product_merchandising',
                         'service',
                         'sourceid',
                         'stats_server', # not of use
                         'user_hash',
                         'userid']

In [None]:
# columns that are neither NaN only or static or unnecessary ids

columns_to_keep = ['accept_language',
                   'browser', # lookup
                   'browser_heigth',
                   'browser_width',
                   'c_color',
                   'campaign',
                   'carrier', # loopup
                   'channel',
                   'code_ver',
                   'color', # lookup
                   'connection_type',
                   'cookies',
                   'country', # lookup
                   'daily_visitor',
                   'date_time',
                   'domain',
                   # check out evar columns
                   'event_list', # lookup
                   'exclude_hit',
                   'first_hit_page_url',
                   'first_hit_pagename',
                   'first_hit_ref_domain',
                   'first_hit_ref_type', # lookup
                   'first_hit_referrer',
                   'first_hit_time_gmt',
                   'geo_city',
                   'geo_country',
                   'geo_dma',
                   'geo_region',
                   'geo_zip',
                   'hit_source',
                   'hit_time_gmt',
                   'hitid_high',
                   'hitid_low',
                   'hourly_visitor',
                   'ip',
                   'j_script',
                   'java_enabled',
                   'javascript', # lookup
                   'language', # lookup
                   'last_hit_time_gmt',
                   'last_purchase_num',
                   'last_purchase_time_gmt',
                   # check out mobile columns
                   'monthly_visitor',
                   'new_visit',
                   'os', # lookup
                   'page_event', # lookup
                   'page_url',
                   'page_name',
                   'paid_search',
                   'persistent_cookie',
                   'pointofinterest',
                   'pointofinterestdistance',
                   'product_list',
                   # check out prop columns
                   'purchaseid',
                   'quarterly_visitor',
                   'ref_domain',
                   'ref_type',
                   'referrer',
                   'resolution', # lookup
                   's_resolution', 
                   'search_engine', # lookup
                   'search_page_num',
                   'secondary_hit',
                   'state',
                   't_time_info',
                   'tnt',
                   'tnt_action',
                   'transactionid',
                   'truncated_hit',
                   'user_agent',
                   'user_server',
                   'username',
                   'va_closer_detail',
                   'va_closer_id',
                   'va_finder_detail',
                   'va_finder_id',
                   'va_instance_event',
                   'va_new_engagement',
                   # check out video columns
                   'visid_high',
                   'visid_low',
                   'visid_new',
                   'visid_timestamp',
                   'visid_type',
                   'visit_keywords',
                   'visit_num',
                   'visit_page_num',
                   'visit_ref_domain',
                   'visit_ref_type', # lookup
                   'visit_referrer',
                   'visit_search_engine', # lookup
                   'visit_start_page_url',
                   'visit_start_pagename',
                   'visit_start_time_gmt',
                   'weekly_visitor',
                   'yearly_visitor',
                   'zip']