### explore the datasets and develop understanding of the challenge
[project url](https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings)
* objective: predict in which country a new user will make his or her first booking

In [1]:
import sys
import os
import glob
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

%matplotlib inline
%precision 4

'%.4f'

In [3]:
root_dir = os.getcwd()
data_dir = os.path.join(root_dir, 'data')
output_dir = os.path.join(root_dir, 'output')
images_dir = os.path.join(root_dir, 'images')

In [6]:
!ls ./data

age_gender_bkts.csv  sample_submission.csv  test_users.csv
countries.csv	     sessions.csv	    train_users.csv


In [11]:
!head -n 5 ./data/age_gender_bkts.csv

age_bucket,country_destination,gender,population_in_thousands,year
100+,AU,male,1.0,2015.0
95-99,AU,male,9.0,2015.0
90-94,AU,male,47.0,2015.0
85-89,AU,male,118.0,2015.0


In [14]:
!head -n 5 ./data/sessions.csv

user_id,action,action_type,action_detail,device_type,secs_elapsed
ailzdefy6o,similar_listings,data,similar_listings,Windows Desktop,255.0
ailzdefy6o,similar_listings,data,similar_listings,Windows Desktop,183.0
ailzdefy6o,ajax_refresh_subtotal,click,change_trip_characteristics,Windows Desktop,175570.0
ailzdefy6o,show,,,Windows Desktop,86.0


In [17]:
!head -n 5 ./data/countries.csv

country_destination,lat_destination,lng_destination,distance_km,destination_km2,destination_language ,language_levenshtein_distance
AU,-26.853388,133.27516,15297.744,7741220.0,eng,0.0
CA,62.393303,-96.818146,2828.1333,9984670.0,eng,0.0
DE,51.165707,10.452764,7879.568,357022.0,deu,72.61
ES,39.896027,-2.4876945,7730.724,505370.0,spa,92.25


In [18]:
!head -n 5 ./data/test_users.csv

id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser
qe9gwamyfk,2014-04-01,20140401000102,,-unknown-,,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Safari
16aco8ay46,2014-04-01,20140401000238,2014-04-10,-unknown-,27.0,basic,0,en,other,other,linked,Web,Windows Desktop,Firefox
e3sr92jphf,2014-04-01,20140401000319,2014-04-01,MALE,22.0,facebook,12,en,api,other,untracked,iOS,iPhone,Mobile Safari
0clg3c9hw9,2014-04-01,20140401000343,,MALE,35.0,basic,0,en,sem-non-brand,google,omg,Web,Mac Desktop,Safari


In [19]:
!head -n 5 ./data/train_users.csv

id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
ccu7c3q7h3,2010-06-28,20090319043255,,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
0xqosmub05,2011-05-25,20090523174809,,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF
syiid9h31c,2010-09-28,20090609231247,2010-08-02,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US
4uid7lk4z3,2011-12-05,20091031060129,2012-09-08,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,other


### check the the relationship about the data

In [21]:
from os.path import join
import pandas as pd

train_user = pd.read_csv(join(data_dir, 'train_users.csv'), header=0)
test_user  = pd.read_csv(join(data_dir, 'test_users.csv'), header=0)
session_df = pd.read_csv(join(data_dir, 'sessions.csv'), header=0)

In [31]:
print("train_user's dimensions:{}".format(train_user.shape))
print("test_user's dimensions:{}".format(test_user.shape))
print("session's dimensions:{}".format(session_df.shape))

train_user's dimensions:(171239, 16)
test_user's dimensions:(43673, 15)
session's dimensions:(5600850, 6)


In [24]:
train_user.head()
# account_created > date_first_booking, 
# How does the Airbnb link the historical booking records of non-registered customer to new registor

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
0,ccu7c3q7h3,2010-06-28,20090319043255,,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
1,0xqosmub05,2011-05-25,20090523174809,,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF
2,syiid9h31c,2010-09-28,20090609231247,2010-08-02,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US
3,4uid7lk4z3,2011-12-05,20091031060129,2012-09-08,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,other
4,bibf93h56j,2010-09-14,20091208061105,2010-02-18,-unknown-,40.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,US


In [56]:
all_sessioned_uid = session_df['user_id'].unique()

sc_train_users = train_user['id'][train_user['id'].isin(all_sessioned_uid)]
sc_test_users  = test_user['id'][test_user['id'].isin(all_sessioned_uid)]

In [57]:
print("session covered users: {}".format(len(all_sessioned_uid)))
print("sc_train_users: {} ({} %)".format(len(sc_train_users), len(sc_train_users)/train_user.shape[0]))
print("sc_test_users:: {} ({} %)".format(len(sc_test_users), len(sc_test_users)/test_user.shape[0]))

session covered users: 74610
sc_train_users: 31202 (0.18221316405725332 %)
sc_test_users:: 43408 (0.9939321777757425 %)


#### hierachical modeling
* modeling without other+session data
* modeling with other+session data