# Airbnb New User Bookings
*Where will a new guest book their first travel experience?*

* [Kaggle Page](https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings)

**Outline**

* [Read Data](#read)
* [Exploratory Data Analysis](#eda)

In [1]:
%load_ext watermark

In [67]:
%matplotlib inline

import os
import pandas as pd
import numpy as np
import seaborn as sns
import math

In [3]:
%watermark -a 'PredictiveII' -d -t -v -p pandas,numpy,sklearn,watermark

PredictiveII 2018-02-18 12:58:34 

CPython 3.6.3
IPython 6.1.0

pandas 0.20.3
numpy 1.13.3
sklearn 0.19.1
watermark 1.6.0


---

## <a id="read">Read Data</a>

In [44]:
def data_reader():
    """
    read data into notebook 
    """
        
    data_dir = os.path.join('..', 'data')

    session_path = os.path.join(data_dir, 'sessions.csv')
    train_path = os.path.join(data_dir, 'train_users_2.csv')    
    test_path = os.path.join(data_dir, 'test_users.csv')   
    age_gender_bkt_path = os.path.join(data_dir, 'age_gender_bkts.csv')   
    country_path = os.path.join(data_dir, 'countries.csv')   
    sample_submission_path = os.path.join(data_dir, 'sample_submission_NDF.csv')   

    session = pd.read_csv(session_path)
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)
    age_gender = pd.read_csv(age_gender_bkt_path)
    country = pd.read_csv(country_path)
    sample_submission = pd.read_csv(sample_submission_path)

    
    return session, train, test, age_gender, country, sample_submission

In [45]:
session, train, test, age_gender, country, sample_submission = data_reader()

---

# <a id="eda">EDA</a>

In [23]:
train.head()

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
0,gxn3p5htnn,2010-06-28,20090319043255,,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
1,820tgsjxq7,2011-05-25,20090523174809,,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF
2,4ft3gnwmtx,2010-09-28,20090609231247,2010-08-02,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US
3,bjjt8pjhuk,2011-12-05,20091031060129,2012-09-08,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,other
4,87mebub9p4,2010-09-14,20091208061105,2010-02-18,-unknown-,41.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,US


> **What is the distribution of the outcome variable**

In [24]:
# train.country_destination.value_counts()/train.shape[0]
# sns.countplot(x="country_destination", 
#               data=train[train.country_destination!='NDF'],
#              order = train['country_destination'].value_counts().index)

<img src="pic/Country.png" style="width: 600px;height: 450px;"/>

> **What is the distribution of the outcome variable with NDF**

In [25]:
# train_noNDF = train[train.country_destination!='NDF']
# train_noNDF.country_destination.value_counts()/train_noNDF.shape[0]
# sns.countplot(x="country_destination", 
#               data=train_noNDF,
#              order = train_noNDF['country_destination'].value_counts().index)

<img src="pic/CountryFilter.png" style="width: 600px;height: 450px;"/>

> **For Users whose date_first_booking is NaN, Does that mean their country_destination should be NDP?**

**Answer**: Yes, all users whose `date_first_booking` is `NaN`, their destination is NDF; for those whose `date_first_booking` is not `NaN`, they have been to somewhere, and around 70% of them go to US. However, we also see that the `date_first_booking` of all the user in test data are `NaN`. Therefore, we can not use this column for prediction.

In [26]:
train[train.date_first_booking.isnull()]['country_destination'].value_counts()

NDF    124543
Name: country_destination, dtype: int64

In [27]:
train[~train.date_first_booking.isnull()]['country_destination'].value_counts()

US       62376
other    10094
FR        5023
IT        2835
GB        2324
ES        2249
CA        1428
DE        1061
NL         762
AU         539
PT         217
Name: country_destination, dtype: int64

> **Do session only contains data for those who have made a booking?**

**Answer**: No, even though only 73815 out of 213451 can be matched to session data. For those who can be matched, around 40,000 user's destination is NDF.

In [28]:
train[train.id.isin(session.id.unique())].country_destination.value_counts()

NDF      45041
US       20095
other     3655
FR        1435
IT         979
GB         731
ES         707
CA         440
DE         250
NL         247
AU         152
PT          83
Name: country_destination, dtype: int64

In [29]:
len(train[train.id.isin(session.id.unique())].id.unique())

73815

> **Is there any effect for affiliate channel to the destination? Do people sign up from different channel tend to have different destination?**

<img src="pic/AffiliateChannel_Destination.png" style="width: 600px;height: 450px;"/>

> **What is the effect of each feature to the destination?**

**Some Takeaway**

* Browser language seems to be a good predictor. It's clear to see that for those whose use fr as browser language, their destination seems to be france. Similar result apply to es, de...etc.
* The effect of all the other features in the train data is hard to see based on the following plots.

<img src="pic/AffiliateProvider_Destination.png" style="width: 600px;height: 450px;"/>

<img src="pic/FirstAffiliateTracked_Destination.png" style="width: 600px;height: 450px;"/>

<img src="pic/FirstBrowser_Destination.png" style="width: 600px;height: 450px;"/>

<img src="pic/FirstDeviceType_Destination.png" style="width: 600px;height: 450px;"/>

<img src="pic/Gender_Destination.png" style="width: 600px;height: 450px;"/>

<img src="pic/Language_Destination.png" style="width: 600px;height: 450px;"/>

<img src="pic/SignupApp_Destination.png" style="width: 600px;height: 450px;"/>

<img src="pic/SignupFlow_Destination.png" style="width: 600px;height: 450px;"/>

<img src="pic/SignupMethod_Destination.png" style="width: 600px;height: 450px;"/>

> **What is the account creation and booking trend for the top 5 destination?**

* The last month of account creation is June 2014, and we see that the month of date first booking go straight down after that month. It should indicate that most people create their account and make their first booking within a short period of time.

<img src="pic/AccountCreateBookingts.png" style="width: 600px;height: 450px;"/>