## Capstone Project: Analyzing Airbnb New User Bookings
What will a new airbnb user's first booking destination be?

## I. Overview

* ### Source: 
https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data
* All the users in the data set are from the USA.
* The data was provided in the form of multiple data sets by Airbnb itself as a challenge on Kaggle.
* I will grab the train_data set and perform my own train_test_split.

This project will have two parts.
 * ### Part I: Binary Classification Model
Will a new airbnb user end up booking a destination? True or False. 
For this, we will create a new feature called 'effective_booking'

* ### Part II: Multi-Class Classification 
What will a new airbnb user's first booking destination be?  There are 12 possible outcomes of the destination country: 'US', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL','DE', 'AU', 'NDF' (no destination found), and 'other'. 
Please note that 'NDF' is different from 'other' because 'other' means there was a booking, but is to a country not included in the list, while 'NDF' means there wasn't a booking.

## II. Business Problem

1. Predict whether a new airbnb user will effectively book a destination or not.
2. Predict which country a new airbnb user's first booking destination will be.

* We will create a feature "effective_booking" True or False and build a binary classification model to predict if a customer will 
end up booking or not. 
* what is happening? what defines if a customer ends up booking or not at a granular or overall level?
* Only 42% of users end up booking.
* Then build a classifier to predict of those who book, where are they going?

* 128070 observations (users) in the train data.
* 74878 NDF (no destination found) 58%
* Number of actual bookings: 53192
* US represents domestic travel, which is 70% of all bookings in our data set.

In [46]:
X_train1.country_destination.value_counts()

NDF      74878
US       37333
other     6096
FR        3030
IT        1673
ES        1353
GB        1325
CA         862
DE         626
NL         469
AU         308
PT         117
Name: country_destination, dtype: int64

## III. Obtaining and Scrubbing Data

In [3]:
import pandas as pd
import numpy as np

In [4]:
age_bkts = pd.read_csv('../data/age_gender_bkts.csv.zip')
countries = pd.read_csv('../data/countries.csv.zip')
sessions = pd.read_csv('../data/sessions.csv.zip')
train_data = pd.read_csv('../data/train_users_2.csv.zip')

In [5]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213451 entries, 0 to 213450
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   id                       213451 non-null  object 
 1   date_account_created     213451 non-null  object 
 2   timestamp_first_active   213451 non-null  int64  
 3   date_first_booking       88908 non-null   object 
 4   gender                   213451 non-null  object 
 5   age                      125461 non-null  float64
 6   signup_method            213451 non-null  object 
 7   signup_flow              213451 non-null  int64  
 8   language                 213451 non-null  object 
 9   affiliate_channel        213451 non-null  object 
 10  affiliate_provider       213451 non-null  object 
 11  first_affiliate_tracked  207386 non-null  object 
 12  signup_app               213451 non-null  object 
 13  first_device_type        213451 non-null  object 
 14  firs

In [6]:
train_data.head()

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
0,gxn3p5htnn,2010-06-28,20090319043255,,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
1,820tgsjxq7,2011-05-25,20090523174809,,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF
2,4ft3gnwmtx,2010-09-28,20090609231247,2010-08-02,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US
3,bjjt8pjhuk,2011-12-05,20091031060129,2012-09-08,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,other
4,87mebub9p4,2010-09-14,20091208061105,2010-02-18,-unknown-,41.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,US


Let's look into our features one by one, to understand better what is the data telling us and to manually select which ones could be of best use.

* ### gender

In [7]:
train_data.gender.value_counts()

-unknown-    95688
FEMALE       63041
MALE         54440
OTHER          282
Name: gender, dtype: int64

In [8]:
train_data['gender'] = train_data['gender'].str.replace('-unknown-', 'UNKNOWN')

* ### language

In [9]:
train_data.language.value_counts()

en    206314
zh      1632
fr      1172
es       915
ko       747
de       732
it       514
ru       389
pt       240
ja       225
sv       122
nl        97
tr        64
da        58
pl        54
cs        32
no        30
th        24
el        24
id        22
hu        18
fi        14
is         5
ca         5
hr         2
Name: language, dtype: int64

* ### signup_method

In [10]:
train_data.signup_method.value_counts()

basic       152897
facebook     60008
google         546
Name: signup_method, dtype: int64

* ### signup_flow

In [11]:
train_data.signup_flow.value_counts()

0     164739
25     14659
12      9329
3       8822
2       6881
24      4328
23      2835
1       1047
6        301
8        240
21       196
5         36
20        14
16        11
15        10
10         2
4          1
Name: signup_flow, dtype: int64

* ### affiliate_channel

In [12]:
train_data.affiliate_channel.value_counts()

direct           137727
sem-brand         26045
sem-non-brand     18844
other              8961
seo                8663
api                8167
content            3948
remarketing        1096
Name: affiliate_channel, dtype: int64

* ### affiliate_provider

In [13]:
train_data.affiliate_provider.value_counts()

direct                 137426
google                  51693
other                   12549
craigslist               3471
bing                     2328
facebook                 2273
vast                      829
padmapper                 768
facebook-open-graph       545
yahoo                     496
gsp                       453
meetup                    347
email-marketing           166
naver                      52
baidu                      29
yandex                     17
wayn                        8
daum                        1
Name: affiliate_provider, dtype: int64

In [14]:
train_data['affiliate_provider'] = train_data['affiliate_provider'].replace(
    to_replace= ['vast', 'padmapper', 'yahoo', 'gsp', 'meetup', 'email-marketing', 'naver', 'baidu', 'yandex', 
                'wayn', 'daum'], value = 'other')

In [15]:
train_data['affiliate_provider'] = train_data['affiliate_provider'].replace('facebook-open-graph', 'facebook')

In [16]:
train_data.affiliate_provider.value_counts()

direct        137426
google         51693
other          15715
craigslist      3471
facebook        2818
bing            2328
Name: affiliate_provider, dtype: int64

* ### first_affiliate_tracked

In [17]:
train_data.first_affiliate_tracked.value_counts()

untracked        109232
linked            46287
omg               43982
tracked-other      6156
product            1556
marketing           139
local ops            34
Name: first_affiliate_tracked, dtype: int64

* ### signup_app

In [18]:
train_data.signup_app.value_counts()

Web        182717
iOS         19019
Moweb        6261
Android      5454
Name: signup_app, dtype: int64

* ### first_device_type

In [19]:
train_data.first_device_type.value_counts()

Mac Desktop           89600
Windows Desktop       72716
iPhone                20759
iPad                  14339
Other/Unknown         10667
Android Phone          2803
Android Tablet         1292
Desktop (Other)        1199
SmartPhone (Other)       76
Name: first_device_type, dtype: int64

In [20]:
train_data['first_device_type'] = train_data['first_device_type'].replace(
    to_replace=['Desktop (Other)', 'SmartPhone (Other)'], value='Other/Unknown')

In [21]:
train_data.first_device_type.value_counts()

Mac Desktop        89600
Windows Desktop    72716
iPhone             20759
iPad               14339
Other/Unknown      11942
Android Phone       2803
Android Tablet      1292
Name: first_device_type, dtype: int64

* ### first_browser

In [22]:
train_data.first_browser.value_counts()

Chrome                  63845
Safari                  45169
Firefox                 33655
-unknown-               27266
IE                      21068
Mobile Safari           19274
Chrome Mobile            1270
Android Browser           851
AOL Explorer              245
Opera                     188
Silk                      124
Chromium                   73
BlackBerry Browser         53
Maxthon                    46
IE Mobile                  36
Apple Mail                 36
Sogou Explorer             33
Mobile Firefox             30
SiteKiosk                  24
RockMelt                   24
Iron                       17
IceWeasel                  13
Pale Moon                  12
Yandex.Browser             11
CometBird                  11
SeaMonkey                  11
Camino                      9
TenFourFox                  8
CoolNovo                    6
wOSBrowser                  6
Opera Mini                  4
Avant Browser               4
Mozilla                     3
OmniWeb   

In [23]:
train_data['first_browser'] = train_data['first_browser'].replace(
    to_replace=['AOL Explorer','Opera', 'Silk', 'Chromium', 'BlackBerry Browser', 'Maxthon', 'Apple Mail', 
                'IE Mobile', 'Sogou Explorer', 'Mobile Firefox', 'SiteKiosk', 'RockMelt', 'Iron', 'IceWeasel',
                'Pale Moon', 'SeaMonkey', 'Yandex.Browser', 'CometBird', 'Camino', 'TenFourFox', 'wOSBrowser',
                'CoolNovo', 'Avant Browser', 'Opera Mini', 'Mozilla', 'TheWorld Browser', 'OmniWeb', 'Epic',
                'SlimBrowser', 'Opera Mobile', 'Crazy Browser', 'Comodo Dragon', 'Flock', 'PS Vita browser',
                'Googlebot', 'Outlook 2007', 'Stainless', 'Conkeror', 'Palm Pre web browser', 'IceDragon', 
                'NetNewsWire', 'Kindle Browser', 'Google Earth', 'Arora'], value='Other')

In [24]:
train_data.first_browser.value_counts()

Chrome             63845
Safari             45169
Firefox            33655
-unknown-          27266
IE                 21068
Mobile Safari      19274
Chrome Mobile       1270
Other               1053
Android Browser      851
Name: first_browser, dtype: int64

* ## Feature Engineering

* ### english_language

We will create a new feature: is english the preferred international language? True or False.

In [25]:
# creating a new feature:
train_data['english_lan'] = train_data['language'] == 'en'

* ### age

We will use the .cut() method to create age bins and assign the users ages to the corresponding one.

In [26]:
train_data['age_bins'] = pd.cut(x=train_data['age'], bins=[14, 19, 24, 29, 34, 39, 44, 49, 54, 59, 64, 69, 74, 
                                                           79, 84, 89, 94, 99])

In [27]:
train_data['age_bins'] = train_data.age_bins.astype(str)

In [28]:
age_mapper = {'nan':'unknown',
'(29.0, 34.0]':'30-34', 
'(24.0, 29.0]':'25-29', 
'(34.0, 39.0]':'35-39', 
'(39.0, 44.0]':'40-44', 
'(19.0, 24.0]':'20-24', 
'(44.0, 49.0]':'45-49', 
'(49.0, 54.0]':'50-54', 
'(54.0, 59.0]':'55-59', 
'(59.0, 64.0]':'59-64', 
'(64.0, 69.0]':'65-69', 
'(14.0, 19.0]':'15-19', 
'(69.0, 74.0]':'69-74', 
'(74.0, 79.0]':'75+', 
'(79.0, 84.0]':'75+', 
'(94.0, 99.0]':'75+', 
'(84.0, 89.0]':'75+', 
'(89.0, 94.0]':'75+',} 

In [29]:
train_data['age_bins'].replace(age_mapper, inplace=True)

In [30]:
train_data.age_bins.value_counts()

unknown    90418
30-34      28551
25-29      27143
35-39      19019
40-44      11740
20-24       8906
45-49       8470
50-54       6051
55-59       4470
59-64       3129
65-69       2002
15-19       1872
69-74        900
75+          780
Name: age_bins, dtype: int64

* ### effective_booking

In [31]:
countries_list = train_data['country_destination'].unique().tolist()

In [32]:
countries_list

['NDF', 'US', 'other', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL', 'DE', 'AU']

'NDF' means there wasn't a booking. We will use this to split our data in two categories: 
1. It is a booking,
2. It is not a booking.

In [33]:
countries_list.remove('NDF')

In [34]:
train_data['effective_booking'] = train_data['country_destination'].isin(countries_list)

In [35]:
train_data.effective_booking.value_counts()

False    124543
True      88908
Name: effective_booking, dtype: int64

In [36]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213451 entries, 0 to 213450
Data columns (total 19 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   id                       213451 non-null  object 
 1   date_account_created     213451 non-null  object 
 2   timestamp_first_active   213451 non-null  int64  
 3   date_first_booking       88908 non-null   object 
 4   gender                   213451 non-null  object 
 5   age                      125461 non-null  float64
 6   signup_method            213451 non-null  object 
 7   signup_flow              213451 non-null  int64  
 8   language                 213451 non-null  object 
 9   affiliate_channel        213451 non-null  object 
 10  affiliate_provider       213451 non-null  object 
 11  first_affiliate_tracked  207386 non-null  object 
 12  signup_app               213451 non-null  object 
 13  first_device_type        213451 non-null  object 
 14  firs

* ## Defining Our Case Problems

In [37]:
## Part I
X1 = train_data.drop(columns=['date_account_created', 'timestamp_first_active', 'date_first_booking', 
                            'first_affiliate_tracked', 'language', 'age', 'effective_booking', 'signup_flow'])
y1 = train_data['effective_booking']

## Part II
X2 = train_data.drop(columns=['date_account_created', 'timestamp_first_active', 'date_first_booking', 
                            'first_affiliate_tracked', 'language', 'age', 'country_destination', 'signup_flow'])
y2 = train_data['country_destination']

* ## Splitting Data

In [38]:
from sklearn.model_selection import train_test_split

In [39]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size = 0.4, random_state = 42)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size = 0.4, random_state = 25)

In [41]:
%store X_train1
%store X_test1
%store y_train1
%store y_test1
%store X_train2
%store X_test2
%store y_train2
%store y_test2

Stored 'X_train1' (DataFrame)
Stored 'X_test1' (DataFrame)
Stored 'y_train1' (Series)
Stored 'y_test1' (Series)
Stored 'X_train2' (DataFrame)
Stored 'X_test2' (DataFrame)
Stored 'y_train2' (Series)
Stored 'y_test2' (Series)


In [45]:
X_train1.country_destination.value_counts()

NDF      74878
US       37333
other     6096
FR        3030
IT        1673
ES        1353
GB        1325
CA         862
DE         626
NL         469
AU         308
PT         117
Name: country_destination, dtype: int64