## Capstone Project: Analyzing Airbnb New User Bookings
What will a new airbnb user's first booking destination be?

## I. Overview

This project will have two parts.
 * ### Part I: Binary Classification Model
Will a new airbnb user end up booking a destination? Yes or No

* ### Part II: Multi-Class Classification 
What will a new airbnb user's first booking destination be? Choose from 12 possible classes.

## II. Business Problem

In [147]:
## add a feature "booked" yes or no and do a binary classification model to predict if a customer will 
# end up booking or not. what is happening? what defines if a customer ends up booking or not at a granular or overall level?
## only 42% of users end up booking, why?
## then build a classifier to predict of those who book where are they booking and why? 

In [148]:
## predict which country a new airbnb user's first booking destination will be.

In [149]:
## 213451
## 124543 no destination found makes up for more than half of users in our dataset (58%).
## actual bookings 88908
## US represents domestic travel, which represents 70% of all bookings in our data set.
## new york, paris, rome, london, madrid

## III. Obtaining and Scrubbing Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
age_bkts = pd.read_csv('../data/age_gender_bkts.csv.zip')
countries = pd.read_csv('../data/countries.csv.zip')
sessions = pd.read_csv('../data/sessions.csv.zip')
train_data = pd.read_csv('../data/train_users_2.csv.zip')
test_data = pd.read_csv('../data/test_users.csv.zip')

In [3]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213451 entries, 0 to 213450
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   id                       213451 non-null  object 
 1   date_account_created     213451 non-null  object 
 2   timestamp_first_active   213451 non-null  int64  
 3   date_first_booking       88908 non-null   object 
 4   gender                   213451 non-null  object 
 5   age                      125461 non-null  float64
 6   signup_method            213451 non-null  object 
 7   signup_flow              213451 non-null  int64  
 8   language                 213451 non-null  object 
 9   affiliate_channel        213451 non-null  object 
 10  affiliate_provider       213451 non-null  object 
 11  first_affiliate_tracked  207386 non-null  object 
 12  signup_app               213451 non-null  object 
 13  first_device_type        213451 non-null  object 
 14  firs

In [4]:
train_data.head()

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
0,gxn3p5htnn,2010-06-28,20090319043255,,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
1,820tgsjxq7,2011-05-25,20090523174809,,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF
2,4ft3gnwmtx,2010-09-28,20090609231247,2010-08-02,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US
3,bjjt8pjhuk,2011-12-05,20091031060129,2012-09-08,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,other
4,87mebub9p4,2010-09-14,20091208061105,2010-02-18,-unknown-,41.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,US


Let's look into our features one by one, to manually select which ones could be of best use.

* ### gender

In [5]:
train_data.gender.value_counts()

-unknown-    95688
FEMALE       63041
MALE         54440
OTHER          282
Name: gender, dtype: int64

In [6]:
train_data['gender'] = train_data['gender'].str.replace('-unknown-', 'UNKNOWN')

* ### language

In [7]:
train_data.language.value_counts()

en    206314
zh      1632
fr      1172
es       915
ko       747
de       732
it       514
ru       389
pt       240
ja       225
sv       122
nl        97
tr        64
da        58
pl        54
cs        32
no        30
el        24
th        24
id        22
hu        18
fi        14
is         5
ca         5
hr         2
Name: language, dtype: int64

* ### english_lan

In [157]:
# creating a new feature:
## is english the preferred international language? True or False
train_data['english_lan'] = train_data['language'] == 'en'

* ### age

In [159]:
train_data['age_bins'] = pd.cut(x=train_data['age'], bins=[14, 19, 24, 29, 34, 39, 44, 49, 54, 59, 64, 69, 74, 
                                                           79, 84, 89, 94, 99])

In [160]:
train_data['age_bins'] = train_data.age_bins.astype(str)

In [163]:
train_data['age_bins'].replace('nan','unknown', inplace=True)
train_data['age_bins'].replace('(29.0, 34.0]', '30-34', inplace=True)
train_data['age_bins'].replace('(24.0, 29.0]', '25-29', inplace=True)
train_data['age_bins'].replace('(34.0, 39.0]', '35-39', inplace=True)
train_data['age_bins'].replace('(39.0, 44.0]', '40-44', inplace=True)
train_data['age_bins'].replace('(19.0, 24.0]', '20-24', inplace=True)
train_data['age_bins'].replace('(44.0, 49.0]', '45-49', inplace=True)
train_data['age_bins'].replace('(49.0, 54.0]', '50-54', inplace=True)
train_data['age_bins'].replace('(54.0, 59.0]', '55-59', inplace=True)
train_data['age_bins'].replace('(59.0, 64.0]', '59-64', inplace=True)
train_data['age_bins'].replace('(64.0, 69.0]', '65-69', inplace=True)
train_data['age_bins'].replace('(14.0, 19.0]', '15-19', inplace=True)
train_data['age_bins'].replace('(69.0, 74.0]', '69-74', inplace=True)
train_data['age_bins'].replace('(74.0, 79.0]', '75+', inplace=True)
train_data['age_bins'].replace('(79.0, 84.0]', '75+', inplace=True)
train_data['age_bins'].replace('(94.0, 99.0]', '75+', inplace=True)
train_data['age_bins'].replace('(84.0, 89.0]', '75+', inplace=True)
train_data['age_bins'].replace('(89.0, 94.0]', '75+', inplace=True)

In [164]:
train_data.age_bins.value_counts()

unknown    90418
30-34      28551
25-29      27143
35-39      19019
40-44      11740
20-24       8906
45-49       8470
50-54       6051
55-59       4470
59-64       3129
65-69       2002
15-19       1872
69-74        900
75+          780
Name: age_bins, dtype: int64

* ### signup_method

In [165]:
train_data.signup_method.value_counts()

basic       152897
facebook     60008
google         546
Name: signup_method, dtype: int64

In [8]:
train_data.signup_flow.value_counts()

0     164739
25     14659
12      9329
3       8822
2       6881
24      4328
23      2835
1       1047
6        301
8        240
21       196
5         36
20        14
16        11
15        10
10         2
4          1
Name: signup_flow, dtype: int64

* ### effective_booking

In [107]:
countries_list = train_data['country_destination'].unique().tolist()

In [108]:
countries_list

['NDF', 'US', 'other', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL', 'DE', 'AU']

'NDF' means there wasn't a booking. We will use this to split our data in two categories: 
1. It is a booking,
2. It is not a booking.

In [109]:
countries_list.remove('NDF')

In [110]:
train_data['effective_booking'] = train_data['country_destination'].isin(countries_list)

In [111]:
train_data.effective_booking.value_counts()

False    124543
True      88908
Name: effective_booking, dtype: int64

In [112]:
## drop columns we won't use or grab just the ones we'll use when creating X and y.
train_data.drop(columns=['date_account_created','timestamp_first_active','date_first_booking'])

Unnamed: 0,id,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination,english_lan,age_bins,effective_booking
0,gxn3p5htnn,UNKNOWN,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF,True,unknown,False
1,820tgsjxq7,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF,True,"(34.0, 39.0]",False
2,4ft3gnwmtx,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US,True,"(54.0, 59.0]",True
3,bjjt8pjhuk,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,other,True,"(39.0, 44.0]",True
4,87mebub9p4,UNKNOWN,41.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,US,True,"(39.0, 44.0]",True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213446,zxodksqpep,MALE,32.0,basic,0,en,sem-brand,google,omg,Web,Mac Desktop,Safari,NDF,True,"(29.0, 34.0]",False
213447,mhewnxesx9,UNKNOWN,,basic,0,en,direct,direct,linked,Web,Windows Desktop,Chrome,NDF,True,unknown,False
213448,6o3arsjbb4,UNKNOWN,32.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,NDF,True,"(29.0, 34.0]",False
213449,jh95kwisub,UNKNOWN,,basic,25,en,other,other,tracked-other,iOS,iPhone,Mobile Safari,NDF,True,unknown,False
