
## **Project Abstract** ## 
To help people get started on their project and to make sure you are selecting an appropriate task, we will have all the teams submit an abstract. Please only submit one abstract per team.

The abstract should include (at least):

-Team members

-Problem statement

-Data you will use to solve the problem

-Outline of how you plan on solving the problem with the data. For example, what pre-processing steps might you need to do, what models, etc.

-Supporting documents if necessary citing past research in the area and methods used to solve the problem.

-The goal of this abstract is for you to think deeply about the project you will be undertaking and convince yourself (and us) that it is a meaningful and achievable project for this class.

This homework is due March 1, 2018 by midnight Utah time. and will be submitted on learning suite.

# Airbnb New User Bookings

## Team Members

- Alex Fabiano 
- Michael Clawson
- Elijah Broadbent 


## Problem Statement



With 34,000+ cities across 190+ countries, Airbnb users have a multitude of destinations from which to choose.  This vast array of possibilities creates problems for both users and Airbnb. New users may suffer choice overload and prolong their first booking. Irregular and prolonged first bookings can cause demand lags and inhibit demand predictability for Airbnb.
	
The goal of this data project is to accurately predict where new users will book their first Airbnb. This will enable Airbnb to share more personalized content and better forecast demand as well as improve user experience.


## Data

The data for this project comes from four separate files containing age and gender buckets, countries, websession, and a user set.  We will need to join the users and sessions sets into a training set while the remaining sets will serve as supplementary information to inform our data cleaning and analysis.

In [1]:
import pandas as pd
import numpy as np

In [5]:
test = pd.read_csv('test_users.csv')
XY_Age = pd.read_csv('age_gender_bkts.csv') #complicated and messy...consider doing last
countries = pd.read_csv('countries.csv')
users = pd.read_csv('train_users_2.csv')
sessions = pd.read_csv('sessions.csv')

In [7]:
users.first_affiliate_tracked.value_counts(dropna=False)
#Maybe make a binary variable, tracked versus untracked?  Fill NaN's according to probability
#numpy.random.choice()

untracked        109232
linked            46287
omg               43982
tracked-other      6156
NaN                6065
product            1556
marketing           139
local ops            34
Name: first_affiliate_tracked, dtype: int64

In [8]:
users.country_destination.value_counts(dropna=False)

NDF      124543
US        62376
other     10094
FR         5023
IT         2835
GB         2324
ES         2249
CA         1428
DE         1061
NL          762
AU          539
PT          217
Name: country_destination, dtype: int64

In [9]:
users.first_browser.value_counts(dropna=False)

Chrome                  63845
Safari                  45169
Firefox                 33655
-unknown-               27266
IE                      21068
Mobile Safari           19274
Chrome Mobile            1270
Android Browser           851
AOL Explorer              245
Opera                     188
Silk                      124
Chromium                   73
BlackBerry Browser         53
Maxthon                    46
IE Mobile                  36
Apple Mail                 36
Sogou Explorer             33
Mobile Firefox             30
RockMelt                   24
SiteKiosk                  24
Iron                       17
IceWeasel                  13
Pale Moon                  12
Yandex.Browser             11
SeaMonkey                  11
CometBird                  11
Camino                      9
TenFourFox                  8
wOSBrowser                  6
CoolNovo                    6
Avant Browser               4
Opera Mini                  4
Mozilla                     3
SlimBrowse

We can make a dummy for each of these variables.

Dummies for the major browsers and then an other and an unknown category.  Could maybe group some of the google offshoots/Amazon Offshoots into a category.

In [10]:
users.first_device_type.value_counts(dropna=False)

Mac Desktop           89600
Windows Desktop       72716
iPhone                20759
iPad                  14339
Other/Unknown         10667
Android Phone          2803
Android Tablet         1292
Desktop (Other)        1199
SmartPhone (Other)       76
Name: first_device_type, dtype: int64

We can create dummies for desktop, Apple, Android, Microsoft, Phone, Android.

Note that other refers to bookings made to a country not on this list while NDF corresponds to sessions in which no booking was ultimately made.

In [10]:
sessions.head()

Unnamed: 0,user_id,action,action_type,action_detail,device_type,secs_elapsed
0,d1mm9tcy42,lookup,,,Windows Desktop,319.0
1,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,67753.0
2,d1mm9tcy42,lookup,,,Windows Desktop,301.0
3,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,22141.0
4,d1mm9tcy42,lookup,,,Windows Desktop,435.0


In [71]:
joined = pd.merge(users, sessions, left_on='id', right_on='user_id', how='inner')
joined.head()

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,...,signup_app,first_device_type,first_browser,country_destination,user_id,action,action_type,action_detail,device_type,secs_elapsed
0,d1mm9tcy42,2014-01-01,20140101000936,2014-01-04,MALE,62.0,basic,0,en,sem-non-brand,...,Web,Windows Desktop,Chrome,other,d1mm9tcy42,lookup,,,Windows Desktop,319.0
1,d1mm9tcy42,2014-01-01,20140101000936,2014-01-04,MALE,62.0,basic,0,en,sem-non-brand,...,Web,Windows Desktop,Chrome,other,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,67753.0
2,d1mm9tcy42,2014-01-01,20140101000936,2014-01-04,MALE,62.0,basic,0,en,sem-non-brand,...,Web,Windows Desktop,Chrome,other,d1mm9tcy42,lookup,,,Windows Desktop,301.0
3,d1mm9tcy42,2014-01-01,20140101000936,2014-01-04,MALE,62.0,basic,0,en,sem-non-brand,...,Web,Windows Desktop,Chrome,other,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,22141.0
4,d1mm9tcy42,2014-01-01,20140101000936,2014-01-04,MALE,62.0,basic,0,en,sem-non-brand,...,Web,Windows Desktop,Chrome,other,d1mm9tcy42,lookup,,,Windows Desktop,435.0


In [72]:
print("Joined Length: {}\tSessions Length: {}\tUsers Length: {}".format(len(joined), len(sessions), len(users)))

Joined Length: 5537957	Sessions Length: 10567737	Users Length: 213451


In [23]:
joined.id.describe() #there are 73,815 matching users between the sessions and users files

count        5537957
unique         73815
top       0hjoc5q8nf
freq            2644
Name: id, dtype: object

In [75]:
joined.dtypes

id                                 object
date_account_created               object
timestamp_first_active              int64
date_first_booking         datetime64[ns]
gender                             object
age                               float64
signup_method                      object
signup_flow                         int64
language                           object
affiliate_channel                  object
affiliate_provider                 object
first_affiliate_tracked            object
signup_app                         object
first_device_type                  object
first_browser                      object
country_destination                object
user_id                            object
action                             object
action_type                        object
action_detail                      object
device_type                        object
secs_elapsed                      float64
dtype: object

In [55]:
joined.date_first_booking.value_counts(dropna=False) #need to extract month as variable

NaT           3057710
2014-06-11      26954
2014-06-25      22184
2014-06-10      21829
2014-05-22      21540
2014-06-13      21442
2014-06-24      21220
2014-05-02      21109
2014-05-08      21057
2014-06-27      19882
2014-05-28      19756
2014-06-05      19151
2014-06-12      19061
2014-06-06      18584
2014-06-20      18202
2014-05-21      18130
2014-06-23      17979
2014-06-26      17951
2014-05-14      17843
2014-05-15      17690
2014-05-20      17563
2014-06-03      17194
2014-06-16      16963
2014-06-09      16815
2014-05-13      16571
2014-06-29      16514
2014-06-15      16464
2014-06-30      16455
2014-04-28      16324
2014-04-03      16279
               ...   
2014-10-18        254
2015-06-18        248
2015-05-08        241
2015-05-17        231
2015-05-10        231
2015-05-26        225
2015-04-05        217
2015-06-11        215
2015-05-16        209
2015-04-26        206
2015-05-05        197
2015-05-22        177
2015-06-02        146
2015-05-24        145
2015-06-17

In [74]:
joined.date_first_booking = pd.to_datetime(joined.date_first_booking) #Casts object as datetime

In [76]:
joined['day_of_week_1stbook'] = joined.date_first_booking.dt.weekday #Create indicators for day of week (0=Mon, 6=Sun)

# Project Research

Based on the other Kaggle kernels for this project, a few different methods have been used to predict first booking location for new Airbnb users. Some competitors utilized the ensemble technique, incorporating up to three layers. One kaggler implemented a three tiered ensemble with six different models in each layer (Support Vector Machines, Logistic Regression, Random Forest, Gradient Boosting, Extra Trees Classifier, and K-Nearest Neighbors). Another kaggler coded a Normalized Discounted Cumulative Gain model, or NDCG, which is a type of ranking measure. This model relies on a logarithmic discounting factor and has achieved significant empirical results. However, little is known about the theoretical properties of NDCG models. Further information can be found at https://arxiv.org/abs/1304.6480 .

# Strategy

After merging our data and doing a minimal amount of initial cleaning, we will use a K-Neighbors Classifier to get a naive baseline for our classification problem.  From there we will modify our cleaning as needed to account for outliers, missing data, and optimal control variables in a more robust fashion.  Other models we plan to consult include Random Forest, regression, and Gradient Boosting.  Depending on the results we obtain from these methods we can complicate our approach using tiered-ensembles of multiple models.