# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png)  Part 2: Problem Statement + EDA

### Overview

For the capstone project I have chose to work on a Kaggle competition which aims to predict future destination of new users. From the Kaggle page:

https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings

*"New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By accurately predicting where a new user will book their first travel experience, Airbnb can share more personalized content with their community, decrease the average time to first booking, and better forecast demand. In this recruiting competition, Airbnb challenges you to predict in which country a new user will make his or her first booking."*

The assumption is that new users are from USA.

### Data

The data provided are as follows:

In [2]:
import pandas as pd
import numpy as np

---------------------------------------------------
* Age/gender dataset: dataset of age buckets, gender and countries. All data is from year 2015:

In [3]:
age_gender_df = pd.read_csv('./assets/age_gender_bkts.csv/age_gender_bkts.csv')
print age_gender_df.shape
age_gender_df.head(1)

(420, 5)


Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands,year
0,100+,AU,male,1.0,2015.0


In [4]:
print age_gender_df['age_bucket'].unique()
print ''
print (len(age_gender_df['country_destination'].unique()))
print age_gender_df['country_destination'].unique()

['100+' '95-99' '90-94' '85-89' '80-84' '75-79' '70-74' '65-69' '60-64'
 '55-59' '50-54' '45-49' '40-44' '35-39' '30-34' '25-29' '20-24' '15-19'
 '10-14' '5-9' '0-4']

10
['AU' 'CA' 'DE' 'ES' 'FR' 'GB' 'IT' 'NL' 'PT' 'US']


--------------------------------------------
* Countries dataset: dataset containing information about the destination including distance to USA (all from the center of each country), language, and Levenshtein distance of the language to English. There are a total of 10 countries.

In [7]:
countries_df = pd.read_csv('./assets/countries.csv/countries.csv')
print countries_df.shape
countries_df.head()

(10, 7)


Unnamed: 0,country_destination,lat_destination,lng_destination,distance_km,destination_km2,destination_language,language_levenshtein_distance
0,AU,-26.853388,133.27516,15297.744,7741220.0,eng,0.0
1,CA,62.393303,-96.818146,2828.1333,9984670.0,eng,0.0
2,DE,51.165707,10.452764,7879.568,357022.0,deu,72.61
3,ES,39.896027,-2.487694,7730.724,505370.0,spa,92.25
4,FR,46.232193,2.209667,7682.945,643801.0,fra,92.06


--------------------------------------------
* Train/Test: since this is a Kaggle competition data comes with a train and test datasets. The dates need to be converted to data-time information. I will cover the the train set here:

In [15]:
train_data_df = pd.read_csv('./assets/train_users_2.csv/train_users_2.csv')
print train_data_df.shape
train_data_df.head(2)

(213451, 16)


Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
0,gxn3p5htnn,2010-06-28,20090319043255,,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
1,820tgsjxq7,2011-05-25,20090523174809,,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF


* ** id**: user id
* ** date_account_created**: the date of account creation
* ** timestamp_first_active**: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
* ** date_first_booking**: date of first booking
* ** signup_flow**: the page a user came to signup up from
* ** language**: international language preference
* ** affiliate_channel**: what kind of paid marketing
* ** affiliate_provider**: where the marketing is e.g. google, craigslist, other
* ** first_affiliate_tracked**: whats the first marketing the user interacted with before the signing up
* ** country_destination**: this is the target variable you are to predict

In [16]:
train_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213451 entries, 0 to 213450
Data columns (total 16 columns):
id                         213451 non-null object
date_account_created       213451 non-null object
timestamp_first_active     213451 non-null int64
date_first_booking         88908 non-null object
gender                     213451 non-null object
age                        125461 non-null float64
signup_method              213451 non-null object
signup_flow                213451 non-null int64
language                   213451 non-null object
affiliate_channel          213451 non-null object
affiliate_provider         213451 non-null object
first_affiliate_tracked    207386 non-null object
signup_app                 213451 non-null object
first_device_type          213451 non-null object
first_browser              213451 non-null object
country_destination        213451 non-null object
dtypes: float64(1), int64(2), object(13)
memory usage: 26.1+ MB


--------------------------------------------
* Session: this datasets exposes the sessions of SOME users, what they did during the time they were online, and how much time they spent on each *task*, on what type of a device.

In the sessions dataset, the data only dates back to 1/1/2014, while the users dataset dates back to 2010.


In [17]:
sessions_df = pd.read_csv('./assets/sessions.csv/sessions.csv')
sessions_df.head(2)

Unnamed: 0,user_id,action,action_type,action_detail,device_type,secs_elapsed
0,d1mm9tcy42,lookup,,,Windows Desktop,319.0
1,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,67753.0


In [18]:
print sessions_df.shape
print sessions_df.info()

(10567737, 6)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10567737 entries, 0 to 10567736
Data columns (total 6 columns):
user_id          object
action           object
action_type      object
action_detail    object
device_type      object
secs_elapsed     float64
dtypes: float64(1), object(5)
memory usage: 483.8+ MB
None


In [19]:
print (len(sessions_df.action.unique()))
print (len(sessions_df.action_type.unique()))
print (len(sessions_df.action_detail.unique()))
print (len(sessions_df.device_type.unique()))

360
11
156
14


Probably the most important question here is do we care about the individual actions, action types and action details? Do we want to preserve the information they carry, and convey it to the train set, OR a groupby aggregate by user would be sufficient?

For the initial phase I will be performing a summary statistcs on the session info so the following information for the available users is gathered:

* How much time does a user spend on each session? (mean, std, sum)
* How many actions (actions, action types and action details) is involved per user? 
* What are the top actions (actions, action types and action details) among users? Does it have any value for our calculations?
* What are the most time consuming actions (actions and action types, action details)

A notable information here is that we have a 'booking_request' action type. Among users who booked a room (cross referenced with the train dataset) majority have this task among their activities, but not all. Why? Is it important?

So to answer these questions for this dataset, I might be using PCA or XGBoost to find important or salient features. 

### Preprocessing

All datasets have some preprocessing to be done, including replacing NAs, fixing ages and dates, and getting dummies or encoding for categorical data.

## Next Step

Among completion of analysis of the sessions dataset, and after doing the preprocessing, the three other sets will be merged to the train data to create the final dataframe. This will be then used to create and train the model.