# Detailed walk through the Airbnb Dataset

## Get the Setup out of the way

* Setup matplotlib for inline plotting
* Setup math-related, statistics, machine learning, DataFrames
* airbnb_tools is a set up wrappers  to do some basic jobs written by myself

In [49]:
# Setup plot environment
%matplotlib inline

# Imports
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from airbnb_tools import *
import math

from xgboost.sklearn import XGBClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import make_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier




## Read in the data and start some Cleaning

* Read in with pandas
* At first, I was limiting to those with bookings. Turns out that's the wrong way to do it, they want a non-booking to be included
* Replace missing data with -1 (this is for quick analysis)

In [27]:
# Read in the training data
train = pd.read_csv("data/train_users_2.csv")
test = pd.read_csv("data/test_users.csv")

# Join test and train so they get treated the same during cleaning.
# We'll separate them for model testing.
piv_train = train.shape[0]

# We're going to drop the ID from the test data eventually. Need to make 
# sure we keep track of it
testId = test['id']
test["dataset"] = "test"
train["dataset"] = "train"
alldata = pd.concat((train, test), axis=0, ignore_index=True)

# Start some preliminary cleaning of the data.
# Fill in missing data with -1 for now
alldata.fillna(-1, inplace=True)

## Create a function that takes two YYYY-MM-DD dates and calculates the difference between them.

Note that we allow an optional keyword (noNeg) to force the minimum output to be 0. We do this because some of the inputs we're dealing with should always be > 0 and a negative is probably fubar.

In [28]:
# Create some features from others that may be telling
def days_between(d1, d2, noNeg=False):
    d1 = datetime.strptime(d1, "%Y-%m-%d")
    d2 = datetime.strptime(d2, "%Y-%m-%d")
    outval = (d2-d1).days
    if (outval > 0 or noNeg == False):
        return outval
    else:
        return 0

# Where were no bookings made?
d1 = alldata["date_account_created"].values
d2 = alldata["date_first_booking"].values
elapsed = []
for ii in range(0, len(alldata)):
    if (d2[ii] != -1):
        elapsed.append(days_between(d1[ii],d2[ii], noNeg=True))
    else:
        elapsed.append(10000.)
# Add this to the DataFrame
alldata["elapsed_booking_time"] = elapsed

## Create another little function to convert the (numeric) time stamp keyword to useful values.

In [29]:
# What does the timestamp_first_active say?
def tfa(x):
    """ Return YYYY-MM-DD format for timestamp formatted data."""
    output = []
    x = str(x)
    return str(x[:4]) + '-' + str(x[4:6]) + '-' + str(x[6:8])



In [30]:
# Split the initial Time Stamp into useful bits. The creation delay is the time between
# Their first search (which could be first) to when they created their account.
tfa_year = []
tfa_month = []
tfa_day = []
creation_delay = []
tfa_vector = alldata["timestamp_first_active"].values
for ii in range(0, len(alldata)):
    tfa_out = tfa(tfa_vector[ii])
    creation_delay.append(days_between(tfa_out, d1[ii]))
    
    
alldata["creation_delay"] = creation_delay
alldata.drop(["timestamp_first_active", "date_account_created", 
            "date_first_booking"], axis=1, inplace=True)

## Clean up out of whack age data and remove unneeded features

* For now, limit age to 15 < age < 100. Set others to -1
* Remove ID which isn't a tracer of anything in this data set.

In [31]:
# Clean up Age a bit.  Assume anything with age < 18 and age > 100 are fubar and set to -1
alldata['age'][(alldata['age'] < 15) | (alldata['age'] > 100)] = -1

# ID is likely not super informative, so let's drop it
alldata.drop(['id'], inplace=True, axis=1)

## Define and split the categorical values you want to use in a fit.

* for now, I'm fitting nearly everything that's left.
* Note that gender has M/F/prefer not to answer/Unknown, so not truely degenerate 
* Some of these may end up going later, but for now, let's see what we can do with ever feature.
* This likely *will* overfit the data

## Finally, pop the destination to y

* Rather than the string labels for countries, let's go a value between 0 and max_coutnries


In [33]:
# What Categorical Variables are we interested in?
categorical_variables = ['gender', 'language', 'signup_method',  'signup_flow', 
                         'affiliate_channel', 'affiliate_provider', 
                         'first_affiliate_tracked', 'signup_app', 
                         'first_device_type', 'first_browser']
alldataSplit = split_categorical_variables(alldata, categorical_variables)

In [52]:
# Now, resplit the data and test set
X = alldataSplit[alldataSplit["dataset"]=="train"]
X_test = alldataSplit[alldataSplit["dataset"]=="test"]
y = X.pop("country_destination")

X.drop(["dataset"], axis=1, inplace=True)
X_test.drop(["dataset", "country_destination"], axis=1, inplace=True)



In [65]:
le = LabelEncoder()
vals = le.fit_transform(y.values)
lb = LabelBinarizer()
lb.fit(range(np.amax(vals)+1))
T = lb.transform(y)


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

# Let's try a gradiant boost classifier

* This wasn't behaving like I wanted. Namely, I need to read up on scoring a method like this, as I wasn't able to quickly 
    * see what features were important
    * get quantitative measurements of how each paramter was affecting us, which meant we didn't get an idea of how to tune the model.
* Need to read up on this algorithm

In [8]:
#xgb_model = XGBClassifier(max_depth=3, n_estimators=10, learning_rate=0.1)
#xgb_model.fit(X, y)

## Let's try a random forest classifier and see how that works.

In [43]:
rfc_model = OneVsRestClassifier(RandomForestClassifier(
                                n_estimators = 100, oob_score=True, n_jobs=-1, 
                                max_features_options="sqrt"))

In [None]:
rfc_model.fit(X, y)

In [50]:
fpr = dict()
trp = dict()
roc_auc = dict()
y

array([ 7,  7, 10, ...,  7,  7,  7])