## Expedia Hotel Recommendation 

In this challenge we need to predict what hotel a user will book based on attributes that user searched on Expedia.

Before getting into the prediction problem let's first understand the columns.

### Expedia - Contextualizing

After looking at the expedia website we get following information.

The text box labelled *Going To* maps to *srch_destination_type_id*, *hotel_continent*, *hotel_country*,*hotel_market* fields in the data.

The text box labelled *Check-in* maps to the *srch_ci* field in the data, and the box labelled *Check out* maps to *srch_co* field in the data.

The box labelled *Guests* maps to *srch_adults_cnt* , *srch_children_cnt*, and *srch_rm_cnt* fields in the data.

The text box labelled *Add a Flight* maps to the *is_package* field in the data.

*site_name* is the name of the site you visited, it can be main *Expedia* or any other.

In [2]:
import pandas as pd

destinations = pd.read_csv("../../Kaggle Data/Expedia/Data/destinations.csv")
test = pd.read_csv("../../Kaggle Data/Expedia/Data/test.csv")
train = pd.read_csv("../../Kaggle Data/Expedia/Data/train.csv")


In [3]:
train.shape

(37670293, 24)

In [4]:
test.shape

(2528243, 22)

We have about 37 million training set rows, and 2 million testing set rows, which will make this problem a bit challenging to work with.

In [5]:
train.head()

Unnamed: 0,date_time,site_name,posa_continent,user_location_country,user_location_region,user_location_city,orig_destination_distance,user_id,is_mobile,is_package,...,srch_children_cnt,srch_rm_cnt,srch_destination_id,srch_destination_type_id,is_booking,cnt,hotel_continent,hotel_country,hotel_market,hotel_cluster
0,2014-08-11 07:46:59,2,3,66,348,48862,2234.2641,12,0,1,...,0,1,8250,1,0,3,2,50,628,1
1,2014-08-11 08:22:12,2,3,66,348,48862,2234.2641,12,0,1,...,0,1,8250,1,1,1,2,50,628,1
2,2014-08-11 08:24:33,2,3,66,348,48862,2234.2641,12,0,0,...,0,1,8250,1,0,1,2,50,628,1
3,2014-08-09 18:05:16,2,3,66,442,35390,913.1932,93,0,0,...,0,1,14984,1,0,1,2,50,1457,80
4,2014-08-09 18:08:18,2,3,66,442,35390,913.6259,93,0,0,...,0,1,14984,1,0,1,2,50,1457,21


Some analysis on the data: 

 * *data_time* is very useful and needs conversion.
 * Most of the columns are integers or floats, so we can't do any feature engineering because we don't know exactly which each value means.

In [6]:
test.head()

Unnamed: 0,id,date_time,site_name,posa_continent,user_location_country,user_location_region,user_location_city,orig_destination_distance,user_id,is_mobile,...,srch_ci,srch_co,srch_adults_cnt,srch_children_cnt,srch_rm_cnt,srch_destination_id,srch_destination_type_id,hotel_continent,hotel_country,hotel_market
0,0,2015-09-03 17:09:54,2,3,66,174,37449,5539.0567,1,1,...,2016-05-19,2016-05-23,2,0,1,12243,6,6,204,27
1,1,2015-09-24 17:38:35,2,3,66,174,37449,5873.2923,1,1,...,2016-05-12,2016-05-15,2,0,1,14474,7,6,204,1540
2,2,2015-06-07 15:53:02,2,3,66,142,17440,3975.9776,20,0,...,2015-07-26,2015-07-27,4,0,1,11353,1,2,50,699
3,3,2015-09-14 14:49:10,2,3,66,258,34156,1508.5975,28,0,...,2015-09-14,2015-09-16,2,0,1,8250,1,2,50,628
4,4,2015-07-17 09:32:04,2,3,66,467,36345,66.7913,50,0,...,2015-07-22,2015-07-23,2,0,1,11812,1,2,50,538


Some analysis on the test data:

* dates in test.csv are later than dates in train.csv.
* user ids in test.csv are a subset of the user ids in train.csv, given the overlapping integer ranges.


### What we're predicting

we'll be predicting which *hotel_cluster* a user will book after a given search. There are 100 clusters.

**Scoring metric**

The scoring metric is Mean Average Precision on 5 cluster predictions each row and will be scored on whether or not the correct prediction appears in our list. If the correct prediction comes earlier in the list, we get more points.

For example, if the "correct" cluster is 3, and we predict 4,43,60,3,20 our score will be lower than if we predict 3,4,43,60,20. So, we should put predictions which we're more certain about earlier in our list of predictions.

**Exploring hotel clusters**

In [7]:
train["hotel_cluster"].value_counts().head(20)

91    1043720
41     772743
48     754033
64     704734
65     670960
5      620194
98     589178
59     570291
42     551605
21     550092
70     545572
18     545284
83     534132
46     534038
25     530591
62     518809
95     509266
28     507016
68     503797
82     503755
Name: hotel_cluster, dtype: int64

In [10]:
train["hotel_cluster"].value_counts().tail(20)

7     252447
54    250745
92    244343
89    243560
45    241408
49    240124
3     225250
80    220218
60    217919
71    216054
93    214293
86    209054
14    192299
75    165226
24    164127
35    139122
53    134812
88    107784
27    105040
74     48355
Name: hotel_cluster, dtype: int64

The number of hotels in each cluster is evenly distributed. There doesn't appear to be any relationship between cluster number and number of items.

**Exploring train and test user ids**

We hypothesized that all the test user ids in test dataframe can be found in the train dataframe. We can do this by finding the unique values for user_id in test, and seeing if they all exist in train

In [11]:
test_ids = set(test.user_id.unique())
train_ids = set(train.user_id.unique())
intersection_count = len(test_ids & train_ids)
intersection_count == len(test_ids)a

True

Yeay!! So we were correct.

**Downsampling our Kaggle data**

There are so 37 million rows in our training set which makes it hard to experiment with different techniques.

In [12]:
train["date_time"] = pd.to_datetime(train["date_time"])
train["year"] = train["date_time"].dt.year
train["month"] = train["date_time"].dt.month

**Pick 10,000 users**

In [14]:
import random

unique_users = train.user_id.unique()

sel_user_ids = [unique_users[i] for i in sorted(
        random.sample(range(len(unique_users)),10000))]
sel_train = train[train.user_id.isin(sel_user_ids)]

MemoryError: 

The above code creates a DataFrame called sel_train that only contains data from 10000 users.

**Pick new training and testing sets**

In [None]:
t1 = sel_train[(sel_train.year == 2013) | ((sel_train.year == 2014) 
                                           & (sel_train.month < 8))]
t2 = sel_train[((sel_train.year == 2014) & (sel_train.month >=8))]

Here, we picked up new training and testing sets from *sel_train*. Which are t1 and t2 respectively.

In the original train and test dataframes, test contained data from 2015, and train contained data from 2013 and 2014. We split the data so that anything after *July 2014* is in *t2*, and anything before is in *t1*. 

**Remove click events**

If is_booking is 0, it represents a click, and a 1 represents a booking. We need to sample our t2 such that it contains only booking as did our original test dataset.

In [None]:
t2 = t2[t2.is_booking == True]

Now one of the simplest things we can do is that we could try on this data to find the most common clusters, then use them as predictions.

In [None]:
most_common_clusters = list(train.hotel_cluster.value_counts().head().index)

In [None]:
most_common_clusters

**Generatig predictions**

We can turn *most_common_clusters* into a list of predictions by making same predictions for each row.

Note that we are using the data leak in the datasets.

In [None]:
del train

In [None]:
predictions = [most_common_clusters for i in range(t2.shape[0])]

This will create a list with as many elements as there are rows in t2. Each element will be equal to most_common_clusters.

**Evaluting error**

In order to evalute error, we'll need to figure out how to compute Mean Average Precision. Ben Hamner has written an implementation. *ml_metrics* package.

We will use this package and compute our error metric with the mapk method in ml_metrics.

In [None]:
import ml_metrics as metrics
target = [[l] for l in t2["hotel_cluster"]]
metrics.mapk(target,predictions, k=5)

Our target needs to be in list of lists format for mapk to work, so we convert the *hotel_cluster* column of t2 into a list of lists. Then, we call use mapk method with our target, our predictions and number of predictions we want to evaluate.

**Principal Component Analysis**

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 3)
dest_small = pca.fit_transform(
    destinations[["d{0}".format(i+1) for i in range(149)]])
dest_small = pd.DataFrame(dest_small)
dest_small["srch_destination_id"] = destinations["srch_destination_id"]

The above code compresses 149 columns in the destinations down to 3 columns, and creates a new DataFrame call dest_small. We preserve most of the variance in destinations while doing this, so we don't lose a lot of information, but save a lot of runtime for a machine learning algorithm.

**Generating features**

Now that we have preprocessed our data to some level. We will generate and clean our features.

* Generate new date features based on date_time, srch_ci and srch_co.
* Remove non-numeric columns like date_time.
* Add in features from dest_small.
* Replace any missing values with -1.

In [None]:
def calc_fast_features(df):
    df["data_time"] = pd.to_datetime(df["date_time"])
    df["srch_ci"] = pd.to_datetime(df["srch_ci"], format='%Y-%m-%d',errors='coerce')
    df['srch_co'] = pd.to_datetime(df['srch_co'], format='%Y-%m-%d',errors='coerce')
    
    props = {}
    for prop in ["month","day","hour","minute", "dayofweek","quater"]:
        props[prop] = getattr(df["date_time"].dt, prop)
        
    carryover = [p for p in df.columns if p not in ["date_time","srch_ci","srch_co"]]
    for prop in carryover:
        props[prop] = df[prop]
    
    date_prop = ["month","day","dayofweek","quater"]
    for prop in date_props:
        props["ci_{0}".format(prop)] = getattr(df["srch_ci"].dt,prop)
        props["co_{0}".format(prop)] = getattr(df["srch_co"].dt,prop)
    props["stay_span"] = (df["srch_co"] - df["srch_ci"]).astype('timedelta64[h]')
    
    ret = pd.DataFrame(props)
    
    ret = ret.join(dest_small, on="srch_destination_id",how='left', rsuffix = 'dest')
    ret = ret.drop("srch_destination_iddest",axis=1)
    return ret

df = calc_fast_features(t1)
df.fillna(-1,inplace=True)

**Machine Learning**

In [15]:
predictors = [c for cc in df.columns if c not in ["hotel_cluster"]]
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10, min_weight_fraction_leaf=0.1)
scores = cross_validation.cross_val_score(clf,df[predictors]
                                          ,df['hotel_cluster'],cv=3)
scores

NameError: name 'df' is not defined