# Expedia Kaggle Competition

https://www.kaggle.com/c/expedia-hotel-recommendations

### we worked with lots of customer behavior
- These include what customers searched for, how they interacted with search results, clicks or books, and whether or not the search result was a travel package

### Expedia was interested in predicting 
- which hotel group a user is going to book
- Important thing here is prediction target the hotel group
- In other words, characteristics of actual hotel

## Data Leakage

Distance from user_city to an actual hotel he clicked on booked. And our prediction target is a characteristic of an actual hotel. Furthermore, destination distance was very precise so unique user_city and destination distance pairs corresponded to unique hotels.

### We could treat `user_city` and `destination_distance` pair as a proxy to our target

* (`destination_distance`, `user_city`) pair 
  * is a leak to **true hotel location**
  * A lot of matches between train and test set
  * **However, there are new observation in test set which do not appear in train set**
* How to improve on that?
  1. Features based on counts on corteges of such nature
    * (ex) **how many hotels of which group** are for `user_city`, `hotel_country`, `hotel_city` triplet.
    * then we coud train some machine learning model on such features
  2. Try to find the true coordinates
    * Find true coordinates of `user_cities` and `hotel_cities`
    * Guess the `destination_distance` feature from those
    * Find good approximation for the coordinates of actual hotels

## Spherical Geometry

### Haversine formula
source : https://en.wikipedia.org/wiki/Haversine_formula<br>

For any two points on a sphere, the haversine of the central angle between them is given by

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/47a496cca1b6d57e0ae7b462c1678660392d1057)

where

* hav is the haversine function: ${\displaystyle \operatorname {hav} (\theta )=\sin ^{2}\left({\frac {\theta }{2}}\right)={\frac {1-\cos(\theta )}{2}}} \operatorname {hav} (\theta )=\sin ^{2}\left({\frac {\theta }{2}}\right)={\frac {1-\cos(\theta )}{2}}$
* d is the distance between the two points (along a great circle of the sphere; see spherical distance),
* r is the radius of the sphere,
* φ1, φ2: latitude of point 1 and latitude of point 2, in radians
* λ1, λ2: longitude of point 1 and longitude of point 2, in radians

![spherical-geometry](../img/spherical-geometry.png)

## Hotel cities. Old version

Now suppose we have true coordinates of three points (`user_city`, `hotel_city`, `hotel_country`) and distances from fourth point with unknown coordinates to each of them (`destination_distance`). 
* If you write down a system of three equations, one of each distance, we can unambiguously solve it and get true coordinates for the fourth point.

![hotel-cities-old-version](../img/hotel-cities-old-version.png)

Now, we have four points with known coordinates. So at first, by hook or by crook, we reverse engineer true coordinate of these big cities. After that, we can iteratively find coordinates of more and more cities.

### But as you see in the picture, some cities end up in oceans
* It means our algorithm is not very precise
* A rounding error accumulates after every iteration and everything starts to fall apart
* We get some different method and indeed we can do better

## Hotel cities. New version

Remember how in iterative method we solved a system of three equations to unambiguously find coordinates or fourth unknown point. But why limit ourselves with three equations? 

![hotel-cities-new-version](../img/hotel-cities-new-version.png)

### Let's create a giant system of equations from all known distances with true coordinates being done on variables.

We end up with literally hundreds or thousands of equations and tens of thousands of unknown variables - good thing is it's very sparse.<br>

We can apply special methods from SciPy to efficiently solve such a system. In the end, after solving that system of equations, we end up with a very precise coordinates for both `hotel_city` and `user_city`.

![user-cities-new-version](../img/user-cities-new-version.png)





### We're predicting a type(characteristic) of a hotel using city coordinates and destination distance

It's possible to find an **approximation of true coordinates of an actual hotel**.
* When we fix `user_city` and draw a circumference around it with the radians of `destination_distance`, it's obvous that true hotel location must be somewhere on that circumference.
* Now, let's fix some `hotel_city` and draw such circumferences from all `user_city` to that fixed `hotel_city` and draw them for every given `destination_distance`.

After doing so, we end up with the pictures below.

![circumferences](../img/circumferences.png)

A city contains a limited number of hotels so the intuition here is that **hotels actually are on the intersection points and the more circumferences intersect in such point.**

## Counters in grid cells

![circum-messy](../img/circum-messy.png)

It's pretty messy and seems impossible to operate in terms of singluar points. However, there are **explicit clusters of points** and this information can be of use.<br>

We can do some kind of integration for every city. Let's create a grid around its center something like 10km times 10km with stepsize of 100 meters.

![counters-in-grid-cells](../img/counters-in-grid-cells.png)

We can count how many hotels of which type are present there. 
* If a circumference goes through a cell, we give plus one to the hotel type corresponding to that circumference.
* During inference, we also draw a circumference based on destination distance feature.
* We see from what degree its cells it went through and use information from those cells to create features like **a sum of all counters, average of all counters, maximum of all counters, and so on**.



## Final model

We have covered the part of feature engineering. Note that all the features directly used target label. **We cannot use them as is in training. We should generate them in out-of-fold fashion for train data.**

### So we had training data for years 2013 and 2014. 
To generate features for year 2014, we used labelled data from year 2013 and vice versa. For the test features, which was from year 2015, we naturally used all training data.<br>

In the end we calculated a lot of features and serve them into XGBoost model. 

* Out-of-fold feature geneartion. 2013 <-> 2014
* XGBoost
* 16 hours of training
* Result : `3rd place in public / 4th place in private leaderboard`