# Location Prediction for Airbnb Berlin Data

We are going to predict whether or not a given combination of features (price, bedroom, room_type, accommodates) and locations (latitude, longitude) exists in a given Airbnb Listing.
More precisely, we are going to estimate the probability that such a combination exists in the listings.

## Core assumptions

#### Date of request has to be similar to some dates in the listings

We assume that Rob tells us the features of his whereabouts at a date that is close to given dates in the listings. This means we are not going to exprapolate the evolution of prices. We are therefore only learning from existing places and locations at dates close to that at which Rob gives us the information

#### Room type cannot be different from those in the training data

Because we are training on available data only and do not make assumptions concerning extrapolation, we can only predict on categorical data that is in the range of the training data. 

Note that this assumption is kind of valid also for non-categorical data such as price, number of bedrooms or accommodates, and geographical position. However, we assume a somewhat smooth dependence on these non-categorical variables that will allow us to predict on values of price, bedrooms, accommodates, latitude and longitude that were not available in the original data. Note that combinations of features/locations that were not existing in the original dataset are considered to be negative samples. We use these samples to construct rows in the dataset that builds on existing values, yet its label is 'not in the listings'.

## Main results

We train a random forest on the available data to make the predictions. Core results:
* Predictions are better than those generated by random guesses
* Forests trained on balanced data can make useful predictions on unbalanced data
* The final probability finally attributed to a given combination of features and locations **depends on the multiplicity of negative samples in the dataset to be evaluated**.

## Methods

The problem is treated as a binary classification problem. We consider each sample that exists in the listings as positive sample. Each combination of features and locations that is not in the listings is considered a negative sample. We construct negative samples by combining existing features and existing locations such that these combinations are not present in the listings. We then use datasets with a freely choosable ratio between negative and positive samples (in certain limits given by the total number of possible combinations between distinct features and distinct locations) to train random forests. The resulting forest will be able to guess the probaility of an unseen(meening not present in the listings) combination of features/locations (which might include unseen features and/or unseen locations) to be present in the listings.

If the features/location combination that is requested exists in the listings (given a range of dates), we will simply perform a lookup on the existing data in the listings.

## Results

The predictive power of random forests was far better than that of neural networks or gradient boosting (results not shown). In particular, they are far faster to train for this simple classification task.

The forest does not make directly a prediction whether or not a set of features/locations is in the listings. It rather yields a number between 0 and 1 which can be interpreted as a probability (the score requested in the homework). To be able to predict a class given that probability (in the listings [1] or not [0]), one has to choose a **threshold** on the probability. In the evaluation section, We compare some metrics such as precision, recall, and f1 in dependence of the choice of threshold.

We find that the predictive power of the forest (measured in terms of the metrics at threshold 0.5) is mostly larger than the metrics evaluated with a model that randomly chooses whether or not a set of features/locations is in the listings or not. Here. the probability $p$ of a given combination being in the listings is considered to be the ratio of positive samples in the dataset ($N_+$) divided by the total number of samples in the dataset ($N_t$): $p=N_+/N_t$.

Under this assumption, we can calculate the expected value of the metrics $precision$, $recall$, and $f_1$. The predictive power of the random forest ($f_{precision}$) is then given as $$f_{precision} = precision_{forest}/precision_{random}\quad.$$

We also calculate the corresponding quantities for $recall$ and $f_1$.

We find that the metrics for a given balancedness of valuation data strongly depends on both the threshold **and the balancedness of the training data**. While the best performance is achieved when the same balancedness is used for both training and valuation data, we find that we can get close to those values using balanced training and unbalanced valuation data for a threshold of 0.5, if **we perform an analytical transformation of the probabilities of the forest that depends on how strongly we undersampled the possible negative samples in order to achieve a balanced training dataset __[link to publication](https://www3.nd.edu/~dial/publications/dalpozzolo2015calibrating.pdf)__**. 

We can understand the dependence of the predicted probability based on the ratio in the **validation** data in the following way: Imagine that you are to check whether or not a list of combinations of one feature and one million locations matches any of the listings in the Airbnb database. Imagine that all the locations are closely clustered around just one house in Berlin. Imagine that only one of these combinations actually corresponds to a listing in the database. It is then not conceivable that the prediction machine trained on a dataset with a vastly different ratio would make adequate predictions as it is not used to make location discrimations at this scale.

As a consequence, this means that we **cannot give** useful probabilities unless **we know beforehand the ratio of positive to negative samples in the evaluation dataset**. In other words, we would assign a **different** probability to the same combination of features/locations if the ratio of positive samples to negative samples in the dataset to be evaluated changes.

Finally, the predictive power also depends on the granularity of location discretization. While it is technically absolutely possible to use training data with one degree of granularity, and predict on valuation data of a different granularity, our tests have shown that the loss of power of prediction (as compared to the best achievable power at the same granularity and balancedness) is quite large (not shown).

We conclude:

**using the analytic transform, the probabilities generated by a forest trained with balanced data can be adjusted to yield reasonable values on unbalanced evaluation data; however, it is necessary to know how unbalanced the evaluation data is in order to make the adjustment**


## Code section

We start by importing some needed libraries.

In [16]:
import LocationPrediction.source_data
import LocationPrediction.preprocess_data
import LocationPrediction.predictor
from LocationPrediction.GmapsAirbnbExplorer import AirbnbExplorer
import sys
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import pickle
import numpy as np
from  IPython.display import display
import gmaps
% matplotlib inline

### Data downloading and cleaning

We are now downloading the data using an instance of the custom-made class `source_dataClass`.

In [17]:
sd = LocationPrediction.source_data.source_dataClass()
# download and clean data
# df_clean_csv = sd.download_and_clean() # takes time because it reads individual source csv files, combines them, cleans them
# read the cleaned csv into memory
df_clean_csv = sd.read_clean_csv()

At the end of this procedure, we have produced a clean csv file consiting of the concatenation of all source zip files. It is saved to disk and also exists in memory as a pandas DataFrame (`df_clean_csv`).

In [18]:
df_clean_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 292651 entries, 0 to 292650
Data columns (total 11 columns):
room_type         292651 non-null object
accommodates      292651 non-null float64
bedrooms          292651 non-null float64
price             292651 non-null float64
latitude          292651 non-null float64
longitude         292651 non-null float64
last_modified     292651 non-null datetime64[ns]
survey_id         292651 non-null int64
year              292651 non-null int64
yearmonth         292651 non-null int64
yearmonthIndex    292651 non-null int64
dtypes: datetime64[ns](1), float64(5), int64(4), object(1)
memory usage: 24.6+ MB


### Data preprocessing

The most demanding tasks in this section are 
* sampling of negative samples. How do I choose a number of samples without replacement out of a list that is too large to be kept in memory (number of distinct features multiplied by number of distinct locations), while a low number of discrete values in this list is marked as positive and should not be drawn? See the code for details of our approach


In the following, we allow to draw samples using an instance of the custom class `preprocess_dataClass`. We have 5 parameters:
* `min_date`: only sample from dates larger or equal to this date
* `max_date`: only sample from dates smaller or equal to this date
* `n_cuts`: the level of granularity of the location data. `n_cuts`=2 means that all available location data will be mapped to a 2x2 grid of Berlin. If this setting is set to -1, no discretization is performed.
* `N_positive_samples_to_draw`: the number of positive samples that should exist in the training dataset. If this number is -1, all samples in the given date range are drawn. If the number is larger than the total number of available rows in the given date range, only the available number is being drawn (i.e., all samples are drawn but not more, we require uniqueness).
* `Neg_Multiplier`: gives the ratio of Negative Samples to Positive samples in the final training data. If this ratio is 2, there will be twice as many negative samples as there are positive samples in the training data. If it is -1, the ratio of positive to negative samples in the training data will be identical to the same ratio in the base data (all possible combinations of disctinct features and distinct locations). If the number yields a target number of negative samples that is larger than the available negative samples (`N_distinct_features`*`N_distinct_lcoations`-`N_positive`), the value is capped such that not more negative samples as given by this number are being drawn.

In [19]:
# data preprocessing up to the generation of a training dataframe
prpdata_train = LocationPrediction.preprocess_data.preprocess_dataClass(df=df_clean_csv)

# create a reduced dataset by restricting observations to fall into a specific date range
# this also creates distinct indices
min_date = '2017-02-01'
max_date = None
n_cuts = 5
N_positive_samples_to_draw = -1
Neg_Multiplier = 1
prpdata_train.create_training_data(min_date=min_date,
                                   n_cuts=n_cuts,
                                   N_positive_samples_to_draw=N_positive_samples_to_draw,
                                   Neg_Multiplier=Neg_Multiplier)

print(prpdata_train)
display(prpdata_train.df_training_categorical.head())

N_selection_positives: 16732
N_selection_distinct_features: 7114
N_selection_distinct_locations: 22
N_selection_distinct_combinations: 156508
N_selection_possible_negatives: 139776
N_training positives: 16732
N_training negatives: 16732
N_training total: 33464
N_training distinct features: 7114
N_training distinct locations: 22
beta_for_balanced_data: 0.11970581501831502
beta_for_training_data: 0.11970581501831502



Unnamed: 0,label,price,bedrooms,accommodates,room_type,latitude,longitude
0,1,44.0,1.0,3.0,Private room,52.5595,13.2985
1,1,56.0,1.0,3.0,Shared room,52.4985,13.4255
2,1,24.0,1.0,1.0,Private room,52.3755,13.5525
3,1,97.0,0.0,4.0,Entire home/apt,52.5595,13.4255
4,1,16.0,1.0,4.0,Shared room,52.4985,13.4255


The results of this operation are summarized above. The resulting dataset contains 7 columns: the label, the feature columns, and the location columns. Only the room_type column is considered categorical and will be transformed to a one_hot_vector for training.

## Prepare training dataset and fit a random forest

We continue by taking the training data and splitting it into predictive features (`X`: features and locations) and the resulting class (`y`: `label`). Note that we are not creating a separate validation dataset. The validation has been performed beforehand (not shown); some of the results are shown in the evaluation section.

We first create a random forest (prior tests have shown that a number of estimators of 500 is reasonable for most situations; for better performance, the number of estimators should be higher than the expected ratio of Negative to Positive samples).

The default settings of the RandomForestClassifier have proven to show reasonable results. 

Finally, we save the resulting forest to disk. Note that this training, performed on large datasets, can take quite a while (in particular if one performs training on unbalanced data which can be quite large).

In [20]:
# # create a random forest and train it with the training data
do_train = True
if do_train:
    rf = RandomForestClassifier(n_estimators=500,
                                     random_state=42,
                                     class_weight='balanced',
                                     n_jobs=-1)

    df_train = prpdata_train.df_training_one_hot_vector.copy()
    X_df_train = df_train.drop(columns=[prpdata_train.label_column_name])
    X_train = X_df_train.values

    y_df_train = df_train.drop(columns=df_train.columns.difference([prpdata_train.label_column_name]))
    y_train = y_df_train.values

    rf.fit(X_train,y_train.ravel())

    filename = 'my_rf.pkl'
    pickle.dump(rf, open(filename, 'wb'))
else:
    filename = 'my_rf.pkl'
    rf = pickle.load(open(filename,'rb'))

## Use the random forest for a prediction

We now proceed to the prediction process. We first create a dataframe that only contains the samples that are considered to be positive (as the prediction method will try to find matching combinations of features/locations in this dataset).

In [21]:
# make a prediction using the trained random forest
# the features of the first row of the training data
df_with_known_matches = prpdata_train.df_selection_categorical_discretized_locations.copy()
df_with_known_matches = df_with_known_matches[prpdata_train.columns_to_keep_categorical].drop_duplicates()
df_with_known_matches = df_with_known_matches[df_with_known_matches[prpdata_train.label_column_name]==1].drop(columns=[prpdata_train.label_column_name])
# display(df_with_known_matches.head(n=10))

We are then instantiating an instance of the custom class `predictorClass`.

In [22]:
# instantiate a prediction object
predictor = LocationPrediction.predictor.predictorClass(model=rf,
                                                        preprocessed_data_for_training=prpdata_train,
                                                        df_with_known_matches=df_with_known_matches)

We now come to the specification of the valuation data as required by this homework task. Rob gives us a specification of features (saved in my_dict). We allow to specify as many locations as one wishes by passing a dataframe to the method `set_locations_for_predictions`.

For simplicity, by default, we take all existing locations that are present in the training dataset (argument `None` to that method).

In [23]:
# define the features and locations for which we wish to predict
my_dict = {'price': 50.5,
           'room_type':'Entire home/apt',
           'bedrooms': 1,
           'accommodates': 2}
predictor.set_features_for_prediction(int_or_dict=my_dict)
predictor.set_locations_for_prediction(df_with_locations=None)

We are then creating all possible combinations between the set of features (one row) and the set of locations (arbitrary number of rows). The resulting dataframe contains combinations of features/locations and has the same number of rows as the number of locations that have been passed to the object.

In [24]:
# create the cross join of features and locations to generate a dataframe for prediction
predictor.set_features_and_locations()
# display(predictor.df_features_and_locations_to_predict)
# display(predictor.df_features_and_locations_one_hot_vector_to_predict)

We are now proceeding to the prediction. Given the trained forest and the dataframe concerning the features given by Rob and the locations that we passed, we calculate a probability that this combination exists in the listings for each row. We adjust these probabilities by using an analytic function that depends on `beta` that adjusts for undersampling (as explained above). This number would have to be adjusted if the ratio of negative to positive samples that are in the evaluation set is different from that of the training set. This is not done here. We only implemented an automatic adjustment of `beta` with respect to the training dataset alone - we assume that the data to be validated has the same ratio of positive to negative samples as the training data. This can be changed easily to reflect the balancedness of the validation set (not done).

In [25]:
# calculate the probabilities to find a given combination of features and locations
y_probas =  predictor.predict_probas(beta=prpdata_train.beta_for_training_data)
# y_probas = predictor.predict(beta=1)

# fill the predictor's dataframe with the results
predictor.fill_prediction_dataframe_with_probas(y_probas=y_probas)

df_predictions = predictor.df_features_and_locations_to_predict

The result of this operation is a dataframe containing features/locations, probabilities, and classes (threshold 0.5 using the `beta`-adjusted forest's probabilities)

In [26]:
# print results
display(df_predictions.head(n=5))

Unnamed: 0,price,bedrooms,accommodates,room_type,latitude,longitude,probability,class,prediction_method
0,50.5,1.0,2.0,Entire home/apt,52.4985,13.4255,1.0,True,prediction
1,50.5,1.0,2.0,Entire home/apt,52.4375,13.4255,0.983535,True,prediction
2,50.5,1.0,2.0,Entire home/apt,52.4985,13.1705,0.359155,False,prediction
3,50.5,1.0,2.0,Entire home/apt,52.5595,13.4255,1.0,True,prediction
4,50.5,1.0,2.0,Entire home/apt,52.4985,13.2985,1.0,True,prediction


Sort the dataframe to show the positive predictions

In [27]:
display(df_predictions.sort_values(by='probability',ascending=False).head(n=5))

Unnamed: 0,price,bedrooms,accommodates,room_type,latitude,longitude,probability,class,prediction_method
0,50.5,1.0,2.0,Entire home/apt,52.4985,13.4255,1.0,True,prediction
9,50.5,1.0,2.0,Entire home/apt,52.4375,13.2985,1.0,True,prediction
3,50.5,1.0,2.0,Entire home/apt,52.5595,13.4255,1.0,True,prediction
4,50.5,1.0,2.0,Entire home/apt,52.4985,13.2985,1.0,True,prediction
5,50.5,1.0,2.0,Entire home/apt,52.5595,13.2985,1.0,True,prediction


## Graphical representation of results

We have written a custom class that is able to show a heatmap of the results. Predictions and actual results can be graphically compared by slightly changing e.g. the price such that it takes on a value not in the listing (but close to one) and running the above operation twice: once for a feature set in the listings, one slightly outside.

In [28]:
AirbnbExplorer(df_predictions).render()

VBox(children=(HTML(value='<h3>Airbnb Berlin Location Prediction</h3><h4>Data from <a href="http://tomslee.net…

In [29]:
AirbnbExplorer(df_predictions).render()

VBox(children=(HTML(value='<h3>Airbnb Berlin Location Prediction</h3><h4>Data from <a href="http://tomslee.net…

## Evaluation

Finally, we show some evaluation of the data. The following are only images, no live evaluation is implemented. The code for this evaluation is based on jupyter notebooks that are organised a little bit differently from the simplified classes implemented here.

The following graphs show the dependence of three metrics ($precision$, $recall$, $f_1$) on the threshold. The main idea is to compare the predictive performance of a forest trained on balanced data (green lines) (which is vastly faster than training on complete unbalanced data) to a forest trained on unbalanced data (blue lines).

To show the success of the analytical transform mentioned above, we also show the effect of applying this transform (red lines). We see that the application of this transform is very successful, increasing the predictive power of the forest at the threshold 0.5 greatly.

We conclude that the transform is a useful tool to take probabilities as given by the forest trained on unbalanced data and transform it to probabilities that serve the evaluation of datasets that are unbalanced.

The plots below show both the metric  ($prec$, $reca$, $f_1$) as a function of threshold as well as how much the method is above a random guess ($f\_prec$, $f\_reca$, $f\_1$), where the $f\_$- values are defined as $metric_{forest}/expected\_metric\_when\_assuming\_random\_guessing$.  

The curves below correspond to training data between 2017-02-01 and 2017-07-31 (the last available date in the listings). Locations have been discretized using a 10x10 grid. Validation data has a date range ov 2017-01-01 and 2017-01-31 and is unbalanced. 

Metric f1: harmonic mean between recall and precision. 
![f1](img/f1.png)

Metric f_f1: harmonic mean between recall and precision. Here, we have plotted how much this measure is above the pure random approach using just the probability of positive samples in the valuation dataset.
![f_f1](img/f_f1.png)

Metric precision 
![prec](img/prec.png)

Metric f_precision 
![f_prec](img/f_prec.png)

Metric recall 
![reca](img/reca.png)

Metric f_recall 
![f_reca](img/f_reca.png)