<h1><center> PROPENSITY TO INVEST </center></h1>
<h2><center> ENSEMBLE CLASSIFICATION METHODS </center></h2>

## 1. VOTING CLASSIFIERS

Supervised learning algorithms perform the task of searching through a hypothesis space to find a suitable hypothesis that will make good predictions with a particular problem. Even if the hypothesis space contains hypotheses that are very well-suited for a particular problem, it may be very difficult to find a good one. Ensembles combine multiple hypotheses to form a (hopefully) better hypothesis. The term ensemble is usually reserved for methods that generate multiple hypotheses using the same base learner. The broader term of multiple classifier systems also covers hybridization of hypotheses that are not induced by the same base learner.

Evaluating the prediction of an ensemble typically requires more computation than evaluating the prediction of a single model, so ensembles may be thought of as a way to compensate for poor learning algorithms by performing a lot of extra computation. Fast algorithms such as decision trees are commonly used in ensemble methods (for example, random forests), although slower algorithms can benefit from ensemble techniques as well.

By analogy, ensemble techniques have been used also in unsupervised learning scenarios, for example in consensus clustering or in anomaly detection.

An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis. This hypothesis, however, is not necessarily contained within the hypothesis space of the models from which it is built. Thus, ensembles can be shown to have more flexibility in the functions they can represent. This flexibility can, in theory, enable them to over-fit the training data more than a single model would, but in practice, some ensemble techniques (especially bagging) tend to reduce problems related to over-fitting of the training data.

Empirically, ensembles tend to yield better results when there is a significant diversity among the models. Many ensemble methods, therefore, seek to promote diversity among the models they combine.Although perhaps non-intuitive, more random algorithms (like random decision trees) can be used to produce a stronger ensemble than very deliberate algorithms (like entropy-reducing decision trees).[9] Using a variety of strong learning algorithms, however, has been shown to be more effective than using techniques that attempt to dumb-down the models in order to promote diversity.

## 2. PREDICTING REAL ESTATE BUYER'S LEVEL OF INTEREST

As you may well know in most Kaggle competitions the winners usually resort to stacking or meta-ensembling, which is a technique involving the combination of several 1st level predictive models to generate a 2nd level model which tends to outperform all of them.

This usually happens because the 2nd level model is somewhat able to exploit the strengths of each 1st level model where they perform best, while smoothing the impact of their weaknesses in other parts of the dataset.

There are different methods and "schools of thought" on how stacking can be performed. If you are interested in this topic, then I suggest you to have a look at this and this to start.

Here I will show a simple technique that is known as "Soft Voting" and can be implemented with Sklearn VotingClassifier.

More can be found at https://www.kaggle.com/den3b81/better-predictions-stacking-with-votingclassifier/notebook

In [4]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier 
from xgboost import XGBClassifier

  from numpy.core.umath_tests import inner1d


In [5]:
df = pd.read_json(open("train.json", "r"))

In [6]:
df["num_photos"] = df["photos"].apply(len)
df["num_features"] = df["features"].apply(len)
df["num_description_words"] = df["description"].apply(lambda x: len(x.split(" ")))
df["created"] = pd.to_datetime(df["created"])
df["created_year"] = df["created"].dt.year
df["created_month"] = df["created"].dt.month
df["created_day"] = df["created"].dt.day

In [7]:
num_feats = ["bathrooms", "bedrooms", "latitude", "longitude", "price",
             "num_photos", "num_features", "num_description_words",
             "created_year", "created_month", "created_day"]
X = df[num_feats]
y = df["interest_level"]
X.head()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,num_photos,num_features,num_description_words,created_year,created_month,created_day
4,1.0,1,40.7108,-73.9539,2400,12,7,77,2016,6,16
6,1.0,2,40.7513,-73.9722,3800,6,6,131,2016,6,1
9,1.0,2,40.7575,-73.9625,3495,6,6,119,2016,6,14
10,1.5,3,40.7145,-73.9425,3000,5,0,95,2016,6,24
15,1.0,0,40.7439,-73.9743,2795,4,4,41,2016,6,28


In [8]:
# random state for reproducing same results
random_state = 54321

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state = 54321)

For this little experiment we will combine the predictions of 4 different classifiers, trained with basic parametrization. Tuning the parameters is beyond the scope of this notebook.

The classifiers are:

>-  RandomForestClassifier with "entropy" criterion 
>-  RandomForestClassifier with "gini" criterion 
>-  Sklearn GradientBoostingClassifier 
>-  XGBoost classifier

In [9]:
rf1 = RandomForestClassifier(n_estimators=250, criterion='entropy',  n_jobs = -1,  random_state=random_state)
rf1.fit(X_train, y_train)
y_val_pred = rf1.predict_proba(X_val)
log_loss(y_val, y_val_pred)

0.6307891949598801

In [10]:
rf2= RandomForestClassifier(n_estimators=250, criterion='gini',  n_jobs = -1, random_state=random_state)
rf2.fit(X_train, y_train)
y_val_pred = rf2.predict_proba(X_val)
log_loss(y_val, y_val_pred)

0.632009390517281

In [11]:
gbc = GradientBoostingClassifier(random_state=random_state)
gbc.fit(X_train, y_train)
y_val_pred = gbc.predict_proba(X_val)
log_loss(y_val, y_val_pred)

0.6392365307706523

In [12]:
xgb = XGBClassifier(seed=random_state)
xgb.fit(X_train, y_train)
y_val_pred = xgb.predict_proba(X_val)
log_loss(y_val, y_val_pred)

0.6479794418970433

#### SOFT VOTING

It looks like rf1 tops all the other classifiers, with XGB coming last (as I said before, we are not concerned with tuning the parameters).

Let's now see whether we can improve performances by combining model predictions through "Soft Voting". Soft voting entails computing a weighed sum of the predicted probabilities of all models for each class. If the weights are equal for all classifiers, then the probabilities are simply the averages for each class.

The predicted label is the argmax of the summed prediction probabilities.

Soft voting can be easily performed using a VotingClassifier, which will "retrain" the model prototypes we specified before. Please note that one could avoid retraining the models (in this case) and manually perform the soft voting using the output of each classifier's "predic_proba" method.

In [13]:
eclf = VotingClassifier(estimators=[
    ('rf1', rf1), ('rf2', rf2), ('gbc', gbc), ('xgb',xgb)], voting='soft')
eclf.fit(X_train, y_train)
y_val_pred = eclf.predict_proba(X_val)
log_loss(y_val, y_val_pred)

0.6228273354540116

#### SOFT VOTING WITH WEIGTHED COEFFICIENTS

We can already see a decent improvement with respect to the best performing model. We could improve further by using a different coefficients for each classifier, based on their performances.

For instance, it is common to assign greater weight to the best performing one.

In [14]:
eclf = VotingClassifier(estimators=[
    ('rf1', rf1), ('rf2', rf2), ('gbc', gbc), ('xgb',xgb)], voting='soft', weights = [3,1,1,1])
eclf.fit(X_train, y_train)
y_val_pred = eclf.predict_proba(X_val)
log_loss(y_val, y_val_pred)

0.6205717377933158