## 6.10 Homework

The goal of this homework is to create a tree-based regression model for prediction apartment prices (column `'price'`).

In this homework we'll again use the New York City Airbnb Open Data dataset - the same one we used in homework 2 and 3.

You can take it from [Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv)
or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
if you don't want to sign up to Kaggle.

Let's load the data:

In [31]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [32]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv

--2021-10-18 10:16:28--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7077973 (6.8M) [text/plain]
Saving to: ‘AB_NYC_2019.csv.2’


2021-10-18 10:16:29 (34.6 MB/s) - ‘AB_NYC_2019.csv.2’ saved [7077973/7077973]



In [33]:
columns = [
    'neighbourhood_group', 'room_type', 'latitude', 'longitude',
    'minimum_nights', 'number_of_reviews','reviews_per_month',
    'calculated_host_listings_count', 'availability_365',
    'price'
]

df = pd.read_csv('AB_NYC_2019.csv', usecols=columns)
df.reviews_per_month = df.reviews_per_month.fillna(0)

df.head()

Unnamed: 0,neighbourhood_group,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,40.64749,-73.97237,Private room,149,1,9,0.21,6,365
1,Manhattan,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,2,355
2,Manhattan,40.80902,-73.9419,Private room,150,3,0,0.0,1,365
3,Brooklyn,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,1,194
4,Manhattan,40.79851,-73.94399,Entire home/apt,80,10,9,0.1,1,0


* Apply the log tranform to `price`
* Do train/validation/test split with 60%/20%/20% distribution. 
* Use the `train_test_split` function and set the `random_state` parameter to 1

In [34]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction import DictVectorizer

In [35]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [36]:
y_train = (np.log1p(df_train.price.values) >= np.log1p(152)).astype('int')
y_val = (np.log1p(df_val.price.values) >= np.log1p(152)).astype('int')
y_test = (np.log1p(df_test.price.values) >= np.log1p(152)).astype('int')

del df_train['price']
del df_val['price']
del df_test['price']

Now, use `DictVectorizer` to turn train and validation into matrices:

In [37]:
train_dicts = df_train.fillna(0).to_dict(orient='records')
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)

## Question 1

Let's train a decision tree regressor to predict the price variable. 

* Train a model with `max_depth=1`

In [38]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text

In [39]:
dt = DecisionTreeRegressor(max_depth=1)
dt.fit(X_train, y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=1,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [40]:
print(export_text(dt, feature_names=dv.get_feature_names()))

|--- room_type=Entire home/apt <= 0.50
|   |--- value: [0.06]
|--- room_type=Entire home/apt >  0.50
|   |--- value: [0.53]



Which feature is used for splitting the data?

* **`room_type`**
* `neighbourhood_group`
* `number_of_reviews`
* `reviews_per_month`

## Question 2

Train a random forest model with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1`  (optional - to make training faster)

In [41]:
from sklearn.ensemble import RandomForestClassifier

In [42]:
min_samples_leaf = 3

In [49]:
rf = RandomForestClassifier(n_estimators=10,
                            random_state=1,
                            n_jobs=-1)
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [50]:
val_dicts = df_val.fillna(0).to_dict(orient='records')
X_val = dv.transform(val_dicts)

In [53]:
from sklearn import metrics

In [55]:
y_pred = rf.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, y_pred)
auc

0.8756060048516114

In [54]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred)))

Mean Absolute Error: 0.23092340730136005
Mean Squared Error: 0.12816443399120567
Root Mean Squared Error: 0.3580006061324557


What's the RMSE of this model on validation?

* 0.059
* 0.259
* 0.459
* 0.659

## Question 3

Now let's experiment with the `n_estimators` parameter

* Try different values of this parameter from 10 to 200 with step 10
* Set `random_state` to `1`
* Evaluate the model on the validation dataset

In [58]:
scores = []


for n in [50, 70, 120]:
  rf = RandomForestClassifier(n_estimators=n, random_state=1, n_jobs=-1)
  rf.fit(X_train, y_train)

  y_pred = rf.predict_proba(X_val)[:, 1]
  auc = roc_auc_score(y_val, y_pred)

  print(n, auc)
  print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred)))

  scores.append((n, auc))

scores

50 0.8923617437617177
Root Mean Squared Error: 0.34583059441923164
70 0.8933876053276424
Root Mean Squared Error: 0.34500406169949305
120 0.8958900237896477
Root Mean Squared Error: 0.3431236362302584


[(50, 0.8923617437617177), (70, 0.8933876053276424), (120, 0.8958900237896477)]

After which value of `n_estimators` does RMSE stop improving?

- 10
- 50
- **70**
- 120

## Question 4

Let's select the best `max_depth`:

* Try different values of `max_depth`: `[10, 15, 20, 25]`
* For each of these values, try different values of `n_estimators` from 10 till 200 (with step 10)
* Fix the random seed: `random_state=1`

In [59]:
scores_2 = []
estimators = np.arange(10, 210, 10);


for n in estimators:
  scores_md = []

  for md in [10, 15, 20, 25]:
    rf = RandomForestClassifier(n_estimators=n, random_state=1, n_jobs=-1, max_depth=md)
    rf.fit(X_train, y_train)

    y_pred = rf.predict_proba(X_val)[:, 1]
    auc = roc_auc_score(y_val, y_pred)

    scores_md.append((n, auc))
    scores_2.append(scores_md)

scores_2

[[(10, 0.8925139024350549),
  (10, 0.8933937829747419),
  (10, 0.8858894055287411),
  (10, 0.8764001426267374)],
 [(10, 0.8925139024350549),
  (10, 0.8933937829747419),
  (10, 0.8858894055287411),
  (10, 0.8764001426267374)],
 [(10, 0.8925139024350549),
  (10, 0.8933937829747419),
  (10, 0.8858894055287411),
  (10, 0.8764001426267374)],
 [(10, 0.8925139024350549),
  (10, 0.8933937829747419),
  (10, 0.8858894055287411),
  (10, 0.8764001426267374)],
 [(20, 0.8959060261285197),
  (20, 0.8974466668095713),
  (20, 0.8945349110002011),
  (20, 0.8870090482434048)],
 [(20, 0.8959060261285197),
  (20, 0.8974466668095713),
  (20, 0.8945349110002011),
  (20, 0.8870090482434048)],
 [(20, 0.8959060261285197),
  (20, 0.8974466668095713),
  (20, 0.8945349110002011),
  (20, 0.8870090482434048)],
 [(20, 0.8959060261285197),
  (20, 0.8974466668095713),
  (20, 0.8945349110002011),
  (20, 0.8870090482434048)],
 [(30, 0.8966332021786602),
  (30, 0.8993004819457892),
  (30, 0.8969182919088603),
  (30, 0.890

What's the best `max_depth`:

* 10
* **15**
* 20
* 25

Bonus question (not graded):

Will the answer be different if we change the seed for the model?

## Question 5

We can extract feature importance information from tree-based models. 

At each step of the decision tree learning algorith, it finds the best split. 
When doint it, we can calculate "gain" - the reduction in impurity before and after the split. 
This gain is quite useful in understanding what are the imporatant features 
for tree-based models.

In Scikit-Learn, tree-based models contain this information in the `feature_importances_` field. 

For this homework question, we'll find the most important feature:

* Train the model with these parametes:
    * `n_estimators=10`,
    * `max_depth=20`,
    * `random_state=1`,
    * `n_jobs=-1` (optional)
* Get the feature importance information from this model

What's the most important feature? 

* `neighbourhood_group=Manhattan`
* `room_type=Entire home/apt`	
* `longitude`
* `latitude`

## Question 6

Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:

```
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
```

Now change `eta` first to `0.1` and then to `0.01`

Which eta leads to the best RMSE score on the validation dataset?

* 0.3
* 0.1
* 0.01

## Submit the results


Submit your results here: https://forms.gle/wQgFkYE6CtdDed4w8

It's possible that your answers won't match exactly. If it's the case, select the closest one.


## Deadline


The deadline for submitting is 20 October 2021, 17:00 CET (Wednesday). After that, the form will be closed.

