## 6.10 Homework

The goal of this homework is to create a tree-based regression model for prediction apartment prices (column `'price'`).

In this homework we'll again use the New York City Airbnb Open Data dataset - the same one we used in homework 2 and 3.

You can take it from [Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv)
or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
if you don't want to sign up to Kaggle.

Let's load the data:

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
columns = [
    'neighbourhood_group', 'room_type', 'latitude', 'longitude',
    'minimum_nights', 'number_of_reviews','reviews_per_month',
    'calculated_host_listings_count', 'availability_365',
    'price'
]

df = pd.read_csv('AB_NYC_2019.csv', usecols=columns)
df.reviews_per_month = df.reviews_per_month.fillna(0)

* Apply the log tranform to `price`
* Do train/validation/test split with 60%/20%/20% distribution. 
* Use the `train_test_split` function and set the `random_state` parameter to 1

In [3]:
df.price = np.log1p(df.price)

In [4]:
RANDOM_STATE=1
FEATURE_COLUMNS = ['neighbourhood_group', 'room_type', 'latitude', 'longitude',
    'minimum_nights', 'number_of_reviews','reviews_per_month',
    'calculated_host_listings_count', 'availability_365',]

In [5]:
from sklearn.model_selection import train_test_split

train_validation_df, test_df  = train_test_split(df, test_size=0.2, train_size=0.8, random_state=RANDOM_STATE)
train_df, val_df = train_test_split(train_validation_df,test_size = 0.25, 
                                    train_size=0.75, random_state=RANDOM_STATE)

train_df  = train_df.reset_index(drop=True)
val_df = val_df.reset_index(drop=True)

y_train = train_df.price
X_train = train_df[FEATURE_COLUMNS]

y_val = val_df.price
X_val = val_df[FEATURE_COLUMNS]

In [23]:
train_df.head()

Unnamed: 0,neighbourhood_group,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Queens,40.76434,-73.92132,Entire home/apt,4.70953,4,30,0.32,1,363
1,Brooklyn,40.73442,-73.95854,Private room,4.26268,5,2,0.16,1,0
2,Brooklyn,40.66359,-73.99487,Entire home/apt,6.133398,1,33,2.75,5,113
3,Brooklyn,40.63766,-74.02626,Private room,4.60517,3,1,0.12,2,362
4,Brooklyn,40.65118,-74.00842,Private room,7.601402,2,0,0.0,2,365


Now, use `DictVectorizer` to turn train and validation into matrices:

In [7]:
from sklearn.feature_extraction import DictVectorizer

dicts = X_train.to_dict(orient="records")
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(dicts)

val_dicts = X_val.to_dict(orient="records")
X_val = dv.transform(val_dicts)

## Question 1

Let's train a decision tree regressor to predict the price variable. 

* Train a model with `max_depth=1`

In [40]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import roc_auc_score

dt = DecisionTreeRegressor(max_depth=1)
dt.fit(X_train, y_train)

DecisionTreeRegressor(max_depth=1)

In [41]:
from sklearn.tree import export_text
print(export_text(dt, feature_names=dv.get_feature_names()))

|--- room_type=Entire home/apt <= 0.50
|   |--- value: [4.29]
|--- room_type=Entire home/apt >  0.50
|   |--- value: [5.15]





Which feature is used for splitting the data?

* `room_type`
* `neighbourhood_group`
* `number_of_reviews`
* `reviews_per_month`

## Question 2

Train a random forest model with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1`  (optional - to make training faster)

In [28]:
from sklearn.ensemble import RandomForestRegressor

In [42]:
rfc = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_val)

In [43]:
from sklearn.metrics import mean_squared_error
from math import sqrt

round(sqrt(mean_squared_error(y_val, y_pred)),3)

0.462

What's the RMSE of this model on validation?

* 0.059
* 0.259
* 0.459
* 0.659

## Question 3

Now let's experiment with the `n_estimators` parameter

* Try different values of this parameter from 10 to 200 with step 10
* Set `random_state` to `1`
* Evaluate the model on the validation dataset

In [49]:
prev_rmse= 0
for n in range(10,201,10):
    rfc = RandomForestRegressor(n_estimators=n, random_state=RANDOM_STATE, n_jobs=-1)
    rfc.fit(X_train, y_train)

    y_pred = rfc.predict(X_val)
    rmse = round(sqrt(mean_squared_error(y_val, y_pred)),4)
    diff = round(rmse-prev_rmse,4)
    print(f"N={n} RMSE:{rmse} diff: {diff}")
    prev_rmse=rmse

N=10 RMSE:0.4616 diff: 0.4616
N=20 RMSE:0.4482 diff: -0.0134
N=30 RMSE:0.4455 diff: -0.0027
N=40 RMSE:0.4436 diff: -0.0019
N=50 RMSE:0.4423 diff: -0.0013
N=60 RMSE:0.4416 diff: -0.0007
N=70 RMSE:0.4412 diff: -0.0004
N=80 RMSE:0.4411 diff: -0.0001
N=90 RMSE:0.4405 diff: -0.0006
N=100 RMSE:0.44 diff: -0.0005
N=110 RMSE:0.4395 diff: -0.0005
N=120 RMSE:0.4392 diff: -0.0003
N=130 RMSE:0.4393 diff: 0.0001
N=140 RMSE:0.439 diff: -0.0003
N=150 RMSE:0.4389 diff: -0.0001
N=160 RMSE:0.4387 diff: -0.0002
N=170 RMSE:0.4386 diff: -0.0001
N=180 RMSE:0.4388 diff: 0.0002
N=190 RMSE:0.4387 diff: -0.0001
N=200 RMSE:0.4387 diff: 0.0


After which value of `n_estimators` does RMSE stop improving?

- 10
- 50
- 70
- 120

## Question 4

Let's select the best `max_depth`:

* Try different values of `max_depth`: `[10, 15, 20, 25]`
* For each of these values, try different values of `n_estimators` from 10 till 200 (with step 10)
* Fix the random seed: `random_state=1`

In [51]:
prev_rmse= 0
for md in [10, 15, 20, 25]:
    
    best = 10
    for n in range(10,201,10):
        
        rfc = RandomForestRegressor(n_estimators=n, random_state=RANDOM_STATE, n_jobs=-1, max_depth = md)
        rfc.fit(X_train, y_train)

        y_pred = rfc.predict(X_val)
        rmse = round(sqrt(mean_squared_error(y_val, y_pred)),4)
        diff = round(rmse-prev_rmse,4)
        if rmse < best:
            best = rmse
        print(f"maxdepth= {md} N={n} RMSE:{rmse}")
        prev_rmse=rmse
    print(f"\t\t best RMSE for maxdepth {md} : {best}")

maxdepth= 10 N=10 RMSE:0.4456
maxdepth= 10 N=20 RMSE:0.442
maxdepth= 10 N=30 RMSE:0.4414
maxdepth= 10 N=40 RMSE:0.4415
maxdepth= 10 N=50 RMSE:0.4411
maxdepth= 10 N=60 RMSE:0.441
maxdepth= 10 N=70 RMSE:0.4408
maxdepth= 10 N=80 RMSE:0.4406
maxdepth= 10 N=90 RMSE:0.4403
maxdepth= 10 N=100 RMSE:0.4401
maxdepth= 10 N=110 RMSE:0.44
maxdepth= 10 N=120 RMSE:0.4398
maxdepth= 10 N=130 RMSE:0.4399
maxdepth= 10 N=140 RMSE:0.4398
maxdepth= 10 N=150 RMSE:0.4397
maxdepth= 10 N=160 RMSE:0.4396
maxdepth= 10 N=170 RMSE:0.4396
maxdepth= 10 N=180 RMSE:0.4397
maxdepth= 10 N=190 RMSE:0.4397
maxdepth= 10 N=200 RMSE:0.4397
		 best RMSE for maxdepth 10 : 0.4396
maxdepth= 15 N=10 RMSE:0.4501
maxdepth= 15 N=20 RMSE:0.4414
maxdepth= 15 N=30 RMSE:0.4399
maxdepth= 15 N=40 RMSE:0.4393
maxdepth= 15 N=50 RMSE:0.4384
maxdepth= 15 N=60 RMSE:0.438
maxdepth= 15 N=70 RMSE:0.4375
maxdepth= 15 N=80 RMSE:0.4373
maxdepth= 15 N=90 RMSE:0.4369
maxdepth= 15 N=100 RMSE:0.4365
maxdepth= 15 N=110 RMSE:0.4363
maxdepth= 15 N=120 RMSE:

What's the best `max_depth`:

* 10
* 15
* 20
* 25

Bonus question (not graded):

Will the answer be different if we change the seed for the model?

## Question 5

We can extract feature importance information from tree-based models. 

At each step of the decision tree learning algorith, it finds the best split. 
When doint it, we can calculate "gain" - the reduction in impurity before and after the split. 
This gain is quite useful in understanding what are the imporatant features 
for tree-based models.

In Scikit-Learn, tree-based models contain this information in the `feature_importances_` field. 

For this homework question, we'll find the most important feature:

* Train the model with these parametes:
    * `n_estimators=10`,
    * `max_depth=20`,
    * `random_state=1`,
    * `n_jobs=-1` (optional)
* Get the feature importance information from this model

In [53]:
rfc_q5 = RandomForestRegressor(n_estimators=10, max_depth=20, random_state=1, n_jobs=-1)
rfc_q5.fit(X_train, y_train)

RandomForestRegressor(max_depth=20, n_estimators=10, n_jobs=-1, random_state=1)

In [72]:
df_importance = pd.DataFrame(data=rfc_q5.feature_importances_, index= dv.feature_names_, columns=["importance"])

In [73]:
df_importance.sort_values(by="importance")

Unnamed: 0,importance
neighbourhood_group=Staten Island,8.4e-05
neighbourhood_group=Bronx,0.000265
neighbourhood_group=Brooklyn,0.000966
neighbourhood_group=Queens,0.001166
room_type=Private room,0.004032
room_type=Shared room,0.005023
calculated_host_listings_count,0.030102
neighbourhood_group=Manhattan,0.034047
number_of_reviews,0.041594
minimum_nights,0.053252


What's the most important feature? 

* `neighbourhood_group=Manhattan`
* `room_type=Entire home/apt`	
* `longitude`
* `latitude`

## Question 6

Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:

```
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
```

In [1]:
!pip install xgboost


You should consider upgrading via the '/Users/abarabas/devel/jupyter/venv/bin/python -m pip install --upgrade pip' command.[0m


In [None]:
import xgboost as xgb
features = dv.get_feature_names()
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=features)

### my kernel dies at this step so I didn't finish this tasks...

In [None]:
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round=100)

Now change `eta` first to `0.1` and then to `0.01`

Which eta leads to the best RMSE score on the validation dataset?

* 0.3
* 0.1
* 0.01

## Submit the results


Submit your results here: https://forms.gle/wQgFkYE6CtdDed4w8

It's possible that your answers won't match exactly. If it's the case, select the closest one.


## Deadline


The deadline for submitting is 20 October 2021, 17:00 CET (Wednesday). After that, the form will be closed.

