In [56]:
# Initialize Otter
import otter
grader = otter.Notebook("hw5.ipynb")

# CPSC 330 - Applied Machine Learning 

## Homework 5: Putting it all together 
### Associated lectures: All material till lecture 13 

**Due date: Monday, October 31st, 2022 at 11:59pm**

## Table of contents
0. [Submission instructions](#si)
1. [Understanding the problem](#1)
2. [Data splitting](#2)
3. [EDA](#3)
4. (Challenging) [Feature engineering](#4)
5. [Preprocessing and transformations](#5) 
6. [Baseline model](#6)
7. [Linear models](#7)
8. [Different models](#8)
9. (Challenging) [Feature selection](#9)
10. [Hyperparameter optimization](#10)
11. [Interpretation and feature importances](#11) 
12. [Results on the test set](#12)
13. [Summary of the results](#13)
14. (Challenging) [Your takeaway from the course](#15)

## Instructions
<hr>
rubric={points:6}

Follow the [homework submission instructions](https://ubc-cs.github.io/cpsc330-2023W1/docs/homework_instructions.html). 

**You may work with a partner on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 4. 
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline.
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).


**When you are ready to submit your assignment do the following:**

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission.
4. Make sure that the plots and output are rendered properly in your submitted file. 
5. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb.

**Final Score**

The final score is 0.3315139892355451 using the $R^2$ metric. 

## Imports

In [57]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

<!-- BEGIN QUESTION -->



In [None]:
...

<!-- END QUESTION -->

## Introduction <a name="in"></a>

In this homework you will be working on an open-ended mini-project, where you will put all the different things you have learned so far together to solve an interesting problem.

A few notes and tips when you work on this mini-project: 

#### Tips
1. This mini-project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary. 
2. **Do not include everything you ever tried in your submission** -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code. 
3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions. 

#### Assessment
We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. **You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results.** For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.


#### A final note
Finally, this style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours" (15-20 hours???) is a good guideline for this project . Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and we hope you enjoy it as well. 

<br><br>

<!-- BEGIN QUESTION -->

## 1. Pick your problem and explain the prediction problem <a name="1"></a>
<hr>
rubric={points:3}

In this assignment we'll be exploring a [dataset](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data) of New York City Airbnb listings from 2019. As usual, you'll need to start by downloading the dataset. In this assignment we'll try to predict `reviews_per_month`, as a proxy for the popularity of the listing. Airbnb could use this sort of model to predict how popular future listings might be before they are posted, perhaps to help guide hosts create more appealing listings. In reality they might instead use something like vacancy rate or average rating as their target, but we do not have that available here.

> Note there is an updated version of this dataset with more features available [here](http://insideairbnb.com/). The features were are using in `listings.csv.gz` for the New York city datasets. You will also see some other files like `reviews.csv.gz`. For your own interest you may want to explore the expanded dataset and try your analysis there. However, please submit your results on the dataset obtained from Kaggle.

**Your tasks:**

1. Spend some time understanding the problem and what each feature means. You can find this information in the documentation on. You can find an explaination of the features [here](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit).  Write a few sentences on your initial thoughts on the problem and the dataset. 
2. Download the dataset and read it as a pandas dataframe.

<div class="alert alert-warning">
    
Solution_1
    
</div>

_Points:_ 3

_Type your answer here, replacing this text._

There are some features that seem redundant, such as names for the id's. The feature I consider to be the hardest to deal with is the longitude and latitude features, which seems like it may need an SVM or linear regressor. 

In [58]:
df = pd.read_csv("data/AB_NYC_2019.csv")
df

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 2. Data splitting <a name="2"></a>
<hr>
rubric={points:2}

**Your tasks:**

1. Split the data into train (70%) and test (30%) portions with `random_state=123`.

> If your computer cannot handle training on 70% training data, make the test split bigger.  

<div class="alert alert-warning">
    
Solution_2
    
</div>

_Points:_ 2

In [59]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.40, random_state=123)

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 3. EDA <a name="3"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Perform exploratory data analysis on the train set.
2. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
3. Summarize your initial observations about the data. 
4. Pick appropriate metric/metrics for assessment. 

<div class="alert alert-warning">
    
Solution_3
    
</div>

_Points:_ 10

_Manhattan and Brooklyn appear to be the most popular boroughs, with entire homes/apartments and private rooms being preferred. Fewer reviews are more common than more reviews._

In [60]:
# Shows some examples of the train set
train_df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
17877,14010200,"Scandinavean design in Crown Heights, BK",1683437,Boram,Brooklyn,Crown Heights,40.66477,-73.9506,Entire home/apt,89,5,9,2018-05-28,0.25,1,0
14638,11563821,Private bedroom located in the heart of Chelsea,10307134,Anna,Manhattan,Chelsea,40.74118,-74.00012,Private room,110,1,48,2019-06-16,1.8,2,67
7479,5579629,Lovely sunlit room in Brooklyn,329917,Clémentine,Brooklyn,Greenpoint,40.72905,-73.95755,Private room,53,2,5,2016-10-21,0.13,1,0
47058,35575853,"Great view, 1 BR right next to Central Park!",35965489,Meygan,Manhattan,East Harlem,40.79755,-73.94797,Private room,100,2,0,,,1,7
9769,7509362,Great BIG Upper West Side Apartment,29156329,Andrew,Manhattan,Upper West Side,40.8012,-73.96382,Private room,87,3,9,2018-09-19,0.19,1,0


In [61]:
# size of the train_set
train_df.shape

(29337, 16)

In [62]:
# lower review rates seem to be more common
train_df["reviews_per_month"].value_counts(normalize=True)

reviews_per_month
0.02     0.023219
0.05     0.022920
1.00     0.022706
0.03     0.020824
0.16     0.017874
           ...   
11.68    0.000043
7.94     0.000043
9.15     0.000043
11.72    0.000043
9.03     0.000043
Name: proportion, Length: 867, dtype: float64

In [63]:
# number of occurences of neighbourhoods
train_df["neighbourhood"].value_counts()

neighbourhood
Williamsburg          2334
Bedford-Stuyvesant    2228
Harlem                1595
Bushwick              1481
Upper West Side       1191
                      ... 
Sea Gate                 1
Neponsit                 1
Silver Lake              1
Eltingville              1
Woodrow                  1
Name: count, Length: 217, dtype: int64

In [64]:
# number of occurences of neighbourhood groups
train_df["neighbourhood_group"].value_counts()

neighbourhood_group
Manhattan        12972
Brooklyn         12091
Queens            3394
Bronx              658
Staten Island      222
Name: count, dtype: int64

In [65]:
# number of occurences of room types
train_df["room_type"].value_counts()

room_type
Entire home/apt    15318
Private room       13340
Shared room          679
Name: count, dtype: int64

In [66]:
# many different data types, though most are numeric
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29337 entries, 17877 to 15725
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              29337 non-null  int64  
 1   name                            29328 non-null  object 
 2   host_id                         29337 non-null  int64  
 3   host_name                       29325 non-null  object 
 4   neighbourhood_group             29337 non-null  object 
 5   neighbourhood                   29337 non-null  object 
 6   latitude                        29337 non-null  float64
 7   longitude                       29337 non-null  float64
 8   room_type                       29337 non-null  object 
 9   price                           29337 non-null  int64  
 10  minimum_nights                  29337 non-null  int64  
 11  number_of_reviews               29337 non-null  int64  
 12  last_review                     2

In [67]:
# some listings are available for 0 days per year
train_df.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,29337.0,29337.0,29337.0,29337.0,29337.0,29337.0,29337.0,23386.0,29337.0,29337.0
mean,18919880.0,67145790.0,40.729013,-73.952217,150.939121,7.141971,23.354501,1.369867,7.00334,112.803627
std,11021550.0,78354040.0,0.054594,0.046091,228.224188,22.27211,44.69248,1.706732,32.511623,131.544488
min,2539.0,2438.0,40.50641,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9350729.0,7740184.0,40.69009,-73.98303,69.0,1.0,1.0,0.19,1.0,0.0
50%,19517510.0,30719070.0,40.72314,-73.95553,107.0,3.0,5.0,0.71,1.0,45.0
75%,29165310.0,106442900.0,40.76328,-73.93643,175.0,5.0,23.0,2.01,2.0,227.0
max,36485610.0,274321300.0,40.91234,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


In [68]:
# some hosts have multiple listings
train_df["host_id"].value_counts()

host_id
219517861    186
107434423    145
30283594      83
61391963      57
137358866     56
            ... 
6291714        1
168495229      1
134901180      1
3783926        1
159769278      1
Name: count, Length: 23964, dtype: int64

In [69]:
...

Ellipsis

In [70]:
...

Ellipsis

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## (Challenging) 4. Feature engineering <a name="4"></a>
<hr>
rubric={points:1}

**Your tasks:**

1. Carry out feature engineering. In other words, extract new features relevant for the problem and work with your new feature set in the following exercises. You may have to go back and forth between feature engineering and preprocessing. 

In [71]:
train_df = train_df.assign(
    price_per_min_night=train_df["price"] / train_df["minimum_nights"])
test_df = test_df.assign(
    price_per_min_night=test_df["price"] / test_df["minimum_nights"])

<div class="alert alert-warning">
    
Solution_4
    
</div>

_Points:_ 1

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 5. Preprocessing and transformations <a name="5"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Identify any data cleaning that needs to be done and perform it.
2. Identify different feature types and the transformations you would apply on each feature type. 
3. Define a column transformer, if necessary.
4. You have likely noticed the `number_of_reviews` feature will be higly informative for the target `reviews_per_month`. To make this assignment more interesting **drop** the `number_of_reviews` feature.

<div class="alert alert-warning">
    
Solution_5
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

In [72]:
# split X and y for train and test data
X_train = train_df.drop(columns=["reviews_per_month"])
y_train = train_df["reviews_per_month"]
y_train = y_train.fillna(0)

X_test = test_df.drop(columns=["reviews_per_month"])
y_test = test_df["reviews_per_month"]
y_test = y_test.fillna(0)

In [73]:
drop_features = ["number_of_reviews", "id", "host_id", "name", "host_name",
                 "last_review", "latitude", "longitude", "calculated_host_listings_count", "neighbourhood"]
numeric_features = ["price", "minimum_nights", "availability_365", "price_per_min_night"]
categorical_features = ["neighbourhood_group", "room_type"]

In [74]:
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, KBinsDiscretizer

# adapted from lecture 12, 13
numeric_transformer = make_pipeline(
    SimpleImputer(strategy="median"), StandardScaler()
)
categorical_transformer = make_pipeline(
    SimpleImputer(strategy="constant"),
    OneHotEncoder(handle_unknown="ignore")
)
# column transformer
preprocessor = make_column_transformer(
    ("drop", drop_features),
    (numeric_transformer, numeric_features),
    (categorical_transformer, categorical_features),
)

In [75]:
...

Ellipsis

In [76]:
...

Ellipsis

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 6. Baseline model <a name="6"></a>
<hr>
rubric={points:2}

**Your tasks:**
1. Try `scikit-learn`'s baseline model and report results.

<div class="alert alert-warning">
    
Solution_6
    
</div>

_Points:_ 2

In [77]:
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_validate

dummy = DummyRegressor()
dummy.fit(X_train, y_train)
dr = pd.DataFrame(cross_validate(dummy, X_train, y_train, cv=10, return_train_score=True))
dr

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.008546,0.0,-7.7e-05,0.0
1,0.008466,0.0,-0.000311,0.0
2,0.008187,0.0,-4e-05,0.0
3,0.008001,0.0,-0.000734,0.0
4,0.010255,0.0,-0.000166,0.0
5,0.0,0.0,-5.5e-05,0.0
6,0.007989,0.0,-2.7e-05,0.0
7,0.008012,0.0,-2.2e-05,0.0
8,0.008,0.0,-0.000125,0.0
9,0.00807,0.0,-0.002337,0.0


The test score is very bad, and train score is 0, which is expected from DummyRegressor. 

In [78]:
...

Ellipsis

<!-- END QUESTION -->



<!-- BEGIN QUESTION -->

## 7. Linear models <a name="7"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Try a linear model as a first real attempt. 
2. Carry out hyperparameter tuning to explore different values for the complexity hyperparameter. 
3. Report cross-validation scores along with standard deviation. 
4. Summarize your results.

<div class="alert alert-warning">
    
Solution_7
    
</div>

_Points:_ 10

_The linear model produces very bad results, and with tuning, continues to. The best alpha appears to be alpha=100._

In [79]:
from sklearn.linear_model import Ridge

pipe = make_pipeline(preprocessor, Ridge())
pipe.fit(X_train, y_train)
ridge = pd.DataFrame(cross_validate(pipe, X_train, y_train, cv=10, return_train_score=True))
ridge

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.140337,0.016611,0.070586,0.062599
1,0.113371,0.016575,0.066795,0.063018
2,0.138171,0.02436,0.06366,0.063439
3,0.124457,0.016109,0.049095,0.065537
4,0.1156,0.016315,0.055524,0.064229
5,0.130478,0.016373,0.059056,0.06378
6,0.111775,0.017551,0.068739,0.062905
7,0.112223,0.019816,0.066836,0.06308
8,0.102496,0.024136,0.0653,0.063158
9,0.125013,0.017133,0.05213,0.064181


In [80]:
# adapted from lecture 07
scores_dict = {
    "alpha": 10.0 ** np.arange(-3, 6, 1),
    "mean_train_scores": list(),
    "mean_cv_scores": list(),
    "std_train_scores": list(),
    "std_cv_scores": list(),
}
for alpha in scores_dict["alpha"]:
    pipe_ridge = make_pipeline(preprocessor, Ridge(alpha=alpha))
    scores = cross_validate(pipe_ridge, X_train, y_train, return_train_score=True)
    scores_dict["mean_train_scores"].append(scores["train_score"].mean())
    scores_dict["mean_cv_scores"].append(scores["test_score"].mean())
    scores_dict["std_train_scores"].append(scores["train_score"].std())
    scores_dict["std_cv_scores"].append(scores["test_score"].std())

results_df = pd.DataFrame(scores_dict)
results_df

Unnamed: 0,alpha,mean_train_scores,mean_cv_scores,std_train_scores,std_cv_scores
0,0.001,0.063663,0.062076,0.001336,0.004931
1,0.01,0.063663,0.062076,0.001336,0.004931
2,0.1,0.063663,0.062076,0.001336,0.004931
3,1.0,0.063663,0.062077,0.001336,0.00493
4,10.0,0.063663,0.062082,0.001336,0.004916
5,100.0,0.063645,0.06211,0.001334,0.004787
6,1000.0,0.062971,0.061726,0.001278,0.003974
7,10000.0,0.05215,0.051724,0.000802,0.002721
8,100000.0,0.016892,0.016771,0.000199,0.001322


...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 8. Different models <a name="8"></a>
<hr>
rubric={points:12}

**Your tasks:**
1. Try at least 3 other models aside from a linear model. One of these models should be a tree-based ensemble model. 
2. Summarize your results in terms of overfitting/underfitting and fit and score times. Can you beat a linear model? 

<div class="alert alert-warning">
    
Solution_8
    
</div>

_Points:_ 12

_KNNs had the best fit_times and decision tree had the best score time. Overall, decision trees had the best run time. Gradient boosting had the best fit of the three, and the scores best those of the linear model I chose._

In [81]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR

pipe_gb = make_pipeline(preprocessor, GradientBoostingRegressor(random_state=123))
pipe_dt = make_pipeline(preprocessor, DecisionTreeRegressor())
pipe_knn = make_pipeline(preprocessor, KNeighborsRegressor())
pipe_svr = make_pipeline(preprocessor, SVR())

In [82]:
gb = pd.DataFrame(cross_validate(pipe_gb, X_train, y_train, cv=5, return_train_score=True))

In [84]:
gb

Unnamed: 0,fit_time,score_time,test_score,train_score
0,5.828171,0.049363,0.323115,0.332212
1,5.964062,0.049114,0.276839,0.344798
2,5.952469,0.048393,0.34058,0.329027
3,5.903961,0.048656,0.328229,0.333273
4,5.77912,0.048073,0.325044,0.332162


In [85]:
dt = pd.DataFrame(cross_validate(pipe_dt, X_train, y_train, cv=5, return_train_score=True))

In [86]:
dt

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.659782,0.028381,-0.22749,0.899842
1,0.675487,0.033501,-0.207903,0.935621
2,0.671735,0.024577,-0.264871,0.901791
3,0.651551,0.025804,-0.307125,0.893763
4,0.68399,0.021786,-0.276997,0.901047


In [87]:
knn = pd.DataFrame(cross_validate(pipe_knn, X_train, y_train, cv=5, return_train_score=True))

In [88]:
knn

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.347407,0.484935,0.181731,0.447782
1,0.345642,0.474813,0.164476,0.468545
2,0.361978,0.50495,0.239724,0.439332
3,0.351722,0.469495,0.205943,0.444654
4,0.360742,0.41232,0.201984,0.448432


In [90]:
svr = pd.DataFrame(cross_validate(pipe_svr, X_train, y_train, cv=5, return_train_score=True))

In [91]:
svr

Unnamed: 0,fit_time,score_time,test_score,train_score
0,80.141119,32.379562,0.199767,0.198581
1,80.571169,32.329581,0.167749,0.209415
2,80.608231,32.821936,0.219728,0.194319
3,80.683723,32.430051,0.202114,0.19816
4,83.369543,32.454965,0.196467,0.203983


...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## (Challenging) 9. Feature selection <a name="9"></a>
<hr>
rubric={points:2}

**Your tasks:**

Make some attempts to select relevant features. You may try `RFECV` or forward selection for this. Do the results improve with feature selection? Summarize your results. If you see improvements in the results, keep feature selection in your pipeline. If not, you may abandon it in the next exercises. 

<div class="alert alert-warning">
    
Solution_9
    
</div>

_Points:_ 2

_The results do not seem to improve with feature selection. Fit time is much greater, while test and trains scores seem have not really changed._

In [30]:
from sklearn.feature_selection import RFECV

#adapted from lecture 13
pipe_rfe = make_pipeline(
    preprocessor, 
    RFECV(Ridge(), cv=10),
    GradientBoostingRegressor(random_state=123),
)
rfe_results = pd.DataFrame(cross_validate(pipe_rfe, X_train, y_train, cv=5, return_train_score=True))

In [31]:
rfe_results

Unnamed: 0,fit_time,score_time,test_score,train_score
0,2.276093,0.020333,0.315026,0.324646
1,2.318155,0.018073,0.269362,0.33802
2,2.332094,0.016,0.330389,0.321438
3,2.164092,0.015996,0.321072,0.32508
4,2.151462,0.014545,0.326648,0.334657


In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 10. Hyperparameter optimization <a name="10"></a>
<hr>
rubric={points:10}

**Your tasks:**

Make some attempts to optimize hyperparameters for the models you've tried and summarize your results. In at least one case you should be optimizing multiple hyperparameters for a single model. You may use `sklearn`'s methods for hyperparameter optimization or fancier Bayesian optimization methods. 
  - [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)   
  - [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  - [scikit-optimize](https://github.com/scikit-optimize/scikit-optimize) 

<div class="alert alert-warning">
    
Solution_10
    
</div>

_Points:_ 10

_The best hyper parameters for support vectors and gradient boosting appear to be ones that are more complex, while the hyperparameters for decision trees and k-nearest neighbours are values that are not too low or high to avoid overfitting._

In [37]:
from sklearn.model_selection import RandomizedSearchCV

param_grid_svr = {
    "svr__gamma": [0.001, 0.01, 0.1, 1.0, 10, 100],
    "svr__C": [0.001, 0.01, 0.1, 1.0, 10, 100],
}

rs_svr = RandomizedSearchCV(pipe_svr, param_distributions = param_grid_svr, n_iter=10, n_jobs=-1, return_train_score=True)
rs_svr.fit(X_train, y_train)

In [38]:
rs_svr.best_score_

0.24500536854705635

In [39]:
rs_svr.best_params_

{'svr__gamma': 1.0, 'svr__C': 100}

In [45]:
param_grid_gb = {
    "gradientboostingregressor__alpha": [0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99],
}

rs_gb = RandomizedSearchCV(pipe_gb, param_distributions = param_grid_gb, n_iter=8, n_jobs=-1, return_train_score=True)
rs_gb.fit(X_train, y_train)



In [46]:
rs_gb.best_score_

0.31876137802076465

In [47]:
rs_gb.best_params_

{'gradientboostingregressor__alpha': 0.001}

In [52]:
# adapted from lecture 08
param_grid_dt = {
    "decisiontreeregressor__max_depth": [1, 2, 3, 5, 7, 10, 25, 50, 100, 250],
}

rs_dt = RandomizedSearchCV(pipe_dt, param_distributions = param_grid_dt, n_iter=10, n_jobs=-1, return_train_score=True)
rs_dt.fit(X_train, y_train)

In [50]:
rs_dt.best_score_

0.2630832468085913

In [51]:
rs_dt.best_params_

{'decisiontreeregressor__max_depth': 10}

In [54]:
param_grid_knn = {
    "kneighborsregressor__n_neighbors": [1, 2, 3, 5, 7, 10, 25, 50, 100, 250],
}

rs_knn = RandomizedSearchCV(pipe_knn, param_distributions = param_grid_knn, n_iter=10, n_jobs=-1, return_train_score=True)
rs_knn.fit(X_train, y_train)

In [57]:
rs_knn.best_score_

0.27388137473643737

In [58]:
rs_knn.best_params_

{'kneighborsregressor__n_neighbors': 25}

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 11. Interpretation and feature importances <a name="1"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Use the methods we saw in class (e.g., `eli5`, `shap`) (or any other methods of your choice) to examine the most important features of one of the non-linear models. 
2. Summarize your observations. 

<div class="alert alert-warning">
    
Solution_11
    
</div>

_Points:_ 10

_In the gradient boosted regressor model, the most important feature is the availability, followed by the minimum nights feature. Every other feature is comparatively small._

In [69]:
pipe_gb.fit(X_train, y_train);

In [94]:
# adapted from lecture 12
data = {
    "Importance": pipe_gb.named_steps["gradientboostingregressor"].feature_importances_,
}

ohe_feature_names = (
    pipe_gb.named_steps["columntransformer"]
    .named_transformers_["pipeline-2"]
    .named_steps["onehotencoder"]
    .get_feature_names_out(categorical_features)
    .tolist()
)
feature_names = (numeric_features + ohe_feature_names)
gb_imp_df = pd.DataFrame(
    data=data,
    index=feature_names,
).sort_values(by="Importance", ascending=False)

gb_imp_df

Unnamed: 0,Importance
availability_365,0.504665
minimum_nights,0.392392
price,0.03755
price_per_min_night,0.029165
room_type_Entire home/apt,0.016707
neighbourhood_group_Queens,0.008911
neighbourhood_group_Manhattan,0.006358
room_type_Shared room,0.00159
room_type_Private room,0.001232
neighbourhood_group_Brooklyn,0.000931


In [95]:
np.sum(pipe_gb.named_steps["gradientboostingregressor"].feature_importances_)

0.9999999999999999

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 12. Results on the test set <a name="12"></a>
<hr>

rubric={points:10}

**Your tasks:**

1. Try your best performing model on the test data and report test scores. 
2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias? 
3. Take one or two test predictions and explain these individual predictions (e.g., with SHAP force plots).  

<div class="alert alert-warning">
    
Solution_12
    
</div>

_Points:_ 10

2. The test scores agree with the validation scores from before. I do not think there is optimization bias, as the test score is similar to the best cross validation score, which is selected because it is not overfit, so I can trust these results to a large extent.
3. Availability and Minimum Nights seem to be the biggest determining factors for the target. 

In [37]:
pipe_gb_best = make_pipeline(preprocessor, GradientBoostingRegressor(random_state=123, alpha=0.001))
pipe_gb_best.fit(X_train, y_train)
pipe_gb_best.score(X_test, y_test)

0.3315139892355451

In [38]:
# adapted from lecture 12
data = {
    "Importance": pipe_gb_best.named_steps["gradientboostingregressor"].feature_importances_,
}

ohe_feature_names = (
    pipe_gb_best.named_steps["columntransformer"]
    .named_transformers_["pipeline-2"]
    .named_steps["onehotencoder"]
    .get_feature_names_out(categorical_features)
    .tolist()
)
feature_names = (numeric_features + ohe_feature_names)
gb_best_imp_df = pd.DataFrame(
    data=data,
    index=feature_names,
).sort_values(by="Importance", ascending=False)

gb_best_imp_df

Unnamed: 0,Importance
availability_365,0.504665
minimum_nights,0.392392
price,0.03755
price_per_min_night,0.029165
room_type_Entire home/apt,0.016707
neighbourhood_group_Queens,0.008911
neighbourhood_group_Manhattan,0.006358
room_type_Shared room,0.00159
room_type_Private room,0.001232
neighbourhood_group_Brooklyn,0.000931


In [35]:
...

Ellipsis

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 13. Summary of results <a name="13"></a>
<hr>
rubric={points:12}

Imagine that you want to present the summary of these results to your boss and co-workers. 

**Your tasks:**

1. Create a table summarizing important results. 
2. Write concluding remarks.
3. Discuss other ideas that you did not try but could potentially improve the performance/interpretability . 
3. Report your final test score along with the metric you used at the top of this notebook in the [Submission instructions section](#si).

<div class="alert alert-warning">
    
Solution_13
    
</div>

_Points:_ 12

2. The dataset had many features that did not need to be there, such as id's and names. There were also many neighbourhoods that would have been very intensive to fit, though luckily there were the neighbourhood groups (boroughs) that conveniently grouped them together. Overall, I got quite low scores which I attribute to a lack of time to use a more complex model. 
3. I wanted to try a bigger training split, though it seems my PC was not able to handle it and it took a grueling amount of time to fit for certain models. I also wanted try the SHAP plots, though I was not able to with the amount of time I have allotted.

In [93]:
results = {
    "Dummy": dr.mean(),
    "Ridge": ridge.mean(),
    "GradientBoosting": gb.mean(),
    "KNeighbors": knn.mean(),
    "SupportingVectors": svr.mean(),
    "DecisionTree": dt.mean(),
}
pd.DataFrame(results)

Unnamed: 0,Dummy,Ridge,GradientBoosting,KNeighbors,SupportingVectors,DecisionTree
fit_time,0.007553,0.121392,5.885557,0.353498,81.074757,0.668509
score_time,0.0,0.018498,0.04872,0.469303,32.483219,0.02681
test_score,-0.000389,0.061772,0.318761,0.198772,0.197165,-0.256877
train_score,0.0,0.063593,0.334294,0.449749,0.200892,0.906413


...

<!-- END QUESTION -->

<br><br>

<br><br>

<!-- BEGIN QUESTION -->

## (Challenging) 14. Your takeaway <a name="15"></a>
<hr>
rubric={points:2}

**Your tasks:**

What is your biggest takeaway from the supervised machine learning material we have learned so far? Please write thoughtful answers.  

<div class="alert alert-warning">
    
Solution_14
    
</div>

_Points:_ 2

My biggest takeaway is that the preprocessing of data and being able to explain the process and results of supervised machine learning are the most important aspects of the material. Preprocessing takes a great deal of time and determines what happens in the fitting portion, and being able to explain models and results is necessary as to not be mislead. 

<!-- END QUESTION -->

<br><br>

**Before submitting your assignment, please make sure you have followed all the instructions in the Submission instructions section at the top.**

This was a tricky one but you did it! Have a great weekend! 

![](img/eva-well-done.png)