# CPSC 330 hw5

**IMPORTANT NOTE:** This assignment, in its original form, is quite long. In Covid times, I am wary of overloading you. At the same time, I am also hesitant to delete useful material. My compromise is that I have marked some questions as optional and made them not part of your grade. If you have time, I recommend doing them! However, if you're short on time, you can skip them for now - later parts of the assignment don't depend on them. If you skip these questions, please look them over later as part of your exam studying - that content will still be fair game for exams. Solutions for these questions will be posted as part of the usual solutions posting.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.feature_extraction.text import CountVectorizer

# Non-sklearn
from lightgbm import LGBMRegressor
import shap

# Add more imports below


In [None]:
plt.rcParams['font.size'] = 16

In [None]:
def cross_validate_std(*args, **kwargs):
    """Like cross_validate, except also gives the standard deviation of the score"""
    res = pd.DataFrame(cross_validate(*args, **kwargs))
    res_mean = res.mean()

    res_mean["std_test_score"] = res["test_score"].std()
    if "train_score" in res:
        res_mean["std_train_score"] = res["train_score"].std()
    return res_mean

## Instructions
rubric={points:5}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md). 

## Writing quality/quantity
rubric={points:5}

The TAs have reported a couple issues with the first few assignments: in some cases, submissions simply show the code output with no commentary; please write at least a sentence explaining your output in each question. In other cases, the TAs have come across multi-paragraph answers where a couple of sentences would have sufficed. Thus, we are now allocating the above points for well-structured answers of a reasonable length. In general, 1-3 sentences is good.

## Introduction

In this assignment we'll be exploring a [dataset](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data) of New York City Airbnb listings from 2019. As usual, you'll need to start by downloading the dataset.

In [None]:
df = pd.read_csv('AB_NYC_2019.csv')

In [None]:
df_train, df_test = train_test_split(df, random_state=123)

In [None]:
df_train.shape

In [None]:
df_train.head()

In this assignment we'll try to predict `reviews_per_month`, as a proxy for the popularity of the listing. Airbnb could use this sort of model to predict how popular future listings might be before they are posted, perhaps to help guide hosts create more appealing listings. In reality they might instead use something like vacancy rate or average rating as their target, but we do not have that available here.

I'll take care of some missing values for you here. 

- It seems that listings with 0 reviews have `reviews_per_month` set to NaN instead of 0.
- For listings without a listing name I set the name to the empty string.

In [None]:
df_train = df_train.fillna({"reviews_per_month" : 0, "name" : ""})
df_test  = df_test.fillna({"reviews_per_month" : 0, "name" : ""})

In [None]:
target_name = 'reviews_per_month'

In [None]:
X_train = df_train.drop(columns=[target_name])
y_train = df_train[target_name]

X_test  = df_test.drop(columns=[target_name])
y_test  = df_test[target_name]

In [None]:
y_train = y_train.fillna(value=0)
y_test  = y_test.fillna(value=0)

In [None]:
categorical_features = ['neighbourhood', 'neighbourhood_group', 'room_type']
drop_features        = ['id', 'name', 'host_id', 'host_name', 'last_review', 'number_of_reviews']
numeric_features     = ['latitude', 'longitude', 'price', 'minimum_nights', 'calculated_host_listings_count', 'availability_365']

Make sure all columns are accounted for:

In [None]:
assert sorted(X_train.columns) == sorted(categorical_features + drop_features + numeric_features)

Let's check for missing values (I already took care of some):

In [None]:
X_train.info()

It seems they are only present in `host_name` and `last_review`, both of which we are dropping, so we can skip imputation here.

In [None]:
preprocessor = ColumnTransformer([
    ('scale',  StandardScaler(), numeric_features),
    ('ohe', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

In [None]:
preprocessor.fit(X_train);

In [None]:
ohe_feature_names = list(preprocessor.named_transformers_['ohe'].get_feature_names(categorical_features))
new_columns = numeric_features + ohe_feature_names

In [None]:
X_train_enc = pd.DataFrame(preprocessor.transform(X_train).toarray(), index=X_train.index, columns=new_columns)

In [None]:
X_train_enc.head()

Next, we'll try training some models:

In [None]:
dr = DummyRegressor()

In [None]:
pd.DataFrame(cross_validate(dr, X_train, y_train, return_train_score=True)).mean()

Remember, the score here is the $R^2$. Let's try linear regression:

In [None]:
lr = make_pipeline(preprocessor, Ridge())

In [None]:
pd.DataFrame(cross_validate(lr, X_train, y_train, return_train_score=True)).mean()

And a LightGBM regressor (using this because it runs much faster than `RandomForestRegressor`!):

In [None]:
lgbm = make_pipeline(preprocessor, LGBMRegressor())

In [None]:
cross_validate_std(lgbm, X_train, y_train, return_train_score=True)

## Exercise 1: data preparation

#### 1(a)
rubric={points:5}

Above I dropped the feature `number_of_reviews`. And yet, knowing the number of reviews seems quite useful in predicting the number of reviews per month. Why would it be problematic to leave this feature in?

#### (optional) 1(b)

I noticed there were no missing values in the columns we're using, so I didn't put any imputation into the pipeline. In what scenario could that result in a problem down the line?

#### 1(c)
rubric={points:5}

After one-hot encoding, the number of columns increased significantly:

In [None]:
len(numeric_features) + len(categorical_features)

In [None]:
len(new_columns)

For each of the categorical variables printed below, how many new columns were generated?

In [None]:
categorical_features

#### (optional) 1(d)

In the previous part you (hopefully) found that the vast majority of categories come from the "neighborhood" column. Let's look at how many of each category we have in the training set:

In [None]:
df_train['neighbourhood'].value_counts()

Perhaps it doesn't make sense to have separate columns for some of these very rare cases. The code below modifies the preprocessor so that the categories are specified inside the `OneHotEncoder`. It uses the 30 most frequent neighbourhoods in the training set as the fixed categories.

In [None]:
categories = df_train['neighbourhood'].value_counts()[:30].index.values
categories

In [None]:
categorical_features

In [None]:
preprocessor_cats = ColumnTransformer([
    ('scale',  StandardScaler(), numeric_features),
    ('ohe_cats', OneHotEncoder(handle_unknown='ignore', categories=[categories]), ["neighbourhood"]),
    ('ohe', OneHotEncoder(handle_unknown='ignore'), ["neighbourhood_group", "room_type"])
])

We then train two linear regression models with small `alpha` values, one with the original preprocessor and one with the modified preprocessor:

In [None]:
lr_small = make_pipeline(preprocessor, Ridge(0.01))
cross_validate_std(lr_small, X_train, y_train, return_train_score=True)

In [None]:
lr_small_cats = make_pipeline(preprocessor_cats, Ridge(0.01))
cross_validate_std(lr_small_cats, X_train, y_train, return_train_score=True)

In class I briefly mentioned that for our linear models (`LogisticRegression` and `Ridge`), having more features can be considered as having a more complex model because we have more coefficients. Given this, discuss the results above with respect to the fundamental tradeoff. Do the results align with your expectations?

-----------------

Earlier we dropped the `name` column, but perhaps the names contain useful information.

In [None]:
X_train["name"]

For example, maybe listings with the word "new" are more popular than listings with the word "rustic". Let's update the feature preprocessing so that `name` is not dropped, but instead encoded with `CountVectorizer`. Let's update the preprocessor:

In [None]:
preprocessor_text = ColumnTransformer([
    ('scale',  StandardScaler(), numeric_features),
    ('ohe', OneHotEncoder(handle_unknown='ignore'), categorical_features),
    ('text', CountVectorizer(max_features=100, stop_words='english'), "name")
])

We can take a look at the transformed training data:

In [None]:
preprocessor_text.fit(X_train, y_train);

In [None]:
word_feature_names = ['name_' + word for word in preprocessor_text.named_transformers_['text'].get_feature_names()]
new_columns_text = new_columns + word_feature_names

In [None]:
X_train_enc_text = pd.DataFrame(data=preprocessor_text.transform(X_train).toarray(), columns=new_columns_text, index=X_train.index)
X_train_enc_text.head()

We now have a lot of features generated by the OHE and the word counts:

In [None]:
ohe_feature_names[:20]

In [None]:
word_feature_names[:20]

In fact, it seems we now have a feature called `neighbourhood_Williamsburg` and a feature called `name_williamsburg`:

In [None]:
X_train_enc_text[['neighbourhood_Williamsburg', 'name_williamsburg']]

#### 1(e)
rubric={points:5}

- What is the difference between our features  `neighbourhood_Williamsburg` and `name_williamsburg`? 
- Is it possible that one could be zero and the other nonzero?
- Is it redundant to have both?

#### (optional) 1(f)

Next, we explore the target values. The number of reviews per month ranges quite a bit:

In [None]:
y_train.describe()

In [None]:
y_train.hist(bins=100);

We decide to log-transform these targets, because decide we care more about relative than absolute error:

In [None]:
y_train_log = np.log(y_train)
y_test_log  = np.log(y_test)

In [None]:
y_train_log.hist()

Oh no, an error!

- Explain the issue.
- Fix the issue.

## Exercise 2: model selection

#### 2(a)
rubric={points:15}

- Use `GridSearchCV` or `RandomizedSearchCV` to tune a `LGBMRegressor` pipeline on this problem, using `preprocessor_text` with un-transformed y-values. 
- Tune `num_leaves` of the `LGBMRegressor` and `max_features` of the `CountVectorizer` at least (you're welcome to include more hyperparameters if you want).
- What hyperparameters were chosen and what is the final score? 

#### 2(b)
rubric={points:5}

Briefly comment on the "error bars" from the different folds of cross-validation. Do you trust the ordering of the hyperparameters from the previous part? Briefly discuss.

#### 2(c)
rubric={points:5}

Run your chosen pipeline on the test set. How does your test error compare to your cv error? Do you think you overfit the (cross-)validation set? Briefly discuss.

## Exercise 3: feature importances

For this exercise we'll consider the random forest created below, which performs somewhat decently (though still not great):

In [None]:
rf = RandomForestRegressor(max_depth=20, max_features=20, n_estimators=20, random_state=20)
rf_ttr = TransformedTargetRegressor(rf, func=np.log1p, inverse_func=np.expm1)
rf_ttr_pipe = make_pipeline(preprocessor_text, rf_ttr)

In [None]:
cross_validate_std(rf_ttr_pipe, X_train, y_train, return_train_score=True)

#### 3(a)
rubric={points:10}

Look at the feature importances for this random forest pipeline trained on the training set. What features seem to be driving your predictions most?

#### 3(b)
rubric={points:5}

For the two most important features in the above model, do you think increasing these features increases or decreases the predicted number of reviews per month? Briefly justify your answer. **Maximum 2 sentences.**

Note: the results from part (a) don't tell you the answer here - I'm asking you to use your domain knowledge about short-term vacation rentals. What do you think makes a rental attractive?

#### 3(c)
rubric={points:10}

Use SHAP to try to answer the above question more definitively, for predictions on the training set. You will likely need to take a subset of the training examples for speed when computing the SHAP values. **Maximum 1 paragraph.**

#### (optional) 3(d)

The following code adds a column of random noise to `X` and re-trains.

In [None]:
random_noise = np.random.randn(X_train_enc_text.shape[0],1)
random_noise

In [None]:
X_train_words_noise = pd.concat((X_train_enc_text, pd.DataFrame(random_noise, columns=['noise'], index=X_train_enc_text.index)), axis=1)
X_train_words_noise.head()

In [None]:
rf_noise = RandomForestRegressor(max_depth=20, max_features=20, n_estimators=20, random_state=20)
rf_noise.fit(X_train_words_noise, y_train);

Next, we can look at the feature importances of this noise column:

In [None]:
pd.DataFrame(data=rf_noise.feature_importances_, index=X_train_words_noise.columns, columns=["Importance"]).tail()

But wait, we know this feature can't possibly be useful - why is the importance of the random noise feature non-zero (and in fact larger than for some real features)?  **Maximum 2 sentences.**

#### 3(e)
rubric={points:10}

One of the numeric features is `price`, which is presumably in U.S. dollars per night. We can fit a `Ridge` model to the data and look at the coefficient corresponding to `price`. 

In [None]:
lr = make_pipeline(preprocessor_text, Ridge(alpha=100))
lr.fit(X_train, y_train);

In [None]:
coefs = pd.DataFrame(data=lr[1].coef_, index=new_columns_text, columns=["Coefficient"])
coefs.head()

In [None]:
price_coef = coefs.loc["price"].values[0]
price_coef

Normally, we can interpret a linear regression coefficient as the increase/decrease in the predicted target per unit change of the feature: for example, increasing the price by \\$1/night would decrease the predicted reviews/month by the above amount (decrease because of the negative coefficient).

However, that would be incorrect because we scaled the numeric features in the preprocessing pipeline, as discussed in lecture. 

In [None]:
scaler = preprocessor_text.named_transformers_['scale']

Taking into account the scaling, find the relationship implied by the model. Your answer should be in the units of (reviews per month) / (dollars per night); that is, your answer should be in the form of "Increasing the price by \$1 increases/decreases the predicted target by ___ reviews/month".


#### (optional) 3(f)

Perform an open-ended exploration of the feature importances using the model of your choosing. What is the most fun or interesting result you found? You could examine, for example:

- Different room types
- The importance of various numeric features
- The "most positive" and "most negative" words.

Recommended time spent: 5-15 min.

#### 3(g)
rubric={points:10}

Given your exploration in the previous part, can you draw conclusions about the world, such as the desireability of various neighbourhoods in New York? Could you now give advice to a friend on what words are best to use in an Airbnb listing? My answer, as you may have guessed, is no! (Or, at least, not without extreme caution and/or statistical training that goes beyond the scope of this course.) Give **3 reasons** why you might be wrong if you drew such conclusions about the world from the above analysis. **Maximum 1 sentence** per reason.

## Exercise 4: very short answer questions
rubric={points:25}

Answer each of the following questions in **at most 1 sentence**. Each one is worth 5 points.

1. If `neighbourhood_group` is aready a feature, why might `neighbourhood` still be worth including?
2. If `neighbourhood` is aready a feature, why might `neighbourhood_group` still be worth including?
3. If you had an enormous training set and could only keep one of `neighbourhood` or `neighbourhood_group`, which one would you keep, and why?
4. How does the size of your training set influence how much you "trust" your feature importances?
5. If you wanted to find the "most positive" and "most negative" words in terms of reviews per month, would you use `Ridge` or `RandomForestRegressor`? Assume you're using only scikit-learn (no SHAP, etc.).


## Submission to Canvas

**IF YOU ARE WORKING WITH A PARTNER** please form the group before submitting - see instructions [here](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md#partners).

When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`.
2. Save your notebook.
3. Convert your notebook to `.html` format using the `convert_notebook()` function below **or** by `File -> Export Notebook As... -> Export Notebook to HTML`.
4. Run the code `submit()` below to go through an interactive submission process to Canvas.
>For this step, you will need a Canvas *Access Token* token. If you haven't already got one, log-in to Canvas, click `Account` (top-left of the screen), then `Settings`, then scroll down until you see the `+ New Access Token` button. Click that button, give your token any name you like and set the expiry date to Dec 31, 2020. Then click `Generate token`. Save this token in a safe place on your computer as you'll need it for all assignments. Treat the token with as much care as you would an important password. 

Note: for those having trouble with the Jupyter widgets and the dropdowns: if you add the argument `no_widgets=True` to your `submit` call, it should let you do a text-based entry of your key and avoid the dropdowns altogether. If this doesn't work, you probably need to upgrade to the latest version of `canvasutils` with `pip install canvasutils -U` from your terminal with your environment activated.


In [None]:
from canvasutils.submit import submit, convert_notebook

# Note: the canvasutils package should have been installed as part of your environment setup - 
# see https://github.com/UBC-CS/cpsc330/blob/master/docs/setup.md

In [None]:
# convert_notebook("hw5.ipynb", "html")  # uncomment and run when you want to try convert your notebook to HTML (or you can convert manually from the File menu)

In [None]:
# submit(course_code=53561, token=False)  # uncomment and run when ready to submit 