# CPSC 330 hw5



In this assignment we'll be exploring a [dataset](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data) of New York City Airbnb listings from 2019. As usually, you'll need to start by downloading the dataset.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split

In [None]:
plt.rcParams['font.size'] = 14

## Instructions
rubric={points:5}

Follow the [homework submission instructions](https://github.students.cs.ubc.ca/cpsc330-2019w-t2/home/blob/master/docs/homework_instructions.md). 

**Additional requirement**: if you are working with a partner, please write a couple sentences explaining the contribution of each team member. You should refer to yourselves by your CSIDs (because seeing names can cause bias during grading). Here is an example:

> a1b2c did Exercise 1, checked over Exercise 2, and pair-programmed for Exercise 3. z9y8x checked over Exercise 1, did Exercise 2, and pair-programmed for Exercise 3. 

Our ideal scenario is that you worked together on all the exercises, but you are not required to do so, and for now we are only collecting this information because we are curious. If you are working alone, you can ignore this section.

_YOUR TEAMWORK CONTRIBUTION STATEMENT GOES HERE_

## Writing quality/quantity
rubric={points:5}

The TAs have reported a couple issues with the first few assignments: in some cases, submissions simply show the code output with no commentary; please write at least a sentence explaining your output in each question. In other cases, the TAs have come across multi-paragraph answers where a couple of sentences would have sufficed. Thus, we are now allocating the above points for well-structured answers of a reasonable length. In general, 1-3 sentences is good.

In [None]:
df = pd.read_csv('AB_NYC_2019.csv')

In [None]:
df_train, df_test = train_test_split(df, test_size=0.1, random_state=123)
df_train, df_valid = train_test_split(df_train, test_size=0.25, random_state=123)

In [None]:
df_train.head()

In this assignment we'll try to predict `reviews_per_month`, as a proxy for the popularity of the listing. Airbnb could use this sort of model to predict how popular future listings might be before they are posted, perhaps to help guide hosts create more appealing listings. In reality they might instead use something like vacancy rate or average rating as their target, but we do not have that available here.

In [None]:
target_name = 'reviews_per_month'

In [None]:
y_train = df_train[target_name]
y_valid = df_valid[target_name]
y_test  = df_test[target_name]

It seems that listings with 0 reviews have `reviews_per_month` set to NaN instead of 0. I will fix this:

In [None]:
y_train = y_train.fillna(value=0)
y_valid = y_valid.fillna(value=0)
y_test  = y_test.fillna(value=0)

In [None]:
categorical_features = ['neighbourhood', 'neighbourhood_group', 'room_type']
drop_features        = ['id', 'name', 'host_id', 'host_name', 'last_review']
numeric_features     = ['latitude', 'longitude', 'price', 'minimum_nights', 'number_of_reviews', 'calculated_host_listings_count', 'availability_365']

Make sure all columns are accounted for (this is better than what I was doing previous with sets, because it also finds duplicates):

In [None]:
assert sorted(df.columns) == sorted(categorical_features + drop_features + numeric_features + [target_name])

In [None]:
transformers=[
    ('scale',  StandardScaler(), numeric_features),
    ('ohe',    OneHotEncoder(drop='first', sparse=False), categorical_features[1:]),
    ('ohe-nodrop', OneHotEncoder(handle_unknown='ignore', sparse=False), categorical_features[:1])]

In [None]:
preprocessor = ColumnTransformer(transformers=transformers)

In [None]:
preprocessor.fit(df_train);

In [None]:
new_columns = numeric_features + list(preprocessor.named_transformers_['ohe'].get_feature_names(categorical_features[1:])) + list(preprocessor.named_transformers_['ohe-nodrop'].get_feature_names(categorical_features[:1]))
new_columns;

In [None]:
X_train = pd.DataFrame(preprocessor.transform(df_train), index=df_train.index, columns=new_columns)
X_valid = pd.DataFrame(preprocessor.transform(df_valid), index=df_valid.index, columns=new_columns)
X_test  = pd.DataFrame(preprocessor.transform(df_test),  index=df_test.index,  columns=new_columns)

In [None]:
X_train.head()

Next, we'll try training some models:

In [None]:
dr = DummyRegressor()
dr.fit(X_train, y_train);

In [None]:
dr.score(X_train, y_train)

In [None]:
dr.score(X_valid, y_valid)

Remember, the score here is the $R^2$. Let's try linear regression:

In [None]:
lr = Ridge(alpha=100)
lr.fit(X_train, y_train);

In [None]:
lr.score(X_train, y_train)

In [None]:
lr.score(X_valid, y_valid)

In [None]:
lr_coefs = pd.DataFrame(data=lr.coef_, index=X_train.columns, columns=["Coefficient"])
lr_coefs.sort_values(by="Coefficient", ascending=False).head()

## Exercise 1: data preparation

#### 1(a)
rubric={points:5}

Looking at the top coefficients of the linear regression model reveals an issue with the way we set up the data. One of those features should not be used in the analysis. 

- Explain the issue.
- Fix the issue.

#### 1(b)
rubric={points:10}

In my data preparation, I set `handle_unknown='ignore'` in the `OneHotEncoder` for the `neighbourhood` feature, but not the `neighbourhood_group` feature. 

- Why was this necessary for `neighbourhood`? 
- Why was this **not** necessary for `neighbourhood_group`?

#### 1(c)
rubric={points:10}

Perhaps it would have been reasonable to take all neighbourhoods across all 3 splits (train/valid/test) and use those to train the `OneHotEncoder` for `neighbourhood`. In general it's a violation of the Golden Rule to look at the validation/test data in our feature preprocessing. 

- Make an argument for why it might be OK in this case.
- But explain what might go wrong if we did it this way.

#### 1(d)
rubric={points:10}

Earlier we dropped the `name` column, but perhaps the names contain useful information. For example, maybe listings with the word "new" are more popular than listings with the word "rustic". Let's update the feature preprocessing so that `name` is not dropped, but instead encoded with `CountVectorizer`.

In [None]:
count_vec = CountVectorizer(max_features=100, stop_words='english')
count_vec.fit(df_train['name']);

- Explain the issue.
- Modify `df_train`, `df_valid`, and `df_test` to fix the issue, so that the same line of code (above) runs successfully (below).

In [None]:
count_vec.fit(df_train['name']); # this should work now

Once the issue is fixed, we'll proceed to transform the data:

In [None]:
word_feature_names = ['name_' + word for word in count_vec.get_feature_names()]

In [None]:
X_train_name = pd.DataFrame(data=count_vec.transform(df_train["name"]).toarray(), columns=word_feature_names, index=X_train.index)
X_valid_name = pd.DataFrame(data=count_vec.transform(df_valid["name"]).toarray(), columns=word_feature_names, index=X_valid.index)
X_test_name  = pd.DataFrame(data=count_vec.transform(df_test["name"]).toarray(),  columns=word_feature_names, index=X_test.index)

In [None]:
X_train_words = pd.concat((X_train, X_train_name), axis=1)
X_valid_words = pd.concat((X_valid, X_valid_name), axis=1)
X_test_words  = pd.concat((X_test, X_test_name),   axis=1)

We now have a lot of features generated by the OHE and the word counts:

In [None]:
X_train_words.columns[X_train_words.columns.str.startswith("neighbourhood")]

In [None]:
X_train_words.columns[X_train_words.columns.str.startswith("name")]

In fact, it seems we now have a feature called `neighbourhood_Williamsburg` and a feature called `name_williamsburg`:

In [None]:
X_train_words[['neighbourhood_Williamsburg', 'name_williamsburg']]

#### 1(e)
rubric={points:5}

- Would it make sense to remove the feature `neighbourhood_Williamsburg`? Briefly explain.
- Would it make sense to remove the feature `name_williamsburg`? Briefly explain.

Next, we explore the target values. The number of reviews per month ranges quite a bit:

In [None]:
y_train.describe()

In [None]:
y_train.hist(bins=100);

We decide to log-transform these targets, because decide we care more about relative than absolute error:

In [None]:
y_train_log = np.log(y_train)
y_valid_log = np.log(y_valid)
y_test_log  = np.log(y_test)

In [None]:
y_train_log.hist()

#### 1(f)
rubric={points:5}

- Explain the issue.
- Fix the issue.

## Exercise 2: model selection
rubric={points:15}

Spend 5-20 minutes tuning a `Ridge` and a `RandomForestRegressor` on this problem, using `X_*_words` and `y_*_log` (where `*` is a split name). Which model do you think is better here?

## Exercise 3: feature importances

For this exercise we'll consider the random forest created below, which performs somewhat decently (though still not great):

In [None]:
rf = RandomForestRegressor(max_depth=20, max_features=20, n_estimators=20, random_state=20)
rf.fit(X_train_words, y_train_log);

In [None]:
rf.score(X_train_words, y_train_log)

In [None]:
rf.score(X_valid_words, y_valid_log)

#### 3(a)
rubric={points:10}

Look at the feature importances for this random forest. What features seem to be driving your predictions most?

#### 3(b)
rubric={points:5}

For the two most important features in the above model, do you think increasing these features increases or decreases the predicted number of reviews per month? Briefly justify your answer. **Maximum 2 sentences.**

#### 3(c)
rubric={points:10}

Use SHAP to try to answer the above question, for predictions on the training set. You will likely need to take a subset of the training examples for speed when computing the SHAP values. **Maximum 1 paragraph.**

The following code adds a column of random noise to `X` and re-trains.

In [None]:
X_train_words_noise = pd.concat((X_train_words, pd.DataFrame(np.random.randn(X_train_words.shape[0],1), columns=['noise'], index=X_train_words.index)), axis=1)
X_valid_words_noise = pd.concat((X_valid_words, pd.DataFrame(np.random.randn(X_valid_words.shape[0],1), columns=['noise'], index=X_valid_words.index)), axis=1)

In [None]:
rf_noise = RandomForestRegressor(max_depth=20, max_features=20, n_estimators=20, random_state=20)
rf_noise.fit(X_train_words_noise, y_train_log);

Next, we can look at the feature importances of this noise column:

In [None]:
pd.DataFrame(data=rf_noise.feature_importances_, index=X_train_words_noise.columns, columns=["Importance"]).tail()

#### 3(d)
rubric={points:5}

Why is the importance of the random noise feature non-zero, and in fact larger than that of other features? **Maximum 2 sentences.**

#### 3(e)
rubric={points:15}

One of the numeric features is `price`, which is presumably in U.S. dollars per night. Using a `Ridge` model, look at the coefficient corresponding to `price`. Find the relationship implied by the model, in the units of (reviews per month) / (dollars per night). 

Hint: should you use the log-transformed target values or the originals?

#### 3(f)
rubric={points:10}

Perform an open-ended exploration of the feature importances using the model of your choosing. What is the most fun or interesting result you found? You could examine, for example:

- Different room types
- The importance of various numeric features
- The "most positive" and "most negative" words.

#### 3(g)
rubric={points:10}

Given your exploration in the previous part, can you draw conclusions about the world, such as the desireability of various neighbourhoods in New York? Could you now give advice to a friend on what words are best to use in an Airbnb listing? My answer, as you may have guessed, is no! (Or, at least, not without extreme caution and/or statistical training that goes beyond the scope of this course.) Give **3 reasons** why you might be wrong if you drew such conclusions about the world from the above analysis. **Maximum 1 sentence** per reason.

## Exercise 4: very short answer questions
rubric={points:40}

Answer each of the following questions in **at most 1 sentence**. Each one is worth 5 points.

1. If `neighbourhood_group` is aready a feature, why might `neighbourhood` still be worth including?
2. If `neighbourhood` is aready a feature, why might `neighbourhood_group` still be worth including?
3. If you had an enormous training set and could only keep one of `neighbourhood` or `neighbourhood_group`, which one would you keep, and why?
4. How does the size of your training set influence how much you "trust" your feature importances?
5. If you wanted to find the "most positive" and "most negative" words in terms of reviews per month, would you use `Ridge` or `RandomForestRegressor`? Assume you're using only scikit-learn (no SHAP, etc.).
6. What is an advantage of ensembling multiple models as opposed to just choosing one of them?
7. What is an disadvantage of ensembling multiple models as opposed to just choosing one of them?
8. By default, `StackingRegressor` uses `Ridge` as its "meta-model". Explain the significance of the coefficients learned by this `Ridge` model.