## Instructions:

- Put the parts of your code under the corresponding sections. (0.25/2 points will be taken off for not doing this.)
- Do not include any redundant/irrelevant code, text or comments. (0.5/2 points will be taken off for not doing this.)
- **Your code must run without any errors or runtime issues.** (Failure to meet this condition will result in a 0.)
- **Your code must return your Public Leaderboard score.** (Failure to meet this condition will result in a 0.)

## 1) Libraries

Put all the Python libraries and tools you imported here.

In [1]:
import pandas as pd
import numpy as np
from sklearn import metrics, impute
from sklearn.impute import KNNImputer
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PolynomialFeatures
from sklearn.metrics import root_mean_squared_error, mean_absolute_error
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.model_selection import cross_val_score, cross_val_predict, GridSearchCV
from sklearn.metrics import root_mean_squared_error # mean_squared_error
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier

## 2) Data

- This section is required to include the code that reads, cleans and preprocesses the datasets.
- Note that both the training and test datasets should undergo the same sequence of operations.

In [5]:
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

train.dtypes

id                                                int64
listing_location                                 object
description                                      object
host_since                                       object
host_location                                    object
host_about                                       object
host_response_time                               object
host_response_rate                               object
host_acceptance_rate                             object
host_neighbourhood                               object
host_listings_count                             float64
host_total_listings_count                       float64
host_verifications                               object
host_has_profile_pic                             object
host_identity_verified                           object
neighbourhood_cleansed                           object
latitude                                        float64
longitude                                       

### Cleaning

1. drop unecessary cols - same in both
   1. VIF cols, room_type, 
3. clean messy cols -- bathrooms_test, true/false to 1/0
   1. bathrooms_text, host_identity_verified, property_type
   2. description - luxury words
4. OHE
5. impute
6. scale

In [8]:
# many cols determined as multicollinear with VIF vals above 8
    # others just unecessary, too messy, too many categories, etc.


cols_to_drop = ['has_availability', 'host_verifications', 'host_response_time',
                'host_since', 'host_location', 'host_neighbourhood', 'host_has_profile_pic',
                'neighbourhood_cleansed', 'first_review', 'last_review', 'room_type',
                'amenities', 'host_about','minimum_minimum_nights', 'minimum_maximum_nights',
                'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm',
                'availability_30', 'availability_60', 'availability_90', 'calculated_host_listings_count']

train = train.drop(cols_to_drop, axis=1)
test = test.drop(cols_to_drop, axis=1)

In [10]:
# clean $ and % from numeric cols

def clean_cols(val):
    if isinstance(val, str):
        val = val.replace('$', '').replace('%', '').replace(',', '')
    return float(val)

train[['price','host_acceptance_rate','host_response_rate']] = train[['price','host_acceptance_rate','host_response_rate']].map(clean_cols)
test[['host_acceptance_rate','host_response_rate']] = test[['host_acceptance_rate','host_response_rate']].map(clean_cols)

In [12]:
# clean bathrooms_text - convert half-bath to 0.5
train['bathrooms_text'] = train['bathrooms_text'].replace('Half-bath', '0.5')
test['bathrooms_text'] = test['bathrooms_text'].replace('Half-bath', '0.5')

# extract numeric part
train['bathrooms_text'] = train['bathrooms_text'].str.extract(r'(\d+\.?\d*)')[0].astype(float)
test['bathrooms_text'] = test['bathrooms_text'].str.extract(r'(\d+\.?\d*)')[0].astype(float)

test = test.rename(columns={'bathrooms_text': 'bathrooms'})
train = train.rename(columns={'bathrooms_text': 'bathrooms'})


In [14]:
# host identity verified - convert to 0/1
# instant bookable - convert to 0/1

train['instant_bookable'] = train['instant_bookable'].fillna(False).astype(int)
test['instant_bookable'] = test['instant_bookable'].fillna(False).astype(int)

train['host_identity_verified'] = train['host_identity_verified'].fillna(False).astype(int)
test['host_identity_verified'] = test['host_identity_verified'].fillna(False).astype(int)

  train['host_identity_verified'] = train['host_identity_verified'].fillna(False).astype(int)
  test['host_identity_verified'] = test['host_identity_verified'].fillna(False).astype(int)


In [16]:
# property type -- make new col, 1 if entire property and 0 otherwise

train['is_entire_place'] = train['property_type'].str.contains('entire', case=False, na=False).astype(int)
test['is_entire_place'] = test['property_type'].str.contains('entire', case=False, na=False).astype(int)

train = train.drop('property_type', axis=1)
test = test.drop('property_type', axis=1)

In [18]:
# description -- make new col, 1 if contains 'luxury' words

keywords = ['luxury', 'luxurious', 'penthouse', 'exclusive', 'elegant',
            'premium', 'high-end', 'designer', 'upscale', 'chic',
            'modern', 'deluxe', 'sophisticated', 'breathtaking','custom-built', 'architect-designed',
            'state-of-the-art','prestigious', 'top-tier', '5-star', 'five-star']

pattern = '|'.join(keywords)
train['luxury_description'] = train['description'].str.contains(pattern, case=False, na=False).astype(int)
test['luxury_description'] = test['description'].str.contains(pattern, case=False, na=False).astype(int)

train = train.drop('description', axis=1)
test = test.drop('description', axis=1)

In [19]:
# OHE location

location_dummies = pd.get_dummies(train['listing_location'], prefix='location', drop_first=True)
train = pd.concat([train, location_dummies], axis=1).drop('listing_location', axis=1)

location_dummies = pd.get_dummies(test['listing_location'], prefix='location', drop_first=True)
test = pd.concat([test, location_dummies], axis=1).drop('listing_location', axis=1)

In [22]:
train[['location_chicago', 'location_kauai']] = train[['location_chicago', 'location_kauai']].astype(int)
test[['location_chicago', 'location_kauai']] = test[['location_chicago', 'location_kauai']].astype(int)

In [24]:
train.isna().sum()

id                                                 0
host_response_rate                               557
host_acceptance_rate                             331
host_listings_count                                3
host_total_listings_count                          3
host_identity_verified                             0
latitude                                           0
longitude                                          0
accommodates                                       0
bathrooms                                          3
bedrooms                                          23
beds                                              10
price                                              0
minimum_nights                                     0
maximum_nights                                     0
maximum_minimum_nights                             0
availability_365                                   0
number_of_reviews                                  0
number_of_reviews_ltm                         

In [26]:
# IMPUTE MISSING VALS

# fit to train data, transform train and test data

imputer = KNNImputer(n_neighbors=3)

train_impute_cols = [col for col in train.columns if col not in ['id', 'price']]

#scale
scaler = MinMaxScaler()
scaled_train_array = scaler.fit_transform(train[train_impute_cols])
scaled_train_df = pd.DataFrame(scaled_train_array, columns=train_impute_cols, index=train.index)

train_imputed_arr = imputer.fit_transform(scaled_train_df) # first scale
train_unscaled_data = scaler.inverse_transform(train_imputed_arr) # then undo scale

train_imputed = train.copy()
train_imputed[train_impute_cols] = pd.DataFrame(train_unscaled_data, columns=train_impute_cols, index=train.index)


# test - use same scale
test_impute_cols = [col for col in test.columns if col != 'id']


scaled_test_array = scaler.transform(test[test_impute_cols])
scaled_test_df = pd.DataFrame(scaled_test_array, columns=test_impute_cols, index=test.index)

test_imputed_arr = imputer.transform(scaled_test_df)
test_unscaled_data = scaler.inverse_transform(test_imputed_arr)

test_imputed = test.copy()
test_imputed[test_impute_cols] = pd.DataFrame(test_unscaled_data, columns = test_impute_cols, index=test.index)

In [27]:
# slice predictors - no id !! but save for predictions

X_train = train_imputed.drop(columns=['price','id'])
y_train = np.log(train_imputed['price'])

X_test = test_imputed.drop(columns=['id'])
test_ids = test_imputed['id']



In [28]:
# scale

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 3) Machine Learning Model

- This section is required to train the **already tuned** model and obtain the test predictions (or prediction probabilities) with it.
- As written in the instructions, your code must not have any runtime issues, so **do NOT include your grid search here!** You will still need to tune your model to pass the thresholds. However, you need to keep that as your personal work and should NOT include the grid search here.

In [274]:
'''
model = Ridge(alpha = 0.005)
model.fit(X_train_scaled, y_train)
y_pred = np.exp(model.predict(X_test_scaled))

submission = pd.DataFrame({'id': test_ids,'predicted': y_pred})

submission.to_csv('submission.csv', index=False)'''




In [294]:
# create an array of hyperparameter values
alphas = 10**np.linspace(10,-2,200)*0.5 # 200 alpha values

# create an empty list
cv_scores = []

# now, we will use ridge, lasso and cross_val_score
for alpha in alphas: # for each hyperparam
    model = Ridge(alpha=alpha) # create a model with that that hyperparam
    cv_score = cross_val_score(model, X_train_scaled, y_train, scoring = 'neg_mean_absolute_error')
    cv_scores.append(cv_score)

avg_cv_results = -np.array(cv_scores).mean(axis=1)

# first, find the best score: lowest MAE
print('Best avg cv performance:', np.min(avg_cv_results))
print('Best hyperparam:', alphas[np.argmin(avg_cv_results)])




model = Ridge(alpha = alphas[np.argmin(avg_cv_results)])
model.fit(X_train_scaled, y_train)
y_preds = np.exp(model.predict(X_test_scaled))

submission = pd.DataFrame({
    'id': test_ids,           # original id column from the test data
    'predicted': y_preds         # your model’s predictions
})

submission.to_csv('sub1.csv', index=False)

Best avg cv performance: 0.3936951100339348
Best hyperparam: 440.2441790821732


In [286]:
alphas = 10**np.linspace(10,-2,200)*0.5
cv_preds = []

for alpha in alphas: # for each alpha
    model = Ridge(alpha=alpha) # create model with that alpha
    cv_pred = cross_val_predict(model, X_train_scaled, y_train) #no scoring because the output will be the preds, not performances

    cv_preds.append(cv_pred)

# each row: the prediction for each obs in the training data WHEN that obs was iin the assessment fold

# convert the log predictions into the linear scale 
cv_pred_errors = np.exp(np.array(cv_preds)) - np.array(np.exp(y_train))

# calculate MAE for each alpha
cv_maes = np.abs((cv_pred_errors).mean(axis=1))

print('Price error value:', np.min(cv_maes))
print('Best alpha:', alphas[np.argmin(cv_maes)]) # same as above

Price error value: 45.031844341816175
Best alpha: 0.005


In [300]:
#LASSO


# create an array of hyperparameter values
alphas = 10**np.linspace(10,-2,200)*0.5 # 200 alpha values

# create an empty list
cv_scores = []

# now, we will use ridge, lasso and cross_val_score
for alpha in alphas: # for each hyperparam
    model = Lasso(alpha=alpha) # create a model with that that hyperparam
    cv_score = cross_val_score(model, X_train_scaled, y_train, scoring = 'neg_mean_absolute_error')
    cv_scores.append(cv_score)

avg_cv_results = -np.array(cv_scores).mean(axis=1)

# first, find the best score: lowest MAE
print('Best avg cv performance:', np.min(avg_cv_results))
print('Best hyperparam:', alphas[np.argmin(avg_cv_results)])




model = Ridge(alpha = alphas[np.argmin(avg_cv_results)])
model.fit(X_train_scaled, y_train)
y_preds = np.exp(model.predict(X_test_scaled))

submission = pd.DataFrame({
    'id': test_ids,           # original id column from the test data
    'predicted': y_preds         # your model’s predictions
})

submission.to_csv('sub2.csv', index=False)

Best avg cv performance: 0.38836468890107834
Best hyperparam: 0.011502150598864608


In [308]:
# KNN model

knn_model = KNeighborsRegressor(n_neighbors = 5, weights = 'distance')

# train
knn_model.fit(X_train_scaled, y_train)

y_preds = np.exp(knn_model.predict(X_test_scaled))

#y_pred_train = knn_model.predict(X_train_scaled)
#print(mean_absolute_error(y_train, y_pred_train))

# evaluate
#print(root_mean_squared_error(y_test, y_pred))


submission = pd.DataFrame({
    'id': test_ids,        
    'predicted': y_preds   
})

submission.to_csv('knn1.csv', index=False)


In [312]:
# KNN model 2

knn_model = KNeighborsRegressor(n_neighbors = 6, weights = 'distance')

# train
knn_model.fit(X_train_scaled, y_train)

y_preds = np.exp(knn_model.predict(X_test_scaled))

#y_pred_train = knn_model.predict(X_train_scaled)
#print(mean_absolute_error(y_train, y_pred_train))

# evaluate
#print(root_mean_squared_error(y_test, y_pred))


submission = pd.DataFrame({
    'id': test_ids,        
    'predicted': y_preds   
})

submission.to_csv('knn3.csv', index=False)

In [318]:
# CV KNN

for i in range(1,20):
    knn_model = KNeighborsRegressor(n_neighbors=i, weights='distance')
    knn_model.fit(X_train_scaled, y_train)
    #y_preds = np.exp(knn_model.predict(X_test_scaled))
    y_pred_train = knn_model.predict(X_train_scaled)
    print(i, mean_absolute_error(y_train, y_pred_train))

    #sub = pd.DataFrame()

1 0.006491575585352442
2 0.005611904898415031
3 0.005545588882155145
4 0.005104618310302251
5 0.00505232653268439
6 0.0051739539253849165
7 0.005197984986656579
8 0.005096219877715136
9 0.005051749273379211
10 0.005023153359506428
11 0.005033191619682778
12 0.0050634266116792075
13 0.005107170894904329
14 0.005029630292161799
15 0.005033615855686972
16 0.005024790510806664
17 0.005014891466792658
18 0.0050199232593925525
19 0.005018959100357977


In [320]:
# KNN model 3

knn_model = KNeighborsRegressor(n_neighbors = 17, weights = 'distance')

# train
knn_model.fit(X_train_scaled, y_train)

y_preds = np.exp(knn_model.predict(X_test_scaled))

#y_pred_train = knn_model.predict(X_train_scaled)
#print(mean_absolute_error(y_train, y_pred_train))

# evaluate
#print(root_mean_squared_error(y_test, y_pred))


submission = pd.DataFrame({
    'id': test_ids,        
    'predicted': y_preds   
})

submission.to_csv('knn4.csv', index=False)

In [79]:
# GRIDSEARCH CV

model = KNeighborsRegressor() # Any object input that will be fixed (not tuned) can be inputted.

# Second, create a hyperparam grid.
grid = {'n_neighbors':np.arange(10,160,10), 'weights':['uniform', 'distance']}

# Create the grid search
gscv = GridSearchCV(model, grid, scoring='neg_mean_absolute_error')

gscv.fit(X_train_scaled, y_train)

y_preds = np.exp(gscv.best_estimator_.predict(X_test_scaled))

submission = pd.DataFrame({
    'id': test_ids,        
    'predicted': y_preds   
})

submission.to_csv('gscv1.csv', index=False)

In [51]:
# gridsearch unscaled idk

model = KNeighborsRegressor() # Any object input that will be fixed (not tuned) can be inputted.

# Second, create a hyperparam grid.
grid = {'n_neighbors':np.arange(10,160,10), 'weights':['uniform', 'distance']}

# Create the grid search
gscv = GridSearchCV(model, grid, scoring='neg_mean_absolute_error')

gscv.fit(X_train_scaled, y_train)

y_preds = np.exp(gscv.best_estimator_.predict(X_test_scaled))

submission = pd.DataFrame({
    'id': test_ids,        
    'predicted': y_preds   
})

submission.to_csv('gscv2.csv', index=False)

In [34]:
## poly features
# GRIDSEARCH CV

for i in range(2,10):
    
    X_train_poly = PolynomialFeatures(degree=i, include_bias=False).fit_transform(X_train_scaled)
    X_test_poly = PolynomialFeatures(degree=i, include_bias=False).fit_transform(X_test_scaled)

    model = KNeighborsRegressor()

    grid = {'n_neighbors': np.arange(10,160,10), 'weights': ['uniform', 'distance']}

    gscv = GridSearchCV(model, grid, scoring='neg_mean_absolute_error')

    gscv.fit(X_train_poly, y_train)

    y_preds = np.exp(gscv.best_estimator_.predict(X_test_poly))

    submission = pd.DataFrame({'id': test_ids, 'predicted': y_preds})

    submission.to_csv(f'poly_{i}_gscv.csv', index=False)
    print('Degree', i, 'done.')

Degree 2 done.


KeyboardInterrupt: 

In [36]:
# GRIDSEARCH CV again, diff scale

model = KNeighborsRegressor() # Any object input that will be fixed (not tuned) can be inputted.

# Second, create a hyperparam grid.
grid = {'n_neighbors':np.arange(5,150,5), 'weights':['uniform', 'distance']}

# Create the grid search
gscv = GridSearchCV(model, grid, scoring='neg_mean_absolute_error')

gscv.fit(X_train_scaled, y_train)

y_preds = np.exp(gscv.best_estimator_.predict(X_test_scaled))

submission = pd.DataFrame({
    'id': test_ids,        
    'predicted': y_preds   
})

submission.to_csv('gscv_4_22.csv', index=False)

In [38]:
# GRIDSEARCH CV again, diff scale

model = KNeighborsRegressor() # Any object input that will be fixed (not tuned) can be inputted.

# Second, create a hyperparam grid.
grid = {'n_neighbors':np.arange(5,50,1), 'weights':['uniform', 'distance'], 'metric':['manhattan', 'euclidean']}

# Create the grid search
gscv = GridSearchCV(model, grid, scoring='neg_mean_absolute_error')

gscv.fit(X_train_scaled, y_train)

y_preds = np.exp(gscv.best_estimator_.predict(X_test_scaled))

submission = pd.DataFrame({
    'id': test_ids,        
    'predicted': y_preds   
})

submission.to_csv('gscv_4_22_2.csv', index=False)

## 4) Exporting the Predictions

Include the code that (1) puts the predictions in the format that Kaggle understands and (2) exports it as a csv file.