## Instructions:

- Put the parts of your code under the corresponding sections. (0.25/2 points will be taken off for not doing this.)
- Do not include any redundant/irrelevant code, text or comments. (0.5/2 points will be taken off for not doing this.)
- **Your code must run without any errors or runtime issues.** (Failure to meet this condition will result in a 0.)
- **Your code must return your Public Leaderboard score.** (Failure to meet this condition will result in a 0.)

## 1) Libraries

Put all the Python libraries and tools you imported here.

In [67]:
import pandas as pd
import numpy as np
from sklearn import metrics, impute
from sklearn.impute import KNNImputer

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.neighbors import KNeighborsRegressor

## 2) Data

- This section is required to include the code that reads, cleans and preprocesses the datasets.
- Note that both the training and test datasets should undergo the same sequence of operations.

In [71]:
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

train.dtypes

id                                                int64
listing_location                                 object
description                                      object
host_since                                       object
host_location                                    object
host_about                                       object
host_response_time                               object
host_response_rate                               object
host_acceptance_rate                             object
host_neighbourhood                               object
host_listings_count                             float64
host_total_listings_count                       float64
host_verifications                               object
host_has_profile_pic                             object
host_identity_verified                           object
neighbourhood_cleansed                           object
latitude                                        float64
longitude                                       

### Cleaning

1. drop unecessary cols - same in both
   1. VIF cols, room_type, 
3. clean messy cols -- bathrooms_test, true/false to 1/0
   1. bathrooms_text, host_identity_verified, property_type
   2. description - luxury words
4. OHE
5. impute
6. scale

In [74]:
# many cols determined as multicollinear with VIF vals above 8
    # others just unecessary, too messy, too many categories, etc.


cols_to_drop = ['has_availability', 'host_verifications', 'host_response_time',
                'host_since', 'host_location', 'host_neighbourhood', 'host_has_profile_pic',
                'neighbourhood_cleansed', 'first_review', 'last_review', 'room_type',
                'amenities', 'host_about','minimum_minimum_nights', 'minimum_maximum_nights',
                'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm',
                'availability_30', 'availability_60', 'availability_90', 'calculated_host_listings_count']

train = train.drop(cols_to_drop, axis=1)
test = test.drop(cols_to_drop, axis=1)

In [76]:
# clean $ and % from numeric cols

def clean_cols(val):
    if isinstance(val, str):
        val = val.replace('$', '').replace('%', '').replace(',', '')
    return float(val)

train[['price','host_acceptance_rate','host_response_rate']] = train[['price','host_acceptance_rate','host_response_rate']].map(clean_cols)
test[['host_acceptance_rate','host_response_rate']] = test[['host_acceptance_rate','host_response_rate']].map(clean_cols)

In [78]:
# clean bathrooms_text - convert half-bath to 0.5
train['bathrooms_text'] = train['bathrooms_text'].replace('Half-bath', '0.5')
test['bathrooms_text'] = test['bathrooms_text'].replace('Half-bath', '0.5')

# extract numeric part
train['bathrooms_text'] = train['bathrooms_text'].str.extract(r'(\d+\.?\d*)')[0].astype(float)
test['bathrooms_text'] = test['bathrooms_text'].str.extract(r'(\d+\.?\d*)')[0].astype(float)

test = test.rename(columns={'bathrooms_text': 'bathrooms'})
train = train.rename(columns={'bathrooms_text': 'bathrooms'})


In [80]:
# host identity verified - convert to 0/1
# instant bookable - convert to 0/1

train['instant_bookable'] = train['instant_bookable'].fillna(False).astype(int)
test['instant_bookable'] = test['instant_bookable'].fillna(False).astype(int)

train['host_identity_verified'] = train['host_identity_verified'].fillna(False).astype(int)
test['host_identity_verified'] = test['host_identity_verified'].fillna(False).astype(int)

  train['host_identity_verified'] = train['host_identity_verified'].fillna(False).astype(int)
  test['host_identity_verified'] = test['host_identity_verified'].fillna(False).astype(int)


In [82]:
# property type -- make new col, 1 if entire property and 0 otherwise

train['is_entire_place'] = train['property_type'].str.contains('entire', case=False, na=False).astype(int)
test['is_entire_place'] = test['property_type'].str.contains('entire', case=False, na=False).astype(int)

train = train.drop('property_type', axis=1)
test = test.drop('property_type', axis=1)

In [84]:
# description -- make new col, 1 if contains 'luxury' words

keywords = ['luxury', 'luxurious', 'penthouse', 'exclusive', 'elegant',
            'premium', 'high-end', 'designer', 'upscale', 'chic',
            'modern', 'deluxe', 'sophisticated', 'breathtaking','custom-built', 'architect-designed',
            'state-of-the-art','prestigious', 'top-tier', '5-star', 'five-star']

pattern = '|'.join(keywords)
train['luxury_description'] = train['description'].str.contains(pattern, case=False, na=False).astype(int)
test['luxury_description'] = test['description'].str.contains(pattern, case=False, na=False).astype(int)

train = train.drop('description', axis=1)
test = test.drop('description', axis=1)

In [86]:
# OHE location

location_dummies = pd.get_dummies(train['listing_location'], prefix='location', drop_first=True)
train = pd.concat([train, location_dummies], axis=1).drop('listing_location', axis=1)

location_dummies = pd.get_dummies(test['listing_location'], prefix='location', drop_first=True)
test = pd.concat([test, location_dummies], axis=1).drop('listing_location', axis=1)

In [88]:
train[['location_chicago', 'location_kauai']] = train[['location_chicago', 'location_kauai']].astype(int)
test[['location_chicago', 'location_kauai']] = test[['location_chicago', 'location_kauai']].astype(int)

In [90]:
train.isna().sum()

id                                                 0
host_response_rate                               557
host_acceptance_rate                             331
host_listings_count                                3
host_total_listings_count                          3
host_identity_verified                             0
latitude                                           0
longitude                                          0
accommodates                                       0
bathrooms                                          3
bedrooms                                          23
beds                                              10
price                                              0
minimum_nights                                     0
maximum_nights                                     0
maximum_minimum_nights                             0
availability_365                                   0
number_of_reviews                                  0
number_of_reviews_ltm                         

In [92]:
# IMPUTE MISSING VALS

# fit to train data, transform train and test data

imputer = KNNImputer(n_neighbors=3)

train_impute_cols = [col for col in train.columns if col not in ['id', 'price']]

#scale
scaler = MinMaxScaler()
scaled_train_array = scaler.fit_transform(train[train_impute_cols])
scaled_train_df = pd.DataFrame(scaled_train_array, columns=train_impute_cols, index=train.index)

train_imputed_arr = imputer.fit_transform(scaled_train_df) # first scale
train_unscaled_data = scaler.inverse_transform(train_imputed_arr) # then undo scale

train_imputed = train.copy()
train_imputed[train_impute_cols] = pd.DataFrame(train_unscaled_data, columns=train_impute_cols, index=train.index)


# test - use same scale
test_impute_cols = [col for col in test.columns if col != 'id']


scaled_test_array = scaler.transform(test[test_impute_cols])
scaled_test_df = pd.DataFrame(scaled_test_array, columns=test_impute_cols, index=test.index)

test_imputed_arr = imputer.transform(scaled_test_df)
test_unscaled_data = scaler.inverse_transform(test_imputed_arr)

test_imputed = test.copy()
test_imputed[test_impute_cols] = pd.DataFrame(test_unscaled_data, columns = test_impute_cols, index=test.index)

In [94]:
# slice predictors - no id !! but save for predictions

X_train = train_imputed.drop(columns=['price','id'])
y_train = np.log(train_imputed['price'])

X_test = test_imputed.drop(columns=['id'])
test_ids = test_imputed['id']



In [96]:
# scale

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 3) Machine Learning Model

- This section is required to train the **already tuned** model and obtain the test predictions (or prediction probabilities) with it.
- As written in the instructions, your code must not have any runtime issues, so **do NOT include your grid search here!** You will still need to tune your model to pass the thresholds. However, you need to keep that as your personal work and should NOT include the grid search here.

In [100]:
model = KNeighborsRegressor(n_neighbors=15, weights='distance', metric='manhattan')

model.fit(X_train_scaled, y_train)

y_preds = np.exp(model.predict(X_test_scaled))

## 4) Exporting the Predictions

Include the code that (1) puts the predictions in the format that Kaggle understands and (2) exports it as a csv file.

In [102]:
submission = pd.DataFrame({
    'id': test_ids,        
    'predicted': y_preds   
})

submission.to_csv('final_regression_submission1.csv', index=False)