## Instructions {-}

- This is the template for the code and report on the Prediction Problem assignments.

- Your code in steps 1, 3, 4, and 5 will be executed sequentially, and must produce the RMSE / accuracy claimed on Kaggle.

- Your code in step 2 will also be executed, and must produce the optimal hyperparameter values used to train the model.

## Read data

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import ast

from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, cross_validate, GridSearchCV, RandomizedSearchCV, KFold, StratifiedKFold, RepeatedKFold, RepeatedStratifiedKFold
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, recall_score, mean_squared_error, r2_score
from scipy.stats import uniform
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from IPython import display

In [2]:
raw_train = pd.read_csv('../../Datasets/train_regression.csv')
raw_test = pd.read_csv('../../Datasets/test_regression.csv')

## 1) Data pre-processing

Put the data pre-processing code. You don't need to explain it. You may use the same code from last quarter.

In [3]:
# Create copies of the raw datasets
train = raw_train.copy()
test = raw_test.copy()

# Clean 'price' column: remove '$' and ',' characters, and convert to float
train['price'] = train['price'].str.replace(',', '').str.replace('$', '', regex=False).astype(float)

In [4]:
# Convert 'host_acceptance_rate' and 'host_response_rate' columns to float and scale by dividing by 100
train['acceptance_rate'] = train['host_acceptance_rate'].str.replace('%', '').astype(float) / 100
train['response_rate'] = train['host_response_rate'].str.replace('%', '').astype(float) / 100

test['acceptance_rate'] = test['host_acceptance_rate'].str.replace('%', '').astype(float) / 100
test['response_rate'] = test['host_response_rate'].str.replace('%', '').astype(float) / 100

# Drop unnecessary columns
train.drop(columns=['host_acceptance_rate', 'host_response_rate'], inplace=True)
test.drop(columns=['host_acceptance_rate', 'host_response_rate'], inplace=True)



# Extract numeric values from 'bathrooms_text' column and convert to float
train['bathrooms_num'] = train['bathrooms_text'].str.extract('(\d+)').astype(float)
test['bathrooms_num'] = test['bathrooms_text'].str.extract('(\d+)').astype(float)

# Fill missing values in 'bathrooms_num' where 'Half-bath' is mentioned in 'bathrooms_text' with 0.5
train.loc[train['bathrooms_text'].str.contains('Half-bath', case=False, na=False) & train['bathrooms_num'].isna(), 'bathrooms_num'] = 0.5
test.loc[test['bathrooms_text'].str.contains('Half-bath', case=False, na=False) & test['bathrooms_num'].isna(), 'bathrooms_num'] = 0.5


In [5]:
# Convert date columns to datetime format
def strip_date(row):
    if isinstance(row, str):
        row = datetime.strptime(row, '%Y-%m-%d').date()
    return row

# Apply date conversion to train dataset
train['host_since'] = train['host_since'].apply(strip_date)
train['first_review'] = train['first_review'].apply(strip_date)
train['last_review'] = train['last_review'].apply(strip_date)

# Apply date conversion to test dataset
test['host_since'] = test['host_since'].apply(strip_date)
test['first_review'] = test['first_review'].apply(strip_date)
test['last_review'] = test['last_review'].apply(strip_date)

# ----- #

# Calculate months since various dates for train dataset
train['host_since_in_months'] = round(((datetime.now().date() - train['host_since']).dt.days) / 30, 2)
train['first_review_in_months'] = round(((datetime.now().date() - train['first_review']).dt.days) / 30, 2)
train['last_review_in_months'] = round(((datetime.now().date() - train['last_review']).dt.days) / 30, 2)

# Calculate months since various dates for test dataset
test['host_since_in_months'] = round(((datetime.now().date() - test['host_since']).dt.days) / 30,  2)
test['first_review_in_months'] = round(((datetime.now().date() - test['first_review']).dt.days) / 30, 2)
test['last_review_in_months'] = round(((datetime.now().date() - test['last_review']).dt.days) / 30, 2)


# Because the review values are extremely collinear, calculate average review scores and fill missing values with 0
train['review_scores_avg'] = np.mean(train[['review_scores_accuracy', 'review_scores_checkin', 'review_scores_communication', 'review_scores_rating', 'review_scores_value', 'review_scores_location', 'review_scores_cleanliness']], axis=1)
test['review_scores_avg'] = np.mean(test[['review_scores_accuracy', 'review_scores_checkin', 'review_scores_communication', 'review_scores_rating', 'review_scores_value', 'review_scores_location', 'review_scores_cleanliness']], axis=1)

train.drop(columns=['review_scores_accuracy', 'review_scores_checkin', 'review_scores_communication', 'review_scores_rating', 'review_scores_value', 'review_scores_location', 'review_scores_cleanliness'], inplace=True)
test.drop(columns=['review_scores_accuracy', 'review_scores_checkin', 'review_scores_communication', 'review_scores_rating', 'review_scores_value', 'review_scores_location', 'review_scores_cleanliness'], inplace=True)

train['review_scores_avg'].fillna(value=0, inplace=True)
test['review_scores_avg'].fillna(value=0, inplace=True)

In [6]:
## Identify outliers in 'price' and 'minimum_nights'

# top and bottom 0.04% of price
lower_val = np.percentile(train[['price']], 0.07)
upper_val = np.percentile(train[['price']], 99.93)
outliers_idx_price = list(train[(train['price'] >= upper_val) | (train['price'] <= lower_val)].index)
print("Price outliers:", list(outliers_idx_price))
outliers_idx = outliers_idx_price

# # top 0.1% of minimum_nights
# # upper_lim = np.percentile(train[['minimum_nights']], 99.9)
# outliers_idx_nights = []  # list(train[train['minimum_nights'] >= upper_lim].index)
# outliers_idx = list(outliers_idx_price) + list(outliers_idx_nights)
# # print("Min nights outliers:", list(train[train['minimum_nights'] >= upper_lim].index))


print(f"\n{len(train.iloc[outliers_idx, :]['price'])} observations dropped\n")
# train.loc[outliers_idx_price, :]['price'].sort_values()

Price outliers: [523, 1626, 1823, 1848, 2067, 2380, 3129, 4865]

8 observations dropped



In [7]:
train_clean = train.drop(outliers_idx).reset_index(drop=True)
test_clean = test.copy()

Clean Transform

In [8]:
def clean_vars(row):
    # Check if 'shared' is in 'bathrooms_text' to identify shared bathrooms
    if 'shared' in str(row['bathrooms_text']):
        row['bathrooms_shared'] = "t"
        
    # Check if 'bathrooms_text' is empty and 'room_type' is 'Shared' to identify shared bathrooms
    elif pd.isna(row['bathrooms_text']):
        if 'Shared' in row['room_type']:
            row['bathrooms_shared'] = "t"              
        else:
            row['bathrooms_shared'] = "f"
    else: 
        row['bathrooms_shared'] = "f"
        
    # Convert 'Hotel room' room type to 'Private room'
    if row.loc['room_type'] == 'Hotel room':
        row['room_type'] = 'Private room'
        
    return row

# Apply the function to clean variables to train and test datasets
train_clean = train_clean.apply(clean_vars, axis=1)
test_clean = test_clean.apply(clean_vars, axis=1)


clean neighbourhoods

In [9]:
# Group small occurrences into 'Other'
neighbourhood_counts = train_clean['neighbourhood_cleansed'].value_counts()

other_hoods = [i for i in neighbourhood_counts.index if neighbourhood_counts[i] < 100]

test_only_hoods = [i for i in test_clean['neighbourhood_cleansed'].unique() 
                   if i not in neighbourhood_counts 
                   and i != 'Other']
    

In [10]:
# Create DataFrame with unique neighbourhoods
hood_df = pd.DataFrame(index=train_clean['neighbourhood_cleansed'].unique())

# Compute mean and standard deviation for each neighbourhood
grouped = train_clean.groupby('neighbourhood_cleansed')['price']
all_mean = grouped.mean()
all_std = grouped.std()

# Add mean and std to DataFrame
hood_df['mean_price'] = all_mean
hood_df['std_price'] = all_std

# Merge with counts
hood_df = hood_df.merge(neighbourhood_counts, left_index=True, right_index=True)
hood_df.rename(columns={'neighbourhood_cleansed': 'count'}, inplace=True)


# Get the 10th percentile of standard deviations
std_90 = np.percentile(hood_df.dropna(how='any')['std_price'], 10)


# Filter DataFrame
filtered_df = hood_df[((hood_df['std_price'] < std_90) | (hood_df['count'] > 100)) & (hood_df['count'] > 20)]

keep_hoods = filtered_df.index.tolist()


In [11]:
# if neighbourhood has small std or more than 100 but no neighbourhoods with less than 20
def clean_hoods(row):
    if row.loc['neighbourhood_cleansed'] not in keep_hoods:
        row['neighbourhood_grouped'] = 'Other'
        
    else:    
        row['neighbourhood_grouped'] = row.loc['neighbourhood_cleansed']
        
    return row

train_clean = train_clean.apply(clean_hoods, axis=1)
test_clean = test_clean.apply(clean_hoods, axis=1)

Clean property type

In [12]:
words_to_remove = ['place', 'room', 'private', 'shared', 'entire', ' in', ' room', ' private', ' shared', ' entire', ' in',]

# remove filler and unnecessary words from property
def remove_words(text):
    text=text.lower()
    for word in words_to_remove:
        word = word.lower()
        text = text.replace(word, '')
    return text.strip()


train_clean['property_type'] = train_clean['property_type'].apply(remove_words)
test_clean['property_type'] = test_clean['property_type'].apply(remove_words)


In [13]:
# identify value counts and make a list of neighbourhoods with more than 10
property_counts = train_clean['property_type'].value_counts()
keep = [i for i in property_counts.index if property_counts[i] > 10]

def clean_property(row):
    if row not in keep or row == "":
        row = 'Other'
      
    return row


train_clean['property_type_cleansed'] = train_clean['property_type'].apply(clean_property)
test_clean['property_type_cleansed'] = test_clean['property_type'].apply(clean_property)

train_filter_2 = train_clean.copy()
test_filter_2 = test_clean.copy()


In [14]:
train_filter_2.drop(columns=['host_id', 'host_since', 'first_review', 'last_review', 'neighbourhood_cleansed', 'property_type', 'bathrooms_text'], inplace=True)
test_filter_2.drop(columns=['host_id', 'host_since', 'first_review', 'last_review', 'neighbourhood_cleansed', 'property_type', 'bathrooms_text'], inplace=True)

number of verifications

In [15]:
try:
    train_filter_2['host_verifications'] = train_filter_2['host_verifications'].apply(ast.literal_eval)
except: pass

try:
    test_filter_2['host_verifications'] = test_filter_2['host_verifications'].apply(ast.literal_eval)
except: pass

In [16]:
train_filter_2['num_verifications'] = train_filter_2['host_verifications'].apply(len)
test_filter_2['num_verifications'] = test_filter_2['host_verifications'].apply(len)

In [17]:
def split_vers(df):
    def update_verification(row):
        ver_phone = 't' if 'phone' in row['host_verifications'] else 'f'
        ver_email = 't' if 'email' in row['host_verifications'] else 'f'
        ver_work_email = 't' if 'work_email' in row['host_verifications'] else 'f'
        return pd.Series({'ver_phone': ver_phone, 'ver_email': ver_email, 'ver_work_email': ver_work_email})

    df[['ver_phone', 'ver_email', 'ver_work_email']] = df.apply(update_verification, axis=1)

    return df


train_filter_2 = split_vers(train_filter_2).drop('host_verifications', axis=1)
test_filter_2 = split_vers(test_filter_2).drop('host_verifications', axis=1)

group small occurances into other

In [18]:
host_hood_counts = train_filter_2['host_neighbourhood'].value_counts()
keep_host_hood = host_hood_counts[host_hood_counts >= 5].index

train_filter_2['host_neighbourhood'] = train_filter_2['host_neighbourhood'].apply(lambda x: 'Other' if x not in keep_host_hood else x)
test_filter_2['host_neighbourhood'] = test_filter_2['host_neighbourhood'].apply(lambda x: 'Other' if x not in keep_host_hood else x)
# train_final[['host_neighbourhood']].value_counts()
# test_final[['host_neighbourhood']].value_counts()

# ----- #

host_loc_counts = train_filter_2['host_location'].value_counts()
keep_host_loc = host_loc_counts[host_loc_counts >= 10].index

train_filter_2['host_location'] = train_filter_2['host_location'].apply(lambda x: 'Other' if x not in keep_host_loc else x)
test_filter_2['host_location'] = test_filter_2['host_location'].apply(lambda x: 'Other' if x not in keep_host_loc else x)
# train_final['host_location'].value_counts()
# test_final['host_location'].value_counts()

Columns with missing Values

In [19]:
# Create a temporary dataframe to manipulate
train_filter_temp = train_filter_2.copy()
test_filter_temp = test_filter_2.copy()

# Change t/f to numeric 1/0
train_filter_temp['host_is_superhost'] = train_filter_2['host_is_superhost'].replace({'f': 0, 't': 1})
test_filter_temp['host_is_superhost'] = test_filter_2['host_is_superhost'].replace({'f': 0, 't': 1})

# Create model
superhost_model = smf.logit(formula="host_is_superhost ~ calculated_host_listings_count*number_of_reviews_ltm + response_rate", data=train_filter_temp).fit()

# Predict all values 
impute_superhost_train = (superhost_model.predict(train_filter_temp) > 0.5).replace({False:'f', True:'t'})
impute_superhost_test = (superhost_model.predict(test_filter_temp) > 0.5).replace({False:'f', True:'t'})

# fill na's with coordinating value from model imputation
train_filter_2['host_is_superhost'].fillna(impute_superhost_train, inplace=True)
test_filter_2['host_is_superhost'].fillna(impute_superhost_test, inplace=True)


Optimization terminated successfully.
         Current function value: 0.586516
         Iterations 8


In [20]:
# create model to impute acceptance rate
acceptance_model = smf.logit(formula="acceptance_rate ~ calculated_host_listings_count + accommodates", data=train_filter_2).fit()


# fill in missing values with the predictions from the model
predicted_acceptance = acceptance_model.predict(train_filter_2)
train_filter_2['acceptance_rate'].fillna(predicted_acceptance, inplace=True)

predicted_acceptance_test = acceptance_model.predict(test_filter_2)
test_filter_2['acceptance_rate'].fillna(predicted_acceptance_test, inplace=True)


# ----- #


# Create model to impute response rate
response_model = smf.logit(formula="response_rate ~ accommodates", data=train_filter_2).fit()


# fill in missing values with the predictions from the model
predicted_response = response_model.predict(train_filter_2)
train_filter_2['response_rate'].fillna(predicted_response, inplace=True)

predicted_response_test = response_model.predict(test_filter_2)
test_filter_2['response_rate'].fillna(predicted_response_test, inplace=True)


Optimization terminated successfully.
         Current function value: 0.192005
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.067442
         Iterations 8


naive imputation

In [21]:
# Fill in remaining missing values with median for numerical columns
train_filter_2.fillna(train_filter_2.median(numeric_only=True), inplace=True)
test_filter_2.fillna(test_filter_2.median(numeric_only=True), inplace=True)

In [22]:
import statsmodels.api as sm

non_numeric_columns = train_filter_2.select_dtypes(exclude=[np.number]).columns
data_numeric = train_filter_2.drop(columns=non_numeric_columns)

X = data_numeric.drop(columns=['price', 'id'])
y = data_numeric.price

vif = pd.DataFrame()
vif["Predictor"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

vif[vif['VIF'] >= 5].sort_values('VIF', ascending=False)

Unnamed: 0,Predictor,VIF
13,maximum_nights_avg_ntm,414390800000.0
10,minimum_maximum_nights,220332700000.0
11,maximum_maximum_nights,43100360000.0
2,latitude,648687.9
3,longitude,647502.7
21,calculated_host_listings_count,36115.02
22,calculated_host_listings_count_entire_homes,35978.06
15,availability_60,256.598
12,minimum_nights_avg_ntm,186.2627
16,availability_90,157.4214


In [23]:
train_filter_3 = train_filter_2.drop(columns=['maximum_maximum_nights', 'minimum_maximum_nights'])
test_filter_3 = test_filter_2.drop(columns=['maximum_maximum_nights', 'minimum_maximum_nights'])

In [24]:
# train_filter_3 = train_filter_2.drop(columns =['calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'maximum_maximum_nights', 'maximum_minimum_nights', 'maximum_nights_avg_ntm', 'minimum_maximum_nights', 'minimum_minimum_nights', 'minimum_nights_avg_ntm', 'review_scores_cleanliness', 'review_scores_location', 'review_scores_rating', 'review_scores_value', 'review_scores_accuracy', 'review_scores_communication', 'review_scores_checkin', 'reviews_per_month'])
# test_filter_3 = test_filter_2.drop(columns =['calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'maximum_maximum_nights', 'maximum_minimum_nights', 'maximum_nights_avg_ntm', 'minimum_maximum_nights', 'minimum_minimum_nights', 'minimum_nights_avg_ntm', 'review_scores_cleanliness', 'review_scores_location', 'review_scores_rating', 'review_scores_value', 'review_scores_accuracy', 'review_scores_communication', 'review_scores_checkin', 'reviews_per_month'])

In [25]:
non_numeric_columns = train_filter_3.select_dtypes(exclude=[np.number]).columns
data_numeric = train_filter_3.drop(columns=non_numeric_columns)

X = data_numeric.drop(columns=['price', 'id', 'latitude', 'longitude'])
y = data_numeric.price

# Calculate VIF for each predictor
vif = pd.DataFrame()
vif["Predictor"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

vif.sort_values('VIF', ascending=False)

Unnamed: 0,Predictor,VIF
17,calculated_host_listings_count,36084.551154
18,calculated_host_listings_count_entire_homes,35939.563943
11,availability_60,255.700208
8,minimum_nights_avg_ntm,185.499125
12,availability_90,156.826298
0,host_listings_count,127.235334
19,calculated_host_listings_count_private_rooms,104.270151
23,response_rate,57.185994
1,host_total_listings_count,49.980148
4,minimum_nights,42.328162


In [26]:
train_filter_4 = train_filter_3.drop(columns=['host_listings_count', 'availability_90', 'availability_60', 'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms'])
test_filter_4 = test_filter_3.drop(columns=['host_listings_count', 'availability_90', 'availability_60',  'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms'])

In [27]:
non_numeric_columns = train_filter_4.select_dtypes(exclude=[np.number]).columns
data_numeric = train_filter_4.drop(columns=non_numeric_columns)

X = data_numeric.drop(columns=['price', 'id', 'latitude', 'longitude'])
y = data_numeric.price

vif = pd.DataFrame()
vif["Predictor"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

vif.sort_values('VIF', ascending=False)

Unnamed: 0,Predictor,VIF
17,response_rate,56.141121
16,acceptance_rate,35.762314
23,num_verifications,24.350194
7,minimum_nights_avg_ntm,23.195004
3,minimum_nights,18.162502
5,minimum_minimum_nights,16.376433
15,reviews_per_month,13.211356
1,accommodates,11.532231
2,beds,11.189253
6,maximum_minimum_nights,9.213347


## 2) Hyperparameter tuning

### How many attempts did it take you to tune the model hyperparameters?

Including having to reupload because of fixing typos and minor errors I had about 18 attempts in tuning.

### Which tuning method did you use (grid search / Bayes search / etc.)?

I used RandomizedSearchCV as a course search to get a range for searching and then GridSearchCV as a finer search.

### What challenges did you face while tuning the hyperparameters, and what actions did you take to address those challenges?

When one portion of any part of my code changed, the optimal hyperparameters and the subsequents ranges to search changed. ___

### How many hours did you spend on hyperparameter tuning?

I spent about 8 hours total adjusting hyperparameters and tuning

**Paste the hyperparameter tuning code below. You must show at least one hyperparameter tuning procedure.**

In [28]:
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, cross_validate, GridSearchCV, GridSearchCV, RandomizedSearchCV, KFold, StratifiedKFold, RepeatedKFold, RepeatedStratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [29]:
numeric_columns = train_filter_4.select_dtypes(include=['number']).drop(columns=['price', 'id']).columns

X_train = train_filter_4.drop(columns=['price', 'id'])
X_test = test_filter_4.drop(columns=['id'])

X_train_num = X_train[numeric_columns]
y_train = train_filter_4.price

sc = StandardScaler()
sc.fit(X_train_num)

X_train_scaled = sc.transform(X_train[numeric_columns])
X_test_scaled = sc.transform(X_test[numeric_columns])

X_train_num_scaled = pd.DataFrame(X_train_scaled, columns=numeric_columns)
X_test_num_scaled = pd.DataFrame(X_test_scaled, columns=numeric_columns)

In [30]:

train_testing = train_filter_4.drop(columns=['price']) 
test_testing = test_filter_4  

train_testing_cat = train_testing.select_dtypes(exclude=['number'])
test_testing_cat = test_testing.select_dtypes(exclude=['number'])


In [31]:
enc = OneHotEncoder(drop='if_binary', handle_unknown='ignore')
enc.fit(train_testing_cat)

drop_enc = enc.transform(train_testing_cat)
drop_enc_test = enc.transform(test_testing_cat)

train_encoded_df = pd.DataFrame(drop_enc.toarray(), columns=enc.get_feature_names_out(train_testing_cat.columns))
test_encoded_df = pd.DataFrame(drop_enc_test.toarray(), columns=enc.get_feature_names_out(test_testing_cat.columns))

X_train_final = pd.concat([X_train_num_scaled, train_encoded_df], axis=1)
X_test_final = pd.concat([X_test_num_scaled, test_encoded_df], axis=1)

In [54]:
def dist_power_2(distance):
    return 1/(1e-10+distance**2)
def dist_power_3(distance):
    return 1/(1e-10+distance**3)
def dist_power_4(distance):
    return 1/(1e-10+distance**4)
def dist_power_5(distance):
    return 1/(1e-10+distance**5)

def dist_power_6(distance):
    return 1/(1e-10+distance**6)
def dist_power_7(distance):
    return 1/(1e-10+distance**7)
def dist_power_8(distance):
    return 1/(1e-10+distance**8)
def dist_power_9(distance):
    return 1/(1e-10+distance**9)
def dist_power_10(distance):
    return 1/(1e-10+distance**10)

def three_dist_power_7(distance):
    return (2/3)*(1/(1e-10+distance**7))
def five_dist_power_7(distance):
    return (4/5)*(1/(1e-10+distance**7))
def sev_dist_power_7(distance):
    return (6/7)*(1/(1e-10+distance**7))
def ten_dist_power_7(distance):
    return (9/10)*(1/(1e-10+distance**7))

def three_1_dist_power_7(distance):
    return (4/3)*(1/(1e-10+distance**7))
def five_1_dist_power_7(distance):
    return (4/5)*(1/(1e-10+distance**7))
def sev_1_dist_power_7(distance):
    return (8/7)*(1/(1e-10+distance**7))
def ten_1_dist_power_7(distance):
    return (11/10)*(1/(1e-10+distance**7))


### VIF for Predictor Selection

#### Naive grid to get a range

In [45]:
step_1 = 10
range_1 = 100

cv_settings_naive = KFold(n_splits=3, shuffle=True, random_state=12)

model = KNeighborsRegressor()
grid = {'n_neighbors':np.arange(step_1, step_1+range_1, step_1), 'weights':[dist_power_2, dist_power_3, dist_power_4, dist_power_5, dist_power_6, dist_power_7, dist_power_8, dist_power_9, dist_power_10, dist_power_11, dist_power_12, dist_power_13, dist_power_14, dist_power_15, dist_power_16]}

rscv = RandomizedSearchCV(model, grid, n_iter=100, cv=cv_settings_naive, scoring='neg_root_mean_squared_error', verbose=2)
rscv.fit(X_train_final, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
[CV] END n_neighbors=80, weights=<function dist_power_5 at 0x000001E296353060>; total time=   0.7s
[CV] END n_neighbors=80, weights=<function dist_power_5 at 0x000001E296353060>; total time=   0.1s
[CV] END n_neighbors=80, weights=<function dist_power_5 at 0x000001E296353060>; total time=   0.1s
[CV] END n_neighbors=60, weights=<function dist_power_5 at 0x000001E296353060>; total time=   0.1s
[CV] END n_neighbors=60, weights=<function dist_power_5 at 0x000001E296353060>; total time=   0.1s
[CV] END n_neighbors=60, weights=<function dist_power_5 at 0x000001E296353060>; total time=   0.1s
[CV] END n_neighbors=70, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.1s
[CV] END n_neighbors=70, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.1s
[CV] END n_neighbors=70, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.1s
[CV] END n_neighbors=70, weights=<function dis

[CV] END n_neighbors=90, weights=<function dist_power_6 at 0x000001E296353B00>; total time=   0.1s
[CV] END n_neighbors=90, weights=<function dist_power_6 at 0x000001E296353B00>; total time=   0.1s
[CV] END n_neighbors=70, weights=<function dist_power_6 at 0x000001E296353B00>; total time=   0.1s
[CV] END n_neighbors=70, weights=<function dist_power_6 at 0x000001E296353B00>; total time=   0.1s
[CV] END n_neighbors=70, weights=<function dist_power_6 at 0x000001E296353B00>; total time=   0.1s
[CV] END n_neighbors=90, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.1s
[CV] END n_neighbors=90, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.1s
[CV] END n_neighbors=90, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.1s
[CV] END n_neighbors=40, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.1s
[CV] END n_neighbors=40, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.1s
[CV] 

[CV] END n_neighbors=50, weights=<function dist_power_5 at 0x000001E296353060>; total time=   0.1s
[CV] END n_neighbors=50, weights=<function dist_power_5 at 0x000001E296353060>; total time=   0.1s
[CV] END n_neighbors=50, weights=<function dist_power_5 at 0x000001E296353060>; total time=   0.1s
[CV] END n_neighbors=20, weights=<function dist_power_6 at 0x000001E296353B00>; total time=   0.0s
[CV] END n_neighbors=20, weights=<function dist_power_6 at 0x000001E296353B00>; total time=   0.0s
[CV] END n_neighbors=20, weights=<function dist_power_6 at 0x000001E296353B00>; total time=   0.0s
[CV] END n_neighbors=20, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=20, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.2s
[CV] END n_neighbors=20, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.2s
[CV] END n_neighbors=30, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.2s
[CV] END 

[CV] END n_neighbors=40, weights=<function dist_power_3 at 0x000001E29983F1A0>; total time=   0.2s
[CV] END n_neighbors=20, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.2s
[CV] END n_neighbors=20, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.2s
[CV] END n_neighbors=20, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.2s
[CV] END n_neighbors=50, weights=<function dist_power_16 at 0x000001E296EDF240>; total time=   0.2s
[CV] END n_neighbors=50, weights=<function dist_power_16 at 0x000001E296EDF240>; total time=   0.2s
[CV] END n_neighbors=50, weights=<function dist_power_16 at 0x000001E296EDF240>; total time=   0.2s
[CV] END n_neighbors=80, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.3s
[CV] END n_neighbors=80, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.2s
[CV] END n_neighbors=80, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.1s
[

In [46]:
print(rscv.best_params_)
print(rscv.best_score_)

{'weights': <function dist_power_8 at 0x000001E294661800>, 'n_neighbors': 20}
-111.09295094637345


#### Zoom in on previous grid

In [47]:
prev_k = rscv.best_params_['n_neighbors']
gscv_step = 1
search_radii = step_1
print(prev_k, search_radii, '-', prev_k-search_radii, prev_k+search_radii+gscv_step, gscv_step)

cv_settings = RepeatedKFold(n_splits=5, n_repeats=3, random_state=12) # KFold(n_splits=5, shuffle=True, random_state=12)  # 

model = KNeighborsRegressor()
grid = {'n_neighbors':np.arange(prev_k-search_radii, prev_k+search_radii+gscv_step, gscv_step), 'weights': [dist_power_7, dist_power_8, dist_power_9, dist_power_10, dist_power_11]}

gscv = GridSearchCV(model, grid, cv=cv_settings, scoring='neg_root_mean_squared_error', verbose=2)
gscv.fit(X_train_final, y_train)


20 10 - 10 31 1
Fitting 15 folds for each of 105 candidates, totalling 1575 fits
[CV] END n_neighbors=10, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=10, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=10, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=10, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=10, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=10, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=10, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=10, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=10, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=10, wei

[CV] END n_neighbors=11, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=11, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=11, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=11, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=11, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=11, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=11, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=11, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=11, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=11, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n

[CV] END n_neighbors=12, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=12, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=12, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=12, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=12, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=12, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=12, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=12, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=12, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=12, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n

[CV] END n_neighbors=13, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=13, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=13, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=13, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=13, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=13, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=13, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=13, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=13, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=13, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n

[CV] END n_neighbors=14, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=14, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=14, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=14, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=14, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=14, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=14, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=14, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.1s
[CV] END n_neighbors=14, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=14, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n

[CV] END n_neighbors=15, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=15, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=15, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=15, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=15, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=15, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=15, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=15, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=15, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=15, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV

[CV] END n_neighbors=16, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=16, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=16, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=16, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=16, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=16, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=16, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=16, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=16, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=16, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s


[CV] END n_neighbors=17, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=17, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=17, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=17, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=17, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=17, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=17, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=17, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=17, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=17, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s


[CV] END n_neighbors=18, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.1s
[CV] END n_neighbors=18, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.2s
[CV] END n_neighbors=18, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.2s
[CV] END n_neighbors=18, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.1s
[CV] END n_neighbors=18, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.1s
[CV] END n_neighbors=18, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=18, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=19, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=19, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=19, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV

[CV] END n_neighbors=20, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=20, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.1s
[CV] END n_neighbors=20, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.2s
[CV] END n_neighbors=20, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.2s
[CV] END n_neighbors=20, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.1s
[CV] END n_neighbors=20, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.1s
[CV] END n_neighbors=20, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.1s
[CV] END n_neighbors=20, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.1s
[CV] END n_neighbors=20, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.1s
[CV] END n_neighbors=20, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.2s
[CV] END n

[CV] END n_neighbors=21, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=21, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=21, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=21, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=21, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=21, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=21, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=21, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=21, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=21, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n

[CV] END n_neighbors=22, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=22, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=22, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=22, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=22, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=22, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=22, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=22, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=22, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n_neighbors=22, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.0s
[CV] END n

[CV] END n_neighbors=23, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.1s
[CV] END n_neighbors=23, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.1s
[CV] END n_neighbors=23, weights=<function dist_power_8 at 0x000001E294661800>; total time=   0.1s
[CV] END n_neighbors=23, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.1s
[CV] END n_neighbors=23, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=23, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=23, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=23, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=23, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=23, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n

[CV] END n_neighbors=24, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=24, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=24, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=24, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=24, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=24, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=24, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=24, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=24, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=24, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.1s
[CV] END n

[CV] END n_neighbors=25, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=25, weights=<function dist_power_9 at 0x000001E294663D80>; total time=   0.0s
[CV] END n_neighbors=25, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=25, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=25, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=25, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=25, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=25, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.1s
[CV] END n_neighbors=25, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.1s
[CV] END n_neighbors=25, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.1s
[C

[CV] END n_neighbors=26, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=26, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=26, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=26, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=26, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=26, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=26, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=26, weights=<function dist_power_10 at 0x000001E294663880>; total time=   0.0s
[CV] END n_neighbors=26, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=26, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s


[CV] END n_neighbors=27, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=27, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=27, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=27, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=27, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=27, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=27, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=27, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=27, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=27, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s


[CV] END n_neighbors=28, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=28, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.0s
[CV] END n_neighbors=28, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.1s
[CV] END n_neighbors=28, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.1s
[CV] END n_neighbors=28, weights=<function dist_power_11 at 0x000001E294540360>; total time=   0.1s
[CV] END n_neighbors=29, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.2s
[CV] END n_neighbors=29, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.2s
[CV] END n_neighbors=29, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.2s
[CV] END n_neighbors=29, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.2s
[CV] END n_neighbors=29, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.1s
[CV] 

[CV] END n_neighbors=30, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=30, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=30, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=30, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=30, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=30, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=30, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=30, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=30, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n_neighbors=30, weights=<function dist_power_7 at 0x000001E294663B00>; total time=   0.0s
[CV] END n

log functions

In [36]:
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

In [38]:
cv_settings = RepeatedKFold(n_splits=5, n_repeats=3, random_state=12) 

model = KNeighborsRegressor()
grid = {'n_neighbors':Integer(10, 30), 'weights': Categorical([dist_power_5, dist_power_6, dist_power_7, dist_power_8])}

gscv_bayes = BayesSearchCV(model, grid, cv=cv_settings, n_iter=75, random_state=1, scoring='neg_root_mean_squared_error', n_jobs=12, verbose=2)
gscv_bayes.fit(X_train_final, y_train)

Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for eac



Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits


In [52]:
print(gscv_bayes.best_score_)
print(gscv_bayes.best_params_)

-112.56164983730932
OrderedDict([('n_neighbors', 25), ('weights', <function dist_power_7 at 0x0000014FCEAF1E40>)])


In [55]:
cv_settings = RepeatedKFold(n_splits=5, n_repeats=3, random_state=12) 

model = KNeighborsRegressor()
grid = {'n_neighbors':Integer(20, 30), 'weights': Categorical([dist_power_7, sev_dist_power_7, three_dist_power_7, five_dist_power_7, sev_1_dist_power_7, three_1_dist_power_7, five_1_dist_power_7])}

gscv_bayes_weights = BayesSearchCV(model, grid, cv=cv_settings, n_iter=25, random_state=1, scoring='neg_root_mean_squared_error', n_jobs=12, verbose=2)
gscv_bayes_weights.fit(X_train_final, y_train)

Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits
Fitting 15 folds for each of 1 candidates, totalling 15 fits




Fitting 15 folds for each of 1 candidates, totalling 15 fits


In [56]:
print(gscv_bayes_weights.best_score_, gscv_bayes_weights.best_params_)

-112.5616498373093 OrderedDict([('n_neighbors', 25), ('weights', <function sev_dist_power_7 at 0x0000014FDA987560>)])


**Paste the optimal hyperparameter values below.**

In [59]:
print(gscv_bayes_weights.best_params_)
print(gscv_bayes_weights.best_score_)

best_k = gscv_bayes_weights.best_params_['n_neighbors']
best_weight = gscv_bayes_weights.best_params_['weights']

OrderedDict([('n_neighbors', 25), ('weights', <function sev_dist_power_7 at 0x0000014FDA987560>)])
-112.5616498373093


[('n_neighbors', 25), ('weights', <function sev_dist_power_7 at 0x0000014FDA987560>)]

## 3) Model

Using the optimal model hyperparameters, train the model, and paste the code below.

In [81]:
best_model = KNeighborsRegressor(n_neighbors=best_k, weights=best_weight).fit(X_train_final, y_train)

## 4) Put any ad-hoc steps for further improving model accuracy
For example, scaling up or scaling down the predictions, capping predictions, etc.

## 5) Export the predictions in the format required to submit on Kaggle

#### Getting Test Predictions

In [57]:
y_preds_test = gscv_bayes_weights.predict(X_test_final)

In [58]:
predicted_values = pd.DataFrame(y_preds_test, columns=['predicted'])

# add listing id to the predicted values dataframe and set the index to the id value
predicted_values = predicted_values.merge(test_filter_4['id'], left_index=True, right_index=True).set_index('id').rename(columns={0:'predicted'})
predicted_values

predicted_values.to_csv('KNN_reg_model_results.csv')