## Instructions:

- Put the parts of your code under the corresponding sections. (0.25/2 points will be taken off for not doing this.)
- Do not include any redundant/irrelevant code, text or comments. (0.5/2 points will be taken off for not doing this.)
- **Your code must run without any errors or runtime issues.** (Failure to meet this condition will result in a 0.)
- **Your code must return your Public Leaderboard score.** (Failure to meet this condition will result in a 0.)

## 1) Libraries

Put all the Python libraries and tools you imported here.

In [4]:
import numpy as np
import pandas as pd
import ast
import math
# sklearn tools
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn import metrics, impute
from sklearn.impute import KNNImputer
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier

## 2) Data

- This section is required to include the code that reads, cleans and preprocesses the datasets.
- Note that both the training and test datasets should undergo the same sequence of operations.

In [7]:
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

# initial cols to drop

cols_to_drop = ['has_availability', 'host_neighbourhood', 'neighbourhood_cleansed', 'first_review',
               'last_review', 'room_type', 'host_response_time', 'host_location', 'host_about']

train = train.drop(columns = cols_to_drop)
test = test.drop(columns=cols_to_drop)


# clean numeric vals with symbols

def clean_cols(val):
    if isinstance(val, str):
        val = val.replace('$', '').replace('%', '').replace(',', '')
    return float(val)

train[['host_acceptance_rate','host_response_rate']] = train[['host_acceptance_rate','host_response_rate']].map(clean_cols)
test[['host_acceptance_rate','host_response_rate']] = test[['host_acceptance_rate','host_response_rate']].map(clean_cols)


# clean bathrooms_text - convert half-bath to 0.5
train['bathrooms_text'] = train['bathrooms_text'].replace('Half-bath', '0.5')
test['bathrooms_text'] = test['bathrooms_text'].replace('Half-bath', '0.5')

# extract numeric part
train['bathrooms_text'] = train['bathrooms_text'].str.extract(r'(\d+\.?\d*)')[0].astype(float)
test['bathrooms_text'] = test['bathrooms_text'].str.extract(r'(\d+\.?\d*)')[0].astype(float)

test = test.rename(columns={'bathrooms_text': 'bathrooms'})
train = train.rename(columns={'bathrooms_text': 'bathrooms'})


# host identity verified - convert to 0/1
# instant bookable - convert to 0/1
# host profile pic - convert to 0/1

train['instant_bookable'] = train['instant_bookable'].fillna(False).astype(int)
test['instant_bookable'] = test['instant_bookable'].fillna(False).astype(int)

train['host_identity_verified'] = train['host_identity_verified'].fillna(False).astype(int)
test['host_identity_verified'] = test['host_identity_verified'].fillna(False).astype(int)

train['host_has_profile_pic'] = train['host_has_profile_pic'].fillna(False).astype(int)
test['host_has_profile_pic'] = test['host_has_profile_pic'].fillna(False).astype(int)





# property type -- make new col, 1 if entire property and 0 otherwise

train['is_entire_place'] = train['property_type'].str.contains('entire', case=False, na=False).astype(int)
test['is_entire_place'] = test['property_type'].str.contains('entire', case=False, na=False).astype(int)

train = train.drop('property_type', axis=1)
test = test.drop('property_type', axis=1)




# description -- make new col, 1 if contains 'luxury' words

keywords = ['luxury', 'luxurious', 'penthouse', 'exclusive', 'elegant',
            'premium', 'high-end', 'designer', 'upscale', 'chic',
            'modern', 'deluxe', 'sophisticated', 'breathtaking','custom-built', 'architect-designed',
            'state-of-the-art','prestigious', 'top-tier', '5-star', 'five-star']

pattern = '|'.join(keywords)
train['luxury_description'] = train['description'].str.contains(pattern, case=False, na=False).astype(int)
test['luxury_description'] = test['description'].str.contains(pattern, case=False, na=False).astype(int)

train = train.drop('description', axis=1)
test = test.drop('description', axis=1)


# OHE location

location_dummies = pd.get_dummies(train['listing_location'], prefix='location', drop_first=True)
train = pd.concat([train, location_dummies], axis=1).drop('listing_location', axis=1)

location_dummies = pd.get_dummies(test['listing_location'], prefix='location', drop_first=True)
test = pd.concat([test, location_dummies], axis=1).drop('listing_location', axis=1)

train[['location_chicago', 'location_kauai']] = train[['location_chicago', 'location_kauai']].astype(int)
test[['location_chicago', 'location_kauai']] = test[['location_chicago', 'location_kauai']].astype(int)



# host_verifications, amenities. -- convert to length

def make_len(val):
    if isinstance(val, float) and math.isnan(val):
        return np.nan
    if isinstance(val, str):
        try:
            val = ast.literal_eval(val)
        except (ValueError, SyntaxError):
            return np.nan
    return len(val)
    

train['host_verifications'] = train['host_verifications'].map(make_len)
train['amenities'] = train['amenities'].map(make_len)

test['host_verifications'] = test['host_verifications'].map(make_len)
test['amenities'] = test['amenities'].map(make_len)



  train['host_identity_verified'] = train['host_identity_verified'].fillna(False).astype(int)
  test['host_identity_verified'] = test['host_identity_verified'].fillna(False).astype(int)
  train['host_has_profile_pic'] = train['host_has_profile_pic'].fillna(False).astype(int)
  test['host_has_profile_pic'] = test['host_has_profile_pic'].fillna(False).astype(int)


In [8]:
from datetime import datetime

today = pd.to_datetime("today")
train['host_since'] = pd.to_datetime(train['host_since'], errors='coerce')
train['host_since'] = (today - train['host_since']).dt.days // 365

test['host_since'] = pd.to_datetime(test['host_since'], errors='coerce')
test['host_since'] = (today - test['host_since']).dt.days // 365

In [9]:
# IMPUTE MISSING VALS

# fit to train data, transform train and test data

imputer = KNNImputer(n_neighbors=3)

train_impute_cols = [col for col in train.columns if col not in ['id', 'host_is_superhost']]

#scale
scaler = MinMaxScaler()
scaled_train_array = scaler.fit_transform(train[train_impute_cols])
scaled_train_df = pd.DataFrame(scaled_train_array, columns=train_impute_cols, index=train.index)

train_imputed_arr = imputer.fit_transform(scaled_train_df) # first scale
train_unscaled_data = scaler.inverse_transform(train_imputed_arr) # then undo scale

train_imputed = train.copy()
train_imputed[train_impute_cols] = pd.DataFrame(train_unscaled_data, columns=train_impute_cols, index=train.index)


# test - use same scale
test_impute_cols = [col for col in test.columns if col != 'id']


scaled_test_array = scaler.transform(test[test_impute_cols])
scaled_test_df = pd.DataFrame(scaled_test_array, columns=test_impute_cols, index=test.index)

test_imputed_arr = imputer.transform(scaled_test_df)
test_unscaled_data = scaler.inverse_transform(test_imputed_arr)

test_imputed = test.copy()
test_imputed[test_impute_cols] = pd.DataFrame(test_unscaled_data, columns = test_impute_cols, index=test.index)

In [10]:
X = train_imputed.drop(columns=['id', 'host_is_superhost'], axis=1)

# Adding constant: necessary for the calculations
X_const = add_constant(X)

vif_data = pd.DataFrame() # empty df
vif_data['variable'] = X_const.columns # Put the names in the first col

for i in range(len(X_const.columns)):
    vif_data.loc[i,'VIF'] = variance_inflation_factor(X_const.values, i)

vif_data


# minimum_nights, maximum_minimum_nights have VIF scores over 7

train_imputed = train_imputed.drop(columns=['minimum_nights','maximum_minimum_nights'])
test_imputed = test_imputed.drop(columns=['minimum_nights','maximum_minimum_nights'])

In [11]:
# slice predictors - no id !! but save for predictions

X_train = train_imputed.drop(columns=['host_is_superhost','id'])
y_train = train_imputed['host_is_superhost']

X_test = test_imputed.drop(columns=['id'])
test_ids = test_imputed['id']

## 3) Machine Learning Model





- This section is required to train the **already tuned** model and obtain the test predictions (or prediction probabilities) with it.
- As written in the instructions, your code must not have any runtime issues, so **do NOT include your grid search here!** You will still need to tune your model to pass the thresholds. However, you need to keep that as your personal work and should NOT include the grid search here.

In [14]:
model = DecisionTreeClassifier(random_state=12, criterion='entropy', max_depth=13, max_leaf_nodes=94, min_samples_leaf=5, min_samples_split=5)

model.fit(X_train, y_train)
y_preds = model.predict_proba(X_test)[:, 1]



## 4) Exporting the Predictions

Include the code that (1) puts the predictions in the format that Kaggle understands and (2) exports it as a csv file.

In [17]:
submission = pd.DataFrame({
    'id': test_ids,        
    'predicted': y_preds   
})

submission.to_csv('FINAL_CLASS_SUB.csv', index=False)