## Instructions:

- Put the parts of your code under the corresponding sections. (0.25/2 points will be taken off for not doing this.)
- Do not include any redundant/irrelevant code, text or comments. (0.5/2 points will be taken off for not doing this.)
- **Your code must run without any errors or runtime issues.** (Failure to meet this condition will result in a 0.)
- **Your code must return your Public Leaderboard score.** (Failure to meet this condition will result in a 0.)
- **Submit both your ipynb and your html file for grading purposes.**

## 1) Libraries

Put all the Python libraries and tools you imported here.

In [35]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier

## 2) Data

- This section is required to include the code that reads, cleans and preprocesses the datasets.
- Note that both the training and test datasets should undergo the same sequence of operations.

In [36]:
test = pd.read_csv('test_classification.csv')
train = pd.read_csv('train_classification.csv')

# Converting test and train to datetime objects to date time
train[['host_since', 'first_review', 'last_review']] = train[['host_since', 'first_review', 'last_review']].apply(pd.to_datetime)
test[['host_since', 'first_review', 'last_review']] = test[['host_since', 'first_review', 'last_review']].apply(pd.to_datetime)


# Removing % from numbers
train['host_response_rate'] = train['host_response_rate'].str.rstrip('%').astype(float)
train['host_acceptance_rate'] = train['host_acceptance_rate'].str.rstrip('%').astype(float)

test['host_response_rate'] = test['host_response_rate'].str.rstrip('%').astype(float)
test['host_acceptance_rate'] = test['host_acceptance_rate'].str.rstrip('%').astype(float)

X_train = train.drop(columns=['host_is_superhost'])
y_train = train['host_is_superhost']

X_test = test

X_train[['has_availability', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'reviews_per_month']] = X_train[['has_availability', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'reviews_per_month']].fillna(0)
X_test[['has_availability', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'reviews_per_month']] = X_test[['has_availability', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'reviews_per_month']].fillna(0)

def extract_bathrooms(text):
    if pd.isnull(text):
        return np.nan
    # Check if it starts with "Half"
    if 'Half' in text:
        return 0.5
    # Otherwise extract the first number
    try:
        return float(text.split(' ')[0])
    except:
        return np.nan

X_train['bathrooms'] = X_train['bathrooms_text'].apply(extract_bathrooms)
X_test['bathrooms'] = X_test['bathrooms_text'].apply(extract_bathrooms)

def count_verifications(text):
    if pd.isnull(text):
        return 0
    # Remove square brackets and split by comma
    items = text.strip('[]').split(',')
    return len(items)

X_train['num_host_verifications'] = X_train['host_verifications'].apply(count_verifications)
X_test['num_host_verifications'] = X_test['host_verifications'].apply(count_verifications)


# Fixing NaN values in numerical columns
numerical_cols = X_train.select_dtypes(include=['float64', 'int64', 'bool']).columns
for col in numerical_cols:
    mean_value = X_train[col].mean()
    X_train[col].fillna(mean_value, inplace=True)
    X_test[col].fillna(mean_value, inplace=True)


# Fixing NaN values in categorical columns
categorical_cols = X_train.select_dtypes(include=['object']).columns
for col in categorical_cols:
    mode_value = X_train[col].mode()[0]
    X_train[col].fillna("missing", inplace=True)
    X_test[col].fillna("missing", inplace=True)


# Fixing NaN values in datetime columns
datetime_cols = X_train.select_dtypes(include=['datetime64[ns]']).columns
for col in datetime_cols:
    earliest_date = X_train[col].min()
    X_train[col].fillna(earliest_date, inplace=True)
    X_test[col].fillna(earliest_date, inplace=True)

X_train['bedrooms_per_accommodates'] = X_train['bedrooms'] / (X_train['accommodates'] + 1)
X_test['bedrooms_per_accommodates'] = X_test['bedrooms'] / (X_test['accommodates'] + 1)

X_train['beds_per_accommodates'] = X_train['beds'] / (X_train['accommodates'] + 1)
X_test['beds_per_accommodates'] = X_test['beds'] / (X_test['accommodates'] + 1)

X_train['bathrooms_per_accommodates'] = X_train['bathrooms'] / (X_train['accommodates'] + 1)
X_test['bathrooms_per_accommodates'] = X_test['bathrooms'] / (X_test['accommodates'] + 1)

X_train_cat_attempt1 = X_train[['listing_location', 'host_response_time', 'host_has_profile_pic', 
                                'host_identity_verified', 'room_type', 'property_type', 
                                'instant_bookable']]

X_test_cat_attempt1 = X_test[['listing_location', 'host_response_time', 'host_has_profile_pic', 
                              'host_identity_verified', 'room_type', 'property_type', 
                              'instant_bookable']]


X_train_num_attempt1 = X_train.select_dtypes(include=['float64', 'int64', 'bool'])


X_test_num_attempt1 = X_test.select_dtypes(include=['float64', 'int64', 'bool'])
X_test_cat_attempt1 = X_test.select_dtypes(include=['object'])

X_train_cat_attempt1_ohe = pd.get_dummies(X_train_cat_attempt1, drop_first=True)
X_test_cat_attempt1_ohe = pd.get_dummies(X_test_cat_attempt1, drop_first=True)

for col in X_train_cat_attempt1_ohe.columns:
    mean_value = X_train_cat_attempt1_ohe[col].mean()
    X_train_cat_attempt1_ohe[col].fillna(mean_value, inplace=True)

for col in X_test_cat_attempt1_ohe.columns:
    mean_value = X_test_cat_attempt1_ohe[col].mean()
    X_test_cat_attempt1_ohe[col].fillna(mean_value, inplace=True)

# Align columns to ensure consistency between train and test sets
X_train_cat_attempt1_ohe, X_test_cat_attempt1_ohe = X_train_cat_attempt1_ohe.align(
    X_test_cat_attempt1_ohe, join='left', axis=1, fill_value=0
)

# Combining scaled numerical and one-hot encoded categorical features
X_train_complete = pd.concat([X_train_num_attempt1, X_train_cat_attempt1_ohe], axis=1)
X_test_complete = pd.concat([X_test_num_attempt1, X_test_cat_attempt1_ohe], axis=1)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_train[col].fillna(mean_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_test[col].fillna(mean_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values al

## 3) Machine Learning Model

- This section is required to train the **already tuned** model and obtain the test predictions (or prediction probabilities) with it.
- As written in the instructions, your code must not have any runtime issues, so **do NOT include your grid search here!** You will still need to tune your model to pass the thresholds. However, you need to keep that as your personal work and should NOT include the grid search here.

In [40]:
model = GradientBoostingClassifier(learning_rate=0.5, max_depth=12, n_estimators=125, subsample=1.0, random_state=12)
model.fit(X_train_complete, y_train)
y_pred_test = model.predict_proba(X_test_complete)[:, 1]

## 4) Exporting the Predictions

Include the code that (1) puts the predictions in the format that Kaggle understands and (2) exports it as a csv file.

In [41]:
submission = pd.DataFrame({'id': test['id'], 'predicted': y_pred_test})
submission.to_csv('classification_submission.csv', index=False)