# Business Analytics II – Project II
### 2020120083 손영진

## Overview
This project involves building classification models to predict whether an Airbnb host is a "Superhost".
The process includes:
1. Data Preprocessing (Handling missing values, Encoding, Balancing, Scaling)
2. Model Building & Hyperparameter Tuning (Logistic Regression, Decision Tree, Random Forest, MLP, KNN, Naive Bayes)
3. Final Prediction on Test Data using the best model

In [1]:
# Install requirements
!pip install -r requirements.txt

Collecting numpy==2.1.1 (from -r requirements.txt (line 2))
  Downloading numpy-2.1.1-cp312-cp312-macosx_14_0_arm64.whl.metadata (60 kB)
Collecting matplotlib==3.10.1 (from -r requirements.txt (line 3))
  Downloading matplotlib-3.10.1-cp312-cp312-macosx_11_0_arm64.whl.metadata (11 kB)
Collecting scikit-learn==1.6.1 (from -r requirements.txt (line 5))
  Downloading scikit_learn-1.6.1-cp312-cp312-macosx_12_0_arm64.whl.metadata (31 kB)
INFO: pip is looking at multiple versions of contourpy to determine which version is compatible with other requirements. This could take a while.
Collecting contourpy>=1.0.1 (from matplotlib==3.10.1->-r requirements.txt (line 3))
  Downloading contourpy-1.3.3-cp312-cp312-macosx_11_0_arm64.whl.metadata (5.5 kB)
INFO: pip is looking at multiple versions of scipy to determine which version is compatible with other requirements. This could take a while.
Collecting scipy>=1.6.0 (from scikit-learn==1.6.1->-r requirements.txt (line 5))
  Downloading scipy-1.16.3-c

# 1. Import Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, recall_score, f1_score
from sentence_transformers import SentenceTransformer
import warnings
warnings.filterwarnings('ignore')

# 2. Data Load

In [12]:
# Load datasets
train_path = 'train_data.csv'
test_path = 'test_f25.xlsx'

df_train = pd.read_csv(train_path)
df_test = pd.read_excel(test_path)

print(f"Train Shape: {df_train.shape}")
print(f"Test Shape: {df_test.shape}")
df_train.head()

Train Shape: (26304, 54)
Test Shape: (130, 55)


Unnamed: 0,estimated_revenue_l365d,last_review,calculated_host_listings_count_shared_rooms,review_scores_communication,review_scores_location,availability_60,host_listings_count,host_identity_verified,host_response_rate,number_of_reviews_l30d,...,reviews_per_month,maximum_minimum_nights,availability_30,beds,maximum_maximum_nights,host_response_time,has_availability,host_is_superhost,bedrooms,review_scores_value
0,13890.0,2024-10-19,0,4.94,4.94,52,8.0,t,80%,0,...,0.33,2,25,4.0,1125,within a day,t,f,3.0,4.85
1,0.0,,0,,,0,7.0,t,,0,...,,4,0,5.0,730,,t,f,3.0,
2,0.0,2022-06-16,0,4.71,4.86,58,7.0,t,100%,0,...,0.05,14,29,2.0,1125,within a few hours,t,f,1.0,5.0
3,0.0,2016-04-15,0,5.0,5.0,60,2.0,t,,0,...,0.01,14,30,1.0,30,,t,,1.0,4.0
4,,,0,,,0,1.0,f,,0,...,,3,0,,730,,,f,2.0,


# 3. Data Preprocessing

In [13]:
# Target Variable
target_col = 'host_is_superhost'

# Convert target to binary (t/f -> 1/0)
df_train[target_col] = df_train[target_col].map({'t': 1, 'f': 0})

# Check for missing values in target and drop them
df_train = df_train.dropna(subset=[target_col])
print(f"Train Shape after dropping missing targets: {df_train.shape}")

# Separate Target and Features
y = df_train[target_col]
X = df_train.drop(columns=[target_col])

# Drop columns that are entirely empty (all NaNs)
X = X.dropna(axis=1, how='all')

# --- Date Processing ---
print("Processing Dates...")
date_cols = ['last_review', 'first_review']
for col in date_cols:
    if col in X.columns:
        # Convert to datetime
        X[col] = pd.to_datetime(X[col], errors='coerce')
        if col in df_test.columns:
            df_test[col] = pd.to_datetime(df_test[col], errors='coerce')
        
        # Create 'days_since' feature (relative to a reference date, e.g., today or max date)
        ref_date = pd.Timestamp('2024-12-01') # Use a fixed recent date
        X[f'days_since_{col}'] = (ref_date - X[col]).dt.days
        df_test[f'days_since_{col}'] = (ref_date - df_test[col]).dt.days
        
        # Drop original date columns
        X = X.drop(columns=[col])
        df_test = df_test.drop(columns=[col])

# --- Text Embedding (Amenities) ---
print("Embedding Amenities (this may take a while)...")
if 'amenities' in X.columns:
    # Load pre-trained model
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Fill NaNs with empty string
    X['amenities'] = X['amenities'].fillna('')
    df_test['amenities'] = df_test['amenities'].fillna('')
    
    # Encode
    train_embeddings = model.encode(X['amenities'].tolist(), show_progress_bar=True)
    test_embeddings = model.encode(df_test['amenities'].tolist(), show_progress_bar=True)
    
    # Create DataFrame from embeddings
    embedding_cols = [f'amenity_emb_{i}' for i in range(train_embeddings.shape[1])]
    train_emb_df = pd.DataFrame(train_embeddings, columns=embedding_cols, index=X.index)
    test_emb_df = pd.DataFrame(test_embeddings, columns=embedding_cols, index=df_test.index)
    
    # Concatenate and drop original column
    X = pd.concat([X, train_emb_df], axis=1).drop(columns=['amenities'])
    df_test = pd.concat([df_test, test_emb_df], axis=1).drop(columns=['amenities'])

# Identify categorical and numerical columns (Refresh after new features)
cat_cols = X.select_dtypes(include=['object']).columns
num_cols = X.select_dtypes(include=['int64', 'float64']).columns

print(f"Categorical Columns: {len(cat_cols)}")
print(f"Numerical Columns: {len(num_cols)}")

Train Shape after dropping missing targets: (25386, 54)
Categorical Columns: 15
Numerical Columns: 38


In [14]:
# Handle Missing Values
# For numerical, fill with median
for col in num_cols:
    median_val = X[col].median()  # Calculate median on Train
    X[col] = X[col].fillna(median_val)
    if col in df_test.columns:
        df_test[col] = df_test[col].fillna(median_val)  # Apply Train median to Test

# For categorical, fill with mode
for col in cat_cols:
    mode_val = X[col].mode()[0]  # Calculate mode on Train
    X[col] = X[col].fillna(mode_val)
    if col in df_test.columns:
        df_test[col] = df_test[col].fillna(mode_val)  # Apply Train mode to Test

In [15]:
# One-Hot Encoding
X_encoded = pd.get_dummies(X, columns=cat_cols, drop_first=True)
df_test_encoded = pd.get_dummies(df_test, columns=cat_cols, drop_first=True)

# Align columns (Ensure train and test have same features)
X_encoded, df_test_encoded = X_encoded.align(df_test_encoded, join='left', axis=1, fill_value=0)

print(f"Encoded Train Features Shape: {X_encoded.shape}")
print(f"Encoded Test Features Shape: {df_test_encoded.shape}")

Encoded Train Features Shape: (25386, 34086)
Encoded Test Features Shape: (130, 34086)


# 4. Data Balancing

We handle class imbalance using resampling.
- **Option**: Set `SAMPLING_METHOD` to `'upsample'` or `'downsample'`.
- **Upsampling (Default)**: Replicates minority class samples. **Pros**: Retains all information. **Cons**: Increases dataset size, potential for overfitting if not careful.
- **Downsampling**: Removes majority class samples. **Pros**: Faster training. **Cons**: Loss of potentially valuable information.

In [16]:
SAMPLING_METHOD = 'upsample'  # Options: 'upsample', 'downsample'

# Combine X and y for resampling
train_data = pd.concat([X_encoded, y], axis=1)

# Check class distribution
print("Original Class Distribution:")
print(y.value_counts())

# Separate majority and minority classes
df_majority = train_data[train_data[target_col] == 0]
df_minority = train_data[train_data[target_col] == 1]

if SAMPLING_METHOD == 'upsample':
    print("\nPerforming Upsampling...")
    df_minority_upsampled = resample(df_minority, 
                                     replace=True,     # sample with replacement
                                     n_samples=len(df_majority),    # to match majority class
                                     random_state=42)
    df_balanced = pd.concat([df_majority, df_minority_upsampled])
    
elif SAMPLING_METHOD == 'downsample':
    print("\nPerforming Downsampling...")
    df_majority_downsampled = resample(df_majority, 
                                       replace=False,    # sample without replacement
                                       n_samples=len(df_minority),    # to match minority class
                                       random_state=42)
    df_balanced = pd.concat([df_majority_downsampled, df_minority])

# Display new class counts
print("\nBalanced Class Distribution:")
print(df_balanced[target_col].value_counts())

# Separate X and y again
X_balanced = df_balanced.drop(columns=[target_col])
y_balanced = df_balanced[target_col]

Original Class Distribution:
host_is_superhost
0.0    17165
1.0     8221
Name: count, dtype: int64

Performing Upsampling...

Balanced Class Distribution:
host_is_superhost
0.0    17165
1.0    17165
Name: count, dtype: int64


# 5. Feature Scaling

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_balanced)
X_test_scaled = scaler.transform(df_test_encoded)

X_scaled = pd.DataFrame(X_scaled, columns=X_balanced.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=df_test_encoded.columns)

# 6. Model Building & Hyperparameter Tuning

### Hyperparameter Tuning Strategy
We use `GridSearchCV` to find the optimal hyperparameters for each model. Here is the rationale for the chosen parameter ranges:

- **MLP (Neural Network)**:
    - `hidden_layer_sizes`: We test different architectures. `(50,)` is a simple model, `(100,)` adds width, and `(50, 50)` adds depth to capture more complex patterns.
    - `activation`: `relu` is standard for deep learning, `tanh` can be better for some datasets.
    - `alpha`: Regularization parameter to prevent overfitting. We test small values `0.0001` and `0.001`.

- **Random Forest & Decision Tree**:
    - `max_depth`: Controls tree complexity. `None` allows full growth (risk of overfitting), while `10`, `20` limit it.
    - `min_samples_split`: Higher values (e.g., `10`) prevent the model from learning overly specific patterns (noise).
    - `n_estimators` (RF only): Number of trees. More trees generally improve performance but increase computation.

- **KNN**:
    - `n_neighbors`: `3` captures local patterns (sensitive to noise), `7` is smoother.
    - `weights`: `distance` gives more weight to closer neighbors, `uniform` treats all equally.

- **Logistic Regression**:
    - `C`: Inverse of regularization strength. Smaller values (e.g., `0.1`) specify stronger regularization.

In [None]:
# Initialize results list and best_models dictionary
# Run this cell once before running individual model cells
results = []
best_models = {}


In [None]:
print("--- Processing Logistic Regression ---")
lr_params = {
    'C': [0.1, 1, 10]
}
clf = GridSearchCV(LogisticRegression(max_iter=100, random_state=42), lr_params, cv=5, scoring='accuracy', n_jobs=1, verbose=3)
clf.fit(X_scaled, y_balanced)
    # Calculate additional metrics
    y_pred = clf.best_estimator_.predict(X_scaled)
    rec = recall_score(y_balanced, y_pred)
    f1 = f1_score(y_balanced, y_pred)


best_models['Logistic Regression'] = clf.best_estimator_
results.append({
    'Model': 'Logistic Regression',
    'Best Score': clf.best_score_,
    'Best Params': clf.best_params_,
        'Recall': rec,
        'F1 Score': f1
})
print(f"  Best Score: {clf.best_score_:.4f}")
print(f"  Best Params: {clf.best_params_}")
    print(f"  Recall: {rec:.4f}")
    print(f"  F1 Score: {f1:.4f}")

In [None]:
print("--- Processing Decision Tree ---")
dt_params = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}
clf = GridSearchCV(DecisionTreeClassifier(random_state=42), dt_params, cv=5, scoring='accuracy', n_jobs=1, verbose=3)
clf.fit(X_scaled, y_balanced)
    # Calculate additional metrics
    y_pred = clf.best_estimator_.predict(X_scaled)
    rec = recall_score(y_balanced, y_pred)
    f1 = f1_score(y_balanced, y_pred)


best_models['Decision Tree'] = clf.best_estimator_
results.append({
    'Model': 'Decision Tree',
    'Best Score': clf.best_score_,
    'Best Params': clf.best_params_,
        'Recall': rec,
        'F1 Score': f1
})
print(f"  Best Score: {clf.best_score_:.4f}")
print(f"  Best Params: {clf.best_params_}")
    print(f"  Recall: {rec:.4f}")
    print(f"  F1 Score: {f1:.4f}")

In [None]:
print("--- Processing Random Forest ---")
rf_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}
clf = GridSearchCV(RandomForestClassifier(random_state=42), rf_params, cv=5, scoring='accuracy', n_jobs=1, verbose=3)
clf.fit(X_scaled, y_balanced)
    # Calculate additional metrics
    y_pred = clf.best_estimator_.predict(X_scaled)
    rec = recall_score(y_balanced, y_pred)
    f1 = f1_score(y_balanced, y_pred)


best_models['Random Forest'] = clf.best_estimator_
results.append({
    'Model': 'Random Forest',
    'Best Score': clf.best_score_,
    'Best Params': clf.best_params_,
        'Recall': rec,
        'F1 Score': f1
})
print(f"  Best Score: {clf.best_score_:.4f}")
print(f"  Best Params: {clf.best_params_}")
    print(f"  Recall: {rec:.4f}")
    print(f"  F1 Score: {f1:.4f}")

In [None]:
print("--- Processing MLP (Neural Network) ---")
mlp_params = {
    'hidden_layer_sizes': [(50,), (100,), (50, 50)],
    'activation': ['relu', 'tanh'],
    'alpha': [0.0001, 0.001]
}
clf = GridSearchCV(MLPClassifier(max_iter=100, random_state=42), mlp_params, cv=5, scoring='accuracy', n_jobs=1, verbose=3)
clf.fit(X_scaled, y_balanced)
    # Calculate additional metrics
    y_pred = clf.best_estimator_.predict(X_scaled)
    rec = recall_score(y_balanced, y_pred)
    f1 = f1_score(y_balanced, y_pred)


best_models['MLP (Neural Network)'] = clf.best_estimator_
results.append({
    'Model': 'MLP (Neural Network)',
    'Best Score': clf.best_score_,
    'Best Params': clf.best_params_,
        'Recall': rec,
        'F1 Score': f1
})
print(f"  Best Score: {clf.best_score_:.4f}")
print(f"  Best Params: {clf.best_params_}")
    print(f"  Recall: {rec:.4f}")
    print(f"  F1 Score: {f1:.4f}")

In [None]:
print("--- Processing KNN ---")
knn_params = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance']
}
clf = GridSearchCV(KNeighborsClassifier(), knn_params, cv=5, scoring='accuracy', n_jobs=1, verbose=3)
clf.fit(X_scaled, y_balanced)
    # Calculate additional metrics
    y_pred = clf.best_estimator_.predict(X_scaled)
    rec = recall_score(y_balanced, y_pred)
    f1 = f1_score(y_balanced, y_pred)


best_models['KNN'] = clf.best_estimator_
results.append({
    'Model': 'KNN',
    'Best Score': clf.best_score_,
    'Best Params': clf.best_params_,
        'Recall': rec,
        'F1 Score': f1
})
print(f"  Best Score: {clf.best_score_:.4f}")
print(f"  Best Params: {clf.best_params_}")
    print(f"  Recall: {rec:.4f}")
    print(f"  F1 Score: {f1:.4f}")

In [None]:
print("--- Processing Naive Bayes ---")
nb_params = {}
clf = GridSearchCV(GaussianNB(), nb_params, cv=5, scoring='accuracy', n_jobs=1, verbose=3)
clf.fit(X_scaled, y_balanced)
    # Calculate additional metrics
    y_pred = clf.best_estimator_.predict(X_scaled)
    rec = recall_score(y_balanced, y_pred)
    f1 = f1_score(y_balanced, y_pred)


best_models['Naive Bayes'] = clf.best_estimator_
results.append({
    'Model': 'Naive Bayes',
    'Best Score': clf.best_score_,
    'Best Params': clf.best_params_,
        'Recall': rec,
        'F1 Score': f1
})
print(f"  Best Score: {clf.best_score_:.4f}")
print(f"  Best Params: {clf.best_params_}")
    print(f"  Recall: {rec:.4f}")
    print(f"  F1 Score: {f1:.4f}")

In [None]:
# Display results dataframe
results_df = pd.DataFrame(results).sort_values(by='Best Score', ascending=False)
print("\n--- Model Selection Report ---")
print(results_df)

# Select the best performing model overall
if not results_df.empty:
    best_model_name = results_df.iloc[0]['Model']
    best_model_score = results_df.iloc[0]['Best Score']
    final_model = best_models[best_model_name]

    print(f"\nSelected Best Model: {best_model_name}")
    print(f"Reason: It achieved the highest cross-validation accuracy of {best_model_score:.4f} among all tested models.")
    
    # Compare with Rule-Based
    print("\n--- Comparison with Rule-Based Logic ---")
    final_preds = final_model.predict(X_test_scaled)
    agreement = (final_preds == rule_based_preds).mean()
    print(f"Agreement between Best Model and Rule-Based Logic: {agreement:.2%}")
else:
    print("No results found. Please run the model cells above.")

In [None]:
# --- Rule-Based Validation ---
print("\n--- Rule-Based Validation ---")

def check_superhost_criteria(row):
    # Criteria based on Airbnb Superhost requirements
    # 1. 10+ stays (or 3 stays + 100 nights) -> approximated by number_of_reviews_ltm >= 10
    # 2. Response rate >= 90%
    # 3. Rating >= 4.8
    # 4. Cancellation rate < 1% (Not available in dataset, assumed met)
    
    # Rating Check
    rating_ok = False
    if pd.notna(row.get('review_scores_rating')):
        rating_ok = row['review_scores_rating'] >= 4.8
    elif pd.notna(row.get('review_scores_value')):
        rating_ok = row['review_scores_value'] >= 4.8
        
    # Response Rate Check
    response_ok = False
    if pd.notna(row.get('host_response_rate')):
        # Convert '100%' string to 100 number
        try:
            rate = float(str(row['host_response_rate']).replace('%', ''))
            response_ok = rate >= 90
        except:
            pass
    else:
        # If missing, assume ok if other criteria met (lenient)
        response_ok = True
            
    # Stays Check
    stays_ok = False
    if pd.notna(row.get('number_of_reviews_ltm')):
        stays_ok = row['number_of_reviews_ltm'] >= 10
    elif pd.notna(row.get('number_of_reviews')):
        stays_ok = row['number_of_reviews'] >= 10
        
    return 1 if (rating_ok and response_ok and stays_ok) else 0

# Apply to Test Data (Need original columns, so we reload or use df_test before dropping)
# Since we dropped columns in preprocessing, we'll reload a fresh copy for this validation
df_test_raw = pd.read_excel('test_f25.xlsx')
rule_based_preds = df_test_raw.apply(check_superhost_criteria, axis=1)

print(f"Rule-Based Predictions (First 10): {rule_based_preds.head(10).tolist()}")


# 7. Final Prediction

In [None]:
# Predict on Test Data
final_predictions = final_model.predict(X_test_scaled)

# Create submission dataframe
submission = pd.DataFrame({
    'No': df_test['No'],
    'host_is_superhost_pred': final_predictions
})

# Map 1/0 back to t/f
submission['host_is_superhost_pred'] = submission['host_is_superhost_pred'].map({1: 't', 0: 'f'})

submission.head()

In [None]:
# Save to Excel
submission.to_excel('prediction_result.xlsx', index=False)
print("Prediction saved to prediction_result.xlsx")

# 8. Summary

**Model Building Process:**
1.  **Data Preprocessing**: 
    -   Target variable `host_is_superhost` was converted to binary.
    -   Missing values were filled (median for numerical, mode for categorical).
    -   Categorical variables were One-Hot Encoded.
    -   Train and Test features were aligned to ensure consistency.
    -   Data was balanced using upsampling to address class imbalance.
    -   Features were scaled using StandardScaler.

2.  **Model Evaluation & Tuning**:
    -   We performed GridSearchCV for multiple models: Logistic Regression, Decision Tree, Random Forest, MLP, KNN, and Naive Bayes.
    -   Hyperparameters such as hidden layers for MLP, tree depth for Decision Tree/Random Forest, and neighbors for KNN were tuned.
    -   5-fold cross-validation was used to ensure robust evaluation.

3.  **Prediction**:
    -   The model with the highest cross-validation accuracy was selected.
    -   Predictions were generated for the test dataset and saved to `prediction_result.xlsx`.