# Business Analytics II – Project II
### 2020120083 손영진

## Overview
This project involves building classification models to predict whether an Airbnb host is a "Superhost".
The process includes:
1. Data Preprocessing (Handling missing values, Encoding, Balancing, Scaling)
2. Model Building & Hyperparameter Tuning (Logistic Regression, Decision Tree, Random Forest, MLP, KNN, Naive Bayes)
3. Final Prediction on Test Data using the best model

In [None]:
# Install requirements
!pip install -r requirements.txt

# 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# 2. Data Load

In [None]:
# Load datasets
train_path = 'train_data.csv'
test_path = 'test_f25.xlsx'

df_train = pd.read_csv(train_path)
df_test = pd.read_excel(test_path)

print(f"Train Shape: {df_train.shape}")
print(f"Test Shape: {df_test.shape}")
df_train.head()

# 3. Data Preprocessing

In [None]:
# Target Variable
target_col = 'host_is_superhost'

# Convert target to binary (t/f -> 1/0)
df_train[target_col] = df_train[target_col].map({'t': 1, 'f': 0})

# Check for missing values in target and drop them
df_train = df_train.dropna(subset=[target_col])
print(f"Train Shape after dropping missing targets: {df_train.shape}")

# Separate Target and Features
y = df_train[target_col]
X = df_train.drop(columns=[target_col])

# Identify categorical and numerical columns
cat_cols = X.select_dtypes(include=['object']).columns
num_cols = X.select_dtypes(include=['int64', 'float64']).columns

print(f"Categorical Columns: {len(cat_cols)}")
print(f"Numerical Columns: {len(num_cols)}")

In [None]:
# Handle Missing Values
# For numerical, fill with median
for col in num_cols:
    median_val = X[col].median()  # Calculate median on Train
    X[col] = X[col].fillna(median_val)
    if col in df_test.columns:
        df_test[col] = df_test[col].fillna(median_val)  # Apply Train median to Test

# For categorical, fill with mode
for col in cat_cols:
    mode_val = X[col].mode()[0]  # Calculate mode on Train
    X[col] = X[col].fillna(mode_val)
    if col in df_test.columns:
        df_test[col] = df_test[col].fillna(mode_val)  # Apply Train mode to Test

In [None]:
# One-Hot Encoding
X_encoded = pd.get_dummies(X, columns=cat_cols, drop_first=True)
df_test_encoded = pd.get_dummies(df_test, columns=cat_cols, drop_first=True)

# Align columns (Ensure train and test have same features)
X_encoded, df_test_encoded = X_encoded.align(df_test_encoded, join='left', axis=1, fill_value=0)

print(f"Encoded Train Features Shape: {X_encoded.shape}")
print(f"Encoded Test Features Shape: {df_test_encoded.shape}")

# 4. Data Balancing

We handle class imbalance using resampling.
- **Option**: Set `SAMPLING_METHOD` to `'upsample'` or `'downsample'`.
- **Upsampling (Default)**: Replicates minority class samples. **Pros**: Retains all information. **Cons**: Increases dataset size, potential for overfitting if not careful.
- **Downsampling**: Removes majority class samples. **Pros**: Faster training. **Cons**: Loss of potentially valuable information.

In [None]:
SAMPLING_METHOD = 'upsample'  # Options: 'upsample', 'downsample'

# Combine X and y for resampling
train_data = pd.concat([X_encoded, y], axis=1)

# Check class distribution
print("Original Class Distribution:")
print(y.value_counts())

# Separate majority and minority classes
df_majority = train_data[train_data[target_col] == 0]
df_minority = train_data[train_data[target_col] == 1]

if SAMPLING_METHOD == 'upsample':
    print("\nPerforming Upsampling...")
    df_minority_upsampled = resample(df_minority, 
                                     replace=True,     # sample with replacement
                                     n_samples=len(df_majority),    # to match majority class
                                     random_state=42)
    df_balanced = pd.concat([df_majority, df_minority_upsampled])
    
elif SAMPLING_METHOD == 'downsample':
    print("\nPerforming Downsampling...")
    df_majority_downsampled = resample(df_majority, 
                                       replace=False,    # sample without replacement
                                       n_samples=len(df_minority),    # to match minority class
                                       random_state=42)
    df_balanced = pd.concat([df_majority_downsampled, df_minority])

# Display new class counts
print("\nBalanced Class Distribution:")
print(df_balanced[target_col].value_counts())

# Separate X and y again
X_balanced = df_balanced.drop(columns=[target_col])
y_balanced = df_balanced[target_col]

# 5. Feature Scaling

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_balanced)
X_test_scaled = scaler.transform(df_test_encoded)

X_scaled = pd.DataFrame(X_scaled, columns=X_balanced.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=df_test_encoded.columns)

# 6. Model Building & Hyperparameter Tuning

### Hyperparameter Tuning Strategy
We use `GridSearchCV` to find the optimal hyperparameters for each model. Here is the rationale for the chosen parameter ranges:

- **MLP (Neural Network)**:
    - `hidden_layer_sizes`: We test different architectures. `(50,)` is a simple model, `(100,)` adds width, and `(50, 50)` adds depth to capture more complex patterns.
    - `activation`: `relu` is standard for deep learning, `tanh` can be better for some datasets.
    - `alpha`: Regularization parameter to prevent overfitting. We test small values `0.0001` and `0.001`.

- **Random Forest & Decision Tree**:
    - `max_depth`: Controls tree complexity. `None` allows full growth (risk of overfitting), while `10`, `20` limit it.
    - `min_samples_split`: Higher values (e.g., `10`) prevent the model from learning overly specific patterns (noise).
    - `n_estimators` (RF only): Number of trees. More trees generally improve performance but increase computation.

- **KNN**:
    - `n_neighbors`: `3` captures local patterns (sensitive to noise), `7` is smoother.
    - `weights`: `distance` gives more weight to closer neighbors, `uniform` treats all equally.

- **Logistic Regression**:
    - `C`: Inverse of regularization strength. Smaller values (e.g., `0.1`) specify stronger regularization.

In [None]:
# Define models and their hyperparameter grids
model_params = {
    "Logistic Regression": {
        "model": LogisticRegression(max_iter=1000, random_state=42),
        "params": {
            'C': [0.1, 1, 10]
        }
    },
    "Decision Tree": {
        "model": DecisionTreeClassifier(random_state=42),
        "params": {
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10]
        }
    },
    "Random Forest": {
        "model": RandomForestClassifier(random_state=42),
        "params": {
            'n_estimators': [50, 100, 200],
            'max_depth': [None, 10, 20],
            'min_samples_split': [2, 5]
        }
    },
    "MLP (Neural Network)": {
        "model": MLPClassifier(max_iter=1000, random_state=42),
        "params": {
            'hidden_layer_sizes': [(50,), (100,), (50, 50)],
            'activation': ['relu', 'tanh'],
            'alpha': [0.0001, 0.001]
        }
    },
    "KNN": {
        "model": KNeighborsClassifier(),
        "params": {
            'n_neighbors': [3, 5, 7],
            'weights': ['uniform', 'distance']
        }
    },
    "Naive Bayes": {
        "model": GaussianNB(),
        "params": {} # Naive Bayes typically doesn't have many hyperparameters to tune
    }
}

best_models = {}
results = []

print("--- Hyperparameter Tuning & Evaluation ---")

for name, mp in model_params.items():
    print(f"Processing {name}...")
    clf = GridSearchCV(mp['model'], mp['params'], cv=5, scoring='accuracy', n_jobs=-1)
    clf.fit(X_scaled, y_balanced)
    
    best_models[name] = clf.best_estimator_
    results.append({
        'Model': name,
        'Best Score': clf.best_score_,
        'Best Params': clf.best_params_
    })
    print(f"  Best Score: {clf.best_score_:.4f}")
    print(f"  Best Params: {clf.best_params_}")

# Display results dataframe
results_df = pd.DataFrame(results).sort_values(by='Best Score', ascending=False)
print("\n--- Model Selection Report ---")
print(results_df)

# Select the best performing model overall
best_model_name = results_df.iloc[0]['Model']
best_model_score = results_df.iloc[0]['Best Score']
final_model = best_models[best_model_name]

print(f"\nSelected Best Model: {best_model_name}")
print(f"Reason: It achieved the highest cross-validation accuracy of {best_model_score:.4f} among all tested models.")

# 7. Final Prediction

In [None]:
# Predict on Test Data
final_predictions = final_model.predict(X_test_scaled)

# Create submission dataframe
submission = pd.DataFrame({
    'No': df_test['No'],
    'host_is_superhost_pred': final_predictions
})

# Map 1/0 back to t/f
submission['host_is_superhost_pred'] = submission['host_is_superhost_pred'].map({1: 't', 0: 'f'})

submission.head()

In [None]:
# Save to Excel
submission.to_excel('prediction_result.xlsx', index=False)
print("Prediction saved to prediction_result.xlsx")

# 8. Summary

**Model Building Process:**
1.  **Data Preprocessing**: 
    -   Target variable `host_is_superhost` was converted to binary.
    -   Missing values were filled (median for numerical, mode for categorical).
    -   Categorical variables were One-Hot Encoded.
    -   Train and Test features were aligned to ensure consistency.
    -   Data was balanced using upsampling to address class imbalance.
    -   Features were scaled using StandardScaler.

2.  **Model Evaluation & Tuning**:
    -   We performed GridSearchCV for multiple models: Logistic Regression, Decision Tree, Random Forest, MLP, KNN, and Naive Bayes.
    -   Hyperparameters such as hidden layers for MLP, tree depth for Decision Tree/Random Forest, and neighbors for KNN were tuned.
    -   5-fold cross-validation was used to ensure robust evaluation.

3.  **Prediction**:
    -   The model with the highest cross-validation accuracy was selected.
    -   Predictions were generated for the test dataset and saved to `prediction_result.xlsx`.