### Import Libraries

This cell imports all the necessary libraries for data manipulation, machine learning models, and evaluation.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb
import shap
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
import re

In [2]:
    import zipfile
    import io

    uploaded_file_name = '/content/Airbnb-price-predictor.zip' # Replace with the actual uploaded file name
    with zipfile.ZipFile(uploaded_file_name, 'r') as zf:
        zf.extractall('/content/')


### Concatenate Big Cities DataFrames

This cell combines the DataFrames of big cities (NYC, Chicago, LA, SF) into a single DataFrame named `big_cities` and prints the number of columns in the resulting DataFrame.

### Display Shape of `big_cities`

This cell shows the dimensions (rows, columns) of the `big_cities` DataFrame after concatenation.

### Display Shapes of Medium Cities DataFrames

This cell defines a list of medium cities DataFrames and then iterates through them to print the shape of each DataFrame.

### Concatenate Medium Cities DataFrames

This cell combines the DataFrames of medium cities (Denver, Portland, Austin, Seattle) into a single DataFrame named `medium_cities`.

### Display Shape of `medium_cities`

This cell shows the dimensions (rows, columns) of the `medium_cities` DataFrame after concatenation.

### Display Shapes of Small Cities DataFrames

This cell defines a list of small cities DataFrames and then iterates through them to print the shape of each DataFrame.

### Concatenate Small Cities DataFrames

This cell combines the DataFrames of small cities (Asheville, Salem, Columbus, Santacruz) into a single DataFrame named `small_cities`.

### Display Shape of `small_cities`

This cell shows the dimensions (rows, columns) of the `small_cities` DataFrame after concatenation.

### Define Base Numeric Columns and All Features

This cell defines the list of base numerical columns and constructs the `ALL_FEATURES` list, which includes engineered features, and specifies the `TARGET_COL` (price).

### Data Cleaning and Feature Engineering Functions

This cell defines several functions:
- `parse_price`: Cleans and converts price strings to floats.
- `parse_bathrooms`: Extracts and converts bathroom text to numeric values.
- `create_features_and_clean`: Applies the parsing functions, handles missing values, creates `log_price`, `amenities_count`, `bath_bed_ratio`, `is_entire_home`, and `avg_sub_review_score` features, and filters the DataFrame to include only the relevant columns.

### Model Definitions

This cell defines functions to create and configure different machine learning models:
- `get_xgb_model`: Returns a pre-configured XGBoost Regressor model.
- `get_nn_model_arch1`: Returns a Sequential Keras model with a deeper MLP architecture and Dropout regularization.
- `get_nn_model_arch2`: Returns a Sequential Keras model with a wider MLP architecture and L2 regularization.

### Model Evaluation and Training Function

This cell defines two key functions:
- `evaluate_model`: Calculates RMSE, MAE, and R2 scores for a given model and test set.
- `train_and_evaluate_composite`: Orchestrates the entire training and evaluation process for a given DataFrame. It performs data preparation, scaling, trains XGBoost and two Neural Network architectures, and stores their results, trained models, and scaled test data.

### Execute Training and Evaluation for Composite DataFrames

This cell executes the `train_and_evaluate_composite` function for each of the composite DataFrames (Big, Medium, Small cities). It stores the evaluation results, trained models, and test data for each composite, printing status messages during execution.

### Load City DataFrames

This cell loads the `listings.csv` file for each specified city into individual pandas DataFrames. These DataFrames will be used for further analysis and model training.

In [None]:

# 1. Asheville
df_asheville = pd.read_csv('./data/Asheville/listings.csv')

# 2. Austin
df_austin = pd.read_csv('./data/Austin/listings.csv')

# 3. Chicago
df_chicago = pd.read_csv('./data/Chicago/listings.csv')

# 4. Columbus
df_columbus = pd.read_csv('./data/Columbus/listings.csv')

# 5. Denver
df_denver = pd.read_csv('./data/Denver/listings.csv')

# 6. LA
df_la = pd.read_csv('./data/LA/listings.csv')

# 7. NYC
df_nyc = pd.read_csv('./data/NYC/listings.csv')

# 8. Portland
df_portland = pd.read_csv('./data/Portland/listings.csv')

# 9. Salem
df_salem = pd.read_csv('./data/Salem/listings.csv')

# 10. Santacruz
df_santacruz = pd.read_csv('./data/Santacruz/listings.csv')

# 11. Seattle
df_seattle = pd.read_csv('./data/Seattle/listings.csv')

# 12. SF
df_sf = pd.read_csv('./data/SF/listings.csv')

### Display Shape of `df_nyc`

This cell shows the dimensions (rows, columns) of the `df_nyc` DataFrame.

In [None]:
df_nyc.shape

(36111, 79)

### Display Shape of `df_la`

This cell shows the dimensions (rows, columns) of the `df_la` DataFrame.

In [None]:
df_la.shape

(45886, 79)

### Display Shape of `df_sf`

This cell shows the dimensions (rows, columns) of the `df_sf` DataFrame.

In [None]:
df_sf.shape

(7780, 79)

### Display Shape of `df_chicago`

This cell shows the dimensions (rows, columns) of the `df_chicago` DataFrame.

In [None]:
df_chicago.shape

(8604, 79)

In [None]:
big_cities = pd.concat([df_nyc , df_chicago , df_la , df_sf])
len(big_cities.columns)

79

In [None]:
big_cities.shape

(98381, 79)

In [None]:
medium_cities_list = [df_denver , df_portland , df_austin, df_seattle]

for city in medium_cities_list:
    print(f" the shape of the df of is {city.shape}")

 the shape of the df of is (4910, 79)
 the shape of the df of is (4425, 79)
 the shape of the df of is (15187, 79)
 the shape of the df of is (6996, 79)


In [None]:
medium_cities = pd.concat(medium_cities_list)

In [None]:
medium_cities.shape

(31518, 79)

In [None]:
small_cities_list = [df_asheville , df_salem, df_columbus, df_santacruz]

for city in small_cities_list:
    print(f" the shape of the df of is {city.shape}")

 the shape of the df of is (2876, 79)
 the shape of the df of is (351, 79)
 the shape of the df of is (2877, 79)
 the shape of the df of is (1739, 79)


In [None]:
small_cities = pd.concat(small_cities_list)

In [None]:
small_cities.shape

(7843, 79)

# Base Numerical Columns

In [None]:

BASE_NUMERIC_COLS = [
    'accommodates', 'bedrooms', 'beds',
    'review_scores_rating', 'review_scores_accuracy',
    'review_scores_cleanliness', 'review_scores_checkin',
    'review_scores_communication', 'review_scores_location',
    'review_scores_value', 'number_of_reviews',
    'availability_365', 'minimum_nights', 'maximum_nights'
]
ALL_FEATURES = BASE_NUMERIC_COLS + ['bathrooms', 'amenities_count', 'bath_bed_ratio', 'is_entire_home', 'avg_sub_review_score']
TARGET_COL = 'price'
n_features = len(ALL_FEATURES)

In [None]:
def parse_price(price_str):
    if pd.isna(price_str): return np.nan
    clean_price = str(price_str).replace('$', '').replace(',', '')
    try: return float(clean_price)
    except: return np.nan
def parse_bathrooms(text):
    if pd.isna(text): return 0.0
    match = re.search(r"(\d+(\.\d+)?)", str(text))
    if match: return float(match.group(1))
    if 'half-bath' in str(text).lower(): return 0.5
    return 0.0
def create_features_and_clean(df, tier_name):
    df['price'] = df[TARGET_COL].apply(parse_price)
    df['bathrooms'] = df['bathrooms_text'].apply(parse_bathrooms)
    df = df.dropna(subset=[TARGET_COL])
    df = df[df[TARGET_COL] > 10]
    df['log_price'] = np.log1p(df[TARGET_COL])
    for col in ['bedrooms', 'beds', 'accommodates', 'bathrooms']:
        df[col] = df[col].fillna(df[col].median())
    review_cols = [c for c in BASE_NUMERIC_COLS if 'review' in c]
    for col in review_cols:
        df[col] = df[col].fillna(df[col].mean())
    df['amenities_count'] = df['amenities'].apply(lambda x: len(str(x).split(',')) if pd.notna(x) else 0)
    df['bath_bed_ratio'] = np.where(df['bedrooms'] > 0, df['bathrooms'] / df['bedrooms'], 0)
    df['bath_bed_ratio'] = df['bath_bed_ratio'].replace([np.inf, -np.inf], 0).fillna(0)
    df['is_entire_home'] = (df['room_type'] == 'Entire home/apt').astype(int)
    sub_scores = ['review_scores_accuracy', 'review_scores_cleanliness',
                  'review_scores_checkin', 'review_scores_communication',
                  'review_scores_location', 'review_scores_value']
    df['avg_sub_review_score'] = df[sub_scores].mean(axis=1)
    return df.filter(items=ALL_FEATURES + ['log_price'])

In [None]:
# MODEL DEFINITION
def get_xgb_model(random_state=42):
    return xgb.XGBRegressor(
        objective='reg:squarederror', n_estimators=100, learning_rate=0.1,
        max_depth=6, random_state=random_state)

# NN Architecture 1 (Deeper MLP with Dropout regularization)
def get_nn_model_arch1(n_features):
    model = Sequential([
        Dense(128, activation='relu', input_shape=(n_features,)),
        Dropout(0.1),
        Dense(64, activation='relu'),
        Dense(32, activation='relu'),
        Dense(1)])
    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    return model

# NN Architecture 2 (Wider MLP with L2 Regularization)
def get_nn_model_arch2(n_features):
    model = Sequential([
        Dense(256, activation='relu', input_shape=(n_features,)),
        Dense(128, activation='relu', kernel_regularizer=keras.regularizers.l2(0.01)),
        Dense(64, activation='relu'),
        Dense(1)])
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
    return model

In [None]:
def evaluate_model(model, X_test, y_test, is_keras=False):
    if is_keras:
        _, mae = model.evaluate(X_test, y_test, verbose=0)
        y_pred = model.predict(X_test).flatten()
    else:
        y_pred = model.predict(X_test)
        mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    return {'RMSE': rmse, 'MAE': mae, 'R2': r2}

def train_and_evaluate_composite(df, tier_name):
    # Data Preparation and Scaling (essential for NN)
    df_clean = create_features_and_clean(df, tier_name).dropna()
    X = df_clean[ALL_FEATURES].values
    y = df_clean['log_price'].values
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    results = {}
    trained_models = {}

    # XGBoost
    xgb_model = get_xgb_model()
    xgb_model.fit(X_train_scaled, y_train)
    results['XGBoost'] = evaluate_model(xgb_model, X_test_scaled, y_test)
    trained_models['XGBoost'] = xgb_model

    # NN Architecture 1
    nn1_model = get_nn_model_arch1(n_features)
    nn1_model.fit(X_train_scaled, y_train, epochs=50, batch_size=32, verbose=0)
    results['NN_Arch1'] = evaluate_model(nn1_model, X_test_scaled, y_test, is_keras=True)
    trained_models['NN_Arch1'] = nn1_model

    # NN Architecture 2
    nn2_model = get_nn_model_arch2(n_features)
    nn2_model.fit(X_train_scaled, y_train, epochs=50, batch_size=32, verbose=0)
    results['NN_Arch2'] = evaluate_model(nn2_model, X_test_scaled, y_test, is_keras=True)
    trained_models['NN_Arch2'] = nn2_model

    return results, trained_models, X_test_scaled, y_test, scaler

In [None]:
# EXECUTION

composite_dfs = {
    'Big_Composite': globals().get('big_cities'),
    'Medium_Composite': globals().get('medium_cities'),
    'Small_Composite': globals().get('small_cities')
}

composite_results = {}
composite_models = {}
composite_test_data = {}

for name, df in composite_dfs.items():
    if df is not None:
        print(f"Starting training for {name}...")
        results, models, X_test, y_test, scaler = train_and_evaluate_composite(df, name.split('_')[0])

        composite_results[name] = results
        composite_models[name] = models

        # Store test data and scaler for Phase 3 (Cross-Tier Analysis)
        composite_test_data[name] = {'X_test': X_test, 'y_test': y_test, 'scaler': scaler}
        print(f"Finished {name}.")
    else:
        print(f"Skipping {name}: DataFrame not found. Please ensure it's loaded.")

Starting training for Big_Composite...


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m448/448[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 305us/step


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m448/448[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 334us/step
Finished Big_Composite.
Starting training for Medium_Composite...


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 413us/step


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 449us/step
Finished Medium_Composite.
Starting training for Small_Composite...


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m45/45[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 817us/step


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m45/45[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 813us/step
Finished Small_Composite.


# Cross-Tier Neural Network Analysis

### Cross-Tier Analysis Function and Execution

This cell defines and executes the `perform_cross_tier_analysis` function. This function evaluates how well models trained on one city tier perform when predicting prices in other city tiers. It then displays the R2 scores from this cross-tier analysis in a formatted Markdown table.

In [None]:


def evaluate_model(model, X_test, y_test, is_keras=False):
    """Calculates RMSE, MAE, and R2 for a given model and test set."""
    if is_keras:
        # Suppressing detailed Keras output during prediction
        y_pred = model.predict(X_test, verbose=0).flatten()
    else:
        y_pred = model.predict(X_test)

    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    return {'RMSE': rmse, 'MAE': mae, 'R2': r2}

def perform_cross_tier_analysis(composite_models, composite_test_data, model_type='NN_Arch1'):
    """
    Evaluates how well models trained on one composite tier predict prices
    in the other two tiers.
    """
    cross_tier_results = {}
    tier_names = ['Big', 'Medium', 'Small']
    is_keras_model = (model_type == 'NN_Arch1' or model_type == 'NN_Arch2')

    for train_tier in tier_names:
        # Get the model trained on the composite tier data
        model_name = f'{train_tier}_Composite'
        trained_model = composite_models.get(model_name, {}).get(model_type)

        if trained_model is None:
            continue

        for test_tier in tier_names:
            if train_tier == test_tier:
                continue # Skip self-comparison

            test_data_key = f'{test_tier}_Composite'

            # The test data used here must be the SCALED test set
            test_data_entry = composite_test_data.get(test_data_key, {})
            X_test_scaled = test_data_entry.get('X_test')
            y_test = test_data_entry.get('y_test')

            if X_test_scaled is None:
                continue

            # Evaluate the model on the cross-tier test data
            results = evaluate_model(trained_model, X_test_scaled, y_test, is_keras=is_keras_model)

            key = f'Trained on {train_tier} | Tested on {test_tier}'
            cross_tier_results[key] = results

    # Format the results into a readable DataFrame
    if not cross_tier_results:
        return pd.DataFrame()

    final_df = pd.DataFrame(cross_tier_results).T
    final_df = final_df.apply(lambda x: pd.Series(x).map(lambda y: f'{y:.4f}'))
    return final_df

#  EXECUTE ANALYSIS

try:
    cross_tier_results_df = perform_cross_tier_analysis(
        composite_models,
        composite_test_data,
        model_type='NN_Arch2'
    )

    print("\n" + "="*70)
    print("Phase 3 Complete: Cross-Tier NN Generalization Analysis (R-Squared)")
    print("="*70)
    if not cross_tier_results_df.empty:
        # Displaying only the R2 values for key insight
        print(cross_tier_results_df['R2'].to_markdown(numalign="left", stralign="left"))
    else:
        print("Analysis failed. Please confirm Phase 2 executed successfully and variables are defined.")

except NameError as e:
    print(f"\nERROR: A required variable is missing. Did you run the full Phase 2 code? Missing variable: {e}")


Phase 3 Complete: Cross-Tier NN Generalization Analysis (R-Squared)
|                                     | R2      |
|:------------------------------------|:--------|
| Trained on Big | Tested on Medium   | -1.4242 |
| Trained on Big | Tested on Small    | -1.4058 |
| Trained on Medium | Tested on Big   | 0.4128  |
| Trained on Medium | Tested on Small | 0.5059  |
| Trained on Small | Tested on Big    | 0.2362  |
| Trained on Small | Tested on Medium | 0.4244  |
