# **Data Preprocessing**

This notebook handles data cleaning, feature engineering, and preparation for modeling based on insights from the exploratory data analysis. The preprocessing pipeline ensures no data leakage by performing train/test split before any statistical operations like outlier removal or scaling.

The cleaned dataset will be saved both locally and uploaded to S3 for use in the modeling phase.

In [1]:
import pandas as pd
import boto3
import io
import os
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [2]:
# FETCH DATA FROM AWS

# Create the AWS S3 Client
s3_client = boto3.client('s3')
# Load the data in a dictionary 'response'
response = s3_client.get_object(Bucket = 'software-tools-ai', Key ='raw_data/listings.csv')
# Filter the content of the csvg
csv_content = response['Body'].read()

In [3]:
# Load fetched data into the dataset
df = pd.read_csv(io.BytesIO(csv_content), header = 0, sep = ',')
print(df.shape)
df.head(2)

(48895, 16)


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355


### Outlier Handling Strategy

Based on EDA findings, price has significant outliers (up to $10,000). This preprocessing pipeline implements removing listings above the 99th percentile from the training set only.

**Outlier handling for features (not target):** Winsorization will be tested during modeling phase as it can improve linear model performance while not affecting tree-based models negatively.

In [4]:
# Data cleaning before split

# Remove invalid prices
print(f"Original dataset: {df.shape}")
df = df[df['price'] > 0]
print(f"After removing $0 prices: {df.shape}")

# Drop unnecessary columns
# Note: neighbourhood kept for target encoding, will be dropped later
cols_to_drop = ['id', 'host_id', 'name', 'host_name']
df = df.drop(columns=cols_to_drop)
print(f"After dropping ID columns: {df.shape}")
print(f"Remaining columns: {list(df.columns)}")

Original dataset: (48895, 16)
After removing $0 prices: (48884, 16)
After dropping ID columns: (48884, 12)
Remaining columns: ['neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']


### Train-Test-Split

In [5]:
# Separate features and target
X = df.drop('price', axis=1)
y = df['price']

# Split 80/20
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Split ratio: {X_train.shape[0]/len(df)*100:.1f}% train, {X_test.shape[0]/len(df)*100:.1f}% test")

Training set: 39107 samples
Test set: 9777 samples
Split ratio: 80.0% train, 20.0% test


### Outlier Removal - Target Variable

In [6]:
# Calculate outlier threshold on TRAIN only
outlier_threshold = y_train.quantile(0.99)
print(f"Outlier threshold (99th percentile from train): ${outlier_threshold:.2f}")

# Count outliers in train
n_outliers_train = (y_train > outlier_threshold).sum()
print(f"Outliers in training set: {n_outliers_train}")

# Remove outliers from TRAIN
mask = y_train <= outlier_threshold
X_train = X_train[mask]
y_train = y_train[mask]

print(f"Training set after outlier removal: {X_train.shape[0]} samples")
print(f"Test set (unchanged): {X_test.shape[0]} samples")

Outlier threshold (99th percentile from train): $798.76
Outliers in training set: 392
Training set after outlier removal: 38715 samples
Test set (unchanged): 9777 samples


### Missing Value Handling

In [7]:
# Check missing values in train
missing_df = pd.DataFrame({
    'column': X_train.columns,
    'missing_count': X_train.isnull().sum().values,
    'missing_pct': (X_train.isnull().sum() / len(X_train) * 100).round(2).values
})

missing_df = missing_df[missing_df['missing_count'] > 0].sort_values('missing_count', ascending=False)

print("Missing values in training set:")
missing_df

Missing values in training set:


Unnamed: 0,column,missing_count,missing_pct
7,last_review,7833,20.23
8,reviews_per_month,7833,20.23


In [8]:
# Handle missing values

# Impute reviews_per_month with 0 (listings without reviews)
X_train['reviews_per_month'] = X_train['reviews_per_month'].fillna(0)
X_test['reviews_per_month'] = X_test['reviews_per_month'].fillna(0)

# Drop last_review (not useful for prediction)
X_train = X_train.drop(columns=['last_review'])
X_test = X_test.drop(columns=['last_review'])

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"\nMissing values remaining in train: {X_train.isnull().sum().sum()}")
print(f"Missing values remaining in test: {X_test.isnull().sum().sum()}")

Training set shape: (38715, 10)
Test set shape: (9777, 10)

Missing values remaining in train: 0
Missing values remaining in test: 0


## Feature Engineering

### Geographic Distance

Based on EDA findings showing Manhattan as the most expensive area, a new feature was created: distance from each listing to Manhattan's center (Times Square coordinates: 40.7580, -73.9855).

In [9]:
# Manhattan center coordinates
manhattan_lat = 40.7580
manhattan_lon = -73.9855

# Calculate Euclidean distance for train
X_train['distance_to_manhattan'] = np.sqrt(
    (X_train['latitude'] - manhattan_lat)**2 + 
    (X_train['longitude'] - manhattan_lon)**2
)

# Calculate for test
X_test['distance_to_manhattan'] = np.sqrt(
    (X_test['latitude'] - manhattan_lat)**2 + 
    (X_test['longitude'] - manhattan_lon)**2
)

print(f"Distance range in train: {X_train['distance_to_manhattan'].min():.4f} to {X_train['distance_to_manhattan'].max():.4f}")

Distance range in train: 0.0007 to 0.3631


### Target Encoding - Neighbourhood

The neighbourhood column (221 unique categories) contains valuable pricing information. Instead of one-hot encoding (which would create 221 features), we use target encoding: replacing each neighbourhood with the mean price of listings in that neighbourhood.

**Critical:** Statistics are calculated ONLY from training data to prevent data leakage.

In [10]:
# Calculate mean price per neighbourhood in TRAIN only
neighbourhood_means = X_train.join(y_train).groupby('neighbourhood')['price'].mean()

print(f"Number of unique neighbourhoods: {len(neighbourhood_means)}")
print(f"\nTop 5 most expensive neighbourhoods:")
print(neighbourhood_means.sort_values(ascending=False).head())
print(f"\nTop 5 least expensive neighbourhoods:")
print(neighbourhood_means.sort_values(ascending=True).head())

# Map to train
X_train['neighbourhood_price_encoded'] = X_train['neighbourhood'].map(neighbourhood_means)

# Map to test using SAME statistics from train
X_test['neighbourhood_price_encoded'] = X_test['neighbourhood'].map(neighbourhood_means)

# Handle neighbourhoods in test that weren't in train (use global mean)
global_mean = y_train.mean()
n_missing = X_test['neighbourhood_price_encoded'].isnull().sum()
if n_missing > 0:
    print(f"\nWarning: {n_missing} test samples had neighbourhoods not seen in train")
    print(f"Filled with global mean: ${global_mean:.2f}")
    # FIX: Don't use inplace=True, assign directly
    X_test['neighbourhood_price_encoded'] = X_test['neighbourhood_price_encoded'].fillna(global_mean)
else:
    print(f"\n✓ All test neighbourhoods were present in train data")

print(f"\nTarget encoding complete:")
print(f"Train range: ${X_train['neighbourhood_price_encoded'].min():.2f} to ${X_train['neighbourhood_price_encoded'].max():.2f}")
print(f"Test range: ${X_test['neighbourhood_price_encoded'].min():.2f} to ${X_test['neighbourhood_price_encoded'].max():.2f}")

Number of unique neighbourhoods: 217

Top 5 most expensive neighbourhoods:
neighbourhood
Tribeca              299.070866
NoHo                 287.390625
Flatiron District    262.048387
Midtown              253.051452
Willowbrook          249.000000
Name: price, dtype: float64

Top 5 least expensive neighbourhoods:
neighbourhood
Westerleigh    40.000000
Mount Eden     40.200000
Bull's Head    48.800000
Hunts Point    51.333333
Soundview      51.916667
Name: price, dtype: float64

Filled with global mean: $137.38

Target encoding complete:
Train range: $40.00 to $299.07
Test range: $40.00 to $299.07


### Interaction Features

Create interaction between room_type and neighbourhood_group to capture patterns like "entire homes in Manhattan are more expensive than entire homes elsewhere".

In [11]:
# Create interaction feature
X_train['room_borough'] = X_train['room_type'] + '_' + X_train['neighbourhood_group']
X_test['room_borough'] = X_test['room_type'] + '_' + X_test['neighbourhood_group']

print(f"Interaction feature created: {X_train['room_borough'].nunique()} unique combinations")
print(f"\nExamples:")
print(X_train['room_borough'].value_counts().head())

Interaction feature created: 15 unique combinations

Examples:
room_borough
Entire home/apt_Manhattan    10341
Private room_Brooklyn         8089
Entire home/apt_Brooklyn      7578
Private room_Manhattan        6312
Private room_Queens           2685
Name: count, dtype: int64


### Drop Original Neighbourhood Column

Now that we've extracted the pricing information via target encoding, we can drop the original neighbourhood column.

In [12]:
# Drop neighbourhood (already encoded)
X_train = X_train.drop(columns=['neighbourhood'])
X_test = X_test.drop(columns=['neighbourhood'])

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"\nRemaining columns: {list(X_train.columns)}")

Training set shape: (38715, 12)
Test set shape: (9777, 12)

Remaining columns: ['neighbourhood_group', 'latitude', 'longitude', 'room_type', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365', 'distance_to_manhattan', 'neighbourhood_price_encoded', 'room_borough']


### Feature Outlier Handling

**Note:** Winsorization of numeric features (clipping at 1st and 99th percentiles) will be tested during the modeling phase. This transformation can improve linear model performance but doesn't affect tree-based models. It will be logged as a separate experiment in MLflow to compare results.

## Encoding and Scaling

In [13]:
# Define column types
categorical_cols = ['neighbourhood_group', 'room_type', 'room_borough']

numeric_cols = ['latitude', 'longitude', 'minimum_nights', 
                'number_of_reviews', 'reviews_per_month', 
                'calculated_host_listings_count', 'availability_365',
                'distance_to_manhattan', 'neighbourhood_price_encoded']

print(f"Categorical: {categorical_cols}")
print(f"Numeric: {numeric_cols}")
print(f"Total: {len(categorical_cols) + len(numeric_cols)} features")

Categorical: ['neighbourhood_group', 'room_type', 'room_borough']
Numeric: ['latitude', 'longitude', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365', 'distance_to_manhattan', 'neighbourhood_price_encoded']
Total: 12 features


In [14]:
# Create preprocessing pipeline
# - OneHotEncoder: converts categorical variables to binary columns (drop first to avoid multicollinearity)
# - StandardScaler: normalizes numeric features to mean=0, std=1
preprocessor = ColumnTransformer(transformers=[
    ('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), categorical_cols),
    ('stdscaler', StandardScaler(), numeric_cols)
])

# Fit preprocessor on training data (learn statistics)
preprocessor.fit(X_train)

# Transform both train and test using the same statistics
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print(f'X_train_processed shape: {X_train_processed.shape}')
print(f'X_test_processed shape: {X_test_processed.shape}')
print(f'\nX_train_processed type: {type(X_train_processed)}')
print(f'X_test_processed type: {type(X_test_processed)}')

X_train_processed shape: (38715, 29)
X_test_processed shape: (9777, 29)

X_train_processed type: <class 'numpy.ndarray'>
X_test_processed type: <class 'numpy.ndarray'>


### Saving Preprocessed Dataset - S3 and Local

In [15]:
# Save processed datasets

# Convert numpy arrays back to DataFrames and combine with target variable
# Note: X_processed has no column names (it's a numpy array after ColumnTransformer)
# Create generic feature names
feature_names = [f'feature_{i}' for i in range(X_train_processed.shape[1])]

# Combine X and y for train
train_processed = pd.DataFrame(X_train_processed, columns=feature_names, index=X_train.index)
train_processed['price'] = y_train

# Combine X and y for test
test_processed = pd.DataFrame(X_test_processed, columns=feature_names, index=X_test.index)
test_processed['price'] = y_test

# Create output directory
output_dir = '../data/processed'
os.makedirs(output_dir, exist_ok=True)

# Save locally
train_path = os.path.join(output_dir, 'train_processed_v2.csv')
test_path = os.path.join(output_dir, 'test_processed_v2.csv')

train_processed.to_csv(train_path, index=False)
test_processed.to_csv(test_path, index=False)
print(f"✓ Saved locally: {train_path}")
print(f"✓ Saved locally: {test_path}")

# Upload to S3
for filename, df_data in [('train_processed_v2.csv', train_processed), ('test_processed_v2.csv', test_processed)]:
    csv_buffer = io.StringIO()
    df_data.to_csv(csv_buffer, index=False)
    
    s3_client.put_object(
        Bucket='software-tools-ai',
        Key=f'processed_data/{filename}',
        Body=csv_buffer.getvalue()
    )
    print(f"✓ Uploaded to S3: s3://software-tools-ai/processed_data/{filename}")

print(f"\nFinal datasets:")
print(f"Train: {train_processed.shape}")
print(f"Test: {test_processed.shape}")

✓ Saved locally: ../data/processed\train_processed_v2.csv
✓ Saved locally: ../data/processed\test_processed_v2.csv
✓ Uploaded to S3: s3://software-tools-ai/processed_data/train_processed_v2.csv
✓ Uploaded to S3: s3://software-tools-ai/processed_data/test_processed_v2.csv

Final datasets:
Train: (38715, 30)
Test: (9777, 30)


## Preprocessing Complete

The dataset has been successfully preprocessed and is ready for modeling. The pipeline applied:

**Data cleaning:**
- Removed 11 listings with price = $0
- Dropped identifier columns (id, host_id, name, host_name)

**Train/test split:**
- 80/20 split (38,715 train / 9,777 test samples after outlier removal)
- Outliers removed from train only (392 listings above $798.76)

**Feature engineering:**
- Created distance_to_manhattan from geographic coordinates
- Target encoded neighbourhood (221 categories) using train statistics only
- Created room_type × neighbourhood_group interaction features
- Imputed missing review values with 0

**Encoding and scaling:**
- One-hot encoded categorical variables (neighbourhood_group, room_type, room_borough) with drop_first
- StandardScaler applied to numeric features including neighbourhood_price_encoded
- All transformations fit on train data only to prevent leakage

**Final datasets (V2):**
- Training: 38,715 samples × (features + price)
- Test: 9,777 samples × (features + price)
- Saved locally: `data/processed/train_processed_v2.csv`, `test_processed_v2.csv`
- Uploaded to S3: `s3://software-tools-ai/processed_data/`

**Feature count breakdown:**
- One-hot encoded: ~16 features (neighbourhood_group + room_type + room_borough interactions)
- Numeric scaled: 9 features (including neighbourhood_price_encoded)
- Total: ~25 predictor features + price

**Note:** Feature outlier handling (winsorization) will be tested as an experiment during modeling phase.

Next step: Model development with MLflow experiment tracking.