
# Airbnb Price Prediction — Professional (Internship-ready)
**Style:** Data Scientist / Machine Learning Engineer (Professional)
**Dataset:** https://www.kaggle.com/datasets/stevezhenghp/airbnb-price-prediction

**What this notebook contains (high level)**
- Clear problem statement and success metrics
- Data ingestion (Kaggle API + manual upload instructions)
- Exploratory Data Analysis (EDA) with visualizations
- Robust preprocessing & advanced feature engineering:
  - Price cleaning + log-transform target
  - Amenities parsing and amenity scoring
  - Bathroom/bedroom parsing, capacity features
  - Geospatial features (latitude/longitude → distance to city center, clustering neighborhoods)
  - Outlier handling and missing value strategy
- Modeling & Evaluation:
  - Train/test split + K-Fold CV
  - Model comparison: RandomForest, XGBoost, ExtraTrees, GradientBoosting
  - Hyperparameter tuning (RandomizedSearchCV for speed)
  - Cross-validated metrics (RMSE, MAE, R2) and final model selection
- Explainability & Interpretation:
  - Feature importance + SHAP summary plots
- Deliverables:
  - Save final model, processed CSV for dashboards, and helper `predict_price()` function
  - Instructions to create a PPT and PDF report from results (cells included)
- Notes: Run in Google Colab. Some cells install packages (xgboost, shap, python-pptx) — allow them to run.


## 1) Download dataset (Kaggle API) or Upload CSV
Follow one of the two options:

**A) Upload manually**: Drag & drop `listings.csv` into Colab `/content` (left Files pane).

**B) Kaggle API**: Upload your `kaggle.json` to `/content` and run the cell below to download and unzip the dataset.


In [None]:

# Optional: Download dataset using Kaggle API (if you uploaded kaggle.json to /content)
!pip install -q kaggle
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content'
# After you upload kaggle.json to /content, run:
!kaggle datasets download -d stevezhenghp/airbnb-price-prediction -p /content --unzip
!ls -lh /content


In [None]:

# Install extra packages not always present in Colab
!pip install -q xgboost shap python-pptx


In [None]:

# Standard imports
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
from sklearn.model_selection import train_test_split, KFold, RandomizedSearchCV, cross_val_score
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import joblib, os, json
import warnings; warnings.filterwarnings('ignore')
print('imports done')


In [None]:

csvs = [f for f in os.listdir('/content') if f.lower().endswith('.csv')]
print('CSV files found:', csvs)
# default attempt
fn = '/content/listings.csv' if '/content/listings.csv' in ['/content/'+c for c in csvs] else ('/content/' + csvs[0] if csvs else None)
if fn is None:
    raise FileNotFoundError('No CSV found. Upload listings.csv or use Kaggle API cell.')
df = pd.read_csv(fn, low_memory=False)
print('Loaded', fn, 'shape=', df.shape)
df.head()


In [None]:

# --- EDA (quick) ---
print('Columns:', len(df.columns))
print(df.select_dtypes(include=['object']).columns[:30])
print('\nPrice column candidates:')
candidates = [c for c in df.columns if 'price' in c.lower()]
print(candidates)
# Show basic distribution if price exists
price_col = None
for c in df.columns:
    if c.lower()=='price' or 'price' in c.lower():
        price_col = c; break
print('Using price column:', price_col)
if price_col:
    df[price_col] = df[price_col].astype(str)
    # Clean preview 10
    print(df[price_col].head(10))
    # quick stats (after cleaning step below)


In [None]:

# ----------------------
# Preprocessing & Feature Engineering
# ----------------------

import re
from math import radians, cos, sin, asin, sqrt

# Helper: clean price to float
def clean_price_col(s):
    try:
        if pd.isna(s): return np.nan
        s = str(s)
        s = re.sub(r'[^0-9.]', '', s)
        if s=='' or s=='.': return np.nan
        return float(s)
    except:
        return np.nan

# Identify likely column names with heuristics
cols = df.columns.tolist()
def find_col(options):
    for o in options:
        for c in cols:
            if o.lower()==c.lower(): return c
    return None

col_price = find_col(['price','Price','price_per_night','price_night'])
col_lat = find_col(['latitude','lat'])
col_lng = find_col(['longitude','lng','long'])
col_amen = find_col(['amenities','Amenities'])
col_room = find_col(['room_type','room type','property_type','property type','property_type'])
col_bedrooms = find_col(['bedrooms'])
col_bath = find_col(['bathrooms','bathrooms_text'])
col_accom = find_col(['accommodates','guests'])

print('Mapped:', col_price, col_lat, col_lng, col_amen, col_room, col_bedrooms, col_bath, col_accom)

# Clean price and create target
df['price_clean'] = df[col_price].apply(clean_price_col)
print('price_clean na:', df['price_clean'].isna().sum())

# Basic numeric features
for c in [col_bedrooms, col_bath, col_accom]:
    if c and c in df.columns:
        df[c] = pd.to_numeric(df[c], errors='coerce')

# Amenities: count and create amenity score
if col_amen and col_amen in df.columns:
    # remove braces and split; dataset often stores as string like "{'Wifi', 'Kitchen'}"
    df['amen_list'] = df[col_amen].fillna('[]').astype(str).apply(lambda x: re.findall(r"'([^']+)'|\"([^\"]+)\"|([A-Za-z0-9 _+-]+)", x))
    # amen_list above is complex; simpler count by commas if that fails
    def amen_count_raw(x):
        try:
            s = str(x)
            s = s.strip('{}[] ')
            if s=='' or s.lower()=='nan': return 0
            # split by comma not within quotes approx
            return len([a for a in re.split(r',\s*(?![^\(]*\))', s) if a.strip()!=''])
        except:
            return 0
    df['amenities_count'] = df[col_amen].fillna('').apply(amen_count_raw)
    amenities_lower = df[col_amen].fillna('').str.lower()
    # create flags for common amenities
    for a in ['wifi','kitchen','heater','heating','washer','dryer','parking','parking space','air conditioning','air conditioning','ac']:
        colname = 'amen_' + re.sub(r'\W+','_',a)
        df[colname] = amenities_lower.str.contains(a, na=False).astype(int)
else:
    df['amenities_count'] = 0

# Bathrooms: sometimes in text "1 bath", "1.5 shared baths"
if col_bath and col_bath in df.columns:
    df['bath_numeric'] = pd.to_numeric(df[col_bath].astype(str).str.extract(r'([0-9\.]+)')[0], errors='coerce')

# Latitude/longitude distance to centroid (approx center) if available
def haversine(lon1, lat1, lon2, lat2):
    # convert decimal degrees to radians
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a))
    km = 6371 * c
    return km

if col_lat and col_lat in df.columns and col_lng and col_lng in df.columns:
    # compute city centroid as median of available coords
    center_lat = df[col_lat].median()
    center_lng = df[col_lng].median()
    df['dist_to_center_km'] = df.apply(lambda row: haversine(row[col_lng], row[col_lat], center_lng, center_lat) if pd.notna(row[col_lat]) and pd.notna(row[col_lng]) else np.nan, axis=1)
else:
    df['dist_to_center_km'] = np.nan

# Create final feature list candidates
feature_candidates = ['amenities_count','dist_to_center_km','bath_numeric']
for c in df.columns:
    if c.startswith('amen_'): feature_candidates.append(c)
if col_bedrooms and col_bedrooms in df.columns: feature_candidates.append(col_bedrooms)
if col_accom and col_accom in df.columns: feature_candidates.append(col_accom)
if col_room and col_room in df.columns: feature_candidates.append(col_room)
# drop na target rows
df_model = df.copy()
df_model = df_model.dropna(subset=['price_clean']).reset_index(drop=True)
print('df_model shape after dropping NA price:', df_model.shape)

# Target log-transform
df_model['log_price'] = np.log1p(df_model['price_clean'])

# Save processed snapshot for inspection
df_model.sample(3).T.head(50)


In [None]:

# ----------------------
# Modeling: comparison + tuning with K-Fold CV
# ----------------------

# Select features for modeling (mix numeric + categorical)
numeric_features = [c for c in feature_candidates if c in df_model.columns and df_model[c].dtype in [np.float64, np.int64, 'float64','int64']]
categorical_features = [col_room] if (col_room and col_room in df_model.columns) else []
print('Numeric features:', numeric_features)
print('Categorical features:', categorical_features)

X = df_model[numeric_features + categorical_features].copy()
y = df_model['log_price']  # model on log target

# Fill NA for numeric
X[numeric_features] = X[numeric_features].fillna(X[numeric_features].median())

# Simple train-test split (holdout)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)

# Preprocessor: scale numeric, one-hot categorical
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Models to compare
models = {
    'RandomForest': RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1),
    'ExtraTrees': ExtraTreesRegressor(n_estimators=200, random_state=42, n_jobs=-1),
    'GradientBoost': GradientBoostingRegressor(n_estimators=200, random_state=42),
}

# XGBoost optionally
try:
    import xgboost as xgb
    models['XGBoost'] = xgb.XGBRegressor(n_estimators=200, random_state=42, verbosity=0, n_jobs=-1, objective='reg:squarederror')
except Exception as e:
    print('xgboost not available:', e)

# Function to evaluate model with cross-validation
from sklearn.model_selection import cross_val_score
results = []
for name, model in models.items():
    pipe = Pipeline(steps=[('pre', preprocessor), ('model', model)])
    # 5-fold CV on training set (neg MSE)
    scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error', n_jobs=-1)
    results.append((name, -scores.mean(), -scores.std()))
    print(f'{name} CV RMSE: {-scores.mean():.4f} ± {-scores.std():.4f}')

# Fit each model on full train and evaluate on test
test_results = []
for name, model in models.items():
    pipe = Pipeline(steps=[('pre', preprocessor), ('model', model)])
    pipe.fit(X_train, y_train)
    y_pred_log = pipe.predict(X_test)
    # invert log transform
    y_pred = np.expm1(y_pred_log)
    y_true = np.expm1(y_test)
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    test_results.append((name, rmse, mae, r2))
    # save model file
    joblib.dump(pipe, f'/content/{name}_pipeline.joblib')
    print(f'{name} Test RMSE: {rmse:.2f}, MAE: {mae:.2f}, R2: {r2:.4f} -- saved to /content/{name}_pipeline.joblib')

# Present test results as dataframe
res_df = pd.DataFrame(test_results, columns=['model','RMSE','MAE','R2']).sort_values('RMSE')
res_df


In [None]:

# ----------------------
# Hyperparameter tuning with RandomizedSearchCV on the best model (choose top from res_df)
# ----------------------
res_df = res_df.reset_index(drop=True)
best_model_name = res_df.loc[0,'model']
print('Best model to tune:', best_model_name)
if best_model_name == 'RandomForest':
    param_dist = {
        'model__n_estimators': [100,200,400],
        'model__max_depth': [None, 10, 20, 30],
        'model__min_samples_split': [2,5,10],
        'model__min_samples_leaf': [1,2,4]
    }
elif best_model_name == 'XGBoost':
    param_dist = {
        'model__n_estimators': [100,200,400],
        'model__learning_rate': [0.01,0.05,0.1],
        'model__max_depth': [3,6,10],
        'model__subsample': [0.6,0.8,1.0]
    }
else:
    # generic for tree models
    param_dist = {
        'model__n_estimators': [100,200,400],
        'model__max_depth': [None, 6, 10, 20],
        'model__min_samples_split': [2,5,10]
    }

from sklearn.model_selection import RandomizedSearchCV
# load the pipeline for best model
best_pipe = joblib.load(f'/content/{best_model_name}_pipeline.joblib')
rs = RandomizedSearchCV(best_pipe, param_distributions=param_dist, n_iter=20, cv=3, scoring='neg_root_mean_squared_error', n_jobs=-1, random_state=42, verbose=1)
rs.fit(X_train, y_train)
print('Best params:', rs.best_params_)
print('Best CV score (neg RMSE):', rs.best_score_)
# evaluate on test
y_pred_log = rs.predict(X_test)
y_pred = np.expm1(y_pred_log)
y_true = np.expm1(y_test)
print('Tuned Test RMSE:', mean_squared_error(y_true, y_pred, squared=False))
joblib.dump(rs.best_estimator_, f'/content/{best_model_name}_tuned_pipeline.joblib')
print('Saved tuned model to', f'/content/{best_model_name}_tuned_pipeline.joblib')


In [None]:

# ----------------------
# SHAP explainability for the tuned model (or best estimator)
# ----------------------
try:
    import shap
    tuned_path = f'/content/{best_model_name}_tuned_pipeline.joblib'
    if os.path.exists(tuned_path):
        tuned = joblib.load(tuned_path)
    else:
        tuned = joblib.load(f'/content/{best_model_name}_pipeline.joblib')
    # get preprocessed X_train for shap sampling
    pre = tuned.named_steps['pre']
    model = tuned.named_steps['model']
    X_pre = pre.transform(X_train)
    # shap depends on model type; use TreeExplainer if supported
    explainer = shap.Explainer(model)
    # sample small subset for speed
    sample = X_train.sample(min(100, len(X_train)), random_state=42)
    shap_values = explainer(pre.transform(sample))
    print('Plotting SHAP summary (may open in notebook)')
    shap.summary_plot(shap_values, features=pre.transform(sample), feature_names=(pre.named_transformers_['num'].named_steps['scaler'].get_feature_names_out(numeric_features) if hasattr(pre.named_transformers_['num'].named_steps['scaler'],'get_feature_names_out') else numeric_features))
except Exception as e:
    print('SHAP step skipped or failed:', e)


In [None]:

# ----------------------
# Helper: load final model and helper predict function
# ----------------------
# Use tuned model if exists, else best model
final_path = None
if os.path.exists(f'/content/{best_model_name}_tuned_pipeline.joblib'):
    final_path = f'/content/{best_model_name}_tuned_pipeline.joblib'
else:
    final_path = f'/content/{best_model_name}_pipeline.joblib'
final_model = joblib.load(final_path)
print('Final model loaded from', final_path)

def predict_price(row_dict):
    # row_dict should contain keys for numeric_features + categorical_features
    x = pd.DataFrame([row_dict])
    # ensure numeric features present
    for f in numeric_features:
        if f not in x.columns: x[f] = np.nan
    for f in categorical_features:
        if f not in x.columns: x[f] = None
    x[numeric_features] = x[numeric_features].fillna(df_model[numeric_features].median())
    pred_log = final_model.predict(x)[0]
    return float(np.expm1(pred_log))

# Example usage (modify values):
example = {f: float(df_model[f].median()) if f in numeric_features else (df_model[f].mode()[0] if f in categorical_features else None) for f in numeric_features + categorical_features}
print('Example input:', example)
print('Predicted price for example (USD):', predict_price(example))

# Export processed CSV for dashboards
processed_path = '/content/airbnb_processed_for_dashboard.csv'
df_model.to_csv(processed_path, index=False)
print('Saved processed dataset to', processed_path)



## Deliverables included
- `*_pipeline.joblib` files for each candidate model (in /content)  
- `*_tuned_pipeline.joblib` for the tuned best model (if tuning ran)  
- `airbnb_processed_for_dashboard.csv` — exported processed dataset for Power BI/Tableau  
- Notebook (`.ipynb`) — this file (downloadable)  

### Next optional deliverables I can add for you:
- Automated **Project PPT** (slides) and **PDF documentation** generated from notebook results.
- Readme and `requirements.txt` for GitHub repo.
- A short script (`present_me.txt`) with bullet points for your interview presentation.

Reply **"Add PPT & PDF"** if you want me to also generate PPT and PDF files now (I'll include a polished 8–10 slide PPT and a one-page PDF summary).
