# NYC Airbnb Analysis

**Dataset:** `1730285881-Airbnb_Open_Data.csv` (place this file in the same folder; already uploaded at `/mnt/data/1730285881-Airbnb_Open_Data.csv`).

## Problem Statement
Airbnb has transformed the lodging industry by offering diverse and affordable accommodations. This project analyzes NYC Airbnb data to identify popular property types, neighborhood and pricing trends, and relationships among location, host behavior, reviews, and availability.

## Project Description
This notebook performs data cleaning, exploratory data analysis, visualization, and basic statistical modeling to answer research questions about room type popularity, neighborhood distributions, price drivers, top hosts, and host-availability patterns.

## End Users
- **Travelers** – Find affordable and suitable lodging options.
- **Hosts** – Optimize pricing, availability, and customer satisfaction.
- **Policy Makers** – Understand market dynamics for regulation and planning.
- **Researchers & Analysts** – Study sharing-economy and urban tourism trends.
- **Airbnb Platform** – Improve pricing and user experience.

## Technology Used
- Python, Pandas, NumPy
- Matplotlib, Seaborn
- Jupyter Notebook

---


In [None]:
# Imports and load dataset
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import statsmodels.api as sm

plt.rcParams['figure.figsize'] = (10,6)
sns.set(style='whitegrid')

DATA_PATH = '/mnt/data/1730285881-Airbnb_Open_Data.csv'  # path to uploaded CSV
OUTPUT_PLOTS_DIR = 'plots'
os.makedirs(OUTPUT_PLOTS_DIR, exist_ok=True)

# Load CSV
df = pd.read_csv(DATA_PATH)
print('Loaded:', DATA_PATH)
print('Shape:', df.shape)
df.head()

In [None]:
# Basic cleaning & feature engineering
data = df.copy()

# Standardize column names (strip whitespace)
data.columns = [c.strip() for c in data.columns]

# Convert common columns if present
if 'last_review' in data.columns:
    data['last_review'] = pd.to_datetime(data['last_review'], errors='coerce')

if 'reviews_per_month' in data.columns:
    data['reviews_per_month'] = data['reviews_per_month'].fillna(0)

if 'neighbourhood_group' in data.columns:
    data['neighbourhood_group'] = data['neighbourhood_group'].fillna('Unknown')

# Ensure numeric price
if 'price' in data.columns:
    data['price'] = pd.to_numeric(data['price'], errors='coerce')

# Save a full copy and a trimmed copy (remove impossible prices)
data_full = data.copy()
if 'price' in data.columns:
    price_mask = (data['price'] >= 1) & (data['price'] <= 2000)
    data = data.loc[price_mask].copy()
    print('Filtered rows (1 <= price <= 2000):', data.shape[0], '/', data_full.shape[0])
else:
    print('Warning: \"price\" column not found. Some analyses will be skipped.')

In [None]:
# Q1: Most popular room types
if 'room_type' in data.columns:
    room_counts = data['room_type'].value_counts()
    display(room_counts)
    plt.figure()
    room_counts.plot(kind='barh')
    plt.title('Room type counts')
    plt.xlabel('Number of listings')
    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_PLOTS_DIR, 'room_type_counts.png'))
    plt.show()
else:
    print('room_type column not found.')

In [None]:
# Q2: Which neighbourhood group has the highest number of listings?
if 'neighbourhood_group' in data.columns:
    nbhd_counts = data['neighbourhood_group'].value_counts()
    display(nbhd_counts)
    plt.figure()
    nbhd_counts.plot(kind='bar')
    plt.title('Listings per Neighbourhood Group')
    plt.ylabel('Number of listings')
    plt.xlabel('Neighbourhood Group')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_PLOTS_DIR, 'listings_per_neighbourhood_group.png'))
    plt.show()
else:
    print('neighbourhood_group column not found.')

In [None]:
# Q3: Which neighbourhood group has the highest average prices?
if 'neighbourhood_group' in data.columns and 'price' in data.columns:
    avg_price_by_group = data.groupby('neighbourhood_group')['price'].mean().sort_values(ascending=False)
    display(avg_price_by_group)
    plt.figure()
    avg_price_by_group.plot(kind='bar')
    plt.title('Average Price by Neighbourhood Group')
    plt.ylabel('Average Price (USD)')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_PLOTS_DIR, 'avg_price_by_neighbourhood_group.png'))
    plt.show()
    # Also show median
    median_price_by_group = data.groupby('neighbourhood_group')['price'].median().sort_values(ascending=False)
    print('\\nMedian price by group:') 
    display(median_price_by_group)
else:
    print('Required columns missing for average price analysis.')

In [None]:
# Q4: Correlation between location/property features and price
numeric_cols = []
for c in ['price','latitude','longitude','minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count','availability_365']:
    if c in data.columns:
        numeric_cols.append(c)

if 'price' in data.columns and len(numeric_cols) > 1:
    cor_df = data[numeric_cols].corr(method='pearson')
    print('\\nPearson correlations with price:')
    display(cor_df['price'].sort_values(ascending=False))
    print('\\nSpearman correlations with price:')
    display(data[numeric_cols].corr(method='spearman')['price'].sort_values(ascending=False))

    # Simple scatter plots for lat/lon vs price
    if 'latitude' in data.columns and 'longitude' in data.columns:
        plt.figure()
        plt.scatter(data['longitude'], data['price'], alpha=0.3, s=10)
        plt.xlabel('Longitude')
        plt.ylabel('Price')
        plt.title('Price vs Longitude')
        plt.tight_layout()
        plt.savefig(os.path.join(OUTPUT_PLOTS_DIR, 'price_vs_longitude.png'))
        plt.show()

        plt.figure()
        plt.scatter(data['latitude'], data['price'], alpha=0.3, s=10)
        plt.xlabel('Latitude')
        plt.ylabel('Price')
        plt.title('Price vs Latitude')
        plt.tight_layout()
        plt.savefig(os.path.join(OUTPUT_PLOTS_DIR, 'price_vs_latitude.png'))
        plt.show()

    # Linear regression using numeric features
    X = data[[c for c in numeric_cols if c != 'price']].fillna(0)
    y = data['price']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    reg = LinearRegression().fit(X_train, y_train)
    y_pred = reg.predict(X_test)
    print('\\nLinear model R^2 on test set:', r2_score(y_test, y_pred))
    coef_df = pd.Series(reg.coef_, index=X.columns).sort_values(key=abs, ascending=False)
    print('Top coefficients (by absolute value):')
    display(coef_df.head(10))
else:
    print('Not enough numeric columns for correlation analysis.')

In [None]:
# Q5: Top 10 hosts by hosted listing count
if 'host_id' in data_full.columns:
    host_counts = data_full.groupby(['host_id','host_name']).size().reset_index(name='n_listings')
    top10_hosts = host_counts.sort_values('n_listings', ascending=False).head(10)
    display(top10_hosts)
else:
    print('host_id/host_name columns not found.')

In [None]:
# Q6: Relationship between host listing frequency and reviews
if 'host_id' in data_full.columns:
    df_hosts = data_full.copy()
    host_metrics = df_hosts.groupby('host_id').agg(
        host_name=('host_name','first'),
        total_listings=('id','count'),
        avg_reviews_per_listing=('number_of_reviews','mean'),
        avg_reviews_per_month=('reviews_per_month','mean'),
        avg_price=('price','mean'),
        avg_availability=('availability_365','mean')
    ).reset_index()

    display(host_metrics.describe().T)

    # Correlations
    corr_pearson = host_metrics['total_listings'].corr(host_metrics['avg_reviews_per_listing'], method='pearson')
    corr_spearman = host_metrics['total_listings'].corr(host_metrics['avg_reviews_per_listing'], method='spearman')
    print('Pearson corr (total_listings vs avg_reviews_per_listing):', corr_pearson)
    print('Spearman corr (total_listings vs avg_reviews_per_listing):', corr_spearman)

    # Scatter
    plt.figure()
    plt.scatter(host_metrics['total_listings'], host_metrics['avg_reviews_per_listing'], alpha=0.5, s=10)
    plt.xscale('symlog')
    plt.xlabel('Total listings (host)')
    plt.ylabel('Avg number_of_reviews per listing')
    plt.title('Host total listings vs avg reviews per listing')
    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_PLOTS_DIR, 'host_listings_vs_avg_reviews.png'))
    plt.show()
else:
    print('host_id not present in dataset.')

In [None]:
# Q7: Price vs service fee (if present)
if 'service_fee' in data.columns:
    data_sample = data.sample(min(len(data), 5000), random_state=1)
    corr = data_sample['price'].corr(data_sample['service_fee'])
    print('Correlation (price, service_fee):', corr)
    plt.figure()
    plt.scatter(data_sample['price'], data_sample['service_fee'], alpha=0.4, s=10)
    plt.xlabel('Price')
    plt.ylabel('Service Fee')
    plt.title('Price vs Service Fee (sample)')
    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_PLOTS_DIR, 'price_vs_service_fee.png'))
    plt.show()
else:
    print('No service_fee column in dataset. If you have one, add it and re-run.')

In [None]:
# Q8: Average listings per host and variation by neighbourhood group and room type
if 'host_id' in data_full.columns:
    avg_listings_per_host = data_full.groupby('host_id').size().mean()
    median_listings_per_host = data_full.groupby('host_id').size().median()
    print('Average listings per host (mean):', avg_listings_per_host)
    print('Median listings per host:', median_listings_per_host)

    # By neighbourhood group (hosts may appear in multiple groups)
    if 'neighbourhood_group' in data_full.columns:
        group_avg = data_full.groupby(['neighbourhood_group','host_id']).size().reset_index(name='listings_in_group')
        group_avg2 = group_avg.groupby('neighbourhood_group')['listings_in_group'].mean().sort_values(ascending=False)
        print('\\nAverage listings per host within each neighbourhood_group:')
        display(group_avg2)

    # By room type
    if 'room_type' in data_full.columns:
        room_host_stats = data_full.groupby(['room_type','host_id']).size().reset_index(name='listings_per_host_room')
        room_avg = room_host_stats.groupby('room_type')['listings_per_host_room'].mean().sort_values(ascending=False)
        print('\\nAverage listings per host by room type:')
        display(room_avg)
else:
    print('host_id not present; cannot compute host averages.')

In [None]:
# Q9: Availability vs hosted listing count
if 'calculated_host_listings_count' in data.columns and 'availability_365' in data.columns:
    corr_listings_avail = data['calculated_host_listings_count'].corr(data['availability_365'])
    print('Listing-level Pearson correlation (calculated_host_listings_count vs availability_365):', corr_listings_avail)

    # Host-level aggregation
    if 'host_id' in data.columns:
        host_avail = data.groupby('host_id').agg(
            host_listings_count=('calculated_host_listings_count','first'),
            avg_availability=('availability_365','mean'),
            median_availability=('availability_365','median')
        ).reset_index()
        host_corr = host_avail['host_listings_count'].corr(host_avail['avg_availability'])
        print('Host-level Pearson correlation (host_listings_count vs avg_availability):', host_corr)

        plt.figure()
        plt.scatter(host_avail['host_listings_count'], host_avail['avg_availability'], alpha=0.5, s=10)
        plt.xscale('symlog')
        plt.xlabel('Host listings count')
        plt.ylabel('Average availability (days per year)')
        plt.title('Host listings count vs avg availability')
        plt.tight_layout()
        plt.savefig(os.path.join(OUTPUT_PLOTS_DIR, 'host_listings_vs_availability.png'))
        plt.show()
else:
    print('Required columns not present for availability analysis.')

In [None]:
# Save outputs
try:
    if 'host_metrics' in globals():
        host_metrics.to_csv('host_metrics_summary.csv', index=False)
        print('Saved host_metrics_summary.csv')
except Exception as e:
    print('Could not save host_metrics_summary.csv:', e)

try:
    data.head(100).to_csv('cleaned_sample.csv', index=False)
    print('Saved cleaned_sample.csv')
except Exception as e:
    print('Could not save cleaned_sample.csv:', e)

# Conclusion

- **Room types:** Entire home/apt and private rooms are typically the most common listings.
- **Neighbourhoods & price:** Central areas (e.g., Manhattan) usually have higher average prices; medians reduce skewness.
- **Price drivers:** Simple numeric features (lat/lon, reviews, availability) explain only a small portion of price variance.
- **Hosts & reviews:** A few high-volume hosts control many listings; more listings do not necessarily mean higher reviews per listing.
- **Availability:** Weak-to-moderate relationship may exist between host listing counts and availability, but it varies by host behavior.

**Next steps:** include amenities parsing, sentiment analysis of reviews, geospatial clustering, and tree-based models for price prediction.

---

*Notebook saved as `/mnt/data/NYC_Airbnb_Analysis_Full.ipynb`.*