# Group AECA - Airbnb in NYC: Market Trends & Impact

## Introduction

Our project explores the **Airbnb landscape in New York City** using the publicly available *Inside Airbnb* dataset. This dataset contains **37,784 listings**, each representing a unique short-term rental across NYC’s five boroughs. With **75 features** ranging from price, availability, and room type to host verification, reviews, and neighborhood, the dataset is effective for analyzing the competitive Airbnb market.

We focus on **pricing strategies, location and listing features, host behaviours, and guest satisfaction** to uncover what drives success on Airbnb. Whether it's how review scores influence pricing, how multiple listings affect visibility, or how location plays into guest preferences — we aim to extract insights that can inform host strategy and guest expectations.

Our work is especially relevant for:
- **Airbnb hosts** looking to refine their listings and pricing models.
- **Short-term rental operators** seeking a competitive edge.
- **Policymakers** examining the platform’s impact on urban housing and tourism dynamics.

By using interactive visualizations, we aim to turn raw data into actionable insights that support smarter decision-making in the short-term rental market.

## Code

### Imports

In [1]:
import os
import ast
import altair as alt
import pandas as pd
from toolz.curried import pipe
import numpy as np
import sys
import geopandas as gpd
import json
from datetime import datetime

# Create a new data transformer that stores the files in a directory
def json_dir(data, data_dir='altairdata'):
    os.makedirs(data_dir, exist_ok=True)
    return pipe(data, alt.to_json(filename=data_dir + '/{prefix}-{hash}.{extension}') )

# Register and enable the new transformer
alt.data_transformers.register('json_dir', json_dir)
alt.data_transformers.enable('json_dir')

# Handle large data sets (default shows only 5000)
# See here: https://altair-viz.github.io/user_guide/data_transformers.html
alt.data_transformers.disable_max_rows()

alt.renderers.enable('jupyterlab')

sys.path.append(os.path.abspath("../code"))
from cleaning_workflows import prepare_dataset

### Data Cleaning

In [2]:
df = pd.read_csv('../data/raw/listings.csv', parse_dates=['first_review', 'last_review'])

In [3]:
df_cleaned = prepare_dataset(df)
df_cleaned.head()

Unnamed: 0,name,description,neighborhood_overview,host_id,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,1 br in a 2 br apt (Midtown West),No description available,No overview available,169927,2010-07-17,"Saint-Aubin-sur-Scie, France","Facebook Likes:\r\nNew York French Geek, David...",within a day,1.0,0.88,...,4.98,5.0,4.98,4.86,False,2,1,1,0,0.25
1,A lovely room w/ a Manhattan view,"A private, furnished large room to rent Jan/F...","Nate Silver called this super safe, clean, qui...",110506,2010-04-18,"New York, NY","I grew up in South Korea, moved to Montreal, C...",within a few hours,1.0,0.6,...,4.96,4.96,4.79,4.93,False,1,0,1,0,0.2
2,"Private, Large & Sunny 1BR w/W&D",It's a No Brainer:<br />•Terrific Space For Le...,The Neighborhood<br />• Rich History <br />• B...,170510,2010-07-18,"New York, United States",I am a self employed licensed real estate brok...,No response time,1.0,0.88,...,4.89,4.92,4.38,4.72,False,2,2,0,0,1.93
3,Beautiful Lower East Side Loft,Architect-owned loft is a corner unit in a bea...,"The apartment is in the border of Soho, LES an...",184755,2010-07-29,"New York, NY",I am an architect living in NYC and have my ow...,within a day,1.0,1.0,...,4.85,4.87,4.57,4.62,False,1,1,0,0,0.4
4,@HouseOnHenrySt - Private 2nd bedroom w/shared...,No description available,"Lovely old Brooklyn neighborhood, with brick/b...",11481,2009-03-26,"New York, NY",I have been a host with Airbnb since its intro...,within a day,0.67,0.33,...,4.71,4.73,4.58,4.64,False,4,1,3,0,1.26


In [4]:
# Transformation 1: Convert the list of amenities into separate binary columns
# Convert the string representation of lists back into actual lists
df_amenities = df_cleaned['amenities'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

# Calculate the frequency of each amenity
amenity_counts = df_amenities.explode().value_counts()

# Select the top 25 amenities by frequency
top_25_amenities = amenity_counts.head(25).index.tolist()

# Concatenate the new amenity columns to df_cleaned
amenity_columns = pd.DataFrame(
    [[1 if amenity in x else 0 for amenity in top_25_amenities] for x in df_cleaned['amenities']],
    columns=top_25_amenities)

df_cleaned = pd.concat([df_cleaned, amenity_columns], axis=1)

In [5]:
# Transformation 2: Create a new column counting the number of amenities in each listing 
df_cleaned['num_amenities'] = df_cleaned['amenities'].apply(len)
print(df_cleaned[['amenities', 'num_amenities']].head())

                                           amenities  num_amenities
0  ["Hangers", "Wifi", "Smoke alarm", "Carbon mon...            192
1  ["Hangers", "Smoke alarm", "Hair dryer", "Clot...            687
2  ["Hangers", "Coffee maker", "Smoke alarm", "Fi...            454
3  ["Hangers", "Coffee maker", "Smoke alarm", "Ha...            442
4  ["Hangers", "Coffee maker", "Smoke alarm", "TV...            462


In [6]:
# Function to count the number of words in a string
def word_count(text):
    if isinstance(text, str):
        return len(text.split())
    return 0  # If the text is not a string, return 0

df_cleaned['name_word_count'] = df_cleaned['name'].apply(word_count)
df_cleaned['description_word_count'] = df_cleaned['description'].apply(word_count)
df_cleaned['neighborhood_overview_word_count'] = df_cleaned['neighborhood_overview'].apply(word_count)

print(df_cleaned[['name_word_count', 'description_word_count', 'neighborhood_overview_word_count']].head())

   name_word_count  description_word_count  neighborhood_overview_word_count
0                9                       3                                 3
1                7                      58                               153
2                6                      32                                68
3                5                      88                                21
4                7                       3                                16


### Erhan

In [7]:
gdf = gpd.read_file('../analysis/Erhan/new-york-city-boroughs.geojson')

# Convert any datetime columns to strings (in case)
for col in gdf.columns:
    if pd.api.types.is_datetime64_any_dtype(gdf[col]):
        gdf[col] = gdf[col].astype(str)

# Convert GeoDataFrame to GeoJSON dictionary for Altair
geojson = json.loads(gdf.to_json())

In [8]:
# Erhan Vis

# --- UI CONTROLS ---
price_param = alt.param(
    name='MaxPrice',
    bind=alt.binding_range(min=0, max=1000, step=25, name='Max Price:'),
    value=800
)

# Click-based selection only
borough_select = alt.selection_point(
    fields=['neighbourhood_group_cleansed'],
    name='BoroughSelect'
)

# --- MAP BACKGROUND ---
nyc_map = alt.Chart(alt.Data(values=geojson['features'])).mark_geoshape(
    fill='lightgray',
    stroke='white'
).encode(
    tooltip=alt.Tooltip('properties.name:N', title='Borough')
).project(type='mercator').properties(width=550, height=600, title='🗺️ NYC Borough Map')

# --- MAP POINTS (clickable to select) ---
map_points = alt.Chart(df_cleaned).transform_filter(
    alt.datum.price <= price_param
).mark_circle(size=10).encode(
    longitude=alt.Longitude('longitude:Q'),
    latitude=alt.Latitude('latitude:Q'),
    color=alt.condition(
        borough_select,
        alt.Color('price:Q', scale=alt.Scale(scheme='reds'), title='Price ($)'),
        alt.value('lightgray')
    ),
    opacity=alt.condition(
        borough_select,
        alt.value(0.9),
        alt.value(0.2)
    ),
    tooltip=[
        alt.Tooltip('price:Q', title='Price ($)'),
        alt.Tooltip('room_type:N', title='Room Type'),
        alt.Tooltip('neighbourhood_group_cleansed:N', title='Borough')
    ]
).add_params(price_param, borough_select)

map_view = (nyc_map + map_points).properties(
    title='📍 Airbnb Listings by Location (Filtered by Price & Borough)'
)

# --- HISTOGRAM ---
histogram = alt.Chart(df_cleaned).transform_filter(
    (alt.datum.price <= price_param)
).transform_filter(
    borough_select
).mark_bar(opacity=0.8).encode(
    x=alt.X('price:Q', bin=alt.Bin(maxbins=30), title='Price ($)'),
    y=alt.Y('count()', title='Number of Listings'),
    color=alt.Color('neighbourhood_group_cleansed:N', scale=alt.Scale(scheme='purples'), legend=None),
    tooltip=[
        alt.Tooltip('neighbourhood_group_cleansed:N', title='Borough'),
        alt.Tooltip('count():Q', title='Listings Count')
    ]
).properties(
    width=300,
    height=200,
    title='📊 Price Distribution in Selected Borough'
).add_params(price_param, borough_select)

# --- BAR CHART (clickable) ---
avg_price_chart = alt.Chart(df_cleaned).transform_filter(
    alt.datum.price <= price_param
).transform_aggregate(
    avg_price='mean(price)',
    groupby=['neighbourhood_group_cleansed']
).mark_bar().encode(
    x=alt.X('neighbourhood_group_cleansed:N', title='Borough'),
    y=alt.Y('avg_price:Q', title='Average Price ($)'),
    color=alt.condition(
        borough_select,
        alt.Color('neighbourhood_group_cleansed:N', scale=alt.Scale(scheme='purples'), legend=None),
        alt.value('lightgray')
    ),
    opacity=alt.condition(
        borough_select,
        alt.value(1),
        alt.value(0.3)
    ),
    tooltip=[
        alt.Tooltip('neighbourhood_group_cleansed:N', title='Borough'),
        alt.Tooltip('avg_price:Q', title='Average Price ($)')
    ]
).add_params(price_param, borough_select).properties(
    width=300,
    height=200,
    title='📉 Average Listing Price by Borough'
)

# --- TITLE ---
title_text = alt.Chart(df_cleaned.head()).mark_text(
    align='center',
    baseline='middle',
    fontSize=24,
    #font='Georgia',
    fontWeight='bold'
).encode(
    text=alt.value('📊 Airbnb Price Dashboard — New York City')
).properties(width=900, height=40)

# --- DESCRIPTION ---
description = alt.Chart(df_cleaned.head()).mark_text(
    align='center',
    baseline='top',
    #font='Georgia',
    fontSize=14
).encode(
    text=alt.value("Explore how Airbnb prices vary across NYC neighborhoods through interactive maps, histograms, and comparisons.")
).properties(width=900, height=30)

# --- FINAL VIEW ---
right_col = alt.vconcat(avg_price_chart, histogram)

vis1 = alt.vconcat(
    title_text,
    description,
    alt.hconcat(map_view, right_col).resolve_scale(color='independent')
)

### Carol 

In [9]:
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning, module="numpy")

# Function to bin ratings 
def bin_ratings(rating):
    if rating == 5.0:
        return "5.0"
    elif rating >= 4.5:
        return "4.5 - 5.0"
    elif rating >= 4.0:
        return "4.0 - 4.5"
    elif rating >= 3.5:
        return "3.5 - 4.0"
    elif rating >= 3.0:
        return "3.0 - 3.5"
    elif rating >= 2.5:
        return "2.5 - 3.0"
    elif rating >= 2.0:
        return "2.0 - 2.5"
    elif rating >= 1.5:
        return "1.5 - 2.0"
    elif rating >= 1.0:
        return "1.0 - 1.5"
    elif rating >= 0.5:
        return "0.5 - 1.0"
    else:
        return "0.0 - 0.5"

# Features to analyze
features = ["price", "accommodates", "beds", "bathrooms", "instant_bookable", "host_is_superhost", 
            "host_response_rate", "number_of_reviews", "reviews_per_month", 
            "availability_365", "num_amenities", "name_word_count", "description_word_count"]

# Ratings columns for dropdown
rating_columns = [
    "review_scores_rating",
    "review_scores_accuracy",
    "review_scores_cleanliness",
    "review_scores_checkin",
    "review_scores_communication",
    "review_scores_location",
    "review_scores_value"]

# Find correlations for each bin and rating type
correlation_data = []
for rating_col in rating_columns:
    df_cleaned[f"{rating_col}_bin"] = df_cleaned[rating_col].apply(bin_ratings)  
    
    for rating_bin, group in df_cleaned.groupby(f"{rating_col}_bin"):
        group = group.dropna(subset=[rating_col] + features) 
        
        for feature in features:
            if group[feature].nunique() > 1: 
                    corr = group[feature].corr(group[rating_col])
                    if pd.notna(corr):  
                        correlation_data.append({
                            "rating_bin": rating_bin, 
                            "feature": feature, 
                            "correlation": corr, 
                            "rating_type": rating_col 
                        })

correlation_df = pd.DataFrame(correlation_data)

# Dropdown for rating selection
dropdown = alt.binding_select(options=rating_columns, name="Rating Type: ")
selection = alt.selection_point(fields=["rating_type"], bind=dropdown, value="review_scores_rating")

# Click selection for feature details
feature_selection = alt.selection_point(fields=["feature"])

# Heatmap 
heatmap = alt.Chart(correlation_df).mark_rect().encode(
    alt.Y("rating_bin:N", title="Guest Rating Bin",
            sort=["5.0", "4.5 - 5.0", "4.0 - 4.5", "3.5 - 4.0", "3.0 - 3.5",
                  "2.5 - 3.0", "2.0 - 2.5", "1.5 - 2.0", "1.0 - 1.5", "0.5 - 1.0", "0.0 - 0.5"]),
    alt.X("feature:N", title="Feature", 
          axis=alt.Axis(labelAngle=-45)),  
    color=alt.Color("correlation:Q", title="Correlation", scale=alt.Scale(domain=[-1, 0, 1], range=["#7d6387", "#ffffff", "#a31a2a"])),  
    tooltip=["feature", "rating_bin", "correlation"]
).add_params(feature_selection).add_params(selection).transform_filter(selection).properties(
    width=500,
    height=300,
    title="Correlation Heatmap of Airbnb Listing Features with Different Rating Types")

# Linked Bar Chart 
detail_chart = alt.Chart(correlation_df).mark_bar().encode(
    y=alt.Y("rating_bin:N", title="Guest Rating Bin", sort="-x"),
    x=alt.X("correlation:Q", title="Correlation Value"),
    color=alt.Color("correlation:Q", scale=alt.Scale(domain=[-1, 0, 1], range=["#7d6387", "#ffffff", "#a31a2a"])),
    tooltip=["feature", "rating_bin", "correlation"]
).transform_filter(feature_selection).transform_filter(selection).properties(
    width=300,
    height=200,
    title="Correlation Values for Selected Feature")

vis2 = alt.hconcat(heatmap, detail_chart).properties(
    title={"text": "Correlation between Airbnb Listing Features and Different Rating Types",
        "fontSize": 18,
        "fontWeight": "bold",
        "anchor": "middle"})

### Aaron

In [10]:
selection = alt.selection_point(name='review_select', fields=['bin_number_of_reviews'])
sort_by_popularity = df_cleaned.groupby(['neighbourhood_group_cleansed']).sum('number_of_reviews').sort_values('number_of_reviews', ascending=False)
popular_groups = sort_by_popularity.index.to_list()
multiple_grouped_df = df_cleaned.groupby(['neighbourhood_group_cleansed', 'instant_bookable'], as_index=False).sum('number_of_reviews')

boroughs = ['All'] + sorted(df_cleaned['neighbourhood_group_cleansed'].unique().tolist())
neighbourhood_param = alt.param(
    name='NeighborhoodSelect',
    bind=alt.binding_select(options=boroughs, name='Neighborhood:'),
    value='All'
)

high_fid1 = alt.Chart(multiple_grouped_df).mark_rect().encode(
    x=alt.X('neighbourhood_group_cleansed:N', sort=popular_groups, title= 'Neighbourhood Group', axis=alt.Axis(labelAngle=-45)),
    y=alt.Y('instant_bookable:N', title = 'Instant Bookability Status'),
    color=alt.Color('number_of_reviews:Q', scale=alt.Scale(scheme='reds'), title = 'Number of Reviews'),
    tooltip=[
            alt.Tooltip('neighbourhood_group_cleansed', title='Neighbourhood Group'),
            alt.Tooltip('number_of_reviews', title='Number of Reviews'),
            alt.Tooltip('instant_bookable:N', title='Instant Bookability Status')
        ],
    opacity=alt.condition((selection), alt.value(1), alt.value(0.1))
).transform_filter(
    (alt.datum.neighbourhood_group_cleansed == neighbourhood_param) | (neighbourhood_param == 'All')
).transform_bin(
    'bin_number_of_reviews', 
    field='number_of_reviews', 
    bin=alt.Bin(maxbins=10)
).add_params(selection, neighbourhood_param).properties(
    title=alt.Title('Number of Reviews for most popular neighbourhood groups with instant bookability'),
    width=400,
    height=300
)

neighbourhood_reviews = df_cleaned.groupby(['neighbourhood_cleansed', 'neighbourhood_group_cleansed'], as_index=False).sum('number_of_reviews')
neighbourhood_reviews = neighbourhood_reviews[['neighbourhood_cleansed', 'neighbourhood_group_cleansed','number_of_reviews']]
neighbourhood_reviews.head()
filter_condition = alt.datum.neighbourhood_group_cleansed == neighbourhood_param | (neighbourhood_param == 'All')
top_10_transform = alt.WindowTransform(
    sort=[alt.SortField("number_of_reviews", order="descending")],
    window=[{"op": "rank", "as": "rank"}]
)

high_fid_hist1 = (
    alt.Chart(neighbourhood_reviews)
    .transform_filter(
        (alt.datum.neighbourhood_group_cleansed == neighbourhood_param) | (neighbourhood_param == 'All')
    )
    .transform_window(
        rank='rank()', 
        sort=[alt.SortField('number_of_reviews', order='descending')],
    )
    .transform_filter(alt.datum.rank <= 10)
    .mark_bar()
    .encode(
        y=alt.Y('neighbourhood_cleansed:N', sort='-x', title='Neighbourhood'),
        x=alt.X('number_of_reviews:Q', title='Total Number of Reviews', axis=alt.Axis(tickCount=4)),
        color=alt.Color('neighbourhood_group_cleansed', title='Neighbourhood Group', scale=alt.Scale(scheme='purples')),
        tooltip=[
            alt.Tooltip('neighbourhood_group_cleansed', title='Neighbourhood Group'),
            alt.Tooltip('neighbourhood_cleansed', title='Neighbourhood'), 
            alt.Tooltip('number_of_reviews', title='Number of Reviews')
        ]
    )
    .add_params(neighbourhood_param)
    .properties(
        title="Top 10 Neighbourhoods by Number of Reviews",
        width=400,
        height=400
    )
)


vis3 = (high_fid1 | high_fid_hist1).configure_view(
    strokeWidth=0
).configure_title(
    anchor='middle',
    offset=20
).properties(title=alt.Title('The effect of Instant Bookability on the Total Number of Reviews per Neighbourhood Group', anchor='middle', fontSize=18, fontWeight='bold'), 
             padding={'top': 20, 'bottom': 30, 'left': 30, 'right': 30})

### Ayuho

In [11]:
df_listings_price_reviews = df_cleaned[["calculated_host_listings_count", "price", "review_scores_rating"]]

bins = [1, 5, 20, 100, 500, 1000, float("inf")]
labels = ["1-5", "6-20", "21-100", "101-500", "501-1000", "1001+"]
df_listings_price_reviews["listings_count_bin"] = pd.cut(
    df_listings_price_reviews["calculated_host_listings_count"],
    bins=bins,
    labels=labels,
    right=False
)

# Define a custom red-purple palette.
custom_palette = ["#fee08b", "#fdae61", "#e34a33", "#b30000", "#8856a7", "#810f7c"]


# Sliders
rating_min_slider = alt.binding_range(
    min=0,
    max=5,
    step=0.1,
    name="Min Guest Rating: "
)
rating_min_param = alt.param(
    name="RatingMin",
    bind=rating_min_slider,
    value=0.0
)

rating_max_slider = alt.binding_range(
    min=0,
    max=5,
    step=0.1,
    name="Max Guest Rating: "
)
rating_max_param = alt.param(
    name="RatingMax",
    bind=rating_max_slider,
    value=5.0
)

# Selection
selection = alt.selection_point(
    fields=["listings_count_bin"],
    bind="legend",  
    empty="all"
)

# Binned Bar Chart
bar_graph = (
    alt.Chart(df_listings_price_reviews)
    .transform_filter(
        "(datum.review_scores_rating >= RatingMin) & (datum.review_scores_rating <= RatingMax)"
    )
    .transform_aggregate(
        avg_price="mean(price)",
        groupby=["listings_count_bin"]
    )
    .mark_bar()
    .encode(
        x=alt.X("listings_count_bin:N", sort=labels, title="Number of Listings per Host", axis=alt.Axis(labelAngle=0)),
        y=alt.Y("avg_price:Q", title="Average Price (USD)"),
        color=alt.condition(
            selection,
            alt.Color("listings_count_bin:N", sort=labels, title="Number of Listings", scale=alt.Scale(range=custom_palette)),
            alt.value("lightgray")
        ),
        tooltip=["listings_count_bin:N", "avg_price:Q"]
    )
    .add_params(rating_min_param, rating_max_param, selection)
    .properties(title="Average Price by Number of Listings per Host (Binned)", width=350)
)

# Density Plot
density_plot = (
    alt.Chart(df_listings_price_reviews)
    .transform_density(
        "review_scores_rating",
        as_=["review_scores_rating", "density"],
        extent=[0, 5],
        groupby=["listings_count_bin"]
    )
    .mark_area()
    .encode(
        x=alt.X("review_scores_rating:Q", title="Guest Review Score"),
        y=alt.Y("density:Q", title="Density of Listings"),
        color=alt.Color("listings_count_bin:N", sort=labels, title="Number of Listings", scale=alt.Scale(range=custom_palette)),
        opacity=alt.condition(selection, alt.value(0.7), alt.value(0.05)),
        tooltip=["listings_count_bin:N", "review_scores_rating:Q", "density:Q"]
    )
    .add_params(selection)
    .properties(title="Density of Guest Ratings by Number of Listings per Host (Binned)")
)

vis4 = (bar_graph | density_plot).properties(
    title=alt.TitleParams(
        text="How Host Pricing Strategies Vary by Number of Listings and Impact Guest Satisfaction",
        anchor="middle",
        fontWeight="bold",  
        fontSize=18         
    )
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_listings_price_reviews["listings_count_bin"] = pd.cut(


## Final Visualizations 

### Erhan: Pricing & Affordability Trends

In [12]:
vis1

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


### Carol: Guest Experience & Satisfaction

In [13]:
vis2

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


### Aaron: Location & Listing Features, Host Decision-Making, and Booking Preferences

In [14]:
vis3

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


### Ayuho: Host Behaviour & Market Competitiveness

In [15]:
vis4

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting
