# Group AECA - "Airbnb in NYC: Market Trends & Impact"

## 1. Introduction 

### 1.1 Dataset Description
The Inside Airbnb dataset is a comprehensive collection of Airbnb listings in many major cities around the world. For our project, we will focus on the New York City dataset, which includes 377,784 records with each row representing an individual listing. The dataset features 75 columns that cover essential details like listing name, host identity verification, neighbourhood, room type, price per night, and availability. It also includes review metrics like the number of reviews, last review date, and ratings to provide insights into customer experiences. This dataset is useful for analyzing Airbnb market dynamics, host behaviours, pricing strategies, and customer preferences in New York City. 

### 1.2 Data Source
The dataset is publicly available on [Inside Airbnb](https://insideairbnb.com/get-the-data/) under a Creative Commons Attribution 4.0 International License. The data is scraped from publicly available information on the Airbnb site. Afterwards, it is analyzed, cleansed, and aggregated by the collaborators of the project. The dataset is updated quarterly and includes information for the last twelve months. The version that we are using for our project was last updated on March 1, 2025. 


### 1.3 Team Members 

**Carol Zhang**: I am a fifth-year Commerce student with a minor in Data Science. I am interested in this dataset because Airbnb is a major player in the travel industry. Through my travels on exchange, I came to realize how important guest satisfaction is and how other reviews heavily influenced my decision-making when choosing accommodation. I want to better understand the factors that lead to positive guest experiences and make wiser choices the next time I travel.
  
**Ayuho Negishi**: I am a fifth-year Psychology student minoring in Data Science. My academic background has made me very interested in understanding human behaviour. I am especially curious how host strategies on Airbnb, like being a verified host, having multiple listings, or offering instant booking, influence guest satisfaction, booking frequency, and pricing. I want to find patterns in how these host behaviours affect the success of listings and the overall competitiveness in the Airbnb market.

**Erhan Asad Javed**: I'm a fourth-year Mathematics student with a minor in Data Science. This dataset is particularly interesting to me because it allows me to apply my practical skills in data cleaning, organization, and visualization to explore patterns in short-term rentals and urban housing dynamics. My goal is to uncover underlying trends in Airbnb pricing, host behaviour, guest satisfaction, and location-based factors affecting listings in New York City. Through this analysis, I aim to provide insights that can help travellers make informed decisions, hosts optimize their strategies, and policymakers better understand Airbnb’s impact on the housing market.

**Aaron Ma**: I’m a 3rd year Computer Science student. This dataset is particularly interesting to me because of the prevalence of AirBnB in a very common activity, travel. As someone who heavily relies on public reviews, I would like to see how hosts can leverage their tools such as being verified and having high response time, to gain an advantage against others in such a competitive space. I would like to better understand how Airbnb hosts garner advantages over one another through means that are not as obvious such as price. 

### 1.4 Intended Audience 
The intended audience for this project is Airbnb hosts and short-term rental operators. By analyzing location, pricing strategies, host behaviour, and guest satisfaction factors, this project aims to provide hosts with valuable insights into how they can optimize their listings for better performance. Hosts will learn how elements such as neighbourhood, room type, cancellation policies, and review ratings influence pricing and booking frequency. This information will help them refine their strategies, improve guest experiences, and increase bookings to stay competitive in a dynamic market.


## 2. About the Data 

### 2.1 Data Abstraction 

| Attribute Name                      | Attribute Type               | Data Semantics                                        | Cardinality                          |
|-------------------------------------|------------------------------|------------------------------------------------------|--------------------------------------|
| id                                  | Nominal                      | Unique identifier for each listing                   | 37,784                               |
| listing_url                         | Nominal                      | URL for listing on Airbnb                            | 37,784                               |
| scrape_id                           | Nominal                      | Unique identifier for data collection/scrape session | 1                                    |
| last_scraped                        | Temporal                     | Date of last data collection/scrape session          | 1                                    |
| source                              | Nominal                      | Where the data was sourced from                      | 2                                    |
| name                                | Nominal                      | Name of the listing                                  | 36,057                               |
| description                         | Nominal                      | Description of the listing                           | 31,144                               |
| neighborhood_overview               | Nominal                      | Description of the neighbourhood                     | 15,119                               |
| picture_url                         | Nominal                      | URL link to pictures of the property                 | 36,983                               |
| host id                             | Nominal                      | Unique identifier for each host                      | 22,323                               |
| host_url                            | Nominal                      | URL link to the host’s profile                       | 22,323                               |
| host_name                           | Nominal                      | Name of the host                                     | 8,495                                |
| host_since                          | Temporal                     | Date when the host started listing on Airbnb         | 5,095                                |
| host_location                       | Nominal                      | Location of the host                                 | 987                                  |
| host_about                          | Nominal                      | Host bios                                            | 11,679                               |
| host_response_time                  | Ordinal                      | Time taken by the host to respond                    | 4                                    |
| host_response_rate                  | Quantitative (Continuous)    | The response rate of the host                        | 59                                   |
| host_acceptance_rate                | Quantitative (Continuous)    | Acceptance rate of booking requests by the host      | 100                                  |
| host_is_superhost                   | Binary (Boolean)             | Whether the host is a Superhost                      | 2                                    |
| host_neighbourhood                  | Nominal                      | Neighbourhood of the host's primary location         | 521                                  |
| host_thumbnail_url                  | Nominal                      | URL link to the host’s thumbnail                     | 21,723                               |
| host_picture_url                    | Nominal                      | URL link to the host’s profile picture               | 21,723                               |
| host_listings_count                 | Quantitative (Discrete)      | Number of listings the host has on Airbnb            | 121                                  |
| host_total_listings_count           | Quantitative (Discrete)      | Total number of listings the host has (including other platforms) | 145                                  |
| host_verifications                  | Nominal                      | Types of verification that the host has gone through | 7                                    |
| host_has_profile_pic                | Binary (Boolean)             | Whether the host has a profile pic                   | 2                                    |
| host_identity_verified              | Binary (Boolean)             | Whether the host's identity is verified              | 2                                    |
| neighbourhood                        | Nominal                      | Unclear, no semantic meaning                         | 1                                    |
| neighbourhood_cleansed              | Nominal                      | Cleaned version of the specific neighbourhood location | 223                                  |
| neighbourhood_group_cleansed        | Nominal                      | Cleaned version of the neighbourhood group/area     | 5                                    |
| latitude                            | Quantitative (Continuous)    | Latitude of the listing                              | 23,085                               |
| longitude                           | Quantitative (Continuous)    | Longitude of the listing                             | 20,843                               |
| property_type                       | Nominal                      | Type of the property                                 | 69                                   |
| room_type                           | Nominal                      | Type of the room                                     | 4                                    |
| accommodates                        | Quantitative (Discrete)      | Number of people that the listing can accommodate    | 16                                   |
| bathrooms                           | Quantitative (Discrete)      | Number of bathrooms in the listing                   | 17                                   |
| bathrooms_text                      | Nominal                      | Description of the bathroom type e.g. shared or private bathroom | 31                                   |
| bedrooms                            | Quantitative (Discrete)      | Number of bedrooms in the listing                    | 14                                   |
| beds                                | Quantitative (Discrete)      | Number of beds in the listing                        | 19                                   |
| amenities                           | Nominal                      | List of amenities offered by the listing             | 30,453                               |
| price                               | Quantitative (Continuous)    | Price per night for the listing                      | 897                                  |
| minimum_nights                      | Quantitative (Discrete)      | Minimum number of nights required to book            | 121                                  |
| maximum_nights                      | Quantitative (Discrete)      | Maximum number of nights available for booking       | 255                                  |
| minimum_minimum_nights              | Quantitative (Discrete)      | Minimum value for minimum nights                     | 118                                  |
| maximum_minimum_nights              | Quantitative (Discrete)      | Maximum value for minimum nights                     | 140                                  |
| minimum_maximum_nights              | Quantitative (Discrete)      | Minimum value for maximum nights                     | 241                                  |
| maximum_maximum_nights              | Quantitative (Discrete)      | Maximum value for maximum nights                     | 240                                  |
| minimum_nights_avg_ntm              | Quantitative (Continuous)    | Average minimum number of nights required per month | 429                                  |
| maximum_nights_avg_ntm              | Quantitative (Continuous)    | Average maximum number of nights available per month | 989                                  |
| calendar_updated                    | Nominal                      | Unclear, no semantic meaning                         | 0                                    |
| has_availability                    | Binary (Boolean)             | Whether the listing has availability                 | 2                                    |
| availability_30                      | Quantitative (Discrete)      | Availability over the next 30 days                   | 31                                   |
| availability_60                      | Quantitative (Discrete)      | Availability over the next 60 days                   | 61                                   |
| availability_90                      | Quantitative (Discrete)      | Availability over the next 90 days                   | 91                                   |
| availability_365                     | Quantitative (Discrete)      | Availability over the next 365 days                  | 366                                  |
| calendar_last_scraped               | Temporal                     | Date when the availability information was last scraped from the calendar | 1                                    |
| number_of_reviews                   | Quantitative (Discrete)      | Number of reviews for the listing                    | 492                                  |
| number_of_reviews_ltm               | Quantitative (Discrete)      | Number of reviews in the last twelve months          | 175                                  |
| number_of_reviews_l30d              | Quantitative (Discrete)      | Number of reviews in the last 30 days                | 34                                   |
| first_review                         | Temporal                     | Date of the first review for the listing             | 4,284                                |
| last_review                          | Temporal                     | Date of the most recent review                       | 3,204                                |
| review_scores_rating                | Quantitative (Continuous)    | Average rating score from reviews                    | 163                                  |
| review_scores_accuracy              | Quantitative (Continuous)    | Accuracy rating from reviews                         | 152                                  |
| review_scores_cleanliness           | Quantitative (Continuous)    | Cleanliness rating from reviews                      | 180                                  |
| review_scores_checkin               | Quantitative (Continuous)    | Check-in process rating from reviews                 | 133                                  |
| review_scores_communication         | Quantitative (Continuous)    | Communication rating from reviews                    | 144                                  |
| review_scores_location              | Quantitative (Continuous)    | Location rating from reviews                         | 149                                  |
| review_scores_value                 | Quantitative (Continuous)    | Value for money rating from reviews                  | 166                                  |
| license                             | Nominal                      | Property license if available                        | 1,970                                |
| instant_bookable                    | Binary (Boolean)             | Whether the listing can be booked instantly          | 2                                    |
| calculated_host_listings_count      | Quantitative (Discrete)      | Number of listings listed by the host                | 73                                   |
| calculated_host_listings_count_entire_homes | Quantitative (Discrete)  | Number of entire home listings by the host           | 53                                   |
| calculated_host_listings_count_private_rooms | Quantitative (Discrete) | Number of private room listings hosted by the host  | 42                                   |
| calculated_host_listings_count_shared_rooms  | Quantitative (Discrete)  | Number of shared room listings hosted by the host   | 5                                    |
| reviews_per_month                   | Quantitative (Discrete)      | Average number of reviews per month for the listing  | 801                                  |


### 2.2 Exploratory Data Analysis

#### General 
**Dataset Cleaning Workflow:**
1. Drop unncessary columns
2. Fill NA values with desired fill values
3. Convert datetime columns to pandas datetime format
4. Convert percentage columns to [0,1] scale
5. Clean price column to remove special characters and symbols
6. Impute missing numerical values with median
7. Impute missing categorical values with the mode
8. Convert binary valued columns to boolean

The cleaning workflows can be seen in directory: `code/cleaning_workflows.py`

In [1]:
import os
import ast
import altair as alt
import pandas as pd
from toolz.curried import pipe
import numpy as np
import sys

# # Create a new data transformer that stores the files in a directory
# def json_dir(data, data_dir='altairdata'):
#     os.makedirs(data_dir, exist_ok=True)
#     return pipe(data, alt.to_json(filename=data_dir + '/{prefix}-{hash}.{extension}') )

# # Register and enable the new transformer
# alt.data_transformers.register('json_dir', json_dir)
# alt.data_transformers.enable('json_dir')

# Handle large data sets (default shows only 5000)
# See here: https://altair-viz.github.io/user_guide/data_transformers.html
alt.data_transformers.disable_max_rows()

alt.renderers.enable('jupyterlab')

sys.path.append(os.path.abspath("../../code"))
from cleaning_workflows import prepare_dataset

In [3]:
df = pd.read_csv('../../data/raw/listings.csv', parse_dates=['first_review', 'last_review'])
df_cleaned = prepare_dataset(df)

In [4]:
df_cleaned.head()

Unnamed: 0,name,description,neighborhood_overview,host_id,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,1 br in a 2 br apt (Midtown West),No description available,No overview available,169927,2010-07-17,"Saint-Aubin-sur-Scie, France","Facebook Likes:\r\nNew York French Geek, David...",within a day,1.0,0.88,...,4.98,5.0,4.98,4.86,False,2,1,1,0,0.25
1,A lovely room w/ a Manhattan view,"A private, furnished large room to rent Jan/F...","Nate Silver called this super safe, clean, qui...",110506,2010-04-18,"New York, NY","I grew up in South Korea, moved to Montreal, C...",within a few hours,1.0,0.6,...,4.96,4.96,4.79,4.93,False,1,0,1,0,0.2
2,"Private, Large & Sunny 1BR w/W&D",It's a No Brainer:<br />•Terrific Space For Le...,The Neighborhood<br />• Rich History <br />• B...,170510,2010-07-18,"New York, United States",I am a self employed licensed real estate brok...,No response time,1.0,0.88,...,4.89,4.92,4.38,4.72,False,2,2,0,0,1.93
3,Beautiful Lower East Side Loft,Architect-owned loft is a corner unit in a bea...,"The apartment is in the border of Soho, LES an...",184755,2010-07-29,"New York, NY",I am an architect living in NYC and have my ow...,within a day,1.0,1.0,...,4.85,4.87,4.57,4.62,False,1,1,0,0,0.4
4,@HouseOnHenrySt - Private 2nd bedroom w/shared...,No description available,"Lovely old Brooklyn neighborhood, with brick/b...",11481,2009-03-26,"New York, NY",I have been a host with Airbnb since its intro...,within a day,0.67,0.33,...,4.71,4.73,4.58,4.64,False,4,1,3,0,1.26


In [5]:
df_cleaned.shape

(37784, 57)

In [6]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37784 entries, 0 to 37783
Data columns (total 57 columns):
 #   Column                                        Non-Null Count  Dtype         
---  ------                                        --------------  -----         
 0   name                                          37784 non-null  object        
 1   description                                   37784 non-null  object        
 2   neighborhood_overview                         37784 non-null  object        
 3   host_id                                       37784 non-null  int64         
 4   host_since                                    37784 non-null  datetime64[ns]
 5   host_location                                 37784 non-null  object        
 6   host_about                                    37784 non-null  object        
 7   host_response_time                            37784 non-null  object        
 8   host_response_rate                            37784 non-null  floa

**Comments** 
- There are 377,784 observations and 56 columns after cleaning 
- There are no missing values in the dataset after imputation 

In [10]:
exclude_columns = ['latitude', 'longitude', 'host_id']
df_numerical = df_cleaned.drop(columns=exclude_columns)

pd.set_option('display.max_columns', None)
df_numerical.describe()

Unnamed: 0,host_since,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,37784,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784,37784.0,37784.0,37784.0,37784,37784,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0,37784.0
mean,2017-03-06 09:48:26.373067776,0.952508,0.812859,263.054864,351.429229,2.754896,1.119654,1.321723,1.383946,170.434126,28.882172,57939.5,29.49784,850127.7,12.314075,27.011275,42.416552,163.400963,2025-01-03 00:00:00,25.658639,3.731447,0.294357,2021-02-19 02:42:05.259369216,2023-05-11 06:14:56.379419904,4.763849,4.805924,4.704505,4.870094,4.867555,4.775639,4.676217,71.636354,45.435555,23.985232,0.005187,0.685365
min,2008-08-11 00:00:00,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,8.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,2025-01-03 00:00:00,0.0,0.0,0.0,2009-05-25 00:00:00,2011-05-12 00:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.01
25%,2014-07-07 00:00:00,1.0,0.84,1.0,1.0,2.0,1.0,1.0,1.0,113.75,30.0,150.0,30.0,365.0,0.0,0.0,0.0,0.0,2025-01-03 00:00:00,0.0,0.0,0.0,2019-01-21 00:00:00,2023-03-19 00:00:00,4.75,4.8175,4.69,4.89,4.9,4.75,4.67,1.0,0.0,0.0,0.0,0.16
50%,2016-08-02 00:00:00,1.0,0.88,2.0,3.0,2.0,1.0,1.0,1.0,132.0,30.0,365.0,30.0,567.65,2.0,23.0,41.0,155.0,2025-01-03 00:00:00,3.0,0.0,0.0,2022-12-11 00:00:00,2024-10-11 00:00:00,4.85,4.9,4.81,4.95,4.96,4.85,4.76,2.0,1.0,1.0,0.0,0.29
75%,2019-10-29 00:00:00,1.0,0.95,10.0,15.0,4.0,1.0,1.0,1.0,156.0,30.0,1125.0,30.0,1125.0,29.0,58.0,88.0,329.0,2025-01-03 00:00:00,22.0,1.0,0.0,2023-01-01 00:00:00,2024-12-15 00:00:00,4.94,4.97,4.91,5.0,5.0,4.95,4.85,9.0,2.0,2.0,0.0,0.55
max,2024-12-27 00:00:00,1.0,1.0,5079.0,9048.0,16.0,15.5,16.0,42.0,20000.0,1250.0,2147484000.0,1250.0,2147484000.0,30.0,60.0,90.0,365.0,2025-01-03 00:00:00,2485.0,1779.0,137.0,2025-01-02 00:00:00,2025-01-02 00:00:00,5.0,5.0,5.0,5.0,5.0,5.0,5.0,1154.0,1154.0,739.0,4.0,116.3
std,,0.173547,0.232501,1000.361076,1196.815995,1.9039,0.442243,0.869987,0.984719,277.145183,29.90515,11048280.0,30.858242,42521980.0,13.408262,26.656382,39.790892,148.521232,,62.619846,18.798642,1.485928,,,0.376341,0.371005,0.423313,0.317409,0.345344,0.331185,0.413567,224.585038,200.899492,109.136674,0.086848,1.58403


In [11]:
df_cleaned.describe(include=['object'])

Unnamed: 0,name,description,neighborhood_overview,host_location,host_about,host_response_time,host_neighbourhood,host_verifications,neighbourhood_cleansed,neighbourhood_group_cleansed,property_type,room_type,amenities
count,37784,37784,37784,37784,37784,37784,37784,37784,37784,37784,37784,37784,37784
unique,36058,31140,15117,988,11654,5,521,7,223,5,69,4,30453
top,Water View King Bed Hotel Room,No description available,No overview available,"New York, NY","We’re Blueground, a global proptech company wi...",No response time,Bedford-Stuyvesant,"['email', 'phone']",Bedford-Stuyvesant,Manhattan,Entire rental unit,Entire home/apt,"[""Wifi"", ""TV"", ""Smoke alarm"", ""Carbon monoxide..."
freq,30,973,17177,22469,17326,15585,9498,29303,2678,16819,15887,20160,237


#### Erhan 
Detailed EDA can be found in the directory: `analysis/Erhan/analysis.ipynb`. 

#### Carol
Detailed EDA can be found in the directory: `analysis/Carol/analysis2.ipynb`. 

#### Aaron 
Detailed EDA can be found in the directory: `analysis/Erhan/analysis3.ipynb`. 

In [None]:
#### Ayuho 
Detailed EDA can be found in the directory: `analysis/Erhan/analysis3.ipynb`. 