# Airbnb Data Exploration Notebook
_A Fall 2025 Team Project_

## 1. File Format Overview


This project uses multiple file formats, each with tradeoffs:

### ✅ CSV (Comma-Separated Values)
- Human-readable, plaintext format.
- No type information (everything is read as strings unless inferred).
- Large file size, no compression.

### ✅ TSV (Tab-Separated Values)
- Same as CSV but uses tabs instead of commas.
- Used here for metadata description files.

### ✅ Parquet
- Binary, columnar format optimized for analytics.
- Built-in support for compression (e.g., Snappy).
- Preserves data types and schema.
- Excellent with Spark, DuckDB, and modern data tools.


## 2. Tools for Data Manipulation and Exploration


We will use three main tools:

### ✅ Pandas
- Python-native
- Best for local, small-to-medium datasets
- Simple and expressive syntax

### ✅ PySpark
- Distributed data processing engine (Apache Spark)
- Best for large-scale data (millions of rows)
- Lazy evaluation and optimized execution

### ✅ SQL (via DuckDB or Spark SQL)
- Universal language for querying structured data
- Easily integrates with both Pandas and Spark
- Great for expressing aggregations and joins


## 3. Imports and Setup

In [2]:
import pandas as pd
from pyspark.sql import SparkSession
import duckdb

In [3]:
# Start Spark
spark = SparkSession.builder \
    .appName("AirbnbExploration") \
    .config("spark.driver.host", "localhost") \
    .config("spark.driver.bindAddress", "127.0.0.1") \
    .master("local[*]") \
    .getOrCreate()

25/09/25 09:51:27 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext should be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500)
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
py4j.Gateway.invoke(Gateway.java:238)
py4j.command

## 4. Load NYC Listings Dataset

In [4]:
# Load parquet using pandas
df_pd = pd.read_parquet("../data/nyc/nyc-listings.parquet")
df_pd.head()

Unnamed: 0,listing_id,listing_name,listing_type,room_type,cover_photo_url,photos_count,host_id,host_name,cohost_ids,cohost_names,...,l90d_occupancy,l90d_adjusted_occupancy,l90d_revpar,l90d_revpar_native,l90d_adjusted_revpar,l90d_adjusted_revpar_native,l90d_reserved_days,l90d_blocked_days,l90d_available_days,l90d_total_days
0,6848,Only 2 stops to Manhattan studio,Entire rental unit,entire_home,https://a0.muscache.com/im/pictures/e4f031a7-f...,15,15991,Allen,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0,81,90,90
1,6990,UES Beautiful Blue Room,Private room in rental unit,private_room,https://a0.muscache.com/im/pictures/86985bff-b...,35,16800,Cynthia,,,...,0.356,0.8,26.3,26.3,59.2,59.2,32,50,58,90
2,7097,"Perfect for Your Parents, With Garden & Patio",Private room in guest suite,private_room,https://a0.muscache.com/im/pictures/107655534/...,29,17571,Jane,,,...,0.944,0.0,214.9,214.9,0.0,0.0,85,0,5,90
3,7801,Sunny Williamsburg Loft with Sauna,Entire place,entire_home,https://a0.muscache.com/im/pictures/miso/Hosti...,16,21207,Chaya,313584.0,Elie,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,90,90
4,8490,"Maison des Sirenes1,bohemian, luminous apartment",Entire loft,entire_home,https://a0.muscache.com/im/pictures/1d0d9773-c...,58,25183,Nathalie,,,...,0.078,0.304,13.1,13.1,51.3,51.3,7,67,83,90


In [5]:
# Load parquet using PySpark
df_spark = spark.read.parquet("../data/nyc/nyc-listings.parquet")
df_spark.show(5)

25/09/25 09:52:01 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+----------+--------------------+--------------------+------------+--------------------+------------+-------+---------+----------+------------+---------+--------+---------+------+--------+----+-----+------------+--------------------+------------+----------+-------------------+--------+------------+---------------+-----------+--------------+---------------+--------------+------------------+--------------------+---------------+------------+-----------+------------------+------------+-------------------+-------------+----------------------+----------+-----------------+-------------------+--------------------------+-----------------+----------------+------------------+--------------+------------+-------------------+-------------+--------------------+--------------+-----------------------+-----------+------------------+--------------------+---------------------------+------------------+-----------------+-------------------+---------------+
|listing_id|        listing_name|        listing_ty

In [6]:
# Load parquet using DuckDB SQL
duckdb.sql("CREATE TABLE listings AS SELECT * FROM '../data/nyc/nyc-listings.parquet'")
duckdb.sql("SELECT * FROM listings LIMIT 5")

┌────────────┬──────────────────────────────────────────────────┬─────────────────────────────┬──────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────┬─────────┬───────────┬────────────┬──────────────┬───────────┬───────────────┬───────────────┬────────┬──────────┬───────┬──────────────┬──────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

## 5. Initial Exploration Questions

### Q1: How many rows and columns are in the dataset?

In [7]:
# Pandas
df_pd.shape

(300, 61)

In [8]:
# PySpark
df_spark.count(), len(df_spark.columns)

(300, 61)

In [12]:
# SQL (DuckDB) - Corrected query for rows and columns
duckdb.sql("""
SELECT 
    (SELECT COUNT(*) FROM listings) AS rows,
    (SELECT COUNT(*) FROM pragma_table_info('listings')) AS cols
""")

┌───────┬───────┐
│ rows  │ cols  │
│ int64 │ int64 │
├───────┼───────┤
│   300 │    61 │
└───────┴───────┘

### Q2: What are the column names and data types?

In [13]:
df_pd.dtypes

listing_id                       int64
listing_name                    object
listing_type                    object
room_type                       object
cover_photo_url                 object
                                ...   
l90d_adjusted_revpar_native    float64
l90d_reserved_days               int64
l90d_blocked_days                int64
l90d_available_days              int64
l90d_total_days                  int64
Length: 61, dtype: object

In [14]:
df_spark.printSchema()

root
 |-- listing_id: long (nullable = true)
 |-- listing_name: string (nullable = true)
 |-- listing_type: string (nullable = true)
 |-- room_type: string (nullable = true)
 |-- cover_photo_url: string (nullable = true)
 |-- photos_count: integer (nullable = true)
 |-- host_id: long (nullable = true)
 |-- host_name: string (nullable = true)
 |-- cohost_ids: string (nullable = true)
 |-- cohost_names: string (nullable = true)
 |-- superhost: boolean (nullable = true)
 |-- latitude: decimal(10,4) (nullable = true)
 |-- longitude: decimal(10,4) (nullable = true)
 |-- guests: integer (nullable = true)
 |-- bedrooms: integer (nullable = true)
 |-- beds: integer (nullable = true)
 |-- baths: decimal(4,1) (nullable = true)
 |-- registration: boolean (nullable = true)
 |-- amenities: string (nullable = true)
 |-- instant_book: boolean (nullable = true)
 |-- min_nights: integer (nullable = true)
 |-- cancellation_policy: string (nullable = true)
 |-- currency: string (nullable = true)
 |-- cle

In [15]:
duckdb.sql('PRAGMA table_info(listings)')

┌───────┬─────────────────────────────┬─────────┬─────────┬────────────┬─────────┐
│  cid  │            name             │  type   │ notnull │ dflt_value │   pk    │
│ int32 │           varchar           │ varchar │ boolean │  varchar   │ boolean │
├───────┼─────────────────────────────┼─────────┼─────────┼────────────┼─────────┤
│     0 │ listing_id                  │ BIGINT  │ false   │ NULL       │ false   │
│     1 │ listing_name                │ VARCHAR │ false   │ NULL       │ false   │
│     2 │ listing_type                │ VARCHAR │ false   │ NULL       │ false   │
│     3 │ room_type                   │ VARCHAR │ false   │ NULL       │ false   │
│     4 │ cover_photo_url             │ VARCHAR │ false   │ NULL       │ false   │
│     5 │ photos_count                │ INTEGER │ false   │ NULL       │ false   │
│     6 │ host_id                     │ BIGINT  │ false   │ NULL       │ false   │
│     7 │ host_name                   │ VARCHAR │ false   │ NULL       │ false   │
│   

### Q3: How many listings per room type?

In [16]:
df_pd['room_type'].value_counts()

room_type
entire_home     174
private_room    120
hotel_room        3
shared_room       3
Name: count, dtype: int64

In [19]:
df_spark.groupBy("room_type").count().orderBy("count", ascending=False).show()

+------------+-----+
|   room_type|count|
+------------+-----+
| entire_home|  174|
|private_room|  120|
|  hotel_room|    3|
| shared_room|    3|
+------------+-----+



In [26]:
duckdb.sql("""
    SELECT
        room_type,
        COUNT(*) AS count
    FROM listings
    GROUP BY room_type
    ORDER BY count DESC""")
           

┌──────────────┬───────┐
│  room_type   │ count │
│   varchar    │ int64 │
├──────────────┼───────┤
│ entire_home  │   174 │
│ private_room │   120 │
│ hotel_room   │     3 │
│ shared_room  │     3 │
└──────────────┴───────┘

### Q4: What are the average and median prices average daily rates in past 12 months?

In [30]:
df_pd['ttm_avg_rate'].describe()[['mean', '50%']]

mean    185.103
50%     146.000
Name: ttm_avg_rate, dtype: float64

In [32]:
from pyspark.sql.functions import avg, expr
df_spark.select(avg("ttm_avg_rate"), expr("percentile(ttm_avg_rate, 0.5)")).show()

+------------------+--------------------------------+
| avg(ttm_avg_rate)|percentile(ttm_avg_rate, 0.5, 1)|
+------------------+--------------------------------+
|185.10299999999987|                           146.0|
+------------------+--------------------------------+



In [None]:
duckdb.sql("SELECT AVG(ttm_avg_rate), MEDIAN(ttm_avg_rate) FROM listings")

┌────────────────────┬──────────────────────┐
│ avg(ttm_avg_rate)  │ median(ttm_avg_rate) │
│       double       │        double        │
├────────────────────┼──────────────────────┤
│ 185.10299999999987 │                146.0 │
└────────────────────┴──────────────────────┘

25/09/25 15:57:24 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 370256 ms exceeds timeout 120000 ms
25/09/25 15:57:24 WARN SparkContext: Killing executors is not supported by current scheduler.
25/09/25 15:57:26 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:53)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:342)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:132)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$

## 6. Deeper Data Exploration (Exercises)

### Q5: Which columns have missing values, and how many?

In [None]:
df_pd.isnull().sum().sort_values(ascending=False).head(10)

In [None]:

from pyspark.sql.functions import col, isnan, when, count
df_spark.select([count(when(col(c).isNull(), c)).alias(c) for c in df_spark.columns]).show()


In [None]:

duckdb.sql("""
SELECT column_name, COUNT(*) AS null_count
FROM (
  SELECT * FROM listings
) t
UNPIVOT (value FOR column_name IN (*))
WHERE value IS NULL
GROUP BY column_name
ORDER BY null_count DESC
LIMIT 10
""")


### Q6: How many listings are there in each neighborhood?

In [None]:
df_pd['neighbourhood_cleansed'].value_counts().head(10)

In [None]:

df_spark.groupBy("neighbourhood_cleansed").count().orderBy("count", ascending=False).show(10)


In [None]:

duckdb.sql("SELECT neighbourhood_cleansed, COUNT(*) AS count FROM listings GROUP BY 1 ORDER BY 2 DESC LIMIT 10")


### Q7: What is the distribution of `availability_365`?

In [None]:
df_pd['availability_365'].hist()

In [None]:

df_spark.groupBy("availability_365").count().orderBy("availability_365").show(10)


In [None]:

duckdb.sql("SELECT availability_365, COUNT(*) FROM listings GROUP BY availability_365 ORDER BY availability_365 LIMIT 10")


### Q8: How many listings are available for most of the year (availability_365 > 300)?

In [None]:
(df_pd['availability_365'] > 300).sum()

In [None]:

df_spark.filter("availability_365 > 300").count()


In [None]:

duckdb.sql("SELECT COUNT(*) FROM listings WHERE availability_365 > 300")


### Q9: How many listings have zero reviews?

In [None]:
(df_pd['number_of_reviews'] == 0).sum()

In [None]:

df_spark.filter("number_of_reviews = 0").count()


In [None]:

duckdb.sql("SELECT COUNT(*) FROM listings WHERE number_of_reviews = 0")


### Q10: Who are the top 10 hosts by number of listings?

In [None]:
df_pd['host_id'].value_counts().head(10)

In [None]:

df_spark.groupBy("host_id").count().orderBy("count", ascending=False).show(10)


In [None]:

duckdb.sql("SELECT host_id, COUNT(*) AS count FROM listings GROUP BY 1 ORDER BY 2 DESC LIMIT 10")


### Q11: What's the average price for listings with different minimum night requirements?

In [None]:

df_pd.groupby("minimum_nights")["price"].mean().sort_values(ascending=False).head(10)


In [None]:

df_spark.groupBy("minimum_nights").avg("price").orderBy("avg(price)", ascending=False).show(10)


In [None]:

duckdb.sql("SELECT minimum_nights, AVG(price) FROM listings GROUP BY minimum_nights ORDER BY AVG(price) DESC LIMIT 10")


### Q12: Are there listings with suspiciously high prices (e.g., > $1000)?

In [None]:
df_pd[df_pd['price'] > 1000][['name', 'price']].head()

In [None]:

df_spark.filter("price > 1000").select("name", "price").show(5)


In [None]:

duckdb.sql("SELECT name, price FROM listings WHERE price > 1000 LIMIT 5")


### Q13: What's the average number of reviews per month by room type?

In [None]:
df_pd.groupby('room_type')['reviews_per_month'].mean()

In [None]:

df_spark.groupBy("room_type").avg("reviews_per_month").show()


In [None]:

duckdb.sql("SELECT room_type, AVG(reviews_per_month) FROM listings GROUP BY room_type")


### Q14: What are the most common values for `minimum_nights`?

In [None]:
df_pd['minimum_nights'].value_counts().head(10)

In [None]:

df_spark.groupBy("minimum_nights").count().orderBy("count", ascending=False).show(10)


In [None]:

duckdb.sql("SELECT minimum_nights, COUNT(*) FROM listings GROUP BY minimum_nights ORDER BY COUNT(*) DESC LIMIT 10")


### Q15: Which neighborhoods have the highest average price?

In [None]:

df_pd.groupby("neighbourhood_cleansed")["price"].mean().sort_values(ascending=False).head(10)


In [None]:

df_spark.groupBy("neighbourhood_cleansed").avg("price").orderBy("avg(price)", ascending=False).show(10)


In [None]:

duckdb.sql("SELECT neighbourhood_cleansed, AVG(price) FROM listings GROUP BY 1 ORDER BY 2 DESC LIMIT 10")


## 7. Advanced Queries and Joins


We'll now perform some more complex operations by joining the `listings` dataset with the `reviews` dataset.

This will allow us to explore relationships between listings and their reviews — for example:

- Average review scores by neighborhood
- Number of reviews per listing
- Listings with unusually high/low review scores


### Step 1: Load the Reviews Dataset

In [None]:

# Load reviews using pandas
df_reviews_pd = pd.read_parquet("../data/nyc/nyc-reviews.parquet")
df_reviews_pd.head()


In [None]:

# Load reviews in Spark
df_reviews_spark = spark.read.parquet("../data/nyc/nyc-reviews.parquet")
df_reviews_spark.show(5)


In [None]:

# Load reviews in DuckDB
duckdb.sql("CREATE TABLE reviews AS SELECT * FROM '../data/nyc/nyc-reviews.parquet'")
duckdb.sql("SELECT * FROM reviews LIMIT 5")


### Q16: How many reviews does each listing have?

In [None]:

# Pandas
review_counts_pd = df_reviews_pd.groupby("listing_id").size().reset_index(name="num_reviews")
df_joined_pd = df_pd.merge(review_counts_pd, how="left", left_on="id", right_on="listing_id")
df_joined_pd[["id", "name", "num_reviews"]].head()


In [None]:

# PySpark
from pyspark.sql.functions import count
review_counts_spark = df_reviews_spark.groupBy("listing_id").agg(count("*").alias("num_reviews"))
df_joined_spark = df_spark.join(review_counts_spark, df_spark.id == review_counts_spark.listing_id, "left")
df_joined_spark.select("id", "name", "num_reviews").show(5)


In [None]:

# DuckDB
duckdb.sql("""
CREATE TABLE joined AS
SELECT l.*, COUNT(r.id) AS num_reviews
FROM listings l
LEFT JOIN reviews r ON l.id = r.listing_id
GROUP BY l.*
LIMIT 5
""")


### Q17: What's the average number of reviews per room type?

In [None]:

df_joined_pd.groupby("room_type")["num_reviews"].mean()


In [None]:

df_joined_spark.groupBy("room_type").avg("num_reviews").show()


In [None]:

duckdb.sql("SELECT room_type, AVG(num_reviews) FROM joined GROUP BY room_type")


### Q18: What are the top 5 listings by number of reviews?

In [None]:

df_joined_pd.sort_values("num_reviews", ascending=False)[["name", "host_name", "num_reviews"]].head()


In [None]:

df_joined_spark.select("name", "host_name", "num_reviews").orderBy("num_reviews", ascending=False).show(5)


In [None]:

duckdb.sql("SELECT name, host_name, num_reviews FROM joined ORDER BY num_reviews DESC LIMIT 5")
