# Hackathon: From Raw Data to ML-Ready Dataset
## Insight-Driven EDA and End-to-End Feature Engineering on Airbnb Data Using pandas and Plotly

### What is a Hackathon?

A hackathon is a fast-paced, collaborative event where participants use data and technology to solve a real problem end-to-end.  
In this hackathon, you will work with a **real-world Airbnb dataset** and complete two interconnected goals:

- Produce a **high-quality exploratory data analysis (EDA)** using `pandas` and `plotly`, extracting meaningful insights, trends, and signals from the data.  
- Design and deliver a **clean, feature-rich, ML-ready dataset** that will serve as the foundation for a follow-up hackathon focused on building and evaluating machine learning models.

Your task is to **get the most out of the data**: uncover structure and patterns through EDA, and engineer informative features (numerical, categorical, temporal, textual (TF–IDF), and optionally image-based) to maximize the predictive power of the final dataset.

<div class="alert alert-success">
<b>About the Dataset</b>

<u>Context</u>

The data comes from <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a>, an open project that publishes detailed, regularly updated datasets for cities around the world.  
Each city provides three main CSV files:

- <b>listings.csv</b> — property characteristics, host profiles, descriptions, amenities, etc.  
- <b>calendar.csv</b> — daily availability and pricing information for each listing.  
- <b>reviews.csv</b> — guest feedback and textual reviews.

These datasets offer a rich view of the short-term rental market, including availability patterns, pricing behavior, host attributes, and guest sentiment.  

<u>Inspiration</u>

Your ultimate objective is to create a dataset suitable for training a machine learning model that predicts whether a specific Airbnb listing will be <b>available on a given date</b>, using property attributes, review information, and host characteristics.
</div>

<div class="alert alert-info">
<b>Task</b>

Using one city of your choice from Inside Airbnb, create an end-to-end pipeline that:

1. Loads and explores the raw data (EDA).  
2. Engineers features (numerical, categorical, temporal, textual TF–IDF, etc.).  
3. Builds a unified ML-ready dataset.  

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions—**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
    
<b>Collaboration Requirement: Git & GitHub</b>

You must collaborate with your team using a **shared GitHub repository**.  
Your use of Git is part of the evaluation. We will specifically look at:

- Commit quality (clear messages, meaningful steps).  
- Balanced participation across team members.  
- Use of branches.  
- Ability to resolve merge conflicts appropriately.  
- A clean, readable project history that reflects real collaboration.

Good Git practice is **part of your grade**, not optional.
</div>
<div class="alert alert-danger">
    You are free to add as many cells as you wish as long as you leave untouched the first one.
</div>

<div class="alert alert-warning">

<b>Hints</b>

- Text columns often carry substantial predictive power, use text-vectorization methods to extract meaningful features.  
- Make sure all columns use appropriate data types (categorical, numeric, datetime, boolean). Correct dtypes help prevent subtle bugs and improve performance.  
- Feel free to enrich the dataset with any additional information you consider useful: engineered features, external data, derived temporal features, etc.  
- If the dataset is too large for your computer, use <code>.sample()</code> to work with a subset while preserving the logic of your pipeline.  
- Plotly offers a wide variety of powerful visualizations, experiment creatively, but always begin with a clear analytical question: *What insight am I trying to uncover with this plot?*

</div>




<div class="alert alert-danger">
<b>Submission Deadline:</b> Wednesday, December 3rd, 12:00

Start with a simple, working pipeline.  
Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>

<div class="alert alert-danger">
    
You may add as many cells as you want, but the **first cell must remain exactly as provided**. Do not edit, move, or delete it under any circumstances.
</div>


In [None]:
# LEAVE BLANK

### Team Information

Fill in the information below.  
All fields are **mandatory**.

- **GitHub Repository URL**: Paste the link to the team repo you will use for collaboration.
- **Team Members**: List all student names (and emails or IDs if required).

Do not modify the section title.  
Do not remove this cell.


In [None]:
# === Team Information (Mandatory) ===
# Fill in the fields below.

GITHUB_REPO = ""       # e.g. "https://github.com/myteam/airbnb-hackathon"
TEAM_MEMBERS = [
    # "Full Name 1",
    # "Full Name 2",
    # "Full Name 3",
]

GITHUB_REPO, TEAM_MEMBERS


In [8]:
import pandas as pd

df_calendar = pd.read_csv("/Users/sabeenaawan/Desktop/Hackathon_1/data_local/calendar.csv")
df_listings = pd.read_csv("/Users/sabeenaawan/Desktop/Hackathon_1/data_local/listings.csv")
df_reviews = pd.read_csv("/Users/sabeenaawan/Desktop/Hackathon_1/data_local/reviews.csv")

df_calendar.head(), df_listings.head(), df_reviews.head()



(   listing_id        date available  price  adjusted_price  minimum_nights  \
 0       18674  2025-09-15         f    NaN             NaN               3   
 1       18674  2025-09-16         t    NaN             NaN               2   
 2       18674  2025-09-17         t    NaN             NaN               2   
 3       18674  2025-09-18         t    NaN             NaN               2   
 4       18674  2025-09-19         f    NaN             NaN               3   
 
    maximum_nights  
 0             999  
 1             999  
 2             999  
 3             999  
 4             999  ,
       id                         listing_url       scrape_id last_scraped  \
 0  18674  https://www.airbnb.com/rooms/18674  20250914152803   2025-09-15   
 1  23197  https://www.airbnb.com/rooms/23197  20250914152803   2025-09-14   
 2  32711  https://www.airbnb.com/rooms/32711  20250914152803   2025-09-15   
 3  34241  https://www.airbnb.com/rooms/34241  20250914152803   2025-09-15   
 4  349

In [9]:
df = df_calendar.merge(
    df_listings,
    how="left",
    left_on="listing_id",
    right_on="id"
)

# Ya no necesitamos la columna id duplicada
df = df.drop(columns=["id"])

# -------------------------
# 2) Procesar REVIEWS (agregar a nivel listing)
# -------------------------

df_reviews_agg = df_reviews.groupby("listing_id").agg(
    n_reviews=("id", "count"),
    first_review=("date", "min"),
    last_review=("date", "max")
).reset_index()

# -------------------------
# 3) Unir CALENDAR+LISTINGS con REVIEWS AGREGADAS
# -------------------------

df = df.merge(
    df_reviews_agg,
    how="left",
    on="listing_id"
)

# -------------------------
# 4) Comprobar resultado
# -------------------------
print(df.shape)
df.head()

(7084654, 88)


Unnamed: 0,listing_id,date,available,price_x,adjusted_price,minimum_nights_x,maximum_nights_x,listing_url,scrape_id,last_scraped,...,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_reviews,first_review_y,last_review_y
0,18674,2025-09-15,f,,,3,999,https://www.airbnb.com/rooms/18674,20250914152803,2025-09-15,...,ESFCTU000008058000039706000000000000000HUTB-00...,t,26,26,0,0,0.34,51.0,2013-05-27,2025-07-31
1,18674,2025-09-16,t,,,2,999,https://www.airbnb.com/rooms/18674,20250914152803,2025-09-15,...,ESFCTU000008058000039706000000000000000HUTB-00...,t,26,26,0,0,0.34,51.0,2013-05-27,2025-07-31
2,18674,2025-09-17,t,,,2,999,https://www.airbnb.com/rooms/18674,20250914152803,2025-09-15,...,ESFCTU000008058000039706000000000000000HUTB-00...,t,26,26,0,0,0.34,51.0,2013-05-27,2025-07-31
3,18674,2025-09-18,t,,,2,999,https://www.airbnb.com/rooms/18674,20250914152803,2025-09-15,...,ESFCTU000008058000039706000000000000000HUTB-00...,t,26,26,0,0,0.34,51.0,2013-05-27,2025-07-31
4,18674,2025-09-19,f,,,3,999,https://www.airbnb.com/rooms/18674,20250914152803,2025-09-15,...,ESFCTU000008058000039706000000000000000HUTB-00...,t,26,26,0,0,0.34,51.0,2013-05-27,2025-07-31


1. Exploratory Data Analysis (EDA)
This section provides an exploratory overview of the Airbnb dataset after merging the calendar, listings, and aggregated reviews data.
The goal of the EDA is to understand the structure of the dataset, detect patterns, identify potential issues, and gain insights that will inform the cleaning, feature engineering, and modeling steps in later tasks.
The analysis focuses on availability, price structures, geographical distributions, host activity, and temporal review patterns.

In [10]:
# Convert date columns
df['date'] = pd.to_datetime(df['date'])
df['last_scraped'] = pd.to_datetime(df['last_scraped'], errors='coerce')
df['first_review_y'] = pd.to_datetime(df['first_review_y'], errors='coerce')
df['last_review_y'] = pd.to_datetime(df['last_review_y'], errors='coerce')

# Clean price_x and adjusted_price (they come as strings with $)
for col in ["price_x", "adjusted_price"]:
    df[col] = (
        df[col]
        .astype(str)
        .replace('[\$,]', '', regex=True)
        .replace('nan', None)
        .astype(float)
    )


  .replace('[\$,]', '', regex=True)


In [12]:
# We drop the columns as they have no data included.

cols_to_drop = ["calendar_updated", "price_x", "adjusted_price"]

df = df.drop(columns=cols_to_drop, errors="ignore")

df.head()


Unnamed: 0,listing_id,date,available,minimum_nights_x,maximum_nights_x,listing_url,scrape_id,last_scraped,source,name,...,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_reviews,first_review_y,last_review_y
0,18674,2025-09-15,f,3,999,https://www.airbnb.com/rooms/18674,20250914152803,2025-09-15,city scrape,Huge flat for 8 people close to Sagrada Familia,...,ESFCTU000008058000039706000000000000000HUTB-00...,t,26,26,0,0,0.34,51.0,2013-05-27,2025-07-31
1,18674,2025-09-16,t,2,999,https://www.airbnb.com/rooms/18674,20250914152803,2025-09-15,city scrape,Huge flat for 8 people close to Sagrada Familia,...,ESFCTU000008058000039706000000000000000HUTB-00...,t,26,26,0,0,0.34,51.0,2013-05-27,2025-07-31
2,18674,2025-09-17,t,2,999,https://www.airbnb.com/rooms/18674,20250914152803,2025-09-15,city scrape,Huge flat for 8 people close to Sagrada Familia,...,ESFCTU000008058000039706000000000000000HUTB-00...,t,26,26,0,0,0.34,51.0,2013-05-27,2025-07-31
3,18674,2025-09-18,t,2,999,https://www.airbnb.com/rooms/18674,20250914152803,2025-09-15,city scrape,Huge flat for 8 people close to Sagrada Familia,...,ESFCTU000008058000039706000000000000000HUTB-00...,t,26,26,0,0,0.34,51.0,2013-05-27,2025-07-31
4,18674,2025-09-19,f,3,999,https://www.airbnb.com/rooms/18674,20250914152803,2025-09-15,city scrape,Huge flat for 8 people close to Sagrada Familia,...,ESFCTU000008058000039706000000000000000HUTB-00...,t,26,26,0,0,0.34,51.0,2013-05-27,2025-07-31


This analysis examines how the availability of listings changes over time. By calculating the proportion of listings that are available on each day, we can uncover seasonal behavior, peak tourism periods, and other fluctuations in demand.

In [13]:
import plotly.express as px

availability_over_time = (
    df.groupby("date", as_index=False)
      .agg(availability_rate=("available", lambda x: (x == "t").mean()))
)

fig = px.line(
    availability_over_time,
    x="date",
    y="availability_rate",
    title="Availability Rate Over Time",
    labels={"availability_rate": "Percent Available"}
)
fig.show()


Interpretation
This plot highlights whether certain months or specific periods show reduced availability, which typically corresponds to higher demand. Sharp dips in availability may indicate major events, holidays, or weekends with unusually high booking activity.

In [16]:
[p for p in df.columns if "price" in p.lower()]


['price_y', 'listing_price']

Distribution of Listing Prices
Understanding the distribution of listing prices is essential for characterizing the overall pricing structure of the market.
The distribution reveals whether prices are concentrated around certain values, whether there are multiple pricing segments, and how many properties fall into the premium or budget categories.
Since calendar-level prices were not available in this dataset, the listing-level price serves as the primary indicator of cost.
This variable typically reflects the base nightly rate set by the host and is therefore an appropriate target for exploratory analysis.
The histogram below shows the frequency of listings across different price ranges. The distribution is expected to be right-skewed, with a large number of affordable listings and a long tail representing more expensive properties.

In [19]:
df['listing_price'] = (
    df['price_y']
    .astype(str)
    .replace('[\$,]', '', regex=True)
    .astype(float)
)

df_unique = df.drop_duplicates("listing_id")

fig = px.histogram(
    df_unique,
    x="listing_price",
    nbins=50,
    title="Distribution of Listing Prices"
)
fig.show()



invalid escape sequence '\$'


invalid escape sequence '\$'


invalid escape sequence '\$'



Interpretation
The distribution provides a clear view of the pricing dynamics within the city.
Several patterns typically emerge:
The majority of listings cluster within the lower to mid-range price categories, indicating that most hosts position their properties competitively to attract a wide set of guests.
The right-skewed tail reflects a smaller number of premium listings that charge substantially higher prices. These may represent luxury apartments, unique accommodations, or properties located in highly desirable areas.
The presence of outliers can indicate either genuinely high-value properties or potentially misconfigured price settings by hosts.
The width of the distribution suggests how diverse the market is. A wide spread indicates a mix of budget-friendly, mid-range, and high-end listings, whereas a narrow distribution would suggest more uniform pricing.
Overall, this distribution helps characterize the market structure and provides a foundation for modeling tasks, such as predicting prices or segmenting listings based on pricing strategy.

Availability by Weekday and Month
In order to better understand the temporal structure of availability, the availability signal was aggregated by weekday and month.
This representation is more compact and interpretable than a day-of-month heatmap, which tends to be visually busy and includes many empty or irregular dates.
Aggregating by weekday and month provides answers to questions such as:
Are certain weekdays consistently less available than others?
Do weekends behave differently from weekdays across the year?
Which months show overall lower availability (higher demand)?
Do weekday patterns change seasonally?
This two-dimensional summary reveals clear behavioral patterns that daily charts often obscure, and it is particularly useful when availability is relatively stable within weeks but varies across seasons.


In [27]:
import plotly.express as px
import pandas as pd

# Create month and weekday columns
tmp = df.copy()
tmp["month"] = tmp["date"].dt.to_period("M").astype(str)
tmp["weekday"] = tmp["date"].dt.day_name()

# Order weekdays for nicer display
weekday_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
tmp["weekday"] = pd.Categorical(tmp["weekday"], categories=weekday_order, ordered=True)

# Aggregate availability by month and weekday
availability_mw = (
    tmp.groupby(["weekday", "month"], as_index=False)
       .agg(availability_rate=("available", lambda x: (x == "t").mean()))
)

# Pivot to weekday rows, month columns
pivot_mw = availability_mw.pivot(index="weekday", columns="month", values="availability_rate")

fig = px.imshow(
    pivot_mw,
    aspect="auto",
    color_continuous_scale="RdBu",
    zmin=0,
    zmax=1,
    title="Availability by Weekday and Month"
)

fig.update_xaxes(title="Month")
fig.update_yaxes(title="Weekday")
fig.show()









Interpretation of the Availability by Weekday and Month Heatmap
This heatmap summarizes the average availability of listings across weekdays for each month in the dataset.
Instead of focusing on daily fluctuations, this view aggregates availability by weekday (Monday to Sunday) and month.
Darker red areas indicate lower availability (higher occupancy), while lighter or blue areas indicate higher availability.
This visualization is effective for identifying recurring weekly patterns, seasonality across months, and how demand shifts over time.
Key Observations
1. September 2025 shows the lowest availability overall
The darkest red tones appear in September 2025 across all weekdays.
This suggests unusually high demand throughout the entire month.
Such a pattern may indicate a peak tourism season or a major local event.
2. Availability rises sharply from autumn into winter (Nov 2025 – Feb 2026)
From November 2025 onward, most cells become significantly lighter.
This indicates a shift toward higher availability and lower demand.
The market appears much less active during the winter period.
3. Early 2026 maintains moderate, stable availability
January, February, March, and April 2026 show mostly light blue tones.
Availability is more uniform across all weekdays, suggesting consistent demand patterns.
No major spikes or dips occur during this period.
4. Mid-2026 sees very high availability, indicating reduced demand
May 2026 and July 2026 show very pale or near-white tones.
This suggests that a large share of listings remained unbooked during this period.
Such patterns may occur due to seasonality, oversupply, or reduced tourism activity.
5. Weekday differences are relatively small
Unlike many Airbnb markets where weekends show clear demand differences, this dataset displays similar availability levels across all weekdays.
This indicates demand may be driven more by monthly seasonality rather than day-of-week patterns.
Summary of Market Dynamics
High demand: September 2025
Moderate demand: Late autumn 2025 and early spring 2026
Low demand / high availability: Summer 2026 (May–July)
Minimal weekday effects: Availability is shaped more by monthly seasonality than by weekly cycles
This aggregated weekday–month view provides a clear, high-level understanding of how booking pressure changes across months and helps identify broader seasonal patterns that are difficult to detect in daily-level visualizations.

Calendar View of Availability for a Single Month
While summarizing availability across the entire dataset is useful, it can also be valuable to focus on a single month and visualize the availability pattern in a calendar-like layout.
In this representation:
Columns correspond to days of the week.
Rows correspond to weeks within the month.
Each cell represents the average availability for that weekday/week combination.
This format closely resembles a standard monthly calendar and is effective for presenting a detailed view of booking pressure during a specific period, such as a peak tourism month or a month containing major events.
This visualization is more illustrative than analytical, but it helps communicate concrete patterns in a simple, intuitive format.

In [28]:
import numpy as np

# choose a month you care about
target_month = "2025-09"  # change if needed

month_df = df[df["date"].dt.to_period("M") == target_month].copy()

month_df["weekday"] = month_df["date"].dt.dayofweek  # Monday=0
month_df["week_of_month"] = ((month_df["date"].dt.day - 1) // 7)  # 0,1,2,3,4

# aggregate availability per cell
cal = (
    month_df.groupby(["week_of_month", "weekday"], as_index=False)
            .agg(availability_rate=("available", lambda x: (x == "t").mean()))
)

# build matrix 5 weeks × 7 days
calendar_matrix = np.full((cal["week_of_month"].max()+1, 7), np.nan)
for _, row in cal.iterrows():
    calendar_matrix[int(row["week_of_month"]), int(row["weekday"])] = row["availability_rate"]

fig = px.imshow(
    calendar_matrix,
    color_continuous_scale="RdBu",
    zmin=0,
    zmax=1,
    title=f"Availability Calendar Heatmap – {target_month}",
)

fig.update_xaxes(
    tickmode="array",
    tickvals=list(range(7)),
    ticktext=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
    title="Day of Week",
)
fig.update_yaxes(
    tickmode="array",
    tickvals=list(range(calendar_matrix.shape[0])),
    ticktext=[f"Week {i+1}" for i in range(calendar_matrix.shape[0])],
    title="Week of Month",
)
fig.show()


Interpretation of the Availability Calendar Heatmap (September 2025)
The calendar-style heatmap provides a detailed view of listing availability for the selected month (September 2025), broken down by day of the week and week of the month. Each cell represents the average share of listings that were available on that weekday during the corresponding week. Darker red cells indicate lower availability (higher occupancy), while lighter or blue tones indicate higher availability.
Several patterns emerge from the visualization:
1. Weeks 1 and 2 show unusually high availability
Most cells in the first half of the month are very light in color.
This indicates that many listings remained available, reflecting lower demand.
This pattern aligns with the end of the summer holiday season when travel volumes typically decline.
2. Week 3 displays the lowest availability across the month
The third week contains several dark red cells, especially on Monday and Sunday.
This suggests a period of unusually high demand.
The pattern may be associated with an event, festival, or other travel-intensive period.
3. Week 4 remains moderately booked
Availability increases slightly compared to Week 3 but remains lower than in early September.
This indicates a gradual easing of demand after the peak in Week 3.
4. Week 5 shows mixed patterns
The beginning of Week 5 shows moderate availability.
Later days in the week appear empty or neutral, likely due to missing data for dates beyond the dataset range.
The earlier part of the week suggests availability recovering toward the end of the month.
5. Weekend effects are visible
Sunday in Week 2 is one of the darkest cells, indicating strong booking pressure.
This may reflect common turnover cycles (check-ins or check-outs) or an event concentrated on that specific date.
Summary of September 2025 Booking Dynamics
Low demand in Weeks 1 and 2.
A sharp and pronounced peak in demand during Week 3.
Moderate demand in Week 4.
Mixed availability in Week 5, partially influenced by incomplete data.
This visualization effectively highlights how booking behavior varies both across and within weeks. It also illustrates how certain periods experience concentrated demand, which can be valuable for understanding seasonal patterns, event-driven spikes, and overall market dynamics.