# Hackathon: From Raw Data to ML-Ready Dataset
## Insight-Driven EDA and End-to-End Feature Engineering on Airbnb Data Using pandas and Plotly

### What is a Hackathon?

A hackathon is a fast-paced, collaborative event where participants use data and technology to solve a real problem end-to-end.  
In this hackathon, you will work with a **real-world Airbnb dataset** and complete two interconnected goals:

- Produce a **high-quality exploratory data analysis (EDA)** using `pandas` and `plotly`, extracting meaningful insights, trends, and signals from the data.  
- Design and deliver a **clean, feature-rich, ML-ready dataset** that will serve as the foundation for a follow-up hackathon focused on building and evaluating machine learning models.

Your task is to **get the most out of the data**: uncover structure and patterns through EDA, and engineer informative features (numerical, categorical, temporal, textual (TF–IDF), and optionally image-based) to maximize the predictive power of the final dataset.

<div class="alert alert-success">
<b>About the Dataset</b>

<u>Context</u>

The data comes from <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a>, an open project that publishes detailed, regularly updated datasets for cities around the world.  
Each city provides three main CSV files:

- <b>listings.csv</b> — property characteristics, host profiles, descriptions, amenities, etc.  
- <b>calendar.csv</b> — daily availability and pricing information for each listing.  
- <b>reviews.csv</b> — guest feedback and textual reviews.

These datasets offer a rich view of the short-term rental market, including availability patterns, pricing behavior, host attributes, and guest sentiment.  

<u>Inspiration</u>

Your ultimate objective is to create a dataset suitable for training a machine learning model that predicts whether a specific Airbnb listing will be <b>available on a given date</b>, using property attributes, review information, and host characteristics.
</div>

<div class="alert alert-info">
<b>Task</b>

Using one city of your choice from Inside Airbnb, create an end-to-end pipeline that:

1. Loads and explores the raw data (EDA).  
2. Engineers features (numerical, categorical, temporal, textual TF–IDF, etc.).  
3. Builds a unified ML-ready dataset.  

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions—**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
    
<b>Collaboration Requirement: Git & GitHub</b>

You must collaborate with your team using a **shared GitHub repository**.  
Your use of Git is part of the evaluation. We will specifically look at:

- Commit quality (clear messages, meaningful steps).  
- Balanced participation across team members.  
- Use of branches.  
- Ability to resolve merge conflicts appropriately.  
- A clean, readable project history that reflects real collaboration.

Good Git practice is **part of your grade**, not optional.
</div>
<div class="alert alert-danger">
    You are free to add as many cells as you wish as long as you leave untouched the first one.
</div>

<div class="alert alert-warning">

<b>Hints</b>

- Text columns often carry substantial predictive power, use text-vectorization methods to extract meaningful features.  
- Make sure all columns use appropriate data types (categorical, numeric, datetime, boolean). Correct dtypes help prevent subtle bugs and improve performance.  
- Feel free to enrich the dataset with any additional information you consider useful: engineered features, external data, derived temporal features, etc.  
- If the dataset is too large for your computer, use <code>.sample()</code> to work with a subset while preserving the logic of your pipeline.  
- Plotly offers a wide variety of powerful visualizations, experiment creatively, but always begin with a clear analytical question: *What insight am I trying to uncover with this plot?*

</div>




<div class="alert alert-danger">
<b>Submission Deadline:</b> Wednesday, December 3rd, 12:00

Start with a simple, working pipeline.  
Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>

<div class="alert alert-danger">
    
You may add as many cells as you want, but the **first cell must remain exactly as provided**. Do not edit, move, or delete it under any circumstances.
</div>


In [None]:
# LEAVE BLANK

### Team Information

Fill in the information below.  
All fields are **mandatory**.

- **GitHub Repository URL**: Paste the link to the team repo you will use for collaboration.
- **Team Members**: List all student names (and emails or IDs if required).

Do not modify the section title.  
Do not remove this cell.


In [None]:
# === Team Information (Mandatory) ===
# Fill in the fields below.

GITHUB_REPO = "https://github.com/dacobri/Python-Hackathon---Group-A10.git"       
TEAM_MEMBERS = [
     "Aumkar Prasad Wagle ",
    "Brice Da Costa",
    "Giorgio Fiorentino",
    "Jakob Kohrgruber",
    "Georgii Runko"
]

GITHUB_REPO, TEAM_MEMBERS


('https://github.com/dacobri/Python-Hackathon---Group-A10.git',
 ['Aumkar Prasad Wagle ',
  'Brice Da Costa',
  'Giorgio Fiorentino',
  'Jakob Kohrgruber',
  'Georgii Runko'])

In [2]:
import pandas as pd

df_calendar = pd.read_csv("/Users/giorgiofiorentino/Documents/Hackathon/Data/calendar.csv")
df_listings = pd.read_csv("/Users/giorgiofiorentino/Documents/Hackathon/Data/listings.csv")
df_reviews = pd.read_csv("/Users/giorgiofiorentino/Documents/Hackathon/Data/reviews.csv")


In [3]:
df_calendar_sample = df_calendar.sample(n=1000, random_state=42)
df_listings_sample = df_listings.sample(n=1000, random_state=42)
df_reviews_sample = df_reviews.sample(n=1000, random_state=42)

In [5]:
# 1. Rename the listing ID column in listings (df_listings)
df_listings_sample = df_listings_sample.rename(columns={'id': 'listing_id'})

# 2. Merge calendar + listings
merged_1 = df_calendar_sample.merge(df_listings_sample, on='listing_id', how='left')

# 3. Merge reviews
final_df = merged_1.merge(df_reviews_sample, on='listing_id', how='left')

# 4. Inspect the final merged dataset
final_df.info()
final_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004 entries, 0 to 1003
Data columns (total 90 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   listing_id                                    1004 non-null   int64  
 1   date_x                                        1004 non-null   object 
 2   available                                     1004 non-null   object 
 3   price_x                                       0 non-null      float64
 4   adjusted_price                                0 non-null      float64
 5   minimum_nights_x                              1004 non-null   int64  
 6   maximum_nights_x                              1004 non-null   int64  
 7   listing_url                                   45 non-null     object 
 8   scrape_id                                     45 non-null     float64
 9   last_scraped                                  45 non-null     o

Unnamed: 0,listing_id,date_x,available,price_x,adjusted_price,minimum_nights_x,maximum_nights_x,listing_url,scrape_id,last_scraped,...,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,id,date_y,reviewer_id,reviewer_name,comments
0,833368081295949152,2025-11-13,t,,,3,365,,,,...,,,,,,,,,,
1,5427997,2026-07-13,f,,,7,1125,,,,...,,,,,,,,,,
2,1192010287930228289,2025-11-04,t,,,4,365,,,,...,,,,,,,,,,
3,1226721657844314323,2025-11-18,t,,,115,116,,,,...,,,,,,,,,,
4,823789398926242617,2025-09-28,f,,,32,70,,,,...,,,,,,,,,,
