# Hackathon: From Raw Data to ML-Ready Dataset
## Insight-Driven EDA and End-to-End Feature Engineering on Airbnb Data Using pandas and Plotly

### What is a Hackathon?

A hackathon is a fast-paced, collaborative event where participants use data and technology to solve a real problem end-to-end.  
In this hackathon, you will work with a **real-world Airbnb dataset** and complete two interconnected goals:

- Produce a **high-quality exploratory data analysis (EDA)** using `pandas` and `plotly`, extracting meaningful insights, trends, and signals from the data.  
- Design and deliver a **clean, feature-rich, ML-ready dataset** that will serve as the foundation for a follow-up hackathon focused on building and evaluating machine learning models.

Your task is to **get the most out of the data**: uncover structure and patterns through EDA, and engineer informative features (numerical, categorical, temporal, textual (TF–IDF), and optionally image-based) to maximize the predictive power of the final dataset.

<div class="alert alert-success">
<b>About the Dataset</b>

<u>Context</u>

The data comes from <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a>, an open project that publishes detailed, regularly updated datasets for cities around the world.  
Each city provides three main CSV files:

- <b>listings.csv</b> — property characteristics, host profiles, descriptions, amenities, etc.  
- <b>calendar.csv</b> — daily availability and pricing information for each listing.  
- <b>reviews.csv</b> — guest feedback and textual reviews.

These datasets offer a rich view of the short-term rental market, including availability patterns, pricing behavior, host attributes, and guest sentiment.  

<u>Inspiration</u>

Your ultimate objective is to create a dataset suitable for training a machine learning model that predicts whether a specific Airbnb listing will be <b>available on a given date</b>, using property attributes, review information, and host characteristics.
</div>

<div class="alert alert-info">
<b>Task</b>

Using one city of your choice from Inside Airbnb, create an end-to-end pipeline that:

1. Loads and explores the raw data (EDA).  
2. Engineers features (numerical, categorical, temporal, textual TF–IDF, etc.).  
3. Builds a unified ML-ready dataset.  

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions—**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
    
<b>Collaboration Requirement: Git & GitHub</b>

You must collaborate with your team using a **shared GitHub repository**.  
Your use of Git is part of the evaluation. We will specifically look at:

- Commit quality (clear messages, meaningful steps).  
- Balanced participation across team members.  
- Use of branches.  
- Ability to resolve merge conflicts appropriately.  
- A clean, readable project history that reflects real collaboration.

Good Git practice is **part of your grade**, not optional.
</div>
<div class="alert alert-danger">
    You are free to add as many cells as you wish as long as you leave untouched the first one.
</div>

<div class="alert alert-warning">

<b>Hints</b>

- Text columns often carry substantial predictive power, use text-vectorization methods to extract meaningful features.  
- Make sure all columns use appropriate data types (categorical, numeric, datetime, boolean). Correct dtypes help prevent subtle bugs and improve performance.  
- Feel free to enrich the dataset with any additional information you consider useful: engineered features, external data, derived temporal features, etc.  
- If the dataset is too large for your computer, use <code>.sample()</code> to work with a subset while preserving the logic of your pipeline.  
- Plotly offers a wide variety of powerful visualizations, experiment creatively, but always begin with a clear analytical question: *What insight am I trying to uncover with this plot?*

</div>




<div class="alert alert-danger">
<b>Submission Deadline:</b> Wednesday, December 3rd, 12:00

Start with a simple, working pipeline.  
Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>

<div class="alert alert-danger">
    
You may add as many cells as you want, but the **first cell must remain exactly as provided**. Do not edit, move, or delete it under any circumstances.
</div>


In [1]:
# LEAVE BLANK

### Team Information

Fill in the information below.  
All fields are **mandatory**.

- **GitHub Repository URL**: Paste the link to the team repo you will use for collaboration.
- **Team Members**: List all student names (and emails or IDs if required).

Do not modify the section title.  
Do not remove this cell.


In [2]:
# === Team Information (Mandatory) ===
# Fill in the fields below.

GITHUB_REPO = "https://github.com/dacobri/Python-Hackathon---Group-A10.git"       
TEAM_MEMBERS = [
     "Aumkar Prasad Wagle ",
    "Brice Da Costa",
    "Giorgio Fiorentino",
    "Jakob Kohrgruber",
    "Georgii Runko"
]

GITHUB_REPO, TEAM_MEMBERS


('https://github.com/dacobri/Python-Hackathon---Group-A10.git',
 ['Aumkar Prasad Wagle ',
  'Brice Da Costa',
  'Giorgio Fiorentino',
  'Jakob Kohrgruber',
  'Georgii Runko'])

In [3]:
import pandas as pd

# URLs
url_listings = "https://data.insideairbnb.com/spain/catalonia/barcelona/2025-09-14/data/listings.csv.gz"
url_calendar = "https://data.insideairbnb.com/spain/catalonia/barcelona/2025-09-14/data/calendar.csv.gz"
url_reviews = "https://data.insideairbnb.com/spain/catalonia/barcelona/2025-09-14/data/reviews.csv.gz"

# Load with pandas (handles .gz automatically)
df_listings = pd.read_csv(url_listings, compression='gzip')
df_calendar = pd.read_csv(url_calendar, compression='gzip')
df_reviews = pd.read_csv(url_reviews, compression='gzip')

# Quick check
print(df_listings.head())
print(df_calendar.head())
print(df_reviews.head())


      id                         listing_url       scrape_id last_scraped  \
0  18674  https://www.airbnb.com/rooms/18674  20250914152803   2025-09-15   
1  23197  https://www.airbnb.com/rooms/23197  20250914152803   2025-09-14   
2  32711  https://www.airbnb.com/rooms/32711  20250914152803   2025-09-15   
3  34241  https://www.airbnb.com/rooms/34241  20250914152803   2025-09-15   
4  34981  https://www.airbnb.com/rooms/34981  20250914152803   2025-09-15   

        source                                               name  \
0  city scrape    Huge flat for 8 people close to Sagrada Familia   
1  city scrape  Forum CCIB DeLuxe, Spacious, Large Balcony, relax   
2  city scrape                   Sagrada Familia area - Còrsega 1   
3  city scrape   Stylish Top Floor Apartment - Ramblas Plaza Real   
4  city scrape               VIDRE HOME PLAZA REAL on LAS RAMBLAS   

                                         description  \
0  110m2 apartment to rent in Barcelona. Located ...   
1  Beautif

In [4]:
df_calendar_sample = df_calendar.sample(n=1000, random_state=42)
df_listings_sample = df_listings.sample(n=1000, random_state=42)
df_reviews_sample = df_reviews.sample(n=1000, random_state=42)

In [5]:
# 1. Rename the listing ID column in listings (df_listings)
df_listings_sample = df_listings_sample.rename(columns={'id': 'listing_id'})

# 2. Merge calendar + listings
merged_1 = df_calendar_sample.merge(df_listings_sample, on='listing_id', how='left')

# 3. Merge reviews
final_df = merged_1.merge(df_reviews_sample, on='listing_id', how='left')

# 4. Inspect the final merged dataset
final_df.info()
final_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004 entries, 0 to 1003
Data columns (total 90 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   listing_id                                    1004 non-null   int64  
 1   date_x                                        1004 non-null   object 
 2   available                                     1004 non-null   object 
 3   price_x                                       0 non-null      float64
 4   adjusted_price                                0 non-null      float64
 5   minimum_nights_x                              1004 non-null   int64  
 6   maximum_nights_x                              1004 non-null   int64  
 7   listing_url                                   45 non-null     object 
 8   scrape_id                                     45 non-null     float64
 9   last_scraped                                  45 non-null     o

Unnamed: 0,listing_id,date_x,available,price_x,adjusted_price,minimum_nights_x,maximum_nights_x,listing_url,scrape_id,last_scraped,...,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,id,date_y,reviewer_id,reviewer_name,comments
0,833368081295949152,2025-11-13,t,,,3,365,,,,...,,,,,,,,,,
1,5427997,2026-07-13,f,,,7,1125,,,,...,,,,,,,,,,
2,1192010287930228289,2025-11-04,t,,,4,365,,,,...,,,,,,,,,,
3,1226721657844314323,2025-11-18,t,,,115,116,,,,...,,,,,,,,,,
4,823789398926242617,2025-09-28,f,,,32,70,,,,...,,,,,,,,,,


In [6]:
# --- 3. Initial data audit ---

print("Shape (rows, columns):", final_df.shape)

print("\nColumn names:")
print(final_df.columns.tolist())

print("\nData types:")
print(final_df.dtypes)

print("\nPreview of the dataset:")
display(final_df.head())

print("\nMissing values per column:")
print(final_df.isna().sum().sort_values(ascending=False))

print("\nBasic descriptive statistics (numeric columns):")
display(final_df.describe().T)

Shape (rows, columns): (1004, 90)

Column names:
['listing_id', 'date_x', 'available', 'price_x', 'adjusted_price', 'minimum_nights_x', 'maximum_nights_x', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name', 'description', 'neighborhood_overview', 'picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price_y', 'minimum_nights_y', 'maximum_nights_y', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'min

Unnamed: 0,listing_id,date_x,available,price_x,adjusted_price,minimum_nights_x,maximum_nights_x,listing_url,scrape_id,last_scraped,...,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,id,date_y,reviewer_id,reviewer_name,comments
0,833368081295949152,2025-11-13,t,,,3,365,,,,...,,,,,,,,,,
1,5427997,2026-07-13,f,,,7,1125,,,,...,,,,,,,,,,
2,1192010287930228289,2025-11-04,t,,,4,365,,,,...,,,,,,,,,,
3,1226721657844314323,2025-11-18,t,,,115,116,,,,...,,,,,,,,,,
4,823789398926242617,2025-09-28,f,,,32,70,,,,...,,,,,,,,,,



Missing values per column:
calendar_updated      1004
price_x               1004
adjusted_price        1004
host_neighbourhood     988
neighbourhood          985
                      ... 
date_x                   0
maximum_nights_x         0
minimum_nights_x         0
available                0
listing_id               0
Length: 90, dtype: int64

Basic descriptive statistics (numeric columns):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
listing_id,1004.0,6.455239e+17,5.886974e+17,32711.0,30750540.0,7.903025e+17,1.189004e+18,1.508131e+18
price_x,0.0,,,,,,,
adjusted_price,0.0,,,,,,,
minimum_nights_x,1004.0,17.61056,29.34399,1.0,2.0,4.0,31.0,365.0
maximum_nights_x,1004.0,555.5189,428.3142,2.0,330.0,365.0,1125.0,1125.0
scrape_id,45.0,20250910000000.0,0.0,20250910000000.0,20250910000000.0,20250910000000.0,20250910000000.0,20250910000000.0
host_id,45.0,193826300.0,189157800.0,154251.0,31576590.0,121495100.0,357946500.0,588931700.0
host_listings_count,45.0,107.6222,213.5668,1.0,1.0,10.0,77.0,904.0
host_total_listings_count,45.0,136.5111,244.7168,1.0,2.0,16.0,161.0,966.0
latitude,45.0,41.3938,0.01480159,41.35871,41.38265,41.39287,41.40269,41.4393


In [7]:
# --- 4. Basic summary statistics ---

# Summary statistics for numeric columns
numeric_summary = final_df.describe().T

# Summary statistics for non numeric columns
non_numeric_summary = final_df.describe(include=["object"])

print("Numeric summary statistics:")
display(numeric_summary)

print("\nNon-numeric summary statistics:")
display(non_numeric_summary)

# Check unique value counts for all categorical/object columns
print("\nUnique value counts for categorical columns:")
cat_cols = final_df.select_dtypes(include="object").columns
for col in cat_cols:
    print(f"{col}: {final_df[col].nunique()} unique values")

Numeric summary statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
listing_id,1004.0,6.455239e+17,5.886974e+17,32711.0,30750540.0,7.903025e+17,1.189004e+18,1.508131e+18
price_x,0.0,,,,,,,
adjusted_price,0.0,,,,,,,
minimum_nights_x,1004.0,17.61056,29.34399,1.0,2.0,4.0,31.0,365.0
maximum_nights_x,1004.0,555.5189,428.3142,2.0,330.0,365.0,1125.0,1125.0
scrape_id,45.0,20250910000000.0,0.0,20250910000000.0,20250910000000.0,20250910000000.0,20250910000000.0,20250910000000.0
host_id,45.0,193826300.0,189157800.0,154251.0,31576590.0,121495100.0,357946500.0,588931700.0
host_listings_count,45.0,107.6222,213.5668,1.0,1.0,10.0,77.0,904.0
host_total_listings_count,45.0,136.5111,244.7168,1.0,2.0,16.0,161.0,966.0
latitude,45.0,41.3938,0.01480159,41.35871,41.38265,41.39287,41.40269,41.4393



Non-numeric summary statistics:


Unnamed: 0,date_x,available,listing_url,last_scraped,source,name,description,neighborhood_overview,picture_url,host_url,...,price_y,has_availability,calendar_last_scraped,first_review,last_review,license,instant_bookable,date_y,reviewer_name,comments
count,1004,1004,45,45,45,45,43,19,45,45,...,26,41,45,32,32,33,45,44,44,44
unique,344,2,42,2,2,42,39,18,42,38,...,23,1,2,31,31,22,2,44,43,44
top,2026-03-01,t,https://www.airbnb.com/rooms/1379697197172209746,2025-09-15,city scrape,Cozy Single Room in Coliving near Sagrada Familia,Welcome to Room Picnic in Casa Bosque! A warm ...,"Sagrada Familia is a vibrant neighbourhood, bl...",https://a0.muscache.com/pictures/prohost-api/H...,https://www.airbnb.com/users/show/370718107,...,$171.00,t,2025-09-15,2017-01-01,2017-09-29,Exempt,f,2019-03-21,David,Marta and Cesc are the best of hosts. A very ...
freq,8,535,2,38,26,2,2,2,2,3,...,2,41,38,2,2,11,27,1,2,1



Unique value counts for categorical columns:
date_x: 344 unique values
available: 2 unique values
listing_url: 42 unique values
last_scraped: 2 unique values
source: 2 unique values
name: 42 unique values
description: 39 unique values
neighborhood_overview: 18 unique values
picture_url: 42 unique values
host_url: 38 unique values
host_name: 38 unique values
host_since: 38 unique values
host_location: 5 unique values
host_about: 25 unique values
host_response_time: 4 unique values
host_response_rate: 9 unique values
host_acceptance_rate: 16 unique values
host_is_superhost: 2 unique values
host_thumbnail_url: 38 unique values
host_picture_url: 38 unique values
host_neighbourhood: 9 unique values
host_verifications: 4 unique values
host_has_profile_pic: 2 unique values
host_identity_verified: 2 unique values
neighbourhood: 2 unique values
neighbourhood_cleansed: 24 unique values
neighbourhood_group_cleansed: 9 unique values
property_type: 10 unique values
room_type: 3 unique values
bathr

### 5. Data types and column cleaning

In this step a copy of the merged dataset is created and a few targeted corrections are applied to make the data easier to analyse and to prepare it for a future machine learning pipeline.

The corrections are:

• Rename merge generated columns such as `date_x` and `date_y` to clearer names like `calendar_date` and `review_date`. This avoids confusion later when we build features based on dates or nights.  
• Parse all date like columns into proper datetime type. This will make it straightforward to create new features such as year, month, weekday or length of host experience in the feature engineering section.  
• Convert identifier columns (`listing_id`, `host_id`, `reviewer_id`, review `id`) to string. These are keys and should not be treated as numeric predictors in Hackathon 2, so converting them prevents them from being mistakenly used as continuous features.  
• Convert stable low cardinality text columns such as `room_type`, `neighbourhood_cleansed` or `instant_bookable` to categorical type. This makes their role as categories explicit and prepares them for simple encodings like one hot encoding in the next hackathon.

From this point on the analysis and feature engineering will use `final_df_clean` as the main working dataset.


In [8]:
# --- 5. Data types and column cleaning ---

# Work on a copy
final_df_clean = final_df.copy()

# 5.1 Rename merge generated columns to clearer names
final_df_clean = final_df_clean.rename(
    columns={
        "date_x": "calendar_date",
        "date_y": "review_date",
        "price_x": "calendar_price",
        "price_y": "listing_price",
        "minimum_nights_x": "minimum_nights_calendar",
        "maximum_nights_x": "maximum_nights_calendar",
        "minimum_nights_y": "minimum_nights_listing",
        "maximum_nights_y": "maximum_nights_listing",
    }
)

# 5.2 Parse date columns to datetime
date_cols = [
    "calendar_date",
    "review_date",
    "last_scraped",
    "calendar_last_scraped",
    "host_since",
    "first_review",
    "last_review",
]

for col in date_cols:
    if col in final_df_clean.columns:
        final_df_clean[col] = pd.to_datetime(final_df_clean[col], errors="coerce")

# 5.3 Convert identifier columns to string so they are not used as numeric features
id_cols = ["listing_id", "host_id", "reviewer_id", "id"]

for col in id_cols:
    if col in final_df_clean.columns:
        final_df_clean[col] = (
            final_df_clean[col]
            .astype("Int64")   # keeps missing values as <NA>
            .astype("string")
        )

# 5.4 Convert stable low cardinality text columns to categorical type
object_cols = final_df_clean.select_dtypes(include="object").columns

text_like_cols = [
    "name",
    "description",
    "neighborhood_overview",
    "amenities",
    "comments",
    "listing_url",
    "picture_url",
    "host_url",
    "host_thumbnail_url",
    "host_picture_url",
]

cat_candidates = [
    col
    for col in object_cols
    if col not in text_like_cols and final_df_clean[col].nunique(dropna=True) <= 30
]

for col in cat_candidates:
    final_df_clean[col] = final_df_clean[col].astype("category")

# Quick check of updated dtypes
final_df_clean.dtypes.head(30)

listing_id                   string[python]
calendar_date                datetime64[ns]
available                          category
calendar_price                      float64
adjusted_price                      float64
minimum_nights_calendar               int64
maximum_nights_calendar               int64
listing_url                          object
scrape_id                           float64
last_scraped                 datetime64[ns]
source                             category
name                                 object
description                          object
neighborhood_overview                object
picture_url                          object
host_id                      string[python]
host_url                             object
host_name                            object
host_since                   datetime64[ns]
host_location                      category
host_about                         category
host_response_time                 category
host_response_rate              

### 6. Missing values exploration

In this step we quantify missing data for every column in `final_df_clean`. For each variable we compute:

• The absolute number of missing values  
• The percentage of missing values relative to the total number of rows  
• The data type, to see whether missingness affects numeric, categorical or text variables

The bar chart highlights the columns with the highest proportion of missing values.  
From the earlier results we see three main patterns:

• Some columns are completely missing (for example calendar price related fields and `calendar_updated`). These are strong candidates to be dropped or replaced by future engineered features.  
• Many listing-level and review-level attributes have very high missingness because only a subset of calendar rows have matching listing and review information. We will need to be selective when using these as features for machine learning.  
• Core structural variables such as `calendar_date`, `available`, `minimum_nights_calendar` and `maximum_nights_calendar` have no or very low missingness and are good candidates for later feature engineering.

Understanding this missingness structure now will guide the cleaning strategy in the next sections and ensures that the final feature matrix for Hackathon 2 does not rely on variables with unreliable coverage.


In [11]:
# --- 6. Missing values exploration ---

import pandas as pd
import plotly.express as px

# Force Jupyter to display full DataFrame without truncation
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

n_rows = len(final_df_clean)

# Full missing values table
missing_df = (
    final_df_clean.isna()
    .sum()
    .reset_index(name="missing_count")
    .rename(columns={"index": "column"})
)

missing_df["missing_pct"] = (missing_df["missing_count"] / n_rows) * 100
missing_df["dtype"] = missing_df["column"].map(final_df_clean.dtypes.astype(str))
missing_df = missing_df.sort_values("missing_pct", ascending=False)

print("Full missing values table:")
display(missing_df)

# ----------------------------------------------------------------------
# PLOT 1: Top 20 columns by percentage of missing values
# ----------------------------------------------------------------------
top_pct = missing_df[missing_df["missing_count"] > 0].head(20)

fig1 = px.bar(
    top_pct,
    x="column",
    y="missing_pct",
    title="Top 20 Columns by Percentage of Missing Values",
    labels={"missing_pct": "Missing percentage", "column": "Column"},
)
fig1.update_layout(xaxis_tickangle=-45, yaxis=dict(range=[0, 100]))
fig1.show()

# ----------------------------------------------------------------------
# PLOT 2: Top 20 columns by missing count
# ----------------------------------------------------------------------
top_count = (
    missing_df[missing_df["missing_count"] > 0]
    .sort_values("missing_count", ascending=False)
    .head(20)
)

fig2 = px.bar(
    top_count,
    x="column",
    y="missing_count",
    title="Top 20 Columns by Missing Count",
    labels={"missing_count": "Missing count", "column": "Column"},
)
fig2.update_layout(xaxis_tickangle=-45)
fig2.show()


Full missing values table:


Unnamed: 0,column,missing_count,missing_pct,dtype
55,calendar_updated,1004,100.0,float64
3,calendar_price,1004,100.0,float64
4,adjusted_price,1004,100.0,float64
27,host_neighbourhood,988,98.406375,category
33,neighbourhood,985,98.10757,category
13,neighborhood_overview,985,98.10757,object
68,estimated_revenue_l365d,978,97.410359,float64
46,listing_price,978,97.410359,category
44,beds,978,97.410359,float64
41,bathrooms,978,97.410359,float64


### 7. Missing values cleaning decisions

The merged dataset shows extremely high missingness for most listing and review attributes (95 to 100 percent). This is expected because the calendar, listings and reviews datasets were sampled independently, so only a small number of rows correspond to the same listing. As a result, most listing and review-level features do not align with the calendar rows and cannot be used reliably for analysis or machine learning.

To ensure we retain only stable and meaningful variables, we apply the following cleaning strategy:

• Drop all columns with 95 percent or more missing values. These variables do not provide consistent information and would add noise to the feature matrix.  
• Keep structural calendar features (`calendar_date`, `available`, `minimum_nights_calendar`, `maximum_nights_calendar`) because they contain complete information.  
• Avoid imputing listing-level or review-level fields because imputing 95 percent missingness would produce artificially filled data that introduces bias and breaks the real structure of the dataset.

The final cleaned dataset now contains only reliable variables that can support feature engineering and can be safely encoded in Hackathon 2.


In [12]:
# --- 7. Missing values cleaning decisions ---

# Work on a new copy
df_cleaned = final_df_clean.copy()

# 1. Identify columns to drop (≥95% missing)
cols_to_drop = missing_df[missing_df["missing_pct"] >= 95]["column"].tolist()

# Keep structural variables even if threshold is near-dropped (but they aren't)
essential_cols = [
    "listing_id",
    "calendar_date",
    "available",
    "minimum_nights_calendar",
    "maximum_nights_calendar",
]

cols_to_drop = [col for col in cols_to_drop if col not in essential_cols]

# Drop them
df_cleaned = df_cleaned.drop(columns=cols_to_drop)

print(f"Dropped {len(cols_to_drop)} columns.")
print("Remaining columns:")
df_cleaned.info()


Dropped 85 columns.
Remaining columns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004 entries, 0 to 1003
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   listing_id               1004 non-null   string        
 1   calendar_date            1004 non-null   datetime64[ns]
 2   available                1004 non-null   category      
 3   minimum_nights_calendar  1004 non-null   int64         
 4   maximum_nights_calendar  1004 non-null   int64         
dtypes: category(1), datetime64[ns](1), int64(2), string(1)
memory usage: 32.6 KB
