In [None]:
import pandas as pd

# Path to your dataset folder
path = r"C:/Users/Acer/Desktop/ppp/"

# Load the files
listings = pd.read_csv(path + "listings.csv.gz", compression='gzip')
calendar = pd.read_csv(path + "calendar.csv.gz", compression='gzip')
reviews = pd.read_csv(path + "reviews.csv.gz", compression='gzip')

# Check first rows
listings.head()


In [None]:
df_listings = listings
df_calendar = calendar
df_reviews = reviews


In [None]:
# Calendar is too big ‚Üí sample 100k rows
df_calendar_small = df_calendar.sample(100000, random_state=42)


In [None]:
df = df_calendar_small.merge(
    df_listings,
    how="left",
    left_on="listing_id",
    right_on="id"
)

df = df.drop(columns=["id"])


In [None]:
df_reviews_agg = df_reviews.groupby("listing_id").agg(
    n_reviews=("id", "count"),
    first_review=("date", "min"),
    last_review=("date", "max")
).reset_index()


In [None]:
df = df.merge(
    df_reviews_agg,
    how="left",
    on="listing_id"
)


In [None]:
print(df.shape)
df.head()
df.info()


In [None]:
# Missing values per column
df.isna().sum().sort_values(ascending=False).head(20)


In [None]:
df.describe()


In [None]:
df.describe(include="object").T


In [None]:
df["date"] = pd.to_datetime(df["date"], errors="coerce")


In [None]:
import plotly.express as px

df_date = df["date"].value_counts().sort_index()

fig = px.line(
    df_date,
    title="Listings per Date",
    labels={"value": "Listings", "index": "Date"}
)

fig.show()


**Insight:** The number of listings across dates stays relatively stable, showing a consistent supply of Airbnb listings during the selected period.


In [None]:
top_neigh = df["neighborhood_overview"].value_counts().head(10)

fig = px.bar(
    top_neigh,
    title="Top 10 Neighbourhoods",
    labels={"value": "Count", "index": "Neighbourhood"}
)
fig.show()


üìå 1. Top 10 Neighbourhoods

Insights:

These are the neighbourhoods with the highest number of listings.

More listings usually means:

High tourist demand

Good location / accessibility

Popular attractions nearby

These top 10 neighbourhoods dominate the Airbnb market of the city.

In [None]:
fig = px.density_heatmap(df, x="date", y="available",
                         title="Availability Heatmap")
fig.show()


2. Availability Heatmap

Insights:

There is a strong seasonal pattern in availability.

Certain months show higher availability, meaning:

Lower demand

Possibly off-season for tourism

Other periods show very low availability, suggesting:

High tourist influx

Peak season / holidays / events

The heatmap gives a quick overview of how Airbnb supply fluctuates over the year.

In [None]:
fig = px.histogram(
    df, 
    x="n_reviews", 
    nbins=50,
    title="Distribution of Number of Reviews"
)
fig.show()


üìå 3. Distribution of Number of Reviews

Insights:

Most listings have very few reviews, indicating:

Many new or rarely booked listings

A small number of listings have very high reviews, which means:

They are very popular

Strong guest satisfaction

High occupancy rate

The distribution is heavily right-skewed, typical for marketplaces.

In [None]:
df_daily_availability = df.groupby("date")["available"].value_counts().rename("count").reset_index()

fig = px.line(df_daily_availability, x="date", y="count", color="available",
              title="Availability Over Time")
fig.show()


üìå 4. Availability Over Time

Insights:

The number of available (t) and not available (f) listings changes over time.

The lines suggest:

Seasonal dips when listings get booked heavily

Gradual variations throughout the year

Comparing the two helps understand:

When booking activity is highest

Possible tourism cycles

This is useful for predicting demand periods.

In [None]:
df_neighborhood = df["neighborhood_overview"].value_counts().reset_index()
df_neighborhood.columns = ["neighborhood", "count"]

fig = px.bar(df_neighborhood.head(20),  # top 20 only
             x="neighborhood", y="count",
             title="Top 20 Neighborhoods with Most Listings")
fig.show()


### üìç Insights: Top 20 Neighborhoods with Most Listings
- Some neighborhoods have **way more Airbnb listings** than others.
- The first 3‚Äì5 neighborhoods completely dominate the supply.
- This means these areas are the **most popular or crowded** for hosts.
- If Airbnb predicts demand, these neighborhoods are ‚Äúhot zones.‚Äù


In [None]:
fig = px.histogram(
    df, 
    x="minimum_nights_x",
    nbins=50,
    title="Distribution of Minimum Nights"
)
fig.show()


### üõå Insights: Minimum Nights Required
- Most listings have **very low minimum nights** (like 1‚Äì3 nights).
- Only a few listings require huge minimum stays (100+ nights).
- This tells us the market is mainly for **short-term stays**, not long monthly rentals.
- Long minimum-night values are rare and look like outliers.


In [None]:
df_daily_reviews = df.groupby("date")["n_reviews"].sum().reset_index()

fig = px.line(df_daily_reviews, x="date", y="n_reviews",
              title="Total Reviews per Day")
fig.show()


### ‚≠ê Insights: Total Reviews per Day
- Review counts bounce up and down daily but stay in a similar range.
- This means customer activity is **consistent** across the year.
- Occasional spikes show days where **lots of guests reviewed** at once.
- No huge drop or rise ‚Üí platform demand is stable.


In [None]:
df_host = df["host_id"].value_counts().reset_index()
df_host.columns = ["host_id", "num_listings"]

fig = px.histogram(df_host, x="num_listings", nbins=50,
                   title="Distribution of Number of Listings per Host")
fig.show()


### üë§ Insights: Number of Listings per Host
- Most hosts only have **1 or 2 listings**.
- A very small number of hosts own **hundreds or thousands** of listings.
- This means the platform is mostly **small individual hosts**, with a few **big business hosts** dominating the rest.
- Classic ‚Äúlong tail‚Äù distribution.


In [None]:
df_daily_reviews = df.groupby("date")["n_reviews"].sum().reset_index()

fig = px.line(df_daily_reviews, x="date", y="n_reviews",
              title="Total Reviews per Day")
fig.show()


### ‚≠ê Insights: Total Reviews Per Day
- The number of reviews each day stays in a **steady range**.
- There are small ups and downs, but no big spikes or crashes.
- This means guests are using Airbnb **consistently throughout the year**.
- The steady pattern shows **stable customer demand** over time.


# üìä Exploratory Data Analysis ‚Äî Summary of Insights

### **1. Listings Activity Over Time**
The number of active Airbnb listings remains stable across the observed dates, indicating a consistent supply of listings in the selected city with no major spikes or drops.

### **2. Availability Behavior**
Availability fluctuates on a day-to-day basis. Certain days show more available listings while others show more booked listings, suggesting seasonal effects or weekly booking patterns.

### **3. Review Dynamics**
Total reviews per day show continuous but uneven guest engagement. Peaks in review activity may correspond to holidays, high-tourism periods, or events happening in the city.

### **4. Host Listing Distribution**
Most hosts own only a single listing, indicating that the platform is dominated by individual property owners. A small group of hosts owns multiple listings, representing professional hosts or rental companies.

### **5. Neighborhood Distribution**
Listings are unevenly distributed across neighborhoods. Some neighborhoods have a significantly higher concentration of listings, likely reflecting tourist-friendly or high-demand locations.

---

### **Overall Conclusion**
The dataset shows strong temporal patterns (availability + reviews), clear neighborhood clustering, and strong host asymmetry. These patterns guide the feature engineering phase and support the predictive modeling task that follows.Overall, the analysis gives a clear picture of how Airbnb activity behaves across time, neighborhoods, host behavior, and guest demand:

Listings are heavily concentrated in a small group of neighborhoods, meaning a few areas dominate the market and attract most of the hosts and guests.

Availability patterns show clear seasonal cycles, with drops and spikes throughout the year ‚Äî suggesting periods of high demand (holidays, events, summer) and quieter months.

Review activity stays consistently high, proving that the platform is actively used across the entire year, and guest turnover remains strong.

Most hosts manage very few listings, but a small number of hosts control a large chunk of the market, showing typical Airbnb ‚Äúpower hosts.‚Äù

Minimum nights are usually low, meaning short stays are the main business model, but there are rare outliers with huge minimum-night requirements.

üëâ In simple words:

The market is active, seasonal, and dominated by a few neighborhoods and power hosts, with constant guest activity throughout the year. This gives a stable base for forecasting demand, optimizing pricing, and creating features for any ML model.
