<a href="https://colab.research.google.com/github/cpython-projects/E1402/blob/main/session_06_part_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import warnings
warnings.filterwarnings("ignore")

# 🏡 Real Estate Data Visualization Tasks

In [None]:
import pandas as pd
import plotly.express as px
df = pd.read_csv('https://raw.githubusercontent.com/cpython-projects/E1402/refs/heads/main/real_estate_data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,last_price,total_area,first_day_exposition,rooms,ceiling_height,floors_total,living_area,floor,is_apartment,studio,kitchen_area,balcony,locality_name,days_exposition
0,0,7312500.0,108.0,2024-05-15,3,2.7,16.0,51.0,8,,False,25.0,,Kyiv,
1,1,1884375.0,40.4,2024-08-14,1,,11.0,18.6,1,,False,11.0,2.0,Brovary,81.0
2,2,2922750.0,56.0,2023-11-06,2,,5.0,34.3,4,,False,8.3,0.0,Kyiv,558.0
3,3,36506250.0,159.0,2024-03-19,3,,14.0,,9,,False,,0.0,Kyiv,424.0
4,4,5625000.0,100.0,2024-06-12,2,3.03,14.0,32.0,13,,False,41.0,,Kyiv,121.0


In [None]:
df["price_per_sqm"] = df["last_price"] / df["total_area"]

## 💰 Prices

### 1. Build a **histogram of apartment prices (`last_price`)**. Identify the main price segments: mass market vs premium.

In [None]:
fig_hist = px.histogram(
    df,
    x="last_price",
    nbins=100,
    title="Histogram of properties prices",
    labels={"last_price": "Price (uah.)"}
)

fig_hist.show()

*  It shows the distribution of prices
*  Mass market — the area where most ads are concentrated (peak)
*  Premium — the long “tail” on the right (expensive apartments)

---

What to pay attention to:

*  The main frequency peak (where there are most apartments)
*  The long tail — expensive apartments

---

Conclusion:
*  There are clearly two segments: mass market and premium
*  You can split the dataset into two: mass market and premium for further analysis

**How to split into two segments?**

In the analysis of prices, salaries, incomes, and other *skewed* distributions, the 90th or 95th percentile is often used as a cutoff point (empirical rule).  

This means: 95% of the items belong to the “regular” market, while the top 5% represent the premium segment.  

The 95th percentile is not an absolute truth, but rather a practical assumption.  

The boundary can be chosen based on:  
* **Visual analysis** of the histogram (where the sharp decline occurs),
* **A fixed threshold** (e.g., 10 million UAH — a business decision),
* **Percentiles** (90th, 95th, 99th).

In [None]:
threshold = df["last_price"].quantile(0.95)
df_regular = df[df["last_price"] < threshold]
df_premium = df[df["last_price"] >= threshold]

In [None]:
 px.histogram(
    df_regular,
    x="last_price",
    nbins=100,
    title="Histogram of properties prices (regular segment)",
    labels={"last_price": "Price (uah.)"}
).show()

In [None]:
 px.histogram(
    df_premium,
    x="last_price",
    nbins=100,
    title="Histogram of properties prices (premiun segment)",
    labels={"last_price": "Price (uah.)"}
).show()

### 2. Create a **barplot of the median price per square meter by city (`locality_name`)**. Which city is the most expensive?

In [None]:
df_tmp = df.groupby('locality_name')['price_per_sqm'].median().reset_index()

fig = px.bar(
    df_tmp,
    x="locality_name",
    y="price_per_sqm",
    title="Median price per square meter by city")
fig.show()

* Comparing cities is more convenient using bars (category)
* The **median** is better than the mean, since it is robust to outliers (very expensive penthouses do not distort the result).

---

**What to pay attention to:**

* The **height of the bars**.
* The **top city** = the tallest bar.

---

**Conclusion:**

* The most expensive city is **Kyiv** with a median of about **58,900 UAH/m²**.
* Other cities (Brovary, Boyarka, Hostomel, Bucha, Boryspil, Irpin, Vyshneve, Borshchahivka) have noticeably lower median prices, typically in the range of **37,000–40,000 UAH/m²**.
* This confirms that **Kyiv dominates the real estate market** in terms of price per square meter, while the satellite towns around it form a more affordable segment

### 3. Build a **boxplot of price per square meter vs number of rooms**. Are one-room apartments more expensive per m² than three-room apartments?

In [None]:
fig = px.box(
    df_regular,
    x="rooms",
    y="price_per_sqm",
    title="Price per m² depending on the number of rooms",
    labels={"rooms": "Number of Rooms", "price_per_sqm": "Price per m²"}
)

fig.show()

1. **The median price per m² is almost the same** across all apartment sizes (from studios to 7-room flats). This means the market values a square meter similarly, regardless of the number of rooms.
2. **High variability (outliers)** appears in small apartments (0–2 rooms). Some listings show extremely high price per m² (over 200–300k), which are far above the market average.
3. Larger apartments (5–7 rooms) show **lower variability**, so the market is more “stable” for them.
4. **Studios and one-room flats** have the widest price range. This suggests different market segments: budget housing vs. luxury apartments.

---

**Business implications:**

* **Number of rooms is not the main driver of price per m².** Location, condition, building type, and floor are probably more important.
* **Check the outliers.** Prices above 200–300k per m² are likely either data entry errors or luxury properties that distort the analysis.
* **For marketing:** studios and one-room apartments represent very different buyer segments. They should be marketed separately.

In [None]:
idx_max = df_regular[df_regular["rooms"] == 1]["price_per_sqm"].idxmax()
df_regular = df_regular.drop(idx_max)
fig = px.box(
    df_regular,
    x="rooms",
    y="price_per_sqm",
    title="Price per m² depending on the number of rooms",
    labels={"rooms": "Number of Rooms", "price_per_sqm": "Price per m²"}
)

fig.show()

*  One-room flats → slightly higher median price per m² than three-room flats
*  But the difference is not huge

## 🏠 Area and Layout

### 4. Create a **scatterplot of living area vs total area**. Is the share of living area consistent?

In [None]:
fig = px.scatter(
    df_regular,
    x="total_area",
    y="living_area",
    title="Living Area vs Total Area",
    labels={
        'total_area': 'Total Area (m²)',
        'living_area': 'Living Area (m²)'
    },
    opacity=0.5
)
fig.show()

**How to read this chart**

The chart shows a scatter plot:
* **X-axis (horizontal)** → total area of the property (m²).
* **Y-axis (vertical)** → living area (m²).
* **Each point** = one property.

The points form an elongated “cloud” from the bottom left to the top right — the larger the total area, the larger the living area. But the living area is never equal to the total: part of the space is always taken by hallways, bathrooms, storage rooms, etc.

**The diagonal line (y = x)**

*  If we draw a line at a 45° angle (equation y = x), it represents the case where living area equals total area.
*  In reality, the points are below this line, because the living area is always smaller than the total area.
*  The closer a point is to the line, the fewer “non-living” spaces it has.
*  The farther away it is, the larger the share of the total area is taken up by non living rooms.

* This diagonal **is not a trend line** but a reference.

In [None]:
import plotly.graph_objects as go

# range
min_area = df_regular["total_area"].min()
max_area = df_regular["total_area"].max()

# y = x
fig.add_trace(
    go.Scatter(
        x=[min_area, max_area],
        y=[min_area, max_area],
        mode="lines",
        name="y = x",
        line=dict(color="red", dash="dash", width=2)
    )
)

fig.show()

**Trend line (regression)**  

To understand the **average relationship**, we usually plot a **trend line** (linear regression line):

* It shows the average dependence: how living area grows as total area increases.
* The equation of the line is approximately:

  $$
  LivingArea = a + b \cdot TotalArea
  $$

  where:

  * **a** — intercept (usually > 0, since even small apartments have > 0 living area),
  * **b** — slope (the share of living area relative to total).

In [None]:
import numpy as np

# clean from missing values and zeros
df_clean = df_regular.dropna(subset=["total_area", "living_area"])
df_clean = df_clean[(df_clean["total_area"] > 0) & (df_clean["living_area"] > 0)]

x = df_clean["total_area"].values
y = df_clean["living_area"].values

b, a = np.polyfit(x, y, 1)
print(f"Equation of the line: y = {a:.2f} + {b:.2f}x")

Equation of the line: y = -3.44 + 0.64x


In [None]:
reg_x = np.linspace(min_area, max_area, 100)
reg_y = a + b * reg_x

fig.add_trace(
    go.Scatter(
        x=reg_x,
        y=reg_y,
        mode="lines",
        name=f"Тренд: y = {a:.1f} + {b:.2f}x",
        line=dict(color="green", width=4)
    )
)

fig.show()

**What should the slope be?**

* The slope (**b**) is always less than **1**.
  If b = 1 → living = total (impossible).
* Typically, **b ≈ 0.6–0.8** (60–80% of the area is used as living space).
* Higher b → more “compact” apartments with little wasted space on hallways.
* Lower b → more auxiliary space.

**So:**

* The diagonal y = x = theoretical maximum.
* The trend line (regression) shows the real relationship.
* The slope of the trend gives an estimate of the average share of living space.


**Is the share of living area consistent?**

*Green trend line (y ≈ -3.4 + 0.64x)*  

* On average, the **living area is about 64% of the total area**.
* For **small apartments (20–40 m²)** the living share tends to be **higher** (less space wasted on hallways, corridors).
* For **large apartments (150–300 m²)** the trend line is much lower than the diagonal → meaning the **relative living share decreases** as total area increases (more hallways, storage rooms, secondary spaces).
* Some properties have a living share closer to 70–80%, while others drop below 50%. This reflects differences in planning and property types.

---

*Conclusion*

* The living share is **not constant**.
* On average it is about **60–65%**.
* **Smaller apartments** → higher living share.
* **Larger apartments** → lower living share.
* This makes sense: bigger homes usually dedicate more space to non-living functions (halls, walk-in closets, storage, multiple bathrooms).

### 5. Plot a **histogram of apartment total areas**. Do we see standard sizes (e.g., 30–35 m² for one-room flats)?

In [None]:
fig = px.histogram(
    df_regular,
    x="total_area",
    nbins=100,
    title="Distribution of total area of ​​apartments",
    labels={"total_area": "Total area (m²)"}
)

fig.show()

Looking at the histogram of **total apartment area**:

1. **Clear peaks exist** around:

   * **30–35 m²** → this is a very common size for **one-room (studio/1-bedroom) apartments**.
   * **40–50 m²** → typical for **two-room flats**.
   * **55–65 m²** → often corresponds to **three-room flats**.

2. After \~70 m², the distribution **smooths out** and the counts decrease gradually. Larger apartments (80–120+ m²) are less common but still present, with a long tail up to \~300 m².

3. The **biggest spike** is around **40–45 m²**, which likely reflects the **standardized Soviet-era panel housing sizes** (common in Eastern Europe).

---

**Conclusion:**

Yes — we clearly see **standard size clusters**:

* \~30–35 m² → studios / one-room
* \~40–50 m² → two-room
* \~55–65 m² → three-room
* Larger sizes appear but with much lower frequency.

6. Build a **scatterplot of total area vs number of rooms**. Are there anomalies, like a one-room flat with 100 m²?

In [None]:
fig = px.scatter(
    df_regular,
    x="rooms",
    y="total_area",
    title="Total area vs Number of rooms",
    labels={
        'total_area': 'Total Area (m²)',
        'rooms': 'Number of rooms'
    },
    opacity=0.5
)

fig.show()

This scatterplot does show anomalies.

For example:

* A **1-room flat with \~190 m²** is highly unusual — one-room apartments are typically much smaller.
* Similarly, there are **1-room flats above 100 m²**, which are rare and could be either luxury studios, data entry errors, or misclassified listings.
* On the other end, there are **5–6 room flats with only \~50–60 m²**, which also looks unrealistic (normally, more rooms → larger area).

These points don’t fit the expected trend (more rooms should generally mean more total area), so they might be:

1. **Data entry mistakes** (wrong number of rooms or wrong area).
2. **Special cases** (e.g., lofts, studios with large open spaces, subdivided flats).
3. **Outliers worth checking** — depending on your analysis, you might remove them or investigate further.

## 📈 Market Dynamics

7. Use a **lineplot of monthly publications (`first_day_exposition`)**. Is there seasonality in new listings?

8. Plot a **lineplot of average price per m² over time**. Do we observe growth or decline?

9. Create a **lineplot of average exposure time (`days_exposition`) over time**. Are apartments selling faster or slower?

## 🏢 Buildings and Floors

10. Build a **boxplot of prices by floor category**: first, last, and middle floors. Which category is more expensive?

11. Create a **scatterplot of price per m² vs total number of floors in the building (`floors_total`)**. Do high-rise apartments cost more?

12. Make a **barplot of the number of listings by total floors**. Which types of buildings dominate in each city?


## 🛋️ Apartment Features

13. Build a **barplot of average prices for apartments with and without balconies**. Does having a balcony increase the price?

14. Plot a **boxplot of price per m² for studios vs non-studios**. Are studios cheaper per square meter?

15. Create a **scatterplot of ceiling height vs price per m²**. Are apartments with ceilings above 3 meters more expensive?

## 🔥 Sales and Exposure

16. Build a **barplot of median exposure time (`days_exposition`) by city**. In which city do apartments sell fastest?

17. Create a **scatterplot of price vs exposure time**. Do more expensive apartments take longer to sell?

18. Plot a **barplot of average exposure time by number of rooms**. Which type of apartment sells the fastest?

## 🔗 Correlations

19. Build a **heatmap of correlations between numeric features** (price, area, rooms, floors, exposure time). Which factors influence price the most?

20. Create a **scatterplot matrix (pairplot)** of price, total area, living area, kitchen area, and number of rooms. What relationships can you see?