<a href="https://colab.research.google.com/github/cpython-projects/da_vn/blob/main/session_05_part_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Real estate dataset (real_estate_data.csv)

---
| Column Name             | Description |
|-------------------------|-------------|
| `last_price`            | Final listing price of the property (in local currency)                     |
| `total_area`            | Total area of the property in square meters                                 |
| `first_day_exposition`  | Date when the listing first appeared                                        |
| `rooms`                 | Number of rooms in the property                                              |
| `ceiling_height`        | Height of the ceilings (in meters)                                          |
| `floors_total`          | Total number of floors in the building                                      |
| `living_area`           | Size of the living space in square meters                                   |
| `floor`                 | Floor on which the apartment is located                                     |
| `is_apartment`          | Boolean indicating whether it’s officially considered an apartment          |
| `studio`                | Boolean indicating whether it’s a studio (one-room open layout)             |
| `open_plan`             | Boolean indicating whether the apartment has an open floor plan             |
| `kitchen_area`          | Size of the kitchen in square meters                                        |
| `balcony`               | Number of balconies (can be 0 or more)                                      |
| `locality_name`         | Name of the locality or neighborhood                                        |
| `parks_around3000`      | Number of parks within 3000 meters                                          |
| `parks_nearest`         | Distance to the nearest park (in meters)                                    |
| `ponds_around3000`      | Number of ponds within 3000 meters                                          |
| `ponds_nearest`         | Distance to the nearest pond (in meters)                                    |
| `days_exposition`       | Number of days the property remained on the market                          |
---

### Data Reading

In [1]:
from google.colab import files
uploaded = files.upload()

Saving real_estate_data.csv to real_estate_data.csv


In [2]:
import pandas as pd
df = pd.read_csv('real_estate_data.csv', encoding='utf-8')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,last_price,total_area,first_day_exposition,rooms,ceiling_height,floors_total,living_area,floor,is_apartment,studio,open_plan,kitchen_area,balcony,locality_name,parks_around3000,parks_nearest,ponds_around3000,ponds_nearest,days_exposition
0,0,13000000.0,108.0,2019-03-07T00:00:00,3,2.7,16.0,51.0,8,,False,False,25.0,,Kyiv,1.0,482.0,2.0,755.0,
1,1,3350000.0,40.4,2018-12-04T00:00:00,1,,11.0,18.6,1,,False,False,11.0,2.0,Brovary,0.0,,0.0,,81.0
2,2,5196000.0,56.0,2015-08-20T00:00:00,2,,5.0,34.3,4,,False,False,8.3,0.0,Kyiv,1.0,90.0,2.0,574.0,558.0
3,3,64900000.0,159.0,2015-07-24T00:00:00,3,,14.0,,9,,False,False,,0.0,Kyiv,2.0,84.0,3.0,234.0,424.0
4,4,10000000.0,100.0,2018-06-19T00:00:00,2,3.03,14.0,32.0,13,,False,False,41.0,,Kyiv,2.0,112.0,1.0,48.0,121.0


In [4]:
df = df.drop('Unnamed: 0', axis=1)
df.head()

Unnamed: 0,last_price,total_area,first_day_exposition,rooms,ceiling_height,floors_total,living_area,floor,is_apartment,studio,open_plan,kitchen_area,balcony,locality_name,parks_around3000,parks_nearest,ponds_around3000,ponds_nearest,days_exposition
0,13000000.0,108.0,2019-03-07T00:00:00,3,2.7,16.0,51.0,8,,False,False,25.0,,Kyiv,1.0,482.0,2.0,755.0,
1,3350000.0,40.4,2018-12-04T00:00:00,1,,11.0,18.6,1,,False,False,11.0,2.0,Brovary,0.0,,0.0,,81.0
2,5196000.0,56.0,2015-08-20T00:00:00,2,,5.0,34.3,4,,False,False,8.3,0.0,Kyiv,1.0,90.0,2.0,574.0,558.0
3,64900000.0,159.0,2015-07-24T00:00:00,3,,14.0,,9,,False,False,,0.0,Kyiv,2.0,84.0,3.0,234.0,424.0
4,10000000.0,100.0,2018-06-19T00:00:00,2,3.03,14.0,32.0,13,,False,False,41.0,,Kyiv,2.0,112.0,1.0,48.0,121.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23699 entries, 0 to 23698
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   last_price            23699 non-null  float64
 1   total_area            23699 non-null  float64
 2   first_day_exposition  23699 non-null  object 
 3   rooms                 23699 non-null  int64  
 4   ceiling_height        14504 non-null  float64
 5   floors_total          23613 non-null  float64
 6   living_area           21796 non-null  float64
 7   floor                 23699 non-null  int64  
 8   is_apartment          2775 non-null   object 
 9   studio                23699 non-null  bool   
 10  open_plan             23699 non-null  bool   
 11  kitchen_area          21421 non-null  float64
 12  balcony               12180 non-null  float64
 13  locality_name         23699 non-null  object 
 14  parks_around3000      18181 non-null  float64
 15  parks_nearest      

In [6]:
df.isnull().sum()

Unnamed: 0,0
last_price,0
total_area,0
first_day_exposition,0
rooms,0
ceiling_height,9195
floors_total,86
living_area,1903
floor,0
is_apartment,20924
studio,0


### Data Cleaning and Preparation

#### **Incorrect Formats**

In [7]:
df.first_day_exposition = pd.to_datetime(df.first_day_exposition)
df.head()

Unnamed: 0,last_price,total_area,first_day_exposition,rooms,ceiling_height,floors_total,living_area,floor,is_apartment,studio,open_plan,kitchen_area,balcony,locality_name,parks_around3000,parks_nearest,ponds_around3000,ponds_nearest,days_exposition
0,13000000.0,108.0,2019-03-07,3,2.7,16.0,51.0,8,,False,False,25.0,,Kyiv,1.0,482.0,2.0,755.0,
1,3350000.0,40.4,2018-12-04,1,,11.0,18.6,1,,False,False,11.0,2.0,Brovary,0.0,,0.0,,81.0
2,5196000.0,56.0,2015-08-20,2,,5.0,34.3,4,,False,False,8.3,0.0,Kyiv,1.0,90.0,2.0,574.0,558.0
3,64900000.0,159.0,2015-07-24,3,,14.0,,9,,False,False,,0.0,Kyiv,2.0,84.0,3.0,234.0,424.0
4,10000000.0,100.0,2018-06-19,2,3.03,14.0,32.0,13,,False,False,41.0,,Kyiv,2.0,112.0,1.0,48.0,121.0


In [8]:
df.is_apartment.unique()

array([nan, False, True], dtype=object)

In [10]:
df['is_apartment'] = df.is_apartment.fillna(True)
df.is_apartment = df.is_apartment.astype(bool)

In [11]:
df.locality_name.unique()

array(['Kyiv', 'Brovary', 'Boyarka', 'Hostomel', 'Bucha', 'Boryspil',
       'Irpin', 'Vyshneve', 'Borshchahivka'], dtype=object)

#### **Identifying Duplicates in the Dataset**

In [12]:
duplicate_rows = df.duplicated().sum()
print(duplicate_rows)

0


#### **Data Cleaning Recommendations for Missing Values**

In [14]:
df.isnull().sum()

Unnamed: 0,0
last_price,0
total_area,0
first_day_exposition,0
rooms,0
ceiling_height,9195
floors_total,86
living_area,1903
floor,0
is_apartment,0
studio,0


In [15]:
df.balcony.unique()

array([nan,  2.,  0.,  1.,  5.,  4.,  3.])

In [16]:
df['balcony'] = df.balcony.fillna(0)

In [17]:
studio_missing_kitchen = df[df['studio'] == True]
studio_missing_kitchen.shape

(149, 19)

In [19]:
studio_missing_kitchen = df[(df['studio'] == True) & (df['kitchen_area'].isna())]
studio_missing_kitchen.shape

(149, 19)

In [21]:
df.loc[(df['studio'] == True) & (df['kitchen_area'].isna()), 'kitchen_area'] = 0

In [22]:
df.isnull().sum()

Unnamed: 0,0
last_price,0
total_area,0
first_day_exposition,0
rooms,0
ceiling_height,9195
floors_total,86
living_area,1903
floor,0
is_apartment,0
studio,0


### What about scales?

**Nominal Scale**  
(No natural order, categorical labels)  

| Column                | Description                              |
|-----------------------|-------------------------------------------|
| `locality_name`       | Name of the location                     |
| `first_day_exposition`| Date (object, to be parsed separately)   |


---



**Ordinal Scale**  
(Categorical with order, but not equal spacing)

| Column            | Description                                |
|-------------------|---------------------------------------------|
| `rooms`           | Number of rooms (can also be seen as discrete numeric) |
| `floor`           | Floor number in the building               |
| `floors_total`    | Total number of floors in the building     |
| `balcony`         | Number of balconies                        |
| `parks_around3000`| Number of parks within 3km                 |
| `ponds_around3000`| Number of ponds within 3km                 |

---


**Quantitative Scale**  
(Continuous or discrete numeric data with meaningful order and equal intervals)

| Column           | Description                                   |
|------------------|-----------------------------------------------|
| `last_price`     | Final sale price                              |
| `total_area`     | Total apartment area                          |
| `living_area`    | Living space area                             |
| `kitchen_area`   | Kitchen area                                  |
| `ceiling_height` | Ceiling height                                |
| `parks_nearest`  | Distance to the nearest park (in meters)      |
| `ponds_nearest`  | Distance to the nearest pond (in meters)      |
| `days_exposition`| Number of days the listing was online         |

---

**Boolean (Binary Nominal)**  
(Can be treated as nominal categorical variables)

| Column        | Description                            |
|---------------|-----------------------------------------|
| `studio`      | Whether the flat is a studio apartment |
| `open_plan`   | Whether the flat has an open-plan design |
| `is_apartment`        | Type of property (Yes/No or missing)     |
---


### Central Tendency: Mean, Median, Moda

#### last_price

In [28]:
print(f'Avg: {df.last_price.mean():.2f}')
print(f'Median: {df.last_price.median():.2f}')

Avg: 6541548.77
Median: 4650000.00


**Interpretation**

**Mean > Median**
- This is a clear sign of **right-skewed distribution**.
- A few **very expensive properties** are pulling the average (mean) price upward.
- **Most listings are priced below the mean**, but a handful of high-priced properties inflate the average.

---

**What does it mean in practice?**
- **For buyers:** Median is more informative — it reflects the “typical” price.
- **For analysts:** The mean does not accurately represent the majority of listings. Be cautious when interpreting it.
- **For model developers:** Consider **log-transforming** the `last_price` to normalize the distribution and reduce the effect of outliers.

---

**Summary:**
> “The average real estate price is **6.54 million**, while the median is **4.65 million**, indicating a **right-skewed distribution**. Most properties are priced below the mean, and a few high-end listings pull the average upward. For assessing a typical price, the **median provides a more reliable estimate**.”

#### total_area

In [29]:
print(f'Avg: {df.total_area.mean():.2f}')
print(f'Median: {df.total_area.median():.2f}')

Avg: 60.35
Median: 52.00


**Interpretation**
**Mean > Median**
- This suggests a **right-skewed distribution**.
- There are likely some **very large apartments** that increase the average.
- Most properties are **smaller than the average**.

---

**What does it mean in practice?**
- **For market insights:** The **median** is a better reflection of typical apartment sizes.
- **For planning supply/demand:** The majority of listings are **under 60 m²**, despite a few large units inflating the mean.
- **For developers/investors:** If targeting the median buyer, focus on properties around **50–55 m²**.

**Summary:**

> “The average total area is **60.35 m²**, while the median is **52.00 m²** — indicating that **most apartments are below average size**, with a few very large listings pulling the mean upward. The **median is more representative of a typical home** size in the dataset.”

#### rooms

In [47]:
print(f'Avg rooms: {df.rooms.mean():.2f}')
print(f'Median rooms: {df.rooms.median():.2f}')
print(f'Mode rooms: {df.rooms.mode()[0]}')


Avg rooms: 2.07
Median rooms: 2.00
Mode rooms: 1


**Interpretation:**
- The **mean is slightly higher** than the median. This often suggests that there are some properties with **a larger number of rooms** pulling the average up.
- The **mode being 1** shows that **1-room apartments** are the most common, but not representative of the overall trend.

### Measures of Spread. Range, STD, IQR

#### last_price

In [36]:
print(f'Range: {df.last_price.max() - df.last_price.min():.2f}')
print(f'STD: {df.last_price.std():.2f}')
print(f'IQR: {df.last_price.quantile(0.75) - df.last_price.quantile(0.25):.2f}')

Range: 762987810.00
STD: 10887013.27
IQR: 3400000.00


**last_price**
- **Range**: 762,987,810 — huge spread between the cheapest and most expensive property.
- **Standard Deviation (STD)**: 10,887,013.27 — very high, indicating extreme variability in prices.
- **IQR (Interquartile Range)**: 3,400,000 — 50% of the properties fall within a ₺3.4M range (between Q1 and Q3).

**Interpretation**:  
The wide **range** and large **STD** suggest **extreme outliers** (e.g., ultra-luxury properties). The smaller **IQR** compared to the total range shows that while most properties are priced within a tighter band, a few **very expensive outliers** skew the distribution.

#### total_area

In [37]:
print(f'Range: {df.total_area.max() - df.total_area.min():.2f}')
print(f'STD: {df.total_area.std():.2f}')
print(f'IQR: {df.total_area.quantile(0.75) - df.total_area.quantile(0.25):.2f}')

Range: 888.00
STD: 35.65
IQR: 29.90


**total_area**
- **Range**: 888 m² — again, a big spread.
- **STD**: 35.65 m² — moderate variability.
- **IQR**: 29.90 m² — the middle 50% of homes differ in area by about 30 m².

**Interpretation**:  
The **range** shows that there are both very small and very large properties. However, the **IQR and STD** are relatively small compared to the range, meaning **most properties are similar in size**, with a few very large homes increasing the range.

#### rooms

In [38]:
print(f'Range: {df.rooms.max() - df.rooms.min():.2f}')
print(f'STD: {df.rooms.std():.2f}')
print(f'IQR: {df.rooms.quantile(0.75) - df.rooms.quantile(0.25):.2f}')

Range: 19.00
STD: 1.08
IQR: 2.00


**rooms**
- **Range**: 19 rooms — from studio apartments to mansions.
- **STD**: 1.08 — most properties have a similar number of rooms.
- **IQR**: 2 — 50% of properties differ by 2 rooms or less.

**Interpretation**:  
Despite the large range, the **low STD and IQR** tell us most properties are **clustered around a typical room count**, with very few having extremely high room numbers.

### Measures of Shape. Skewness / Kurtosis

last_price

In [44]:
print(f'Skewness: {df.last_price.skew():.2f}')
print(f'Kurtosis: {df.last_price.kurtosis():.2f}')

Skewness: 25.80
Kurtosis: 1277.68


**Understanding the Metrics:**

- **Skewness**:
  - `0` → perfectly symmetrical distribution
  - `>0` → **right-skewed** (tail on the right)
  - `<0` → **left-skewed** (tail on the left)

- **Kurtosis**:
  - `≈3` → normal distribution (mesokurtic)
  - `>3` → **peaked with heavy tails** (leptokurtic → more outliers)
  - `<3` → **flatter** distribution (platykurtic → fewer outliers)

---

- **Skewness: 25.80** → extremely **right-skewed**
- **Kurtosis: 1277.68** → **very heavy-tailed** and **sharp-peaked**

**Interpretation**: Most properties are in the lower price range, but there are **a few extremely expensive ones** that create a long right tail and lots of outliers.

#### total_area

In [45]:
print(f'Skewness: {df.total_area.skew():.2f}')
print(f'Kurtosis: {df.total_area.kurtosis():.2f}')

Skewness: 4.77
Kurtosis: 47.52


- **Skewness: 4.77** → heavily **right-skewed**
- **Kurtosis: 47.52** → sharp peak with many extreme values

**Interpretation**: Most homes are of modest size, but there are some **very large properties** (e.g., mansions or commercial buildings) skewing the data.

#### rooms

In [46]:
print(f'Skewness: {df.rooms.skew():.2f}')
print(f'Kurtosis: {df.rooms.kurtosis():.2f}')

Skewness: 1.52
Kurtosis: 8.69


- **Skewness: 1.52** → moderately **right-skewed**
- **Kurtosis: 8.69** → still peaked, with some outliers

**Interpretation**: Most properties have fewer rooms (e.g., 1–3), but a small number have many rooms (like 8, 10, etc.), which increases skewness and kurtosis.

## 🔍 **Basic Insights**

### 💰 `last_price` (Property Price):
- **Mean = 6.5M**, **Median = 4.65M** → the mean is much higher than the median → **right-skewed** distribution.
- **Huge range**: nearly **763 million** — extreme variability, very large standard deviation.
- **Skewness = 25.80**, **Kurtosis = 1277.68** → extremely heavy tail and many outliers.

👉 **Conclusion**: A small number of **super-expensive properties** are skewing the entire distribution. Most listings are priced much lower.

---

### 📏 `total_area` (Property Area):
- **Mean = 60.35 m²**, **Median = 52 m²** → again, **right-skewed**.
- Range from small apartments to **888 m²** → highly variable.
- **Skewness = 4.77**, **Kurtosis = 47.52** → many large outliers.

👉 **Conclusion**: Most properties are small to medium-sized, but a few **very large units** (villas, commercial?) are stretching the distribution.

---

### 🛏️ `rooms` (Number of Rooms):
- **Mean ≈ Median** → nearly symmetrical, but slightly right-skewed.
- **Mode = 1** → majority of listings are **1-room apartments**.
- **Max = 19 rooms** → extreme values likely **outliers** or non-residential.

👉 **Conclusion**: Distribution is fairly normal but includes **some anomalies** (very large homes).

---

## ✅ **Recommendations**

### 1. 🧹 **Outlier Detection and Removal**
Use **Z-score** or **IQR method** to filter or remove extreme values:

```python
# Z-score threshold method
df = df[(df['z_last_price'].abs() < 3) &
        (df['z_total_area'].abs() < 3) &
        (df['z_rooms'].abs() < 3)]
```

Or use **percentile-based filtering**, e.g., remove values above the **99th percentile** for `last_price`.

Consider capping `rooms` to a realistic upper bound (e.g., keep only 1–6 rooms).

---

### 2. 🔄 **Log Transformation**
Since `last_price` and `total_area` are **heavily skewed**, applying a **log transformation** will help normalize the data:

```python
df['log_price'] = np.log1p(df['last_price'])
df['log_area'] = np.log1p(df['total_area'])
```

This transformation is **highly recommended** for machine learning models or clean visualizations.

---

### 3. 📊 **Visualizations**
Use the following plots to explore and clean your data:
- **Histograms** before/after log transformation
- **Boxplots** to detect outliers
- **Scatter plots / pairplots** to visualize relationships between `price`, `area`, and `rooms`

---

### 4. ✅ **Additional Cleanup Ideas**
- Check for invalid values like `rooms == 0` or `total_area == 0` — likely input errors.
- Normalize features before modeling.
- Create **price quantiles** to categorize listings (e.g., "low", "mid", "high" price tiers).

---