# Real Estate Price Prediction Project - Dataset Exploration

## Overview

In this project, I'll be focusing on creating a forecasting model for real estate prices using the available datasets. To begin, I'll thoroughly explore each dataset to gain insights, identify patterns, and understand how the data can be used to build a robust model. The datasets I'm working with are:

1. **Transactions**

2. **Rent Contracts**

3. **Valuations**

## Step 1: Data Exploration

I'll start by exploring each dataset individually, checking for data quality, missing values, and important variables that could contribute to the forecasting model. The goal of this step is to gather initial insights and determine how to proceed with data cleaning and preparation.


## Exploring Transactions Dataset

**Objective**: Understand the variables available in the transactions dataset, check for missing values, and identify initial insights.

**Steps**:

- Load the dataset.

- View the first few rows to get an overview.

- Check for missing values and data types.

- Identify important columns that can be used for forecasting.

In [2]:
# Importing libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

In [3]:
# Loading the dataset
transactions_df = pd.read_csv("../data/raw/transactions.csv")

In [4]:
# Setting pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [5]:
# View the first few rows of the dataset
print(transactions_df.head())

    transaction_id  procedure_id  trans_group_id trans_group_ar  \
0   1-11-2018-8205            11               1        مبايعات   
1  1-11-2016-12930            11               1        مبايعات   
2  1-11-2016-13524            11               1        مبايعات   
3   2-13-2014-4939            13               2           رهون   
4     1-11-2002-81            11               1        مبايعات   

  trans_group_en procedure_name_ar      procedure_name_en instance_date  \
0          Sales               بيع                   Sell    13-08-2018   
1          Sales               بيع                   Sell    02-11-2016   
2          Sales               بيع                   Sell    15-11-2016   
3      Mortgages         تسجيل رهن  Mortgage Registration    23-06-2014   
4          Sales               بيع                   Sell    14-01-2002   

   property_type_id property_type_ar property_type_en  property_sub_type_id  \
0                 4             فيلا            Villa              

In [6]:
# Checking information about the dataset
transactions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1314126 entries, 0 to 1314125
Data columns (total 46 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   transaction_id        1314126 non-null  object 
 1   procedure_id          1314126 non-null  int64  
 2   trans_group_id        1314126 non-null  int64  
 3   trans_group_ar        1314126 non-null  object 
 4   trans_group_en        1314126 non-null  object 
 5   procedure_name_ar     1314126 non-null  object 
 6   procedure_name_en     1314126 non-null  object 
 7   instance_date         1314126 non-null  object 
 8   property_type_id      1314126 non-null  int64  
 9   property_type_ar      1314126 non-null  object 
 10  property_type_en      1314126 non-null  object 
 11  property_sub_type_id  1029207 non-null  float64
 12  property_sub_type_ar  1029207 non-null  object 
 13  property_sub_type_en  1029207 non-null  object 
 14  property_usage_ar     1314126 non-

In [7]:
# Checking the missing values percentages of the dataset
transactions_df.isnull().sum() / transactions_df.shape[0] * 100

transaction_id           0.000000
procedure_id             0.000000
trans_group_id           0.000000
trans_group_ar           0.000000
trans_group_en           0.000000
procedure_name_ar        0.000000
procedure_name_en        0.000000
instance_date            0.000000
property_type_id         0.000000
property_type_ar         0.000000
property_type_en         0.000000
property_sub_type_id    21.681254
property_sub_type_ar    21.681254
property_sub_type_en    21.681254
property_usage_ar        0.000000
property_usage_en        0.000000
reg_type_id              0.000000
reg_type_ar              0.000000
reg_type_en              0.000000
area_id                  0.000000
area_name_ar             0.000000
area_name_en             0.000000
building_name_ar        30.517165
building_name_en        30.484900
project_number          31.546442
project_name_ar         31.546442
project_name_en         31.546442
master_project_en       18.035333
master_project_ar       18.039290
nearest_landma

In [8]:
# Checking the percentages of missing values in the dataset
transactions_df.isnull().sum() / transactions_df.shape[0] * 100

transaction_id           0.000000
procedure_id             0.000000
trans_group_id           0.000000
trans_group_ar           0.000000
trans_group_en           0.000000
procedure_name_ar        0.000000
procedure_name_en        0.000000
instance_date            0.000000
property_type_id         0.000000
property_type_ar         0.000000
property_type_en         0.000000
property_sub_type_id    21.681254
property_sub_type_ar    21.681254
property_sub_type_en    21.681254
property_usage_ar        0.000000
property_usage_en        0.000000
reg_type_id              0.000000
reg_type_ar              0.000000
reg_type_en              0.000000
area_id                  0.000000
area_name_ar             0.000000
area_name_en             0.000000
building_name_ar        30.517165
building_name_en        30.484900
project_number          31.546442
project_name_ar         31.546442
project_name_en         31.546442
master_project_en       18.035333
master_project_ar       18.039290
nearest_landma

**Initial Observations of the Transactions Dataset**

1. **Size and Structure**:

    - The dataset contains **1.31 million** records with **46 columns**.

    - The columns represent various aspects of property transactions, including transaction details, property features, project and building details, and more.

2. **Key Feature**:

    - **Transaction Details**: `transaction_id`, `instance_date`, `procedure_id`, `trans_group_en`, `procedure_name_en`.

    - **Property Type and Subtype**: `property_type_en`, `property_sub_type_en`.

    - **Location Information**: `area_name_en`, `building_name_en`, `project_name_en`.

    - **Price-Related Columns**: `procedure_area` (area in sqm), `actual_worth`, `meter_sale_price`, and `rent_value`.

    - **Proximity Data**: Includes information about nearby landmarks, metro stations, and malls.

3. **Missing Values**:

    - Some columns have a significant number of missing values:

        - **Property Subtypes**: About **21.68%** of records are missing `property_sub_type_id`, `property_sub_type_en`.

        - **Building Information**: Around **30.5%** of records lack `building_name_en` and `building_name_ar`.

        - **Proximity Information**: Around **15-26%** of records lack proximity data for nearby landmarks, malls, and metro stations.

        - **Room Information**: Approximately **23.05%** of records lack room details (`rooms_en`, `rooms_ar`).

        - **Rent Data**: A significant portion of the data is missing rental values, with **97.33%** of records lacking `rent_value` and `meter_rent_price`.

4. **Non-Null Columns**:

    - Columns like `transaction_id`, `procedure_area`, `actual_worth`, and `meter_sale_price` are complete, which is promising for property price analysis.

Let's understand the data structure more and the values in order to identify the optimum method to impute the missing values.

In [9]:
# Displaying 10 random observations from the dataset
transactions_df.sample(10)

Unnamed: 0,transaction_id,procedure_id,trans_group_id,trans_group_ar,trans_group_en,procedure_name_ar,procedure_name_en,instance_date,property_type_id,property_type_ar,property_type_en,property_sub_type_id,property_sub_type_ar,property_sub_type_en,property_usage_ar,property_usage_en,reg_type_id,reg_type_ar,reg_type_en,area_id,area_name_ar,area_name_en,building_name_ar,building_name_en,project_number,project_name_ar,project_name_en,master_project_en,master_project_ar,nearest_landmark_ar,nearest_landmark_en,nearest_metro_ar,nearest_metro_en,nearest_mall_ar,nearest_mall_en,rooms_ar,rooms_en,has_parking,procedure_area,actual_worth,meter_sale_price,rent_value,meter_rent_price,no_of_parties_role_1,no_of_parties_role_2,no_of_parties_role_3
1171457,2-13-2015-3450,13,2,رهون,Mortgages,تسجيل رهن,Mortgage Registration,18-05-2015,3,وحدة,Unit,60.0,شقه سكنيه,Flat,سكني,Residential,1,العقارات القائمة,Existing Properties,410,نخلة جميرا,Palm Jumeirah,كمبينسكي ريزيدنسز,KEMPINSKI RESIDENCES,97.0,قصر الزمرد كمبينسكي,EMERALD PALACE KEMPINSKI,Palm Jumeirah,نخلة جميرا,برج العرب,Burj Al Arab,مينا السياحي,Mina Seyahi,مارينا مول,Marina Mall,غرفتين,2 B/R,1,163.6,2460000.0,15036.67,,,1.0,1.0,0.0
530204,2-13-2021-7285,13,2,رهون,Mortgages,تسجيل رهن,Mortgage Registration,17-06-2021,1,أرض,Land,,,,أخرى,Other,1,العقارات القائمة,Existing Properties,264,ند الحمر,Nad Al Hamar,,,,,,,,,,,,,,,,0,1393.55,750000.0,538.19,,,1.0,1.0,0.0
15102,1-11-2023-17343,11,1,مبايعات,Sales,بيع,Sell,05-06-2023,4,فيلا,Villa,4.0,فيلا,Villa,سكني,Residential,1,العقارات القائمة,Existing Properties,467,وادي الصفا 5,Wadi Al Safa 5,,,2159.0,المرابع العربية ااا - ربيع,Arabian Ranches III - Spring,,,مجمع حمدان الرياضي,Hamdan Sports Complex,,,,,أربع غرف,4 B/R,0,238.06,2950000.0,12391.83,,,1.0,2.0,0.0
371238,2-13-2015-7962,13,2,رهون,Mortgages,تسجيل رهن,Mortgage Registration,29-10-2015,3,وحدة,Unit,60.0,شقه سكنيه,Flat,سكني,Residential,1,العقارات القائمة,Existing Properties,390,برج خليفة,Burj Khalifa,برج فيوز تاور بي,Burj Views Tower B,1451.0,إطلالات برج,BURJ VIEWS,Burj Khalifa,برج خليفة,وسط مدينة دبي,Downtown Dubai,محطة مترو بوج خليفة دبي مول,Buj Khalifa Dubai Mall Metro Station,مول دبي,Dubai Mall,غرفة,1 B/R,1,67.54,800000.0,11844.83,,,1.0,1.0,0.0
431447,1-11-2024-16966,11,1,مبايعات,Sales,بيع,Sell,15-05-2024,4,فيلا,Villa,4.0,فيلا,Villa,سكني,Residential,1,العقارات القائمة,Existing Properties,232,مردف,Mirdif,,,,,,,,مطار دبي الدولي,Dubai International Airport,محطة مترو اتصالات,Etisalat Metro Station,سيتي سنتر مردف,City Centre Mirdif,غرفتين,2 B/R,0,185.73,2734200.0,14721.37,,,1.0,1.0,0.0
1051076,2-13-2021-4491,13,2,رهون,Mortgages,تسجيل رهن,Mortgage Registration,21-04-2021,3,وحدة,Unit,60.0,شقه سكنيه,Flat,سكني,Residential,1,العقارات القائمة,Existing Properties,410,نخلة جميرا,Palm Jumeirah,البصري,Al Basri,,,,Palm Jumeirah,نخلة جميرا,برج العرب,Burj Al Arab,نخلة جميرا,Palm Jumeirah,مارينا مول,Marina Mall,غرفتين,2 B/R,1,147.04,1584000.0,10772.58,,,1.0,1.0,0.0
934813,1-102-2013-13305,102,1,مبايعات,Sales,بيع - تسجيل مبدئى,Sell - Pre registration,17-12-2013,4,فيلا,Villa,4.0,فيلا,Villa,سكني,Residential,0,على الخارطة,Off-Plan Properties,506,اليلايس 1,Al Yelayiss 1,,,1793.0,مجمع ريم ميرا بي أتش21,REEM-MIRA COMMUNITY PH 2,,,دورة دبي للدراجات,Dubai Cycling Course,,,,,أربع غرف,4 B/R,0,275.35,1693888.0,6151.76,,,1.0,1.0,0.0
460160,1-11-2023-31737,11,1,مبايعات,Sales,بيع,Sell,04-10-2023,3,وحدة,Unit,60.0,شقه سكنيه,Flat,سكني,Residential,1,العقارات القائمة,Existing Properties,445,جبل علي الأولى,Jabal Ali First,1 غلامز ريزيدنس تاور,GLAMZ RESIDENCE TOWER 1,1749.0,جلامز ريزدينس,GLAMZ RESIDENCE,Al Furjan,الفرجان,أكاديمية المدينة الرياضية للسباحة,Sports City Swimming Academy,محطة مترو ابن بطوطة,Ibn Battuta Metro Station,ابن بطوطة مول,Ibn-e-Battuta Mall,غرفة,1 B/R,1,71.48,750000.0,10492.45,,,1.0,1.0,0.0
688020,1-102-2018-1132,102,1,مبايعات,Sales,بيع - تسجيل مبدئى,Sell - Pre registration,24-01-2018,3,وحدة,Unit,60.0,شقه سكنيه,Flat,سكني,Residential,0,على الخارطة,Off-Plan Properties,366,الكفاف,Al Kifaf,بارك غيت ريزيدنسيز-4,PARK GATE RESIDENCES 4,1957.0,بارك جيت رازيدنس,Park Gate Residences,Wasl 1,وصل 1,برج خليفة,Burj Khalifa,محطة مترو الجافلية,Al Jafiliya Metro Station,مول دبي,Dubai Mall,ثلاث غرف,3 B/R,1,201.05,2967000.0,14757.52,,,1.0,2.0,0.0
875220,1-102-2024-61569,102,1,مبايعات,Sales,بيع - تسجيل مبدئى,Sell - Pre registration,15-08-2024,4,فيلا,Villa,4.0,فيلا,Villa,سكني,Residential,0,على الخارطة,Off-Plan Properties,469,اليفره 1,Al Yufrah 1,,,3153.0,ذا فالي - فيلورا,The Valley - Velora,,,,,,,,,أربع غرف,4 B/R,0,275.87,3169888.0,11490.51,,,1.0,2.0,0.0


Let's rearrange the dataset in a more logical order to help gain imrpoved understanding.

In [10]:
# Rearranging & selecting useful columns of the transactions dataset
transactions_df = transactions_df[['transaction_id', 'instance_date', 'trans_group_id', 'trans_group_en', 'trans_group_ar', 'procedure_id',
                                   'procedure_name_en', 'procedure_name_ar', 'property_type_id', 'property_type_en', 'property_type_ar',
                                   'property_sub_type_id', 'property_sub_type_en', 'property_sub_type_ar', 'property_usage_en', 
                                   'property_usage_ar', 'reg_type_id', 'reg_type_en', 'reg_type_ar', 'area_id', 'area_name_en', 'area_name_ar',
                                   'building_name_en', 'building_name_ar', 'project_number', 'project_name_en', 'project_name_ar', 
                                   'master_project_en', 'master_project_ar', 'nearest_landmark_en', 'nearest_landmark_ar', 'nearest_metro_en',
                                   'nearest_metro_ar', 'nearest_mall_en', 'nearest_mall_ar', 'rooms_en', 'rooms_ar', 'has_parking',
                                   'procedure_area', 'meter_sale_price', 'actual_worth']]

# Displaying the first few rows of the dataset
transactions_df.head()

Unnamed: 0,transaction_id,instance_date,trans_group_id,trans_group_en,trans_group_ar,procedure_id,procedure_name_en,procedure_name_ar,property_type_id,property_type_en,property_type_ar,property_sub_type_id,property_sub_type_en,property_sub_type_ar,property_usage_en,property_usage_ar,reg_type_id,reg_type_en,reg_type_ar,area_id,area_name_en,area_name_ar,building_name_en,building_name_ar,project_number,project_name_en,project_name_ar,master_project_en,master_project_ar,nearest_landmark_en,nearest_landmark_ar,nearest_metro_en,nearest_metro_ar,nearest_mall_en,nearest_mall_ar,rooms_en,rooms_ar,has_parking,procedure_area,meter_sale_price,actual_worth
0,1-11-2018-8205,13-08-2018,1,Sales,مبايعات,11,Sell,بيع,4,Villa,فيلا,,,,Other,أخرى,1,Existing Properties,العقارات القائمة,278,Mankhool,منخول,,,,,,,,Burj Khalifa,برج خليفة,ADCB Metro Station,محطة مترو بنك أبوظبي التجاري,Dubai Mall,مول دبي,,,0,34.41,4795.12,165000.0
1,1-11-2016-12930,02-11-2016,1,Sales,مبايعات,11,Sell,بيع,4,Villa,فيلا,,,,Residential,سكني,1,Existing Properties,العقارات القائمة,276,Al Bada,البدع,,,,,,,,Burj Khalifa,برج خليفة,Emirates Towers Metro Station,محطة مترو أبراج الإمارات,Dubai Mall,مول دبي,,,0,390.0,5358.72,2089900.0
2,1-11-2016-13524,15-11-2016,1,Sales,مبايعات,11,Sell,بيع,4,Villa,فيلا,,,,Other,أخرى,1,Existing Properties,العقارات القائمة,276,Al Bada,البدع,,,,,,,,Burj Khalifa,برج خليفة,Emirates Towers Metro Station,محطة مترو أبراج الإمارات,Dubai Mall,مول دبي,,,0,278.71,10046.28,2800000.0
3,2-13-2014-4939,23-06-2014,2,Mortgages,رهون,13,Mortgage Registration,تسجيل رهن,4,Villa,فيلا,,,,Commercial,تجاري,1,Existing Properties,العقارات القائمة,276,Al Bada,البدع,,,,,,,,Burj Khalifa,برج خليفة,Trade Centre Metro Station,محطة مترو المركز التجاري,Dubai Mall,مول دبي,,,0,16952.94,707.84,12000000.0
4,1-11-2002-81,14-01-2002,1,Sales,مبايعات,11,Sell,بيع,2,Building,مبنى,,,,Commercial,تجاري,1,Existing Properties,العقارات القائمة,271,Al Karama,الكرامه,,,,,,,,Dubai International Airport,مطار دبي الدولي,ADCB Metro Station,محطة مترو بنك أبوظبي التجاري,Dubai Mall,مول دبي,,,0,232.26,6458.28,1500000.0


Let's check the quality of the dataset by ensuring that the `transaction_id` which is the **Primary Key** (uniqueness) doesn't include any duplicates. 

In [11]:
# Checking duplicates in transaction_id
transactions_df['transaction_id'].duplicated().sum()

0

In [12]:
# Checking duplicates across transaction dataset
transactions_df.duplicated().sum()

0

Great! Our unique identifier, that is `transaction_id`, doesn't hold any duplicates, indicating that each transaction is unique. 

Now, let's go and convert each column to its proper data type to imrpove exploratory data analysis. 

In [13]:
# Checking data types of the columns in the transaction dataset
transactions_df.dtypes

transaction_id           object
instance_date            object
trans_group_id            int64
trans_group_en           object
trans_group_ar           object
procedure_id              int64
procedure_name_en        object
procedure_name_ar        object
property_type_id          int64
property_type_en         object
property_type_ar         object
property_sub_type_id    float64
property_sub_type_en     object
property_sub_type_ar     object
property_usage_en        object
property_usage_ar        object
reg_type_id               int64
reg_type_en              object
reg_type_ar              object
area_id                   int64
area_name_en             object
area_name_ar             object
building_name_en         object
building_name_ar         object
project_number          float64
project_name_en          object
project_name_ar          object
master_project_en        object
master_project_ar        object
nearest_landmark_en      object
nearest_landmark_ar      object
nearest_

From the result above, We can notice that the `instance_date` needs converting to a `datetime` format. Also, you may noticed `property_sub_type_id` has a `float64` but that because it holds `NaN` values which are essentially a `float64` value. 

Now, let's convert the `instance_date`:

In [14]:
# Displaying a few samples of the instance_date column
transactions_df['instance_date'].sample(5)

16760      29-03-2021
767282     04-01-2021
910484     14-07-2022
725507     26-10-2009
1174618    17-07-2019
Name: instance_date, dtype: object

In [15]:
# Converting the instance_date column to datetime
transactions_df['instance_date'] = pd.to_datetime(transactions_df['instance_date'], dayfirst=True, errors='coerce')

# Checking the converted instance_date column
transactions_df['instance_date'].sample(5)

990728    2009-08-25
1247865   2019-09-23
1309692   2022-01-20
218619    2024-09-18
911731    2022-12-13
Name: instance_date, dtype: datetime64[ns]

After converting the `instance_date` column to a `datetime` format, we can now gather statistics to better understand the time range of the dataset and determine how many years should be included in the analysis.

First, let's get the earliest and the latest transaction dates:

In [16]:
# Checking the statistical summary of the instance_date
transactions_df['instance_date'].describe()

count                          1314122
mean     2017-10-28 20:19:29.798238464
min                1966-01-18 00:00:00
25%                2013-07-16 00:00:00
50%                2018-10-03 00:00:00
75%                2023-01-30 00:00:00
max                2024-10-11 00:00:00
Name: instance_date, dtype: object

From the summary statistics of the `instance_date` column, here’s what we can gather:

1. **Earliest transaction date**: January 18, 1966

2. **Most recent transaction date**: October 11, 2024

3. **25% of transactions occurred before**: July 16, 2013

4. **Median transaction date (50%)**: October 3, 2018

5. **75% of transactions occurred after**: January 30, 2023

**Insights**

- **Recent transactions dominate**: The majority of transactions are relatively recent, with 50% of them happening after 2018 and 25% happening after early 2023.

- **Older data** (pre-2013) may not be as relevant for current market trends and could potentially skew the analysis if included in full.

- **Focus on recent years**: To ensure a more relevant analysis, I'll be focusing on the last 5 years as it could provide a more accurate reflection of current market trends.

Now, let's filter the data to include the last 5 years:

In [17]:
# Filtering the data to include the last 5 years
from datetime import datetime
transactions_5y = transactions_df[transactions_df['instance_date'].dt.year >= datetime.now().year - 5]

# Comparing the shapes of the filtered and original datasets
print("Original dataset shape: ", transactions_df.shape)
print("Filtered dataset shape: ", transactions_5y.shape)

Original dataset shape:  (1314126, 41)
Filtered dataset shape:  (644745, 41)


In [18]:
# Viewing some random rows of the filtered dataset
transactions_5y.sample(5)

Unnamed: 0,transaction_id,instance_date,trans_group_id,trans_group_en,trans_group_ar,procedure_id,procedure_name_en,procedure_name_ar,property_type_id,property_type_en,property_type_ar,property_sub_type_id,property_sub_type_en,property_sub_type_ar,property_usage_en,property_usage_ar,reg_type_id,reg_type_en,reg_type_ar,area_id,area_name_en,area_name_ar,building_name_en,building_name_ar,project_number,project_name_en,project_name_ar,master_project_en,master_project_ar,nearest_landmark_en,nearest_landmark_ar,nearest_metro_en,nearest_metro_ar,nearest_mall_en,nearest_mall_ar,rooms_en,rooms_ar,has_parking,procedure_area,meter_sale_price,actual_worth
1265136,1-102-2022-34173,2022-10-31,1,Sales,مبايعات,102,Sell - Pre registration,بيع - تسجيل مبدئى,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,0,Off-Plan Properties,على الخارطة,412,Al Merkadh,المركاض,Sobha Hartland - The Crest Tower C,شوبا هارتلاند ? ذا كرست تاور سي,2447.0,Sobha Hartland - The Crest,شوبا هارتلاند - ذا كرست,SOBHA HARTLAND,شوبها هارتلاند,,,,,,,1 B/R,غرفة,1,48.09,20711.87,996034.0
917378,1-102-2024-1110,2024-01-22,1,Sales,مبايعات,102,Sell - Pre registration,بيع - تسجيل مبدئى,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,0,Off-Plan Properties,على الخارطة,370,Um Suqaim Third,ام سقيم الثالثه,Elara 1,ايلارا 1,2754.0,ELARA,إلارا,,,,,,,,,1 B/R,غرفة,1,79.16,34070.24,2697000.0
727145,1-11-2023-39889,2023-12-05,1,Sales,مبايعات,11,Sell,بيع,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,1,Existing Properties,العقارات القائمة,482,Hadaeq Sheikh Mohammed Bin Rashid,حدائق الشيخ محمد بن راشد,MULBERRY II at PARK HEIGHTS Building A2,ملبيري II آت بارك هايتس A2,1489.0,MULBERRY at PARK HEIGHTS,مولبيري في بارك هايتس,DUBAI HILLS - PARK,دبي هيليز - بارك,Burj Al Arab,برج العرب,First Abu Dhabi Bank Metro Station,محطة مترو بنك أبوظبي الأول,Mall of the Emirates,مول الإمارات,2 B/R,غرفتين,1,126.49,22531.43,2850000.0
838036,3-9-2022-3551,2022-12-08,3,Gifts,هبات,9,Grant,هبه,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,1,Existing Properties,العقارات القائمة,330,Marsa Dubai,مرسى دبي,Marina Pinnacle,مارينا بينكال,817.0,MARINA PINNACLE,مارينا بيناكل,Dubai Marina,دبي مارينا,Burj Al Arab,برج العرب,Mina Seyahi,مينا السياحي,Marina Mall,مارينا مول,3 B/R,ثلاث غرف,1,155.8,10141.7,1580077.0
1281440,2-13-2020-5532,2020-08-05,2,Mortgages,رهون,13,Mortgage Registration,تسجيل رهن,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,1,Existing Properties,العقارات القائمة,444,Al Hebiah First,الحبيه الاولى,Oia Residence,أويا ريزدنس,1779.0,OIA Residence,اويا ريزيدنس,Motor City,موتور ستي,Motor City,موتور سيتي,Dubai Internet City,مدينة دبي للإنترنت,Mall of the Emirates,مول الإمارات,3 B/R,ثلاث غرف,1,194.01,4793.57,930000.0


Now that we’ve filtered the dataset to focus on the last 5 years, let’s begin inspecting each column in the dataset to gain a deeper understanding of the data. I’ll start by analyzing the transaction group columns.

In [19]:
# Checking the trans_group columns
print("The unique values count in 'trans_group_id' column is: ")
print(transactions_5y['trans_group_id'].value_counts())

print("\nThe unique values count in 'trans_group_en' column is: ")
print(transactions_5y['trans_group_en'].value_counts())

print("\nThe unique values count in 'trans_group_ar' column is: ")
print(transactions_5y['trans_group_ar'].value_counts())

The unique values count in 'trans_group_id' column is: 
trans_group_id
1    498223
2    117450
3     29072
Name: count, dtype: int64

The unique values count in 'trans_group_en' column is: 
trans_group_en
Sales        498223
Mortgages    117450
Gifts         29072
Name: count, dtype: int64

The unique values count in 'trans_group_ar' column is: 
trans_group_ar
مبايعات    498223
رهون       117450
هبات        29072
Name: count, dtype: int64


**Transactions Group Observations**

1. **Sale Dominance**:

    - The vast majority of the **transactions (498,223)** fall under the “Sales” group, which makes up a significant portion of the dataset. This indicates that most of the activities recorded are direct sales of properties.

    - This should be the primary focus for price prediction or forecasting models, given its dominance in the dataset.

2. **Mortgages as the Second Largest Group**:

    - The second largest group consists of **Mortgages (117,450 transactions)**. Mortgages are closely tied to property purchases, so they provide a good perspective on financing trends, which could also affect price prediction.

    - Mortgages might also give us insight into buyer behavior and loan-dependent transactions, which could be a useful feature in our model.

3. **Smaller Gifts Segment**

    - The “Gifts” category (29,072 transactions) is the smallest group, likely representing property transfers without financial compensation. These transactions may not have as direct an impact on property price forecasting as the other groups but could still offer some useful information about property transfers within families or organizations.

**Filter for Sales**: Since sales make up the largest proportion and are most directly related to price changes, we may want to focus our initial analysis on this group.



In [20]:
# Filtering "Sales" transactions in the dataset
transactions_sales_5y = transactions_5y[transactions_5y['trans_group_en'] == 'Sales']

# Displaying random rows of the filtered dataset
transactions_sales_5y.sample(5)

Unnamed: 0,transaction_id,instance_date,trans_group_id,trans_group_en,trans_group_ar,procedure_id,procedure_name_en,procedure_name_ar,property_type_id,property_type_en,property_type_ar,property_sub_type_id,property_sub_type_en,property_sub_type_ar,property_usage_en,property_usage_ar,reg_type_id,reg_type_en,reg_type_ar,area_id,area_name_en,area_name_ar,building_name_en,building_name_ar,project_number,project_name_en,project_name_ar,master_project_en,master_project_ar,nearest_landmark_en,nearest_landmark_ar,nearest_metro_en,nearest_metro_ar,nearest_mall_en,nearest_mall_ar,rooms_en,rooms_ar,has_parking,procedure_area,meter_sale_price,actual_worth
751895,1-11-2020-11268,2020-11-26,1,Sales,مبايعات,11,Sell,بيع,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,1,Existing Properties,العقارات القائمة,523,Al Hebiah Third,الحبية الثالثة,Golf Promenade 2 - A,غولف بروميناد 2 - أيه,1504.0,DAMAC HILLS - GOLF PROMENADE,دماك هيلز - جولف برومندي,DAMAC HILLS,داماك هيليز,Motor City,موتور سيتي,,,,,2 B/R,غرفتين,1,159.48,8239.28,1314000.0
1050598,1-11-2021-10205,2021-06-16,1,Sales,مبايعات,11,Sell,بيع,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,1,Existing Properties,العقارات القائمة,330,Marsa Dubai,مرسى دبي,ROYAL OCEANIC-1,رويال اوشانك 1,256.0,ROYAL OCEANIC TOWER,برج رويال المحيط,Dubai Marina,دبي مارينا,Burj Al Arab,برج العرب,Jumeirah Beach Resdency,مساكن شاطئ الجميرا,Marina Mall,مارينا مول,2 B/R,غرفتين,1,124.38,10451.84,1300000.0
912372,1-102-2024-17530,2024-03-20,1,Sales,مبايعات,102,Sell - Pre registration,بيع - تسجيل مبدئى,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,0,Off-Plan Properties,على الخارطة,485,Me'Aisem First,معيصم الأول,JANNAT,جنات,2725.0,JANNAT,جنات,International Media Production Zone,المنطقة العالمية للإنتاج الإعلامي,Sports City Swimming Academy,أكاديمية المدينة الرياضية للسباحة,Harbour Tower,برج هاربور,Ibn-e-Battuta Mall,ابن بطوطة مول,1 B/R,غرفة,1,70.68,13208.97,933609.0
301367,1-110-2022-255,2022-07-05,1,Sales,مبايعات,110,Lease to Own Registration,تسجيل إيجارة تنتهى بالتملك,4,Villa,فيلا,,,,Residential,سكني,1,Existing Properties,العقارات القائمة,531,Al Hebiah Sixth,الحبيه السادسة,,,1337.0,MUDON,مدن,Mudon,مدن,Dubai Cycling Course,دورة دبي للدراجات,,,,,,,0,382.84,8619.79,3300000.0
51059,1-11-2022-26303,2022-10-25,1,Sales,مبايعات,11,Sell,بيع,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,1,Existing Properties,العقارات القائمة,332,Zaabeel Second,زعبيل الثانيه,DOWNTOWN VIEWS,داون تاون فيوز,1642.0,DOWNTOWN VIEWS,مشاهد من الداخل,,,Burj Khalifa,برج خليفة,Buj Khalifa Dubai Mall Metro Station,محطة مترو بوج خليفة دبي مول,Dubai Mall,مول دبي,1 B/R,غرفة,1,81.99,20734.24,1700000.0


Since the `trans_group_id`, `trans_group_en`, and `trans_group_ar` columns contain only a single unique value, they don’t provide meaningful information for the analysis. Therefore, I will remove these columns.

In [21]:
# Removing transactions group columns from the dataset
transactions_sales_5y = transactions_sales_5y.drop(columns=['trans_group_id', 'trans_group_en', 'trans_group_ar'])

# Displaying random rows of the filtered dataset
transactions_sales_5y.sample(5)

Unnamed: 0,transaction_id,instance_date,procedure_id,procedure_name_en,procedure_name_ar,property_type_id,property_type_en,property_type_ar,property_sub_type_id,property_sub_type_en,property_sub_type_ar,property_usage_en,property_usage_ar,reg_type_id,reg_type_en,reg_type_ar,area_id,area_name_en,area_name_ar,building_name_en,building_name_ar,project_number,project_name_en,project_name_ar,master_project_en,master_project_ar,nearest_landmark_en,nearest_landmark_ar,nearest_metro_en,nearest_metro_ar,nearest_mall_en,nearest_mall_ar,rooms_en,rooms_ar,has_parking,procedure_area,meter_sale_price,actual_worth
1202523,1-102-2022-10050,2022-04-14,102,Sell - Pre registration,بيع - تسجيل مبدئى,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,0,Off-Plan Properties,على الخارطة,526,Business Bay,الخليج التجارى,Peninsula Two,بينينسولا تو,2318.0,Peninsula Two,بينينسولا تو,Business Bay,الخليج التجاري,Downtown Dubai,وسط مدينة دبي,Business Bay Metro Station,محطة مترو الخليج التجاري,Dubai Mall,مول دبي,1 B/R,غرفة,1,58.72,18732.97,1100000.0
26070,1-102-2023-27360,2023-06-08,102,Sell - Pre registration,بيع - تسجيل مبدئى,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,0,Off-Plan Properties,على الخارطة,441,Al Barsha South Fourth,البرشاء جنوب الرابعة,Mayas Geneva,مياس جينيف,2421.0,Mayas Geneva,مياس جنيف,Jumeirah Village Circle,قرية جميرا الدائرية,Sports City Swimming Academy,أكاديمية المدينة الرياضية للسباحة,Nakheel Metro Station,محطة مترو النخيل,Marina Mall,مارينا مول,1 B/R,غرفة,1,59.56,12592.34,750000.0
1183029,1-11-2023-35725,2023-11-10,11,Sell,بيع,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,1,Existing Properties,العقارات القائمة,507,Al Yelayiss 2,اليلايس 2,Hayat Boulevard-1B,حياة بوليفارد - 1 ب,1755.0,HAYAT BOULEVARD,حياة بوليفارد,TOWN SQUARE,تاون سكوير,Dubai Cycling Course,دورة دبي للدراجات,,,,,3 B/R,ثلاث غرف,1,127.99,10547.7,1350000.0
425812,1-11-2024-31855,2024-08-28,11,Sell,بيع,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,1,Existing Properties,العقارات القائمة,307,Al Safouh First,الصفوح الاولى,J8,8ج,1693.0,J8,جي 8,TECOM Site B,تيكوم سايت بي,Burj Al Arab,برج العرب,Dubai Internet City,مدينة دبي للإنترنت,Mall of the Emirates,مول الإمارات,1 B/R,غرفة,1,75.78,14251.78,1080000.0
973574,1-102-2023-47349,2023-09-14,102,Sell - Pre registration,بيع - تسجيل مبدئى,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,0,Off-Plan Properties,على الخارطة,412,Al Merkadh,المركاض,Sobha Hartland - Crest Grande,شوبا هارتلاند - كريست جراندي,2482.0,Sobha Hartland - Crest Grande,شوبا هارتلاند كريست جراندي,SOBHA HARTLAND,شوبها هارتلاند,Downtown Dubai,وسط مدينة دبي,Buj Khalifa Dubai Mall Metro Station,محطة مترو بوج خليفة دبي مول,Dubai Mall,مول دبي,2 B/R,غرفتين,1,138.32,19913.18,2754391.0


Let's inspect the procedure columns in order to gain a deeper understanding:

In [22]:
# Checking the unique values count in the procedure columns
print("The unique values count in 'procedure_id' column is: ")
print(transactions_sales_5y['procedure_id'].value_counts())

print("\nThe unique values count in 'procedure_name_en' column is: ")
print(transactions_sales_5y['procedure_name_en'].value_counts())

print("\nThe unique values count in 'procedure_name_ar' column is: ")
print(transactions_sales_5y['procedure_name_ar'].value_counts())

The unique values count in 'procedure_id' column is: 
procedure_id
102    252630
11     160417
41      71715
45       5287
110      3133
133      1744
460      1268
851      1259
95        202
715       193
93        126
4          57
852        51
361        43
814        38
107        36
371        22
150         2
Name: count, dtype: int64

The unique values count in 'procedure_name_en' column is: 
procedure_name_en
Sell - Pre registration                       252630
Sell                                          160417
Delayed Sell                                   71715
Sell Development                                5287
Lease to Own Registration                       3133
Development Registration                        1744
Sale On Payment Plan                            1268
Development Registration Pre-Registration       1259
Delayed Development                              202
Delayed Sell Lease to Own Registration           193
Delayed Sell Development                       

**Procedure Columns Insights**

1. **Dominance of Specific Procedures**:

    - The **“Sell - Pre Registration”** procedure is the most common by far, with **252,630** entries, followed by **“Sell”** with **160,417** entries. These two account for a significant portion of the dataset. Together, they make up the bulk of the transactions.

    - Other procedures like **“Delayed Sell”** (**71,715** entries) and **“Sell Development”** (**5,287** entries) are also important but appear much less frequently.

2. **Rare Procedures**:

    - Some procedures, such as **“Portfolio Development Registration”** and **“Lease Development Registration”**, are quite rare, with only a few occurrences. These might not have a substantial impact on the overall analysis but could be important to niche segments.

3. **Consistency Between English and Arabic**:

    - There’s a direct one-to-one mapping between `procedure_name_en` and `procedure_name_ar`, meaning we can use either column for analysis. Since both contain the same information, we can choose one and remove the other, especially if memory optimization is a priority.

4. **Implications for Analysis**:

    - Focusing on **“Sell”** and **“Sell - Pre Registration”** transactions might give the most comprehensive insights, as they represent the bulk of transactions.

    - Given that the dataset already includes a column that distinguishes between **“Off-Plan Properties”** and **“Existing Properties”**, it would be sufficient to rely on this information. As a result, we can confidently drop the additional procedure-related columns without losing any critical insights.



In [23]:
# Removing the procedure columns from the dataset
transactions_sales_5y = transactions_sales_5y.drop(columns=['procedure_id', 'procedure_name_en', 'procedure_name_ar'])

# Displaying random rows of the filtered dataset
transactions_sales_5y.sample(5)

Unnamed: 0,transaction_id,instance_date,property_type_id,property_type_en,property_type_ar,property_sub_type_id,property_sub_type_en,property_sub_type_ar,property_usage_en,property_usage_ar,reg_type_id,reg_type_en,reg_type_ar,area_id,area_name_en,area_name_ar,building_name_en,building_name_ar,project_number,project_name_en,project_name_ar,master_project_en,master_project_ar,nearest_landmark_en,nearest_landmark_ar,nearest_metro_en,nearest_metro_ar,nearest_mall_en,nearest_mall_ar,rooms_en,rooms_ar,has_parking,procedure_area,meter_sale_price,actual_worth
671674,1-11-2024-6005,2024-02-20,4,Villa,فيلا,,,,Residential,سكني,1,Existing Properties,العقارات القائمة,434,Wadi Al Safa 6,وادي الصفا 6,,,,,,Arabian Ranches - Terranova,المرابع العربية - تيرانوفا,Global Village,قرية عالمية,,,,,,,0,704.2,13348.48,9400000.0
604670,1-11-2023-22856,2023-07-19,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,1,Existing Properties,العقارات القائمة,364,Al Wasl,الوصل,Citywalk Residential Building 7,ستي ووك ريزيدينشال بيلدنج 7,,,,City Walk,ستي ووك,Burj Khalifa,برج خليفة,Buj Khalifa Dubai Mall Metro Station,محطة مترو بوج خليفة دبي مول,Dubai Mall,مول دبي,2 B/R,غرفتين,1,158.84,21299.03,3383138.0
522959,1-102-2019-633,2019-01-21,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,0,Off-Plan Properties,على الخارطة,482,Hadaeq Sheikh Mohammed Bin Rashid,حدائق الشيخ محمد بن راشد,Socio Tower 2,سوشيو تووير 2,2028.0,Socio,سوشيو,DUBAI HILLS,دبي هيليز,Burj Al Arab,برج العرب,First Abu Dhabi Bank Metro Station,محطة مترو بنك أبوظبي الأول,Mall of the Emirates,مول الإمارات,2 B/R,غرفتين,1,67.34,14580.49,981850.0
455769,1-102-2023-25681,2023-05-31,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,0,Off-Plan Properties,على الخارطة,441,Al Barsha South Fourth,البرشاء جنوب الرابعة,Oxford Terraces 2,أكسفورد تيراسز 2,2600.0,OXFORD TERRACES 2,اوكسفورد تيرسز 2,Jumeirah Village Circle,قرية جميرا الدائرية,Sports City Swimming Academy,أكاديمية المدينة الرياضية للسباحة,Dubai Internet City,مدينة دبي للإنترنت,Mall of the Emirates,مول الإمارات,Studio,استوديو,1,39.67,12856.06,510000.0
1298796,1-102-2024-49932,2024-07-15,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,0,Off-Plan Properties,على الخارطة,414,Al Barshaa South Second,البرشاء جنوب الثانية,Binghatti Hills - 2,بن غاطي هيلز - 2,3053.0,Binghatti Hills,بن غاطي هيلز,Dubiotech,دبيوتك,Motor City,موتور سيتي,First Abu Dhabi Bank Metro Station,محطة مترو بنك أبوظبي الأول,Mall of the Emirates,مول الإمارات,Studio,استوديو,1,41.45,20506.63,850000.0


Let's inspect the property type columns:

In [24]:
# Checking the unique values count in the property type columns
print("The unique values count in 'property_type_id' column is: ")
print(transactions_sales_5y['property_type_id'].value_counts())

print("\nThe unique values count in 'property_type_en' column is: ")
print(transactions_sales_5y['property_type_en'].value_counts())

print("\nThe unique values count in 'property_type_ar' column is: ")
print(transactions_sales_5y['property_type_ar'].value_counts())

The unique values count in 'property_type_id' column is: 
property_type_id
3    377293
4     71968
1     47002
2      1960
Name: count, dtype: int64

The unique values count in 'property_type_en' column is: 
property_type_en
Unit        377293
Villa        71968
Land         47002
Building      1960
Name: count, dtype: int64

The unique values count in 'property_type_ar' column is: 
property_type_ar
وحدة    377293
فيلا     71968
أرض      47002
مبنى      1960
Name: count, dtype: int64


**Property Type Columns Insights**

1. **Majority of Transactions are Units**:

    - **Units** make up the majority of the transactions with **377,293** records. This indicates that Units are the most common type of property in the dataset.

2. **Villas Follow as the Second Most Popular Property Type**:

    - **Villas** account for **71,968** transactions, making them the second most frequent property type. This suggests a significant market for larger residential properties.

3. **Land and Building Transactions are Less Frequent**:

    - **Land** transactions total **47,002**, while **Buildings** have only **1,960** transactions. This suggests that while land sales are still somewhat common, building transactions are relatively rare, likely due to the nature of the real estate market focusing more on individual units or developed properties.

4. **Consistency Between English and Arabic**:

    - There’s a direct one-to-one mapping between `property_type_en` and `property_type_ar`, meaning we can use either column for analysis. Since both contain the same information, we can choose one and remove the other, especially if memory optimization is a priority.

Now that we have a solid understanding of the property types in the dataset, we’ll continue by exploring the remaining columns to gain a deeper understanding of the data and how each feature might contribute to our analysis.

Let's inspect the property sub types columns:

In [25]:
# Checking the unique values count in the property sub type columns
print("The unique values count in 'property_sub_type_id' column is: ")
print(transactions_sales_5y['property_sub_type_id'].value_counts(dropna=False))

print("\nThe unique values count in 'property_sub_type_en' column is: ")
print(transactions_sales_5y['property_sub_type_en'].value_counts(dropna=False))

print("\nThe unique values count in 'property_sub_type_ar' column is: ")
print(transactions_sales_5y['property_sub_type_ar'].value_counts(dropna=False))

The unique values count in 'property_sub_type_id' column is: 
property_sub_type_id
60.0     342833
NaN       68224
4.0       52686
42.0      11443
101.0     10484
112.0      8480
23.0       3837
75.0        121
69.0         31
2.0          20
43.0         18
70.0         15
71.0         11
67.0          8
44.0          7
66.0          5
Name: count, dtype: int64

The unique values count in 'property_sub_type_en' column is: 
property_sub_type_en
Flat                  342833
NaN                    68224
Villa                  52686
Office                 11443
Hotel Apartment        10484
Hotel Rooms             8480
Shop                    3837
Stacked Townhouses       121
Workshop                  31
Building                  20
Show Rooms                18
Warehouse                 15
Clinic                    11
Hotel                      8
Gymnasium                  7
Sized Partition            5
Name: count, dtype: int64

The unique values count in 'property_sub_type_ar' column is:

**Property Sub Type Columns Insights**

1. **Dominance of Flats**:

    - **342,833** entries are labeled as **Flats**, making it the most common property sub-type in the dataset. This heavy representation indicates that a significant portion of the market is focused on apartment-style living, which could be key for buyers and investors looking for high-density residential areas.

2. **Missing Data**:

    - There are **68,224 NaN** values in this column, which will need to be addressed. Strategies for handling these missing values, such as imputation or exclusion, will be important.

3. **Strong Representation of Villas**:

    - **52,686** entries are labeled as **Villas**, showing that Villas are also a significant part of the market. When we perform segmentation, we can focus on this category to highlight luxury or family-style properties with more space.

4. **Commercial and Mixed-Use Properties**:

    - There are several commercial property sub-types, including **Offices (11,443)**, **Shops (3,837)**, **Hotel Apartments (10,484)**, and **Hotel Rooms (8,480)**. These sub-types represent investment opportunities in the commercial and hospitality sectors, which could be highlighted for investors seeking business-related properties.

5. **Smaller Sub Type Categories**:

    - There are a number of smaller categories like **Stacked Townhouses (121)**, **Workshops (31)**, **Warehouses (15)**, and others. While these represent a very small portion of the dataset, they might cater to niche markets. The agent may want to highlight these for buyers seeking very specific property types.

6. **Handling Missing or Inconsistent Data**:

    - With the presence of missing and varied sub-types, we may need to clean and consolidate some categories or handle **NaNs** properly before moving forward with our model. For example, categories like **“Sized Partition”** or **“Gymnasium”** could potentially be merged with broader property types to simplify the dataset.

**Plan for Analysis**:

Since the **property type** column already provides clear and complete classifications of properties (**Units**, **Land**, **Villas**, and **Buildings**) with **no missing values**, I’ve decided to prioritize this column for our analysis. Using the **property subtype** column, which contains additional granular details like specific apartment or villa types, would likely add unnecessary complexity to the model without significantly enhancing its performance.

But before deciding whether to remove the **property subtype** columns, instead of prematurely deciding to remove any columns, I’ve decided to continue exploring the dataset thoroughly before making any structural changes.

Now, we're going to explore the **property usage** columns:

In [26]:
# Displaying random observations of the dataset
transactions_sales_5y.sample(5)

Unnamed: 0,transaction_id,instance_date,property_type_id,property_type_en,property_type_ar,property_sub_type_id,property_sub_type_en,property_sub_type_ar,property_usage_en,property_usage_ar,reg_type_id,reg_type_en,reg_type_ar,area_id,area_name_en,area_name_ar,building_name_en,building_name_ar,project_number,project_name_en,project_name_ar,master_project_en,master_project_ar,nearest_landmark_en,nearest_landmark_ar,nearest_metro_en,nearest_metro_ar,nearest_mall_en,nearest_mall_ar,rooms_en,rooms_ar,has_parking,procedure_area,meter_sale_price,actual_worth
439064,1-11-2024-31578,2024-08-28,3,Unit,وحدة,42.0,Office,مكتب,Commercial,تجاري,1,Existing Properties,العقارات القائمة,526,Business Bay,الخليج التجارى,Tamani Arts Office,Tamani Arts Office,22.0,TAMANI ARTS OFFICES,مكاتب تماني للفنون,Business Bay,الخليج التجاري,Downtown Dubai,وسط مدينة دبي,Buj Khalifa Dubai Mall Metro Station,محطة مترو بوج خليفة دبي مول,Dubai Mall,مول دبي,Office,مكتب,1,41.21,14996.36,618000.0
851480,1-41-2019-4013,2019-10-23,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,1,Existing Properties,العقارات القائمة,330,Marsa Dubai,مرسى دبي,Bluewaters Residences 8,بلو واترز ريزيدنسز 8,1795.0,Bluewaters Residences,بلو واترز ريزيدنسز,,,Burj Al Arab,برج العرب,Jumeirah Beach Residency,مساكن شاطئ جميرا,Marina Mall,مارينا مول,3 B/R,ثلاث غرف,1,195.18,26191.21,5112000.0
656447,1-11-2022-17653,2022-07-26,4,Villa,فيلا,,,,Residential,سكني,1,Existing Properties,العقارات القائمة,374,Um Al Sheif,ام الشيف,,,,,,,,Burj Al Arab,برج العرب,First Abu Dhabi Bank Metro Station,محطة مترو بنك أبوظبي الأول,Mall of the Emirates,مول الإمارات,,,0,1741.93,7576.65,13198000.0
476454,1-102-2024-14944,2024-03-11,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,0,Off-Plan Properties,على الخارطة,441,Al Barsha South Fourth,البرشاء جنوب الرابعة,BEVERLY RESIDENCE 2,بيفرلي ريزيدنس ?,2773.0,Beverly Residence 2,2 بيفيرلي رسيدانس,Jumeirah Village Circle,قرية جميرا الدائرية,Sports City Swimming Academy,أكاديمية المدينة الرياضية للسباحة,Nakheel Metro Station,محطة مترو النخيل,Marina Mall,مارينا مول,Studio,استوديو,1,36.64,14465.07,530000.0
1278871,1-460-2023-236,2023-06-28,3,Unit,وحدة,60.0,Flat,شقه سكنيه,Residential,سكني,1,Existing Properties,العقارات القائمة,441,Al Barsha South Fourth,البرشاء جنوب الرابعة,National Bonds Residence,الصكوك الوطنية ريزيدنيس,,,,Jumeirah Village Circle,قرية جميرا الدائرية,Sports City Swimming Academy,أكاديمية المدينة الرياضية للسباحة,Nakheel Metro Station,محطة مترو النخيل,Marina Mall,مارينا مول,1 B/R,غرفة,1,68.8,11302.15,777588.0


In [27]:
# Checking the unique values count in the property usage columns
print("The unique values count in 'property_usage_en' column is: ")
print(transactions_sales_5y['property_usage_en'].value_counts())

print("\nThe unique values count in 'property_usage_ar' column is: ")
print(transactions_sales_5y['property_usage_ar'].value_counts())

The unique values count in 'property_usage_en' column is: 
property_usage_en
Residential                 441230
Commercial                   30861
Hospitality                  18975
Other                         4837
Multi-Use                     1443
Industrial                     737
Agricultural                    93
Storage                         40
Residential / Commercial         7
Name: count, dtype: int64

The unique values count in 'property_usage_ar' column is: 
property_usage_ar
سكني                 441230
تجاري                 30861
ضيافة                 18975
أخرى                   4837
متعدد الاستخدامات      1443
صناعي                   737
زراعي                    93
تخزين                    40
سكني / تجاري              7
Name: count, dtype: int64


**Property Usage Columns Insights**

1. **Residential Properties**:

    - The residential properties makes up the majority of the dataset with **441,230** entries, indicating that the dataset is predominantly focused on residential real estate.

2. **Commercial Properties**:

    - The commercial properties follows with **30,861** entries, which is still significant but far fewer than residential.

3. **Hospitality Properties**:

    - The hospitality properties have **18,975** entries, representing hotels and related businesses.

4. **Other Properties**

    - There are also **Other**, **Multi-Use**, **Industrial**, **Agricultural**, **Storage**, and **Residential/Commercial** types, but they make up a very small portion of the dataset.

**Plans for Analysis**:

- **Splitting the dataset**: It would be useful to separate **residential** and **non-residential** properties for a more focused analysis.



In [28]:
# Splitting the dataset into residential and non-residential properties
transactions_sales_residential_5y = transactions_sales_5y[transactions_sales_5y['property_usage_en'] == 'Residential']
transactions_sales_non_residential_5y = transactions_sales_5y[transactions_sales_5y['property_usage_en'] != 'Residential']

# Comparing the difference in shapes between the original and the split datasets
print("Original dataset shape: ", transactions_sales_5y.shape)
print("Residential dataset shape: ", transactions_sales_residential_5y.shape)
print("Non-residential dataset shape: ", transactions_sales_non_residential_5y.shape)

Original dataset shape:  (498223, 35)
Residential dataset shape:  (441230, 35)
Non-residential dataset shape:  (56993, 35)


In [29]:
# Removing the property usage columns from the residential dataset as it holds a single value
transactions_sales_residential_5y = transactions_sales_residential_5y.drop(columns=['property_usage_en', 'property_usage_ar'])

# Displaying few observations of the residential dataset
transactions_sales_residential_5y.sample(5)

Unnamed: 0,transaction_id,instance_date,property_type_id,property_type_en,property_type_ar,property_sub_type_id,property_sub_type_en,property_sub_type_ar,reg_type_id,reg_type_en,reg_type_ar,area_id,area_name_en,area_name_ar,building_name_en,building_name_ar,project_number,project_name_en,project_name_ar,master_project_en,master_project_ar,nearest_landmark_en,nearest_landmark_ar,nearest_metro_en,nearest_metro_ar,nearest_mall_en,nearest_mall_ar,rooms_en,rooms_ar,has_parking,procedure_area,meter_sale_price,actual_worth
465893,1-41-2023-19956,2023-10-12,1,Land,أرض,,,,1,Existing Properties,العقارات القائمة,451,Al Hebiah Fifth,الحبيه الخامسة,,,2356.0,DAMAC LAGOONS - PORTOFINO,داماك لاجونر - بورتوفينو,DAMAC Lagoons,داماك لاغون,,,,,,,,,0,144.0,13888.89,2000000.0
1216986,1-11-2020-71,2020-01-06,4,Villa,فيلا,,,,1,Existing Properties,العقارات القائمة,442,Al Barsha South Fifth,البرشاء جنوب الخامسة,,,,,,Jumeirah Village Triangle,قرية جميرا المثلثة,Sports City Swimming Academy,أكاديمية المدينة الرياضية للسباحة,Damac Properties,عقارات داماك,Marina Mall,مارينا مول,,,0,690.0,3333.33,2300000.0
674542,1-102-2023-28552,2023-06-14,3,Unit,وحدة,60.0,Flat,شقه سكنيه,0,Off-Plan Properties,على الخارطة,482,Hadaeq Sheikh Mohammed Bin Rashid,حدائق الشيخ محمد بن راشد,399 Hills Park B,399 هيلز بارك ب,2510.0,399 Hills Park B,399 هيلز بارك ب,DUBAI HILLS,دبي هيليز,Global Village,قرية عالمية,First Abu Dhabi Bank Metro Station,محطة مترو بنك أبوظبي الأول,Mall of the Emirates,مول الإمارات,1 B/R,غرفة,1,94.42,20451.6,1931040.0
542002,1-11-2022-31480,2022-12-13,3,Unit,وحدة,60.0,Flat,شقه سكنيه,1,Existing Properties,العقارات القائمة,441,Al Barsha South Fourth,البرشاء جنوب الرابعة,Dune Residency Dubai,ديونريزيدنسي دبي,2056.0,Dune Residency Dubai,ديون ريزيدنسي دبي,Jumeirah Village Circle,قرية جميرا الدائرية,Sports City Swimming Academy,أكاديمية المدينة الرياضية للسباحة,Dubai Internet City,مدينة دبي للإنترنت,Mall of the Emirates,مول الإمارات,1 B/R,غرفة,1,78.66,10175.3,800389.0
525363,1-102-2023-65789,2023-12-12,3,Unit,وحدة,60.0,Flat,شقه سكنيه,0,Off-Plan Properties,على الخارطة,523,Al Hebiah Third,الحبية الثالثة,DAMAC HILLS - GOLF GREENS 1 - TOWER B,داماك هيلز - جولف جرينز 1 - تاور بي,2779.0,DAMAC HILLS - GOLF GREENS 1,داماك هيلز - جولف جرينز 1,DAMAC HILLS,داماك هيليز,Motor City,موتور سيتي,,,,,2 B/R,غرفتين,1,122.05,14787.38,1804800.0


After splitting the dataset into **residential properties** and **non-residential properties**, we will focus our attention on the residential properties dataset. Specifically, we will revisit the **property subtype** column within the residential dataset to inspect the unique values and identify any potential patterns or inconsistencies. Based on the results of this inspection, we will decide how to handle the property subtypes.

In [30]:
# Checking the unique values count in the property sub type columns
print("The unique values count in 'property_sub_type_id' column is: ")
print(transactions_sales_residential_5y['property_sub_type_id'].value_counts(dropna=False))

print("\nThe unique values count in 'property_sub_type_en' column is: ")
print(transactions_sales_residential_5y['property_sub_type_en'].value_counts(dropna=False))

print("\nThe unique values count in 'property_sub_type_ar' column is: ")
print(transactions_sales_residential_5y['property_sub_type_ar'].value_counts(dropna=False))

The unique values count in 'property_sub_type_id' column is: 
property_sub_type_id
60.0    342833
4.0      52686
NaN      45590
75.0       121
Name: count, dtype: int64

The unique values count in 'property_sub_type_en' column is: 
property_sub_type_en
Flat                  342833
Villa                  52686
NaN                    45590
Stacked Townhouses       121
Name: count, dtype: int64

The unique values count in 'property_sub_type_ar' column is: 
property_sub_type_ar
شقه سكنيه        342833
فيلا              52686
NaN               45590
منازل متلاصقة       121
Name: count, dtype: int64


**Property Sub Type Columns Insights**

1. **Dominance of Flats**:

    - The majority of the records are categorized as **Flats** with a count of **342,833**, which indicates that this is the most common residential property subtype.

    - **Villas** make up a significant portion with **52,686** entries, representing larger and possibly more luxurious residential units.

    - A small number of properties are labeled as **Stacked Townhouses (121)**, a subtype that is less common but may still be important depending on the analysis.

2. **Missing Values**:

    - There are **45,590** records where the property subtype is missing (NaN). This constitutes a considerable portion of the residential dataset, and these missing values need to be addressed before modeling.

**Plan for Handling Missing Values**:

To understand the missing values in the **property subtype** column, I will:

- **Compare missing subtypes with property types**: For example, if the property type is a Unit, but the subtype is missing, we can infer that these might likely be Flats.

In [31]:
# Displaying random observations where property sub type is missing
transactions_sales_residential_5y[transactions_sales_residential_5y['property_sub_type_id'].isnull()].sample(5)

Unnamed: 0,transaction_id,instance_date,property_type_id,property_type_en,property_type_ar,property_sub_type_id,property_sub_type_en,property_sub_type_ar,reg_type_id,reg_type_en,reg_type_ar,area_id,area_name_en,area_name_ar,building_name_en,building_name_ar,project_number,project_name_en,project_name_ar,master_project_en,master_project_ar,nearest_landmark_en,nearest_landmark_ar,nearest_metro_en,nearest_metro_ar,nearest_mall_en,nearest_mall_ar,rooms_en,rooms_ar,has_parking,procedure_area,meter_sale_price,actual_worth
51121,1-11-2021-1203,2021-01-26,4,Villa,فيلا,,,,1,Existing Properties,العقارات القائمة,414,Al Barshaa South Second,البرشاء جنوب الثانية,,,1525.0,Villa Lantana 2,فيلا لانتانا 2,Dubiotech,دبيوتك,Motor City,موتور سيتي,Sharaf Dg Metro Station,محطة مترو شرف دي جي,Mall of the Emirates,مول الإمارات,,,0,360.08,9095.2,3275000.0
734805,1-41-2022-1394,2022-01-26,1,Land,أرض,,,,1,Existing Properties,العقارات القائمة,451,Al Hebiah Fifth,الحبيه الخامسة,,,2330.0,DAMAC LAGOONS - COSTA BRAVA (2),داماك لاجونز - كوستا برافا 2,DAMAC Lagoons,داماك لاغون,,,,,,,,,0,144.0,12423.61,1789000.0
834847,1-41-2021-12599,2021-11-22,1,Land,أرض,,,,1,Existing Properties,العقارات القائمة,449,Al Yufrah 2,اليفره 2,,,1861.0,DAMAC HILLS (2) - VICTORIA,داماك هيلز ( 2 ) - فيكتوريا,DAMAC HILLS 2,داماك هيليز 2,Dubai Cycling Course,دورة دبي للدراجات,,,,,,,0,155.49,6476.3,1007000.0
911853,1-11-2023-31930,2023-10-09,4,Villa,فيلا,,,,1,Existing Properties,العقارات القائمة,414,Al Barshaa South Second,البرشاء جنوب الثانية,,,1526.0,Villa Lantana 1,فيلا لانتانا 1,Dubiotech,دبيوتك,Motor City,موتور سيتي,Sharaf Dg Metro Station,محطة مترو شرف دي جي,Mall of the Emirates,مول الإمارات,,,0,354.0,13949.12,4937990.0
201511,1-11-2021-23493,2021-12-29,4,Villa,فيلا,,,,1,Existing Properties,العقارات القائمة,412,Al Merkadh,المركاض,,,1366.0,MOHAMMED BIN RASHID AL MAKTOUM CITY- DISTRICT ...,محمد بن راشد ال مكتوم سيتي- ديسترك ون، فيس ون,,,Downtown Dubai,وسط مدينة دبي,Business Bay Metro Station,محطة مترو الخليج التجاري,Dubai Mall,مول دبي,,,0,813.8,18800.69,15300000.0


In [32]:
# Inspecting the unique values count where property sub type is missing
transactions_sales_residential_5y[transactions_sales_residential_5y['property_sub_type_id'].isnull()]['property_type_en'].value_counts()

property_type_en
Land        31802
Villa       13270
Building      518
Name: count, dtype: int64

**Insights**

1. **Majority of Missing Sub Types are for Lands**:

    - The largest portion of missing subtypes comes from **Land properties** with **31,802** entries. This makes sense because Land typically does not have a subtype (like Flats or Villas) associated with it.

2. **Significant Missing Values for Villas**:

    - **Villas** account for **13,270** missing subtypes. Since Villas often have specific subtypes (such as Townhouses), the missing values in this case are more relevant and should be explored further.

3. **Minimal Impact for Buildings**:

    - Only 518 missing subtypes are associated with Buildings. Similar to Land, Buildings may not always have detailed subtypes. 

**Plans for Analysis**

- For **Land** and **Buildings** properties, it might be acceptable to fill in the missing values with a simple dash (-) in the property subtype.

- It’s important to inspect the records for these Villas to determine if we can infer the subtype based on other columns (e.g., area size, project, or location).


In [33]:
# Filling in the property sub type missing values with a "-" where property type is equal to "Land" or "Building"
transactions_sales_residential_5y.loc[(transactions_sales_residential_5y["property_sub_type_id"].isnull()) & 
                                      (transactions_sales_residential_5y['property_type_en'].isin(['Land', 'Building'])), 'property_sub_type_en'] = "-"

# Inspecting the unique values count where property sub type is missing to check our imputation
transactions_sales_residential_5y[transactions_sales_residential_5y['property_sub_type_id'].isnull()]['property_type_en'].value_counts()

property_type_en
Land        31802
Villa       13270
Building      518
Name: count, dtype: int64

In [34]:
# Checking random observations where property sub type is missing and property type is "Villa"
transactions_sales_residential_5y[(transactions_sales_residential_5y['property_sub_type_id'].isnull()) &
                                  (transactions_sales_residential_5y['property_type_en'] == 'Villa')].sample(10)

Unnamed: 0,transaction_id,instance_date,property_type_id,property_type_en,property_type_ar,property_sub_type_id,property_sub_type_en,property_sub_type_ar,reg_type_id,reg_type_en,reg_type_ar,area_id,area_name_en,area_name_ar,building_name_en,building_name_ar,project_number,project_name_en,project_name_ar,master_project_en,master_project_ar,nearest_landmark_en,nearest_landmark_ar,nearest_metro_en,nearest_metro_ar,nearest_mall_en,nearest_mall_ar,rooms_en,rooms_ar,has_parking,procedure_area,meter_sale_price,actual_worth
1045836,1-11-2019-1151,2019-02-07,4,Villa,فيلا,,,,1,Existing Properties,العقارات القائمة,315,Al Manara,المناره,,,,,,,,Burj Al Arab,برج العرب,First Abu Dhabi Bank Metro Station,محطة مترو بنك أبوظبي الأول,Mall of the Emirates,مول الإمارات,,,0,1374.96,6545.64,9000000.0
997367,1-11-2022-8964,2022-04-26,4,Villa,فيلا,,,,1,Existing Properties,العقارات القائمة,531,Al Hebiah Sixth,الحبيه السادسة,,,1521.0,Mudon-Phase 2,مدن-المرحله الثانيه (2),Mudon,مدن,Dubai Cycling Course,دورة دبي للدراجات,,,,,,,0,309.4,9049.77,2800000.0
1025746,1-11-2021-3993,2021-03-15,4,Villa,فيلا,,,,1,Existing Properties,العقارات القائمة,303,Um Suqaim First,ام سقيم الاولى,,,,,,,,Burj Al Arab,برج العرب,Noor Bank Metro Station,محطة مترو نور بنك,Mall of the Emirates,مول الإمارات,,,0,1372.83,7284.22,10000000.0
38689,1-11-2023-13439,2023-05-05,4,Villa,فيلا,,,,1,Existing Properties,العقارات القائمة,434,Wadi Al Safa 6,وادي الصفا 6,,,,,,Arabian Ranches - Almahra,المرابع العربية - المهره,Motor City,موتور سيتي,,,,,,,0,742.02,12297.51,9125000.0
771881,1-41-2019-1446,2019-05-14,4,Villa,فيلا,,,,1,Existing Properties,العقارات القائمة,463,Wadi Al Safa 7,وادي الصفا 7,,,1400.0,Arabian Ranches - Lila Community,المرابع العربيه - ليلا,Arabian Ranches II - LILA,المرابع العربية 2 - ليلا,Motor City,موتور سيتي,,,,,,,0,435.59,10975.66,4780888.0
242627,1-11-2021-8528,2021-05-25,4,Villa,فيلا,,,,1,Existing Properties,العقارات القائمة,452,Al Hebiah Second,الحبيه الثانية,,,1273.0,POLO HOMES,منازل بولو,Arabian Ranches - Polo Homes,المرابع العربية - بولو هومز,Motor City,موتور سيتي,,,,,,,0,2573.04,5549.86,14280000.0
91601,1-11-2019-229,2019-01-09,4,Villa,فيلا,,,,1,Existing Properties,العقارات القائمة,414,Al Barshaa South Second,البرشاء جنوب الثانية,,,1525.0,Villa Lantana 2,فيلا لانتانا 2,Dubiotech,دبيوتك,Motor City,موتور سيتي,Sharaf Dg Metro Station,محطة مترو شرف دي جي,Mall of the Emirates,مول الإمارات,,,0,113.78,7910.0,900000.0
1266772,1-11-2023-3952,2023-02-10,4,Villa,فيلا,,,,1,Existing Properties,العقارات القائمة,531,Al Hebiah Sixth,الحبيه السادسة,,,1337.0,MUDON,مدن,Mudon,مدن,Motor City,موتور سيتي,,,,,,,0,401.43,8469.72,3400000.0
775148,1-110-2022-582,2022-12-20,4,Villa,فيلا,,,,1,Existing Properties,العقارات القائمة,442,Al Barsha South Fifth,البرشاء جنوب الخامسة,,,,,,Jumeirah Village Triangle,قرية جميرا المثلثة,Sports City Swimming Academy,أكاديمية المدينة الرياضية للسباحة,Damac Properties,عقارات داماك,Marina Mall,مارينا مول,,,0,652.79,3982.9,2600000.0
624370,1-110-2022-372,2022-09-01,4,Villa,فيلا,,,,1,Existing Properties,العقارات القائمة,335,Nad Al Shiba First,ند الشبا الاولى,,,1341.0,MILLENNIUM ESTATES,عقارات الألفية,,,Downtown Dubai,وسط مدينة دبي,Business Bay Metro Station,محطة مترو الخليج التجاري,Dubai Mall,مول دبي,,,0,961.19,11964.34,11500000.0


In [35]:
# Checking if all the missing values of property_sub_type are of "Existing Properties"
transactions_sales_residential_5y[(transactions_sales_residential_5y['property_sub_type_id'].isnull()) &
                                  (transactions_sales_residential_5y['property_type_en'] == 'Villa')]['reg_type_en'].value_counts()

reg_type_en
Existing Properties    13270
Name: count, dtype: int64

It appears that all villas with missing `property_sub_type` values are part of the **Existing Properties** category. This could indicate that older or previously built properties (existing properties) may not have been registered with detailed subtype information during their original documentation.

It would be useful to check if **Off-Plan Properties** have a more consistent and complete `property_sub_type` for comparison. This might confirm that the issue is primarily with historical data in the existing properties.

In [36]:
transactions_sales_residential_5y['reg_type_en'].value_counts()

reg_type_en
Off-Plan Properties    242906
Existing Properties    198324
Name: count, dtype: int64

In [37]:
# Checking the unique values count in the property sub type where property type is "Villa" and reg type is "Off Properties"
print("The unique values count in 'property_sub_type_en' column is: ")
transactions_sales_residential_5y[(transactions_sales_residential_5y['property_type_en'] == 'Villa') &
                                  (transactions_sales_residential_5y['reg_type_en'] == 'Off-Plan Properties')]['property_sub_type_en'].value_counts(dropna=False)

The unique values count in 'property_sub_type_en' column is: 


property_sub_type_en
Villa    36232
Name: count, dtype: int64

**Property Sub Types Insights**

1. **Land and Building Subtypes**:

    - For **lands** and **buildings**, the `property_sub_type` is often missing, and the values we do have are not especially informative for our model. This indicates that handling the `property_sub_type` for these categories is not necessary.

2. **Villas and Existing Properties**:

    - In the case of **villas**, we discovered that all missing values in the `property_sub_type` column pertain to existing properties. This implies that property subtype information was not consistently recorded for older, existing villas, reducing the usefulness of this field for predictive modeling.

3. **Off-Plan Villas Consistency**:

    - For **off-plan villas**, the `property_sub_type` is consistently recorded as “Villa,” which suggests that newer data is more complete but still redundant with the broader `property_type` classification.

4. **Stacked Townhouses**:

    - The `property_sub_type` field classifies **stacked townhouses**, but since stacked townhouses are effectively villas, we will reclassify them under the property_type column as **Villas**.

**Plan for Analysis**

- **Convert Stacked Townhouses to Villas**: We will reclassify stacked townhouses as **Villas** in the `property_type` column to ensure consistency.

- **Remove the Property Subtype Columns**: After reclassification, the `property_sub_type` columns (`property_sub_type_id`, `property_sub_type_en`, `property_sub_type_ar`) will be removed entirely from the dataset. This decision is based on the lack of informative value and the redundancy of this column when compared to the more useful `property_type` column.

In [38]:
# Checking the unique values in "property_type_id" column
transactions_sales_residential_5y['property_type_id'].value_counts()

property_type_id
3    342954
4     65956
1     31802
2       518
Name: count, dtype: int64

In [39]:
# Checking the unique values in "property_type_en" column
transactions_sales_residential_5y['property_type_en'].value_counts()

property_type_en
Unit        342954
Villa        65956
Land         31802
Building       518
Name: count, dtype: int64

In [40]:
# Checking the unique values in "property_type_ar" column
transactions_sales_residential_5y['property_type_ar'].value_counts()

property_type_ar
وحدة    342954
فيلا     65956
أرض      31802
مبنى       518
Name: count, dtype: int64

In [41]:
# Checking the unique values in "property_sub_type_en" column
transactions_sales_residential_5y['property_sub_type_en'].value_counts(dropna=False)

property_sub_type_en
Flat                  342833
Villa                  52686
-                      32320
NaN                    13270
Stacked Townhouses       121
Name: count, dtype: int64

In [42]:
# Displaying random samples where the "property_sub_type_en" is "Stacked Townhouses"
transactions_sales_residential_5y[transactions_sales_residential_5y['property_sub_type_en'] == 'Stacked Townhouses'].sample(5)

Unnamed: 0,transaction_id,instance_date,property_type_id,property_type_en,property_type_ar,property_sub_type_id,property_sub_type_en,property_sub_type_ar,reg_type_id,reg_type_en,reg_type_ar,area_id,area_name_en,area_name_ar,building_name_en,building_name_ar,project_number,project_name_en,project_name_ar,master_project_en,master_project_ar,nearest_landmark_en,nearest_landmark_ar,nearest_metro_en,nearest_metro_ar,nearest_mall_en,nearest_mall_ar,rooms_en,rooms_ar,has_parking,procedure_area,meter_sale_price,actual_worth
1081556,1-11-2021-10791,2021-06-25,3,Unit,وحدة,75.0,Stacked Townhouses,منازل متلاصقة,1,Existing Properties,العقارات القائمة,462,Madinat Al Mataar,مدينة المطار,The Pulse Townhouses Cluster 20,النبض المنازل مجموعه 20,1804.0,THE PULSE TOWNHOUSES,النبض المنازل,Dubai South Residential District,المدينة السكنية بدبي الجنوب,Expo 2020 Site,موقع إكسبو 2020,,,,,2 B/R,غرفتين,1,170.62,6148.17,1049000.0
1060788,1-11-2022-3849,2022-03-07,3,Unit,وحدة,75.0,Stacked Townhouses,منازل متلاصقة,1,Existing Properties,العقارات القائمة,462,Madinat Al Mataar,مدينة المطار,The Pulse Townhouses Cluster 28,النبض المنازل مجموعه 28,1804.0,THE PULSE TOWNHOUSES,النبض المنازل,Dubai South Residential District,المدينة السكنية بدبي الجنوب,Expo 2020 Site,موقع إكسبو 2020,,,,,3 B/R,ثلاث غرف,1,302.44,4463.7,1350000.0
793375,1-110-2024-384,2024-06-06,3,Unit,وحدة,75.0,Stacked Townhouses,منازل متلاصقة,1,Existing Properties,العقارات القائمة,462,Madinat Al Mataar,مدينة المطار,The Pulse Townhouses Cluster 19,النبض المنازل مجموعه 19,1804.0,THE PULSE TOWNHOUSES,النبض المنازل,Dubai South Residential District,المدينة السكنية بدبي الجنوب,Expo 2020 Site,موقع إكسبو 2020,,,,,2 B/R,غرفتين,1,203.14,6399.53,1300000.0
1163726,1-11-2024-17383,2024-05-17,3,Unit,وحدة,75.0,Stacked Townhouses,منازل متلاصقة,1,Existing Properties,العقارات القائمة,462,Madinat Al Mataar,مدينة المطار,The Pulse Townhouses Cluster 3,النبض المنازل مجموعه 3,1804.0,THE PULSE TOWNHOUSES,النبض المنازل,Dubai South Residential District,المدينة السكنية بدبي الجنوب,Expo 2020 Site,موقع إكسبو 2020,,,,,2 B/R,غرفتين,1,203.14,6153.39,1250000.0
691067,1-11-2021-15575,2021-09-07,3,Unit,وحدة,75.0,Stacked Townhouses,منازل متلاصقة,1,Existing Properties,العقارات القائمة,462,Madinat Al Mataar,مدينة المطار,The Pulse Townhouses Cluster 7,النبض المنازل مجموعه 7,1804.0,THE PULSE TOWNHOUSES,النبض المنازل,Dubai South Residential District,المدينة السكنية بدبي الجنوب,Expo 2020 Site,موقع إكسبو 2020,,,,,2 B/R,غرفتين,1,78.3,4032.12,315715.0


In [43]:
# Changing the values from "Units" to "Villa" where the "property_sub_type_id" is "Stacked Townhouses"
transactions_sales_residential_5y.loc[
    transactions_sales_residential_5y['property_sub_type_en'] == 'Stacked Townhouses', 'property_type_id'
] = 4

# Changing the values from "Units" to "Villa" where the "property_sub_type_en" is "Stacked Townhouses"
transactions_sales_residential_5y.loc[
    transactions_sales_residential_5y['property_sub_type_en'] == 'Stacked Townhouses', 'property_type_en'
] = 'Villa'

# Changing the values from "Units" to "Villa" where the "property_sub_type_en" is "Stacked Townhouses"
transactions_sales_residential_5y.loc[
    transactions_sales_residential_5y['property_sub_type_en'] == 'Stacked Townhouses', 'property_type_ar'
] = 'فيلا'

# Displaying random samples where the "property_sub_type_en" is "Stacked Townhouses"
transactions_sales_residential_5y[transactions_sales_residential_5y['property_sub_type_en'] == 'Stacked Townhouses'].sample(5)

Unnamed: 0,transaction_id,instance_date,property_type_id,property_type_en,property_type_ar,property_sub_type_id,property_sub_type_en,property_sub_type_ar,reg_type_id,reg_type_en,reg_type_ar,area_id,area_name_en,area_name_ar,building_name_en,building_name_ar,project_number,project_name_en,project_name_ar,master_project_en,master_project_ar,nearest_landmark_en,nearest_landmark_ar,nearest_metro_en,nearest_metro_ar,nearest_mall_en,nearest_mall_ar,rooms_en,rooms_ar,has_parking,procedure_area,meter_sale_price,actual_worth
485015,1-41-2021-351,2021-01-07,4,Villa,فيلا,75.0,Stacked Townhouses,منازل متلاصقة,1,Existing Properties,العقارات القائمة,462,Madinat Al Mataar,مدينة المطار,The Pulse Townhouses Cluster 15,النبض المنازل مجموعه 15,1804.0,THE PULSE TOWNHOUSES,النبض المنازل,Dubai South Residential District,المدينة السكنية بدبي الجنوب,Expo 2020 Site,موقع إكسبو 2020,,,,,3 B/R,ثلاث غرف,1,302.44,3802.41,1150000.0
1287748,1-41-2021-3677,2021-04-13,4,Villa,فيلا,75.0,Stacked Townhouses,منازل متلاصقة,1,Existing Properties,العقارات القائمة,462,Madinat Al Mataar,مدينة المطار,The Pulse Townhouses Cluster 22,النبض المنازل مجموعه 22,1804.0,THE PULSE TOWNHOUSES,النبض المنازل,Dubai South Residential District,المدينة السكنية بدبي الجنوب,Expo 2020 Site,موقع إكسبو 2020,,,,,2 B/R,غرفتين,1,152.94,5845.43,894000.0
402691,1-11-2022-20497,2022-08-25,4,Villa,فيلا,75.0,Stacked Townhouses,منازل متلاصقة,1,Existing Properties,العقارات القائمة,462,Madinat Al Mataar,مدينة المطار,The Pulse Townhouses Cluster 20,النبض المنازل مجموعه 20,1804.0,THE PULSE TOWNHOUSES,النبض المنازل,Dubai South Residential District,المدينة السكنية بدبي الجنوب,Expo 2020 Site,موقع إكسبو 2020,,,,,2 B/R,غرفتين,1,167.73,6677.4,1120000.0
198313,1-11-2023-5731,2023-02-28,4,Villa,فيلا,75.0,Stacked Townhouses,منازل متلاصقة,1,Existing Properties,العقارات القائمة,462,Madinat Al Mataar,مدينة المطار,The Pulse Townhouses Cluster 36,النبض المنازل مجموعه 36,1804.0,THE PULSE TOWNHOUSES,النبض المنازل,Dubai South Residential District,المدينة السكنية بدبي الجنوب,Expo 2020 Site,موقع إكسبو 2020,,,,,2 B/R,غرفتين,1,169.12,6799.91,1150000.0
1225469,1-11-2023-6853,2023-03-09,4,Villa,فيلا,75.0,Stacked Townhouses,منازل متلاصقة,1,Existing Properties,العقارات القائمة,462,Madinat Al Mataar,مدينة المطار,The Pulse Townhouses Cluster 3,النبض المنازل مجموعه 3,1804.0,THE PULSE TOWNHOUSES,النبض المنازل,Dubai South Residential District,المدينة السكنية بدبي الجنوب,Expo 2020 Site,موقع إكسبو 2020,,,,,2 B/R,غرفتين,1,213.6,5969.1,1275000.0


In [44]:
# Dropping the property sub type columns from the residential dataset
transactions_sales_residential_5y = transactions_sales_residential_5y.drop(
    columns=['property_sub_type_id', 'property_sub_type_en', 'property_sub_type_ar'])

# Displaying random samples of the residential dataset
transactions_sales_residential_5y.sample(5)

Unnamed: 0,transaction_id,instance_date,property_type_id,property_type_en,property_type_ar,reg_type_id,reg_type_en,reg_type_ar,area_id,area_name_en,area_name_ar,building_name_en,building_name_ar,project_number,project_name_en,project_name_ar,master_project_en,master_project_ar,nearest_landmark_en,nearest_landmark_ar,nearest_metro_en,nearest_metro_ar,nearest_mall_en,nearest_mall_ar,rooms_en,rooms_ar,has_parking,procedure_area,meter_sale_price,actual_worth
692605,1-102-2024-49646,2024-07-12,3,Unit,وحدة,0,Off-Plan Properties,على الخارطة,370,Um Suqaim Third,ام سقيم الثالثه,Al Jazi 4,الجازي 4,2407.0,Al Jazi - Madinat Jumeriah Living,الجازي - مدينة جميرا ليفينج,,,,,,,,,1 B/R,غرفة,1,68.58,26319.63,1805000.0
1203563,1-102-2023-8213,2023-02-21,4,Villa,فيلا,0,Off-Plan Properties,على الخارطة,467,Wadi Al Safa 5,وادي الصفا 5,,,2534.0,Arabian Ranches lll - Anya,المرابع العربية ااا - انيا,,,,,,,,,3 B/R,ثلاث غرف,0,144.84,14428.94,2089888.0
674858,1-102-2024-41034,2024-06-10,3,Unit,وحدة,0,Off-Plan Properties,على الخارطة,409,Al Barshaa South Third,البرشاء جنوب الثالثة,The Central Downtown-A,ذا سنترال داون تاون - أ,2861.0,The central downtown,وسط البلد,Arjan,أرجان,Motor City,موتور سيتي,Sharaf Dg Metro Station,محطة مترو شرف دي جي,Mall of the Emirates,مول الإمارات,1 B/R,غرفة,1,79.13,14518.99,1148888.0
692168,1-102-2021-13983,2021-08-23,4,Villa,فيلا,0,Off-Plan Properties,على الخارطة,462,Madinat Al Mataar,مدينة المطار,,,1944.0,The Pulse Villas,فلل النبض,Dubai South Residential District,المدينة السكنية بدبي الجنوب,Expo 2020 Site,موقع إكسبو 2020,,,,,3 B/R,ثلاث غرف,0,133.38,9311.74,1242000.0
1124838,1-41-2023-1860,2023-01-24,1,Land,أرض,1,Existing Properties,العقارات القائمة,451,Al Hebiah Fifth,الحبيه الخامسة,,,2489.0,DAMAC LAGOONS - MARBELLA,داماك لاجونز- ماربيلا,,,,,,,,,,,0,144.0,13194.44,1900000.0


In [45]:
# Checking the unique values count in the property type columns
print("The unique values count in 'property_type_id' column is: ")
print(transactions_sales_residential_5y['property_type_id'].value_counts())

print("\nThe unique values count in 'property_type_en' column is: ")
print(transactions_sales_residential_5y['property_type_en'].value_counts())

print("\nThe unique values count in 'property_type_ar' column is: ")
print(transactions_sales_residential_5y['property_type_ar'].value_counts())

The unique values count in 'property_type_id' column is: 
property_type_id
3    342833
4     66077
1     31802
2       518
Name: count, dtype: int64

The unique values count in 'property_type_en' column is: 
property_type_en
Unit        342833
Villa        66077
Land         31802
Building       518
Name: count, dtype: int64

The unique values count in 'property_type_ar' column is: 
property_type_ar
وحدة    342833
فيلا     66077
أرض      31802
مبنى       518
Name: count, dtype: int64


**Property Type Columns Insights**

1. **Dominance of Units and Villas**:

    - The majority of the transactions are related to **Units (342,833 transactions)** and **Villas (66,077 transactions)**. These two categories make up the largest portion of the data, with **Units** significantly outweighing Villas.

    - This suggests that **Units** are the most common type of property in the dataset, likely indicating a higher demand or more frequent sales for apartments or flats in residential areas.

    - **Villas** are the next most prominent category, suggesting there is a substantial market for larger properties, which could be more attractive to families or high-net-worth individuals.

2. **Smaller Proportion of Land and Buildings**:

    - **Land (31,802 transactions)** and **Buildings (518 transactions)** make up a smaller portion of the dataset.

    - **Land transactions** are still relatively significant, likely due to development and investment opportunities, but **Buildings** represent a very small portion of the data.

    - This low number of building transactions indicates that there might be fewer transactions involving entire buildings, possibly because such properties are either less frequently sold or more expensive, resulting in less overall activity in this segment.

3. **Consistency Across Languages**:

    - The Arabic (`property_type_ar`) and English (`property_type_en`) columns are fully aligned, meaning that the same number of transactions are categorized in both languages without any discrepancies. This is good for ensuring data integrity when performing any further analysis in either language.

**Plans for Analysis**

1. **Excluding Buildings**:

    - **Exclusion makes sense**: With only 518 transactions for buildings, they represent a very small fraction of the dataset. Including them in the model may lead to difficulties in model training due to insufficient data, and they might not add significant value to the overall analysis.

2. **Splitting Data by Property Type (Units, Villas, Land, and Buildings)**:

    - **Tailored Models**: **Units** and **Villas** have distinct market dynamics. Units (typically apartments or flats) tend to be smaller, more affordable, and could have higher transaction volumes, while Villas are larger and more expensive. By splitting the data, you can build more tailored models that capture the unique characteristics of each property type.

    - **Separate Treatment for Land and Buildings**: The **land** data (**31,802** transactions) and **buildings** data (**518** transactions) represents a different type of investment altogether. Land transactions often focus more on development opportunities and long-term value appreciation, and they may not follow the same price trends as completed properties (units and villas).



In [46]:
# Displaying random observations where property type is "Building"
transactions_sales_residential_5y[transactions_sales_residential_5y['property_type_en'] == 'Building'].sample(10)

Unnamed: 0,transaction_id,instance_date,property_type_id,property_type_en,property_type_ar,reg_type_id,reg_type_en,reg_type_ar,area_id,area_name_en,area_name_ar,building_name_en,building_name_ar,project_number,project_name_en,project_name_ar,master_project_en,master_project_ar,nearest_landmark_en,nearest_landmark_ar,nearest_metro_en,nearest_metro_ar,nearest_mall_en,nearest_mall_ar,rooms_en,rooms_ar,has_parking,procedure_area,meter_sale_price,actual_worth
122174,1-11-2022-7051,2022-04-13,2,Building,مبنى,1,Existing Properties,العقارات القائمة,264,Nad Al Hamar,ند الحمر,,,,,,,,Dubai International Airport,مطار دبي الدولي,Rashidiya Metro Station,محطة مترو الراشدية,City Centre Mirdif,سيتي سنتر مردف,,,0,1249.36,3201.64,4000000.0
983825,1-11-2021-22340,2021-12-13,2,Building,مبنى,1,Existing Properties,العقارات القائمة,230,Abu Hail,ابو هيل,,,,,,,,Dubai International Airport,مطار دبي الدولي,Abu Baker Al Siddique Metro Station,محطة مترو أبو بكر الصديق,,,,,0,334.45,11212.44,3750000.0
209475,1-11-2024-30172,2024-08-15,2,Building,مبنى,1,Existing Properties,العقارات القائمة,441,Al Barsha South Fourth,البرشاء جنوب الرابعة,,,753.0,RELIANCE 5,ريلاينس 5,Jumeirah Village Circle,قرية جميرا الدائرية,Sports City Swimming Academy,أكاديمية المدينة الرياضية للسباحة,Dubai Internet City,مدينة دبي للإنترنت,Marina Mall,مارينا مول,,,0,2322.31,8008.03,18597139.0
1231237,1-11-2021-6573,2021-04-22,2,Building,مبنى,1,Existing Properties,العقارات القائمة,239,Al Baraha,البراحه,,,,,,,,Dubai International Airport,مطار دبي الدولي,Salah Al Din Metro Station,محطة مترو صلاح الدين,Dubai Mall,مول دبي,,,0,148.37,33362.54,4950000.0
126014,1-11-2020-8621,2020-10-06,2,Building,مبنى,1,Existing Properties,العقارات القائمة,435,Al Hebiah Fourth,الحبيه الرابعة,,,2746.0,Golf Vista Heights,جولف فيستا هايتس,Dubai Sports City,مدينة دبي الرياضية,Sports City Swimming Academy,أكاديمية المدينة الرياضية للسباحة,Nakheel Metro Station,محطة مترو النخيل,Marina Mall,مارينا مول,,,0,5165.5,2903.88,15000000.0
20688,1-11-2019-1231,2019-02-11,2,Building,مبنى,1,Existing Properties,العقارات القائمة,303,Um Suqaim First,ام سقيم الاولى,,,,,,,,Burj Al Arab,برج العرب,Noor Bank Metro Station,محطة مترو نور بنك,Mall of the Emirates,مول الإمارات,,,0,1101.37,14527.36,16000000.0
552578,1-11-2020-9935,2020-11-04,2,Building,مبنى,1,Existing Properties,العقارات القائمة,317,Jumeirah First,جميرا الاولى,,,,,,,,Burj Khalifa,برج خليفة,Emirates Towers Metro Station,محطة مترو أبراج الإمارات,Dubai Mall,مول دبي,,,0,2935.73,5597.24,16432000.0
506598,1-11-2023-25537,2023-08-15,2,Building,مبنى,1,Existing Properties,العقارات القائمة,466,Wadi Al Safa 4,وادي الصفا 4,,,,,,City Of Arabia,ستي اوف ارابيا,IMG World Adventures,آي إم جي وورلد أدفينتشرز,,,,,,,0,3228.57,4398.6,14201194.0
498679,1-11-2022-9714,2022-05-11,2,Building,مبنى,1,Existing Properties,العقارات القائمة,264,Nad Al Hamar,ند الحمر,,,,,,,,Dubai International Airport,مطار دبي الدولي,Rashidiya Metro Station,محطة مترو الراشدية,City Centre Mirdif,سيتي سنتر مردف,,,0,1393.55,5381.94,7500000.0
943253,1-11-2022-32081,2022-12-19,2,Building,مبنى,1,Existing Properties,العقارات القائمة,233,Hor Al Anz,هور العنز,,,,,,,,Dubai International Airport,مطار دبي الدولي,Abu Hail Metro Station,محطة مترو أبو هيل,City Centre Mirdif,سيتي سنتر مردف,,,0,211.35,10645.85,2250000.0


In [47]:
# Displaying the unique values count in the property type columns
transactions_sales_residential_5y['property_type_en'].value_counts()


property_type_en
Unit        342833
Villa        66077
Land         31802
Building       518
Name: count, dtype: int64

In [48]:
# Filtering the dataset for each of the property types
transactions_sales_residential_units_5y = transactions_sales_residential_5y[
    transactions_sales_residential_5y['property_type_en'] == 'Unit'
    ]
transactions_sales_residential_villas_5y = transactions_sales_residential_5y[
    transactions_sales_residential_5y['property_type_en'] == 'Villa'
    ]
transactions_sales_residential_buildings_5y = transactions_sales_residential_5y[
    transactions_sales_residential_5y['property_type_en'] == 'Building'
    ]
transactions_sales_residential_lands_5y = transactions_sales_residential_5y[
    transactions_sales_residential_5y['property_type_en'] == 'Land'
    ]

transactions_sales_residential_units_villas_5y = transactions_sales_residential_5y[
    transactions_sales_residential_5y['property_type_en'].isin(['Unit', 'Villa'])
    ]


# Displaying the different shapes for each of the filtered datasets
print("Units dataset shape: ", transactions_sales_residential_units_5y.shape)
print("Villas dataset shape: ", transactions_sales_residential_villas_5y.shape)
print("Buildings dataset shape: ", transactions_sales_residential_buildings_5y.shape)
print("Lands dataset shape: ", transactions_sales_residential_lands_5y.shape)
print("Units & Villas dataset shape: ", transactions_sales_residential_units_villas_5y.shape)
 

Units dataset shape:  (342833, 30)
Villas dataset shape:  (66077, 30)
Buildings dataset shape:  (518, 30)
Lands dataset shape:  (31802, 30)
Units & Villas dataset shape:  (408910, 30)


In [49]:
# Saving the splitted transactions datasets
transactions_sales_residential_units_5y.to_csv("../data/processed/transactions_sales_residential_units_5y.csv", index=False)
transactions_sales_residential_villas_5y.to_csv("../data/processed/transactions_sales_residential_villas_5y.csv", index=False)
transactions_sales_residential_buildings_5y.to_csv("../data/processed/transactions_sales_residential_buildings_5y.csv", index=False)
transactions_sales_residential_lands_5y.to_csv("../data/processed/transactions_sales_residential_lands_5y.csv", index=False)
transactions_sales_residential_units_villas_5y.to_csv("../data/processed/transactions_sales_residential_units_villas_5y.csv", index=False)

Now that I’ve successfully split the dataset by property type—creating separate datasets for units & villas, buildings, and lands, I’m ready to dive into a thorough exploration of the data. Given that most of the transactions are centered around residential properties, particularly units and villas, my primary focus will be on understanding these two categories in detail.

Let's continue with understanding and inspecting the rest of the columns for both (Units, Villas) datasets. 

In [52]:
# Displaying information about the Units & Villas dataset
print("Units & Villas dataset information:")
print(transactions_sales_residential_units_villas_5y.info())

print("\nUnits & Villas missing data Percentages:")
print(transactions_sales_residential_units_villas_5y.isnull().sum() / transactions_sales_residential_units_villas_5y.shape[0] * 100)

Units & Villas dataset information:
<class 'pandas.core.frame.DataFrame'>
Index: 408910 entries, 25 to 1314123
Data columns (total 30 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   transaction_id       408910 non-null  object        
 1   instance_date        408910 non-null  datetime64[ns]
 2   property_type_id     408910 non-null  int64         
 3   property_type_en     408910 non-null  object        
 4   property_type_ar     408910 non-null  object        
 5   reg_type_id          408910 non-null  int64         
 6   reg_type_en          408910 non-null  object        
 7   reg_type_ar          408910 non-null  object        
 8   area_id              408910 non-null  int64         
 9   area_name_en         408910 non-null  object        
 10  area_name_ar         408910 non-null  object        
 11  building_name_en     342954 non-null  object        
 12  building_name_ar     342533 non-null  o

**Units & Villas Information Insights**

1. **Building Names**:

    - **16.13%** of `building_name_en` and **16.23%** of `building_name_ar` values are missing.

    - An inconsistency between the Arabic and English column that we'll be investigated. 

2. **Project Information**:

    - Around **10.38%** of missing data for `project_number`, `project_name_en`, and `project_name_ar`.

    - Important for understanding properties linked to large developments, which could impact pricing trends.

3. **Master Project Information**:

    - **21.23%** of the `master_project_en` and `master_project_ar` columns have missing values.

    - The master project data can provide context about major developments or districts that might have specific trends or influence property prices.

4. **Promiximity Features**:

    - Significant missing data in proximity columns:

        - **21.08%** for landmarks.

        - **32.89%** for metro stations.

        - **33.21%** for malls.

    - Proximity features is important for price prediction, and missing data may need to be handled through imputation or exclusion.

5. **Rooms Information**:

    - Only **3.33%** missing values in the `rooms_en` and `rooms_ar` columns.

    - Room count is crucial for property segmentation and valuation.

6. **Parking Information**:

    - **No missing data** in the `has_parking` column, ensuring reliable data for parking availability, a key factor in real estate value.

7. **Price-Related Information**:

    - No missing values in `meter_sale_price` and `actual_worth`, which are the most important features for price prediction.


In [53]:
# Displaying random observations of the Units & Villas dataset
transactions_sales_residential_units_villas_5y.sample(5)

Unnamed: 0,transaction_id,instance_date,property_type_id,property_type_en,property_type_ar,reg_type_id,reg_type_en,reg_type_ar,area_id,area_name_en,area_name_ar,building_name_en,building_name_ar,project_number,project_name_en,project_name_ar,master_project_en,master_project_ar,nearest_landmark_en,nearest_landmark_ar,nearest_metro_en,nearest_metro_ar,nearest_mall_en,nearest_mall_ar,rooms_en,rooms_ar,has_parking,procedure_area,meter_sale_price,actual_worth
243079,1-102-2024-79625,2024-10-01,3,Unit,وحدة,0,Off-Plan Properties,على الخارطة,505,Madinat Hind 4,مدينة هند 4,DAMAC HILLS (2) - ELO 3,داماك هيلز (2) - أيلو 3,3122.0,DAMAC HILLS (2) - ELO 2 & ELO 3,داماك هيلز (2) - ايلو 2 و ايلو 3,DAMAC HILLS 2,داماك هيليز 2,,,,,,,1 B/R,غرفة,1,51.77,12729.38,659000.0
1248736,1-41-2024-8367,2024-05-03,3,Unit,وحدة,1,Existing Properties,العقارات القائمة,447,Al Khairan First,الخيران الأولى,Palace Residences,بالاس رزيدنسز,2128.0,Palace Residences - Dubai Creek Harbour,بالاس رزيدنسز - خور دبي,The Lagoons,الخيران,Dubai International Airport,مطار دبي الدولي,Creek Metro Station,محطة مترو الخور,City Centre Mirdif,سيتي سنتر مردف,3 B/R,ثلاث غرف,1,167.22,25415.62,4250000.0
458239,1-102-2019-22891,2019-12-16,3,Unit,وحدة,0,Off-Plan Properties,على الخارطة,390,Burj Khalifa,برج خليفة,DT1,DT1,1797.0,DT 1,1 دي تي,Business Bay,الخليج التجاري,Downtown Dubai,وسط مدينة دبي,Business Bay Metro Station,محطة مترو الخليج التجاري,Dubai Mall,مول دبي,3 B/R,ثلاث غرف,1,235.08,15099.36,3549557.0
1075187,1-11-2022-19874,2022-08-18,3,Unit,وحدة,1,Existing Properties,العقارات القائمة,485,Me'Aisem First,معيصم الأول,LAKESIDE B,ليك سايد بي,436.0,LAKESIDE,ليكسايد,International Media Production Zone,المنطقة العالمية للإنتاج الإعلامي,Sports City Swimming Academy,أكاديمية المدينة الرياضية للسباحة,Damac Properties,عقارات داماك,Marina Mall,مارينا مول,Studio,استوديو,1,34.31,6266.39,215000.0
1072219,1-11-2022-25821,2022-10-20,3,Unit,وحدة,1,Existing Properties,العقارات القائمة,441,Al Barsha South Fourth,البرشاء جنوب الرابعة,TUSCAN RESIDENCES1-AREZZO 1,تاسكون ريزيدنس 1 - أريزو 1,1437.0,TUSCAN RESIDENCES1 - AREZZO,مساكن توسكان 1 - أريزو,Jumeirah Village Circle,قرية جميرا الدائرية,Sports City Swimming Academy,أكاديمية المدينة الرياضية للسباحة,Dubai Internet City,مدينة دبي للإنترنت,Mall of the Emirates,مول الإمارات,Studio,استوديو,1,51.19,7325.65,375000.0


Let's inspect the Registration Type columns:

In [55]:
# Checking the unique values count in the registration type columns in the Units & Villas dataset
print("The unique values count in 'reg_type_id' column in the units & villas dataset is: ")
print(transactions_sales_residential_units_villas_5y['reg_type_id'].value_counts())

print("\nThe unique values count in 'reg_type_en' column in the units & villas dataset is: ")
print(transactions_sales_residential_units_villas_5y['reg_type_en'].value_counts())

print("\nThe unique values count in 'reg_type_ar' column in the units & villas dataset is: ")
print(transactions_sales_residential_units_villas_5y['reg_type_ar'].value_counts())

The unique values count in 'reg_type_id' column in the units & villas dataset is: 
reg_type_id
0    242906
1    166004
Name: count, dtype: int64

The unique values count in 'reg_type_en' column in the units & villas dataset is: 
reg_type_en
Off-Plan Properties    242906
Existing Properties    166004
Name: count, dtype: int64

The unique values count in 'reg_type_ar' column in the units & villas dataset is: 
reg_type_ar
على الخارطة         242906
العقارات القائمة    166004
Name: count, dtype: int64


In [56]:
# Checking the unique values count in the registration type columns in the units & villas dataset
print("The unique values count in 'reg_type_id' column in the units & villas dataset is: ")
print(transactions_sales_residential_units_villas_5y['reg_type_id'].value_counts())

print("\nThe unique values count in 'reg_type_en' column in the units & villas dataset is: ")
print(transactions_sales_residential_units_villas_5y['reg_type_en'].value_counts())

print("\nThe unique values count in 'reg_type_ar' column in the units & villas dataset is: ")
print(transactions_sales_residential_units_villas_5y['reg_type_ar'].value_counts())

The unique values count in 'reg_type_id' column in the units & villas dataset is: 
reg_type_id
0    242906
1    166004
Name: count, dtype: int64

The unique values count in 'reg_type_en' column in the units & villas dataset is: 
reg_type_en
Off-Plan Properties    242906
Existing Properties    166004
Name: count, dtype: int64

The unique values count in 'reg_type_ar' column in the units & villas dataset is: 
reg_type_ar
على الخارطة         242906
العقارات القائمة    166004
Name: count, dtype: int64


**Registration Type Columns Observations**

1. **Off-Plan Properties Dominate the Dataset**:

    - The dataset contains **242,906** transactions for **Off-Plan Properties**, which makes up the majority of the entries.

    - This indicates a strong focus on properties still under development, reflecting the popularity of investing in off-plan units and villas in Dubai.

2. **Existing Properties Have a Smaller Share**:

    - There are **166,004** transactions related to **Existing Properties**.

    - While smaller in comparison, this still represents a significant portion of the dataset, suggesting that existing properties are also a viable market but not as dominant as off-plan properties.

3. **Balance Between Off-Plan and Existing Properties**:

    - The relatively large number of **Off-Plan** transactions suggests that Dubai’s real estate market is heavily driven by future projects and developments.

    - 	Investors and buyers are likely drawn to the benefits of off-plan projects, such as lower initial prices and potential for capital appreciation once completed.

**Plans for Analysis**

- Given the dominance of off-plan transactions, the market for new developments is critical. Forecasting models for **off-plan properties** will need to account for project completion dates, developer reputation, and market anticipation.

- For **existing properties**, the focus will be on understanding how current market conditions and area-specific factors influence price trends.

