# <div style=" text-align: center; font-weight: bold">Phase 02: Preprocessing data</div>

This is the preprocessing phase for data of the real estates for sale.

## Import necessary Python modules

In [96]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## **Explore the data**

### Read the data from file:

In [97]:
real_estate_for_sale_df = pd.read_csv('../Data/real_estate_for_sale.csv')
real_estate_for_sale_df.head()

Unnamed: 0,Address,Area,Price,Bedroom,Toilet,Floor,Furniture,Direction,Legal,Posting date,Expiry date,Ad type,Ad code
0,"Dự án Zenity, Đường Võ Văn Kiệt, Phường Cầu Kh...","161,08 m²","17,98 tỷ",3 phòng,3 phòng,,Đầy đủ,Đông - Bắc,Sổ đỏ/ Sổ hồng,06/12/2023,13/12/2023,Tin VIP Kim Cương,38720367
1,"Dự án Zenity, Đường Võ Văn Kiệt, Phường Cầu Kh...",116 m²,"9,8 tỷ",3 phòng,2 phòng,,Đầy đủ,Đông - Bắc,Sổ đỏ/ Sổ hồng,03/12/2023,10/12/2023,Tin VIP Kim Cương,38693652
2,"Dự án Zenity, Đường Võ Văn Kiệt, Phường Cầu Kh...",108 m²,70 triệu/m²,2 phòng,2 phòng,,Đầy đủ.,Đông - Nam,Sổ đỏ/ Sổ hồng,01/12/2023,11/12/2023,Tin VIP Kim Cương,38481731
3,"Dự án Lumiere Riverside, Đường Xa Lộ Hà Nội, P...",76 m²,"6,2 tỷ",2 phòng,2 phòng,,Đầy đủ,,Hợp đồng mua bán,05/12/2023,12/12/2023,Tin VIP Kim Cương,38393009
4,"Dự án Zenity, Quận 1, Hồ Chí Minh",77 m²,80 triệu/m²,2 phòng,2 phòng,,Đầy đủ,,Sổ đỏ/ Sổ hồng,07/12/2023,14/12/2023,Tin VIP Kim Cương,38734339


#### Num of rows and columns:

In [98]:
num_rows, num_cols = real_estate_for_sale_df.shape

print(f'Num of rows:  {num_rows}')
print (f'Num of columns:  {num_cols}')

Num of rows:  55358
Num of columns:  13


#### The meaning of each line. Does it matter if a line have different meaning?

The data is collected by crawling raw data from the website https://batdongsan.com.vn/    
Each line is the record of a advertisement of real estate. So there isn't any line that has different meaning.

#### Num of duplicated rows:

In [99]:
duplicate = real_estate_for_sale_df.duplicated().sum()

print (f' Nums of duplicated rows: {duplicate}')

 Nums of duplicated rows: 275


we can see there are duplicated rows in the dataset. The reason here is there are some advertisements is reposted in the website, result in the duplicated data.
So, we will drop these duplicated rows.

In [100]:
real_estate_for_sale_df.drop_duplicates(inplace= True)
real_estate_for_sale_df = real_estate_for_sale_df.reset_index(drop=True)

#### Ratio of missing values for each column:

In [101]:
def missing_ratio(column):
    missing_values = column.isnull().sum()
    total_values = len(column)
    return (missing_values / total_values) * 100

missing_ratios_df = real_estate_for_sale_df.agg(missing_ratio).to_frame()
missing_ratios_df.columns = ['Missing ratio']

missing_ratios_df

Unnamed: 0,Missing ratio
Address,0.0
Area,0.036309
Price,0.0
Bedroom,38.367554
Toilet,42.279832
Floor,54.434217
Furniture,58.542563
Direction,78.232849
Legal,31.176588
Posting date,0.0


We can see that the fields `Address`, `Price`, `Posting date`, `Expiry date`, `Ad type`, `Ad code` have no missing values. This is easy to understand the reasons: The `Address` and `Price` is the basic data that a post have to contain. The other fields is automatically generated by the website.     
All the rests is have a large ratio of missing value. from *31.176588 %* in `Legal` up to *78.23849 %* in `Direction`. So the data preprocessing of data is so necessary before we make the analysis.

#### **The meaning of each columns**
Let's see all the columns of the dataset.

In [102]:
real_estate_for_sale_df.columns.to_list()

['Address',
 'Area',
 'Price',
 'Bedroom',
 'Toilet',
 'Floor',
 'Furniture',
 'Direction',
 'Legal',
 'Posting date',
 'Expiry date',
 'Ad type',
 'Ad code']

- **Address:** address of the real estate. With the real estate is apartment, the address can also contain the Project of the real estate.
- **Area:** the area of the real estate.
- **Price:** the price of the real estate. It can be the total price or just the price per m^2.
- **Bedroom:** number of bedroom in the real estate.
- **Toilet:** number of toilet in the real estate.
- **Floor:** number of floor.
- **Furniture:** the furniture status of the real estate.
- **Direction:** the direction of the real estate.
- **Legal:** Some legal policy of the real estate.
- **Posting date:** The day that the advertisement was posted.
- **Expiry date:** the day the real estate was enable.
- **Ad type:** the type of advertisement.
- **Ad code:** the code of advertisement.

#### Data type of each colmuns:

In [103]:
cols_type = real_estate_for_sale_df.dtypes
cols_type

Address         object
Area            object
Price           object
Bedroom         object
Toilet          object
Floor           object
Furniture       object
Direction       object
Legal           object
Posting date    object
Expiry date     object
Ad type         object
Ad code          int64
dtype: object

- Nearly all of the columns is in Object type. These columns will not suitable for the further analysis. So we need to do some preprocessing on the data types.

### Some preprocessing:

- **Some obsevation:**
    - `Area`, `Price`, `Bedroom`, `Toilet` and `Floor` should be numerical columns, so we will convert them into numeric data types.

### Preprocessing for numeric columns:
    

#### **1. Area:**
First, we will find that if the values is in the same unit.

In [104]:
area_values = real_estate_for_sale_df[real_estate_for_sale_df['Area'].notna()]['Area'].to_list()

unit_list = []
value_list = []
for area in area_values:
    unit_list.append(area.split(' ')[1])
    value_list.append(area.split(' ')[0])

set(unit_list)


{'m²'}

So we can see that all the values in columns `Area` is in the same unit. Now we just set the `Area` columns with the new value, change the data type, then rename it for a clearly meaning.

In [105]:
#77,5 , 5.629,2

nonnan_indices = real_estate_for_sale_df['Area'].notna()
real_estate_for_sale_df.loc[nonnan_indices, 'Area'] = value_list
area_list = real_estate_for_sale_df['Area'].dropna().to_list()
cleaned_area_list = []

for area in area_list:
    if (',' in area) and ('.' in area):
        cleaned_area = area.replace('.', '').replace(',', '.')
    elif ',' in area:
        cleaned_area = area.replace(',', '.')
    else:
        cleaned_area = area

    if '.' in area:
        count = len(area.split('.')[-1])
        if count == 3:
            cleaned_area = area.replace('.','')
        

    cleaned_area_list.append(cleaned_area)

real_estate_for_sale_df.loc[nonnan_indices, 'Area'] = cleaned_area_list

real_estate_for_sale_df['Area'] = real_estate_for_sale_df['Area'].astype('float64')
real_estate_for_sale_df.rename(columns={'Area': 'Area(m2)'}, inplace=True)

#### **Price:**
First, we will find that if the values is in the same unit.

In [106]:
price_values = real_estate_for_sale_df['Price'].to_list()

unit_list = []
value_list = []
for price in price_values:
    unit_list.append(price.split(' ')[1])
    value_list.append(price.split(' ')[0])

set(unit_list)

{'nghìn/m²', 'thuận', 'triệu', 'triệu/m²', 'tỷ', 'tỷ/m²'}

The `Price` column contains various units, so we will convert them to the *tỷ* unit

In [107]:
price_list = real_estate_for_sale_df['Price'].to_list()

new_prices = []
for index, value in enumerate(price_list):
    if 'nghìn/m²' in value:
        area = real_estate_for_sale_df.loc[index, 'Area(m2)']
        price = str(value).split(' ')[0]

        price = (float(price) * area )/1000000
        new_prices.append((index, price))

    elif 'triệu' in value:
        price, unit = value.split(' ')
        if unit == 'triệu':
            price = price.replace(',', '.')
            price = float(price) / 1000
            new_prices.append((index, price))
        
        elif unit == 'triệu/m²':
            area = real_estate_for_sale_df.loc[index, 'Area(m2)']
            price = price.replace(',', '.')
            price = (float(price) * area) / 1000
            new_prices.append((index, price))
    elif 'tỷ' in value:
        price, unit = value.split(' ')
        if unit == 'tỷ/m²':
            area = real_estate_for_sale_df.loc[index, 'Area(m2)']
            price = price.replace(',', '.')
            price = (float(price) * area)
            new_prices.append((index, price))
        else:
            price = price.replace(',', '.')
            new_prices.append((index, float(price)))

price_column = np.array(price_list)
for index, price in new_prices:
    price_column[index] = price

real_estate_for_sale_df['Price'] = price_column

real_estate_for_sale_df.rename(columns={'Price': 'Price(tỷ)'}, inplace=True)

#### **Bedroom:**

In [113]:

bedroom_mask = real_estate_for_sale_df['Bedroom'].notna()
bedroom_values = real_estate_for_sale_df.loc[bedroom_mask, 'Bedroom']

unit_list = []
value_list = []
for bedroom in bedroom_values:
    unit_list.append(bedroom.split(' ')[1])
    value_list.append(bedroom.split(' ')[0])

set(unit_list)


{'phòng'}

In [115]:
real_estate_for_sale_df.loc[bedroom_mask, 'Bedroom'] = value_list
real_estate_for_sale_df['Bedroom'] = real_estate_for_sale_df['Bedroom'].astype('float64')

#### **Toilet:**

In [116]:
toilet_mask = real_estate_for_sale_df['Toilet'].notna()

toilet_values = real_estate_for_sale_df.loc[toilet_mask, 'Toilet']

unit_list = []
value_list = []
for toilet in toilet_values:
    unit_list.append(toilet.split(' ')[1])
    value_list.append(toilet.split(' ')[0])

set(unit_list)

{'phòng'}

In [117]:
real_estate_for_sale_df.loc[toilet_mask, 'Toilet'] = value_list
real_estate_for_sale_df['Toilet'] = real_estate_for_sale_df['Toilet'].astype('float64')

#### **Floor:**

In [118]:
floor_mask = real_estate_for_sale_df['Floor'].notna()

floor_values = real_estate_for_sale_df.loc[floor_mask, 'Floor']

unit_list = []
value_list = []
for floor in floor_values:
    unit_list.append(floor.split(' ')[1])
    value_list.append(floor.split(' ')[0])

set(unit_list)

{'tầng'}

In [119]:
real_estate_for_sale_df.loc[floor_mask, 'Floor'] = value_list
real_estate_for_sale_df['Floor'] = real_estate_for_sale_df['Floor'].astype('float64')

### Preprocessing for categorical columns:

- With the **Address** column, there is some thing we can dicuss here:
    - The fully address is does not really meaningfull for analysis. The base idea is that all the real estate is located in Ho Chi Minh City, so we can extract the district of the real estate.
    - Also, there is many types of real easte like: solid, individual house, apartment. The real estate in apartment type can be regonized by the Project they belong to, so we could also extract the project field to

### Extract the district of real estates:

In [133]:
address_df = real_estate_for_sale_df['Address'].values

district_list = []
for address in address_df:
    split = address.split(',')
    # print(split)
    try:
        district = split[-2].strip()
    except: 
        print (address)
        split = address.split(' ')
        print(split)

    district_list.append(district)
    # print (district)

Đại Học Quốc Gia 245 Đường Gò Cát Phường Phú Hữu Quận 9 Hồ Chí Minh
['Đại', 'Học', 'Quốc', 'Gia', '245', 'Đường', 'Gò', 'Cát', 'Phường', 'Phú', 'Hữu', 'Quận', '9', 'Hồ', 'Chí', 'Minh']


In [None]:
real_estate_for_sale_df['District'] = district_list

real_estate_for_sale_df.head()

Unnamed: 0,Address,Area,Price,Bedroom,Toilet,Floor,Furniture,Direction,Legal,Posting date,Expiry date,Ad type,Ad code,District
0,"Dự án Lumiere Riverside, Đường Xa Lộ Hà Nội, P...",77 m²,"6,1 tỷ",2 phòng,2 phòng,,,,,27/11/2023,07/12/2023,Tin VIP Kim Cương,38645401,Quận 2
1,"Dự án Lumiere Riverside, Đường Xa Lộ Hà Nội, P...",76 m²,"6,2 tỷ",2 phòng,2 phòng,,Đầy đủ,,Hợp đồng mua bán,23/11/2023,30/11/2023,Tin thường,38393009,Quận 2
2,"Dự án Lumiere Riverside, Đường Xa Lộ Hà Nội, P...",72 m²,Thỏa thuận,2 phòng,2 phòng,,,,,28/11/2023,05/12/2023,Tin VIP Kim Cương,38657478,Quận 2
3,"Lumiere Riverside, 628, Đường Xa Lộ Hà Nội, Ph...","77,1 m²","6,6 tỷ",2 phòng,2 phòng,,Cơ bản,,Hợp đồng mua bán,29/11/2023,06/12/2023,Tin VIP Kim Cương,38403632,Quận 2
4,"Dự án The Privia, Đường An Dương Vương, Phường...","67,74 m²","3,5 tỷ",2 phòng,2 phòng,,"Bàn giao cơ bản các thiết bị Châu Âu cao cấp, ...",,HĐMB,29/11/2023,06/12/2023,Tin VIP Kim Cương,38654457,Quận Bình Tân


In [None]:
real_estate_df['Direction'].unique()

array([nan, 'Nam', 'Tây - Bắc', 'Đông', 'Đông - Nam', 'Tây', 'Tây - Nam',
       'Đông - Bắc', 'Bắc'], dtype=object)