# Phase 02: Data preparation

After collecting our data, it's time to format and clean the data.

## 1. Import necessary Python modules

Let's import the library we will use for cleaning and format our numbers.

In [1]:
import pandas as pd
import numpy as np
import json
import re

# For NLP (furniture data)
#!pip install underthesea
from underthesea import pos_tag

# Define the display of long decimal values
def thousands_formatter(x):
    if isinstance(x, str):
        return x
    return "{:,}".format(int(x))
pd.set_option('display.float_format', thousands_formatter)

Let's import the data we have collected.

In [2]:
rent_data = pd.read_csv('../Data/real_estate_for_rent.csv', index_col=0)
rent_data.head()

Unnamed: 0,Address,Rent type,Meta data,Post date
0,"Dự án Lumiere Riverside, Đường Xa Lộ Hà Nội, P...",Căn hộ chung cư,"{'Diện tích': '50 m²', 'Mức giá': '17 triệu/th...",07/12/2023
1,"Dự án Lumiere Riverside, Đường Xa Lộ Hà Nội, P...",Căn hộ chung cư,"{'Diện tích': '50 m²', 'Mức giá': '17 triệu/th...",05/12/2023
2,"Dự án Lumiere Riverside, Đường Xa Lộ Hà Nội, P...",Căn hộ chung cư,"{'Diện tích': '77 m²', 'Mức giá': '23 triệu/th...",02/12/2023
3,"Dự án Lumiere Riverside, Đường Xa Lộ Hà Nội, P...",Căn hộ chung cư,"{'Diện tích': '50 m²', 'Mức giá': '17 triệu/th...",02/12/2023
4,"Dự án Lumiere Riverside, Đường Xa Lộ Hà Nội, P...",Căn hộ chung cư,"{'Diện tích': '52 m²', 'Mức giá': '16,9 triệu/...",05/12/2023


Now's let's check the columns in our dataset again.

In [3]:
rent_data.columns

Index(['Address', 'Rent type', 'Meta data', 'Post date'], dtype='object')

We see that we have deleted the unwanted column.

## 2. Upacking our dataset

Our dataset only have 4 columns (for now):
- `Address`: the address of the location of the rent.
- `Rent type`: the type of rent.
- `Meta data`: the addition data that comes with the rent posting.
- `Post date`: the date the rent posting was posted.

The `Meta data` column includes addition data for the rent posting, which we will need for our analysis. We are going to unpack the information in the `Meta data` column.

### 2.1 Unpacking `Meta data` column

There are many type of metat data types stores in `Meta data` columnn. A good way to unpack this data is to convert them into series with corresponsing index, then insert them into our dataframe then inverse our dataframe.

But for now, let's deal with our `Meta data` separate to our dataset.

In [4]:
# Create a list for rows after breaking downt the Meta data
df_list = []
for id in range(len(rent_data['Meta data'])):
    new_df = pd.Series(eval(rent_data.loc[id, 'Meta data']))
    df_list.append(new_df)

# Concating all rows in list of rows
metadata_col = pd.concat(df_list, axis=1)

# Transposing the data frame 
metadata_col = metadata_col.T

Let's see what the new information we have gained from our `Meta data` column.

In [5]:
metadata_col.head()

Unnamed: 0,Diện tích,Mức giá,Số phòng ngủ,Số toilet,Nội thất,Hướng nhà,Hướng ban công,Pháp lý,Mặt tiền,Đường vào,...,Sį» phĆ²ng ngį»§,Sį» toilet,PhĆ”p lĆ½,Nį»i thįŗ„t,Mß║Ęt tiß╗ün,Sß╗æ tß║¦ng,HŲ░ß╗øng nh├Ā,HŲ░ß╗øng ban c├┤ng,Ph├Īp l├Į,Nß╗Öi thß║źt
0,50 m²,17 triệu/tháng,1 phòng,1 phòng,,,,,,,...,,,,,,,,,,
1,50 m²,17 triệu/tháng,1 phòng,1 phòng,,,,,,,...,,,,,,,,,,
2,77 m²,23 triệu/tháng,2 phòng,2 phòng,,,,,,,...,,,,,,,,,,
3,50 m²,17 triệu/tháng,1 phòng,1 phòng,,,,,,,...,,,,,,,,,,
4,52 m²,"16,9 triệu/tháng",1 phòng,1 phòng,Đầy đủ nội thất cơ bản,,,,,,...,,,,,,,,,,


At first glance, we can see that the basic data about a rent like area of the place, amount of rent money, number of bedrooms, number of toilets, furnatures provided and some other additional information like direction of the house, direction of the balconey, entry length, number of floors. 

Let's view all the column of the `metadata_col` to see all the data we are getting.

In [6]:
metadata_col.columns

Index(['Diện tích', 'Mức giá', 'Số phòng ngủ', 'Số toilet', 'Nội thất',
       'Hướng nhà', 'Hướng ban công', 'Pháp lý', 'Mặt tiền', 'Đường vào',
       'Số tầng', 'Diß╗ćn t├Łch', 'Mß╗®c gi├Ī', 'Sß╗æ ph├▓ng ngß╗¦',
       'Sß╗æ toilet', 'Diį»n tĆ­ch', 'Mį»©c giĆ”', 'Sį» phĆ²ng ngį»§',
       'Sį» toilet', 'PhĆ”p lĆ½', 'Nį»i thįŗ„t', 'Mß║Ęt tiß╗ün',
       'Sß╗æ tß║¦ng', 'HŲ░ß╗øng nh├Ā', 'HŲ░ß╗øng ban c├┤ng', 'Ph├Īp l├Į',
       'Nß╗Öi thß║źt'],
      dtype='object')

We can see that the columns after `Số tầng` are all gibrish. Let's drop those column.

In [7]:
metadata_col = metadata_col.iloc[:, :11]

# renaming the column name to English
metadata_col.rename(columns={
    'Diện tích': 'Area', 
    'Mức giá' : 'Price', 
    'Số phòng ngủ' : 'Bedrooms', 
    'Số toilet': 'Toilets', 
    'Nội thất': 'Furniture', 
    'Hướng nhà' : 'House direction', 
    'Hướng ban công' : 'Balcony direction', 
    'Pháp lý' : 'Legality', 
    'Mặt tiền' : 'Frontage', 
    'Đường vào' : 'Entry length', 
    'Số tầng' : 'Floors'
}, inplace= True)

Now, our `metadata_col` has 11 columns:
- Area
- Price 
- Bedrooms 
- Toilets 
- Furniture 
- House direction 
- Balcony direction 
- Legality
- Frontage 
- Entry length 
- Floors

It's time to add the new data to our dataset.

In [8]:
# Add extracted meta data to our dataset
final_df = pd.concat([rent_data, metadata_col], axis=1)

# Don't forget to drop our Meta data column
final_df.drop(['Meta data'], axis = 1, inplace=True)

## 3. Data overview

### 3.1 Validating data columns

In [9]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 40599 entries, 0 to 40598
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Address            40599 non-null  object
 1   Rent type          40599 non-null  object
 2   Post date          40599 non-null  object
 3   Area               40573 non-null  object
 4   Price              40589 non-null  object
 5   Bedrooms           19581 non-null  object
 6   Toilets            21085 non-null  object
 7   Furniture          15382 non-null  object
 8   House direction    2880 non-null   object
 9   Balcony direction  2228 non-null   object
 10  Legality           4893 non-null   object
 11  Frontage           3529 non-null   object
 12  Entry length       2598 non-null   object
 13  Floors             7488 non-null   object
dtypes: object(14)
memory usage: 5.7+ MB


Notice that there are a lot of missing values in the `House direction`, `Balcony direction`, `Legality`, `Frontage`, `Entry length`. The meaninng of these columns are also very ambiguous, especially for rent properties. Therefore, we are going to eliminated them from our analysis.

In [10]:
final_df.drop(['House direction', 'Balcony direction', 'Legality', 'Frontage', 'Entry length'], axis=1, inplace=True)

### 3.2 Shape of dataset

In [11]:
final_df.shape

(40599, 9)

We have 9 columns and 40599 rows in our dataset. It's quite a big number of data to work with.

### 3.3 Meaning of rows 

Each rows is a record of a property rent posting for a specific type of real estate property. 

#### 3.3.1 Duplicated rows

Let's check for the number of duplicated rows in our dataset.

In [12]:
final_df.duplicated().sum()

1912

We have 1912 duplicated rows in our dataset. This could be due to the same seller (rent poster) tries to popularize their posting, or an error in our collection phase. Either way, let's drop the duplicated values.

In [13]:
# Droping duplicated values in our index and reset the index
final_df = final_df.drop_duplicates().reset_index(drop=True)

# Recheck duplication in our dataset
print('Number of duplicated rows in our data:', final_df.duplicated().sum())
print('Number of rows we are left with:', final_df.shape[0])

Number of duplicated rows in our data: 0
Number of rows we are left with: 38687



### 3.4 Meaning of columns

Let's take a closer look into our dataset's column.

In [14]:
final_df.columns

Index(['Address', 'Rent type', 'Post date', 'Area', 'Price', 'Bedrooms',
       'Toilets', 'Furniture', 'Floors'],
      dtype='object')

There are 9 columns in our dataset:
- `Address`: Address of the real estate property. 
- `Rent type`: The category of real estate property.
- `Post date`: The day that the advertisement is posted. 
- `Area`: The area of the real estate.
- `Price`: The price of the real estate. This could be the total price (accounted for area and time duration), price per duration or even price per area. 
- `Bedrooms`: The number of bedrooms in the property.
- `Toilets`: The number of toilets in the property. 
- `Furniture`: The furniture status of the real estate. 
- `Floors`: The number of floors in the real estate.

#### 3.4.1 Columns data types

In [15]:
final_df.dtypes

Address      object
Rent type    object
Post date    object
Area         object
Price        object
Bedrooms     object
Toilets      object
Furniture    object
Floors       object
dtype: object

We can see that all of the column has `object` dtype, but we can see multiple columns that should have different data types. We will come back to convert them to the correct data type later in our analysis.

#### 3.4.2 Ratio of mission values

In [16]:
def missing_ratio(column):
    missing_values = column.isnull().sum()
    total_values = len(column)
    return (missing_values / total_values) * 100

missing_ratios_df = final_df.agg(missing_ratio).to_frame()
missing_ratios_df.columns = ['Missing ratio']

missing_ratios_df

Unnamed: 0,Missing ratio
Address,0
Rent type,0
Post date,0
Area,0
Price,0
Bedrooms,50
Toilets,46
Furniture,60
Floors,80


We can see that fields `Address`, `Rent type`, `Post date`, `Area`, and `Price` doesn't have any missing data. This is easy to explain because they are the compulsory fields of a rent posting and is shared by all `Rent type`.

In stark contrast, the fields `Bedrooms`, `Toilets`, `Furniture` and `Floors` have a lot of missing values. These fields are more particular than the columns mentioned above, because these fields are not compulsory for all `Rent type`, and they more common in certain `Rent type` and others. 

Now we have an overview of the dataset, let's start the preprocessing process.

## 4. Preprocessing for categorical columns

### 4.1 Address

#### 4.1.1 Checking if address of rent posting is in Ho Chi Minh city

Our data is subject to be in Ho Chi Minh city. Let's go over the dataset and see if our data is correct and make some adjustment if the data doesn't align.

In [17]:
# First, we are going to drop every row that isn't in HCM. 
# That is, not having the last string element being Ho Chi Minh
# But first, let's see if every location is in Ho Chi Minh
def get_province(x):
    return x[-1]

final_df['Province'] = final_df['Address'].str.split(',').apply(get_province)
final_df['Province'].value_counts()

Province
 Hồ Chí Minh                                                                    37646
 Hồ Chí Minh.                                                                     975
 Tp Hồ Chí Minh                                                                    36
 Hß╗ō Ch├Ł Minh                                                                     9
                                                                                    4
 Hồ Chí 100                                                                         1
 Huyện Nhà Bè.                                                                      1
 Quận 1                                                                             1
Đường N9                                                                            1
 Thành phố Hồ Chí Minh.                                                             1
2tr                                                                                 1
 Thủ Đức                                     

We see that almost all values is in Ho Chi Minh, but some are in the wrong style and some doesn't include province at all. Because the number of these outliner values are quite fews, and that we will need to extract the district from `Address` later, we can drop them all and keep the rows with values oressponding to HCM city name:
1. "Hồ Chí Minh"
2. "Hồ Chí Minh."
3. "Tp Hồ Chí Minh"

Which makes up the most of our dataset.

In [18]:
final_df = final_df[final_df['Province'].isin([' Hồ Chí Minh', ' Hồ Chí Minh.', ' Tp Hồ Chí Minh'])]
final_df['Province'].value_counts()

Province
 Hồ Chí Minh       37646
 Hồ Chí Minh.        975
 Tp Hồ Chí Minh       36
Name: count, dtype: int64

#### 4.1.2 Extracting district from posting address

Next, let's take the data about `District` in Ho Chi Minh city. Ho Chi Minh city currently have [22 district](https://en.wikipedia.org/wiki/Ho_Chi_Minh_City#Administration):

**5 rural districts:**
- Củ Chi 
- Hóc Môn
- Bình Chánh
- Nhà Bè
- Cần Giờ

**16 urban districts:**
- District 1
- District 2 (currently Thủ Đức city)
- District 3 
- District 4
- District 5 
- District 6 
- District 7 
- District 8 
- District 9 (currently Thủ Đức city) 
- District 10
- District 11
- District 12
- Gò Vấp
- Tân Bình
- Tân Phú
- Bình Thạnh
- Phú Nhuận
- Bình Tân
- Thủ Đức (currently Thủ Đức city)

In this analysis we will treat District 2 and District 9 as they are before mergint into Thủ Đức city. We will also treat Thủ Đức as a district as well. The reason is because the change isn't long and the geographical differnt between the districts (before) would be more beneficial for analysis.

Next, let's extracts the district from the address.

In [19]:
def get_district(x):
    return x[-2]

# Get the districts of the address
final_df['District'] = final_df['Address'].str.split(',').apply(get_district)

# Format them to lower type and strip leading white space
final_df['District'] = final_df['District'].str.strip().str.lower()

# Strim away the title words
final_df['District'] = final_df['District'].str.replace('quận ', '').str.replace('huyện ', '').str.replace('thành phố ', '').str.replace('q. ', '')

# Now we have a new column District
final_df['District'].unique()

array(['2', '1', 'bình thạnh', '7', 'gò vấp', 'tân bình', '9',
       'bình chánh', 'tân phú', 'phú nhuận', '4', 'bình tân', 'thủ đức',
       '10', '6', '3', '5', '8', '12', 'nhà bè', 'củ chi', '11',
       'hóc môn'], dtype=object)

Finally, let's drop the `Province` and the `Address` column.

In [20]:
final_df.drop(['Address', 'Province'], axis=1, inplace=True)

### 4.2 Post date

Let's change the data type for the `Post date` column to datetime.

In [21]:
final_df['Post date'].unique()

array(['07/12/2023', '05/12/2023', '02/12/2023', '01/12/2023',
       '06/12/2023', '04/12/2023', '08/12/2023', '27/11/2023',
       '24/11/2023', '30/11/2023', '03/12/2023', '23/11/2023',
       '29/11/2023', '28/11/2023', '25/11/2023', '29/11/20230',
       '26/11/2023', '21/11/2023', '16/11/2023', '13/11/2023',
       '19/11/2023', '12/11/2023', '10/11/2023', '15/11/2023',
       '18/11/2023', '17/11/2023', '14/11/2023', '08/11/2023',
       '22/11/2023', '09/11/2023', '20/11/2023', '11/11/2023',
       '09/12/2023'], dtype=object)

We see that's there is 1 mistake in the data. The date should be `29/11/2023` instead of `29/11/20230`. Let's change the value and change the data type for the date.

In [22]:
final_df['Post date'].loc[final_df['Post date'] == '29/11/20230'] = '29/11/2023'

# Format datetime data
final_df['Post date'] = pd.to_datetime(final_df['Post date'], format = '%d/%m/%Y')

In [23]:
# Check on the column data
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38657 entries, 0 to 38686
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Rent type  38657 non-null  object        
 1   Post date  38657 non-null  datetime64[ns]
 2   Area       38641 non-null  object        
 3   Price      38657 non-null  object        
 4   Bedrooms   19240 non-null  object        
 5   Toilets    20717 non-null  object        
 6   Furniture  15173 non-null  object        
 7   Floors     7419 non-null   object        
 8   District   38657 non-null  object        
dtypes: datetime64[ns](1), object(8)
memory usage: 2.9+ MB


### 4.3 Furniture 

Preparing `Furniture` data is going to be more diffifult, because the format is natural text and not standardize. After some consideration, I've decided to use NLP to find keywords that would related the catagories I would sort them in. The catagories are:
- Basic furniture: The property has a the basic of a room (lighting, fan), a toilet and a refridgerator but no bed frame. All the furniture are wall mounted.
- Full furniture: The property has all furniture in a bedroom (bed frame, drapes...) and fully equiped kitchen.
- No furniture: There is no prepared furniture.
- Has furniture: Falls in the middle of Basic furniture and No furniture.

For NLP, I'm using the [Underthesea library](https://underthesea.readthedocs.io/en/latest/readme.html) which is a NLP toolkit lib that process Vietnamese.

In [24]:
# Preparing keywords for each catagory
basic_furniture = {'items': set(['tủ lạnh', 'quạt', 'đèn', 'toilet']), 'tag': set(['cơ bản', 'mới', 'cần thiết', 'căn bản', 'dính'])}
full_furniture = {'items' : set(['máy hút mùi', 'máy giặt','hút mùi', 'tủ giày', 'kệ bếp', 'full', '100%']), 
                  'tag' : set(['đầy đủ', 'full', 'đủ', '%', 'ful', 'full nội thất', 'cao cấp', 'tối đa', 'cao', '100', 'hoàn thiện'])}
no_furniture = {'items': set(['fitout', 'nhà thô', 'thô chủ']), 'tag': set(['không', 'k', 'ko', 'trống', 'thô', 'hỗ trợ', 'nhà thô', 'ít'])}

In [25]:
def furniture_transform(x):
    if pd.isna(x): 
        return 'không'
    
    x = x.lower().replace('.', '').strip()
    
    tag = pos_tag(x)

    # Extract key words from tags
    noun_tag = pd.Series([x[0] if x[1] == 'N' else None for x in tag]).dropna().unique()
    adj_tag = pd.Series([x[0] if x[1] != 'N' else None for x in tag]).dropna().unique()

    if set(noun_tag).intersection(no_furniture['items']) or set(adj_tag).intersection(no_furniture['tag']):
        return 'không'
    elif set(noun_tag).intersection(basic_furniture['items']) or set(adj_tag).intersection(basic_furniture['tag']):
        return 'cơ bản'
    elif set(noun_tag).intersection(full_furniture['items']) or set(adj_tag).intersection(full_furniture['tag']):
        return 'đầy đủ'
    else:
        return 'có'

# Cagatorizing the furniture
final_df['Furniture'] = final_df['Furniture'].apply(furniture_transform)

## 5. Preprocessing for numerical data

### 5.1 Area

For the `Area` data, we are going to extract the numerical values in the data. Because the unit is square meter, there's no need to convert between metric.

In [26]:
def check_metric(x):
    return x[1]

# Check number of metric used. We will see 
final_df[final_df['Area'].notnull()]['Area'].str.split(' ').apply(check_metric).unique()

array(['m²'], dtype=object)

In [27]:
# Extract numerical value
final_df['Area'] = final_df['Area'].str.replace(' m²', '').str.replace('.', '')
final_df['Area'] = final_df['Area'].str.replace(',', '.')

We see that there are some null value in the `Area` column. We will drop the missing rows because there are fews misisng values and `Area` is needed for analysis.


In [28]:
# Drop missing values
final_df.dropna(subset=['Area'], inplace=True)

# Convert to numerical type
final_df['Area'] = final_df['Area'].astype(np.float64)

### 5.2 Price

#### 5.2.1 Standardize price values

For the price data, the comma `,` is used to denote decimal values. We are going to convert that value into float data.

Another issue is the diversity of the unit used to determine the monthly rent. Let's see all the unit used in the dataset.

In [29]:
final_df['Price'].str.split(' ').apply(lambda x: x[1]).unique()

array(['triệu/tháng', 'thuận', 'nghìn/m²', 'tỷ/tháng', 'nghìn/tháng',
       'triệu/m²'], dtype=object)

As we can see, there are many units used. We can classify them into 3 types (the third type is not a unit but undetermined value):
1. Monthly based unit:
- triệu/tháng
- nghìn/tháng
- tỷ/tháng

2. Area based unit:
- nghìn/m²
- triệu/m²

3. Other:
- Thỏa thuận

For monthly based unit, we will calculate the total money due in a month (in vnđ).
For area based unit, we will assume that rent is due every month. With that, we will calculate the month in vnđ and mutiple that with `Area` to get totaly monthly rent.
For payment setle by agreement, we will assume unknown and fill witn nan value.

In [30]:
# Convert keyword dict
monthly_rent_trans_dict = {
    '/tháng': '',
    'nghìn': '000',
    'triệu': '000000',
    'tỷ': '000000000',
    ' ': '',
    ',':'.'
}

# Caculate rent for monthly based and area based unit
def monthly_rent_transform(x):
    if '/tháng' in x:
        x = x.replace('/tháng','')
        price = float(x.replace(',','.').split(' ')[0])

        if 'nghìn' in x:
            price *= 1000
        elif 'triệu' in x:
            price *= 1000000
        elif 'tỷ' in x:
            price *= 1000000000
        return str(price)
    elif 'm' in x:
        price = float(x.replace(',','.').split(' ')[0])

        if 'nghìn' in x:
            price *= 1000
        elif 'triệu' in x:
            price *= 1000000
        elif 'tỷ' in x:
            price *= 1000000000
        
        x = str(price) + '/m²'
    return x
    
final_df['Price'] = final_df['Price'].apply(monthly_rent_transform)

Then, we will complete calculating the monthly rent for Area based rent by multiply the value we've calculate above with the area of the property.

In [31]:
def area_monthly_rent_transform(x):
    if 'm' in x['Price']:
        x = float(x['Price'].split('/')[0]) * x['Area']
    return x

final_df['Price'] = final_df.apply(area_monthly_rent_transform, axis=1)['Price']

#### 5.2.2 Handling non-numerical values (`Thỏa thuận`)

Finally, we will make rent by agreement to nan. But before doing that, we will drop all the missing values in the dataset. The number of missing values isn't a lot, and we want the missing values in this column represent only `Thảo thuận`. This will become handy as we will attempt to guess the price of property for agreement values.

In [32]:
# Droping missing values
final_df.dropna(subset=['Price'], inplace=True)

In [33]:
def agreement_rent_transform(x):
    if x == 'Thỏa thuận':
        x = np.nan
    return x

# Make rent by agreement into nan
final_df['Price'] = final_df['Price'].apply(agreement_rent_transform)

In [34]:
# Normalize the datatype to numerical type
final_df['Price'] = final_df['Price'].astype(np.float64)

### 5.3 Bedroom, Toilet and Floors data

For `Bedroom`, `Toilet` and `Floors` data, the process is straightforward. We will extract the numerical values of the string and normalize the column to numerical data type.

In [35]:
final_df['Bedrooms'] = final_df['Bedrooms'].str.replace('phòng', '').str.strip().astype(np.float64)
final_df['Toilets'] = final_df['Toilets'].str.replace('phòng', '').str.strip().astype(np.float64)
final_df['Floors'] = final_df['Floors'].str.replace(' tầng', '').str.strip().astype(np.float64)

## 6. Save processed data

Let's save our processed data in a new file.

In [36]:
export_file = '../Data/cleaned_real_estate_for_rent.csv'
final_df.to_csv(export_file)

And we're done with phase 02. Next up, we are going to ask some questions and hopefully answer them the best we can.