# <div style=" text-align: center; font-size: 50px; font-weight: bold">Phase 01: Collecting data </div>

## **Import necessary Python modules**

In [1]:
import pandas as pd 
from bs4 import BeautifulSoup
# !pip install selenium
from selenium import webdriver
# !pip install cloudscraper
import cloudscraper
import re

## **Collect real estate for sale data:**
In this section, we will collect the data of real estates that for sale from this website: https://batdongsan.com.vn/ . This webstite contain most of the advertisement of real estate (for rent or sale) in Viet Nam. In the scope of this project. We will focus on the data of real estate for sale, which are located in Ho Chi Minh City.

Also, with the heavy density of the advertisement in this website. The data is collected is just mainly concentrate in just a period time (just in round end months of 2023). So for a overview, they cannot give us a full view about the current state of real estate in Ho Chi Minh City over times. Anyway, just don't get more concern to it and start our journey to collect data.

### **1. Some initials:**

Look through the website, I can figure out some features that are common for most of real estate. So we can consider them like features that we should collect.

Now, create an empty dataframe to store the data of the real estate. The dataframe will contain some features like: `Address`, `Type`, `Area`, `Price`, `Bedroom`, `Toilet`, `Floor`, `Furniture`, `Direction`, `Legal`, `Posting date`, `Expiry date`, `Ad type`, `Ad code`.

In [2]:
fields = [ 'Address', 'Type', 'Area', 'Price', 'Bedroom', 'Toilet', 'Floor', 'Furniture', 'Direction','Legal', 'Posting date', 'Expiry date', 'Ad type', 'Ad code']

# Create a empty data frame with these fields
df = pd.DataFrame(columns = fields)

### **2. Collect the links to each real estate webpage:**

The links are collected from multiple pages, with each page containing numerous links to apartments. These links have been stored in a file named `real_estate_for_sale_links.txt`.     

With this task , we can use `Selenium` to access the website and crawl data from it. The link to the homepage and the file name are provided below:

In [3]:
file_name = "../Data/real_estate_for_sale_links.txt"
url = 'https://batdongsan.com.vn/nha-dat-ban-tp-hcm'

We will move from the first page to the last page to collect the links, so the first thing should do is get the final page of the website.

In [5]:
def GetTotalPage(url):

    # Create an instance of the driver
    driver = webdriver.Chrome()
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    # The parsing data process
    unprocessed_num_page = soup.find(class_ = 're__pagination-group')
    driver.quit()
    # Regex to find all the current numpages in the screen.
    regex = r'[^\s]+'
    matchs = re.findall(regex, unprocessed_num_page.text)
    new_array = []
    
    for match in matchs:
        new_array.append(int(match.replace('.', '')))

    total_page = max(new_array)

    return total_page

Process the result to get the last page.

In [7]:
# Get num of total pages in the website.
total_page = GetTotalPage(url)

Function to collect the links to each real estate:

In [6]:
def CollectRealEstateForSaleLink (url, page):
    with open (file_name, 'w') as file:
        # Acess to all page
        for i in range(1, page + 1):
            driver = webdriver.Chrome()
            driver.get(url+'/p'+ str(i))
            soup = soup = BeautifulSoup(driver.page_source, 'html.parser')
            apartment_list = soup.find_all("a", class_= "js__product-link-for-product-id")

            # Add all links to the file
            for apartment in apartment_list:
                link = 'https://batdongsan.com.vn/'+ apartment['href']
                file.write(link)
                file.write('\n')

            driver.quit()

In [None]:
CollectRealEstateForSaleLink (url, total_page)

Exception ignored in: <function Service.__del__ at 0x000001B0CA03C680>
Traceback (most recent call last):
  File "c:\Users\bkphu\anaconda3\Lib\site-packages\selenium\webdriver\common\service.py", line 200, in __del__
    self.stop()
  File "c:\Users\bkphu\anaconda3\Lib\site-packages\selenium\webdriver\common\service.py", line 157, in stop
    self.send_remote_shutdown_command()
  File "c:\Users\bkphu\anaconda3\Lib\site-packages\selenium\webdriver\common\service.py", line 137, in send_remote_shutdown_command
    request.urlopen(f"{self.service_url}/shutdown")
  File "c:\Users\bkphu\anaconda3\Lib\urllib\request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\bkphu\anaconda3\Lib\urllib\request.py", line 519, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\bkphu\anaconda3\Lib\urllib\request.py", line 536, in _open
    result = self._call_chain(self.handle_open

Up to now, we have collected all the links to each real estate website and store them into a text file. 

Read the `real_estate_for_sale_links.txt` file and store all the links to a lists

In [4]:
urls = []
with open(file_name, 'r') as file:
    for line in file.readlines():
            urls.append(line)

In [5]:
len(urls)

65358

### **3. Access to each link and colllect the data:**

The current language of the website is Vietnamese, and as just for some consitent, I try to convert them all to Englist by the dictionary below:

In [6]:
key_map = {
    'Địa chỉ:' : 'Address',
    'Diện tích': 'Area',
    'Mức giá': 'Price',
    'Hướng nhà': 'Direction',
    'Số tầng': 'Floor',
    'Số phòng ngủ': 'Bedroom',
    'Số toilet': 'Toilet',
    'Pháp lý': 'Legal',
    'Nội thất': 'Furniture',
    'Ngày đăng': 'Posting date',
    'Ngày hết hạn': 'Expiry date',
    'Loại tin': 'Ad type',
    'Mã tin': 'Ad code'
}

Also, look the through the website, I see that there are so many different types of real estates. This will be a important features for out future analysis. The issue here is the current data of each real estate that not contain its type, and the type is hide by the type code. Fortunaly, I have found all of them and store in the dictionary bellow:

In [7]:
sale_type = ['Căn hộ chung cư', 'Nhà riêng', 'Nhà biệt thự, liền kề', 'Nhà mặt phố', 'Shophouse nhà phố thương mại', 'Đất nền dự án', 'Đất bán', 'Trang trại, khu nghỉ dưỡng', 'Kho, nhà xưởng','Bất động sản khác','Condotel']
sale_type_dict = {
    '324': sale_type[0], # Chung cư
    '41': sale_type[1], # Nhà riêng
    '325': sale_type[2], # biệt thự
    '163': sale_type[3], # Nhà mặt phố
    '575': sale_type[4], # Shophouse
    '40': sale_type[5], # Đất nền
    '283': sale_type[6], # Bán đất
    '44': sale_type[7], # Trang trại
    '45': sale_type[8], # kho
    '48': sale_type[9], # Khác
    '562': sale_type[10], # Khác
}


Now, let's define some necessary function of scrapping process:

- Function to extract the features from a link:

In [8]:
def ExtractDataFromLinkWithSelenium(url):
    driver = webdriver.Chrome()
    try:
        driver.get(url)
        soup = BeautifulSoup(driver.page_source, 'html.parser')

        #Create a dicitonary to store the feature
        features = {}

        # get the address
        address = soup.find(class_ = "re__pr-short-description js__pr-address")

        # Get the type
        regex = r'[\d]+'
        type = soup.find_all('script', type= 'text/javascript')
        for item in type:
            if (item.text):
                found = re.findall(regex, item.text)
                break;
        # If the address cannot be found, mean the data is invalide, ignore them.
        if address == None:
            return features
        else:
            features['Address'] = address.text
            features['Type'] = sale_type_dict[found[3]]

        # Get all the features about the attribute of each real estate
        feature_items = soup.find_all('div', class_ = 're__pr-specs-content-item')
        for item in feature_items:
            title = item.find(class_ = 're__pr-specs-content-item-title').text.strip()
            value = item.find(class_ = 're__pr-specs-content-item-value').text.strip()
            features[title] = value

        # Get the features about the posting date, expiry date and the ad code
        date_items = soup.find_all( 'div', class_ = "re__pr-short-info-item js__pr-config-item")
        for item in date_items:
            title = item.find('span', class_ = 'title').text.strip()
            value = item.find('span',class_ = 'value').text.strip()
            features[title] = value
    
        # Convert all the keys to the defined keys base on the dictionary.
        keys_to_change = list(features.keys())
        for key in keys_to_change:
            if key in key_map:
                features[key_map[key]] = features[key]
    finally:
        driver.quit()

    return features

Iterate over the urls list and extract the data, which is stored in the previously defined DataFrame

In [15]:
for url in urls[9000:10000]:
        features = ExtractDataFromLinkWithSelenium(url)
        if features:
                new_row = pd.Series(features, index = fields)
                new_row_df = pd.DataFrame([new_row])
                df = pd.concat([df, new_row_df], ignore_index = True)
                new_row_df.to_csv('../Data/real_estate_for_sale.csv', mode='a', header=False, index=False, encoding='utf-8')
        else:
                print ("Error")


Error
Error


Exception ignored in: <function Service.__del__ at 0x000002701A314680>
Traceback (most recent call last):
  File "c:\Users\bkphu\anaconda3\Lib\site-packages\selenium\webdriver\common\service.py", line 200, in __del__
    self.stop()
  File "c:\Users\bkphu\anaconda3\Lib\site-packages\selenium\webdriver\common\service.py", line 157, in stop
    self.send_remote_shutdown_command()
  File "c:\Users\bkphu\anaconda3\Lib\site-packages\selenium\webdriver\common\service.py", line 137, in send_remote_shutdown_command
    request.urlopen(f"{self.service_url}/shutdown")
  File "c:\Users\bkphu\anaconda3\Lib\urllib\request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\bkphu\anaconda3\Lib\urllib\request.py", line 519, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\bkphu\anaconda3\Lib\urllib\request.py", line 536, in _open
    result = self._call_chain(self.handle_open

KeyboardInterrupt: 

In [17]:
df

Unnamed: 0,Address,Type,Area,Price,Bedroom,Toilet,Floor,Furniture,Direction,Legal,Posting date,Expiry date,Ad type,Ad code
0,"Phường Trường Thọ, Thủ Đức, Hồ Chí Minh",Nhà riêng,"60,3 m²","8,5 tỷ",4 phòng,,4 tầng,,,Sổ đỏ/ Sổ hồng.,04/05/2025,14/05/2025,Tin thường,42743075
1,"Đường Phan Xích Long, Phường 3, Bình Thạnh, Hồ...",Nhà mặt phố,48 m²,17 tỷ,5 phòng,,4 tầng,,Đông,Sổ đỏ/ Sổ hồng,04/05/2025,14/05/2025,Tin thường,42626253
2,"Đường Quang Trung, Phường 10, Gò Vấp, Hồ Chí Minh",Nhà riêng,64 m²,"4,3 tỷ",3 phòng,,3 tầng,Cơ bản,,Sổ đỏ/ Sổ hồng,28/04/2025,08/05/2025,Tin thường,42526147
3,"Đường Vườn Lài, Phường An Phú Đông, Quận 12, H...",Đất bán,59 m²,4 tỷ,,,,,Nam,Sổ đỏ/ Sổ hồng,04/05/2025,14/05/2025,Tin thường,42877465
4,"Đường Trần Xuân Soạn, Phường Tân Hưng, Quận 7...","Nhà biệt thự, liền kề",233 m²,"23,5 tỷ",6 phòng,,4 tầng,Đầy đủ,,Sổ đỏ/ Sổ hồng,04/05/2025,14/05/2025,Tin thường,42877818
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3596,"Đường Phạm Hữu Lầu, Phường Phú Mỹ, Quận 7, Hồ ...",Nhà riêng,"110,2 m²","8,9 tỷ",4 phòng,,3 tầng,Đầy đủ,,Sổ đỏ/ Sổ hồng,02/05/2025,12/05/2025,Tin thường,41851869
3597,"Đường Trường Chinh, Phường 14, Tân Bình, Hồ Ch...",Nhà riêng,93 m²,"8,45 tỷ",2 phòng,,2 tầng,,,Sổ đỏ/ Sổ hồng,05/05/2025,15/05/2025,Tin thường,42718429
3598,"Đường Vũ Tùng, Phường 1, Bình Thạnh, Hồ Chí Minh",Nhà riêng,54 m²,Thỏa thuận,4 phòng,,3 tầng,,,Sổ đỏ/ Sổ hồng,04/05/2025,14/05/2025,Tin thường,41965782
3599,"Đường Số 6, Phường Tăng Nhơn Phú B, Quận 9, Hồ...",Nhà riêng,48 m²,"4,39 tỷ",3 phòng,,3 tầng,,,Sổ đỏ/ Sổ hồng,22/04/2025,07/05/2025,Tin thường,42792779


### **4. Store the collected data to csv file:**

Finally, just store the data into a `csv` file and finish the process.

In [14]:
df.to_csv('../Data/real_estate_for_sale.csv', index = False)

In [14]:
real_estate_df = pd.read_csv('../Data/real_estate_for_sale.csv')
real_estate_df

Unnamed: 0,Address,Type,Area,Price,Bedroom,Toilet,Floor,Furniture,Direction,Legal,Posting date,Expiry date,Ad type,Ad code
0,"Dự án Vinhomes Grand Park, Phường Long Thạnh M...",Căn hộ chung cư,47 m²,"1,9 tỷ",2 phòng,,,,,Sổ đỏ/ Sổ hồng,02/05/2025,09/05/2025,Tin VIP Kim Cương,42438234
1,"Dự án Vinhomes Grand Park, Phường Long Thạnh M...",Căn hộ chung cư,47 m²,"1,9 tỷ",2 phòng,,,,,Sổ đỏ/ Sổ hồng,02/05/2025,09/05/2025,Tin VIP Kim Cương,42438234
2,"Dự án Vinhomes Grand Park, Phường Long Thạnh M...",Căn hộ chung cư,47 m²,"1,9 tỷ",2 phòng,,,,,Sổ đỏ/ Sổ hồng,02/05/2025,09/05/2025,Tin VIP Kim Cương,42438234
3,Dự án The Beverly Solari - Vinhomes Grand Par...,Căn hộ chung cư,90 m²,Thỏa thuận,3 phòng,,,,,,24/04/2025,09/05/2025,Tin VIP Kim Cương,42818062
4,"Dự án Lumiere Boulevard, Phường Long Bình, Quậ...",Căn hộ chung cư,75 m²,"4,5 tỷ",2 phòng,,,,,Hợp đồng mua bán,04/05/2025,04/05/2025,Tin thường,42581902
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7752,"Đường Huỳnh Tấn Phát, Phường Phú Thuận, Quận 7...",Nhà riêng,80 m²,"7,5 tỷ",4 phòng,,4 tầng,Đầy đủ,,Sổ đỏ/ Sổ hồng,03/05/2025,13/05/2025,Tin thường,40883158
7753,"Đường Huỳnh Tấn Phát, Phường Phú Mỹ, Quận 7, H...",Nhà riêng,107 m²,"10,5 tỷ",4 phòng,,4 tầng,Đầy đủ,,Sổ đỏ/ Sổ hồng,03/05/2025,13/05/2025,Tin thường,40900034
7754,"Đường Phạm Hữu Lầu, Phường Phú Mỹ, Quận 7, Hồ ...",Nhà riêng,"110,2 m²","8,9 tỷ",4 phòng,,3 tầng,Đầy đủ,,Sổ đỏ/ Sổ hồng,02/05/2025,12/05/2025,Tin thường,41851869
7755,"Đường Lê Văn Khương, Phường Thới An, Quận 12, ...",Nhà mặt phố,130 m²,"11,2 tỷ",2 phòng,,,,,Sổ đỏ/ Sổ hồng,06/05/2025,16/05/2025,Tin thường,42832569


In [None]:
real_estate_df.shape()