# <div style=" text-align: center; font-size: 50px; font-weight: bold">Phase 01: Collecting data </div>

## **Import necessary Python modules**

In [1]:
import pandas as pd 
from bs4 import BeautifulSoup
from selenium import webdriver
import cloudscraper
import re

## **Collect real estate for sale data:**
In this section, we will collect the data of real estates that for sale from this website: https://batdongsan.com.vn/ . This webstite contain most of the advertisement of real estate (for rent or sale) in Viet Nam. In the scope of this project. We will focus on the data of real estate for sale, which are located in Ho Chi Minh City.

Also, with the heavy density of the advertisement in this website. The data is collected is just mainly concentrate in just a period time (just in round end months of 2023). So for a overview, they cannot give us a full view about the current state of real estate in Ho Chi Minh City over times. Anyway, just don't get more concern to it and start our journey to collect data.

### **1. Some initials:**

Look through the website, I can figure out some features that are common for most of real estate. So we can consider them like features that we should collect.

Now, create an empty dataframe to store the data of the real estate. The dataframe will contain some features like: `Address`, `Type`, `Area`, `Price`, `Bedroom`, `Toilet`, `Floor`, `Furniture`, `Direction`, `Legal`, `Posting date`, `Expiry date`, `Ad type`, `Ad code`.

In [2]:
fields = [ 'Address', 'Type', 'Area', 'Price', 'Bedroom', 'Toilet', 'Floor', 'Furniture', 'Direction','Legal', 'Posting date', 'Expiry date', 'Ad type', 'Ad code']

# Create a empty data frame with these fields
df = pd.DataFrame(columns = fields)

### **2. Collect the links to each real estate webpage:**

The links are collected from multiple pages, with each page containing numerous links to apartments. These links have been stored in a file named `real_estate_for_sale_links.txt`.     

With this task , we can use `Selenium` to access the website and crawl data from it. The link to the homepage and the file name are provided below:

In [3]:
file_name = "../Data/real_estate_for_sale_links.txt"
url = 'https://batdongsan.com.vn/nha-dat-ban-tp-hcm'

We will move from the first page to the last page to collect the links, so the first thing should do is get the final page of the website.

In [1]:
def GetTotalPage(url):

    # Create an instance of the driver
    driver = webdriver.Chrome()
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    # The parsing data process
    unprocessed_num_page = soup.find(class_ = 're__pagination-group')
    driver.quit()

    regex = r'[^\s]+'
    matchs = re.findall(regex, unprocessed_num_page.text)
    new_array = []
    
    for match in matchs:
        new_array.append(int(match.replace('.', '')))

    total_page = max(new_array)

    return total_page

Process the result to get the last page.

In [None]:
total_page = GetTotalPage(url)

Function to collect the links to each real estate:

In [None]:
def CollectRealEstateForSaleLink (url, page):
    with open (file_name, 'w') as file:
        # Acess to all page
        for i in range(1, page + 1):
            driver = webdriver.Chrome()
            driver.get(url+'/p'+ str(i))
            soup = soup(driver.page_source, 'html.parser')
            apartment_list = soup.find_all("a", class_= "js__product-link-for-product-id")

            # Add all links to the file
            for apartment in apartment_list:
                link = 'https://batdongsan.com.vn/'+ apartment['href']
                file.write(link)
                file.write('\n')

            driver.quit()

In [None]:
CollectRealEstateForSaleLink (url, total_page)

Up to now, we have collected all the links to each real estate website and store them into a text file. 

Read the `real_estate_for_sale_links.txt` file and store all the links to a lists

In [None]:
urls = []
with open(file_name, 'r') as file:
    for line in file.readlines():
            urls.append(line)

### **3. Access to each link and colllect the data:**

The current language of the website is Vietnamese, and as just for some consitent, I try to convert them all to Englist by the dictionary below:

In [4]:
key_map = {
    'Địa chỉ:' : 'Address',
    'Diện tích': 'Area',
    'Mức giá': 'Price',
    'Hướng nhà': 'Direction',
    'Số tầng': 'Floor',
    'Số phòng ngủ': 'Bedroom',
    'Số toilet': 'Toilet',
    'Pháp lý': 'Legal',
    'Nội thất': 'Furniture',
    'Ngày đăng': 'Posting date',
    'Ngày hết hạn': 'Expiry date',
    'Loại tin': 'Ad type',
    'Mã tin': 'Ad code'
}

Also, look the through the website, I see that there are so many different types of real estates. This will be a important features for out future analysis. The issue here is the current data of each real estate that not contain its type, and the type is hide by the type code. Fortunaly, I have found all of them and store in the dictionary bellow:

In [5]:
sale_type = ['Căn hộ chung cư', 'Nhà riêng', 'Nhà biệt thự, liền kề', 'Nhà mặt phố', 'Shophouse nhà phố thương mại', 'Đất nền dự án', 'Đất bán', 'Trang trại, khu nghỉ dưỡng', 'Kho, nhà xưởng','Bất động sản khác']
sale_type_dict = {
    '324': sale_type[0], # Chung cư
    '41': sale_type[1], # Nhà riêng
    '325': sale_type[2], # biệt thự
    '163': sale_type[3], # Nhà mặt phố
    '575': sale_type[4], # Shophouse
    '40': sale_type[5], # Đất nền
    '283': sale_type[6], # Bán đất
    '44': sale_type[7], # Trang trại
    '45': sale_type[8], # kho
    '48': sale_type[9], # Khác
}


Now, let's define some necessary function of scrapping process:

- Function to extract the features from a link:

In [None]:
def ExtractDataFromLinkWithSelenium(url):
    driver = webdriver.Chrome()
    try:
        driver.get(url)
        soup = BeautifulSoup(driver.page_source, 'html.parser')

        #Create a dicitonary to store the feature
        features = {}

        # get the address
        address = soup.find(class_ = "re__pr-short-description js__pr-address")

        # Get the type
        regex = r'[\d]+'
        type = soup.find_all('script', type= 'text/javascript')
        for item in type:
            if (item.text):
                found = re.findall(regex, item.text)
                break;
        # If the address cannot be found, mean the data is invalide, ignore them.
        if address == None:
            return features
        else:
            features['Address'] = address.text
            features['Type'] = sale_type_dict[found[3]]

        # Get all the features about the attribute of each real estate
        feature_items = soup.find_all('div', class_ = 're__pr-specs-content-item')
        for item in feature_items:
            title = item.find(class_ = 're__pr-specs-content-item-title').text.strip()
            value = item.find(class_ = 're__pr-specs-content-item-value').text.strip()
            features[title] = value

        # Get the features about the posting date, expiry date and the ad code
        date_items = soup.find_all( 'div', class_ = "re__pr-short-info-item js__pr-config-item")
        for item in date_items:
            title = item.find('span', class_ = 'title').text.strip()
            value = item.find('span',class_ = 'value').text.strip()
            features[title] = value
    
        # Convert all the keys to the defined keys base on the dictionary.
        keys_to_change = list(features.keys())
        for key in keys_to_change:
            if key in key_map:
                features[key_map[key]] = features[key]
    finally:
        driver.quit()

    return features

Iterate over the urls list and extract the data, which is stored in the previously defined DataFrame

In [None]:
for url in urls[ 1184 : 2000]:
        features = ExtractDataFromLinkWithSelenium(url)
        if features:
                new_row = pd.Series(features, index = fields)
                new_row_df = pd.DataFrame([new_row])
                df = pd.concat([df, new_row_df], ignore_index = True)
                # new_row_df.to_csv('../Data/real_estate_for_sale.csv', mode='a', header=False, index=False, encoding='utf-8')
        else:
                print ("Error")


### **Note:**
After many effort to collect the data, I find out that the executing time of Selenium is too long, this make the collecing data process become a challenge. So I came up with the solution that using another libraries to collect the data. And it was `CloudScraper`. In the next part, I will handle the collecting data process using `CloudScaper`

### **Collect data with CloudScraper:**

Now, reperform above process using `CloudScapper`

In [None]:
url = 'https://batdongsan.com.vn/nha-dat-ban-tp-hcm'
url_list = []

scraper = cloudscraper.create_scraper(delay=10, browser="chrome") 
content = scraper.get(url)

soup = BeautifulSoup(content.text, 'html.parser')
num_of_page = soup.find_all('a', class_='-numberre__pagination')

num_of_page = int(num_of_page[-1]['pid'])

In [None]:
# Get url list in first page
a_tags_list = soup.find_all('a', class_='js__product-link-for-product-id')
for a_tag in a_tags_list:
    url_list.append('https://batdongsan.com.vn/' + a_tag.get('href'))
url_list

In [None]:
page = '/p'
num_page = num_of_page
for num_page in range(2, int(num_page) + 1):
    content = scraper.get(url + page + str(num_page)).text
    soup = BeautifulSoup(content, 'html.parser')
    a_tags_list = soup.find_all('a', class_='js__product-link-for-product-id')
    for a_tag in a_tags_list:
        url_list.append('https://batdongsan.com.vn/' + a_tag.get('href'))

In [None]:
file_name = 'real_estate_for_sale_links.txt'

with open(file_name, 'w') as file:
    for line in url_list:
        file.write(line + "\n")

In [6]:
def ExtractDataFromLinkWithCloudscraper(url, scraper, headers):

    content = scraper.get(url, timeout=30, headers=headers)

    content.encoding = 'utf-8'
    soup = BeautifulSoup(content.text, 'html.parser')

    #Create a dicitonary to store the feature
    features = {}
    final_features = {}

    regex = r'[\d]+'
    type = soup.find('script', type='text/javascript')
    found = re.findall(regex, type.text)

    

    # get the address
    address = soup.find(class_ = "re__pr-short-description js__pr-address")

    if address is None:
        print(content)
        address = soup.find(class_ = 're__address-title js__product-address')
        if address is None:
            return final_features
        features['Địa chỉ:'] = address.text.strip()
        print("hihi")

        final_features['Type'] = sale_type_dict[found[3]]

        feature_items = soup.find_all('div', class_ = 're__pr-specs-content-item')
        for item in feature_items:
            title = item.find(class_ = 're__pr-specs-content-item-title').text.strip()
            value = item.find(class_ = 're__pr-specs-content-item-value').text.strip()
            features[title] = value

        date_items = soup.find_all( 'ul', class_ = "re__product-info")
        for item in date_items:
            title = item.find('span', class_ = 're__sp1').text.strip()
            value = item.find('span',class_ = 're__sp3').text.strip()
            features[title] = value

        keys_to_change = list(features.keys())

        for key in keys_to_change:
            if key in key_map:
                final_features[key_map[key]] = features[key]
        return final_features
    
    else:
        features['Địa chỉ:'] = address.text.strip()

        final_features['Type'] = sale_type_dict[found[3]]

        # Get all the features
        feature_items = soup.find_all('div', class_ = 're__pr-specs-content-item')
        for item in feature_items:
            title = item.find(class_ = 're__pr-specs-content-item-title').text.strip()
            value = item.find(class_ = 're__pr-specs-content-item-value').text.strip()
            features[title] = value

        date_items = soup.find_all( 'div', class_ = "re__pr-short-info-item js__pr-config-item")
        for item in date_items:
            title = item.find('span', class_ = 'title').text.strip()
            value = item.find('span',class_ = 'value').text.strip()
            features[title] = value

        keys_to_change = list(features.keys())

        for key in keys_to_change:
            if key in key_map:
                final_features[key_map[key]] = features[key]
        return final_features
   

In [7]:
urls = []
with open('../Data/real_estate_for_sale_links.txt', 'r') as file:
    for line in file.readlines():
            urls.append(line)

In [8]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
scraper = cloudscraper.create_scraper(delay=10, browser="chrome") 
for url1 in urls[54723:]:
        features = ExtractDataFromLinkWithCloudscraper(url1[0:-1], scraper, headers)
        print(features)
        if features:
                new_row = pd.Series(features, index = fields)
                new_row_df = pd.DataFrame([new_row])
                df = pd.concat([df, new_row_df], ignore_index = True)
                # new_row_df.to_csv('../Data/real_estate_for_sale.csv', mode='a', header=False, index=False, encoding='utf-8')
        else:
                # break;
                
                print ("error")
     

{'Type': 'Nhà riêng', 'Address': 'Đường Nguyễn Sơn, Phường Phú Thạnh, Tân Phú, Hồ Chí Minh', 'Area': '68 m²', 'Price': '5 tỷ', 'Floor': '5 tầng', 'Bedroom': '11 phòng', 'Posting date': '16/11/2023', 'Expiry date': '16/12/2023', 'Ad type': 'Tin thường', 'Ad code': '38573211'}
{'Type': 'Nhà riêng', 'Address': 'Đường Âu Cơ, Phường Tân Sơn Nhì, Tân Phú, Hồ Chí Minh', 'Area': '70 m²', 'Price': 'Thỏa thuận', 'Posting date': '16/11/2023', 'Expiry date': '16/12/2023', 'Ad type': 'Tin thường', 'Ad code': '38572891'}
{'Type': 'Nhà mặt phố', 'Address': 'Dự án Camellia Garden, Đường Nguyễn Văn Linh, Xã Bình Hưng, Bình Chánh, Hồ Chí Minh', 'Area': '74,1 m²', 'Price': '6,6 tỷ', 'Direction': 'Đông', 'Floor': '2 tầng', 'Bedroom': '3 phòng', 'Toilet': '2 phòng', 'Legal': 'Hợp đồng mua bán', 'Furniture': 'Đầy đủ', 'Posting date': '16/11/2023', 'Expiry date': '16/12/2023', 'Ad type': 'Tin thường', 'Ad code': '38572046'}
{'Type': 'Nhà riêng', 'Address': 'Phường Bình Trưng Đông, Quận 2, Hồ Chí Minh', 'Area

Up to now, we are successfully collect the data, with above 50.000 samples and tooks me about nearly about 12 hours for all of them. As you can see, I divided the collecting process by range to ensure and keep tract the result, Also I don't want put too much pressure on my computer too.

### **4. Store the collected data to csv file:**

Finally, just store the data into a `csv` file and finish the process.

In [None]:
df.to_csv('../Data/real_estate_for_sale.csv', index = False)

In [None]:
real_estate_df = pd.read_csv('../Data/real_estate_for_sale.csv')

In [None]:
real_estate_df.shape()