<!-- Name: Pham Le Tu Nhi -->
<!-- Student ID: 21120308 -->

# **Phase 01: Data collection** 

## Part 02: Collect real estate for rent data

We're be collecting data from [this website](https://batdongsan.com.vn). It's a popular website for real estate property exchange (buying and renting) in Vietnam.

A word of caution: The data pulling from this website won't give us a full view of Ho Chi Minh city's real estate market for many reasons, but mostly because the market is still very heavily depends on words of mouth (introduction from aquaintaince), and there are also other popular digital market like Facebook groups.

Still, looking into the data is still worthwhile and can give us a rough estimate of the market. Plus it's a good practice exercise for our final project of this course.

So let's collect our data.

First, let's import all the necessary library. 

In [166]:
# Scraped date: 7/12/2023
import numpy as np
import pandas as pd 
import random

# !pip install cloudscraper
import cloudscraper
from bs4 import BeautifulSoup as bs

import re
import time

Now let's prepare our scraper. But first, let's talk about the legality of the our webscraping. 

During our research of the website, we see that there is not clear Term of Service (or Use) toward people who access the website without an account, and there isn't any listed Disallow agent or terms in `Robots.txt`, and because we are using the data collected for education purposes, we decided that it is within the law - that is it's legal to collect and analyze the data we have collected from the website.

During our data collection, we've identify that the website uses Cloudguard technology to identify and prevent non-human agent to access and collect data. To bypass this, we're going to use the Cloudscraper library and some tricks - mainly changing the headers of the requests. 

## 1. Collect the urls for all rent posting

Let's collect all the URLs that leads to all current available rent posting.

In [31]:
# First page of website
url = 'https://batdongsan.com.vn/nha-dat-cho-thue-tp-hcm'

# Scraped URL list
url_list = []

In [37]:
# Scrape data
scraper = cloudscraper.create_scraper(delay=10, browser="chrome") 

# Get content of the first page
content = scraper.get(url)

# Get number of page
soup = bs(content.text, 'html.parser')
num_of_page = soup.find('span', class_='re__hide js__current-page').get('data-total-page')

# Get url list in first page
a_tags_list = soup.find_all('a', class_='re__unreport')
for a_tag in a_tags_list:
    url_list.append('https://batdongsan.com.vn/' + a_tag.get('href'))

# Similarly, let's scrape the rest of the pages 
page = '/p'
for num_page in range(2, int(num_of_page) + 1):
    content = scraper.get(url + page + str(num_page)).text
    soup = bs(content, 'html.parser')
    a_tags_list = soup.find_all('a', class_='re__unreport')
    for a_tag in a_tags_list:
        url_list.append('https://batdongsan.com.vn/' + a_tag.get('href'))

Now that we have the url for all the rent posting, let's save it in a file.

In [40]:
# Save scraped urls in file 
file_name = 'Data/rental_properties_url.txt'

with open(file_name, 'w') as file:
    for line in url_list:
        file.write(line + "\n")

## 2. Scrape the contents of rent posting

Now let's actually go out and scrape the data on the page of each rent posting. Do to this, we have to load the url list that we have saved from the first step.

In [146]:
file_name = 'rental_properties_url.txt'
url_list = np.genfromtxt(file_name, dtype=str, encoding=None, skip_footer=1)

There are a fews things to keep in mind before we start our process. 

First, we are collect all rent data, which mean there will be many types of property types for rent. We have to be able to differentiate this data point.

To do that, we have to analyze the structure of web page for rent property. I've go ahead and done that, and discovered the id for each rent type. With this information, I just need to create a function to convert the property type code to the name of the property type.

We will also prepare some necessary function for our process as well.

In [147]:
# Rent type
# Each index of number represent:
# 0: product_ID, 1: project_ID, 2: vipType, 3: rentingTypeID, 4: wardId, 5: streetId, 6: pageId, 7: createdByUser, 8: productType
# We will be extracting the rentalTypeID 
rent_type = ['Căn hộ chung cư', 'Nhà riêng', 'Nhà biệt thự, liền kề', 'Nhà mặt phố', 'Nhà trọ, phòng trọ', 'Shophouse nhà phố thương mại', 'Văn phòng', 'Cửa hàng, ki ốt', 'Kho, nhà xưởng, đất', 'Bất động sản khác']
renting_type_dict = {
    '326': rent_type[0],
    '52': rent_type[1],
    '577': rent_type[2],
    '51': rent_type[3],
    '57': rent_type[4],
    '576': rent_type[5],
    '50': rent_type[6],
    '55': rent_type[7],
    '53': rent_type[8], 
    '59': rent_type[9]
}

# Decode property type ID
def get_renting_type(rent_code):
    return renting_type_dict[rent_code]

# Create list from scraped data
def soup_to_list(soup_data):
    res = []
    for data in soup_data:
        res.append(data.get_text(strip=True))
    return res

We will also have to prepare the columns of the collected dataframe. The data collected will be `Address` of the property, `Rent type` as we have defined above, `Post date` of the rent posting, and `Meta data`.

The `Meta data` can be confusing at first reading, but to breakd own it's just all the pieces of information that accoumpany the rent posting. Not all rent types have the same meta data - information that comes with it becuase they are different type or other reason. For example, `Căn hộ chung cư` might have the meta data for number of bathrooms and bedrooms, while `Kho, nhà xưởng, đất` might not. Because of this different, it's unreasonable to ask for a fixed number of features. Instead, collect all available features from the postings and sort them out later is a better solution.

In [148]:
# Data that we will collect structure
# address | rent type | metadata | post date
fields = [ 'Address', 'Rent type', 'Meta data', 'Post date']

In [215]:
# Prepare our scrapper
scraper = cloudscraper.create_scraper(delay=15, browser="firefox") 


number_pattern = r'(\d+)'
renting_data = pd.DataFrame(columns=fields)

# We are using a step of 2000 to take a break every 2000 web postings
for i in range(0, len(url_list), 2000):
    for url in url_list[i:i + 2000]:
        # Collect content of web page and encode string 
        content = scraper.get(url)
        content.encoding = content.apparent_encoding
        soup = bs(content.text, 'html.parser')
        
        collect_data_dict = {}
        # Prevent the whole scrapper crash because of deleted web page
        try:
            collect_data_dict[fields[0]] = soup.find('span', class_='re__pr-short-description js__pr-address').text.strip()
            collect_data_dict[fields[1]] = get_renting_type(re.findall(number_pattern, soup.find('script', type='text/javascript').text.split('=')[1])[3])
            collect_data_dict[fields[2]] = dict(zip(
                soup_to_list(soup.find_all('span', class_='re__pr-specs-content-item-title')),
                soup_to_list(soup.find_all('span', class_='re__pr-specs-content-item-value'))
            ))
            collect_data_dict[fields[3]] = soup.find('div', class_='re__pr-short-info-item js__pr-config-item').find('span', class_='value').text
            renting_data.loc[len(renting_data)] = list(collect_data_dict.values())
        except:
            continue
    time.sleep(15)       

got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got
got


The whole process takes a lot of time to complete (about 10 hours), and we have collected more than 40,000 samples. Yay!

Now let's export the data we've collected to a file, and we are done!

In [216]:
export_file = '../Data/rent_data.csv'
renting_data.to_csv(export_file, mode='a', header=None)

That's the end of Phase 1! Let's continue to Phase 2!