<h2 style="color:tomato;">Task 3: Web Scraping - Real Estate Data from Batdongsan.com</h2>
Write a Python script to extract real estate data from the website Batdongsan.com: <br><br>

●	Scrape Real Estate Data: Scrape real estate listings from the Batdongsan.com website.<br> Extract details such as property prices, availability, descriptions, and location.<br><br>
●	Caching Mechanism Implementation: Implement a caching mechanism to optimize scraping speed and prevent excessive requests to the server.


<h1 style="color:tomato;">Solution:</h1>

In [29]:
from bs4 import BeautifulSoup
import requests
import requests_cache


In [30]:
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
import requests_cache
import pandas as pd
# Enable requests caching to prevent excessive requests to the server
requests_cache.install_cache('real_estate_cache', expire_after=3600)  # Cache expires after 1 hour


In [31]:
def scrape_real_estate_data(url = "https://batdongsan.com.vn/nha-dat-ban"):
    world_table_titles = ["price", "area", "price_per_m2", "bedroom", "toilet", "location", "description"]
    data = pd.DataFrame(columns = world_table_titles)

    ## scrap the web to get the maximum pagination_number
    driver = webdriver.Chrome()
    
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.quit()
    contents = soup.find_all('div', class_ = 're__card-info-content')

    pagination_numbers= soup.find_all('a', class_ = 're__pagination-number')
    max_pagination_number=max([int(i.text.strip().replace(".","")) for i in pagination_numbers])
    
    ## Because max_pagination_number is too large to scrape all the data, i decided to choose max_pagination_number=3 to save time and stimulate the scrapping process
    max_pagination_number=3

    ## loop through page 1 -> page max_pagination_numbe to scrap all the real estate data 
    for i in range(1,max_pagination_number+1):
        driver = webdriver.Chrome()
        temp_url = url + "/p"+ str(i)
        driver.get(temp_url)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        driver.quit()
        contents = soup.find_all('div', class_ = 're__card-info-content')

        for row in contents:
            price=row.find('span', class_ = 're__card-config-price js__card-config-item')
            price = price.text.strip() if price else None

            area=row.find('span', class_ = 're__card-config-area js__card-config-item')
            area = area.text.strip() if area else None

            price_per_m2=row.find('span', class_ = 're__card-config-price_per_m2 js__card-config-item')
            price_per_m2 = price_per_m2.text.strip() if price_per_m2 else None

            bedroom=row.find('span', class_ = 're__card-config-bedroom js__card-config-item')
            bedroom= bedroom.text.strip() if bedroom else None

            toilet=row.find('span', class_ = 're__card-config-toilet js__card-config-item')
            toilet= toilet.text.strip() if toilet else None

            location=row.find('div', class_ = 're__card-location').find_all('span')[1]
            location= location.text.strip() if location else None

            description=row.find('div', class_ = 're__card-description js__card-description')
            description= description.text.strip() if description else None


            length = len(data)
            data.loc[length] = [price,area,price_per_m2,bedroom,toilet,location,description]
    
    # drop the duplicated row if exists
    data=data.drop_duplicates()
    ## save results data into "data.csv".
    data.to_csv("data.csv", index=False)

scrape_real_estate_data()


In [32]:
df = pd.read_csv("data.csv")
df.head()

Unnamed: 0,price,area,price_per_m2,bedroom,toilet,location,description
0,"20,98 tỷ",96 m²,"218,52 tr/m²",8.0,8.0,"Văn Giang, Hưng Yên","* PT.TV-48, căn quảng trường duy nhất còn sót ..."
1,"6,5 tỷ",162 m²,"40,12 tr/m²",4.0,3.0,"Quận 7, Hồ Chí Minh","Khu North: Quản lý 100% căn hộ Sunrise City, c..."
2,Giá thỏa thuận,"239,7 m²",,,,"Tây Hồ, Hà Nội",Tôi cần bán gấp biệt thự song lập lô góc K5 St...
3,"1,35 tỷ",43 m²,"31,4 tr/m²",1.0,1.0,"Gia Lâm, Hà Nội",Xin gửi quý khách hàng bảng báo giá của các că...
4,"6,1 tỷ",76 m²,"80,26 tr/m²",2.0,2.0,"Quận 2, Hồ Chí Minh",CHCC Lumiere Riverside căn 2PN chỉ 6.1 tỷ phòn...


<h2 style="color:tomato;">Note:</h2>
- The requests_cache library helps cache the responses for a certain duration (expire_after=3600 specifies the cache expiry after 1 hour). This helps prevent making multiple requests to the server within a short time frame, improving scraping efficiency and reducing server load. <br><br>
- To scrap all the real estate data from the website "batdongsan.com.vn", we need to do the same as with "https://batdongsan.com.vn/nha-dat-cho-thue" and "https://batdongsan.com.vn/du-an-bat-dong-san". <br><br>
- The results data save into "data.csv" file.<br><br>

-There are a few possible approaches that can help increase the scraping speed: Multiprocessing, Multithreading, Asyncio. We need to research more to apply these techniques to improve the scraping speed.