# Introduction

The overall goal of this study is to give a general idea or step by step how to approach a Data Science or Machine Learning Problem from 0, that means from obtaining the Data till the model validation. 

The data of interest selected for this project were the housing market of Colombia, with the following features:

* Number of bedrooms and bathrooms.
* Area: area of the property, in m2.
* Price: price of the property, in colombian pesos.
* Location: city where the offer is available.
* Type of property: house or apartment.
* Stratum : number that classifies the zones that receives public services, used to asign differents money grants in them to low income families and tax high income ones. Stratums from 1 to 3 receive help and from 4 to 6 are taxed. This classification will be deleted by 2026.

Because the Stratum classification will be erased, the other goal from the study is to label the data in new categories regardless that number; in other words, an unsupervised learning task.

# Data mining

### Obtaining links

Data was obtained from the website https://www.properati.com.co/ (base url). It's one of the places where is possible to find information about prices market in Colombia. 

First of all, it was built a script to gather all the links from the properties of the cities of interest in the study. To achieve that an inspection of the links and the code of the website was done, using the next aparment as an example:

* view-source:https://www.properati.com.co/s/barranquilla-atlantico/apartamento/venta
* https://www.properati.com.co/proyecto/14032-32-e557-f87edff6468c-7bb864f-88dd-4430

By analyzing the source code of the list of properties of interests and the link of one of them it can be noticed that they have an unique identifier called 'data-idanuncio', and with this attribute there is another one called 'data-href'. Therefore, if you combine the base url and the data-href is possible to build all the links of all the properties.

All the above was possible because the libraries requests, BeautifulSoup, time and pandas.

In [18]:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd

In [21]:
# Set Pandas display option to show the full URL
pd.set_option('display.max_colwidth', None)

In [24]:
# Define the URLs
base_url = "https://www.properati.com.co"
url = "https://www.properati.com.co/s/barranquilla-atlantico/apartamento/venta"

# Send an HTTP request to get the content of the page
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all elements with the data-idanuncio 
    listings = soup.find_all(attrs={"data-idanuncio": True})
    
    # List to store the links built
    full_hrefs = []
    
    for listing in listings:
        # Get the  data-href attribute
        data_href = listing.get('data-href')
        
        if data_href:  # If data-href exists, build the full URL
            full_url = base_url + data_href
            full_hrefs.append(full_url)
    
    # Create a pandas DataFrame with the full URLs
    df_1 = pd.DataFrame(full_hrefs, columns=['URL'])

The previous script worked only for the first list of the properties of each city. If you move through the pages, it can be noticed the website adds a '/' with a number 'x', where x is the digit of the list, for example:

* https://www.properati.com.co/s/barranquilla-atlantico/apartamento/venta/2

Hence, the code was adapted to loop through all the digits that are available in the website for each city (this had to be checked manually). The numbers used were:

In [None]:
# Define the base URL
base_url = "https://www.properati.com.co"

# Define the target URL pattern (replace page number dynamically)
url_pattern = "https://www.properati.com.co/s/barranquilla-atlantico/apartamento/venta/{}"

# List to store the full href values
full_hrefs = []

# Loop through pages 
for page_num in range(2, 168):  # remember to adapt it according to the City
    
    # Construct the URL for the current page
    url = url_pattern.format(page_num)
    
    # Send an HTTP request to get the content of the page
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content with BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find all elements with the data-idanuncio attribute
        listings = soup.find_all(attrs={"data-idanuncio": True})
        
        for listing in listings:
            # Extract the value of the data-href attribute
            data_href = listing.get('data-href')
            
            if data_href:  # If data-href exists, store the full URL
                full_url = base_url + data_href
                full_hrefs.append(full_url)
        
        print(f"Successfully scraped page {page_num}")
        
        # Add a short delay between requests to avoid overwhelming the server or triggering the captcha bot
        time.sleep(1)
    else:
        print(f"Failed to retrieve page {page_num}. HTTP Status Code: {response.status_code}")

# Create a pandas DataFrame with the full URLs
df = pd.DataFrame(full_hrefs, columns=['URL'])

Finally, the links obtained were merged and saved in a csv file. All the URLs for each city can be found in the Data Links folder.

In [None]:
# Merging the links dataframes
links = pd.concat([df, df_1], axis = 0)

In [None]:
links.to_csv('properati_listings_tunja_apt.csv', index = False)

### Obtaining data from each link

A similar process was done to obtain the relevant information of each property. An example link was analyzed and could be noticed that the interest data is present in the classes location, place_details, place_features and price. With that in mind, the next script was built to loop through each URL to extract the data of each apartment or house:

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

In [None]:
# Initialize Selenium 
driver = webdriver.Chrome()  

# Open csv file of interest
df = pd.read_csv('properati_listings_armenia_apt.csv')

# Create a list to store the extracted data
data = []

# Loop through each URL 
for index, row in df.iterrows():
    property_url =  row['URL']
    driver.get(property_url)
    time.sleep(5) # Adjust this depending on your internet speed, if it is too high it will trigger captcha!
    
    # Extract location, bedrooms, area, bathrooms, area and price according to website
    location = driver.find_element(By.CLASS_NAME, "location").text.strip() if driver.find_elements(By.CLASS_NAME, "location") else 'N/A'
    
    place_details = driver.find_elements(By.CLASS_NAME, "place-details--all-elements-showing")
    if place_details:
        bedrooms = place_details[0].find_element(By.XPATH, './/div[@data-test="bedrooms-value"]').text.strip() if place_details[0].find_elements(By.XPATH, './/div[@data-test="bedrooms-value"]') else 'N/A'
        bathrooms = place_details[0].find_element(By.XPATH, './/div[@data-test="full-bathrooms-value"]').text.strip() if place_details[0].find_elements(By.XPATH, './/div[@data-test="full-bathrooms-value"]') else 'N/A'
        area = place_details[0].find_element(By.XPATH, './/div[@data-test="area-value"]').text.strip() if place_details[0].find_elements(By.XPATH, './/div[@data-test="area-value"]') else 'N/A'
    else:
        bedrooms = bathrooms = area = 'N/A'
    
    place_features = driver.find_elements(By.CLASS_NAME, "place-features")
    if place_features:
        property_type = place_features[0].find_element(By.XPATH, './/span[@data-test="property-type-value"]').text.strip() if place_features[0].find_elements(By.XPATH, './/span[@data-test="property-type-value"]') else 'N/A'
        stratum = place_features[0].find_element(By.XPATH, './/span[@data-test="stratum-value"]').text.strip() if place_features[0].find_elements(By.XPATH, './/span[@data-test="stratum-value"]') else 'N/A'
    else:
        property_type = stratum = 'N/A'
    
    price = driver.find_element(By.XPATH, '//*[@data-test="listing-price"]').text.strip() if driver.find_elements(By.XPATH, '//*[@data-test="listing-price"]') else 'N/A'
    
    # Append the extracted data 
    data.append({
        'URL': property_url,
        'Location': location,
        'Bedrooms': bedrooms,
        'Bathrooms': bathrooms,
        'Area': area,
        'Property Type': property_type,
        'Stratum': stratum,
        'Price': price
    })
    
driver.quit()

In [None]:
# Store extracted data as a pandas DataFrame
properties_df = pd.DataFrame(data)

In [None]:
# Save the extracted data to a CSV file
properties_df.to_csv('barranquilla_apt.csv', index=False)

All the above was possible thanks to the Selenium library. Trying to use BeautifulSoup here trigerred the captcha from the website. Finally, all the extracted data can be found in the Data cities no cleaned library.