# **"From Web Scraping to Machine Learning: Understanding Panama City's Real Estate with Python"**

## **Introduction**
We explore the dynamic real estate landscape in Panama City! In this data science journey, we explore on a fascinating research of apartment information sourced directly from the [ACOBIR website](https://www.acobir.com/proyectos/buscador/?category_search=&location_search=). ACOBIR, the Panamanian Chamber of Real Estate Brokers and Developers, provides a lot of data that allows us to gain insights into the property market of Panama City.

Our goal isn't just to gather data but to know all about behind each apartment listing. From web scraping ACOBIR's pages to unleashing the power of machine learning, this project is a comprehensive endeavor to decode the factors influencing property prices in this amazing city.

We start with data cleaning, data visualization, and predictive modeling. By the end of this journey, we aim not only to understand the current state of the real estate market but to predict property prices and recommend projects that align with the city's unique dynamics.



## **The Story Unveiled in Code:**
### **1. Web Scraping: Unearthing the Hidden Gems**
The journey begins with web scraping, where we extract valuable information from ACOBIR's website. Through Python's powerful libraries such as BeautifulSoup and Requests, we navigate through the pages, collecting details on apartment features, locations, and prices. This process forms the foundation of our dataset, a treasure trove of insights waiting to be discovered.

**Step #1 - Web Scrapping Page 1 (Get URL of each Apartment Project) (This page the URL is different of the other pages)**

In [None]:
from bs4 import BeautifulSoup
import requests

#Code for scrap URL of the project en page 1
base_url = "https://www.acobir.com/proyectos/buscador/?category_search=2780&expresion_search=&price_range=0%2C1000000&page="
response = requests.get(base_url)
soup = BeautifulSoup(response.content, 'html.parser')
more_info_links = []

# Use find_all to get all the <a> tags with class "actionbutton"
action_buttons = soup.find_all("a", class_="actionbutton")

# Iterate over the action_buttons to extract the href attribute from each <a> tag
for action_button in action_buttons:
    more_info_url = "https://www.acobir.com" + action_button['href']
    more_info_links.append(more_info_url)

#Display our work in Step #1
print(len(more_info_links), "Collected URL's")
print(more_info_links)



9 Collected URL's
['https://www.acobir.com/proyectos/list/amazonas-bella-vista/', 'https://www.acobir.com/proyectos/list/torres-del-norte/', 'https://www.acobir.com/proyectos/list/allure-at-punta-pacifica/', 'https://www.acobir.com/proyectos/list/the-edge-marbella-1/', 'https://www.acobir.com/proyectos/list/alexa-bella-vista-1/', 'https://www.acobir.com/proyectos/list/mi-condado/', 'https://www.acobir.com/proyectos/list/torres-de-castilla-via-espana/', 'https://www.acobir.com/proyectos/list/ph-sabana-tower/', 'https://www.acobir.com/proyectos/list/torres-de-espana/']


**Step #2 - Web Scrapping Page 2 to 8**

In [None]:
base_url2 = "https://www.acobir.com/proyectos/buscador/page"
more_info_links2 = []

#Code for scrap URL of the project en pages 2 - 8 using a "for" loop
for page in range(2, 8):
    url = base_url2 + str(page) + "?category_search=2780&price_range=0%2C1000000"
    response2 = requests.get(url)
    soup = BeautifulSoup(response2.content, 'html.parser')

    # Use find_all to get all the <a> tags with class "actionbutton"
    action_buttons2 = soup.find_all("a", class_="actionbutton")

    # Iterate over the action_buttons to extract the href attribute from each <a> tag
    for action_button in action_buttons2:
      more_info_url2 = "https://www.acobir.com" + action_button['href']
      more_info_links2.append(more_info_url2)

#Display our work in Step #2
print(len(more_info_links2), "Collected URL's")


#Combine the results of Step #1 and Step #2
more_info_links.extend(more_info_links2)

#Display our work in Step #1 and Step #2
print(len(more_info_links), "Collected URL's in total")
print(more_info_links)


50 Collected URL's
59 Collected URL's in total
['https://www.acobir.com/proyectos/list/amazonas-bella-vista/', 'https://www.acobir.com/proyectos/list/torres-del-norte/', 'https://www.acobir.com/proyectos/list/allure-at-punta-pacifica/', 'https://www.acobir.com/proyectos/list/the-edge-marbella-1/', 'https://www.acobir.com/proyectos/list/alexa-bella-vista-1/', 'https://www.acobir.com/proyectos/list/mi-condado/', 'https://www.acobir.com/proyectos/list/torres-de-castilla-via-espana/', 'https://www.acobir.com/proyectos/list/ph-sabana-tower/', 'https://www.acobir.com/proyectos/list/torres-de-espana/', 'https://www.acobir.com/proyectos/list/ph-demetra/', 'https://www.acobir.com/proyectos/list/prestige-bella-vista/', 'https://www.acobir.com/proyectos/list/mansion-baluarte/', 'https://www.acobir.com/proyectos/list/santa-teresita-living/', 'https://www.acobir.com/proyectos/list/casa-variedades/', 'https://www.acobir.com/proyectos/list/forest-gate-en-albrook/', 'https://www.acobir.com/proyectos/l

**Step #3 - Web Scrapping Page 1 (Get image URL of each Project)**
(This page the URL is different of the other pages)

In [None]:
base_url = "https://www.acobir.com/proyectos/buscador/?category_search=2780&expresion_search=&price_range=0%2C1000000&page="
response_IMG = requests.get(base_url)

#Code for scrap URL of the project en page 1
soup = BeautifulSoup(response_IMG.content, 'html.parser')


# Use find_all to get all the <img> tags with class "card-img-top lazyload"
imageslinks1 = soup.find_all("img", class_="card-img-top lazyload")

images_urls= [] #New list fot images URL's

# Iterate over the imageslinks1 to extract the data-src attribute and build the complete URL
for img_tag in imageslinks1:
    src_attribute = img_tag.get("data-src")
    if src_attribute:
        full_url = "https://www.acobir.com" + src_attribute
        images_urls.append(full_url)

#Display our work in Step #3
print(len(images_urls), "Collected Image URL's")
print(images_urls)


9 Collected Image URL's
['https://www.acobir.com/site/assets/files/2024/08/05/7044/amazonas-01-1.600x450.jpg', 'https://www.acobir.com/site/assets/files/2024/07/19/7020/8-tdn-render-edif5-exterior_1_11zon.600x450.jpg', 'https://www.acobir.com/site/assets/files/2024/06/12/6915/allure_1.600x450.jpg', 'https://www.acobir.com/site/assets/files/2024/06/12/6914/edge_1.600x450.jpg', 'https://www.acobir.com/site/assets/files/2024/06/12/6913/alexa_1.600x450.jpg', 'https://www.acobir.com/site/assets/files/2024/06/12/6911/diseno_sin_titulo_1-1.600x450.jpg', 'https://www.acobir.com/site/assets/files/2024/02/15/6720/torres_1.600x450.jpg', 'https://www.acobir.com/site/assets/files/2024/01/25/6673/1__fachada_ph_sabana_tower.600x450.jpg', 'https://www.acobir.com/site/assets/files/2023/11/23/6591/te01-2.600x450.jpg']


**Step #4 - Web Scrapping Pages 2 to 8 (Get image URL of each Project)**

In [None]:
base_url3 = "https://www.acobir.com/proyectos/buscador/page"
images_urls2 = []

#Code for scrap URL of the project en pages 2 - 8 using a "for" loop
for page in range(2, 8):
    url = base_url3 + str(page) + "?category_search=2780&price_range=0%2C1000000"
    response_IMG2 = requests.get(url)
    soup = BeautifulSoup(response_IMG2.content, 'html.parser')
    images_on_page = soup.find_all("img", class_="card-img-top lazyload")

    # Iterate over the imgage_on_page to extract the data-src attribute build the complete URL
    for img_tag in images_on_page:
        src_attribute = img_tag.get("data-src")
        if src_attribute:
            full_url2 = "https://www.acobir.com" + src_attribute
            images_urls2.append(full_url2)

#Display our work in Step #3
print(len(images_urls2), "Collected Image URL's")

#Combine the results of Step #3 and Step #4
images_urls.extend(images_urls2)

#Display our work in Step #3 and Step #4
print(len(images_urls), "Collected Image URL's in total")
print(images_urls)


49 Collected Image URL's
58 Collected Image URL's in total
['https://www.acobir.com/site/assets/files/2024/08/05/7044/amazonas-01-1.600x450.jpg', 'https://www.acobir.com/site/assets/files/2024/07/19/7020/8-tdn-render-edif5-exterior_1_11zon.600x450.jpg', 'https://www.acobir.com/site/assets/files/2024/06/12/6915/allure_1.600x450.jpg', 'https://www.acobir.com/site/assets/files/2024/06/12/6914/edge_1.600x450.jpg', 'https://www.acobir.com/site/assets/files/2024/06/12/6913/alexa_1.600x450.jpg', 'https://www.acobir.com/site/assets/files/2024/06/12/6911/diseno_sin_titulo_1-1.600x450.jpg', 'https://www.acobir.com/site/assets/files/2024/02/15/6720/torres_1.600x450.jpg', 'https://www.acobir.com/site/assets/files/2024/01/25/6673/1__fachada_ph_sabana_tower.600x450.jpg', 'https://www.acobir.com/site/assets/files/2023/11/23/6591/te01-2.600x450.jpg', 'https://www.acobir.com/site/assets/files/2023/07/24/6315/ph_demetra_abr_2023-21.600x450.jpg', 'https://www.acobir.com/site/assets/files/2022/10/20/5727/

**Step #5 - Build a function for web scraping for get the information of each project**

In [None]:
def scrape_property_info(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extracting name title
    name = soup.find(class_="property-info head-info")
    name_title = name.find('h1').get_text(strip=True)
    result = [name_title] #New list for the values and add names

    # Extracting coordinates
    cordinates = soup.find(class_="maparea")

    if cordinates:
        lat = cordinates.get("maplat")
        lng = cordinates.get("maplng")
    else:
        lat = lng = "0"

    result.extend([lat, lng]) #Add cordinates information

    # Extracting information from the 'character-general' class
    inform = soup.find(class_="character-general")
    categoria_spans = inform.find_all('span', class_='stext')
    categoria_texts = [span.get_text(strip=True) for span in categoria_spans]

    result.extend(categoria_texts) #Add property information

    return result



**Step #6 - Collect all the data for all the url's using the funtion**

In [None]:
property_info = [] #New repository for the information

#Using a "for" lop for the function in the list of URL's obtained in Step #2.
for url in more_info_links:
    property_data = scrape_property_info(url)
    property_info.append(property_data)
    print(property_data)

#Display our work in Step 36
print(len(property_info), "Collected Projects")
print(property_info)


['Amazonas · Bella Vista', '', '', 'Residencial', 'Pre-Venta', 'Apartamentos', '43m2', '1', '1', '$119,000']
['Torres del Norte', '0', '0', 'Residencial', 'Pre-Venta', 'Apartamentos', '64.13m2', '3', '1.5', '$68,000']
['Allure at Punta Pacífica', '8.9759526', '-79.5064391', 'Residencial', 'Pre-Venta', 'Apartamentos', '100m2', '3', '1.5', '$482,000']
['The Edge Marbella', '8.9789744', '-79.5197546', 'Residencial', 'Pre-Venta', 'Apartamentos', '55m2', '1', '1', '$192,000']
['Alexa Bella Vista', '8.9720401', '-79.5322177', 'Residencial', 'En Construcción', 'Apartamentos', '40m2', '1', '1', '$97,100']
['Mi Condado', '9.027435', '-79.5233972', 'Residencial', 'En Construcción', 'Apartamentos', '74m2', '2', '1.5', '$159,000']
['Torres de Castilla - Vía España', '8.999227', '-79.5172926', 'Residencial', 'Proyecto a Estrenar', 'Apartamentos', '86m2', '3', '1.5', '$181,360']
['PH Sabana Tower', '', '', 'Residencial', 'Proyecto a Estrenar', 'Apartamentos', '99.7m2', '3', '2', '$176,960']
['Torres

AttributeError: 'NoneType' object has no attribute 'find'

**Step #7 - Build a Pandas Data Frame with the collected data**

In [None]:


import pandas as pd

# Define columns
columns = ['Name', 'Latitude', 'Longitude', 'Type', 'Status', 'Category', 'Size', 'Rooms', 'Bathrooms', 'Price']

# Use property_info to create DataFrame
property_df = pd.DataFrame(property_info, columns=columns)

#Add the URL and Images URL of each project
property_df['Info_url'] = more_info_links
property_df['image_url'] = images_urls

property_df

ValueError: Length of values (59) does not match length of index (25)

**Step #8 - Convert the Data frame to csv file**

In [None]:
property_df.to_csv('property_data_final.csv', index=False)

#Finally this csv file was uploaded to a GitHub Repository for the next processes

#Test of the file in GitHub Repository
property_data = pd.read_csv('https://raw.githubusercontent.com/eiig26/public_data/main/property_data_final.csv')
property_data.head()