### Project Ideas and Applications

#### 1. Property Price Prediction
- **Goal**: Predict the price of a property based on its features.
- **Data**: Property size, number of bedrooms/bathrooms, location, amenities.
- **Model**: Regression models (e.g., Linear Regression, Random Forest).

#### 2. Real Estate Market Analysis
- **Goal**: Analyze trends in the real estate market over time.
- **Data**: Historical listing prices, sales data, economic indicators.
- **Model**: Time series analysis, clustering.

#### 3. Recommender System for Buyers
- **Goal**: Recommend properties to potential buyers based on their preferences.
- **Data**: User preferences, property features, past user behavior.
- **Model**: Collaborative filtering, content-based filtering.

#### 4. Property Valuation and Investment Analysis
- **Goal**: Assess the investment potential of properties.
- **Data**: Property prices, rental income, neighborhood amenities.
- **Model**: Predictive modeling, ROI calculation.

#### 5. Real Estate Heat Maps
- **Goal**: Visualize property prices and trends geographically.
- **Data**: Property locations, prices.
- **Model**: Geospatial analysis, heat maps.

# EXTRACTING PRODUCTS LINKS

In [1]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import mysql.connector

import time

In [42]:
def scrape_links(URL:str, pages:int) -> str:
    """Scrape Products URLs
    
    Scrape every URL for each product in a page, a normal Open Sooq page contains 30 items. Specify how many pages you want to scrape.
    Provide the link with the most recent products for better results, otherwise many duplicates will occur.
    
    Parameters
    ----------
    url : str
        The category landing page URL you want to extract its products links, the URL must be raw meaning it must be devoid of query parameters.
    pages : int
        The amount of pages you want to scrape

    Returns
    -------
    pd.DataFrame
        A Pandas DataFrame that contains the products link, id and price.
        
    Example
    -------
    url = 'https://jo.opensooq.com/en/real-estate-for-sale/all?search=true&sort_code=recent'
    scrape_links(url, 1122)
    """
    df = []
    for i in range(1, pages+1):
        url = f'{URL}&page={i}'
        r = requests.get(url) 
        soup = BeautifulSoup(r.content)
        table = soup.find('section', attrs = {'id':'serpMainContent'})
        
        for row in table.findAll('div', {'class':re.compile('sc-21acf5d5-0 jhHVZS mb-32 relative radius-8 grayHoverBg whiteBg boxShadow2')}): 
            data = {}
            data['id'] = row.a['href'][11:20]
            data['link'] = 'https://opensooq.com'+row.a['href']
            data['price'] = row.find('div',{'class':'priceColor bold alignSelfCenter font-18 ms-auto'}).text
            df.append(data)
        if i%50==0:
            time.sleep(60)
            print(f'Pages Scrpaed: {i}')
        url = URL
    pd.DataFrame(df).to_csv('data/links.csv',index_label=False)
    return "Scraping Products Links is Done!"

In [43]:
url = 'https://jo.opensooq.com/en/real-estate-for-sale/all?search=true&sort_code=recent'
scrape_links(url,1122)

Pages Scrpaed: 50
Pages Scrpaed: 100
Pages Scrpaed: 150
Pages Scrpaed: 200
Pages Scrpaed: 250
Pages Scrpaed: 300
Pages Scrpaed: 350
Pages Scrpaed: 400
Pages Scrpaed: 450
Pages Scrpaed: 500
Pages Scrpaed: 550
Pages Scrpaed: 600
Pages Scrpaed: 650
Pages Scrpaed: 700
Pages Scrpaed: 750
Pages Scrpaed: 800
Pages Scrpaed: 850
Pages Scrpaed: 900
Pages Scrpaed: 950
Pages Scrpaed: 1000
Pages Scrpaed: 1050
Pages Scrpaed: 1100


In [44]:
links.head()

Unnamed: 0,id,link,price
0,244260887,https://opensooq.com/en/search/244260887/resid...,"60,000 JOD"
1,243608259,https://opensooq.com/en/search/243608259/194-m...,"92,000 JOD"
2,244651225,https://opensooq.com/en/search/244651225/77-m2...,"70,000 JOD"
3,244509387,https://opensooq.com/en/search/244509387/resid...,"130,000 JOD"
4,244715537,https://opensooq.com/en/search/244715537/191-m...,"86,000 JOD"


In [45]:
links.shape

(32850, 3)

In [48]:
links.to_csv('data/links.csv',index_label=False)

# SCRAPING PRODUCTS DATA

In [12]:
links = pd.read_csv('links.csv')
links = links.iloc[0:1000]

In [6]:
def safe_extract(find_function, default=None):
    """
    Safely extracts a value using the provided function, returning a default value if an exception occurs.

    This helper function attempts to execute the `find_function` callable. If `find_function`
    raises an AttributeError or TypeError, the function catches the exception and returns the
    specified default value instead.

    Args:
        find_function (callable): A function that performs the desired extraction or operation.
        default: The value to return if `find_function` raises an AttributeError or TypeError.
                 Defaults to None.

    Returns:
        The result of `find_function()` if no exception occurs, otherwise returns `default`.

    Examples:
        >>> safe_extract(lambda: soup.find('div').text, default='Not found')
        'Extracted text or "Not found" if an exception occurs'
    """
    try:
        return find_function()
    except (AttributeError, TypeError):
        return default

In [16]:
def scrape_real_estate_data(df:pd.DataFrame) -> str:
    """Extract Data From Real-Estate Page
    
    Extract all relevant data from a real-estate page provided by the link of the page including the id, the location, interiour info about the estate, and many more.
    
    Parameters
    ----------
    df : pd.DataFrame
        A Pandas DataFram that contains real-estate web pages links, each link represent one listing, along with the price and the id of each listing.

    Returns
    -------
    str
        A message indicating the process of scraping products is complete
    """
    df = []
    i=1
    for row in range(len(links)):
        link = links.loc[row,'link']        
        r = requests.get(link)
        soup = BeautifulSoup(r.content,'html.parser')
        data = {}
        
        data['link'] = link
        data['id'] = safe_extract(lambda: re.findall(r'\d+', soup.find('div', attrs={'class': re.compile('flex flexSpaceBetween alignItems pb-16 font-17 borderBottom')}).text)[0])
        data['title'] = safe_extract(lambda: soup.find('h1', attrs={'class': re.compile('postViewTitle font-22 mt-16 mb-32')}).text)
        data['images'] = safe_extract(lambda: [img['src'] for img in soup.find('div', {'class': 'image-gallery-slides'}).find_all('img')])
        data['member_since'] = safe_extract(lambda: soup.find('section', {'id': 'PostViewOwnerCard'}).find('span', {'class': 'ltr inline'}).text)
        data['description'] = safe_extract(lambda: soup.find('section', {'id': 'postViewDescription'}).div.text)
        data['owner'] = safe_extract(lambda: soup.find('section', {'id': 'PostViewOwnerCard'}).a.h3.text)
        data['reviews'] = safe_extract(lambda: soup.find('section', {'id': 'PostViewOwnerCard'}).a.span.text)
        data['google_maps_locatoin_link'] = safe_extract(lambda: soup.find('a', attrs={'class': re.compile('sc-750f6c2-0 dqtnfq map_google relative block mt-16')})['href'])
        coordinates = safe_extract(lambda: re.findall(r'-?\d+\.\d+', data['google_maps_locatoin_link']), [])
        data['long'] = coordinates[0] if coordinates else None
        data['lat'] = coordinates[1] if coordinates else None
        data['owner_link'] = safe_extract(lambda: 'https://opensooq.com' + soup.find('section', {'id': 'PostViewOwnerCard'}).a.get("href"))
        data['price'] = links.loc[row,'price']
        # This for loop extracts all the data in the information section of the product's page including the building's rooms, bathrooms, age and more
        ul = soup.find('ul', attrs={'class':re.compile('flex flexSpaceBetween flexWrap mt-8')})
        try:
            for li in ul:
                data[li.p.text] = li.a.text
        except AttributeError:
            data[li.p.text] = li.find('p', attrs={'class':re.compile('width-75')}).text
        except TypeError:
            pass
        df.append(data)
        if i%500==0:
            print(f'Products Scraped: {i}')
            pd.DataFrame(df).to_csv(f'data/products/{i//500}.csv',index_label=False)
            df = []
        i+=1
    return pd.DataFrame(df)

###### make the above function check for the data/products dir and see how much data it pulled, if it pulled 1000 records aka 2 csv files then make the function start with 1000 not 0, this might not work that well, cus posts sorted by recency or relevance change all the time... idk this is a random idea that just popped in my mind.
###### after some thinking, this might be a dumb idea, the only way i could work around it if i started from the bottom up, sort by newest, go to the last page, and start scraping from there...
###### you know what, this might work!
###### but i'm too lazy to do this rn, maybe after everything is finished i will try it.

In [17]:
df = scrape_real_estate_data(links)

Products Scraped: 500
Products Scraped: 1000


In [18]:
df.shape

(1000, 35)

In [19]:
df.head()

Unnamed: 0,link,id,title,images,member_since,description,owner,reviews,google_maps_locatoin_link,long,...,Bathrooms,Furnished?,Surface Area,Floor,Building Age,Main Amenities,Reference ID,Number of Floors,Additional Amenities,Property Status
0,https://opensooq.com/en/search/244260887/resid...,244260887,أرض للبيع ناعور - الروضه,[https://opensooq-images.os-cdn.com/previews/0...,17-05-2015,دونم ارض إستثماري في منطقة الروضة ناعور حوض 5 ...,مكتب العال و الروضه العقاري,( 12 ),https://www.google.com/maps/search/?api=1&quer...,31.831527,...,,,,,,,,,,
1,https://opensooq.com/en/search/243608259/194-m...,243608259,شقة طابق اخير مع روف مميزة,[https://opensooq-images.os-cdn.com/previews/0...,12-08-2018,شقة طابق الاخير مع روف الشقه 161 مترالروف 33 م...,الكسواني للاسكان,( 18 ),https://www.google.com/maps/search/?api=1&quer...,32.024303,...,3.0,Unfurnished,194 meter square,Third Floor,0 - 11 months,"Balcony, Laundry Room, Maid Room, Double Glaze...",,,,
2,https://opensooq.com/en/search/244651225/77-m2...,244651225,مشروع جبل عمان فندق حياه عمان شقة سياحية من ...,[https://opensooq-images.os-cdn.com/previews/0...,14-03-2016,مشروع جبل عمان فندق حياه عمان شقة سياحية من ...,شركه رائد العساف وشريكه,( 8 ),https://www.google.com/maps/search/?api=1&quer...,31.892895,...,3.0,Unfurnished,77 meter square,Third Floor,0 - 11 months,"Air Conditioning, Heating, Balcony, Maid Room,...",799118880.0,,,
3,https://opensooq.com/en/search/244509387/resid...,244509387,قطعة ارض سكنية مميزة جدا ومطلة للبيع على شارع...,[https://opensooq-images.os-cdn.com/previews/0...,18-09-2021,قطعة ارض مميزة جدا بأرقى احياء ابو نصير فلل م...,بلال نجم,( 0 ),https://www.google.com/maps/search/?api=1&quer...,32.066395,...,,,,,,,,,,
4,https://opensooq.com/en/search/244715537/191-m...,244715537,شقة طابق اخير مع روف,[https://opensooq-images.os-cdn.com/previews/0...,12-08-2018,شقة طابق اخير مع روف مميزة جدا تملك شقة أحلامك...,الكسواني للاسكان,( 18 ),https://www.google.com/maps/search/?api=1&quer...,32.044338,...,3.0,Unfurnished,191 meter square,Last floor with roof,0 - 11 months,"Electric Shutters, Balcony, Laundry Room, Doub...",,,,


In [2]:
df.to_csv('data/products.csv',index_label=False)

In [3]:
# engine = create_engine("mysql+mysqlconnector://root:anon@localhost/opensooq")
# df.to_sql(name='estate', con=engine, if_exists='replace', index=False)

# SCRAPING USERS DATA

In [7]:
df = pd.read_csv('products.csv')
df.head()

Unnamed: 0,link,id,title,images,member_since,description,owner,reviews,google_maps_locatoin_link,long,...,Bathrooms,Furnished?,Surface Area,Floor,Building Age,Main Amenities,Reference ID,Number of Floors,Additional Amenities,Property Status
0,https://opensooq.com/en/search/244260887/resid...,244260887.0,أرض للبيع ناعور - الروضه,['https://opensooq-images.os-cdn.com/previews/...,17-05-2015,دونم ارض إستثماري في منطقة الروضة ناعور حوض 5 ...,مكتب العال و الروضه العقاري,( 12 ),https://www.google.com/maps/search/?api=1&quer...,31.831527,...,,,,,,,,,,
1,https://opensooq.com/en/search/243608259/194-m...,243608259.0,شقة طابق اخير مع روف مميزة,['https://opensooq-images.os-cdn.com/previews/...,12-08-2018,شقة طابق الاخير مع روف الشقه 161 مترالروف 33 م...,الكسواني للاسكان,( 18 ),https://www.google.com/maps/search/?api=1&quer...,32.024303,...,3.0,Unfurnished,194 meter square,Third Floor,0 - 11 months,"Balcony, Laundry Room, Maid Room, Double Glaze...",,,,
2,https://opensooq.com/en/search/244651225/77-m2...,244651225.0,مشروع جبل عمان فندق حياه عمان شقة سياحية من ...,['https://opensooq-images.os-cdn.com/previews/...,14-03-2016,مشروع جبل عمان فندق حياه عمان شقة سياحية من ...,شركه رائد العساف وشريكه,( 8 ),https://www.google.com/maps/search/?api=1&quer...,31.892895,...,3.0,Unfurnished,77 meter square,Third Floor,0 - 11 months,"Air Conditioning, Heating, Balcony, Maid Room,...",799118880.0,,,
3,https://opensooq.com/en/search/244509387/resid...,244509387.0,قطعة ارض سكنية مميزة جدا ومطلة للبيع على شارع...,['https://opensooq-images.os-cdn.com/previews/...,18-09-2021,قطعة ارض مميزة جدا بأرقى احياء ابو نصير فلل م...,بلال نجم,( 0 ),https://www.google.com/maps/search/?api=1&quer...,32.066395,...,,,,,,,,,,
4,https://opensooq.com/en/search/244715537/191-m...,244715537.0,شقة طابق اخير مع روف,['https://opensooq-images.os-cdn.com/previews/...,12-08-2018,شقة طابق اخير مع روف مميزة جدا تملك شقة أحلامك...,الكسواني للاسكان,( 18 ),https://www.google.com/maps/search/?api=1&quer...,32.044338,...,3.0,Unfurnished,191 meter square,Last floor with roof,0 - 11 months,"Electric Shutters, Balcony, Laundry Room, Doub...",,,,


In [8]:
def scrape_seller_data(links:list) -> pd.DataFrame:
    """Extract Data From Seller Page
    
    Extract all relevant data from sellers page provided by the link of the page.
    
    Parameters
    ----------
    links : list
        A list of sellers web pages links, each link represent one seller

    Returns
    -------
    pandas.DataFrame
        A pandas dataframe that contains all relevant data for the provided links.
    """
    df = []
    for link in links:
        r = requests.get(link+'?info=info')
        soup = BeautifulSoup(r.content,'html.parser')
        data = {}
        
        data['owner_link'] = link
        data['owner'] = safe_extract(lambda: soup.find('h1', {'class':'font-24'}).text)
        popularity = safe_extract(lambda: soup.find('div',{'class':'flex mt-auto mb-32'}).find_all('span', {'class':'bold'}), [])
        data['views'] = popularity[0].text if popularity else None
        data['followers'] = popularity[1].text if popularity else None
        data['average_rating'] = safe_extract(lambda: soup.find('div',{'class':'sc-fb4b16ed-4 dkGjJX bold'}).text)
        data['google_maps_locatoin_link'] = safe_extract(lambda: soup.find('a', attrs={'class': re.compile('sc-750f6c2-0 dqtnfq map_google relative block mt-16')})['href'])
        coordinates = safe_extract(lambda: re.findall(r'-?\d+\.\d+', data['google_maps_locatoin_link']), [])
        data['long'] = coordinates[0] if coordinates else None
        data['lat'] = coordinates[1] if coordinates else None
        df.append(data)
    return pd.DataFrame(df)

In [10]:
owners_links = list(df['owner_link'].dropna().unique())
len(owners_links)

460

In [11]:
owners_data = scrape_seller_data(owners_links)
owners_data.to_csv('sellers.csv',index_label=False)

# TESTING AREA

In [16]:
link = links.iloc[51]['link']
r = requests.get(link)
soup = BeautifulSoup(r.content,'html.parser')

In [17]:

data = {}
data['link'] = link
data['id'] = safe_extract(lambda: re.findall(r'\d+', soup.find('div', attrs={'class': re.compile('flex flexSpaceBetween alignItems pb-16 font-17 borderBottom')}).text)[0])
data['title'] = safe_extract(lambda: soup.find('h1', attrs={'class': re.compile('postViewTitle font-22 mt-16 mb-32')}).text)
data['images'] = safe_extract(lambda: [img['src'] for img in soup.find('div', {'class': 'image-gallery-slides'}).find_all('img')])
data['member_since'] = safe_extract(lambda: soup.find('section', {'id': 'PostViewOwnerCard'}).find('span', {'class': 'ltr inline'}).text)
data['description'] = safe_extract(lambda: soup.find('section', {'id': 'postViewDescription'}).div.text)
data['owner'] = safe_extract(lambda: soup.find('section', {'id': 'PostViewOwnerCard'}).a.h3.text)
data['reviews'] = safe_extract(lambda: soup.find('section', {'id': 'PostViewOwnerCard'}).a.span.text)
data['google_maps_locatoin_link'] = safe_extract(lambda: soup.find('a', attrs={'class': re.compile('sc-750f6c2-0 dqtnfq map_google relative block mt-16')})['href'])
coordinates = safe_extract(lambda: re.findall(r'-?\d+\.\d+', data['google_maps_locatoin_link']), [])
data['long'] = coordinates[0] if coordinates else None
data['lat'] = coordinates[1] if coordinates else None
data['owner_link'] = safe_extract(lambda: 'https://opensooq.com' + soup.find('section', {'id': 'PostViewOwnerCard'}).a.get("href"))

In [18]:
data

{'link': 'https://opensooq.com/en/search/241985001/mixed-use-land-for-sale-in-amman-al-muwaqqar',
 'id': '241985001',
 'title': 'مساحة الارض 10 دونم سهليه مخدومه',
 'images': ['https://opensooq-images.os-cdn.com/previews/0x720/71/5b/715bc6b10a4341cb4a6aae3e2d9cceb119b1281213cf98048645d676143019af.jpg.webp',
  'https://opensooq-images.os-cdn.com/previews/0x720/3e/18/3e18489f76575fe60f1889290a90c585c3ae7405258165eac986b1550a9ef59f.jpg.webp',
  'https://opensooq-images.os-cdn.com/previews/0x720/50/4f/504f4e3a10dc049f09b1ec8e5776edfc6ab81c43167ba34a21c6d5f773d2b9b8.jpg.webp',
  'https://opensooq-images.os-cdn.com/previews/0x720/83/50/8350674f69884473fe8a87297ac2ac7037531a32d14e43ffab27ba9af75801f8.jpg.webp',
  'https://opensooq-images.os-cdn.com/previews/0x720/96/b2/96b2fbf38d96cc9a55eb814b114fdc67d65d541aa4273f87aefab62b2eac40bf.jpg.webp',
  'https://opensooq-images.os-cdn.com/previews/0x720/4e/fc/4efcbf4ab216f09386392f94f0ee99a21854ecffdd1bb05da390963a43044c98.jpg.webp',
  'https://opens