Scraping inspirations for finn.no:

-[python](https://github.com/qiangwennorge/ScrapeFinnBolig)

-[node.js-scraper](https://github.com/Lekesoldat/finn-scraper)


Example of bussiness property is [this building](https://www.finn.no/realestate/businesssale/ad.html?finnkode=335466271). 

It can be scraped using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc).

In [1]:
import requests
# for webscraping
from bs4 import BeautifulSoup
# for splitting strings
import re

# for working with spatial data
from osgeo import ogr, osr


For now we want to focus on Sarpsborg. By adding a paramter to the URL, we can narrow our search for the listings in Sarpsborg

In [2]:
mainurl = 'https://www.finn.no/realestate/businesssale/search.html?location='
# Sarpsborg location
sarpsborg = '1.20002.20023'

In [3]:
def RequestAndScrape(url) :
    r = requests.get(url)
    if r.status_code == 200 :
        soup = BeautifulSoup(r.content, 'html.parser')
        return soup
    else :
        print('Bad Request:', r.status_code)

def FetchListingsURL(soup) :
    listing_data = []
    # find the listings in articles with the following class
    listings = soup.find_all('a', {'class':'sf-search-ad-link'} )
    i = 0
    for listing in listings : 
        # get the value of the url
        listing_url = listing['href']
        listing_data.append(listing_url)
        i += 1
    # count listings
    print('%d listings available' % i)
    return listing_data

In [13]:
# fetch section of key info. This informtion is located in 'data-testid' divs. 
def findKeyInfo(soup) :
    # defining local variables 
    usable_area, gross_area, ownership_type, area, construction_year, plot_area = [None] * 6
    keyinfodivs = soup.find('section', {'aria-labelledby':'keyinfo-heading'}).find_all('div')
    for div in keyinfodivs :
        if div.has_attr('data-testid') :
            try :
                attr = div.find('dt').text
            except Exception as e :
                attr = ''
            # if there is not a 'dt' class in the div the match-case should be skipped
            if len(attr) > 0 :
                match attr :
                    case 'Bruksareal' :
                        usable_area = div.find('dd').text
                    case 'Bruttoareal' :
                        gross_area = div.find('dd').text
                    case 'Eieform' :
                        ownership_type = div.find('dd').text
                    case 'Areal' :
                        area = div.find('dd').text
                    case 'Byggeår' :
                        construction_year = div.find('dd').text
                    case 'Tomteareal' :
                        plot_area = div.find('dd').text  
    return usable_area, gross_area, ownership_type, area, construction_year, plot_area

Nominatim API that is used, see [this link](https://nominatim.org/release-docs/develop/api/Search/)

In [5]:
# geocode address
def geocodeAddresses(address) :
    address = re.split(', | ', address )
    street  = address[0] + ' ' + address[1]
    postalcode = address[2]
    city = address[3]

    geocode_url = 'https://nominatim.openstreetmap.org/search?'
    params = dict (
        limit = '1',
        polygon_geojson= '1',
        format = 'geojson',
        street = street,
        city = city,
        postalcode = postalcode
    )
    r = requests.get(geocode_url, params=params)
    if r.status_code == 200 :
        features = r.json()['features']
    
        if len(features) > 0 :
            feature = features[0]
            # fetch geometry information
            geometry = ogr.CreateGeometryFromJson(str(feature['geometry']))
            # fetch projection, could be neccessary sometimes...
            # source = geometry.GetSpatialReference() 
            # epsg =   source.GetAttrValue('AUTHORITY', 1)
            return (geometry)
    else : 
        print('The following request failed:')
        print(r.url)
        print('With status code, ', r.status_code)

In [6]:
def fetchRealEstateInfo(soup) :
    companyprofile = soup.find('company-profile-podlet').find('div')
    name = str(companyprofile.find('h2').text)
    img = companyprofile.find('img')['src']
    return name, img
    
def fetchMetadata(soup) :
    metatable = soup.find('h2', id='ad-info-heading').findNext('table').find('td', {'class':'pl-8'})
    finn_kode = metatable.text
    status_date = metatable.findNext('td').text
    return finn_kode, status_date

Per listing, I want the following information: 

    - [x] Title of listing
    - [x] Adress of listing
    - [x] FINN-kode
    - [x] Updated date
    - [x] Size-values of property 
    - [x] Type of listing
    - [x] Real Estate Agent

Example of a listing: 
https://www.finn.no/realestate/businesssale/ad.html?finnkode=341094783


In [None]:
# fetch listings of sarpsborg
url = mainurl + sarpsborg
# get HTML of main page
soup = RequestAndScrape(url)
# get URL's of the different listings pages
listing_urls = FetchListingsURL(soup)
# iterate over each listing url
for listing_url in listing_urls :
    soup = RequestAndScrape(listing_url)

    # title of listing
    title = soup.find('h1').text
    # fetch key information over the listing
    usable_area, gross_area, ownership_type, area, construction_year, plot_area = findKeyInfo(soup)

    # realestate agent
    company_name, img = fetchRealEstateInfo(soup)

    # fetch type of listing
    type = soup.find('section', {'aria-labelledby':'property-type-heading'}).find_next('div').text
    
    # fetch metatable and fetch the with information of the FINN-kode and last updated data. 
    finn_kode, status_date = fetchMetadata(soup)

    # adress is streetname number, postcode city
    address = soup.find('span', {'data-testid':'object-address'}).text
    geometry = geocodeAddresses(address)