# Web Scraping

## Definition

Web scraping is the process of extracting data from websites. It is also known as web harvesting or web data extraction. Basically, it is a technique to convert unstructured data on the web (HTML format) into structured data (database or spreadsheet).

## Why Web Scraping?

Web scraping is useful when you need to collect a large amount of data from websites. You can then use the scraped data for various purposes:

*   Price monitoring
*   Market research
*   Financial data aggregation
*   Job listings
*   News and content aggregation

## What can be scraped?

Almost all websites can be scraped. However, there are some websites that are built using technologies that make it difficult to scrape data from them. For example, websites that heavily rely on JavaScript to load content. In this case, you will need to use a headless browser to scrape the data.

## HTML basics

To scrape we need to know how to read HTML. HTML stands for HyperText Markup Language. It is the standard markup language for creating web pages. HTML describes the structure of a web page and consists of a series of elements. These elements tell the browser how to display the content.

HTML is not a programming language. It is a markup language that defines the structure of your content. HTML consists of a series of elements, which you use to enclose, or wrap, different parts of the content to make it appear a certain way, or act a certain way.

### Docs

Getting started with HTML: [https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML/Getting_started](https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML/Getting_started)

## Anatomy of an HTML element

![Element](https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML/Getting_started/grumpy-cat-small.png)

## Markdown takes html as well

Here we have an img tag

<img
  src="https://raw.githubusercontent.com/mdn/beginner-html-site/gh-pages/images/firefox-icon.png"
  alt="Firefox icon" />

## ELEMENT Attributes

All HTML elements can have attributes

Attributes provide additional information about an element

![Element](https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML/Getting_started/grumpy-cat-attribute-small.png)

WE will use attributes in elements to scrape data more precisely.

## Rental and real estate data

We will use ss.com as a good example of scraping data. We will scrape rental and real estate data.

In [1]:
url = "https://www.ss.com/lv/real-estate/flats/riga/centre/sell/"
# print (url)
url

'https://www.ss.com/lv/real-estate/flats/riga/centre/sell/'

In [None]:
# We can check in developer tools what kind of data are there from Chrome, Safari, Firefox, etc

# we see that this page uses a lot of tables

# MDN docs on tables: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/table

In [2]:
# if we have tabular data on web page we can use pandas to read it

# pandas docs on read_html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

# pandas is a huge data analysis library, we will use it for reading tables

import pandas as pd # standard alias
# version check
print(pd.__version__)

2.0.3


In [3]:
# pandas can read html tables and return a list of dataframes
dfs = pd.read_html(url) # this reads ALL table elements from the page
# how many
print(len(dfs))
# we need 5th table, so what index?
# 

6


In [4]:
df = dfs[4] # we get a dataframe from 5th table
df.head() # first 5 rows

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,Sludinājumi \tdatums,Sludinājumi \tdatums,Sludinājumi \tdatums,Iela,Ist.,m2,Stāvs,Sērija,"Cena, m2",Cena
1,,,Plašs dzīvoklis ar lielisku plānojumu tuvu cen...,Ieroču 1,4,108,1/5,Staļina,"1,065 €","115,000 €"
2,,,"Pārdodam gaišu, siltu dzīvokli. 2 slēgtas guļa...",Ganību d. 25,3,73,3/3,P. kara,986 €,"72,000 €"
3,,,"Īpašnieks pārdod 4-istabu dzīvokli , kas sastā...",Sermuliņu 14,4,85,1/4,Jaun.,"2,871 €","244,000 €"
4,,,Investīciju projekts 7. stāva izbūvei daudzdzī...,Lāčplēša 54,Citi,643,7/7,P. kara,138 €,"89,000 €"


In [6]:
# i can save it now to excel or csv
# df.to_excel("flats.xlsx") # needs pip install openpyxl first
df.to_csv("flats.csv")

In [None]:
# we would like to scrape all pages of flats

# we need to find the last page number

# we can use requests and beautifulsoup for that

In [7]:
import requests
# beautifulsoup is a library for parsing html
# install it with pip install beautifulsoup4
# docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import BeautifulSoup

In [8]:
# we have url already lets get the page
page = requests.get(url)
# check if we got the page
page.status_code

200

In [9]:
# we got text!
page.text[:100]


'<!DOCTYPE html>\r\n<HTML><HEAD>\r\n<title>SS.COM Dzīvokļi - Rīga - Centrs, Cenas, Pārdod - Sludinājumi</'

In [10]:
# if we did not have parser we could use find or index etc
# it is a big string after all
# much better it is to parse it with beautifulsoup
# then we can access elements by tag name, class, id etc
soup = BeautifulSoup(page.text, 'lxml') # there are other parsers too
# lxml is a fast parser
# title
soup.title

<title>SS.COM Dzīvokļi - Rīga - Centrs, Cenas, Pārdod - Sludinājumi</title>

In [11]:
# we could find all anchor tags
all_anchors = soup.find_all("a") 
# mdn for anchor tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a
# how many
len(all_anchors)

99

In [12]:
# we just need a specific anchor with rel attribute and prev value
# we can use css selectors
# https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors

# or we can use soup.find
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find

prev_anchor = soup.find("a", attrs={"rel": "prev"})
# we get the first anchor with rel=prev
# print (prev_anchor)
print(prev_anchor)

<a class="navi" href="/lv/real-estate/flats/riga/centre/sell/page29.html" name="nav_id" rel="prev"><img border="0" height="5" src="https://i.ss.com/img/s_left.png" style="padding-bottom:2px;" width="9"/> Iepriekšējie</a>


In [13]:
# let's get href attribute for this anchor
prev_anchor["href"]

'/lv/real-estate/flats/riga/centre/sell/page29.html'

In [14]:
# we need to extract the page number from this href
# we could use splits or regex
# split approach
last_num_str = prev_anchor["href"].split("page")[-1].split(".")[0]
last_num_str

'29'

In [15]:
# convert to integer
last_num = int(last_num_str)
last_num

29

In [17]:
# regex approach
import re
# regex101.com for practice
# regex docs: https://docs.python.org/3/library/re.html
# so we are looking for any number of digits after page
# we need to escape the dot
# we need to escape the slash

regex = r"page(\d+)\.html" # notice we use r for raw strings
# because regex uses a lot of backslashes that we do not want to escape again
my_group = re.search(regex, prev_anchor["href"])
my_group[1] # first group is everything 2nd is first match


'29'

In [19]:
# convert to integer
last_num = int(my_group[1])
last_num

29

In [21]:
# let's make a function that given a url and last page will return a list of urls

def get_all_urls(url, last_page):
    """Returns a list of urls from 1 to last_page"""
    # we need to get the base url
    # base_url = url.rsplit("/", 2)[0] + "/"
    # print (base_url)
    # we need to get the page number
    # we need to get the extension
    # we need to combine them
    # we need to return a list of urls
    # we need to loop from 1 to last_page
    # we need to append to the list
    # we need to return the list
    # we need to return a list of urls
    urls = [url] # we start with original
    for page_num in range(2, last_page + 1): # we do not need first page
        urls.append(url + f"page{page_num}.html")
    return urls

# test it with our url
urls = get_all_urls(url, last_num)
urls

['https://www.ss.com/lv/real-estate/flats/riga/centre/sell/',
 'https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page2.html',
 'https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page3.html',
 'https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page4.html',
 'https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page5.html',
 'https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page6.html',
 'https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page7.html',
 'https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page8.html',
 'https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page9.html',
 'https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page10.html',
 'https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page11.html',
 'https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page12.html',
 'https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page13.html',
 'https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page14.html

In [22]:
# we can now iterate over urls
# only good practice is to sleep between requests
# 0.3 or 0.5 seconds is enough
# it is bad practice to overload the server with requests and get banned

# we can use time module for that
import time

dfs = [] # we will store all dataframes here
for url in urls:
    print ("Reading", url)
    # we need to get the page
    df = pd.read_html(url)[4] # we get the 5th table, this is differnt on different types of ads
    dfs.append(df)
    # we need to sleep
    time.sleep(0.3) # sleep for 0.5 seconds
# print how many we got
print("Got data from", len(dfs))

Reading https://www.ss.com/lv/real-estate/flats/riga/centre/sell/
Reading https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page2.html
Reading https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page3.html
Reading https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page4.html
Reading https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page5.html
Reading https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page6.html
Reading https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page7.html
Reading https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page8.html
Reading https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page9.html
Reading https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page10.html
Reading https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page11.html
Reading https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page12.html
Reading https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page13.html
Reading https://ww

In [23]:
# let's check last dataframe
dfs[-1].head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,Sludinājumi \tdatums,Sludinājumi \tdatums,Sludinājumi \tdatums,Iela,Ist.,m2,Stāvs,Sērija,"Cena, m2",Cena
1,,,Pārdod gaišu un plašu 3 istabu istabu dzīvokli...,Skolas 20,3,96,5/6,Renov.,"3,000 €","288,000 €"
2,,,No īpašnieka tiek pārdots plašs dzīvoklis Rīga...,Tomsona 2,4,122,3/6,P. kara,"1,627 €","198,500 €"


In [24]:
# we can combine all dataframes into one
df = pd.concat(dfs)
# print shape
print(df.shape)

(871, 10)


In [25]:
# save to csv
# add timestamp to filename
from datetime import datetime
now = datetime.now()
timestamp = now.strftime("%Y%m%d_%H%M%S")
file_name = f"flats_{timestamp}.csv"
df.to_csv(file_name)

## Things to do with BeautifulSoup

We could have used BeautifulSoup to parse the HTML and extract the data we need. 
If data is not in a table, we would use BeautifulSoup to extract the data we need.

One of the things missing is anchor to full ad. We can get it from the link.

In [27]:
# I will show you how you could find link to all ads on the page 

# we will find all tr elements that have tr_ as start of id

tr_ads = soup.find_all("tr", attrs={"id": re.compile("^tr_\d+")}) # \d+ means one or more digits
len(tr_ads)

30

In [28]:
# we could extract now all text from our tr_ads
# first ad text would be 
tr_ads[0].text # minus no separator

'Tiek pārdots iedvesmojošs 4 istabu dzīvoklis ar vienreizēju plānTērbatas 2041204/6P. kara2,125 €255,000  €'

In [29]:
first_ad = tr_ads[0]
# we can find all td elements and their text
tds = first_ad.find_all("td")
tds[-1].text # last td is the price

'255,000  €'

In [30]:
# so list of all texts from first ad
first_ad_texts = [td.text for td in tds]
first_ad_texts

['',
 '',
 'Tiek pārdots iedvesmojošs 4 istabu dzīvoklis ar vienreizēju plān',
 'Tērbatas 20',
 '4',
 '120',
 '4/6',
 'P. kara',
 '2,125 €',
 '255,000  €']

In [31]:
# finally let's extract href from anchor which is in 2nd td
# we can use find
second_td = tds[1]
second_td.find("a")["href"]

'/msg/lv/real-estate/flats/riga/centre/dcehm.html'

In [35]:
# let's create a url for this ad
url = "https://www.ss.com/lv/real-estate/flats/riga/centre/sell/"
base_url = "https://www.ss.com/"
suffix = second_td.find("a")["href"]
print(url)
print(suffix)
first_ad_url = base_url.rstrip("/") + suffix
# print (first_ad_url)
first_ad_url

https://www.ss.com/lv/real-estate/flats/riga/centre/sell/
/msg/lv/real-estate/flats/riga/centre/dcehm.html


'https://www.ss.com/msg/lv/real-estate/flats/riga/centre/dcehm.html'