# Web Scraping with Pandas

Pandas data analysis has basic web scraping capabilities. It can be used to extract data from websites and convert it into a pandas dataframe. In this notebook, I will show you how to scrape data from a website using pandas.

The requirement to use pandas for web scraping is that the data is in tabular form. If the data is not in tabular form, you can use BeautifulSoup or Scrapy to scrape the data - Beautiful Soup will be covered in a separate notebook.

In [1]:
# good idea to start with date and python version
from datetime import datetime
print(f"Date: {datetime.now()}")
import sys
print(f"Python version: {sys.version}")

import pandas as pd # sometimes called excel on steroids
# Google Colab offers this built in
# otherwise you would install it with pip install pandas['html']
print(f"pandas version: {pd.__version__}")

Date: 2025-12-04 14:50:11.160354
Python version: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
pandas version: 2.2.2


In [2]:
# url = "https://www.ss.com/lv/real-estate/flats/riga/centre/sell/"
# url = "https://www.ss.com/en/real-estate/flats/riga/centre/sell/"
# url = "https://www.ss.com/en/real-estate/flats/riga/centre/hand_over/" # hand_over meaning renting
url = "https://www.ss.com/en/real-estate/flats/riga/agenskalns/sell/"
# notice how we can specify some parameters in the url
print(f"URL: {url}")

URL: https://www.ss.com/en/real-estate/flats/riga/agenskalns/sell/


In [3]:
# pandas can do some web scraping as long as there are tables
dfs = pd.read_html(url, header=0) # i know that i want the first row to server for column names
# so above line actually opened a connection to URL (just like a browser), made HTTP(or HTTPS) request and parsed the response
print(f"We got {len(dfs)} dataframes contained in a {type(dfs)}")
# turns out we do not get just one dataframe but a list of dataframes
# there could be 0 or more dataframes in the list

We got 6 dataframes contained in a <class 'list'>


In [4]:
len(dfs) # why list and why such length ?
# well ss.com was made in early 2000s when tables were used for everything

6

In [None]:
# mdn on table tag
# https://developer.mozilla.org/en-US/docs/Web/HTML/Element/table

In [None]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
# pandas reads tables by default ALL tables on the page


In [5]:
df = dfs[4] # the 5th table on our page has our info
# df is the standard name for a dataframe in pandas, could use anything else
df.head()

Unnamed: 0,Advertisements \tdate,Advertisements \tdate.1,Advertisements \tdate.2,Street,R.,m²,Floor,Series,"Price, m2",Price
0,,,Iespēja iegādāties dzīvokli unikālā ēkā vēstur...,Kristapa 2,3,81,1/2,Recon.,"2,963 €","240,000 €"
1,,,"Pārdod 2-istabu dzīvokli Āgenskalnā 40 m², izc...",Baldones 28,2,40,4/5,Lit pr.,"1,722 €","68,888 €"
2,,,"Pārdod komfortablu, skaistu 2-istabu dzīvokli ...",Ranka d. 9,2,54,2/6,Recon.,"2,315 €","125,000 €"
3,,,"Pārdošanā izremontēts, gatavs dzīvošanai. 1 is...",Darba 11a,1,29,1/2,Stalin project,"1,103 €","32,000 €"
4,,,Pārdod vienistabas dzīvokli. Plaša un gaiša is...,M. Nometnu 11,1,40,1/2,Pre-war house,925 €,"37,000 €"


In [6]:
# we can see what shape it is
print(f"Shape: {df.shape}")

Shape: (30, 10)


In [None]:
df

Unnamed: 0,Announcements \tdate,Announcements \tdate.1,Announcements \tdate.2,Street,R.,m2,Floor,Series,"Price, m2",Price
0,,,Izīrē dzīvokli ar mājīgu un komfortablu iekārt...,Stabu 49,2,50,4/5,Perewar,6 €,300 €/mon.
1,,,Tiek izīrēts lielisks 2 istabu dzīvoklis ar sk...,Merkela 17,2,49,4/5,Perewar,11.22 €,550 €/mon.
2,,,"Īpašnieks izīrē , tiko pēc remonta , ar sadzīv...",Deglava 9,1,26,1/6,Recon.,13.42 €,349 €/mon.
3,,,Владелец сдает новую 2-уровневую квартиру (гос...,Brivibas 101,3,73,1/5,Perewar,6.85 €,500 €/mon.
4,,,Доступна уже сегодня. Владелец сдает новую ква...,Lachplesha 121,2,43,2/6,New,8.95 €,385 €/mon.
5,,,Plašs trīsistabu dzīvoklis ar balkonu Vēstniec...,Vilandes 10,3,106,4/5,Perewar,8 €,848 €/mon.
6,,,"Izīrē 1 istabu dzīvokli jaunā mājā, centra tuv...",Sadovnikova 39,1,25,2/7,Recon.,10.80 €,270 €/mon.
7,,,Ilgtermiņa īrei pieejams mājīgs divu istabu dz...,Sporta 7,2,50,1/4,Perewar,7.90 €,395 €/mon.
8,,,Tiek izīrēts 2-istabu dzīvoklis J. Daliņa ielā...,J. Dalina 8,2,64,12/24,New,10.94 €,700 €/mon.
9,,,Īzīrē 1-istabu dzīvokli centra (34 kv. m. ) ar...,Tomsona 24,1,34,3/5,Spec. pr.,9.71 €,330 €/mon.


In [7]:
# you can export to json, csv, excel, sql, etc
# df.to_json("center.json")
df.to_json("agenskalns.json", index=False) # index=False means do not write row numbers

In [8]:
# save to csv
# df.to_csv("center.csv")
df.to_csv("agenskalns.csv", index=False)

In [None]:
# we can save to excel as well
# this assume you installed pandas with pip install pandas[xls] or pip install pandas[all]
# df.to_excel("center.xlsx")
df.to_excel("agenskalns.xlsx", index=False)

In [None]:
# so pandas read html is great for scraping tables and it is very easy to use
# restrictions are that it can only read tables and it can only read tables that are in the same format

# in our case we want to extract links to each flat
# so we will need to use beautiful soup or some other library

In [10]:
# url2 = "https://www.ss.com/lv/real-estate/flats/riga/centre/sell/page2.html"
url2 = "https://www.ss.com/en/real-estate/flats/riga/agenskalns/sell/page2.html"
print(f"URL: {url2}")

URL: https://www.ss.com/en/real-estate/flats/riga/agenskalns/sell/page2.html


In [11]:
dflist = pd.read_html(url2, header=0)
len(dflist)

6

In [12]:
dflist[4].head() # getting start of 5th table from our 2nd html page

Unnamed: 0,Advertisements \tdate,Advertisements \tdate.1,Advertisements \tdate.2,Street,R.,m²,Floor,Series,"Price, m2",Price
0,,,Tiek pārdots mūsdienīgs divīstabu dzīvoklis re...,Ranka d. 31,2,44,2/4,Recon.,"3,023 €","133,000 €"
1,,,Tiek pārdods gatavs bizness. Pavisam jauns dzī...,Ranka d. 31,1,37,2/4,Recon.,"2,700 €","99,900 €"
2,,,Attīstītājs pārdod 3 istabu dzīvokli Nr 41 eks...,Ranka d. 31,3,81,2/4,Recon.,"1,496 €","121,176 €"
3,,,Pārdod plašu 3 istabu dzīvokli A. Grīna bulvār...,Grina boul. 1,3,76,4/5,Stalin project,"1,711 €","130,000 €"
4,,,Pārdod gaumīgu dzīvokli labā stāvoklī ar 2 izo...,Valguma 20,2,54,3/4,Stalin project,"1,741 €","94,000 €"


In [13]:
df_from_page2 = dflist[4]
# check head
df_from_page2.head()

Unnamed: 0,Advertisements \tdate,Advertisements \tdate.1,Advertisements \tdate.2,Street,R.,m²,Floor,Series,"Price, m2",Price
0,,,Tiek pārdots mūsdienīgs divīstabu dzīvoklis re...,Ranka d. 31,2,44,2/4,Recon.,"3,023 €","133,000 €"
1,,,Tiek pārdods gatavs bizness. Pavisam jauns dzī...,Ranka d. 31,1,37,2/4,Recon.,"2,700 €","99,900 €"
2,,,Attīstītājs pārdod 3 istabu dzīvokli Nr 41 eks...,Ranka d. 31,3,81,2/4,Recon.,"1,496 €","121,176 €"
3,,,Pārdod plašu 3 istabu dzīvokli A. Grīna bulvār...,Grina boul. 1,3,76,4/5,Stalin project,"1,711 €","130,000 €"
4,,,Pārdod gaumīgu dzīvokli labā stāvoklī ar 2 izo...,Valguma 20,2,54,3/4,Stalin project,"1,741 €","94,000 €"


In [14]:
# bigdf = pd.concat([df, dflist[4]]) # concat creates a new dataframe from an iterable! of dataframes
bigdf = pd.concat([df, df_from_page2]) # concat creates a new dataframe from an iterable! of dataframes
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
# print shape
print(f"Shape: {bigdf.shape}")
bigdf.head()

Shape: (60, 10)


Unnamed: 0,Advertisements \tdate,Advertisements \tdate.1,Advertisements \tdate.2,Street,R.,m²,Floor,Series,"Price, m2",Price
0,,,Iespēja iegādāties dzīvokli unikālā ēkā vēstur...,Kristapa 2,3,81,1/2,Recon.,"2,963 €","240,000 €"
1,,,"Pārdod 2-istabu dzīvokli Āgenskalnā 40 m², izc...",Baldones 28,2,40,4/5,Lit pr.,"1,722 €","68,888 €"
2,,,"Pārdod komfortablu, skaistu 2-istabu dzīvokli ...",Ranka d. 9,2,54,2/6,Recon.,"2,315 €","125,000 €"
3,,,"Pārdošanā izremontēts, gatavs dzīvošanai. 1 is...",Darba 11a,1,29,1/2,Stalin project,"1,103 €","32,000 €"
4,,,Pārdod vienistabas dzīvokli. Plaša un gaiša is...,M. Nometnu 11,1,40,1/2,Pre-war house,925 €,"37,000 €"


In [15]:
# now let's save the bigger dataframe to excel
bigdf.to_excel("agenskalns_big.xlsx", index=False)

In [16]:
# if you are on google colab you can also download the file automatically
from google.colab import files
files.download("agenskalns_big.xlsx")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
df.shape, dflist[4].shape, bigdf.shape
# so we see that we were successful in combining the two dataframes
# we can save the bigdf again to csv, json, sql, excel, etc

((30, 10), (30, 10), (60, 10))

In [None]:
# so if we know the last page of our search we can loop through all pages
# however that can change from day to day
# better would be to use some automation to find the last page


In [None]:
# Challenge how to automatically get all pages no matter how many ads?
# how to scrape web address for each ad (in case we want to look in to the ad more in depth)

# for this we need to use beautiful soup to parse the html and extract the links needed