# Web Scraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using HTTP or through a browser.

![Robots](https://upload.wikimedia.org/wikipedia/commons/6/63/Web_Robots_Logo.png)

## How the scraping process works
* request the URL (uniform resource locator / web address)
* parse the response text
* find required HTML elements and their content
* save the results in the format you need

## Scraping responsibly
* Check for other ways of accessing data (JSON API, CSV download, etc.).
* Play nice: do not overload sites with requests.
* Access publicly available data only.
* Respect robots.txt.
* Remember the difference between scraping something and how you use it (publishing, monetizing, etc.).

### Legality note
As of Sep 2019 the direction is that it should be legal to scrape publicly available data provided you do not disturb the normal operation of the website. See https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data.


## HTML and HTTP fundamentals
* https://developer.mozilla.org/en-US/docs/Web/HTML
* https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML/Getting_started
* https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview
* Chrome DevTools: https://developers.google.com/web/tools/chrome-devtools/open

Understanding both is essential for scraping: HTML tells you which tags and attributes hold the data you want, while HTTP defines how you request pages, handle status codes, respect robots.txt, and throttle or paginate responsibly.

## Scraping tables with pandas
Pandas data analysis has basic web scraping capabilities via `pandas.read_html`. It works when a page exposes data in `<table>` elements, automatically returning each table as a dataframe. If the data is not in tabular form, use BeautifulSoup or Scrapy instead (see the separate BeautifulSoup notebook).


In [None]:
from datetime import datetime
import sys
import pandas as pd

print(f"Date: {datetime.now()}")
print(f"Python version: {sys.version}")
print(f"pandas version: {pd.__version__}")


## Choose a target URL
The example below scrapes flat listings. Notice how query parameters in the URL describe the target area. Adjust the URL to match the data you want.


In [None]:
# url = "https://www.ss.com/en/real-estate/flats/riga/centre/sell/"
# url = "https://www.ss.com/en/real-estate/flats/riga/centre/hand_over/"  # renting
url = "https://www.ss.com/en/real-estate/flats/riga/agenskalns/sell/"
print(f"URL: {url}")


## Read HTML tables into dataframes
`pandas.read_html` downloads the page, parses all HTML tables, and returns a list. Some sites (like ss.com) use tables for layout, so expect more than one table and pick the one that holds the data you need.


In [None]:
dfs = pd.read_html(url, header=0)
print(f"Found {len(dfs)} tables (type: {type(dfs).__name__})")

if len(dfs) <= 4:
    raise ValueError("Expected at least 5 tables on the page to access index 4.")

df = dfs[4]  # the 5th table on the page holds the listings
print(df.head())


## Inspect the scraped table
Always check the shape and a sample of the data before exporting.


In [None]:
print(f"Shape: {df.shape}")
df.head()


## Save the results
Dataframes can be saved to many formats. Use `index=False` when you do not want the dataframe index written out.


In [None]:
df.to_json("agenskalns.json", orient="records", index=False)
df.to_csv("agenskalns.csv", index=False)
df.to_excel("agenskalns.xlsx", index=False)


## Combine multiple pages
You can scrape additional pages (e.g., pagination) and concatenate them. Here we fetch a second page, grab its listings table, and stack the data.


In [None]:
url2 = "https://www.ss.com/en/real-estate/flats/riga/agenskalns/sell/page2.html"
print(f"URL: {url2}")

dfs_page2 = pd.read_html(url2, header=0)
if len(dfs_page2) <= 4:
    raise ValueError("Expected at least 5 tables on the second page to access index 4.")

df_page2 = dfs_page2[4]
combined = pd.concat([df, df_page2], ignore_index=True)
print(f"Combined shape: {combined.shape}")
combined.head()


## Save combined data
Export the merged dataframe for downstream analysis.


In [None]:
combined.to_excel("agenskalns_big.xlsx", index=False)
combined.to_csv("agenskalns_big.csv", index=False)


## Where to go next
* Loop through pages until you detect the last one automatically.
* Extract detail-page links from each row with BeautifulSoup to enrich the dataset.
* Respect the target site's terms of service and throttle requests.
