<a href="https://colab.research.google.com/github/fleshgordo/scrapinghub/blob/main/003_scraping_bs4" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping and exploring missing or obfuscated data from satellite views

Based on the [list of satellite map images with missing or unclear data](https://en.m.wikipedia.org/wiki/List_of_satellite_map_images_with_missing_or_unclear_data), this notebook gives a step-by-step approach to scrape data from a website, analyse its content and based on the results make automated batch downloads. Python, beautifulsoup and requests library will be used. For a general introduction into the practice of scrapism, please have a look at Sam Levigne's manifesto of [scrapism](https://scrapism.lav.io/).

## Requesting a website

In order to entirely download a webpage and its content we first need to request the server, wait for the response and store it in a python variable. This is achieved with the [requests](https://pypi.org/project/requests/) library. Before using it, we need to import it to our current runtime (this needs to be done only once!)

In [None]:
import requests

Through the [quickstart tutorial](https://requests.readthedocs.io/en/latest/user/quickstart/) we can immediately fetch our wikipedia entry as:

In [None]:
r = requests.get('https://en.m.wikipedia.org/wiki/List_of_satellite_map_images_with_missing_or_unclear_data')
print(r)

<Response [200]>


The above code should output `Response [200]`. To output the HTML source code of the page we need to access the `text` property. The response will be stored in a variable called `source`.

In [None]:
print(r.text)
source = r.text

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of satellite map images with missing or unclear data - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"07c0f613-551c-4a20-b1f7-ca3452a22ece","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_satellite_map_images_with_missing_or_unclear_data","wgTitle":"List of satellite map images with missing or unclear data","wgCurRevisionId":1135011453,"wgRevisionId":1135011453,"wgArticleId":10615708,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgPageContentLanguage":"en","wgPageC

## BeautifulSoup 

The code is highly unreadable. Parsing through this source code is tedious and quickly time-consuming. Hence, [Beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) comes into play. This library is known for extracting data out of web pages. It provides elegant ways of navigating, searching, and modifying the parse tree of HTML and XML files. It commonly saves programmers hours or days of work. So, let's import this library:

In [None]:
from bs4 import BeautifulSoup

Our source code will be loaded into the Beautifulsoup which creates a python object that becomes browsable instead of a basic text string.

In [None]:
soup = BeautifulSoup(source, 'html.parser')
print(soup)

While the output of the new `soup` variable looks pretty much the same as the `source`, its major difference is that it is a python object that contains some functions in order to access the HTML structure. Let's say, we are interested only into the hyperlinks that are present on the page:

In [None]:
soup.find_all("a")

Let's fine-grain this search and focus only on hyperlinks that also have a css class called `external`. The output will be an array (a list of entries). We can also store the response in a variable.


In [None]:
soup.find_all("a",{"class": "external"})
links = soup.find_all("a",{"class": "external"})

In [None]:
print(len(links))

210


In [None]:
links

As time of writing there are apparently 210 hyperlinks with the class on this page. Let's further access their hyperlinks. Since the `links` variable is an array we can iterate through this list with a for-loop:

In [None]:
for link in links:
  print(link.get('href'))

Diving further into the source code it becomes obvious that the geo-coordinates are inside a `span` element with the class `geo-dms` To filter our soup we can thus write: