<a href="https://colab.research.google.com/github/anshupandey/Python-Programming/blob/main/Data_retrieval_access_web_resources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



---

<CENTER>
<H1> <font color="red" size="+3">
    Accessing Web Resources with Python</font>
</H1>
</CENTER>

## <font color='red'>Reference Documents</font>

* <a href="http://zetcode.com/python/requests/">Python Requests Tutorial</a>
* <a href="https://realpython.com/python-requests/">Python’s Requests Library (Guide)</a>
* <a href="https://www.scrapehero.com/a-beginners-guide-to-web-scraping-part-1-the-basics/">What is web scraping</a>
* <a href="https://hackernoon.com/building-a-web-scraper-from-start-to-finish-bb6b95388184">Building a Web Scraper from start to finish</a>
* <a href="https://www.learndatasci.com/tutorials/ultimate-guide-web-scraping-w-python-requests-and-beautifulsoup/">Ultimate Guide to Web Scraping with Python Part 1: Requests and BeautifulSoup</a>
* <a href="https://www.dataquest.io/blog/web-scraping-tutorial-python/">Tutorial: Web Scraping with Python Using Beautiful Soup</a>
* <a href="https://realpython.com/beautiful-soup-web-scraper-python/">Beautiful Soup: Build a Web Scraper With Python</a>
* <a href="https://stackabuse.com/download-files-with-python/">Download Files with Python</a>

## <font color='red'>What will be Covered?</font>
+ Accessing Web Pages with `requests`
+ Introduction to Json
+ Web Scraping with Json
+ Web Scraping with Beautiful Soup

![fig_scrap](https://miro.medium.com/max/1400/1*4BnBQE9Bu-EQ-gGz25x8pg.png)


## <font color='red'>Components of a Web Page</font>

A web page typically has the following components:

- [HTML (HyperText Markup Language)](https://en.wikipedia.org/wiki/HTML): The main content of the page.
   - Is the language that web pages are created in.
   - It’s a markup language that tells a browser how to display content.
- [CSS (Casdading Style Sheets)](https://skillcrush.com/blog/css/): Used to add styling to make the page look nicer and modern.
- [JS (Javascript)](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/First_steps/What_is_JavaScript): Adds interactivity to web pages.
- Images: image formats, such as JPG and PNG, allow web pages to show pictures.


<p>
<P>
    
Below is the content of a [sample](https://www.freecodecamp.org/news/html-css-and-javascript-explained-for-beginners/) web page:

```html
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <meta http-equiv="X-UA-Compatible" content="ie=edge">
  <link rel="stylesheet" href="./styles.css">
  <title>Document</title>
</head>
<body>
  <h1>This is a first level heading in HTML. With CSS, I will turn this into red color</h1>
  <h2>This is a second level heading in HTML. With CSS, I will turn this into blue color</h2>
  <h3>This is a third level heading in HTML. With CSS, I will turn this into green color</h3>
  <p>This is a <em>paragragh</em> As you can see, I placed an empahisis on the word "paragraph". Now, I will change also
    the background color of the word "paragraph" to black, and its text color  to green, all with just CSS.</p>
  <p>The main essence of this tutorial is to:</p>
    <ul>
       <li>Show you how to format a web document with HTML</li>
       <li>Show you how to design a web page with CSS</li>
       <li>Show you how to program a web document with JavaScript</li>
    </ul>

  <p>Next, I am going to add the following two numbers and display the result, all with JavaScript<p/>
    <p>First number:<span id= "firstNum">2</span> <br></p>
    <p>Second number: <span id= "secondNum">7</span> </p>
    <p>Therefore, the sum of the two of those numbers is: <span id= "answer">(placeholder for the answer)</span></p>
    <input type="button" id="sumButton" value="Click to add!">
</body>
</html>
```

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from skimage import io

In [None]:
import os
import pprint

# <font color='red'>Python `requests` Module</font>

* Requests is a simple and elegant Python HTTP (Hypertext Transfer Protocol) library.
* It provides methods for accessing Web resources via HTTP.
* The HTTP request returns a Response Object with all the response data (content, encoding, status, etc.).
* Requests is a built-in Python module.

In [None]:
import requests as reqs

print(reqs.__version__)
print(reqs.__copyright__)

**Reading a Web Page**
- We use the function `get()` to grab the content of a web page into an object.
- We extract from the object the HTML content of the page.

In [None]:
resp = reqs.get("http://www.webcode.me")

We can get all information from the `resp` object:

In [None]:
print(resp.text)

We can use the module `re` to strip all the HTML markups from the content.

In [None]:
import re

resp = reqs.get("http://www.webcode.me")

content = resp.text

stripped = re.sub('<[^<]+?>', '', content)
print(stripped)

- When you issue a request, `Requests` makes educated guesses about the encoding of the response based on the HTTP headers.
- The text encoding guessed by `Requests` is used when you access `resp.text`.

In [None]:
resp.encoding

- You can find out what encoding `Requests` is using, and change it.
- If you change the encoding, `Requests` will use the new value of `resp.encoding` whenever you call `resp.text`.

In [None]:
resp.encoding = 'utf-8'

In [None]:
resp.encoding

**Getting the Status of a Web Page**
- We perform an HTTP request with the `get()` method and check for the returned status code.
- A status code informs you of the status of the request: if the request was successfull or not.
- `200` is a standard response for a successful HTTP request and `404` tells that the requested resource could not be found.
- By accessing `.status_code`, you can see the status code that the server returned.

In [None]:
resp = reqs.get("http://www.webcode.me")
print(resp.status_code)

In [None]:
resp = reqs.get("http://www.webcode.me/news")
print(resp.status_code)

#### Function for checking if a website is accessible

In [None]:
def access_website(url):
    try:
        resp = reqs.get(url, timeout=3)
        resp.raise_for_status()
    except reqs.exceptions.HTTPError as errh:
        print(f"Http Error: {errh}")
    except reqs.exceptions.ConnectionError as errc:
        print(f"Error Connecting: {errc}")
    except reqs.exceptions.Timeout as errt:
        print(f"Timeout Error: {errt}")
    except reqs.exceptions.RequestException as err:
        print(f"General Error: {err}")
    else:
        print("Successfully accessed the site!")

    return resp

In [None]:
def access_website(url):
    try:
        resp = reqs.get(url, timeout=3)
        resp.raise_for_status()
        return resp
    except reqs.exceptions.HTTPError as errh:
        return f"Http Error: {errh}"
    except reqs.exceptions.ConnectionError as errc:
        return f"Error Connecting: {errc}"
    except reqs.exceptions.Timeout as errt:
        return f"Timeout Error: {errt}"
    except reqs.exceptions.RequestException as err:
        return f"General Error: {err}"

In [None]:
url = "http://www.webcode.me/news"
resp = access_website(url)
print(f"Type of returned value: {type(resp)}")
print(resp)

In [None]:
url = "http://www.webcode.me"
resp = access_website(url)
print(f"Type of returned value: {type(resp)}")

In [None]:
if not isinstance(resp, str):
    print(f"Status Code: {resp.status_code}")

**Other Information**

In [None]:
resp = reqs.get("http://www.webcode.me")

print("\t URL:      {}".format(resp.url))
print("\t Encoding: {}".format(resp.encoding))
print("\t Time:     {}".format(resp.elapsed))

**`requests` `head()` Method**
- The `head()` method retrieves document headers.
- The headers consist of fields, including date, server, content type, or last modification time.

In [None]:
resp = reqs.head("http://www.webcode.me")

print("Server:         {}".format(resp.headers['server']))
print("CONNECTION:     {}".format(resp.headers['CONNECTION']))
print("Date:           {}".format(resp.headers['Date']))
print("Last modified:  {}".format(resp.headers['last-modified']))
print("Content type:   {}".format(resp.headers['content-type']))

**`requests` `get()` Method**
- The `get()` method issues a GET request to the server.
- The GET method requests a representation of the specified resource.

```python
requests.get(url, params={key: value}, args)
```

| Parameter | Description |
| --- | :--- |
|`url` | (required) The url of the request |
| `params` | (optional) A dictionary, list of tuples or bytes to send as a query string. |
| `allow_redirects` | (optional) A Boolean to enable/disable redirection. |
| `auth` | (optional) A tuple to enable a certain HTTP authentication. |
| `cert` | (optional) A String or Tuple specifying a cert file or key. |
| `cookies` | (optional) A dictionary of cookies to send to the specified url. |
| `headers` | (optional) A dictionary of HTTP headers to send to the specified url. |
| `proxies` | (optional) A dictionary of the protocol to the proxy url. |
| `stream` | (optional) A Boolean indication if the response should be immediately downloaded (False) or streamed (True). |
| `timeout` | (optional) A number, or a tuple, indicating how many seconds to wait for the client to make a connection and/or send a response. |
| `verify` | (optional) A Boolean or a String indication to verify the servers TLS certificate or not. |

#### Sending Parmeters in URL

- The `get()` method takes a `params` parameter where we can specify the query parameters.
     - The beginning of the query parameters is denoted by a question mark (`?`).
     - The pieces of information constituting one query parameter are encoded in key-value pairs, where related keys and values are joined together by an equals sign (`key=value`).
     - Every URL can have multiple query parameters, which are separated from each other by an ampersand (`&`)

If:
```python
   {'key1': value1, 'key2': value2, 'key2': value3}
```
are the parameters, and `https://MyOwnWebsite.com/` is the url, then the final url to access will be:
```
    https://MyOwnWebsite.com/?key1=value1&key2=value2&key3=value3
```

The code to reach the webpage is:
```Python
payload = {'key1': value1, 'key2': value2, 'key2': value3}
resp = reqs.get("https://MyOwnWebsite.com", params=payload)
```

The following script sends a variable with a value to the `httpbin.org` server. The variable is specified directly in the URL.

In [None]:
resp = reqs.get("https://httpbin.org/get?name=Peter")
print(resp.text)

We send a `get()` request to the web site and pass the data, which is specified in the `params` parameter:

In [None]:
payload = {'name': 'Peter'}
resp = reqs.get("https://httpbin.org/get", params=payload)

`payload` is a dictionary of pairs of keys/values:

In [None]:
payload = {'name': 'Peter', 'age': 23}
resp = reqs.get("https://httpbin.org/get", params=payload)

In [None]:
print(resp.url)

In [None]:
print(resp.text)

You can also pass a list of items as a value:

In [None]:
payload = {'name': ['Peter', 'Johns'], 'age': 23}
resp = reqs.get("https://httpbin.org/get", params=payload)

In [None]:
print(resp.url)

**Other Methods**

```python
requests.post('https://httpbin.org/post', data={'key':'value'})
requests.put('https://httpbin.org/put', data={'key':'value'})
requests.delete('https://httpbin.org/delete')
requests.patch('https://httpbin.org/patch', data={'key':'value'})
requests.options('https://httpbin.org/get')
```

### Summary of `requests` Methods

| Method	| Description |
| :--- | :--- |
| delete(url, args)	| Sends a DELETE request to the specified url |
| get(url, params, args)	| Sends a GET request to the specified url |
| head(url, args)	| Sends a HEAD request to the specified url |
| patch(url, data, args)	| Sends a PATCH request to the specified url |
| post(url, data, json, args)	| Sends a POST request to the specified url |
| put(url, data, args)	| Sends a PUT request to the specified url |
| request(method, url, args)	| Sends a request of the specified method to the specified url|

### Content-Type

- It is part of the HTTP header.
   - A string used to indicate the media type of the resource you want to access.  
   - It tells the browser the type of content it has to load on the machine.
- Here are some values of `content-type`:

```html
   text/html
   text/csv
   application/json
   application/javascript
   audio/ogg
   image/png
```

- While scraping a web page, it is used to determine which tool is needed to parse the content of a web page.

In [None]:
print(f"Content Type: \n\t {resp.headers['content-type']}")

---

# <font color='red'>Web Scraping</font>

![fig_json](https://daveberesford.co.uk/wp-content/uploads/2019/02/data-scraping-960x594.png)
Image Source: daveberesford.co.uk

> Web scraping is a mechanism of collecting large amounts of data from the webpage and store the data into any required format which further helps us to perform analysis on the extracted data.


- Web scraping is used to extract or “scrape” data from any web page on the Internet.
- Web scraping is performed using a “**web scraper**” (or a “bot” or a “web spider” or “web crawler”).
- A web-scraper is a program that goes to web pages, downloads the contents, extracts data out of the contents and then saves the data to a file or a database.



Web scraping involves a three-step process:

1. **Step 1**: Send an HTTP request to the webpage
   - The server responds to the request by returning the HTML content of the target webpage.
2. **Step 2**: Parse the HTML content
   - A parser is needed to create a nested structure of the HTML data.
3. **Step 3**: Pull data out of HTML
   - We use Python packages such as Json and Beautiful Soup to pull out data and store them.
   
![fig_scap](https://prowebscraper.com/blog/wp-content/uploads/2017/11/how_does_web_scraping_work.png)
Image Source: prowebscraper.com
     

Web Scrapers crawl websites, extracts data from it, transforms to a usable structured format and load it to a file or database for subsequent use.


### <font color="blue">Web Scraping Rules</font>

![fig_ethics](https://hackernoon.com/hn-images/0*MPt2rectMhwklT63.jpg)

As reference, check: <a href="https://info.scrapinghub.com/web-scraping-guide/web-scraping-best-practices">The Web Scraping Best Practices Guide</a> or watch the video <a href="https://www.youtube.com/watch?v=i7DEy-ZB_Lk">Is Web Scraping Legal?</a>

- Check a website’s Terms and Conditions before you scrape it.
- Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human).
  - One request for one webpage per second is good practice.
  - Never scrape more frequently than you need to.
  - Consider caching the content you scrape so that it’s only downloaded once.
  - Build pauses into your code using functions like `time.sleep()` to keep from overwhelming servers with too many requests too quickly.
- **<font color="red">The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed.</font>**

## <font color='blue'>Web Scraping with JSON</font>

### <font color="green"> What is JSON?</font>

* JSON (JavaScript Object Notation) is a popular data format used for representing structured data.
* It is a text format that is language independent and can be used in Python, Perl among other languages.
* JSON format is used for data communications between servers and web applications.
* It is built on two structures:

     - A collection of name/value pairs. This is realized as an object, record, dictionary, hash table, keyed list, or associative array.
     - An ordered list of values. This is realized as an array, vector, list, or sequence.
     
     
The main functions of `JSON` are:

* `dump()`: encoded string writing on file.
* `load()`: Decode while JSON file read.
* `dumps()`: encoding to JSON objects
* `loads()`: Decode the JSON string.

**Example of JSON Data**

```python
{
    "stations": [
        {
            "acronym": “BLD”,
            "name": "Boulder Colorado",
            "latitude”: 40.00,
            "longitude”: -105.25
        },
        {
            "acronym”: “BHD”,
            "name": "Baring Head Wellington New Zealand",
            "latitude": -41.28,
            "longitude": 174.87
        }
    ]
}
```

**Another Example of JSON Data**

We consider an online database, <a href="IP-API.com">IP-API.com</a>, that returns GeoIP data in JSON format. Simply opening <a href="http://ip-api.com/json/54.148.84.95">http://ip-api.com/json/54.148.84.95</a> will return the following JSON result:


```python
{
  "as": "AS16509 Amazon.com, Inc.",
  "city": "Boardman",
  "country": "United States",
  "countryCode": "US",
  "isp": "Amazon",
  "lat": 45.8696,
  "lon": -119.688,
  "org": "Amazon",
  "query": "54.148.84.95",
  "region": "OR",
  "regionName": "Oregon",
  "status": "success",
  "timezone": "America\/Los_Angeles",
  "zip": "97818"
}
```

To see your own Geolocation data in JSON format, just open <a href="http://ip-api.com/json/">http://ip-api.com/json/</a>.

In [None]:
import json

In [None]:
resp = reqs.request(method='GET', url="http://ip-api.com/json/")

In [None]:
print(f"Content type: \n\t {resp.headers['content-type']}")

In [None]:
print(f"Web page content: \n\t {resp.text}")

In [None]:
json_page = json.loads(resp.text)
pprint.pprint(json_page)

### <font color="green"> Serialization and Deserialization</font>

> … the process of translating data structures or object state into a format that can be stored … or transmitted … and reconstructed later (possibly in a different computer environment). (Wikipedia)

* **Serialization** is a process of transforming objects or data structures into byte streams or strings.
* These byte streams can then be stored or transferred easily.
* This allows the developers to save, for example, configuration data or user's progress, and then store it (on disk or in a database) or send it to another location.
* The reverse process of serialization is known as **deserialization**.

### Why do we need serialization?

We need Serialization for the following reasons:

- **Communication**: Serialization involves the procedure of object serialization and transmission. This enables multiple computer systems to design, share and execute objects simultaneously.
- **Caching**: The time consumed in building an object is more compared to the time required for de-serializing it. Serialization minimizes time consumption by caching the giant objects.
- **Deep Copy**: Cloning process is made simple by using Serialization. An exact replica of an object is obtained by serializing the object to a byte array, and then de-serializing it.
- **Portability**: The major advantage of Serialization is that it works across different architectures or Operating Systems.
- **Persistence**: The State of any object can be directly stored by applying Serialization on to it and stored in a database so that it can be retrieved later.

![fig_sd](https://miro.medium.com/max/1150/1*9zJJ65xk8agiQXlqd7nYUw.jpeg)
Image Source: Phonlawat Khunphet

**Serialization**

We use the `dump()` that takes two arguments:
* The data object to be serialized.
* The file object to which it will be written (Byte format).

In [None]:
file_name = "Sample.json"
with open(file_name, "w") as fid:
     json.dump(json_page, fid)

In [None]:
!cat Sample.json

**Deserializing JSON**

* The Deserialization is opposite of Serialization, i.e. conversion of JSON object into their respective Python objects.
* We use the `load()` function which is usually used to load from string, otherwise the root object is in list or dict.

In [None]:
with open(file_name, "r") as fid:
     json_obj = json.load(fid)

print(json_obj)

In [None]:
for key in json_obj:
    print("{:>12}: {}".format(key, json_obj[key]))

### <font color="green">Scraping the NASA Astronomy Picture Of the Day (APOD) Webpage </font>

- We want to be able to obtain from the webpage <a href="https://api.nasa.gov/planetary/apod"> https://api.nasa.gov/planetary/apod</a>,  the Astronomy picture of the day for a given day and plot the image.
- We access the webpage (using a set of parameters) and retrieve the content of the page as a JSON object.

**Query Parameters**

| Parameter | Type | Default | Description |
| --- | --- | --- | --- |
|`date` | YYYY-MM-DD | today | Date of the APOD image to retrieve |
|`start_date` | YYYY-MM-DD | none | The start of a date range, when requesting date for a range of dates. Cannot be used with `date`. |
|`end_date` | YYYY-MM-DD | today | The end of the date range, when used with `start_date`. |
| `count` |	int	| none	| If this is specified then count randomly chosen images will be returned. Cannot be used with `date` or `start_date` and `end_date`. |
| `hd` | bool | False | Retrieve the URL for the high resolution image |
| `api_key` | string | DEMO_KEY | <a href="https://api.nasa.gov/">[https://api.nasa.gov/</a> key for expanded usage |


In [None]:
url = "https://api.nasa.gov/planetary/apod"
date = "2022-07-04"
payload = {'api_key': "DEMO_KEY",
          'date': date,
          'hd': True}

page_content = reqs.get(url, params=payload)

Get the url:

In [None]:
print(f"URL: \n\t {page_content.url}")

Deterime the content type:

In [None]:
print(f"Content type: \n\t {page_content.headers['content-type']}")

Process the data with JSON:

In [None]:
if page_content.status_code == 200:
   json_page = json.loads(page_content.text)

The APOD variable is a dictionary of various keys and values. Let’s take a look at the keys of this variable:

In [None]:
for x in json_page:
    print(x)

Print the keys and values:

In [None]:
for x in json_page:
    print(f"{x} --> {json_page[x]} \n")

In [None]:
pprint.pprint(json_page)

Plot images:

In [None]:
if json_page["media_type"] == "image":
    io.imshow(io.imread(json_page["url"]))
    io.show()

<font color="red">If you want to download the file on your local system:</font>

In [None]:
import urllib

url_name = json_page["url"]
loc_file_name = os.path.basename(url_name)

urllib.request.urlretrieve(url_name, loc_file_name)

If you want to view the image through a browser, use:

In [None]:
from IPython.display import Javascript
def window_open(url):
    display(Javascript('window.open("{url}");'.format(url=url)))

window_open(json_page['url'])

In [None]:
#import webbrowser
#webbrowser.open(json_page["url"])

### <font color="green">Obtaining Mars Rover Photos</font>

In [None]:
rover_url = 'https://api.nasa.gov/mars-photos/api/v1/rovers/curiosity/photos'

payload = {'api_key': "DEMO_KEY",
           'sol': 1000}

response = reqs.get(rover_url, params=payload)
print(f"URL: \n\t {response.url}")
print(f"Content type: \n\t {response.headers['content-type']}")

In [None]:
response_dictionary = response.json()

In [None]:
print(f"KEYS: \n\t {response_dictionary.keys()}")

In [None]:
photos = response_dictionary['photos']

In [None]:
print(type(photos))
print(len(photos))

In [None]:
print(photos[0])

Extract the URL of each photo:

In [None]:
url_photos = list()
for photo in photos:
    url_photos.append(photo['img_src'])

print(url_photos[0])

Randomly select 20 pictures:

In [None]:
import random
url_pictures = random.sample(url_photos, 20)

Display the 20 photos:

In [None]:
fig, axes = plt.subplots(4, 5, figsize=(20, 20))
ax = axes.ravel()

for i in range(20):
    ax[i].imshow(io.imread(url_pictures[i]))

fig.tight_layout()

## Exercise 1:

Use the following code to list all the images in the provided year range:

```python
url = "https://images-api.nasa.gov/search"

payload = {
        "q": "apollo",
        "page": "1",
        "media_type": "image",
        "year_start": "2020",
        "year_end": "2022"}

response = reqs.get(url, params=payload)
images = response.json()["collection"]["items"]
```

<details><summary><b><font color="green">Click here to access the solution</font></b></summary>
<p>

```python
import requests as reqs

url = "https://images-api.nasa.gov/search"

params = {
    "q": "apollo",
    "page": "1",
    "media_type": "image",
    "year_start": "2020",
    "year_end": "2022"
}

response = reqs.get(url, params=params)
response.raise_for_status()

images = response.json()["collection"]["items"]
print(f"Number of images: {len(images)}")
for image in images:
    thumbnail_url = image["links"][0]["href"]
    image_url = thumbnail_url[:thumbnail_url.rfind("~")] + "~orig.jpg"
    print(image_url)
```
</p>
</details>

### <font color="green">Scraping the Earth Observatory Natural Event Tracker (EONET) Webpage </font>

- We want to be able to browse the webpage <a href="https://eonet.gsfc.nasa.gov/api/v2.1/events"> https://eonet.gsfc.nasa.gov/api/v2.1/events</a>,  to identify natural events on Earth.

**Query Parameters**

| Parameter | Value(s) |  Description |
| --- | --- | --- |
|`source` | Source ID | Filter the returned events by the <a href="https://eonet.sci.gsfc.nasa.gov/api/v2.1/sources">Source</a>. Multiple sources can be included in the parameter: comma separated, operates as a boolean OR. |
|`status` | open or closed | Events that have ended are assigned a closed date and the existence of that date will allow you to filter for only-open or only-closed events. Omitting the status parameter will return only the currently open events. |
| `limit` | int | Limits the number of events returned |
| `days ` | int | Limit the number of prior days (including today) from which events will be returned. |



In [None]:
url = "https://eonet.gsfc.nasa.gov/api/v2.1/events"
payload = {'source': "EO",
          'status': "open",
          'limit': 6,
          'days': 30}

page_content = reqs.get(url, params=payload)

In [None]:
print(f"Content type: \n\t {page_content.headers['content-type']}")

In [None]:
if page_content.status_code == 200:
    json_page = json.loads(page_content.text)

In [None]:
for x in json_page:
    print(x)

In [None]:
pprint.pprint(json_page['events'])

---

## <font color='blue'>Web Scraping with Beautiful Soup</font>

- Web scraping allows you to download the HTML of a website and extract the data that you need.
- Beautiful Soup is a Python library for scraping data from websites.
- Beautiful Soup creates a parse tree from parsed HTML and XML documents.

In [None]:
from bs4 import BeautifulSoup as bso

In [None]:
source = reqs.get("http://www.webcode.me")

In [None]:
print(f"Content type: \n\t {source.headers['content-type']}")

In [None]:
print(source)

**Create a beautiful soup object**

In [None]:
mysoup = bso(source.text, 'html.parser')

**Print the the HTML content of the page using the `prettify` method**

In [None]:
print(mysoup.prettify())

**Obtain the title section of the page**

In [None]:
print(mysoup.title)

**Get attribute name**

In [None]:
print(mysoup.title.name)

**Get attribute values**

In [None]:
print(mysoup.title.string)

In [None]:
print(mysoup.title.text)

**Beginning navigation**

In [None]:
print(mysoup.title.parent.name)

**Getting specific tags**
- HTML is made up of <a href="https://developer.mozilla.org/en-US/docs/Web/HTML/Element">tags</a>. It stores all of it’s data in them, and in the midst of all that clutter lies the data we need. Some of the tags are:
     * `head` - contains machine-readable information (metadata) about the document, like its title, scripts, and style sheets.
     * `body` - represents the content of an HTML document. There can be only one `<body>` element in a document.
     * `title` - defines the document's title that is shown in a Browser's title bar or a page's tab.
     * `p` - for paragraph
     * `div` — indicates a division, or area, of the page.
     * `b` — bolds any text inside.
     * `i` — italicizes any text inside.
     * `table` — creates a table.
     * `form` — creates an input form.
- The `find` method searches for the first tag with the needed name.
- The `find_all` method searches for all tags with the needed tag name and returns them as a list.

Assume that we want to find paragraph tags `<p>`:

In [None]:
print(mysoup.p)

In [None]:
print(mysoup.p.text)

In [None]:
print(mysoup.find('p'))

In [None]:
print(type(mysoup.find('p')))

We can find all paragraphs:

In [None]:
print(mysoup.find_all('p'))

In [None]:
print(type(mysoup.find_all('p')))

To get the last paragraph only:

In [None]:
print(mysoup.find_all('p')[-1])

We can loop over the paragraphs:

In [None]:
for i, paragraph in enumerate(mysoup.find_all('p'), start=1):
    print(f"Paragraph Text {i}: {paragraph.text}")

In [None]:
body = mysoup.find_all('body')
print(body)

In [None]:
print("Type body:       ", type(body))
print("Type inner body: ", type(body[0]))

In [None]:
print(body[0].find_all('p'))

In [None]:
print(mysoup.find_all('body')[0].find_all('p'))

**Grab the text**

- Use the method `get_text`.

In [None]:
print(mysoup.get_text())

**Searching for tags by `class` and `id`**

- Classes and ids are used by CSS to determine which HTML elements to apply certain styles to.
- We can also use them when scraping to specify specific elements we want to scrape.

In [None]:
url = "http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html"
source = reqs.get(url)
mysoup = bso(source.text, 'html.parser')
print(mysoup.prettify())

- We can use the `find_all` method to search for items by `class` or by `id`.

```python
mysoup.find_all("html_tag", class_="your_class_name")
mysoup.find_all(class_="your_class_name")

mysoup.find_all("html_tag", id="your_id_name")
mysoup.find_all(id="your_id_name")
```

In the below example, we’ll search for any `p` tag that has the class `outer-text`:

In [None]:
mysoup.find_all('p', class_='outer-text')

We can also look for any tag that has the class `outer-text`:

In [None]:
mysoup.find_all(class_="outer-text")

In [None]:
mysoup.find_all(class_="outer-text")[-1].get_text()

In [None]:
mysoup.find_all(class_="outer-text")[-1].get_text().strip()

We can also search for elements by id:

In [None]:
mysoup.find_all(id="first")

In [None]:
mysoup.find_all(id="first")[0].get_text()

**Using CSS Selectors**

- <a href="https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors">CSS (Cascading Style Sheets)</a> is a declarative language that controls how webpages look in the browser.
- The browser applies CSS style declarations to selected elements to display them properly.
- A style declaration contains the properties and their values, which determine how a webpage looks.
- CSS selectors</a> are how the CSS language allows developers to specify HTML tags to style.

Here are some examples:

- `p a` — finds all `a` tags inside of a `p` tag.
- `body p a` — finds all `a` tags inside of a `p` tag inside of a `body` tag.
- `html body` — finds all `body` tags inside of an `html` tag.
- `p.outer-text` — finds all `p` tags with a class of `outer-text`.
- `p#first` — finds all `p` tags with an id of `first`.
- `body p.outer-text` — finds any `p` tags with a class of `outer-text` inside of a `body` tag.

We can use the CSS selectors to search items inside webpages. `BeautifulSoup` objects support searching a page via CSS selectors using the `select` method.

Find all the `p` tags in our page that are inside of a `body`:

In [None]:
mysoup.select("body p")

Find all the `b` tags in our page that are inside of a `p`:

In [None]:
mysoup.select("p b")

Find all `b` tags inside of a `p` tag inside of a `body`:

In [None]:
mysoup.select("body p b")

Find all `p` tags with an id of `first`:

In [None]:
mysoup.select('p#first')

### <font color="blue"> Example: Extract the web link of the Astronomy Picture of the Day</font>

In [None]:
url = "https://apod.nasa.gov/apod/astropix.html"
source = reqs.get(url)
mysoup = bso(source.text, 'html.parser')

In [None]:
print(mysoup.prettify())

Print basic information of the Image of the Day:

In [None]:
print(mysoup.find('p').get_text())

In [None]:
href_comments = mysoup.find_all('a')
for a in href_comments:
    print(a.get_text())

#### <font color="red">Here we assume that the Picture of the Day is a video. If it is not the case, we will skip the next five cells.</font>

In [None]:
mysoup.iframe

In [None]:
from IPython.display import HTML

HTML(str(mysoup.iframe))

In [None]:
mysoup.iframe['src']

In [None]:
src_list = [a['src'] for a in mysoup.select('iframe[src]')]
src_list

In [None]:
src_tags = mysoup.find_all(src=True)
src_tags

Find all `href` tags:

In [None]:
href_tags = mysoup.find_all(href=True)
href_tags

In [None]:
links_with_text = [a['href'] for a in mysoup.find_all('a', href=True) if a.text]

In [None]:
link_list1 = [a['href'] for a in mysoup.find_all('a', href=True)]
link_list1

In [None]:
link_list2 = [l.get('href') for l in mysoup.find_all('a')]
link_list2

In [None]:
link_list3 = [a['href'] for a in mysoup.select('a[href]')]
link_list3

If the Picture of the Day is an image instead (not a video), the following can help us view the image:

In [None]:
url_image = "https://apod.nasa.gov/apod/"+link_list3[1]
fig, axes = plt.subplots(figsize=(10, 8))
axes.imshow(io.imread(url_image))

### Exercise 2

Go to the webpage:

[https://astg606.github.io/py_courses/summer_2022/](https://astg606.github.io/py_courses/summer_2022/)

and extract the `Course Evaluation` web link.

<details><summary><b><font color="green">Click here to access the solution</font></b></summary>
<p>
    
```python
import requests as reqs

from bs4 import BeautifulSoup as bso

URL = "https://astg606.github.io/py_courses/summer_2022/"

source = reqs.get(URL)
if source.status_code == 200:
    mysoup = bso(source.content, 'html.parser')
    href_tags = mysoup.find_all(href=True)
    for tag in href_tags:
        if tag.get_text() == "Course Evaluation":
            print(tag["href"])
else:
    print("URL not accessible.")
```

</p>
</details>

### <font color="blue"> Example: Weather Data for Greenbelt, Maryland</font>

In [None]:
url = "https://forecast.weather.gov/MapClick.php"
params = {'lat': 39.00079000000005,
          'lon': -76.88055999999995}

source = reqs.get(url, params=params)
print(f"URL: \n\t {source.url}")

In [None]:
mysoup = bso(source.text, 'html.parser')
print(mysoup.prettify())

**Extract Tonight's Forecast**

In [None]:
seven_day = mysoup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

In [None]:
period = tonight.find(class_="period-name").get_text()
print(period)

In [None]:
short_desc = tonight.find(class_="short-desc").get_text()
print(short_desc)

In [None]:
temp = tonight.find(class_="temp").get_text()
print(temp)

In [None]:
img = tonight.find("img")
desc = img['title']
print(desc)

**Extracting all Data**

We use CSS selectors to extract everything at once.

We select all items with the class `period-name` inside an item with the class `tombstone-container` in `seven_day`.

In [None]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
print(periods)

We can apply the same technique to get the other fields:

In [None]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
print(short_descs)

In [None]:
#temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")
#print(temps)

In [None]:
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(descs)

We can combine the data into a Pandas DataFrame:

In [None]:
import pandas as pd
df_weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    #"temp": temps,
    "desc":descs
})
df_weather

**Detailed Forecast**

In [None]:
det_forecast = mysoup.find(id="detailed-forecast-body")

In [None]:
forecast_labels = det_forecast.find_all(class_="col-sm-2 forecast-label")

In [None]:
forecast_texts = det_forecast.find_all(class_="col-sm-10 forecast-text")

In [None]:
for a, b in zip(forecast_labels, forecast_texts):
    print("\033[1m {:>15}: \033[0m {:<}".format(a.get_text(), b.get_text()))
    print()

### Exercise 3

- Go to the sitethe website `https://eonet.gsfc.nasa.gov/api/v2.1/events`
- Select a date range and the number of events you want to retrieve.
- Creade a Pandas DataFrame that contains as columns the event type, date, latitude and longitude.

```python
url = "https://eonet.gsfc.nasa.gov/api/v2.1/events"
payload = {'source': "EO",
          'status': "open",
          'limit': 6,
          'days': 100}
```

<details><summary><b><font color="green">Click here to access the solution</font></b></summary>
<p>
    
```python
import json

url = "https://eonet.gsfc.nasa.gov/api/v2.1/events"
payload = {'source': "EO",
          'status': "open",
          'limit': 6,
          'days': 100}

page_content = reqs.get(url, params=payload)

if page_content.status_code == 200:
    json_page = json.loads(page_content.text)

for x in json_page:
    print(x)

list_events = json_page['events']

print(f"Number of events: {len(list_events)}")
print(f"List of events: \n {list_events}")

event_types = [evt['categories'][0]['title'] for evt in list_events]
event_dates = [evt['geometries'][0]['date'] for evt in list_events]
event_lons = [evt['geometries'][0]['coordinates'][0] for evt in list_events]
event_lats = [evt['geometries'][0]['coordinates'][1] for evt in list_events]

print()

import pandas as pd
df_events = pd.DataFrame({
    "Type": event_types,
    "Dates": event_dates,
    #"Latitudes": event_lats,
    "Longitudes":event_lons
})
df_events
```
    
</p>
</details>

### <font color="blue"> Example: MODIS Aerosol Optical Thickness</font>

- Scientists use measurements from the MODIS sensor aboard NASA's Terra and Aqua satellites to map the amount of aerosol that is in the air all over the world. Because aerosols reflect visible and near-infrared light back to space, scientists can use satellites to make maps of where there are high concentrations of these particles.
- Scientists call this measurement aerosol optical thickness (AOT).
- It is a measure of how much light the airborne particles prevent from traveling through the atmosphere.
- Aerosols absorb and scatter incoming sunlight, thus reducing visibility and increasing optical thickness. An optical thickness of less than 0.1 indicates a crystal clear sky with maximum visibility, whereas a value of 1 indicates the presence of aerosols so dense that people would have difficulty seeing the Sun, even at mid-day!


In this example, we want to access the <a href="https://neo.gsfc.nasa.gov/">NASA Earth Observations (NEO)</a> website to obtain the AOT measurements for a given day or a range of days (from 2000 to present).

**Select the day range of interest:**

In [None]:
import pandas as pd

beg_date = '2019-12-30'
end_date = '2019-12-31'

pd_series = pd.date_range(start=beg_date, end=end_date, freq='D')
dates = [dt.strftime('%Y-%m-%d') for dt in pd_series]

url_base = "https://neo.gsfc.nasa.gov/view.php?datasetId=MODAL2_M_AER_OD&year="

urls = [url_base+dt for dt in dates]

In [None]:
print(urls[0])

**Access the webpage for the first day:**

In [None]:
source = reqs.get(urls[0])
print(source)

**Parse the webpage and print its content:**

In [None]:
mysoup = bso(source.text, 'html.parser')
print(mysoup.prettify)

**Gather all the lines with `href` tag:**

In [None]:
href_tags = mysoup.find_all(href=True)

**Find the `http` address that has the word `CSV`. That will give us the remote location of the file we want to read.**

In [None]:
for tag in href_tags:
    loc_url = tag["href"]
    if "CSV" in loc_url:
        csv_url = loc_url
        break

In [None]:
print(csv_url)

**Use `Pandas` to read the remote file:**

In [None]:
df = pd.read_csv(csv_url, index_col=0)
df

**It seems that `99999.0` corresponds to a missing value. Let us replace it with `NaN`:**

In [None]:
df = pd.read_csv(csv_url, index_col=0, na_values=99999.0)
df

**We can use `Xarray` to quickly visualize the data:**

In [None]:
import xarray as xr

da = xr.DataArray(df.values,
                  coords=[[float(lat) for lat in df.index],
                          [float(lon) for lon in df.columns]],
                  dims=['latitude', 'longitude'])

da

In [None]:
da.plot();

## <font color="blue">Application</font>

- We want to get all book names on historic New York Time Best Sellers (Business section)
- The purpose is to:
     1. Help to compile my reading list in 2020
     2. Serve as reference to use Python for simple web analytics
- We use the Python packages: `Pandas`, `Requests` and `Baeutiful Soup`
- We save data in `pickle` and `csv` formats.

The example was taken from: <a href="https://towardsdatascience.com/building-my-2020-reading-list-with-a-simple-python-script-b610c7f2c223">Building my 2020 reading list with a simple Python script</a> by Pan Wu.

In [None]:
import pandas as pd

# Create an empty Pandas dataframe
nylist = pd.DataFrame()

beg_year = 2016
end_year = 2021
for the_year in range(beg_year, end_year):
    for the_month in range(1, 13):
        cur_month = str(the_month).zfill(2) # month in two digits
        # one need to get the URL pattern first, and then use Requests package to get the URL content
        url = 'https://www.nytimes.com/books/best-sellers/{0}/{1}/01/business-books/'.format(the_year, cur_month)
        page = reqs.get(url)
        print(" --  try: {0}, {1} -- ".format(the_year, cur_month))

        # Ensure proper result is returned
        if page.status_code != 200:
            print("      Missing data for Year {} and Month {}".format(the_year, cur_month))
            continue

        # one may want to use BeautifulSoup to parse the right elements out
        soup = bso(page.text, 'html.parser')

        # the specific class names are unique for this URL and they don't change across all URLs
        top_list = soup.findAll("ol", {"class": "css-12yzwg4"})[0].findAll("div", {"class": "css-xe4cfy"})
        print("Year: {} - Month: {} - How many in the top list: {}".format(the_year, the_month, len(top_list)))

        # loop through the Best Seller list in each Year-Month, and append the information into a pandas DataFrame
        for i in range(len(top_list)):
            book   = top_list[i].contents[0]
            title  = book.findAll("h3", {"class": "css-5pe77f"})[0].text
            author = book.findAll("p",  {"class": "css-hjukut"})[0].text
            review = book.get("href")
            # print("{0}, {1}; review: {2}".format(title, author, review))
            one_item = pd.Series([the_year, the_month, title, author, i+1, review],
                                 index=['year', 'month', 'title', 'author', 'rank', 'review'])
            nylist = nylist.append(one_item, ignore_index=True, sort=False)

# write out the result to a pickle file for easy analysis later.
nylist.to_pickle("nylist.pkl")
nylist.to_csv("nylist.csv", index=False)

In [None]:
nylist

### Exercise 4

- Write a Python script that reads the pickle file `nylist.pkl` or the csv file `nylist.csv` and prints its content.