<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Courses</font></h1>
</center>

---

<CENTER>
<H1> <font color="red" size="+3">
    Web Scraping with Python</font>
</H1>
</CENTER>

## <font color='red'>Reference Documents</font>

- [Web Scraping: What It Is and How to Use It](https://scrape-it.cloud/blog/web-scraping-what-it-is-and-how-to-use-it)
- [What is web scraping](https://www.scrapehero.com/a-beginners-guide-to-web-scraping-part-1-the-basics/)
- [Python Requests Tutorial](http://zetcode.com/python/requests/)
- [Python’s Requests Library (Guide](https://realpython.com/python-requests/)
- [Building a Web Scraper from start to finish](https://hackernoon.com/building-a-web-scraper-from-start-to-finish-bb6b95388184)

## <font color='red'>Objectives</font>

In this course, we want to describe web scraping and show how it can be accomplished with Python. We present the basic steps of web scraping and run examples on accessing HTTP servers, grabbing the content of web pages (in JSON format), parsing the content to extract useful information and performing analyses.

The following topics will be covered:

+ What is web scraping?
+ Components of a web page
+ Accessing Web Pages with `requests`
+ Web Scraping with `Json`

We expect that at end of this presentation, participants will be able to write Python scripts that automatically perform web scraping to extract specific data from webpages.

## <font color='red'>Web Scraping</font>

> Web scraping is a mechanism of collecting large amounts of data from a webpage and store the data into any required format which further helps us to perform analysis on the extracted data.


- It is used to extract or “scrape” data from any web page on the Internet.
- It is performed using a “**web-scraper**” (or a “bot” or a “web spider” or “web crawler”). 
    - A web-scraper is a program that goes to web pages, downloads the contents, extracts data out of the contents and then saves the data to a file or a database.

![fig_json](https://daveberesford.co.uk/wp-content/uploads/2019/02/data-scraping-960x594.png)
Image Source: daveberesford.co.uk

Web scraping involves a three-step process:

1. **Step 1**: Send an HTTP request to the webpage
   - The server responds to the request by returning the (JSON, HTML, etc.) content of the target webpage.
2. **Step 2**: Parse the webpage content
   - A parser is needed to create a nested structure of the data. 
3. **Step 3**: Pull out useful data out
   - We use Python packages such as Json and Beautiful Soup to pull out data and store them.
   
![fig_scap](https://scrape-it.cloud/assets/cache_image/assets/blog_img/web-scraping-process2_760x0_073.webp)
Image Source: [scrape-it.cloud](https://scrape-it.cloud/blog/web-scraping-what-it-is-and-how-to-use-it)
     

__Web Scrapers crawl websites, extracts data from it, transforms to a usable structured format and load it to a file or database for subsequent use.__


## <font color='red'>Required Packages</font>
We will need the three main Python packages:

- `requests`: for accessing servers and getting the contents of web pages.
- `json`: for manipulating JSON documents.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
import os
import pprint
import re

In [None]:
import pandas as pd

In [None]:
import requests as reqs

In [None]:
import json

In [None]:
from bs4 import BeautifulSoup as bso

In [None]:
from IPython.display import HTML

In [None]:
print(f"Requests version:  {reqs.__version__}")
print(f"JSON version:      {json.__version__}")

# <font color='red'>Python `requests` Module</font>

* Requests is a built-in Python module.
* Requests is a simple and elegant Python HTTP (Hypertext Transfer Protocol) library. 
* It provides methods for accessing Web resources via HTTP. 
* The HTTP request returns a Response Object with all the response data (content, encoding, status, etc.).

**Reading a Web Page**
- We use the function `get()` to grab the content of a web page into an object.
- We extract from the object the HTML content of the page.

In [None]:
resp = reqs.get("http://www.webcode.me")

We can get all information from the `resp` object:

In [None]:
print(resp.text)

We can use the module `re` to strip all the HTML markups from the content.

In [None]:
resp = reqs.get("http://www.webcode.me")

content = resp.text

stripped = re.sub('<[^<]+?>', '', content)
print(stripped)

#### Encoding

- When you issue a request, `Requests` makes educated guesses about the encoding of the response based on the HTTP headers. 
- The text encoding guessed by `Requests` is used when you access `resp.text`. 

In [None]:
resp.encoding

In [None]:
resp.text

- You can find out what encoding `Requests` is using, and change it.
- If you change the encoding, `Requests` will use the new value of `resp.encoding` whenever you call `resp.text`.

In [None]:
resp.encoding = 'utf-8'

In [None]:
resp.encoding

In [None]:
resp.text

#### Getting the Status of a Web Page
- We perform an HTTP request with the `get()` method and check for the returned status code.
- A status code informs you of the status of the request: if the request was successfull or not.
- `200` is a standard response for a successful HTTP request and `404` tells that the requested resource could not be found.
- By accessing `.status_code`, you can see the status code that the server returned.

In [None]:
resp = reqs.get("http://www.webcode.me")
print(resp.status_code)

In [None]:
resp = reqs.get("http://www.webcode.me/news")
print(resp.status_code)

#### Function for checking if a website is accessible

In [None]:
def access_website(url):
    """
    Attempt to access a server. If the attempt is successful,
    return the response object, otherwise return an error message.

    Parameters
    ----------
    url : str
       HTTP address of the web page we want to access

    Returns
    -------
    resp : object
       Object which has infomation on the web page of interest.
    """
    try:
        resp = reqs.get(url, timeout=3)
        resp.raise_for_status()
    except reqs.exceptions.HTTPError as errh:
        print(f"Http Error: {errh}")
    except reqs.exceptions.ConnectionError as errc:
        print(f"Error Connecting: {errc}")
    except reqs.exceptions.Timeout as errt:
        print(f"Timeout Error: {errt}")
    except reqs.exceptions.RequestException as err:
        print(f"General Error: {err}")
    else:
        print("Successfully accessed the site!")
    
    return resp

In [None]:
url = "http://www.webcode.me/news"
resp = access_website(url)
print(f"Type of returned value: {type(resp)}")
if not isinstance(resp, str):
    print(f"Status Code: {resp.status_code}")

In [None]:
url = "http://www.webcode.me"
resp = access_website(url)
print(f"Type of returned value: {type(resp)}")
if not isinstance(resp, str):
    print(f"Status Code: {resp.status_code}")

In [None]:
if not isinstance(resp, str):
    print(f"Status Code: {resp.status_code}")

**Other Information**

In [None]:
url = "http://www.webcode.me"
resp = access_website(url)
if resp.status_code == 200:
    print(f"\t URL:      {resp.url}")
    print(f"\t Encoding: {resp.encoding}")
    print(f"\t Time:     {resp.elapsed}")

#### Server Headers

We can access the headers the server sent back to us:

In [None]:
resp.headers

In [None]:
headers = resp.headers
for key in headers:
    print(f"{key:20} --> {headers[key]}")

#### More on the `get()` Method
- The `get()` method issues a GET request to the server. 
- The GET method requests a representation of the specified resource.

```python
requests.get(url, params={key: value}, args)
```

| Parameter | Description |
| --- | :--- |
|`url` | (required) The url of the request |
| `params` | (optional) A dictionary, list of tuples or bytes to send as a query string. |
| `allow_redirects` | (optional) A Boolean to enable/disable redirection. |
| `auth` | (optional) A tuple to enable a certain HTTP authentication. |
| `cert` | (optional) A String or Tuple specifying a cert file or key. |
| `cookies` | (optional) A dictionary of cookies to send to the specified url. |
| `headers` | (optional) A dictionary of HTTP headers to send to the specified url. |
| `proxies` | (optional) A dictionary of the protocol to the proxy url. |
| `stream` | (optional) A Boolean indication if the response should be immediately downloaded (False) or streamed (True). |
| `timeout` | (optional) A number, or a tuple, indicating how many seconds to wait for the client to make a connection and/or send a response. |
| `verify` | (optional) A Boolean or a String indication to verify the servers TLS certificate or not. |

#### Sending Parmeters in URL

- We often ant to send some sort of data in the URL’s query string.
- The `get()` method takes a `params` keyword argument where we can specify the query parameters.
     - The beginning of the query parameters is denoted by a question mark (`?`).
     - The pieces of information constituting one query parameter are encoded in key-value pairs, where related keys and values are joined together by an equals sign (`key=value`).
     - Every URL can have multiple query parameters, which are separated from each other by an ampersand (`&`)

If:
```python
   {'key1': value1, 'key2': value2, 'key2': value3}
```
is the dictionary of the parameters, and `https://MyOwnWebsite.com/` is the url, then the final url to access will be:
```
    https://MyOwnWebsite.com/?key1=value1&key2=value2&key3=value3
```

The code to reach the webpage is:
```Python
payload = {'key1': value1, 'key2': value2, 'key2': value3}
resp = reqs.get("https://MyOwnWebsite.com", params=payload)
```

The following script sends a variable with a value to the `httpbin.org` server. The variable is specified directly in the URL.

In [None]:
resp = reqs.get("https://httpbin.org/get?name=Peter")
print(resp.text)

We send a `get()` request to the web site and pass the data, which is specified in the `params` parameter:

In [None]:
payload = {'name': 'Peter'}
resp = reqs.get("https://httpbin.org/get", params=payload)

`payload` is a dictionary of pairs of keys/values:

In [None]:
payload = {'name': 'Peter', 'age': 23}
resp = reqs.get("https://httpbin.org/get", params=payload)

In [None]:
print(resp.url)

In [None]:
print(resp.text)

You can also pass a list of items as a value:

In [None]:
payload = {'name': ['Peter', 'Johns'], 'age': 23}
resp = reqs.get("https://httpbin.org/get", params=payload)

In [None]:
print(resp.url)

**Other Methods**

```python
requests.post('https://httpbin.org/post', data={'key':'value'})
requests.put('https://httpbin.org/put', data={'key':'value'})
requests.delete('https://httpbin.org/delete')
requests.patch('https://httpbin.org/patch', data={'key':'value'})
requests.options('https://httpbin.org/get')
```

### Summary of `requests` Methods 

| Method	| Description |
| :--- | :--- |
| delete(url, args)	| Sends a DELETE request to the specified url | 
| get(url, params, args)	| Sends a GET request to the specified url | 
| head(url, args)	| Sends a HEAD request to the specified url | 
| patch(url, data, args)	| Sends a PATCH request to the specified url | 
| post(url, data, json, args)	| Sends a POST request to the specified url | 
| put(url, data, args)	| Sends a PUT request to the specified url | 
| request(method, url, args)	| Sends a request of the specified method to the specified url| 

### Content-Type

- It is part of the HTTP header.
   - A string used to indicate the media type of the resource you want to access.  
   - It tells the browser the type of content it has to load on the machine. 
- Here are some values of `content-type`:

```html
   text/html
   text/csv
   application/json
   application/javascript
   audio/ogg
   image/png
```

- While scraping a web page, it is used to determine which tool is needed to parse the content of a web page.

In [None]:
print(f"Content Type: \n\t {resp.headers['content-type']}")

#### Revisiting the function to accesss a web page
We want to pass the `payload` as argument of the function.

In [None]:
def access_website(url, payload=None):
    """
    Attempt to access a server. If the attempt is successful,
    return the response object, otherwise return an error message.

    Parameters
    ----------
    url : str
       HTTP address of the web page we want to access
    payload : dict
       Parameters needed to construct the target url.

    Returns
    -------
    resp : object
       Object which has infomation on the web page of interest.
    """
    try:
        if payload:
            resp = reqs.get(url, params=payload, timeout=3)
        else:
            resp = reqs.get(url, timeout=3)
        resp.raise_for_status()
    except reqs.exceptions.HTTPError as errh:
        print(f"Http Error: {errh}")
    except reqs.exceptions.ConnectionError as errc:
        print(f"Error Connecting: {errc}")
    except reqs.exceptions.Timeout as errt:
        print(f"Timeout Error: {errt}")
    except reqs.exceptions.RequestException as err:
        print(f"General Error: {err}")
    else:
        print("Successfully accessed the site!")
    
    return resp

---

## <font color='red'>Web Scraping with JSON</font>

### <font color="blue"> What is JSON?</font>

* JSON (JavaScript Object Notation) is a popular data format used for representing structured data. 
* It is a text format that is language independent and can be used in Python, Perl among other languages. 
* JSON format is used for data communications between servers and web applications.
* It is built on two structures:

     - A collection of name/value pairs. This is realized as an object, record, dictionary, hash table, keyed list, or associative array.
     - An ordered list of values. This is realized as an array, vector, list, or sequence.

The main functions of `JSON` are:

* `dump()`: encoded string writing on file.
* `load()`: Decode while JSON file read.
* `dumps()`: encoding to JSON objects
* `loads()`: Decode the JSON string.

**Example of JSON Data**

```python
{
    "stations": [
        {
            "acronym": “BLD”, 
            "name": "Boulder Colorado",
            "latitude”: 40.00,
            "longitude”: -105.25
        }, 
        {
            "acronym”: “BHD”, 
            "name": "Baring Head Wellington New Zealand",
            "latitude": -41.28,
            "longitude": 174.87
        }
    ]
}
```

**Another Example of JSON Data**

We consider an online database, <a href="IP-API.com">IP-API.com</a>, that returns GeoIP data in JSON format. Simply opening <a href="http://ip-api.com/json/54.148.84.95">http://ip-api.com/json/54.148.84.95</a> will return the following JSON result:


```python
{
  "as": "AS16509 Amazon.com, Inc.",
  "city": "Boardman",
  "country": "United States",
  "countryCode": "US",
  "isp": "Amazon",
  "lat": 45.8696,
  "lon": -119.688,
  "org": "Amazon",
  "query": "54.148.84.95",
  "region": "OR",
  "regionName": "Oregon",
  "status": "success",
  "timezone": "America\/Los_Angeles",
  "zip": "97818"
}
```

To see your own Geolocation data in JSON format, just open <a href="http://ip-api.com/json/">http://ip-api.com/json/</a>.

In [None]:
url = "http://ip-api.com/json/"
resp = access_website(url)

In [None]:
if resp.status_code == 200:
    print(f"Content type: \t {resp.headers['content-type']}")
    print(f"Web page content: \n\t {resp.text}")

**<font color="red">Note that we have a JSON page.</font>**

In [None]:
json_page = json.loads(resp.text)
pprint.pprint(json_page)

### <font color="blue">`JSON` Main Functions</font>
`JSON' has two sets of functions:

- Set 1 - for serialization (process of transforming objects or data structures into byte streams or strings)
   - `dumps()`: Returns a string representing a JSON object from a Python object.
   - `dump()`: Store a file (`.json`) the JSON representation of a Python object.
- Set 2 - for deserialization (conversion of JSON object into their respective Python objects)
   - `loads()`: Returns a Python object from a string representing a JSON object.
   - `load()`: Retrieve from a `.json` (with a JSON object) the Python object.


![fig_json](https://www.bogotobogo.com/python/images/json_load_dump/python-json-load-loads-dump-dumps.png)
Image Source: [www.bogotobogo.com](https://www.bogotobogo.com/python/python-json-dumps-loads-file-read-write.php)

![fig_sd](https://miro.medium.com/max/1150/1*9zJJ65xk8agiQXlqd7nYUw.jpeg)
Image Source: Phonlawat Khunphet

**Serialization**

We use the `dump()` that takes two arguments: 
* The data object to be serialized.
* The file object to which it will be written (Byte format).

In [None]:
file_name = "Sample.json"
with open(file_name, "w") as fid: 
     json.dump(json_page, fid)

In [None]:
!cat Sample.json

**Deserializing JSON**

* The Deserialization is opposite of Serialization, i.e. conversion of JSON object into their respective Python objects. 
* We use the `load()` function which is usually used to load from string, otherwise the root object is in list or dict.

In [None]:
with open(file_name, "r") as fid: 
     json_obj = json.load(fid)

print(json_obj)

In [None]:
for key in json_obj:
    print("{:>12}: {}".format(key, json_obj[key]))

## <font color="green">Accessing the DONKI webservice API</font>

The Space Weather Database Of Notifications, Knowledge, Information (DONKI), developed at the Community Coordinated Modeling Center (CCMC), is a comprehensive on-line tool for space weather forecasters, scientists, and the general space science community.

DONKI provides:
- Chronicles the daily interpretations of space weather observations, analysis, models output, and notifications provided by the Moon to Mars Space Weather Analysis Office as a courtesy to the community.
- Comprehensive knowledge-base search functionality to support anomaly resolution and space science research.
- Intelligent linkages, relationships, cause-and-effects between space weather activities.
- Comprehensive webservice API access to information stored in DONKI


__Our goal__:
- We want to be able to obtain from the webpage [https://kauai.ccmc.gsfc.nasa.gov/DONKI/WS/get/CME](https://kauai.ccmc.gsfc.nasa.gov/DONKI/WS/get/CME) the Coronal Mass Ejection (CME) data
- We access the webpage (using a set of parameters) and retrieve the content of the page as a JSON object.


__URL for Coronal Mass Ejection__

The based URL is: `https://kauai.ccmc.gsfc.nasa.gov/DONKI/WS/get/CME`

To access the data, we need to provide two parameters:

- `startDate` (YYYY-MM-DD): default to 30 days prior to current UTC date
- `endDate` (YYYY-MM-DD): default to current UTC date

In [None]:
url_cme = "https://kauai.ccmc.gsfc.nasa.gov/DONKI/WS/get/CME"

In [None]:
startDate="2017-01-03"
endDate="2017-01-03"

payload = {
    'startDate': startDate,
    'endDate': endDate,
}

page_content = access_website(url_cme, payload)

Get the url:

In [None]:
print(f"URL: \n\t {page_content.url}")

Determine the content type:

In [None]:
print(f"Content type: \n\t {page_content.headers['content-type']}")

Process the data with JSON:

In [None]:
json_page = json.loads(page_content.text)

In [None]:
type(json_page)

__It seems that we have a list.__

In [None]:
len(json_page)

The list has only one entry.

In [None]:
page = json_page[0]

In [None]:
type(page)

List all the keys:

In [None]:
for x in page:
    print(x)

List the keys and the corresponding values:

In [None]:
for x in page:
    print(f"{x} --> {page[x]} \n")

In [None]:
pprint.pprint(page)

#### Task: Get link associated with the `enlilList` key

In [None]:
page['cmeAnalyses']

In [None]:
pprint.pprint(page['cmeAnalyses'][0])

In [None]:
url_link = page['cmeAnalyses'][0]['enlilList'][0]['link']
url_link

#### Task: Get the gif images contain in the webpage

In [None]:
from bs4 import BeautifulSoup as bso

In [None]:
source = access_website(url_link)

In [None]:
print(f"Content type: \n\t {source.headers['content-type']}")

In [None]:
mysoup = bso(source.text, 'html.parser')

In [None]:
print(mysoup.prettify())

In [None]:
[a['href'] for a in mysoup.select('a[href]')]

In [None]:
[a['href'] for a in mysoup.select('a[href]') if "iswa" in a['href']]

In [None]:
list_images = [a['href'] for a in mysoup.select('a[href]') if "iswa" in a['href']]

#### Task: Display some of the remote images

In [None]:
HTML(f'<img src={list_images[0]}>')

In [None]:
HTML(f'<img src={list_images[1]}>')

In [None]:
HTML(f'<img src={list_images[2]}>')

In [None]:
HTML(f'<img src={list_images[3]}>')

In [None]:
HTML(f'<img src={list_images[7]}>')

__Based URL for Coronal Mass Ejection Analysis__

In [None]:
url_cme = "https://kauai.ccmc.gsfc.nasa.gov/DONKI/WS/get/CME/Analysis"

### <font color="green">Scraping the NASA Astronomy Picture Of the Day (APOD) Webpage </font>

- We want to be able to obtain from the webpage <a href="https://api.nasa.gov/planetary/apod"> https://api.nasa.gov/planetary/apod</a>,  the Astronomy picture of the day for a given day and plot the image.
- We access the webpage (using a set of parameters) and retrieve the content of the page as a JSON object.

**Query Parameters**

| Parameter | Type | Default | Description |
| --- | --- | --- | --- |
|`date` | YYYY-MM-DD | today | Date of the APOD image to retrieve |
|`start_date` | YYYY-MM-DD | none | The start of a date range, when requesting date for a range of dates. Cannot be used with `date`. |
|`end_date` | YYYY-MM-DD | today | The end of the date range, when used with `start_date`. |
| `count` |	int	| none	| If this is specified then count randomly chosen images will be returned. Cannot be used with `date` or `start_date` and `end_date`. |
| `hd` | bool | False | Retrieve the URL for the high resolution image |
| `api_key` | string | DEMO_KEY | <a href="https://api.nasa.gov/">[https://api.nasa.gov/</a> key for expanded usage |


---