<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Courses</font></h1>
</center>

---

<CENTER>
<H1> <font color="red" size="+3">
    Web Scraping with Python</font>
</H1>
</CENTER>

![fig_json](https://daveberesford.co.uk/wp-content/uploads/2019/02/data-scraping-960x594.png)
Image Source: daveberesford.co.uk

## <font color='red'>Reference Documents</font>

- [Web Scraping: What It Is and How to Use It](https://scrape-it.cloud/blog/web-scraping-what-it-is-and-how-to-use-it)
- [What is web scraping](https://www.scrapehero.com/a-beginners-guide-to-web-scraping-part-1-the-basics/)
- [Python Requests Tutorial](http://zetcode.com/python/requests/)
- [Python’s Requests Library (Guide](https://realpython.com/python-requests/)
- [Download Files with Python](https://stackabuse.com/download-files-with-python/)
- [Building a Web Scraper from start to finish](https://hackernoon.com/building-a-web-scraper-from-start-to-finish-bb6b95388184)
- [Ultimate Guide to Web Scraping with Python Part 1: Requests and BeautifulSoup](https://www.learndatasci.com/tutorials/ultimate-guide-web-scraping-w-python-requests-and-beautifulsoup/)
- [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/)
- [HTTP status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

## <font color='red'>Objectives</font>

In this course, we want to describe web scraping and show how it can be accomplished with Python. We present the basic steps of web scraping and run examples on accessing HTTP servers, grabbing the content of web pages (in JSON and HTML formats), parsing the content to extract useful information and performing analyses.

The following topics will be covered:

+ What is web scraping?
+ Components of a web page
+ Accessing Web Pages with `requests`
+ Web Scraping with `Json`
+ Web Scraping with `Beautiful Soup`

We expect that at end of this presentation, participants will be able to write Python scripts that automatically perform web scraping to extract specific data from webpages.

## <font color='red'>Web Scraping</font>

> Web scraping is a mechanism of collecting large amounts of data from a webpage and store the data into any required format which further helps us to perform analysis on the extracted data.
 
- It can be an invaluable process for acquiring volumes of data from multiple sources, manipulating them and arranging them to be stored.
- It is performed using a “**web-scraper**” (or a “bot” or a “web spider” or “web crawler”). 
    - A web-scraper is a program that goes to web pages, downloads the contents, extracts data out of the contents and then saves the data to a file or a database.

Web scraping involves a three-step process:

1. **Step 1**: Send an HTTP request to the webpage
   - The server responds to the request by returning the (JSON, HTML, etc.) content of the target webpage.
2. **Step 2**: Parse the webpage content
   - A parser is needed to create a nested structure of the data. 
3. **Step 3**: Pull out useful data out
   - We use Python packages such as Json and Beautiful Soup to pull out data and store them.
   
![fig_scap](https://scrape-it.cloud/assets/cache_image/assets/blog_img/web-scraping-process2_760x0_073.webp)
Image Source: [scrape-it.cloud](https://scrape-it.cloud/blog/web-scraping-what-it-is-and-how-to-use-it)
     

__Web Scrapers crawl websites, extracts data from it, transforms to a usable structured format and load it to a file or database for subsequent use.__


### <font color="blue">Web Scraping Rules</font>

Web scraping in itself is completely legal, though websites can set specific rules regarding the practice on its domain.

![fig_ethics](https://hackernoon.com/hn-images/0*MPt2rectMhwklT63.jpg)


As reference, check: <a href="https://info.scrapinghub.com/web-scraping-guide/web-scraping-best-practices">The Web Scraping Best Practices Guide</a> or watch the video <a href="https://www.youtube.com/watch?v=i7DEy-ZB_Lk">Is Web Scraping Legal?</a>

- Check a website’s Terms and Conditions before you scrape it.
- Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). 
  - One request for one webpage per second is good practice.
  - Never scrape more frequently than you need to.
  - Consider caching the content you scrape so that it’s only downloaded once.
  - Build pauses into your code using functions like `time.sleep()` to keep from overwhelming servers with too many requests too quickly.
- **<font color="red">The layout of a website may change from time to time, so make sure to revisit the site and rewrite your web scraping code as needed.</font>**

## <font color='red'>Required Packages</font>
We will need the three main Python packages:

- `requests`: for accessing servers and getting the contents of web pages.
- `json`: for manipulating JSON documents.
- `BeautifupSoup`: for parsing the content of a HTML document.

In [None]:
import os
import pprint
import re
import random

In [None]:
import matplotlib.pyplot as plt
from skimage import io
from IPython.display import HTML

In [None]:
import plotly.express as px

In [None]:
import pandas as pd

In [None]:
import requests as reqs

In [None]:
import json

In [None]:
from bs4 import BeautifulSoup as bso

In [None]:
import urllib

In [None]:
print(f"Requests version:  {reqs.__version__}")
print(f"JSON version:      {json.__version__}")

# <font color='red'>Python `requests` Module</font>

* Requests is a built-in Python module.
* Requests is a simple and elegant Python HTTP (Hypertext Transfer Protocol) library. 
* It provides methods for accessing Web resources via HTTP. 
* The HTTP request returns a Response Object with all the response data (content, encoding, status, etc.).

**Reading a Web Page**
- We use the function `get()` to grab the content of a web page into an object.
- We extract from the object the HTML content of the page.

In [None]:
url = "http://www.webcode.me"

In [None]:
resp = reqs.get(url)

We can get all information from the `resp` object:

In [None]:
print(resp.text)

We can use the module `re` to strip all the HTML markups from the content.

In [None]:
content = resp.text

stripped_content = re.sub('<[^<]+?>', '', content)
print(stripped_content)

#### Encoding

- When you issue a request, `Requests` makes educated guesses about the encoding of the response based on the HTTP headers. 
- The text encoding guessed by `Requests` is used when you access `resp.text`. 

In [None]:
resp.encoding

In [None]:
resp.text

#### Getting the Status of a Web Page
- We perform an HTTP request with the `get()` method and check for the returned status code.
- A status code informs you of the status of the request: if the request was successfull or not.
- `200` is a standard response for a successful HTTP request and `404` tells that the requested resource could not be found.
- By accessing `.status_code`, you can see the status code that the server returned.

In [None]:
resp = reqs.get("http://www.webcode.me")
print(resp.status_code)

In [None]:
my_url = "http://www.webcode.me/news"
resp = reqs.get(my_url)

In [None]:
assert resp.status_code == 200, f"Failed to fetch {my_url}, got {resp.status_code}"
#print(resp.status_code)

#### Check first if a webpage is accessible

We write a function to check first that a website is accessible.

In [None]:
def access_website(url):
    try:
        resp = reqs.get(url)
        resp.raise_for_status()
        print(f"Valid url {url}")
    except reqs.exceptions.RequestException as errex:
        print(f"Invalid url: {url}")
    return resp

In [None]:
url = "http://www.webcode.me"
resp = access_website(url)
print(f"Status Code: {resp.status_code}")

In [None]:
url = "http://www.webcode.me/news"
resp = access_website(url)
print(f"Status Code: {resp.status_code}")

**Other Information**

In [None]:
url = "http://www.webcode.me"
resp = access_website(url)
if resp.status_code == 200:
    print(f"\t URL:      {resp.url}")
    print(f"\t Encoding: {resp.encoding}")
    print(f"\t Time:     {resp.elapsed}")

#### Server Headers

We can access the headers the server sent back to us:

In [None]:
resp.headers

In [None]:
headers = resp.headers
for key in headers:
    print(f"{key:20} --> {headers[key]}")

#### More on the `get()` Method
- The `get()` method issues a GET request to the server. 
- The GET method requests a representation of the specified resource.

```python
requests.get(url, params={key: value}, args)
```

| Parameter | Description |
| --- | :--- |
|`url` | (required) The url of the request |
| `params` | (optional) A dictionary, list of tuples or bytes to send as a query string. |
| `allow_redirects` | (optional) A Boolean to enable/disable redirection. |
| `auth` | (optional) A tuple to enable a certain HTTP authentication. |
| `cert` | (optional) A String or Tuple specifying a cert file or key. |
| `cookies` | (optional) A dictionary of cookies to send to the specified url. |
| `headers` | (optional) A dictionary of HTTP headers to send to the specified url. |
| `proxies` | (optional) A dictionary of the protocol to the proxy url. |
| `stream` | (optional) A Boolean indication if the response should be immediately downloaded (False) or streamed (True). |
| `timeout` | (optional) A number, or a tuple, indicating how many seconds to wait for the client to make a connection and/or send a response. |
| `verify` | (optional) A Boolean or a String indication to verify the servers TLS certificate or not. |

#### Sending Parmeters in URL

- We often ant to send some sort of data in the URL’s query string.
- The `get()` method takes a `params` keyword argument where we can specify the query parameters.
     - The beginning of the query parameters is denoted by a question mark (`?`).
     - The pieces of information constituting one query parameter are encoded in key-value pairs, where related keys and values are joined together by an equals sign (`key=value`).
     - Every URL can have multiple query parameters, which are separated from each other by an ampersand (`&`)

If:

```python
   {'key1': value1, 'key2': value2, 'key3': value3}
```
is the dictionary of the parameters, and `https://MyOwnWebsite.com/` is the base url, then the final url to access will be:

```
    https://MyOwnWebsite.com/?key1=value1&key2=value2&key3=value3
```

The code to reach the webpage is:

```Python
payload = {'key1': value1, 'key2': value2, 'key2': value3}
resp = reqs.get("https://MyOwnWebsite.com", params=payload)
```

The following script sends a variable with a value to the `httpbin.org` server. The variable is specified directly in the URL.

In [None]:
resp = reqs.get("https://httpbin.org/get?name=Peter")
print(resp.text)

We send a `get()` request to the web site and pass the data, which is specified in the `params` parameter:

In [None]:
payload = {'name': 'Peter'}
resp = reqs.get("https://httpbin.org/get", params=payload)

`payload` is a dictionary of pairs of keys/values:

In [None]:
payload = {'name': 'Peter', 'age': 23}
resp = reqs.get("https://httpbin.org/get", params=payload)

In [None]:
print(resp.url)

In [None]:
print(resp.text)

You can also pass a list of items as a value:

In [None]:
payload = {'name': ['Peter', 'Johns'], 'age': 23}
resp = reqs.get("https://httpbin.org/get", params=payload)

In [None]:
print(resp.url)

### Content-Type

- It is part of the HTTP header.
   - A string used to indicate the media type of the resource you want to access.  
   - It tells the browser the type of content it has to load on the machine. 
- Here are some values of `content-type`:

```html
   text/html
   text/csv
   application/json
   application/javascript
   audio/ogg
   image/png
```

- While scraping a web page, it is important to determine the content type first before to choosing the right tool needed to parse the content of a web page.

In [None]:
print(f"Content Type: \n\t {resp.headers['content-type']}")

#### Revisiting the function to accesss a web page

We want to pass the `payload` and `timeout` as arguments of the function.

In [None]:
def access_website(url: str, payload: dict=None, timeout: int=10):
    """
    Attempt to access a server. If the attempt is successful,
    return the response object, otherwise return an error message.

    Parameters
    ----------
    url : str
       HTTP address of the web page we want to access
    payload : dict
       Parameters needed to construct the target url.
    timeout: int
       Maximum time (in seconds) to access a website and grab its content.

    Returns
    -------
    resp : object
       Object which has infomation on the web page of interest.
    """
    try:
        if payload:
            resp = reqs.get(url, params=payload, timeout=timeout)
        else:
            resp = reqs.get(url, timeout=timeout)
        resp.raise_for_status()
    except reqs.exceptions.HTTPError as errh:
        print(f"Http Error: {errh}")
    except reqs.exceptions.ConnectionError as errc:
        print(f"Error Connecting: {errc}")
    except reqs.exceptions.Timeout as errt:
        print(f"Timeout Error: {errt}")
    except reqs.exceptions.RequestException as err:
        print(f"General Error: {err}")
    else:
        print(f"Successfully accessed the site: \n\t {resp.url}")
    
    return resp

---

## <font color='red'>Web Scraping with JSON</font>


### <font color="blue">NYC - Citi Bike System Data </font>

- Citi Bike is the nation's largest bikeshare program, with 25,000 bikes and over 1,500 stations across Manhattan, Brooklyn, Queens, the Bronx, Jersey City, and Hoboken.
- Citi Bike bikes can be unlocked from one station and returned to any other station in the system.
- Citi Bike maintains a [database of bike trip histories](https://citibikenyc.com/system-data):
   - Where do Citi Bikers ride?
   - When do they ride? How far do they go?
   - Which stations are most popular?
   - What days of the week are most rides taken on? 
- Here we read data from the bike sharing system.

__Get JSON string data from CitiBike NYC__

In [None]:
bike_url = "https://gbfs.citibikenyc.com/gbfs/en/station_status.json"
bike_url = "https://gbfs.citibikenyc.com/gbfs/2.3/gbfs.json"
bike_url = "https://gbfs.lyft.com/gbfs/2.3/bkn/en/station_status.json"

In [None]:
json_response = access_website(bike_url)

__Verify the status code__

In [None]:
json_response.status_code

__Check the content type__

In [None]:
print(f"Content Type: \n\t {json_response.headers['content-type']}")

__Check type of json_response object__

In [None]:
print(type(json_response.text))

__Using the `loads()` function, convert the JSON object into a Python object__

In [None]:
bike_dict = json.loads(json_response.text)

__Inspect the Python object__

In [None]:
print(type(bike_dict))

In [None]:
list(bike_dict.keys())

In [None]:
bike_dict['last_updated']

In [None]:
bike_dict['ttl']

In [None]:
bike_dict['version']

__Record data in a Pandas DataFrame__

In [None]:
type(bike_dict['data'])

In [None]:
list(bike_dict['data'].keys())

In [None]:
type(bike_dict['data']['stations'])

In [None]:
len(bike_dict['data']['stations'])

In [None]:
bike_dict['data']['stations'][0]

In [None]:
list_columns = list(bike_dict['data']['stations'][0].keys())
list_columns

In [None]:
df = pd.DataFrame(columns=list_columns)
for item in bike_dict['data']['stations']:
    df.loc[len(df)] = item

In [None]:
df.head(4)

In [None]:
df.shape

In [None]:
df.info()

__Number of stations with at least one available dock__

In [None]:
print(f"{len(df[df['num_docks_available'] > 0])} / {len(df)}")

__Number of stations with at least one available ebike__

In [None]:
print(f"{len(df[df['num_ebikes_available'] > 0])} / {len(df)}")

__Total number of disabled ebikes at stations__

In [None]:
df['num_bikes_disabled'].sum()

__Further analysis__

To do a more comprehensive analysis, you may want to start with the JSON document:

[https://gbfs.citibikenyc.com/gbfs/2.3/gbfs.json](https://gbfs.citibikenyc.com/gbfs/2.3/gbfs.json)

It contains links to other JSON documents that can be combined to learn how the bike share system is used.

### <font color="blue">Scraping the NASA Astronomy Picture Of the Day (APOD) Webpage </font>

- We want to be able to obtain from the webpage <a href="https://api.nasa.gov/planetary/apod"> https://api.nasa.gov/planetary/apod</a>,  the Astronomy picture of the day for a given day and plot the image.
- We access the webpage (using a set of parameters) and retrieve the content of the page as a JSON object.

**Query Parameters**

| Parameter | Type | Default | Description |
| --- | --- | --- | --- |
|`date` | YYYY-MM-DD | today | Date of the APOD image to retrieve |
|`start_date` | YYYY-MM-DD | none | The start of a date range, when requesting date for a range of dates. Cannot be used with `date`. |
|`end_date` | YYYY-MM-DD | today | The end of the date range, when used with `start_date`. |
| `count` |	int	| none	| If this is specified then count randomly chosen images will be returned. Cannot be used with `date` or `start_date` and `end_date`. |
| `hd` | bool | False | Retrieve the URL for the high resolution image |
| `api_key` | string | DEMO_KEY | <a href="https://api.nasa.gov/">[https://api.nasa.gov/</a> key for expanded usage |

__Get today's date__

In [None]:
import datetime
today = datetime.datetime.today()
date = today.strftime("%Y-%m-%d")
date

__Access the webpage__

In [None]:
url = "https://api.nasa.gov/planetary/apod"
payload = {'api_key': "DEMO_KEY",
          'date': date,
          'hd': True}

page_content = access_website(url, payload)

__Get the url__:

In [None]:
print(f"URL: \n\t {page_content.url}")

__Determine the content type__:

In [None]:
print(f"Content type: \n\t {page_content.headers['content-type']}")

__Process the data with JSON__:

In [None]:
json_page = json.loads(page_content.text)

The APOD variable is a dictionary of various keys and values. Let’s take a look at the keys of this variable:

In [None]:
for x in json_page:
    print(x)

Print the keys and values:

In [None]:
for x in json_page:
    print(f"{x} --> {json_page[x]} \n")

In [None]:
pprint.pprint(json_page)

__Plot images__:

In [None]:
if json_page["media_type"] == "image":
    io.imshow(io.imread(json_page["url"]))
    plt.title(json_page["title"])
    io.show()

<font color="red">If you want to download the file on your local system:</font>

In [None]:
url_name = json_page["url"]
loc_file_name = os.path.basename(url_name)

urllib.request.urlretrieve(url_name, loc_file_name)

If you want to view the image through a browser, use:

In [None]:
from IPython.display import Javascript
def window_open(url):
    display(Javascript('window.open("{url}");'.format(url=url)))
    
window_open(json_page['url'])

### <font color="blue">Obtaining Mars Rover Photos</font>

In [None]:
rover_url = 'https://api.nasa.gov/mars-photos/api/v1/rovers/curiosity/photos'

payload = {'api_key': "DEMO_KEY",
           'sol': 1000}

response = access_website(rover_url, payload)

In [None]:
print(f"URL: \n\t {response.url}")

In [None]:
print(f"Content type: \n\t {response.headers['content-type']}")

In [None]:
response_dict = response.json()

In [None]:
print(f"KEYS: \n\t {response_dict.keys()}")

In [None]:
photos = response_dict['photos']

In [None]:
print(type(photos))

In [None]:
print(f"There are {len(photos)} photos.")

In [None]:
pprint.pprint(photos[0])

__Extract the URL of each photo__

In [None]:
url_photos = list()
for photo in photos:
    url_photos.append(photo['img_src'])

In [None]:
print(url_photos[0])

In [None]:
assert len(url_photos) == len(photos)

__Randomly select 9 photos__

In [None]:
url_pictures = random.sample(url_photos, 9)

In [None]:
url_pictures

__Display the 9 photos__

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(20, 20))
ax = axes.ravel()

for i in range(9):
    ax[i].imshow(io.imread(url_pictures[i]))

fig.tight_layout()

## <font color="purple">Breakout 1</font>

Use the following code to list all the images in the provided range of years:

```python
url = "https://images-api.nasa.gov/search"

payload = {
        "q": "apollo",
        "page": "1",
        "media_type": "image",
        "year_start": "2020",
        "year_end": "2024"}

response = reqs.get(url, params=payload)
images = response.json()["collection"]["items"]
```

<details><summary><b><font color="green">Click here to access the solution</font></b></summary>
<p>

```python
import requests as reqs

url = "https://images-api.nasa.gov/search"

params = {
    "q": "apollo",
    "page": "1",
    "media_type": "image",
    "year_start": "2020",
    "year_end": "2024"
}

response = reqs.get(url, params=params)
response.raise_for_status()

images = response.json()["collection"]["items"]
print(f"Number of images: {len(images)}")
for image in images:
    thumbnail_url = image["links"][0]["href"]
    image_url = thumbnail_url[:thumbnail_url.rfind("~")] + "~orig.jpg"
    print(image_url)
``` 
</p>
</details>

---

## <font color='red'>Web Scraping with Beautiful Soup</font>

- Web scraping allows you to download the HTML of a website and extract the data that you need.
- Beautiful Soup is a Python library for scraping data from websites.
- Beautiful Soup creates a parse tree from parsed HTML and XML documents.

### <font color="blue">Parsing a HTML document</font>

HTML is made up of <a href="https://developer.mozilla.org/en-US/docs/Web/HTML/Element">tags</a>. It stores all of it’s data in them, and in the midst of all that clutter lies the data we need. 

Some of the tags are:

* `head` - contains machine-readable information (metadata) about the document, like its title, scripts, and style sheets.
* `body` - represents the content of an HTML document. There can be only one `<body>` element in a document.
* `title` - defines the document's title that is shown in a Browser's title bar or a page's tab. 
* `p` - for paragraph
* `div` — indicates a division, or area, of the page.
* `b` — bolds any text inside.
* `i` — italicizes any text inside.
* `table` — creates a table.
* `form` — creates an input form.
  

In [None]:
url = "http://www.webcode.me"
source = access_website(url)

In [None]:
print(f"Content type: \n\t {source.headers['content-type']}")

In [None]:
print(source)

**Create a beautiful soup object by using the html parser**

In [None]:
mysoup = bso(source.text, 'html.parser')

**Print the the HTML content of the page using the `prettify` method**

In [None]:
print(mysoup.prettify())

**Obtain the title section of the page**

In [None]:
print(mysoup.title)

**Get attribute name**

In [None]:
print(mysoup.title.name)

**Get attribute values**

In [None]:
print(mysoup.title.string)

In [None]:
print(mysoup.title.text)

**Beginning navigation**

In [None]:
print(mysoup.title.parent.name)

**Getting specific tags**
- The `find` method searches for the first tag with the needed name.
- The `find_all` method searches for all tags with the needed tag name and returns them as a list.

Assume that we want to find paragraph tags `<p>`:

In [None]:
print(mysoup.p)

In [None]:
print(mysoup.p.text)

In [None]:
print(mysoup.find('p'))

In [None]:
print(type(mysoup.find('p')))

We can find all paragraphs:

In [None]:
print(mysoup.find_all('p'))

In [None]:
print(type(mysoup.find_all('p')))

To get the last paragraph only:

In [None]:
print(mysoup.find_all('p')[-1])

We can loop over the paragraphs:

In [None]:
for i, paragraph in enumerate(mysoup.find_all('p'), start=1):
    print(f"Paragraph Text {i}: {paragraph.text}")

In [None]:
body = mysoup.find_all('body')
print(body)

In [None]:
print(f"Type body:       {type(body)}")
print(f"Type inner body: {type(body[0])}")

In [None]:
print(body[0].find_all('p'))

In [None]:
print(mysoup.find_all('body')[0].find_all('p'))

**Grab the text**

- Use the method `get_text`.

In [None]:
print(mysoup.get_text())

**Searching for tags by `class` and `id`**

<a href="https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors">CSS (Cascading Style Sheets)</a> is a declarative language that controls how webpages look in the browser. 
- The browser applies CSS style declarations to selected elements to display them properly. 
- A style declaration contains the properties and their values, which determine how a webpage looks.
- Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. 
- We can also use them when scraping to select specific elements we want to scrape.


We can use the `find_all` method to search for items by `class` or by `id`. 

```python
mysoup.find_all("html_tag", class_="your_class_name")
mysoup.find_all(class_="your_class_name")

mysoup.find_all("html_tag", id="your_id_name")
mysoup.find_all(id="your_id_name")
```

In [None]:
url = "http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html"
source = access_website(url)

In [None]:
mysoup = bso(source.text, 'html.parser')
print(mysoup.prettify())

In the below example, we’ll search for any `p` tag that has the class `outer-text`:

In [None]:
mysoup.find_all('p', class_='outer-text')

We can also look for any tag that has the class `outer-text`:

In [None]:
mysoup.find_all(class_="outer-text")

In [None]:
mysoup.find_all(class_="outer-text")[-1].get_text()

In [None]:
mysoup.find_all(class_="outer-text")[-1].get_text().strip()

We can also search for elements by id:

In [None]:
mysoup.find_all(id="first")

In [None]:
mysoup.find_all(id="first")[0].get_text()

**Using CSS Selectors**

CSS selectors</a> are how the CSS language allows developers to specify HTML tags to style. 

Here are some examples:

- `p a` — finds all `a` tags inside of a `p` tag.
- `body p a` — finds all `a` tags inside of a `p` tag inside of a `body` tag.
- `html body` — finds all `body` tags inside of an `html` tag.
- `p.outer-text` — finds all `p` tags with a class of `outer-text`.
- `p#first` — finds all `p` tags with an id of `first`.
- `body p.outer-text` — finds any `p` tags with a class of `outer-text` inside of a `body` tag.

We can use the CSS selectors to search items inside webpages. `BeautifulSoup` objects support searching a page via CSS selectors using the `select` method. 

Find all the `p` tags in our page that are inside of a `body`:

In [None]:
mysoup.select("body p")

Find all the `b` tags in our page that are inside of a `p`:

In [None]:
mysoup.select("p b")

Find all `b` tags inside of a `p` tag inside of a `body`:

In [None]:
mysoup.select("body p b")

Find all `p` tags with an id of `first`:

In [None]:
mysoup.select('p#first')

### <font color="green"> Example: Extract the web link of the Astronomy Picture of the Day</font>

In [None]:
url = "https://apod.nasa.gov/apod/astropix.html"
source = access_website(url)

In [None]:
mysoup = bso(source.text, 'html.parser')

In [None]:
print(mysoup.prettify())

Print basic information of the Image of the Day:

In [None]:
print(mysoup.find('p').get_text())

In [None]:
href_comments = mysoup.find_all('a')
for a in href_comments:
    print(a.get_text())

<font color="red">__The `Picture of the Day` can either be a picture or a video.__</font>

In [None]:
picture_day = "picture"
if mysoup.iframe:
    print("We have a video.")
    picture_day = "video"
else:
    print("We have a picture.")

In [None]:
if picture_day == "video":
    HTML(str(mysoup.iframe))

In [None]:
if picture_day == "video":
    mysoup.iframe['src']

In [None]:
if picture_day == "video":
    src_list = [a['src'] for a in mysoup.select('iframe[src]')]
    print(src_list)

Find all the `src` tags:

In [None]:
src_tags = mysoup.find_all(src=True)
src_tags

Find all `href` attributes:

In [None]:
href_tags = mysoup.find_all(href=True)
href_tags

List all valid urls in `a` tags:

In [None]:
links_with_text = [a['href'] for a in mysoup.find_all('a', href=True) if a.text]
links_with_text

In [None]:
link_list1 = [a['href'] for a in mysoup.find_all('a', href=True)]
link_list1

In [None]:
link_list2 = [l.get('href') for l in mysoup.find_all('a')]
link_list2

In [None]:
link_list3 = [a['href'] for a in mysoup.select('a[href]')]
link_list3

If the Picture of the Day is an image instead (not a video), the following can help us view the image:

In [None]:
if picture_day == "picture":
    url_image = "".join(["https://apod.nasa.gov/apod/", link_list3[1]])
    fig, axes = plt.subplots(figsize=(10, 8))
    axes.imshow(io.imread(url_image))

## <font color="purple">Breakout 2</font>

Go to the webpage:

[https://astg606.github.io/py_courses/summer_2022/](https://astg606.github.io/py_courses/summer_2022/)

and extract the `Course Evaluation` web link.

<details><summary><b><font color="green">Click here to access the solution</font></b></summary>
<p>
    
```python
import requests as reqs

from bs4 import BeautifulSoup as bso

URL = "https://astg606.github.io/py_courses/summer_2022/"

source = reqs.get(URL)
if source.status_code == 200:
    mysoup = bso(source.content, 'html.parser')
    href_tags = mysoup.find_all(href=True)
    for tag in href_tags:
        if tag.get_text() == "Course Evaluation":
            print(tag["href"])
else:
    print("URL not accessible.")
```

</p>
</details> 

## <font color="purple">Breakout 3</font>

Go to the webpage:

[https://mars.nasa.gov/news/](https://mars.nasa.gov/news/)

and extract the latest `News Title` and associated `Paragraph Text`.

This exercise was inspired from: [Mars Information Scraper](https://pyligent.github.io/2019-01-26-Mars-Information-Scraper/).

<details><summary><b><font color="green">Click here to access the solution</font></b></summary>
<p>
    
```python
mars_news_url = 'https://mars.nasa.gov/news/'
mars_news_source = access_website(mars_news_url)
mars_news_bs = bso(mars_news_source.text, 'html.parser')

#print(mars_news_bs.prettify())

mars_news_titles = [a['title'] for a in mars_news_bs.find_all('a', title=True) if not a.text]
mars_news_paras = [a.text for a in mars_news_bs.find_all(class_="margin-top-0 margin-bottom-1")]

assert len(mars_news_titles) == len(mars_news_paras)

for i in range(len(mars_news_titles)):
    pprint.pprint(mars_news_titles[i])
    pprint.pprint(mars_news_paras[i])
    print()
```

</p>
</details> 

### <font color="blue">Parsing a XML document</font>

- Extensible Markup Language (XML) is a markup language similar to HTML, but it is primarily used to store, transport, and exchange data rather than solely for creating web pages.
- XML provides the flexibility to define custom tags to describe data in a specific manner.
- XML uses markup symbols (tags) to provide more information about any data.
- Tags are crucial in defining data.

#### Parsing a XML document

__Sample XML Document__

- The elements in an XML document form a document tree. The tree starts at the root and branches to the lowest level of the tree.
- The first line describes the root element of the document: `<nasa_centers>`
- The next set of lines describes two child elements (`<center>` `</center>`) of the root, and each of the elements has 3 subelements (`name`, `state`, `location`).

```xml
<nasa_centers>
    <center>
        <name>Goddard Space Flight Center</name>
        <state>Maryland</state>
        <location>8800 Greenbelt Road, Greenbelt</location>
    </center>
    <center>
        <name>Stennis Space Center</name>
        <state>Mississipi</state>
        <location>John C. Stennis Space Center</location>
    </center>
</nasa_centers>
```

__Parsing with BeautifulSoup__

In [None]:
xmldoc = """ 
<nasa_centers>
    <center>
        <name>Goddard Space Flight Center</name>
        <state>Maryland</state>
        <location>8800 Greenbelt Road, Greenbelt</location>
    </center>
    <center>
        <name>Stennis Space Center</name>
        <state>Mississipi</state>
        <location>John C. Stennis Space Center</location>
    </center>
</nasa_centers>
"""

In [None]:
xmldoc_bs = bso(xmldoc, 'xml')

In [None]:
print(xmldoc_bs.prettify())

Get the root element:

In [None]:
xmldoc_root = xmldoc_bs.find_all('nasa_centers')
xmldoc_root

In [None]:
xmldoc_root[0]

Extract data from subelements:

In [None]:
xmldoc_centers = xmldoc_bs.find_all('center')

In [None]:
xmldoc_centers

In [None]:
for center in xmldoc_centers:
    for tag in center.find_all():
        print(f"{tag.name}: {tag.text}")
    print()

### <font color="green"> NASA held and pending patents </font>

- We want to grab the content of a dataset that has information pertaining to NASA held and pending patents.
- We will dat save the content in a Pandas DataFRame for analyis.

In [None]:
patent_url = "https://data.nasa.gov/api/views/gquh-watm/rows.xml?accessType=DOWNLOAD"

In [None]:
patent_source = access_website(patent_url)

In [None]:
patent_source.headers['Content-Type']

In [None]:
patent_bs = bso(patent_source.text, 'xml')

In [None]:
print(patent_bs.prettify())

__Get all the rows in the XML document__

In [None]:
patent_rows = patent_bs.find_all('row')
num_rows = len(patent_rows)

__Get the column names__

In [None]:
columns = [
    'center',
    'status',
    'case_number',
    'patent_number',
    'application_sn',
    'title',
    'patent_expiration_date'
]

In [None]:
#columns = [tag.name for tag in patent_rows[2].find_all()]
#columns 

__Loop over the rows to extract elements and populate a Pandas DataFrame__

In [None]:
df = pd.DataFrame(columns=columns)

for row in patent_rows:
    item = dict()
    for key in columns:
        try:
            item[key] = row.find(key).text
        except:
            pass

    df.loc[len(df)] = item

In [None]:
df.info()

In [None]:
df = pd.DataFrame(columns=columns)

for row in patent_rows:
    item = dict()
    for tag in row.find_all():
        item[tag.name] = tag.text

    df.loc[len(df)] = item

In [None]:
df

In [None]:
df.info()

__Group data by centers__

In [None]:
df_center = df.groupby('center')

In [None]:
df_center.ngroups

In [None]:
list_groups = list(df_center.groups.keys())
list_groups

In [None]:
df_center.size()

In [None]:
df_center.count()

In [None]:
fig = px.pie(df_center.count(), values="status", names=list_groups)
fig.show()

In [None]:
df_center.get_group('NASA Goddard Space Flight Center')