<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Courses</font></h1>
</center>

---

<CENTER>
<H1> <font color="red" size="+3">
    Web Scraping with Python</font>
    <br>
    Earth Science Applications
</H1>
</CENTER>

## <font color='red'>Reference Documents</font>

- [Web Scraping: What It Is and How to Use It](https://scrape-it.cloud/blog/web-scraping-what-it-is-and-how-to-use-it)
- [What is web scraping](https://www.scrapehero.com/a-beginners-guide-to-web-scraping-part-1-the-basics/)
- [Python Requests Tutorial](http://zetcode.com/python/requests/)
- [Python’s Requests Library (Guide](https://realpython.com/python-requests/)
- [Download Files with Python](https://stackabuse.com/download-files-with-python/)
- [Building a Web Scraper from start to finish](https://hackernoon.com/building-a-web-scraper-from-start-to-finish-bb6b95388184)
- [Ultimate Guide to Web Scraping with Python Part 1: Requests and BeautifulSoup](https://www.learndatasci.com/tutorials/ultimate-guide-web-scraping-w-python-requests-and-beautifulsoup/)
- [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/)

## <font color='red'>Objectives</font>

In this course, we want to describe web scraping and show how it can be accomplished with Python. We present the basic steps of web scraping and run examples on accessing HTTP servers, grabbing the content of web pages (in JSON and HTML formats), parsing the content to extract useful information and performing analyses.

The following topics will be covered:

+ What is web scraping?
+ Components of a web page
+ Accessing Web Pages with `requests`
+ Web Scraping with `Json`
+ Web Scraping with `Beautiful Soup`

We expect that at end of this presentation, participants will be able to write Python scripts that automatically perform web scraping to extract specific data from webpages.

## <font color='red'>Web Scraping</font>

> Web scraping is a mechanism of collecting large amounts of data from a webpage and store the data into any required format which further helps us to perform analysis on the extracted data.


Web scraping involves a three-step process:

1. **Step 1**: Send an HTTP request to the webpage
   - The server responds to the request by returning the (JSON, HTML, etc.) content of the target webpage.
2. **Step 2**: Parse the webpage content
   - A parser is needed to create a nested structure of the data. 
3. **Step 3**: Pull out useful data out
   - We use Python packages such as Json and Beautiful Soup to pull out data and store them.

## <font color='red'>Required Packages</font>
We will need the three main Python packages:

- `requests`: for accessing servers and getting the contents of web pages.
- `json`: for manipulating JSON documents.
- `BeautifupSoup`: for parsing the content of a HTML document.

In [None]:
import matplotlib.pyplot as plt
from skimage import io
from IPython.display import HTML

In [None]:
import os
import io
import pprint

In [None]:
import numpy as np

In [None]:
from shapely.geometry import Point
import pandas as pd
import geopandas as gpd
import movingpandas as mpd

In [None]:
import xarray as xr

In [None]:
import requests as reqs

In [None]:
import json

In [None]:
from bs4 import BeautifulSoup as bso

In [None]:
print(f"Requests version:  {reqs.__version__}")
print(f"JSON version:      {json.__version__}")

# <font color='red'>Python `requests` Module</font>

* Requests is a built-in Python module.
* Requests is a simple and elegant Python HTTP (Hypertext Transfer Protocol) library. 
* It provides methods for accessing Web resources via HTTP. 
* The HTTP request returns a Response Object with all the response data (content, encoding, status, etc.).

#### Sending Parmeters in URL

- We often ant to send some sort of data in the URL’s query string.
- The `get()` method takes a `params` keyword argument where we can specify the query parameters.
     - The beginning of the query parameters is denoted by a question mark (`?`).
     - The pieces of information constituting one query parameter are encoded in key-value pairs, where related keys and values are joined together by an equals sign (`key=value`).
     - Every URL can have multiple query parameters, which are separated from each other by an ampersand (`&`)

If:
```python
   {'key1': value1, 'key2': value2, 'key2': value3}
```
is the dictionary of the parameters, and `https://MyOwnWebsite.com/` is the url, then the final url to access will be:
```
    https://MyOwnWebsite.com/?key1=value1&key2=value2&key3=value3
```

The code to reach the webpage is:
```Python
payload = {'key1': value1, 'key2': value2, 'key2': value3}
resp = reqs.get("https://MyOwnWebsite.com", params=payload)
```

In [None]:
def access_website(url: str, payload: dict = None, timeout: int = 10):
    """
    Attempt to access a server. If the attempt is successful,
    return the response object, otherwise return an error message.

    Parameters
    ----------
    url : str
       HTTP address of the web page we want to access
    payload : dict
       Parameters needed to construct the target url.
    timeout : int
       Maximum number of seconds to access the web page.

    Returns
    -------
    resp : object
       Object which has infomation on the web page of interest.
    """
    try:
        if payload:
            resp = reqs.get(url, params=payload, timeout=timeout)
        else:
            resp = reqs.get(url, timeout=timeout)
        resp.raise_for_status()
    except reqs.exceptions.HTTPError as errh:
        print(f"Http Error: {errh}")
    except reqs.exceptions.ConnectionError as errc:
        print(f"Error Connecting: {errc}")
    except reqs.exceptions.Timeout as errt:
        print(f"Timeout Error: {errt}")
    except reqs.exceptions.RequestException as err:
        print(f"General Error: {err}")
    else:
        print("Successfully accessed the site!")
    
    return resp

---

## <font color='red'>Application 1</font>



## <font color='red'>Meteorite Landings</font>

- Meteorite impacts can have effects on the climate,
   - The size and velocity of the impacting body determine the amount of energy released.
   - While most impacts are small and have minimal effects, larger impacts can have far-reaching consequences.
   - Impacts from asteroids and comets can generate more atmospheric dust than large volcanic explosions.
- The Meteoritical Society maintains a comprehensive [data set](https://datasets.ai/datasets/meteorite-landings-api) that contains information on all of the known meteorite landings.
- We want to access the data set and perform analyses.

A more comprehensive analyses can be found at: 

[Meteorite Landings Per Country Using Geopandas](https://github.com/msikorski93/Meteorite-Landings/blob/main/world_geopandas.ipynb)

[Meteorite Landings](https://github.com/msikorski93/Meteorite-Landings/blob/main/meteorites_landings.ipynb)

In [None]:
meteorite_url = "https://data.nasa.gov/resource/gh4g-9sfh.json"

In [None]:
meteorite_resp = access_website(meteorite_url)

In [None]:
print(f"Content type: \n\t {meteorite_resp.headers['content-type']}")

In [None]:
meteorite_page = json.loads(meteorite_resp.text)

In [None]:
type(meteorite_page)

In [None]:
len(meteorite_page)

In [None]:
meteorite_page[0]

In [None]:
list_keys = list(meteorite_page[0].keys())
list_keys = list_keys[:-1]
list_keys

In [None]:
meteorite_df = pd.DataFrame(columns=list_keys)
meteorite_df

In [None]:
missing_val = "-99999"

for item in meteorite_page:
    data = list()
    for key in list_keys:
        try:
            data.append(item[key])
        except:
            data.append(missing_val)
    meteorite_df.loc[len(meteorite_df)] = data

In [None]:
meteorite_df

In [None]:
meteorite_df.replace(missing_val, np.nan, inplace=True)

In [None]:
meteorite_df.info()

__Type conversion__

In [None]:
meteorite_df['id'] = meteorite_df['id'].astype(int)

In [None]:
cols = ['mass', 'reclat', 'reclong']
meteorite_df[cols] = meteorite_df[cols].astype(float)

In [None]:
meteorite_df['year'] = pd.to_datetime(meteorite_df['year'], errors = 'coerce')

In [None]:
meteorite_df.info()

__Create a `geometry` column__

In [None]:
meteorite_df['geometry'] = [Point(xy) for xy in zip(meteorite_df.reclong, meteorite_df.reclat)] 
meteorite_df

__Create a GeoPandas GeoDataFrame__

In [None]:
meteorite_gdf = gpd.GeoDataFrame(meteorite_df, geometry="geometry")

### Perform data profiling

In [None]:
meteorite_df['fall'].unique()

In [None]:
meteorite_df['fall'].value_counts()

__Examin the classes__

In [None]:
meteorite_df['recclass'].value_counts()

In [None]:
n = 10
# create list of labels
class_labels = meteorite_df['recclass'].value_counts().head(n).tolist()

meteorite_df['recclass'].value_counts().head(n).plot(kind='barh', title=f'Top {n} Meteorite Classes', width=0.9, figsize=(10, 5))
plt.grid(axis='x', alpha=0.5)
plt.gca().invert_yaxis()

# add count values from list
for i, v in enumerate(class_labels):
    plt.text(x=v+3, y=i+0.25, s=str(v))

In [None]:
n = 10
# extract dataset by labels
meteo_class = meteorite_df['recclass'].value_counts().head(n)

# pie plot
meteo_class.plot(kind='pie', title=f'Top {n} Meteorite Classes', figsize=(7, 7))
plt.pie(meteo_class, wedgeprops = {'linewidth': 3, 'edgecolor': 'white'})

# insert circle
circle = plt.Circle((0, 0), 0.6, color='white')
plt.gcf().gca().add_artist(circle)

In [None]:
# create list of labels
fall_labels = meteorite_df['fall'].value_counts().tolist()

# bar plot - 'fall' labels
meteorite_df['fall'].value_counts().plot.bar()

# add count values from list
for i, v in enumerate(fall_labels):
    plt.text(x=i, y=v, s=str(v), ha='center', va='bottom')

__Mass Distribution__

In [None]:
# histogram by 'fall' labels
falls = meteorite_df.groupby('fall')

for name, group in falls:
    plt.hist(x=group['mass'], bins=10**np.linspace(-7, 7), alpha=0.5)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('log [mass]')
plt.legend(['Fell','Found'])

In [None]:
meteorite_df.plot(kind="scatter", x="year", y="mass", logy=True)

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(x=meteorite_df['year'], 
            y=meteorite_df['mass'], c=np.log10(meteorite_df['mass']), 
            cmap='gist_earth', alpha=0.7
           )
plt.yscale('log')
plt.xlabel('year')
plt.ylabel('mass')

__Year distribution__

In [None]:
# plot by 'fall' label
for i in ['Found', 'Fell']:
    pd.pivot_table(meteorite_df[meteorite_df['fall'] == i], 
                   index='year', values='name', 
                   aggfunc='count').plot(title=f'year vs {i}', ylabel=f'Num. of {i}')

In [None]:
meteorite_df.plot(kind="scatter", x="year", y="reclat", figsize=(10, 6))

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(x=data['year'], y=data['lat'], alpha=0.4, marker='.')
plt.xlabel('year')
plt.ylabel('latitude')
plt.grid(visible=True, alpha=0.5)

### Visualization

In [None]:
world_url = url = "https://naciscdn.org/naturalearth/110m/cultural/ne_110m_admin_0_countries.zip"
world_gdf = gpd.read_file(world_url)

In [None]:
world_gdf.head(2)

In [None]:
world_gdf['CONTINENT'].unique()

In [None]:
north_america = world_gdf[world_gdf['CONTINENT'] == 'North America']
asia = = world_gdf[world_gdf['CONTINENT'] == 'Asia']

In [None]:
m = world_gdf.explore()
meteorite_gdf.explore(
    m=m, 
    column="mass", 
    cmap="viridis_r",
    #style_kwds={"style_function":lambda x: {"radius":x["properties"]["mass"]}}
) #color="red")

### <font color="green">Scraping the NASA Astronomy Picture Of the Day (APOD) Webpage </font>

- We want to be able to obtain from the webpage <a href="https://api.nasa.gov/planetary/apod"> https://api.nasa.gov/planetary/apod</a>,  the Astronomy picture of the day for a given day and plot the image.
- We access the webpage (using a set of parameters) and retrieve the content of the page as a JSON object.

**Query Parameters**

| Parameter | Type | Default | Description |
| --- | --- | --- | --- |
|`date` | YYYY-MM-DD | today | Date of the APOD image to retrieve |
|`start_date` | YYYY-MM-DD | none | The start of a date range, when requesting date for a range of dates. Cannot be used with `date`. |
|`end_date` | YYYY-MM-DD | today | The end of the date range, when used with `start_date`. |
| `count` |	int	| none	| If this is specified then count randomly chosen images will be returned. Cannot be used with `date` or `start_date` and `end_date`. |
| `hd` | bool | False | Retrieve the URL for the high resolution image |
| `api_key` | string | DEMO_KEY | <a href="https://api.nasa.gov/">[https://api.nasa.gov/</a> key for expanded usage |


In [None]:
url = "https://api.nasa.gov/planetary/apod"
date = "2022-07-04"
payload = {'api_key': "DEMO_KEY",
          'date': date,
          'hd': True}

page_content = access_website(url, payload)

Get the url:

In [None]:
print(f"URL: \n\t {page_content.url}")

Determine the content type:

In [None]:
print(f"Content type: \n\t {page_content.headers['content-type']}")

Process the data with JSON:

In [None]:
json_page = json.loads(page_content.text)

The APOD variable is a dictionary of various keys and values. Let’s take a look at the keys of this variable:

In [None]:
for x in json_page:
    print(x)

Print the keys and values:

In [None]:
for x in json_page:
    print(f"{x} --> {json_page[x]} \n")

In [None]:
pprint.pprint(json_page)

Plot images:

In [None]:
if json_page["media_type"] == "image":
    io.imshow(io.imread(json_page["url"]))
    plt.title(json_page["title"])
    io.show()

<font color="red">If you want to download the file on your local system:</font>

In [None]:
import urllib

url_name = json_page["url"]
loc_file_name = os.path.basename(url_name)

urllib.request.urlretrieve(url_name, loc_file_name)

If you want to view the image through a browser, use:

In [None]:
from IPython.display import Javascript
def window_open(url):
    display(Javascript('window.open("{url}");'.format(url=url)))
    
window_open(json_page['url'])

### <font color="green">Obtaining Mars Rover Photos</font>

In [None]:
rover_url = 'https://api.nasa.gov/mars-photos/api/v1/rovers/curiosity/photos'

payload = {'api_key': "DEMO_KEY",
           'sol': 1000}

response = access_website(rover_url, payload)

In [None]:
print(f"URL: \n\t {response.url}")

In [None]:
print(f"Content type: \n\t {response.headers['content-type']}")

In [None]:
response_dict = response.json()

In [None]:
print(f"KEYS: \n\t {response_dict.keys()}")

In [None]:
photos = response_dict['photos']

In [None]:
print(type(photos))
print(len(photos))

In [None]:
print(photos[0])

Extract the URL of each photo:

In [None]:
url_photos = list()
for photo in photos:
    url_photos.append(photo['img_src'])

print(url_photos[0])

Randomly select 20 pictures:

In [None]:
import random
url_pictures = random.sample(url_photos, 20)

Display the 20 photos:

In [None]:
fig, axes = plt.subplots(4, 5, figsize=(20, 20))
ax = axes.ravel()

for i in range(20):
    ax[i].imshow(io.imread(url_pictures[i]))

fig.tight_layout()

## <font color="purple">Breakout 1</font>

Use the following code to list all the images in the provided range of years:

```python
url = "https://images-api.nasa.gov/search"

payload = {
        "q": "apollo",
        "page": "1",
        "media_type": "image",
        "year_start": "2020",
        "year_end": "2022"}

response = reqs.get(url, params=payload)
images = response.json()["collection"]["items"]
```

<details><summary><b><font color="green">Click here to access the solution</font></b></summary>
<p>

```python
import requests as reqs

url = "https://images-api.nasa.gov/search"

params = {
    "q": "apollo",
    "page": "1",
    "media_type": "image",
    "year_start": "2020",
    "year_end": "2022"
}

response = reqs.get(url, params=params)
response.raise_for_status()

images = response.json()["collection"]["items"]
print(f"Number of images: {len(images)}")
for image in images:
    thumbnail_url = image["links"][0]["href"]
    image_url = thumbnail_url[:thumbnail_url.rfind("~")] + "~orig.jpg"
    print(image_url)
``` 
</p>
</details>

## <font color="red">Global Temperature</font>

https://www.columbia.edu/~mhs119/Temperature/

In [None]:
temp_url = "https://www.columbia.edu/~mhs119/Temperature/Table_Ts.1880-2024vs1951-1980.txt"

In [None]:
response = access_website(temp_url)

In [None]:
print(f"Content type: \n\t {response.headers['content-type']}")

In [None]:
pprint.pprint(response.text)

In [None]:
type(response.text)

In [None]:
file_object = io.StringIO(response.text)
df = pd.read_csv(file_object, 
                 sep=r"\s+", 
                 skiprows=3, 
                 skipfooter=7, 
                 index_col=0,
                 na_values="*****",
                 engine='python')
df

In [None]:
df.info()

In [None]:
df[list(df.columns)[:-1]].T.plot()

In [None]:
years = [i for i in range(1880, 2024, 20)]
years.append(2024)
years

In [None]:
df.loc[years][list(df.columns)[:-1]].T.plot()

In [None]:
df[list(df.columns)[:-1]].plot(figsize=(10, 13), subplots=True);

In [None]:
df["Annual"].iloc[:-1].astype(float).plot()

## <font color="red">Scraping the Earth Observatory Natural Event Tracker (EONET) Webpage </font>

- EONET is a repository of metadata about natural events.
- We want to be able to browse the webpage [https://eonet.gsfc.nasa.gov/api/v3/events](https://eonet.gsfc.nasa.gov/api/v2.1/events), to gather information on recent natural events on Earth.

**Query Parameters**

| Parameter | Value(s) |  Description |
| --- | --- | --- |
|`source` | Source ID | Filter the returned events by the [Source](https://eonet.gsfc.nasa.gov/api/v3/sources). Multiple sources can be included in the parameter: comma separated, operates as a boolean `OR`. |
|`category` | Category ID | Filter the returned events by the category. |
|`status` | open or closed | Events that have ended are assigned a closed date and the existence of that date will allow you to filter for only-open or only-closed events. Omitting the status parameter will return only the currently open events. |
| `limit` | int | Limits the number of events returned |
| `days ` | int | Limit the number of prior days (including today) from which events will be returned. |

__Sample sources__:

- `EO`: [Earth Observatory](https://earthobservatory.nasa.gov/)
- `Earthdata`: [NASA Earth Observation Data](https://earthdata.nasa.gov)
- `FEMA`: [Federal Emergency Management Agency (FEMA)](https://www.fema.gov/)
- `JTWC`: [Joint Typhoon Warning Center](http://www.metoc.navy.mil/jtwc/jtwc.html)
- `NASA_ESRS`: [NASA Earth Science and Remote Sensing Unit](https://eol.jsc.nasa.gov/ESRS/)
- `NASA_HURR`: [NASA Hurricane And Typhoon Updates](https://blogs.nasa.gov/hurricanes/)

In [None]:
eonet_url = "https://eonet.gsfc.nasa.gov/api/v3/events"
eonet_source = "EO,JTWC,Earthdata,FEMA"
eonet_payload = {
    'source': eonet_source,
    'status': "open",
    'limit': 100,
    'days': 180
}

page_content = access_website(eonet_url, eonet_payload)

In [None]:
print(f"Content type: \n\t {page_content.headers['content-type']}")

In [None]:
eonet_page = json.loads(page_content.text)

In [None]:
for x in eonet_page:
    print(x)

In [None]:
pprint.pprint(eonet_page['title'])

In [None]:
pprint.pprint(eonet_page['description'])

#### List the events

In [None]:
pprint.pprint(eonet_page['events'])

In [None]:
len(eonet_page['events'])

In [None]:
for event in eonet_page['events']:
    print(event['title'])

#### Analyze one event

We use `Pandas`, `GeoPandas` and `MovingPandas` to track the movement of an event.

In [None]:
event = eonet_page['events'][0]
print(event['title'])

Create a Pandas DataFrame:

In [None]:
columns = ['latitude', 'longitude', 't']
df = pd.DataFrame(columns=columns)

for geom in event['geometry']:
    lat = geom['coordinates'][1]
    lon = geom['coordinates'][0]
    date = geom['date']
    row = dict(latitude=lat, longitude=lon, t=date)
    df.loc[len(df)] = row

df

In [None]:
df['t'] = pd.to_datetime(df['t'], format = '%Y-%m-%dT%H:%M:%SZ')
df

In [None]:
df = df.set_index('t')
df

In [None]:
df['geometry'] = [Point(xy) for xy in zip(df.longitude, df.latitude)] 
df

Create a GeoPandas DataFrame:

In [None]:
gdf = gpd.GeoDataFrame(df)
gdf

Create a MovingPandas Trajectory:

In [None]:
mdf = mpd.Trajectory(gdf, 1)
mdf

Compute the distance and the speed:

In [None]:
mdf.add_distance(overwrite=True, name="distance", units="mi")
mdf.df

In [None]:
mdf.add_speed(overwrite=True, 
                      name="speed", units=("mi", "h"))

mdf.df

Plot the trajectory:

In [None]:
mdf.plot()

In [None]:
mdf.hvplot(tiles="ESRI",
           c="speed",
           title=event['title'])

---

## <font color='blue'>Web Scraping with Beautiful Soup</font>

- Web scraping allows you to download the HTML of a website and extract the data that you need.
- Beautiful Soup is a Python library for scraping data from websites.
- Beautiful Soup creates a parse tree from parsed HTML and XML documents.

### <font color="blue"> Example: Extract the web link of the Astronomy Picture of the Day</font>

In [None]:
url = "https://apod.nasa.gov/apod/astropix.html"
source = access_website(url)

In [None]:
mysoup = bso(source.text, 'html.parser')

In [None]:
print(mysoup.prettify())

Print basic information of the Image of the Day:

In [None]:
print(mysoup.find('p').get_text())

In [None]:
href_comments = mysoup.find_all('a')
for a in href_comments:
    print(a.get_text())

<font color="red">__The `Picture of the Day` can either be a picture or a video.__</font>

In [None]:
picture_day = "picture"
if mysoup.iframe:
    print("We have a video.")
    picture_day = "video"
else:
    print("We have a picture.")

In [None]:
if picture_day == "video":
    HTML(str(mysoup.iframe))

In [None]:
if picture_day == "video":
    mysoup.iframe['src']

In [None]:
if picture_day == "video":
    src_list = [a['src'] for a in mysoup.select('iframe[src]')]
    src_list

Find all the `src` tags:

In [None]:
src_tags = mysoup.find_all(src=True)
src_tags

Find all `href` attributes:

In [None]:
href_tags = mysoup.find_all(href=True)
href_tags

List all valud urls in `a` tags:

In [None]:
links_with_text = [a['href'] for a in mysoup.find_all('a', href=True) if a.text]
links_with_text

In [None]:
link_list1 = [a['href'] for a in mysoup.find_all('a', href=True)]
link_list1

In [None]:
link_list2 = [l.get('href') for l in mysoup.find_all('a')]
link_list2

In [None]:
link_list3 = [a['href'] for a in mysoup.select('a[href]')]
link_list3

If the Picture of the Day is an image instead (not a video), the following can help us view the image:

In [None]:
if picture_day == "picture":
    url_image = "".join(["https://apod.nasa.gov/apod/", link_list3[1]])
    fig, axes = plt.subplots(figsize=(10, 8))
    axes.imshow(io.imread(url_image))

## <font color="purple">Breakout 2</font>

Go to the webpage:

[https://astg606.github.io/py_courses/summer_2022/](https://astg606.github.io/py_courses/summer_2022/)

and extract the `Course Evaluation` web link.

<details><summary><b><font color="green">Click here to access the solution</font></b></summary>
<p>
    
```python
import requests as reqs

from bs4 import BeautifulSoup as bso

URL = "https://astg606.github.io/py_courses/summer_2022/"

source = reqs.get(URL)
if source.status_code == 200:
    mysoup = bso(source.content, 'html.parser')
    href_tags = mysoup.find_all(href=True)
    for tag in href_tags:
        if tag.get_text() == "Course Evaluation":
            print(tag["href"])
else:
    print("URL not accessible.")
```

</p>
</details> 

### <font color="blue"> Example: Weather Data for Greenbelt, Maryland</font>

In [None]:
url = "https://forecast.weather.gov/MapClick.php"
params = {'lat': 39.00079000000005,
          'lon': -76.88055999999995}

source = access_website(url, params)

In [None]:
print(f"URL: \n\t {source.url}")

In [None]:
mysoup = bso(source.text, 'html.parser')
print(mysoup.prettify())

**Extract Tonight's Forecast**

In [None]:
seven_day = mysoup.find(id="seven-day-forecast")

In [None]:
forecast_items = seven_day.find_all(class_="tombstone-container")

In [None]:
for item in forecast_items:
    if item.find(class_="period-name").get_text() == "Tonight":
        tonight = item
        break

print(tonight.prettify())

In [None]:
period = tonight.find(class_="period-name").get_text()
print(period)

In [None]:
short_desc = tonight.find(class_="short-desc").get_text()
print(short_desc)

In [None]:
temp = tonight.find(class_="temp").get_text()
print(temp)

In [None]:
img = tonight.find("img")
desc = img['title']
print(desc)

**Extracting all Data**

We use CSS selectors to extract everything at once.

We select all items with the class `period-name` inside an item with the class `tombstone-container` in `seven_day`.

In [None]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
print(periods)

We can apply the same technique to get the other fields:

In [None]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
print(short_descs)

In [None]:
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
print(temps)

In [None]:
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(descs)

In [None]:
if len(temps) < len(descs):
    temps = [" "] + temps

We can combine the data into a Pandas DataFrame:

In [None]:
import pandas as pd
forecast_dict = dict(Period=periods, 
                     Temperature=temps,
                     Short_Description=short_descs,  
                     Description=descs)
df_weather = pd.DataFrame(forecast_dict)
df_weather = df_weather.set_index("Period")
df_weather

**Detailed Forecast**

In [None]:
det_forecast = mysoup.find(id="detailed-forecast-body")

In [None]:
forecast_labels = det_forecast.find_all(class_="col-sm-2 forecast-label")

In [None]:
forecast_texts = det_forecast.find_all(class_="col-sm-10 forecast-text")

In [None]:
for a, b in zip(forecast_labels, forecast_texts):
    print(f"\033[1m {a.get_text():>15}: \033[0m {b.get_text():<}")
    print()

## <font color="purple">Breakout 3</font>

- Go to the sitethe website `https://eonet.gsfc.nasa.gov/api/v2.1/events`
- Select a date range and the number of events you want to retrieve.
- Creade a Pandas DataFrame that contains as columns the event type, date, latitude and longitude.

```python
url = "https://eonet.gsfc.nasa.gov/api/v2.1/events"
payload = {'source': "EO",
          'status': "open",
          'limit': 6,
          'days': 100}
```

<details><summary><b><font color="green">Click here to access the solution</font></b></summary>
<p>
    
```python
import json

url = "https://eonet.gsfc.nasa.gov/api/v2.1/events"
payload = {'source': "EO",
          'status': "open",
          'limit': 6,
          'days': 100}

page_content = reqs.get(url, params=payload)

if page_content.status_code == 200:
    json_page = json.loads(page_content.text)

for x in json_page:
    print(x)

list_events = json_page['events']

print(f"Number of events: {len(list_events)}")
print(f"List of events: \n {list_events}")

event_types = [evt['categories'][0]['title'] for evt in list_events]
event_dates = [evt['geometries'][0]['date'] for evt in list_events]
event_lons = [evt['geometries'][0]['coordinates'][0] for evt in list_events]
event_lats = [evt['geometries'][0]['coordinates'][1] for evt in list_events]

print()

import pandas as pd
df_events = pd.DataFrame({
    "Type": event_types,
    "Dates": event_dates,
    #"Latitudes": event_lats,
    "Longitudes":event_lons
})
df_events
```
    
</p>
</details>

### <font color="blue"> Example: MODIS Aerosol Optical Thickness</font>

- Scientists use measurements from the MODIS sensor aboard NASA's Terra and Aqua satellites to map the amount of aerosol that is in the air all over the world. Because aerosols reflect visible and near-infrared light back to space, scientists can use satellites to make maps of where there are high concentrations of these particles.
- Scientists call this measurement aerosol optical thickness (AOT). 
- It is a measure of how much light the airborne particles prevent from traveling through the atmosphere. 
- Aerosols absorb and scatter incoming sunlight, thus reducing visibility and increasing optical thickness. An optical thickness of less than 0.1 indicates a crystal clear sky with maximum visibility, whereas a value of 1 indicates the presence of aerosols so dense that people would have difficulty seeing the Sun, even at mid-day!


In this example, we want to access the <a href="https://neo.gsfc.nasa.gov/">NASA Earth Observations (NEO)</a> website to obtain the AOT measurements for a given day or a range of days (from 2000 to present).

**Select the day range of interest:**

In [None]:
beg_date = '2019-12-30'
end_date = '2019-12-31'

pd_series = pd.date_range(start=beg_date, end=end_date, freq='D')
dates = [dt.strftime('%Y-%m-%d') for dt in pd_series]

url_base = "https://neo.gsfc.nasa.gov/view.php?datasetId=MODAL2_M_AER_OD&year="

urls = [url_base+dt for dt in dates]

In [None]:
print(urls[0])

**Access the webpage for the first day:**

In [None]:
source = reqs.get(urls[0])
print(source)

**Parse the webpage and print its content:**

In [None]:
mysoup = bso(source.text, 'html.parser')
print(mysoup.prettify)

**Gather all the lines with `href` tag:**

In [None]:
href_tags = mysoup.find_all(href=True)

href_tags

**Find the `http` address that has the word `CSV`. That will give us the remote location of the file we want to read.**

In [None]:
for tag in href_tags:
    loc_url = tag["href"]
    print(loc_url)
    if "csv" in loc_url.lower():
        csv_url = loc_url
        break

In [None]:
print(csv_url)

**Use `Pandas` to read the remote file:**

In [None]:
resp = access_website(csv_url)

In [None]:
file_object = io.StringIO(resp.content.decode('utf-8'))
pd.read_csv(file_object, index_col=0)

**It seems that `99999.0` corresponds to a missing value. Let us replace it with `NaN`:**

In [None]:
file_object = io.StringIO(resp.content.decode('utf-8'))
df = pd.read_csv(file_object, index_col=0, na_values=99999.0)
df

**We can use `Xarray` to quickly visualize the data:**

In [None]:
da = xr.DataArray(
    df.values,
    coords=[[float(lat) for lat in df.index], [float(lon) for lon in df.columns]],
    dims=['latitude', 'longitude']
)

da

In [None]:
da.plot();