<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Courses</font></h1>
</center>

---

<CENTER>
<H1> <font color="red" size="+3">
    Web Scraping with Python</font>
    <br>
    Earth Science Applications
</H1>
</CENTER>

## <font color='red'>Reference Documents</font>

- [Web Scraping: What It Is and How to Use It](https://scrape-it.cloud/blog/web-scraping-what-it-is-and-how-to-use-it)
- [What is web scraping](https://www.scrapehero.com/a-beginners-guide-to-web-scraping-part-1-the-basics/)
- [Python Requests Tutorial](http://zetcode.com/python/requests/)
- [Python’s Requests Library (Guide](https://realpython.com/python-requests/)
- [Download Files with Python](https://stackabuse.com/download-files-with-python/)
- [Building a Web Scraper from start to finish](https://hackernoon.com/building-a-web-scraper-from-start-to-finish-bb6b95388184)
- [Ultimate Guide to Web Scraping with Python Part 1: Requests and BeautifulSoup](https://www.learndatasci.com/tutorials/ultimate-guide-web-scraping-w-python-requests-and-beautifulsoup/)
- [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/)

## <font color='red'>Objectives</font>

In this presentation, we use web scraping procedures to extract useful data from websites containing Earth Science related information. The website contents have JSON and HTML formats.

We cover the following applications:


1. MODIS Aerosol Optical Thickness
2. Earth Observatory Natural Event Tracker (EONET)
3. GISS global temperature
4. Meteorite landings
5. Weather forecast at a US location



## <font color='red'>Web Scraping</font>

> Web scraping is the process of programmatically and systematically collecting information on the web and processing it into more easily analyzable formats that can be serialized (json, xml, etc) and stored for later use.


![fig_scrape](https://hasdata.com/_astro/web-scraping-process2.CS0dB9VW_17OsYb.webp)
Image Source: [Sergey Ermakovich](https://hasdata.com/blog/web-scraping)

## <font color='red'>Required Packages</font>
We will need the three main Python packages:

- `requests`: for accessing servers and getting the contents of web pages.
- `json`: for manipulating JSON documents.
- `BeautifupSoup`: for parsing the content of a HTML document.

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import matplotlib.pyplot as plt
from skimage import io
from IPython.display import HTML

In [None]:
import os
import io
import pprint
import time

In [None]:
import numpy as np

In [None]:
from shapely.geometry import Point
import pandas as pd
import geopandas as gpd
import movingpandas as mpd

In [None]:
import xarray as xr

In [None]:
import requests as reqs

In [None]:
import json

In [None]:
from bs4 import BeautifulSoup as bso

In [None]:
print(f"Requests version:  {reqs.__version__}")
print(f"JSON version:      {json.__version__}")

In [None]:
RED = '\033[91m'
GREEN = '\033[92m'
BLUE = '\033[94m'
RESET = '\033[0m'

# <font color='red'>Python `requests` Module</font>

* Requests is a built-in Python module.
* Requests is a simple and elegant Python HTTP (Hypertext Transfer Protocol) library. 
* It provides methods for accessing Web resources via HTTP. 
* The HTTP request returns a Response Object with all the response data (content, encoding, status, etc.).

#### Sending Parmeters in URL

- We often ant to send some sort of data in the URL’s query string.
- The `get()` method takes a `params` keyword argument where we can specify the query parameters.
     - The beginning of the query parameters is denoted by a question mark (`?`).
     - The pieces of information constituting one query parameter are encoded in key-value pairs, where related keys and values are joined together by an equals sign (`key=value`).
     - Every URL can have multiple query parameters, which are separated from each other by an ampersand (`&`)

If:
```python
   {'key1': value1, 'key2': value2, 'key2': value3}
```
is the dictionary of the parameters, and `https://MyOwnWebsite.com/` is the url, then the final url to access will be:
```
    https://MyOwnWebsite.com/?key1=value1&key2=value2&key3=value3
```

The code to reach the webpage is:
```Python
payload = {'key1': value1, 'key2': value2, 'key2': value3}
resp = reqs.get("https://MyOwnWebsite.com", params=payload)
```

In [None]:
def access_website(url: str, payload: dict = None, timeout: int = 10):
    """
    Attempt to access a server. If the attempt is successful,
    return the response object, otherwise return an error message.

    Parameters
    ----------
    url : str
       HTTP address of the web page we want to access
    payload : dict
       Parameters needed to construct the target url.
    timeout : int
       Maximum number of seconds to access the web page.

    Returns
    -------
    resp : object
       Object which has infomation on the web page of interest.
    """
    try:
        if payload:
            resp = reqs.get(url, params=payload, timeout=timeout)
        else:
            resp = reqs.get(url, timeout=timeout)
        resp.raise_for_status()
    except reqs.exceptions.HTTPError as errh:
        print(f"Http Error: {errh}")
    except reqs.exceptions.ConnectionError as errc:
        print(f"Error Connecting: {errc}")
    except reqs.exceptions.Timeout as errt:
        print(f"Timeout Error: {errt}")
    except reqs.exceptions.RequestException as err:
        print(f"General Error: {err}")
    else:
        print(f"Successfully accessed the site: \n\t {resp.url}")
        print(f"Content type: \n\t {resp.headers['content-type']}")
    return resp

---

# Application 1: <font color="red"> MODIS Aerosol Optical Thickness</font>

- Scientists use measurements from the MODIS sensor aboard NASA's Terra and Aqua satellites to map the amount of aerosol that is in the air all over the world. Because aerosols reflect visible and near-infrared light back to space, scientists can use satellites to make maps of where there are high concentrations of these particles.
- Scientists call this measurement aerosol optical thickness (AOT). 
- It is a measure of how much light the airborne particles prevent from traveling through the atmosphere. 
- Aerosols absorb and scatter incoming sunlight, thus reducing visibility and increasing optical thickness. An optical thickness of less than 0.1 indicates a crystal clear sky with maximum visibility, whereas a value of 1 indicates the presence of aerosols so dense that people would have difficulty seeing the Sun, even at mid-day!


In this example, we want to access the <a href="https://neo.gsfc.nasa.gov/">NASA Earth Observations (NEO)</a> website to obtain the AOT measurements for a given day or a range of days (from 2000 to present).

## <font color="blue">Manipulating data for a day</font>

__Access the website__

In [None]:
aot_url = "https://neo.gsfc.nasa.gov/view.php"

In [None]:
payload = {
    'datasetId': 'MODAL2_M_AER_OD',
    'date': '2024-12-01'
}

In [None]:
aot_source = access_website(aot_url, payload=payload)

__Parse the webpage and print its content__

In [None]:
aot_soup = bso(aot_source.text, 'html.parser')
print(aot_soup.prettify)

__Gather all the lines with `href` tag__

In [None]:
href_tags = aot_soup.find_all(href=True)

__Find the `http` address that has the word `CSV`. That will give us the remote location of the file we want to read.__

In [None]:
for tag in href_tags:
    loc_url = tag["href"]
    if "csv" in loc_url.lower():
        csv_url = loc_url
        break

In [None]:
csv_url

__Clean the url__

In [None]:
def remove_after_substring(text: str, substring: str) -> str:
    """
    Remove all characters in text after the substring.
    """
    index = text.find(substring)
    if index != -1:
        return "".join([text[:index], substring])
    return text

In [None]:
csv_url = remove_after_substring(csv_url, 'CSV')

In [None]:
csv_url

__Access the remote file and use `Pandas` to read its content__

In [None]:
resp = access_website(csv_url)

In [None]:
file_object = io.StringIO(resp.content.decode('utf-8'))
pd.read_csv(file_object, index_col=0)

- The labels for the columns are the longitude values.
- The labels for the rows are the latitude values.
- **It seems that `99999.0` corresponds to a missing value. We replace it with `NaN`**

In [None]:
file_object = io.StringIO(resp.content.decode('utf-8'))
df = pd.read_csv(file_object, index_col=0, na_values=99999.0)
df

__We convert the Pandas `DataFrame` into a Xarray `DataArray`__

- The dimensions are `latitude` and `longitude`.
- The coordinates are the latitude values (row labels) and longitude values (column labels).
- The data values are the DataFrame values.

In [None]:
lat_values = [float(lat) for lat in df.index]
lon_values = [float(lon) for lon in df.columns]

da = xr.DataArray(
    df.values,
    coords = [lat_values, lon_values],
    dims = ['latitude', 'longitude']
)

da

__Basic plot__

In [None]:
da.plot();

## <font color="purple">Breakout 1</font>

- Take an arbitrary past date (`YYYY-MM-DD`).
- Access the database to extract the CSV file.
- Read the CSV file and create the Xarray object.
- Plot the data.

## <font color="blue">Manipulating time series data</font>

In [None]:
def get_csv_link(href_tags: list) -> str:
    for tag in href_tags:
        loc_url = tag["href"]
        if "csv" in loc_url.lower():
           csv_url = loc_url
           break
    csv_url = remove_after_substring(csv_url, 'CSV')
    return csv_url

In [None]:
def create_df(url: str, payload: dict):
    aot_source = access_website(url, payload=payload)
    aot_soup = bso(aot_source.text, 'html.parser')
    time.sleep(1)
    href_tags = aot_soup.find_all(href=True)
    csv_url = get_csv_link(href_tags)
    
    resp = access_website(csv_url, timeout=25)
    file_object = io.StringIO(resp.content.decode('utf-8'))
    df = pd.read_csv(file_object, index_col=0, na_values=99999.0)

    return df

In [None]:
def create_da(df):
    lat_values = [float(lat) for lat in df.index]
    lon_values = [float(lon) for lon in df.columns]

    da = xr.DataArray(
        df.values,
        coords = [lat_values, lon_values],
        dims = ['latitude', 'longitude']
    )
    return da

__Combine the data into a Xarray Dataset__

In [None]:
beg_date = '2024-12-01'
end_date = '2024-12-02'

dates = pd.date_range(start=beg_date, end=end_date, freq='D')

In [None]:
arrays = list()

for date in dates:
    print(f'\t Working on: {date}')
    payload = {'datasetId': 'MODAL2_M_AER_OD', 'date': date.strftime('%Y-%m-%d')}
    df = create_df(aot_url, payload)
    da = create_da(df)
    da = da.assign_coords(time = date)
    da = da.expand_dims(dim="time")
    arrays.append(da)
    time.sleep(3)

ds = xr.combine_by_coords(arrays)

In [None]:
ds

In [None]:
ds.plot(x="longitude", y="latitude",
                col="time", col_wrap=3)

__Time average__

In [None]:
ds.mean(dim='time').plot(figsize=(10, 6), cmap='RdBu_r');

__Zoom over the USA__

In [None]:
usa = ds.sel(latitude=slice(50.05, 20.05),
                 longitude=slice(-125.05, -66.50))
usa.mean(dim='time').plot(cmap='RdBu_r');

---

# Application 2: <font color="red">Global temperature</font>

- The [GISS Surface Temperature Analysis ](https://data.giss.nasa.gov/gistemp/) is an estimate of global surface temperature change.
- Graphs and tables are updated about the 10th of every month using current data files from a variety of sources.

__Objectives__

- Read the data available at the [Monthly and Annual Temperature Tables](https://www.columbia.edu/~mhs119/Temperature/Table_Ts.1880-2024vs1880-1920.txt)
- Perform visualizations.

__Access the webpage__

In [None]:
temp_url = "https://www.columbia.edu/~mhs119/Temperature/Table_Ts.1880-2024vs1951-1980.txt"

In [None]:
temp_response = access_website(temp_url)

- We observe that the content type is `text/plain`, i.e. we are dealing we are dealing with simple text (string).

__View the content of the webpage__

In [None]:
pprint.pprint(temp_response.text)

In [None]:
type(temp_response.text)

__Extract the content to create a Pandas DataFrame__

- We use the `io.StringIO` function to takes a string and returns a file object.
- The file object is passed to the `pd.read_csv` function to return a DataFrame.

In [None]:
file_object = io.StringIO(temp_response.text)
temp_df = pd.read_csv(
    file_object, 
    sep=r"\s+", 
    skiprows=3,           # skip the first 3 rows
    skipfooter=7,         # skip the last 7 rows
    index_col=0,          # make the first column as index (labels of the rows)
    na_values="*****",    # deal with missing values
    engine='python'
)

In [None]:
temp_df.head(3)

In [None]:
temp_df.tail(3)

In [None]:
temp_df.shape

In [None]:
temp_df.info()

__Plot the temperature data for each year__

In [None]:
temp_df[list(temp_df.columns)[:-1]].T.plot(xlabel="Month", ylabel="Temperature");

In [None]:
years = [str(i) for i in range(1880, 2024, 20)]
years.append('2024')

In [None]:
mydf2 = temp_df[list(temp_df.columns)[:-1]].T

In [None]:
mycolumns = ["".join(["_", str(y)]) for y in mydf2.columns]
columns = list()
for c in mycolumns:
    if c[1:] in years:
        columns.append(c[1:])
    else:
        columns.append(c)
mydf2.columns = columns

In [None]:
ax = mydf2.plot(xlabel="Month", ylabel="Temperature")
ax.legend(bbox_to_anchor=(1.25, 1.02), loc='upper right')
plt.tight_layout()

__Plot the temperature data for selected years__

In [None]:
years = [i for i in range(1880, 2024, 20)]
years.append(2024)

In [None]:
ax = temp_df.loc[years][list(temp_df.columns)[:-1]].T.plot(xlabel="Month", 
                                                           ylabel="Temperature")
ax.legend(bbox_to_anchor=(1.25, 1.02), loc='upper right')
plt.tight_layout()

__Temperature plots by month__

In [None]:
temp_df[list(temp_df.columns)[:-1]].plot(figsize=(10, 13), subplots=True);

__Plot time series of annual temperatures__

In [None]:
fig, axes = plt.subplots()

mydf = temp_df["Annual"].iloc[:-1].astype(float)
axes.scatter(mydf.index, mydf.values, marker='o', c='k', s=3)
axes.plot(mydf.index, mydf.values, label='12-month Running Mean')
axes.set_xlabel("Year")
axes.set_ylabel(r"Temperature analomaly ($^o$C)");
plt.grid()
plt.legend()

---

## Application 3: <font color="red">Earth Observatory Natural Event Tracker (EONET)</font>

- EONET is a repository of metadata about natural events.
- We want to be able to browse the webpage [https://eonet.gsfc.nasa.gov/api/v3/events](https://eonet.gsfc.nasa.gov/api/v3/events), to gather information on recent natural events on Earth.

**Query Parameters**

| Parameter | Value(s) |  Description |
| --- | --- | --- |
|`source` | Source ID | Filter the returned events by the [Source](https://eonet.gsfc.nasa.gov/api/v3/sources). Multiple sources can be included in the parameter: comma separated, operates as a boolean `OR`. |
|`category` | Category ID | Filter the returned events by the category. |
|`status` | open or closed | Events that have ended are assigned a closed date and the existence of that date will allow you to filter for only-open or only-closed events. Omitting the status parameter will return only the currently open events. |
| `limit` | int | Limits the number of events returned |
| `days ` | int | Limit the number of prior days (including today) from which events will be returned. |

__Sample sources__:

- `EO`: [Earth Observatory](https://earthobservatory.nasa.gov/)
- `Earthdata`: [NASA Earth Observation Data](https://earthdata.nasa.gov)
- `FEMA`: [Federal Emergency Management Agency (FEMA)](https://www.fema.gov/)
- `JTWC`: [Joint Typhoon Warning Center](http://www.metoc.navy.mil/jtwc/jtwc.html)
- `NASA_ESRS`: [NASA Earth Science and Remote Sensing Unit](https://eol.jsc.nasa.gov/ESRS/)
- `NASA_HURR`: [NASA Hurricane And Typhoon Updates](https://blogs.nasa.gov/hurricanes/)

## <font color="blue">Access the database</font>

In [None]:
eonet_url = "https://eonet.gsfc.nasa.gov/api/v3/events"
eonet_source = "EO,JTWC,Earthdata,FEMA"
eonet_payload = {
    'source': eonet_source,
    'status': "open",
    'limit': 100,
    'days': 180
}

page_content = access_website(eonet_url, eonet_payload)

In [None]:
eonet_page = json.loads(page_content.text)

In [None]:
for x in eonet_page:
    print(x)

In [None]:
pprint.pprint(eonet_page['title'])

In [None]:
pprint.pprint(eonet_page['description'])

## <font color="blue">List the events</font>

In [None]:
pprint.pprint(eonet_page['events'])

In [None]:
len(eonet_page['events'])

In [None]:
for event in eonet_page['events']:
    print(event['title'])

## <font color="blue">Analyze one event</font>

- We use `Pandas`, `GeoPandas` and `MovingPandas` to track the movement of an event.

In [None]:
event = eonet_page['events'][0]
print(event['title'])

__Create a Pandas DataFrame__

In [None]:
columns = ['latitude', 'longitude', 't']
df = pd.DataFrame(columns=columns)

for geom in event['geometry']:
    lat = geom['coordinates'][1]
    lon = geom['coordinates'][0]
    date = geom['date']
    row = dict(latitude=lat, longitude=lon, t=date)
    df.loc[len(df)] = row

df

In [None]:
df['t'] = pd.to_datetime(df['t'], format = '%Y-%m-%dT%H:%M:%SZ')
df

In [None]:
df = df.set_index('t')
df

In [None]:
df['geometry'] = [Point(xy) for xy in zip(df.longitude, df.latitude)] 
df

__Create a GeoPandas GeoDataFrame__

In [None]:
gdf = gpd.GeoDataFrame(df)
gdf

__Create a MovingPandas Trajectory__

In [None]:
mdf = mpd.Trajectory(gdf, 1)
mdf

__Compute the distance and the speed__

In [None]:
mdf.add_distance(overwrite=True, name="distance", units="mi")
mdf.df

In [None]:
mdf.add_speed(overwrite=True, 
                      name="speed", units=("mi", "h"))

mdf.df

__Plot the trajectory__

In [None]:
mdf.plot()

In [None]:
mdf.hvplot(tiles="ESRI",
           c="speed",
           title=event['title'])

## <font color="purple">Breakout</font>

- Go to the sitethe website `https://eonet.gsfc.nasa.gov/api/v3/events`
- Select a date range and the number of events you want to retrieve.
- Creade a Pandas DataFrame that contains as columns the event type, date, latitude and longitude.

```python
url = "https://eonet.gsfc.nasa.gov/api/v3/events"
payload = {'source': "EO",
          'status': "open",
          'limit': 6,
          'days': 100}
```

<details><summary><b><font color="green">Click here to access the solution</font></b></summary>
<p>
    
```python
import json

url = "https://eonet.gsfc.nasa.gov/api/v3/events"
payload = {'source': "EO",
          'status': "open",
          'limit': 6,
          'days': 100}

page_content = reqs.get(url, params=payload)

if page_content.status_code == 200:
    json_page = json.loads(page_content.text)

for x in json_page:
    print(x)

list_events = json_page['events']

print(f"Number of events: {len(list_events)}")
print(f"List of events: \n {list_events}")

event_types = [evt['categories'][0]['title'] for evt in list_events]
event_dates = [evt['geometries'][0]['date'] for evt in list_events]
event_lons = [evt['geometries'][0]['coordinates'][0] for evt in list_events]
event_lats = [evt['geometries'][0]['coordinates'][1] for evt in list_events]

print()

df_events = pd.DataFrame({
    "Type": event_types,
    "Dates": event_dates,
    #"Latitudes": event_lats,
    "Longitudes":event_lons
})
df_events
```
    
</p>
</details>

---

## Application 4: <font color='red'>Meteorite Landings</font>

- Meteorite impacts can have effects on the climate,
   - The size and velocity of the impacting body determine the amount of energy released.
   - While most impacts are small and have minimal effects, larger impacts can have far-reaching consequences.
   - Impacts from asteroids and comets can generate more atmospheric dust than large volcanic explosions.
- The Meteoritical Society maintains a comprehensive [data set](https://datasets.ai/datasets/meteorite-landings-api) that contains information on all of the known meteorite landings.
- We want to access the data set and perform analyses.

A more comprehensive analyses can be found at: 

[Meteorite Landings Per Country Using Geopandas](https://github.com/msikorski93/Meteorite-Landings/blob/main/world_geopandas.ipynb)

[Meteorite Landings](https://github.com/msikorski93/Meteorite-Landings/blob/main/meteorites_landings.ipynb)

### <font color="blue">Read the meteorite dataset</font>

We read the first 1000 rows of the database.

To read additional rows you can consult the webpage:

[How to query more than 1000 rows of a dataset](https://support.socrata.com/hc/en-us/articles/202949268-How-to-query-more-than-1000-rows-of-a-dataset)

In [None]:
meteorite_url = "https://data.nasa.gov/resource/gh4g-9sfh.json"

In [None]:
meteorite_resp = access_website(meteorite_url)

__Convert the JSON object into a Python object__

In [None]:
meteorite_page = json.loads(meteorite_resp.text)

In [None]:
type(meteorite_page)

In [None]:
len(meteorite_page)

In [None]:
meteorite_page[0]

__Create a Pandas DataFrame__

In [None]:
list_keys = list(meteorite_page[0].keys())
list_keys = list_keys[:-1]
list_keys

In [None]:
meteorite_df = pd.DataFrame(columns=list_keys)
meteorite_df

In [None]:
missing_val = "-99999"

for item in meteorite_page:
    data = list()
    for key in list_keys:
        try:
            data.append(item[key])
        except:
            data.append(missing_val)
    meteorite_df.loc[len(meteorite_df)] = data

In [None]:
meteorite_df.head(3)

In [None]:
meteorite_df.tail(3)

In [None]:
meteorite_df.shape

In [None]:
meteorite_df.isnull().sum()

### <font color="blue">Perform data profiling</font>

__Dealing with missing values__

In [None]:
meteorite_df.replace(missing_val, np.nan, inplace=True)

In [None]:
meteorite_df.isnull().sum()

In [None]:
meteorite_df.info()

__Type conversion__

In [None]:
meteorite_df['id'] = meteorite_df['id'].astype(int)

In [None]:
cols = ['mass', 'reclat', 'reclong']
meteorite_df[cols] = meteorite_df[cols].astype(float)

In [None]:
meteorite_df['year'] = pd.to_datetime(meteorite_df['year'], errors = 'coerce')

In [None]:
meteorite_df.info()

__Create a `geometry` column__

In [None]:
meteorite_df['geometry'] = [Point(xy) for xy in zip(meteorite_df.reclong, meteorite_df.reclat)] 
meteorite_df.head(3)

__Create a GeoPandas GeoDataFrame__

In [None]:
meteorite_gdf = gpd.GeoDataFrame(meteorite_df, geometry="geometry")

__Count values__

In [None]:
meteorite_df['fall'].unique()

In [None]:
meteorite_df['fall'].value_counts()

__Examine the classes__

In [None]:
meteorite_df['recclass'].value_counts()

__Barplot__

In [None]:
n = 10
# create list of labels
class_labels = meteorite_df['recclass'].value_counts().head(n).tolist()

meteorite_df['recclass'].value_counts().head(n).plot(kind='barh', title=f'Top {n} Meteorite Classes', width=0.9, figsize=(10, 5))
plt.grid(axis='x', alpha=0.5)
plt.gca().invert_yaxis()

# add count values from list
for i, v in enumerate(class_labels):
    plt.text(x=v+3, y=i+0.25, s=str(v))

__Pie chart__

In [None]:
n = 10
# extract dataset by labels
meteo_class = meteorite_df['recclass'].value_counts().head(n)

# pie plot
meteo_class.plot(kind='pie', title=f'Top {n} Meteorite Classes', figsize=(7, 7))
plt.pie(meteo_class, wedgeprops = {'linewidth': 3, 'edgecolor': 'white'})

# insert circle
circle = plt.Circle((0, 0), 0.6, color='white')
plt.gcf().gca().add_artist(circle);

__Another barplot__

In [None]:
# create list of labels
fall_labels = meteorite_df['fall'].value_counts().tolist()

# bar plot - 'fall' labels
meteorite_df['fall'].value_counts().plot.bar()

# add count values from list
for i, v in enumerate(fall_labels):
    plt.text(x=i, y=v, s=str(v), ha='center', va='bottom')

__Histogram of mass distribution__

In [None]:
# histogram by 'fall' labels
falls = meteorite_df.groupby('fall')

for name, group in falls:
    plt.hist(x=group['mass'], bins=10**np.linspace(-7, 7), alpha=0.5)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('log [mass]')
plt.legend(['Fell','Found'])

__Year distribution__

In [None]:
# plot by 'fall' label
for i in ['Found', 'Fell']:
    pd.pivot_table(meteorite_df[meteorite_df['fall'] == i], 
                   index='year', values='name', 
                   aggfunc='count').plot(title=f'year vs {i}', ylabel=f'Num. of {i}')

__Scatter plot: year over latitude__

In [None]:
meteorite_df.plot(kind="scatter", x="year", y="reclat", figsize=(10, 6))

__Scatter plot: year over mass__

In [None]:
meteorite_df.plot(kind="scatter", x="year", y="mass", logy=True);

__Scatter plot: year over mass (with colors)__

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(x=meteorite_df['year'], 
            y=meteorite_df['mass'], c=np.log10(meteorite_df['mass']), 
            cmap='gist_earth', alpha=0.7
           )
plt.yscale('log')
plt.xlabel('year')
plt.ylabel('mass')

### <font color="blue">Create maps</font>

__Include the world map__

In [None]:
world_url = url = "https://naciscdn.org/naturalearth/110m/cultural/ne_110m_admin_0_countries.zip"
world_gdf = gpd.read_file(world_url)

In [None]:
world_gdf.head(2)

In [None]:
meteorite_gdf = gpd.GeoDataFrame(meteorite_df)
meteorite_gdf.head(3)

In [None]:
fig, axes = plt.subplots(figsize=(15, 10))
world_gdf.boundary.plot(ax=axes)
meteorite_gdf.plot(ax=axes, marker='o', color='red', markersize=2)

In [None]:
world_gdf['CONTINENT'].unique()

In [None]:
north_america = world_gdf[world_gdf['CONTINENT'] == 'North America']
asia = world_gdf[world_gdf['CONTINENT'] == 'Asia']

In [None]:
na_meteorite= gpd.sjoin(meteorite_gdf, north_america, 
                           how='inner', predicate='intersects')
na_meteorite

In [None]:
fig, axes = plt.subplots(figsize=(15, 10))
north_america.boundary.plot(ax=axes)
na_meteorite.plot(ax=axes, marker='o', color='red', markersize=2)

In [None]:
m = north_america.explore() # world_gdf.explore()
na_meteorite.explore(
    m=m, 
    column="mass", 
    cmap="viridis_r",
    #style_kwds={"style_function":lambda x: {"radius":x["properties"]["mass"]}}
) #color="red")

---

# Application 5: <font color="red"> Extract weather data</font>

- We access the web site [www.weather.gov/](https://www.weather.gov/) to extract the weather forecast information for the city of Greenbelt, Maryland.

__Access the web page__

In [None]:
weather_url = "https://forecast.weather.gov/MapClick.php"

greenbelt_lat = 39.00079000000005
greenbelt_lon = -76.88055999999995

weather_params = {
    'lat': greenbelt_lat,
    'lon': greenbelt_lon
}

weather_source = access_website(weather_url, weather_params)

In [None]:
weather_soup = bso(weather_source.text, 'html.parser')
#print(weather_soup.prettify())

**Extract Tonight's Forecast**

In [None]:
seven_day = weather_soup.find(id="seven-day-forecast")

In [None]:
forecast_items = seven_day.find_all(class_="tombstone-container")

In [None]:
for item in forecast_items:
    if item.find(class_="period-name").get_text() == "Tonight":
        tonight = item
        break

print(tonight.prettify())

In [None]:
period = tonight.find(class_="period-name").get_text()
print(f"\t {GREEN} {period} {RESET}")

In [None]:
short_desc = tonight.find(class_="short-desc").get_text()
print(f"\t {GREEN} {short_desc} {RESET}")

In [None]:
temp = tonight.find(class_="temp").get_text()
print(f"\t {GREEN} {temp} {RESET}")

In [None]:
img = tonight.find("img")
desc = img['title']
print(f"\t {GREEN} {desc} {RESET}")

**Extracting all data**

We use CSS selectors to extract everything at once.

We select all items with the class `period-name` inside an item with the class `tombstone-container` in `seven_day`.

In [None]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
for item in periods:
    print(f"\t {GREEN} {item.strip()} {RESET}")
#print(f"\t {GREEN} {periods} {RESET}")

Get text from `short-desc` class:

In [None]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
for item in short_descs:
    print(f"\t {GREEN} {item.strip()} {RESET}")
#print(f"\t {GREEN} {short_descs} {RESET}")

Get text from `temp` class:

In [None]:
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
for item in temps:
    print(f"\t {GREEN} {item.strip()} {RESET}")
#print(f"\t {GREEN} {temps} {RESET}")

In [None]:
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
for item in descs:
    print(f"\t {GREEN} {item.strip()} {RESET}")
#print(f"\t {GREEN} {descs} {RESET}")

In [None]:
if len(temps) < len(descs):
    temps = [" "] + temps

We can combine the data into a Pandas DataFrame:

In [None]:
forecast_dict = dict(
    Period=periods, 
    Temperature=temps,
    Short_Description=short_descs,  
    Description=descs
)
df_weather = pd.DataFrame(forecast_dict)
df_weather = df_weather.set_index("Period")
df_weather

**Detailed Forecast**

In [None]:
det_forecast = weather_soup.find(id="detailed-forecast-body")

In [None]:
forecast_labels = det_forecast.find_all(class_="col-sm-2 forecast-label")

In [None]:
forecast_texts = det_forecast.find_all(class_="col-sm-10 forecast-text")

In [None]:
for a, b in zip(forecast_labels, forecast_texts):
    print(f"{GREEN} {a.get_text():>15}: {RESET} {b.get_text():<}")
    print()

---