## A note before starting

I have opted to use slightly different websites to scrape. In order to have my Flask app serve the most recent content, I decided to go directly to the source pages instead of the re-hosted websites provided from the instructions.

### Dependencies

In [1]:
# Import dependencies
import time
from bs4 import BeautifulSoup
import pandas as pd
import requests
from splinter import Browser
from webdriver_manager.chrome import ChromeDriverManager

### Splinter browser setup with Chrome webdriver

In [2]:
# Set up a windowless Splinter browser with Chrome webdriver to use later when 
# interacting with the NASA News and Mars Facts pages

executable_path = {"executable_path": ChromeDriverManager().install()}
browser = Browser("chrome", **executable_path, headless=True)



Current google-chrome version is 99.0.4844
Get LATEST chromedriver version for 99.0.4844 google-chrome
There is no [linux64] chromedriver for browser 99.0.4844 in cache
Trying to download new driver from https://chromedriver.storage.googleapis.com/99.0.4844.51/chromedriver_linux64.zip
Driver has been saved in cache [/home/charlie/.wdm/drivers/chromedriver/linux64/99.0.4844.51]


*Chrome must be installed for this to work. In Windows, the WebDriver Manager may not be able to find Chrome's version number and will print out a warning. The web driver will be saved into and run from the cache.*

### Scraping NASA's Mars News Site

**Note: NASA uses JS to load certain elements, which causes issues if the code pulls in the HTML before letting the page load in its entirety. The solution for now is to pause the code for a second to let the elements load in. This also means that we cannot use the `requests` module because there is no way to pause the request in the middle of a response.**

In [3]:
# Splinter browser
url = ("https://mars.nasa.gov/news/?page=0&per_page=40"
    + "&order=publish_date+desc%2Ccreated_at+desc&search="
    + "&category=19%2C165%2C184%2C204&blank_scope=Latest")
browser.visit(url)

# NASA loads certain elements using JS, and can cause problems if we load the
# HTML too quickly. Sleep for one second to let their JS do its thing.
time.sleep(1)

# Grab HTML from Splinter browser visit
html = browser.html
soup = BeautifulSoup(html, "lxml")

In [4]:
# Select the most recent article (select_one method returns first instance)
latest_article = soup.select_one("li.slide")

# Grab article's title, paragraph, and url
latest_title = latest_article.select_one("div.content_title").text
latest_para = latest_article.select_one("div.article_teaser_body").text
latest_url = f"https://mars.nasa.gov{latest_article.a['href']}"

# Save in a dictionary for quick access
latest_news = {
    "title": latest_title,
    "paragraph": latest_para,
    "url": latest_url
}

print(latest_news)

{'title': "NASA's Perseverance Rover Hightails It to Martian Delta", 'paragraph': 'The rover’s self-driving capabilities will be put to the test this month as it begins a record-breaking series of sprints to its next sampling location.', 'url': 'https://mars.nasa.gov/news/9149/nasas-perseverance-rover-hightails-it-to-martian-delta/'}


### Scraping NASA's Mars Image of the Week

In [5]:
# Scrape with requests module serving the HTML
url = "https://mars.nasa.gov/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

In [6]:
# Select the image of the week's src attribute and the name of the image

# NASA uses js imageScroll to convert img tag
featured_image = soup.select_one("#featured_image")
featured_url = f"https://mars.nasa.gov{featured_image['data-image']}"

# Select the Image of the Week's Name
featured_name = soup.select_one(".image_of_the_day").\
    select_one(".media_feature_title").text.strip()

# Get link to the article page for the image of the week
article_url = soup.select_one(".header_link")["href"]

# Store name, image url, and article url of featured image
image_of_week = {
    "name": featured_name, "img_url": featured_url, "article_url": article_url
}
print(image_of_week)

{'name': 'Promethei Chasma', 'img_url': 'https://mars.nasa.gov/system/basic_html_elements/293_1600x900-horizontal-ODYSSEY-PROMETHEI-CHASMA-PIA24409.jpg', 'article_url': 'https://mars.nasa.gov/resources/25604/promethei-chasma/'}


### Scraping NASA's Mars Facts Page with Pandas

**Note: I am using NASA's own HTML table instead of the re-hosted table provided from the instructions.**

In [7]:
# Read any available tables from the web page (should be just one)
url = "https://mars.nasa.gov/all-about-mars/facts/"
facts_table = pd.read_html(url)
df = facts_table[0]
df

Unnamed: 0.1,Unnamed: 0,EARTH,MARS
0,Average Distance from Sun,93 million miles,142 million miles
1,Average Speed in Orbiting Sun,18.5 miles per second,14.5 miles per second
2,Diameter,"7,926 miles","4,220 miles"
3,Tilt of Axis,23.5 degrees,25 degrees
4,Length of Year,365.25 Days,687 Earth Days
5,Length of Day,23 hours 56 minutes,24 hours 37 minutes
6,Gravity,2.66 times that of Mars,0.375 that of Earth
7,Temperature,Average 57 degrees F,Average -81 degrees F
8,Atmosphere,"nitrogen, oxygen, argon, others","mostly carbon dioxide, some water vapor"
9,Number of Moons,1,2


In [8]:
# Table looks remarkably clean, just get rid of the name of the first column
df = df.rename(columns={"Unnamed: 0": ""})
df

Unnamed: 0,Unnamed: 1,EARTH,MARS
0,Average Distance from Sun,93 million miles,142 million miles
1,Average Speed in Orbiting Sun,18.5 miles per second,14.5 miles per second
2,Diameter,"7,926 miles","4,220 miles"
3,Tilt of Axis,23.5 degrees,25 degrees
4,Length of Year,365.25 Days,687 Earth Days
5,Length of Day,23 hours 56 minutes,24 hours 37 minutes
6,Gravity,2.66 times that of Mars,0.375 that of Earth
7,Temperature,Average 57 degrees F,Average -81 degrees F
8,Atmosphere,"nitrogen, oxygen, argon, others","mostly carbon dioxide, some water vapor"
9,Number of Moons,1,2


In [9]:
# Convert to HTML table
html_facts_table = df.to_html(index=False)
html_facts_table

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>EARTH</th>\n      <th>MARS</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <td>Average Distance from Sun</td>\n      <td>93 million miles</td>\n      <td>142 million miles</td>\n    </tr>\n    <tr>\n      <td>Average Speed in Orbiting Sun</td>\n      <td>18.5 miles per second</td>\n      <td>14.5 miles per second</td>\n    </tr>\n    <tr>\n      <td>Diameter</td>\n      <td>7,926 miles</td>\n      <td>4,220 miles</td>\n    </tr>\n    <tr>\n      <td>Tilt of Axis</td>\n      <td>23.5 degrees</td>\n      <td>25 degrees</td>\n    </tr>\n    <tr>\n      <td>Length of Year</td>\n      <td>365.25 Days</td>\n      <td>687 Earth Days</td>\n    </tr>\n    <tr>\n      <td>Length of Day</td>\n      <td>23 hours 56 minutes</td>\n      <td>24 hours 37 minutes</td>\n    </tr>\n    <tr>\n      <td>Gravity</td>\n      <td>2.66 times that of Mars</td>\n      <td>0.375 that of Earth<

### Scraping Astrogeology USGS website for high resolution Mars hemisphere images

In [10]:
# Need to interact with website for images, use Splinter
url = ("https://astrogeology.usgs.gov/search/results?"
    + "q=hemisphere+enhanced&k1=target&v1=Mars")
browser.visit(url)
html = browser.html
soup = BeautifulSoup(html, "lxml")

# Get all hemispheres from the results page
hemispheres = soup.select("h3")

# List comprehension to remove "Enhanced" from the hemisphere names
hemispheres = [" ".join(name.text.split()[:-1]) for name in hemispheres]

# Initialize empty list to store dictionaries of hemisphere names 
# and image links into later
hemisphere_imgs = []

# For each hemisphere, create dictionary entry, click on their links, 
# get the url of the full image, and go back to the results page to 
# click on the next hemisphere
for index, value in enumerate(hemispheres):
    hemisphere_imgs.append({"name": value})

    browser.links.find_by_partial_text(value).click()

    # Grab the new page's HTML and parse using beautiful soup
    html = browser.html
    soup = BeautifulSoup(html, "lxml")

    # Get image url, add to dictionary
    img_url = (
        f"https://astrogeology.usgs.gov"
        f"{soup.select_one('img.wide-image')['src']}"
    )
    hemisphere_imgs[index]["url"] = img_url

    # Go back to page with all hemispheres, reload the html
    browser.back()
    html = browser.html
    soup = BeautifulSoup(html, "lxml")

In [11]:
# Confirming list of dictionaries was built successfully
print(hemisphere_imgs)

[{'name': 'Cerberus Hemisphere', 'url': 'https://astrogeology.usgs.gov/cache/images/f5e372a36edfa389625da6d0cc25d905_cerberus_enhanced.tif_full.jpg'}, {'name': 'Schiaparelli Hemisphere', 'url': 'https://astrogeology.usgs.gov/cache/images/3778f7b43bbbc89d6e3cfabb3613ba93_schiaparelli_enhanced.tif_full.jpg'}, {'name': 'Syrtis Major Hemisphere', 'url': 'https://astrogeology.usgs.gov/cache/images/555e6403a6ddd7ba16ddb0e471cadcf7_syrtis_major_enhanced.tif_full.jpg'}, {'name': 'Valles Marineris Hemisphere', 'url': 'https://astrogeology.usgs.gov/cache/images/b3c7c6c9138f57b4756be9b9c43e3a48_valles_marineris_enhanced.tif_full.jpg'}]


In [12]:
# End browser instance
browser.quit()