# Mission To Mars - Web Scraping Analysis

### This Jupyter Notebook file will utilize Splinter and Beautiful Soup to scrape the NASA website and pull the latest news and information related to Mars missions.

#### The notebook will accomplish the following:
* 1) Scrape the NASA Mars News Site and collect the latest News Title and Paragraph Text and assign to variables.
* 2) Visit the url for JPL Featured Space Image and then use splinter to navigate the site and find the image url for the current Featured Mars Image and assign the url string to a variable called featured_image_url.
* 3) Visit the Mars Weather twitter account and scrape the latest Mars weather tweet from the page and then save the tweet text for the weather report as a variable called mars_weather.
* 4) Visit the Mars Facts webpage here and use Pandas to scrape the table containing facts about the planet including Diameter, Mass, etc. Then, use Pandas to convert the data to a HTML table string.
* 5) Visit the USGS Astrogeology site to obtain high resolution images for each of Mar's hemispheres.

### Workbook Setup

In [1]:
# Import necessary dependencies
from splinter import Browser
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# determine the path to the Google Chrome driver and store it in a variable
chrome_driver_path = !which chromedriver

In [3]:
# Set the executable path and initialize the chrome browser in splinter
executable_path = {'executable_path': chrome_driver_path[0]}
chrome_browser = Browser('chrome', **executable_path)

### 1) NASA Website

In [4]:
# Assign URL variables to visit the NASA website
nasa_url = 'https://mars.nasa.gov/news/'
# have Chrome navigate to that URL
chrome_browser.visit(nasa_url)

In [5]:
# Convert the browser html from the NASA website from the link above to a Beautiful Soup object
nasa_html = chrome_browser.html
nasa_beautiful_soup = BeautifulSoup(nasa_html, 'html.parser')

In [8]:
# after running through the total beautiful soup object by doing "print(nasa_beautiful_soup)"
# and searching for the first article title I saw upon visual inspection of the page
# ("The Mars InSight..."), I determined the news titles were in a div with class 
# "content_title"

# Thus, use the parent element to find the first a tag and save it as `news_title` since
# the first tag is the latest by date
news_title = nasa_beautiful_soup.find("div", class_='content_title').get_text()
news_title

'The Mars InSight Landing Site Is Just Plain Perfect'

In [9]:
# The first paragraph text can be found with similar logic
news_paragraph = nasa_beautiful_soup.find('div', class_="article_teaser_body").get_text()
news_paragraph

'If the InSight landing zone were ice cream, it would be vanilla.'

## 2) JPL Space Images Featured Image

In [11]:
# Visit the provided URL in Chrome to scrape for the images
JPL_url = 'https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars'
chrome_browser.visit(JPL_url)

In [12]:
# After inspecting the page using it was determined that the full image button has
# id = "full_image", thus we use Splinter to find that id
JPL_full_image_elem = chrome_browser.find_by_id('full_image')

# instruct Chrome to click the full image button
JPL_full_image_elem.click()

In [14]:
# After inspecting the subsequent page, it is determine
# that the "More Info" button needs to be pressed to get access to the 
# full image. After inspection, the <a> element class is "button". Since that is generic,
# use the text within the button to click it.
JPL_more_info_elem = chrome_browser.find_link_by_partial_text('more info')
JPL_more_info_elem.click()

In [15]:
# Now that we are on the page with the full image we need to parse html with BeautifulSoup
# to get the image link and store it to a variable
JPL_html = chrome_browser.html
JPL_img_soup = BeautifulSoup(JPL_html, 'html.parser')

In [18]:
# find the relative image url. After page inspection it is determined the image is within an
# <img> element with class "main_image" nested within a figure with class "lede"
JPL_img_url_rel = JPL_img_soup.select_one('figure.lede a img').get("src")
JPL_img_url_rel

'/spaceimages/images/largesize/PIA18328_hires.jpg'

In [20]:
# Use the base url to create an absolute url
JPL_base_URL = "https://www.jpl.nasa.gov"
JPL_absolute_img_url = JPL_base_URL + JPL_img_url_rel
JPL_absolute_img_url

'https://www.jpl.nasa.gov/spaceimages/images/largesize/PIA18328_hires.jpg'

## 3) Mars Weather via Twitter

In [21]:
# have the Chrome browser navigate to the provided Twitter URL
mars_twitter_url = 'https://twitter.com/marswxreport?lang=en'
chrome_browser.visit(mars_twitter_url)

In [22]:
# parse the resultant HTML with Beautiful Soup
mars_twitter_html = chrome_browser.html
mars_weather_soup = BeautifulSoup(mars_twitter_html, 'html.parser')

In [23]:
# First, find a tweet with the data-name `Mars Weather`
mars_weather_tweet = mars_weather_soup.find('div', attrs={"class": "tweet", "data-name": "Mars Weather"})

In [24]:
# Next, search within the tweet for the p tag containing the tweet text
found_mars_weather = mars_weather_tweet.find('p', 'tweet-text').get_text()
found_mars_weather

'Sol 2219 (2018-11-03), high -7C/19F, low -71C/-95F, pressure at 8.68 hPa, daylight 06:15-18:33'

## 4) Mars Facts

In [25]:
# create a dataframe using the link provided from the Mars Facts webpage
mars_facts_df = pd.read_html('http://space-facts.com/mars/')[0]
# set the columns to more descriptive headers
mars_facts_df.columns=['description', 'value']
# change the index to "description"
mars_facts_df.set_index('description', inplace=True)
mars_facts_df

Unnamed: 0_level_0,value
description,Unnamed: 1_level_1
Equatorial Diameter:,"6,792 km"
Polar Diameter:,"6,752 km"
Mass:,6.42 x 10^23 kg (10.7% Earth)
Moons:,2 (Phobos & Deimos)
Orbit Distance:,"227,943,824 km (1.52 AU)"
Orbit Period:,687 days (1.9 years)
Surface Temperature:,-153 to 20 °C
First Record:,2nd millennium BC
Recorded By:,Egyptian astronomers


In [26]:
# convert the pandas dataframe to an HTML table
mars_facts_df.to_html()

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>value</th>\n    </tr>\n    <tr>\n      <th>description</th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>Equatorial Diameter:</th>\n      <td>6,792 km</td>\n    </tr>\n    <tr>\n      <th>Polar Diameter:</th>\n      <td>6,752 km</td>\n    </tr>\n    <tr>\n      <th>Mass:</th>\n      <td>6.42 x 10^23 kg (10.7% Earth)</td>\n    </tr>\n    <tr>\n      <th>Moons:</th>\n      <td>2 (Phobos &amp; Deimos)</td>\n    </tr>\n    <tr>\n      <th>Orbit Distance:</th>\n      <td>227,943,824 km (1.52 AU)</td>\n    </tr>\n    <tr>\n      <th>Orbit Period:</th>\n      <td>687 days (1.9 years)</td>\n    </tr>\n    <tr>\n      <th>Surface Temperature:</th>\n      <td>-153 to 20 °C</td>\n    </tr>\n    <tr>\n      <th>First Record:</th>\n      <td>2nd millennium BC</td>\n    </tr>\n    <tr>\n      <th>Recorded By:</th>\n      <td>Egyptian astronomers</td>\n    </tr>

### 5) USGS Astrogeology Images

In [27]:
# navigate to the USGS site with the link provided in the Chrome browser
USGS_url = 'https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars'
chrome_browser.visit(USGS_url)

In [28]:
# instantiate an empty list to hold all of the scraped image urls
hemisphere_image_urls = []

# Then, get a list of all of the hemispheres
hemisphere_links = chrome_browser.find_by_css("a.product-item h3")

In [29]:
# Next, loop through those links, click the link, find the sample anchor, and return the href
for i in range(len(hemisphere_links)):
    # create an empty dictionary
    current_hemisphere = {}
    
    # We have to find the elements on each loop to avoid a stale element exception
    chrome_browser.find_by_css("a.product-item h3")[i].click()
    
    # Next, we find the Sample image anchor tag and extract the href
    sample_elem = chrome_browser.find_link_by_text('Sample').first
    # set the href into the dictionary
    current_hemisphere['img_url'] = sample_elem['href']
    
    # Get the Hemisphere title and place it in the dictionary
    current_hemisphere['title'] = chrome_browser.find_by_css("h2.title").text
    
    # Append hemisphere dictionary to the list
    hemisphere_image_urls.append(current_hemisphere)
    
    # Finally, we navigate backwards in the browser so that on the next iteration the images
    # can be clicked into
    chrome_browser.back()

In [30]:
# display all of the found urls
hemisphere_image_urls

[{'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/cerberus_enhanced.tif/full.jpg',
  'title': 'Cerberus Hemisphere Enhanced'},
 {'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/schiaparelli_enhanced.tif/full.jpg',
  'title': 'Schiaparelli Hemisphere Enhanced'},
 {'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/syrtis_major_enhanced.tif/full.jpg',
  'title': 'Syrtis Major Hemisphere Enhanced'},
 {'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/valles_marineris_enhanced.tif/full.jpg',
  'title': 'Valles Marineris Hemisphere Enhanced'}]

In [32]:
# finally quit out of the Chrome browser
chrome_browser.quit()