## MISSION TO MARS
### Scrape Mars Data: The News

In [1]:
# Import Splinter and BeautifulSoup
from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager

# Instead of scraping each row or data in table, scrape the entire table with Pandas' .read_html() function.
import pandas as pd

Set up the executable path and initialize a browser

In [2]:
# Set your executable path via Splinter
# Then set up the URL 'https://redplanetscience.com/' for scraping
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)



Current google-chrome version is 92.0.4515
Get LATEST driver version for 92.0.4515
Driver [C:\Users\Daniel Brock\.wdm\drivers\chromedriver\win32\92.0.4515.107\chromedriver.exe] found in cache


Assign the URL and instruct the browser to visit it

In [3]:
# Visit the mars nasa news site
url = 'https://redplanetscience.com'
browser.visit(url)
# Optional delay for loading the page
    # Search for elements with a specific combination of tag (div) and attribute (list_text).
    # Tell our browser to wait one second before searching for components
browser.is_element_present_by_css('div.list_text', wait_time=1)

True

Use BeautifulSoup to parse the HTML

In [4]:
# Set up the HTML parser using BeautifulSoup
html = browser.html
news_soup = soup(html, 'html.parser')
# Assign slide_elem as the variable to look for the <div /> tag and its descendent (the other tags within the <div /> element)
slide_elem= news_soup.select_one('div.list_text')

After opening the page in a new browser, right-click to inspect and activate your DevTools. Then search for the HTML components you'll use to identify the title and paragraph you want. We'll want to assign the title and summary text to variables we'll reference later. 

In [5]:
 # Chain .find onto our previously assigned variable, slide_elem and look for content title
slide_elem.find('div', class_='content_title')    

<div class="content_title">NASA InSight's 'Mole' Is Out of Sight</div>

The title is in that mix of HTML in our output—that's awesome! But we need to get just the text, and the extra HTML stuff isn't necessary

In [6]:
# Use the parent element to find the first `a` tag and save it as `news_title` variable
    # When the .get_text() this method is chained onto .find(), only the text of the element is returned.
news_title = slide_elem.find('div', class_='content_title').get_text()
print("-------------1st ARTICLE---------")
print(news_title)

# Use the parent element to find the paragraph text
news_summary = slide_elem.find('div', class_="article_teaser_body").get_text()
print("-------------SUMMARY-------------")
print(news_summary)

-------------1st ARTICLE---------
NASA InSight's 'Mole' Is Out of Sight
-------------SUMMARY-------------
Now that the heat probe is just below the Martian surface, InSight's arm will scoop some additional soil on top to help it keep digging so it can take Mars' temperature.


### Scrape Mars Data: Featured Image

In [7]:
# Visit URL
url = 'https://spaceimages-mars.com/'
browser.visit(url)

Next, we want to click the "Full Image" button. This button will direct our browser to an image slideshow. Let's take a look at the button's HTML tags and attributes with the DevTools.

Since there are only three buttons, and we want to click the full-size image button, we can go ahead and use the HTML tag in our code.

In [8]:
# Find and click the full image button
    # Assign a  new variable to hold the scraping result
    # Use the browser finds to find an element by its tag
    # Use index chaining at the end of first block of code to stipulate taht we want our browser to click the 2nd button
full_imagine_elem = browser.find_by_tag('button')[1]
    # Splinter will "click" the imagine to view its full size
full_imagine_elem.click()

With the new page loaded onto our automated browser, it needs to be parsed so we can continue and scrape the full-size image URL. 

In [9]:
# Parse the resulting html with soup
html = browser.html
img_soup = soup(html, 'html.parser')

Now we need to find the relative image URL.  In our browser (make sure you're on the same page as the automated one), activate your DevTools again. It's important to note that the value of the src will be different every time the page is updated, so we can't simply record the current value—we would only pull that image each time the code is executed.

We'll use the image tag and class (img /and fancybox-img) to build the URL to the full-size image. 

In [10]:
# Find the relative image url using BeautifulSoup to look inside the <img /> tag for an image with a class of fancybox-image
    # An img tag is nested within this HTML, so we've included it.
    # .get('src') pulls the link to the image.
img_url_rel = img_soup.find('img', class_='fancybox-image').get('src')
img_url_rel
# Basically we're saying, "This is where the image we want lives—use the link that's inside these tags."

'image/featured/mars3.jpg'

Let's add the base URL to our code, because if we copy and paste this link into a browser, it won't work. This is because it's only a partial link, as the base URL isn't included.

In [11]:
# Use the base URL to create an absolute URL
    # img_url is the variable that holds our f string
    # the f-string is a type of string formatting used for print statements in Python.
    # {} The curly brackets hold a variable that will be inserted into the f-string when it's executed.
img_url = f'https://spaceimages-mars.com/{img_url_rel}'
img_url

'https://spaceimages-mars.com/image/featured/mars3.jpg'

### Scrape Mars Data: Mars Facts
Get a table from Mars Facts and display it as a table on our own web app. Let's look at the webpage again, this time using our DevTools. All of the data we want is in a table / tag. 

In [12]:
# Create a new DataFrame from the HTML table.
    # The Pandas function read_html() specifically searches for and returns a list of tables found in the HTML.
    # By specifying an index of 0, we're telling Pandas to pull only the first table it encounters, or the first item in the list.
# Assign columns to the new DataFrame for additional clarity.
# Use the .set_index() function, we're turning the Description column into the DataFrame's index
    # inplace=True means that the updated index will remain in place, without having to reassign the DataFrame to a new variable.
df = pd.read_html('https://galaxyfacts-mars.com/')[0]
df.columns = ['Description', 'Mars', 'Earth']
df.set_index('Description', inplace=True)
df    

Unnamed: 0_level_0,Mars,Earth
Description,Unnamed: 1_level_1,Unnamed: 2_level_1
Mars - Earth Comparison,Mars,Earth
Diameter:,"6,779 km","12,742 km"
Mass:,6.39 × 10^23 kg,5.97 × 10^24 kg
Moons:,2,1
Distance from Sun:,"227,943,824 km","149,598,262 km"
Length of Year:,687 Earth days,365.24 days
Temperature:,-87 to -5 °C,-88 to 58°C


Robin's web app is going to be an actual webpage. Thankfully, Pandas also has a way to easily convert our DataFrame back into HTML-ready code using the .to_html() function. 

In [13]:
df.to_html()

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Mars</th>\n      <th>Earth</th>\n    </tr>\n    <tr>\n      <th>Description</th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>Mars - Earth Comparison</th>\n      <td>Mars</td>\n      <td>Earth</td>\n    </tr>\n    <tr>\n      <th>Diameter:</th>\n      <td>6,779 km</td>\n      <td>12,742 km</td>\n    </tr>\n    <tr>\n      <th>Mass:</th>\n      <td>6.39 × 10^23 kg</td>\n      <td>5.97 × 10^24 kg</td>\n    </tr>\n    <tr>\n      <th>Moons:</th>\n      <td>2</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>Distance from Sun:</th>\n      <td>227,943,824 km</td>\n      <td>149,598,262 km</td>\n    </tr>\n    <tr>\n      <th>Length of Year:</th>\n      <td>687 Earth days</td>\n      <td>365.24 days</td>\n    </tr>\n    <tr>\n      <th>Temperature:</th>\n      <td>-87 to -5 °C</td>\n      <td>-88 to 58°C</td>\n    </tr>\n  </tbody>

Now that we've gathered everything on Robin's list, we can end the automated browsing session. 

In [14]:
browser.quit()