# Mission To Mars
## Web Scaping Notebook
This notebook is used to build the general strategy for scraping data needed to make the website. The website needs the following:
- The title and body of a news article from [replanetscience.com](https://redplanetscience.com/),
- The featured image from [spaceimages-mars.com](https://spaceimages-mars.com/),
- A table conatining cool facts about Mars (and Earth) from [galaxyfacts-mars.com](https://galaxyfacts-mars.com/), and
- Images of all four Martian hemispheres from [marshemispheres.com](https://marshemispheres.com/).

The scraping techniques includes using [splinter](https://splinter.readthedocs.io/en/latest/) to visit pages, interact with them and get html, using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to find elements with wpecific tags and attributes, using [pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) to `read_html` from tables and to save tables as html with `to_html` and using [requests](https://requests.readthedocs.io/en/latest/) to get raw html text from a url. To get the data of interest involves using a combination of these techniques and tweaking parameters. 

In [1]:
# imports
from bs4 import BeautifulSoup as BS
from splinter import Browser
from webdriver_manager.chrome import ChromeDriverManager
import requests
import pandas as pd
executable_path = {'executable_path': ChromeDriverManager().install()}

### News Article

In [2]:
# *** Strategy ***
# have splinter visit redplanetscience and call html
# have bs4 read html and find the 'section' element with class="image_and_dexription_container"
# then get the first 'div' under that - this is the target news story
# save the element with class="content_title" as title
# save the element with class="article_teaser_body" as body

url = 'https://redplanetscience.com/'
with Browser('chrome', **executable_path, headless=False) as browser:
    browser.visit(url)
    html = browser.html
    soup = BS(html, 'lxml')
    section = soup.find('section', class_="image_and_description_container")
    div = section.find('div', class_='col-md-12')
    title = div.find('div',class_='content_title').text
    body = div.find('div', class_='article_teaser_body').text
print(f'Title: {title}\nBody: {body}')

Title: NASA's Mars 2020 Comes Full Circle
Body: Aiming to pinpoint the Martian vehicle's center of gravity, engineers took NASA's 2,300-pound Mars 2020 rover for a spin in the clean room at JPL. 


### Featured Image

In [3]:
# *** Strategy ***
# have splinter visit spaceimages-mars.com and call html
# have bs4 read html and find the 'div' element with class="header" - this is the main image
# get the 'src' attribute from the 'img' element with class="headerimage" and save as img
# concatanate the website url and the img path to get the img_url

url = 'https://spaceimages-mars.com/'
with Browser('chrome', **executable_path, headless=False) as browser:
    browser.visit(url)
    html = browser.html
    soup = BS(html,'html.parser')
    header = soup.find('div', class_='header')
    img = header.find('img', class_='headerimage')['src']
    img_url = f'{url}{img}'
print(f'URL: {img_url}')

URL: https://spaceimages-mars.com/image/featured/mars1.jpg


### Mars/Earth Facts

In [4]:
# *** Strategy ***
# get html from galaxyfacts-mars.com using requests
# have pandas scrape table(s) from html with text 'Mars - Earth Comparison'
# clean up column names
# have pandas generate html table for the dataframe and include bootstrap classes - 'table table-responsive'

url = 'https://galaxyfacts-mars.com/'
html = requests.get(url).content
df = pd.read_html(html, match='Mars - Earth Comparison', header=0)[0]
df[''] = df['Mars - Earth Comparison']
table = df.loc[:,['','Mars','Earth']].to_html(index=False, border=1, classes="table table-responsive")
print(table)

<table border="False" class="dataframe table table-responsive">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Mars</th>
      <th>Earth</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Diameter:</td>
      <td>6,779 km</td>
      <td>12,742 km</td>
    </tr>
    <tr>
      <td>Mass:</td>
      <td>6.39 × 10^23 kg</td>
      <td>5.97 × 10^24 kg</td>
    </tr>
    <tr>
      <td>Moons:</td>
      <td>2</td>
      <td>1</td>
    </tr>
    <tr>
      <td>Distance from Sun:</td>
      <td>227,943,824 km</td>
      <td>149,598,262 km</td>
    </tr>
    <tr>
      <td>Length of Year:</td>
      <td>687 Earth days</td>
      <td>365.24 days</td>
    </tr>
    <tr>
      <td>Temperature:</td>
      <td>-87 to -5 °C</td>
      <td>-88 to 58°C</td>
    </tr>
  </tbody>
</table>


### Martian Hemispheres

In [5]:
# *** Strategy ***
# have splinter visit marshemispheres.com, four times
# have splinter get all the links with partial text 'hemisphere'
# and click on a different one each time
# on new page, click the link that says 'Open' and call the html
# have bs4 read html and find the 'h2' element with class="title"
# then get the 'src' attribute of the 'img' element with class='wide-image' and save it as img
# concatenate the page root page url with the img path and save as the img_url

url = 'https://marshemispheres.com/'
with Browser('chrome', **executable_path, headless=False) as browser:
    for i in range(4):
        browser.visit(url)
        links = browser.links.find_by_partial_text('Hemisphere')
        links[i].click()
        browser.links.find_by_text('Open').click()
        html = browser.html
        soup = BS(html)
        title = soup.find('h2', class_='title').text
        img = soup.find('img', class_='wide-image')['src']
        img_url = f'{url}{img}'
        print('--------------------------------------------------------------------------------------------------')
        print(f'Title: {title}\nURL: {img_url}')

--------------------------------------------------------------------------------------------------
Title: Cerberus Hemisphere Enhanced
URL: https://marshemispheres.com/images/f5e372a36edfa389625da6d0cc25d905_cerberus_enhanced.tif_full.jpg
--------------------------------------------------------------------------------------------------
Title: Schiaparelli Hemisphere Enhanced
URL: https://marshemispheres.com/images/3778f7b43bbbc89d6e3cfabb3613ba93_schiaparelli_enhanced.tif_full.jpg
--------------------------------------------------------------------------------------------------
Title: Syrtis Major Hemisphere Enhanced
URL: https://marshemispheres.com/images/555e6403a6ddd7ba16ddb0e471cadcf7_syrtis_major_enhanced.tif_full.jpg
--------------------------------------------------------------------------------------------------
Title: Valles Marineris Hemisphere Enhanced
URL: https://marshemispheres.com/images/b3c7c6c9138f57b4756be9b9c43e3a48_valles_marineris_enhanced.tif_full.jpg
