## 11.6.3 Scraping Mars News

In [1]:
# Import Splinter and BeautifulSoup
from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager

In [2]:
# set the executable path. 
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)

In [3]:
# assign the URL, and instruct the browser to visit it
# Visit the Mars NASA news site
url = 'https://redplanetscience.com'
browser.visit(url)
# Optional delay for loading the page
browser.is_element_present_by_css('div.list_text', wait_time=1)

True

In [4]:
# In the preceding code, the last line, browser.is_element_present_by_css('div.list_text', wait_time=1), accomplishes two things. The first is that the is_element_present_by_css method searches for elements with a specific combination of tag (div) and attribute (list_text). For example, if we used ul.item_list, the method would find <ul class="item_list">.
# Recall that a CSS class attribute gets represented with a period (.). And, an id attribute gets represented with a hashtag (#). So, a div element that has a class attribute of list_text gets notated as div.list_text. Likewise, a button element that has an id attribute of more gets notated as button#more. And, remember that an id attribute value must be unique on a webpage.

# The second thing that the last line of code accomplishes is that it tells the browser to wait one second before searching for components. This optional delay is useful because sometimes, dynamic pages take a little while to load—especially if they have lots of images.

# The last line of code returns True, indicating that it found a div element with a list_text attribute.

# Now, practice using the is_element_present_by_css method yourself in the following Skill Drill:


In [5]:
html = browser.html
news_soup = soup(html, 'html.parser')
slide_elem = news_soup.select_one('div.list_text')


In [6]:
# In the preceding code, we search for a <div /> tag that has a class attribute of list_text, and we assign the result to the slide_elem variable. This result is the parent element of each article, which means that it holds all the other elements within it. Later, we'll reference this when we want to filter our search results further.


In [7]:
# We want to assign the title and the summary text to variables that we can reference later. To do so, begin scraping by entering and running the following code in the next cell:

In [8]:
title_elem = slide_elem.find('div', class_='content_title')
print(title_elem)

<div class="content_title">From JPL's Mailroom to Mars and Beyond</div>


In [None]:
# In the preceding code, we chained the find method to our previously assigned slide_elem variable. By doing so, we're saying, "This variable holds tons of information, so search that information to find this specific data." And, the specific data is the content title. We specify that inside the find method by using the 'div', class_='content_title' parameters, which mean, "The specific data exists in a <div /> element that has a class attribute of content_title."

# The output of running the cell is the HTML code that contains the content title along with anything else that’s nested inside that <div /> element. Specifically, the output resembles the following:


In [9]:
# Get the title text. We don’t need the extra HTML code

title = title_elem.get_text()
print(title)

From JPL's Mailroom to Mars and Beyond


In [None]:
# In the preceding code, the get_text method of the title_elem object returns only the title of the news article. So, the output consists of the most recently published title from the website. When the website gets updated with a new article, rerunning the code will return the title of that new article.

# We have the title that we want, and that's a terrific start. Now, we need to get the summary text. To do so, we’ll use a block of code that’s similar to the last one.



In [15]:
# What will we need to change in the following line of code to scrape the article summary instead of the title: 
# We’ll need to change the class to “article_teaser_body.

news_p = slide_elem.find(“div”, class_=‘article_teaser_body’).get_text()
 = news_p

SyntaxError: invalid character '“' (U+201C) (199635592.py, line 4)

In [None]:
# But before we add this code, we need to make sure that we scrape the right tag and class. To do so, in DevTools, click the “Select an element in the page to inspect it” button, and then on the webpage, select the article summary (also known as a teaser). Then in DevTools, check which tag is selected. It should be the following:

In [13]:
<div class="article_teaser_body">

SyntaxError: invalid syntax (1211537904.py, line 1)

In [14]:
# For now, we want to extract only the first one—the most recent one. (We want only new news!) And because new articles get added to the beginning of the list, our search leads us to the first article.

# To get the summary text, in the next cell, enter and run the following code:

# Use the parent element to find the paragraph text
news_p = slide_elem.find('div', class_='article_teaser_body').text
news_p

#The output is just the article summary, which resembles that in the following image:



'Bill Allen has thrived as the mechanical systems design lead for three Mars rover missions, but he got his start as a teenager sorting letters for the NASA center.'

In [None]:
browser.quit()