#**Write a program to scrape news from the https://www.space.com/news web page.**

> You should submit a notebook that contains a function that can be run that retrieves the space.com news page, extracts the news stories, and prints out the headline, author, synopsis, and date and time for each story.

> Hint: Use Chrome Developer tools to examine the structure of the page to find the HTML for the articles, then use requests and BeautifulSoup to explore extracting the data that you need.

Note that some of the items in the news list are sponsored ads and even though they look like news stories they are structurally different in the code. You can ignore those and just scrape the news itself.

*Optionally*, **if you wish to challenge yourself**, have your spider follow the next link at the bottom of the page and scrape even more news stories.

---



## Fetch the HTML code

In [2]:
import requests
url = ' https://www.space.com/news'
response = requests.get(url)

# make sure we got a valid response
if(response.ok):
  # get the full data from the response
  data = response.text
  print(data[:100])

<!DOCTYPE html>
<html lang="en" dir="ltr" data-locale="US" class="">
<head>
<script>
window.vanilla 


## Top commons tag names in HTML

|Index|Tag name | Description|
|:-:|---|---|
|1|HTML | It is the root of the html document which is used to specify that the document is html|
|2|h1-h6|	Heading|
|3|p	|Paragraph |
|4|i or em |	Italic / Emphasis
|5|b or strong	| Bold / Strong
|6|a |	Anchor, Link
|7|ul & li |	Unordered List & List Item
|8|ol |Ordered list
|9|blockquote |	Blockquote
|10|hr	| Horizontal Rule
|11|img |	Image
|12|div |	Division
|13|span| to style part of text
|14|body| main HTML part
|15|title| title of html document

## Parsing the HTML code then navigating the Document

In [3]:
from bs4 import BeautifulSoup

In [4]:
soup = BeautifulSoup(data, 'html.parser')
print(soup.prettify()[:100])

<!DOCTYPE html>
<html class="" data-locale="US" dir="ltr" lang="en">
 <head>
  <script>
   window.va


In [5]:
# this gets a list of all divs on the page
divs = soup.find_all('div',class_='content')
print('The number of articles is {}'.format(len(divs)),'\n')
divs[:2] # print first two articles

The number of articles is 20 



[<div class="content">
 <header>
 <h3 class="article-name">The 'ring of fire' solar eclipse of 2020 occurs Sunday. Here's how to watch online.</h3>
 <p class="byline">
 <span class="by-author">
 By
 <span style="white-space:nowrap">
 Elizabeth Howell </span>
 </span>
 <time class="published-date relative-date" data-published-date="2020-06-19T14:10:05Z" datetime="2020-06-19T14:10:05Z"></time>
 </p>
 </header>
 <p class="synopsis">
 A "ring of fire" solar eclipse will briefly appear in parts of Africa and Asia this weekend, and if you aren't out there in person, you can take in the spectacular show online.
 </p>
 </div>, <div class="content">
 <header>
 <h3 class="article-name">Pictures from space! Our image of the day</h3>
 <p class="byline">
 <span class="by-author">
 By
 <span style="white-space:nowrap">
 Space.com Staff </span>
 </span>
 <time class="published-date relative-date" data-published-date="2020-06-19T13:36:47Z" datetime="2020-06-19T13:36:47Z"></time>
 </p>
 </header>
 <p c

To print out the headline, author, synopsis, and date and time for each story, we need to identify the tag names that contain them.
1. **Headline**: h3
2. **Author**: span with class_='by-author' and we need to remove unwanted parts
3. **Synopsis**: p with class synopsis
4. **Date and time**: time with key ***data-published-date***

In [6]:
# get the name of the first article
i = 1
print('Information of the article #{} is:\n'.format(i),
      divs[i].find('h3').get_text(),'\n',
      divs[i].find('span',class_='by-author').get_text().split('\n\n')[1].strip(), '\n',
      divs[i].find('p',class_='synopsis').get_text().strip(),'\n',
      divs[i].find('time')['data-published-date'] # z mean UTC time
      ) 

Information of the article #1 is:
 Pictures from space! Our image of the day 
 Space.com Staff 
 The Butterfly Nebula, also known as NGC 6302, is depicted here in a brilliant image taken by the NASA/ESA Hubble Space Telescope. 
 2020-06-19T13:36:47Z


### Create a class for article

In [7]:
class article(object):
    # define attributions
    def __init__(self, headline, author, synopsis,datetime):
        self.headline = headline
        self.author = author
        self.synopsis = synopsis
        self.datetime = datetime
    def print(self):
      print('\n Headline:',self.headline,'\n Author:',
            self.author,'\n Synopsis:',
            self.synopsis,'\n Datetime:',
            self.datetime,'\n',)

### Print all stories

In [8]:
# a function for fetch all desired information and then store in an article object
def extract_div(div):
  arti=article(div.find('h3').get_text(),
               div.find('span',class_='by-author').get_text().split('\n\n')[1].strip(),
               div.find('p',class_='synopsis').get_text().strip(),
               div.find('time')['data-published-date'])
  return arti

In [9]:
i=1
for div in divs:
  print('Information of the article #{} is:'.format(i),)
  extract_div(div).print()
  i+=1

Information of the article #1 is:

 Headline: The 'ring of fire' solar eclipse of 2020 occurs Sunday. Here's how to watch online. 
 Author: Elizabeth Howell 
 Synopsis: A "ring of fire" solar eclipse will briefly appear in parts of Africa and Asia this weekend, and if you aren't out there in person, you can take in the spectacular show online. 
 Datetime: 2020-06-19T14:10:05Z 

Information of the article #2 is:

 Headline: Pictures from space! Our image of the day 
 Author: Space.com Staff 
 Synopsis: The Butterfly Nebula, also known as NGC 6302, is depicted here in a brilliant image taken by the NASA/ESA Hubble Space Telescope. 
 Datetime: 2020-06-19T13:36:47Z 

Information of the article #3 is:

 Headline: On This Day in Space! June 19, 1963: Vostok 5 & Vostok 6 return to Earth 
 Author: Hanneke Weitering 
 Synopsis: On June 19, 1963, two Soviet spacecraft named Vostok 5 and Vostok 6 returned to Earth, ending a historic joint mission. See how it happened in our On This Day in Space v

## Follow the next link at the bottom of the page and scrape even more news stories.

Let's look at the HTML code for the next page button.
```
<li class="pagination-numerical-list-item">
<a href="https://www.space.com/news/2" class="pagination-numerical-list-item-link" data-component-tracked="2">
2
</a>
</li>
```

In [10]:
soup.find_all('a',class_='pagination-numerical-list-item-link')

[<a class="pagination-numerical-list-item-link" href="https://www.space.com/news/2">
 2
 </a>,
 <a class="pagination-numerical-list-item-link" href="https://www.space.com/news/3">
 3
 </a>,
 <a class="pagination-numerical-list-item-link" href="https://www.space.com/news/4">
 4
 </a>,
 <a class="pagination-numerical-list-item-link" href="https://www.space.com/news/5">
 5
 </a>,
 <a class="pagination-numerical-list-item-link" href="https://www.space.com/news/6">
 6
 </a>,
 <a class="pagination-numerical-list-item-link" href="https://www.space.com/news/7">
 7
 </a>,
 <a class="pagination-numerical-list-item-link" href="https://www.space.com/news/8">
 8
 </a>,
 <a class="pagination-numerical-list-item-link" href="https://www.space.com/news/9">
 9
 </a>]

There are 9 pages and the parameters for querying are not required

## Print first N stories of each page

In [11]:
from time import sleep
from random import randint

In [13]:
MAX_PAGE=5
MAX_STORY=2
for i in range(1,MAX_PAGE+1):
  url = ' https://www.space.com/news/'+str(i)
  #print(url)
  response = requests.get(url)
  if(response.ok):
  # get the full data from the response
    data = response.text
  soup = BeautifulSoup(data, 'html.parser')
  divs = soup.find_all('div',class_='content')
  j=1
  for div in divs:
    print('Information of the article #{} in page {} is:'.format(j,i))
    extract_div(div).print()
    j+=1
    if j > MAX_STORY: break
  sleep(randint(1,3)) # take a break before next query

Information of the article #1 in page 1 is:

 Headline: The 'ring of fire' solar eclipse of 2020 occurs Sunday. Here's how to watch online. 
 Author: Elizabeth Howell 
 Synopsis: A "ring of fire" solar eclipse will briefly appear in parts of Africa and Asia this weekend, and if you aren't out there in person, you can take in the spectacular show online. 
 Datetime: 2020-06-19T14:10:05Z 

Information of the article #2 in page 1 is:

 Headline: Pictures from space! Our image of the day 
 Author: Space.com Staff 
 Synopsis: The Butterfly Nebula, also known as NGC 6302, is depicted here in a brilliant image taken by the NASA/ESA Hubble Space Telescope. 
 Datetime: 2020-06-19T13:36:47Z 

Information of the article #1 in page 2 is:

 Headline: Watch Venus play 'peekaboo' with the crescent moon Friday morning 
 Author: Joe Rao 
 Synopsis: If you live in the northeast U.S. or Canada, mark Friday, June 19, on your calendar. That morning the moon will rise with the brilliant planet Venus hidden 