# Part 2 - Web Scrapping

**This is my first tentative project to scrap information from the internet.**  
Here, I want scrap the "TOP STORIES" in the left-side bar of the USNEWS main website. My targets include:  
- news title 
- news link 
- the first 3 sentences of one of the story

**My takeaway from this project are:**  
- Learn to use BeautifulSoup package to connect, get content and select html element to locate the exact information I want
- Lean to use NLTK package for NLP, locate the whole sentence in a large paragraph.

In [2]:
# importing the necessary packages for webscraping
import requests
from bs4 import BeautifulSoup
import numpy as np

In [3]:
# importing necessary NLP packages to detect the sentence
import nltk 
import nltk.data

In [9]:
###### Navigate to USNEWS url and parse the website 
# define the url of target website
url = 'http://www.usnews.com'
# provide the user agent to pretent a real visitor
headers = {
            # pretend I am a browser
           'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}
# open a new session
session = requests.Session()
# log on to session and get content
response_usnews = session.get(url,headers=headers)
# parse the website and save the text
soup = BeautifulSoup(response_usnews.text, 'html.parser')
#coverpage = response_usnews.content

In [10]:
# Find all the "h3" elemnets to locate the_second_current top story
coverpage_news = soup.find_all('h3'
                               #, class_='story-headline sc-gipzik ktkHzo'
                              )

In [11]:
#  the_second_ current top story is the sixth element of the list
print("Top Stories:"
      +"\n 1."+coverpage_news[4].get_text()
      +"\n 2."+coverpage_news[5].get_text()
     )

Top Stories:
 1.West Treads Lightly With Iran After Jet Crash
 2.U.K. Poised for Jan. 31 Exit from EU


In [12]:
# Get the url element of the first top story
url1 = coverpage_news[4].find_all('a')[0]
# Get the url element of the second top story
url2 = coverpage_news[5].find_all('a')[0]
# Get the url of the first top story
first_story_url = url1.get('href')
# Get the url of the second top story
second_story_url = url2.get('href')
# print the url of the first top story
print(first_story_url)
# print the url of the second top story
print(second_story_url)

https://www.usnews.com/news/world-report/articles/2020-01-09/west-backs-off-of-blaming-iran-despite-evidence-in-ukraine-jet-crash
https://www.usnews.com/news/national-news/articles/2020-01-09/brexit-bill-clears-key-hurdle-uk-poised-for-jan-31-exit-from-european-union


In [13]:
# Navigate to the second story url
# define the url of target website
url2 =second_story_url
# provide the user agent to pretent a real visitor
headers = {
            # pretend I am a browser
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36'
             }

# open a new session
session = requests.Session()
# log on to session and get content
response_second_story = session.get(url2,headers=headers)
# parse the website and save the text
soup2 = BeautifulSoup(response_second_story.text, 'html.parser')
#coverpage = response_second_story.content

In [14]:
# Get the title of the article
print(soup2.find('h1').get_text())

Brexit Bill Clears Key Hurdle, U.K. Poised for Jan. 31 Exit from European Union


In [15]:
# Get the first three sentence of the article.To do this, we will utilize NLP package "nltk" to detect the text as a sentence, then extract the first sentence  
## First, get the whole news paragraph.
body = soup2.find_all('div', class_='Raw-s14xcvr1-0 AXWJq')

## Create a new_contents text element to append the paragraph by loop
news_contents = ""

## Perform loop to get each paragraph 
for i in np.arange(0, len(body)):
    paragraph = body[i].find('p').get_text()
    news_contents = news_contents+paragraph+" "

news_contents

'The United Kingdom is poised to leave the European Union on Jan. 31 after British lawmakers Thursday gave final approval to Prime Minister Boris Johnson\'s Brexit bill.  The measure passed the House of Commons easily and without fanfare in an anticlimactic end to a chaotic, yearslong saga that began after U.K. citizens narrowly voted to leave the EU in 2016.  The withdrawal bill\'s passage was expected: Lawmakers gave it initial approval in December after a general election that Johnson\'s Conservative Party won decisively, picking up 66 seats in Parliament. Though the measure now heads to the House of Lords, Thursday\'s vote essentially guarantees that Brexit will happen at the end of the month.  Parliament refused three times to pass former Prime Minister Theresa May\'s Brexit bill, ultimately leading to her resignation. Lawmakers also declined to pass Johnson\'s Brexit bill this fall, prompting him to call for an election.  Johnson\'s Brexit bill is similar to May\'s but replaces a

In [16]:
## Then, apply the split sentence function to the paragraph to extract the first three sentences.
nltk.sent_tokenize(news_contents)[0:3]

["The United Kingdom is poised to leave the European Union on Jan. 31 after British lawmakers Thursday gave final approval to Prime Minister Boris Johnson's Brexit bill.",
 'The measure passed the House of Commons easily and without fanfare in an anticlimactic end to a chaotic, yearslong saga that began after U.K. citizens narrowly voted to leave the EU in 2016.',
 "The withdrawal bill's passage was expected: Lawmakers gave it initial approval in December after a general election that Johnson's Conservative Party won decisively, picking up 66 seats in Parliament."]