# Scraping and Summarizing News Articles

This notebook gives a short demonstration of code to scrape and summarize news articles. It accompanies the blog post found here: {link}

In [1]:
# Imports
import requests
from bs4 import BeautifulSoup
from gensim.summarization import summarize

### Article Choice
https://www.npr.org/2019/07/10/740387601/university-of-texas-austin-promises-free-tuition-for-low-income-students-in-2020

In [2]:
# Retrieve page text
url = 'https://www.npr.org/2019/07/10/740387601/university-of-texas-austin-promises-free-tuition-for-low-income-students-in-2020'
page = requests.get(url).text

In [3]:
# Turn page into BeautifulSoup object to access HTML tags
soup = BeautifulSoup(page)

In [4]:
# Get headline
headline = soup.find('h1').get_text()
print(headline)

University of Texas-Austin Promises Free Tuition For Low-Income Students In 2020


In [5]:
# Get text from all <p> tags.
p_tags = soup.find_all('p')
p_tags_text = [tag.get_text().strip() for tag in p_tags]
p_tags_text

['Vanessa Romo',
 'Claire McInerny',
 'From',
 'The University of Texas-Austin announced Tuesday it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less a year.\n                \n                \n                    \n                    Jon Herskovitz/Reuters\n                    \n                \nhide caption',
 "Four year colleges and universities have difficulty recruiting talented students from the lower end of the economic spectrum who can't afford to attend such institutions without taking on massive debt. To remedy that — at least in part — the University of Texas-Austin announced it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less per year.",
 "The University of Texas System Board of Regents voted unanimously on Tuesday to establish a $160 million endowment, drawing from the state's Permanent University Fund to begin the program in the fall of 2020.",
 '"Recognizing both the

In [6]:
# Filter out sentences that contain newline characters '\n' or don't contain periods.
sentence_list = [sentence for sentence in p_tags_text if not '\n' in sentence]
sentence_list = [sentence for sentence in sentence_list if '.' in sentence]
sentence_list

["Four year colleges and universities have difficulty recruiting talented students from the lower end of the economic spectrum who can't afford to attend such institutions without taking on massive debt. To remedy that — at least in part — the University of Texas-Austin announced it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less per year.",
 "The University of Texas System Board of Regents voted unanimously on Tuesday to establish a $160 million endowment, drawing from the state's Permanent University Fund to begin the program in the fall of 2020.",
 '"Recognizing both the need for improved access to higher education and the high value of a UT Austin degree, we are dedicating a distribution from the Permanent University Fund to establish an endowment that will directly benefit students and make their degrees more affordable," Chairman of the Board of Regents Kevin Eltife said after the vote.',
 '"This will benefit students of our gr

In [7]:
# Combine list items into string.
article = ' '.join(sentence_list)

In [8]:
print(f'Length of original article: {len(article)}')

Length of original article: 4672


In [9]:
summary = summarize(article, ratio=0.3)
print(f'Length of summary: {len(summary)}')

Length of summary: 1859


In [10]:
print(f'Headline: {headline} \n')
print(f'Article Summary:\n{summary}')

Headline: University of Texas-Austin Promises Free Tuition For Low-Income Students In 2020 

Article Summary:
Four year colleges and universities have difficulty recruiting talented students from the lower end of the economic spectrum who can't afford to attend such institutions without taking on massive debt.
To remedy that — at least in part — the University of Texas-Austin announced it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less per year.
The University of Texas System Board of Regents voted unanimously on Tuesday to establish a $160 million endowment, drawing from the state's Permanent University Fund to begin the program in the fall of 2020.
"Recognizing both the need for improved access to higher education and the high value of a UT Austin degree, we are dedicating a distribution from the Permanent University Fund to establish an endowment that will directly benefit students and make their degrees more affordable," Chairman