## The Easy Way to Web Scrape Articles Online

In this tutorial we will scrape a bunch of news articles from different news outlets into a simple python script using the package "Newspaper3k"

### The Basics

In [1]:
import newspaper
from newspaper import Article
import nltk
import warnings
warnings.filterwarnings("ignore")

In [2]:
# The basics of downloading the article into memory
article = Article("https://www.washingtonpost.com/world/2022/08/09/russia-ukraine-war-latest-updates/?itid=sf_world_article_list")
#article = Article("https://www.gamespot.com//news/")
article.download()
article.parse()
nltk.download('punkt')
article.nlp()

[nltk_data] Downloading package punkt to /home/johnfrost/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
# To print out the full text
print(article.text)

War in Ukraine: What you need to know

The latest: The United Nations has expressed hope that the first grain shipments from blockaded Ukrainian ports could start Friday. However, the exact coordinates needed to ensure a safe passage for ships were still being negotiated on Thursday, U.N. aid chief Martin Griffiths said.

The fight: Russia’s recent operational pause, which analysts identified in recent weeks as an effort to regroup troops before doubling down on Ukraine’s south and east, appears to be ending. Russia appears set to resume ground offensives, with Defense Minister Sergei Shoigu telling troops on Saturday to intensify attacks “in all operational sectors” of Ukraine.

The weapons: Ukraine is making use of weapons such as Javelin antitank missiles and Switchblade “kamikaze” drones, provided by the United States and other allies. Russia has used an array of weapons against Ukraine, some of which have drawn the attention and concern of analysts.

Photos: Post photographers hav

In [4]:
# To print out a summary of the text
# This works, because newspaper3k has built in NLP tools
print(article.summary)

War in Ukraine: What you need to knowThe latest: The United Nations has expressed hope that the first grain shipments from blockaded Ukrainian ports could start Friday.
The fight: Russia’s recent operational pause, which analysts identified in recent weeks as an effort to regroup troops before doubling down on Ukraine’s south and east, appears to be ending.
Russia appears set to resume ground offensives, with Defense Minister Sergei Shoigu telling troops on Saturday to intensify attacks “in all operational sectors” of Ukraine.
The weapons: Ukraine is making use of weapons such as Javelin antitank missiles and Switchblade “kamikaze” drones, provided by the United States and other allies.
Russia has used an array of weapons against Ukraine, some of which have drawn the attention and concern of analysts.


In [5]:
# To print out the list of authors
print(article.authors)

['Jennifer Hassan', 'Sean Fanning']


In [6]:
# To print out the list of keywords
print(article.keywords)

['weapons', 'russians', 'killed', 'izyum', 'wounded', 'ukraine', 'troops', 'russia', 'briefing', 'operational', 'war', 'ukrainian', 'live', 'united', 'help', 'recent']


In [7]:
# other functions to gather the other useful bits of meta data in an article
article.title # Gives the title

'Ukraine Live Briefing: Ukrainian troops advance on Izyum; 80,000 Russians may have been killed, wounded in war'

In [8]:
article.publish_date # gives the date the article was published

datetime.datetime(2022, 8, 9, 0, 0)

In [9]:
article.top_image # gives the link to the main image of the article 

'https://www.washingtonpost.com/wp-apps/imrs.php?src=https://arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/FNIJBJQXFEI63OMYWKVWR5MENA.jpg&w=1440'

In [10]:
article.images # provides a set of image links

{'https://www.washingtonpost.com/wp-apps/imrs.php?src=https://arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/FNIJBJQXFEI63OMYWKVWR5MENA.jpg&w=1440'}

### Advanced: Downloading multiple articles from one news site

If we want to scrape more than one news article and store the data in a pandas dataframe first and then into a csv file, **its quite simple with this package**

In [11]:
import newspaper
from newspaper import Article
from newspaper import Source
import pandas as pd

In [13]:
# Lets say we wanted to download articles from WashingtonPost (a US newspaper agency)
washington = newspaper.build("https://www.washingtonpost.com/world/", memoize_articles = False)
# setting memoization to false prevents cache and saving into memory
final_df = pd.DataFrame()
counter = 0

for each_article in washington.articles:
    each_article.download()
    each_article.parse()
    each_article.nlp()
    
    temp_df = pd.DataFrame(columns= ['Title', 'Authors', 'Text', 'Summary', 'Published_date', 'Source'])
    temp_df['Authors'] = (each_article.authors)
    temp_df['Title'] = each_article.title
    temp_df['Text'] = each_article.text
    temp_df['Summary'] = each_article.summary
    temp_df['Published_date'] = each_article.publish_date
    temp_df['Source'] = each_article.source_url
    
    final_df = final_df.append(temp_df, ignore_index=True)
    counter += 1
    if counter > 15:
        break
    
final_df.head()

Unnamed: 0,Title,Authors,Text,Summary,Published_date,Source
0,"FBI searches Trump safe at Mar-a-Lago Club, fo...",Devlin Barrett,Gift Article Share\n\nFormer president Donald ...,In a lengthy statement in which he equated the...,2022-08-08,https://www.washingtonpost.com
1,"FBI searches Trump safe at Mar-a-Lago Club, fo...",Mariana Alfaro,Gift Article Share\n\nFormer president Donald ...,In a lengthy statement in which he equated the...,2022-08-08,https://www.washingtonpost.com
2,"FBI searches Trump safe at Mar-a-Lago Club, fo...",Josh Dawsey,Gift Article Share\n\nFormer president Donald ...,In a lengthy statement in which he equated the...,2022-08-08,https://www.washingtonpost.com
3,"FBI searches Trump safe at Mar-a-Lago Club, fo...",Jacqueline Alemany,Gift Article Share\n\nFormer president Donald ...,In a lengthy statement in which he equated the...,2022-08-08,https://www.washingtonpost.com
4,Trump investigation live updates Trump claims ...,John Wagner,The activity Monday at Mar-a-Lago appears rela...,The activity Monday at Mar-a-Lago appears rela...,2022-08-09,https://www.washingtonpost.com


### Multi Threaded Web Scraping

The above solution is slow as it downloads each article one after another. If you have many news sources, this could be a time consuming process.
However, there is a way to speed up all this process. We can do this with the help of multi threading technologies.

In [14]:
import newspaper
from newspaper import news_pool

# Various news sources we will like to web scrape from 
washington = newspaper.build("https://www.washingtonpost.com/world/", memoize_articles = False)
bbc = newspaper.build("https://www.bbc.com/news", memoize_articles = False)

# Place the sources in a list
papers = [washington, bbc]

# Essentially we will be downloading 2 articles parallely per source.
# Since we have two sources, 4 sources will be downloaded at any one time, thereby greatly speeding up the process. 
# Once downloaded they will be stored in memory to be used in the for loop below to extract the bits of data we want.

In [15]:
news_pool.set(papers, threads_per_source = 2)

news_pool.join()

# Create our final dataframe
final_df = pd.DataFrame()

# Create a download limit per source
limit = 50

for source in papers:
    # temporary list to store each element we want to extract
    list_title = []
    list_text = []
    list_source = []
    count = 0
    for article_extract in source.articles:
        article_extract.parse()
        if count > limit: 
            break
        # Appending the elements we want to extract
        list_text.append(article_extract.text)
        list_title.append(article_extract.title)
        list_source.append(article_extract.source_url)
        #update count
        count += 1
        #print("one article appended")
    temp_df = pd.DataFrame({'Title': list_title, 'Text': list_text, 'Source': list_source})
    # Append to the final dataframe
    final_df = final_df.append(temp_df, ignore_index= True)
    
# Exporting to csv file
final_df.to_csv('my_scraped_articles.csv')

In [16]:
final_df.head()

Unnamed: 0,Title,Text,Source
0,"FBI searches Trump safe at Mar-a-Lago Club, fo...",Gift Article Share\n\nFormer president Donald ...,https://www.washingtonpost.com
1,Trump investigation live updates Trump claims ...,The activity Monday at Mar-a-Lago appears rela...,https://www.washingtonpost.com
2,Trump investigation live updates Trump claims ...,The activity Monday at Mar-a-Lago appears rela...,https://www.washingtonpost.com
3,Trump investigation live updates Trump claims ...,The activity Monday at Mar-a-Lago appears rela...,https://www.washingtonpost.com
4,Top Republicans echo Trump’s evidence-free cla...,Listen 9 min Comment on this story Comment Gif...,https://www.washingtonpost.com
