# Real News Scraper

This notebook walks through using Beautiful Soup(python scraping module) for the first time.
Our working goal is a collection of real news texts to be used in Fake vs Real News classificaion. See https://github.com/dcurry09/FakeNews_Classifier.

We will be scraping news articles from the left, right, and center categories.  An assumption, perhaps a biased one, is that all news articles from AllSides.com are considered "real".

Output will be real news articles in a CSV file(headers include title and text only).

In [1]:
# Scraping Packages
import requests
from bs4 import BeautifulSoup
import newspaper
import textacy as tcy

# Standard Libraries
import re
from collections import defaultdict
import csv as csv
import numpy as np
import pandas as pd
import pylab as py
import operator
import time
import progressbar
import matplotlib.pyplot as plt
from collections import Counter

We define a single news article URL as a test case of scraping(single article first, then loop over nmany).

In [2]:
# Collect first Article Text
page = requests.get('https://www.allsides.com/story/warren-says-dnc-rigged-nomination')

# Create a BeautifulSoup object
soup = BeautifulSoup(page.text, 'html.parser')

Now we have an object(page) and a BS object(soup).  Let us first print out page just to see what has been defined:

In [3]:
print(page.text[:1000])

<!DOCTYPE html>
<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="lt-ie9 lt-ie8 lt-ie7"  lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="lt-ie9 lt-ie8"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)]><!--><html  lang="en" dir="ltr" prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product# content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#"><!--<![endif]-->

<head profile="http://www.w3.org/1999/xhtml/vocab">


  <meta charset="utf-8" 

This is the same as if we had right clicked on the webpage and choosen "Inspect".  This is our HTML object.  Our job now is to identify where the article text is in the HTML and extract it.  By inspecting the web pahe HTML structure I have found a div item with a class that contains the body text.

In [4]:
body_text = soup.find_all('div', class_ = 'story-id-page-description')
print(type(body_text))
print(len(body_text))
print(body_text)

<class 'bs4.element.ResultSet'>
1
[<div class="story-id-page-description">
<p>Senator Elizabeth Warren (D-MA) answered that yes, she believed the Democratic primary was rigged to favor Hillary Clinton when asked Thursday about former Democratic National Committee Chairwoman Donna Brazile’s explosive admission that the Clinton campaign had control of the DNC before Clinton secured the nomination.</p>
</div>]


Now lets assign just the text to an object

In [5]:
for i in body_text: print(i)

<div class="story-id-page-description">
<p>Senator Elizabeth Warren (D-MA) answered that yes, she believed the Democratic primary was rigged to favor Hillary Clinton when asked Thursday about former Democratic National Committee Chairwoman Donna Brazile’s explosive admission that the Clinton campaign had control of the DNC before Clinton secured the nomination.</p>
</div>


In order to access each element in body_text we need to make it a list

In [6]:
body_text_list = body_text[0]
print(body_text_list.p)

<p>Senator Elizabeth Warren (D-MA) answered that yes, she believed the Democratic primary was rigged to favor Hillary Clinton when asked Thursday about former Democratic National Committee Chairwoman Donna Brazile’s explosive admission that the Clinton campaign had control of the DNC before Clinton secured the nomination.</p>


In [7]:
print(body_text_list.p.text)

Senator Elizabeth Warren (D-MA) answered that yes, she believed the Democratic primary was rigged to favor Hillary Clinton when asked Thursday about former Democratic National Committee Chairwoman Donna Brazile’s explosive admission that the Clinton campaign had control of the DNC before Clinton secured the nomination.


Bingo!  Ok so we have one article of text.  We need to store this first article in a CSV file, then perform many times across the website.  Lets start by creating a pandas dataframe.

In [8]:
final_text = body_text_list.p.text
test_df = pd.DataFrame({'text': [final_text]})

print(test_df.info())
test_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
text    1 non-null object
dtypes: object(1)
memory usage: 88.0+ bytes
None


Unnamed: 0,text
0,Senator Elizabeth Warren (D-MA) answered that ...


## Moving Beyond Single Articles
We now we expand the scope of our scraping code to include multiple websites and multiple artcicles.  We will be suing the Newspaper package of tools to help us streamline this process.  Below are a few functions that help wrap the entire workflow together(from the nice blog: https://dataflume.wordpress.com/2017/04/23/scraping-newspaper-text-and-computing-readability-statistics-in-python/)

In [9]:
def article_extractor(newspaper_url, title_topic=None):
    '''
    Extracts News article text for a given URL.
    Uses Newspaper modules and returns a pandas DF of text.
    '''
        
    dd = defaultdict(list)

    source = newspaper.build(newspaper_url, memoize_articles=False)

    print("\nExtracting articles from", newspaper_url)    
    print("# of Articles:", source.size()) 
    #print(source.articles)    
        
    arts = [i.url for i in source.articles]
    
    print("# of Article URLs:", len(arts))  
        
        
    if title_topic is None:
        relevant_arts = [i for i in arts]
    else:
        relevant_arts = [i for i in arts if title_topic in i]

    print("# of Relevant Articles:", len(relevant_arts))
        
    bar = progressbar.ProgressBar()
        
    for i in bar(relevant_arts):
        time.sleep(0.02)

        try:
            art = newspaper.build_article(i)
            art.download()
            art.parse()
            dd["title"].append(art.title)
            dd["text"].append(art.text)
        except:
            print('ERROR: URL No Longer Available...')
            continue
        
    print('DF:', pd.DataFrame.from_dict(dd).head(5))   
        
    return pd.DataFrame.from_dict(dd)

def get_articles(newspaper_url):

    results = []
    for url in newspaper_url:
        articles = article_extractor(url)
        articles["paper"] = url
        results.append(articles)
    return pd.concat(results)

def clean_text(string):

    string = re.sub(r"SIGN UP FOR OUR NEWSLETTER", "", string)
    string = re.sub(r"Read more here", "", string)
    string = re.sub(r"REUTERS", "", string)
    string = re.sub(r"\?", "'", string)
    string = re.sub(r"\n", "", string)
    return string

def preprocess_articles(articles):

    clean_arts = []
    for art in articles:
        clean_art = tcy.preprocess.preprocess_text(art,
                                          fix_unicode=True,
                                          lowercase=True,
                                          no_currency_symbols=True,
                                          no_numbers=True,
                                          no_urls=True)
        clean_arts.append(clean_art)
    return clean_arts

In [11]:
# Define the news websites to scrape from
cnn = "http://cnn.com"
allSides = "https://www.allsides.com/"
guardian = "https://www.theguardian.com/us/"
NYT = "https://www.nytimes.com/"
wapo = "https://www.washingtonpost.com/"
nbc = "https://www.nbcnews.com/"
fox = "http://www.foxnews.com/"
lat = "http://www.latimes.com/hp-2/"

site_list = [cnn, allSides, guardian, NYT, wapo, nbc, fox, lat]
#site_list = [wapo]
articles = get_articles(site_list)
articles["text"] = articles.text.map(clean_text)
articles["text"] = preprocess_articles(articles.text)

# Save to CSV
articles.to_csv('real_news.csv')

                                                                               N/A% (0 of 790) |                        | Elapsed Time: 0:00:00 ETA:  --:--:--


Extracting articles from http://cnn.com
# of Articles: 790
# of Article URLs: 790
# of Relevant Articles: 790


  0% (2 of 790) |                         | Elapsed Time: 0:00:00 ETA:  0:03:48

Article `download()` failed with 404 Client Error: Not Found for url: http://www.cnn.com/hln-morning-express-tour-robin-meade on URL http://cnn.com/hln-morning-express-tour-robin-meade
ERROR: URL No Longer Available...


  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
100% (790 of 790) |#######################| Elapsed Time: 0:14:25 Time: 0:14:25


DF:                                                 text  \
0  IN YOUR HEADSET...\n\nHeadsets are hands-down ...   
1  (CNN) Three new reports from the US Centers fo...   
2  Story highlights Trump wants China to do somet...   
3  (CNN) The President's opioid commission on Wed...   
4  Atlanta (CNN) We hear about those who've kicke...   

                                               title  
0                                    How to Watch VR  
1  Heart disease deaths plummet, overdose deaths ...  
2         China downplays role in US opioid epidemic  
3  Opioid commission: We need drug courts, not pr...  
4  'This is skid row': What two current heroin ad...  


                                                                               N/A% (0 of 208) |                        | Elapsed Time: 0:00:00 ETA:  --:--:--


Extracting articles from https://www.allsides.com/
# of Articles: 208
# of Article URLs: 208
# of Relevant Articles: 208


 54% (114 of 208) |############           | Elapsed Time: 0:02:23 ETA:  0:00:57

Article `download()` failed with 404 Client Error: Not Found for url: https://www.allsides.com/blog/twitter%25E2%2580%2599s-new-safety-tools-what-do-you-think on URL https://www.allsides.com/blog/twitter%25E2%2580%2599s-new-safety-tools-what-do-you-think
ERROR: URL No Longer Available...


 60% (125 of 208) |#############          | Elapsed Time: 0:02:31 ETA:  0:01:02

Article `download()` failed with 404 Client Error: Not Found for url: https://www.allsides.com/blog/trump%25E2%2580%2599s-economic-plan-other-media-contrasts-week on URL https://www.allsides.com/blog/trump%25E2%2580%2599s-economic-plan-other-media-contrasts-week
ERROR: URL No Longer Available...


 62% (131 of 208) |##############         | Elapsed Time: 0:02:35 ETA:  0:00:46

Article `download()` failed with 404 Client Error: Not Found for url: https://www.allsides.com/blog/story-week-responding-last-week%25E2%2580%2599s-violence on URL https://www.allsides.com/blog/story-week-responding-last-week%25E2%2580%2599s-violence
ERROR: URL No Longer Available...


100% (208 of 208) |#######################| Elapsed Time: 0:03:24 Time: 0:03:24


DF:                                                 text  \
0  What region do you want to see?\n\nThe Communi...   
1  "Save" saves this article for you to read late...   
2  "Save" saves this article for you to read late...   
3  "Save" saves this article for you to read late...   
4  What region do you want to see?\n\nThe Communi...   

                                               title  
0           Congressmen Call for Mueller Resignation  
1  GOP Reps. Gaetz, Gohmert, Biggs push for Muell...  
2  GOP lawmaker calls for Mueller recusal over ur...  
3  It begins: Republican Congressmen introduce re...  
4                  Warren Says DNC Rigged Nomination  


                                                                               N/A% (0 of 120) |                        | Elapsed Time: 0:00:00 ETA:  --:--:--


Extracting articles from https://www.theguardian.com/us/
# of Articles: 120
# of Article URLs: 120
# of Relevant Articles: 120


100% (120 of 120) |#######################| Elapsed Time: 0:01:56 Time: 0:01:56


DF:                                                 text  \
0  The president seized on claims in former chair...   
1  Donald Trump’s claim that the US has been atta...   
2  The deactivation of @realDonaldTrump – apparen...   
3  The trial of ex-Trump campaign officials Paul ...   
4  Spanish judge’s move comes day after former me...   

                                               title  
0  Raging Trump demands FBI investigate Clinton, ...  
1  Trump's claim US hitting Isis 'much harder' af...  
2  Experts warn about security after Donald Trump...  
3  Paul Manafort and Rick Gates trial date set fo...  
4  European arrest warrant issued for ex-Catalan ...  


                                                                               N/A% (0 of 196) |                        | Elapsed Time: 0:00:00 ETA:  --:--:--


Extracting articles from https://www.nytimes.com/
# of Articles: 196
# of Article URLs: 196
# of Relevant Articles: 196


100% (196 of 196) |#######################| Elapsed Time: 0:02:14 Time: 0:02:14


DF:                                                 text  \
0  The climate science report is part of a congre...   
1  The climate science report is part of a congre...   
2  For instance, the 2014 assessment forecast tha...   
3  The White House is projecting robust economic ...   
4  House Republicans on Thursday unveiled a bill ...   

                                               title  
0  U.S. Report Says Humans Cause Climate Change, ...  
1  U.S. Report Says Humans Cause Climate Change, ...  
2  What the Climate Report Says About the Impact ...  
3  Republicans May Inject Health Care Mandate Deb...  
4  The Five Biggest Changes for Families in the R...  


                                                                               N/A% (0 of 139) |                        | Elapsed Time: 0:00:00 ETA:  --:--:--


Extracting articles from https://www.washingtonpost.com/
# of Articles: 139
# of Article URLs: 139
# of Relevant Articles: 139


100% (139 of 139) |#######################| Elapsed Time: 0:01:16 Time: 0:01:16


DF:                                                 text  \
0                                                      
1                                                      
2                                                      
3  The leader of the Catholic Church spoke out ag...   
4  \n\nA man prays at a memorial after the deadly...   

                                               title  
0                    Real or Fake: Past Games Events  
1                   Two Truths and a Lie: Tom Cruise  
2               Which Actors Starred in Indie Films?  
3  Pope Francis’s ominous, emotional message abou...  
4  ISIS claims suspected New York truck attacker ...  


                                                                               N/A% (0 of 1056) |                       | Elapsed Time: 0:00:00 ETA:  --:--:--


Extracting articles from https://www.nbcnews.com/
# of Articles: 1056
# of Article URLs: 1056
# of Relevant Articles: 1056


 32% (339 of 1056) |#######               | Elapsed Time: 0:06:55 ETA:  0:10:06

Article `download()` failed with 404 Client Error: Not Found for url: https://www.nbcnews.com/dateline/news on URL https://www.nbcnews.com/dateline/news
ERROR: URL No Longer Available...


100% (1056 of 1056) |#####################| Elapsed Time: 0:20:54 Time: 0:20:54


DF:                                                 text  \
0  ACLU Joins Lawsuit Against FBI by Scientist Fo...   
1  Video\n\nWorkplace Harassment Can Never Be Tol...   
2      NBC News works best with JavaScript turned on   
3  NBC News Weather takes you up close to some of...   
4  Taro Karibe / for NBC News\n\nN. Korea Abducte...   

                                               title  
0  U.S. News: Breaking News Photos, & Videos on t...  
1  World News: Latest Breaking Global News Storie...  
2                                     NBC Affiliates  
3  Weather: News, Photos & Videos on Natural Disa...  
4  Asian America: Community News, Information, Cu...  


                                                                               N/A% (0 of 334) |                        | Elapsed Time: 0:00:00 ETA:  --:--:--


Extracting articles from http://www.foxnews.com/
# of Articles: 334
# of Article URLs: 334
# of Relevant Articles: 334


100% (334 of 334) |#######################| Elapsed Time: 0:03:25 Time: 0:03:25


DF:                                                 text  \
0                                                      
1  President Trump tweeted Friday that Army Sgt. ...   
2  The New York City Police Department said Frida...   
3  Will the so-called "Antifa apocalypse" come wi...   
4                                                      

                                               title  
0  Watch Fox News Channel and Fox Business Networ...  
1  Bergdahl dishonorably discharged, no jail time...  
2  Harvey Weinstein recent rape allegations are '...  
3  Antifa apocalypse? Anarchist group's plan to o...  
4  Watch Fox News Channel and Fox Business Networ...  


                                                                               N/A% (0 of 70) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--


Extracting articles from http://www.latimes.com/hp-2/
# of Articles: 70
# of Article URLs: 70
# of Relevant Articles: 70


 21% (15 of 70) |#####                    | Elapsed Time: 0:00:10 ETA:  0:00:46

You must `download()` an article first!
ERROR: URL No Longer Available...


 95% (67 of 70) |#######################  | Elapsed Time: 0:00:57 ETA:  0:00:04

You must `download()` an article first!
ERROR: URL No Longer Available...


                                                                                97% (68 of 70) |######################## | Elapsed Time: 0:01:04 ETA:  0:00:04

You must `download()` an article first!
ERROR: URL No Longer Available...


                                                                                98% (69 of 70) |######################## | Elapsed Time: 0:01:11 ETA:  0:00:02

You must `download()` an article first!
ERROR: URL No Longer Available...


100% (70 of 70) |#########################| Elapsed Time: 0:01:18 Time: 0:01:18


You must `download()` an article first!
ERROR: URL No Longer Available...
DF:                                                 text  \
0  New York City police investigators say a 2010 ...   
1  House Republicans produced an ambitious propos...   
2  In a highly unusual move, the Justice Departme...   
3  A massive U.S. report concludes that the evide...   
4  Sukhrob Sobirov was 20 when he left Uzbekistan...   

                                               title  
0  NYPD gathering evidence to arrest Harvey Weins...  
1  House Republicans produced an ambitious tax ov...  
2  Trump administration asks Supreme Court to pun...  
3  U.S. report contradicts Trump team: Warming is...  
4  Uzbek community in New York wary of being tied...  
