*Written by Gregory Palermo, 2018-07-01*

This notebook will scrape the article text of a list of articles from the NYTimes website. The text on each webpage is contained in `<p>` elements within the `<article>` element. This text is then put in a dictionary associated with each article's URL and headline (as it appears in the article itself), which can used to join with the metadata df for each center created using the API notebook.

In [332]:
import requests # to get html pages, given a URL
from bs4 import BeautifulSoup #parsing XML

Testing for a single file:

In [333]:
filename = "./test.html"
myfile = open(filename, encoding= 'utf-8').read()
test = BeautifulSoup(myfile, "html.parser") # using default Python 3 html parser

In [334]:
print(test.prettify())

<!DOCTYPE html>
<html class="story" itemid="https://www.nytimes.com/2018/06/22/opinion/children-detention-trump-executive-order.html" itemscope="true" itemtype="http://schema.org/NewsArticle" lang="en" xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
  <!--––
         0000000                         000        0000000
       111111111      11111111100          000      111111111
       00000        111111111111111111      00000      000000
       000        1111111111111111111111111100000         000
       000        1111       1111111111111111100          000
       000         11       0     1111111100              000
       000          1      00             1               000
       000               00      00       1               000
       000             000    00000       1               000
    00000            0000  00000000       1                00000
  11111            000 00    000000      000                 11111
    00000          0000      000000     0000

In [337]:
test.find("h1").get_text()

'There’s a Better, Cheaper Way to Handle Immigration'

In [353]:
# for this particular article, this is the relevant class — true across NYTimes??
paragraph_class = "css-1i0edl6 e2kc3sl0"

# returns a list in which each element is a string containing the text of a paragraph
def get_article_text(doc):
    p_tags = doc.find_all('p',attrs = {"class": "css-1i0edl6 e2kc3sl0"})
    paragraph_text = []
    for paragraph in p_tags:
        paragraph_text.append(paragraph.get_text())
    return(paragraph_text)

def get_article_headline(doc):
    headline = doc.find("h1").get_text()
    return(headline)
    

In [245]:
get_article_text(test)

['And now, it has come to this.',
 '“We are trading kids in cages for families in cages,” said Cory Smith of Kids in Need of Defense, a legal advocacy group on whose board I serve.',
 'On Wednesday, President Trump said in an executive order that he planned to keep families together by jailing parents and children together during the course of their immigration hearings.',
 'On Thursday, the Department of Defense was tasked with finding space on military bases to house up to 20,000 children. Attorney General Jeff Sessions filed a request with a United States District Court in California to modify a consent degree from 1997 — known as the Flores agreement — that set standards for the detention of children and that a judge in 2015 interpreted as requiring that children be released within 20 days.',
 'The Trump administration can’t jail children and parents together without changing this agreement. But there’s no reason to expect its attempt to succeed, said Carlos Holguín, the general co

In [351]:
# given a list of article URLs, return a dictionary, with URLs as keys,
# that contains a tuple with the headline each html parsed article

article_urls = ['https://www.nytimes.com/2018/06/25/opinion/family-detention-immigration.html','https://www.nytimes.com/2018/06/22/opinion/children-detention-trump-executive-order.html']

def request_articles(url_list):
    article_soup = {}
    for url in url_list:
        r = requests.get(url)
        r_html = r.text
        article_soup[url]=BeautifulSoup(r_html,"html.parser")
    return(article_soup)

# given that dictionary, transforming it to just the text of the article
'''
This was a first attempt, before I had scaled up to write different functions for text and headline.

def article_text_dictionary_old(soup_dictionary):
    article_text = {}
    for url,doc in soup_dictionary.items():
        p_tags = doc.find_all('p',attrs = {"class": "css-1i0edl6 e2kc3sl0"})
        paragraph_text = []
        for paragraph in p_tags:
            paragraph_text.append(paragraph.get_text())
        article_text[url]=paragraph_text
    return(article_text)
'''

def article_dictionary(soup_dictionary):
    article_text={}
    for url,doc in soup_dictionary.items():
        article_content = (get_article_headline(doc),get_article_text(doc))
        article_text[url]=article_content
    return(article_text)

# putting it all together

def NYT_article_text(url_list):
    return article_dictionary(request_articles(url_list))
    
                    

In [354]:
NYT_article_text(test_list)

{'https://www.nytimes.com/2018/06/22/opinion/children-detention-trump-executive-order.html': ('There’s a Better, Cheaper Way to Handle Immigration',
  ['And now, it has come to this.',
   '“We are trading kids in cages for families in cages,” said Cory Smith of Kids in Need of Defense, a legal advocacy group on whose board I serve.',
   'On Wednesday, President Trump said in an executive order that he planned to keep families together by jailing parents and children together during the course of their immigration hearings.',
   'On Thursday, the Department of Defense was tasked with finding space on military bases to house up to 20,000 children. Attorney General Jeff Sessions filed a request with a United States District Court in California to modify a consent degree from 1997 — known as the Flores agreement — that set standards for the detention of children and that a judge in 2015 interpreted as requiring that children be released within 20 days.',
   'The Trump administration can’t 