# Scrape NYT

This notebook contains code to scrape The New York Times.

Caveats:
- Some articles in The New York Times are still in their Time Machine and do not have a transcribed copy available. Only their headlines are available. If their headlines happen to contain a deaf-related phrase, my code will include only the headline, and not the body. But if their headlines do not contain a deaf-related phrase, and their body does, my code unfortunately will miss them entirely. 

Load dependencies and data.

In [82]:
import pandas as pd
from ast import literal_eval
import time
import json
import os
import math

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import StaleElementReferenceException

import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
from nltk.tokenize import RegexpTokenizer

import configparser
configs = configparser.ConfigParser()
configs.read('../../config.ini')

['../../config.ini']

In [15]:
data = pd.read_csv('data/all.csv')
data['keywords'] = data['keywords'].apply(literal_eval)
data['date'] = pd.to_datetime(data['date']) 
data

Unnamed: 0,headline,date,doc_type,material_type,news_desk,section,keywords,url,id,byline,deaf_and_dumb,deaf_mute,fell_on_deaf_ears,hearing_impaired,tone_deaf,deaf_as_a_post,stone_deaf,deaf
0,THE DEAF AND DUMB WAITER.,1885-12-03,article,Archives,,Archives,[],https://www.nytimes.com/1885/12/03/archives/th...,nyt://article/0074c23c-1ff6-5bc7-85d9-e56a5af3...,,True,False,False,False,False,False,False,False
1,Chad Threatens to Expel Sudanese Refugees,2006-04-14,article,News,International,World,[],https://www.nytimes.com/2006/04/14/world/chad-...,nyt://article/00bb19d7-2ba6-5072-8e6b-3159730d...,By Marc Lacey,True,False,False,False,False,False,False,False
2,WELFARE HOTEL CHILDREN: TOMORROW'S POOR,1987-07-16,article,News,Metropolitan Desk,New York,"[Homeless Persons, HOTELS AND MOTELS, Children...",https://www.nytimes.com/1987/07/16/nyregion/we...,nyt://article/01670df3-ae07-5eb6-8862-7bd834bf...,By Lydia Chavez,True,False,False,False,False,False,False,False
3,Wal-Mart Says Oil Prices Held Down Profits for...,2005-08-16,article,News,Business,Business Day,[Company Reports],https://www.nytimes.com/2005/08/16/business/wa...,nyt://article/0175ac61-cc62-5cdc-923c-f5efb8ec...,By Roben Farzad,True,False,False,False,False,False,False,False
4,"A Space Force? The Idea May Have Merit, Some Say",2018-06-23,article,News,Washington,U.S.,"[Space and Astronomy, United States Defense an...",https://www.nytimes.com/2018/06/23/us/politics...,nyt://article/01b8b8a5-7d0c-592a-a283-a9ccd3d8...,By Helene Cooper,True,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17340,Your Money; Claiming a Pet As a Deduction,1981-03-28,article,News,Financial Desk,Business Day,"[ANIMALS, Taxation, Income Tax, Handicapped]",https://www.nytimes.com/1981/03/28/business/yo...,nyt://article/8aa2aceb-e543-5691-b4cc-572cfada...,By Elizabeth M. Fowler,False,False,False,True,False,False,False,True
17341,"Your Typical Crowded, Swinging, Silent Bar Scene",1994-10-30,article,News,The City Weekly Desk,New York,"[Deafness, Bars]",https://www.nytimes.com/1994/10/30/nyregion/ne...,nyt://article/1aab92b9-5b05-50b7-bf7a-2b8cef19...,By Jennifer Kingson Bloom,False,False,False,True,False,False,False,True
17342,"‘Fargo’ Recap: Dead Dogs, Spiders and Pestilence",2014-04-30,article,News,Culture,Arts,[],https://artsbeat.blogs.nytimes.com/2014/04/29/...,nyt://article/3a6161c6-023a-5968-a28e-0ea2ecb6...,By Kate Phillips,False,False,False,True,False,False,False,True
17343,‘Singing’ With Their Hands,2012-02-11,article,News,Styles,Fashion & Style,"[Video Recordings and Downloads, Music, Sign L...",https://www.nytimes.com/2012/02/12/fashion/sin...,nyt://article/0918d106-bd33-59fe-a100-cd3f9a23...,By Austin Considine,False,False,False,True,False,False,False,True


Load phrases.

In [16]:
with open('phrases.txt', 'r') as infile:
    phrases = json.load(infile)
    
phrases

{'deaf_and_dumb': ['deaf and dumb', 'deaf dumb'],
 'deaf_mute': ['deaf mute', 'deaf and mute', 'mute deaf', 'mute and deaf'],
 'fell_on_deaf_ears': ['fell on deaf ears',
  'fall on deaf ears',
  'falls on deaf ears',
  'fall on a deaf ear',
  'falling on deaf ears',
  'falling on a deaf ear',
  'turn a deaf ear',
  'turned deaf ears',
  'turned a deaf ear',
  'turning deaf ears',
  'turning a deaf ear'],
 'hearing_impaired': ['hearing impaired', 'hearing impairment'],
 'tone_deaf': ['tone deaf'],
 'deaf_as_a_post': ['deaf as a post'],
 'stone_deaf': ['stone deaf'],
 'deaf': ['deaf']}

Extract data subset for only the True/False columns. We use this when scraping to see which phrases each article has.

In [17]:
data_subset = data.loc[:, 'deaf_and_dumb':'deaf']
column_names = data_subset.columns.values.astype(str)
phrases_for_each_article = [column_names[i].tolist() for i in data_subset.values]

## Scrape sentences into sentences vector/Series.

Define functions to log into the New York Times site, browse to each article, and scrape matching sentences.

In [18]:
def connect_to_nyt():
    # Open browser and navigate to NYT
    browser = webdriver.Chrome(executable_path='../../chromedriver')
    browser.get('https://nyt.com')

    # Bring up login portal
    wait = WebDriverWait(browser, 15)
    login_button_XPATH = '//button[@data-testid="login-button"]'
    login_button_present = EC.presence_of_element_located((By.XPATH, login_button_XPATH))
    login_button = wait.until(login_button_present)
    login_button.click()

    # Log in
    fields_present = EC.presence_of_element_located((By.ID, 'username'))
    wait.until(fields_present).send_keys(configs['NYT']['EMAIL'])
    browser.find_element_by_id('password').send_keys(configs['NYT']['PASSWORD'])
    browser.find_element_by_xpath(login_button_XPATH).click()
    
    return browser

In [121]:
def get_paragraphs():
    '''Returns a list of the paragraphs on the page.'''
    paragraphs = browser.find_elements_by_tag_name('p')
    text = []
    try:
        if '/video/' in browser.current_url:
            print('Page contains a video.')
            text = [h2.text for h2 in browser.find_elements_by_tag_name('h2')]
        else:
            text = [p.text for p in paragraphs]
    except StaleElementReferenceException:
        print('StaleElementReferenceException')
        text = get_paragraphs()
        
    return text


def scrape_matches(article, only_do_headline=False):
    matches = []
    
    # Get all paragraphs
    if not only_do_headline:    
        paragraphs = get_paragraphs()     
        paragraphs.append(article['headline']) # add headline to paragraphs
    else:
        if not isinstance(article['headline'], pd.Series) and math.isnan(article['headline']):
            return matches
        else:
            paragraphs = [article['headline']]
    paragraphs_cleaned = [' '.join(RegexpTokenizer(r'\w+').tokenize(p)).lower() for p in paragraphs] # remove punctuation, lowercase

    # Get all varieties of the phrases for this article
    phrases_for_this_article = phrases_for_each_article[article.name]
    phrase_varieties = ['deaf'] # add by default, also because we made the 'deaf' column to exclude other 'deaf' terms
    for phrase in phrases_for_this_article:
        phrase_varieties += [x for x in phrases[phrase]]

    # Get sentences that contain a phrase variety
    for p in phrase_varieties:
        matches += ([paragraphs[i] for i, x in enumerate(paragraphs_cleaned) if (p in x and paragraphs[i] not in matches)])

    return matches

    
def scrape(article):
    time.sleep(5)
    article_id = article['id']
    url = article['url']
    status = None
    matches = None
    statuses = ['Full text is unavailable for this digitized archive article.',
              'Page Not Found',
              'Server Error',
              'Success',
              'Failed',
              'We’re sorry, we seem to be having some technical difficulties, but we don’t want to lose you.']
    
    # Navigate to page
    browser.get(article['url'])
    
    # If Error 503, refresh
    if "Error 503 first byte timeout" in browser.page_source:
        print('Error 503')
        browser.refresh()
    
    # If video in URL
    if '/video/' in article['url']:
        status = statuses[3] # success
        matches = scrape_matches(article) # scrape the article
        if len(matches) >= 1: 
            status = statuses[3]
            print(status + ': ' + str(len(matches)) + ' paragraph(s) found.') 
        else:
            status = statuses[4]
            matches = None
            print(status) # failed
    
    elif statuses[0] in browser.page_source: # time machine
        status = statuses[0]
        matches = scrape_matches(article) # will only attempt to scrape headline
        if len(matches) >= 1:
            print(status + ': ' + str(len(matches)) + ' paragraph(s) found.') 
        else:
            matches = None
            print(status)
            
    elif statuses[1] in browser.page_source: 
        status = statuses[1] # page not found
        matches = scrape_matches(article, True)
        if len(matches) >= 1:
            print(status + ': ' + str(len(matches)) + ' paragraph(s) found.')
        else:
            matches = None
            print(status)
    
    elif statuses[2] in browser.page_source or statuses[5] in browser.page_source: 
        status = statuses[2] # server error
        matches = scrape_matches(article, True)
        if len(matches) >= 1:
            print(status + ': ' + str(len(matches)) + ' paragraph(s) found.')
        else:
            matches = None
            print(status)
        
    else:
        status = statuses[3] # success
        matches = scrape_matches(article) # scrape the article
        if len(matches) >= 1: 
            status = statuses[3]
            print(status + ': ' + str(len(matches)) + ' paragraph(s) found.') 
        else:
            status = statuses[4]
            matches = None
            print(status) # failed
            
    return article_id, status, matches

Scrape sentences. We get back two lists, `statuses` and `sentences`.

`sentences` is a 2D list with all the sentences found for each article, including the headline if it matched. 

`statuses` is a 1D list with the scrape status for each article:
- `Page Not Found` — if the page was not found
- `Full text is unavailable for this digitized archive article.` — if there is no fulltext available, and the article needs to be viewed through Time Machine
- `Server Errror` — if there was a server error on NYT's side
- `Success` 
- `Failed` — if the page was fine, but my logic failed to find a matching sentence

In [122]:
browser = connect_to_nyt()

In [None]:
def chunker(df, size):
    return (df[pos:pos + size] for pos in range(0, len(df), size))

csv_path = 'data/paragraphs.csv'
if os.path.exists(csv_path):
    already_scraped = pd.read_csv('data/paragraphs.csv')
    not_yet_scraped = data[~data['id'].isin(already_scraped['id'])]
else:
    not_yet_scraped = data
    
article_iterator = chunker(not_yet_scraped, 10)

for chunk in article_iterator:
    ids, statuses, paragraphs = zip(*chunk.apply(scrape, axis=1))
    if not os.path.exists(csv_path):
        include_header = True
        mode = 'w'
    else: 
        include_header = False
        mode = 'a'
        with open(csv_path, 'a') as f:
            f.write('\n') # buggy if i don't do this, idk why, will concatenate 1st line on last line
    pd.DataFrame({'id': ids, 'status': statuses, 'sentences': paragraphs}).to_csv(csv_path, index=False, header=include_header, mode=mode)
    print('Updated ' + csv_path)

Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Updated data/paragraphs.csv
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Updated data/paragraphs.csv
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Success: 1 paragraph(s) found.
Updated data/p

**To do:**
- For the articles that are unavailable as fulltexts, I can collect their URLs based on sentences==None then maybe automatically direct browser to the TimeMachine, then I will manually read it then [type the sentence into Jupyter Notebook using user input](https://stackoverflow.com/questions/34968112/how-to-give-jupyter-cell-standard-input-in-python) and then the program will add this to sentences cell for that article row, move on to the next row that needs it, then redirect browser automatically, and so on. 

In [110]:
headline = data[data['url'] == 'https://www.nytimes.com/1998/06/22//IHT-william-at-16-prince-of-hearts-and-new-windsor-icon.html']['headline']

In [111]:
list(headline

4308    William at 16: Prince of Hearts and New Windso...
Name: headline, dtype: object

In [117]:
headline

4308    William at 16: Prince of Hearts and New Windso...
Name: headline, dtype: object

In [118]:
isinstance(headline, pd.Series)

True