# Scrape NYT

This notebook contains code to scrape The New York Times.

Load dependencies and data.

In [23]:
import pandas as pd
from ast import literal_eval
import time

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

import configparser
configs = configparser.ConfigParser()
configs.read('../../config.ini')

['../../config.ini']

In [3]:
data = pd.read_csv('data/all.csv')
data['keywords'] = data['keywords'].apply(literal_eval)
data['date'] = pd.to_datetime(data['date']) 
data

Unnamed: 0,headline,date,doc_type,material_type,news_desk,section,keywords,url,id,byline,deaf_and_dumb,deaf_mute,fall_on_deaf_ears,hearing_impaired,tone_deaf,deaf_as_a_post,stone_deaf,deaf
0,THE DEAF AND DUMB WAITER.,1885-12-03,article,Archives,,Archives,[],https://www.nytimes.com/1885/12/03/archives/th...,nyt://article/0074c23c-1ff6-5bc7-85d9-e56a5af3...,,True,False,False,False,False,False,False,False
1,Chad Threatens to Expel Sudanese Refugees,2006-04-14,article,News,International,World,[],https://www.nytimes.com/2006/04/14/world/chad-...,nyt://article/00bb19d7-2ba6-5072-8e6b-3159730d...,By Marc Lacey,True,False,False,False,False,False,False,False
2,WELFARE HOTEL CHILDREN: TOMORROW'S POOR,1987-07-16,article,News,Metropolitan Desk,New York,"[Homeless Persons, HOTELS AND MOTELS, Children...",https://www.nytimes.com/1987/07/16/nyregion/we...,nyt://article/01670df3-ae07-5eb6-8862-7bd834bf...,By Lydia Chavez,True,False,False,False,False,False,False,False
3,Wal-Mart Says Oil Prices Held Down Profits for...,2005-08-16,article,News,Business,Business Day,[Company Reports],https://www.nytimes.com/2005/08/16/business/wa...,nyt://article/0175ac61-cc62-5cdc-923c-f5efb8ec...,By Roben Farzad,True,False,False,False,False,False,False,False
4,"A Space Force? The Idea May Have Merit, Some Say",2018-06-23,article,News,Washington,U.S.,"[Space and Astronomy, United States Defense an...",https://www.nytimes.com/2018/06/23/us/politics...,nyt://article/01b8b8a5-7d0c-592a-a283-a9ccd3d8...,By Helene Cooper,True,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17442,Theater: ‘Look to the Lilies’ Begins Its Run a...,1970-03-30,article,Archives,,Archives,"[Theater, REVIEWS AND OTHER DATA ON SPECIFIC P...",https://www.nytimes.com/1970/03/30/archives/th...,nyt://article/be014601-0c8f-5182-9bf1-fe762c9f...,By Clive Barnes,False,False,False,False,True,False,True,False
17443,WESTCHESTER Q&A;: MARIE TRAFICANTE;\nBringing ...,1993-01-10,article,Interview,Westchester Weekly Desk,New York,"[Music, Teachers and School Employees]",https://www.nytimes.com/1993/01/10/nyregion/we...,nyt://article/12beaf0c-59ab-52e7-8449-4a20ab2c...,By Donna Greene,False,False,False,True,True,False,False,False
17444,WESTCHESTER Q&A;: MARIE TRAFICANTE;\nBringing ...,1993-01-10,article,Interview,Westchester Weekly Desk,New York,"[Music, Teachers and School Employees]",https://www.nytimes.com/1993/01/10/nyregion/we...,nyt://article/cc68f54b-25d0-5fb5-91aa-9c225d27...,By Donna Greene,False,False,False,True,True,False,False,False
17445,WESTCHESTER Q&A;: MARIE tRAFICANTE;\nBringing ...,1993-01-10,article,Interview,Westchester Weekly Desk,New York,"[Music, Teachers and School Employees]",https://www.nytimes.com/1993/01/10/nyregion/we...,nyt://article/13570b93-5f1a-5134-8b64-952ed3c6...,By Donna Greene,False,False,False,True,True,False,False,False


## Scrape sentences into sentences vector/Series.

Log into the New York Times site.

In [81]:
def connect_to_nyt():
    # Open browser and navigate to NYT
    browser = webdriver.Chrome(executable_path='../../chromedriver')
    browser.get('https://nyt.com')

    # Bring up login portal
    wait = WebDriverWait(browser, 15)
    login_button_XPATH = '//button[@data-testid="login-button"]'
    login_button_present = EC.presence_of_element_located((By.XPATH, login_button_XPATH))
    login_button = wait.until(login_button_present)
    login_button.click()

    # Log in
    fields_present = EC.presence_of_element_located((By.ID, 'username'))
    wait.until(fields_present).send_keys(configs['NYT']['EMAIL'])
    browser.find_element_by_id('password').send_keys(configs['NYT']['PASSWORD'])
    browser.find_element_by_xpath(login_button_XPATH).click()
    
    return browser
    
    
def scrape_sentences(article):
    time.sleep(4)
    article_id = article['id']
    url = article['url']
    
    browser.get(article['url'])
    if 'Full text is unavailable for this digitized archive article.' in browser.page_source:
        status = 'Full text unavailable.'
        print(status) # will need to manually check these in Time Machine
        return 'time machine'
    elif 'Page Not Found' in browser.page_source:
        status = 'page not found'
        print(status)
        return status
    elif "Server Error" in browser.page_source:
        status = 'server error'
        print(status)
        return status
    else:
        paragraphs = browser.find_elements_by_tag_name('p')
        text = ' '.join([p.text for p in paragraphs])
        sentences = [x for x in tokenizer.tokenize(text) if 'deaf' in x.lower() and 'dumb' in x.lower()]
        if len(sentences) >= 1: 
            status = 'Matching sentences found: ' + str(len(sentences))
            print(status)
        else:
            status = 'Full text available, but no matching sentence found.'
            print(status)
        return sentences
    return 'hi'

Scrape sentences. Each row will contain either of 3 values:
- `page not found` — if the page was not found
- `server error` — if there was a server error
- `time machine` — if there is no fulltext available, and the article needs to be viewed through Time Machine
- `[]` - array of sentence(s)

In [80]:
browser = connect_to_nyt()

In [82]:
sentences = data.apply(scrape_sentences, axis=1)

Full text unavailable.
Matching sentences found: 1
Matching sentences found: 1
Matching sentences found: 1
Matching sentences found: 1
Full text unavailable.
Full text unavailable.
Full text unavailable.
Matching sentences found: 1
Matching sentences found: 1
Matching sentences found: 1
Full text unavailable.
Matching sentences found: 1
Full text unavailable.
Matching sentences found: 1
Matching sentences found: 1
Matching sentences found: 1
Matching sentences found: 1
Matching sentences found: 1
Matching sentences found: 1
Matching sentences found: 1
Full text unavailable.
Matching sentences found: 1
Matching sentences found: 1
Matching sentences found: 1
Matching sentences found: 2
Matching sentences found: 1
Matching sentences found: 1
Matching sentences found: 1
Matching sentences found: 1
server error
Matching sentences found: 1


KeyboardInterrupt: 

In [83]:
sentences

Series([], dtype: object)

**To do:**
- Figure out how to make `sentences` accumulate results in the lambda. Right now it's staying empty.
- Find regex for each phrase, and when article contains the phrase, use these regexes to target sentences instead of the current approach `if term in sentence` which would get lengthy for some phrases like "fall on deaf ears" which have multiple varieties. But again, the regex could get complex as well. Maybe use something like the queries format in query_nyt.ipynb.
    - if article contains multiple phrases, e.g. has two True columns in `data`, use something like contains(regex) OR contains(regex)
- Include headline in sentences array for each article, if headline contains the phrase
- For the articles that are unavailable as fulltexts, I can collect their URLs based on sentences==None then maybe automatically direct browser to the TimeMachine, then I will manually read it then [type the sentence into Jupyter Notebook using user input](https://stackoverflow.com/questions/34968112/how-to-give-jupyter-cell-standard-input-in-python) and then the program will add this to sentences cell for that article row, move on to the next row that needs it, then redirect browser automatically, and so on. 

Thoughts:
- Potential logic error: Since I targeted "deaf" with a query formatted to exclude articles that contain other phrases like "deaf and dumb" and "hearing-impaired" and so on, I should send the query again, this time excluding "hearing-impaired" as a exclusionary condition, because it's not a term with "deaf" in it. When I submit it manually I can see this increases the number of hits, which I'm assuming are articles that contain both "hearing-impaired" and a non-phrase "deaf".