# Scrape NYT

### Sample HTML tag

```html
<section class="story-wrapper"><a class="css-9mylee" href="https://www.nytimes.com/2024/12/01/us/politics/biden-hunter-pardon-politics.html" data-uri="nyt://article/dffb88f6-058f-5e6f-8a61-6b4c08e420e4" aria-hidden="false"><div><div class="css-xdandi"><div class="css-1a3ibh4"><p class="css-tdd4a3"><span class="css-wt2ynm">Analysis</span></p></div><p class="indicate-hover css-91bpc3">In Pardoning His Son, Biden Echoes Some of Trump’s Complaints</p></div><p class="summary-class css-1l5zmz6">President Biden and President-elect Trump now agree on one thing: The Biden Justice Department has been politicized.</p><div class="css-1tic89u"><div><p class="css-1a0ymrn" data-ttr="1">7 min read</p></div></div></div></a></section>
```

Notice that we need to extract the headline, as well as the summary

### Code
(you may have to install BeautifulSoup)

In [1]:
from bs4 import BeautifulSoup

In [2]:
html_element = """<section class="story-wrapper"><a class="css-9mylee" href="https://www.nytimes.com/2024/12/01/us/politics/biden-hunter-pardon-politics.html" data-uri="nyt://article/dffb88f6-058f-5e6f-8a61-6b4c08e420e4" aria-hidden="false"><div><div class="css-xdandi"><div class="css-1a3ibh4"><p class="css-tdd4a3"><span class="css-wt2ynm">Analysis</span></p></div><p class="indicate-hover css-91bpc3">In Pardoning His Son, Biden Echoes Some of Trump’s Complaints</p></div><p class="summary-class css-1l5zmz6">President Biden and President-elect Trump now agree on one thing: The Biden Justice Department has been politicized.</p><div class="css-1tic89u"><div><p class="css-1a0ymrn" data-ttr="1">7 min read</p></div></div></div></a></section>"""

In [3]:
soup = BeautifulSoup(html_element, 'html.parser')

In [4]:
headline1 = soup.find('section', class_='story-wrapper')
headline1.find_all('p')[1], headline1.find_all('p')[2]

(<p class="indicate-hover css-91bpc3">In Pardoning His Son, Biden Echoes Some of Trump’s Complaints</p>,
 <p class="summary-class css-1l5zmz6">President Biden and President-elect Trump now agree on one thing: The Biden Justice Department has been politicized.</p>)

In [5]:
title_and_summary_tag = headline1.find_all('p')
title = title_and_summary_tag[1].text
summary = title_and_summary_tag[2].text

title_and_summary = title + ". " + summary
title_and_summary

'In Pardoning His Son, Biden Echoes Some of Trump’s Complaints. President Biden and President-elect Trump now agree on one thing: The Biden Justice Department has been politicized.'

In [6]:
def get_text(html_element):
    title_and_summary_tag = html_element.find_all('p')

    if len(title_and_summary_tag) == 0: return None
    
    if len(title_and_summary_tag) < 2: # This function is not very robust :(
        return title_and_summary_tag[0].text
        
    title   = title_and_summary_tag[0].text
    summary = title_and_summary_tag[1].text
    
    title_and_summary = title + ". " + summary
    title_and_summary

    return title_and_summary

In [33]:
get_text(headline1)

'Analysis. In Pardoning His Son, Biden Echoes Some of Trump’s Complaints'

### Find ALL headlines

First, we download the front-page

In [34]:
import requests

In [35]:
%%time
response = requests.get('https://www.nytimes.com/')

CPU times: user 33.2 ms, sys: 1.13 ms, total: 34.3 ms
Wall time: 243 ms


In [36]:
response

<Response [200]>

In [37]:
print(response.text[:500])

<!DOCTYPE html>
<html lang="en" class=" nytapp-vi-homepage "  data-nyt-compute-assignment="fallback" xmlns:og="http://opengraphprotocol.org/schema/">
  <head>
    
    
    <meta charset="utf-8" />
    <title data-rh="true">The New York Times - Breaking News, US News, World News and Videos</title>
    <meta data-rh="true" name="description" content="Live news, investigations, opinion, photos and video by the journalists of The New York Times from more than 150 countries around the world. Subscri


In [38]:
html = BeautifulSoup(response.text)

In [39]:
html.find_all(class_="story-wrapper")[:5]

[<div class="story-wrapper css-1e505by" data-tpl="sli"><div class="css-114aoa5" style="flex-direction:column-reverse"><div style="flex-grow:1"><div class="css-ep7xq6" data-tpl="b"></div><div class="css-1wzkfo3" data-tpl="slic"><div class="css-12yuaq1" data-tpl="la"><div class="css-1vb0fst" data-tpl="f"><p class="css-ae0yjg"><span class="css-12tlih8">LIVE</span></p><span class="css-1ufpbe9"><time class="css-16lxk39" datetime="2025-07-07T19:54:02.987Z"><div class="css-ki347z"><span aria-hidden="true" class="css-1stvlmo" data-time="abs">July 7, 2025, 3:54 p.m. ET</span><span class="css-kpxlkr" data-time="rel"></span></div></time></span></div></div><div class="css-cfnhvx" data-tpl="h"><a class="tpl-lbl css-5mgoji" data-tpl="l" href="https://www.nytimes.com/live/2025/07/07/us/trump-news"><div class="css-xdandi" data-tpl="b"><p class="indicate-hover css-1ixq7yl">Trump Threatens Japan and S. Korea With 25% Tariffs as He Presses for Deals</p></div></a></div><div class="css-sarx3u" data-tpl="bo

### Extract headlines

In [40]:
html.find_all(class_="story-wrapper")[0]

<div class="story-wrapper css-1e505by" data-tpl="sli"><div class="css-114aoa5" style="flex-direction:column-reverse"><div style="flex-grow:1"><div class="css-ep7xq6" data-tpl="b"></div><div class="css-1wzkfo3" data-tpl="slic"><div class="css-12yuaq1" data-tpl="la"><div class="css-1vb0fst" data-tpl="f"><p class="css-ae0yjg"><span class="css-12tlih8">LIVE</span></p><span class="css-1ufpbe9"><time class="css-16lxk39" datetime="2025-07-07T19:54:02.987Z"><div class="css-ki347z"><span aria-hidden="true" class="css-1stvlmo" data-time="abs">July 7, 2025, 3:54 p.m. ET</span><span class="css-kpxlkr" data-time="rel"></span></div></time></span></div></div><div class="css-cfnhvx" data-tpl="h"><a class="tpl-lbl css-5mgoji" data-tpl="l" href="https://www.nytimes.com/live/2025/07/07/us/trump-news"><div class="css-xdandi" data-tpl="b"><p class="indicate-hover css-1ixq7yl">Trump Threatens Japan and S. Korea With 25% Tariffs as He Presses for Deals</p></div></a></div><div class="css-sarx3u" data-tpl="bo"

In [41]:
html.find_all(class_="story-wrapper")[0].find_all('p')

[<p class="css-ae0yjg"><span class="css-12tlih8">LIVE</span></p>,
 <p class="indicate-hover css-1ixq7yl">Trump Threatens Japan and S. Korea With 25% Tariffs as He Presses for Deals</p>,
 <p class="summary-class css-crclbt">President Trump said the tariffs would take effect on Aug. 1 if no trade deals are reached. Markets fell sharply on the news.</p>,
 <p class="css-16lw6zo"> </p>]

In [42]:
for e in html.find_all(class_="story-wrapper")[:15]:
    #print(e)
    print(get_text(e))

LIVE. Trump Threatens Japan and S. Korea With 25% Tariffs as He Presses for Deals
Trump Says BRICS-Aligned Countries Could Face Extra Tariffs. 2 min read
LIVE. Death Toll Reaches 95 in Texas Floods, With 27 Killed at Summer Camp
See How Fast Floodwaters Rose Along Guadalupe River
None
None
None
None
None
None
North Carolina Faces Widespread Flooding After Chantal Dumps Heavy Rain. The storm flooded roads, downed trees and stranded residents across the central part of the state. It is heading northeast toward Washington, D.C.
Tracking Post-Tropical Cyclone Chantal
What’s at Stake as Benjamin Netanyahu and Trump Meet in Washington. President Trump is considering whether to pursue a new nuclear agreement with Tehran. He is also urging a new cease-fire deal to end the fighting in Gaza.
Where Do Israel-Hamas Truce Negotiations Stand?. 4 min read


In [43]:
headlines = [get_text(headline) for headline in html.find_all(class_="story-wrapper")]

In [44]:
headlines[:5]

['LIVE. Trump Threatens Japan and S. Korea With 25% Tariffs as He Presses for Deals',
 'Trump Says BRICS-Aligned Countries Could Face Extra Tariffs. 2 min read',
 'LIVE. Death Toll Reaches 95 in Texas Floods, With 27 Killed at Summer Camp',
 'See How Fast Floodwaters Rose Along Guadalupe River']

In [45]:
len(headlines)

126

### Write headlines to file

#### Create the filename

In [46]:
import datetime

In [47]:
datetime.datetime.today()

datetime.datetime(2025, 7, 7, 14, 55, 31, 766091)

In [48]:
datetime.datetime.today().strftime('%Y-%m-%d')

'2025-07-07'

In [49]:
TODAY = datetime.datetime.today().strftime('%Y-%m-%d')

In [50]:
TODAY

'2025-07-07'

In [53]:
filename = f"../data/headlines_nyt_{TODAY}.txt"
filename

'../data/headlines_nyt_2025-07-07.txt'

In [54]:
with open(filename, 'w', encoding='utf-8') as output_file:
    for headline in headlines:
        if headline is None: continue
        output_file.write(headline + '\n')