# 📰 New York Post

## 📌 Instructions

1. Enter a **search term** (e.g., `"economy"`, `"sports"`, `"technology"`).  
2. Define the **page range** (`start_page` and `end_page`) to scrape multiple pages of results.  
3. The script retrieves:
   - Title  
   - Date 
   - Link  
   - Full article content  
4. The results are stored in a **pandas DataFrame** and can be exported to CSV:

```python
nypost_df.to_csv("data_nypost_df.csv", index=False)

In [None]:
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

def fetch_nypost_articles(search_term, start_page=1, end_page=1):
    all_articles = []
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }

    # Set up headless Selenium driver
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(options=options)

    for page in range(start_page, end_page + 1):
        url = f"https://nypost.com/search/{search_term}/page/{page}/?orderby=relevance"
        response = requests.get(url, headers=headers)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, "html.parser")
            for h3 in soup.find_all("h3", class_="story__headline headline headline--archive"):
                a_tag = h3.find("a")
                title = a_tag.get_text(strip=True)
                link = a_tag["href"]

                content = fetch_article_content_with_selenium(driver, link)
                all_articles.append({
                    "Title": title,
                    "Link": link,
                    "Content": content
                })

    driver.quit()
    return pd.DataFrame(all_articles)

def fetch_article_content_with_selenium(driver, url):
    driver.get(url)
    time.sleep(3)
    soup = BeautifulSoup(driver.page_source, "html.parser")
    content_element = soup.select_one(".single__content.entry-content.m-bottom")

    if content_element:
        paragraphs = content_element.find_all("p")
        return "\n".join([p.get_text(strip=True) for p in paragraphs])
    else:
        return None

# Function to extract date in DD-MM-YYYY format
def extract_date_ddmmyyyy(url):
    match = re.search(r'/(\d{4})/(\d{2})/(\d{2})/', url)
    if match:
        return f"{match.group(3)}-{match.group(2)}-{match.group(1)}"
    return None

search_term = "economy" # change to "sports", "technology", etc.
nypost_df = fetch_nypost_articles(search_term, start_page=1, end_page=1)

nypost_df['Date'] = nypost_df['Link'].apply(extract_date_ddmmyyyy)
nypost_df["Date"] = pd.to_datetime(nypost_df["Date"], dayfirst=True, errors="coerce")
cols = ['Title', 'Date', 'Link', 'Content']
nypost_df = nypost_df[cols]

print(f"\nNY Post Articles Data (Keyword: '{search_term}'):")
nypost_df.head(10)


NY Post Articles Data (Keyword: 'economy'):


Unnamed: 0,Title,Date,Link,Content
0,Meghan Markle called out for Archie’s expensiv...,2025-05-12,https://nypost.com/2025/05/12/entertainment/me...,In this economy?\nMeghan Markle got ripped ove...
1,Trump's economy is already proving the doomsay...,2025-07-30,https://nypost.com/2025/07/30/opinion/trumps-e...,So much for the prophecies of Trumponomics doo...
2,Delta ditches 'basic economy' in ticketing ove...,2025-05-17,https://nypost.com/2025/05/17/lifestyle/delta-...,Delta is grounding Basic Economy.\nThe airline...
3,Why the economy may never return to pre-tariff...,2025-07-08,https://nypost.com/2025/07/08/business/why-the...,Terminal tariffs: More than a quarter of Ameri...
4,"Labubu craze could spell doom for the economy,...",2025-08-09,https://nypost.com/2025/08/09/us-news/labubu-c...,Labubu dolls have been spotted dangling from L...
5,Trump can apply real pressure on Russia's coll...,2025-07-28,https://nypost.com/2025/07/28/opinion/trump-ca...,Considering Russian President Vladimir Putin w...
6,Soak-the-rich tax hike a disaster for the econ...,2025-04-16,https://nypost.com/2025/04/16/opinion/soak-the...,As longtime advocates for pro-growth economic ...
7,Longshoremen's strike poses a fresh threat to ...,2024-09-30,https://nypost.com/2024/09/30/opinion/longshor...,East and Gulf Coast dockworkers areset to go o...
8,Passenger reveals little-known hack to score a...,2025-03-24,https://nypost.com/2025/03/24/lifestyle/etihad...,It was a poor man’s first class.\nA UK passeng...
9,Trump's bizarro-world tariffs will turn econom...,2025-03-31,https://nypost.com/2025/03/31/opinion/trumps-b...,"In the comics, the supervillain Bizarro is the..."


In [None]:
nypost_df.to_csv("data_nypost_df.csv", index=False)