### **Web Scraping Stock-Specific Financial News from Yahoo Finance**


### SECTION 01
**Choosing the Real Finance data source**
1. I have chosen **Yahoo Finance**
2. Chosen stock-specific news page
3. Verified it's public, structured,finance-relevant

### Section 2
**Web Scraping Implementation**

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### 2.1
Here I used the **requests.session()**, beacuse the normal responses was not able to fetch the data from the Yahoo finance as the website did a **"Bot detection"**
so using the **requests.session()**.*So I make my request look like a real chrome browser*.

In [2]:
url = "https://finance.yahoo.com/quote/RELIANCE.NS/news"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://finance.yahoo.com/",
    "Connection": "keep-alive"
}


session = requests.Session()
response = session.get(url, headers = headers, allow_redirects = True)


print(response.status_code)
print(response.text[:500])

200
<!doctype html>
<html lang="en-US" theme="auto" data-color-theme-enabled="true" data-color-scheme="auto" class="desktop neo-green dock-upscale">
    <head>
        <meta charset="utf-8" />
        <meta name="oath:guce:consent-host" content="guce.yahoo.com" />
        <link rel="preconnect" href="//s.yimg.com" crossorigin="anonymous"><link rel="preconnect" href="//geo.yahoo.com"/><link rel="preconnect" href="//query1.finance.yahoo.com"/><link rel="preconnect" href="//consent.cmp.oath.com"/><link


### 2.2
**"BeautifulSoup(response.text, "html.parser")"** this code took the raw HTML string and Built a DOM-like tree

and **"soup"** beahve like:  a document that I can query

In [3]:
soup = BeautifulSoup(response.text, "html.parser")

type(soup)

bs4.BeautifulSoup

### 2.3

**find_all("a", class_="subtle-link")**

Scanned the entire HTML tree

Returned only <a> tags that match Yahoo’s news links

Stored them in news_links (a Python list)

In [4]:
news_link = soup.find_all("a", class_="subtle-link")

len(news_link)

76

In [5]:
news_link[1]

<a aria-label="India's Reliance to buy up to 150,000 bpd of Russian oil from February" class="subtle-link fin-size-small titles noUnderline yf-119g04z" data-ylk="elm:hdln;elmt:link;itc:0;ct:story;slk:India's%20Reliance%20to%20buy%20up%20to%20150%2C000%20bpd%20of%20Russian%20oil%20from%20February;sec:unspecified-block;subsec:all;cpos:1;g:01e6b459-7685-3b32-ab50-5e81ef527978" href="https://finance.yahoo.com/news/indias-reliance-buy-150-000-123912146.html" title="India's Reliance to buy up to 150,000 bpd of Russian oil from February"><h3 class="clamp yf-1u32w3i">India's Reliance to buy up to 150,000 bpd of Russian oil from February</h3> </a>

### 2.4
Now we will extract the data from the **news_link**
and we extracted the headline and the URL of the news_article

In [6]:
sample = news_link[1]

headline = sample.find("h3").get_text(strip=True)
article_url = sample.get("href")

headline, article_url


("India's Reliance to buy up to 150,000 bpd of Russian oil from February",
 'https://finance.yahoo.com/news/indias-reliance-buy-150-000-123912146.html')

### 2.5

here I extracted the ***time*** the article was published.


In [7]:
sample = news_link[3]

headline = sample.find("h3").get_text(strip=True)

article_url = sample.get("href")

meta = sample.find_next("div", class_ = "publishing")

meta_text = meta.get_text(" ", strip = True)

headline, article_url, meta_text


("India's Reliance to buy sanctions-compliant Russian oil in February and March, sources say",
 'https://finance.yahoo.com/news/indias-reliance-buy-sanctions-compliant-132556976.html',
 'Reuters • 8d ago')

In [8]:
source, published_time = meta_text.split("•")

source = source.strip()
published_time = published_time.strip()

source, published_time

('Reuters', '8d ago')

### 2.6
Now using the above code i ran a loop for storing every data in the **list = news_data**
and uses **datetime, timezone** module to put a timestamps of when I scrapped the data

In [9]:
from datetime import datetime, timezone

news_data = []

for link in news_link:
    h3 = link.find("h3")
    if not h3:
        continue

    headline = h3.get_text(strip=True)
    article_url = link.get("href")

    meta = link.find_next("div", class_="publishing")
    if not meta:
        continue

    meta_text = meta.get_text(" ", strip=True)

    if "•" not in meta_text:
        continue

    source, published_time = meta_text.split("•", 1)

    news_data.append({
        "stock": "RELIANCE.NS",
        "headline": headline,
        "article_url": article_url,
        "source": source.strip(),
        "published_time": published_time.strip(),
        "scraped_at": datetime.now(timezone.utc)
    })


In [10]:
news_data[3]

{'stock': 'RELIANCE.NS',
 'headline': 'How The Story Behind Reliance Industries (NSEI:RELIANCE) Fair Value Is Quietly Changing',
 'article_url': 'https://finance.yahoo.com/news/story-behind-reliance-industries-nsei-001226666.html',
 'source': 'Simply Wall St.',
 'published_time': '9d ago',
 'scraped_at': datetime.datetime(2026, 1, 29, 16, 37, 44, 520704, tzinfo=datetime.timezone.utc)}

### Section 3 Cleaning, Validation & Storage

1. Convert raw data to structured **DataFrame** using Pandas.
2. check the data are in correct order or not
3. Check if there is any **Null** or **Dubplicate** values
4. Saved the Dataframe as a **CSV** file 

In [11]:
df = pd.DataFrame(news_data)

df.sample(5)
df.shape

(20, 6)

In [12]:
df.columns


Index(['stock', 'headline', 'article_url', 'source', 'published_time',
       'scraped_at'],
      dtype='object')

In [13]:
df.isna().sum()

stock             0
headline          0
article_url       0
source            0
published_time    0
scraped_at        0
dtype: int64

In [14]:
df.dtypes

stock                          object
headline                       object
article_url                    object
source                         object
published_time                 object
scraped_at        datetime64[ns, UTC]
dtype: object

In [15]:
output_file = "reliance_yahoo_finance_news.csv"

df.to_csv(output_file, index = False)

In [16]:
import os

os.path.exists(output_file), os.path.getsize(output_file)


(True, 4469)

In [17]:
pd.read_csv(output_file).head(2)

Unnamed: 0,stock,headline,article_url,source,published_time,scraped_at
0,RELIANCE.NS,"India's Reliance to buy up to 150,000 bpd of R...",https://finance.yahoo.com/news/indias-reliance...,Reuters,3h ago,2026-01-29 16:37:44.520493+00:00
1,RELIANCE.NS,India's Reliance to buy sanctions-compliant Ru...,https://finance.yahoo.com/news/indias-reliance...,Reuters,8d ago,2026-01-29 16:37:44.520570+00:00
