## Step 1: Scrapping the news heading

**Objectives**
- 1.1. Selecting the date range for news
- 1.2. Scrape the news headings from the source website

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
!pip install requests_html

Collecting requests_html
  Downloading requests_html-0.10.0-py3-none-any.whl.metadata (15 kB)
Collecting pyquery (from requests_html)
  Downloading pyquery-2.0.1-py3-none-any.whl.metadata (9.0 kB)
Collecting fake-useragent (from requests_html)
  Downloading fake_useragent-2.0.3-py3-none-any.whl.metadata (17 kB)
Collecting parse (from requests_html)
  Downloading parse-1.20.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting bs4 (from requests_html)
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting w3lib (from requests_html)
  Downloading w3lib-2.3.1-py3-none-any.whl.metadata (2.3 kB)
Collecting pyppeteer>=0.0.14 (from requests_html)
  Downloading pyppeteer-2.0.0-py3-none-any.whl.metadata (7.1 kB)
Collecting appdirs<2.0.0,>=1.4.3 (from pyppeteer>=0.0.14->requests_html)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting pyee<12.0.0,>=11.0.0 (from pyppeteer>=0.0.14->requests_html)
  Downloading pyee-11.1.1-py3-none-any.whl.metadata (2.8

In [6]:
!pip install lxml.html.clean

Collecting lxml.html.clean
  Downloading lxml_html_clean-0.4.1-py3-none-any.whl.metadata (2.4 kB)
Downloading lxml_html_clean-0.4.1-py3-none-any.whl (14 kB)
Installing collected packages: lxml.html.clean
Successfully installed lxml.html.clean-0.4.1


In [7]:
from requests_html import HTMLSession, PyQuery as pq
session = HTMLSession()

from datetime import datetime, timedelta
import json
import numpy as np

### 1.1. Selecting the date range for news

In [9]:
startDate = datetime(2025,1,1)
# endDate = datetime(2015, 1, 2)
endDate = datetime.now().date()
dt = timedelta(days=1)
for i in np.arange(startDate, endDate, dt).astype(datetime):
    print(type(str(i.date())))
    break
#endfor

<class 'str'>


### 1.2. Scrape the news headings from the source website
- Website source used: DAWN News (Most recognized news company in Pakistan)
- Data is saved in JSON format ([headlines.json](./data/news2023/headlines.json))

In [10]:
f = open('/content/drive/MyDrive/headlines.json','w', encoding='utf-8')

allData = {}
for i in np.arange(startDate, endDate, dt).astype(datetime):
    while 1:
        temp = f'https://www.dawn.com/archive/latest-news/{str(i.date())}'
        print(temp)
        r = session.get(temp)
        articles = r.html.find('article')
        # print(articles)

        if len(articles)>0:
            break
        #endif

        print("in while")
    #endwhile

    print(f"{str(i.date())} > {len(articles)} articles fetched")
    allData[str(i.date())] = []

    for article in articles:
        t = pq(article.html)
        # print(f"{articles.index(article)}: {t}")
        headingText = t('h2.story__title a.story__link').text()
        spanId = t('span').eq(0).attr('id')
        label = spanId.lower() if spanId is not None else None
        # print(f"{label} > {headingText}")
        # print(f"{articles.index(article)}: {headingText}")
        if len(headingText)>0 and label in ["business", "pakistan"]:
            allData[str(i.date())].append({
                "heading": headingText,
                "label": label,
            })
        #endif
    #endfor
#endfor

json.dump(allData, f, ensure_ascii=False)
f.close()

https://www.dawn.com/archive/latest-news/2025-01-01
2025-01-01 > 5 articles fetched
https://www.dawn.com/archive/latest-news/2025-01-02
2025-01-02 > 5 articles fetched
https://www.dawn.com/archive/latest-news/2025-01-03
2025-01-03 > 5 articles fetched
https://www.dawn.com/archive/latest-news/2025-01-04
2025-01-04 > 5 articles fetched
https://www.dawn.com/archive/latest-news/2025-01-05
2025-01-05 > 5 articles fetched
https://www.dawn.com/archive/latest-news/2025-01-06
2025-01-06 > 5 articles fetched
https://www.dawn.com/archive/latest-news/2025-01-07
2025-01-07 > 5 articles fetched
https://www.dawn.com/archive/latest-news/2025-01-08
2025-01-08 > 5 articles fetched
https://www.dawn.com/archive/latest-news/2025-01-09
2025-01-09 > 5 articles fetched
https://www.dawn.com/archive/latest-news/2025-01-10
2025-01-10 > 5 articles fetched
https://www.dawn.com/archive/latest-news/2025-01-11
2025-01-11 > 5 articles fetched
https://www.dawn.com/archive/latest-news/2025-01-12
2025-01-12 > 5 articles 