<a href="https://colab.research.google.com/github/aakhterov/ML_projects/blob/master/news_sentiment_analysis/collect_news.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Collecting a news dataset

The idea is to fine-tune one of the LLM from the Huggingface ecosystem to make a news sentiment analysis regarding the pro- and anti-Israel attitudes (but collected dataset can be used also for other NLP tasks). Here one of the main issue is data labeling. To overcome it we suppose that almost all news from the Palestinian "news" agency "WAFA" and Quatar agency "Al Mayadeen" has anti-Israel position. Most of the Al Jazeera news also tends to be against Israel. In return news by "The Times of Israel" is mostly pro-Israel.
For example, the following piece of news carries an anti-Israel pattern:

> *KHAN YUNIS, Sunday, December 10, 2023 (WAFA) - At least 10 civilians were killed, mostly children, and dozens more were wounded early this morning as Israeli warplanes bombed a residential house in Khan Yunis, south of the Gaza Strip, as the Israeli aggression on the enclave enters its 65th day in a row. (WAFA "news" agency)*

Opposite the following information from "The Times of Israel" is pro-Israel:

>*Several thousand people demonstrate against antisemitism in Berlin as Germany grapples with a large increase in anti-Jewish incidents following Hamas’s assault on Israel two months ago. Police estimate that around 3,200 people gathered in the rain in the German capital, while organizers put the figure at 10,000, German news agency dpa reports. Participants in the protest, titled “Never again is now,” march to the Brandenburg Gate.*

We've accumulated news from the following sources:

1. BBC (live news) - from 2023-11-05 to 2023-11-18. Total: 805
2. The Times of Israel (live news) - from 2023-10-07 to 2023-11-18. Total: 6581
3. Al Jazeera (live news) - from 2023-11-04 to 2023-11-25. Total: 3297
4. Al Mayadeen (articles from the site) - from 2023-10-08 to 2023-11-24. Total: 74
5. WAFA "News" Agency (articles from the site section "Occupation")- from 2023-09-28 to 2023-11-26. Total: 1020
6. CNN live news  - from 2023-10-26 to 2023-11-26. Total: 1428

As we aim to use the Huggingface ecosystem and its library for the models fine-tuning we used Dataset object from the Huggingface package here.
The dataset has the following fields:
- "url" - link to the piece of news;
- "datetime" - news date and time (YYYY-mm-ddTHH:MM:SS);
- "author" - news author if provided else None;
- "title" - news title;
- "text" - news text;
- "provider" - news provider
- "source" - where news was collected

All news is collected using Beautiful Soup library. In the case of the BBC site, the WAFA site and the Al Mayadeen site it was enough to use simple GET requests and in the other cases we used Silenium.

The final dataset can be found [here](https://huggingface.co/datasets/aav-ds/Israel-HAMAS_war_news)

In [142]:
!pip install datasets nltk huggingface_hub



In [88]:
import requests
import re
import unicodedata
import time
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
from datasets import Dataset, load_dataset, load_from_disk, concatenate_datasets
from tqdm.notebook import tqdm

In [3]:
from google.colab import drive
drive.mount('/content/drive')
base_path = '/content/drive/MyDrive/Colab Notebooks/'

Mounted at /content/drive


## 1. Collecting news from the BBC site

In [None]:
PAUSE_BETWEEN_QUERIES = 1

In [None]:
current_live_url = 'https://www.bbc.com/news/live/world-middle-east-67446662'

In [None]:
# We will collect news to the following Python dictionary
data = {"url": [], "datetime": [], "author": [], "title": [], "text": [], "provider": [], "source": []}
current_date = datetime(2023, 11, 18)

# Starting from "current_live_url" grab news from this page and try to find the link to the previous day at the bottom of the page.

while current_live_url and current_date>=datetime(2023, 10, 7):
  print(f"current_date: {current_date.strftime('%Y-%m-%d')}")
  print(f"current_live_url: {current_live_url}, total pages: {total_pages}")

  for page in tqdm(range(1, total_pages+1), 'Pages: '):
    url = f"{current_live_url}/page/{page}"
    result = requests.get(url)
    body = BeautifulSoup(result.text, 'html.parser')

    for article in body.find_all('article'):
      title_block = article.find('span', class_=re.compile('header-text'))
      title = title_block.string if title_block is not None else title_block

      time_block = article.find('time')
      time_block = time_block.find('span', class_=re.compile("qa-post-auto-meta"))
      if time_block is not None:
        datetime_parts = time_block.string.split()
        today = datetime.today()
        if len(datetime_parts) == 1:
          datetime_str = f"{today.year}-{today.month}-{today.day}T{datetime_parts[0]}:00"
          post_datetime = datetime.strptime(datetime_str, '%Y-%m-%dT%H:%M:%S')
        else:
          month = datetime.strptime(datetime_parts[2], '%b').month
          day = datetime_parts[1]
          datetime_str = f"{today.year}-{month}-{day}T{datetime_parts[0]}:00"
          post_datetime = datetime.strptime(datetime_str, '%Y-%m-%dT%H:%M:%S')

        current_date = post_datetime
      else:
        post_datetime = None

      contributor_block = article.find('div', class_=re.compile('contributor-body'))
      if contributor_block is not None:
        name = contributor_block.find('p', class_=re.compile('contributor-name'))
        name = name.string if name else ''
        role = contributor_block.find('p', class_=re.compile('contributor-role'))
        role = role.string if role else ''
        author = f"{name};{role}"
      else:
        author = None

      text_block = article.find('div', class_=re.compile('post-body'))
      text = []
      if text_block is not None:
        text = [p_block.text if p_block is not None else '' for p_block in text_block.find_all('p')]

      text = '\n'.join(text)

      data['url'].append(url)
      data['title'].append(title)
      data['datetime'].append(post_datetime.strftime('%Y-%m-%dT%H:%M:%S') if post_datetime else None)
      data['author'].append(author if author else '-')
      data['text'].append(text)
      data['provider'].append('BBC')
      data['source'].append('site-live-news')

    time.sleep(PAUSE_BETWEEN_QUERIES)

  current_live_url = article.find('a', href=re.compile('https://www.bbc.com/news/live/world-middle-east'))
  if current_live_url is None:
    current_live_url = article.find('a', href=re.compile('https://www.bbc.co.uk/news/live/world-middle-east'))

  if current_live_url is not None:
    current_live_url = current_live_url.get('href')
    result = requests.get(current_live_url)
    body = BeautifulSoup(result.text, 'html.parser')
    total_pages = int(body.find('span', class_=re.compile('qa-pagination-total-page-number')).string)

In [None]:
# Instantiate Dataset object from the dictionary
ds_bbc = Dataset.from_dict(data)
ds_bbc

Dataset({
    features: ['url', 'datetime', 'author', 'title', 'text', 'provider', 'source'],
    num_rows: 805
})

In [44]:
ds_bbc.save_to_disk(base_path + '/Data/bbc_news_ds')

Saving the dataset (0/1 shards):   0%|          | 0/805 [00:00<?, ? examples/s]

In [53]:
ds_bbc = load_from_disk(base_path + '/Data/bbc_news_ds')

## 2. Collecting live news from the "THE TIMES OF ISRAEL" site

To collect news from the "THE TIMES OF ISRAEL" site we used the Selenium library. As it's a bit difficult to use selenium inside Google Colab, we leveraged a simple Python script that can be found in [Github](https://github.com/aakhterov/ML_projects/blob/master/news_sentiment_analysis/collect_the_times_of_israel_news.py.py). Here we just load the saved dataset.

In [122]:
ds_toi = load_from_disk(base_path + '/Data/toi_news_ds')
ds_toi

Dataset({
    features: ['url', 'datetime', 'author', 'title', 'text', 'provider', 'source'],
    num_rows: 6581
})

In [None]:
ds_toi[:3]

{'url': ['https://www.timesofisrael.com/liveblog_entry/us-official-says-there-will-only-be-a-significant-pause-in-gaza-when-hostages-released/',
  'https://www.timesofisrael.com/liveblog_entry/survey-shows-substantial-support-for-renewal-of-jewish-settlement-in-gaza-after-war/',
  'https://www.timesofisrael.com/liveblog_entry/hamas-claims-that-gaza-death-toll-reaches-12300/'],
 'datetime': ['2023-11-18T19:30:14',
  '2023-11-18T19:25:21',
  '2023-11-18T19:18:28'],
 'author': ['AFP', '-', 'AFP'],
 'title': ['US official says there will only be a ‘significant pause’ in Gaza when hostages released',
  'Survey shows substantial support for renewal of Jewish settlement in Gaza after war',
  'Hamas claims that Gaza death toll reaches 12,300'],
 'text': ['US President Joe Biden’s main adviser on the Middle East says there will only be a “significant pause” in the Israel-Hamas war if hostages held by Hamas and other terror groups in Gaza are freed.\n“The surge in humanitarian relief, the surge 

In [123]:
def change_provider_and_source(example):
  example['provider'] = 'The Times of Israel'
  example['source'] = 'site-live-news'
  return example

In [124]:
ds_toi = ds_toi.map(change_provider_and_source)

## 3. Collect live news from Al Jazeera site

Here is the same case as with "THE TIMES OF ISRAEL" - we need to use Selenium to grab the news, save the collected dataset (code is [here](https://github.com/aakhterov/ML_projects/blob/master/news_sentiment_analysis/collect_al_jazeera_news.py)) and load it here.

In [55]:
ds_aj = load_from_disk(base_path + '/Data/aljazeera_news_ds')
ds_aj

Dataset({
    features: ['url', 'datetime', 'author', 'title', 'text', 'provider', 'source'],
    num_rows: 3317
})

In [None]:
ds_aj = ds_aj.filter(lambda x: "This live page is now closed" not in x["text"])

In [None]:
def change_provider_and_source(example):
  example['provider'] = 'Al Jazeera'
  example['source'] = 'site-live-news'
  return example

In [None]:
ds_aj = ds_aj.map(change_provider_and_source)

## 4. Collect articles from Al Mayadeen site

In [None]:
BASE_URL = "https://english.almayadeen.net"
FIRST_URL = "https://english.almayadeen.net/coverage/operation-al-aqsa-flood"
PAGE_URL = "https://english.almayadeen.net/GetMore/coverage/operation-al-aqsa-flood?widgetid=00000000-0000-0000-0000-000000000000&widget=coverage-articles&postid=1753428&ismain=False&order=1"

In [None]:
data = {"url": [], "datetime": [], "author": [], "title": [], "text": [], "provider": [], "source": []}

res = requests.get(FIRST_URL)
first_page_body = BeautifulSoup(res.text, 'html.parser')
first_block = first_page_body.find('div', attrs={"data-widgetkey": "coverage-articles"})

# Collect all news link
urls = []
for div in first_block.find_all('div', class_='content-block'):
    urls.append(div.find('a').get('href'))

for page in range(1, 101):
    print(f"Page {page}. Length urls: {len(urls)}")
    res = requests.get(f"{PAGE_URL}&page={page}")
    page_body = BeautifulSoup(res.text, 'html.parser')
    divs = page_body.find_all('div', class_='content-block')
    if divs:
        for div in divs:
            urls.append(div.find('a').get('href'))
    else:
        break

print(f"Total amount of the news: {len(urls)}")

# Grab the news using the collected links
for idx, url in enumerate(urls):
    print(idx+1, url)
    result = requests.get(BASE_URL + url)
    body = BeautifulSoup(result.text, 'html.parser')
    article_block = body.find('article', class_ = 'post-details')
    if article_block:
        title = article_block.find('h1').text.strip()

        author_block = article_block.find('a', class_='post-author post-author-with-img')
        author = author_block.text.strip() if author_block else None

        datetime_block = article_block.find('li', class_='media-date')
        if datetime_block:
            datetime_ = datetime_block.text.strip()
            if datetime_.split()[0] == 'Today':
                post_datetime = datetime.strptime(datetime.today().strftime("%d %b %Y") + ' ' + datetime_.split()[1], "%d %b %Y %H:%M") # 15 Oct 2023 06:36
            else:
                post_datetime = datetime.strptime(datetime_, "%d %b %Y %H:%M") # 15 Oct 2023 06:36
        else:
            post_datetime = None

        text_block = body.find('div', class_='p-content')
        text_lst = []
        if text_block:
            for p in text_block.find_all('p'):
                text_lst.append(p.text)
        text = '\n'.join(text_lst)

        print('\t', post_datetime, title)
        data['url'].append(BASE_URL + url)
        data['title'].append(title)
        data['datetime'].append(post_datetime.strftime('%Y-%m-%dT%H:%M:%S') if post_datetime else None)
        data['author'].append(author if author else '-')
        data['text'].append(text)
        data['provider'].append('Al Mayadeen')
        data['source'].append('site-articles')

In [None]:
ds = Dataset.from_dict(data)
ds.save_to_disk('al_mayadeen_news_ds')

In [56]:
ds_am = load_from_disk(base_path + '/Data/al_mayadeen_news_ds')
ds_am

Dataset({
    features: ['url', 'datetime', 'author', 'title', 'text', 'provider', 'source'],
    num_rows: 74
})

## 5. Collect news from WAFA "News" Agency site

In [None]:
BASE_URL = "https://english.wafa.ps"
NEWS_URL = BASE_URL + "/Regions/Details/2"

In [None]:
data = {"url": [], "datetime": [], "author": [], "title": [], "text": [], "provider": [], "source": []}

for page in range(51):
    print(f"Page: {page}")
    result = requests.get(NEWS_URL, params={'pageNumber': page})
    body = BeautifulSoup(result.text, "html.parser")
    posts_block = body.find('div', class_="post-blockcat-wrapper")
    for div in posts_block.find_all('div', class_="content"):
        datetime_block = div.find('span', class_="meta-item date")
        post_datetime = datetime.strptime(datetime_block.text, "%d/%B/%Y %I:%M %p") # <span class="meta-item date">07/October/2023 02:29 PM</span>

        url_block = div.find('h4', class_="title")
        post_url = BASE_URL + url_block.find('a').get("href")

        print('\t', post_datetime, post_url)
        res = requests.get(post_url)
        html = BeautifulSoup(res.text, "html.parser")
        post_block = html.find('div', class_="blog-wrap")

        title_block = post_block.find('h3', class_='title')
        title = title_block.text if title_block else None

        text_lst = []
        for p in post_block.find('div', class_='content').find_all('p'):
            if p is not None:
                text_lst.append(p.text)
        text = '\n'.join(text_lst)

        author = text_lst[-1]

        data['url'].append(post_url)
        data['title'].append(title)
        data['datetime'].append(post_datetime.strftime('%Y-%m-%dT%H:%M:%S') if post_datetime else None)
        data['author'].append(author if author else '-')
        data['text'].append(text)
        data['provider'].append('WAFA News Agency')
        data['source'].append('site-occupation')

In [None]:
ds_wafa = Dataset.from_dict(data)
ds_wafa.save_to_disk('wafa_news_ds')
ds_wafa

In [57]:
ds_wafa = load_from_disk(base_path + '/Data/wafa_news_ds')
ds_wafa

Dataset({
    features: ['url', 'datetime', 'author', 'title', 'text', 'provider', 'source'],
    num_rows: 1020
})

## 6. Collect live news from CNN's site


Here is the same case as with "THE TIMES OF ISRAEL" - we need to use Selenium to grab the news, save the collected dataset (code is [here](https://github.com/aakhterov/ML_projects/blob/master/news_sentiment_analysis/collect_cnn_news.py.py)) and load it here.

In [58]:
ds_cnn = load_from_disk(base_path + '/Data/cnn_news_ds')
ds_cnn

Dataset({
    features: ['url', 'datetime', 'author', 'title', 'text', 'provider', 'source'],
    num_rows: 1428
})

## 7. Merging the collected datasets

In [125]:
ds = concatenate_datasets([ds_bbc, ds_toi, ds_aj, ds_am, ds_wafa, ds_cnn])

In [60]:
ds

Dataset({
    features: ['url', 'datetime', 'author', 'title', 'text', 'provider', 'source'],
    num_rows: 13225
})

In [126]:
ds.unique('provider')

['BBC',
 'The Times of Israel',
 'Al Jazeera',
 'Al Mayadeen',
 'WAFA News Agency',
 'CNN']

In [127]:
ds.save_to_disk(base_path + '/Data/news_ds')

Saving the dataset (0/1 shards):   0%|          | 0/13225 [00:00<?, ? examples/s]

## 8. Cleaning dataset

In [128]:
# Check if there is empty news.
ds.filter(lambda x: x['text'] == '')

Filter:   0%|          | 0/13225 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'datetime', 'author', 'title', 'text', 'provider', 'source'],
    num_rows: 122
})

In [129]:
# Remove empty news from the dataset
ds = ds.filter(lambda x: x['text'] != '')

Filter:   0%|          | 0/13225 [00:00<?, ? examples/s]

In [87]:
# Let's take a look at the 10 random news from WAFA News Agency. We can see the following issues:
# 1) Every piece of news starts with the prefix that contains location, date and nasmr of the agency - WAFA
# 2) There are symbols in unicode in the text (e.g. \xa0)
# 3) There are some initials at the end of the news (e.g. M.N.)
ds.filter(lambda x: x['provider'] == 'WAFA News Agency').shuffle().select(range(10))['text']

['JENIN, Saturday, October 21, 2023 (WAFA) – Israeli occupation forces assaulted Palestinian farmers this evening in the village of Zububa, to the northwest of Jenin, firing shots at them and preventing them from reaching their agricultural lands adjacent to the Israeli segregation barrier, according to Palestinian security sources.\nIsraeli occupation forces opened fire on farmers while they were picking olives in their lands adjacent to the segregation barrier, which is in close proximity to the village. The barrier is itself built on Palestinian-owned lands.\nThroughout the ongoing olive harvest season, Israeli occupation forces have been trying to prevent farmers from accessing their lands by assaulting them, firing shots at them, and detaining them for extended periods.\nM.N',
 "NABLUS, Friday, October 6, 2023 (WAFA) - A Palestinian youth was shot and killed late last night during an Israeli colonists' attack against the town of Huwara, south of Nablus, bringing up the number of P

In [130]:
def make_unicode_normalization(example):
  '''
    Substitute the unicode encoding by appropriate symbols (perform unicode normalize)
  '''
  example['text'] = unicodedata.normalize('NFKD', example['text'])
  return example

In [131]:
ds = ds.map(make_unicode_normalization)

Map:   0%|          | 0/13103 [00:00<?, ? examples/s]

In [132]:
def remove_start_of_WAFA_news(example):
  '''
  Remove prefix from the WAFA news
  '''
  if example["provider"] == 'WAFA News Agency':
    example['text'] = example['text'].replace('–', '-')
    split = re.split(r"\(WAFA\)\s*-\s*", example['text'])
    if len(split) > 1:
      example['text'] =split[1]
  return example

In [133]:
ds = ds.map(remove_start_of_WAFA_news)

Map:   0%|          | 0/13103 [00:00<?, ? examples/s]

In [134]:
def remove_initials_at_the_end_WAFA(example):
  '''
  Remove initials from the WAFA news
  '''
  if example["provider"] == 'WAFA News Agency':
    example["text"] = "\n".join(example["text"].split("\n")[:-1])
  return example

In [117]:
ds = ds.map(remove_initials_at_the_end_WAFA)

Map:   0%|          | 0/13103 [00:00<?, ? examples/s]

In [137]:
ds.save_to_disk(base_path + '/Data/news_ds')

Saving the dataset (0/1 shards):   0%|          | 0/13103 [00:00<?, ? examples/s]

## 9. Calculating the average news length

In [138]:
def get_length(example):
  example['len'] = len(example['text'].split())
  return example

In [139]:
ds_w_len = ds.map(get_length)

Map:   0%|          | 0/13103 [00:00<?, ? examples/s]

In [140]:
ds_w_len.set_format("pandas")
ds_w_len[:].groupby(["provider", "source"]).agg(['mean', 'count'])

  ds_w_len[:].groupby(["provider", "source"]).agg(['mean', 'count'])


Unnamed: 0_level_0,Unnamed: 1_level_0,len,len
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,count
provider,source,Unnamed: 2_level_2,Unnamed: 3_level_2
Al Jazeera,site-live-news,104.884615,3198
Al Mayadeen,site-articles,1196.256757,74
BBC,site-live-news,129.625935,802
CNN,site-live-news,197.663165,1428
The Times of Israel,site-live-news,127.901383,6581
WAFA News Agency,site-occupation,151.498039,1020


## 10. Pushing dataset to the Hugging Face Hub

In [143]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [145]:
ds.push_to_hub("Israel-HAMAS_war_news")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/14 [00:00<?, ?ba/s]