**About this project**

This project aimed at scraping some news articles from a multidisciplinary science journal *Nature* using the Beautiful Soup library, summarizing them with a transformer model T5 Large, creating a dataset, and storing it in an SQLite database.

In [53]:
#Installations

# !pip install selenium
# !apt update
# !apt install chromium-chromedriver
# !pip install selenium
# !pip -q install transformers

In [54]:
#Imports

import requests
from bs4 import BeautifulSoup
import string
import os
from transformers import pipeline
import torch
import pandas as pd
pd.set_option('display.max_colwidth', 500)
import sqlite3

In [55]:
#Let's use T5 Large model for the summarization.

summarizer = pipeline("summarization", model="t5-large", tokenizer="t5-large", framework="pt")

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [56]:
#Let's find out the general amount of pages for scraping.

start_link = f'https://www.nature.com/nature/articles?type=news&year=2023'
content = requests.get(start_link).text
soup = BeautifulSoup(content, "html.parser")
last_page_index = -2
pagination = soup.find_all('li', class_='c-pagination__item')[last_page_index].get_text().strip()
page_number = [i for i in pagination if i in list(string.digits)]
page_number = int(''.join(page_number))
page_number

10

In [57]:
#Iterating through all the pages dedicated to 2023 news.

list_scraped = []

for u in range(1, page_number+1): 
    
    URL = f'https://www.nature.com/nature/articles?searchType=journalSearch&sort=PubDate&type=news&year=2023&page={u}'
    content = requests.get(URL).text
    soup = BeautifulSoup(content, "html.parser")

    publications = soup.find_all('li', class_='app-article-list-row__item') 

    for publication in publications: 
        title = publication.find('h3', class_= 'c-card__title').text.strip()
        text = publication.find('div', class_= 'c-article-body main-content')
        prelink = publication.find('a', {'data-track-action':'view article'}).get('href')
        article_link = 'https://www.nature.com' + prelink
        URL2 = article_link
        response = requests.get(URL2).text
        soup = BeautifulSoup(response, "html.parser")
        try:
            text_article = soup.find('div', class_='c-article-body main-content').text.strip()
            summary = summarizer(text_article)
            summary = summary[0]['summary_text']
            article ={"title": title, "link": article_link, "summary": summary}
            list_scraped.append(article)
        except Exception:
            continue

Token indices sequence length is longer than the specified maximum sequence length for this model (1435 > 512). Running this sequence through the model will result in indexing errors


In [58]:
#We can notice that a lot of news is behind a paywall. 
#However, we still managed to scrape some up-to-date scientific data.
#Now let's create a dataset with the scraped news.

df = pd.DataFrame(list_scraped)
df

Unnamed: 0,title,link,summary
0,Editors quit top neuroscience journal to protest against open-access charges,https://www.nature.com/articles/d41586-023-01391-5,"more than 40 editors have resigned from two leading neuroscience journals in protest . they say that the fees, which publishers use to cover publishing services and in some cases make money, are unethical . the editors plan to start a new journal hosted by the non-profit publisher MIT Press ."
1,Genetic map of Tasmanian devil cancers hints at their future evolution,https://www.nature.com/articles/d41586-023-01349-7,Tasmanian devils are susceptible to two cancers that are spread by biting . genetic analysis of these cancers has tracked their evolution . lays the groundwork for modelling how they could affect populations in future .
2,White House to tap cancer leader Monica Bertagnolli as new NIH director,https://www.nature.com/articles/d41586-023-01378-2,"if confirmed by the US Senate, Bertagnolli will take over the NIH . the agency has a budget of more than US$47 billion and is composed of 27 separate institutes and centres . it has taken more than a year to find geneticist Francis Collins's replacement ."
3,Drugs give biology’s favourite worms the munchies too,https://www.nature.com/articles/d41586-023-01376-4,study suggests mechanism by which cannabis affects appetite evolved more than 500 million years ago . cannabinoid molecules derived from cannabis plant bind to same receptors as molecules naturally found in the body . researchers tested endocannabinoids on worms genetically engineered to have human cannabinoid receptors .
4,Australian researchers welcome plan to curb politicians’ power to veto grants,https://www.nature.com/articles/d41586-023-01379-1,researchers in australia have previously criticized political interference in the grant-awarding process . the changes were recommended on 20 April as part of an independent review into the legislation underpinning the Australian Research Council . acting education minister Stuart Robert vetoed six ARC projects in December 2021 .
5,Comb jellies’ unique fused neurons challenge evolution ideas,https://www.nature.com/articles/d41586-023-01381-7,"ctenophores, also known as comb jellies, have a fused network of neurons . scientists used an electron microscope to create a 3D reconstruction of the nervous system . the results suggest that the animal's nervous system evolved independently ."
6,SpaceX Starship: launch of biggest-ever rocket ends with explosion,https://www.nature.com/articles/d41586-023-01377-3,"Starship roared off a launch pad in southern texas today and then exploded before it reached space . the goal of today's flight had been to reach space and travel most of the way around the planet . if spaceX demonstrates that Starship can reach orbit, that will be ""significant for what it will bring afterwards"", says an expert ."
7,Racial inequalities deepened in US prisons during COVID,https://www.nature.com/articles/d41586-023-01311-7,"researchers compiled 20 years' worth of demographic records on prison populations . they found that the proportion of incarcerated Black people had been decreasing . but by the end of 2021, the proportion who were Black had returned to pre-pandemic levels . the researchers hope their findings will help to reshape the criminal justice system ."
8,Famous ‘homunculus’ brain map redrawn to include complex movements,https://www.nature.com/articles/d41586-023-01312-6,a new study redraws the motor homunculus or 'little man' diagram . it adds regions connected to brain areas responsible for coordinating complex movements . the findings could lead to changes in therapy for disorders of the primary motor cortex caused by stroke or injury .
9,Why Earth’s giant kelp forests are worth $500 billion a year,https://www.nature.com/articles/d41586-023-01307-3,"kelp forests provide services worth between $465 billion and $562 billion a year . they provide habitat for more than 1,000 species, draw carbon dioxide from the atmosphere . each hectare removes an average of 657 kilograms of excess nitrogen from seawater . pollack, giant seabass, south american morwongs and lingcod were most valuable fish ."


In [59]:
#Writing the dataset to the SQLite database on Google Drive.

from google.colab import drive
drive.mount('/content/gdrive/')
%cd '/content/gdrive/MyDrive/ML_projects'

conn = sqlite3.connect('scraped_news.sqlite')
df.to_sql('data_news', conn, if_exists='replace', index=False)
conn.close()

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).
/content/gdrive/MyDrive/ML_projects


In [60]:
#Check

conn = sqlite3.connect('scraped_news.sqlite')
df = pd.read_sql_query('SELECT * FROM data_news', conn)
conn.close()
df

Unnamed: 0,title,link,summary
0,Editors quit top neuroscience journal to protest against open-access charges,https://www.nature.com/articles/d41586-023-01391-5,"more than 40 editors have resigned from two leading neuroscience journals in protest . they say that the fees, which publishers use to cover publishing services and in some cases make money, are unethical . the editors plan to start a new journal hosted by the non-profit publisher MIT Press ."
1,Genetic map of Tasmanian devil cancers hints at their future evolution,https://www.nature.com/articles/d41586-023-01349-7,Tasmanian devils are susceptible to two cancers that are spread by biting . genetic analysis of these cancers has tracked their evolution . lays the groundwork for modelling how they could affect populations in future .
2,White House to tap cancer leader Monica Bertagnolli as new NIH director,https://www.nature.com/articles/d41586-023-01378-2,"if confirmed by the US Senate, Bertagnolli will take over the NIH . the agency has a budget of more than US$47 billion and is composed of 27 separate institutes and centres . it has taken more than a year to find geneticist Francis Collins's replacement ."
3,Drugs give biology’s favourite worms the munchies too,https://www.nature.com/articles/d41586-023-01376-4,study suggests mechanism by which cannabis affects appetite evolved more than 500 million years ago . cannabinoid molecules derived from cannabis plant bind to same receptors as molecules naturally found in the body . researchers tested endocannabinoids on worms genetically engineered to have human cannabinoid receptors .
4,Australian researchers welcome plan to curb politicians’ power to veto grants,https://www.nature.com/articles/d41586-023-01379-1,researchers in australia have previously criticized political interference in the grant-awarding process . the changes were recommended on 20 April as part of an independent review into the legislation underpinning the Australian Research Council . acting education minister Stuart Robert vetoed six ARC projects in December 2021 .
5,Comb jellies’ unique fused neurons challenge evolution ideas,https://www.nature.com/articles/d41586-023-01381-7,"ctenophores, also known as comb jellies, have a fused network of neurons . scientists used an electron microscope to create a 3D reconstruction of the nervous system . the results suggest that the animal's nervous system evolved independently ."
6,SpaceX Starship: launch of biggest-ever rocket ends with explosion,https://www.nature.com/articles/d41586-023-01377-3,"Starship roared off a launch pad in southern texas today and then exploded before it reached space . the goal of today's flight had been to reach space and travel most of the way around the planet . if spaceX demonstrates that Starship can reach orbit, that will be ""significant for what it will bring afterwards"", says an expert ."
7,Racial inequalities deepened in US prisons during COVID,https://www.nature.com/articles/d41586-023-01311-7,"researchers compiled 20 years' worth of demographic records on prison populations . they found that the proportion of incarcerated Black people had been decreasing . but by the end of 2021, the proportion who were Black had returned to pre-pandemic levels . the researchers hope their findings will help to reshape the criminal justice system ."
8,Famous ‘homunculus’ brain map redrawn to include complex movements,https://www.nature.com/articles/d41586-023-01312-6,a new study redraws the motor homunculus or 'little man' diagram . it adds regions connected to brain areas responsible for coordinating complex movements . the findings could lead to changes in therapy for disorders of the primary motor cortex caused by stroke or injury .
9,Why Earth’s giant kelp forests are worth $500 billion a year,https://www.nature.com/articles/d41586-023-01307-3,"kelp forests provide services worth between $465 billion and $562 billion a year . they provide habitat for more than 1,000 species, draw carbon dioxide from the atmosphere . each hectare removes an average of 657 kilograms of excess nitrogen from seawater . pollack, giant seabass, south american morwongs and lingcod were most valuable fish ."
