<a href="https://colab.research.google.com/github/bohuslavska/Study-projects/blob/main/Web_Scraper_and_Summarizer(BeautifulSoup%2CTransformer%2CSQLite)/Web_Scraper_and_Summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**About this project**

This project aimed at scraping some news articles from a multidisciplinary science journal *Nature* using the Beautiful Soup library, summarizing them with a transformer model T5 Large, creating a dataset, and storing it in an SQLite database.

In [2]:
#Installations

!pip install selenium
!apt update
!apt install chromium-chromedriver
!pip install selenium
!pip -q install transformers

In [1]:
#Imports

import requests
from bs4 import BeautifulSoup
import string
import os
from transformers import pipeline
import torch
import pandas as pd
import sqlite3

In [3]:
# Let's use T5 Large model for the summarization.

summarizer = pipeline("summarization", model="t5-large", tokenizer="t5-large", framework="pt")

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [4]:
#Iterating through several pages dedicated to 2023 news.

list_scraped = []
for u in range(1,3): 
    
    URL = f'https://www.nature.com/nature/articles?searchType=journalSearch&sort=PubDate&type=news&year=2023&page={u}'
    content = requests.get(URL).text
    soup = BeautifulSoup(content, "html.parser")
    
    publications = soup.find_all('li', class_='app-article-list-row__item')  

    for publication in publications: 
        title = publication.find('h3', class_= 'c-card__title').text.strip()
        text = publication.find('div', class_= 'c-article-body main-content')
        prelink = publication.find('a', {'data-track-action':'view article'}).get('href')
        article_link = 'https://www.nature.com' + prelink
        URL2 = article_link
        response = requests.get(URL2).text
        soup = BeautifulSoup(response, "html.parser")

        try:
            text_article = soup.find('div', class_='c-article-body main-content').text.strip()
            sum = summarizer(text_article)
            sum = sum[0]['summary_text']
            article ={"title": title, "link": article_link, "summary": sum}
            list_scraped.append(article)
        except:
            continue

Token indices sequence length is longer than the specified maximum sequence length for this model (623 > 512). Running this sequence through the model will result in indexing errors


In [5]:
#Creating the dataset with the scraped news.

df = pd.DataFrame(list_scraped)
df.head(10)

Unnamed: 0,title,link,summary
0,"Death threats, trolling and sexist abuse: clim...",https://www.nature.com/articles/d41586-023-010...,survey by non-governmental organization Global...
1,Crazy ants’ strange genomes are a biological f...,https://www.nature.com/articles/d41586-023-010...,yellow crazy ants are a notorious invasive spe...
2,Researchers back African Union to join G20 gro...,https://www.nature.com/articles/d41586-023-010...,researchers are backing the inclusion of the A...
3,Global scholars decry funding ban on influenti...,https://www.nature.com/articles/d41586-023-009...,Indian government suspends foreign funding to ...
4,Stem-cell derived ‘embryos’ implanted in monkeys,https://www.nature.com/articles/d41586-023-009...,stem-cell-derived blastoids could help researc...
5,COVID-origins data from Wuhan market published...,https://www.nature.com/articles/d41586-023-009...,researchers have published an analysis of swab...
6,Medieval accounts of eclipses shine light on m...,https://www.nature.com/articles/d41586-023-009...,palaeoclimatologists have used medieval accoun...
7,How air pollution causes lung cancer — without...,https://www.nature.com/articles/d41586-023-009...,air pollution could cause lung cancer by creat...
8,Habit-linked brain circuits light up in people...,https://www.nature.com/articles/d41586-023-009...,brain scans of people with binge-eating disord...
9,How virtual models of the brain could transfor...,https://www.nature.com/articles/d41586-023-009...,virtual models representing brains of people w...


In [13]:
#Writing the dataset to the SQLite database on Google Drive.

from google.colab import drive
drive.mount('/content/gdrive/')
%cd '/content/gdrive/MyDrive/ML_projects'

conn = sqlite3.connect('scraped_news.sqlite')
df.to_sql('data_news', conn, if_exists='replace', index=False)
conn.close()

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).
/content/gdrive/MyDrive/ML_projects


In [14]:
#Check

conn = sqlite3.connect('scraped_news.sqlite')
df = pd.read_sql_query('SELECT * FROM data_news', conn)
conn.close()
df

Unnamed: 0,title,link,summary
0,"Death threats, trolling and sexist abuse: clim...",https://www.nature.com/articles/d41586-023-010...,survey by non-governmental organization Global...
1,Crazy ants’ strange genomes are a biological f...,https://www.nature.com/articles/d41586-023-010...,yellow crazy ants are a notorious invasive spe...
2,Researchers back African Union to join G20 gro...,https://www.nature.com/articles/d41586-023-010...,researchers are backing the inclusion of the A...
3,Global scholars decry funding ban on influenti...,https://www.nature.com/articles/d41586-023-009...,Indian government suspends foreign funding to ...
4,Stem-cell derived ‘embryos’ implanted in monkeys,https://www.nature.com/articles/d41586-023-009...,stem-cell-derived blastoids could help researc...
5,COVID-origins data from Wuhan market published...,https://www.nature.com/articles/d41586-023-009...,researchers have published an analysis of swab...
6,Medieval accounts of eclipses shine light on m...,https://www.nature.com/articles/d41586-023-009...,palaeoclimatologists have used medieval accoun...
7,How air pollution causes lung cancer — without...,https://www.nature.com/articles/d41586-023-009...,air pollution could cause lung cancer by creat...
8,Habit-linked brain circuits light up in people...,https://www.nature.com/articles/d41586-023-009...,brain scans of people with binge-eating disord...
9,How virtual models of the brain could transfor...,https://www.nature.com/articles/d41586-023-009...,virtual models representing brains of people w...
