# Enhancing the metadata: parsing information into the dataset
In this part, we use a package, Newspaper3k to scrape and parse information from the web pages into our dataset. We are going to enhance it with title, keywords, summary, and full text (when available) to provide a bases for NLP-based analysis.

## Import libraries

In [None]:
import pandas as pd #This is for loading the data
from newspaper import Article #This is the library doing the scraping
from tqdm.notebook import tqdm #This is for tracking progress
import nltk #These two help with Natural Language Processing (NLP)
nltk.download('punkt')

## Load the dataset containing the active links

In [7]:
# Load your CSV file into a DataFrame
df = pd.read_csv('data/AoT4_good_urls.csv')



## Make sure the DataFrame has the required columns

In [8]:
# Ensure your DataFrame has the columns 'id', 'url', and 'description'
df['title'] = None
df['summary'] = None
df['keywords'] = None
df['full_text'] = None





## Define the funcion to scrape the data from the web
We are using Newspaper3k package and it is design to scrape news. It has great and clear documentation, if you want to know more.

In [None]:
# Define a function to process a URL and extract details
def process_url(url):
    try:
        # Create an Article object
        article = Article(url)
        # Download the article
        article.download()
        # Parse the article
        article.parse()
        # Perform NLP to extract keywords and summary
        article.nlp()

        return {
            "title": article.title,
            "summary": article.summary,
            "keywords": article.keywords,
            "full_text": article.text #This is optional, as it takes a lot of time and not all analysis requires it. Also, many times it requires manual cleaning.
        }
    except Exception as e:
        print(f"Error processing {url}: {e}")
        return None

## Run the scraper
It will take a lot of time, especially if you decide to parse for full text. Make sure your computer is plugged in, internet connection is stable, and sleep is disabled. Optionally, you can add a prevent sleep function. You will get a lot of error messages, as web scraping is unpredictable. Don't get discouraged. Repeat, if needed.

In [9]:
# Iterate through DataFrame and update with parsed information
with tqdm(total=len(df), desc="Processing URLs", unit="url") as pbar:
    for index, row in df.iterrows():
        url = row['url']
        result = process_url(url)
        if result:
            df.at[index, 'title'] = result['title']
            df.at[index, 'summary'] = result['summary']
            df.at[index, 'keywords'] = ', '.join(result['keywords'])
            df.at[index, 'full_text'] = result['full_text']
        pbar.update(1)




Processing URLs:   0%|          | 0/1137 [00:00<?, ?url/s]

Error processing https://adhduk.co.uk/: Article `download()` failed with 500 Server Error: Internal Server Error for url: https://adhduk.co.uk/ on URL https://adhduk.co.uk/
Error processing https://www.bad.org.uk/: Article `download()` failed with HTTPSConnectionPool(host='www.bad.org.uk', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)'))) on URL https://www.bad.org.uk/
Error processing https://www.healthforall.org.uk/: Article `download()` failed with HTTPSConnectionPool(host='www.healthforall.org.uk', port=443): Read timed out. (read timeout=7) on URL https://www.healthforall.org.uk/
Error processing https://www.forbes.com/sites/carlieporterfield/2021/08/30/dozens-of-subreddits-go-private-to-protest-reddits-covid-disinformation-policy/: Article `download()` failed with 403 Client Error: Max restarts limit reached for url: https

## Let's see what we got

In [None]:
# Display updated DataFrame
print(df.head())
print(df.shape)

In [1]:
## Export the enhanced dataset
It's ready for analysis.

In [10]:
#Save the updated DataFrame to a new CSV file. Define your own destination folder.
df.to_csv('AoT_enhanced.csv', index=False)

## Admire your work

In [11]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
df

Unnamed: 0,id,url,description,is_alive,title,summary,keywords,full_text
0,167157,https://www.adhdfoundation.org.uk/,A national neurodiversity charity. It supports...,True,Home - ADHD Foundation,Neurodiversity Umbrella ProjectFind out more a...,"morefind, umbrella, foundation, 2024find, part...",Neurodiversity Umbrella Project\n\nFind out mo...
1,166183,https://thoughtfultherapists.org/,Site of 'a group of psychotherapists and couns...,True,Thoughtful Therapists,Welcome to Thoughtful Therapists.\nWe are conc...,"therapeutic, therapists, integrity, conversion...",Welcome to Thoughtful Therapists. We have come...
2,158957,https://yourneighbour.org/,'YourNeighbour is a UK wide church response to...,True,Equipping Churches in the Covid-19 Crisis,YourNeighbour has been developed into the Chur...,"yourneighbour, current, website, commission, c...",YourNeighbour has been developed into the Chur...
3,161779,https://www.backoffscotland.com/,Site of group formed in 2020 to campaign for t...,True,Back Off Scotland,Use tab to navigate through the menu items.,"tab, menu, items, navigate, scotland",Use tab to navigate through the menu items.
4,84967,https://www.anhinternational.org/,The Alliance for Natural Health (ANH) Internat...,True,Alliance For Natural Health,Watch The Hyperactive Children‚Äôs Support Group...,"removing, prevent, alliance, rob, health, supp...",Watch The Hyperactive Children‚Äôs Support Group...
5,165404,https://www.beintheknow.org/,"Site offering trusted, evidence-based content ...",True,Get the facts about sexual health and HIV,Sex and relationshipsWhether you‚Äôve had sex be...,"sex, makes, hiv, experience, know, health, enj...",Sex and relationships\n\nWhether you‚Äôve had se...
6,162392,https://prepster.info/,PrEPster aims to educate and agitate for PrEP...,True,Prepster,New research shows on-going barriers to PrEP a...,"terrence, struggled, thirds, voice, prepster, ...",New research shows on-going barriers to PrEP a...
7,170906,https://www.collectivevoice.org.uk/,The national alliance of drug and alcohol trea...,True,Collective Voice,The Sentencing Bill is shortly due to enter Co...,"prison, treatment, way, system, voice, stage, ...",The Sentencing Bill is shortly due to enter Co...
8,170986,https://www.uklcc.org.uk/,UKLCC is a campaigning group formed from a coa...,True,Home page,A review of lung cancer services in 2023 to su...,"services, optimal, national, lung, cancer, sco...",A review of lung cancer services in 2023 to su...
9,171115,https://www.patientsafetylearning.org/,A charity for improving patient safety.,True,Patient Safety Learning,A platform for anyone with an interest in pati...,"safety, interest, learning, share, learn, plat...",A platform for anyone with an interest in pati...
