# Link checker
The reason behind archiving web content is the transient nature of the internet - web pages are short lived. However, as we can do distant reading only on the metadata and not the archived collection itself, not all the URLs in the dataset are currently live. This link checker helps to select those URLs which are currently active, save them in the DataFrame, ready to scraping more information about the archived websites.

## Filter CSV by Alive URLs
Read CSV file with URLs and filter out the rows where the URL is not alive (i.e., returns a 200 OK status).

Steps:
Read the CSV file into a pandas DataFrame.
Define an asynchronous function to check if a URL is alive using the aiohttp library.
Apply the function to filter out rows where the URL is not alive.
Display or save the filtered DataFrame.
Let's get started!

## Importing libraries

In [None]:
import pandas as pd #This is the dataframe
import requests # This is for opening urls
import asyncio # These two are for making asynchronous request.
import aiohttp
from urllib.parse import urlparse #We use it to check the URL fomratting
from tqdm.asyncio import tqdm_asyncio #It's for tracking the progress

In [None]:
## Import CVS file to Dataframe
Use your own file name. Your csv must have a 'url' column.

In [None]:
df = pd.read_csv('AoT3_without_duplicates_and_xml.csv')

In [None]:
## Define functions to validate the format of your urls and to clean them

In [None]:
# Function to validate and clean URLs
def clean_and_validate_url(url):
    if not isinstance(url, str):
        return None
    url = url.strip()  # Remove leading/trailing whitespace
    if not url.startswith('http'):
        url = 'http://' + url  # Add http:// if missing
    return url if is_valid_url(url) else None

# Function to check if URL is valid
def is_valid_url(url):
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc])
    except Exception as e:
        print(f"Error validating URL: {url}, Error: {e}")
        return False



## Define functions to get the URls and check if they are alive

In [None]:
# Asynchronous function to fetch status of URLs
async def fetch_status(session, url):
    try:
        async with session.head(url) as response:
            return url, response.status == 200
    except:
        return url, False

# Asynchronous function to check statuses of multiple URLs
async def check_urls(urls):
    async with aiohttp.ClientSession() as session:
        tasks = []
        with tqdm_asyncio(total=len(urls), desc="Checking URLs", unit="url") as pbar:
            for url in urls:
                tasks.append(fetch_status(session, url))
                await asyncio.sleep(0.1)  # Add a small delay to avoid overloading the server
                pbar.update(1)  # Update progress bar
            results = await asyncio.gather(*tasks)
    return results

# Function to run asynchronous URL checking
async def is_url_alive_async(urls):
    results = await check_urls(urls)
    return results


## Clean and validate URLs in DataFrame

In [None]:
# It's iterative. You check each rows for the urls. If they are clean, all fine, if not, you update them.
for index, row in df.iterrows():
    original_url = row['url']
    cleaned_url = clean_and_validate_url(original_url)
    if cleaned_url:
        df.at[index, 'url'] = cleaned_url  # Update DataFrame with cleaned URL
    else:
        print(f"Invalid URL: {original_url}")



## Run the link checker
It will take a some time depending on the lenght of your url list. The progress bar at the bottom helps to track where you are in the progress. Keep your computer charged, logged in, and your connection alive to avoid disappointment.

In [None]:
# Get list of cleaned URLs from DataFrame
urls_to_check = df['url'].dropna().tolist()

# Run asynchronous URL checking in current event loop
results = await is_url_alive_async(urls_to_check)

# Create DataFrame from results
results_df = pd.DataFrame(results, columns=['url', 'is_alive'])

# Merge results with original DataFrame to retain other columns
df = df.merge(results_df, on='url')

# Filter DataFrame to keep only rows where the URL is alive
alive_urls_df = df[df['is_alive']]

# Display first few rows of filtered DataFrame
print(alive_urls_df.head())


## Export your list of working urls in a new CSV file.
This is the base of your final dataset which we are going to enhance and analyse in the next steps.

In [None]:
# Save the filtered DataFrame to a new CSV file
alive_urls_df.to_csv('AoT_good_urls.csv', index=False)

## Admire your work
Display the final dataset containging only the active links.

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
df