# Asynchronous Web Scraping
In this notebook, we will explore **asynchronous web scraping** using Python. Unlike synchronous scraping, which waits for each request to complete, asynchronous scraping allows multiple requests to happen simultaneously, making the process faster and more efficient.

We will use the following libraries:
- `aiohttp` for making asynchronous HTTP requests
- `asyncio` for writing concurrent code
- `BeautifulSoup` for parsing HTML
- `csv` for saving extracted links
- `re` for filtering links with regular expressions
- `nest_asyncio` to handle nested event loops in Jupyter Notebook

## Step 1: Install Required Libraries
<details>
<summary>Hint</summary>
Run `pip install aiohttp asyncio nest-asyncio beautifulsoup4` to make sure all libraries are installed.
</details>

In [None]:
!pip install aiohttp asyncio nest-asyncio beautifulsoup4

## Step 2: Import Libraries
<details>
<summary>Hint</summary>
Import `aiohttp`, `asyncio`, `BeautifulSoup`, `csv`, `re`, and `nest_asyncio`. Apply `nest_asyncio.apply()` to enable nested event loops in Jupyter Notebook.
</details>

In [None]:
import aiohttp
import asyncio
from bs4 import BeautifulSoup
import csv
import re
import nest_asyncio
nest_asyncio.apply()

## Step 3: Define Asynchronous Function to Parse and Save Links
We define an **asynchronous function** that:
1. Parses the HTML content using BeautifulSoup
2. Extracts all links that start with `http`
3. Saves them into a CSV file

<details>
<summary>Hint</summary>
Use `async def` to define asynchronous functions. Use `soup.findAll('a', attrs={'href': re.compile('^http')})` to extract all valid links.
Use `csv.writer` to write links to a CSV file.
</details>

In [None]:
async def scrap_and_save_links(text):
    soup = BeautifulSoup(text, 'html.parser')
    with open('csv_file.csv', 'a', newline='') as file:
        writer = csv.writer(file, delimiter=',')
        for link in soup.findAll('a', attrs={'href': re.compile('^http')}):
            url = link.get('href')
            writer.writerow([url])

## Step 4: Fetch HTML Content Asynchronously
This function fetches the HTML of a webpage asynchronously and passes it to `scrap_and_save_links`.

<details>
<summary>Hint</summary>
Use `async with session.get(url) as response` to make an asynchronous GET request.
Use `await response.text()` to get the HTML content.
Create a task with `asyncio.create_task(scrap_and_save_links(text))` and await it.
</details>

In [None]:
async def fetch(session, url):
    try:
        async with session.get(url) as response:
            text = await response.text()
            task = asyncio.create_task(scrap_and_save_links(text))
            await task
    except Exception as e:
        print(f"Error fetching {url}: {str(e)}")

## Step 5: Scrape Multiple URLs Concurrently
This function schedules multiple fetch tasks concurrently using `asyncio.gather`.

<details>
<summary>Hint</summary>
Create a list of tasks and pass them to `await asyncio.gather(*tasks)`.
Use `aiohttp.ClientSession()` to maintain session for all requests.
</details>

In [None]:
async def scrap(urls):
    tasks = []
    async with aiohttp.ClientSession() as session:
        for url in urls:
            tasks.append(fetch(session, url))
        await asyncio.gather(*tasks)

## Step 6: Run the Asynchronous Scraper
We provide a list of URLs to scrape and run the asynchronous scraper.

<details>
<summary>Hint</summary>
Use `asyncio.run(scrap(urls))` to run the asynchronous event loop.
</details>

In [None]:
urls = [
    'https://analytics.usa.gov/',
    'https://www.python.org/',
    'https://www.linkedin.com/'
]

asyncio.run(scrap(urls))

## âœ… Summary
- We learned how **asynchronous scraping** can speed up web data extraction.
- Used `aiohttp` and `asyncio` for concurrent HTTP requests.
- Parsed HTML with BeautifulSoup and saved links to a CSV file.
- Handled Jupyter Notebook's event loop using `nest_asyncio`.