# Interactive Exercise: Asynchronous Web Scraping
In this notebook, you will practice **asynchronous web scraping** using Python. You will write code to fetch multiple URLs concurrently, parse the HTML, and save links to a CSV file.
Each step includes a **hint**. After completing all exercises, check the **collapsed solution** section at the end.

## Step 1: Install Required Libraries
<details>
<summary>Hint</summary>
Run `pip install aiohttp asyncio nest-asyncio beautifulsoup4` to ensure all necessary libraries are installed.
</details>
### Exercise:
Install the required packages.

In [None]:
# TODO: Install packages
!pip install aiohttp asyncio nest-asyncio beautifulsoup4

## Step 2: Import Libraries and Apply `nest_asyncio`
<details>
<summary>Hint</summary>
Import `aiohttp`, `asyncio`, `BeautifulSoup`, `csv`, `re`, and `nest_asyncio`. Then call `nest_asyncio.apply()` to allow nested event loops in Jupyter.
</details>
### Exercise:
Write the import statements and apply nest_asyncio.

In [None]:
# TODO: Import libraries and apply nest_asyncio
import aiohttp
import asyncio
from bs4 import BeautifulSoup
import csv
import re
import nest_asyncio
nest_asyncio.apply()

## Step 3: Define Asynchronous Function to Parse HTML and Save Links
<details>
<summary>Hint</summary>
Use `async def` to define the function.
Use BeautifulSoup to parse the HTML.
Use `soup.findAll('a', attrs={'href': re.compile('^http')})` to find all links starting with http.
Use `csv.writer` to write links to a CSV file.
</details>
### Exercise:
Define an asynchronous function `scrap_and_save_links(text)` that extracts all links from the HTML `text` and saves them in a CSV file.

In [None]:
# TODO: Define scrap_and_save_links
async def scrap_and_save_links(text):
    # parse HTML
    soup = BeautifulSoup(text, 'html.parser')
    # open CSV file
    with open('csv_file.csv', 'a', newline='') as file:
        writer = csv.writer(file, delimiter=',')
        # find all links starting with http
        for link in soup.findAll('a', attrs={'href': re.compile('^http')}):
            url = link.get('href')
            writer.writerow([url])

## Step 4: Define Asynchronous Fetch Function
<details>
<summary>Hint</summary>
Use `async with session.get(url) as response` to fetch the webpage.
Use `await response.text()` to get the HTML content.
Create a task with `asyncio.create_task(scrap_and_save_links(text))` and await it.
</details>
### Exercise:
Write `async def fetch(session, url)` to fetch HTML content from a URL and call `scrap_and_save_links`.

In [None]:
# TODO: Define fetch function
async def fetch(session, url):
    try:
        async with session.get(url) as response:
            text = await response.text()
            task = asyncio.create_task(scrap_and_save_links(text))
            await task
    except Exception as e:
        print(f"Error fetching {url}: {str(e)}")

## Step 5: Scrape Multiple URLs Concurrently
<details>
<summary>Hint</summary>
Create a list of tasks for each URL and use `await asyncio.gather(*tasks)`.
Use `aiohttp.ClientSession()` to maintain a session for all requests.
</details>
### Exercise:
Define `async def scrap(urls)` to concurrently scrape a list of URLs.

In [None]:
# TODO: Define scrap function
async def scrap(urls):
    tasks = []
    async with aiohttp.ClientSession() as session:
        for url in urls:
            tasks.append(fetch(session, url))
        await asyncio.gather(*tasks)

## Step 6: Run the Asynchronous Scraper
<details>
<summary>Hint</summary>
Provide a list of URLs and call `asyncio.run(scrap(urls))` to run the scraper.
</details>
### Exercise:
Run the scraper on a list of URLs.

In [None]:
# TODO: Run scraper
urls = [
    'https://analytics.usa.gov/',
    'https://www.python.org/',
    'https://www.linkedin.com/'
]

asyncio.run(scrap(urls))

## âœ… Step 7: Self-Check Solutions
<details>
<summary>Click to view full solutions</summary>
```python
# Step 2: Imports
import aiohttp
import asyncio
from bs4 import BeautifulSoup
import csv
import re
import nest_asyncio
nest_asyncio.apply()

# Step 3: scrap_and_save_links
async def scrap_and_save_links(text):
    soup = BeautifulSoup(text, 'html.parser')
    with open('csv_file.csv', 'a', newline='') as file:
        writer = csv.writer(file, delimiter=',')
        for link in soup.findAll('a', attrs={'href': re.compile('^http')}):
            writer.writerow([link.get('href')])

# Step 4: fetch
async def fetch(session, url):
    try:
        async with session.get(url) as response:
            text = await response.text()
            task = asyncio.create_task(scrap_and_save_links(text))
            await task
    except Exception as e:
        print(f"Error fetching {url}: {str(e)}")

# Step 5: scrap
async def scrap(urls):
    tasks = []
    async with aiohttp.ClientSession() as session:
        for url in urls:
            tasks.append(fetch(session, url))
        await asyncio.gather(*tasks)

# Step 6: Run
urls = ['https://analytics.usa.gov/', 'https://www.python.org/', 'https://www.linkedin.com/']
asyncio.run(scrap(urls))
```
</details>