## Introduction to BeautifulSoup

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a popular Python library used for web scraping purposes to pull the data out of HTML and XML files. It creates parse trees from page source codes that can be used to extract data easily. BeautifulSoup works well with requests and other HTTP libraries, making it a powerful tool for crawling and scraping websites.

### Key Features of BeautifulSoup

- **Parsing HTML/XML:** BeautifulSoup can parse and navigate HTML or XML documents, even those with poorly-formed markup.
- **Searching the Parse Tree:** It provides simple methods for searching and navigating the parse tree, such as `find()`, `find_all()`, and CSS selectors.
- **Modifying the Parse Tree:** You can modify the HTML/XML tree, extract text, and manipulate tags.

---

## How BeautifulSoup is Used in the Karachi Bakery Scraper

Let's connect the main BeautifulSoup concepts to their usage in the code above:

- **Downloading the Webpage:**  
  The code uses the `requests` library to fetch the HTML content of a webpage:
  ```python
  url = requests.get(site)
  data = url.text

### Parsing HTML with BeautifulSoup

The HTML content is parsed using BeautifulSoup, which creates a parse tree for easy navigation:

```python
soup = BeautifulSoup(data, 'lxml')
```

### Finding all anchor tags

To extract all the links from the page, the code uses the find_all('a') method, which returns all <a> tags (hyperlinks) in the HTML:

```
anchors = soup.find_all('a')
```

### Navigating and Extracting Data:
Each anchor tag is processed to extract the href attribute (the actual URL), and relative links are converted to absolute URLs using Python's urljoin.

### Crawling Multiple Pages:
The script keeps track of crawled URLs and uses a queue to visit new links found on each page, demonstrating how BeautifulSoup can be used in a simple web crawler.

BeautifulSoup, combined with requests, provides a straightforward way to scrape and crawl websites, as demonstrated in the Karachi Bakery scraper below.

## 🎂 Scraping the entire bakery using BeautifulSoup

In [None]:
import requests
from bs4 import BeautifulSoup
import re

crawled_websites = set()   # Set to keep track of already crawled URLs
website_queue = []         # Queue for URLs to crawl

def is_url(link):
    # Checks using regex if 'link' is a valid url
    link = str(link)
    url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*/\\,() ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', link)
    x = (" ".join(url) == link)
    return x

def read_anchors(site):
    # Downloads the page and extracts all anchor tags
    url = requests.get(site)
    data = url.text
    soup = BeautifulSoup(data, 'lxml')
    anchors = soup.find_all('a')
    return anchors

def link_from_anchor(anchor, base_url):
    # Extracts href from anchor and converts relative links to absolute
    try:
        link = re.search('href="(.*?)"', str(anchor))[1]
        if link.startswith('/'):
            # Convert relative link to absolute using base_url
            from urllib.parse import urljoin
            link = urljoin(base_url, link)
    except:
        link = None
    return link

def get_links(site, filename):
    # Main crawling function
    # Adds the initial site to the queue if not already crawled
    if (site not in crawled_websites) and (site not in crawled_websites):
        website_queue.append(site)
    with open(filename, 'a') as f:
        while website_queue:
            if len(crawled_websites) > 100:
                # Limit crawl to 100 pages for safety
                break
            current_site = website_queue.pop(0)
            if current_site in crawled_websites:
                continue
            try:
                anchors = read_anchors(current_site)  # Get all links from the page
            except Exception as e:
                continue
            for anchor in anchors:
                link = link_from_anchor(anchor, current_site)  # Extract and normalize link
                # Only add valid, uncrawled, in-domain links to the queue and file
                if link and is_url(link) and ('karachibakery.com' in link) and (link not in crawled_websites):
                    website_queue.append(link)
                    f.write(link + "\n")
            crawled_websites.add(current_site)  # Mark as crawled

get_links("https://www.karachibakery.com/", "kaveri_ALLCAKES.txt")

In [2]:
crawled_websites #print some sample crawled websites

{'http://www.karachibakery.com/virtualtour/karachi-virtualtour.html',
 'https://order.karachibakery.com',
 'https://order.karachibakery.com/',
 'https://order.karachibakery.com/pages/contact',
 'https://order.karachibakery.com/pages/terms',
 'https://order.karachibakery.com/shop',
 'https://order.karachibakery.com/shop/account',
 'https://order.karachibakery.com/shop/account/favourites',
 'https://order.karachibakery.com/shop/c',
 'https://order.karachibakery.com/shop/c/200g-packs_6055',
 'https://order.karachibakery.com/shop/c/assorted-biscuits-pack_5344',
 'https://order.karachibakery.com/shop/c/biscotti_5998',
 'https://order.karachibakery.com/shop/c/buy-1-get-1-free_4942',
 'https://order.karachibakery.com/shop/c/chocolate-biscuits_6000',
 'https://order.karachibakery.com/shop/c/christmas-special-cakes_5262',
 'https://order.karachibakery.com/shop/c/christmas-specials_8441',
 'https://order.karachibakery.com/shop/c/cocoatini-exc-chocolate-collection_4943',
 'https://order.karachiba