### **Structure of the Homework**

The homework is divided into **four tasks**, and each task requires you to implement specific classes and functions. The tasks build upon concepts of web scraping, HTML parsing, document processing, and crawling.

Here’s a breakdown of what is expected in each task:

#### **Task 1: Artifact Caching System**
- **Class**: `Artifact`
  - Implement a class that can **download**, **store**, and **retrieve** digital content from a URL.
  - This class should:
    - Fetch content from a URL and store it in memory.
    - Save the content to a local file to avoid redundant downloads.
    - Retrieve the content from the local cache if it already exists.
- **Methods**:
  - `fetch_artifact()`: Downloads the content from the URL.
  - `store_artifact()`: Stores the content locally in a unique file.
  - `retrieve_artifact()`: Retrieves the content from the local cache.
  
---

#### **Task 2: Smithsonian Snapshot Parser**
- **Class**: `SmithsonianParser`
  - Implement a class that parses HTML pages like [Smithsonian Snapshots](https://www.si.edu/newsdesk/snapshot/how-very-logical).
  - This class should:
    - Extract all links (`<a>` tags) and store them as a list of tuples.
    - Extract all image URLs (`<img>` tags) and store them in a list.
    - Clean the text from the page, removing unnecessary elements (scripts, styles).
- **Methods**:
  - `fetch_page()`: Downloads the HTML content of a page.
  - `parse()`: Parses the content and extracts links, images, and cleaned text.
  - `get_anchors()`, `get_images()`, and `get_text()`: Returns the extracted data.

---

#### **Task 3: Text Analysis of Smithsonian Snapshot**
- **Class**: `SmithsonianTextAnalyzer`
  - Implement a class that analyzes the text content of a Smithsonian Snapshot page.
  - This class should:
    - Perform **word frequency analysis**.
    - **Segment sentences** and split text properly.
    - Clean the text to remove special characters and whitespace.
- **Methods**:
  - `analyze()`: Fetches the page content, processes the text, and generates word frequency statistics.
  - `get_word_stats()`: Returns word frequency in the form of a `Counter` object.
  - `split_into_sentences()`: Splits the text into sentences.

---

#### **Task 4: Smithsonian Snapshot Web Crawler**
- **Class**: `SmithsonianCrawler`
  - Implement a web crawler that starts at the [Smithsonian Snapshots](https://www.si.edu/newsdesk/snapshots) page and crawls through linked snapshot articles.
  - This class should:
    - Crawl pages to a specified depth.
    - Extract links, images, and cleaned text from each page.
    - Return the results as soon as the page is processed.
- **Methods**:
  - `crawl()`: Recursively visits pages starting from a given URL.
  - `crawl_generator()`: Generates content as the crawler processes each page.

---

### Task 1: Archiving Virtual Artifacts - Preserving a Digital Museum

#### 1.0.1. Task Description
Imagine you are a data archivist working to preserve artifacts from the **Smithsonian Institution's digital collection**. Your job is to download, store, and manage different types of data (such as images, videos, and documents) to ensure they can be accessed later without repeated downloads.

Use the [Smithsonian Institution Collections](https://www.si.edu/snapshot) as your source of artifacts. You are tasked with building a caching system that can store downloaded files in a structured way and retrieve them as needed.

#### Tasks:
1. `fetch_artifact()`: Download content from the Smithsonian's collection page based on a provided URL. The method should return `True` if successful, or `False` if the download fails.
2. `store_artifact()`: Save the content of the artifact (text, image, etc.) in a local file system. Each artifact must be stored in its own unique file based on its URL.
3. `retrieve_artifact()`: Load an artifact from your local storage using its URL to ensure that content is cached correctly and avoid redundant downloads.

#### Criteria for Success:
- Different URLs must map to different files, even if they belong to the same domain.
- Binary files (e.g., images) must be handled correctly without corruption.
- Artifacts that are already stored locally should not be downloaded again.

#### Link: [Smithsonian Institution Collections](https://www.si.edu/newsdesk/snapshot/what-good-boy)

In [3]:
import os
import hashlib
import requests

class Artifact:
    def __init__(self, url):
        self.url = url
        self.content = None
        self.filename = None

    def generate_filename(self):
        """
        Generates a unique and safe filename based on the URL.
        You will need to use a hash function (hint: hashlib).
        """
        url_hash = hashlib.md5(self.url.encode()).hexdigest()
        extension = os.path.splitext(self.url)[1]
        self.filename = url_hash + extension if extension else url_hash + ".artifact"

    def fetch_artifact(self):
        """
        Download the artifact from the given URL and store its content in memory.
        If the download is successful, return True. Otherwise, return False.
        """
        try:
            response = requests.get(self.url, stream=True)
            response.raise_for_status()
            self.content = response.content
            return True
        except (requests.exceptions.RequestException, IOError) as e:
            print(f"Failed to fetch artifact: {e}")
            return False

    def store_artifact(self, directory="artifact_cache"):
        """
        Store the artifact content in a local file in a cache directory.
        Ensure the file is stored with a unique name to avoid overwriting.
        """
        if not self.content:
            print("No content to store.")
            return False

        if not self.filename:
            self.generate_filename()

        if not os.path.exists(directory):
            os.makedirs(directory)

        filepath = os.path.join(directory, self.filename)
        try:
            with open(filepath, 'wb') as f:
                f.write(self.content)
            return True
        except IOError as e:
            print(f"Failed to store artifact: {e}")
            return False

    def retrieve_artifact(self, directory="artifact_cache"):
        """
        Retrieve the artifact from the local cache if it has been stored before.
        Return True if successful, False otherwise.
        """
        if not self.filename:
            self.generate_filename()

        filepath = os.path.join(directory, self.filename)
        if os.path.exists(filepath):
            try:
                with open(filepath, 'rb') as f:
                    self.content = f.read()
                return True
            except IOError as e:
                print(f"Failed to retrieve artifact: {e}")
                return False
        return False

In [4]:
artifact = Artifact("https://www.si.edu/newsdesk/snapshot/what-good-boy")

if artifact.fetch_artifact():
    artifact.store_artifact()

artifact.retrieve_artifact()


True

### Task 2: Parsing Web Pages - Smithsonian Snapshot

#### 2.0.2. Task Description
For this task, you will be working with pages from the [Smithsonian Newsdesk Snapshot](https://www.si.edu/newsdesk/snapshot/how-very-logical). Your goal is to extract meaningful content such as links, images, and clean text from the page.

You will need to:
1. Extract all hyperlinks (anchor tags) from the page and store them as a list of tuples `('link_text', 'absolute_url')`. Make sure to handle relative links by converting them to absolute URLs.
2. Collect all image URLs in a list. Ensure relative URLs are converted to absolute URLs.
3. Extract the plain text from the page, ignoring scripts, styles, and comments.

#### Criteria for Success:
- Extract all links as `('link_text', 'absolute_url')` and handle relative URLs.
- Extract all image URLs as absolute URLs.
- Clean and extract the main text from the document.

#### Link: [How Very Logical](https://www.si.edu/newsdesk/snapshot/how-very-logical)

In [5]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

class SmithsonianParser:
    def __init__(self, url):
        self.url = url
        self.anchors = []
        self.images = []
        self.text = ""

    def fetch_page(self):
        """
        Fetch the HTML content of the given URL. If the request is successful,
        return the page content; otherwise, return None.
        """
        try:
            response = requests.get(self.url)
            response.raise_for_status()
            return response.text
        except requests.RequestException as e:
            print(f"Failed to fetch the page: {e}")
            return None

    def parse(self, html_content):
        """
        Parse the HTML content using BeautifulSoup. You need to:
        1. Extract all anchor tags and store them as ('link_text', 'absolute_url').
        2. Extract all image URLs and store them in a list.
        3. Extract clean, readable text from the page.
        """
        soup = BeautifulSoup(html_content, 'html.parser')

        self.anchors = [
            (a.get_text(strip=True), urljoin(self.url, a['href']))
            for a in soup.find_all('a', href=True)
        ]

        self.images = [
            urljoin(self.url, img['src'])
            for img in soup.find_all('img', src=True)
        ]

        for script_or_style in soup(['script', 'style']):
            script_or_style.decompose()

        self.text = soup.get_text(separator=' ').strip()

        self.text = ' '.join(self.text.split())

    def get_anchors(self):
        """
        Return the list of anchors extracted from the page.
        """
        return self.anchors

    def get_images(self):
        """
        Return the list of image URLs extracted from the page.
        """
        return self.images

    def get_text(self):
        """
        Return the cleaned text content extracted from the page.
        """
        return self.text

In [6]:
url = "https://www.si.edu/newsdesk/snapshot/how-very-logical"
parser = SmithsonianParser(url)

page_content = parser.fetch_page()
if page_content:
    parser.parse(page_content)

    print("Anchors:", parser.get_anchors())
    print("Images:", parser.get_images())
    print("Text:", parser.get_text())


Anchors: [('Skip to main content', 'https://www.si.edu/newsdesk/snapshot/how-very-logical#main-content'), ('My Visit', 'https://www.si.edu/myvisit'), ('Donate', 'http://go.si.edu/si-give'), ('Smithsonian Institution', 'https://www.si.edu/'), ('Visit', 'https://www.si.edu/visit'), ('Hours and Locations', 'https://www.si.edu/visit/hours'), ('Entry and Guidelines', 'https://www.si.edu/visit/tips'), ('Maps and Brochures', 'https://www.si.edu/visit/maps'), ('Dining and Shopping', 'https://www.si.edu/dining'), ('Accessibility', 'https://www.si.edu/visit/accessibility'), ('Visiting with Kids', 'https://www.si.edu/visit/kids'), ('Group Visits', 'https://www.si.edu/visit/groups'), ('Group Sales', 'https://www.si.edu/groupsales'), ("What's On", 'https://www.si.edu/whats-on'), ('Exhibitions', 'https://www.si.edu/exhibitions'), ('Current', 'https://www.si.edu/exhibitions'), ('Upcoming', 'https://www.si.edu/exhibitions/upcoming'), ('Past', 'https://www.si.edu/exhibitions/past'), ("Today's Events", 

### Task 3: Summarizing Smithsonian Snapshots

#### 3.0.3. Task Description
You will analyze the text content from one of the Smithsonian Snapshot pages, such as [How Very Logical](https://www.si.edu/newsdesk/snapshot/how-very-logical). Your task is to extract and analyze the text content from this page, focusing on the following:

1. **Extract key phrases**: Use basic natural language processing (NLP) techniques to identify the most frequently used words and key phrases from the main body of the text.
2. **Sentence segmentation**: Split the text into individual sentences, making sure to handle punctuation and proper sentence breaks appropriately.
3. **Clean the text**: Remove any extraneous characters, symbols, or whitespace.

### Tasks:
1. **Word Frequency Analysis**: Implement a method to count the frequency of each word in the page content, converting all words to lowercase.
2. **Sentence Splitting**: Implement a method to split the content into individual sentences, being mindful of punctuation and line breaks.
3. **Cleaning and Normalization**: Clean the text to remove any special characters or unnecessary whitespace.

#### Criteria for Success:
- The `get_word_stats()` method should return a frequency distribution of words as a `Counter` object.
- Sentences should be extracted cleanly from the page’s main text.
- The text should be normalized (lowercased, and special characters should be removed).

#### Link: [How Very Logical](https://www.si.edu/newsdesk/snapshot/how-very-logical)

In [7]:
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter
import re

nltk.download('punkt')

class SmithsonianTextAnalyzer:
    def __init__(self, url):
        self.url = url
        self.text = ""
        self.sentences = []
        self.word_frequency = None

    def fetch_page(self):
        """
        Fetch the HTML content of the Smithsonian Snapshot page.
        Return the page content if successful, else return None.
        """
        try:
            response = requests.get(self.url)
            response.raise_for_status()
            return response.text
        except requests.RequestException as e:
            print(f"Error fetching the page: {e}")
            return None

    def clean_text(self, html_content):
        """
        Use BeautifulSoup to extract clean text from the HTML content.
        Remove scripts, styles, and special characters.
        """
        soup = BeautifulSoup(html_content, 'html.parser')
        
        for script in soup(["script", "style"]):
            script.decompose()
        
        text = soup.get_text()
        
        text = re.sub(r'\s+', ' ', text).strip()
        text = re.sub(r'[^\w\s]', '', text)
        
        self.text = text

    def split_into_sentences(self):
        """
        Use nltk's sentence tokenizer to split the cleaned text into sentences.
        If NLTK data is not available, use a simple regex-based approach.
        """
        try:
            nltk.data.find('tokenizers/punkt')
            self.sentences = sent_tokenize(self.text)
        except LookupError:
            print("NLTK Punkt tokenizer not found. Using simple regex-based sentence splitting.")
            self.sentences = re.split(r'(?<=[.!?])\s+', self.text)

    def get_word_stats(self):
        """
        Count the frequency of each word in the text. Return a Counter object.
        Ensure the text is lowercased for accurate counting.
        """
        try:
            nltk.data.find('tokenizers/punkt')
            words = word_tokenize(self.text.lower())
        except LookupError:
            print("NLTK word tokenizer not found. Using simple space-based word splitting.")
            words = self.text.lower().split()
        
        self.word_frequency = Counter(words)
        return self.word_frequency

    def analyze(self):
        """
        Orchestrate the fetching, cleaning, and analysis of the text from the page.
        - Fetch the HTML content.
        - Clean the text.
        - Split into sentences.
        - Get word frequency statistics.
        """
        html_content = self.fetch_page()
        if html_content:
            self.clean_text(html_content)
            self.split_into_sentences()
            self.get_word_stats()
            return True
        return False


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\eugen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
url = "https://www.si.edu/newsdesk/snapshot/how-very-logical"
analyzer = SmithsonianTextAnalyzer(url)
    
if analyzer.analyze():
    print("Text Analysis Results:")
    print(f"Number of sentences: {len(analyzer.sentences)}")
    print(f"Number of unique words: {len(analyzer.word_frequency)}")
    print("Top 10 most frequent words:")
    for word, count in analyzer.word_frequency.most_common(10):
        print(f"{word}: {count}")
else:
    print("Analysis failed.")

NLTK Punkt tokenizer not found. Using simple regex-based sentence splitting.
NLTK word tokenizer not found. Using simple space-based word splitting.
Text Analysis Results:
Number of sentences: 1
Number of unique words: 339
Top 10 most frequent words:
the: 20
and: 19
of: 11
a: 11
smithsonian: 10
in: 10
to: 7
ear: 5
tips: 5
nimoy: 5


### Task 4: Building a Smithsonian Snapshots Crawler

#### 4.0.4. Task Description
In this task, you will create a **web crawler** that will start at the Smithsonian Snapshots page and follow links to gather and analyze the content from multiple snapshot pages. The Smithsonian Snapshot section contains multiple articles, and your crawler will explore these articles, download their content, and process the information.

You will implement a web crawler that:
1. Starts at the [Smithsonian Snapshots Page](https://www.si.edu/snapshot).
2. Crawls through snapshot pages, extracting key information (links, images, and text) from each page.
3. Follows links from the initial page to other snapshot articles up to a specified depth.
4. Processes and stores the content from each crawled page.

### Tasks:
1. **Implement a Crawler**: Start crawling from the [Smithsonian Snapshots Page](https://www.si.edu/newsdesk/snapshots), gather links to snapshot articles, and visit each article.
2. **Content Extraction**: For each visited page, extract:
   - Anchor tags (`'link_text', 'absolute_url'`).
   - Image URLs (absolute URLs).
   - Cleaned text content from the body of the article.
3. **Depth Control**: Implement a parameter to control the depth of the crawl (i.e., how many levels of links the crawler should follow).
4. **Yield Results**: Your crawler should return a **generator** that yields the results (text, links, images) as soon as a page is processed, rather than collecting everything before returning.

### Criteria for Success:
- The crawler should respect the specified depth and only crawl the specified number of levels.
- Each snapshot page should have its content (links, images, text) extracted and returned.
- The crawler should handle relative links and convert them to absolute URLs.
- The content should be cleaned and stored properly.

#### Link: [Smithsonian Snapshots Page](https://www.si.edu/snapshot)

# Make sure the total number of visited links doesn't exceed 10 links, or you might get 0 for the whole assignment due to long runtime when checking the links!

In [16]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from typing import Optional, Dict, Generator
import time

class SmithsonianCrawler:
    def __init__(self, start_url: str, max_depth: int = 2) -> None:
        self.start_url: str = start_url
        self.max_depth: int = max_depth
        self.visited: set = set()

    def fetch_page(self, url: str) -> Optional[bytes]:
        try:
            response = requests.get(url)
            response.raise_for_status()
            return response.content
        except requests.RequestException as e:
            print(f"Error fetching {url}: {e}")
            return None

    def extract_content(self, html_content: bytes, base_url: str) -> Dict[str, Optional[str]]:
        soup = BeautifulSoup(html_content, 'html.parser')
        
        anchors = []
        for a in soup.find_all('a', href=True):
            link_text = a.text.strip()
            absolute_url = urljoin(base_url, a['href'])
            anchors.append((link_text, absolute_url))
        
        images = [urljoin(base_url, img['src']) for img in soup.find_all('img', src=True)]
        
        text = ' '.join([p.text for p in soup.find_all('p')])
        
        return {
            'anchors': anchors,
            'images': images,
            'text': text
        }

    def crawl(self, url: str, depth: int = 0) -> Generator[Dict[str, Optional[str]], None, None]:
        if depth > self.max_depth or url in self.visited or len(self.visited) >= 10:
            return
        
        self.visited.add(url)
        print("\n")
        print(f"Crawling {url} at depth {depth}...")
        print("\n")
        
        html_content = self.fetch_page(url)
        if html_content:
            content = self.extract_content(html_content, url)
            yield content
            
            for _, link in content['anchors']:
                if link.startswith(self.start_url):
                    yield from self.crawl(link, depth + 1)
        
        time.sleep(1)

    def crawl_generator(self) -> Generator[Dict[str, Optional[str]], None, None]:
        yield from self.crawl(self.start_url)



In [17]:
start_url = "https://www.si.edu/snapshot"
crawler = SmithsonianCrawler(start_url, max_depth=2)

for page_content in crawler.crawl_generator():
    print("Crawled Page Content:")
    print(f"Anchors: {page_content['anchors']}")
    print(f"Images: {page_content['images']}")
    print(f"Text (excerpt): {page_content['text']}")



Crawling https://www.si.edu/snapshot at depth 0...


Crawled Page Content:
Anchors: [('Skip to main content', 'https://www.si.edu/snapshot#main-content'), ('My Visit', 'https://www.si.edu/myvisit'), ('Donate', 'http://go.si.edu/si-give'), ('Smithsonian Institution', 'https://www.si.edu/'), ('Visit', 'https://www.si.edu/visit'), ('Hours and Locations', 'https://www.si.edu/visit/hours'), ('Entry and Guidelines', 'https://www.si.edu/visit/tips'), ('Maps and Brochures', 'https://www.si.edu/visit/maps'), ('Dining and Shopping', 'https://www.si.edu/dining'), ('Accessibility', 'https://www.si.edu/visit/accessibility'), ('Visiting with Kids', 'https://www.si.edu/visit/kids'), ('Group Visits', 'https://www.si.edu/visit/groups'), ('Group Sales', 'https://www.si.edu/groupsales'), ("What's On", 'https://www.si.edu/whats-on'), ('Exhibitions', 'https://www.si.edu/exhibitions'), ('Current', 'https://www.si.edu/exhibitions'), ('Upcoming', 'https://www.si.edu/exhibitions/upcoming'), ('Past', 'https:/