## Testing crawl4ai for web-scraping  
I'll be using crawl4ai here. It's an intelligent crawler that's highly flexible and easy to use. Performance is pretty good too. We can use more traditional methods like requests/BeautifulSoup/Selenium/Playwright, but I find those methods a bit cumbersome personally (but it's worth testing!!). crawl4ai can also intelligently filter out the relevant content and output markdowns which would save us an extra data cleaning step. You can find the crawl4ai documentation [here](https://docs.crawl4ai.com/).

In [1]:
import crawl4ai

## Extracting article URLs from the MassLive Archive  
I'll be testing out page 1 of the January Archives of the News Section

In [None]:
'''
Extracting the links to the articles, page 1 of January Articles.
'''
import asyncio
import re
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def main():
    config = CrawlerRunConfig(
        css_selector="ul.archive a.archive-item__headline-link"
    )
    
    url = "https://www.masslive.com/archives/news/2025/january/1/"  # Replace with actual archive URL
    
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, config=config)
        if result.success:
            # Extract URLs using regex
            url_pattern = r'href="([^"]+)"'
            urls = re.findall(url_pattern, result.cleaned_html)
            
            # Create a list of URLs
            article_urls = []
            for url in urls:
                article_urls.append(url)
            
            print(f"Found {len(article_urls)} article URLs:")
            for i, url in enumerate(article_urls, 1):
                print(f"{i}. {url}")
            
            # Return the list for further use
            return article_urls
        else:
            print(f"Failed to crawl: {result.error_message}")
            return []

# Call the function and get the list
article_urls = await main()
print(f"\nTotal URLs collected: {len(article_urls)}")

Found 100 article URLs:
1. https://www.masslive.com/news/2025/01/ibram-x-kendi-leaves-boston-university-for-director-role-at-howard-university.html
2. https://www.masslive.com/news/2025/01/boston-registered-sex-offender-grabbed-woman-before-filming-him-da-says.html
3. https://www.masslive.com/news/2025/01/body-found-stabbed-in-boston-harbor-in-1991-identified-by-genealogy.html
4. https://www.masslive.com/news/2025/01/recall-alert-chocolate-covered-snacks-other-items-may-cause-serious-allergies.html
5. https://www.masslive.com/news/2025/01/through-tears-nancy-kerrigan-mourns-skaters-killed-in-dc-crash-not-sure-how-to-process-it.html
6. https://www.masslive.com/news/2025/01/video-appears-to-show-moment-american-airlines-flight-collided-with-military-helicopter.html
7. https://www.masslive.com/news/2025/01/dc-plane-crash-deadliest-aviation-accident-in-us-in-decades.html
8. https://www.masslive.com/news/2025/01/blood-on-his-hands-man-stabbed-after-police-say-he-tried-to-rob-a-woman.html
9.

I verified the last url and it is indeed the last article of the page. We just need to loop through every single page for each month.

## Testing the extraction of the article  
I've set the output to markdown to filter out all the unnecessary HTML elements. The css-selector needs to be changed because I think the one I've chosen is specific to this article.

In [None]:
'''
Testing the extraction of an article. Output is markdown.
'''
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def main():
    config = CrawlerRunConfig(
        css_selector="#arc-TDWMELVMTBFPTFAQNT5XEX22CI > div:nth-child(3)",
        exclude_external_links=False,
    )
    
    url = "https://www.masslive.com/news/2025/01/ibram-x-kendi-leaves-boston-university-for-director-role-at-howard-university.html"
    
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, config=config)
        if result.success:
            # Filter out content after "## More News:" from markdown
            markdown_content = result.markdown
            if "## More News" in markdown_content:
                markdown_content = markdown_content.split("## More News")[0]
            
            print("Success!")
            print("Filtered markdown length:", len(markdown_content))
            print("\nExtracted markdown content:")
            print(markdown_content)
        else:
            print(f"Failed to crawl: {result.error_message}")

await main()

Success!
Filtered markdown length: 5062

Extracted markdown content:
Ibram X. Kendi, the founding director of the [Boston University Center for Antiracist Research](https://www.bu.edu/antiracism-center/the-center/), is leaving Boston for a new position at [Howard University](https://thedig.howard.edu/all-stories/howard-university-announces-hiring-dr-ibram-x-kendi-director-howard-university-institute-advanced "https://thedig.howard.edu/all-stories/howard-university-announces-hiring-dr-ibram-x-kendi-director-howard-university-institute-advanced"). 
The center will close when its charter with Boston University expires on June 30, [according to Boston University Today](https://www.bu.edu/articles/2025/ibram-x-kendi-departing-boston-university/ "https://www.bu.edu/articles/2025/ibram-x-kendi-departing-boston-university/"), the school’s daily website. The remaining staff members will still be employed through June 30. 
Kendi will serve as the director of the new Howard University Institute f

## Putting it all together

I'll try extracting the first 5 articles from the list of urls I created earlier. 

In [None]:
import asyncio
import re
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def extract_article_markdown(url):
    """Extract markdown content from a single article URL"""
    config = CrawlerRunConfig(
        css_selector=".main.article__story .entry-content", #Using a more general xpath
        exclude_external_links=False,
    )
    
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, config=config)
        if result.success:
            # Filter out content after "## More News:" from markdown
            markdown_content = result.markdown
            if "## More News" in markdown_content:
                markdown_content = markdown_content.split("## More News")[0]
            return markdown_content
        else:
            print(f"Failed to crawl {url}: {result.error_message}")
            return None


async def main():
    # First, get the list of article URLs
    print("Getting article URLs...")
    
    if not article_urls:
        print("No URLs found!")
        return
    
    print(f"Found {len(article_urls)} article URLs")
    
    # Take only the first 5 URLs
    first_5_urls = article_urls[:5]
    print(f"Processing first 5 URLs...")
    
    # Extract markdown content for each URL
    markdown_contents = []
    
    for i, url in enumerate(first_5_urls, 1):
        print(f"Processing article {i}/5: {url}")
        markdown_content = await extract_article_markdown(url)
        
        if markdown_content:
            markdown_contents.append({
                'url': url,
                'markdown': markdown_content,
                'length': len(markdown_content)
            })
            print(f"  ✓ Successfully extracted {len(markdown_content)} characters")
        else:
            print(f"  ✗ Failed to extract content")
    
    print(f"\nSuccessfully extracted {len(markdown_contents)} articles")
    
    # Display results
    for i, article in enumerate(markdown_contents, 1):
        print(f"\n--- Article {i} ---")
        print(f"URL: {article['url']}")
        print(f"Length: {article['length']} characters")
        print(f"Preview: {article['markdown'][:200]}...")
    
    return markdown_contents

# Run the main function
articles_data = await main()

Getting article URLs...
Found 100 article URLs
Processing first 5 URLs...
Processing article 1/5: https://www.masslive.com/news/2025/01/ibram-x-kendi-leaves-boston-university-for-director-role-at-howard-university.html


  ✓ Successfully extracted 5062 characters
Processing article 2/5: https://www.masslive.com/news/2025/01/boston-registered-sex-offender-grabbed-woman-before-filming-him-da-says.html


  ✓ Successfully extracted 2585 characters
Processing article 3/5: https://www.masslive.com/news/2025/01/body-found-stabbed-in-boston-harbor-in-1991-identified-by-genealogy.html


  ✓ Successfully extracted 1907 characters
Processing article 4/5: https://www.masslive.com/news/2025/01/recall-alert-chocolate-covered-snacks-other-items-may-cause-serious-allergies.html


  ✓ Successfully extracted 5031 characters
Processing article 5/5: https://www.masslive.com/news/2025/01/through-tears-nancy-kerrigan-mourns-skaters-killed-in-dc-crash-not-sure-how-to-process-it.html


  ✓ Successfully extracted 7632 characters

Successfully extracted 5 articles

--- Article 1 ---
URL: https://www.masslive.com/news/2025/01/ibram-x-kendi-leaves-boston-university-for-director-role-at-howard-university.html
Length: 5062 characters
Preview: Ibram X. Kendi, the founding director of the [Boston University Center for Antiracist Research](https://www.bu.edu/antiracism-center/the-center/), is leaving Boston for a new position at [Howard Unive...

--- Article 2 ---
URL: https://www.masslive.com/news/2025/01/boston-registered-sex-offender-grabbed-woman-before-filming-him-da-says.html
Length: 2585 characters
Preview: A woman on Boston’s Tremont Street followed and filmed a man – a registered sex offender – after he grabbed her buttocks, Suffolk County District Attorney Kevin Hayden’s office said Thursday. 
Kenneth...

--- Article 3 ---
URL: https://www.masslive.com/news/2025/01/body-found-stabbed-in-boston-harbor-in-1991-identified-by-genealogy.html
Length: 1907 characters
Previ

**Some notes**:  
1. This should work, you just need to extract the all the URLs from each month. We can split the load with the other team, it will be faster if all of us assigned months to each person.
2. The content of the extraction needs to be cleaned. We need to remove the ## More News section and any external links. This can be done in the actual preprocessing phase, but it would make sense to do some cleaning in the scraping phase. 
3. crawl4ai should intelligently handle request-limits, but I haven't stress-tested the package. The final production pipeline should be able to handle disruptions so that when running the loop again, it starts where it left off. 