# Data Preparation for LLM Pre-training
## Notebook 1: Data Collection

In this notebook we will collect the data that we will use as an example for the pre-processing pipeline that will be covered in the subsequent notebook. 

Pre-training data can come from a wide variety of sources, including books, articles, websites, and more. In this notebook, we will source our data from [AWS Blogs](https://aws.amazon.com/blogs/). The steps we will follow are as follows:
- Choose a blog category to scrape such as Machine Learning, Security, Big Data, etc.
- Crawl the blog category to get the URLs of all the blog posts.
- Scrape the content of each blog post.
- Save the data into a [Web Archive (WARC)](https://en.wikipedia.org/wiki/WARC_(file_format)) file

In [None]:
# install libraries for web scraping
%pip install -Uqq beautifulsoup4
%pip install -Uqq requests
%pip install -Uqq warcio

## Web scraping AWS Blogs
AWS Blogs are a great source of technical content and are also quite easy to scrape. We will use the `requests` and `beautifulsoup4` libraries to scrape the blog posts. `requests` will be used to get the HTML content of the blog posts, and `beautifulsoup4` will be used to parse the HTML content and extract the text. Each blog category has a navigation page with a url format like `https://aws.amazon.com/blogs/<category>/page/<page_number>`. Additional url paths can be appended to the base url to further filter down the content. For example, the url `https://aws.amazon.com/blogs/big-data/category/industries/financial-services/page/20/` will bring up the 20th navigation page with links to Big Data blog posts related to the financial services industry.

This format of the AWS Blogs website makes it easy to scrape the blog posts. Below we define a number of functions to:
1. Traverse the navigation pages of a given blog category
2. Extract the URLs of the blog posts from each navigation page
3. Scrape the full HTML content of each blog post (We'll later extract the main text from this HTML content)
4. Save the scraped data into a WARC file

In [7]:
import requests
from bs4 import BeautifulSoup
from warcio.warcwriter import WARCWriter
from warcio.statusandheaders import StatusAndHeaders
import sagemaker
from io import BytesIO
from pathlib import Path
import json

SAGEMAKER_SESSION = sagemaker.Session()
S3_BUCKET = SAGEMAKER_SESSION.default_bucket()
S3_PREFIX = 'aws-blogs-pretrain'

def fetch_blog_nav_page(blog_category, page_num, industry=None):
    
    "get the content of a blog navigation page"
    
    url = f"https://aws.amazon.com/blogs/{blog_category}/"
    
    if industry:
        url = f"{url}category/industries/{industry}/"
    
    if page_num != 1:
        url = f"{url}page/{page_num}/"
        
    
    response = requests.get(url)
    response.raise_for_status()
    
    return response.text


def parse_blog_links(html_content):
    
    "Get links to blog posts from the blog navigation page"
    
    soup = BeautifulSoup(html_content, 'html.parser')
    links = []
    for h2_tag in soup.find_all('h2', class_='blog-post-title'):
        a_tag = h2_tag.find('a')
        if a_tag and 'href' in a_tag.attrs:
            links.append(a_tag['href'])
    return links


def fetch_blog_content(url):
    "get content of a blog post"
    
    response = requests.get(url)
    try: 
        response.raise_for_status()
    except:
        print(f"Failed to fetch {url}")
        return None
    
    return response


def save_to_warc(response, writer):
    
    "save crawled results to WARC file"
    
    headers_list = [(k, v) for k, v in response.headers.items()]
    status_line = f"HTTP/1.1 {response.status_code} {response.reason}"
    http_headers = StatusAndHeaders(status_line, headers_list, protocol='HTTP/1.1')
    payload = BytesIO(response.content)
    record = writer.create_warc_record(response.url, 'response', payload=payload, http_headers=http_headers)
    writer.write_record(record)


def crawl_aws_blogs(start_page, blog_category, industry=None, warc_file_path=None):
    
    "main function to crawl AWS blogs"
    
    if warc_file_path is None:
        warc_file_path = f"aws_{blog_category}_blogs.warc.gz"
    with open(warc_file_path, 'wb') as warc_file:
        writer = WARCWriter(warc_file, gzip=True)
        
        page_num = start_page
        
        while True:
            
            print(f"Crawling page {page_num}")
            try:
                html_content = fetch_blog_nav_page(blog_category, page_num, industry)
                page_num += 1
            
            except requests.HTTPError as e:
                if e.response.status_code == 404:
                    print(f"Page {page_num} not found. Exiting")
                    break
            
            blog_links = parse_blog_links(html_content)
            
            for blog_link in blog_links:
                print(f"Fetching blog {blog_link}")
                response = fetch_blog_content(blog_link)
                if response:
                    save_to_warc(response, writer)

Let's scrape Machine Learning and Big Data blog posts from AWS Blogs. To add some duplicate data, we'll also scrape the Machine Learning blogs related to the financial services industry.

In [None]:
scraped_data_path = Path("scraped_data")

if not scraped_data_path.exists():
    scraped_data_path.mkdir()

# crawl ML Blogs
crawl_aws_blogs(1, "machine-learning", warc_file_path=f"{scraped_data_path}/ml-blogs.warc.gz")

# crawl Big Data Blogs
crawl_aws_blogs(1, "big-data", warc_file_path=f"{scraped_data_path}/big-data-blogs.warc.gz")

# crawl ML Financial Services Blogs
crawl_aws_blogs(1, "machine-learning",  "financial-services", warc_file_path=f"{scraped_data_path}/ml-fsi-blogs.warc.gz")

With the data scraped, we can upload it to S3 and proceed to the next notebook where we will pre-process the data for LLM pre-training.

In [None]:
s3_path = SAGEMAKER_SESSION.upload_data(path=str(scraped_data_path), bucket=S3_BUCKET, key_prefix=S3_PREFIX)

# save the S3 path to a file for use in later notebook
Path("s3_path.json").write_text(json.dumps({"s3_path": s3_path}))