---
title: RAG on Microsoft Learning Paths - Part 1. Web Scraping and Supabase   
author: "Francisco Mussari"  
date: 2023-11-15  
image: "ERD.svg"  
categories: [RAG, LLM, Embeddings, Power BI, SQLAlchemy, PostgreSQL, Supabase, BeautifulSoup]  
format:
  html:
    toc: true
    toc-depth: 3
    
---

## Overview

**Retrieval Augmented Generation** (RAG) provides **Large Language Models** (LLMs) context and access to a knowledge base, leading to more precise answers and fewer hallucinations.  
  
In this blog series we'll go in a journey with RAG and Microsoft Learning Paths (Power BI) to explore what could be achieved.

First up, let's tackle the data preparation phase.  

In **Part 1** we'll ditch the idea of starting with a set of pre-processed documents and instead build our own data pipeline from scratch. This involves creating a database model, scraping Microsoft Modules, storing them in Supabase (PostgreSQL), and managing the database using SQLAlchemy Object Relational Mapper (ORM).  
  
Don't worry if some of these terms are new; you're not alone!"

## References

- [SQLAlchemy Unified Tutorial](https://docs.sqlalchemy.org/en/20/tutorial/index.html#sqlalchemy-unified-tutorial)
- [ORM Quick Start](https://docs.sqlalchemy.org/en/20/orm/quickstart.html#orm-quick-start)

## Import libraries

In [None]:
import os
import time
from dotenv import load_dotenv

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bs

In [None]:
from sqlalchemy import create_engine
from sqlalchemy.orm import Session

from typing import List, Tuple, Dict
from typing import Optional
from sqlalchemy import ForeignKey
from sqlalchemy import String
from sqlalchemy.orm import DeclarativeBase
from sqlalchemy.orm import Mapped
from sqlalchemy.orm import mapped_column
from sqlalchemy.orm import relationship

In [None]:
from sqlalchemy import Text, Boolean

## Credentials

In [None]:
load_dotenv()

supabase_host = os.environ.get('SUPABASE_HOST')
supabase_pass = os.environ.get('SUPABASE_PASS')

## Data Model's Entity Relationship Diagram (ERD)

Imagine we're building a roadmap to guide someone's journey towards becoming a **Microsoft Certified: Power BI Data Analyst Associate**.  
  
The recommended path involves completing a series of **Learning Paths**, each filled with **Modules** containing **Chapters**. Some Chapters might have **Knowledge Checks** to assess your understanding, consisting of single-selection **Questions** with multiple **Answers**, only one of which is correct.  
  
To represent this structure, let's create a model that looks like this:

<img src="ERD.svg" align="center"/>  

Fig 1. Model's ERD. (Created with https://dbdiagram.io/)

## Creating the Object Relational Mapping

Let's now move from the diagram to the database itself with SQLAlchemy. We can use the free version of Supabase.

In [None]:
DB_CONNECTION = f'postgresql://postgres:{supabase_pass}@{supabase_host}:5432/postgres'
engine = create_engine(DB_CONNECTION, echo=False)

In [None]:
class Base(DeclarativeBase):
    pass

class Exam(Base):
    __tablename__ = "exam"
    
    id: Mapped[int] = mapped_column(primary_key=True)
    code: Mapped[str] = mapped_column(Text())
    description: Mapped[str] = mapped_column(Text())
    url: Mapped[str] = mapped_column(Text())
        
    learning_paths: Mapped[List["LearningPath"]] = relationship(
        back_populates="exam", cascade="all, delete-orphan"
    )
        
    def __repr__(self) -> str:
        return f"Exam(code={self.code!r}, description={self.description!r}, url={self.url!r})"

class LearningPath(Base):
    __tablename__ = "learning_path"
    
    id: Mapped[int] = mapped_column(primary_key=True)
    title: Mapped[str] = mapped_column(Text())
    summary: Mapped[str] = mapped_column(Text())
    url: Mapped[str] = mapped_column(Text())
    
    exam_id: Mapped[int] = mapped_column(ForeignKey("exam.id"))
    
    exam: Mapped["Exam"] = relationship(back_populates="learning_paths")
    modules: Mapped[List["Module"]] = relationship(
        back_populates="learning_path", cascade="all, delete-orphan"
    )
    
    def __repr__(self) -> str:
        return f"LearningPath(title={self.title!r}, url={self.url!r})"

class Module(Base):
    __tablename__ = "module"
    
    id: Mapped[int] = mapped_column(primary_key=True)
    title: Mapped[str] = mapped_column(Text())
    summary: Mapped[str] = mapped_column(Text())
    url: Mapped[str] = mapped_column(Text())
    
    path_id: Mapped[int] = mapped_column(ForeignKey("learning_path.id"))
        
    learning_path: Mapped["LearningPath"] = relationship(back_populates="modules")
    chapters: Mapped[List["Chapter"]] = relationship(
        back_populates="module", cascade="all, delete-orphan"
    )
    
    def __repr__(self) -> str:
        return f"Module(title={self.title!r}, url={self.url!r})"

class Chapter(Base):
    __tablename__ = "chapter"
    
    id: Mapped[int] = mapped_column(primary_key=True)
    title: Mapped[str] = mapped_column(Text())
    content: Mapped[str] = mapped_column(Text())
    url: Mapped[str] = mapped_column(Text())
    is_check: Mapped[str] = mapped_column(Boolean())
    
    module_id: Mapped[int] = mapped_column(ForeignKey("module.id"))
        
    module: Mapped["Module"] = relationship(back_populates="chapters")
    questions: Mapped[List["Question"]] = relationship(
        back_populates="chapter", cascade="all, delete-orphan"
    )
    
    def __repr__(self) -> str:
        return f"Chapter(title={self.title!r}, url={self.url!r})"

class Question(Base):
    __tablename__ = "question"
    
    id: Mapped[int] = mapped_column(primary_key=True)
    question: Mapped[str] = mapped_column(Text())
    
    chapter_id: Mapped[int] = mapped_column(ForeignKey("chapter.id"))
    
    chapter: Mapped["Chapter"] = relationship(back_populates="questions")
    answers: Mapped[List["Answer"]] = relationship(
        back_populates="question", cascade="all, delete-orphan"
    )
    
    def __repr__(self) -> str:
        return f"Question(question={self.question!r})"

class Answer(Base):
    __tablename__ = "answer"
    
    id: Mapped[int] = mapped_column(primary_key=True)
    answer: Mapped[str] = mapped_column(Text())
    is_correct: Mapped[str] = mapped_column(Boolean())
    
    question_id: Mapped[int] = mapped_column(ForeignKey("question.id"))
    
    question: Mapped["Question"] = relationship(back_populates="answers")
    
    def __repr__(self) -> str:
        return f"Answer(answer={self.answer!r}, is_correct={self.is_correct!r})"

In [None]:
# Delete all table if they exists
Base.metadata.drop_all(engine)

In [None]:
# Create all tables
Base.metadata.create_all(engine)

After executing the previous command, the tables are created in the database:                           

<img src="ERD-pgAdmin.PNG" align="center"/>   

Fig 2. Model's ERD. (Created with pgAdmin)

## Start Populating the Database

We'll use Python to extract the contents of each web page, but we'll manually grab the **exam** and **Learning Paths** links. Our case is **Exam PL-300: Microsoft Power BI Data Analyst**.

In [None]:
power_bi_exam = Exam(
    code='PL-300', description='Microsoft Power BI Data Analyst', 
    url='https://learn.microsoft.com/en-us/credentials/certifications/power-bi-data-analyst-associate/')

In [None]:
power_bi_exam

Exam(code='PL-300', description='Microsoft Power BI Data Analyst', url='https://learn.microsoft.com/en-us/credentials/certifications/power-bi-data-analyst-associate/')

In [None]:
with Session(engine) as session:
    session.add_all([power_bi_exam])
    session.commit()

## Web Scraping

Since I couldn't find a direct way to scrape the main page listing all Learning Paths, let's start with a list of Learning Paths and work our way from there.

In [None]:
from llama_index import SummaryIndex, SimpleWebPageReader
import re

In [None]:
learning_path_path = [
    "data-analytics-microsoft/", "prepare-data-power-bi/", 
    "model-data-power-bi/", "build-power-bi-visuals-reports/", 
    "manage-workspaces-datasets-power-bi/"
]

learning_path_base_url = "https://learn.microsoft.com/en-us/training/paths/"
module_base_url = "https://learn.microsoft.com/en-us/training/modules/"

learning_path_url = [learning_path_base_url + path for path in learning_path_path]

In [None]:
learning_path_url

['https://learn.microsoft.com/en-us/training/paths/data-analytics-microsoft/',
 'https://learn.microsoft.com/en-us/training/paths/prepare-data-power-bi/',
 'https://learn.microsoft.com/en-us/training/paths/model-data-power-bi/',
 'https://learn.microsoft.com/en-us/training/paths/build-power-bi-visuals-reports/',
 'https://learn.microsoft.com/en-us/training/paths/manage-workspaces-datasets-power-bi/']

### Functions

In [None]:
def get_text_between_regex(text:str, regex:str=r"\n\n\[ (.*?)\/\)\n\n") -> List[str]:
    """Matches and extracts text using regular expressions."""
    matches = re.findall(regex, text)
    if matches:
        return matches
    else:
        return None
    
def get_text_from_urls(urls:List[str], delay:int=5) -> List[str]:
    """Scrapes a list of URLs, introducing pauses, using `SimpleWebPageReader`."""
    output = []
    
    for url in urls:
        print(f'Scraping {url}')
        txt = SimpleWebPageReader(html_to_text=True).load_data([url])[0].text
        output.append(txt)
        time.sleep(delay) 
        
    return output

def extract_modules_from_lp(lp_raw_content:List[str]) -> Dict:
    """Extracts modules' names and paths from Learning Path's scraped content."""
    modules = {'title': [], 'path': []}
    
    for raw_text in lp_raw_content:
        text = raw_text.replace("\n", "").replace("[ !", "[!")
        
        paths = get_text_between_regex(text, MOD_URL_REG)
        names = get_text_between_regex(text, MOD_NAME_REG)
        
        names = [
            n.strip() for n in names if 'Download' not in n and 
            'More info' not in n and n != '']
        paths = list(dict.fromkeys(paths))
        
        modules['title'].append(names)
        modules['path'].append(paths)
        
    return modules

def extract_chapters_from_mods(mod_content:List[List[str]]) -> Dict:
    """Extracts chapters' names and paths from Modules' scraped content."""
    chapters = {'title': [], 'path': []}

    for lp_mods in mod_content:

        for content in lp_mods:

            chapters_txt = get_text_between_regex(content.replace("\n", ""), CHAPTERS_REG)[0]

            titles = get_text_between_regex(chapters_txt, CHAPTER_TITLE_REG)
            titles = [t.strip() for t in titles]
            
            paths = get_text_between_regex(chapters_txt, CHAPTER_PATH_REG)
            paths = [p.strip() for p in paths]
            
            chapters['title'].append(titles)
            chapters['path'].append(paths)

    return chapters

def get_soup(url:str) -> 'bs4.BeautifulSoup':
    """Fetches the HTML content from a URL and parses it using BeautifulSoup."""
    req = Request(url)
    response = urlopen(req, timeout=10)
    soup = bs(response.read())
    return soup

def get_questions(soup:'bs4.BeautifulSoup') -> List[str]:
    """Extracts questions from a BeautifulSoup object representing the HTML content."""
    questions = []
    for q in (
        soup
        .find_all(class_="field")[0]
        .find_all(class_="margin-top-sm margin-bottom-xs field-label")
    ):
        questions += q.select('p')[0].contents
    return questions

def get_answers(soup:'bs4.BeautifulSoup') -> List[List]:
    """Extracts answers from a BeautifulSoup object representing the HTML content."""
    all_answers = []
    for answrs in (
        soup
        .find_all(class_="field")[0]
        .find_all(class_="field-body")
    ):
        answers = []
        for answr in (
            answrs.find_all(class_="margin-inline-sm radio-label-text")
        ):
            answers += answr.select('p')[0].contents
        
        all_answers.append(answers)
        
    return all_answers

### Getting Data from each Learning Path

In [None]:
lp_raw_content = get_text_from_urls(learning_path_url, 5)

Scraping https://learn.microsoft.com/en-us/training/paths/data-analytics-microsoft/
Scraping https://learn.microsoft.com/en-us/training/paths/prepare-data-power-bi/
Scraping https://learn.microsoft.com/en-us/training/paths/model-data-power-bi/
Scraping https://learn.microsoft.com/en-us/training/paths/build-power-bi-visuals-reports/
Scraping https://learn.microsoft.com/en-us/training/paths/manage-workspaces-datasets-power-bi/


#### Title of each Learning Path

In [None]:
LP_NAME_REG = r"\n# (.*?)\n\n"

lp_title = [get_text_between_regex(txt, LP_NAME_REG)[0] for txt in lp_raw_content]
lp_title

['Get started with Microsoft data analytics',
 'Prepare data for analysis with Power BI',
 'Model data with Power BI',
 'Build Power BI visuals and reports',
 'Manage workspaces and datasets in Power BI']

#### Description of each Learning Path

In [None]:
LP_DESC_REG = r"\n\nPower BI\n\n([\s\S]*?)\n\n"

lp_desc = [get_text_between_regex(txt, LP_DESC_REG)[0] for txt in lp_raw_content]

In [None]:
print(f"- {lp_desc[0][:50]}...")
print(f"- {lp_desc[1][:50]}...")
print(f"- {lp_desc[2][:50]}...")
print(f"- {lp_desc[3][:50]}...")
print(f"- {lp_desc[4][:50]}...")

- Businesses need data analysis more than ever. In t...
- You'll learn how to use Power Query to extract dat...
- Learn what a Power BI semantic model is, which dat...
- Turn data into interactive, actionable insights wi...
- In this Learning Path, you'll learn how to publish...


In [None]:
[len(des) for des in lp_desc]

[446, 219, 147, 90, 325]

### Insert Learning Path Data into Supabase

In [None]:
lp_to_postgres = [
    LearningPath(
        exam_id=1, title=title,
        summary=summary, url=url
    ) for title, summary, url in zip(lp_title, lp_desc, learning_path_url)
]
lp_to_postgres[0]

LearningPath(title='Get started with Microsoft data analytics', url='https://learn.microsoft.com/en-us/training/paths/data-analytics-microsoft/')

In [None]:
print(f"id: {lp_to_postgres[0].id}")
print(f"exam_id: {lp_to_postgres[0].exam_id}")
print(f"title: {lp_to_postgres[0].title}")
print(f"summary: {lp_to_postgres[0].summary[:50]}...")

id: None
exam_id: 1
title: Get started with Microsoft data analytics
summary: Businesses need data analysis more than ever. In t...


In [None]:
with Session(engine) as session:
    session.add_all(lp_to_postgres)
    session.commit()
    lp_desc_id = [lp.id for lp in lp_to_postgres]

In [None]:
lp_desc_id

[1, 2, 3, 4, 5]

#### Getting Modules' Paths and Titles

In [None]:
MOD_URL_REG = r"\(.*?/modules/(.*?)\)"
MOD_NAME_REG = r"\[ (.*?)\]"

modules = extract_modules_from_lp(lp_raw_content)

In [None]:
modules['title'][0]

['Discover data analysis', 'Get started building with Power BI']

In [None]:
modules['path'][0]

['data-analytics-microsoft/', 'get-started-with-power-bi/']

In [None]:
module_base_url
modules['url'] = [[module_base_url + p for p in path] for path in modules['path']]
modules['url'][0]

['https://learn.microsoft.com/en-us/training/modules/data-analytics-microsoft/',
 'https://learn.microsoft.com/en-us/training/modules/get-started-with-power-bi/']

### Scraping Data from each Module

In [None]:
modules['content'] = []
for modules_url in modules['url']:
    modules['content'].append(get_text_from_urls(modules_url, 5))

Scraping https://learn.microsoft.com/en-us/training/modules/data-analytics-microsoft/
Scraping https://learn.microsoft.com/en-us/training/modules/get-started-with-power-bi/
Scraping https://learn.microsoft.com/en-us/training/modules/get-data/
Scraping https://learn.microsoft.com/en-us/training/modules/clean-data-power-bi/
Scraping https://learn.microsoft.com/en-us/training/modules/dax-power-bi-models/
Scraping https://learn.microsoft.com/en-us/training/modules/choose-power-bi-model-framework/
Scraping https://learn.microsoft.com/en-us/training/modules/design-model-power-bi/
Scraping https://learn.microsoft.com/en-us/training/modules/dax-power-bi-write-formulas/
Scraping https://learn.microsoft.com/en-us/training/modules/dax-power-bi-add-measures/
Scraping https://learn.microsoft.com/en-us/training/modules/dax-power-bi-add-calculated-tables/
Scraping https://learn.microsoft.com/en-us/training/modules/dax-power-bi-time-intelligence/
Scraping https://learn.microsoft.com/en-us/training/mod

In [None]:
modules['content'][3][0][600:1000]

'sign requirements\n\n  *   * Module \n  * 8 Units \n\nFeedback\n\nIntermediate\n\nBusiness User\n\nBusiness Analyst\n\nMicrosoft Power Platform\n\nPower BI\n\nGathering appropriate inputs to scope your report design requirements involves\nidentifying your audience, determining the suitable report types, and defining\ntheir interface and experience requirements. This module provides you with a\nstrong foundation on wh'

In [None]:
m_d_r_1 = r"\n\nPower BI\n\nMicrosoft Power Platform\n\n"
m_d_r_2 = r"\n\nMicrosoft Power Platform\n\nPower BI\n\n"
m_d_r_3 = r"\n\nPower BI\n\n"
m_d_r_4 = r"\n\nMicrosoft Power Platform\n\n"

MOD_DESC_REG = fr"({m_d_r_1}|{m_d_r_2}|{m_d_r_3}|{m_d_r_4})([\s\S]*?)\n\nSave"

In [None]:
modules['summary'] = []
for mod_content in modules['content']:
    summary = []
    for txt in mod_content:
        summary.append(get_text_between_regex(txt, MOD_DESC_REG)[0][1])
    modules['summary'].append(summary)

In [None]:
print(modules['summary'][3][0])

Gathering appropriate inputs to scope your report design requirements involves
identifying your audience, determining the suitable report types, and defining
their interface and experience requirements. This module provides you with a
strong foundation on which to learn how to plan your report design
requirements.

##  Learning objectives

In this module, you will:

  * Determine business goals.
  * Identify your audience.
  * Determine report types.
  * Define user interface requirements.
  * Define user experience requirements.


### Insert Module Data into Supabase

In [None]:
mod_to_postgres = []

for idx, path_id in enumerate(lp_desc_id):
    for title, summary, url in zip(
        modules['title'][idx], modules['summary'][idx], modules['url'][idx]
    ):
        mod_to_postgres.append(
            Module(
                path_id=path_id, title=title,
                summary=summary, url=url
            )
        )

In [None]:
with Session(engine) as session:
    session.add_all(mod_to_postgres)
    session.commit()
    mod_id = [mod.id for mod in mod_to_postgres]

In [None]:
print(mod_id)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]


### Getting Chapter's Data from each Module

In [None]:
CHAPTERS_REG = r"\* (\[.*) minTheme"
CHAPTER_TITLE_REG = r"\[(.*?)\]"
CHAPTER_PATH_REG = r"]\((.*?)\)"

chapters = extract_chapters_from_mods(modules['content'])

In [None]:
len(chapters['title'])

23

In [None]:
base_url = [base for lp_url in modules['url'] for base in lp_url]

chapters['url'] = []
for base, paths in zip(base_url, chapters['path']):
    urls = [base + p for p in paths]
    chapters['url'].append(urls)

In [None]:
chapters['url'][20]

['https://learn.microsoft.com/en-us/training/modules/manage-datasets-power-bi/1-introduction',
 'https://learn.microsoft.com/en-us/training/modules/manage-datasets-power-bi/4-power-bi-gateway',
 'https://learn.microsoft.com/en-us/training/modules/manage-datasets-power-bi/5-dataset-refresh',
 'https://learn.microsoft.com/en-us/training/modules/manage-datasets-power-bi/6-incremental-refresh',
 'https://learn.microsoft.com/en-us/training/modules/manage-datasets-power-bi/7-manage-datasets',
 'https://learn.microsoft.com/en-us/training/modules/manage-datasets-power-bi/8-troubleshoot-connectivity',
 'https://learn.microsoft.com/en-us/training/modules/manage-datasets-power-bi/9-query-caching',
 'https://learn.microsoft.com/en-us/training/modules/manage-datasets-power-bi/10-check',
 'https://learn.microsoft.com/en-us/training/modules/manage-datasets-power-bi/11-summary']

In [None]:
chapters['content'] = []
CHAPTERS_CONTENT_REG = r"minute[s]?\n\n([\s\S]*)\n\n\[ Continue"

for chapters_url in chapters['url']:
    raw_content = get_text_from_urls(chapters_url, 8)
    chapter_content = []
    
    for txt in raw_content:
        chapter_content.append(
            get_text_between_regex(txt, CHAPTERS_CONTENT_REG)[0])
        
    chapters['content'].append(chapter_content)

Scraping https://learn.microsoft.com/en-us/training/modules/data-analytics-microsoft/1-introduction
Scraping https://learn.microsoft.com/en-us/training/modules/data-analytics-microsoft/2-data-analysis
Scraping https://learn.microsoft.com/en-us/training/modules/data-analytics-microsoft/3-roles
Scraping https://learn.microsoft.com/en-us/training/modules/data-analytics-microsoft/4-tasks
Scraping https://learn.microsoft.com/en-us/training/modules/data-analytics-microsoft/5-check
Scraping https://learn.microsoft.com/en-us/training/modules/data-analytics-microsoft/6-summary
Scraping https://learn.microsoft.com/en-us/training/modules/get-started-with-power-bi/1-introduction
Scraping https://learn.microsoft.com/en-us/training/modules/get-started-with-power-bi/2-using-power-bi
Scraping https://learn.microsoft.com/en-us/training/modules/get-started-with-power-bi/3-building-blocks-of-power-bi
Scraping https://learn.microsoft.com/en-us/training/modules/get-started-with-power-bi/4-exercise-touring-

Scraping https://learn.microsoft.com/en-us/training/modules/dax-power-bi-time-intelligence/3-calculations
Scraping https://learn.microsoft.com/en-us/training/modules/dax-power-bi-time-intelligence/3b-lab
Scraping https://learn.microsoft.com/en-us/training/modules/dax-power-bi-time-intelligence/4-check
Scraping https://learn.microsoft.com/en-us/training/modules/dax-power-bi-time-intelligence/5-summary
Scraping https://learn.microsoft.com/en-us/training/modules/optimize-model-power-bi/1-introduction
Scraping https://learn.microsoft.com/en-us/training/modules/optimize-model-power-bi/2-performance
Scraping https://learn.microsoft.com/en-us/training/modules/optimize-model-power-bi/3-variables
Scraping https://learn.microsoft.com/en-us/training/modules/optimize-model-power-bi/4-reduce-cardinality
Scraping https://learn.microsoft.com/en-us/training/modules/optimize-model-power-bi/5-directquery-models
Scraping https://learn.microsoft.com/en-us/training/modules/optimize-model-power-bi/6-aggrega

Scraping https://learn.microsoft.com/en-us/training/modules/create-manage-workspaces-power-bi/1-introduction
Scraping https://learn.microsoft.com/en-us/training/modules/create-manage-workspaces-power-bi/2-distribute-report-dashboard
Scraping https://learn.microsoft.com/en-us/training/modules/create-manage-workspaces-power-bi/3-monitor-usage-performance
Scraping https://learn.microsoft.com/en-us/training/modules/create-manage-workspaces-power-bi/4-development-lifecycle-strategy
Scraping https://learn.microsoft.com/en-us/training/modules/create-manage-workspaces-power-bi/5-troubleshoot-data
Scraping https://learn.microsoft.com/en-us/training/modules/create-manage-workspaces-power-bi/6-data-protection
Scraping https://learn.microsoft.com/en-us/training/modules/create-manage-workspaces-power-bi/8-check
Scraping https://learn.microsoft.com/en-us/training/modules/create-manage-workspaces-power-bi/9-summary
Scraping https://learn.microsoft.com/en-us/training/modules/manage-datasets-power-bi/1

### Insert Chapter Data into Supabase

In [None]:
ch_to_postgres = []

for idx, m_id in enumerate(mod_id):
    for title, content, url in zip(
        chapters['title'][idx], chapters['content'][idx], chapters['url'][idx]
    ):
        is_check = False
        if 'check' in url or 'quiz' in url:
            is_check = True
            
        ch_to_postgres.append(
            Chapter(
                module_id=m_id, title=title,
                content=content, url=url, is_check=is_check
            )
        )

In [None]:
print(len(ch_to_postgres))
ch_to_postgres[0]

196


Chapter(title='Introduction', url='https://learn.microsoft.com/en-us/training/modules/data-analytics-microsoft/1-introduction')

In [None]:
with Session(engine) as session:
    session.add_all(ch_to_postgres)
    session.commit()
    chapter_id = [ch.id for ch in ch_to_postgres]

In [None]:
len(chapter_id)

196

### Getting Questions from 'Check Knowledge' Chapters

In [None]:
check_chapter = []
for chapter in ch_to_postgres:
    if chapter.is_check:
        check_chapter.append(chapter)

In [None]:
check_chapter[0].url

'https://learn.microsoft.com/en-us/training/modules/data-analytics-microsoft/5-check'

In [None]:
questions = []
for ch in check_chapter:
    print(f"Questions from {ch.url}")
    soup = get_soup(ch.url)
    questions.append(get_questions(soup))
    time.sleep(5)

Questions from https://learn.microsoft.com/en-us/training/modules/data-analytics-microsoft/5-check
Questions from https://learn.microsoft.com/en-us/training/modules/get-started-with-power-bi/6-get-started-with-power-bi-quiz
Questions from https://learn.microsoft.com/en-us/training/modules/get-data/10-check
Questions from https://learn.microsoft.com/en-us/training/modules/clean-data-power-bi/9-check
Questions from https://learn.microsoft.com/en-us/training/modules/dax-power-bi-models/5-check
Questions from https://learn.microsoft.com/en-us/training/modules/choose-power-bi-model-framework/7-knowledge-check
Questions from https://learn.microsoft.com/en-us/training/modules/design-model-power-bi/10-check
Questions from https://learn.microsoft.com/en-us/training/modules/dax-power-bi-write-formulas/7-check
Questions from https://learn.microsoft.com/en-us/training/modules/dax-power-bi-add-measures/6-check
Questions from https://learn.microsoft.com/en-us/training/modules/dax-power-bi-add-calcul

In [None]:
questions[0]

['Which data role enables advanced analytics capabilities specifically through reports and visualizations?',
 'Which data analyst task has a critical performance impact on reporting and data analysis?',
 'Which one of the following options is the most important key benefit of data analysis?']

### Insert Questions into Supabase

In [None]:
q_to_postgres = []

for qs, chapter in zip(questions, check_chapter):
    for q in qs:
        q_to_postgres.append(
            Question(chapter_id=chapter.id, question=q)
        )

In [None]:
q_to_postgres[0]

Question(question='Which data role enables advanced analytics capabilities specifically through reports and visualizations?')

In [None]:
with Session(engine) as session:
    session.add_all(q_to_postgres)
    session.commit()
    question_id = [q.id for q in q_to_postgres]

In [None]:
len(question_id)

69

### Getting Answers from 'Check Knowledge' Chapters

In [None]:
answers = []
for ch in check_chapter:
    print(f"Answers from {ch.url}")
    soup = get_soup(ch.url)
    answers += get_answers(soup)
    time.sleep(5)

Answers from https://learn.microsoft.com/en-us/training/modules/data-analytics-microsoft/5-check
Answers from https://learn.microsoft.com/en-us/training/modules/get-started-with-power-bi/6-get-started-with-power-bi-quiz
Answers from https://learn.microsoft.com/en-us/training/modules/get-data/10-check
Answers from https://learn.microsoft.com/en-us/training/modules/clean-data-power-bi/9-check
Answers from https://learn.microsoft.com/en-us/training/modules/dax-power-bi-models/5-check
Answers from https://learn.microsoft.com/en-us/training/modules/choose-power-bi-model-framework/7-knowledge-check
Answers from https://learn.microsoft.com/en-us/training/modules/design-model-power-bi/10-check
Answers from https://learn.microsoft.com/en-us/training/modules/dax-power-bi-write-formulas/7-check
Answers from https://learn.microsoft.com/en-us/training/modules/dax-power-bi-add-measures/6-check
Answers from https://learn.microsoft.com/en-us/training/modules/dax-power-bi-add-calculated-tables/5-check


In [None]:
answers[0], len(answers)

(['Data scientist', 'Data engineer', 'Data analyst'], 69)

### Insert Answers into Supabase

In [None]:
a_to_postgres = []

for q_id, aswrs in zip(question_id, answers):
    for a in aswrs:
        a_to_postgres.append(
            Answer(question_id=q_id, answer=a, is_correct=False)
        )

In [None]:
a_to_postgres[0].question_id, a_to_postgres[0]

(1, Answer(answer='Data scientist', is_correct=False))

In [None]:
with Session(engine) as session:
    session.add_all(a_to_postgres)
    session.commit()
    answer_id = [a.id for a in a_to_postgres]

Since we don't have the correct answers readily available, let's see if our Large Language Models can help us identify them. We can then manually verify the answers against the web for accuracy.

## Conclusion

- We've got a database model and a free Supabase instance filled with all of Power BI's Learning Paths.
- We started using Beautiful Soup at the end of the scraping process, and it might be worth using it throughout to make the process more generalizable to other Learning Paths.
- With our data in order, we're ready to dive into **Part 2** and start exploring RAG with Large Language Models.