# Scraping DPI-Related Resources and Creating a Dataset

In this notebook, we will use MOSIP documentation to create a dataset that can be used in an LLM instruction fine-tuning task.

In [2]:
import os
import markdown
from bs4 import BeautifulSoup 
from datasets import Dataset

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Clone the MOSIP documentation repository
!git clone https://github.com/mosip/documentation/tree/1.2.0 ../data/mosip-docs

In [4]:
# Open the an example documentation file to explore the structure
f = open('../data/mosip-docs/docs/abis-api.md', 'r')
md = f.read()
htmlmarkdown=markdown.markdown(md)
htmlmarkdown

'<h1>ABIS API</h1>\n<p>This document defines the APIs specifications for various operations that ABIS can perform to integrate with MOSIP.</p>\n<p>API specification version: <strong>0.9</strong></p>\n<p>Published Date: February 05, 2021</p>\n<h2>Revision Note</h2>\n<p>| Publish Date      | Revision                                                                                                                                                                                                                                     |\n| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| May 07, 2020      | This is the first formal publication of the interface as a version-ed specification. Earlier draft are superseded by this document. The interface is revamped to make it friendlier to pr

In [6]:
# scrape the html for the content
soup = BeautifulSoup(htmlmarkdown, "html.parser")
soup.find_all('h1')

[<h1>ABIS API</h1>]

In [7]:
soup.find_all('h2')

[<h2>Revision Note</h2>,
 <h2>Introduction</h2>,
 <h2>Parameters</h2>,
 <h2>ABIS Operations</h2>,
 <h2>References</h2>]

As you can see, the documentation is already structured and most of the information is categorised under H2 heading. Let's create text chunks for each content that is written under H2 heading.

In [11]:
def extract_headings_and_content(markdown_text):
    headings_content = {}
    current_heading = None
    current_content = []

    lines = markdown_text.split('\n')
    h1 = ''
    for line in lines:
        if line.startswith('# '):
            h1 = line.strip('#').strip()
        
        if line.startswith('## '):
            if current_heading:
                headings_content[current_heading] = '\n'.join(current_content)
            current_heading = line.strip('#').strip()
            current_content = []
        else:
            current_content.append(line)

    # Add the content of the last heading
    if current_heading:
        headings_content[current_heading] = '\n'.join(current_content)

    return [h1, headings_content]

In [12]:
h1, headings_content = extract_headings_and_content(md)

In [13]:
headings_content

{'Revision Note': '\n| Publish Date      | Revision                                                                                                                                                                                                                                     |\n| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| May 07, 2020      | This is the first formal publication of the interface as a version-ed specification. Earlier draft are superseded by this document. The interface is revamped to make it friendlier to programmers and also has a new method for conversion. |\n| June 09, 2020     | A note related to targetFPIR was added.                                                                                                                                    

Now we will extract the headings and content from all the markdown files in the repository

In [14]:
def find_markdown_files(folder_path):
    markdown_files = []
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.endswith('.md'):
                markdown_files.append(os.path.join(root, file))
    return markdown_files

In [15]:
# Find all markdown files in the folder
folder_path = '../data/mosip-docs/docs'
markdown_files = find_markdown_files(folder_path)
print("Markdown files found:")

# Extract the headings and content from the markdown files
input_list = []

for file in markdown_files:
    print(file)
    f = open(file, 'r')
    md = f.read()
    h1, headings_content = extract_headings_and_content(md)
    for heading, context in headings_content.items():
        text = f"{h1} {'-'} {heading} {':'} {context}"
        input_list.append({"input": text})

# Create a dataset from the list of markdown files
dataset = Dataset.from_list(input_list)

Markdown files found:
../data/mosip-docs/docs/registration-client.md
../data/mosip-docs/docs/data-protection.md
../data/mosip-docs/docs/partner-policies.md
../data/mosip-docs/docs/automation-testing.md
../data/mosip-docs/docs/mosip-emanas-integration.md
../data/mosip-docs/docs/release-notes-resident-portal-dp1.md
../data/mosip-docs/docs/country-implementation.md
../data/mosip-docs/docs/ctk-setup-steps-B2.md
../data/mosip-docs/docs/resident-portal-configuration-guide.md
../data/mosip-docs/docs/keycloak.md
../data/mosip-docs/docs/manual-adjudication-and-verification.md
../data/mosip-docs/docs/resident-services-developer-guide.md
../data/mosip-docs/docs/registration-processor-developers-guide.md
../data/mosip-docs/docs/commons-developer-guide.md
../data/mosip-docs/docs/license.md
../data/mosip-docs/docs/wireguard-client-installation-guide.md
../data/mosip-docs/docs/engineering-roadmap.md
../data/mosip-docs/docs/helm-charts.md
../data/mosip-docs/docs/contributions.md
../data/mosip-docs/doc

In [16]:
dataset[123]

{'input': 'Pre-registration - Pre-registration module : \nThe relationship of the pre-registration module with other services is explained here. _NOTE: The numbers do not signify sequence of operations or control flow_\n\n![](\\_images/pre-reg-entity.png)\n\n1. Fetch [ID Schema](id-schema/) details with the help of Syncdata service.\n2. Fetch a new OTP for the user on the login page.\n3. Log all events.\n4. Pre-Registration interacts with Keycloak via [`kernel-auth-adapater`](https://github.com/mosip/mosip-openid-bridge/tree/release-1.2.0). The Pre-Reg module communicates with endpoints of other MOSIP modules. However, to access these endpoints, a token is required. This token is obtained from Keycloak.\n5. Database used by pre-reg.\n6. Generate a new AID for the application.\n7. Send OTP in the email/SMS to the user.\n8. Registration Processor uses reverse sync to mark the pre-reg application as consumed.\n9. Registration clients use [Datasync service](https://github.com/mosip/pre-reg

In [21]:
# Save the dataset to disk
dataset.save_to_disk("../data/mosip_dataset.hf")

Saving the dataset (1/1 shards): 100%|██████████| 822/822 [00:00<00:00, 191072.82 examples/s]


In [22]:
# Load the dataset from disk
from datasets import load_from_disk
datasets = load_from_disk("../data/mosip_dataset.hf")

In [23]:
datasets[123]

{'input': 'Pre-registration - Pre-registration module : \nThe relationship of the pre-registration module with other services is explained here. _NOTE: The numbers do not signify sequence of operations or control flow_\n\n![](\\_images/pre-reg-entity.png)\n\n1. Fetch [ID Schema](id-schema/) details with the help of Syncdata service.\n2. Fetch a new OTP for the user on the login page.\n3. Log all events.\n4. Pre-Registration interacts with Keycloak via [`kernel-auth-adapater`](https://github.com/mosip/mosip-openid-bridge/tree/release-1.2.0). The Pre-Reg module communicates with endpoints of other MOSIP modules. However, to access these endpoints, a token is required. This token is obtained from Keycloak.\n5. Database used by pre-reg.\n6. Generate a new AID for the application.\n7. Send OTP in the email/SMS to the user.\n8. Registration Processor uses reverse sync to mark the pre-reg application as consumed.\n9. Registration clients use [Datasync service](https://github.com/mosip/pre-reg