# NSMQ - Kwame AI Project

## Script Title: HTML Parsing for Textbook Content Structuring
### Author: Ernest Samuel
#### Team: Data Preprocessing Team
##### Date: June 24, 2023
##### Updated: August 25, 2023

---

## Data Processing Functions

This Jupyter Notebook encompasses a collection of five essential functions designed to streamline the extraction and structuring of content from HTML sources:

1. **unique(array):** This function efficiently eliminates duplicate items from a processed dataset. It is designed to operate on a list of items, ensuring the final dataset remains free of redundancies.

2. **extract_rawTable_of_content(link, homePage):** With this function, given a URL (referred to as 'link') and a specific page identifier (referred to as 'homePage'), the script extracts the table of contents of the textbook from the targeted website.

3. **extract_url(link, pageList, maxNmber, char):** This function assembles URLs by concatenating the link with content-specific page identifiers extracted from the table of contents. The parameters include:
   - `link`: The base URL up to the last forward slash
   - `pageList`: A list of table of contents items
   - `maxNmber`: The highest numerical index on the table of contents
   - `char` (Optional): A list of non-numerical index characters
   
4. **extract_url_content(url, file_name):** Taking the URL of a textbook and an optional file name, this function extracts and structures the content from the URL, saving it as a JSON file.

5. **extract_textbook(url_list, textbook_name):** This iterative function employs the list of URLs generated by function 3 to extract content. It uses function 4 to process and structure the content, ultimately saving it as a JSON file named after the textbook.

6. **json_to_txt_script_notebook_.ipynb:** This Jupyter Notebook script, available in the same directory, provides a function `convert_to_txt_or_csv(input_filename, output_format)` to convert JSON content to either .txt or .csv format based on specified preferences.

---

This documentation outlines the purpose and functionality of each function, ensuring clarity and readability for future users and collaborators.


In [12]:
# ------------- Install, if missing, and import libraries ----------------------- #

try:
    import requests
except ImportError:
    print("Installing requests...")
    %pip install requests
    import requests

try:
    from bs4 import BeautifulSoup
except ImportError:
    print("Installing beautifulsoup4...")
    %pip install beautifulsoup4
    from bs4 import BeautifulSoup

try:
    import pandas as pd
except ImportError:
    print("Installing pandas...")
    %pip install pandas
    import pandas as pd

import os
from urllib.parse import urljoin
import json

#------------ ------------ call .csv or .txt convertion Script ---------------------------------------#
%run json_to_txt_script_notebook_.ipynb 
# --------------------------------------------------------------------------------#


### Data Cleaning Functions


In [2]:
def unique(array):
    """
    Remove duplicates from a list while preserving the order of elements.
    
    Args:
        array (list): The input list containing elements.
        
    Returns:
        list: A new list with duplicates removed.
    """
    return list(dict.fromkeys(array))

def remove_items(superset_of_item, set_of_item):
    """
    Remove items in list 'set_of_item' from list 'superset_of_item' and return the filtered list.
    
    This function filters out a subset 'set_of_item' from a superset 'superset_of_item' list.
    It is used to exclude specific items from a list based on another list.
    
    Args:
        superset_of_item (list): The superset list.
        set_of_item (list): The subset list to be removed from 'superset_of_item'.
        
    Returns:
        list: A new list with items from 'set_of_item' removed from 'superset_of_item'.
    """
    superset_of_item = [item for item in superset_of_item if item not in set_of_item]
    return superset_of_item


#### Extracting Raw Table of Contents as URLs from Online Textbooks

In [3]:
def extract_rawTable_of_content(link, homePage):
    """
    Extracts URLs of chapters from the table of contents of an online textbook.

    Args:
        link (str): Base URL of the textbook pages.
        homePage (str): First landing page of the online view of the textbook.

    Returns:
        list: A list of unique URLs representing chapters.
    """
    
    # Construct the complete URL for the table of contents page
    website_link = link + homePage
    url_list = []

    # Send a GET request to the website
    response = requests.get(website_link)
    
    # Check if the response is successful
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the table of contents div using its class
        table_of_contents_div = soup.find('div')

        if table_of_contents_div:
            # Find all the <a> tags within the table of contents div
            a_tags = table_of_contents_div.find_all('a')

            # Extract the href attribute from each <a> tag and store it in the list
            for a_tag in a_tags:
                href = a_tag.get('href')
                url_list.append(href)
        else:
            print("Table of contents div not found on the website.")
            
    # Ensure uniqueness of URLs in the list
    return unique(url_list)

#### Extract each pages URL for required page of the textbook

In [4]:
#---------------- From Table of Contents as URL, Extract pages URL for each required page of the textbook ----------------#

def extract_url(link, pageList, maxNmber, alphabet_char=[]):
    """
    Extracts URLs for specific pages of an online textbook.

    Args:
        link (str): Base URL of the textbook pages.
        pageList (list): List of landing pages for specific chapters.
        maxNmber (int): Maximum number of index numbers of the table of content.
        alphabet_char (list, optional): List of alphabet characters or string indices. Default is an empty list.

    Returns:
        list: A list of unique URLs representing specified pages.
    """
    # Generate a combined list of numeric and alphabet characters
    list_pages = list(range(1, maxNmber + 1))
    list_pages.extend(alphabet_char)

    url = []
    
    # Iterate through the provided pageList
    for item in pageList:
        for value in list_pages:
            # Check if the item starts with the current value (number or alphabet)
            if item.startswith(str(value)):
                url.append(link + item)

    # Ensure uniqueness of URLs in the list
    return unique(url)


### Extract contents from a URL


In [5]:
def extract_url_content(url):
    """
    Extracts content from a URL's HTML structure, organized into sections, irrelevant_content, paragraphs, lists, figures, and tables.

    Args:
        url (str): The URL of the webpage to extract content from.

    Returns:
        list: A list containing two dictionaries: irrelevant_content (chapter title and non-section paragraphs) 
              and content_list (structured sections, headings, paragraphs, lists, figures, and tables).
    """
    
    # Send a GET request to the URL
    response = requests.get(url)

    # Parse the HTML content
    soup = BeautifulSoup(response.content, "html.parser")

    # Find the main content section
    main_content = soup.find("div")

    if main_content is None:
        print("Unable to find the main content section")
        return

    # List to store the content
    content_list = []
    
    # Extract chapter title and non-section paragraphs
    irrelevant_content = {}
    head = soup.find('head')
    Titles = head.find_all('title')
    for title in Titles:
        if title:
            name = title.text.strip()
            irrelevant_content["Title"] = name

    body = soup.find('body')
    paras = []
    pp = body.find_all('p')
    for p in pp:
        paras.append(p.text.strip())

    # Find all the sections in the main content
    sections = main_content.find_all("section")

    # Set to store unique section identifiers
    section_identifiers = set()

    # Iterate over each section
    sec = []  # To generate a subset of paragraph data
    for section in sections:
        section_data = {}

        # Extract section identifier
        section_id = section.get("id")
        section_class = section.get("class")
        section_uuid_key = section.get("data-uuid-key")
        section_data_type = section.get("data-type")
        section_class_tuple = tuple(section_class) if section_class is not None else ()
        section_identifier = (section_id, section_class_tuple, section_uuid_key, section_data_type)

        # Skip if section identifier is already encountered
        if section_identifier in section_identifiers:
            continue

        # Add section identifier to the set
        section_identifiers.add(section_identifier)

        # Extract section title
        # ------------------------
        subtitle = soup.find(['h3', 'h4', 'h2', 'h1'])
        title = section.find(["h1", "h2", "h3", "h4", "h5"])
        if title:
            section_data["title"] = title.text.strip()
        else:
             section_data["title"] = subtitle.text.strip()

        # Extract section paragraphs
        paragraphs = section.find_all(["p", "span"])
        section_data["Section"] = []
        
        for paragraph in paragraphs:
            paragraph_text = paragraph.text.strip()
            if paragraph_text:
                section_data["Section"].append(paragraph_text)
                sec.append(paragraph_text)

        # Extract list items
        lists = section.find_all("ul")
        section_data["lists"] = []
        for ul in lists:
            list_items = ul.find_all("li")
            section_data["lists"].append([li.text.strip() for li in list_items])

        # Extract figures and image links
        figures = section.find_all("div", {"class": "os-figure"})
        section_data["figures"] = []
        for figure in figures:
            figure_data = {}
            img = figure.find("img")
           
            if img and "src" in img.attrs:
                image_url = urljoin(url, img["src"])
                figure_data["image"] = image_url

            caption = figure.find("figcaption")
            if caption:
                figure_data["caption"] = caption.text
            
            section_data["figures"].append(figure_data)

        # Extract tables
        tables = section.find_all("table")
        section_data["tables"] = []
        for table in tables:
            table_data = []
            rows = table.find_all("tr")
            for row in rows:
                cells = row.find_all("td")
                table_data.append([cell.text.strip() for cell in cells])
            section_data["tables"].append(table_data)

        content_list.append(section_data)

  # Extract only paragraphs that are not in the sections structure
    irrelevant_content["Paragraphs_Not_in_Sections"] = remove_items(paras, sec)

    return [irrelevant_content, content_list]  


### Iterate and extract different pages, by URL, in the textbook and save as one file


In [6]:

def extract_textbook(url_list, textbook_name):
    """
    Extracts and structures content from a list of URLs into a JSON file.

    Args:
        url_list (list): List of URLs containing the textbook content.
        textbook_name (str): Name of the textbook for JSON file naming.

    Returns:
        content_list: A list of dictionaries, containing structured content data.
    """
    content_list = []  # List to store the structured content
    page_data = {}  # Dictionary to store content per page
    pages = 0  # Page counter
    
    # Iterate through the list of URLs
    for url in url_list:
        # Extract content from the URL using a helper function (extract_url_content)
        page_content = extract_url_content(url)
        
        # The page_content is a list with two items:
        # 1. irrelevant_content (not used in this context)
        # 2. content_list (well-structured sections needed)
        # We access content_list using indexing [1] to remove paragraphs not in sections since it is not needed for the training dataset.
        
        page_data['Page ' + str(pages)] = page_content[1]
        pages += 1
    
    content_list.append(page_data)  # Append structured content to the list
    
    # Get the current working directory and create the file path for JSON
    script_dir = os.getcwd()
    json_path = os.path.join(script_dir, f"{textbook_name}.json")
    
    # Save the structured content as JSON
    with open(json_path, "w") as file:
        json.dump(content_list, file, indent=4)
    
    return content_list

### Run script


In [14]:
# Read CSV file containing links to the textbooks
# CSV should have columns 'BOOKS' for textbook names and 'URL' for URLs

openStax = pd.read_csv('openstax_textbooks_sheet.csv')

# Select specific textbooks to extract
selected_textbooks = openStax.iloc[[25,22,3]]

# Iterate through selected textbooks and extract content
for BookName, urls in zip(selected_textbooks['BOOKS'], selected_textbooks['URL']):
   
    print("Extracting content from:", BookName)
    
    # Initialize the starting page for online view
    landing_page = 'preface'
    site = str(urls)

    # Extract the table of contents
    pageList = extract_rawTable_of_content(site, landing_page)

    # Initialize the starting index for required content from the table of contents
    max_number = 10  # Maximum number of numerical index on the table of contents
    char = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'i']

    # Extract the list of URLs to be parsed for website scraping (used in htmlProcessing)
    URLs = extract_url(site, pageList, max_number, char)

    # Name of the textbook
    textbook = str(BookName)
    
    # Extract the textbook content
    extracted_content = extract_textbook(URLs, textbook)
    file_name = f"{BookName}.json"
    output_format = "csv"
    convert_to_txt_or_csv(file_name, output_format)  # check json_to_txt_script_notebook_.ipynb file, on how to use this function.


Extracting content from: Biology for AP courses
Biology for AP courses Data extracted and saved as .csv successfully in formatted_files.
Extracting content from: Chemistry 2e
Chemistry 2e Data extracted and saved as .csv successfully in formatted_files.
Extracting content from: High School Physics
High School Physics Data extracted and saved as .csv successfully in formatted_files.


#### Convert .json file to .txt or .csv file. 
- make sure `json_to_txt_script_notebook_.ipynb` is in thesame folder as this script

In [8]:
input_filename = 'College Physics for AP Courses 2e.json'  # Replace with the actual input file name
output_format = "csv"               # Choose output format ("csv" or "txt")

# Call the function with the input filename and output format, default is .txt
convert_to_txt_or_csv(input_filename, output_format)


College Physics for AP Courses 2e Data extracted and saved as .csv successfully in formatted_files.


## Data Storage Structure

The stored data follows a structured organization aligned with the website's table of contents:

- Data is organized using dictionary format, corresponding to each page of the website.
- Each page contains titles and corresponding sections. Within each section, there are lists, figures, and tables.
- Sections encompass paragraphs from the textbook.
- Figures include links to images when present in a section.
- Tables capture tabular data identified within a section.
- Lists encompass ordered or unordered list items found in sections.

## Notable Insights

- When storing data in JSON format, certain special characters, such as mathematical equations, may be encoded.
- Printing the content directly can reveal more about the format and structure.
