<a href="https://colab.research.google.com/github/cchummer/sec-api/blob/main/s1_prospectus_parsing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This module is being developed mainly to extract the prospectus summary and risk factors sections from S1 filings for LDA and NMF topic analysis.

First we will use the daily index files to retrieve filings in mass

In [1]:
from datetime import date
import requests

In [2]:
"""
Support method used below. Given a date object, determines the quarter number.
Quarters are literally just divided into four 3-month groups: Jan-Mar, Apr-Jun, Jul-Sep, Oct-Dec (month 1-3, 4-6, 7-9, 10-12)
"""
def get_quarter_from_date(date_obj):
  if (date_obj.month >= 1 and date_obj.month <= 3):
    return 1
  elif (date_obj.month >= 4 and date_obj.month <= 6):
    return 2
  elif (date_obj.month >= 7 and date_obj.month <= 9):
    return 3
  elif (date_obj.month >= 10 and date_obj.month <= 12):
    return 4
  else:
    print("Failed to determine quarter from given date.")

  return 0

In [3]:
"""
filter_type = "cik", "company", "type", or "date"
unique_vals = [] # List of unique values (CIKs, company names, types, or dates). NOTE: Dates should be passed as a list of date objects

Returns a list of dictionaries of the following structure:
{
  "cik" : "CIK_NUM",
  "company" : "COMPANY_NAME",
  "type" : "FILING_TYPE",
  "date" : "YYYY-MM-DD",
  "fulltext_path" : ".../edgar/data/CIK/ETC"
}
"""
def filter_quarter_by_param(target_year, target_quarter, filter_type, unique_vals = ()):

  filings_list = []

  # Figure out what kind of filter and format the unique values if needed
  clean_unique_vals = []
  filter_column = 0

  if filter_type == "cik":
    # Strip leading 0's for consistency. The IDX files won't include them from what I've seen.
    for i in unique_vals:
      clean_unique_vals.append(str(i).lstrip("0"))

  elif filter_type == "company":
    filter_column = 1
    for i in unique_vals:
      clean_unique_vals.append(i.lower())

  elif filter_type == "type":
    filter_column = 2
    for i in unique_vals:
      clean_unique_vals.append(i.lower())

  elif filter_type == "date":
    filter_column = 3
    # Convert date structures into "YYYY-MM-DD" strings
    try:
      for d in unique_vals:
        clean_unique_vals.append("{}-{}-{}".format(d.year, str(d.month).zfill(2), str(d.day).zfill(2)))
    except:
      print("Invalid filter date objects passed.")
      return filings_list

  else:
    print("Invalid quarterly filter type. Must be name, CIK, type, or date: {}".format(filter_type))
    return filings_list

  # URLs and UA for full-index
  qtr_index_url = r"https://www.sec.gov/Archives/edgar/full-index/{}/QTR{}/master.idx".format(target_year, target_quarter)
  base_archives = "https://www.sec.gov/Archives/"
  req_headers = { "User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36" }

  # Get the .idx file
  resp = requests.get(url = qtr_index_url, headers = req_headers)
  resp.raise_for_status()

  # First separate the header from the lines of content. Look for twenty dashes in a row followed by a newline
  idx_full_text = resp.text
  split_idx = idx_full_text.split("--------------------\n")

  # Loop through lines of data
  try:
    for line in split_idx[1].splitlines():

      # CIK|Company Name|Form Type|Date Filed|Filename
      # We expect 5 columns
      columns = line.split("|")
      if len(columns) == 5:

        if columns[filter_column].lower() in clean_unique_vals:
          # Build a dictionary for the filing if we find match
          found_filing = {}
          found_filing["cik"] = columns[0].zfill(10)
          found_filing["company"] = columns[1]
          found_filing["type"] = columns[2]
          found_filing["date"] = columns[3]
          found_filing["fulltext_path"] = base_archives + columns[4]

          # Append it
          filings_list.append(found_filing)

  except:
    print("IDX file was in unexpected format, error parsing")

  return filings_list

In [4]:
# Test above method
todays_date = date.today()
qtrnum = get_quarter_from_date(todays_date)

if (not qtrnum):
  print('There was an issue determining the correct quarter(s) to grab')
  exit()

filtered_list = filter_quarter_by_param(todays_date.year, qtrnum, "type", ['S-1'])
print(filtered_list)

[{'cik': '0001014763', 'company': 'Ainos, Inc.', 'type': 'S-1', 'date': '2024-04-08', 'fulltext_path': 'https://www.sec.gov/Archives/edgar/data/1014763/0001493152-24-013774.txt'}, {'cik': '0001029125', 'company': 'Panbela Therapeutics, Inc.', 'type': 'S-1', 'date': '2024-04-18', 'fulltext_path': 'https://www.sec.gov/Archives/edgar/data/1029125/0001437749-24-012445.txt'}, {'cik': '0001063537', 'company': 'RiceBran Technologies', 'type': 'S-1', 'date': '2024-04-19', 'fulltext_path': 'https://www.sec.gov/Archives/edgar/data/1063537/0001437749-24-012560.txt'}, {'cik': '0001096275', 'company': 'Worksport Ltd', 'type': 'S-1', 'date': '2024-04-02', 'fulltext_path': 'https://www.sec.gov/Archives/edgar/data/1096275/0001493152-24-012713.txt'}, {'cik': '0001121702', 'company': 'YIELD10 BIOSCIENCE, INC.', 'type': 'S-1', 'date': '2024-04-25', 'fulltext_path': 'https://www.sec.gov/Archives/edgar/data/1121702/0001121702-24-000029.txt'}, {'cik': '0001133818', 'company': 'BIO-PATH HOLDINGS, INC.', 'typ

Now we want to break down the document and find both the prosectus summary and risk factors, as mentioned. This code is borrowed from my 10Q and 10K [text section handling framework](https://github.com/cchummer/sec-api/blob/main/10q_k_text_parsing.ipynb).

In [1]:
from bs4 import BeautifulSoup, Tag, NavigableString
import unicodedata

In [6]:
"""
unicodedata.normalize leaves behind a couple of not technically whitespace control-characters. See https://www.geeksforgeeks.org/python-program-to-remove-all-control-characters/ and http://www.unicode.org/reports/tr44/#GC_Values_Table
"""
def remove_control_characters(s):
    return "".join(ch for ch in s if unicodedata.category(ch)[0] != "C")

"""
Cleans the given text (specifically: unicode normalize and turn newlines/whitespace into a single space)
"""
def clean_column_text(text_to_clean):

  clean_text = unicodedata.normalize('NFKD', text_to_clean)
  clean_text = remove_control_characters(clean_text)
  clean_text = clean_text.replace('\n', ' ') # Split doesn't catch newlines from my testing
  clean_text = " ".join(clean_text.split()) # Split string along tabs and spaces, then rejoin the parts with single spaces instead

  return clean_text

In [3]:
"""
Attempts to locate a table of contents by looking for a <table> element containing one or more <href> elements
Returns the bs4.Element.Tag object of that table if it exists, or None
"""
def linked_toc_exists(document_soup):

  # Find all <table> tags
  all_tables = document_soup.find_all('table')
  for cur_table in all_tables:

    # Look for an <a href=...>
    links = cur_table.find_all('a', attrs = { 'href' : True })
    if len(links):
      return cur_table

  return None

In [8]:
"""
Helper method to find_sections_with_toc, extracts the text found inbetween 2 bs4 Tags/elements
"""
def text_between_tags(start, end):

  cur = start
  found_text = ""

  # Loop through all elements inbetween the two
  while cur and cur != end:
    if isinstance(cur, NavigableString):

      text = cur.strip()
      if len(text):
        found_text += "{} ".format(text)

    cur = cur.next_element

  return clean_column_text(found_text.strip()) # Strip trailing space that the above pattern will result in

"""
Helper  method to find_sections_with_toc, extracts the text found starting at a given tag through the end of the soup
"""
def text_starting_at_tag(start):

  cur = start
  found_text = ""

  # Loop through all elements
  while cur:
    if isinstance(cur, NavigableString):

      text = cur.strip()
      if len(text):
        found_text += "{} ".format(text)

    cur = cur.next_element

  return clean_column_text(found_text.strip())

"""
Helper method for find_sections_with_toc, attempt to determine if the given text is simple a page number (duplicate link in my observations)
"""
def is_text_page_number(question_text):

  # Check argument
  if type(question_text) != str:
    print("Non-string passed to is_text_page_number. Returning True (will result in href being skipped)")
    return True

  # Strip just to be sure
  stripped_question_text = question_text.strip()

  # Check if text is only digits
  if stripped_question_text.isnumeric():
    return True

  # Check if only roman numerals
  valid_romans = ["M", "D", "C", "L", "X", "V", "I", "(", ")"]
  is_roman = True
  for letter in stripped_question_text.upper():
    if letter not in valid_romans:
      is_roman = False
      break

  return is_roman

In [9]:
"""
Use the hyperlinked TOC to find the given text section. Provide a bs4 Tag object for the located TOC. Returns a dictionary the same as its calling function, find_sections_in_fulltext:
{
  "MATCHING_SECTION_NAME_FOUND" : "SECTION_TEXT",
  ...
}
"""
def find_section_with_linked_toc(document_soup, toc_soup, target_sections = ()):

  # Returned dictionary
  text_dict = {}

  # First, loop through the <a> tags of the TOC and build a dictionary of href anchor values and text (sections) values
  link_dict = {}
  link_tags = toc_soup.find_all('a', attrs = { 'href' : True })
  for link_tag in link_tags:

    # From some TOC's I have examined, there may be a second <a href...> for each section, labeled instead by the page number. This page number may be a digit or a roman numeral
    # If I come across a filing with a different TOC strcture, I will find a more nuanced way to handle it. For now simply check if the text is only digits or roman numerals
    # Some TOC's also look to have a third link to each section, on the far left of the table and with the text "Item 1, Item 2, ...". Again will update if these appear after the properly labeled links and thus
    # over-write that spot in the href dict defined below. As of now we are relying on the properly/fully labeled links being the last non-page-number reference to each href in order to be recorded.
    if is_text_page_number(link_tag.text.strip()):
      continue

    link_dict[link_tag.get('href').replace('#', '')] = clean_column_text(link_tag.text.strip())

  # Grab a list of destination anchors (<a> or <div> tags with "id" or "name" attribute)
  link_dests = document_soup.find_all('a', attrs = { 'id' : True }) + document_soup.find_all('a', attrs = { 'name' : True })\
   + document_soup.find_all('div', attrs = { 'id' : True }) + document_soup.find_all('div', attrs = { 'name' : True })

  # Filter out those which are never linked to, they will obstruct our logic in text_between_tags as we rely on the next anchor to be the beginning of the next section
  # I have run into filings with such "phantom" anchors that are never linked to and can prematurely signal the end of a section
  # (i.e: https://www.sec.gov/Archives/edgar/data/1331451/000133145118000076/0001331451-18-000076.txt)
  link_dests = [anchor for anchor in link_dests if (anchor.get('id') in link_dict.keys() or anchor.get('name') in link_dict.keys())]

  # Loop through the dictionary of hrefs we built and look for our target sections, storing any found in a new dict
  target_section_links = {}
  for href_val, section_name in link_dict.items():

    for indiv_target in target_sections:
      if indiv_target.lower() in section_name.lower():

        # Add the target section and its href value to target_section_links
        target_section_links[href_val] = indiv_target

  # Now loop through the target sections that we just found links to. We will try to locate the destination of each
  for target_href, target_name in target_section_links.items():

    # The href values are used at their destination in <a> tags with an id/name attribute of the same href value (minus the leading #, why we got rid of it)
    # Loop through the link_dests list of all destination tags, and find the one with id/name=target_href
    num_destinations = len(link_dests)
    for dest_index, link_dest in enumerate(link_dests):

      if (link_dest.get('id') == target_href or link_dest.get('name') == target_href): # Can be either id or name according to HTML spec (see https://stackoverflow.com/questions/484719/should-i-make-html-anchors-with-name-or-id)

        # Grab the text inbetween the current destination tag and the next occuring destination in link_dests
        # If we are on the last destination, grab all the text left
        section_text = ""

        if dest_index + 1 < num_destinations:
          section_text = text_between_tags(link_dest, link_dests[dest_index + 1])
        else:
          section_text = text_starting_at_tag(link_dest)

        if len(section_text):

          # Add to master dict. TODO: Explore whether there may be 2 sections matching the same target section on the same document
          # With the current code, the last matching section on the document will be recorded for each target. Just a thought
          text_dict[target_name] = section_text

  return text_dict

In [None]:
"""
Some HTML style filings have tables of contents but they are not hyperlinked and thus are not caught by the above algorithm. Methods here attempt to deal with such filings.

Ideas:

Look for section names in text of tags which are:
  1. Not in a table
  2. Have a text-center, or who's parents have a text-center attribute

Continue reading until...? (pagebreaks etc seem inconsistent)


"""

# TODO


In [4]:
"""
Locate and extract custom text section(s). Takes the path to a filing's full text submission and a list of target sections. Returned structure:
{
  "MATCHING_SECTION_NAME_FOUND" : "SECTION_TEXT",
  ...
}
"""
def find_sections_in_html_fulltext(fulltext_sub, target_sections = ()):

  # Returned dict
  master_text_dict = {}

  # Check that sections were specified
  if len(target_sections) == 0:
    print("No target sections were entered. Provide in a list")
    return master_text_dict

  # Get the file contents
  request_headers = { "User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36" }
  response = requests.get(url = fulltext_sub, headers = request_headers)
  response.raise_for_status()

  # First task is to break the full text submission into documents by <DOCUMENT> tag
  docs_list = response.text.split("<DOCUMENT>")[1:]
  for doc_string in docs_list:

    document = BeautifulSoup(doc_string, "lxml")

    # Will hold results from current document
    doc_results = {}

    # Only parse HTM/HTML files
    doc_name = document.filename.find(text = True, recursive = False).strip()
    if ".htm" not in doc_name.lower():
      continue

    # Jump into HTML document's contents inside <TEXT>
    doc_html = document.find('text')

    # Parse using TOC if it exists
    # TODO: BETTER PARSING / STORING OF TABLES FOUND IN TEXT SECTIONS
    toc_tag = linked_toc_exists(doc_html)
    if toc_tag:
      doc_results = find_section_with_linked_toc(doc_html, toc_tag, target_sections)
    else:
      pass

    # Loop through results, add to the master dict
    for result_section_name, result_section_text in doc_results.items():

      # If we already have an entry for that target_section, create another ("xxx", "xxx_1", "xxx_2", ...)
      master_key_name = result_section_name
      i = 1
      while master_key_name in master_text_dict.keys():
        master_key_name = "{}_{}".format(result_section_name, i)
        i += 1

      # Add to dict after finding unused key
      master_text_dict[master_key_name] = result_section_text

  return master_text_dict

In [None]:
print(filtered_list)

In [None]:
master_summary_list = []

for s1 in filtered_list[:10]: # Will limit for now for testing.

  found_summary = find_sections_in_html_fulltext(s1['fulltext_path'], ['summary'])
  if len(found_summary) != 0:

    summary_entry = { s1['fulltext_path'] : found_summary }
    master_summary_list.append(summary_entry)

print(master_summary_list)

At the time of this testing (4/30/24), we successfully grabbed the clean text of ~50% of prospectus summaries by trying the 100 most recent S1's via our quarterly index. It seems that some modern filings do not include hyperlinked TOC's. See [here](https://www.sec.gov/Archives/edgar/data/1121702/000112170224000029/yten-2024x04x25xs1forwarra.htm#ie2eb5995f0dd459a99f25e48e556c22b_10) and [here](https://www.sec.gov/ix?doc=/Archives/edgar/data/1029125/000143774924012445/pbla20240410_s1.htm) for 2 examples. Will investigate and optimize for better performance. At this point the summaries are stored in a dictionary with their respective fulltext filing URL.