In [None]:
# HTTP and parsing libs
import requests
from bs4 import BeautifulSoup

# Used for method of making a copy of dictionary argument
import copy

# Company Search / "Legacy" search endpoint
As mentioned in fulltext_search_endpoint, the company search endpoint is the only method I have found to search via SEC file number (as well as by SIC). There is some documentation on these file numbers scattered around that I will try to compile into a more useful resource at some point. Similar to a CIK, but unique to "filer type". The pre-fixes (numbers before the first hyphen) are indicative of form type. Know that they should be kept intact with hyphens/prefixes.

The only browser-facing page I found to interact with this endpoint is via [/edgar/searchedgar/legacy/companysearch.html](https://www.sec.gov/edgar/searchedgar/legacy/companysearch.html) and [/edgar/searchedgar/companysearch](https://www.sec.gov/edgar/searchedgar/companysearch) (by searching by SIC), which ends up sending a GET request to https://www.sec.gov/cgi-bin/browse-edgar with certain parameters. Documented as of 9/19/2022:


```
- action : "getcompany" by default, we will leave it (seems to affect some queries). "getcurrent" turns this endpoint into a whole different type of search, which can be filtered by company, CIK, or type. a "getcurrent" search will return only filings from the current business day.
- company : Filing entity name. If this is the only identifier entered and there are multiple matching companies, a different response format is given (see note below)
- match : Left blank by default, seems to act the same as "starts-with". Can be set to "contains" to return companies with names simply containing the value
- CIK : Filer CIK or Ticker. Required to be set for use of "type" and set to an CIK (not ticker) for "dateb" parameter
- type : Filing type. See https://www.sec.gov/forms 
- dateb : If CIK is set, will only return results before date in format YYYYMMDD
- owner : "exclude" by default, possible values also "include" and "only"
- myowner : Left out of most requests, and in some SIC/company list searches seems to actually interfere
- start : Starting index of results to display
- State : State code of filer's primary location. If this or Country are the only identifying parameters entered, a response like that of "company" mentioned above is given
- Country : Country of filer's primary location
- filenum : SEC file number. Hyphen(s) included
- SIC : Filer industry classification code. Similarly can return "company" list if is the only ID entered
- output : "atom" to return XML rather than HTML
- count : Number of results to display. Default is 40 max is 100
- Find : "Find+Companies" by default, only in some requests. Response seems the same without including it
- hidefilings : 0 by default it seems, only included in some requests so leave it out. Unsure of purpose
- search_text : sometimes included but left empty. Unsure of purpose
```
There is also a "nearby" endpoint of https://www.sec.gov/cgi-bin/own-disp which only seems to take the parameters "action=getissuer" and a CIK number and returns a page on insider transactions, example [here](https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0001703057). May be worth writing another notebook for. 


Anyways in response to a properly formatted request to browse-edgar, which includes either CIK or filenum, results are returned in an HTML table (unless  atom XML is requested) which can be parsed. The results are in a table of class "tableFile2" and summary "Results", with the following format:


```
<table class="tableFile2" summary="Results">
         <tr>
            <th width="7%" scope="col">Filings</th>
            <th width="10%" scope="col">Format</th>
            <th scope="col">Description</th>
            <th width="10%" scope="col">Filing Date</th>
            <th width="15%" scope="col">File/Film Number</th>
         </tr>
<tr>
<td nowrap="nowrap">13F-NT</td>
<td nowrap="nowrap"><a href="/Archives/edgar/data/900203/000090266422003932/0000902664-22-003932-index.htm" id="documentsbutton">&nbsp;Documents</a></td>
<td class="small" >Quarterly report filed by institutional managers, Notice<br />Acc-no: 0000902664-22-003932&nbsp;(34 Act)&nbsp; Size: 3 KB            </td>
            <td>2022-08-12</td>
            <td nowrap="nowrap"><a href="/cgi-bin/browse-edgar?action=getcompany&amp;filenum=028-10418&amp;owner=include&amp;count=40">028-10418</a><br>221160271         </td>
         </tr>
<tr class="blueRow">
<td nowrap="nowrap">13F-NT</td>
<td nowrap="nowrap"><a href="/Archives/edgar/data/900203/000090266422002936/0000902664-22-002936-index.htm" id="documentsbutton">&nbsp;Documents</a></td>
<td class="small" >Quarterly report filed by institutional managers, Notice<br />Acc-no: 0000902664-22-002936&nbsp;(34 Act)&nbsp; Size: 3 KB            </td>
            <td>2022-05-13</td>
            <td nowrap="nowrap"><a href="/cgi-bin/browse-edgar?action=getcompany&amp;filenum=028-10418&amp;owner=include&amp;count=40">028-10418</a><br>22923545         </td>
         </tr>
```

Note that the second column for each result row, containing a link to the filing documents, may also contain a link to XBRL interactive data if it is available. An example of such a column:


```
<td nowrap="nowrap"><a href="/Archives/edgar/data/1703057/000156459022028784/0001564590-22-028784-index.htm" id="documentsbutton">&nbsp;Documents</a>&nbsp; <a href="/cgi-bin/viewer?action=view&amp;cik=1703057&amp;accession_number=0001564590-22-028784&amp;xbrl_type=v" id="interactiveDataBtn">&nbsp;Interactive Data</a></td>
```

In addition, some results may not include file number links (5th column in above example / when available). 

**Note on simple "company" search:**
As mentioned above in the outlining of possible query parameters, if the company, State/Country, and/or SIC fields are the only identifying parameters entered (basically no CIK or filenum), a list of companies is returned rather than a list of filings if there are multiple matches. This list of companies is similar in structure to the list of filings we receive if we enter a CIK or filenum. Example response structure [(url)](https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&filenum=&State=&Country=&SIC=2086&myowner=exclude&action=getcompany):


```
<table class="tableFile2" summary="Results">
         <tr>
            <th width="6%" scope="col"><acronym title="Central Index Key">CIK</acronym></th>
            <th width="79%" scope="col">Company</th>
            <th width="15%" scope="col">State/Country</th>
         </tr>
         <tr>
            <td valign="top" scope="row"><a href="/cgi-bin/browse-edgar?action=getcompany&amp;CIK=0001780692&amp;owner=include&amp;count=40&amp;hidefilings=0">0001780692</a></td>
            <td scope="row">Test &amp; Treat, Inc.</td>
            <td valign="top" scope="row"><a href="/cgi-bin/browse-edgar?action=getcompany&amp;State=SC&amp;owner=include&amp;count=40&amp;hidefilings=0">SC</a></td>
         </tr>
         <tr class="blueRow">
            <td valign="top" scope="row"><a href="/cgi-bin/browse-edgar?action=getcompany&amp;CIK=0001684627&amp;owner=include&amp;count=40&amp;hidefilings=0">0001684627</a></td>
            <td scope="row">Test Anywhere Technology, Inc.</td>
            <td valign="top" scope="row"><a href="/cgi-bin/browse-edgar?action=getcompany&amp;State=SC&amp;owner=include&amp;count=40&amp;hidefilings=0">SC</a></td>
         </tr>
```
For each result row, there are three columns: CIK, Company, and State/Country (code). CIK contains a link to a query with that CIK which will in turn return a list of filings as descripted above. However this link has count=40 so may want to edit it before following it. Similarly State/Country contains a link to a query by that state/country code with count=40. Will return another company list.

One way to determine which type of result list has been returned is to look for a span element of class "companyMatch" or div element of class "noCompanyMatch" (in the case of no results). It will be right after the content divider (div id "contentDiv") if a company list has been returned:


```
<!-- BEGIN CONTENT -->
<div id="contentDiv">
   <span class="companyMatch">Companies with names matching "TEST"</span> <br />   <em>Click on <acronym title="Central Index Key">CIK</acronym> to view company filings</em> <br />
<span class="items">Items 1 - 40</span>   <div id="seriesDiv">
      <table class="tableFile2" summary="Results">
        ...
```




# Company search **with** CIK/ticker, filenum, SIC code, or State/Country
Method for searching for filings by CIK or filenum fields. Recall tickers can be entered into the CIK field. SIC and State/Country can still be entered in such a query but seem to be simply overridden if a CIK or filenum are entered. Entering a company name can also interfere with results if it does not match that of the CIK / filenum given.

In [None]:
""" 
Input:
Format of full_query_params: {
  "action" : "getcompany",
  "company" : "", # Leave blank under most circumstances if searching by CIK / filenum
  "match" : "", # Again leave alone most times, can be set to "contains" if desired
  "CIK" : "", # Required to be set for use of "type" and set to an actual CIK (not ticker) for "dateb" parameter
  "type" : "", # ^
  "dateb" : "", # ^, format YYYYMMDD
  "owner" : "include", # We want them all boii
  "start" : 0, # Will update as we loop through results and send more requests if needed
  "State" : "",
  "Country" : "",
  "filenum" : "", # Hyphen(s) included
  "SIC" : "",
  "count" : 100,
}

Output:
Returns a dictionary of either 1 of 2 structures. The structure of the dictionary is denoted by the key "list_type" which both dictionary types will contain.
A "list_type" value of 0 indicates a list of entities is being returned, and "list_type" of 1 indicates a list of filings is being returned. 
A value of 2 indicates somehow both lists got populated and neither are being returned, and 3 indicates no results to a list_of_filings style search

Structure of list_of_entities: {
  "list_type" : 0,
  "entries" : [ {
    "cik" : "",
    "name" : "",
    "state" : ""
  } ]
}

Structure of list_of_filings: {
  "list_type" : 1,
  "entries" : [ {
    "type" : "",
    "file_date" : "", # Will be in format YYYY-MM-DD
    "filenum" : "",
    "filing_home" : "", # Will be in format ".../Archives/edgar/data/1703057/000114036122026730/0001140361-22-026730-index.html". May want to strip filename before using it
    "xbrl_doc" : "" # Format ".../cgi-bin/viewer?action=view&amp;cik=1703057&amp;accession_number=0001564590-22-025922&amp;xbrl_type=v"
  } ]
}
"""
def list_results_company_search(full_query_params):

  # Grab local copy of query_params we can modify (namely "start")
  local_query_params = copy.deepcopy(full_query_params)

  # UA and base URLs
  request_headers = { "User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36" }
  endpoint_url = "https://www.sec.gov/cgi-bin/browse-edgar"
  base_sec_url = "https://www.sec.gov" # Used to build paths to filing documents

  # Define these outside the loop. We will determine which one to populate and return
  # In case we return a list of entities
  list_of_entities = {
        "list_type" : 0,
        "entries" : []
  }
  # In case we return a list of filings 
  list_of_filings = {
        "list_type" : 1,
        "entries" : []
  }
  
  # Keep track of how many results we have read. As long as we keep reading 100, request the next 100 until none are left
  results_read_total = 0
  while True:

    # Send the GET
    response = requests.get(url = endpoint_url, headers = request_headers, params = local_query_params)
    response.raise_for_status()

    # Parse HTML. First check if class "noCompanyMatch" div element exists. Empty company list
    soup = BeautifulSoup(response.content, "html.parser")
    results_read_lastpass = 0 # Reset for request

    if soup.find("div", class_= "noCompanyMatch"):
      if (results_read_total and len(list_of_entities["entries"])): # If we overstep the number of results in count, we could land here. Shouldn't happen unless there is a multiple of 100 number of results. 
        return list_of_entities # But if it does, return the entity list we had built up until then

      # Otherwise it was just an invalid search
      return { "list_type" : 0, "entries": [] }
    
    # Check for span element of class "companyMatch". We have a list of companies on our hands, not filings. Parse accordingly
    if soup.find("span", class_= "companyMatch"):

      # There should only be one table of summary type "Results", regardless of what type of list we have received back
      results_table = soup.find("table", attrs = { "summary" : "Results" })

      # Loop through rows, ignore those without columns
      for current_row in results_table.find_all("tr"):
        columns = current_row.find_all("td")
        if len(columns):

          # We are not going to store the links given with the CIK and State/Country, just because we know how to build those queries with just the values if we need to
          entity_info = {}
          entity_info["cik"] = columns[0].text.strip()
          entity_info["name"] = columns[1].text.strip()
          entity_info["state"] = columns[2].text.strip()

          # Append the entry to our return dictionary and update our tally
          list_of_entities["entries"].append(entity_info)
          results_read_lastpass += 1
    
    # Otherwise we have a list of filings
    else:

      results_table = soup.find("table", attrs =  { "summary" : "Results" })

      # Loop through rows. Ignore those without columns
      for current_row in results_table.find_all("tr"):
        columns = current_row.find_all("td")
        if len(columns):

          filing_info = {}
          filing_info["type"] = columns[0].text.strip()
          filing_info["file_date"] = columns[3].text.strip()

          # A little extra work to extract the filenum, as film number is also given in the same column. The file number should be the only text belonging to the <a href=...> tag 
          # TODO: Grab the film number just incase we decide we have a use for it.
          filenum_link = columns[4].find("a", { "href" : True })
          if (filenum_link): # There is not always a file number, so check before stripping the text
            filing_info["filenum"] = filenum_link.text.strip()
          else:
            filing_info["filenum"] = ""

          # Need to build full paths using base_sec_url
          current_filing_home = columns[1].find("a", { "href" : True, "id" : "documentsbutton"})
          current_xbrl_doc = columns[1].find("a", { "href" : True, "id" : "interactiveDataBtn"})

          if current_filing_home:
            filing_info["filing_home"] = base_sec_url + current_filing_home["href"]
          else:
            filing_info["filing_home"] = ""

          if current_xbrl_doc:
            filing_info["xbrl_doc"] = base_sec_url + current_xbrl_doc["href"]
          else:
            filing_info["xbrl_doc"] = ""

          # Append to return dictionary and update tally
          list_of_filings["entries"].append(filing_info)
          results_read_lastpass += 1
    
    # Check how many results we read
    results_read_total += results_read_lastpass
    if (results_read_lastpass < 100): # We don't need to request another 100 if we didn't read a full 100 this time.
      break
    
    # Update the "start" parameter accordingly
    local_query_params["start"] = results_read_total

  # Figure out what to return
  # First possibility is we have somehow populated both dictionaries. Tell the user and return nothing. Can modify to print both out if so choose
  if (len(list_of_entities["entries"]) and len(list_of_filings["entries"])):
    print("Lists of both entities and filings have been populated. Check query parameters and script logic.")
    return { "list_type" : 2, "entries": [] } # Use type 2 to indicate both got populated and neither being returned

  elif len(list_of_entities["entries"]):
    return list_of_entities
  
  elif len(list_of_filings["entries"]):
    return list_of_filings

  # Neither
  else:
    pass

  # Use type 3 for no results (assumably list_of_filings style search, it didn't trigger noCompanyMatch)
  return { "list_type" : 3, "entries": [] }

In [None]:
# Calling above method

full_query_params = {
  "action" : "getcompany",
  "company" : "", 
  "match" : "", 
  "CIK" : "", # Required to be set for use of "type" and set to CIK (not ticker) for "dateb" parameter
  "type" : "4", # ^
  "dateb" : "", # ^, format YYYYMMDD
  "owner" : "include", 
  "start" : 0, 
  "State" : "",
  "Country" : "",
  "filenum" : "", 
  "SIC" : "1000",
  "count" : 100,
}

result_dict = list_results_company_search(full_query_params)
print("Number of results returned: {}".format(len(result_dict["entries"])))
input()

if len(result_dict["entries"]):
  if result_dict["list_type"] == 1:
    print("List of filings:")
    for current_result in result_dict["entries"]:
      print(current_result["filing_home"])
  elif result_dict["list_type"] == 0:
    print("List of entities:")
    for current_result in result_dict["entries"]:
      print(current_result["name"])

Number of results returned: 579

List of entities:
37 CAPITAL INC
ABACUS MINERALS CORP
ACCORD VENTURES INC
ACREX VENTURES LTD
ADAMANT DRI PROCESSING & MINERALS GROUP
ADASTRA MINERALS INC
Advanced Mineral Technologies, Inc
ALASKA GOLD CORP.
Alaska Pacific Resources Inc
ALBERTA STAR DEVELOPMENT CORP
Alderon Iron Ore Corp.
ALMADEN MINERALS LTD
AMCA RES0URCES, INC.
AMERICAN BATTERY TECHNOLOGY Co
American Bonanza Gold Corp.
AMERICAN BULLION MINERALS LTD
AMERICAN CONSOLIDATED MANAGEMENT GROUP INC
AMERICAN EAGLE ENERGY Corp
AMERICAN GEM CORP
AMERICAN LITHIUM MINERALS, INC.
AMERICAS ENERGY Co - AECO
Americas Gold & Silver Corp
Ameritrust Corp
AMERTHAI MINERALS INC.
Amogear Inc.
ANGLO-CANADIAN URANIUM CORP
Angstrom Microsystems Corp.
APEX MINERALS CORP
Apotheca Biosciences, Inc.
AQUEST MINERALS CORP
ARCHANGEL DIAMOND CORP
Arctos Petroleum Corp.
ARGOSY MINERALS INC
ARRAS MINERALS CORP.
Artisan Consumer Goods, Inc.
Asia Atlantic Resources
ASIAN DRAGON GROUP INC.
Atlantic Gold Corp
ATLAS CONSOLIDA