In [None]:
# HTTP lib
import requests

# Date objects
from datetime import date

# JSON
import json

# "EDGAR Full Text Search"
"New versatile tool lets you search for keywords and phrases in over 20 years of EDGAR filings, and filter by date, company, person, filing category or location." (https://www.sec.gov/edgar/searchedgar/companysearch -> https://www.sec.gov/edgar/search/). FAQ [here](https://www.sec.gov/edgar/search/efts-faq.html).

You can query for text/strings in forms filtered by form type, date range, location of principal office of the filing entity and name/CIK (both use the same parameter) of the filing entity/person. 

**Note:** The two useful search parameters which are not offered by [edgar/search/](https://) are search by file number and SIC number. Both are displayed in the table of results of a given query, but in order to search via file number, it is required to use the "company search" endpoint https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&filenum=XXX-XXXX. **See the corresponding notebook company_search_endpoint**. (It is actually possible to search by file number via the company database search endpoint as well)

# Parameters of the Query
The base url of the endpoint is https://efts.sec.gov/LATEST/search-index.
We send a POST request with the following parameters sent as a JSON string, see https://linuxpip.org/post-python-requests/ section "Send POST request with JSON data as string" for example. 

A list of parameters available (as discovered so far):
```
- q=TARGET_WORDS : list of words, separated by spaces, to search for in filing documents. Use quotes to match an exact multi-word string. See https://www.sec.gov/edgar/searchedgar/search_help.htm 
- entityName=NAME_OR_CIK_OF_FILING_ENTITY : can either be the CIK of the filing entity, or the beginning (or full) name of the filing entity. 
- forms=LIST_OF_TARGET_FORM_TYPES : list of form types to query for. See fulltext_forms_list.txt for full list (via 9/14/22).
- dateRange=SIMPLE_FORMAT_DATE_RANGE : Note: If not set, default is 5 years. There are a few possible values here:
  1. all : query filings from as far back as possible (site states since 2001)
  2. 10y,5y,1y,30d : pretty self explanatory. Strange that you cannot enter custom versions of these such as 2y etc but oh well
  3. custom : custom time-frame. The endpoint will expect the following 2 parameters in this case:
    - startdt=YYYY-MM-DD
    - enddt=YYYY-MM-DD
- locationCode=STATE_OR_COUNTRY_CODE : Seems overruled by locationCodes. Makes no difference when including in manual query
- locationCodes=LIST_OF_STATE_OR_COUNTRY_CODES : List of either 2 letter state abbreviation or country code, see edgar_state_codes.txt
- from=RESULT_INDEX : return the next 100 results starting from RESULT_INDEX
- page=PAGE_NUM doesn't not seem to matter since we are not viewing in a browser, "from" seems to be the parameter controlling the results returned
- category=custom : not sure what it does, page seems to function the same without including it
```









# Response
If the query is successful, the server sends back a JSON structure of results in the following format:


```
{
   "took":422,
   "timed_out":false,
   "_shards":{
      "total":50,
      "successful":50,
      "skipped":0,
      "failed":0
   },
   "hits":{
      "total":{
         "value":6308,
         "relation":"eq"
      },
      "max_score":18.2674,
      "hits":[
         {
            "_index":"edgar_file",
            "_type":"_doc",
            "_id":"0001193125-16-760799:d275549dex1012.htm",
            "_score":18.2674,
            "_source":{
               "ciks":[
                  "0001603978"
               ],
               "period_ending":null,
               "root_form":"10-12B",
               "file_num":[
                  "001-36426"
               ],
               "display_names":[
                  "AquaBounty Technologies, Inc.  (AQB)  (CIK 0001603978)"
               ],
               "xsl":null,
               "sequence":"16",
               "file_date":"2016-11-07",
               "biz_states":[
                  "MA"
               ],
               "sics":[
                  "0900"
               ],
               "form":"10-12B",
               "adsh":"0001193125-16-760799",
               "film_num":[
                  "161976497"
               ],
               "biz_locations":[
                  "Maynard, MA"
               ],
               "file_type":"EX-10.12",
               "file_description":"EX-10.12",
               "inc_states":[
                  "DE"
               ],
               "items":[
                  
               ]
            }
         },
         {
           ... one per hit...
         },
         ...
   },
   "aggregations":{
     ...
   },
   "query":{
     ...
   }
}
```

The hits dictionary contains a couple of important pieces of information. Within the "total" dictionary, the number of total results is given. In the "hits" sublist, a dictionary object exists for up to the first 100 results. We use these two facts to loop through in the case that there is > 100 results. This result dictionary contains the specific document filename "_id", along with another dictionary "_source". The "_source" dictionary contains the accession number and filing CIKs (may be multiple, in the case of a form 4 for example). The filing should be accessible via either CIK under the same accession number.

The aggregations dictionary may also be of interest, as it contains statistics about the different entities, form types, etc present in the list of results. 

"query" contains information about the query that was made.

In [None]:
"""
Method which returns a list of result_dictionary structures as declared below follow for each result.
# Either a query string, entity name / CIK, or form type(s) must be entered for successful query. Default to "all" date range (left blank would default to 5yr)
# Note that forms are expected in proper list format

Format of result_dict: {
  "filename" : DOCUMENT_FILENAME, # Not full path!
  "cik" : [] LIST_OF_CIK_NUM(S),
  "accession" : ACCESSION_NUM,
  "file_num" : [] LIST_OF_FILING_NUMBER(S),
  "film_num" : [] LIST_OF_FILM_NUMBER(S),
  "display_name" : [] LIST_OF_ENTITY_NAME(S),
  "sic" : [] LIST_OF_SICS_OF_FILER(S),
  "form" : ROOT_FORM_TYPE,
  "period_ending" : PERIOD_ENDED, # Period and file date are both given in the same YYYY-MM-DD format
  "file_date" : FILING_DATE,
  "file_desc" : FILE_DESCRIPTION,
  "biz_city" : [] LIST_OF_CITY/CITIES_OF_FILER(S),
  "biz_state" : [] LIST_OF_STATE(S)_OR_COUNTRY_CODE(S)_OF_FILERS, # See edgar_state_codes.txt
  "inc_in" : [] LIST_OF_STATE(S)_OR_COUNTRY(S)_OF_INCORP # See edgar_state_codes.txt
}

Note: cik, display_name, biz_city, and inc_in lists seem to be returned in sync/order and if a value is missing for one entry, an empty list entry is present.
 The other elements will not create empty entries if there is no value for a filer, so don't rely on them being the sane "depth" as the lists just mentioned. 
"""

def list_results_full_text_query(query_string = "", entity_name = "", target_forms_list = (), date_range = "all", location_code = "", target_category = "", start_dt = "", end_dt = ""):

  # List we will return
  results_list = []

  # Headers
  request_headers = { 
      "host" : "efts.sec.gov",
      "accept-language" : "en-US,en;q=0.9",
      "User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36",
      "origin" : "https://www.sec.gov", # Not sure how necessary these are, kept just to be safe
      "referer" : "https://www.sec.gov/",
      "sec-fetch-dest" : "empty",
      "sec-fetch-mode" : "cors",
      "sec-fetch-site" : "same-site"
      }

  # Copy parameters into dictionary
  query_data = { 
      "q" : query_string,
      "entityName" : entity_name,
      "forms" : target_forms_list,
      "dateRange" : date_range,
      "startdt" : "",
      "enddt" : "",
      "locationCode" : location_code,
      "from" : 0,
      "category" : target_category
  }
  # Copy start and end date if applicable
  if date_range == "custom":
    if (type(start_dt) != date or type(end_dt) != date):
      print("Invalid start/end date structure entered. Must be a date object")
      return results_list

    query_data["startdt"] = "{}-{}-{}".format(start_dt.year, str(start_dt.month).zfill(2), str(start_dt.day).zfill(2))
    query_data["enddt"] = "{}-{}-{}".format(end_dt.year, str(end_dt.month).zfill(2), str(end_dt.day).zfill(2))

  # Send request 
  response = requests.post(url = "https://efts.sec.gov/LATEST/search-index/", headers = request_headers, json = query_data) # Notice difference when sending JSON string instead of data
  response.raise_for_status()

  # Server also responds with JSON as a string, read it into dictionary
  response_dict = json.loads(response.content)

  # Only 100 results are given back at a time. There is a count of total results in hits->total->value. Record it so we can send multiple requests if needed
  hits_total = 0
  hits_parsed = 0 
  if "value" in response_dict["hits"]["total"]:
    hits_total = response_dict["hits"]["total"]["value"]
  else:
    print("Failed to read number of hits. Results returned may be limited")

  # Loop will break when we have read all results
  while True:

    # Loop through hits->hits list
    try:
      for hit_index, current_hit in enumerate(response_dict["hits"]["hits"]):

        # Create dictionary for the result
        result_dict = {
            "filename" : "",
            "cik" : [],
            "accession" : "",
            "file_num" : [],
            "film_num" : [],
            "display_name" : [],
            "sic" : [],
            "form" : "",
            "period_ending" : "",
            "file_date" : "",
            "file_desc" : "",
            "biz_city" : [],
            "biz_state" : [], 
            "inc_in" : []
        }
        
        # Grab the one thing outside the _source dictionary, the filename. Accession number is prepended, i.e.: "0001127602-12-017271:scheduleto-goldreserveinc711.htm". Split by colon and disregard first word
        id_entry_split = current_hit["_id"].split(":")
        if len(id_entry_split) != 2: 
          print("Result filename is in unexpected format (inspect endpoint for any changes), skipping.")
          continue
        
        result_dict["filename"] = id_entry_split[1]

        # Now grab info from _source dictionary
        source_dict = current_hit["_source"]

        # Check the critical keys exist in order for us to at least build a link
        if (len(source_dict) and "ciks" in source_dict and "adsh" in source_dict):
          
          # In case we attempt to access a missing key
          try:

            # Go through result_dict in the order we declared it

            # cik
            for current_cik in source_dict["ciks"]:
              result_dict["cik"].append(current_cik.zfill(10))
            
            # accession
            result_dict["accession"] = source_dict["adsh"]

            # file_num
            for current_filenum in source_dict["file_num"]:
              result_dict["file_num"].append(current_filenum)
            
            # film_num
            for current_filmnum in source_dict["film_num"]:
              result_dict["film_num"].append(current_filmnum)

            # display_name. Comes in this format: "AbCellera Biologics Inc.  (ABCL)  (CIK 0001703057)". Split it by double spaces ("  ") and disregard the [1] and [2] elements. 
            # Can always change if we decide we want to scrape tickers from here
            for current_dn in source_dict["display_names"]:
              result_dict["display_name"].append(current_dn.split("  ")[0])
            
            # sic
            for current_sic in source_dict["sics"]:
              result_dict["sic"].append(current_sic)

            # form
            result_dict["form"] = source_dict["root_form"]
            if len(result_dict["form"]) == 0:
              result_dict["form"] = source_dict["form"] # Might as well try the other option as backup. Haven't come across either not being set in a result, but hey.

            # period_ending
            result_dict["period_ending"] = source_dict["period_ending"]

            # file_date
            result_dict["file_date"] = source_dict["file_date"]

            # file_desc
            result_dict["file_desc"] = source_dict["file_description"]
            
            # biz_city
            for current_city in source_dict["biz_locations"]:
              result_dict["biz_city"].append(current_city)
            
            # biz_state
            for current_state in source_dict["biz_states"]:
              result_dict["biz_state"].append(current_state)
            
            # inc_in
            for current_inc_state in source_dict["inc_states"]:
              result_dict["inc_in"].append(current_inc_state)

          except:
            print("Failed to read a non-critical result attribute, information may be missing.")

          # Append the result_dict to the list we will return
          results_list.append(result_dict)

        # Critical key (or _source dictionary) was missing 
        else:
          print("Failed to read result _source dictionary or other critical JSON key, skipping.")
          continue
      
    except:
      break

    # Tally total number parsed so far. If we have hit the end, stop
    hits_parsed += hit_index + 1
    if hits_parsed >= hits_total:
      break
    
    # Otherwise, update "from" query parameter for next batch
    query_data["from"] = hits_parsed

    # Send another request
    response = requests.post(url = "https://efts.sec.gov/LATEST/search-index/", headers = request_headers, json = query_data) 
    response.raise_for_status()

    # Read into the response_dict dictionary and restart loop
    response_dict = json.loads(response.content)

  return results_list

In [None]:
# Support method to build filing document URL from dictionary returned by full text query
def url_from_full_text_dict(full_text_result_dict):

  # Remove hyphens from accession number
  clean_accession = full_text_result_dict["accession"].replace("-","")

  # There may be multiple CIKs. The file should be available under each one using the same accession number and filename. Loop through and use the sane value. Not worth sending another request here.
  for current_cik in full_text_result_dict["cik"]:

    # Sanity check
    if len(current_cik) > 0:
      # Build URL return
      return "https://www.sec.gov/Archives/edgar/data/" + current_cik + "/" + clean_accession + "/" + full_text_result_dict["filename"]

  return ""

In [None]:

results = list_results_full_text_query(query_string = "ABCELLERA")

for hit in results:
  print(url_from_full_text_dict(hit))

print("Num of results: {}".format(len(results)))