**Table of contents**<a id='toc0_'></a>    
- [Libraries](#toc1_)    
- [Loading text filings](#toc2_)    
- [Extracting the Management's Discussion and Analysis Section](#toc3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Libraries](#toc0_)

In [15]:
import os
import time
os.chdir(os.environ.get('PROJECT_PATH'))
from secnlp.ml_logic import data as d
from secnlp.ml_logic import parsing as p
import secnlp.ml_logic.parsing
from secnlp import utils as u
from secnlp.params import *
import pandas as pd
import importlib

# <a id='toc2_'></a>[Loading text filings](#toc0_)

In [2]:
df = u.read_data_from_bq(credentials = SERVICE_ACCOUNT, gcp_project = PROJECT, bq_dataset = DATASET_ID, table = FILINGS_10KQ_TABLE_ID)

In [3]:
df['date_filed'] = pd.to_datetime(df['date_filed'])

In [4]:
filing_sample_10k = df[(df['date_filed'].dt.year == 2023) & (df['form_type'] == '10-K')].sample(1)
filing_sample_10k['raw_filing'] = filing_sample_10k['file_name'].apply(lambda url: d.fetch_text_from_url(url, agent = AGENT))


In [5]:
filing_sample_10q = df[(df['date_filed'].dt.year == 2023) & (df['form_type'] == '10-Q')].sample(1)
filing_sample_10q['raw_filing'] = filing_sample_10q['file_name'].apply(lambda url: d.fetch_text_from_url(url, agent = AGENT))


# <a id='toc3_'></a>[Extracting the Management's Discussion and Analysis Section](#toc0_)

In [87]:
text = filing_sample_10k['raw_filing'].iloc[0]

In [99]:
importlib.reload(secnlp.ml_logic.parsing)
print(p.parse_10k_10q_filing_item7(text, item = '1a'))

Item 1A. Risk Factors&#x201d; of this Annual Report on Form 10-K and under the heading &#x201c;Summary of Risk Factors&#x201d; below. As a result of these and other factors, we may not actually achieve the plans, intentions, expectations or results disclosed in our forward-looking statements, and you should not place undue reliance on our forward-looking statements. Our forward-looking statements do not reflect the potential impact of any future acquisitions, mergers, dispositions, joint ventures or investments we may make. We do not assume any obligation to update any forward-looking statements, whether as a result of new information, future events or otherwise, except as required by law.</span><span style="color:rgba(0,0,0,1);white-space:pre-wrap;font-weight:normal;font-size:10.0pt;font-family:&quot;Times New Roman&quot;, serif;font-style:italic;min-width:fit-content;"> </span></p>
  <p style="text-indent:4.533%;font-size:10.0pt;margin-top:6.0pt;font-family:Times New Roman;margin-bot

Write a function that will delete all of the original markup language tags (HTML, XBRL, XML):
- [ ] Remove ASCII-Encoded segments – All document segment <TYPE> tags of GRAPHIC, ZIP, EXCEL, JSON, and PDF are deleted from the file. ASCII-encoding is a means of converting binary-type files into standard ASCII characters to facilitate transfer across various hardware platforms. A relatively small graphic can create a substantial ASCII segment. Filings containing multiple graphics can be orders of magnitude larger than those containing only textual information.
- [ ] Remove <DIV>, <TR>, <TD>, and <FONT> tags – Although we require some HTML information for subsequent parsing, the files are so large (and processed as a single string) that, for processing efficiency, we initially simply strip out some of the formatting HTML.
- [ ] Remove all XML – all XML embedded documents are removed.
- [ ] Remove all XBRL – all characters between <XBRL …> … </XBRL> are deleted.
- [ ] Remove SEC Header/Footer – All characters from the beginning of the original file thru </SEC-HEADER> (or </IMS-HEADER> in some older documents) are deleted from the file. Note however that the header information is retained and included in the tagged items discussed in section 4.1. In addition, the footer “-----END PRIVACY-ENHANCED MESSAGE-----” appearing at the end of each document is deleted.
- [ ] Replace \&NBSP and \&#160 with a blank space.
- [ ] Replace \&AMP and \&#38 with “&”
- [ ] Remove all remaining extended character references (ISO-8859-1, see https://www.sec.gov/info/edgar/specifications/edgarfm-vol2-v59.pdf, section "5.2.2.6 Extended Character Sets within HTML Documents".
Tag Exhibits – At this point in the parsing process all exhibits are tagged as discussed in section 3.2.
- [ ] Remove Markup Tags – remove all remaining markup tags (i.e., <…>).
- [ ] Remove excess linefeeds.

In [94]:

item_start = re.compile("item\s*1[\.\;\:\-\_]*\s*\\b", re.IGNORECASE)
item_end = re.compile("item\s*1[abc][\.\,\;\:\-\_]\s*(?:Risk|Unresolved|Cyber)|item\s*2[\.\,\;\:\-\_]\s*Properties", re.IGNORECASE)

# Find all start and end positions using finditer
starts = [i.start() for i in item_start.finditer(text)]
ends = [i.start() for i in item_end.finditer(text)]

In [93]:
starts

[222333, 293493, 862760, 2087427]