# Section 4.1 Risk Factors Subject Corpus

In this section we are going to build up a corpus based on the risk factors subject.

In [1]:
import pandas as pd
import numpy as np
import requests
import os
import re
import html
import warnings
warnings.filterwarnings("ignore")

## 4.1.1 Download the .txt files from SEC website

In the section 1, we have generate the url for downloading the financial report in HTML format. Now we use it to download more files (3000 files) to build up our corpus.

In [2]:
_22FY10KFile_df = pd.read_csv("../Section 1 - Datasets Downloads and Preparation/22FY10KFile.csv")

In [3]:
# this function is to download the .txt file from the corresponding url.
def download_file(url, downloadPath, headerEmail):
    # request the url
    headers = {"User-Agent": headerEmail}
    response = requests.get(url, headers = headers)
    
    if response.status_code == 200:
        fileContent = response.text
        with open(downloadPath, 'w', encoding='utf-8') as file:
            file.write(fileContent)
        print("Download successifully.")
    else:
        print("Fail to download.")

In [None]:
headerEmail = "yuchaoba@usc.edu"
for i in range(0,3000):
    url = _22FY10KFile_df["filing_url"].iloc[i]
    folderPath = "Downloaded Corpus/"
    fileName = str(_22FY10KFile_df["adsh"].iloc[i]) + ".txt"
    downloadPath = folderPath + fileName
    download_file(url, downloadPath, headerEmail)

## 4.1.2 Extract the risk factors
### 1.  Extract the risk factors raw text

The whole files are too big to process in one time, so we divide them into 3 parts and extract the subjects from them seperately. The steps are quite similar to the Section 2.

1. Part 1:

In [2]:
folderPath = "Downloaded Corpus/"
fileName_lst = os.listdir(folderPath)
filePair_dict = {} # filePair stores the file name and the corresponding file text.
for fileName in fileName_lst[:1000]:  
    # encode each document
    filePath = str(folderPath) + str(fileName)    
    with open(filePath, 'r', encoding='utf-8') as file:
        fileText = file.read()
    filePair_dict[fileName] = fileText

KeyboardInterrupt: 

In [7]:
def remove_tags(text):
    # replace the tag included by <> to a \ 
    pattern1 = r'<[^>]*>'
    cleanText = re.sub(pattern1, "|", text)
    
    # replace the multiple continued \ to a single \
    cleanText = re.sub(r'\s*\|+\s*', '|', cleanText)
    cleanText = re.sub(r'\|+', '|', cleanText)
    
    # replace the unicode
    cleanText = html.unescape(cleanText)
    
    # remove the line break
    cleanText = cleanText.replace('\n', '')
    # remove the pure numbers text between \
    cleanText = re.sub(r'\|(\d+)\|', '|', cleanText)
    # remove the unrecognized string
    cleanText = re.sub(r'|\xa0\|', '', cleanText)
    cleanText = re.sub(r'|\xa0|', '', cleanText)
    return cleanText

In [8]:
def separate_main_body(text):
    lastContentInTC_lst = [">form 10-k summary<", ">item 16. form 10-k summary<", r">exhibits, financial statement schedules<", 
                           r">exhibits and financial statement schedules<", r"exhibits and financial statement schedules <",
                           ">exhibits<"]
    start_idx = -1
    for content in lastContentInTC_lst:
        if content in text.lower():
            start_idx = text.lower().find(content)
            break
            
    if start_idx == -1:
        return False
    
    mainBody = text[start_idx:]
    mainBody = html.unescape(mainBody)
    return mainBody

In [9]:
def extract_risk_factors(mainBody, fileName):
    # match the risk factor
    keywordMatch = re.findall(r">item 1a.\s+risk factors\.?\<", mainBody.lower())
    
    # if fail to match the pattern, the pattern might be segmented be some html tags, so re-match
    if not keywordMatch:
        keyword = seperated_keyword_rematch(mainBody)
        
    # if match the pattern successfully, just keep the keyword of risk factor
    else:
        keyword = keywordMatch[0]
    
    # check if match successfully
    if not keyword:
        return False
    keywordCheck_lst = ["item", r"item 1a\.?", "risk", "factors", "risk factors"]
    if keyword in keywordCheck_lst:
        print("Warning: It is better to check whether the extraction is correct manually for file", {fileName})
    
    # match the next section of the risk factor
    nextSectionPattern_lst = [r">item 1b.\s+unresolved staff comments\.?\<", r">item 2.\s+description of property\.?\<"]
    for pattern in nextSectionPattern_lst:
        nextSectionMatch = re.findall(pattern, mainBody.lower())
        
        # if fail to match the pattern, the pattern might be segmented be some html tags, so re-match
        if not nextSectionMatch:
            nextSection = seperated_next_section_rematch(mainBody, pattern)
            # if re-match successfully, jump out of the loop
            if nextSection:
                break
        
        # if match the pattern successfully, just keep the next section keyword and jump out of the loop
        else:
            nextSection = nextSectionMatch[0]
            break
    
    # check if match successfully
    if not nextSection:
        return False
    
    # if match successfully, extract the risk factor
    start_idx = mainBody.lower().find(keyword) # start index is the position of the keyword occured in the main body
    end_idx = mainBody.lower().find(nextSection) # end index is the position of the next section begin in the main body
    riskFactors = mainBody[start_idx: end_idx]
    return riskFactors

In [10]:
def seperated_keyword_rematch(mainBody):
    cleanedMainBody = remove_tags(mainBody)
    keywordMatch = re.findall(r"i\|?t\|?e\|?m\|?\s+\|?1\|?a\|?.\|?\s+\|?r\|?i\|?s\|?k\|?\s+\|?f\|?a\|?c\|?t\|?o\|?r\|?s\.?", cleanedMainBody.lower())
    
    # check if re-match the keyword of risk factor successfully
    if not keywordMatch:
        return False
    
    # if re-match successfully, use the longer segmentation as the keyword of the risk factor
    reMatchedKeyword = max(keywordMatch[0].split('|'), key=len)
    return reMatchedKeyword
    
def seperated_next_section_rematch(mainBody, nextSectionPattern):
    # cleanse the main body text
    cleanedMainBody = remove_tags(mainBody)
    
    # edit the pattern 
    edittedPattern = nextSectionPattern.replace(r">", "").replace(r"\s+", " ").replace(r"\.?\<", "")
    edittedPattern = "\|?".join(edittedPattern)
    edittedPattern = r"{}\.?".format(edittedPattern)
    pattern = edittedPattern.replace(" ", "\s?+")
    nextSectionMatch = re.findall(pattern, cleanedMainBody.lower())
    
    # check if re-match the keyword of next section successfully
    if not nextSectionMatch:
        return False
    
    # if re-match successfully, use the longer segmentation as the keyword of the next section 
    reMatchedNextSection = max(nextSectionMatch[0].split('|'), key=len)
    return reMatchedNextSection

In [8]:
for key, value in filePair_dict.items():
    fileName = key
    fileText = value
    mainBody = separate_main_body(fileText) # separate the original text as the table of content and the main body.
    
    if not mainBody:
        print("The extraction is failed for", {fileName}, "because of the error in seperating main body.")
        continue
    
    riskText = extract_risk_factors(mainBody, fileName) # extract the risk related text
    if not riskText:
        print("The extraction is failed for", {fileName}, "because of the error in extracting risk factors.")
    else:
        riskFilePath = "Corpus Extraction/Part 1/Raw Text/" + fileName
        with open(riskFilePath, 'w', encoding='utf-8') as file:
            file.write(riskText)

The extraction is failed for {'0000002488-23-000047.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000008858-22-000031.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000008947-22-000034.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000012927-23-000007.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000014707-22-000019.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000015615-23-000009.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000018255-22-000011.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000018926-23-000013.txt'} because of the error in seperating main body.
The extraction is failed for {'0000023194-23-000009.txt'} because of the error in extracting risk factors.
The extraction is failed for {'000002741

The extraction is failed for {'0000849146-22-000058.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000850209-22-000003.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000851310-23-000021.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000852772-23-000032.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000856982-23-000011.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000862668-22-000023.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000866729-22-000017.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000876167-23-000036.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000883569-23-000010.txt'} because of the error in seperating main body.
The extraction is failed for {'000088471

The extraction is failed for {'0000950170-23-004939.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000950170-23-005085.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000950170-23-005118.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000950170-23-005127.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000950170-23-005132.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000950170-23-005163.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000950170-23-005336.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000950170-23-005451.txt'} because of the error in seperating main body.
The extraction is failed for {'0000950170-23-005490.txt'} because of the error in seperating main body.
The extraction is failed for {'0000950170-2

The extraction is failed for {'0001046995-23-000007.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001047127-23-000009.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001050797-23-000047.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001060736-23-000009.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001062231-23-000016.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001062822-23-000007.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001062993-22-013882.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001062993-22-019526.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001062993-23-007681.txt'} because of the error in extracting risk factors.
The extraction is failed for {'000106

The extraction is failed for {'0001104659-23-040151.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001104659-23-040158.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001104659-23-040176.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001104659-23-040195.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001104659-23-040275.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001104659-23-040285.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001104659-23-040291.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001108205-23-000023.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001108426-23-000006.txt'} because of the error in seperating main body.
The extraction is failed for {'000111133

The extraction is failed for {'0001193125-22-092380.txt'} because of the error in seperating main body.
The extraction is failed for {'0001193125-22-092414.txt'} because of the error in seperating main body.
The extraction is failed for {'0001193125-22-093685.txt'} because of the error in seperating main body.
The extraction is failed for {'0001193125-22-093842.txt'} because of the error in seperating main body.
The extraction is failed for {'0001193125-22-093970.txt'} because of the error in seperating main body.
The extraction is failed for {'0001193125-22-095122.txt'} because of the error in seperating main body.
The extraction is failed for {'0001193125-22-095322.txt'} because of the error in seperating main body.
The extraction is failed for {'0001193125-22-097864.txt'} because of the error in seperating main body.
The extraction is failed for {'0001193125-22-099020.txt'} because of the error in seperating main body.
The extraction is failed for {'0001193125-22-100377.txt'} becaus

2. Part 2:

In [6]:
folderPath = "Downloaded Corpus/"
fileName_lst = os.listdir(folderPath)
filePair_dict = {} # filePair stores the file name and the corresponding file text.
for fileName in fileName_lst[1000:2000]:  
    # encode each document
    filePath = str(folderPath) + str(fileName)    
    with open(filePath, 'r', encoding='utf-8') as file:
        fileText = file.read()
    filePair_dict[fileName] = fileText

In [12]:
for key, value in filePair_dict.items():
    fileName = key
    fileText = value
    mainBody = separate_main_body(fileText) # separate the original text as the table of content and the main body.
    
    if not mainBody:
        print("The extraction is failed for", {fileName}, "because of the error in seperating main body.")
        continue
    
    riskText = extract_risk_factors(mainBody, fileName) # extract the risk related text
    if not riskText:
        print("The extraction is failed for", {fileName}, "because of the error in extracting risk factors.")
    else:
        riskFilePath = "Corpus Extraction/Part 2/Raw Text/" + fileName
        with open(riskFilePath, 'w', encoding='utf-8') as file:
            file.write(riskText)

The extraction is failed for {'0001193125-23-045575.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001193125-23-051721.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001193125-23-054105.txt'} because of the error in seperating main body.
The extraction is failed for {'0001193125-23-054302.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001193125-23-055176.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001193125-23-056551.txt'} because of the error in seperating main body.
The extraction is failed for {'0001193125-23-058309.txt'} because of the error in seperating main body.
The extraction is failed for {'0001193125-23-058333.txt'} because of the error in seperating main body.
The extraction is failed for {'0001193125-23-060184.txt'} because of the error in seperating main body.
The extraction is failed for {'0001193125-23-060212.

The extraction is failed for {'0001213900-22-017432.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001213900-22-017486.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001213900-22-017545.txt'} because of the error in seperating main body.
The extraction is failed for {'0001213900-22-017945.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001213900-22-018458.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001213900-22-019276.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001213900-22-019381.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001213900-22-019462.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001213900-22-019464.txt'} because of the error in seperating main body.
The extraction is failed for {'0001213900-2

The extraction is failed for {'0001213900-23-022824.txt'} because of the error in seperating main body.
The extraction is failed for {'0001213900-23-023363.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001213900-23-023698.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001213900-23-023704.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001213900-23-023710.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001213900-23-023825.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001213900-23-023973.txt'} because of the error in seperating main body.
The extraction is failed for {'0001213900-23-024080.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001213900-23-024098.txt'} because of the error in seperating main body.
The extraction is failed for {'0001213900-23-0

The extraction is failed for {'0001288847-23-000029.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001297184-23-000019.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001297989-23-000001.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001305168-23-000018.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001305773-23-000013.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001308411-22-000009.txt'} because of the error in seperating main body.
The extraction is failed for {'0001309108-23-000022.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001309402-23-000012.txt'} because of the error in seperating main body.
The extraction is failed for {'0001321655-23-000011.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001321732-2

The extraction is failed for {'0001410578-22-000957.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001410578-22-000966.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001410578-22-000983.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001410578-22-001015.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001410578-22-001051.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001410578-22-001567.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001410578-22-001865.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001410578-22-001868.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001410578-22-001889.txt'} because of the error in extracting risk factors.
The extraction is failed for {'000141

The extraction is failed for {'0001437749-22-008763.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001437749-22-008982.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001437749-22-008996.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001437749-22-009043.txt'} because of the error in seperating main body.
The extraction is failed for {'0001437749-22-010855.txt'} because of the error in seperating main body.
The extraction is failed for {'0001437749-22-014727.txt'} because of the error in seperating main body.
The extraction is failed for {'0001437749-22-015612.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001437749-22-015886.txt'} because of the error in seperating main body.
The extraction is failed for {'0001437749-22-016173.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001437749-22-0162

The extraction is failed for {'0001477932-22-003925.txt'} because of the error in seperating main body.
The extraction is failed for {'0001477932-22-003940.txt'} because of the error in seperating main body.
The extraction is failed for {'0001477932-22-004396.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001477932-22-004557.txt'} because of the error in seperating main body.
The extraction is failed for {'0001477932-22-004724.txt'} because of the error in seperating main body.
The extraction is failed for {'0001477932-22-005494.txt'} because of the error in seperating main body.
The extraction is failed for {'0001477932-22-007294.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001477932-22-007306.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001477932-23-000012.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001477932-23-000586.

The extraction is failed for {'0001493152-22-017738.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-22-017850.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-22-017917.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-22-017952.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-22-018006.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-22-018093.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-22-018095.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-22-018102.txt'} because of the error in seperating main body.
The extraction is failed for {'0001493152-22-018107.txt'} because of the error in extracting risk factors.
The extraction is failed for {'000149315

The extraction is failed for {'0001493152-23-008241.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-23-008326.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-23-008518.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-23-008716.txt'} because of the error in seperating main body.
The extraction is failed for {'0001493152-23-008751.txt'} because of the error in seperating main body.
The extraction is failed for {'0001493152-23-008765.txt'} because of the error in seperating main body.
The extraction is failed for {'0001493152-23-008878.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-23-008898.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-23-008914.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-23-0

3. Part 3:

In [13]:
folderPath = "Downloaded Corpus/"
fileName_lst = os.listdir(folderPath)
filePair_dict = {} # filePair stores the file name and the corresponding file text.
for fileName in fileName_lst[2000:3000]:  
    # encode each document
    filePath = str(folderPath) + str(fileName)    
    with open(filePath, 'r', encoding='utf-8') as file:
        fileText = file.read()
    filePair_dict[fileName] = fileText

In [14]:
for key, value in filePair_dict.items():
    fileName = key
    fileText = value
    mainBody = separate_main_body(fileText) # separate the original text as the table of content and the main body.
    
    if not mainBody:
        print("The extraction is failed for", {fileName}, "because of the error in seperating main body.")
        continue
    
    riskText = extract_risk_factors(mainBody, fileName) # extract the risk related text
    if not riskText:
        print("The extraction is failed for", {fileName}, "because of the error in extracting risk factors.")
    else:
        riskFilePath = "Corpus Extraction/Part 3/Raw Text/" + fileName
        with open(riskFilePath, 'w', encoding='utf-8') as file:
            file.write(riskText)

The extraction is failed for {'0001493152-23-009605.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-23-009609.txt'} because of the error in seperating main body.
The extraction is failed for {'0001493152-23-009730.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-23-009799.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-23-009812.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-23-009824.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-23-009840.txt'} because of the error in seperating main body.
The extraction is failed for {'0001493152-23-009844.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001493152-23-009866.txt'} because of the error in seperating main body.
The extraction is failed for {'0001493152-23-0

The extraction is failed for {'0001548123-23-000051.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001548123-23-000062.txt'} because of the error in seperating main body.
The extraction is failed for {'0001549346-23-000011.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001552781-22-000463.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001553350-22-000745.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001553350-22-000785.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001553350-22-000798.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001554795-22-000255.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001554795-23-000091.txt'} because of the error in extracting risk factors.
The extraction is failed for {'000155479

The extraction is failed for {'0001558370-23-003331.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001558370-23-003450.txt'} because of the error in seperating main body.
The extraction is failed for {'0001558370-23-003469.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001558370-23-003575.txt'} because of the error in seperating main body.
The extraction is failed for {'0001558370-23-003589.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001558370-23-003621.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001558370-23-003650.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001558370-23-003716.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001558370-23-003738.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001558370-2

The extraction is failed for {'0001564590-23-002309.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001564590-23-002350.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001564590-23-002352.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001564590-23-002354.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001564590-23-002356.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001564590-23-002357.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001564590-23-002358.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001564590-23-002362.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001564590-23-002367.txt'} because of the error in extracting risk factors.
The extraction is failed for {'000156

The extraction is failed for {'0001628280-22-019309.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001628280-22-023828.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001628280-22-024988.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001628280-23-002350.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001628280-23-002580.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001628280-23-002857.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001628280-23-003850.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001628280-23-004087.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001628280-23-004299.txt'} because of the error in seperating main body.
The extraction is failed for {'000162828

The extraction is failed for {'0001679363-23-000011.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001682852-23-000011.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001683168-22-002280.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001683168-22-002406.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001683168-22-002484.txt'} because of the error in seperating main body.
The extraction is failed for {'0001683168-22-002671.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001683168-22-002677.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001683168-22-002694.txt'} because of the error in seperating main body.
The extraction is failed for {'0001683168-22-002703.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001683168-2

The extraction is failed for {'0001818382-23-000016.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001818644-23-000003.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001819574-22-000049.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001819881-23-000007.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001819974-23-000011.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001821825-23-000003.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001828108-23-000020.txt'} because of the error in seperating main body.
The extraction is failed for {'0001829126-22-007895.txt'} because of the error in seperating main body.
The extraction is failed for {'0001829126-22-008021.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001829126-2

## 4.1.3. Extract the risk factors subject

Also, we need to extract the risk factors subject for these 3 parts.

1. Part 1:

In [17]:
from bs4 import BeautifulSoup

def extract_subjects(riskFactors, subjectFormat):
    soup = BeautifulSoup(riskFactors, 'html.parser')
    subject_lst = []
    for span in soup.find_all('span'):
        style = span.get('style')
        if style and subjectFormat in style:
            subject_lst.append(span.get_text())
    return subject_lst

In [10]:
rawTextFolderPath = "Corpus Extraction/Part 1/Raw Text/"
rawTextFileName_lst = os.listdir(rawTextFolderPath)
rawTextFilePair_dict = {} # filePair stores the file name and the corresponding file text.
for fileName in rawTextFileName_lst:  
    # encode each document
    filePath = str(rawTextFolderPath) + str(fileName)    
    with open(filePath, 'r', encoding='utf-8') as file:
        rawText = file.read()
    rawTextFilePair_dict[fileName] = rawText

In [11]:
subjectFormat_lst = ["font-weight:700", "font-weight:bold"]
for key, value in rawTextFilePair_dict.items():
    fileName = key
    rawText = value
    
    i = 0
    subject_lst = []
    
    while not subject_lst:
        subjectFormat = subjectFormat_lst[i]
        subject_lst = extract_subjects(rawText, subjectFormat)
        i += 1
        if i >= len(subjectFormat_lst):
            break
        
    if not subject_lst:
        print("The subject fail to extract for", {fileName})
    else:
        subjectFolder = "Corpus Extraction/Part 1/Subject/"
        subjectFilePath = subjectFolder + str(fileName)
        with open(subjectFilePath, 'w', encoding='utf-8') as file:
            for item in subject_lst:
                file.write(item + '\n')

The subject fail to extract for {'0000107140-22-000022.txt'}
The subject fail to extract for {'0000355019-22-000042.txt'}
The subject fail to extract for {'0000874015-23-000105.txt'}
The subject fail to extract for {'0000898432-23-000193.txt'}
The subject fail to extract for {'0000930413-23-001072.txt'}
The subject fail to extract for {'0001017386-22-000150.txt'}
The subject fail to extract for {'0001017386-23-000093.txt'}
The subject fail to extract for {'0001058090-23-000010.txt'}
The subject fail to extract for {'0001062993-22-015574.txt'}
The subject fail to extract for {'0001062993-23-007669.txt'}
The subject fail to extract for {'0001062993-23-008012.txt'}
The subject fail to extract for {'0001065949-22-000047.txt'}
The subject fail to extract for {'0001065949-23-000017.txt'}
The subject fail to extract for {'0001079973-22-000422.txt'}
The subject fail to extract for {'0001079973-22-000424.txt'}
The subject fail to extract for {'0001079973-23-000353.txt'}
The subject fail to extr

2. Part 2:

In [28]:
rawTextFolderPath = "Corpus Extraction/Part 2/Raw Text/"
rawTextFileName_lst = os.listdir(rawTextFolderPath)
rawTextFilePair_dict = {} # filePair stores the file name and the corresponding file text.
for fileName in rawTextFileName_lst:  
    # encode each document
    filePath = str(rawTextFolderPath) + str(fileName)    
    with open(filePath, 'r', encoding='utf-8') as file:
        rawText = file.read()
    rawTextFilePair_dict[fileName] = rawText

In [29]:
subjectFormat_lst = ["font-weight:700", "font-weight:bold"]
for key, value in rawTextFilePair_dict.items():
    fileName = key
    rawText = value
    
    i = 0
    subject_lst = []
    
    while not subject_lst:
        subjectFormat = subjectFormat_lst[i]
        subject_lst = extract_subjects(rawText, subjectFormat)
        i += 1
        if i >= len(subjectFormat_lst):
            break
        
    if not subject_lst:
        print("The subject fail to extract for", {fileName})
    else:
        subjectFolder = "Corpus Extraction/Part 2/Subject/"
        subjectFilePath = subjectFolder + str(fileName)
        with open(subjectFilePath, 'w', encoding='utf-8') as file:
            for item in subject_lst:
                file.write(item + '\n')

The subject fail to extract for {'0001193311-23-000005.txt'}
The subject fail to extract for {'0001213900-22-016996.txt'}
The subject fail to extract for {'0001213900-22-017017.txt'}
The subject fail to extract for {'0001213900-22-017095.txt'}
The subject fail to extract for {'0001213900-22-017118.txt'}
The subject fail to extract for {'0001213900-22-017159.txt'}
The subject fail to extract for {'0001213900-22-017248.txt'}
The subject fail to extract for {'0001213900-22-018211.txt'}
The subject fail to extract for {'0001213900-22-018523.txt'}
The subject fail to extract for {'0001213900-22-018778.txt'}
The subject fail to extract for {'0001213900-22-019078.txt'}
The subject fail to extract for {'0001213900-22-019210.txt'}
The subject fail to extract for {'0001213900-22-019367.txt'}
The subject fail to extract for {'0001213900-22-019421.txt'}
The subject fail to extract for {'0001213900-22-019444.txt'}
The subject fail to extract for {'0001213900-22-019472.txt'}
The subject fail to extr

The subject fail to extract for {'0001437749-22-007905.txt'}
The subject fail to extract for {'0001437749-22-008152.txt'}
The subject fail to extract for {'0001437749-22-008182.txt'}
The subject fail to extract for {'0001437749-22-009005.txt'}
The subject fail to extract for {'0001437749-22-009050.txt'}
The subject fail to extract for {'0001437749-22-015762.txt'}
The subject fail to extract for {'0001437749-22-022143.txt'}
The subject fail to extract for {'0001437749-22-022334.txt'}
The subject fail to extract for {'0001437749-22-023256.txt'}
The subject fail to extract for {'0001437749-23-003778.txt'}
The subject fail to extract for {'0001437749-23-003782.txt'}
The subject fail to extract for {'0001437749-23-003931.txt'}
The subject fail to extract for {'0001437749-23-004329.txt'}
The subject fail to extract for {'0001437749-23-004331.txt'}
The subject fail to extract for {'0001437749-23-004332.txt'}
The subject fail to extract for {'0001437749-23-004334.txt'}
The subject fail to extr

3. Part 3:

In [23]:
rawTextFolderPath = "Corpus Extraction/Part 3/Raw Text/"
rawTextFileName_lst = os.listdir(rawTextFolderPath)
rawTextFilePair_dict = {} # filePair stores the file name and the corresponding file text.
for fileName in rawTextFileName_lst:  
    # encode each document
    filePath = str(rawTextFolderPath) + str(fileName)    
    with open(filePath, 'r', encoding='utf-8') as file:
        rawText = file.read()
    rawTextFilePair_dict[fileName] = rawText

In [25]:
subjectFormat_lst = ["font-weight:700", "font-weight:bold"]
for key, value in rawTextFilePair_dict.items():
    fileName = key
    rawText = value
    
    i = 0
    subject_lst = []
    
    while not subject_lst:
        subjectFormat = subjectFormat_lst[i]
        subject_lst = extract_subjects(rawText, subjectFormat)
        i += 1
        if i >= len(subjectFormat_lst):
            break
        
    if not subject_lst:
        print("The subject fail to extract for", {fileName})
    else:
        subjectFolder = "Corpus Extraction/Part 3/Subject/"
        subjectFilePath = subjectFolder + str(fileName)
        with open(subjectFilePath, 'w', encoding='utf-8') as file:
            for item in subject_lst:
                file.write(item + '\n')

The subject fail to extract for {'0001493152-23-009742.txt'}
The subject fail to extract for {'0001493152-23-009765.txt'}
The subject fail to extract for {'0001493152-23-009907.txt'}
The subject fail to extract for {'0001493152-23-010013.txt'}
The subject fail to extract for {'0001493152-23-010120.txt'}
The subject fail to extract for {'0001493152-23-010129.txt'}
The subject fail to extract for {'0001493152-23-010160.txt'}
The subject fail to extract for {'0001493152-23-010166.txt'}
The subject fail to extract for {'0001493152-23-010210.txt'}
The subject fail to extract for {'0001493152-23-010345.txt'}
The subject fail to extract for {'0001510964-22-000033.txt'}
The subject fail to extract for {'0001510964-23-000016.txt'}
The subject fail to extract for {'0001513162-23-000040.txt'}
The subject fail to extract for {'0001515971-22-000125.txt'}
The subject fail to extract for {'0001520138-22-000237.txt'}
The subject fail to extract for {'0001552800-23-000003.txt'}
The subject fail to extr

## 4.1.4 Organizing and Cleansing
We are going to re-organize and cleanse these 3 parts of corpus into a entire file.

### 1. Organizing the .txt Files in .csv Files

1. Part 1:

In [30]:
def generate_subject_form(folderPath, outputPath):
    adsh_lst = []
    
    fileName_lst = os.listdir(folderPath)
    riskFactorSubject_lst = []
    for fileName in fileName_lst:
        adsh = fileName.rstrip(".txt")
        filePath = str(folderPath) + str(fileName)    
        with open(filePath, 'r', encoding='utf-8') as file:
            fileText = file.read()
    
        subject = fileText.split('\n')
        riskFactorSubject_lst.extend(subject)
        adsh_lst.extend([adsh] * len(subject))
    df = pd.DataFrame(list(zip(adsh_lst, riskFactorSubject_lst)), columns=['adsh', 'Risk Factors'])
    df.to_csv(outputPath, index=False)
    return df

In [32]:
CorpusFolderPath1 = "Corpus Extraction/Part 1/Subject/"
CorpusOutputPath1 = "Corpus Extraction/Part 1/Corpus1.csv"
Corpus_df1 = generate_subject_form(CorpusFolderPath1, CorpusOutputPath1)
Corpus_df1.head()

Unnamed: 0,adsh,Risk Factors
0,0000004904-23-000011,GENERAL RISKS OF REGULATED OPERATIONS
1,0000004904-23-000011,AEP may not be able to recover the costs of su...
2,0000004904-23-000011,Regulated electric revenues and earnings are d...
3,0000004904-23-000011,AEP’s transmission investment strategy and exe...
4,0000004904-23-000011,Certain elements of AEP’s transmission formula...


2. Part 2:

In [33]:
CorpusFolderPath2 = "Corpus Extraction/Part 2/Subject/"
CorpusOutputPath2 = "Corpus Extraction/Part 2/Corpus2.csv"
Corpus_df2 = generate_subject_form(CorpusFolderPath2, CorpusOutputPath2)
Corpus_df2.head()

Unnamed: 0,adsh,Risk Factors
0,0001193125-23-044850,Age
1,0001193125-23-044850,"Young-Joon (YJ) Kim, Board of Directors, Membe..."
2,0001193125-23-044850,"Shin Young Park, Chief Financial Officer"
3,0001193125-23-044850,.
4,0001193125-23-044850,"Theodore Kim, Chief Compliance Officer, Genera..."


3. Part 3:

In [34]:
CorpusFolderPath3 = "Corpus Extraction/Part 3/Subject/"
CorpusOutputPath3 = "Corpus Extraction/Part 3/Corpus3.csv"
Corpus_df3 = generate_subject_form(CorpusFolderPath3, CorpusOutputPath3)
Corpus_df3.head()

Unnamed: 0,adsh,Risk Factors
0,0001495320-22-000007,Risks Related to the COVID-19 Pandemic
1,0001495320-22-000007,The outbreak of the novel coronavirus (COVID-1...
2,0001495320-22-000007,Risks Related to Our Business Operations and I...
3,0001495320-22-000007,If we are unable to successfully implement our...
4,0001495320-22-000007,Table of Contents


### 2. Combination

In [36]:
Corpus_df = pd.concat([Corpus_df1, Corpus_df2, Corpus_df3], axis=0)
print("The first 5 records are shown below: \n", Corpus_df.head(), "\n")
print("There are ", len(Corpus_df), "records in the corpus.")

The first 5 records are shown below: 
                    adsh                                       Risk Factors
0  0000004904-23-000011              GENERAL RISKS OF REGULATED OPERATIONS
1  0000004904-23-000011  AEP may not be able to recover the costs of su...
2  0000004904-23-000011  Regulated electric revenues and earnings are d...
3  0000004904-23-000011  AEP’s transmission investment strategy and exe...
4  0000004904-23-000011  Certain elements of AEP’s transmission formula... 

There are  135436 records in the corpus.


### 3. Cleansing
There are some empty line and incompleted line in the documents so we need to remove them.

In [37]:
import string
Corpus_df = Corpus_df.dropna()
Corpus_df = Corpus_df[Corpus_df["Risk Factors"].str.match(r'^[a-zA-Z,.\s\'"?!]+$')]
Corpus_df = Corpus_df[~Corpus_df["Risk Factors"].str.startswith(" ")]
punctuations = string.punctuation
for p in punctuations:
    Corpus_df = Corpus_df[~Corpus_df["Risk Factors"].str.startswith(p)]
Corpus_df = Corpus_df[Corpus_df['Risk Factors'].str.split().str.len() > 2]
print("There are ", len(Corpus_df), "records in the corpus.")

There are  55898 records in the corpus.


The subjects we extract in section 2 can also used to build up the corpus, so we combine them together.

In [39]:
folderPath = "../Section 2 - Risk Factor Extraction/Risk Factor Text/Risk Factor Subject Text Summary/"
fileName_lst = os.listdir(folderPath)
df_lst = []
for fileName in fileName_lst:
    filePath = str(folderPath) + str(fileName)
    df_lst.append(pd.read_csv(filePath))
riskCorpus_df = pd.concat(df_lst, axis=0)
print("The first 5 records in the corpus are shown below: \n", riskCorpus_df.head(), "\n")
print("There are", len(riskCorpus_df), "records in the corpus.")

The first 5 records in the corpus are shown below: 
                    adsh                                       Risk Factors
0  0000007332-23-000005  Natural gas, oil and NGL prices and basis diff...
1  0000007332-23-000005  Significant capital investment is required to ...
2  0000007332-23-000005  If we are not able to develop and replace rese...
3  0000007332-23-000005  Our business depends on access to natural gas,...
4  0000007332-23-000005  Strategic determinations, including the alloca... 

There are 33103 records in the corpus.


In [40]:
Corpus_df = pd.concat([Corpus_df, riskCorpus_df], axis=0)

### 4. Output the file

In [41]:
Corpus_df.to_csv("Corpus.csv", index=False)