# Section 2 Risk Factor Extraction

In section 1, we have downloaded the fincancial reports from 2022Q1 to 2023Q3. In this section, we are going to extract the risk factor text from each of these files. According to the requirement of the 10-K form given by SEC (https://www.sec.gov/files/form10-k.pdf), there is an section named "Item 1A. Risk Factors" included in the form  which shows the potential risks faced by the company. Therefore, it is important for us to filter and cleanse the text from this part so that we can use it for further analysis.

In [1]:
import pandas as pd
import numpy as np
import requests
import os
import re
import html
import warnings
warnings.filterwarnings("ignore")

# 2.1 Energy & Transportation Industry (1311, 1389, 4911)

In this section, we only focus on companies in Energy & Transportation Industry.

## 2.1.1 Industry 1311

### 1. Read in the files and store in a dictionary

In [2]:
folderPath = "../Section 1 - Datasets Downloads and Preparation/Downloaded Financial Reports/Energy Industry/Industry_1311/"
fileName_lst = os.listdir(folderPath)
filePair_dict = {} # filePair stores the file name and the corresponding file text.
for fileName in fileName_lst:  
    # encode each document
    filePath = str(folderPath) + str(fileName)    
    with open(filePath, 'r', encoding='utf-8') as file:
        fileText = file.read()
    filePair_dict[fileName] = fileText

In [3]:
filePair_dict.keys()

dict_keys(['0000007332-23-000005.txt', '0000010254-23-000058.txt', '0000023194-23-000009.txt', '0000033213-23-000008.txt', '0000061398-23-000011.txt', '0000077159-23-000003.txt', '0000077877-23-000012.txt', '0000101778-23-000030.txt', '0000351817-23-000032.txt', '0000717423-23-000015.txt', '0000797468-23-000011.txt', '0000798949-23-000015.txt', '0000821189-23-000015.txt', '0000821483-23-000008.txt', '0000844965-23-000009.txt', '0000858470-23-000011.txt', '0000893538-23-000014.txt', '0000895126-23-000022.txt', '0000928022-23-000017.txt', '0000945764-23-000028.txt', '0000950170-23-002852.txt', '0000950170-23-003794.txt', '0000950170-23-004459.txt', '0000950170-23-004687.txt', '0000950170-23-005278.txt', '0000950170-23-006714.txt', '0001001614-23-000011.txt', '0001017386-22-000147.txt', '0001038357-23-000039.txt', '0001062993-23-007669.txt', '0001070412-23-000015.txt', '0001072613-23-000285.txt', '0001096906-22-000878.txt', '0001096906-23-000666.txt', '0001104485-23-000041.txt', '00011851

### 2. Extract the raw risk factor text

In [4]:
def remove_tags(text):
    # replace the tag included by <> to a \ 
    pattern1 = r'<[^>]*>'
    cleanText = re.sub(pattern1, "|", text)
    
    # replace the multiple continued \ to a single \
    cleanText = re.sub(r'\s*\|+\s*', '|', cleanText)
    cleanText = re.sub(r'\|+', '|', cleanText)
    
    # replace the unicode
    cleanText = html.unescape(cleanText)
    
    # remove the line break
    cleanText = cleanText.replace('\n', '')
    # remove the pure numbers text between \
    cleanText = re.sub(r'\|(\d+)\|', '|', cleanText)
    # remove the unrecognized string
    cleanText = re.sub(r'|\xa0\|', '', cleanText)
    cleanText = re.sub(r'|\xa0|', '', cleanText)
    return cleanText

In [5]:
def separate_main_body(text):
    lastContentInTC_lst = [">form 10-k summary<", ">item 16. form 10-k summary<", r">exhibits, financial statement schedules<", 
                           r">exhibits and financial statement schedules<", r"exhibits and financial statement schedules <",
                           ">exhibits<"]
    start_idx = -1
    for content in lastContentInTC_lst:
        if content in text.lower():
            start_idx = text.lower().find(content)
            break
            
    if start_idx == -1:
        return False
    
    mainBody = text[start_idx:]
    mainBody = html.unescape(mainBody)
    return mainBody

In [6]:
def extract_risk_factors(mainBody, fileName):
    # match the risk factor
    keywordMatch = re.findall(r">item 1a.\s+risk factors\.?\<", mainBody.lower())
    
    # if fail to match the pattern, the pattern might be segmented be some html tags, so re-match
    if not keywordMatch:
        keyword = seperated_keyword_rematch(mainBody)
        
    # if match the pattern successfully, just keep the keyword of risk factor
    else:
        keyword = keywordMatch[0]
    
    # check if match successfully
    if not keyword:
        return False
    keywordCheck_lst = ["item", r"item 1a\.?", "risk", "factors", "risk factors"]
    if keyword in keywordCheck_lst:
        print("Warning: It is better to check whether the extraction is correct manually for file", {fileName})
    
    # match the next section of the risk factor
    nextSectionPattern_lst = [r">item 1b.\s+unresolved staff comments\.?\<", r">item 2.\s+description of property\.?\<"]
    for pattern in nextSectionPattern_lst:
        nextSectionMatch = re.findall(pattern, mainBody.lower())
        
        # if fail to match the pattern, the pattern might be segmented be some html tags, so re-match
        if not nextSectionMatch:
            nextSection = seperated_next_section_rematch(mainBody, pattern)
            # if re-match successfully, jump out of the loop
            if nextSection:
                break
        
        # if match the pattern successfully, just keep the next section keyword and jump out of the loop
        else:
            nextSection = nextSectionMatch[0]
            break
    
    # check if match successfully
    if not nextSection:
        return False
    
    # if match successfully, extract the risk factor
    start_idx = mainBody.lower().find(keyword) # start index is the position of the keyword occured in the main body
    end_idx = mainBody.lower().find(nextSection) # end index is the position of the next section begin in the main body
    riskFactors = mainBody[start_idx: end_idx]
    return riskFactors

In [7]:
def seperated_keyword_rematch(mainBody):
    cleanedMainBody = remove_tags(mainBody)
    keywordMatch = re.findall(r"i\|?t\|?e\|?m\|?\s+\|?1\|?a\|?.\|?\s+\|?r\|?i\|?s\|?k\|?\s+\|?f\|?a\|?c\|?t\|?o\|?r\|?s\.?", cleanedMainBody.lower())
    
    # check if re-match the keyword of risk factor successfully
    if not keywordMatch:
        return False
    
    # if re-match successfully, use the longer segmentation as the keyword of the risk factor
    reMatchedKeyword = max(keywordMatch[0].split('|'), key=len)
    return reMatchedKeyword
    
def seperated_next_section_rematch(mainBody, nextSectionPattern):
    # cleanse the main body text
    cleanedMainBody = remove_tags(mainBody)
    
    # edit the pattern 
    edittedPattern = nextSectionPattern.replace(r">", "").replace(r"\s+", " ").replace(r"\.?\<", "")
    edittedPattern = "\|?".join(edittedPattern)
    edittedPattern = r"{}\.?".format(edittedPattern)
    pattern = edittedPattern.replace(" ", "\s?+")
    nextSectionMatch = re.findall(pattern, cleanedMainBody.lower())
    
    # check if re-match the keyword of next section successfully
    if not nextSectionMatch:
        return False
    
    # if re-match successfully, use the longer segmentation as the keyword of the next section 
    reMatchedNextSection = max(nextSectionMatch[0].split('|'), key=len)
    return reMatchedNextSection

In [8]:
for key, value in filePair_dict.items():
    fileName = key
    fileText = value
    mainBody = separate_main_body(fileText) # separate the original text as the table of content and the main body.
    
    if not mainBody:
        print("The extraction is failed for", {fileName}, "because of the error in seperating main body.")
        continue
    
    riskText = extract_risk_factors(mainBody, fileName) # extract the risk related text
    if not riskText:
        print("The extraction is failed for", {fileName}, "because of the error in extracting risk factors.")
    else:
        riskFilePath = "Risk Factor Text/Energy Industry/Industry_1311/Raw Text/" + fileName
        with open(riskFilePath, 'w', encoding='utf-8') as file:
            file.write(riskText)

The extraction is failed for {'0000023194-23-000009.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000033213-23-000008.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000798949-23-000015.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000821483-23-000008.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000895126-23-000022.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000950170-23-002852.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000950170-23-004687.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001017386-22-000147.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001070412-23-000015.txt'} because of the error in extracting risk factors.
The extraction is failed for {'000107

### 3. Extract the subjects from raw risk factors text

In [9]:
from bs4 import BeautifulSoup

def extract_subjects(riskFactors, subjectFormat):
    soup = BeautifulSoup(riskFactors, 'html.parser')
    subject_lst = []
    for span in soup.find_all('span'):
        style = span.get('style')
        if style and subjectFormat in style:
            subject_lst.append(span.get_text())
    return subject_lst

In [10]:
rawTextFolderPath = "Risk Factor Text/Energy Industry/Industry_1311/Raw Text/"
rawTextFileName_lst = os.listdir(rawTextFolderPath)
rawTextFilePair_dict = {} # filePair stores the file name and the corresponding file text.
for fileName in rawTextFileName_lst:  
    # encode each document
    filePath = str(rawTextFolderPath) + str(fileName)    
    with open(filePath, 'r', encoding='utf-8') as file:
        rawText = file.read()
    rawTextFilePair_dict[fileName] = rawText

In [11]:
subjectFormat_lst = ["font-weight:700", "font-weight:bold"]
for key, value in rawTextFilePair_dict.items():
    fileName = key
    rawText = value
    
    i = 0
    subject_lst = []
    
    while not subject_lst:
        subjectFormat = subjectFormat_lst[i]
        subject_lst = extract_subjects(rawText, subjectFormat)
        i += 1
        if i >= len(subjectFormat_lst):
            break
        
    if not subject_lst:
        print("The subject fail to extract for", {fileName})
    else:
        subjectFolder = "Risk Factor Text/Energy Industry/Industry_1311/Subject/"
        subjectFilePath = subjectFolder + str(fileName)
        with open(subjectFilePath, 'w', encoding='utf-8') as file:
            for item in subject_lst:
                file.write(item + '\n')

The subject fail to extract for {'0001062993-23-007669.txt'}
The subject fail to extract for {'0001185185-22-000456.txt'}
The subject fail to extract for {'0001185185-23-000244.txt'}
The subject fail to extract for {'0001185185-23-000305.txt'}
The subject fail to extract for {'0001213900-22-019928.txt'}
The subject fail to extract for {'0001437749-22-008152.txt'}
The subject fail to extract for {'0001437749-23-004361.txt'}
The subject fail to extract for {'0001437749-23-008793.txt'}
The subject fail to extract for {'0001515971-22-000125.txt'}
The subject fail to extract for {'0001640334-22-001335.txt'}
The subject fail to extract for {'0001654954-22-004427.txt'}
The subject fail to extract for {'0001654954-23-003759.txt'}


---

## 2.1.2 Industry 1389

### 1. Read in the files and store in a dictionary

In [12]:
folderPath = "../Section 1 - Datasets Downloads and Preparation/Downloaded Financial Reports/Energy Industry/Industry_1389/"
fileName_lst = os.listdir(folderPath)
filePair_dict = {} # filePair stores the file name and the corresponding file text.
for fileName in fileName_lst:  
    # encode each document
    filePath = str(folderPath) + str(fileName)    
    with open(filePath, 'r', encoding='utf-8') as file:
        fileText = file.read()
    filePair_dict[fileName] = fileText

### 2. Extract the raw risk factor text

In [13]:
for key, value in filePair_dict.items():
    fileName = key
    fileText = value
    mainBody = separate_main_body(fileText) # separate the original text as the table of content and the main body.
    
    if not mainBody:
        print("The extraction is failed for", {fileName}, "because of the error in seperating main body.")
        continue
    
    riskText = extract_risk_factors(mainBody, fileName) # extract the risk related text
    if not riskText:
        print("The extraction is failed for", {fileName}, "because of the error in extracting risk factors.")
    else:
        riskFilePath = "Risk Factor Text/Energy Industry/Industry_1389/Raw Text/" + fileName
        with open(riskFilePath, 'w', encoding='utf-8') as file:
            file.write(riskText)

The extraction is failed for {'0000045012-23-000011.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000073756-23-000009.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000866829-23-000010.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000950170-23-010960.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001437749-22-016757.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001437749-23-004408.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001437749-23-008841.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001532286-23-000005.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001558370-23-001765.txt'} because of the error in seperating main body.
The extraction is failed for {'000155837

### 3. Extract the subjects from raw risk factors text

In [14]:
rawTextFolderPath = "Risk Factor Text/Energy Industry/Industry_1389/Raw Text/"
rawTextFileName_lst = os.listdir(rawTextFolderPath)
rawTextFilePair_dict = {} # filePair stores the file name and the corresponding file text.
for fileName in rawTextFileName_lst:  
    # encode each document
    filePath = str(rawTextFolderPath) + str(fileName)    
    with open(filePath, 'r', encoding='utf-8') as file:
        rawText = file.read()
    rawTextFilePair_dict[fileName] = rawText

In [15]:
subjectFormat_lst = ["font-weight:700", "font-weight:bold"]
for key, value in rawTextFilePair_dict.items():
    fileName = key
    rawText = value
    
    i = 0
    subject_lst = []
    
    while not subject_lst:
        subjectFormat = subjectFormat_lst[i]
        subject_lst = extract_subjects(rawText, subjectFormat)
        i += 1
        if i >= len(subjectFormat_lst):
            break
        
    if not subject_lst:
        print("The subject fail to extract for", {fileName})
    else:
        subjectFolder = "Risk Factor Text/Energy Industry/Industry_1389/Subject/"
        subjectFilePath = subjectFolder + str(fileName)
        with open(subjectFilePath, 'w', encoding='utf-8') as file:
            for item in subject_lst:
                file.write(item + '\n')

The subject fail to extract for {'0001692427-23-000008.txt'}


---

## 2.1.3 Industry 4911

### 1. Read in the files and store in a dictionary

In [16]:
folderPath = "../Section 1 - Datasets Downloads and Preparation/Downloaded Financial Reports/Energy Industry/Industry_4911/"
fileName_lst = os.listdir(folderPath)
filePair_dict = {} # filePair stores the file name and the corresponding file text.
for fileName in fileName_lst:  
    # encode each document
    filePath = str(folderPath) + str(fileName)    
    with open(filePath, 'r', encoding='utf-8') as file:
        fileText = file.read()
    filePair_dict[fileName] = fileText

### 2. Extract the raw risk factor text

In [17]:
for key, value in filePair_dict.items():
    fileName = key
    fileText = value
    mainBody = separate_main_body(fileText) # separate the original text as the table of content and the main body.
    
    if not mainBody:
        print("The extraction is failed for", {fileName}, "because of the error in seperating main body.")
        continue
    
    riskText = extract_risk_factors(mainBody, fileName) # extract the risk related text
    if not riskText:
        print("The extraction is failed for", {fileName}, "because of the error in extracting risk factors.")
    else:
        riskFilePath = "Risk Factor Text/Energy Industry/Industry_4911/Raw Text/" + fileName
        with open(riskFilePath, 'w', encoding='utf-8') as file:
            file.write(riskText)

The extraction is failed for {'0000072741-23-000004.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000784977-23-000005.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000827052-23-000010.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000936340-23-000073.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0000950170-23-003912.txt'} because of the error in seperating main body.
The extraction is failed for {'0000950170-23-007785.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001013871-23-000004.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001089819-23-000003.txt'} because of the error in extracting risk factors.
The extraction is failed for {'0001108426-23-000006.txt'} because of the error in seperating main body.
The extraction is failed for {'0001130310-2

### 3. Extract the subjects from raw risk factors text

In [18]:
rawTextFolderPath = "Risk Factor Text/Energy Industry/Industry_4911/Raw Text/"
rawTextFileName_lst = os.listdir(rawTextFolderPath)
rawTextFilePair_dict = {} # filePair stores the file name and the corresponding file text.
for fileName in rawTextFileName_lst:  
    # encode each document
    filePath = str(rawTextFolderPath) + str(fileName)    
    with open(filePath, 'r', encoding='utf-8') as file:
        rawText = file.read()
    rawTextFilePair_dict[fileName] = rawText

In [19]:
subjectFormat_lst = ["font-weight:700", "font-weight:bold"]
for key, value in rawTextFilePair_dict.items():
    fileName = key
    rawText = value
    
    i = 0
    subject_lst = []
    
    while not subject_lst:
        subjectFormat = subjectFormat_lst[i]
        subject_lst = extract_subjects(rawText, subjectFormat)
        i += 1
        if i >= len(subjectFormat_lst):
            break
        
    if not subject_lst:
        print("The subject fail to extract for", {fileName})
    else:
        subjectFolder = "Risk Factor Text/Energy Industry/Industry_4911/Subject/"
        subjectFilePath = subjectFolder + str(fileName)
        with open(subjectFilePath, 'w', encoding='utf-8') as file:
            for item in subject_lst:
                file.write(item + '\n')

The subject fail to extract for {'0001193311-23-000005.txt'}
The subject fail to extract for {'0001437749-23-004477.txt'}
