<font color='#E27271'>

# *Unveiling Complex Interconnections Among Companies through Learned Embeddings*</font>

-----------------------
<font color='#E27271'>

Ethan Moody, Eugene Oon, and Sam Shinde</font>

<font color='#E27271'>

August 2023</font>

-----------------------
<font color='#00AED3'>

# **Download 10K Data** </font>
-----------------------

We initially explored publicly available stock screeners for the GICS sector labels and the WRDS SEC Analytics Suite for business descriptions sourced from Form 10-Ks for NYSE/NASDAQ companies. However, we encountered two issues with this approach:
1. some of the GICS classifications were outdated, and
2. accurately extracting relevant sections from Form 10-Ks using regex proved to be time-consuming.

We modified our approach to instead pull up-to-date GICS classification labels from TD Ameritrade's trading platform and 2022 Form 10-K data using the [SEC API](https://sec-api.io/). This adjustment gave us more accurate data that required less cleaning.

This notebook captures all the steps to download data using subscription based SEC API.  

# [1] Installs

In [None]:
!pip install google-colab-shell --quiet
!pip install wikipedia --quiet
!pip install sec-api --quiet

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for google-colab-shell (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone


# [2] Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


# [3] Imports

In [None]:
import os, sys
import requests
import pandas as pd
import yfinance as yf
import psycopg2
from psycopg2.extras import json as psycop_json
import json
from datetime import date
import html5lib
from bs4 import BeautifulSoup
import requests
import wikipedia
from urllib.parse import unquote
import re
from datetime import datetime, timedelta
import calendar
from datetime import datetime

# [4] Get SEC API KEY

In [None]:
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# Load User ID and Password from Config File                    #
# config.txt (format of the config file below)                  #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
cfg = pd.read_csv('/content/gdrive/My Drive/config.txt')
secapikey=str(cfg['SECAPIKEY'][0])

# [5] Extract SEC 10K data

## [5.1] Extract Filing URLs

`stocks.csv` file is downloaded from TDAmeritrade which gives us our starting point of ticker to GICS mapping. We use this data as the basis to download data using SEC API.

In this section, we get the 10K URLs only for the year 2022.

In [None]:
# Query API call
from sec_api import QueryApi

queryApi = QueryApi(api_key=secapikey)

In [None]:
stocks_df = pd.read_csv('/content/gdrive/My Drive/data/stocks/stocks.csv')
stocks_df = stocks_df[stocks_df['ticker'].notna()]
#stocks_df = stocks_df[stocks_df['ticker']=='TSLA']
stocks_df.reset_index(drop=True, inplace=True)
stocks_df.head()

Unnamed: 0,ind,ticker,name,mrkt_cap,sector,industry,industry_group,sp
0,NYSE,TM,Toyota Motor Corp (ADR),$219.9B,Consumer Discretionary,Automobiles,Automobiles & Components,N
1,NYSE,STLA,STELLANTIS NV,$56.2B,Consumer Discretionary,Automobiles,Automobiles & Components,N
2,NYSE,HMC,Honda Motor Co Ltd (ADR),$49.9B,Consumer Discretionary,Automobiles,Automobiles & Components,N
3,NASDAQ,LI,Li Auto Inc (ADR),$34.9B,Consumer Discretionary,Automobiles,Automobiles & Components,N
4,NASDAQ,MBLY,Mobileye Global Inc,$32.8B,Consumer Discretionary,Automobile Components,Automobiles & Components,N


In [None]:
# Base Query
base_query = {
  "query": {
      "query_string": {
          "query": "PLACEHOLDER", # this will be set during runtime
          "time_zone": "America/New_York"
      }
  },
  "from": "0",
  "size": "200", # dont change this
  # sort returned filings by the filedAt key/value
  "sort": [{ "periodOfReport": { "order": "desc" } }]
}

# open the file we use to store the filing URLs
path = "/content/gdrive/My Drive/data/10K/Reference Files/filings_urls_final.json"
results = []

# start with filings filed in 2022, then 2020, 2019, ... up to 1995
# uncomment next line to fetch all filings filed from 2022-1995
startyear = 2022
endyear = 2022
startmonth = 1
endmonth = 12
ttlcnt = stocks_df.shape[0]
cnt = 0
for index, row in stocks_df.iterrows():
  cnt+=1
  print('\b'*100, end = '')
  tkr = row['ticker']
  print(f'{cnt}/{ttlcnt} [ticker: {tkr}]', end='')
  # print(tkr)
  # resulting query example: "formType:\"10-Q\" AND filedAt:[2021-01-01 TO 2021-01-31]"
  universe_query = \
      "formType:(\"10-K\") AND " + \
      "ticker:" + tkr + " AND " + \
      "periodOfReport:[{startyear}-{startmonth:02d}-01 TO {endyear}-{endmonth:02d}-31]" \
      .format(tkr=tkr, startyear=startyear, startmonth=startmonth, endyear=endyear, endmonth=endmonth)

  # print(universe_query)
  # set new query universe for year-month combination
  base_query["query"]["query_string"]["query"] = universe_query;

  # paginate through results by increasing "from" parameter
  # until we don't find any matches anymore
  # uncomment next line to fetch all 10,000 filings
  for from_batch in range(0, 9800, 200):
    # set new "from" starting position of search
    base_query["from"] = from_batch;

    response = queryApi.get_filings(base_query)

    # no more filings in search universe
    if len(response["filings"]) == 0:
      break;

    results.extend(response["filings"])

with open(path, "w", encoding="utf-8") as file:
  json.dump(results, file, indent=4, separators=(',',': '))

1/6592 [ticker: TM]2/6592 [ticker: STLA]3/6592 [ticker: HMC]4/6592 [ticker: LI]5/6592 [ticker: MBLY]6/6592 [ticker: RIVN]7/6592 [ticker: NIO]8/6592 [ticker: LCID]

In [None]:
# Load filings URL data
path = "/content/gdrive/My Drive/data/10K/Reference Files/filings_urls_final.json"
filings = pd.read_json(path)
# if filings['periodOfReport'] == '':
#   report_date = filings['filedAt']
#   report_date = datetime.strptime(report_date, "%Y-%m-%d")
#   # report_date = str_to_date(report_date)
#   if report_date.month <= 3:
#     mnth = 12
#     year = report_date.year - 1
#   else:
#     mnth = int((report_date.month - 1) / 3) * 3
#     year = report_date.year
#   dy = calendar.monthrange(year, mnth)[1]
#   report_date = date(year, mnth, dy)

# filings['periodOfReportFinal']=datetime.strftime(report_date, "%Y-%m-%d")
filings['year']=pd.DatetimeIndex(filings['periodOfReport']).year

filings_clean = filings[['ticker','cik','formType','filedAt','linkToTxt','linkToHtml','periodOfReport','year']]
filings_clean = filings_clean.sort_values(by=['ticker','year','formType'], ignore_index=True)
filings_clean = filings_clean[(filings_clean['year']==2022) & (filings_clean['formType']=='10-K')].drop_duplicates(subset=['ticker','year'], keep='first').reset_index(drop = True)
filings_clean.to_csv('/content/gdrive/My Drive/data/10K/Reference Files/filings_clean_final.csv')
filings_clean.shape

(4562, 8)

In [None]:
cleanpath = "/content/gdrive/My Drive/data/10K/Reference Files/filings_clean_final.csv"
filings_clean = pd.read_csv(cleanpath)
filings_clean = filings_clean.sort_values(by=['ticker','year','formType'], ignore_index=True)
print(filings_clean.shape)
filings_clean.head()

(4562, 9)


Unnamed: 0.1,Unnamed: 0,ticker,cik,formType,filedAt,linkToTxt,linkToHtml,periodOfReport,year
0,0,A,1090872,10-K,2022-12-20T18:42:30-05:00,https://www.sec.gov/Archives/edgar/data/109087...,https://www.sec.gov/Archives/edgar/data/109087...,2022-10-31,2022
1,1,AA,1675149,10-K,2023-02-23T16:34:17-05:00,https://www.sec.gov/Archives/edgar/data/167514...,https://www.sec.gov/Archives/edgar/data/167514...,2022-12-31,2022
2,2,AAC,1829432,10-K,2023-02-28T16:47:01-05:00,https://www.sec.gov/Archives/edgar/data/182943...,https://www.sec.gov/Archives/edgar/data/182943...,2022-12-31,2022
3,3,AACI,1844817,10-K,2022-12-21T21:16:25-05:00,https://www.sec.gov/Archives/edgar/data/184481...,https://www.sec.gov/Archives/edgar/data/184481...,2022-09-30,2022
4,4,AADI,1422142,10-K,2023-03-28T17:34:21-04:00,https://www.sec.gov/Archives/edgar/data/142214...,https://www.sec.gov/Archives/edgar/data/142214...,2022-12-31,2022


## [5.2] Extract Section 1

In this section, we use the URL captures in the earlier step (5.1) to extract `Item 1` using the `ExtractorApi`

In [None]:
from sec_api import ExtractorApi

extractorApi = ExtractorApi(secapikey)

In [None]:
stocks_df = pd.read_csv('/content/gdrive/My Drive/data/stocks/stocks.csv')
stocks_df = stocks_df[stocks_df['ticker'].notna()]
# stocks_df = stocks_df[stocks_df['ticker']=='LCID']
stocks_df.reset_index(drop=True, inplace=True)
stocks_df.head()

Unnamed: 0,ind,ticker,name,mrkt_cap,sector,industry,industry_group,sp
0,NYSE,TM,Toyota Motor Corp (ADR),$219.9B,Consumer Discretionary,Automobiles,Automobiles & Components,N
1,NYSE,STLA,STELLANTIS NV,$56.2B,Consumer Discretionary,Automobiles,Automobiles & Components,N
2,NYSE,HMC,Honda Motor Co Ltd (ADR),$49.9B,Consumer Discretionary,Automobiles,Automobiles & Components,N
3,NASDAQ,LI,Li Auto Inc (ADR),$34.9B,Consumer Discretionary,Automobiles,Automobiles & Components,N
4,NASDAQ,MBLY,Mobileye Global Inc,$32.8B,Consumer Discretionary,Automobile Components,Automobiles & Components,N


In [None]:
# S&P files are saved in the Reference files folder as we will be
# doing some cleanup on these files before we finalize
sp_path = "/content/gdrive/My Drive/data/10K/Reference Files/sp500.json"
sp_csv_path = "/content/gdrive/My Drive/data/10K/Reference Files/sp500_summary.csv"
# Non S&P files are treated final
nsp_path = "/content/gdrive/My Drive/data/10K/nsp500_final_final.json"
nsp_csv_path = "/content/gdrive/My Drive/data/10K/nsp500_summary_final.csv"

stime = datetime.now()

# Remove files if exists
try:
  os.remove(sp_path)
  os.remove(sp_csv_path)
  os.remove(nsp_path)
  os.remove(nsp_csv_path)
except:
  pass

# Empty Lists
sp_output_data = []
sp_output_summary_data = []
nsp_output_data = []
nsp_output_summary_data = []


# Loop for each symbol
ttlcnt = stocks_df.shape[0]
spcnt = 0
nspcnt = 0
for index, row in stocks_df.iterrows():
  ticker = row['ticker']
  if row['sp']=='Y':
    spcnt+=1
  else:
    nspcnt+=1
  print('\b'*100, end = '')
  print(f'{spcnt+nspcnt}/{ttlcnt} [sp:{spcnt}] [nsp:{nspcnt}] [ticker: {ticker} - {row["sp"]}]', end='')
  prev_year = 0
  for ind, rec in filings_clean[filings_clean['ticker']==ticker].iterrows():
    filing_url = rec['linkToHtml']
    try:
      section1 = extractorApi.get_section(filing_url=filing_url,
                                              section='1',
                                              return_type="text")
    except Exception as e:
      section1 = "Error"
    # try:
    #   section1a = extractorApi.get_section(filing_url=filing_url,
    #                                           section='1A',
    #                                           return_type="text")
    # except Exception as e:
    #   section1a = "No Data"
    # try:
    #   section7 = extractorApi.get_section(filing_url=filing_url,
    #                                           section='7',
    #                                           return_type="text")
    # except Exception as e:
    #   section7 = "No Data"
    # try:
    #   section7a = extractorApi.get_section(filing_url=filing_url,
    #                                           section='7A',
    #                                           return_type="text")
    # except Exception as e:
    #   section7a = "No Data"
    result_dict = {
        'ticker': ticker,
        'cik': rec['cik'],
        'formType': rec['formType'],
        'filedAt': rec['filedAt'],
        'linkToTxt': rec['linkToTxt'],
        'linkToHtml': rec['linkToHtml'],
        'periodOfReport': rec['periodOfReport'],
        'year': rec['year'],
        'ind': row['ind'],
        'name': row['name'],
        'sector': row['sector'],
        'industry': row['industry'],
        'industry_group': row['industry_group'],
        'business_cnt': len(section1),
        # 'risk_factors_cnt': len(section1a),
        # 'mgmt_d_and_a_cnt': len(section7),
        # 'quant_qual_mkt_risk_cnt': len(section7a),
        'business': section1
        # 'risk_factors': section1a,
        # 'mgmt_d_and_a': section7,
        # 'quant_qual_mkt_risk': section7a
    }
    result_summary_dict = {
        'ticker': ticker,
        'cik': rec['cik'],
        'formType': rec['formType'],
        'filedAt': rec['filedAt'],
        'linkToTxt': rec['linkToTxt'],
        'linkToHtml': rec['linkToHtml'],
        'periodOfReport': rec['periodOfReport'],
        'year': rec['year'],
        'ind': row['ind'],
        'name': row['name'],
        'sector': row['sector'],
        'industry': row['industry'],
        'industry_group': row['industry_group'],
        'business_cnt': len(section1)
        # 'risk_factors_cnt': len(section1a),
        # 'mgmt_d_and_a_cnt': len(section7),
        # 'quant_qual_mkt_risk_cnt': len(section7a)
    }
    if row['sp']=='Y':
      sp_output_data.append(result_dict)
      sp_output_summary_data.append(result_summary_dict)
    else:
      nsp_output_data.append(result_dict)
      nsp_output_summary_data.append(result_summary_dict)

# Convert date objects to strings
for item in sp_output_data:
  for key, value in item.items():
    if isinstance(value, date):
      item[key] = value.isoformat()

for item in nsp_output_data:
  for key, value in item.items():
    if isinstance(value, date):
      item[key] = value.isoformat()

# Write the results to a JSON file
#path = "/content/gdrive/My Drive/data/10K/" + ticker + "_new.json"

# Write to detail file - SP500 data
with open(sp_path, "w", encoding="utf-8") as file:
  json.dump(sp_output_data, file, indent=4, separators=(',',': '))

# Write to summary file - SP500 data
sp_output_sum_df = pd.DataFrame.from_records(sp_output_summary_data)
sp_output_sum_df.to_csv(sp_csv_path, encoding='utf-8')

# Write to detail file - Non SP500 data
with open(nsp_path, "w", encoding="utf-8") as file:
  json.dump(nsp_output_data, file, indent=4, separators=(',',': '))

# Write to summary file - Non SP500 data
nsp_output_sum_df = pd.DataFrame.from_records(nsp_output_summary_data)
nsp_output_sum_df.to_csv(nsp_csv_path, encoding='utf-8')

etime = datetime.now()
print(f'Total time taken is: {((etime-stime).total_seconds())/60}')


1/6592 [sp:0] [nsp:1] [ticker: TM - N]2/6592 [sp:0] [nsp:2] [ticker: STLA - N]3/6592 [sp:0] [nsp:3] [ticker: HMC - N]4/6592 [sp:0] [nsp:4] [ticker: LI - N]5/6592 [sp:0] [nsp:5] [ticker: MBLY - N]6/6592 [sp:0] [nsp:6] [ticker: RIVN - N]7/6592 [sp:0] [nsp:7] [ticker: NIO - N]

# [6] Troubleshooting and Verification

## [6.1] Function to remove unwanted text

In [None]:
def clean(rawtext):
  """Function to remove unwanted text which might impact model performance, such as -
      Remove Special Characters
      Remove Consecutive Whitespace
      Remove new line characters
      Remove Table Content
      Remove all characters except lowercase or uppercase alphabetic character
      (a-z, A-Z) or a whitespace character (\s) or dot (.)
  """

  # Remove specific (non-breaking space) character sequence
  rawtext = rawtext.replace('\\xa0','')

  # Remove New Line (escape the backslash)
  rawtext = rawtext.replace('\\n','')

  # pattern that matches one or more consecutive whitespace characters
  rawtext = re.sub('\s\s+',' ',rawtext)

  # Replace new line with Space
  rawtext = re.sub('\n',' ',rawtext)

  # Replace Table Content
  rawtext = re.sub("(?is)<table[^>]*>(.*?)<\/table>", "", rawtext)

  # pattern that matches any character that is not a lowercase or uppercase alphabetic character (a-z, A-Z) or a whitespace character (\s)
  # rawtext = re.sub(r'[^A-Za-z0-9 .]+', '', rawtext)
  rawtext = re.sub(r'[^A-Za-z .]+','',rawtext)

  rawtext = re.sub('I tem','',rawtext)
  rawtext = re.sub('TABLEEND','',rawtext)
  rawtext = re.sub('TABLESTART','',rawtext)

  # pattern that matches one or more consecutive digits
  # rawtext = re.sub(r'\d+', '', rawtext)

  # matches one or more consecutive spaces
  rawtext = re.sub(' +', ' ', rawtext)

  return rawtext

In [None]:
# Load S&P files and analyse the data
sp_path = "/content/gdrive/My Drive/data/10K/Reference Files/sp500.json"
sp_csv_path = "/content/gdrive/My Drive/data/10K/Reference Files/sp500_summary.csv"

sp_df = pd.read_json(sp_path)

## [6.2] Investigate Small Descriptions in S&P 500 dataset

In [None]:
sp_investigate_df = sp_df[sp_df['business_cnt']<8000].sort_values(
    by=['business_cnt'],ascending=True, ignore_index=True)
sp_investigate_df

Unnamed: 0,ticker,cik,formType,filedAt,linkToTxt,linkToHtml,periodOfReport,year,ind,name,sector,industry,industry_group,business_cnt,business
0,C,831001,10-K,2023-02-24T21:45:33-05:00,https://www.sec.gov/Archives/edgar/data/831001...,https://www.sec.gov/Archives/edgar/data/831001...,2022-12-31,2022,NYSE,Citigroup Inc,Financials,Banks,Banks,0,
1,MA,1141391,10-K,2023-02-14T16:48:22-05:00,https://www.sec.gov/Archives/edgar/data/114139...,https://www.sec.gov/Archives/edgar/data/114139...,2022-12-31,2022,NYSE,Mastercard Inc,Financials,Financial Services,Financial Services,0,
2,EIX,827052,10-K,2023-02-23T16:10:08-05:00,https://www.sec.gov/Archives/edgar/data/827052...,https://www.sec.gov/Archives/edgar/data/827052...,2022-12-31,2022,NYSE,Edison International,Utilities,Electric Utilities,Utilities,11,BUSINESS\n\n
3,ILMN,1110803,10-K,2022-02-18T15:11:17-05:00,https://www.sec.gov/Archives/edgar/data/111080...,https://www.sec.gov/Archives/edgar/data/111080...,2022-01-02,2022,NASDAQ,"Illumina, Inc.",Health Care,Life Sciences Tools & Services,"Pharmaceuticals, Biotechnology & Life Sciences",21,Item 1 \n\nBusiness \n\n
4,MCD,63908,10-K,2023-02-24T12:34:36-05:00,https://www.sec.gov/Archives/edgar/data/63908/...,https://www.sec.gov/Archives/edgar/data/63908/...,2022-12-31,2022,NYSE,McDonald's Corp,Consumer Discretionary,"Hotels, Restaurants & Leisure",Consumer Services,33,"Item 1 Business Pages 3-7, 9-10"
5,GE,40545,10-K,2023-02-10T08:32:52-05:00,https://www.sec.gov/Archives/edgar/data/40545/...,https://www.sec.gov/Archives/edgar/data/40545/...,2022-12-31,2022,NYSE,General Electric Co,Industrials,Industrial Conglomerates,Capital Goods,35,"Item 1. Business 4-6, 8-14, 80-83"
6,SYF,1601712,10-K,2023-02-09T16:55:13-05:00,https://www.sec.gov/Archives/edgar/data/160171...,https://www.sec.gov/Archives/edgar/data/160171...,2022-12-31,2022,NYSE,Synchrony Financial,Financials,Consumer Finance,Financial Services,41,"Item 1. \n\nBusiness \n\n7 - 26 , 80 - 95 \n\n"
7,CAH,721371,10-K,2022-08-11T16:17:06-04:00,https://www.sec.gov/Archives/edgar/data/721371...,https://www.sec.gov/Archives/edgar/data/721371...,2022-06-30,2022,NYSE,Cardinal Health Inc,Health Care,Health Care Providers & Services,Health Care Equipment & Services,147,Part 1 1 \n\nBusiness \n\n1A \n\nRisk Factors...
8,INTC,50863,10-K,2023-01-26T18:31:20-05:00,https://www.sec.gov/Archives/edgar/data/50863/...,https://www.sec.gov/Archives/edgar/data/50863/...,2022-12-31,2022,NASDAQ,Intel Corporation,Information Technology,Semiconductors & Semiconductor Equipment,Semiconductors & Semiconductor Equipment,164,Item 1. Business: General development of busi...
9,HON,773840,10-K,2023-02-10T14:09:31-05:00,https://www.sec.gov/Archives/edgar/data/773840...,https://www.sec.gov/Archives/edgar/data/773840...,2022-12-31,2022,NASDAQ,Honeywell International Inc,Industrials,Industrial Conglomerates,Capital Goods,7792,\n\n##TABLE_START \n\n##TABLE_END\n\nREVIEW O...


In [None]:
# Filter on only S&P 500
stocks_sp_df = stocks_df[stocks_df['sp']=='Y']
stocks_sp_df.reset_index(drop=True, inplace=True)
stocks_sp_df.head()

Unnamed: 0,ind,ticker,name,mrkt_cap,sector,industry,industry_group,sp
0,NASDAQ,TSLA,Tesla Inc,$880.2B,Consumer Discretionary,Automobiles,Automobiles & Components,Y
1,NYSE,F,Ford Motor Co,$60.6B,Consumer Discretionary,Automobiles,Automobiles & Components,Y
2,NYSE,GM,General Motors Co,$55.3B,Consumer Discretionary,Automobiles,Automobiles & Components,Y
3,NYSE,APTV,Aptiv PLC,$30.4B,Consumer Discretionary,Automobile Components,Automobiles & Components,Y
4,NYSE,BWA,BorgWarner Inc.,$10.6B,Consumer Discretionary,Automobile Components,Automobiles & Components,Y


In [None]:
# tkr = input('Enter Ticker:')
for ind, rec in sp_investigate_df.iterrows():
  tkr = rec['ticker']
  txt_link = rec['linkToTxt']
  html_link = rec['linkToHtml']
  print(tkr)
  print(html_link)
  print(txt_link)
  print(rec['business'])
  print('-'*100)

MA
https://www.sec.gov/Archives/edgar/data/1141391/000114139123000020/0001141391-23-000020-index.htm
https://www.sec.gov/Archives/edgar/data/1141391/000114139123000020/0001141391-23-000020.txt

----------------------------------------------------------------------------------------------------
C
https://www.sec.gov/Archives/edgar/data/831001/000083100123000037/0000831001-23-000037-index.htm
https://www.sec.gov/Archives/edgar/data/831001/000083100123000037/0000831001-23-000037.txt

----------------------------------------------------------------------------------------------------
EIX
https://www.sec.gov/Archives/edgar/data/827052/000082705223000010/0000827052-23-000010-index.htm
https://www.sec.gov/Archives/edgar/data/827052/000082705223000010/0000827052-23-000010.txt
 BUSINESS


----------------------------------------------------------------------------------------------------
ILMN
https://www.sec.gov/Archives/edgar/data/1110803/000111080322000013/0001110803-22-000013-index.htm
https

## [6.3] Investigate Missing stocks in S&P500 dataset

In [None]:
# Missing tickers from S&P500 dataset
stocks_sp_df[~stocks_sp_df['ticker'].isin(list(sp_df['ticker']))].head()

Unnamed: 0,ind,ticker,name,mrkt_cap,sector,industry,industry_group,sp
296,NYSE,PRU,Prudential Financial Inc,$32.9B,Financials,Insurance,Insurance,Y
305,NYSE,RE,Everest Re Group Ltd,$15.0B,Financials,Insurance,Insurance,Y
499,NYSE,CBOE,Cboe Global Markets Inc,$15B,Financials,Capital Markets,Financial Services,Y


## [6.4] Manually capture description for 13 S&P 500 Stocks

From the earlier step, we learned that we have incorrect `business` description for 9 companies. Additionally, we learnt that we are also missing 3 companies.

For these 13 companies, we manually captured `business` description from SEC 10K reports for 2022 for these companies.

In [None]:
# Manual Input
mi_path = "/content/gdrive/My Drive/data/10K/manual_input.json"
mi_df = pd.read_json(mi_path)
mi_df


Unnamed: 0,ticker,business
0,MA,Overview\r\nMastercard is a technology company...
1,C,OVERVIEW\r\n\r\n Citigroup’s history dates ...
2,EIX,"BUSINESS\r\n\r\n CORPORATE STRUCTURE, I..."
3,ILMN,BUSINESS OVERVIEW\r\n\r\n We ar...
4,MCD,ABOUT McDONALD'S\r\n \r\n McDon...
5,GE,ABOUT GENERAL ELECTRIC. General Electric Compa...
6,SYF,OUR BUSINESS\r\n Our Company\r\n ...
7,CAH,Business\r\n General\r\n ...
8,INTC,"2022 revenue was $63.1 billion, down $16.0 bil..."
9,HON,ABOUT HONEYWELL\r\n Honeywell I...


## [6.5] Generate Final S&P500 data files

In [None]:
sp_path_new = "/content/gdrive/My Drive/data/10K/sp500_final.json"
sp_csv_path_new = "/content/gdrive/My Drive/data/10K/sp500_summary_final.csv"
sp_stocks_df = stocks_df[stocks_df['sp']=='Y']
stime = datetime.now()

# Remove files if exists
try:
  os.remove(sp_path_new)
  os.remove(sp_csv_path_new)
except:
  pass

sp_output_data = []
sp_output_summary_data = []
mi_data = []

#Stocks not in sp_df but in manual input
missing = [x for x in mi_df['ticker'].values if x not in sp_df['ticker'].values]
missing_df = mi_df[mi_df['ticker'].isin(missing)]
missing_df

for ind, rec in sp_df.iterrows():
  if rec['ticker'] in mi_df['ticker'].values:
    section1 = mi_df[mi_df['ticker']==rec['ticker']]['business'].values[0]
    # print('\b'*100, end = '')
    print(f"| {rec['ticker']} ", end='')
    mi_dict = {'ticker':rec['ticker'],
               'business_len_before':rec['business_cnt'],
               'business_len_after':len(section1)}
    mi_data.append(mi_dict)
  else:
    section1 = rec['business']
  result_dict = {
      'ticker': rec['ticker'],
      'cik': rec['cik'],
      'formType': rec['formType'],
      'filedAt': rec['filedAt'],
      'linkToTxt': rec['linkToTxt'],
      'linkToHtml': rec['linkToHtml'],
      'periodOfReport': rec['periodOfReport'],
      'year': rec['year'],
      'ind': rec['ind'],
      'name': rec['name'],
      'sector': rec['sector'],
      'industry': rec['industry'],
      'industry_group': rec['industry_group'],
      'business_cnt': len(section1),
      'business': section1
  }
  result_summary_dict = {
      'ticker': rec['ticker'],
      'cik': rec['cik'],
      'formType': rec['formType'],
      'filedAt': rec['filedAt'],
      'linkToTxt': rec['linkToTxt'],
      'linkToHtml': rec['linkToHtml'],
      'periodOfReport': rec['periodOfReport'],
      'year': rec['year'],
      'ind': rec['ind'],
      'name': rec['name'],
      'sector': rec['sector'],
      'industry': rec['industry'],
      'industry_group': rec['industry_group'],
      'business_cnt': len(section1)
  }

  sp_output_data.append(result_dict)
  sp_output_summary_data.append(result_summary_dict)

print('|')

for ind, row in missing_df.iterrows():
  if row['ticker'] in stocks_df['ticker'].values:
    print(f"| {row['ticker']} ", end='')
    result_dict = {
        'ticker': row['ticker'],
        'cik': '-',
        'formType': '10-K',
        'filedAt': '',
        'linkToTxt': '-',
        'linkToHtml': '-',
        'periodOfReport': '2033-12-31',
        'year': '2022',
        'ind': sp_stocks_df[sp_stocks_df['ticker']==row['ticker']]['ind'].values[0],
        'name': sp_stocks_df[sp_stocks_df['ticker']==row['ticker']]['name'].values[0],
        'sector': sp_stocks_df[sp_stocks_df['ticker']==row['ticker']]['sector'].values[0],
        'industry': sp_stocks_df[sp_stocks_df['ticker']==row['ticker']]['industry'].values[0],
        'industry_group': sp_stocks_df[sp_stocks_df['ticker']==row['ticker']]['industry_group'].values[0],
        'business_cnt': len(row['business']),
        'business': row['business']
    }
    result_summary_dict = {
        'ticker': row['ticker'],
        'cik': '-',
        'formType': '10-K',
        'filedAt': '',
        'linkToTxt': '-',
        'linkToHtml': '-',
        'periodOfReport': '2033-12-31',
        'year': '2022',
        'ind': sp_stocks_df[sp_stocks_df['ticker']==row['ticker']]['ind'].values[0],
        'name': sp_stocks_df[sp_stocks_df['ticker']==row['ticker']]['name'].values[0],
        'sector': sp_stocks_df[sp_stocks_df['ticker']==row['ticker']]['sector'].values[0],
        'industry': sp_stocks_df[sp_stocks_df['ticker']==row['ticker']]['industry'].values[0],
        'industry_group': sp_stocks_df[sp_stocks_df['ticker']==row['ticker']]['industry_group'].values[0],
        'business_cnt': len(row['business'])
    }
    sp_output_data.append(result_dict)
    sp_output_summary_data.append(result_summary_dict)

print('|')

# Convert date objects to strings
for item in sp_output_data:
  for key, value in item.items():
    if isinstance(value, date):
      item[key] = value.isoformat()

# Write to detail file - SP500 data
with open(sp_path_new, "w", encoding="utf-8") as file:
  json.dump(sp_output_data, file, indent=4, separators=(',',': '))

# Write to summary file - SP500 data
sp_output_sum_df = pd.DataFrame.from_records(sp_output_summary_data)
sp_output_sum_df.to_csv(sp_csv_path_new, encoding='utf-8')

etime = datetime.now()
print(f'Total time taken is: {((etime-stime).total_seconds())/60}')

| C | HON | GE | MCD | MA | SYF | CAH | ILMN | INTC | EIX |
| RE | CBOE | PRU |
Total time taken is: 0.0068351


In [None]:
# Verify the before and after descriptions
pd.DataFrame.from_records(mi_data)

Unnamed: 0,ticker,business_len_before,business_len_after
0,C,0,51050
1,HON,7792,26767
2,GE,35,50108
3,MCD,33,41448
4,MA,0,51475
5,SYF,41,78681
6,CAH,147,37319
7,ILMN,21,39405
8,INTC,164,85600
9,EIX,11,82241
