<font color='#E27271'>

# *Unveiling Complex Interconnections Among Companies through Learned Embeddings*</font>

-----------------------
<font color='#E27271'>

Ethan Moody, Eugene Oon, and Sam Shinde</font>

<font color='#E27271'>

August 2023</font>

-----------------------
<font color='#00AED3'>

# **Model: Base Model - Bag of Words** </font>
-----------------------

We use **Bag-of-Words (BoW)** as our base model. Bag-of-Words model is a fundamental concept in Natural Language Processing (NLP) used for text representation. It treats a piece of text as an unordered collection of words and creates a "bag" by counting the frequency of each word in the text. The order and structure of the words are disregarded in this model, focusing solely on the occurrence of words.

We will further use **Multinomial Naive Bayes** probabilistic classification algorithm for text classification

## [1] Installs, Imports and Setup Steps

### [1.2] Import Packages

In [None]:
import numpy as np
import pandas as pd
import re

import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras import losses
from tensorflow.keras.utils import plot_model
from tensorflow.keras import Sequential
print("Tensorflow version:", tf.__version__)

from sklearn.feature_extraction import _stop_words
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_recall_fscore_support

# Setup
path = '/content/gdrive/My Drive/project'

Tensorflow version: 2.12.0


[1.2] Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


## [2] Modeling Data Preparation

### [2.1] Clean Function
Clean function removes unwanted text which might impact model perofrmance

In [None]:
def clean(rawtext):
  """Function to remove unwanted text which might impact model performance, such as -
      Remove Special Characters
      Remove Consecutive Whitespace
      Remove new line characters
      Remove Table Content
      Remove all characters except lowercase or uppercase alphabetic character
      (a-z, A-Z) or a whitespace character (\s) or dot (.)
  """

  # Remove specific (non-breaking space) character sequence
  rawtext = rawtext.replace('\\xa0','')

  # Remove New Line (escape the backslash)
  rawtext = rawtext.replace('\\n','')

  # pattern that matches one or more consecutive whitespace characters
  rawtext = re.sub('\s\s+',' ',rawtext)

  # Replace new line with Space
  rawtext = re.sub('\n',' ',rawtext)

  # Replace Table Content
  rawtext = re.sub("(?is)<table[^>]*>(.*?)<\/table>", "", rawtext)

  # pattern that matches any character that is not a lowercase or uppercase alphabetic character (a-z, A-Z) or a whitespace character (\s)
  rawtext = re.sub(r'[^A-Za-z .]+', '', rawtext)
  # rawtext = re.sub(r'[^A-Za-z0-9 .]+', '', rawtext)
  # rawtext = re.sub('[^a-zA-Z\s]','',rawtext)

  # pattern that matches one or more consecutive digits
  # rawtext = re.sub(r'\d+', '', rawtext)

  rawtext = re.sub('I tem','',rawtext)
  rawtext = re.sub('TABLEEND','',rawtext)
  rawtext = re.sub('TABLESTART','',rawtext)

  # matches one or more consecutive spaces
  rawtext = re.sub(' +', ' ', rawtext)

  return rawtext

### [2.2] Stop Words Function
stopwords function removes english stop words which are not necessary but may create noise during modeling.

In [None]:
stop_words = _stop_words.ENGLISH_STOP_WORDS

#Function for basic cleaning/preprocessing texts
def stopwords(doc):
    doc = " ".join([token for token in doc.split() if token not in stop_words])
    return doc.lower()

### [2.3] Load Training and Test Data

In [None]:
# Define the path to the JSON file containing training data
file_nsp500 = path + '/data/10K/nsp500_final.json'

# Load the dataset
nsp_df = pd.read_json(file_nsp500)
nsp_df.head()

Unnamed: 0,ticker,cik,formType,filedAt,linkToTxt,linkToHtml,periodOfReport,year,ind,name,sector,industry,industry_group,business_cnt,business
0,MBLY,1910139,10-K,2023-03-09T16:15:44-05:00,https://www.sec.gov/Archives/edgar/data/191013...,https://www.sec.gov/Archives/edgar/data/191013...,2022-12-31,2022,NASDAQ,Mobileye Global Inc,Consumer Discretionary,Automobile Components,Automobiles & Components,133257,Item 1. Business \n\nIn this Annual Report on...
1,RIVN,1874178,10-K,2023-02-28T17:15:26-05:00,https://www.sec.gov/Archives/edgar/data/187417...,https://www.sec.gov/Archives/edgar/data/187417...,2022-12-31,2022,NASDAQ,Rivian Automotive Inc,Consumer Discretionary,Automobiles,Automobiles & Components,42199,Item 1. Business \n\nOverview \n\nRivian exis...
2,LCID,1811210,10-K,2023-02-28T16:09:35-05:00,https://www.sec.gov/Archives/edgar/data/181121...,https://www.sec.gov/Archives/edgar/data/181121...,2022-12-31,2022,NASDAQ,Lucid Group Inc,Consumer Discretionary,Automobiles,Automobiles & Components,82184,Item 1. Business. \n\nOVERVIEW \n\nMission \n...
3,LEA,842162,10-K,2023-02-09T16:59:45-05:00,https://www.sec.gov/Archives/edgar/data/842162...,https://www.sec.gov/Archives/edgar/data/842162...,2022-12-31,2022,NYSE,Lear Corp,Consumer Discretionary,Automobile Components,Automobiles & Components,88376,ITEM 1 &#8211; BUSINESS \n\nIn this Annual Re...
4,ALV,1034670,10-K,2023-02-16T09:41:48-05:00,https://www.sec.gov/Archives/edgar/data/103467...,https://www.sec.gov/Archives/edgar/data/103467...,2022-12-31,2022,NYSE,Autoliv Inc,Consumer Discretionary,Automobile Components,Automobiles & Components,38394,Item 1. Business \n\n&#160; \n\nGeneral \n\nA...


In [None]:
# Define the path to the JSON file containing test data
file_sp500 = path + '/data/10K/sp500_final.json'

# Load the dataset
sp_df = pd.read_json(file_sp500)
sp_df.head()

Unnamed: 0,ticker,cik,formType,filedAt,linkToTxt,linkToHtml,periodOfReport,year,ind,name,sector,industry,industry_group,business_cnt,business
0,TSLA,1318605,10-K,2023-01-30T21:29:15-05:00,https://www.sec.gov/Archives/edgar/data/131860...,https://www.sec.gov/Archives/edgar/data/131860...,2022-12-31,2022,NASDAQ,Tesla Inc,Consumer Discretionary,Automobiles,Automobiles & Components,42832,"ITEM 1. BUSINESS \n\nOverview \n\nWe design, ..."
1,F,37996,10-K,2023-02-02T19:39:34-05:00,https://www.sec.gov/Archives/edgar/data/37996/...,https://www.sec.gov/Archives/edgar/data/37996/...,2022-12-31,2022,NYSE,Ford Motor Co,Consumer Discretionary,Automobiles,Automobiles & Components,65208,ITEM 1. Business. \n\nFord Motor Company was ...
2,GM,1467858,10-K,2023-01-31T15:50:27-05:00,https://www.sec.gov/Archives/edgar/data/146785...,https://www.sec.gov/Archives/edgar/data/146785...,2022-12-31,2022,NYSE,General Motors Co,Consumer Discretionary,Automobiles,Automobiles & Components,59113,Item 1. Business \n\nGeneral Motors Company (...
3,APTV,1521332,10-K,2023-02-08T08:33:34-05:00,https://www.sec.gov/Archives/edgar/data/152133...,https://www.sec.gov/Archives/edgar/data/152133...,2022-12-31,2022,NYSE,Aptiv PLC,Consumer Discretionary,Automobile Components,Automobiles & Components,50554,"ITEM 1. BUSINESS \n\n&#8220;Aptiv,&#8221; the..."
4,BWA,908255,10-K,2023-02-09T14:58:14-05:00,https://www.sec.gov/Archives/edgar/data/908255...,https://www.sec.gov/Archives/edgar/data/908255...,2022-12-31,2022,NYSE,BorgWarner Inc.,Consumer Discretionary,Automobile Components,Automobiles & Components,41231,Item 1. Business \n\nBorgWarner Inc. (togethe...


### [2.4] Prepare Training Dataset
In the preparation process, we drop duplicates and NAs to have a clean starting point for our training dataset

In [None]:
# Non S&P Cleanup
nsp_df = nsp_df.sort_values(by=['ticker','year','formType'], ignore_index=True)
nsp_df.head()
print(f'Starting Data                       : {nsp_df.shape[0]}')

nsp_df = nsp_df.drop_duplicates(subset = ['ticker', 'year'],keep = 'first').reset_index(drop = True)
print(f'After Dropping Duplicates           : {nsp_df.shape[0]}')

nsp_df = nsp_df[nsp_df['sector'].notnull()]
nsp_df.reset_index(drop=True, inplace=True)
print(f'After Dropping Sector = None        : {nsp_df.shape[0]}')

nsp_df = nsp_df[nsp_df['business_cnt']!=0]
nsp_df.reset_index(drop=True, inplace=True)
print(f'After Dropping Business Count = 0   : {nsp_df.shape[0]}')

nsp_df = nsp_df[nsp_df['business_cnt']>=5000]
nsp_df.reset_index(drop=True, inplace=True)
print(f'After Dropping Business Count < 5000: {nsp_df.shape[0]}')

Starting Data                       : 4063
After Dropping Duplicates           : 4063
After Dropping Sector = None        : 3695
After Dropping Business Count = 0   : 3689
After Dropping Business Count < 5000: 3682


### [2.5] Clean Business Description from Training and Test Set
Remove unwanted and stopwords from **Business** description

In [None]:
nsp_df['business_clean']=nsp_df['business'].apply(lambda x: stopwords(clean(x)))
nsp_df.head()


Unnamed: 0,ticker,cik,formType,filedAt,linkToTxt,linkToHtml,periodOfReport,year,ind,name,sector,industry,industry_group,business_cnt,business,business_clean
0,AA,1675149,10-K,2023-02-23T16:34:17-05:00,https://www.sec.gov/Archives/edgar/data/167514...,https://www.sec.gov/Archives/edgar/data/167514...,2022-12-31,2022,NYSE,Alcoa Corp,Materials,Metals & Mining,Materials,68410,"Item 1. Business. \n\n(dollars in millions, e...",item . business. dollars millions pershare amo...
1,AADI,1422142,10-K,2023-03-28T17:34:21-04:00,https://www.sec.gov/Archives/edgar/data/142214...,https://www.sec.gov/Archives/edgar/data/142214...,2022-12-31,2022,NASDAQ,Aadi Bioscience Inc,Health Care,Biotechnology,"Pharmaceuticals, Biotechnology & Life Sciences",143972,Item 1. Business. \n\nOverview \n\nWe are a b...,item . business. overview we biopharmaceutical...
2,AAIC,1209028,10-K,2023-03-31T12:27:37-04:00,https://www.sec.gov/Archives/edgar/data/120902...,https://www.sec.gov/Archives/edgar/data/120902...,2022-12-31,2022,NYSE,Arlington Asset Investment Corp,Financials,Mortgage Real Estate Investment Trusts (REITs),Financial Services,45996,ITEM 1. BUSINESS \n\nUnless the context other...,item . business unless context requires indica...
3,AAME,8177,10-K,2023-06-30T14:33:09-04:00,https://www.sec.gov/Archives/edgar/data/8177/0...,https://www.sec.gov/Archives/edgar/data/8177/0...,2022-12-31,2022,NASDAQ,Atlantic American Corporation,Financials,Insurance,Insurance,51845,Item 1. Business\n\n##TABLE_END \n\nThe Compa...,item . business the company atlantic american ...
4,AAN,1821393,10-K,2023-03-01T16:36:14-05:00,https://www.sec.gov/Archives/edgar/data/182139...,https://www.sec.gov/Archives/edgar/data/182139...,2022-12-31,2022,NYSE,Aaron's Company Inc,Consumer Discretionary,Specialty Retail,Consumer Discretionary Distribution & Retail,81509,ITEM 1. BUSINESS \n\nUnless otherwise indicat...,item . business unless indicated unless contex...


In [None]:
sp_df['business_clean']=sp_df['business'].apply(lambda x: stopwords(clean(x)))
sp_df.head()


Unnamed: 0,ticker,cik,formType,filedAt,linkToTxt,linkToHtml,periodOfReport,year,ind,name,sector,industry,industry_group,business_cnt,business,business_clean
0,TSLA,1318605,10-K,2023-01-30T21:29:15-05:00,https://www.sec.gov/Archives/edgar/data/131860...,https://www.sec.gov/Archives/edgar/data/131860...,2022-12-31,2022,NASDAQ,Tesla Inc,Consumer Discretionary,Automobiles,Automobiles & Components,42832,"ITEM 1. BUSINESS \n\nOverview \n\nWe design, ...",item . business overview we design develop man...
1,F,37996,10-K,2023-02-02T19:39:34-05:00,https://www.sec.gov/Archives/edgar/data/37996/...,https://www.sec.gov/Archives/edgar/data/37996/...,2022-12-31,2022,NYSE,Ford Motor Co,Consumer Discretionary,Automobiles,Automobiles & Components,65208,ITEM 1. Business. \n\nFord Motor Company was ...,item . business. ford motor company incorporat...
2,GM,1467858,10-K,2023-01-31T15:50:27-05:00,https://www.sec.gov/Archives/edgar/data/146785...,https://www.sec.gov/Archives/edgar/data/146785...,2022-12-31,2022,NYSE,General Motors Co,Consumer Discretionary,Automobiles,Automobiles & Components,59113,Item 1. Business \n\nGeneral Motors Company (...,item . business general motors company referre...
3,APTV,1521332,10-K,2023-02-08T08:33:34-05:00,https://www.sec.gov/Archives/edgar/data/152133...,https://www.sec.gov/Archives/edgar/data/152133...,2022-12-31,2022,NYSE,Aptiv PLC,Consumer Discretionary,Automobile Components,Automobiles & Components,50554,"ITEM 1. BUSINESS \n\n&#8220;Aptiv,&#8221; the...",item . business aptiv company refer aptiv plc ...
4,BWA,908255,10-K,2023-02-09T14:58:14-05:00,https://www.sec.gov/Archives/edgar/data/908255...,https://www.sec.gov/Archives/edgar/data/908255...,2022-12-31,2022,NYSE,BorgWarner Inc.,Consumer Discretionary,Automobile Components,Automobiles & Components,41231,Item 1. Business \n\nBorgWarner Inc. (togethe...,item . business borgwarner inc. consolidated s...


### [2.6] Encoding Labels
Encode labels is converting categorical labels or classes into numerical representations.

In [None]:
# Encoding Labels
labels = sorted(nsp_df.sector.dropna().unique())

label_dict = {}
for index, label in enumerate(labels):
    label_dict[label] = index
label_dict

{'Communication Services': 0,
 'Consumer Discretionary': 1,
 'Consumer Staples': 2,
 'Energy': 3,
 'Financials': 4,
 'Health Care': 5,
 'Industrials': 6,
 'Information Technology': 7,
 'Materials': 8,
 'Real Estate': 9,
 'Utilities': 10}

In [None]:
nsp_df['label'] = nsp_df.sector.replace(label_dict)
nsp_df.head()

Unnamed: 0,ticker,cik,formType,filedAt,linkToTxt,linkToHtml,periodOfReport,year,ind,name,sector,industry,industry_group,business_cnt,business,business_clean,label
0,AA,1675149,10-K,2023-02-23T16:34:17-05:00,https://www.sec.gov/Archives/edgar/data/167514...,https://www.sec.gov/Archives/edgar/data/167514...,2022-12-31,2022,NYSE,Alcoa Corp,Materials,Metals & Mining,Materials,68410,"Item 1. Business. \n\n(dollars in millions, e...",item . business. dollars millions pershare amo...,8
1,AADI,1422142,10-K,2023-03-28T17:34:21-04:00,https://www.sec.gov/Archives/edgar/data/142214...,https://www.sec.gov/Archives/edgar/data/142214...,2022-12-31,2022,NASDAQ,Aadi Bioscience Inc,Health Care,Biotechnology,"Pharmaceuticals, Biotechnology & Life Sciences",143972,Item 1. Business. \n\nOverview \n\nWe are a b...,item . business. overview we biopharmaceutical...,5
2,AAIC,1209028,10-K,2023-03-31T12:27:37-04:00,https://www.sec.gov/Archives/edgar/data/120902...,https://www.sec.gov/Archives/edgar/data/120902...,2022-12-31,2022,NYSE,Arlington Asset Investment Corp,Financials,Mortgage Real Estate Investment Trusts (REITs),Financial Services,45996,ITEM 1. BUSINESS \n\nUnless the context other...,item . business unless context requires indica...,4
3,AAME,8177,10-K,2023-06-30T14:33:09-04:00,https://www.sec.gov/Archives/edgar/data/8177/0...,https://www.sec.gov/Archives/edgar/data/8177/0...,2022-12-31,2022,NASDAQ,Atlantic American Corporation,Financials,Insurance,Insurance,51845,Item 1. Business\n\n##TABLE_END \n\nThe Compa...,item . business the company atlantic american ...,4
4,AAN,1821393,10-K,2023-03-01T16:36:14-05:00,https://www.sec.gov/Archives/edgar/data/182139...,https://www.sec.gov/Archives/edgar/data/182139...,2022-12-31,2022,NYSE,Aaron's Company Inc,Consumer Discretionary,Specialty Retail,Consumer Discretionary Distribution & Retail,81509,ITEM 1. BUSINESS \n\nUnless otherwise indicat...,item . business unless indicated unless contex...,1


In [None]:
sp_df['label'] = sp_df.sector.replace(label_dict)
sp_df.head()

Unnamed: 0,ticker,cik,formType,filedAt,linkToTxt,linkToHtml,periodOfReport,year,ind,name,sector,industry,industry_group,business_cnt,business,business_clean,label
0,TSLA,1318605,10-K,2023-01-30T21:29:15-05:00,https://www.sec.gov/Archives/edgar/data/131860...,https://www.sec.gov/Archives/edgar/data/131860...,2022-12-31,2022,NASDAQ,Tesla Inc,Consumer Discretionary,Automobiles,Automobiles & Components,42832,"ITEM 1. BUSINESS \n\nOverview \n\nWe design, ...",item . business overview we design develop man...,1
1,F,37996,10-K,2023-02-02T19:39:34-05:00,https://www.sec.gov/Archives/edgar/data/37996/...,https://www.sec.gov/Archives/edgar/data/37996/...,2022-12-31,2022,NYSE,Ford Motor Co,Consumer Discretionary,Automobiles,Automobiles & Components,65208,ITEM 1. Business. \n\nFord Motor Company was ...,item . business. ford motor company incorporat...,1
2,GM,1467858,10-K,2023-01-31T15:50:27-05:00,https://www.sec.gov/Archives/edgar/data/146785...,https://www.sec.gov/Archives/edgar/data/146785...,2022-12-31,2022,NYSE,General Motors Co,Consumer Discretionary,Automobiles,Automobiles & Components,59113,Item 1. Business \n\nGeneral Motors Company (...,item . business general motors company referre...,1
3,APTV,1521332,10-K,2023-02-08T08:33:34-05:00,https://www.sec.gov/Archives/edgar/data/152133...,https://www.sec.gov/Archives/edgar/data/152133...,2022-12-31,2022,NYSE,Aptiv PLC,Consumer Discretionary,Automobile Components,Automobiles & Components,50554,"ITEM 1. BUSINESS \n\n&#8220;Aptiv,&#8221; the...",item . business aptiv company refer aptiv plc ...,1
4,BWA,908255,10-K,2023-02-09T14:58:14-05:00,https://www.sec.gov/Archives/edgar/data/908255...,https://www.sec.gov/Archives/edgar/data/908255...,2022-12-31,2022,NYSE,BorgWarner Inc.,Consumer Discretionary,Automobile Components,Automobiles & Components,41231,Item 1. Business \n\nBorgWarner Inc. (togethe...,item . business borgwarner inc. consolidated s...,1


### [2.7] TF-IDF - Feature Extraction Technique
"Term Frequency-Inverse Document Frequency" converts a collection of text documents into numerical vectors, representing the importance of each word in a document relative to the entire corpus. TfidfVectorizer assigns higher weights to words that are more frequent in a specific document but less common across the entire dataset.

We limit our max features to 10,000.

In [None]:
nsp_df['sp'] = 'N'
sp_df['sp'] = 'Y'
df = nsp_df[[ 'business_clean', 'label', 'sp']]
df = pd.concat([df, sp_df[[ 'business_clean', 'label', 'sp']]],axis=0)
df.reset_index(drop=True, inplace=True)
df.shape

(4182, 3)

In [None]:
# Model: Bag-of-words

tfidf = TfidfVectorizer(sublinear_tf=True,
                        min_df=5,
                        norm='l2',
                        encoding='latin-1',
                        ngram_range=(1, 2),
                        stop_words='english',
                        max_features=10000)

features = tfidf.fit_transform(df['business_clean']).toarray()

labels = df['label']

print(f'Features shape: {features.shape}')
print(f'Labels shape  : {labels.shape}')

Features shape: (4182, 10000)
Labels shape  : (4182,)


### [2.8] Train | Validation Split

In [None]:
X_nsp = features[:features.shape[0]-500]
X_sp = features[features.shape[0]-500:]

y_nsp = labels[:labels.shape[0]-500]
y_sp = labels[labels.shape[0]-500:]
print(f'Training feature shape: {X_nsp.shape}')
print(f'Training label shape: {y_nsp.shape}')
print(f'Testing feature shape: {X_sp.shape}')
print(f'Testing label shape: {y_sp.shape}')

Training feature shape: (3682, 10000)
Training label shape: (3682,)
Testing feature shape: (500, 10000)
Testing label shape: (500,)


In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_nsp,
                                                    y_nsp,
                                                    test_size=0.15,
                                                    random_state=777,
                                                    stratify=y_nsp)

### [2.9] Multinomial Naive Bayes
Multinomial NB probabilistic classification algorithm for Text Classification

In [None]:
model_mnb = MultinomialNB()
model_mnb.fit(X_train, y_train)
# model_mnb.fit(features, labels)
y_pred_mnb = model_mnb.predict(X_val)
mnb_accuracy = accuracy_score(y_val, y_pred_mnb)
(mnb_precision, mnb_recall, mnb_f1, _,) = precision_recall_fscore_support(y_val, y_pred_mnb, average="weighted")
print(f'MultinomialNB Accuracy : {mnb_accuracy}')
print(f'MultinomialNB Precision: {mnb_precision}')
print(f'MultinomialNB Recall   : {mnb_recall}')
print(f'MultinomialNB F1 Score : {mnb_f1}')


MultinomialNB Accuracy : 0.7522603978300181
MultinomialNB Precision: 0.6979602438205132
MultinomialNB Recall   : 0.7522603978300181
MultinomialNB F1 Score : 0.7128742492147385


  _warn_prf(average, modifier, msg_start, len(result))
