<font color='#E27271'>

# *Unveiling Complex Interconnections Among Companies through Learned Embeddings*</font>

-----------------------
<font color='#E27271'>

Ethan Moody, Eugene Oon, and Sam Shinde</font>

<font color='#E27271'>

August 2023</font>

-----------------------
<font color='#00AED3'>

# **XLNet Summarization** </font>
-----------------------

XLNet Summarization refers to the application of the XLNet model for automatic text summarization. XLNet, also developed by Google AI and based on the transformer architecture, differs from other language models by utilizing a permutation-based training approach, which allows it to consider all possible word permutations during pre-training. This bidirectional context modeling makes it well-suited for handling complex relationships within the text. When fine-tuned for summarization tasks, XLNet can generate high-quality abstractive summaries by effectively capturing long-range dependencies and contextual information, providing a promising solution for creating accurate and coherent summaries of diverse documents.

We leverage the power of XLNet to summarize 10K `business` section of varied length from various companies to standard 25 sentences (approx 512 tokens) which will be used in our downstream classification task.

# [1] Installs

In [None]:
!pip install transformers --quiet
!pip install bert-extractive-summarizer --quiet
!pip install spacy --quiet
!python -m spacy download en_core_web_lg --quiet
!pip install sentencepiece --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m50.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m101.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m53.8 MB/s[0m eta [36m0:00:00[0m
[?25h2023-07-23 19:27:38.715460: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-23 19:27:40.986232: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one N

# [2] Imports

In [None]:
# Imports General Packages
import os, sys
import pandas as pd
import json
from datetime import date
import re
from datetime import datetime

#
import jax
from jax import numpy as jnp

# Import Transformers
from transformers import BertTokenizer, TFBertModel, BertConfig
import tensorflow as tf

import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_lg')

# Setup
path = '/content/gdrive/My Drive/project'
working_path = '/content/gdrive/My Drive/Working'

# Summarizer
from summarizer import Summarizer,TransformerSummarizer

import warnings
warnings.filterwarnings("ignore")

# [3] Mount GDrive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


# [4] Read Input Files

In [None]:
sp_df = pd.read_json(path + '/data/10K/sp500_final.json')
nsp_df = pd.read_json(path + '/data/10K/nsp500_final.json')

In [None]:
# Non S&P Cleanup
nsp_df = pd.read_json(path + '/data/10K/nsp500_final.json')

nsp_df = nsp_df.sort_values(by=['ticker','year','formType'], ignore_index=True)
nsp_df.head()
print(f'Starting Data                       : {nsp_df.shape[0]}')

nsp_df = nsp_df.drop_duplicates(subset = ['ticker', 'year'],keep = 'first').reset_index(drop = True)
print(f'After Dropping Duplicates           : {nsp_df.shape[0]}')

nsp_df = nsp_df[nsp_df['sector'].notnull()]
nsp_df.reset_index(drop=True, inplace=True)
print(f'After Dropping Sector = None        : {nsp_df.shape[0]}')

nsp_df = nsp_df[nsp_df['business_cnt']!=0]
nsp_df.reset_index(drop=True, inplace=True)
print(f'After Dropping Business Count = 0   : {nsp_df.shape[0]}')

nsp_df = nsp_df[nsp_df['business_cnt']>=5000]
nsp_df.reset_index(drop=True, inplace=True)
print(f'After Dropping Business Count < 5000: {nsp_df.shape[0]}')

Starting Data                       : 4063
After Dropping Duplicates           : 4063
After Dropping Sector = None        : 3695
After Dropping Business Count = 0   : 3689
After Dropping Business Count < 5000: 3682


# [5] Helper Function (Clean Raw Texts)

In [None]:
def clean(rawtext):
  """Function to remove unwanted text which might impact model performance, such as -
      Remove Special Characters
      Remove Consecutive Whitespace
      Remove new line characters
      Remove Table Content
      Remove all characters except lowercase or uppercase alphabetic character
      (a-z, A-Z) or a whitespace character (\s) or dot (.)
  """

  # Remove specific (non-breaking space) character sequence
  rawtext = rawtext.replace('\\xa0','')

  # Remove New Line (escape the backslash)
  rawtext = rawtext.replace('\\n','')

  # pattern that matches one or more consecutive whitespace characters
  rawtext = re.sub('\s\s+',' ',rawtext)

  # Replace new line with Space
  rawtext = re.sub('\n',' ',rawtext)

  # Replace Table Content
  rawtext = re.sub("(?is)<table[^>]*>(.*?)<\/table>", "", rawtext)

  # pattern that matches any character that is not a lowercase or uppercase alphabetic character (a-z, A-Z) or a whitespace character (\s)
  rawtext = re.sub(r'[^A-Za-z .]+', '', rawtext)
  # rawtext = re.sub(r'[^A-Za-z0-9 .]+', '', rawtext)
  # rawtext = re.sub('[^a-zA-Z\s]','',rawtext)

  # pattern that matches one or more consecutive digits
  # rawtext = re.sub(r'\d+', '', rawtext)

  rawtext = re.sub('I tem','',rawtext)
  rawtext = re.sub('TABLEEND','',rawtext)
  rawtext = re.sub('TABLESTART','',rawtext)

  # matches one or more consecutive spaces
  rawtext = re.sub(' +', ' ', rawtext)

  return rawtext

# [6] Summarization

## [6.1] XLNet Summarizer

In [None]:
XLNet_model = TransformerSummarizer(transformer_type="XLNet",transformer_model_key="xlnet-base-cased")

## [6.2] Run Summarization on Training Dataset

In [None]:
xlnet_nsp_summary = []
cnt = 1
ttlcnt = nsp_df[nsp_df['year']==2022].shape[0]
for index, row in nsp_df[nsp_df['year']==2022].iterrows():
  if row['business'][:5] != 'Error':
    business_summary = ''.join(XLNet_model(clean(row['business'][0:100000]), min_length=20, num_sentences=25))
  else:
    business_summary = None

  result = {'ticker': row['ticker'],
            'cik': row['cik'],
            'formType': row['formType'],
            'filedAt': row['filedAt'],
            'linkToTxt': row['linkToTxt'],
            'linkToHtml': row['linkToHtml'],
            'periodOfReport': row['periodOfReport'],
            'year': row['year'],
            'ind': row['ind'],
            'name': row['name'],
            'sector': row['sector'],
            'industry': row['industry'],
            'industry_group': row['industry_group'],
            'business': business_summary
  }
  xlnet_nsp_summary.append(result)
  print('\b'*100, end = '')
  print(f'{cnt}/{ttlcnt}', end='')
  cnt+=1

1/40632/40633/40634/40635/40636/40637/40638/40639/4063

## [6.3] Save Summarized Output (Training dataset)

In [None]:
# Write to detail file
file_path = path + '/data/10K/xlnet_nsp_summary_final.json'
with open(file_path, "w", encoding="utf-8") as file:
  json.dump(xlnet_nsp_summary, file, indent=4, separators=(',',': '))

## [6.4] Run Summarization on Test Dataset

In [None]:
xlnet_sp_summary = []
cnt = 1
ttlcnt = sp_df[sp_df['year']==2022].shape[0]
for index, row in sp_df[sp_df['year']==2022].iterrows():
  if row['business'][:5] != 'Error':
    business_summary = ''.join(XLNet_model(clean(row['business'][0:100000]), min_length=20, num_sentences=25))
  else:
    business_summary = None

  result = {'ticker': row['ticker'],
            'cik': row['cik'],
            'formType': row['formType'],
            'filedAt': row['filedAt'],
            'linkToTxt': row['linkToTxt'],
            'linkToHtml': row['linkToHtml'],
            'periodOfReport': row['periodOfReport'],
            'year': row['year'],
            'ind': row['ind'],
            'name': row['name'],
            'sector': row['sector'],
            'industry': row['industry'],
            'industry_group': row['industry_group'],
            'business': business_summary
  }
  xlnet_sp_summary.append(result)
  print('\b'*100, end = '')
  print(f'{cnt}/{ttlcnt}', end='')
  cnt+=1

1/5002/5003/5004/5005/5006/5007/5008/5009/500

## [6.5] Save Summarized Output (Test dataset)

In [None]:
# Write to detail file
file_path = path + '/data/10K/xlnet_sp_summary_final.json'
with open(file_path, "w", encoding="utf-8") as file:
  json.dump(xlnet_sp_summary, file, indent=4, separators=(',',': '))

## [6.6] Delete Model

In [None]:
del XLNet_model