# Graph Neural Nets for Credit Ratings Demo

In this demo notebook, we demonstrate following:
* How to send inference requests to a pre-deployed endpoint of a trained [SageMaker XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) model and get the model response/prediction.
* Evaluate the responses (predictions) compared with the ground truth labels.

**To see more details of the end-to-end model training with hyper-parameter optimization and deployement using SageMaker, please open the [graph_neural_net_models_financial_classificationn.ipynb](graph_neural_net_models_financial_classification.ipynb) notebook.** This training notebook also discusses the construction of the corporate graph used in the graph neural network and the tabular data that accompanies the graph. 

**Kernel: PyTorch 1.8 Python 3.6 GPU Optimized**

## Abstract 

Credit ratings are traditionally generated using models that use financial statement data and market data, which is typically in tabular form (numeric and categorical). This solution constructs a network of firms using text from SEC filings and shows that using the network of firm relationships with tabular data can generate more accurate rating predictions. This solution demonstrates a methodology to use big data to extend tabular data credit scoring models, which have been used by the ratings industry for decades, to the class of machine learning models on networks.

## Data Summary

1. The training dataset has tabular data such as various accounting ratios (numerical) and industry codes (categorical). The dataset has $N = 3286$ rows. Rating labels are also added. These are the node features to be used with graph machine learning. 

2. The dataset also contains a corporate graph, which is undirected and unweighted. This solution also allows the user to tweak the structure of the graph by varying the way in which links are included. 

3. Classification using GNNs can be multi-category for all ratings or binary, divided between investment grade (AAA, AA, A, BBB) and non-investment grade (BB, B, CCC, CC, C, D). D=defaulted.  

Next, we show how to use the graph and tabular data with GNNs.

>**<span style="color:RED">Important</span>**: 
>This solution is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice. The associated notebooks, including the trained model, use synthetic data, and are not intended for production use. While text from SEC filings is used, the financial data is synthetically and randomly generated and have no relation to any company's true financials. Hence, the synthetically generated ratings also do not have any relation to a company's true rating. 

This solution relies on a config file to run the provisioned AWS resources. Run the cell below to generate that file.

In [None]:
import boto3
import os
import json

client = boto3.client('servicecatalog')
cwd = os.getcwd().split('/')
i= cwd.index('S3Downloads')
pp_name = cwd[i + 1]
pp = client.describe_provisioned_product(Name=pp_name)
record_id = pp['ProvisionedProductDetail']['LastSuccessfulProvisioningRecordId']
record = client.describe_record(Id=record_id)

keys = [ x['OutputKey'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
values = [ x['OutputValue'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
stack_output = dict(zip(keys, values))

with open(f'/root/S3Downloads/{pp_name}/stack_outputs.json', 'w') as f:
    json.dump(stack_output, f)

For a new use case, you must first train a new GNN model using your own data, i.e., replace the data here with your own, using the training notebook cited above. Assess your newly trained model using the metrics (e.g., accuracy, precision, recall, etc.) that are relevant for your application. Results will depend on the quality and appropriateness of your data for the use case (which may indeed be different from credit scoring). After you are satisfied that your trained model is performing well, then use this notebook to call the trained model for predictions. 

## Step 1: Read in the solution config

Install dependencies files that will be used in this notebook.

In [None]:
!pip install -I gensim bokeh networkx --no-index --find-links file://$PWD/wheelhouse

In [None]:
%pylab inline
import warnings
import json
import sagemaker
import pickle
import numpy as np
import pandas as pd
from utils import construct_network_data
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score, classification_report, matthews_corrcoef, balanced_accuracy_score, roc_curve

warnings.filterwarnings('ignore')
session = sagemaker.Session()

SOLUTION_CONFIG = json.load(open("stack_outputs.json"))
REGION = SOLUTION_CONFIG["AWSRegion"]
SOLUTION_BUCKET = SOLUTION_CONFIG["SolutionS3Bucket"]
SOLUTION_NAME = SOLUTION_CONFIG["SolutionName"]
BUCKET = SOLUTION_CONFIG["S3Bucket"]

## Step 2: Download and read in the tabular data

We read in a file that contains all the data. It has various financial quantities and the binary rating class, as well as the industry categories. We also specify the target column below. There is also a column `Rating`, which is to be used for multi-category classification. 


Download the data folder from S3

In [None]:
input_data_bucket = f"s3://{SOLUTION_BUCKET}-{REGION}/{SOLUTION_NAME}/data"
!aws s3 sync $input_data_bucket datasets

In [None]:
target_column = 'binary_rating'

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

rating_df = pd.read_csv('datasets/tabular_data.csv')

rating_df[target_column] = rating_df['Rating'].apply(lambda x: 1 if x in ['AAA', 'AA', 'A', 'BBB'] else 0)
rating_df['Rating'] = le.fit_transform(rating_df['Rating'])

rating_df.reset_index(inplace=True, drop=True)
rating_df.head()

Finally, we prune the tabular feature set, by removing the label column `Rating` (for multiclass rating). This dataset still includes the label column `binary_rating` for binary classification, which is the use case we develop here. If multicategory classification is required, then we drop `binary_rating` instead and keep the `Rating` column. 

In [None]:
rating_df.drop(['Rating'], axis=1, inplace=True)

## Step 3: Add graph information to the tabular dataset

In this section we use our custom approach to construct `CorpNet` using the text from SEC filings. Then we use the `CorpNet` to generate graph information and add them into tabular data `rating_df`.

Read the text data to construct network data. This text comprises the Management Discussion and Analysis (MD&A) section of the 10-K/Q filings. The MD&A section typically contains forward-looking information about a company's prospects. Companies that have commonalities in their SEC text are also likely to be impacted by the same macroeconomic and financial market conditions and will be linked in the corporate graph, and the graph neural network will exploit this information for the classification of companies by credit quality. 

In [None]:
text_df = pd.read_csv("datasets/text_data.csv", header=0)

In [None]:
text_df.columns

### A. Construct the network
1. For each source node (row) in the dataset, we find its neighbour nodes sorted from nearest to furthest based on a distance measurement. The distance measurement is computed based on the embeddings of the MD&A text. 

2. For each source node (row), add a link if the document vector for a node has high cosine siilarity with another node. Note that it is possible to generate zero links for some nodes. 

3. Add links, i.e., destination nodes for all nodes that are close in embedding space. 

4. The length of the `src` and `dst` lists will be the number of links in the network. They should both have the same length. 

5. The number of source nodes with links will differ from the number of destination nodes with links. Some nodes will be isolated, i.e., not linked at all. Both these numbers will be less than $N$. We do include these nodes in the graph however, as this is required. 

6. Use the corporate graph to generate node-level statistics and add graph feature columns to the tabular dataset `rating_df`. Three additional features are added to the tabular dataset. We enhanced the tabular data with three additional data columns for node features from the graph: (i) [degree centrality](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.degree_centrality.html), i.e., normalized number of links each node has, converted into the fraction of nodes a given node is connected to; (ii) [eigenvector centrality](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.eigenvector_centrality.html\#networkx.algorithms.centrality.eigenvector_centrality), a measure of connectedness of each node to other nodes, no matter how deeply connected. Eigenvector centrality computes the centrality for a node based on the centrality of its neighbors; (iii) [Clustering coefficient](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.cluster.clustering.html\#networkx.algorithms.cluster.clustering) for unweighted graphs, the clustering of a node is the fraction of possible triangles through that node that exist. That is, if $t_i$ is the number of closed triangles attached to a node $i$, and $k_i$ is the number of nodes it is connected to, then the clustering coefficient is $\frac{2 t_i}{k_i(k_i-1)}$. 

The code below enables you to create a new graph using the function `construct_network_data` with a single column of SEC text. The graph is stored as a pickle file and read in. For our demo, we already created the graph and hence the graph construction code has been commented out below. 

In [None]:
%%time
#src_dst_dict = construct_network_data(text_df, text_column_name="MDNA", embedding_size=300, cutoff=0.5)
src_dst_dict = pickle.load( open( "datasets/src_dst.p", "rb" ) )

### B. Drop isolated nodes and renumber all nodes

We will use the Deep Graph Library (DGL, https://www.dgl.ai/) for graph machine learning. DGL requires that all nodes are consecutively numbered, and the code below ensures no missing node numbers in a sequence of nodes, so all nodes are numbered consecutively. 

In [None]:
del_nodes = set(rating_df.index) - (set(src_dst_dict['src'] + src_dst_dict['dst']))

rating_df.drop(list(del_nodes), inplace=True)
rating_df['node'] = rating_df.index
rating_df.reset_index(inplace=True, drop=True)

In [None]:
node_map = {v:k for k,v in rating_df['node'].to_dict().items()}

src_dst_dict['src'] = [node_map[n] for n in src_dst_dict['src']]
src_dst_dict['dst'] = [node_map[n] for n in src_dst_dict['dst']]

In [None]:
# Check
print('Highest index =', max(rating_df.index))
print('# nodes =', len(rating_df))
print('# source nodes with links =', len(set(src_dst_dict['src'])))
print('# destination nodes with links =',len(set(src_dst_dict['dst'])))
print('# Linked nodes =', len(set(src_dst_dict['src'] + src_dst_dict['dst'])))
print('# isolated nodes =', len(rating_df) - len(set(src_dst_dict['src'] + src_dst_dict['dst'])))

### C. Include isolated nodes

In [None]:
## Make network
import networkx as nx
G = nx.Graph()

# Get edges 
src = src_dst_dict['src']
dst = src_dst_dict['dst']
e_list = [(src[j], dst[j], {'weight':1}) for j in range(len(src))]

# Find singleton nodes
s_list1 = set(range(len(rating_df)))-set(src)
s_list2 = set(range(len(rating_df)))-set(src)
s_list = list(s_list1.intersection(s_list2))
print('Singleton nodes :', s_list)
s_list = [(j,j,{'weight':0}) for j in s_list] # a blank entry for each singleton

# Add all nodes and edges
G.add_edges_from(e_list)
G.add_edges_from(s_list)

# Check stats
print("#nodes =",G.number_of_nodes())
print("#edges =",G.number_of_edges())

### D. Use the graph to add three new tabular features

Construct graph columns that can be added to the tabular dataset `rating_df`

In [None]:
%%time
# CREATE NETWORK FEATURES
# Degree for each node
tmp = nx.degree_centrality(G)
G_degree = [tmp[j] for j in range(G.number_of_nodes())]

# Eigen Centrality of each node
tmp = nx.eigenvector_centrality(G)
G_EVcent = [tmp[j] for j in range(G.number_of_nodes())]

# Get clustering coefficient of each node
tmp = nx.clustering(G)
G_ClustCoef = [tmp[j] for j in range(G.number_of_nodes())]

In [None]:
# Add new features to tabular dataset
rating_df["G_degree"] = G_degree
rating_df["G_EVcent"] = G_EVcent
rating_df["G_ClustCoef"] = G_ClustCoef

This completes the preparation of the corporate graph, denoted as `CorpNet`. We save the source and destination nodes to store `CorpNet`. 

Move the target column `binary_rating` to the first column, as the input requirement for SageMaker XGBoost.

In [None]:
columns_names = rating_df.columns.tolist()
columns_names.remove(target_column)
rating_df = rating_df[[target_column] + columns_names]

Convert the categorical feature `industry_code` into one-hot encoded values, as XGBoost cannot take the raw categorical feature as input. 

In [None]:
rating_df = pd.get_dummies(data=rating_df, columns=['industry_code'])

## Step 4: Split the data into training and test set

Split the data into training and test set. The test set is used to query a pre-deployed endpoint of a trained [SageMaker XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) tabular model to get the model response.

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(rating_df, test_size=0.2, random_state=46, stratify=rating_df[target_column].values)

Check the shape of training and test data.

In [None]:
train_df.shape, test_df.shape

Prepare the test data to send into the deployed endpoint.

In [None]:
test_df = test_df.drop(["node"], axis=1)

## Step 5: Query the endpoint for prediction on test data

Use the previously trained model for prediction. 

In [None]:
from sagemaker import Predictor

endpoint_name = SOLUTION_CONFIG["SolutionPrefix"] + "-demo-endpoint" 

predictor = Predictor(
    endpoint_name = endpoint_name,
    sagemaker_session = sagemaker.Session(),
    deserializer =  sagemaker.deserializers.CSVDeserializer(),
    serializer = sagemaker.serializers.CSVSerializer(),
)

prediction = predictor.predict(test_df.values[:, 1:]) 

In [None]:
pred_prob = [float(pred) for pred in prediction[0]]
pred = np.where(np.array(pred_prob) > 0.5, 1, 0)

## Step 6: Evaluate model predictions

Use [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html), [Precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html), [Recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html), [Accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html), [ROC_AUC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), [MCC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html), [Balanced Accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html) to compare the predicted labels with the ground truth labels. For all of those metrics, a large value indicates better predictive performance.

In [None]:
y_true = test_df[target_column].values
metrics_results = classification_report(y_true, pred, zero_division=1, output_dict=True)
results = pd.DataFrame(
    {
        "F1 Score": metrics_results["1"]["f1-score"],
        "ROC AUC": roc_auc_score(y_true, pred),
        "Accuracy": metrics_results["accuracy"],
        "MCC": matthews_corrcoef(y_true, pred),
        "Balanced Accuracy": balanced_accuracy_score(y_true, pred),
        "Precision": metrics_results["1"]["precision"],
        "Recall": metrics_results["1"]["recall"],        
    },
    index=["XGB"]
)

In [None]:
results

Next, we visualize the evaluation by plotting ROC curve. The closer the curve comes to the 45-degree diagonal of the ROC space (navy line), the less accurate the predictions are. On the other hand, the closer the curve comes to the upper left corner, the more accuracte the predictions are.

In [None]:
def plot_roc_curve(fpr, tpr, roc_auc):
    f = plt.figure()
    lw = 2
    plt.plot(fpr, tpr, color='darkorange',
             lw=lw, label='ROC curve (AUC = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Model ROC curve')
    plt.legend(loc="lower right")
    

fpr, tpr, _ = roc_curve(y_true, pred_prob)
plot_roc_curve(fpr, tpr, roc_auc_score(y_true, pred_prob))