# Corporate Credit Rating with Synthetic Data Based on the Altman Z-score model

You may use this notebook as a template for a _*text-enhanced*_ credit rating model. It shows how to take a model based on numeric features (in this case, Altman's famous 5 financial ratios) combined with texts from SEC filings so as to achieve an improvement in the prediction of credit ratings. You are not restricted to the 5 Altman ratios; you can add more variables as needed or completely change the variables. 

The main objective of this solution notebook is to show how SageMaker JumpStart Industry can help process NLP scoring of SEC filings text and use the [Altman's Z-score](https://www.creditguru.com/index.php/bankruptcy-and-insolvency/altman-z-score-insolvency-predictor) to compute the Altman's 5 financial ratios to enhance features, train a model using the enhanced features to achieve a best-in-class model, deploy the model to a SageMaker endpoint for production, and receive improved predictions in real time.

>**<span style="color:RED">Important</span>**: 
>This solution is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice. The associated notebooks, including the trained model, use synthetic data, and are not intended for production.


## What is Altman's Z-score?

The [Altman's Z-score](https://www.creditguru.com/index.php/bankruptcy-and-insolvency/altman-z-score-insolvency-predictor) uses Management Discussion and Analysis (**MDNA**) text data from SEC 10-K/Q filings, [SIC code](https://www.sec.gov/corpfin/division-of-corporation-finance-standard-industrial-classification-sic-code-list) (**industry_code**), and **8 financial variables** from a company's financial statement, such as balance sheet tables and income statements. The 8 financial features are as follows: 

1. Current assets
2. Current liabilities
3. Total liabilities
4. EBIT (earnings before interest and tax)
5. Total assets
6. Net sales
7. Retained earnings
8. Market value of equity

The true label is the **Rating** column. 

The following snapshot shows an example of the input data:

![Input data](https://sagemaker-solutions-prod-us-east-2.s3.us-east-2.amazonaws.com/sagemaker-corporate-credit-rating/0.0.1/docs/input_data.png)

These 8 input features translate into the 5 financial ratios that are used for the Altman Z-score:  

* A: EBIT / total assets 
* B: Net sales / total assets
* C: Market value of equity / total liabilities
* D: Working capital / total assets
* E: Retained earnings / total assets

>**Reference**:
>1. Balance sheet data: https://fred.stlouisfed.org/release/tables?rid=434&eid=196197
>2. Income statement data: https://fred.stlouisfed.org/release/tables?rid=434&eid=195208
>3. Price to book : http://pages.stern.nyu.edu/~adamodar/New_Home_Page/datafile/pbvdata.html

### Step 1: Read in the SageMaker JumpStart Solution configuration


In [None]:
import json

SOLUTION_CONFIG = json.load(open("stack_outputs.json"))
ROLE = SOLUTION_CONFIG["IamRole"]
SOLUTION_BUCKET = SOLUTION_CONFIG["SolutionS3Bucket"]
REGION = SOLUTION_CONFIG["AWSRegion"]
SOLUTION_NAME = SOLUTION_CONFIG["SolutionName"]
BUCKET = SOLUTION_CONFIG["S3Bucket"]

### Step 2: Download and read in the synthetic multimodal dataset

In [None]:
from sagemaker.s3 import S3Downloader

input_data_bucket = f"s3://{SOLUTION_BUCKET}-{REGION}/{SOLUTION_NAME}/data"
print("solution data: ")
S3Downloader.list(input_data_bucket)

#### Download the input data from S3

In [None]:
input_data = f"{input_data_bucket}/CCR_data.csv"
!aws s3 cp $input_data .

We have feature columns for **MD&A** section in the 10K/Q SEC filings, **industry_code**, and **8 corporate financial properties**. The label column is **Rating**. 

In [None]:
import pandas as pd

df = pd.read_csv('CCR_data.csv')
print(df.shape)
df.head()

Next, we convert the 8 corporate financial variables into the 5 financial ratios. We add the 5 ratios to the dataset, drop the 8 financial variables, and use the processed dataset for machine learning. The true label (**Rating**) is multicategorical, but it can be simplified in a binary fashion by grouping the ratings into two investment groups: a group above the investment grade (AAA, AA, A, BBB) and the other group below the investment grade.

After the data preprocessing has completed, there are 5 numerical columns (**A, B, C, D, E**), one categorical column (**industry_code**) and a long-text column (**MDNA**). 

In [None]:
df["A"] = df["EBIT"]/df["TotalAssets"]
df["B"] = df["NetSales"]/df["TotalAssets"]
df["C"] = df["MktValueEquity"]/df["TotalLiabs"]
df["D"] = (df["CurrentAssets"]-df["CurrentLiabs"])/df["TotalAssets"]
df["E"] = df["RetainedEarnings"]/df["TotalAssets"]
df = df.drop(["TotalAssets","CurrentLiabs","TotalLiabs", "RetainedEarnings", "CurrentAssets", 
              "NetSales", "EBIT", "MktValueEquity"], axis=1)
df.head()

In [None]:
df.to_csv("CCR_data_input.csv", index=False)

### Step 3: Sample the input for demo purposes

#### Check the distribution of each rating class

In [None]:
df.groupby('Rating').count()

#### Calculate the ratio for each rating class, used for the stratified sampling

In [None]:
rating_ratio = {"AAA": len(df[df["Rating"] == "AAA"])/len(df), "AA": len(df[df["Rating"] == "AA"])/len(df), "A": len(df[df["Rating"] == "A"])/len(df), 
                "BBB": len(df[df["Rating"] == "BBB"])/len(df), "BB": len(df[df["Rating"] == "BB"])/len(df), "B": len(df[df["Rating"] == "B"])/len(df), 
                "CCC": len(df[df["Rating"] == "CCC"])/len(df)}
rating_ratio

We randomly take 500 samples out of the synthetic data for demonstration purposes.

In [None]:
sample = 500
df_sample = pd.concat([df[df['Rating'] == k].sample(int(v * sample), replace=False, random_state=42) for k, v in rating_ratio.items()])
df_sample.shape

In [None]:
df_sample.to_csv("CCR_data_input_sample.csv", index=False)

### Step 4: Add NLP scores to the multimodal dataset

We add 11 NLP scores to the multimodal dataset using the <span style="color:lightgreen">SageMaker JumpStart Industry Python SDK</span>; the client library helps trigger a SageMaker processing job for NLP scoring.

The processing job will take around an hour with the sample dataset (`CCR_data_input_sample.csv`) and around 5 hours with the full dataset (`CCR_data_input.csv`) on an `ml.c5.18xlarge` processing instance. NLP scoring scales with the size of documents, and SEC filings have tens of thousands of words. After the NLP-scoring processing job has complete, you'll see that new 11 columns for the NLP scores are added to the training dataset.

#### Download dependencies and install SageMaker JumpStart Industry Python SDK

In [None]:
dependency_bucket = f"s3://{SOLUTION_BUCKET}-{REGION}/{SOLUTION_NAME}/python-dependencies"

!mkdir -p python-dependencies
!aws s3 sync $dependency_bucket python-dependencies/

!pip install smjsindustry --no-index --find-links file://$PWD/python-dependencies/wheelhouse


Here, we use `ml.c5.18xlarge` for the NLPScorer processing job to reduce the running time. If `ml.c5.18xlarge` is not available in your region or account, choose one of the other processing instances. If you encounter an error message that you've exceeded your quota, use AWS Support to request a service limit increase for [SageMaker resources](https://console.aws.amazon.com/support/home#/) you want to scale up.

In [None]:
import sagemaker
from smjsindustry import NLPScoreType, NLPSCORE_NO_WORD_LIST
from smjsindustry import NLPScorer, NLPScorerConfig

score_type_list = list(
    NLPScoreType(score_type, [])
    for score_type in NLPScoreType.DEFAULT_SCORE_TYPES
    if score_type not in NLPSCORE_NO_WORD_LIST
)
score_type_list.extend([NLPScoreType(score_type, None) for score_type in NLPSCORE_NO_WORD_LIST])
nlp_scorer_config = NLPScorerConfig(score_type_list)

nlp_score_processor = NLPScorer(      
        ROLE,
        1,                                    
        'ml.c5.18xlarge',                       
        volume_size_in_gb=30,                  
        volume_kms_key=None,                    
        output_kms_key=None,                    
        max_runtime_in_seconds=None,            
        sagemaker_session=sagemaker.Session(),  
        tags=None)                             

nlp_score_processor.calculate(
    nlp_scorer_config, 
    "MDNA", 
    "CCR_data_input_sample.csv",               # replace this with CCR_data_input.csv if you want to use the full dataset
    's3://{}/{}'.format(BUCKET, "nlp_score"), 
    'ccr_nlp_score_sample.csv'
)

#### Download the NLP scoring result from S3 bucket

In [None]:
import boto3
client = boto3.client('s3')
client.download_file(BUCKET, '{}/{}'.format("nlp_score", 'ccr_nlp_score_sample.csv'), 'ccr_nlp_score_sample.csv')
df_tabtext_score = pd.read_csv('ccr_nlp_score_sample.csv')
df_tabtext_score.head()

### Step 5: Prepare a Docker environment for model training and inference 

We use [GluonNLP](https://nlp.gluon.ai/) for machine learning. 

The following script creates a `lib` folder and a `requirements.txt` file to store AutoGluon related dependencies for SageMaker training and inference tasks. These dependencies will be installed in the training and inference containers. To learn more, see [Use third-party libraries](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html#use-third-party-libraries) in the *SageMaker Python SDK documentation*.

In [None]:
! bash prepare_model_code.sh

### Step 6: Train the AutoGluon model with the SageMaker MXNet Estimator

AutoGluon is built on the MXNet framework. We use [SageMaker MXNet Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/sagemaker.mxnet.html) to train the AutoGluon model in the AWS deep learning container for MXNet. For more details about AutoGluon, see [AutoGluon: AutoML for Text, Image, and Tabular Data](https://auto.gluon.ai/stable/index.html) and [AutoGluon GitHub](https://github.com/awslabs/autogluon).

We split the dataset into a training dataset and a test dataset.

In [None]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(
    df_tabtext_score, test_size=0.2, random_state=42, stratify=df_tabtext_score['Rating']
)

In [None]:
import sagemaker
session = sagemaker.Session()

train_data.to_csv("train_data.csv", index=False)
test_data.to_csv("test_data.csv", index=False)

train_s3_path = session.upload_data('train_data.csv', bucket=BUCKET, key_prefix='data')
test_s3_path = session.upload_data('test_data.csv', bucket=BUCKET, key_prefix='data')

The following training job on an `ml.c5.2xlarge` instance takes about 17 minutes with the sample dataset. If you want to train a model with your own dataset, you may need to update the training script `train.py` in the` model-training` folder. If you want to use a GPU instance to achieve a better accuracy, replace `train_instance_type` with the desired GPU instance type and uncomment `fit_args` and `hyperparameters` to pass the number of GPUs to the training script.

In [None]:
from sagemaker.mxnet import MXNet

# Define required label and additional parameters for Autogluon TabularPredictor
init_args = {
  'label': 'Rating'
}

# Define parameters for Autogluon TabularPredictor fit method
# fit_args = {
#   'ag_args_fit': {'num_gpus': 1}
# }

hyperparameters = {'init_args': str(init_args)}
# hyperparameters = {'init_args': str(init_args), 'fit_args': str(fit_args)}

tags = [{'Key' : 'AlgorithmName', 'Value' : 'AutoGluon-Tabular'}, 
        {'Key' : 'ProjectName', 'Value' : 'Jumpstart-Industry-Finance'},]

estimator = MXNet(
    entry_point="train.py",
    role=ROLE,
    train_instance_count=1,
    train_instance_type="ml.c5.2xlarge", # Specify the desired instance type
    framework_version="1.8.0",
    py_version="py37",
    source_dir="model-training",
    base_job_name='sagemaker-soln-ccr-js-training',
    hyperparameters=hyperparameters,
    tags=tags,
    disable_profiler=True,
    debugger_hook_config=False,
    enable_network_isolation=True,  # Set enable_network_isolation=True to ensure a security running environment
)

inputs = {'training': train_s3_path, 'testing': test_s3_path}

estimator.fit(inputs)

After the training job has completed, the following files are saved in the SageMaker session's default S3 bucket:
* `leaderboard.csv`
* `predictions.csv`
* `feature_importance.csv`
* `evaluation.json`
* `classification_report.csv`
* `confusion_matrix.png`

#### Download model output

In [None]:
import boto3 

s3_client = boto3.client("s3")
job_name = estimator._current_job_name
bucket = session.default_bucket()
s3_client.download_file(bucket, f"{job_name}/output/output.tar.gz", "output.tar.gz")
!tar -xvzf output.tar.gz

#### Get test accuracy

In [None]:
import json

with open('evaluation.json') as f:
    data = json.load(f)
print(data)
print("The test accurary is {}.".format(data['accuracy']))

#### Display confusion matrix 

The following confusion matrix shows the performance of the multicategorical classification. 

In [None]:
from IPython.display import display, Image
display(Image(filename='confusion_matrix.png'))

### Step 7: Deploy an endpoint

In this step, we deploy the model artifact from **Step 6** and use for inference. We use the [SageMaker MXNet model](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/sagemaker.mxnet.html#mxnet-model) and [SageMaker model deployment](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html#deploy-mxnet-models) APIs to deploy an endpoint. If you bring your own data for inference, you may also need to update the inference script `inference.py` in the `model-inference` folder.

In [None]:
training_job_name = estimator.latest_training_job.name
print("Training job name: ", training_job_name)

In [None]:
from sagemaker.mxnet import MXNet
attached_estimator = MXNet.attach(training_job_name)
attached_estimator.model_data

In [None]:
from sagemaker.mxnet import MXNetModel

endpoint_name = SOLUTION_CONFIG["SolutionPrefix"] + "-endpoint"

deployed_model = MXNetModel(
    framework_version="1.8.0", 
    py_version="py37", 
    model_data=attached_estimator.model_data, 
    role=ROLE,
    entry_point="inference.py", 
    source_dir="model-inference",
    name=SOLUTION_CONFIG["SolutionPrefix"] + "-model",
    enable_network_isolation=True)     # Set enable_network_isolation=True to ensure a security running environment

ccr_endpoint = deployed_model.deploy(
    instance_type='ml.m5.xlarge',  
    initial_instance_count=1,
    endpoint_name=endpoint_name,
    wait=True)

### Step 8: Test the endpoint

We randomly select some data from the test dataset and test the endpoint.

In [None]:
test_endpoint_data = test_data.sample(n=5).drop(["Rating"], axis=1)

In [None]:
import sagemaker
from sagemaker import Predictor

endpoint_name = SOLUTION_CONFIG["SolutionPrefix"] + "-endpoint"  


predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker.Session(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
    serializer=sagemaker.serializers.CSVSerializer(),
)

predictor.predict(test_endpoint_data.values)

### Step 9: Clean up the resources

After you are done using this notebook, delete the model and the endpoint to avoid any incurring charges.

In [None]:
ccr_endpoint.delete_model()
ccr_endpoint.delete_endpoint()