# Modeling using Gradient Boosted Trees

This notebook uses Python 3.7 with Spark 3.0

## Begin by importing the bankruptcy data from Cloud Object Storage

In [1]:
# Import bankruptcy data from COS
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_1c720e1b65aa412b89762bf230a6b5f6 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='7GN1Hk1VrQ0_RGdupGi9ZaphffKXQb6iMJPgYn1DNqgh',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_1c720e1b65aa412b89762bf230a6b5f6.get_object(Bucket='ibmcapstoneproject-donotdelete-pr-iiaol8yd9vtgm6',Key='data.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data = pd.read_csv(body)



df_data.head()


Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20210212185649-0000
KERNEL_ID = be088286-4c32-402a-860c-dc6018c228a1


Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,operating gross margin,realized sales gross margin,operating profit rate,tax Pre-net interest rate,after-tax net interest rate,non-industry income and expenditure/revenue,...,net income to total assets,total assets to GNP price,No-credit interval,Gross profit to Sales,Net income to stockholder's Equity,liability to equity,Degree of financial leverage (DFL),Interest coverage ratio( Interest expense to EBIT ),one if net income was negative for the last two year zero otherwise,equity to liability
0,1,0.370594,0.424389,0.40575,0.601457,0.601457,0.998969,0.796887,0.808809,0.302646,...,0.716845,0.009219,0.622879,0.601453,0.82789,0.290202,0.026601,0.56405,1,0.016469
1,1,0.464291,0.538214,0.51673,0.610235,0.610235,0.998946,0.79738,0.809301,0.303556,...,0.795297,0.008323,0.623652,0.610237,0.839969,0.283846,0.264577,0.570175,1,0.020794
2,1,0.426071,0.499019,0.472295,0.60145,0.601364,0.998857,0.796403,0.808388,0.302035,...,0.77467,0.040003,0.623841,0.601449,0.836774,0.290189,0.026555,0.563706,1,0.016474
3,1,0.399844,0.451265,0.457733,0.583541,0.583541,0.9987,0.796967,0.808966,0.30335,...,0.739555,0.003252,0.622929,0.583538,0.834697,0.281721,0.026697,0.564663,1,0.023982
4,1,0.465022,0.538432,0.522298,0.598783,0.598783,0.998973,0.797366,0.809304,0.303475,...,0.795016,0.003878,0.623521,0.598782,0.839973,0.278514,0.024752,0.575617,1,0.03549


## Next, import the features dataset and Choose a value for N

In [7]:

body = client_1c720e1b65aa412b89762bf230a6b5f6.get_object(Bucket='ibmcapstoneproject-donotdelete-pr-iiaol8yd9vtgm6',Key='features_final.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

features_final = pd.read_csv(body)

# Due to the nature of IBM Watson Studio and difficulties loading data, this next step is necessary
df_data.to_csv("data.csv")
features_final.to_csv("features_final.csv")

# Choose value for N
N = 5

# Select top N features
features_final = features_final.nlargest(N, 'avg_score', keep='first')

# Define column names for model inputs
cols = features_final['Feature Name'].values.tolist()
cols

[' Persistent EPS in the Last Four Seasons',
 ' Per Share Net profit before tax (yuan)',
 ' net profit before tax/paid-in capital',
 'net income to total assets',
 ' ROA(A) before interest and % after tax']

## Now, onto prepping the data and building the model

In [8]:
# Define SparkSession

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

spark = SparkSession \
    .builder \
    .getOrCreate()

In [9]:
# Load data into Spark DataFrame
df = spark.read.option("header", True) \
               .option("inferSchema", True).csv('data.csv')

In [10]:
# Define column names for the actual model
cols = features_final['Feature Name'].values.tolist()

# Create training and testing data
splits = df.randomSplit([0.75, 0.25])
df_train = splits[0]
df_test = splits[1]

In [11]:
# Gradient Boosted Trees model

from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer

indexer = StringIndexer(inputCol='Bankrupt?', outputCol='label')
encoder = OneHotEncoder(inputCol='label', outputCol = 'labelVec')
vectorAssembler = VectorAssembler(inputCols=cols, outputCol='features')
normalizer = Normalizer(inputCol='features', outputCol='features_norm', p=1.0)

from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(labelCol='label', featuresCol='features', maxIter=10)

from pyspark.ml import Pipeline

##left the encoder out of the pipeline for some reason
pipeline = Pipeline(stages=[indexer, vectorAssembler, normalizer,gbt])

#Evaluate training data
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

model = pipeline.fit(df_train)
prediction = model.transform(df_train)

#binEval = BinaryClassificationEvaluator().setMetricName('areaUnderROC').setRawPredictionCol('prediction').setLabelCol('label')
binEval = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
print("Train Data Accuracy: ", binEval.evaluate(prediction))

model = pipeline.fit(df_test)
prediction = model.transform(df_test)

binEval = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
print("Test Data Accuracy: ", binEval.evaluate(prediction))

Train Data Accuracy:  0.9464955175224126
Test Data Accuracy:  0.9640288156550724


In [12]:
# Hyperparameter tuning

from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer

# Create training and testing data
splits = df.randomSplit([0.75, 0.25])
df_train = splits[0]
df_test = splits[1]

# Choose values for N
N = [3, 5, 7, 10]

for i in range(len(N)):
    
    features_final = pd.read_csv('features_final.csv')

    # Select top N features
    features_final_1 = features_final.nlargest(N[i], 'avg_score', keep='first')
    
    # Define column names for model inputs
    cols = features_final_1['Feature Name'].values.tolist()
    
    indexer = StringIndexer(inputCol='Bankrupt?', outputCol='label')
    encoder = OneHotEncoder(inputCol='label', outputCol = 'labelVec')
    vectorAssembler = VectorAssembler(inputCols=cols, outputCol='features')
    normalizer = Normalizer(inputCol='features', outputCol='features_norm', p=1.0)

    from pyspark.ml.classification import GBTClassifier
    gbt = GBTClassifier(labelCol='label', featuresCol='features', maxIter=10)

    from pyspark.ml import Pipeline

    ##left the encoder out of the pipeline for some reason
    pipeline = Pipeline(stages=[indexer, vectorAssembler, normalizer,gbt])

    #Evaluate training data
    from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

    model = pipeline.fit(df_train)
    prediction = model.transform(df_train)

    #binEval = BinaryClassificationEvaluator().setMetricName('areaUnderROC').setRawPredictionCol('prediction').setLabelCol('label')
    binEval = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
    print("Train Data Accuracy for N= " + str(N[i]) + ":", binEval.evaluate(prediction))

    model = pipeline.fit(df_test)
    prediction = model.transform(df_test)

    binEval = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
    print("Test Data Accuracy for N= " + str(N[i]) + ":", binEval.evaluate(prediction))

Train Data Accuracy for N= 3: 0.9212279718537892
Test Data Accuracy for N= 3: 0.9447044212617984
Train Data Accuracy for N= 5: 0.9448900800746228
Test Data Accuracy for N= 5: 0.9712816691505217
Train Data Accuracy for N= 7: 0.9600440213447606
Test Data Accuracy for N= 7: 0.9707650273224047
Train Data Accuracy for N= 10: 0.9698274602200299
Test Data Accuracy for N= 10: 0.9938052657724791


# Summary

The hyperparameter tuning result are as follows:

Train Data Accuracy for N= 3: 0.9212279718537892
Test Data Accuracy for N= 3: 0.9447044212617984

Train Data Accuracy for N= 5: 0.9448900800746228
Test Data Accuracy for N= 5: 0.9712816691505217

Train Data Accuracy for N= 7: 0.9600440213447606
Test Data Accuracy for N= 7: 0.9707650273224047

Train Data Accuracy for N= 10: 0.9698274602200299
Test Data Accuracy for N= 10: 0.9938052657724791

As we can see, the ***optimal value is N=10.***


### The only financial datapoints we need for the model are:

1) Persistent EPS in the Last Four Seasons

2) Per Share Net profit before tax (yuan)

3) net profit before tax/paid-in capital

4) net income to total assets

5) ROA(A)before interest and % after tax

6) ROA(B) before interest and depreciation after tax

7) per Net Share Value (B)

8) Net income to stockholder's Equity

9) Net Value Per Share (A)

10) net worth/assets

## Success! 

Using Gradient Boosted Trees, we have improved the model's accuracy to 99% on the test data. The Train data and Test data accuracies do not suggest overfitting or any other issues with the model. Using N=10, we can efficiently predict whether or not a company is at risk of going bankrupt. 