# Initial exploration using a Support Vector Machine Model

This notebook uses Python 3.7 with Spark 3.0

## Begin by importing the bankruptcy data from Cloud Object Storage

In [3]:
# Import bankruptcy data from COS
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_1c720e1b65aa412b89762bf230a6b5f6 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='7GN1Hk1VrQ0_RGdupGi9ZaphffKXQb6iMJPgYn1DNqgh',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_1c720e1b65aa412b89762bf230a6b5f6.get_object(Bucket='ibmcapstoneproject-donotdelete-pr-iiaol8yd9vtgm6',Key='data.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data = pd.read_csv(body)



df_data.head()


Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,operating gross margin,realized sales gross margin,operating profit rate,tax Pre-net interest rate,after-tax net interest rate,non-industry income and expenditure/revenue,...,net income to total assets,total assets to GNP price,No-credit interval,Gross profit to Sales,Net income to stockholder's Equity,liability to equity,Degree of financial leverage (DFL),Interest coverage ratio( Interest expense to EBIT ),one if net income was negative for the last two year zero otherwise,equity to liability
0,1,0.370594,0.424389,0.40575,0.601457,0.601457,0.998969,0.796887,0.808809,0.302646,...,0.716845,0.009219,0.622879,0.601453,0.82789,0.290202,0.026601,0.56405,1,0.016469
1,1,0.464291,0.538214,0.51673,0.610235,0.610235,0.998946,0.79738,0.809301,0.303556,...,0.795297,0.008323,0.623652,0.610237,0.839969,0.283846,0.264577,0.570175,1,0.020794
2,1,0.426071,0.499019,0.472295,0.60145,0.601364,0.998857,0.796403,0.808388,0.302035,...,0.77467,0.040003,0.623841,0.601449,0.836774,0.290189,0.026555,0.563706,1,0.016474
3,1,0.399844,0.451265,0.457733,0.583541,0.583541,0.9987,0.796967,0.808966,0.30335,...,0.739555,0.003252,0.622929,0.583538,0.834697,0.281721,0.026697,0.564663,1,0.023982
4,1,0.465022,0.538432,0.522298,0.598783,0.598783,0.998973,0.797366,0.809304,0.303475,...,0.795016,0.003878,0.623521,0.598782,0.839973,0.278514,0.024752,0.575617,1,0.03549


## Next, import the features dataset and Choose a value for N

In [32]:

body = client_1c720e1b65aa412b89762bf230a6b5f6.get_object(Bucket='ibmcapstoneproject-donotdelete-pr-iiaol8yd9vtgm6',Key='features_final.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

features_final = pd.read_csv(body)

# Due to the nature of IBM Watson Studio and difficulties loading data, this next step is necessary
df_data.to_csv("data.csv")
features_final.to_csv("features_final.csv")

# Choose value for N
N = 15

# Select top N features
features_final = features_final.nlargest(N, 'avg_score', keep='first')

# Define column names for model inputs
cols = features_final['Feature Name'].values.tolist()
cols

[' Persistent EPS in the Last Four Seasons',
 ' Per Share Net profit before tax (yuan)',
 ' net profit before tax/paid-in capital',
 'net income to total assets',
 ' ROA(A) before interest and % after tax',
 ' ROA(B) before interest and depreciation after tax',
 ' per Net Share Value (B)',
 "Net income to stockholder's Equity",
 ' Net Value Per Share (A)',
 ' net worth/assets',
 'Retained Earnings/Total assets',
 ' ROA(C) before interest and depreciation before interest',
 ' debt ratio %',
 ' Net Value Per Share (C)',
 ' borrowing dependency']

In [5]:
# Define SparkSession

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

spark = SparkSession \
    .builder \
    .getOrCreate()

In [8]:
# Load data into Spark DataFrame
df = spark.read.option("header", True) \
               .option("inferSchema", True).csv('data.csv')

In [9]:
# Define column names for the actual model
cols = features_final['Feature Name'].values.tolist()

In [10]:
# Create training and testing data
splits = df.randomSplit([0.75, 0.25])
df_train = splits[0]
df_test = splits[1]

In [20]:
# Support Vector Machine model

from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer
from pyspark.ml.classification import LinearSVC
from pyspark.ml import Pipeline

indexer = StringIndexer(inputCol='Bankrupt?', outputCol='label')
encoder = OneHotEncoder(inputCol='label', outputCol = 'labelVec')
vectorAssembler = VectorAssembler(inputCols=cols, outputCol='features')
normalizer = Normalizer(inputCol='features', outputCol='features_norm', p=1.0)

lsvc = LinearSVC(maxIter=40, regParam=0.1)

pipeline = Pipeline(stages=[indexer, encoder, vectorAssembler, normalizer, lsvc])

model = pipeline.fit(df_train)

from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate train data
prediction = model.transform(df_train)

evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(prediction)

print("Train Data Accuracy: ", evaluator.evaluate(prediction))

# Evaluate test data
prediction = model.transform(df_test)

evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(prediction)
print("Test Data Accuracy: ", evaluator.evaluate(prediction))

Train Data Accuracy:  0.9200302393661126
Test Data Accuracy:  0.9356821589205414


## Summary

We get 92% accuracy for our Training Data and 94% accuracy for our Testing data.

### The good news:
1) Low risk of overfitting, since the Training and Testing accuracies are similar.

2) We are on the right track with our feature selection, since we are getting good accuracy with 15 features.

### The bad news:
1) This level of accuracy is still too low. With so many features and a simple binary classification, it should be possible to get >95% accuracy.

### Improvement ideas:
1) Tune hyperparameters, especially N

In [38]:
# Support Vector Machine model hyperparameter tuning over values for N

from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer
from pyspark.ml.classification import LinearSVC
from pyspark.ml import Pipeline

# Create training and testing data
splits = df.randomSplit([0.75, 0.25])
df_train = splits[0]
df_test = splits[1]

# Choose values for N
N = [5, 10, 15, 30, 60, 90]

for i in range(len(N)):
    
    features_final = pd.read_csv('features_final.csv')

    # Select top N features
    features_final_1 = features_final.nlargest(N[i], 'avg_score', keep='first')
    
    # Define column names for model inputs
    cols = features_final_1['Feature Name'].values.tolist()
    #print(len(cols))

    indexer = StringIndexer(inputCol='Bankrupt?', outputCol='label')
    encoder = OneHotEncoder(inputCol='label', outputCol = 'labelVec')
    vectorAssembler = VectorAssembler(inputCols=cols, outputCol='features')
    normalizer = Normalizer(inputCol='features', outputCol='features_norm', p=1.0)

    lsvc = LinearSVC(maxIter=10, regParam=0.1)

    pipeline = Pipeline(stages=[indexer, encoder, vectorAssembler, normalizer, lsvc])

    model = pipeline.fit(df_train)

    from pyspark.ml.evaluation import BinaryClassificationEvaluator

    # Evaluate train data
    prediction = model.transform(df_train)

    evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
    #evaluator.evaluate(prediction)

    print("Train Data Accuracy for N= " + str(N[i]) + ":", evaluator.evaluate(prediction))

# Evaluate test data
    prediction = model.transform(df_test)

    evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
    #evaluator.evaluate(prediction)
    print("Test Data Accuracy for N= " + str(N[i]) + ":", evaluator.evaluate(prediction))

Train Data Accuracy for N= 5: 0.8809944524347648
Test Data Accuracy for N= 5: 0.8883885039774162
Train Data Accuracy for N= 10: 0.8923079632510468
Test Data Accuracy for N= 10: 0.9129586861688461
Train Data Accuracy for N= 15: 0.9210014969620463
Test Data Accuracy for N= 15: 0.9182448036951484
Train Data Accuracy for N= 30: 0.9028183979570881
Test Data Accuracy for N= 30: 0.9142288940210387
Train Data Accuracy for N= 60: 0.9035357656520637
Test Data Accuracy for N= 60: 0.8971003335899402
Train Data Accuracy for N= 90: 0.9175191523085495
Test Data Accuracy for N= 90: 0.9126507569925568


# Summary of Hyperparameter Tuning

The results of Hyperparameter tuning are as follows:

Train Data Accuracy for N= 5: 0.8809944524347648
Test Data Accuracy for N= 5: 0.8883885039774162

Train Data Accuracy for N= 10: 0.8923079632510468
Test Data Accuracy for N= 10: 0.9129586861688461

Train Data Accuracy for N= 15: 0.9210014969620463
Test Data Accuracy for N= 15: 0.9182448036951484

Train Data Accuracy for N= 30: 0.9028183979570881
Test Data Accuracy for N= 30: 0.9142288940210387

Train Data Accuracy for N= 60: 0.9035357656520637
Test Data Accuracy for N= 60: 0.8971003335899402

Train Data Accuracy for N= 90: 0.9175191523085495
Test Data Accuracy for N= 90: 0.9126507569925568

1) As we can see, improvements in accuracy level off around N=10

2) While we do see slight improvements in N>10, one of our primary goals is to build a model that requires only readily available financial data to predict bankruptcy.

3) At N=10, these features are:

[' Persistent EPS in the Last Four Seasons',
 ' Per Share Net profit before tax (yuan)',
 ' net profit before tax/paid-in capital',
 'net income to total assets',
 ' ROA(A) before interest and % after tax',
 ' ROA(B) before interest and depreciation after tax',
 ' per Net Share Value (B)',
 "Net income to stockholder's Equity",
 ' Net Value Per Share (A)',
 ' net worth/assets']
 
4) At N=15, these features are: 

[' Persistent EPS in the Last Four Seasons',
 ' Per Share Net profit before tax (yuan)',
 ' net profit before tax/paid-in capital',
 'net income to total assets',
 ' ROA(A) before interest and % after tax',
 ' ROA(B) before interest and depreciation after tax',
 ' per Net Share Value (B)',
 "Net income to stockholder's Equity",
 ' Net Value Per Share (A)',
 ' net worth/assets',
 'Retained Earnings/Total assets',
 ' ROA(C) before interest and depreciation before interest',
 ' debt ratio %',
 ' Net Value Per Share (C)',
 ' borrowing dependency']
 
 In other words, as soon as we get to N>10, we start to get features such as "Borrowing Dependency," which is not typically a readily available financial datapoint. 
 
 As a result, our SVM model is optimized at N=10.
 
 As stated before, I think we can boost the N=10 accuracy value with other algorithms, so we'll try that next.

In [None]:
features_final = pd.read_csv('features_final.csv')

N=10
holder = features_final.nlargest(N, 'avg_score', keep='first')
holder['Feature Name'].values.tolist()