# Polish companies bankruptcy prediction
### https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data

### Part 1. Description

The dataset is about bankruptcy prediction of Polish companies. The data was collected from Emerging Markets Information Service (EMIS, [Web Link]), which is a database containing information on emerging markets around the world. The bankrupt companies were analyzed in the period 2000-2012, while the still operating companies were evaluated from 2007 to 2013.<br><br>
The dataset is splitted into 5 different files, which are related to forecast period for bancruptcy prediction. Each dataset contains 64 attributes. Some values are missing and needs to be handled.<br><br>
I find this case very interesting to look closer. I will try to determine which classification model suits best to this dataset, pick best parameters. I will also try to see if there could be the same model fit to all files.<br><br>
For preduction I will evaluate followings classification models:
* Logistic Regression
* KNN K-Nearest Neighbors
* Support Vector Machines (Linear & Kernel)
* Naive Bayes
* Decission Tree
* Random Forest
* Neural Networks

Also each algorythm will be checked based on crossvalidation accuracy.

### Part 2. Data Proccessing

Collecting data from external files:

In [1]:
import numpy as np
import pandas as pd

import datetime
import warnings

warnings.filterwarnings("ignore")

headers = pd.read_csv('data/headers.txt', nrows=0, sep=',').columns.tolist()

raw_files = [
    'data/1year.arff',
    'data/2year.arff',
    'data/3year.arff',
    'data/4year.arff',
    'data/5year.arff'
    ]

Creating some useful functions:

In [2]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score

def CreateDatasetFromSourceFile(inputFile):
    return pd.read_csv(inputFile, delimiter=',', header=None, names=headers, skiprows=69, low_memory=False)

def SettingNameToDataset(inputDataset, nameDataset):
    inputDataset.name = nameDataset

def IsAnyNanInDataset(dataset):
    return dataset.isna().values.any()

def EnsureValuesFloatType(dataset):
    cols = dataset.columns
    for col in cols:
        dataset[col] = dataset[col].convert_objects(convert_numeric=True)
    return dataset

def FillMissingValuesInDataset(dataset):
    if (IsAnyNanInDataset(dataset)):
        dataset = dataset.fillna(0)
    return dataset

def RetrieveXYfromRawFile(file):
    data = CreateDatasetFromSourceFile(file)
    SettingNameToDataset(data, file)
    data = EnsureValuesFloatType(data)
    data = FillMissingValuesInDataset(data)
    X, y = data.iloc[:,:-1].values, data.iloc[:,-1:].values
    return X, y

def PrintModelResults(y_pred, y_test, model, model_name):
    print('Classification: {}'.format(model_name))
    print('Finished calculation: {:%H:%M:%S}'.format(datetime.datetime.now()))
    print('Best parameters: \n{}\n'.format(model))
    print('Accuracy score: {:.3f}%\n'.format(accuracy_score(y_pred, y_test)*100))
    print('Confusion matrix: \n{}\n'.format(confusion_matrix(y_test, y_pred)))
    print('Classification report: \n{}'.format(classification_report(y_test, y_pred)))

def CreateTestAndTrainSetsFromRawFile(file):
    X, y = RetrieveXYfromRawFile(file)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
    return X_train, X_test, y_train, y_test
    
def ExecuteClassification(file, model_names, models, parameter_grids):
    i = 0
    X_train, X_test, y_train, y_test = CreateTestAndTrainSetsFromRawFile(file)
    print('\tProcessing file: {}'.format(file))
    print('\t===== ==== === == = == === ==== =====\n')
    for model, parameter_grid in zip(models, parameter_grids):
        grid_search = GridSearchCV(model, parameter_grid, cv=5)
        grid_search.fit(X_train, y_train)
        best_model = grid_search.best_estimator_
        y_pred = best_model.predict(X_test)
        PrintModelResults(y_pred, y_test, best_model, model_names[i])
        print('Cross validation: {:.3f}%\n'.format(cross_val_score(best_model, X_train,y_train, cv=10).mean()*100))
        print('\t----- ---- --- -- - -- --- ---- -----\n')
        i += 1

Importing Classifications Models, Pipeline & Grid Search:

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

Model names to list of strings:

In [4]:
model_names1 = [
    'Logistic Regression',
    'K-Nearest Neighbours',
    'Naive Bayes',
    'Decission Tree',
    'Random Forest'
]

Declaring models for Grid Search:

In [5]:
models1 = [
    Pipeline([
        ("scale", StandardScaler()),
        ("model", LogisticRegression())
    ]),
    Pipeline([
        ("scale", StandardScaler()),
        ("model", KNeighborsClassifier())
    ]),
    Pipeline([
        ("scale", StandardScaler()),
        ("model", GaussianNB()),
    ]),
    Pipeline([
        ("model", DecisionTreeClassifier())
    ]), 
    Pipeline([
        ("model", RandomForestClassifier())
    ])
    ]

Declaring Grid Search:

In [6]:
parameter_grids1 = [
    { # Logistic regression
        "scale__with_mean": [True, False], 
        "model__penalty": ["l1", "l2"], 
        "model__C": [0.01, 0.1, 1], 
        "model__solver": ['liblinear', 'saga']
    },
    { # KNN
        "model__n_neighbors": [3, 5, 7, 9, 11, 13], 
        "model__metric": ['minkowski'], 
        "model__p": [2]}, # KNN
    { # Naive Bayes, no parameters
    },
    { # Decission Tree
        "model__min_samples_leaf": [5, 10, 15, 20, 25, 30, 50, 70, 100], 
        "model__max_depth": [5, 10, 15, 20]
    },
    { # Random Forests
        "model__n_estimators": [10, 50, 100, 300, 500], 
        "model__min_samples_leaf": [5, 10, 15, 20, 25]
    }
    ]

Execute classification models and report results:

In [7]:
for raw_file in raw_files:
    ExecuteClassification(raw_file, model_names1, models1, parameter_grids1)

	Processing file: data/1year.arff
	===== ==== === == = == === ==== =====

Classification: Logistic Regression
Finished calculation: 13:07:03
Best parameters: 
Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=False, with_std=True)), ('model', LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=None, solver='saga',
          tol=0.0001, verbose=0, warm_start=False))])

Accuracy score: 96.088%

Confusion matrix: 
[[1351    0]
 [  55    0]]

Classification report: 
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      1351
           1       0.00      0.00      0.00        55

   micro avg       0.96      0.96      0.96      1406
   macro avg       0.48      0.50      0.49      1406
weighted avg       0.92      0.96      0.94      1406

Cross validation: 96.157%

	----- -

Classification: Random Forest
Finished calculation: 13:37:39
Best parameters: 
Pipeline(memory=None,
     steps=[('model', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

Accuracy score: 96.265%

Confusion matrix: 
[[1956    0]
 [  76    3]]

Classification report: 
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      1956
           1       1.00      0.04      0.07        79

   micro avg       0.96      0.96      0.96      2035
   macro avg       0.98      0.52      0.53      2035
weighted avg       0.96      0.96      0.95      2035

Cross validation: 96.117%


Cross validation: 11.133%

	----- ---- --- -- - -- --- ---- -----

Classification: Decission Tree
Finished calculation: 14:07:43
Best parameters: 
Pipeline(memory=None,
     steps=[('model', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=30, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))])

Accuracy score: 95.814%

Confusion matrix: 
[[1839   12]
 [  70   38]]

Classification report: 
              precision    recall  f1-score   support

           0       0.96      0.99      0.98      1851
           1       0.76      0.35      0.48       108

   micro avg       0.96      0.96      0.96      1959
   macro avg       0.86      0.67      0.73      1959
weighted avg       0.95      0.96      0.95      1959

Cross validation: 95.774%

	-----



Accuracy score: 90.525%

Confusion matrix: 
[[1055   46]
 [  66   15]]

Classification report: 
              precision    recall  f1-score   support

           0       0.94      0.96      0.95      1101
           1       0.25      0.19      0.21        81

   micro avg       0.91      0.91      0.91      1182
   macro avg       0.59      0.57      0.58      1182
weighted avg       0.89      0.91      0.90      1182

Cross validation: 77.750%

	----- ---- --- -- - -- --- ---- -----

Classification: Decission Tree
Finished calculation: 14:26:29
Best parameters: 
Pipeline(memory=None,
     steps=[('model', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=15, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))])

Accuracy score: 95.770%

Confusio

Supported Vector Machine (linear and kernel). Time consuming stage thus separated from the main search grid.

In [9]:
model_names2 = [
    'SVM (Linear or Kernel)'
]

models2 = [
    Pipeline([
        ("scale", StandardScaler()),
        ("model", SVC())
    ])
    ]

parameter_grids2 = [
    [ # SVM
        {
            "model__C":[0.001, 0.01, 0.1, 1], 
            "model__kernel": ["linear"]
        },
        {
            "model__C":[0.001, 0.01, 0.1], 
            "model__kernel": ["rbf"], 
            "model__gamma": ["auto", 0.001, 0.005, 0.01, 0.02],
            "model__max_iter": [100],
            "model__tol": [0.00002]
        }
    ]
    ]

In [10]:
for raw_file in raw_files:
    ExecuteClassification(raw_file, model_names2, models2, parameter_grids2)

	Processing file: data/1year.arff
	===== ==== === == = == === ==== =====

Classification: SVM (Linear or Kernel)
Finished calculation: 08:13:39
Best parameters: 
Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('model', SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))])

Accuracy score: 96.088%

Confusion matrix: 
[[1351    0]
 [  55    0]]

Classification report: 
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      1351
           1       0.00      0.00      0.00        55

   micro avg       0.96      0.96      0.96      1406
   macro avg       0.48      0.50      0.49      1406
weighted avg       0.92      0.96      0.94      1406

Cross validation: 96.140%

	----- ---- --- -- - -- --- --

Testing various Neural Networks:

In [8]:
networks = [(10,),(20,),(30,),(40,),(50,),(70,),(100,),
            (10,10,),(20,20,),(30,30,),(40,40,),(50,50,),(70,70,),(100,100,)]

for network in networks:
    print('Testing network architecture: {}'.format(network))
    for raw_file in raw_files:
        X_train, X_test, y_train, y_test = CreateTestAndTrainSetsFromRawFile(raw_file)
        model = Pipeline([
            ("standarization",StandardScaler()),
            ("NeuralNetwork",MLPClassifier(network,alpha=0,max_iter=1000))
        ])
        model.fit(X_train, y_train)
        print('Accuracy score in file {}: {:.3f}%'.format(raw_file, accuracy_score(model.predict(X_test), y_test)*100))
    print()

Testing network architecture: (10,)
Accuracy score in file data/1year.arff: 95.875%
Accuracy score in file data/2year.arff: 96.020%
Accuracy score in file data/3year.arff: 95.431%
Accuracy score in file data/4year.arff: 94.793%
Accuracy score in file data/5year.arff: 94.078%

Testing network architecture: (20,)
Accuracy score in file data/1year.arff: 96.017%
Accuracy score in file data/2year.arff: 95.921%
Accuracy score in file data/3year.arff: 95.383%
Accuracy score in file data/4year.arff: 94.589%
Accuracy score in file data/5year.arff: 94.332%

Testing network architecture: (30,)
Accuracy score in file data/1year.arff: 95.661%
Accuracy score in file data/2year.arff: 95.872%
Accuracy score in file data/3year.arff: 95.098%
Accuracy score in file data/4year.arff: 94.538%
Accuracy score in file data/5year.arff: 94.247%

Testing network architecture: (40,)
Accuracy score in file data/1year.arff: 95.804%
Accuracy score in file data/2year.arff: 95.971%
Accuracy score in file data/3year.arf

We can see the best neural network is single layer with 20 neurons but cannot beat Decission Tree model.

### Part 3. Conclusions

All classifications models bring decent results with prediction score at the level of 94-97%. Classification models works great even with default parameters and optimization boosts resuts by particles of percentage.<br><br>
Factor which distinguish especially one model from the others is the False Positive on Confusion Matrix. The lowest value can be observed on Decission Tree model and this is why Decission Tree model is more favour than others. The model did a least number of signifficant mistakes. Also looking at accuracy value across all models, Decission Tree model is better than other models by at average 1%.
We can notice Baive Bayes is the worst model for these datasets.<br><br>
Decission Tree is the choice in bankruptcy prediction in polish companies for different forecast period. Hovewer we still need to optimize parameters for decission tree for each period in order to get satissfied results - there is no common model for all files.