## Instructions

Goal: Your task is to build the best possible model to predict defaulted firms in the ‘Manufacture of
computer, electronic and optical products’ industry, 2015.

#### Hold-out sample

To construct the hold-out sample, follow, the next specification:
- Your definition of default should be the following:
 Existed in 2014 (sales > 0), but did not exists in 2015 (sales is 0 or missing)
- We are only interested in predicting default for ‘ind2 == 26’, which is the selected industry, and the
firm is a small or medium enterprise (SME). Thus, yearly sales in 2014 was between 1000 EUR and 10
million EUR.
- If you do the sample design properly, you have an overall of 1037 firms. 56 firms defaulted, and 981
stayed alive. The average sales of the firms is 0.4902 million EUR, with the minimum of 0.00107 million
EUR and 9.57648 million EUR.
- You should not use this sample for modeling, only for your final prediction’s evaluation. If you use
these data in any (visible) way to estimate a model, you will be penalized with -10 points. You should
report your final model of choice’s following measures on this hold-out sample:
    - Brier-score
    - ROC curve
    - AUC
    - Accuracy, sensitivity, specificity (for optimal threshold)
    - Expected loss and optimal threshold
        - Expected loss is has the following parameters: loss(F N) = 15, loss(F P) = 3
    - In addition, report the same descriptive statistics: number of firms, firms defaulted and stayed
alive. Mean of sales, minimum and maximum values. This helps our work to evaluate and compare
your results.

#### Task
- Build the best prediction to classify the defaults.
- You may do different feature engineering.
- You may make any sample design decisions!
- In each case, document your steps!
- Have at least 3 different models and compare performance
- Argue for your choice of models
    - One model must be theoretically profound logistic regression.
    - You can use any model you wish, even models that we have not covered in this course.

### SUBMISSION

- A summary report (pdf), max 3 pages including tables and graphs discussing your work. It is targeted at data science team leaders
    - Can use technical language
    - But need to be the point
    - Focus on key decision points, results, interpretation, decision
- Technical report – a markdown / quarto in pdf/html with more technical discussion.
    - May include code snippets
    - May include additional tables and graphs
    - Detail all decisions you made
- Reports should link to code in Git Hub

### Scoring weights
Overall, you can get 30p from this task.
- It is a prediction race. 15 points will be allocated according to model performance compared to your
peers.
    - You should aim to get the lowest expected loss value.
    - Best gets 15 points; remaining is scaled as a distance from the closest.
- The remaining 15 points can be earned for the following:
    - Data prep, label, and feature engineering (5p)
    - Model building, prediction, and model selection (5p)
    - Discussion of steps, decisions, and results (3p)
    - Quality of the write-up (2p)
Submission deadline: 3rd of March, 2024, 23.59 CET

## Introduction
....

In [1]:
# importing libraries

import os
import sys
import warnings
from patsy import dmatrices
from py_helper_functions import *
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from mizani.formatters import percent_format
from statsmodels.api import OLS
from plotnine import *
from stargazer import stargazer
from statsmodels.tools.eval_measures import mse, rmse
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor



from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from sklearn.impute import SimpleImputer

from plotnine import ggplot, aes, geom_point, geom_segment, coord_flip, theme_bw, scale_y_continuous
from mizani.formatters import percent_format
from sklearn.inspection import permutation_importance

# ignore warnings
warnings.filterwarnings("ignore")
# turn off scientific notation
# pd.set_option("display.float_format", lambda x: "%.2f" % x)
# show all columns
pd.set_option('display.max_columns',None)
# show all rows
pd.set_option('display.max_rows',None)

### Initial Setup

In [2]:
# read the data
df = pd.read_csv("https://osf.io/download/3qyut/")
# show the first 2 rows
df.head(2)

Unnamed: 0,comp_id,begin,end,COGS,amort,curr_assets,curr_liab,extra_exp,extra_inc,extra_profit_loss,finished_prod,fixed_assets,inc_bef_tax,intang_assets,inventories,liq_assets,material_exp,net_dom_sales,net_exp_sales,personnel_exp,profit_loss_year,sales,share_eq,subscribed_cap,tang_assets,wages,D,balsheet_flag,balsheet_length,balsheet_notfullyear,year,founded_year,exit_year,ceo_count,foreign,female,birth_year,inoffice_days,gender,origin,nace_main,ind2,ind,urban_m,region_m,founded_date,exit_date,labor_avg
0,1001034.0,2005-01-01,2005-12-31,,692.59259,7266.666504,7574.074219,0.0,0.0,0.0,,1229.629639,218.518524,0.0,4355.555664,2911.111084,38222.222656,,,22222.222656,62.962963,62751.851562,881.481506,1388.888916,1229.629639,,,0,364,0,2005,1990.0,,2.0,0.0,0.5,1968.0,5686.5,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
1,1001034.0,2006-01-01,2006-12-31,,603.703674,13122.222656,12211.111328,0.0,0.0,0.0,,725.925903,996.296326,0.0,7225.925781,5896.296387,38140.742188,,,23844.445312,755.555542,64625.925781,1637.036987,1388.888916,725.925903,,,0,364,0,2006,1990.0,,2.0,0.0,0.5,1968.0,5686.5,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
