In [1]:
import pandas as pd

pd.set_option('display.max_columns',None)

## Default of credit card clients

### Data Set Information

- This research aimed at the case of customersâ€™ default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods.

### Attribute Information

- This research employed a binary variable **default payment next month**, default payment (Yes = 1, No = 0), as the response variable.

- X1: Amount of the given credit (NT dollar) **LIMIT_BAL**: it includes both the individual consumer credit and his/her family (supplementary) credit.

- X2: Gender **SEX** (1 = male; 2 = female).

- X3: **Education** (1 = graduate school; 2 = university; 3 = high school; 4 = others).

- X4: Marital status **MARRIAGE** (1 = married; 2 = single; 3 = others).

- X5: **Age** (year).

- X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows:

X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005.

 The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.


- X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.

- X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005. 

### Link

https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

In [2]:
dccc=pd.read_excel('datos/default of credit card clients.xls',header=1,index_col=0)
dccc.head()

Unnamed: 0_level_0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
1,20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
2,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,90000,2,2,2,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,50000,2,2,1,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
5,50000,1,2,1,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [3]:
dccc.shape # número de filas y columnas

(30000, 24)

In [4]:
dccc.info() # información de las variables

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 1 to 30000
Data columns (total 24 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   LIMIT_BAL                   30000 non-null  int64
 1   SEX                         30000 non-null  int64
 2   EDUCATION                   30000 non-null  int64
 3   MARRIAGE                    30000 non-null  int64
 4   AGE                         30000 non-null  int64
 5   PAY_0                       30000 non-null  int64
 6   PAY_2                       30000 non-null  int64
 7   PAY_3                       30000 non-null  int64
 8   PAY_4                       30000 non-null  int64
 9   PAY_5                       30000 non-null  int64
 10  PAY_6                       30000 non-null  int64
 11  BILL_AMT1                   30000 non-null  int64
 12  BILL_AMT2                   30000 non-null  int64
 13  BILL_AMT3                   30000 non-null  int64
 14  BILL_A

## South German Credit 

### Data Set Information
- 700 good and 300 bad credits with 20 predictor variables. Data from 1973 to 1975. 

### Attribute Information

- **laufkont** = status: status of the debtor's checking account with the bank (categorical)

 1 : no checking account                       
 2 : ... < 0 DM                                
 3 : 0<= ... < 200 DM                          
 4 : ... >= 200 DM / salary for at least 1 year

-  **laufzeit** = duration: credit duration in months (quantitative)

-  **moral** = credit_history : history of compliance with previous or concurrent credit contracts (categorical)

 0 : delay in paying off in the past            
 1 : critical account/other credits elsewhere   
 2 : no credits taken/all credits paid back duly
 3 : existing credits paid back duly till now   
 4 : all credits at this bank paid back duly    

- **verw** = purpose: purpose for which the credit is needed (categorical)

 0 : others             
 1 : car (new)          
 2 : car (used)         
 3 : furniture/equipment
 4 : radio/television   
 5 : domestic appliances
 6 : repairs            
 7 : education          
 8 : vacation           
 9 : retraining         
 10 : business   



- **hoehe** = amount: credit amount in DM (quantitative; result of monotonic transformation; actual data and type of transformation unknown)

- **sparkont** = savings : debtor's savings (categorical)

 1 : unknown/no savings account
 2 : ... <  100 DM             
 3 : 100 <= ... <  500 DM      
 4 : 500 <= ... < 1000 DM      
 5 : ... >= 1000 DM    

- **beszeit** = employment_duration : duration of debtor's employment with current employer (ordinal; discretized quantitative)

 1 : unemployed      
 2 : < 1 yr          
 3 : 1 <= ... < 4 yrs
 4 : 4 <= ... < 7 yrs
 5 : >= 7 yrs    

- **rate** = installment_rate: credit installments as a percentage of debtor's disposable income (ordinal; discretized quantitative)

 1 : >= 35         
 2 : 25 <= ... < 35
 3 : 20 <= ... < 25
 4 : < 20       

- **famges** = personal_status_sex : combined information on sex and marital status; categorical.

1 : male : divorced/separated           
 2 : female : non-single or male : single
 3 : male : married/widowed              
 4 : female : single  

- **buerge** = other_debtors: Is there another debtor or a guarantor for the credit? (categorical)

 1 : none        
 2 : co-applicant
 3 : guarantor   

- **wohnzeit** = present_residence: length of time (in years) the debtor lives in the present residence (ordinal; discretized quantitative)

 1 : < 1 yr          
 2 : 1 <= ... < 4 yrs
 3 : 4 <= ... < 7 yrs
 4 : >= 7 yrs    

- **verm** = property: the debtor's most valuable property, i.e. the highest possible code is used.

 1 : unknown / no property                    
 2 : car or other                             
 3 : building soc. savings agr./life insurance
 4 : real estate     

- **alter** = age: age in years (quantitative)

- **weitkred** = other_installment_plans : installment plans from providers other than the credit-giving bank (categorical)

 1 : bank  
 2 : stores
 3 : none  

- **wohn** = housing : type of housing the debtor lives in (categorical)

 1 : for free
 2 : rent    
 3 : own   

- **bishkred** = number_credits : number of credits including the current one the debtor has (or had) at this bank (ordinal, discretized quantitative)

 1 : 1   
 2 : 2-3 
 3 : 4-5 
 4 : >= 6

- **beruf** = job : quality of debtor's job (ordinal)

1 : unemployed/unskilled - non-resident       
 2 : unskilled - resident                      
 3 : skilled employee/official                 
 4 : manager/self-empl./highly qualif. employee

- **pers** = people_liable : number of persons who financially depend on the debtor (i.e., are entitled to maintenance) (binary,discretized quantitative)

 1 : 3 or more
 2 : 0 to 2   

- **telef** = telephone : Is there a telephone landline registered on the debtor's name? (binary; remember that the data are from the 1970s)

 1 : no                       
 2 : yes (under customer name)

- **gastarb** = foreign_worker: Is the debtor a foreign worker? (binary)

 1 : yes
 2 : no 


- **kredit** = credit_risk : Has the credit contract been complied with (good) or not (bad) ? (binary)

 0 : bad 
 1 : good

### Link

https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29

In [5]:
sgc=pd.read_csv('datos/SouthGermanCredit/SouthGermanCredit.asc',sep=' ')
sgc.head()

Unnamed: 0,laufkont,laufzeit,moral,verw,hoehe,sparkont,beszeit,rate,famges,buerge,wohnzeit,verm,alter,weitkred,wohn,bishkred,beruf,pers,telef,gastarb,kredit
0,1,18,4,2,1049,1,2,4,2,1,4,2,21,3,1,1,3,2,1,2,1
1,1,9,4,0,2799,1,3,2,3,1,2,1,36,3,1,2,3,1,1,2,1
2,2,12,2,9,841,2,4,2,2,1,4,1,23,3,1,1,2,2,1,2,1
3,1,12,4,0,2122,1,3,3,3,1,2,1,39,3,1,2,2,1,1,1,1
4,1,12,4,0,2171,1,3,4,3,1,4,2,38,1,2,2,2,2,1,1,1


In [6]:
sgc.shape # número de filas y columnas

(1000, 21)

In [7]:
sgc.info() # información de las variables

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   laufkont  1000 non-null   int64
 1   laufzeit  1000 non-null   int64
 2   moral     1000 non-null   int64
 3   verw      1000 non-null   int64
 4   hoehe     1000 non-null   int64
 5   sparkont  1000 non-null   int64
 6   beszeit   1000 non-null   int64
 7   rate      1000 non-null   int64
 8   famges    1000 non-null   int64
 9   buerge    1000 non-null   int64
 10  wohnzeit  1000 non-null   int64
 11  verm      1000 non-null   int64
 12  alter     1000 non-null   int64
 13  weitkred  1000 non-null   int64
 14  wohn      1000 non-null   int64
 15  bishkred  1000 non-null   int64
 16  beruf     1000 non-null   int64
 17  pers      1000 non-null   int64
 18  telef     1000 non-null   int64
 19  gastarb   1000 non-null   int64
 20  kredit    1000 non-null   int64
dtypes: int64(21)
memory usage: 164.2 KB


## Give Me Some Credit

### Data Set Information

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.

The goal of this competition is to build a model that borrowers can use to help make the best financial decisions.

### Attribute Information

- **SeriousDlqin2yrs** : Person experienced 90 days past due delinquency or worse : 1=Yes 0=No
- **RevolvingUtilizationOfUnsecuredLines** : Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits : percentage
- **age** : Age of borrower in years :integer
- **NumberOfTime30-59DaysPastDueNotWorse** : Number of times borrower has been 30-59 days past due but no worse in the last 2 years. : integer
- **DebtRatio** : Monthly debt payments, alimony,living costs divided by monthly gross income : percentage
- **MonthlyIncome** : Monthly income : float
- **NumberOfOpenCreditLinesAndLoans** : Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) :integer
- **NumberOfTimes90DaysLate** : Number of times borrower has been 90 days or more past due. : integer
- **NumberRealEstateLoansOrLines** : Number of mortgage and real estate loans including home equity lines of credit. : integer
- **NumberOfTime60-89DaysPastDueNotWorse** : Number of times borrower has been 60-89 days past due but no worse in the last 2 years. : integer
- **NumberOfDependents** : Number of dependents in family excluding themselves (spouse, children etc.) : integer

### Link

https://www.kaggle.com/competitions/GiveMeSomeCredit/data

In [8]:
train=pd.read_csv('datos/GiveMeSomeCredit/cs-training.csv',index_col=0)
train.head()

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
1,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
2,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
3,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
4,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
5,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


In [9]:
train.shape # número de filas y columnas

(150000, 11)

In [10]:
train.info() # información de las variables

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150000 entries, 1 to 150000
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      150000 non-null  int64  
 1   RevolvingUtilizationOfUnsecuredLines  150000 non-null  float64
 2   age                                   150000 non-null  int64  
 3   NumberOfTime30-59DaysPastDueNotWorse  150000 non-null  int64  
 4   DebtRatio                             150000 non-null  float64
 5   MonthlyIncome                         120269 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       150000 non-null  int64  
 7   NumberOfTimes90DaysLate               150000 non-null  int64  
 8   NumberRealEstateLoansOrLines          150000 non-null  int64  
 9   NumberOfTime60-89DaysPastDueNotWorse  150000 non-null  int64  
 10  NumberOfDependents                    146076 non-null  float64
dtype

## HMEQ


### Data Set Information

The data set HMEQ reports characteristics and delinquency information for 5,960 home equity loans. A home equity loan is a loan where the obligor uses the equity of his or her home as the underlying collateral. 

### Attribute Information

- BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid loan
- LOAN: Amount of the loan request
- MORTDUE: Amount due on existing mortgage
- VALUE: Value of current property
- REASON: DebtCon = debt consolidation; HomeImp = home improvement
- JOB: Occupational categories
- YOJ: Years at present job
- DEROG: Number of major derogatory reports
- DELINQ: Number of delinquent credit lines
- CLAGE: Age of oldest credit line in months
- NINQ: Number of recent credit inquiries
- CLNO: Number of credit lines
- DEBTINC: Debt-to-income ratio

### Link

http://www.creditriskanalytics.net/datasets-private2.html

In [11]:
hmeq=pd.read_csv('datos/hmeq.csv')
hmeq.head()

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.366667,1.0,9.0,
1,1,1300,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.833333,0.0,14.0,
2,1,1500,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.466667,1.0,10.0,
3,1,1500,,,,,,,,,,,
4,0,1700,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.333333,0.0,14.0,


In [12]:
hmeq.shape # número de filas y columnas

(5960, 13)

In [13]:
hmeq.info() # información de las variables

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 605.4+ KB


## lending club

https://www.kaggle.com/datasets/devanshi23/loan-data-2007-2014

In [14]:
lc=pd.read_csv('datos/loan_data_2007_2014/loan_data_2007_2014_sample.csv',index_col=0)
lc.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_il_6m,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m
17325,709490,902113,16000,16000,15975.0,60 months,17.88,405.26,E,E5,XL Insurance,7 years,OWN,75500.0,Verified,Apr-11,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 04/02/11 > This loan is fo...,other,Settlement,198xx,DE,19.35,1.0,Dec-98,3.0,5.0,57.0,9.0,1.0,8979,85.5,27.0,f,0.0,0.0,23610.96951,23574.11,16000.0,7610.97,0.0,0.0,0.0,Dec-14,6214.37,,Jan-15,0.0,,1,INDIVIDUAL,,,,0.0,,,,,,,,,,,,,,,,,
217965,1273339,1517729,11000,11000,11000.0,36 months,6.03,334.8,A,A1,Federal Government,3 years,RENT,57400.0,Verified,May-12,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 05/01/12 > I'd like to use...,debt_consolidation,Debt Consolidation,402xx,KY,24.49,0.0,Dec-80,0.0,,,11.0,0.0,5342,24.0,28.0,f,0.0,0.0,11488.51,11488.51,11000.0,488.51,0.0,0.0,0.0,Mar-13,8478.22,,Mar-13,0.0,,1,INDIVIDUAL,,,,0.0,,,,,,,,,,,,,,,,,
155719,3702268,4675542,3000,3000,3000.0,36 months,10.16,97.03,B,B1,Farmer's Insurance,6 years,MORTGAGE,85000.0,Source Verified,Mar-13,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,,credit_card,Debt Snowball,730xx,OK,9.9,1.0,Dec-00,0.0,8.0,,12.0,0.0,10082,49.4,20.0,w,0.0,0.0,3025.73,3025.73,3000.0,25.73,0.0,0.0,0.0,Apr-13,3025.88,,Apr-13,0.0,,1,INDIVIDUAL,,,,0.0,0.0,142614.0,,,,,,,,,,,,20400.0,,,
391129,16201614,18304095,15000,15000,14900.0,36 months,9.17,478.19,B,B1,ACCOUNTANT,2 years,OWN,50000.0,Verified,May-14,Current,n,https://www.lendingclub.com/browse/loanDetail....,,debt_consolidation,Debt consolidation,481xx,MI,29.45,0.0,Aug-95,0.0,33.0,54.0,11.0,1.0,786,8.0,36.0,f,7175.83,7127.99,9563.8,9500.04,7824.17,1739.63,0.0,0.0,0.0,Jan-16,478.19,Feb-16,Jan-16,0.0,33.0,1,INDIVIDUAL,,,,0.0,0.0,20758.0,,,,,,,,,,,,9800.0,,,
9753,839065,1049201,10000,10000,10000.0,36 months,9.99,322.63,B,B1,Millenium Group,1 year,RENT,65000.0,Source Verified,Aug-11,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 08/08/11 > I am a junior t...,small_business,Seeking Liquidity Before Bonus Season,105xx,NY,1.61,0.0,Nov-97,1.0,,,4.0,0.0,5122,18.2,8.0,f,0.0,0.0,11606.89139,11606.89,10000.0,1606.89,0.0,0.0,0.0,Aug-14,351.86,,Aug-14,0.0,,1,INDIVIDUAL,,,,0.0,,,,,,,,,,,,,,,,,
