## Wczytanie i przygotowanie danych

In [1]:
import pandas as pd
import numpy as np

In [2]:
# wczytanie zbioru w postaci linku do danych; testowo wybór dwóch systemów kodowania
link_do_danych = "https://raw.githubusercontent.com/saimadhu-polamuri/DataHakthon3X/master/dataSet/Train.csv"
system_kodowania1 = "ISO-8859-1"
system_kodowania2 = "latin"

df1_ = pd.read_csv(link_do_danych, header=0, index_col = 'ID', encoding = system_kodowania1,
                    converters={'Employer_Name': str, 'Salary_Account': str, 'City': str, 'DOB': str}, 
                    parse_dates=['Lead_Creation_Date'])
df2_ = pd.read_csv(link_do_danych, header=0, index_col = 'ID', encoding = system_kodowania2,
                    converters={'Employer_Name': str, 'Salary_Account': str, 'City': str, 'DOB': str}, 
                    parse_dates=['Lead_Creation_Date'])

# uwaga, nie parsuję automatycznie daty urodzenia 'DOB' gdyż read_csv wczytuje lata "dwucyforwe" do roku 68 jak 20XX,
# zamiast (w tym przypadku) jako 19XX

Dalsza analiza prowadzona jest dla danych w: 'df1_'

In [3]:
# usunięcie zmiennej (kolumny): LoggedIn zgodnie z wymogami projektu
df1_ = df1_.drop(columns=["LoggedIn"])

In [4]:
# przegląd informacji o danych w df1
df1_.info(20)

<class 'pandas.core.frame.DataFrame'>
Index: 87020 entries, ID000002C20 to ID124821V10
Data columns (total 24 columns):
Gender                   87020 non-null object
City                     87020 non-null object
Monthly_Income           87020 non-null int64
DOB                      87020 non-null object
Lead_Creation_Date       87020 non-null datetime64[ns]
Loan_Amount_Applied      86949 non-null float64
Loan_Tenure_Applied      86949 non-null float64
Existing_EMI             86949 non-null float64
Employer_Name            87020 non-null object
Salary_Account           87020 non-null object
Mobile_Verified          87020 non-null object
Var5                     87020 non-null int64
Var1                     87020 non-null object
Loan_Amount_Submitted    52407 non-null float64
Loan_Tenure_Submitted    52407 non-null float64
Interest_Rate            27726 non-null float64
Processing_Fee           27420 non-null float64
EMI_Loan_Submitted       27726 non-null float64
Filled_Form         

1. Poniżej przekształcam wszystkie daty wg reguły: ROK = 1900 + rr, gdzie rr to dwucyfrowy rok (typ int) wyodrębniony z pola DOB, zawierającego datę urodzenia w postaci string. W analizowanym zbiorze danych problem z datą dotyczy lat 50 - 68, które zamieniam na 1950-1968, podczas gdy read_csv wczytywałby je z automatu jako 2050 - 2068
2. Następnie z tak uzyskanej daty obliczam przybliżony wiek w latach dla każdego wnioskującego o kredyt poprzez odjęcie od Lead_Creation_Date jego daty urodzenia: DOBd (już w formacie daty)

In [5]:
from datetime import datetime
from dateutil import relativedelta

DOBdlst = [] # lista, do której wpiszę poprawnie wyliczone daty urodzenia już w formacie daty
Agelst = [] # lista, do której wpiszę wiek wnioskodawcy wyrażony w latach (od Lead_Creation_Date odejmiemy datę urodzenia)

for i in range (0, len(df1_)):
    DOBstr = df1_['DOB'][i]
    rr = int(DOBstr[-2:])
    rok = 1900 + rr # wyliczenie roku wg podanych wyżej założeń
    DOBstr = DOBstr[0:7]+str(rok) # złączenie roku 4-cyfrowego z dniem i miesiącem daty urodzenia
    DOBdate = datetime.strptime(DOBstr, '%d-%b-%Y') # przekształcenie daty urodzenia na format daty
    DOBdlst.append(DOBdate) # dołączenie danej daty do listy z datami urodzenia
    date_L = df1_['Lead_Creation_Date'][i]
    difference = relativedelta.relativedelta(date_L, DOBdate) # odjęcie od Lead_Creation_Date daty urodzenia wnioskodawcy
    years = difference.years
    if years != 100: # jest kilkanaście rekordów gdzie daty urodzenia i składania wniosku
                     # są sobie równe, co jest niemożliwe, 
                     # ktoś pomylił sie przy wpisywaniu danych
        Agelst.append(years)
    else:
        #print(years)
        Agelst.append(np.nan) # gdy różnica w datach = 100 wstawiamy wartość NaN

# utworzenie nowych pól w df1_ w oparciu o w/w listy
df1_['DOBd'] = DOBdlst
df1_['estimated_age_in_years'] = Agelst

In [6]:
# sprawdzenie efektów zastosowania powyższego algorytmu
df1_.estimated_age_in_years.value_counts()

26.0    6981
27.0    6959
25.0    6627
28.0    6485
29.0    6220
30.0    5588
24.0    5290
31.0    4779
32.0    4170
23.0    4070
33.0    3603
34.0    3270
35.0    2664
22.0    2658
36.0    1929
37.0    1617
21.0    1402
38.0    1383
39.0    1179
40.0    1047
45.0     987
41.0     803
42.0     769
20.0     705
43.0     616
44.0     556
47.0     476
46.0     474
48.0     385
49.0     367
50.0     360
51.0     325
19.0     324
52.0     297
53.0     281
54.0     240
55.0     234
18.0     189
57.0     148
56.0     130
58.0     102
59.0      81
60.0      49
64.0      42
63.0      28
61.0      27
17.0      25
62.0      22
65.0      12
67.0       6
68.0       5
72.0       4
73.0       4
69.0       3
86.0       1
78.0       1
77.0       1
82.0       1
66.0       1
71.0       1
Name: estimated_age_in_years, dtype: int64

In [7]:
df1_.head()

Unnamed: 0_level_0,Gender,City,Monthly_Income,DOB,Lead_Creation_Date,Loan_Amount_Applied,Loan_Tenure_Applied,Existing_EMI,Employer_Name,Salary_Account,...,Processing_Fee,EMI_Loan_Submitted,Filled_Form,Device_Type,Var2,Source,Var4,Disbursed,DOBd,estimated_age_in_years
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ID000002C20,Female,Delhi,20000,23-May-78,2015-05-15,300000.0,5.0,0.0,CYBOSOL,HDFC Bank,...,,,N,Web-browser,G,S122,1,0,1978-05-23,36.0
ID000004E40,Male,Mumbai,35000,07-Oct-85,2015-05-04,200000.0,2.0,0.0,TATA CONSULTANCY SERVICES LTD (TCS),ICICI Bank,...,,6762.9,N,Web-browser,G,S122,3,0,1985-10-07,29.0
ID000007H20,Male,Panchkula,22500,10-Oct-81,2015-05-19,600000.0,4.0,0.0,ALCHEMIST HOSPITALS LTD,State Bank of India,...,,,N,Web-browser,B,S143,1,0,1981-10-10,33.0
ID000008I30,Male,Saharsa,35000,30-Nov-87,2015-05-09,1000000.0,5.0,0.0,BIHAR GOVERNMENT,State Bank of India,...,,,N,Web-browser,B,S143,3,0,1987-11-30,27.0
ID000009J40,Male,Bengaluru,100000,17-Feb-84,2015-05-20,500000.0,2.0,25000.0,GLOBAL EDGE SOFTWARE,HDFC Bank,...,,,N,Web-browser,B,S134,3,0,1984-02-17,31.0


In [8]:
# zapis danych lokalnie na dysk (wykorzystywany w początkowej fazie analizy)
#df1_.to_csv(path_or_buf = 'dane/df1_d1.csv', encoding='utf-8')

Dalsza analiza prowadzona jest dla danych w: _df1_

In [9]:
#usunięcie zbędnych kolumn z datami, zostawiam tylko wiek
df1 = df1_.drop(columns=["Lead_Creation_Date", "DOB", "DOBd"])
df1.head(3)

Unnamed: 0_level_0,Gender,City,Monthly_Income,Loan_Amount_Applied,Loan_Tenure_Applied,Existing_EMI,Employer_Name,Salary_Account,Mobile_Verified,Var5,...,Interest_Rate,Processing_Fee,EMI_Loan_Submitted,Filled_Form,Device_Type,Var2,Source,Var4,Disbursed,estimated_age_in_years
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ID000002C20,Female,Delhi,20000,300000.0,5.0,0.0,CYBOSOL,HDFC Bank,N,0,...,,,,N,Web-browser,G,S122,1,0,36.0
ID000004E40,Male,Mumbai,35000,200000.0,2.0,0.0,TATA CONSULTANCY SERVICES LTD (TCS),ICICI Bank,Y,13,...,13.25,,6762.9,N,Web-browser,G,S122,3,0,29.0
ID000007H20,Male,Panchkula,22500,600000.0,4.0,0.0,ALCHEMIST HOSPITALS LTD,State Bank of India,Y,0,...,,,,N,Web-browser,B,S143,1,0,33.0


In [10]:
# wyświeltenie informacji o typach zmiennych w ramce: df1
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 87020 entries, ID000002C20 to ID124821V10
Data columns (total 23 columns):
Gender                    87020 non-null object
City                      87020 non-null object
Monthly_Income            87020 non-null int64
Loan_Amount_Applied       86949 non-null float64
Loan_Tenure_Applied       86949 non-null float64
Existing_EMI              86949 non-null float64
Employer_Name             87020 non-null object
Salary_Account            87020 non-null object
Mobile_Verified           87020 non-null object
Var5                      87020 non-null int64
Var1                      87020 non-null object
Loan_Amount_Submitted     52407 non-null float64
Loan_Tenure_Submitted     52407 non-null float64
Interest_Rate             27726 non-null float64
Processing_Fee            27420 non-null float64
EMI_Loan_Submitted        27726 non-null float64
Filled_Form               87020 non-null object
Device_Type               87020 non-null object
Var2      

In [11]:
# przegląd wartości wybranych zmiennych
df1.Gender.value_counts()

Male      49848
Female    37172
Name: Gender, dtype: int64

In [12]:
df1.Filled_Form.value_counts()

N    67530
Y    19490
Name: Filled_Form, dtype: int64

In [13]:
df1.Var1.value_counts()

HBXX    59294
HBXC     9010
HBXB     4479
HAXA     2909
HBXA     2123
HAXB     2011
HBXD     1964
HAXC     1536
HBXH      970
HCXF      722
HAYT      508
HAVC      384
HAXM      268
HCXD      237
HCYS      217
HVYS      186
HAZD      109
HCXG       78
HAXF       15
Name: Var1, dtype: int64

In [14]:
df1.Var2.value_counts()

B    37280
G    33032
C    14210
E     1315
D      634
F      544
A        5
Name: Var2, dtype: int64

In [15]:
df1.City.value_counts()

Delhi                  12527
Bengaluru              10824
Mumbai                 10795
Hyderabad               7272
Chennai                 6916
Pune                    5207
Kolkata                 2888
Ahmedabad               1788
Jaipur                  1331
Gurgaon                 1212
Coimbatore              1147
                        1003
Thane                    905
Chandigarh               870
Surat                    802
Visakhapatnam            764
Indore                   734
Vadodara                 624
Nagpur                   594
Lucknow                  580
Ghaziabad                560
Bhopal                   513
Kochi                    492
Patna                    461
Faridabad                447
Madurai                  375
Noida                    373
Gautam Buddha Nagar      338
Dehradun                 314
Raipur                   289
                       ...  
Gadwal                     1
Boudh                      1
North Cachar Hills         1
Gopal Ganj    

In [16]:
df1.Employer_Name.value_counts()

0                                               4914
TATA CONSULTANCY SERVICES LTD (TCS)              550
COGNIZANT TECHNOLOGY SOLUTIONS INDIA PVT LTD     404
ACCENTURE SERVICES PVT LTD                       324
GOOGLE                                           301
HCL TECHNOLOGIES LTD                             250
ICICI BANK LTD                                   239
INDIAN AIR FORCE                                 191
INFOSYS TECHNOLOGIES                             181
GENPACT                                          179
IBM CORPORATION                                  173
INDIAN ARMY                                      171
TYPE SLOWLY FOR AUTO FILL                        162
WIPRO TECHNOLOGIES                               155
HDFC BANK LTD                                    148
IKYA HUMAN CAPITAL SOLUTIONS LTD                 142
STATE GOVERNMENT                                 134
INDIAN RAILWAY                                   130
INDIAN NAVY                                   

In [17]:
df1.Device_Type.value_counts()

Web-browser    64316
Mobile         22704
Name: Device_Type, dtype: int64

In [18]:
df1.Source.value_counts()

S122    38567
S133    29885
S159     5599
S143     4332
S127     1931
S137     1724
S134     1301
S161      769
S151      720
S157      650
S153      494
S156      308
S144      299
S158      208
S123       73
S141       57
S162       36
S124       24
S160       11
S150       10
S155        4
S139        3
S136        3
S129        3
S138        3
S135        2
S130        1
S125        1
S140        1
S154        1
Name: Source, dtype: int64

In [19]:
df1.Salary_Account.value_counts()

HDFC Bank                                          17695
ICICI Bank                                         13636
State Bank of India                                11843
                                                   11764
Axis Bank                                           8783
Citibank                                            2376
Kotak Bank                                          2067
IDBI Bank                                           1550
Punjab National Bank                                1201
Bank of India                                       1170
Bank of Baroda                                      1126
Standard Chartered Bank                              995
Canara Bank                                          990
Union Bank of India                                  951
Yes Bank                                             779
ING Vysya                                            678
Corporation bank                                     649
Indian Overseas Bank           

In [20]:
df1.Mobile_Verified.value_counts()

Y    56481
N    30539
Name: Mobile_Verified, dtype: int64

Na podstawie analizy nazw pól oraz dostępnej informacji w internecie, przyjmuję, że jesli dla danego wnioskodawcy, pola: 'Loan_Amount_Submitted', 'Loan_Tenure_Submitted', 'EMI_Loan_Submitted' są puste to do finalnej analizy wniosku kredytowego brano pod uwagę wartości z pól: 'Loan_Amount_Applied' oraz 'Loan_Tenure_Applied', 'Existing_EMI'
W związku z tym poniżej przenoszę odpowiednie wartości z pól "applied" do "submited" (jeśli są puste) i w dalszej części biorę pod uwagę już tylko pola "submited", a 'applied' usuwam.

In [21]:
df1['Loan_Amount_Submitted'].fillna(df1['Loan_Amount_Applied'], inplace=True)
df1['Loan_Tenure_Submitted'].fillna(df1['Loan_Tenure_Applied'], inplace=True)
df1['EMI_Loan_Submitted'].fillna(df1['Existing_EMI'], inplace=True)

In [22]:
# usuwam pola typu 'Applied' oraz 'Existing Emi'
df1 = df1.drop(columns=["Loan_Tenure_Applied", "Loan_Amount_Applied", "Existing_EMI"])

In [23]:
# usuwam wiersze z danymi NaN dla zmiennej: Loan_Amount_Submited
df1.dropna(axis=0, subset=["Loan_Amount_Submitted", "EMI_Loan_Submitted"], inplace=True)

In [24]:
# usuwam wiersze z danymi NaN dla zmiennej: Existing_EMI
#df1.dropna(axis=0, subset=['Existing_EMI'], inplace=True)

In [25]:
#Sprawdzenie ile jest danych z brakującymi wartościami
df1.isnull().sum()

Gender                        0
City                          0
Monthly_Income                0
Employer_Name                 0
Salary_Account                0
Mobile_Verified               0
Var5                          0
Var1                          0
Loan_Amount_Submitted         0
Loan_Tenure_Submitted         0
Interest_Rate             59252
Processing_Fee            59558
EMI_Loan_Submitted            0
Filled_Form                   0
Device_Type                   0
Var2                          0
Source                        0
Var4                          0
Disbursed                     0
estimated_age_in_years       17
dtype: int64

In [26]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 86978 entries, ID000002C20 to ID124821V10
Data columns (total 20 columns):
Gender                    86978 non-null object
City                      86978 non-null object
Monthly_Income            86978 non-null int64
Employer_Name             86978 non-null object
Salary_Account            86978 non-null object
Mobile_Verified           86978 non-null object
Var5                      86978 non-null int64
Var1                      86978 non-null object
Loan_Amount_Submitted     86978 non-null float64
Loan_Tenure_Submitted     86978 non-null float64
Interest_Rate             27726 non-null float64
Processing_Fee            27420 non-null float64
EMI_Loan_Submitted        86978 non-null float64
Filled_Form               86978 non-null object
Device_Type               86978 non-null object
Var2                      86978 non-null object
Source                    86978 non-null object
Var4                      86978 non-null int64
Disbursed     

Dla: Gender, Mobile_Verified, Device_Type używam pd.get_dummies (poniżej)

In [27]:
dummies = pd.get_dummies(df1[['Filled_Form', 'Gender', 'Mobile_Verified', 'Device_Type']])

In [28]:
dummies.head(2)

Unnamed: 0_level_0,Filled_Form_N,Filled_Form_Y,Gender_Female,Gender_Male,Mobile_Verified_N,Mobile_Verified_Y,Device_Type_Mobile,Device_Type_Web-browser
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ID000002C20,1,0,1,0,1,0,0,1
ID000004E40,1,0,0,1,0,1,0,1


Zdefiniowanie zmiennej y

In [29]:
# zdefiniowanie zmiennej y
y = df1.Disbursed
type(y)

pandas.core.series.Series

In [30]:
# usunięcie części zbędnych kolumn
tmp1 = df1.drop(columns=["Filled_Form", "Gender", "Mobile_Verified", "Device_Type"])
#X = pd.concat([tmp1, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis = 1)

Dalsze przetwarzanie dla Xf1

In [31]:
Xf1 = pd.concat([tmp1, dummies[['Gender_Male', 'Filled_Form_Y', 'Mobile_Verified_Y', 'Device_Type_Mobile']]], axis = 1)

In [32]:
Xf1.head()

Unnamed: 0_level_0,City,Monthly_Income,Employer_Name,Salary_Account,Var5,Var1,Loan_Amount_Submitted,Loan_Tenure_Submitted,Interest_Rate,Processing_Fee,EMI_Loan_Submitted,Var2,Source,Var4,Disbursed,estimated_age_in_years,Gender_Male,Filled_Form_Y,Mobile_Verified_Y,Device_Type_Mobile
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
ID000002C20,Delhi,20000,CYBOSOL,HDFC Bank,0,HBXX,300000.0,5.0,,,0.0,G,S122,1,0,36.0,0,0,0,0
ID000004E40,Mumbai,35000,TATA CONSULTANCY SERVICES LTD (TCS),ICICI Bank,13,HBXA,200000.0,2.0,13.25,,6762.9,G,S122,3,0,29.0,1,0,1,0
ID000007H20,Panchkula,22500,ALCHEMIST HOSPITALS LTD,State Bank of India,0,HBXX,450000.0,4.0,,,0.0,B,S143,1,0,33.0,1,0,1,0
ID000008I30,Saharsa,35000,BIHAR GOVERNMENT,State Bank of India,10,HBXX,920000.0,5.0,,,0.0,B,S143,3,0,27.0,1,0,1,0
ID000009J40,Bengaluru,100000,GLOBAL EDGE SOFTWARE,HDFC Bank,17,HBXX,500000.0,2.0,,,25000.0,B,S134,3,0,31.0,1,0,1,0


In [33]:
# labelizacja zmiennych kategoryzujących, alfanumerycznych: City, Employer_Name, Salary_Account, Source
#przy użyciu:LabelEncoder

In [34]:
# zdefiniowanie listy ze zmiennych do labelizacji
Vars_To_Labelize = ["City", "Employer_Name", "Salary_Account", "Source", "Var1", "Var2"]

In [35]:
from sklearn import preprocessing

def labelize(df, VarsToLabelize):
    
    """
    Parametry:
        df (Pandas Dataframe): zbiór danych w postaci dataframe,
            zawierający m. in. kolumny z danymi do "etykietowania"
        VarsToLabelize (list): lista zmiennych, którym chcemy przypisać etykiety 
            numeryczne przy wykorzystaniu LabelEncoder()
    Returns (zwraca): 
        zbiór danych df, zawierający wygenerowane przez LabelEncoder()
        etykiety numeryczne dla podanych kolumn w zmiennej VarsToLabelize
    w ramce danych: df
    """
    # inicalizacja LabelEncoder'a
    lbencdr = preprocessing.LabelEncoder()
    labels_str = '_labels'
    for labelVar in VarsToLabelize:
        lbencdr.fit(df[labelVar])
        labelVar_label = labelVar + labels_str
        df[labelVar_label] = lbencdr.transform(df[labelVar])
    return
        

In [36]:
labelize(Xf1, Vars_To_Labelize)
Xf1.head()

Unnamed: 0_level_0,City,Monthly_Income,Employer_Name,Salary_Account,Var5,Var1,Loan_Amount_Submitted,Loan_Tenure_Submitted,Interest_Rate,Processing_Fee,...,Gender_Male,Filled_Form_Y,Mobile_Verified_Y,Device_Type_Mobile,City_labels,Employer_Name_labels,Salary_Account_labels,Source_labels,Var1_labels,Var2_labels
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ID000002C20,Delhi,20000,CYBOSOL,HDFC Bank,0,HBXX,300000.0,5.0,,,...,0,0,0,0,173,8692,21,0,13,6
ID000004E40,Mumbai,35000,TATA CONSULTANCY SERVICES LTD (TCS),ICICI Bank,13,HBXA,200000.0,2.0,13.25,,...,1,0,1,0,447,38691,23,0,8,6
ID000007H20,Panchkula,22500,ALCHEMIST HOSPITALS LTD,State Bank of India,0,HBXX,450000.0,4.0,,,...,1,0,1,0,498,1834,45,16,13,1
ID000008I30,Saharsa,35000,BIHAR GOVERNMENT,State Bank of India,10,HBXX,920000.0,5.0,,,...,1,0,1,0,569,5699,45,16,13,1
ID000009J40,Bengaluru,100000,GLOBAL EDGE SOFTWARE,HDFC Bank,17,HBXX,500000.0,2.0,,,...,1,0,1,0,88,13514,21,8,13,1


In [37]:
X = Xf1.drop(columns=Vars_To_Labelize)

In [38]:
X.head()

Unnamed: 0_level_0,Monthly_Income,Var5,Loan_Amount_Submitted,Loan_Tenure_Submitted,Interest_Rate,Processing_Fee,EMI_Loan_Submitted,Var4,Disbursed,estimated_age_in_years,Gender_Male,Filled_Form_Y,Mobile_Verified_Y,Device_Type_Mobile,City_labels,Employer_Name_labels,Salary_Account_labels,Source_labels,Var1_labels,Var2_labels
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
ID000002C20,20000,0,300000.0,5.0,,,0.0,1,0,36.0,0,0,0,0,173,8692,21,0,13,6
ID000004E40,35000,13,200000.0,2.0,13.25,,6762.9,3,0,29.0,1,0,1,0,447,38691,23,0,8,6
ID000007H20,22500,0,450000.0,4.0,,,0.0,1,0,33.0,1,0,1,0,498,1834,45,16,13,1
ID000008I30,35000,10,920000.0,5.0,,,0.0,3,0,27.0,1,0,1,0,569,5699,45,16,13,1
ID000009J40,100000,17,500000.0,2.0,,,25000.0,3,0,31.0,1,0,1,0,88,13514,21,8,13,1


In [39]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Index: 86978 entries, ID000002C20 to ID124821V10
Data columns (total 20 columns):
Monthly_Income            86978 non-null int64
Var5                      86978 non-null int64
Loan_Amount_Submitted     86978 non-null float64
Loan_Tenure_Submitted     86978 non-null float64
Interest_Rate             27726 non-null float64
Processing_Fee            27420 non-null float64
EMI_Loan_Submitted        86978 non-null float64
Var4                      86978 non-null int64
Disbursed                 86978 non-null int64
estimated_age_in_years    86961 non-null float64
Gender_Male               86978 non-null uint8
Filled_Form_Y             86978 non-null uint8
Mobile_Verified_Y         86978 non-null uint8
Device_Type_Mobile        86978 non-null uint8
City_labels               86978 non-null int64
Employer_Name_labels      86978 non-null int64
Salary_Account_labels     86978 non-null int64
Source_labels             86978 non-null int64
Var1_labels           

In [40]:
#Sprawdzenie ile jest danych z brakującymi wartościami
X.isnull().sum()

Monthly_Income                0
Var5                          0
Loan_Amount_Submitted         0
Loan_Tenure_Submitted         0
Interest_Rate             59252
Processing_Fee            59558
EMI_Loan_Submitted            0
Var4                          0
Disbursed                     0
estimated_age_in_years       17
Gender_Male                   0
Filled_Form_Y                 0
Mobile_Verified_Y             0
Device_Type_Mobile            0
City_labels                   0
Employer_Name_labels          0
Salary_Account_labels         0
Source_labels                 0
Var1_labels                   0
Var2_labels                   0
dtype: int64

In [41]:
# dodaję wartości średnie do poniższych kolumn zawierających dane NaN
X['Interest_Rate'].fillna(X['Interest_Rate'].mean(), inplace=True)
X['Processing_Fee'].fillna(X['Processing_Fee'].mean(), inplace=True)
X['estimated_age_in_years'].fillna(X['estimated_age_in_years'].mean(), inplace=True)

In [42]:
X.head()

Unnamed: 0_level_0,Monthly_Income,Var5,Loan_Amount_Submitted,Loan_Tenure_Submitted,Interest_Rate,Processing_Fee,EMI_Loan_Submitted,Var4,Disbursed,estimated_age_in_years,Gender_Male,Filled_Form_Y,Mobile_Verified_Y,Device_Type_Mobile,City_labels,Employer_Name_labels,Salary_Account_labels,Source_labels,Var1_labels,Var2_labels
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
ID000002C20,20000,0,300000.0,5.0,19.197474,5131.150839,0.0,1,0,36.0,0,0,0,0,173,8692,21,0,13,6
ID000004E40,35000,13,200000.0,2.0,13.25,5131.150839,6762.9,3,0,29.0,1,0,1,0,447,38691,23,0,8,6
ID000007H20,22500,0,450000.0,4.0,19.197474,5131.150839,0.0,1,0,33.0,1,0,1,0,498,1834,45,16,13,1
ID000008I30,35000,10,920000.0,5.0,19.197474,5131.150839,0.0,3,0,27.0,1,0,1,0,569,5699,45,16,13,1
ID000009J40,100000,17,500000.0,2.0,19.197474,5131.150839,25000.0,3,0,31.0,1,0,1,0,88,13514,21,8,13,1


In [43]:
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MaxAbsScaler
from sklearn.feature_extraction.text import CountVectorizer
import pprint
import scipy
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier

In [44]:
# podział na dane trenujące i testowe (sprawdzające jak radzi sobie dany model)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [45]:
# Dodaję możliwość wyświetlania boldem w Pythonie-Jupyterze
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))
printmd("**bold**")

**bold**

In [46]:
X_train.shape

(69582, 20)

In [47]:
y_train.shape

(69582,)

In [48]:
X_test.shape

(17396, 20)

In [49]:
y_train.shape

(69582,)

In [50]:
y_test.shape

(17396,)

In [51]:
# spojrzenie na rozkład 0 i 1 dla y_train
y_train.value_counts()

0    68561
1     1021
Name: Disbursed, dtype: int64

In [52]:
# spojrzenie na rozkład 0 i 1 dla y_test
y_test.value_counts()

0    17144
1      252
Name: Disbursed, dtype: int64

In [53]:
# ograniczenie informacji typu warning podczas testowania modeli
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
warnings.filterwarnings(action='ignore', category=FutureWarning)

In [54]:
#definicje zmiennych, modeli oraz zakresu parametrów modeli, które będziemy tunować

names = np.array(["Naiwny Bayes", "Drzewo decyzyjne", "Regresja logistyczna_MaxAbs",
                  "Regresja logistyczna_StandardScaler",  "SVM_MaxAbs", "SVM_StandardScaler",
                 "BaggingClassifier_tree", "BaggingClassifier_Logistic", "RandomForest",
                  "sparamtzrRandomForest"])

models = [[("model", MultinomialNB())],
         [("model", DecisionTreeClassifier())],
         [("scaler", MaxAbsScaler()), ("model", LogisticRegression())],
          [("scaler", StandardScaler()), ("model", LogisticRegression())],
         [("scaler", MaxAbsScaler()),("model", SVC())],
          [("scaler", StandardScaler()),("model", SVC())],
         [("model", BaggingClassifier(base_estimator=DecisionTreeClassifier()))],
         [("model", BaggingClassifier(base_estimator=LogisticRegression()))],
         [("model", RandomForestClassifier())],
          [("model", RandomForestClassifier(n_estimators=180, min_samples_leaf=3,
                                            max_features=0.5, n_jobs=-1))]
         ]

param_grids = [{"model__alpha": [0.1, 1, 10], "model__fit_prior": [False, True]},
               {"model__criterion": ["gini", "entropy"],
                "model__min_samples_split": [2, 10, 100], "model__max_depth": [None, 2, 10, 100]}, 
               {"model__penalty": ["l1", "l2"], "model__C": [0.001, 0.1, 1, 10, 100]},
               {"model__penalty": ["l1", "l2"], "model__C": [0.001, 0.1, 1, 10, 100]},
              [{"model__kernel": ["rbf"],},
              {"model__kernel": ["poly"], "model__degree": [1, 2, 3, 4]},
              {"model__kernel": ["linear","sigmoid"]}],
               [{"model__kernel": ["rbf"],},
              {"model__kernel": ["poly"], "model__degree": [1, 2, 3, 4]},
              {"model__kernel": ["linear","sigmoid"]}],
               {"model__n_estimators" : [2, 5, 100], "model__max_features": [0.5, 0.8, 1.0]},
               {"model__n_estimators" : [2, 5, 100], "model__max_features": [0.5, 0.8, 1.0]},
               {"model__n_estimators" : [2, 5, 100]},
               {}
              ]

# uses = np.array([True, True, True, True, True, True, True, True, True, True])
uses = np.array([True, True, True, True, False, False, False, False, False, False])

In [55]:
# Wybór najlepszych modeli, prezentacja wyników

if len(names) != len(models) or len(models) != len(param_grids) or len(param_grids) != len(uses):
    print(f"len(names): {len(names)}")
    print(f"len(models): {len(models)}")
    print(f"len(param_grids): {len(param_grids)}")
    print(f"len(uses): {len(uses)}")
    raise ValueError("Listy nie mają tej samej długości!")

best_models = []

for use, name, pipe, params in zip(uses, names, models, param_grids):
    if not use:
        continue
    printmd(f"**Tunuje model: {name}**") 
    pipeline = Pipeline(pipe)
    gs = GridSearchCV(estimator=pipeline, param_grid=params, n_jobs=3)
    gs.fit(X_train, y_train)
    for mean, std, param, fit_time, score_time in zip(gs.cv_results_["mean_test_score"],
                                gs.cv_results_["std_test_score"],
                                gs.cv_results_["params"],
                                gs.cv_results_["mean_fit_time"],
                                gs.cv_results_["mean_score_time"]):
        print("Parametry/wyniki modelu:")
        #for i in param.items():
        #    print(f" {i}")
        #print(f"\t {param}:\n\t mean: {mean}, std: {std},\n\t fit_time: {fit_time}, score_time: {score_time}\n")
        print(f"{param}:\n mean: {round(mean, 4)}, std: {round(std, 4)}, fit_time: {round(fit_time, 4)}, score_time: {round(score_time, 4)}\n")
    best_models.append(gs.best_estimator_)
        
best_models = np.array(best_models)

print(f"____________________\n\n")

printmd(f"**Testuję (accuracy):**")
for name, best_model in zip(names, best_models):
    print(f"\t{name}: {round(accuracy_score(best_model.predict(X_test), y_test), 4)}")
    
print(f"\n")
printmd(f"**Testuję (ROC_AUC):**")
for name, best_model in zip(names, best_models):
    print(f"\t {name}: {round(roc_auc_score(best_model.predict(X_test), y_test), 4)}")

**Tunuje model: Naiwny Bayes**

Parametry/wyniki modelu:
{'model__alpha': 0.1, 'model__fit_prior': False}:
 mean: 0.4767, std: 0.075, fit_time: 0.1099, score_time: 0.0441

Parametry/wyniki modelu:
{'model__alpha': 0.1, 'model__fit_prior': True}:
 mean: 0.4768, std: 0.075, fit_time: 0.1093, score_time: 0.0252

Parametry/wyniki modelu:
{'model__alpha': 1, 'model__fit_prior': False}:
 mean: 0.4767, std: 0.075, fit_time: 0.0763, score_time: 0.0317

Parametry/wyniki modelu:
{'model__alpha': 1, 'model__fit_prior': True}:
 mean: 0.4768, std: 0.075, fit_time: 0.0812, score_time: 0.0235

Parametry/wyniki modelu:
{'model__alpha': 10, 'model__fit_prior': False}:
 mean: 0.4767, std: 0.075, fit_time: 0.1054, score_time: 0.0279

Parametry/wyniki modelu:
{'model__alpha': 10, 'model__fit_prior': True}:
 mean: 0.4768, std: 0.075, fit_time: 0.0844, score_time: 0.0291



**Tunuje model: Drzewo decyzyjne**

Parametry/wyniki modelu:
{'model__criterion': 'gini', 'model__max_depth': None, 'model__min_samples_split': 2}:
 mean: 1.0, std: 0.0, fit_time: 0.1644, score_time: 0.0368

Parametry/wyniki modelu:
{'model__criterion': 'gini', 'model__max_depth': None, 'model__min_samples_split': 10}:
 mean: 1.0, std: 0.0, fit_time: 0.1891, score_time: 0.0373

Parametry/wyniki modelu:
{'model__criterion': 'gini', 'model__max_depth': None, 'model__min_samples_split': 100}:
 mean: 1.0, std: 0.0, fit_time: 0.1671, score_time: 0.034

Parametry/wyniki modelu:
{'model__criterion': 'gini', 'model__max_depth': 2, 'model__min_samples_split': 2}:
 mean: 1.0, std: 0.0, fit_time: 0.219, score_time: 0.0393

Parametry/wyniki modelu:
{'model__criterion': 'gini', 'model__max_depth': 2, 'model__min_samples_split': 10}:
 mean: 1.0, std: 0.0, fit_time: 0.1793, score_time: 0.0274

Parametry/wyniki modelu:
{'model__criterion': 'gini', 'model__max_depth': 2, 'model__min_samples_split': 100}:
 mean: 1.0, std: 0.0, fit_time: 0

**Tunuje model: Regresja logistyczna_MaxAbs**

Parametry/wyniki modelu:
{'model__C': 0.001, 'model__penalty': 'l1'}:
 mean: 0.9853, std: 0.0, fit_time: 0.6819, score_time: 0.0407

Parametry/wyniki modelu:
{'model__C': 0.001, 'model__penalty': 'l2'}:
 mean: 0.9853, std: 0.0, fit_time: 0.7435, score_time: 0.0561

Parametry/wyniki modelu:
{'model__C': 0.1, 'model__penalty': 'l1'}:
 mean: 1.0, std: 0.0, fit_time: 0.9513, score_time: 0.0333

Parametry/wyniki modelu:
{'model__C': 0.1, 'model__penalty': 'l2'}:
 mean: 1.0, std: 0.0, fit_time: 0.703, score_time: 0.036

Parametry/wyniki modelu:
{'model__C': 1, 'model__penalty': 'l1'}:
 mean: 1.0, std: 0.0, fit_time: 0.7816, score_time: 0.0337

Parametry/wyniki modelu:
{'model__C': 1, 'model__penalty': 'l2'}:
 mean: 1.0, std: 0.0, fit_time: 0.937, score_time: 0.0342

Parametry/wyniki modelu:
{'model__C': 10, 'model__penalty': 'l1'}:
 mean: 1.0, std: 0.0, fit_time: 0.9197, score_time: 0.0391

Parametry/wyniki modelu:
{'model__C': 10, 'model__penalty': 'l2'}:
 mean: 1.0, std: 0.0, fit_time: 1.0

**Tunuje model: Regresja logistyczna_StandardScaler**

Parametry/wyniki modelu:
{'model__C': 0.001, 'model__penalty': 'l1'}:
 mean: 1.0, std: 0.0, fit_time: 0.7934, score_time: 0.0573

Parametry/wyniki modelu:
{'model__C': 0.001, 'model__penalty': 'l2'}:
 mean: 1.0, std: 0.0, fit_time: 0.7265, score_time: 0.0427

Parametry/wyniki modelu:
{'model__C': 0.1, 'model__penalty': 'l1'}:
 mean: 1.0, std: 0.0, fit_time: 0.5285, score_time: 0.0367

Parametry/wyniki modelu:
{'model__C': 0.1, 'model__penalty': 'l2'}:
 mean: 1.0, std: 0.0, fit_time: 0.9336, score_time: 0.0531

Parametry/wyniki modelu:
{'model__C': 1, 'model__penalty': 'l1'}:
 mean: 1.0, std: 0.0, fit_time: 0.6567, score_time: 0.0499

Parametry/wyniki modelu:
{'model__C': 1, 'model__penalty': 'l2'}:
 mean: 1.0, std: 0.0, fit_time: 0.9149, score_time: 0.0391

Parametry/wyniki modelu:
{'model__C': 10, 'model__penalty': 'l1'}:
 mean: 1.0, std: 0.0, fit_time: 0.8185, score_time: 0.0393

Parametry/wyniki modelu:
{'model__C': 10, 'model__penalty': 'l2'}:
 mean: 1.0, std: 0.0, fit_time: 1.1848

**Testuję (accuracy):**

	Naiwny Bayes: 0.4698
	Drzewo decyzyjne: 1.0
	Regresja logistyczna_MaxAbs: 1.0
	Regresja logistyczna_StandardScaler: 1.0




**Testuję (ROC_AUC):**

	 Naiwny Bayes: 0.5023
	 Drzewo decyzyjne: 1.0
	 Regresja logistyczna_MaxAbs: 1.0
	 Regresja logistyczna_StandardScaler: 1.0


Poniżej to samo przetwarzanie tylko inny sposób prezentowania parametrów i wyników modeli

In [56]:
# Wybór najlepszych modeli, prezentacja wyników

if len(names) != len(models) or len(models) != len(param_grids) or len(param_grids) != len(uses):
    print(f"len(names): {len(names)}")
    print(f"len(models): {len(models)}")
    print(f"len(param_grids): {len(param_grids)}")
    print(f"len(uses): {len(uses)}")
    raise ValueError("Listy nie mają tej samej długości!")

best_models = []

for use, name, pipe, params in zip(uses, names, models, param_grids):
    if not use:
        continue
    printmd(f"**Tunuje model: {name}**") 
    pipeline = Pipeline(pipe)
    gs = GridSearchCV(estimator=pipeline, param_grid=params, n_jobs=3)
    gs.fit(X_train, y_train)
    for mean, std, param, fit_time, score_time in zip(gs.cv_results_["mean_test_score"],
                                gs.cv_results_["std_test_score"],
                                gs.cv_results_["params"],
                                gs.cv_results_["mean_fit_time"],
                                gs.cv_results_["mean_score_time"]):
        print("Zastosowane parametry:")
        for i in param.items():
            print(f"\t {i}")
        #print(f"\t {param}:\n\t mean_test_score: {mean}, std_test_score: {std},\n\t mean_fit_time: {fit_time}, mean_score_time: {score_time}\n")
        print(f"Wyniki: \n\t mean_test_score: {round(mean, 4)}, \
              std_test_score: {round(std, 4)}, \
              \n\t mean_fit_time: {round(fit_time, 4)}, \
              mean_score_time: {round(score_time, 4)}\n")
        
    best_models.append(gs.best_estimator_)

print(f"____________________\n\n")

printmd(f"**Testuję (accuracy):**")
for name, best_model in zip(names, best_models):
    print(f"\t{name}: {round(accuracy_score(best_model.predict(X_test), y_test), 4)}")
    
print(f"\n")
printmd(f"**Testuję (ROC_AUC):**")
for name, best_model in zip(names, best_models):
    print(f"\t {name}: {round(roc_auc_score(best_model.predict(X_test), y_test), 4)}")

**Tunuje model: Naiwny Bayes**

Zastosowane parametry:
	 ('model__alpha', 0.1)
	 ('model__fit_prior', False)
Wyniki: 
	 mean_test_score: 0.4767,               std_test_score: 0.075,               
	 mean_fit_time: 0.0759,               mean_score_time: 0.0306

Zastosowane parametry:
	 ('model__alpha', 0.1)
	 ('model__fit_prior', True)
Wyniki: 
	 mean_test_score: 0.4768,               std_test_score: 0.075,               
	 mean_fit_time: 0.1035,               mean_score_time: 0.0575

Zastosowane parametry:
	 ('model__alpha', 1)
	 ('model__fit_prior', False)
Wyniki: 
	 mean_test_score: 0.4767,               std_test_score: 0.075,               
	 mean_fit_time: 0.1021,               mean_score_time: 0.0387

Zastosowane parametry:
	 ('model__alpha', 1)
	 ('model__fit_prior', True)
Wyniki: 
	 mean_test_score: 0.4768,               std_test_score: 0.075,               
	 mean_fit_time: 0.074,               mean_score_time: 0.0254

Zastosowane parametry:
	 ('model__alpha', 10)
	 ('model__fit_prior', False)
Wyniki: 
	 mean

**Tunuje model: Drzewo decyzyjne**

Zastosowane parametry:
	 ('model__criterion', 'gini')
	 ('model__max_depth', None)
	 ('model__min_samples_split', 2)
Wyniki: 
	 mean_test_score: 1.0,               std_test_score: 0.0,               
	 mean_fit_time: 0.1298,               mean_score_time: 0.0345

Zastosowane parametry:
	 ('model__criterion', 'gini')
	 ('model__max_depth', None)
	 ('model__min_samples_split', 10)
Wyniki: 
	 mean_test_score: 1.0,               std_test_score: 0.0,               
	 mean_fit_time: 0.199,               mean_score_time: 0.0389

Zastosowane parametry:
	 ('model__criterion', 'gini')
	 ('model__max_depth', None)
	 ('model__min_samples_split', 100)
Wyniki: 
	 mean_test_score: 1.0,               std_test_score: 0.0,               
	 mean_fit_time: 0.1382,               mean_score_time: 0.0318

Zastosowane parametry:
	 ('model__criterion', 'gini')
	 ('model__max_depth', 2)
	 ('model__min_samples_split', 2)
Wyniki: 
	 mean_test_score: 1.0,               std_test_score: 0.0,               
	 mean_fi

**Tunuje model: Regresja logistyczna_MaxAbs**

Zastosowane parametry:
	 ('model__C', 0.001)
	 ('model__penalty', 'l1')
Wyniki: 
	 mean_test_score: 0.9853,               std_test_score: 0.0,               
	 mean_fit_time: 0.4224,               mean_score_time: 0.0456

Zastosowane parametry:
	 ('model__C', 0.001)
	 ('model__penalty', 'l2')
Wyniki: 
	 mean_test_score: 0.9853,               std_test_score: 0.0,               
	 mean_fit_time: 0.7125,               mean_score_time: 0.0418

Zastosowane parametry:
	 ('model__C', 0.1)
	 ('model__penalty', 'l1')
Wyniki: 
	 mean_test_score: 1.0,               std_test_score: 0.0,               
	 mean_fit_time: 1.0124,               mean_score_time: 0.0386

Zastosowane parametry:
	 ('model__C', 0.1)
	 ('model__penalty', 'l2')
Wyniki: 
	 mean_test_score: 1.0,               std_test_score: 0.0,               
	 mean_fit_time: 1.2549,               mean_score_time: 0.0366

Zastosowane parametry:
	 ('model__C', 1)
	 ('model__penalty', 'l1')
Wyniki: 
	 mean_test_score: 1.0,               std_tes

**Tunuje model: Regresja logistyczna_StandardScaler**

Zastosowane parametry:
	 ('model__C', 0.001)
	 ('model__penalty', 'l1')
Wyniki: 
	 mean_test_score: 1.0,               std_test_score: 0.0,               
	 mean_fit_time: 0.7347,               mean_score_time: 0.0585

Zastosowane parametry:
	 ('model__C', 0.001)
	 ('model__penalty', 'l2')
Wyniki: 
	 mean_test_score: 1.0,               std_test_score: 0.0,               
	 mean_fit_time: 1.0724,               mean_score_time: 0.0731

Zastosowane parametry:
	 ('model__C', 0.1)
	 ('model__penalty', 'l1')
Wyniki: 
	 mean_test_score: 1.0,               std_test_score: 0.0,               
	 mean_fit_time: 0.6221,               mean_score_time: 0.0372

Zastosowane parametry:
	 ('model__C', 0.1)
	 ('model__penalty', 'l2')
Wyniki: 
	 mean_test_score: 1.0,               std_test_score: 0.0,               
	 mean_fit_time: 0.7681,               mean_score_time: 0.0408

Zastosowane parametry:
	 ('model__C', 1)
	 ('model__penalty', 'l1')
Wyniki: 
	 mean_test_score: 1.0,               std_test_scor

**Testuję (accuracy):**

	Naiwny Bayes: 0.4698
	Drzewo decyzyjne: 1.0
	Regresja logistyczna_MaxAbs: 1.0
	Regresja logistyczna_StandardScaler: 1.0




**Testuję (ROC_AUC):**

	 Naiwny Bayes: 0.5023
	 Drzewo decyzyjne: 1.0
	 Regresja logistyczna_MaxAbs: 1.0
	 Regresja logistyczna_StandardScaler: 1.0
