Langkah-langkah Penyelesaian Projek
1. [Data Collecting](#data-collecting)
1. [Data Cleaning](#data-cleaning)
1. [Data Preprocessing](#data-preprocessing) 
1. [Modelling](#modelling)
1. [Model Evaluating](#model-evaluating)

### Data Collecting
langkah-langkah yang dilakukan adalah
1. Membaca file
2. Menginpeksi data menggunakan statistika deskriptif dan untuk mengetahui tipe data

In [30]:
# Import library 
import numpy as np 
import pandas as pd 

In [31]:
df=pd.read_csv('datasets/cc_approvals.data',header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,f,g,00202,0,+
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,f,g,00043,560,+
2,a,24.50,0.500,u,g,q,h,1.50,t,f,0,f,g,00280,824,+
3,b,27.83,1.540,u,g,w,v,3.75,t,t,5,t,g,00100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,00120,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,00260,0,-
686,a,22.67,0.750,u,g,c,v,2.00,f,t,2,t,g,00200,394,-
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,t,g,00200,1,-
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,00280,750,-


In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


Kesimpulan Data Collection:
Berdasarkan hasil inspeksi bahwa semua feature telah memiliki tipe data yang sesuai

In [33]:
df.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


### Data Cleaning
Fungsi data cleaning adalah untuk mengatasi missing value. Berdasarkan deskripsi data bahwa missing value dilambang dengan '?'. Penanganan missing value dibagi menjadi 2 berdasarkan tipe data yaitu missing value untuk tipe data float, int menggunakan metode mean, sedangkan untuk data bersifat text menggunakan metode most frequence.

Berikut langkah-langkahnya:
1. [Cek semua instansi feature apakah terdapat missing value](#langkah-1-pengecekan)
2. [Ubah missing value menjadi np.nan](#langkah-2-mengubah-lambang--menjadi-npnan)
3. [Tangani missing value berdasarkan tipe data](#langkah-3-menangani-missing-value-berdasarkan-tipe-data)

#### Langkah 1 Pengecekan

In [34]:

df.isin(['?']).sum() # Karena missing value dilambangkan '?'

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

#### langkah 2 Mengubah lambang '?' menjadi np.nan 

In [35]:

df=df.replace('?',np.nan) # mengubah menjadi np.nan 
df.isin(['?']).sum() # Pengecekan kembali apakah masih terdapat lambang '?' yang belum ditangani

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

In [36]:
df.isnull().sum() # mengecek jumlah np.nan didalam data sets berdasarkan feature nya

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

In [37]:
df.tail(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-


Berdasarkan hasil di atas bahwa semua lambang '?' telah diubah menjadi np.nan

#### langkah 3: menangani missing value berdasarkan tipe data

In [38]:


for col in df:
    if df[col].dtype == 'object':
        df[col]=df[col].fillna(df[col].value_counts().index[0]) # Mengatasi np.nan di column bertipe data 'object'
    elif df[col].dtype != 'object':
        df[col]=df[col].fillna(np.mean) # Mengatasi np.nan di column bertipe data selain 'object'

df.isnull().sum() # Pengecekan hasil penanganan

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

Berdasarkan hasil inpeksi diatas bahwa semua missing value telah ditangani berdasarkan tipe data feature-nya


### Data Preprocessing
Data Preprocessing adalah langkah untuk mengubah value datasets menjadi lebih sederhana dan lebih effisien serta meningkatkan akurasi untuk modeling namun tidak mengubah makna sebenarnya dari datasets

Berikut langkah-langkahnya
1. [Mengubah datasets bertipe data object menjadi bertipe category](#mengubah-datasets-bertipe-data-object-menjadi-bertipe-category)
1. [Memisahkan feature dengan target](#memisahkan-feature-dengan-target)
2. [Memisahkan feature bertipe data category dengan yang bukan category](#memisahkan-feature-bertipe-data-category-dengan-yang-bukan-category)
1. [Membuat dummy value untuk feature dan target yang bertipe data category](#membuat-dummy-value-untuk-feature-dan-target-yang-bertipe-data-category)
1. [Membuat Scalling untuk feature bertipe data non category](#membuat-scalling-untuk-feature-bertipe-data-non-category)

#### Mengubah datasets bertipe data object menjadi bertipe category


In [39]:
label_cat=[i for i in df if df[i].dtype == 'object'] # memfilter feature names untuk tipe data object
label_num=[i for i in df if df[i].dtype != 'object'] # memfilter feature names untuk tipe data non object

print(f'label_cat= {label_cat}')
print(f'label_num= {label_num}')

label_cat= [0, 1, 3, 4, 5, 6, 8, 9, 11, 12, 13, 15]
label_num= [2, 7, 10, 14]


In [40]:
df[label_cat]=df[label_cat].astype('category') # mengubah object menjadi category
df.info() # pengecekan apakah object sudah berubah menjadi category

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   0       690 non-null    category
 1   1       690 non-null    category
 2   2       690 non-null    float64 
 3   3       690 non-null    category
 4   4       690 non-null    category
 5   5       690 non-null    category
 6   6       690 non-null    category
 7   7       690 non-null    float64 
 8   8       690 non-null    category
 9   9       690 non-null    category
 10  10      690 non-null    int64   
 11  11      690 non-null    category
 12  12      690 non-null    category
 13  13      690 non-null    category
 14  14      690 non-null    int64   
 15  15      690 non-null    category
dtypes: category(12), float64(2), int64(2)
memory usage: 49.4 KB


#### Memisahkan feature dengan target

In [41]:
X_cat=df[label_cat[:-1]] # Features
y=df[label_cat[-1]] # Target
print(f'X=\n {X_cat.head()}')
print(f'y=\n {y.head()}')

X=
   0      1  3  4  5  6  8  9  11 12     13
0  b  30.83  u  g  w  v  t  t  f  g  00202
1  a  58.67  u  g  q  h  t  t  f  g  00043
2  a  24.50  u  g  q  h  t  f  f  g  00280
3  b  27.83  u  g  w  v  t  t  t  g  00100
4  b  20.17  u  g  w  v  t  f  f  s  00120
y=
 0    +
1    +
2    +
3    +
4    +
Name: 15, dtype: category
Categories (2, object): ['+', '-']


#### Memisahkan feature bertipe data category dengan yang bukan category


In [42]:
X_num=df[label_num]
X_num


Unnamed: 0,2,7,10,14
0,0.000,1.25,1,0
1,4.460,3.04,6,560
2,0.500,1.50,0,824
3,1.540,3.75,5,3
4,5.625,1.71,0,0
...,...,...,...,...
685,10.085,1.25,0,0
686,0.750,2.00,2,394
687,13.500,2.00,1,1
688,0.205,0.04,0,750


#### Membuat dummy value untuk feature dan target yang bertipe data category

In [43]:
X_cat=pd.get_dummies(X_cat,drop_first=True) # membuat dummy feature
X_cat.shape

(690, 548)

In [44]:
y=pd.get_dummies(y,drop_first=True) # membuat dummy target
y.head()

Unnamed: 0,-
0,0
1,0
2,0
3,0
4,0


#### Membuat Scalling untuk feature bertipe data non category

In [45]:
from sklearn.preprocessing import MinMaxScaler # import scaler tool

In [46]:
scaler=MinMaxScaler(feature_range=(0,1))
X=pd.concat([X_cat,X_num],axis=1)
X.columns=X.columns.astype(str)
scaled_X=scaler.fit_transform(X)

In [47]:
X

Unnamed: 0,0_b,1_15.17,1_15.75,1_15.83,1_15.92,1_16.00,1_16.08,1_16.17,1_16.25,1_16.33,...,13_00760,13_00840,13_00928,13_00980,13_01160,13_02000,2,7,10,14
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.000,1.25,1,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,4.460,3.04,6,560
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.500,1.50,0,824
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1.540,3.75,5,3
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,5.625,1.71,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,10.085,1.25,0,0
686,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.750,2.00,2,394
687,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,13.500,2.00,1,1
688,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.205,0.04,0,750


In [48]:
scaled_X

array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        4.38596491e-02, 1.49253731e-02, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.06666667e-01, 8.95522388e-02, 5.60000000e-03],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        5.26315789e-02, 0.00000000e+00, 8.24000000e-03],
       ...,
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        7.01754386e-02, 1.49253731e-02, 1.00000000e-05],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.40350877e-03, 0.00000000e+00, 7.50000000e-03],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        2.90877193e-01, 0.00000000e+00, 0.00000000e+00]])

### Modelling
Tahap ini adalah tahap membuat model Machine Learning (ML).

Berikut langkahnya
1. [Memisahkan data train dan data test](#memisahkan-data-train-dan-data-test)
2. [Mentraining data](#mentraining-data) 
3. [Mempredicting data]()

In [49]:
from sklearn.model_selection import train_test_split

In [50]:
X_train,X_test,y_train,y_test= train_test_split( # memisahkan data train dan data test
    X,y,
    test_size=.33,
    random_state=1
)
y_train=y_train.values.reshape(-1)
y_test=y_test.values.reshape(-1)
print(f'X_train\n {X_train.shape}') # Pengecekan apakah pemisahan sesuai dengan yang diinginkan
print(f'y_train\n {y_train}')

X_train
 (462, 552)
y_train
 [0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1 0 1 1
 1 1 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 0 0 1 1
 1 1 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 1 0 1 1
 0 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1
 1 1 1 0 1 1 0 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1
 1 0 1 0 1 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0
 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1
 1 1 1 1 0 0 1 1 1 0 0 1 0 1 1 0 1 0 1 1 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 1 0
 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 1
 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 1 1 1 1 1 1 1
 0 1 0 1 1 0 1 0 1 0 0 1 0 0 1 0 1 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 1 0 0 0 0
 1 1 0 1 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 0 0 0 0 1
 1 0 0 1 1 1 1 0 1 1 0 0 0 0 1 1 0 0]


In [51]:
print(f'X_test\n {X_test.shape}')
print(f'y_test\n {y_test.shape}')

X_test
 (228, 552)
y_test
 (228,)


#### Mentraining data

In [52]:
from sklearn.linear_model import LogisticRegression # mengimport model ML


In [53]:
logreg=LogisticRegression(max_iter=1000,solver='lbfgs')
logreg.fit(X_train,y_train)

LogisticRegression(max_iter=1000)

In [54]:
y_pred=logreg.predict(X_test)

### Model Evaluating

In [55]:
from sklearn.metrics import confusion_matrix

In [56]:
print(f'Accuracy : {logreg.score(X_test,y_test)}')
print(f'Confusion matrix :\n {confusion_matrix(y_pred,y_test)}')

Accuracy : 0.8728070175438597
Confusion matrix :
 [[ 84  20]
 [  9 115]]
