# Modul 3 - Demo

### Dataset : https://drive.google.com/file/d/1x2Xpa5T-ifi1xn-MtKA8HOV7E9Rk0NlJ/view?pli=1

Gunakan dataset di atas untuk melakukan serangkaian proses di bawah:

1. Handling Categorical Values:
- Identifikasi kolom-kolom kategorikal dalam dataset.
- Terapkan metode handling categorical value yang cocok pada 3 kolom yang ada dalam dataset (tidak boleh ketiganya metode yang sama, bisa proporsi 2 dan 1).
- Lakukan Binning pada kolom age dengan membaginya ke dalam 4 kelompok: "Muda", "Dewasa", "Paruh Baya", dan "Lanjut Usia".

2. Data Normalization:
- Gunakan Min-Max Scaling pada kolom balance.
- Gunakan Z-Score Scaling pada kolom duration.
- Gunakan Decimal Scaling pada kolom campaign.

3. Dimensionality Reduction:
- Lakukan Feature Selection dengan memilih hanya fitur yang memiliki korelasi tinggi terhadap variabel target (y). Gunakan korelasi Pearson atau metode seleksi fitur lain yang sesuai.
- Lakukan Feature Extraction menggunakan PCA (Principal Component Analysis) untuk mereduksi dimensi dataset menjadi hanya 5 fitur utama.

4. Data Splitting:
- Bagi dataset menjadi Train (70%), Validation (15%), dan Test (15%).
- Pastikan bahwa distribusi kelas dalam variabel target (y) tetap seimbang dalam proses pembagian data.

### Persiapan Awal

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import TruncatedSVD, PCA
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv('tugas3_genap.csv')  

In [3]:
data = data.rename(columns={'deposit': 'y'})

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11162 entries, 0 to 11161
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        11162 non-null  int64 
 1   job        11162 non-null  object
 2   marital    11162 non-null  object
 3   education  11162 non-null  object
 4   default    11162 non-null  object
 5   balance    11162 non-null  int64 
 6   housing    11162 non-null  object
 7   loan       11162 non-null  object
 8   contact    11162 non-null  object
 9   day        11162 non-null  int64 
 10  month      11162 non-null  object
 11  duration   11162 non-null  int64 
 12  campaign   11162 non-null  int64 
 13  pdays      11162 non-null  int64 
 14  previous   11162 non-null  int64 
 15  poutcome   11162 non-null  object
 16  y          11162 non-null  object
dtypes: int64(7), object(10)
memory usage: 1.4+ MB


In [5]:
data.head(500)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,59,admin.,married,secondary,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,yes
1,56,admin.,married,secondary,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown,yes
2,41,technician,married,secondary,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,yes
3,55,services,married,secondary,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown,yes
4,54,admin.,married,tertiary,no,184,no,no,unknown,5,may,673,2,-1,0,unknown,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,31,management,divorced,tertiary,no,294,yes,no,cellular,17,jul,536,5,-1,0,unknown,yes
496,27,services,married,secondary,no,0,yes,no,cellular,17,jul,991,5,-1,0,unknown,yes
497,44,blue-collar,single,secondary,no,292,no,yes,cellular,17,jul,1153,4,-1,0,unknown,yes
498,40,entrepreneur,divorced,secondary,no,2998,yes,no,cellular,18,jul,623,3,-1,0,unknown,yes


In [6]:
unique_months = data['month'].unique()
print(unique_months)

['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'jan' 'feb' 'mar' 'apr' 'sep']


In [7]:
unique_educations = data['education'].unique()
print(unique_educations)

['secondary' 'tertiary' 'primary' 'unknown']


In [33]:
unique_jobs = data['job'].unique()
print(unique_jobs)

['admin.' 'technician' 'services' 'management' 'retired' 'blue-collar'
 'unemployed' 'entrepreneur' 'housemaid' 'unknown' 'self-employed'
 'student']


### 1. Handling Categorical Values

In [8]:
categorical_cols = data.select_dtypes(include=['object']).columns
print("Kolom kategorikal:", categorical_cols)

Kolom kategorikal: Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'poutcome', 'y'],
      dtype='object')


In [9]:
data_encoded = pd.get_dummies(data, columns=['job', 'education'], prefix=['job', 'education'], dtype=int)

In [10]:
month_mapping = {
    'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
    'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12
}
data_encoded['month'] = data_encoded['month'].map(month_mapping)

In [11]:
bins = [0, 25, 40, 60, np.inf] 
labels = ['Muda', 'Dewasa', 'Paruh Baya', 'Lanjut Usia']
data_encoded['age_group'] = pd.cut(data_encoded['age'], bins=bins, labels=labels, include_lowest=True)

In [12]:
print("Hasil setelah handling categorical values:")
print(data_encoded.head())

Hasil setelah handling categorical values:
   age  marital default  balance housing loan  contact  day  month  duration  \
0   59  married      no     2343     yes   no  unknown    5      5      1042   
1   56  married      no       45      no   no  unknown    5      5      1467   
2   41  married      no     1270     yes   no  unknown    5      5      1389   
3   55  married      no     2476     yes   no  unknown    5      5       579   
4   54  married      no      184      no   no  unknown    5      5       673   

   ...  job_services  job_student  job_technician job_unemployed job_unknown  \
0  ...             0            0               0              0           0   
1  ...             0            0               0              0           0   
2  ...             0            0               1              0           0   
3  ...             1            0               0              0           0   
4  ...             0            0               0              0           0

### 2. Data Normalization

In [13]:
minmax_scaler = MinMaxScaler()
data_encoded['balance_scaled'] = minmax_scaler.fit_transform(data_encoded[['balance']])

In [14]:
zscore_scaler = StandardScaler()
data_encoded['duration_scaled'] = zscore_scaler.fit_transform(data_encoded[['duration']])

In [15]:
max_campaign = data_encoded['campaign'].max()
p = len(str(int(max_campaign)))
data_encoded['campaign_scaled'] = data_encoded['campaign'] / (10 ** p)

In [16]:
print("Hasil setelah normalisasi:")
print(data_encoded[['balance', 'balance_scaled', 'duration', 'duration_scaled', 'campaign', 'campaign_scaled']].head())

Hasil setelah normalisasi:
   balance  balance_scaled  duration  duration_scaled  campaign  \
0     2343        0.104371      1042         1.930226         1   
1       45        0.078273      1467         3.154612         1   
2     1270        0.092185      1389         2.929901         1   
3     2476        0.105882       579         0.596366         1   
4      184        0.079851       673         0.867171         2   

   campaign_scaled  
0             0.01  
1             0.01  
2             0.01  
3             0.01  
4             0.02  


In [17]:
data_encoded.head()

Unnamed: 0,age,marital,default,balance,housing,loan,contact,day,month,duration,...,job_unemployed,job_unknown,education_primary,education_secondary,education_tertiary,education_unknown,age_group,balance_scaled,duration_scaled,campaign_scaled
0,59,married,no,2343,yes,no,unknown,5,5,1042,...,0,0,0,1,0,0,Paruh Baya,0.104371,1.930226,0.01
1,56,married,no,45,no,no,unknown,5,5,1467,...,0,0,0,1,0,0,Paruh Baya,0.078273,3.154612,0.01
2,41,married,no,1270,yes,no,unknown,5,5,1389,...,0,0,0,1,0,0,Paruh Baya,0.092185,2.929901,0.01
3,55,married,no,2476,yes,no,unknown,5,5,579,...,0,0,0,1,0,0,Paruh Baya,0.105882,0.596366,0.01
4,54,married,no,184,no,no,unknown,5,5,673,...,0,0,0,0,1,0,Paruh Baya,0.079851,0.867171,0.02


### 3. Dimensionality Reduction

In [18]:
X = data_encoded.drop(columns=['y', 'age', 'balance', 'duration', 'campaign']) 
y = data_encoded['y'].map({'yes': 1, 'no': 0})  

In [19]:
X.head()

Unnamed: 0,marital,default,housing,loan,contact,day,month,pdays,previous,poutcome,...,job_unemployed,job_unknown,education_primary,education_secondary,education_tertiary,education_unknown,age_group,balance_scaled,duration_scaled,campaign_scaled
0,married,no,yes,no,unknown,5,5,-1,0,unknown,...,0,0,0,1,0,0,Paruh Baya,0.104371,1.930226,0.01
1,married,no,no,no,unknown,5,5,-1,0,unknown,...,0,0,0,1,0,0,Paruh Baya,0.078273,3.154612,0.01
2,married,no,yes,no,unknown,5,5,-1,0,unknown,...,0,0,0,1,0,0,Paruh Baya,0.092185,2.929901,0.01
3,married,no,yes,no,unknown,5,5,-1,0,unknown,...,0,0,0,1,0,0,Paruh Baya,0.105882,0.596366,0.01
4,married,no,no,no,unknown,5,5,-1,0,unknown,...,0,0,0,0,1,0,Paruh Baya,0.079851,0.867171,0.02


#### Feature Selection

In [20]:
selector = SelectKBest(score_func=f_classif, k=10)  
X_selected = selector.fit_transform(X.select_dtypes(include=[np.number]), y)

In [21]:
selected_features = X.select_dtypes(include=[np.number]).columns[selector.get_support()].tolist()
print("Fitur terpilih:", selected_features)

Fitur terpilih: ['pdays', 'previous', 'job_blue-collar', 'job_retired', 'job_student', 'education_primary', 'education_tertiary', 'balance_scaled', 'duration_scaled', 'campaign_scaled']


#### Feature Extraction

In [22]:
pca = PCA(n_components=5)  
X_pca = pca.fit_transform(X.select_dtypes(include=[np.number]))

In [23]:
print("\nHasil PCA:")
print("Explained variance ratio untuk setiap komponen:", pca.explained_variance_ratio_)
print("Total explained variance:", sum(pca.explained_variance_ratio_))


Hasil PCA:
Explained variance ratio untuk setiap komponen: [9.92992450e-01 5.91848422e-03 5.53355803e-04 3.27033022e-04
 8.38528412e-05]
Total explained variance: 0.9998751762408862


In [24]:
print("\nKomponen PCA (Loading Factors):")
pca_components_df = pd.DataFrame(pca.components_, columns=X.select_dtypes(include=[np.number]).columns, index=[f'PC{i+1}' for i in range(5)])
print(pca_components_df)


Komponen PCA (Loading Factors):
          day     month     pdays  previous  job_admin.  job_blue-collar  \
PC1 -0.006016 -0.000838  0.999924  0.010693    0.000131        -0.000107   
PC2  0.999769  0.019547  0.006091 -0.005639   -0.000412        -0.001171   
PC3 -0.019353  0.998953  0.000331  0.036402   -0.003122        -0.009641   
PC4  0.006371 -0.036457 -0.010681  0.999134    0.000146        -0.005375   
PC5  0.002635 -0.002387  0.000154  0.010806   -0.008528         0.014060   

     job_entrepreneur  job_housemaid  job_management  job_retired  ...  \
PC1         -0.000046      -0.000037        0.000061     0.000010  ...   
PC2         -0.000133       0.000211        0.000519    -0.000125  ...   
PC3          0.002024       0.001098        0.010151     0.001788  ...   
PC4         -0.000874      -0.000926        0.003473     0.002478  ...   
PC5         -0.000390      -0.001758       -0.010296     0.004198  ...   

     job_technician  job_unemployed  job_unknown  education_prima

In [25]:
X_pca_df = pd.DataFrame(X_pca, columns=[f'PC{i+1}' for i in range(5)])
print("\n5 Baris pertama dari dataset setelah PCA:")
print(X_pca_df.head())


5 Baris pertama dari dataset setelah PCA:
         PC1        PC2       PC3       PC4       PC5
0 -52.270550 -10.998621 -1.036166 -0.322271  1.883909
1 -52.270859 -11.001677 -1.033251 -0.335386  3.107826
2 -52.270990 -10.999560 -1.028167 -0.330823  2.887526
3 -52.270344 -10.994805 -1.040629 -0.311201  0.563020
4 -52.270315 -10.994761 -1.022543 -0.300118  0.809089


### 4. Data Splitting

In [26]:
data_final = pd.concat([X_pca_df, y.reset_index(drop=True)], axis=1)

In [27]:
train_data, temp_data = train_test_split(data_final, test_size=0.3, stratify=data_final['y'], random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, stratify=temp_data['y'], random_state=42)


In [28]:
print("Distribusi kelas di Train:")
print(train_data['y'].value_counts(normalize=True))
print("Distribusi kelas di Validation:")
print(val_data['y'].value_counts(normalize=True))
print("Distribusi kelas di Test:")
print(test_data['y'].value_counts(normalize=True))

Distribusi kelas di Train:
y
0    0.526174
1    0.473826
Name: proportion, dtype: float64
Distribusi kelas di Validation:
y
0    0.526284
1    0.473716
Name: proportion, dtype: float64
Distribusi kelas di Test:
y
0    0.52597
1    0.47403
Name: proportion, dtype: float64


In [29]:
print(f"Ukuran Train: {train_data.shape}")
print(f"Ukuran Validation: {val_data.shape}")
print(f"Ukuran Test: {test_data.shape}")

Ukuran Train: (7813, 6)
Ukuran Validation: (1674, 6)
Ukuran Test: (1675, 6)


In [30]:
train_data.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,y
7524,-52.416915,13.063081,1.495024,-0.246387,-0.87798,0
4777,-52.419758,13.984397,-2.502536,-0.08969,-0.290413,1
1041,-52.37162,5.118532,4.661844,-0.418175,0.955058,1
3137,249.774914,-11.239134,-3.711595,1.567778,0.012098,1
2829,23.619548,11.544601,3.65547,1.838705,-0.148872,1


In [34]:
val_data.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,y
121,-52.366887,5.00176,-1.325247,-0.201778,1.394146,1
7924,-52.353351,2.121708,4.699373,-0.437628,-0.304086,0
8567,-52.297605,-6.914634,2.907283,-0.408871,0.036508,0
6272,-52.318236,-2.992696,-1.191306,-0.24249,-0.578856,0
1986,122.736847,-0.97322,-0.899506,4.859226,1.783591,1


In [35]:
test_data.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,y
7219,-52.31465,-3.933256,1.848711,-0.346365,-0.60117,0
4740,-52.303717,-5.051403,-4.124619,-0.138651,-0.154608,1
9303,-52.282293,-8.994725,-1.073893,-0.289755,0.271877,0
7886,-52.2708,-10.971472,-0.04535,-0.331423,-0.957078,0
4292,-52.419595,13.981076,-2.523795,-0.10503,-0.057976,1
