# Menganalisis risiko peminjam gagal membayar

Proyek Anda ialah menyiapkan laporan untuk bank bagian kredit. Anda harus mencari tahu pengaruh status perkawinan seorang nasabah dan jumlah anak terhadap probabilitas ketepatan waktu dalam melunasi pinjaman. Bank sudah memiliki beberapa data mengenai kelayakan kredit nasabah.

Laporan Anda akan dipertimbangkan pada saat membuat **penilaian kredit** untuk calon nasabah. **Penilaian kredit** digunakan untuk mengevaluasi kemampuan calon peminjam untuk melunasi pinjaman mereka.

[Dalam notebook ini Anda diberikan petunjuk dan instruksi singkat serta petunjuk untuk berfikir. Jangan abaikan petunjuk tersebut karena mereka dirancang untuk melengkapi Anda dengan struktur proyek dan akan membantu Anda menganalisis apa yang Anda lakukan secara mendalam. Sebelum menyerahkan proyek Anda, pastikan Anda menghapus semua petunjuk dan deskripsi yang diberikan kepada Anda. Buat laporan ini seolah-olah Anda mengirimnya ke rekan tim Anda untuk menunjukkan apa yang Anda temukan - mereka seharusnya tidak tahu bahwa Anda mendapat bantuan eksternal dari kami! Untuk membantu Anda, kami telah memberikan petunjuk apa yang harus Anda hapus dalam tanda kurung siku.]

[Sebelum Anda memasuki analisis data Anda, jelaskan kepada diri Anda tujuan proyek dan langkah-langkah yang akan Anda lakukan.]

## Buka *file* data dan baca informasi umumnya. 

[Mulai dengan mengimpor perpustakaan dan memuat data. Anda mungkin akan menyadari bahwa Anda memerlukan perpustakaan tambahan saat Anda menjalankan proyek, dan itu merupakan hal yang normal - hanya saja pastikan untuk memperbaharui bagian ini saat Anda melakukannya.]

In [None]:
# Memuat semua perpustakaan
import pandas as pd

# muat data
sp2 = pd.read_csv('/datasets/credit_scoring_eng.csv')

In [None]:
import numpy as np

## Soal 1. Eksplorasi Data

**Deskripsi Data**
- *`children`* - jumlah anak dalam keluarga
- *`days_employed`* - pengalaman kerja dalam hari
- *`dob_years`* - usia klien dalam tahun
- *`education`* - pendidikan klien
- *`education_id`* - tanda pengenal pendidikan
- *`family_status`* - status perkawinan
- *`family_status_id`* - tanda pengenal status perkawinan
- *`gender`* - jenis kelamin klien
- *`income_type`* - jenis pekerjaan
- *`debt`* - apakah klien memiliki hutang pembayaran pinjaman
- *`total_income`* - pendapatan bulanan
- *`purpose`* - tujuan mendapatkan pinjaman

[Sekarang saatnya mengeksplor data kita. Anda ingin melihat berapa banyak kolom dan baris yang tersedia, lihat beberapa baris untuk memeriksa potensi masalah dengan data.]

In [None]:
# Mari kita lihat berapa banyak baris dan kolom yang dimiliki oleh dataset kita

sp2.shape

(21525, 12)

In [None]:
# mari menampilkan N baris pertama

sp2.head(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


[Jelaskan apa yang Anda lihat dan perhatikan dalam sampel data yang Anda tampilkan. Apakah terdapat masalah yang mungkin memerlukan penyelidikan dan perubahan lebih lanjut?]

**Tidak ada nomor ID tersendiri untuk idetifikasi/karakteristik para kliennya, dalam kolom `days_employed` seharusnya nilainya tidak ada koma apalagi minus, dalam kolom `education` ada huruf yang besar sehingga bisa di anggap value yang berbeda dibanding huruf kecilnya. Dalam `purpose` terdapat tujuan yang sama pada intinya.**

In [None]:
# Mendapatkan informasi data
sp2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


[Apakah terdapat nilai-nilai yang hilang di semua kolom atau hanya beberapa? Jelaskan secara singkat apa yang Anda amati dalam 1-2 kalimat.]

**Penjelasan Student : 21525 - 19351 = 2174. `2174` adalah nilai yang hilang untuk kolom Pada kolom `days_employed` dan kolom `total_income`. masing kolom itu memiliki jumlah nilai hilang yang sama**

In [None]:
# Mari kita melihat tabel yang telah difilter dengan nilai yang hilang di kolom pertama dengan data yang hilang

# Untuk memeriksa kolom 'days_employed' yang mempunyai nlai NaN atau hilang

nan_de = sp2.loc[sp2['days_employed'].isna()]
nan_de

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


In [None]:
# Untuk memeriksa kolom 'total_income' yang mempunyai nlai NaN atau hilang

nan_ti = sp2.loc[sp2['total_income'].isna()]
nan_ti

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


[Apakah nilai yang hilang tampak simetris? Apakah kita yakin dengan asumsi ini? Jelaskan pendapat Anda secara singkat di bagian ini. Anda mungkin ingin melakukan penyelidikan lebih lanjut, dan menghitung nilai yang hilang di semua baris dengan nilai yang hilang untuk memastikan bahwa sampel yang hilang memiliki ukuran yang sama.]

**setelah melakukan proses untuk melihat tabel yang difilter dengan nilai yang hilang, bahwa data yang hilang tersebut simetris, Ketika pada kolom `days_employed` nilai hilang atau `NaN` maka kolom `total_income` juga memiliki nilai hilang atau `NaN`. Dan jumlah `rows` memliki nilai yang sama yaitu sebesar `2174`, sehingga sesuai penjelasan di atas sebelumnya bahwa nilai hilang antara kolom `days_employed` dan kolom `total_income` adalah sama untuk nilai yang hilang**

In [None]:
# Mari kita menerapkan beberapa persyaratan untuk memfilter data dan melihat jumlah baris dalam tabel yang difilter

# Melihat jumlah baris yang telah di filter dalam kolom 'days_employed'

# data_nan = sp2.loc[(sp2['days_employed'].isna()) & (sp2['total_income'].isna())]
#dropna_de = sp2['days_employed'].dropna()
dropna_de = sp2.dropna(subset=['days_employed'])
dropna_de

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.422610,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21520,1,-4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions
21521,0,343937.404131,67,secondary education,1,married,0,F,retiree,0,24959.969,purchase of a car
21522,1,-2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property
21523,3,-3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car


In [None]:
# Melihat jumlah baris yang telah di filter dalam kolom 'total_income'
#dropna_ti = sp2['total_income'].dropna()
dropna_ti = sp2.dropna(subset=['total_income'])
dropna_ti

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.422610,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21520,1,-4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions
21521,0,343937.404131,67,secondary education,1,married,0,F,retiree,0,24959.969,purchase of a car
21522,1,-2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property
21523,3,-3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car


In [None]:
# Melihat jumlah baris yang hilang dalam tabel sp2 apakah kolom 'days_employed' dan kolom 'total_income' nilainya sama.

sp2.isna().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

**Kesimpulan menengah**

[Apakah jumlah baris dalam tabel yang difilter sesuai dengan jumlah nilai yang hilang? Kesimpulan apa yang bisa kita buat dari hal ini?]

[Hitung persentase nilai yang hilang yang dibandingkan dengan seluruh kumpulan data. Apakah ini merupakan bagian data yang sangat besar? Jika demikian, Anda mungkin ingin mengisi nilai yang hilang. Untuk melakukannya, pertama-tama kita harus mempertimbangkan apakah data yang hilang bisa jadi disebabkan oleh karakteristik klien tertentu, seperti jenis pekerjaan atau yang lainnya. Anda perlu memutuskan karakteristik mana yang *Anda* pikir mungkin menjadi alasannya. Kedua, kita harus memeriksa apakah terdapat ketergantungan nilai yang hilang pada nilai indikator lain dengan kolom yang berisi karakteristik klien spesifik yang teridentifikasi.]

[Jelaskan langkah Anda selanjutnya dan bagaimana hubungannya dengan kesimpulan yang Anda buat sejauh ini.]

**Jumlah baris yang hilang sama untuk kolom `days_employed` dan `total_income` sebesar `2174`** **|** **Jumlah nilai yang hilang dalam presentasinya yaitu `0.100999` atau `10%`. Artinya bahwa data yang hilang masih bisa di toleransi sesuai dengan toeri dan penjelasan tutor yaitu `50%` dan boleh untuk melanjutkan proses analisis datanya dengan mengisi nilai hilangnya dengan cara yang sudah ada atau sesuai teori dan materi.**

In [None]:
# Mari kita memeriksa klien yang tidak memiliki data tentang karakteristik yang teridentifikasi dan kolom dengan nilai yang hilang
# dikali 100 untuk mengonversi kedalam % (persen)
total_miss_value = sp2.isna().sum() / len(sp2) * 100
total_miss_value


children             0.000000
days_employed       10.099884
dob_years            0.000000
education            0.000000
education_id         0.000000
family_status        0.000000
family_status_id     0.000000
gender               0.000000
income_type          0.000000
debt                 0.000000
total_income        10.099884
purpose              0.000000
dtype: float64

In [None]:
# Membuat variabel dropna berdasarkan kolom 'days_employed' dan 'total_income' bersamaan
data_dropna = sp2.dropna(subset=['days_employed', 'total_income'])

In [None]:
# Membuat variabel isna berdasarkan kolol 'days_employed' dan 'total_income' bersamaan
data_isna = sp2.loc[(sp2['days_employed'].isna()) & (sp2['total_income'].isna())]

In [None]:
# Memeriksa distribusi dropna
# Pada kolom 'children'
data_dropna['children'].sort_values().unique()

array([-1,  0,  1,  2,  3,  4,  5, 20])

In [None]:
# Memeriksa distribusi isna
# Pada kolom 'children'
data_isna['children'].sort_values().unique()

array([-1,  0,  1,  2,  3,  4,  5, 20])

[Deksripsikan yang Anda temukan di sini.] **kolom `children` memiliki value -1, 0, 1, 2, 3, 4, 5, dan 20.**

**Kemungkinan penyebab hilangnya nilai dalam data**

[Ajukan ide-ide Anda tentang mengapa menurut Anda nilai-nilai tersebut kemungkinan hilang. Apakah menurut Anda mereka hilang secara acak atau terdapat pola?] **Menurut saya, kemungkinan ada pola yang terdiri dari dropna atau isna. Kemungkinan ada data yang value uniquenya di salah satunya. contoh di isna unique valuenya tidak selengkap dropna dan seluruh dataset atau sebaliknya dari dropna.**

[Mari kita mulai memeriksa apakah nilai hilang secara acak.]

In [None]:
# Memeriksa distribusi di seluruh dataset
# Pada kolom 'children'

sp2['children'].sort_values().unique()

array([-1,  0,  1,  2,  3,  4,  5, 20])

In [None]:
# Memeriksa distribusi data kolom children dengan unique 1
sp2.loc[(sp2['children']==1) & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
72,1,,32,bachelor's degree,0,married,0,M,civil servant,0,,transactions with commercial real estate
94,1,,34,bachelor's degree,0,civil partnership,1,F,business,0,,having a wedding
189,1,,30,secondary education,1,unmarried,4,F,employee,0,,to own a car
205,1,,31,bachelor's degree,0,married,0,F,employee,0,,buying a second-hand car
220,1,,23,some college,2,civil partnership,1,F,business,0,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21407,1,,36,secondary education,1,married,0,F,business,0,,building a real estate
21432,1,,38,some college,2,unmarried,4,F,employee,0,,housing transactions
21463,1,,35,bachelor's degree,0,civil partnership,1,M,employee,0,,having a wedding
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony


In [None]:
# Memeriksa distribusi data kolom children dengan unique 2
sp2.loc[(sp2['children']==2) & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
82,2,,50,bachelor's degree,0,married,0,F,employee,0,,housing
90,2,,35,bachelor's degree,0,married,0,F,employee,0,,housing transactions
264,2,,40,secondary education,1,divorced,3,F,employee,0,,property
853,2,,24,bachelor's degree,0,married,0,M,business,0,,housing transactions
923,2,,30,secondary education,1,married,0,M,business,0,,building a property
...,...,...,...,...,...,...,...,...,...,...,...,...
21271,2,,42,secondary education,1,civil partnership,1,M,employee,1,,transactions with my real estate
21300,2,,45,Secondary Education,1,civil partnership,1,M,business,0,,to get a supplementary education
21369,2,,42,secondary education,1,divorced,3,M,business,0,,buy residential real estate
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car


In [None]:
# Memeriksa distribusi data kolom children dengan unique 3
sp2.loc[(sp2['children']==3) & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
978,3,,28,secondary education,1,civil partnership,1,F,employee,0,,housing renovation
2528,3,,34,secondary education,1,married,0,M,employee,0,,housing renovation
2685,3,,40,some college,2,married,0,F,employee,0,,to own a car
3561,3,,34,secondary education,1,married,0,F,employee,0,,construction of own property
4036,3,,40,secondary education,1,civil partnership,1,M,business,0,,construction of own property
4504,3,,35,secondary education,1,married,0,M,employee,0,,buying property for renting out
4959,3,,38,Secondary Education,1,married,0,M,employee,0,,buy commercial real estate
4974,3,,34,bachelor's degree,0,married,0,M,employee,0,,purchase of a car
7050,3,,36,secondary education,1,civil partnership,1,M,employee,1,,wedding ceremony
7228,3,,40,secondary education,1,married,0,M,employee,1,,housing renovation


In [None]:
# Memeriksa distribusi data kolom children dengan unique 4
sp2.loc[(sp2['children']==4) & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
2890,4,,33,secondary education,1,married,0,F,civil servant,0,,purchase of the house for my family
3190,4,,45,secondary education,1,married,0,F,employee,0,,to get a supplementary education
5583,4,,49,bachelor's degree,0,married,0,M,employee,0,,construction of own property
8125,4,,51,secondary education,1,married,0,M,employee,1,,buy residential real estate
8681,4,,33,secondary education,1,married,0,F,civil servant,0,,construction of own property
14974,4,,49,secondary education,1,unmarried,4,F,civil servant,0,,going to university
19095,4,,31,BACHELOR'S DEGREE,0,civil partnership,1,M,employee,0,,wedding ceremony


In [None]:
# Memeriksa distribusi data kolom children dengan unique 5
sp2.loc[(sp2['children']==5) & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
3979,5,,42,secondary education,1,civil partnership,1,M,employee,0,,buying my own car


In [None]:
# Memeriksa distribusi data kolom children dengan unique -1
sp2.loc[(sp2['children']==-1) & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
941,-1,,57,Secondary Education,1,married,0,F,retiree,0,,buying my own car
7615,-1,,35,secondary education,1,married,0,M,employee,0,,education
13786,-1,,42,secondary education,1,unmarried,4,M,business,0,,car


In [None]:
# Memeriksa distribusi data kolom children dengan unique 20
sp2.loc[(sp2['children']==20) & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
3302,20,,35,secondary education,1,unmarried,4,F,civil servant,0,,profile education
3396,20,,56,bachelor's degree,0,married,0,F,business,0,,university education
6198,20,,35,bachelor's degree,0,unmarried,4,M,business,0,,housing
8430,20,,60,secondary education,1,widow / widower,2,F,retiree,0,,purchase of the house
12909,20,,25,secondary education,1,married,0,M,employee,0,,housing transactions
15976,20,,39,secondary education,1,unmarried,4,F,employee,0,,buy real estate
17286,20,,50,bachelor's degree,0,divorced,3,F,employee,0,,buy commercial real estate
19774,20,,59,secondary education,1,civil partnership,1,F,retiree,0,,to have a wedding
21390,20,,53,secondary education,1,married,0,M,business,0,,buy residential real estate


In [None]:
# Memeriksa distribusi data kolom children dengan unique 0
sp2.loc[(sp2['children']==0) & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21414,0,,65,secondary education,1,married,0,F,retiree,0,,purchase of my own house
21415,0,,54,secondary education,1,married,0,F,retiree,0,,housing transactions
21423,0,,63,secondary education,1,married,0,M,retiree,0,,purchase of a car
21426,0,,49,secondary education,1,married,0,F,employee,1,,property


**Kesimpulan menengah**

[Apakah distribusi dalam dataset yang asli mirip dengan distribusi tabel yang telah difilter? Apa artinya itu untuk kita?]  **untuk kolom `children` memiliki distribusi yang sama baik dropna, isna, dan keseluruhan data, tidak ada yang nilai distribusinya tidak hilang.**

[Jika menurut Anda kita belum dapat membuat kesimpulan apa pun, mari kembali menyelidiki dataset kita lebih lanjut. Mari kita pikirkan alasan lain yang dapat menyebabkan data hilang dan periksa apakah kita dapat menemukan pola yang dapat membuat kita berpikir bahwa nilai yang hilang tidaklah secara acak. Karena ini merupakan pekerjaan Anda, bagian ini adalah bagian opsional.]

In [None]:
# Periksa penyebab dan pola lain yang dapat mengakibatkan nilai yang hilang
# Kolom dob_years
# Memeriksa distribusi dropna
data_dropna['dob_years'].sort_values().unique()

array([ 0, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
       69, 70, 71, 72, 73, 74, 75])

In [None]:
# Memeriksa distribusi isna
data_isna['dob_years'].sort_values().unique()

array([ 0, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
       69, 70, 71, 72, 73])

In [None]:
# Memeriksa distribusi di seluruh dataset
sp2['dob_years'].sort_values().unique()

array([ 0, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
       69, 70, 71, 72, 73, 74, 75])

**Pada isna terdapat distribusi yang berbeda, yang tidak ada ada selain dropna dan seluruh dataset yaitu unique dengan nilai umur 74 dan 75. Saya hanya akan menampilkan distribusi pada umur 74 dan 75, karena kalau semua umur di masukkan rasanya terlalu panjang.** 

In [None]:
# Memeriksa distribusi data kolom children dengan unique 74
sp2.loc[(sp2['dob_years']==74) & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


In [None]:
# Memeriksa distribusi data kolom children dengan unique 75
sp2.loc[(sp2['dob_years']==75) & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


**Kesimpulan menengah**

[Apakah pada akhirnya kita dapat memastikan bahwa nilai yang hilang adalah suatu kebetulan? Periksa hal lain yang menurut Anda penting di sini.] **Pada kolom `children` tidak ada unique value yang hilang antara dropna, isna, dan seluruh dataset. Sementara pada kolom `children` unique value ada yang mengalami hilang di isna, untuk dropna dan seluruh dataset ga ada yang hilang. Kemungkinan harus melihat beberapa kolom lagi untuk melihat kemungkinan kasus unique value yang hilang seperti kolom children atau ada yang sesautu yang lebih menarik lagi seperti kolom pada dropna hilang salah satu unique valuenya sementara pada isna ada semua.**

**Mengecek kolom Education**

In [None]:
# Memeriksa pola lainnya - jelaskan pola tersebut
# Kolom Education
# Memeriksa distribusi dropna
data_dropna['education'].sort_values().unique()

array(["BACHELOR'S DEGREE", "Bachelor's Degree", 'GRADUATE DEGREE',
       'Graduate Degree', 'PRIMARY EDUCATION', 'Primary Education',
       'SECONDARY EDUCATION', 'SOME COLLEGE', 'Secondary Education',
       'Some College', "bachelor's degree", 'graduate degree',
       'primary education', 'secondary education', 'some college'],
      dtype=object)

In [None]:
# Memeriksa distribusi isna
data_isna['education'].sort_values().unique()

array(["BACHELOR'S DEGREE", "Bachelor's Degree", 'PRIMARY EDUCATION',
       'Primary Education', 'SECONDARY EDUCATION', 'SOME COLLEGE',
       'Secondary Education', 'Some College', "bachelor's degree",
       'primary education', 'secondary education', 'some college'],
      dtype=object)

In [None]:
# Memeriksa distribusi di seluruh dataset
sp2['education'].sort_values().unique()

array(["BACHELOR'S DEGREE", "Bachelor's Degree", 'GRADUATE DEGREE',
       'Graduate Degree', 'PRIMARY EDUCATION', 'Primary Education',
       'SECONDARY EDUCATION', 'SOME COLLEGE', 'Secondary Education',
       'Some College', "bachelor's degree", 'graduate degree',
       'primary education', 'secondary education', 'some college'],
      dtype=object)

**Pada isna terdapat distribusi yang berbeda, yang tidak ada ada selain dropna dan seluruh dataset yaitu unique dengan nilai Graduate Degree. Sisanya hanya nilai duplicate, jadi hanya akan menampilkan yang hurufnya kecil saja untuk melihat distribusinya.** 

In [None]:
# Memeriksa distribusi data kolom children dengan unique "bachelor's degree"
sp2.loc[(sp2['education']=="bachelor's degree") & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
67,0,,52,bachelor's degree,0,married,0,F,retiree,0,,purchase of the house for my family
72,1,,32,bachelor's degree,0,married,0,M,civil servant,0,,transactions with commercial real estate
82,2,,50,bachelor's degree,0,married,0,F,employee,0,,housing
90,2,,35,bachelor's degree,0,married,0,F,employee,0,,housing transactions
94,1,,34,bachelor's degree,0,civil partnership,1,F,business,0,,having a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21182,0,,28,bachelor's degree,0,civil partnership,1,F,employee,0,,construction of own property
21242,1,,33,bachelor's degree,0,unmarried,4,F,civil servant,0,,construction of own property
21268,1,,44,bachelor's degree,0,civil partnership,1,F,civil servant,0,,having a wedding
21281,1,,30,bachelor's degree,0,married,0,F,employee,0,,buy commercial real estate


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'graduate degree'
sp2.loc[(sp2['education']=='graduate degree') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'primary education'
sp2.loc[(sp2['education']=='primary education') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
1303,1,,70,primary education,3,civil partnership,1,F,employee,0,,transactions with commercial real estate
1510,0,,52,primary education,3,civil partnership,1,M,employee,1,,construction of own property
1520,0,,53,primary education,3,married,0,F,employee,0,,transactions with my real estate
4388,0,,39,primary education,3,married,0,F,employee,0,,building a property
4585,1,,58,primary education,3,civil partnership,1,F,employee,0,,housing transactions
7314,1,,45,primary education,3,married,0,M,employee,0,,profile education
8142,0,,64,primary education,3,civil partnership,1,F,civil servant,0,,to have a wedding
10998,0,,34,primary education,3,married,0,F,employee,0,,buy real estate
11758,0,,58,primary education,3,married,0,M,retiree,0,,transactions with my real estate
13031,0,,37,primary education,3,civil partnership,1,F,business,1,,wedding ceremony


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'secondary education'
sp2.loc[(sp2['education']=='secondary education') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21423,0,,63,secondary education,1,married,0,M,retiree,0,,purchase of a car
21426,0,,49,secondary education,1,married,0,F,employee,1,,property
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'some college'
sp2.loc[(sp2['education']=='some college') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
220,1,,23,some college,2,civil partnership,1,F,business,0,,to have a wedding
950,0,,51,some college,2,married,0,F,business,0,,cars
2171,0,,31,some college,2,married,0,M,civil servant,0,,housing transactions
2685,3,,40,some college,2,married,0,F,employee,0,,to own a car
2948,1,,29,some college,2,unmarried,4,F,business,0,,purchase of the house for my family
3913,1,,27,some college,2,unmarried,4,F,civil servant,0,,building a real estate
4309,2,,31,some college,2,married,0,F,civil servant,0,,car purchase
4372,0,,28,some college,2,married,0,M,employee,1,,purchase of the house for my family
4527,0,,52,some college,2,unmarried,4,F,employee,0,,building a property
5042,0,,42,some college,2,civil partnership,1,M,employee,0,,wedding ceremony


**Mengecek kolom family_status**

In [None]:
# Memeriksa distribusi dropna
data_dropna['family_status'].sort_values().unique()

array(['civil partnership', 'divorced', 'married', 'unmarried',
       'widow / widower'], dtype=object)

In [None]:
# Memeriksa distribusi isna
data_isna['family_status'].sort_values().unique()

array(['civil partnership', 'divorced', 'married', 'unmarried',
       'widow / widower'], dtype=object)

In [None]:
# Memeriksa distribusi di seluruh dataset
sp2['family_status'].sort_values().unique()

array(['civil partnership', 'divorced', 'married', 'unmarried',
       'widow / widower'], dtype=object)

**Pada kolom `family_status` bahwa unique valuenya baik dropna, isna, dan seluruh data set memiliki semua jenis unique valuenya dan tidak ada yang hilang pada isnanya. Saya akan memeriksa distribusi datanya untuk masing-masing unique valuenya masing-masing dengan syarat isna.**

In [None]:
# Memeriksa distribusi data kolom children dengan unique 'civil partnership'
sp2.loc[(sp2['family_status']=='civil partnership') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
94,1,,34,bachelor's degree,0,civil partnership,1,F,business,0,,having a wedding
141,0,,39,secondary education,1,civil partnership,1,M,employee,0,,wedding ceremony
181,0,,26,secondary education,1,civil partnership,1,F,business,1,,purchase of the house for my family
...,...,...,...,...,...,...,...,...,...,...,...,...
21268,1,,44,bachelor's degree,0,civil partnership,1,F,civil servant,0,,having a wedding
21271,2,,42,secondary education,1,civil partnership,1,M,employee,1,,transactions with my real estate
21300,2,,45,Secondary Education,1,civil partnership,1,M,business,0,,to get a supplementary education
21463,1,,35,bachelor's degree,0,civil partnership,1,M,employee,0,,having a wedding


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'divorced'
sp2.loc[(sp2['family_status']=='divorced') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
264,2,,40,secondary education,1,divorced,3,F,employee,0,,property
389,1,,31,SECONDARY EDUCATION,1,divorced,3,M,civil servant,0,,supplementary education
736,0,,52,secondary education,1,divorced,3,F,business,0,,buying property for renting out
888,0,,45,secondary education,1,divorced,3,F,employee,0,,housing renovation
1202,1,,43,secondary education,1,divorced,3,F,employee,0,,profile education
...,...,...,...,...,...,...,...,...,...,...,...,...
20272,0,,34,bachelor's degree,0,divorced,3,F,employee,0,,car
20615,0,,56,secondary education,1,divorced,3,F,retiree,0,,housing transactions
20691,0,,47,secondary education,1,divorced,3,F,business,0,,to get a supplementary education
20775,1,,46,bachelor's degree,0,divorced,3,F,employee,0,,education


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'married'
sp2.loc[(sp2['family_status']=='married') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
67,0,,52,bachelor's degree,0,married,0,F,retiree,0,,purchase of the house for my family
72,1,,32,bachelor's degree,0,married,0,M,civil servant,0,,transactions with commercial real estate
82,2,,50,bachelor's degree,0,married,0,F,employee,0,,housing
...,...,...,...,...,...,...,...,...,...,...,...,...
21426,0,,49,secondary education,1,married,0,F,employee,1,,property
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'unmarried'
sp2.loc[(sp2['family_status']=='unmarried') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
65,0,,21,secondary education,1,unmarried,4,M,business,0,,transactions with commercial real estate
189,1,,30,secondary education,1,unmarried,4,F,employee,0,,to own a car
317,0,,21,bachelor's degree,0,unmarried,4,M,employee,0,,purchase of a car
376,0,,23,Some College,2,unmarried,4,M,employee,0,,getting higher education
...,...,...,...,...,...,...,...,...,...,...,...,...
21242,1,,33,bachelor's degree,0,unmarried,4,F,civil servant,0,,construction of own property
21258,1,,40,secondary education,1,unmarried,4,M,employee,0,,housing transactions
21305,0,,59,secondary education,1,unmarried,4,F,retiree,0,,construction of own property
21350,0,,21,secondary education,1,unmarried,4,M,business,0,,to buy a car


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'widow / widower'
sp2.loc[(sp2['family_status']=='widow / widower') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
174,0,,55,bachelor's degree,0,widow / widower,2,F,business,0,,to own a car
320,0,,63,secondary education,1,widow / widower,2,F,employee,0,,buying a second-hand car
328,0,,69,secondary education,1,widow / widower,2,F,employee,0,,car purchase
361,0,,59,secondary education,1,widow / widower,2,F,retiree,0,,housing
415,0,,57,bachelor's degree,0,widow / widower,2,F,retiree,0,,housing
...,...,...,...,...,...,...,...,...,...,...,...,...
20283,0,,61,primary education,3,widow / widower,2,F,retiree,0,,property
20347,0,,62,secondary education,1,widow / widower,2,F,retiree,0,,property
20469,0,,57,secondary education,1,widow / widower,2,M,civil servant,1,,to own a car
20625,0,,56,secondary education,1,widow / widower,2,F,retiree,0,,housing renovation


**Mengecek kolom gender**

In [None]:
# Memeriksa distribusi dropna
data_dropna['gender'].sort_values().unique()

array(['F', 'M', 'XNA'], dtype=object)

In [None]:
# Memeriksa distribusi isna
data_isna['gender'].sort_values().unique()

array(['F', 'M'], dtype=object)

In [None]:
# Memeriksa distribusi di seluruh dataset
sp2['gender'].sort_values().unique()

array(['F', 'M', 'XNA'], dtype=object)

**Pada kolom `gender` bahwa unique valuenya baik dropna dan seluruh data set memiliki semua jenis unique valuenya dan tidak ada yang hilang, sementara pada isna terjadi kehilangan satu uniquenya yaitu XNA. Saya akan memeriksa distribusi datanya untuk masing-masing unique valuenya masing-masing dengan syarat isna.**

In [None]:
# Memeriksa distribusi data kolom children dengan unique 'F'
sp2.loc[(sp2['gender']=='F') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
67,0,,52,bachelor's degree,0,married,0,F,retiree,0,,purchase of the house for my family
82,2,,50,bachelor's degree,0,married,0,F,employee,0,,housing
...,...,...,...,...,...,...,...,...,...,...,...,...
21432,1,,38,some college,2,unmarried,4,F,employee,0,,housing transactions
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'M'
sp2.loc[(sp2['gender']=='M') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
65,0,,21,secondary education,1,unmarried,4,M,business,0,,transactions with commercial real estate
72,1,,32,bachelor's degree,0,married,0,M,civil servant,0,,transactions with commercial real estate
83,0,,52,secondary education,1,married,0,M,employee,0,,housing
...,...,...,...,...,...,...,...,...,...,...,...,...
21369,2,,42,secondary education,1,divorced,3,M,business,0,,buy residential real estate
21390,20,,53,secondary education,1,married,0,M,business,0,,buy residential real estate
21423,0,,63,secondary education,1,married,0,M,retiree,0,,purchase of a car
21463,1,,35,bachelor's degree,0,civil partnership,1,M,employee,0,,having a wedding


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'XNA'
sp2.loc[(sp2['gender']=='XNA') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


**Mengecek kolom income_type**

In [None]:
# Memeriksa distribusi dropna
data_dropna['income_type'].sort_values().unique()

array(['business', 'civil servant', 'employee', 'entrepreneur',
       'paternity / maternity leave', 'retiree', 'student', 'unemployed'],
      dtype=object)

In [None]:
# Memeriksa distribusi isna
data_isna['income_type'].sort_values().unique()

array(['business', 'civil servant', 'employee', 'entrepreneur', 'retiree'],
      dtype=object)

In [None]:
# Memeriksa distribusi di seluruh dataset
sp2['income_type'].sort_values().unique()

array(['business', 'civil servant', 'employee', 'entrepreneur',
       'paternity / maternity leave', 'retiree', 'student', 'unemployed'],
      dtype=object)

In [None]:
# Memeriksa distribusi data kolom children dengan unique 'business'
sp2.loc[(sp2['income_type']=='business') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
65,0,,21,secondary education,1,unmarried,4,M,business,0,,transactions with commercial real estate
94,1,,34,bachelor's degree,0,civil partnership,1,F,business,0,,having a wedding
121,0,,29,bachelor's degree,0,married,0,F,business,0,,car
135,0,,27,secondary education,1,married,0,M,business,0,,housing
174,0,,55,bachelor's degree,0,widow / widower,2,F,business,0,,to own a car
...,...,...,...,...,...,...,...,...,...,...,...,...
21390,20,,53,secondary education,1,married,0,M,business,0,,buy residential real estate
21391,0,,52,secondary education,1,married,0,F,business,0,,purchase of the house for my family
21407,1,,36,secondary education,1,married,0,F,business,0,,building a real estate
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'civil servant'
sp2.loc[(sp2['income_type']=='civil servant') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
72,1,,32,bachelor's degree,0,married,0,M,civil servant,0,,transactions with commercial real estate
242,0,,58,secondary education,1,married,0,F,civil servant,0,,purchase of my own house
389,1,,31,SECONDARY EDUCATION,1,divorced,3,M,civil servant,0,,supplementary education
...,...,...,...,...,...,...,...,...,...,...,...,...
20469,0,,57,secondary education,1,widow / widower,2,M,civil servant,1,,to own a car
20479,1,,26,bachelor's degree,0,married,0,F,civil servant,0,,education
20914,0,,32,bachelor's degree,0,married,0,F,civil servant,0,,car purchase
21242,1,,33,bachelor's degree,0,unmarried,4,F,civil servant,0,,construction of own property


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'employee'
sp2.loc[(sp2['income_type']=='employee') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
82,2,,50,bachelor's degree,0,married,0,F,employee,0,,housing
83,0,,52,secondary education,1,married,0,M,employee,0,,housing
90,2,,35,bachelor's degree,0,married,0,F,employee,0,,housing transactions
96,0,,44,SECONDARY EDUCATION,1,married,0,F,employee,0,,buy residential real estate
97,0,,47,bachelor's degree,0,married,0,F,employee,0,,profile education
...,...,...,...,...,...,...,...,...,...,...,...,...
21432,1,,38,some college,2,unmarried,4,F,employee,0,,housing transactions
21463,1,,35,bachelor's degree,0,civil partnership,1,M,employee,0,,having a wedding
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'entrepreneur'
sp2.loc[(sp2['income_type']=='entrepreneur') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
5936,0,,58,bachelor's degree,0,married,0,M,entrepreneur,0,,buy residential real estate


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'paternity / maternity leave'
sp2.loc[(sp2['income_type']=='paternity / maternity leave') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'retiree'
sp2.loc[(sp2['income_type']=='retiree') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
67,0,,52,bachelor's degree,0,married,0,F,retiree,0,,purchase of the house for my family
145,0,,62,secondary education,1,married,0,M,retiree,0,,building a property
...,...,...,...,...,...,...,...,...,...,...,...,...
21311,0,,49,secondary education,1,married,0,F,retiree,0,,buying property for renting out
21321,0,,56,Secondary Education,1,married,0,F,retiree,0,,real estate transactions
21414,0,,65,secondary education,1,married,0,F,retiree,0,,purchase of my own house
21415,0,,54,secondary education,1,married,0,F,retiree,0,,housing transactions


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'student'
sp2.loc[(sp2['income_type']=='student') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


In [None]:
# Memeriksa distribusi data kolom children dengan unique 'unemployed'
sp2.loc[(sp2['income_type']=='unemployed') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


**Mengecek kolom debt**

In [None]:
# Memeriksa distribusi dropna
data_dropna['debt'].sort_values().unique()

array([0, 1])

In [None]:
# Memeriksa distribusi isna
data_isna['debt'].sort_values().unique()

array([0, 1])

In [None]:
# Memeriksa distribusi di seluruh dataset
sp2['debt'].sort_values().unique()

array([0, 1])

In [None]:
# Memeriksa distribusi data kolom children dengan unique 0
sp2.loc[(sp2['debt']==0) & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
65,0,,21,secondary education,1,unmarried,4,M,business,0,,transactions with commercial real estate
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


In [None]:
# Memeriksa distribusi data kolom children dengan unique 1
sp2.loc[(sp2['debt']==1) & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
181,0,,26,secondary education,1,civil partnership,1,F,business,1,,purchase of the house for my family
247,1,,60,bachelor's degree,0,married,0,F,retiree,1,,going to university
278,1,,23,Secondary Education,1,civil partnership,1,F,employee,1,,car
312,1,,33,secondary education,1,civil partnership,1,M,employee,1,,buying property for renting out
...,...,...,...,...,...,...,...,...,...,...,...,...
20592,3,,35,secondary education,1,married,0,F,employee,1,,to get a supplementary education
20646,1,,50,secondary education,1,civil partnership,1,F,business,1,,construction of own property
20917,0,,50,bachelor's degree,0,married,0,F,employee,1,,construction of own property
21271,2,,42,secondary education,1,civil partnership,1,M,employee,1,,transactions with my real estate


**Mengecek kolom purpose**

In [None]:
# Memeriksa distribusi dropna
data_dropna['purpose'].sort_values().unique()

array(['building a property', 'building a real estate',
       'buy commercial real estate', 'buy real estate',
       'buy residential real estate', 'buying a second-hand car',
       'buying my own car', 'buying property for renting out', 'car',
       'car purchase', 'cars', 'construction of own property',
       'education', 'getting an education', 'getting higher education',
       'going to university', 'having a wedding', 'housing',
       'housing renovation', 'housing transactions', 'profile education',
       'property', 'purchase of a car', 'purchase of my own house',
       'purchase of the house', 'purchase of the house for my family',
       'real estate transactions', 'second-hand car purchase',
       'supplementary education', 'to become educated', 'to buy a car',
       'to get a supplementary education', 'to have a wedding',
       'to own a car', 'transactions with commercial real estate',
       'transactions with my real estate', 'university education',
       'we

In [None]:
# Memeriksa distribusi isna
data_isna['purpose'].sort_values().unique()

array(['building a property', 'building a real estate',
       'buy commercial real estate', 'buy real estate',
       'buy residential real estate', 'buying a second-hand car',
       'buying my own car', 'buying property for renting out', 'car',
       'car purchase', 'cars', 'construction of own property',
       'education', 'getting an education', 'getting higher education',
       'going to university', 'having a wedding', 'housing',
       'housing renovation', 'housing transactions', 'profile education',
       'property', 'purchase of a car', 'purchase of my own house',
       'purchase of the house', 'purchase of the house for my family',
       'real estate transactions', 'second-hand car purchase',
       'supplementary education', 'to become educated', 'to buy a car',
       'to get a supplementary education', 'to have a wedding',
       'to own a car', 'transactions with commercial real estate',
       'transactions with my real estate', 'university education',
       'we

In [None]:
# Memeriksa distribusi di seluruh dataset
sp2['purpose'].sort_values().unique()

array(['building a property', 'building a real estate',
       'buy commercial real estate', 'buy real estate',
       'buy residential real estate', 'buying a second-hand car',
       'buying my own car', 'buying property for renting out', 'car',
       'car purchase', 'cars', 'construction of own property',
       'education', 'getting an education', 'getting higher education',
       'going to university', 'having a wedding', 'housing',
       'housing renovation', 'housing transactions', 'profile education',
       'property', 'purchase of a car', 'purchase of my own house',
       'purchase of the house', 'purchase of the house for my family',
       'real estate transactions', 'second-hand car purchase',
       'supplementary education', 'to become educated', 'to buy a car',
       'to get a supplementary education', 'to have a wedding',
       'to own a car', 'transactions with commercial real estate',
       'transactions with my real estate', 'university education',
       'we

**Penjelasan**

In [None]:
# Memeriksa distribusi data kolom children dengan unique 
sp2.loc[(sp2['purpose']=='building a property') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
145,0,,62,secondary education,1,married,0,M,retiree,0,,building a property
788,0,,48,bachelor's degree,0,unmarried,4,F,business,0,,building a property
923,2,,30,secondary education,1,married,0,M,business,0,,building a property
1594,0,,50,secondary education,1,married,0,M,business,0,,building a property
2232,0,,66,secondary education,1,widow / widower,2,F,retiree,0,,building a property
2581,0,,56,bachelor's degree,0,divorced,3,F,civil servant,0,,building a property
3891,1,,68,SOME COLLEGE,2,divorced,3,F,employee,0,,building a property
4277,1,,34,secondary education,1,married,0,M,business,0,,building a property
4388,0,,39,primary education,3,married,0,F,employee,0,,building a property
4527,0,,52,some college,2,unmarried,4,F,employee,0,,building a property


In [None]:
# Memeriksa distribusi data kolom children dengan unique 
sp2.loc[(sp2['purpose']=='building a real estate') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
1963,0,,27,bachelor's degree,0,civil partnership,1,F,employee,0,,building a real estate
3913,1,,27,some college,2,unmarried,4,F,civil servant,0,,building a real estate
3992,0,,48,secondary education,1,widow / widower,2,F,employee,0,,building a real estate
4081,1,,40,secondary education,1,civil partnership,1,F,business,0,,building a real estate
4216,0,,30,SECONDARY EDUCATION,1,married,0,M,employee,0,,building a real estate
4480,1,,28,secondary education,1,married,0,M,employee,0,,building a real estate
4596,0,,38,secondary education,1,civil partnership,1,F,employee,0,,building a real estate
5843,0,,47,bachelor's degree,0,unmarried,4,M,employee,0,,building a real estate
6048,0,,48,secondary education,1,married,0,F,employee,0,,building a real estate


In [None]:
# Memeriksa distribusi data kolom children dengan unique 
sp2.loc[(sp2['purpose']=='to buy a car') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
490,1,,29,secondary education,1,married,0,F,business,0,,to buy a car
1340,0,,59,secondary education,1,widow / widower,2,F,retiree,0,,to buy a car
2900,0,,35,secondary education,1,married,0,M,employee,0,,to buy a car
3090,2,,35,bachelor's degree,0,married,0,F,employee,0,,to buy a car
3575,0,,54,secondary education,1,married,0,F,retiree,0,,to buy a car
3660,0,,28,secondary education,1,unmarried,4,M,business,0,,to buy a car
3903,0,,36,secondary education,1,married,0,M,employee,0,,to buy a car
4096,1,,37,secondary education,1,married,0,F,business,0,,to buy a car
4257,2,,31,secondary education,1,married,0,M,business,0,,to buy a car
4784,0,,34,secondary education,1,civil partnership,1,F,employee,0,,to buy a car


In [None]:
# Memeriksa distribusi data kolom children dengan unique 
sp2.loc[(sp2['purpose']=='education') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
930,0,,35,Bachelor's Degree,0,married,0,F,employee,0,,education
959,0,,25,secondary education,1,civil partnership,1,M,employee,0,,education
1779,0,,38,secondary education,1,married,0,F,employee,0,,education
2304,0,,43,secondary education,1,married,0,F,employee,1,,education
3757,2,,28,secondary education,1,married,0,M,employee,0,,education
6365,1,,53,secondary education,1,married,0,F,retiree,0,,education
7376,1,,34,bachelor's degree,0,unmarried,4,F,civil servant,0,,education
7539,2,,36,Bachelor's Degree,0,married,0,F,business,0,,education
7615,-1,,35,secondary education,1,married,0,M,employee,0,,education


In [None]:
# Memeriksa distribusi data kolom children dengan unique 
sp2.loc[(sp2['purpose']=='having a wedding') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
94,1,,34,bachelor's degree,0,civil partnership,1,F,business,0,,having a wedding
321,1,,34,bachelor's degree,0,civil partnership,1,M,business,0,,having a wedding
573,0,,52,secondary education,1,civil partnership,1,F,civil servant,0,,having a wedding
1075,0,,41,secondary education,1,civil partnership,1,F,employee,0,,having a wedding
1439,0,,23,SECONDARY EDUCATION,1,civil partnership,1,M,employee,0,,having a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
20473,1,,31,secondary education,1,civil partnership,1,F,employee,1,,having a wedding
20662,0,,58,Secondary Education,1,civil partnership,1,M,employee,0,,having a wedding
20797,0,,46,secondary education,1,civil partnership,1,F,employee,0,,having a wedding
21268,1,,44,bachelor's degree,0,civil partnership,1,F,civil servant,0,,having a wedding


In [None]:

# Memeriksa distribusi data kolom children dengan unique 
sp2.loc[(sp2['purpose']=='housing') & (sp2['total_income'].isna()) & (sp2['days_employed'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
82,2,,50,bachelor's degree,0,married,0,F,employee,0,,housing
83,0,,52,secondary education,1,married,0,M,employee,0,,housing
135,0,,27,secondary education,1,married,0,M,business,0,,housing
361,0,,59,secondary education,1,widow / widower,2,F,retiree,0,,housing
415,0,,57,bachelor's degree,0,widow / widower,2,F,retiree,0,,housing
1228,0,,29,bachelor's degree,0,married,0,F,employee,0,,housing
1658,0,,48,Secondary Education,1,civil partnership,1,M,business,0,,housing
1890,0,,0,bachelor's degree,0,unmarried,4,F,employee,0,,housing
2210,1,,49,bachelor's degree,0,married,0,F,employee,0,,housing
2656,0,,38,secondary education,1,married,0,M,business,0,,housing


**Kesimpulan**

[Apakah Anda menemukan suatu pola? Bagaimana Anda mendapatkan kesimpulan ini?] **Ya, ada suatu pola ketika mengecek data yang isna di kolom `days_employed` dan kolom `total_income` di lakukan bersamaan. Ketidak lengkapan unique value seperti di kolom `dob_years`, `education`, dan `income_type` pada isna. Untuk kolom `gender` bukan saya anggap pola karena datanya juga yang tidak ada unique value di isna adalah XNA dan datanya itu hanya satu, kemungkinan menurut saya itu hanya salah input atau sistemnya yang error.**

[Jelaskan bagaimana Anda akan mengatasi nilai-nilai yang hilang. Mempertimbangkan kategori yang nilainya tidak ada.] **Karena nilai-nilai yang hilang kategorinya numerikal (kelihatan dari kolom `days_employed` dan kolom `total_income`) maka menggunakan median atau mean sebagai atas jawaban tersebut, hal itu adalah yang paling logis untuk melakukannya.**

[Buatlah perencanaan secara singkat langkah Anda selanjutnya untuk mengubah data. Anda mungkin perlu mengatasi berbagai jenis masalah: duplikat, register yang berbeda, data lama yang salah, dan nilai yang hilang.]

## Transformasi data

[Mari kita perhatikan setiap kolom untuk melihat masalah apa yang mungkin kita miliki di dalamnya.]

[Mulailah dengan menghapus duplikat dan memperbaiki informasi pendidikan jika diperlukan.]

In [None]:
# Mari kita lihat semua nilai di kolom pendidikan untuk memeriksa ejaan apa yang perlu diperbaiki
sp2.sort_values('education')['education'].unique()

array(["BACHELOR'S DEGREE", "Bachelor's Degree", 'GRADUATE DEGREE',
       'Graduate Degree', 'PRIMARY EDUCATION', 'Primary Education',
       'SECONDARY EDUCATION', 'SOME COLLEGE', 'Secondary Education',
       'Some College', "bachelor's degree", 'graduate degree',
       'primary education', 'secondary education', 'some college'],
      dtype=object)

In [None]:
# Perbaiki register jika diperlukan
sp2['education'] = sp2['education'].str.lower()

In [None]:
# Memeriksa semua nilai di kolom untuk memastikan kita telah memperbaikinya
# Melihat unique value dari 'education'
sp2.sort_values('education')['education'].unique()

array(["bachelor's degree", 'graduate degree', 'primary education',
       'secondary education', 'some college'], dtype=object)

In [None]:
# Melihat jumlah unique value dari 'education' dan 'education_id' agar melihat apakah ada error tidak terduga atau tidak
sp2.groupby('education')['education_id'].value_counts()

education            education_id
bachelor's degree    0                5260
graduate degree      4                   6
primary education    3                 282
secondary education  1               15233
some college         2                 744
Name: education_id, dtype: int64

[Periksa data kolom `children`]

In [None]:
# Mari kita lihat distribusi nilai pada kolom `children`
# Melihat unqiue valuenya
sp2.sort_values('children')['children'].unique()

array([-1,  0,  1,  2,  3,  4,  5, 20])

In [None]:
# Melihat jumlah unique valuenya
sp2.sort_values('children')['children'].value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

In [None]:
# Melihat jumlah distribusi data di dataset yang mempunyai value -1
sp2.loc[(sp2['children']==-1)].count()

children            47
days_employed       44
dob_years           47
education           47
education_id        47
family_status       47
family_status_id    47
gender              47
income_type         47
debt                47
total_income        44
purpose             47
dtype: int64

In [None]:
# Melihat jumlah presentase data di dataset yang mempunyai value -1
sp2.loc[(sp2['children']==-1)].count()/len(sp2) * 100

children            0.218351
days_employed       0.204413
dob_years           0.218351
education           0.218351
education_id        0.218351
family_status       0.218351
family_status_id    0.218351
gender              0.218351
income_type         0.218351
debt                0.218351
total_income        0.204413
purpose             0.218351
dtype: float64

In [None]:
# Melihat jumlah distribusi data di dataset yang mempunyai value 20
sp2.loc[(sp2['children']==20)].count()

children            76
days_employed       67
dob_years           76
education           76
education_id        76
family_status       76
family_status_id    76
gender              76
income_type         76
debt                76
total_income        67
purpose             76
dtype: int64

In [None]:
# Melihat jumlah presentase data di dataset yang mempunyai value 20
sp2.loc[(sp2['children']==20)].count()/len(sp2) * 100

children            0.353078
days_employed       0.311266
dob_years           0.353078
education           0.353078
education_id        0.353078
family_status       0.353078
family_status_id    0.353078
gender              0.353078
income_type         0.353078
debt                0.353078
total_income        0.311266
purpose             0.353078
dtype: float64

[Apakah terdapat hal-hal aneh di kolom? Jika jawabannya iya, seberapa tinggi persentase data yang bermasalah? Bagaimana mereka bisa terjadi? Buat keputusan tentang apa yang akan Anda lakukan dengan data ini dan jelaskan alasannya.] **pada kolom `children` ada nilai -1 yang dimana ini sebuah keanehan dan presentasi datanya dari total hanya `0.02%` saja. Ada kemungkinan mereka tidak ingin mengungkap apakah mereka punya anak atau tidak agar kemungkinan untuk diterima pinjaman memiliki peluang lebih besar yang dikarenakan tidak terlalu menanggung keuangan yang besar melalui prediksi gaji bulanan dan jabatannya. Keputusan yang saya ambil yaitu menganggap bahwa mereka tidak punya anak.**  

**Dan untuk yang mempunyai anak kemungkinan itu salah saat melakukan input data, bisa ketidaksadaran memencet tombol 0. Presentasi datanya hanya `0.03%` sajak dari keseluruhan data yang ada.**

In [None]:
# [perbaiki data berdasarkan keputusan Anda]
sp2['children'] = sp2['children'].replace(-1, 0)
sp2['children'] = sp2['children'].replace(20, 2)

In [None]:
# Periksa kembali kolom `children` untuk memastikan semua telah diperbaiki

sp2.sort_values('children')['children'].unique()

array([0, 1, 2, 3, 4, 5])

In [None]:
sp2['children'].value_counts()

0    14137
1     4808
2     2128
3      330
4       41
5        9
Name: children, dtype: int64

[Periksa data dalam kolom the `days_employed`. Pertama-tama pikirkan tentang masalah apa yang mungkin ada dan apa yang mungkin ingin Anda periksa dan bagaimana Anda akan melakukannya.]

In [None]:
# Temukan data yang bermasalah di `days_employed`, jika terdapat masalah, dan hitung persentasenya
sp2['days_employed']

0         -8437.673028
1         -4024.803754
2         -5623.422610
3         -4124.747207
4        340266.072047
             ...      
21520     -4529.316663
21521    343937.404131
21522     -2113.346888
21523     -3112.481705
21524     -1984.507589
Name: days_employed, Length: 21525, dtype: float64

In [None]:
# Mengkonversi jumlah hari menjadi jumlah tahun
xs1 = sp2['days_employed'] / 365
xs1

0        -23.116912
1        -11.026860
2        -15.406637
3        -11.300677
4        932.235814
            ...    
21520    -12.409087
21521    942.294258
21522     -5.789991
21523     -8.527347
21524     -5.437007
Name: days_employed, Length: 21525, dtype: float64

In [None]:
# Memeriksa Mean
xs1.mean()

172.73013057937914

In [None]:
# Memeriksa Median
xs1.median()

-3.296902818549285

In [None]:
# Melihat Jumlah kolom yang nilainya minus.
sp2[sp2['days_employed'] < 0].count()

children            15906
days_employed       15906
dob_years           15906
education           15906
education_id        15906
family_status       15906
family_status_id    15906
gender              15906
income_type         15906
debt                15906
total_income        15906
purpose             15906
dtype: int64

In [None]:
# Melihat presentasenya
sp2[sp2['days_employed'] < 0].count() / len(sp2) * 100

children            73.89547
days_employed       73.89547
dob_years           73.89547
education           73.89547
education_id        73.89547
family_status       73.89547
family_status_id    73.89547
gender              73.89547
income_type         73.89547
debt                73.89547
total_income        73.89547
purpose             73.89547
dtype: float64

In [None]:
# Melihat Jumlah kolom yang jumlah kerjanya melebihi umur pekerjanya
sp2.loc[(sp2['days_employed'] / 365 > sp2['dob_years'])].count()

children            3445
days_employed       3445
dob_years           3445
education           3445
education_id        3445
family_status       3445
family_status_id    3445
gender              3445
income_type         3445
debt                3445
total_income        3445
purpose             3445
dtype: int64

In [None]:
# Melihat presentasenya
sp2.loc[(sp2['days_employed'] / 365 > sp2['dob_years'])].count()/len(sp2) * 100

children            16.004646
days_employed       16.004646
dob_years           16.004646
education           16.004646
education_id        16.004646
family_status       16.004646
family_status_id    16.004646
gender              16.004646
income_type         16.004646
debt                16.004646
total_income        16.004646
purpose             16.004646
dtype: float64

**Dari hasil beberapa hal yang saya analisis bahwa ada beberapa kesimpulan yang saya ambil :**

**1. Data memiliki masalah terhadap nilai minus, dan presentasinya sangat tinggi sekitar `73%` terhadap keseluruhan distribusi datanya.**

**2. Nilai Median dan Mean yang tidak revelan karena nilai minusnya, perlu pengecekan kembali saat data tersebut diatasi. Fungsinya untuk menentukan apakah menggunakan median atau mean.**

**3. Karena ada jumlah `days_employed` yang sangat tinggi maka saya melakukan asumsi untuk mengecek bahwa orang yang bekerja biasanya tidak bisa melebihi dari `dob_years` nya. Maka dilakukan perhitungan jumlah distribusi datanya yang melebihi dari `dob_years` nya. Dan hasil tersebut bahwa datanya bisa diubah karena presentasinya dibawah nilai toleransi yaitu sekitar `16%`.**

[Jika jumlah data yang bermasalah tinggi, hal tersebut mungkin dikarenakan beberapa masalah teknis. Kami mungkin ingin mengusulkan alasan paling jelas mengapa hal tersebut dapat terjadi dan bagaimanakah kemungkinan data yang benar, karena kita tidak dapat menghapus baris yang bermasalah ini.]

In [None]:
# Atasi nilai yang bermasalah, jika ada

sp2['days_employed'] = sp2['days_employed'].abs()

mean_day = sp2['days_employed'].mean()
mean_year = sp2['days_employed'].mean() / 365
median_day = sp2['days_employed'].median()
median_year = sp2['days_employed'].median() / 365

sp2.loc[((sp2['days_employed']/365) > sp2['dob_years']), 'days_employed'] = median_day

median_day

2194.220566878695

**[Saat mengatasi nilai yang bermasalah, saya memilih untuk menggunakan `median_day`. Karena dalam kolom `days_employed` yang digunakan adalah hari. Serta memilih untuk mengganti nilai yang bermasalah dengan nilai `median` karena kalau menggunakan `mean` itu tidak lah cocok. Karena nilai yang bermasalah tersebut sangat mempengaruhi hasil `mean` maka hasilnya juga sangat melenceng dari kebanyakan data yang ada.]**

In [None]:
# Periksa hasilnya - pastikan telah diperbaiki
sp2

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,5623.422610,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,2194.220567,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21520,1,4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions
21521,0,2194.220567,67,secondary education,1,married,0,F,retiree,0,24959.969,purchase of a car
21522,1,2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property
21523,3,3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car


In [None]:
# Melihat Jumlah kolom yang nilainya minus.
sp2[sp2['days_employed'] < 0].count()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64

[Sekarang mari kita melihat usia klien dan apakah terdapat masalah di sana. Sekali lagi, pikirkan tentang data apakah yang dapat menjadi suatu kejanggalan pada kolom ini, yaitu berapa usia seseorang.]

In [None]:
# Periksa `dob_years` untuk nilai yang mencurigakan dan hitung persentasenya
sp2['dob_years'].sort_values().unique()

array([ 0, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
       69, 70, 71, 72, 73, 74, 75])

In [None]:
# Melihat jumah distribusi di dataset yang nilainya 0
sp2.loc[(sp2['dob_years']==0)].count()

children            101
days_employed        91
dob_years           101
education           101
education_id        101
family_status       101
family_status_id    101
gender              101
income_type         101
debt                101
total_income         91
purpose             101
dtype: int64

In [None]:
# Melihat presentasenya
sp2.loc[(sp2['dob_years']==0)].count()/len(sp2['dob_years']) * 100

children            0.469222
days_employed       0.422764
dob_years           0.469222
education           0.469222
education_id        0.469222
family_status       0.469222
family_status_id    0.469222
gender              0.469222
income_type         0.469222
debt                0.469222
total_income        0.422764
purpose             0.469222
dtype: float64

[Putuskan apa yang akan Anda lakukan dengan nilai yang bermasalah dan jelaskan alasannya.] **[Karena ada umur yang mencurigakan yaitu umur 0. Hal itu bisa dikatakan bahwa orang itu bisa saja tidak ingin menyatakan umurnya. Variabel yang mempunyai nilai `0` bisa di ubah karena presentasi nilainya hanya sangat kecil, yaitu: `0.4%`. Hal yang dilakukan yaitu mengubah nilai 0 menjadi rata-rata `(meaan())`.]** 

In [None]:
# Atasi masalah pada kolom `dob_years`, jika terdapat masalah

mean_dob_years = round(sp2['dob_years'].mean())
sp2['dob_years'] = sp2['dob_years'].replace(0, mean_dob_years)

In [None]:
# Periksa hasilnya - pastikan telah diperbaiki
sp2['dob_years'].sort_values().unique()

array([19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
       36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
       53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,
       70, 71, 72, 73, 74, 75])

[Sekarang saatnya memeriksa kolom `family_status`. Lihat nilai seperti apakah yang terdapat di kolom dan masalah apa yang mungkin perlu Anda atasi.]

In [None]:
# Mari kita lihat nilai untuk kolom

print(sp2['family_status'].unique())
print(sp2['family_status'].value_counts())
print(sp2.groupby('family_status_id')['family_status'].value_counts())

['married' 'civil partnership' 'widow / widower' 'divorced' 'unmarried']
married              12380
civil partnership     4177
unmarried             2813
divorced              1195
widow / widower        960
Name: family_status, dtype: int64
family_status_id  family_status    
0                 married              12380
1                 civil partnership     4177
2                 widow / widower        960
3                 divorced              1195
4                 unmarried             2813
Name: family_status, dtype: int64


**[Saya melakukan perubahan dari `widow / widower` menjadi `divorced` karena intinya adalaha sama statusnya dan melakukan perubahan sekaligus terhadap `family_status_id` karena itu adalah kode identitas statusnya dan itu akan digunakan pada di akhir untuk mengakategorian.]**

In [None]:
# Atasi nilai yang bermasalah di `family_status`, jika ada

sp2['family_status'] = sp2['family_status'].replace('widow / widower', 'divorced')
sp2['family_status_id'] = sp2['family_status_id'].replace(2, 3)
sp2['family_status_id'] = sp2['family_status_id'].replace(3, 2)
sp2['family_status_id'] = sp2['family_status_id'].replace(4, 3)

In [None]:
# Periksa hasilnya - pastikan nilai telah diperbaiki
sp2.groupby('family_status_id')['family_status'].value_counts()

family_status_id  family_status    
0                 married              12380
1                 civil partnership     4177
2                 widow / widower        960
3                 divorced              1195
4                 unmarried             2813
Name: family_status, dtype: int64

[Sekarang saatnya memeriksa kolom `gender`. Lihat nilai seperti apakah yang terdapat di kolom dan masalah apa yang mungkin perlu Anda atasi.]

In [None]:
# Mari kita melihat nilainya di kolom
sp2['gender'].value_counts()

F      14236
M       7288
XNA        1
Name: gender, dtype: int64

In [None]:
# Melihat unique value XNA
sp2.loc[(sp2['gender']=='XNA')]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
10701,0,2358.600502,24,some college,2,civil partnership,1,XNA,business,0,32624.825,buy real estate


**[Tidak perlu di ubah, karena dalam perintah tugas ini hanya perintah status perkawinan dan jumlah anak, selama jumlah status perkawinannya tertulis maka tidak ada yang perlu di ubah]**

In [None]:
# Atasi nilai-nilai yang bermasalah, jika ada

In [None]:
# Periksa hasilnya - pastikan telah diperbaiki

[Sekarang saatnya memeriksa kolom `income_type`. Lihat nilai seperti apakah yang terdapat di kolom dan masalah apa yang mungkin perlu Anda atasi.]

In [None]:
# Mari kita lihat nilai dalam kolom
sp2['income_type'].value_counts()

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
entrepreneur                       2
unemployed                         2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

**[Tidak ada yang perlu di ubah karena tidak ada yang masalah, banyak kemungkinan di balik statusnya. Seperti `student` yang sebenarnya punya usaha yang dijalankan sambil mengayomi pendidikan atau status dalam kartu identitas hanya belum di ubah dan `pternity / maternity leave` yang hanya mengambil jatah cuti saja.]**

In [None]:
# Atasi nilai yang bermasalah, jika ada

In [None]:
# Periksa hasilnya - pastikan telah diperbaiki

[Sekarang saatnya melihat apakah terdapat duplikasi di dalam data kita. Jika kita menemukannya, Anda harus memutuskan apa yang akan Anda lakukan dengan duplikat tersebut dan menjelaskan alasannya.]

In [None]:
# Memeriksa duplikat

sp2.duplicated().sum()

72

In [None]:
# Atasi duplikat, jika ada
sp2 = sp2.drop_duplicates().reset_index(drop=True)

In [None]:
# Terakhir periksa apakah kita memiliki duplikat
sp2.duplicated().sum()

0

In [None]:
# Periksa ukuran dataset yang sekarang Anda miliki setelah manipulasi pertama yang Anda lakukan
sp2.shape

(21453, 12)

# Bekerja dengan nilai yang hilang

[Untuk mempercepat pekerjaan dengan beberapa data, Anda mungkin ingin menggunakan dictionary untuk beberapa nilai, di mana tersedia ID. Jelaskan mengapa dan dictionary apakah yang akan Anda gunakan.]

In [None]:
# Temukan dictionary
{'dob_years': sp2['dob_years'].unique()}

{'dob_years': array([42, 36, 33, 32, 53, 27, 43, 50, 35, 41, 40, 65, 54, 56, 26, 48, 24,
        21, 57, 67, 28, 63, 62, 47, 34, 68, 25, 31, 30, 20, 49, 37, 45, 61,
        64, 44, 52, 46, 23, 38, 39, 51, 59, 29, 60, 55, 58, 71, 22, 73, 66,
        69, 19, 72, 70, 74, 75])}

### Memperbaiki nilai yang hilang di `total_income`

[Jelaskan secara singkat kolom manakah yang memiliki nilai yang hilang yang perlu Anda tangani. Jelaskan bagaimana Anda akan memperbaikinya.]


[Mulailah dengan mengatasi nilai pendapatan total yang hilang. Membuat kategori usia untuk klien. Membuat kolom baru dengan kategori usia. Strategi ini dapat dibantu dengan menghitung nilai pendapatan total.]


In [None]:
# Mari menulis fungsi untuk menghitung kategori usia

def category_age(age):
    if age <= 24:
        return '0-24'
    elif age <= 64:
        return '25-64'
    else:
        return '65+'

In [None]:
# Lakukan pengujian apakah fungsi bekerja atau tidak
print(category_age(23))
print(category_age(46))
print(category_age(67))

0-24
25-64
65+


In [None]:
# Membuat kolom baru berdasarkan fungsi

sp2['category_age'] = sp2['dob_years'].apply(category_age)

In [None]:
# Memeriksa bagaimana nilai di dalam kolom baru

sp2['category_age'].value_counts()

25-64    19683
65+        895
0-24       875
Name: category_age, dtype: int64

[Pikirkan tentang faktor-faktor yang biasanya bergantung pada pendapatan. Akhirnya, Anda akan mengetahui apakah Anda harus menggunakan nilai rata-rata atau median untuk mengganti nilai yang hilang. Untuk membuat keputusan ini, Anda mungkin ingin melihat identifikasi distribusi faktor yang memengaruhi pendapatan seseorang.]

[Buat tabel yang hanya memiliki data tanpa nilai yang hilang. Data ini akan digunakan untuk memperbaiki nilai yang hilang.]

In [None]:
# Membuat tabel tanpa nilai yang hilang dan menampilkan beberapa barisnya untuk memastikan semuanya berjalan dengan baik

data_clean = sp2.loc[~(sp2['days_employed'].isna()) & ~(sp2['total_income'].isna())]
data_clean.head(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,category_age
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,25-64
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,25-64
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,25-64
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,25-64
4,0,2194.220567,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,25-64
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,25-64
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,25-64
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,25-64
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,25-64
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,25-64


In [None]:
# Perhatikan nilai rata-rata untuk pendapatan berdasarkan faktor yang telah Anda identifikasi

mean_ti = data_clean.groupby('family_status')['total_income'].mean()
mean_ti

family_status
civil partnership    26694.428597
divorced             25322.079763
married              27041.784689
unmarried            26934.069805
Name: total_income, dtype: float64

In [None]:
# Perhatikan nilai median untuk pendapatan berdasarkan faktor yang telah Anda identifikasi

med_ti = data_clean.groupby('family_status')['total_income'].median()
med_ti

family_status
civil partnership    23186.5340
divorced             22089.2815
married              23389.5400
unmarried            23149.0280
Name: total_income, dtype: float64

[Ulangi perbandingan tersebut untuk beberapa faktor. Pastikan Anda mempertimbangkan berbagai aspek dan menjelaskan proses pada saat Anda berpikir.]



[Buat keputusan tentang karakteristik yang paling menentukan pendapatan dan apakah Anda akan menggunakan median atau mean. Jelaskan mengapa Anda membuat keputusan ini]


In [None]:
# Membuat kolom baru sementara untuk menguji fungsinya
sp2['total_income_imputation'] = sp2['total_income']

In [None]:
#  Tulis fungsi yang akan kita gunakan untuk mengisi nilai yang hilang

 # Parameter meaning
    # data => Nama dataframe yang akan diolah
    # column_grouping => group yang akan mengelompokkan value dan mengambil mediannya
    # column_selected => kolom yang akan kita isi value NaN nya

def data_imputation(sp2, column_grouping, column_selected):
    group = sp2[column_grouping].unique()
    for rule in group:
        median = sp2.loc[(sp2[column_grouping]==rule) & ~(sp2[column_selected].isna()), column_selected].median()
        sp2.loc[(sp2[column_grouping]==rule) & (sp2[column_selected].isna()), column_selected] = median
    return sp2

In [None]:
# Menjalan fungsinya
sp2 = data_imputation(sp2, column_grouping='family_status', column_selected='total_income_imputation')

In [None]:
# Memeriksa bagaimana nilai di dalam kolom baru

sp2.head(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,category_age,total_income_imputation
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,25-64,40620.102
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,25-64,17932.802
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,25-64,23341.752
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,25-64,42820.568
4,0,2194.220567,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,25-64,25378.572
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,25-64,40922.17
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,25-64,38484.156
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,25-64,21731.829
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,25-64,15337.093
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,25-64,23108.15


In [None]:
# Terapkan fungsi ke setiap baris

sp2['total_income'] = sp2['total_income_imputation']

In [None]:
# Periksa apakah kita mendapatkan kesalahan

sp2['total_income'].isna().sum()

0

[Jika Anda menemukan kesalahan dalam menyiapkan nilai data yang hilang, artinya mungkin terdapat sesuatu yang khusus terkait dengan data untuk kategori tersebut. Mari pikirkan - Anda mungkin ingin memperbaiki beberapa hal secara manual, jika terdapat cukup data untuk menemukan median/rata-rata.]


In [None]:
# Mengganti nilai yang hilang jika terdapat kesalahan

# Sudah tidak ada, karena pengecakan isna sudah 0

[Ketika Anda berpikir Anda telah selesai dengan `total_income`, periksa apakah jumlah total nilai di kolom ini sesuai dengan jumlah nilai di kolom lain.]

In [None]:
# Memeriksa jumlah entri di kolom

sp2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21453 entries, 0 to 21452
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   children                 21453 non-null  int64  
 1   days_employed            19351 non-null  float64
 2   dob_years                21453 non-null  int64  
 3   education                21453 non-null  object 
 4   education_id             21453 non-null  int64  
 5   family_status            21453 non-null  object 
 6   family_status_id         21453 non-null  int64  
 7   gender                   21453 non-null  object 
 8   income_type              21453 non-null  object 
 9   debt                     21453 non-null  int64  
 10  total_income             21453 non-null  float64
 11  purpose                  21453 non-null  object 
 12  category_age             21453 non-null  object 
 13  total_income_imputation  21453 non-null  float64
dtypes: float64(3), int64(5

###  Memperbaiki nilai di `days_employed`

[Pikirkan tentang parameter yang dapat membantu Anda memperbaiki nilai yang hilang di kolom ini. Akhirnya, Anda akan mengetahui apakah Anda harus menggunakan nilai rata-rata atau median untuk mengganti nilai yang hilang. Anda mungkin akan melakukan penelitian yang sama dengan yang Anda lakukan saat memperbaiki data di kolom sebelumnya.]

In [None]:
# Distribusi median dari `days_employed` berdasarkan parameter yang Anda identifikasi

med_de = data_clean.groupby('family_status')['days_employed'].median()
med_de

family_status
civil partnership    1960.262853
divorced             2194.220567
married              2194.220567
unmarried            1476.537588
Name: days_employed, dtype: float64

In [None]:
# Distribusi rata-rata dari `days_employed` berdasarkan parameter yang Anda identifikasi

avg_de = data_clean.groupby('family_status')['days_employed'].mean()
avg_de

family_status
civil partnership    2245.461300
divorced             2470.990324
married              2423.737746
unmarried            1892.161968
Name: days_employed, dtype: float64

[Tentukan apa yang akan Anda gunakan: rata-rata atau median. Jelaskan mengapa.] **[Saya menggunakan nilai mean karena nilainya sudah tidak ada yang jauh dari distribusi yang ada pada data itu.]**

In [None]:
# Membuat kolom baru sementara untuk menguji fungsinya
sp2['days_employed_imputation'] = sp2['days_employed']

In [None]:
# Mari tulis fungsi yang menghitung rata-rata atau median (tergantung keputusan Anda) berdasarkan parameter yang Anda identifikasi

def data_imputation(sp2, column_grouping, column_selected):
    group = sp2[column_grouping].unique()
    for rule in group:
        median = sp2.loc[(sp2[column_grouping]==rule) & ~(sp2[column_selected].isna()), column_selected].median()
        sp2.loc[(sp2[column_grouping]==rule) & (sp2[column_selected].isna()), column_selected] = median
    return sp2

In [None]:
# Menjalan fungsinya
sp2 = data_imputation(sp2, column_grouping='family_status', column_selected='days_employed_imputation')

In [None]:
# Periksa bahwa fungsi bekerja
sp2.head(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,category_age,total_income_imputation,days_employed_imputation
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,25-64,40620.102,8437.673028
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,25-64,17932.802,4024.803754
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,25-64,23341.752,5623.42261
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,25-64,42820.568,4124.747207
4,0,2194.220567,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,25-64,25378.572,2194.220567
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,25-64,40922.17,926.185831
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,25-64,38484.156,2879.202052
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,25-64,21731.829,152.779569
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,25-64,15337.093,6929.865299
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,25-64,23108.15,2188.756445


In [None]:
# Terapkan fungsi ke days_employed

sp2['days_employed'] = sp2['days_employed_imputation']

In [None]:
# Periksa bahwa fungsi bekerja
sp2.head(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,category_age,total_income_imputation,days_employed_imputation
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,25-64,40620.102,8437.673028
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,25-64,17932.802,4024.803754
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,25-64,23341.752,5623.42261
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,25-64,42820.568,4124.747207
4,0,2194.220567,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,25-64,25378.572,2194.220567
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,25-64,40922.17,926.185831
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,25-64,38484.156,2879.202052
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,25-64,21731.829,152.779569
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,25-64,15337.093,6929.865299
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,25-64,23108.15,2188.756445


In [None]:
# Mengganti nilai yang hilang

# *Note : Tidak ada, karena sudah di atasi pada kode sebelumnya.

[Ketika Anda berpikir bahwa Anda telah selesai dengan `total_income`, periksa apakah jumlah total nilai di kolom ini sesuai dengan jumlah nilai di kolom lain.] **[Karena sudah di perbaiki / diatasi pada kode sebelumnya.]**

In [None]:
# Periksa entri di semua kolom - pastikan kita memperbaiki semua nilai yang hilang
sp2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21453 entries, 0 to 21452
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   children                  21453 non-null  int64  
 1   days_employed             21453 non-null  float64
 2   dob_years                 21453 non-null  int64  
 3   education                 21453 non-null  object 
 4   education_id              21453 non-null  int64  
 5   family_status             21453 non-null  object 
 6   family_status_id          21453 non-null  int64  
 7   gender                    21453 non-null  object 
 8   income_type               21453 non-null  object 
 9   debt                      21453 non-null  int64  
 10  total_income              21453 non-null  float64
 11  purpose                   21453 non-null  object 
 12  category_age              21453 non-null  object 
 13  total_income_imputation   21453 non-null  float64
 14  days_e

## Pengkategorian Data

[Untuk menjawab pertanyaan dan menguji hipotesis, Anda akan bekerja dengan data yang telah dikategorikan. Lihatlah pertanyaan-pertanyaan yang diajukan kepada Anda dan yang harus Anda jawab. Pikirkan tentang data mana yang perlu dikategorikan untuk menjawab pertanyaan-pertanyaan ini. Di bawah ini Anda akan menemukan template di mana Anda dapat bekerja dengan cara Anda sendiri saat mengkategorikan data. Proses pertama adalah menutup data teks; yang kedua mengatasi data numerik yang perlu dikategorikan. Anda dapat menggunakan keduanya atau tidak sama sekali dari petunjuk yang disarankan - terserah Anda.]

[Terlepas dari keputusan Anda untuk mengatasi pengkategorian, pastikan secara jelas Anda memberikan penjelasan tentang mengapa Anda membuat keputusan tersebut. Ingat: ini merupakan pekerjaan Anda dan semua di dalamnya adalah keputusan Anda.]


In [None]:
# Tampilkan nilai data yang Anda pilih untuk pengkategorian

# print(sp2['family_status'].value_counts().head(15))
# print(sp2['children'].value_counts().head(15))
# print(sp2['debt'].value_counts().head(15))
sp2[['family_status', 'debt', 'children', 'purpose']]

Unnamed: 0,family_status,debt,children,purpose
0,married,0,1,purchase of the house
1,married,0,1,car purchase
2,married,0,0,purchase of the house
3,married,0,3,supplementary education
4,civil partnership,0,0,to have a wedding
...,...,...,...,...
21448,civil partnership,0,1,housing transactions
21449,married,0,0,purchase of a car
21450,civil partnership,1,1,property
21451,married,1,3,buying my own car


[Let's check unique values]

In [None]:
# Periksa nilai unik
# Kolom 'family_status'
sp2['family_status'].unique()

array(['married', 'civil partnership', 'divorced', 'unmarried'],
      dtype=object)

In [None]:
# Kolom 'children'
sp2['children'].unique()

array([1, 0, 3, 2, 4, 5])

In [None]:
# Kolom 'debt'
sp2['debt'].unique()

array([0, 1])

[Kelompok utama apakah yang dapat Anda identifikasi berdasarkan nilai uniknya?] **Saya mengidentifikasikan nilai uniknya denagn kelompok utama yang dijelaskan dalam deskripsi utama proyek. Karena melihatnya status perkawanin dan anak, maka itulah yang saya ambil, dan debt yang akan dibantu oleh informasi itu.**

[Berdasarkan topik ini, kita ingin mengkategorikan data kita.]


**[Saya menambahkan beberapa kolom untuk kode tambahan untuk melakukan membuat fungsi dikarenakan perintah pada deskripsi yang telah di jelaskan dan kebingungan dalam jumlah kolom untuk kode yang disediakan serta perintah yang tidak berimbang ]**

In [None]:
# membuat status debt

print(sp2['debt'].value_counts())

def debt_category(value):
    if value == 1:
        return 'Gagal Bayar'
    else:
        return 'Tidak Gagal Bayar'

sp2['debt_category'] = sp2['debt'].apply(debt_category)
sp2.head(15)

0    19712
1     1741
Name: debt, dtype: int64


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,category_age,total_income_imputation,days_employed_imputation,debt_category,income_status
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,house,25-64,40620.102,8437.673028,Tidak Gagal Bayar,high income
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car,25-64,17932.802,4024.803754,Tidak Gagal Bayar,average income
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,house,25-64,23341.752,5623.42261,Tidak Gagal Bayar,average income
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,education,25-64,42820.568,4124.747207,Tidak Gagal Bayar,high income
4,0,2194.220567,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,wedding,25-64,25378.572,2194.220567,Tidak Gagal Bayar,high income
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,house,25-64,40922.17,926.185831,Tidak Gagal Bayar,high income
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,house,25-64,38484.156,2879.202052,Tidak Gagal Bayar,high income
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,25-64,21731.829,152.779569,Tidak Gagal Bayar,average income
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,wedding,25-64,15337.093,6929.865299,Tidak Gagal Bayar,low income
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,house,25-64,23108.15,2188.756445,Tidak Gagal Bayar,average income


In [None]:
# Melihat count dari kolom 'debt_category'
sp2['debt_category'].value_counts()

Tidak Gagal Bayar    19712
Gagal Bayar           1741
Name: debt_category, dtype: int64

[Jika Anda memutuskan untuk mengkategorikan data numerik, Anda juga harus membuat kategori untuk data tersebut.]

In [None]:
# Lihat melalui semua data numerik di kolom yang Anda pilih untuk pengkategorian

sp2[['total_income']]

Unnamed: 0,total_income
0,40620.102
1,17932.802
2,23341.752
3,42820.568
4,25378.572
...,...
21448,35966.698
21449,24959.969
21450,14347.610
21451,39054.888


In [None]:
# Mendapatkan kesimpulan statistik untuk kolomnya

sp2['total_income'].describe()

count     21453.000000
mean      26435.067305
std       15684.413776
min        3306.762000
25%       17219.352000
50%       23387.843000
75%       31331.009000
max      362496.645000
Name: total_income, dtype: float64

[Tentukan rentang apa yang akan Anda gunakan untuk pengelompokan dan jelaskan alasannya.] **[Merentangkan kelompok kolom  `total_income` karena tidak ada statusnya pada kolomnya atau hanya ada angka saja tanpa keterangan. ]**

In [None]:
# Membuat fungsi untuk pengkategorian menjadi kelompok numerik yang berbeda berdasarkan rentang

def income_status(total_income):
    if total_income <= 17500:
        return 'low income'
    elif total_income <= 25000:
        return 'average income'
    else:
        return 'high income'

print(income_status(9000))
print(income_status(21600))
print(income_status(30000))

low income
average income
high income


In [None]:
# Membuat kolom dengan kategori

sp2['income_status'] = sp2['total_income'].apply(income_status)
sp2.head(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,category_age,total_income_imputation,days_employed_imputation,debt_category,income_status
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,25-64,40620.102,8437.673028,Tidak Gagal Bayar,high income
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,25-64,17932.802,4024.803754,Tidak Gagal Bayar,average income
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,25-64,23341.752,5623.42261,Tidak Gagal Bayar,average income
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,25-64,42820.568,4124.747207,Tidak Gagal Bayar,high income
4,0,2194.220567,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,25-64,25378.572,2194.220567,Tidak Gagal Bayar,high income
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,25-64,40922.17,926.185831,Tidak Gagal Bayar,high income
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,25-64,38484.156,2879.202052,Tidak Gagal Bayar,high income
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,25-64,21731.829,152.779569,Tidak Gagal Bayar,average income
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,25-64,15337.093,6929.865299,Tidak Gagal Bayar,low income
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,25-64,23108.15,2188.756445,Tidak Gagal Bayar,average income


In [None]:
# Menghitung setiap nilai kategori untuk melihat pendistribusian
sp2['income_status'].value_counts()

high income       8603
average income    7281
low income        5569
Name: income_status, dtype: int64

In [None]:
# Mengakategorikan 'purpose' terlebih dahulu

sp2.loc[sp2['purpose'].isin(['having a wedding' , 'to have a wedding' , 'wedding ceremony']), 'purpose'] = 'wedding'
sp2.loc[sp2['purpose'].isin(['buying a second-hand car' , 'buying my own car' , 'car' , 'car purchase' , 'cars' , 'purchase of a car' , 'second-hand car purchase' , 'to buy a car' , 'to own a car']), 'purpose'] = 'car'
sp2.loc[sp2['purpose'].isin(['education' , 'getting an education' , 'getting higher education' , 'going to university' , 'profile education' , 'supplementary education' , 'to become educated' , 'to get a supplementary education' , 'university education']), 'purpose'] = 'education'
sp2.loc[sp2['purpose'].isin(['housing' , 'housing renovation' , 'housing transactions' , 'purchase of my own house' , 'purchase of the house' , 'purchase of the house for my family']), 'purpose'] = 'house'
sp2.loc[sp2['purpose'].isin(['building a property' , 'building a real estate' , 'buy commercial real estate' , 'buy real estate' , 'buy residential real estate' , 'buying property for renting out' , 'construction of own property' , 'property' , 'real estate transactions' , 'transactions with commercial real estate' , 'transactions with my real estate']), 'purpose'] = 'property and real estate'

print(sp2['purpose'].unique())
print(sp2['purpose'].value_counts())

['house' 'car' 'education' 'wedding' 'property and real estate']
property and real estate    7002
car                         4305
education                   4013
house                       3809
wedding                     2324
Name: purpose, dtype: int64


## Memeriksa Hipotesis


**Apakah terdapat korelasi antara memiliki anak dengan membayar kembali tepat waktu?** Ada, terdapat kolerasi bahwa semakin banyak anak yang di punya pada nasabah yang ingin meminjam maka kasus gagal bayarnya akan semakin tinggi. Di karenakan kebutuhan yang tinggi pada nasabah tersebut sementara pendapatannya tidak cukup untuk membayar pinjaman sebelumnya.

In [None]:
# Periksa data anak dan membayar kembali dengan tepat waktu
sp2[['children', 'debt', 'debt_category']]

Unnamed: 0,children,debt,debt_category
0,1,0,Tidak Gagal Bayar
1,1,0,Tidak Gagal Bayar
2,0,0,Tidak Gagal Bayar
3,3,0,Tidak Gagal Bayar
4,0,0,Tidak Gagal Bayar
...,...,...,...
21448,1,0,Tidak Gagal Bayar
21449,0,0,Tidak Gagal Bayar
21450,1,1,Gagal Bayar
21451,3,1,Gagal Bayar


In [None]:
# Menghitung tarif otomatis berdasarkan jumlah anak
children_and_debt = sp2.pivot_table(index='children', columns= 'debt', values= 'debt_category', aggfunc= 'count')
children_and_debt['persen'] = children_and_debt[1]/ (children_and_debt[0] + children_and_debt[1]) * 100
children_and_debt

debt,0,1,persen
children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,13073.0,1064.0,7.526349
1,4364.0,444.0,9.234609
2,1926.0,202.0,9.492481
3,303.0,27.0,8.181818
4,37.0,4.0,9.756098
5,9.0,,


**Kesimpulan**

**Bahwa pada kolom `children` semakin sedikit jumlah anaknya maka semakin besar untuk nasabah yang mengajukan pinjaman punya kasus gagal bayar. Ada hal yang menarik yaitu nasabah yang ingin mengajukan pinjaman dengan 3 anak, dengan kasus `gagal bayar lebih rendah dari nasabah mempunyai 2 anak dan 1 anak` dapat berspekulasi ada kemungkinan nasabah tersebut adalah orang-orang yang benar-benar income-nya berada pada status `high income` dan anaknya kemungkinan sudah pada dewasa dan bekera sendiri. Perlu pengecekan lanjut untuk keterkaitan antara nasabah mempunyai 3 anak dengan pekerjaan, pendapatan, status pendapatannya. Hal menarik lainnya status `punya anak 5` memiliki masalah dengan kasus `debt` adalah `telalu sedikit sehingga tidak bisa ditampilkan`. Bisa di asumsikan bahwa nasabah yang ingin mengajukan pinjaman dan memilki anak 5 adalah orang-orang yang benar-benar income-nya berada pada status `high income` dan tidak pernah ada kasus gagal bayar.**

**Apakah terdapat korelasi antara status keluarga dengan membayar kembali tepat waktu?** Ada, ketika nasabah yang mengajukan pinjaman hidup dengan orang lain maka kebutuhannya lebih besar dibanding nasabah yang mengajukan pinjaman hidup sendiri.

In [None]:
# Periksa data status keluarga dan membayar kembali dengan tepat waktu
sp2[['family_status', 'family_status_id', 'debt']]

Unnamed: 0,family_status,family_status_id,debt
0,married,0,0
1,married,0,0
2,married,0,0
3,married,0,0
4,civil partnership,1,0
...,...,...,...
21448,civil partnership,1,0
21449,married,0,0
21450,civil partnership,1,1
21451,married,0,1


In [None]:
# Menghitung tarif otomatis berdasarkan status keluarga
family_and_debt = sp2.pivot_table(index='family_status', columns= 'debt', values= 'debt_category', aggfunc= 'count')
family_and_debt['persen'] = family_and_debt[1]/ (family_and_debt[0] + family_and_debt[1]) * 100
family_and_debt

debt,0,1,persen
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
civil partnership,3763,388,9.347145
divorced,2005,148,6.874129
married,11408,931,7.545182
unmarried,2536,274,9.75089


**Kesimpulan**

**Dalam Kolom `family_status` bahwa rata-rata nasabah yang hidupnya sendiri lebih kecil daripada hidupnya sudah berpasangan atau hidup barang. Tetapi ada hal unik dalam `unmarried`, karena kasus gagal bayarnya terbilang tinggi. Hal ini bisa memungkinkan beberapa spekulasi atau asumsi yang diperkiran seperti nasabahnya mempunyai manajemen keuangan dan hutang yang buruk.**


**Apakah terdapat korelasi antara tingkat pendapatan dengan membayar kembali tepat waktu?** Ada, bahwa ketika status pendatapan seseorang lebih rendah di banding yang lain maka kasus gagal bayar akan semakin banyak.

In [None]:
# Periksa data tingkat pendapatan dan membayar kembali dengan tepat waktu
sp2[['income_status', 'debt']]

Unnamed: 0,income_status,debt
0,high income,0
1,average income,0
2,average income,0
3,high income,0
4,high income,0
...,...,...
21448,high income,0
21449,average income,0
21450,low income,1
21451,high income,1


In [None]:
# Menghitung tarif otomatis berdasarkan tingkat pendapatan
income_and_debt = sp2.pivot_table(index='income_status', columns= 'debt', values= 'debt_category', aggfunc= 'count')
income_and_debt['persen'] = income_and_debt[1]/ (income_and_debt[0] + income_and_debt[1]) * 100
income_and_debt

debt,0,1,persen
income_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
average income,6661,620,8.515314
high income,7928,675,7.8461
low income,5123,446,8.008619


**Kesimpulan**

**Pada kolom `income_status` bahwa statusnya semakin rendah maka kasus gagal bayar pada nasabah yang akan meminjam pinjaman akan semakin tinggi. Tetapi, ada hal unik pada `average income` dan `low income` bahwa yang average lebih tinggi dari pada low. Dapat berspekulasi / berasumsi bahwa nasabah yang statusnya low mereka lebih mengetahui tentang menejemen keuangan di banding yang average, atau yang low income mengajukan pinjaman yang rendah dan mampu membayarnya, atau mereka punya rencana keuangan yang sudah sedemikian matangnya dan tujuannya sangat jelas.**

**Atau nasabahnya yang statusnya average income mempunyai kejadian yang tak terduga yang dimana pernah melakukan pinjaman terjadi gagal bayar atau menejemen dan rencana keuangan yang buruk juga sebelumnya. Karena ketika seseorang semakin besar pendapatannya maka semakin hal yang akan di inginkan dan di butuhkannya.**

**Bagaimana tujuan kredit memengaruhi tarif otomatis?** Bahwa tujuan yang bersifat kebutuhan primer atau utama dan bisnis akan lebih kecil dari pada kebutuhan sekunder.

In [None]:
# Periksa persentase tarif otomatis untuk setiap tujuan kredit dan lakukan penganalisisan
purpose_and_debt = sp2.pivot_table(index='purpose', columns= 'debt', values= 'debt_category', aggfunc= 'count')
purpose_and_debt['persen'] = purpose_and_debt[1]/ (purpose_and_debt[0] + purpose_and_debt[1]) * 100
purpose_and_debt

debt,0,1,persen
purpose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
car,3902,403,9.361208
education,3643,370,9.220035
house,3553,256,6.720924
property and real estate,6476,526,7.512139
wedding,2138,186,8.003442


**Kesimpulan**

**Pada kolom `purpose` bahwa kebutuhan house yang merupakan primer dan properti maupun real estate merupakan kebutuhan bisnis. Maka mereka adalah nasabah yang mempunyai rencana dan tujuan yang matang serta menejemen keuangan dan pinjaman yang baik agar tidak terjadi gagal bayar baik disebabkan faktor internal maupun eksternal.** 

**Pada tujuan car bahwa ada kemungkinan nasabahnya mempunyai menejemen keuangan dan pinjaman yang buruk.** 

**Pada education juga tinggi untuk kasus pinjamannya dapat berspekulasi / berasumsi bahwa mereka bisa jadi pada pinjaman sebelumnya mengalami hal yang tidak terduga, sehingga mereka butuh naik tingakatan dengan belajar agar nanti hasil dari education/belajar tersebut bisa mendapatkan pekerjaan yang membuat pendaptan lebih besar dan membayar pinjaman yang gagal bayar.**

# Kesimpulan Umum 

[Tuliskan kesimpulan Anda di bagian akhir ini. Pastikan Anda memasukkan semua kesimpulan penting yang telah Anda buat berkaitan dengan cara Anda memproses dan menganalisis data. Mengatasi nilai yang hilang, duplikat, dan kemungkinan alasan serta solusi untuk data lama yang bermasalah yang harus Anda tangani.]

[Tuliskan kesimpulan Anda mengenai pertanyaan yang ingin Anda ajukan di sini juga.]


Kesimpulan: 
1. Bahwa orang dengan status secondary education adalah orang yang paling banyak mengajukan pinjaman.
2. Bahwa orang yang tidak memiliki anak adalah orang yang paling banyak mengajukan pinjaman.
3. Bahwa orang yang dengan status married adalah orang yang paling banyak mengajukan pinjaman.
4. Bahwa orang yang dengan gender female/perempuan adalah orang yang paling banyak mengajukan pinjaman.
5. Bahwa orang dengan pekerjaan employee/karyawan adalah orang yang paling banyak mengajukan pinjaman.
6. Bahwa orang dengan kategori 25-64 adalah orang yang paling banyak mengajukan pinjaman.
7. Bahwa orang dengan status tidak gagal bayar adalah orang yang paling banyak mengajukan pinjaman.
8. Bahwa orang dengan high_income/berpendatapan tinggi adalah orang yang paling banyak mengajukan pinjaman.
9. Bahwa orang dengan pinjaman tujuan property and real estate adalah orang yang paling banyak mengajukan pinjaman.
10. Bahwa orang dengan anak sebanyak 4 adalah orang yang paling bermasalah dalam kasus gagal bayar.
11. Bahwa orang dengan status belum menikah adalah orang yang paling bermasalah dalam kasus gagal bayar.
12. Bahwa orang dengan status pendaptan rata-rata adalah orang yang paling bermasalah dalam kasus gagal bayar. 
13. Bahwa orang dengan tujuan untuk membeli mobil adalah orang yang paling bermasalah dalam kasus gagal bayar.