# Feature Extraction
Feature Extraction adalah proses menghasilkan fitur-fitur baru dari fitur yang sudah ada, untuk memperkaya informasi yang tersedia bagi model machine learning. Tujuannya adalah menangkap pola tersembunyi atau hubungan antar variabel yang mungkin tidak langsung terlihat dari data mentah. Beberapa teknik umum dalam feature extraction meliputi:
* Perhitungan rasio antar fitur numerik
* Transformasi logis atau matematis terhadap fitur
* Penggabungan atau pengurangan antar kolom
* Modifikasi dan pengkodean ulang variabel kategorikal

Proses feature extraction ini memberikan fitur-fitur baru yang lebih bermakna dan sesuai konteks, sehingga dapat membantu meningkatkan performa dan interpretabilitas model prediktif secara keseluruhan.

## Memuat Dataset California Housing
Memuat dataset California Housing Train dari file CSV lokal ke dalam DataFrame untuk eksplorasi dan proses feature extraction.

In [49]:
import pandas as pd

In [50]:
dataset = pd.read_csv("C:/Users/LENOVO/Python/california_housing_train.csv")

In [51]:
dataset.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


## Menghitung Jumlah Ruangan Selain Kamar Tidur
Menghitung jumlah ruangan selain kamar tidur untuk mengetahui proporsi ruang lainnya dalam suatu rumah.

In [52]:
other_rooms = dataset["total_rooms"] - dataset["total_bedrooms"]
other_rooms

0        4329.0
1        5749.0
2         546.0
3        1164.0
4        1128.0
          ...  
16995    1823.0
16996    1821.0
16997    2146.0
16998    2120.0
16999    1520.0
Length: 17000, dtype: float64

## Menghitung Kepadatan Penduduk
Menghitung kepadatan penduduk berdasarkan rasio jumlah populasi terhadap jumlah rumah tangga.

In [53]:
population_density = dataset["population"].div(dataset["households"])
density_alt = dataset["population"]/dataset["households"]
density_alt

0        2.150424
1        2.438445
2        2.846154
3        2.278761
4        2.381679
           ...   
16995    2.457995
16996    2.567742
16997    2.728070
16998    2.715481
16999    2.985185
Length: 17000, dtype: float64

## Menambahkan Fitur Densitas ke Dataset
Menambahkan fitur baru bernama densitas ke dalam dataset sebagai hasil dari perhitungan kepadatan sebelumnya.

In [54]:
dataset["population_density"] = population_density
dataset

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,population_density
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,2.150424
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,2.438445
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,2.846154
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,2.278761
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,2.381679
...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,2.457995
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,2.567742
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,2.728070
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,2.715481


## Alternatif Penambahan Fitur Densitas
Alternatif lain dalam menambahkan kolom kepadatan ke dalam dataset menggunakan fungsi assign().

In [55]:
dataset = dataset.assign(population_density = population_density)
dataset

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,population_density
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,2.150424
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,2.438445
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,2.846154
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,2.278761
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,2.381679
...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,2.457995
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,2.567742
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,2.728070
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,2.715481


## Membuat Fitur Price to Income Ratio
Menyisipkan fitur baru price_income_ratio di kolom ketiga yang merepresentasikan rasio antara harga rumah dengan pendapatan median.

In [56]:
dataset.insert(2, "price_income_ratio", dataset["median_house_value"] / dataset["median_income"])
dataset

Unnamed: 0,longitude,latitude,price_income_ratio,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,population_density
0,-114.31,34.19,44791.108731,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,2.150424
1,-114.47,34.40,44010.989011,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,2.438445
2,-114.56,33.69,51911.078806,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,2.846154
3,-114.57,33.64,22997.148855,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,2.278761
4,-114.57,33.57,34025.974026,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,2.381679
...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,47261.465360,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,2.457995
16996,-124.27,40.69,31375.352476,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,2.567742
16997,-124.30,41.84,34176.755847,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,2.728070
16998,-124.30,41.80,43339.899985,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,2.715481


## Menghitung Kepadatan Ruangan
Mengukur jumlah rata-rata ruang per rumah tangga untuk melihat intensitas ruangan per hunian.

In [57]:
room_density = dataset["total_rooms"] / dataset["households"]
room_density

0        11.889831
1        16.522678
2         6.153846
3         6.641593
4         5.549618
           ...    
16995     6.008130
16996     5.051613
16997     5.870614
16998     5.589958
16999     6.740741
Length: 17000, dtype: float64

## Membuat Fitur Kepadatan Ruangan Kuadrat
Membuat fitur baru hasil kuadrat dari kepadatan ruang guna memodelkan kemungkinan hubungan non-linear.

In [58]:
dataset.assign(room_density_squared = room_density.pow(2))

Unnamed: 0,longitude,latitude,price_income_ratio,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,population_density,room_density_squared
0,-114.31,34.19,44791.108731,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,2.150424,141.368070
1,-114.47,34.40,44010.989011,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,2.438445,272.998894
2,-114.56,33.69,51911.078806,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,2.846154,37.869822
3,-114.57,33.64,22997.148855,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,2.278761,44.110757
4,-114.57,33.57,34025.974026,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,2.381679,30.798264
...,...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,47261.465360,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,2.457995,36.097627
16996,-124.27,40.69,31375.352476,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,2.567742,25.518793
16997,-124.30,41.84,34176.755847,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,2.728070,34.464109
16998,-124.30,41.80,43339.899985,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,2.715481,31.247632


## Memuat Dataset MPG dari Seaborn
Memuat dataset MPG (Miles Per Gallon) dari Seaborn untuk proses feature extraction pada data kendaraan.

In [59]:
import seaborn as sns

In [60]:
mpg_dataset = sns.load_dataset("mpg")

In [61]:
sns.get_dataset_names()

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

In [62]:
mpg_dataset["origin"].unique()

array(['usa', 'japan', 'europe'], dtype=object)

## Standarisasi Label Origin Menjadi America
Mengganti label asal negara dari "usa" menjadi "america" untuk standarisasi penamaan kategori.

In [63]:
mpg_dataset.loc[mpg_dataset["origin"] == "usa", "origin"] = "america"
mpg_dataset

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,america,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,america,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,america,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,america,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,america,ford torino
...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,america,ford mustang gl
394,44.0,4,97.0,52.0,2130,24.6,82,europe,vw pickup
395,32.0,4,135.0,84.0,2295,11.6,82,america,dodge rampage
396,28.0,4,120.0,79.0,2625,18.6,82,america,ford ranger


## Menambahkan Kolom Baru untuk Label Negara
Membuat kolom baru new_origin yang merepresentasikan hasil perubahan nama asal kendaraan, tanpa mengganti kolom aslinya.

In [64]:
mpg_dataset.loc[mpg_dataset["origin"] == "usa", "new_origin"] = "america"
mpg_dataset

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name,new_origin
0,18.0,8,307.0,130.0,3504,12.0,70,america,chevrolet chevelle malibu,
1,15.0,8,350.0,165.0,3693,11.5,70,america,buick skylark 320,
2,18.0,8,318.0,150.0,3436,11.0,70,america,plymouth satellite,
3,16.0,8,304.0,150.0,3433,12.0,70,america,amc rebel sst,
4,17.0,8,302.0,140.0,3449,10.5,70,america,ford torino,
...,...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,america,ford mustang gl,
394,44.0,4,97.0,52.0,2130,24.6,82,europe,vw pickup,
395,32.0,4,135.0,84.0,2295,11.6,82,america,dodge rampage,
396,28.0,4,120.0,79.0,2625,18.6,82,america,ford ranger,


## Transformasi Label Origin ke Label Origin Baru
Melakukan transformasi kategori secara bertahap dengan mengganti label "japan" menjadi "asia" untuk menyederhanakan analisis kawasan.

In [65]:
mpg_dataset["new_origin"] = mpg_dataset["origin"].replace("japan", "asia")
mpg_dataset.query("origin == 'asia'")

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name,new_origin


## Pemetaan Ulang Label Kawasan ke Negara Asal
Mengembalikan kategori asal kendaraan ke bentuk awal dengan menggunakan pemetaan dictionary untuk standardisasi label.

In [66]:
mpg_dataset["new_origin"] = mpg_dataset["origin"].map({"america": "usa",
                                                      "asia" : "japan",
                                                      "europe" : "europe"
})
mpg_dataset

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name,new_origin
0,18.0,8,307.0,130.0,3504,12.0,70,america,chevrolet chevelle malibu,usa
1,15.0,8,350.0,165.0,3693,11.5,70,america,buick skylark 320,usa
2,18.0,8,318.0,150.0,3436,11.0,70,america,plymouth satellite,usa
3,16.0,8,304.0,150.0,3433,12.0,70,america,amc rebel sst,usa
4,17.0,8,302.0,140.0,3449,10.5,70,america,ford torino,usa
...,...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,america,ford mustang gl,usa
394,44.0,4,97.0,52.0,2130,24.6,82,europe,vw pickup,europe
395,32.0,4,135.0,84.0,2295,11.6,82,america,dodge rampage,usa
396,28.0,4,120.0,79.0,2625,18.6,82,america,ford ranger,usa


## Mengganti Nama Kolom agar Lebih Deskriptif
Mengganti nama kolom agar lebih deskriptif dan mudah dipahami, seperti mengubah "origin" menjadi "country".

In [67]:
mpg_dataset = mpg_dataset.rename(columns = {"origin": "country",'mpg':'miles per galon'})
mpg_dataset

Unnamed: 0,miles per galon,cylinders,displacement,horsepower,weight,acceleration,model_year,country,name,new_origin
0,18.0,8,307.0,130.0,3504,12.0,70,america,chevrolet chevelle malibu,usa
1,15.0,8,350.0,165.0,3693,11.5,70,america,buick skylark 320,usa
2,18.0,8,318.0,150.0,3436,11.0,70,america,plymouth satellite,usa
3,16.0,8,304.0,150.0,3433,12.0,70,america,amc rebel sst,usa
4,17.0,8,302.0,140.0,3449,10.5,70,america,ford torino,usa
...,...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,america,ford mustang gl,usa
394,44.0,4,97.0,52.0,2130,24.6,82,europe,vw pickup,europe
395,32.0,4,135.0,84.0,2295,11.6,82,america,dodge rampage,usa
396,28.0,4,120.0,79.0,2625,18.6,82,america,ford ranger,usa


## Menghapus Kolom yang Tidak Relevan
Menghapus beberapa kolom yang dianggap tidak relevan atau tidak dibutuhkan dalam analisis selanjutnya.

In [68]:
mpg_dataset = mpg_dataset.drop(columns = ['miles per galon','model_year'])
mpg_dataset

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,country,name,new_origin
0,8,307.0,130.0,3504,12.0,america,chevrolet chevelle malibu,usa
1,8,350.0,165.0,3693,11.5,america,buick skylark 320,usa
2,8,318.0,150.0,3436,11.0,america,plymouth satellite,usa
3,8,304.0,150.0,3433,12.0,america,amc rebel sst,usa
4,8,302.0,140.0,3449,10.5,america,ford torino,usa
...,...,...,...,...,...,...,...,...
393,4,140.0,86.0,2790,15.6,america,ford mustang gl,usa
394,4,97.0,52.0,2130,24.6,europe,vw pickup,europe
395,4,135.0,84.0,2295,11.6,america,dodge rampage,usa
396,4,120.0,79.0,2625,18.6,america,ford ranger,usa


## Menghapus Kolom Sementara
Menghapus kolom new_origin yang sebelumnya digunakan untuk manipulasi label sementara.

In [69]:
mpg_dataset.drop(columns = "new_origin", inplace = True)

In [70]:
mpg_dataset

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,country,name
0,8,307.0,130.0,3504,12.0,america,chevrolet chevelle malibu
1,8,350.0,165.0,3693,11.5,america,buick skylark 320
2,8,318.0,150.0,3436,11.0,america,plymouth satellite
3,8,304.0,150.0,3433,12.0,america,amc rebel sst
4,8,302.0,140.0,3449,10.5,america,ford torino
...,...,...,...,...,...,...,...
393,4,140.0,86.0,2790,15.6,america,ford mustang gl
394,4,97.0,52.0,2130,24.6,europe,vw pickup
395,4,135.0,84.0,2295,11.6,america,dodge rampage
396,4,120.0,79.0,2625,18.6,america,ford ranger


## Filter Kendaraan Berdasarkan Negara Asal
Menyaring data kendaraan yang berasal dari Jepang untuk fokus pada analisis subset data berdasarkan negara asal.

In [71]:
mpg_dataset[mpg_dataset["country"] == "japan"]

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,country,name
14,4,113.0,95.0,2372,15.0,japan,toyota corona mark ii
18,4,97.0,88.0,2130,14.5,japan,datsun pl510
29,4,97.0,88.0,2130,14.5,japan,datsun pl510
31,4,113.0,95.0,2228,14.0,japan,toyota corona
53,4,71.0,65.0,1773,19.0,japan,toyota corolla 1200
...,...,...,...,...,...,...,...
382,4,108.0,70.0,2245,16.9,japan,toyota corolla
383,4,91.0,67.0,1965,15.0,japan,honda civic
384,4,91.0,67.0,1965,15.7,japan,honda civic (auto)
385,4,91.0,67.0,1995,16.2,japan,datsun 310 gx
