# *Principal Component Analysis*/Analisis Komponen Utama

Dalam pengenalan pola, diperlukan suatu proses untuk pemilihan fitur. Dalam suatu dataset, data yang diperoleh sangat besar dan memerlukan banyak memori. Untuk itu, kita memerlukan algoritma yang dapat mereduksi dimensi dari dataset tersebut. Dua teknik *feature extraction* yang sering digunakan adalah *Principal Component Analysis*/Analisis Komponen Utama (*PCA*/AKU) dan *Linear Discriminant Analysis*/Analisis Diskriminan Linear (*LDA*/ADL). AKU adalah salah satu teknik *feature extraction* dimana "sistem" melakukan pembelajaran sendiri--user tidak melakukan *training* apapun ke dalam sistem--dengan kata lain *unsupervised learning*. Algoritma AKU sebagai berikut:
1.   Tentukan dataset
2.   Standarisasi dataset
3.   Menghitung matriks kovarians $\mathbf{C}$
4.   Menghitung nilai eigen dan vektor eigen $\mathbf{C}$
5.   Menentukan *threshold* $(k)$ sebagai komponen utama
6.   Memproyeksikan dataset dengan vektor eigen dari komponen utamanya.

Misalkan kita memiliki data sebagai berikut:
<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-c3ow">No.</th>
    <th class="tg-c3ow">Fitur 1</th>
    <th class="tg-c3ow">Fitur 2</th>
    <th class="tg-c3ow">Fitur 3</th>
    <th class="tg-c3ow">Kelas</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-c3ow">1</td>
    <td class="tg-c3ow">10</td>
    <td class="tg-c3ow">7</td>
    <td class="tg-c3ow">2</td>
    <td class="tg-0pky">Padi</td>
  </tr>
  <tr>
    <td class="tg-c3ow">2</td>
    <td class="tg-c3ow">6</td>
    <td class="tg-c3ow">5</td>
    <td class="tg-c3ow">4</td>
    <td class="tg-0pky">Jagung</td>
  </tr>
  <tr>
    <td class="tg-c3ow">3</td>
    <td class="tg-c3ow">2</td>
    <td class="tg-c3ow">3</td>
    <td class="tg-c3ow">6</td>
    <td class="tg-0pky">Gandum</td>
  </tr>
  <tr>
    <td class="tg-baqh">4</td>
    <td class="tg-baqh">8</td>
    <td class="tg-baqh">0</td>
    <td class="tg-baqh">10</td>
    <td class="tg-0lax">Kedelai</td>
  </tr>
</tbody>
</table>

Langkah 1: Menentukan dataset

Dalam kasus ini, $X$ diperoleh dari data yang ada di dalam tabel

In [None]:
import numpy as np
import pandas as pd
X = np.array([[10, 7, 2], [6, 5, 4], [2, 3, 6], [8, 0, 10]])
df=pd.DataFrame(X, columns=['Fitur 1', 'Fitur 2', 'Fitur 3'])
display(df)

Unnamed: 0,Fitur 1,Fitur 2,Fitur 3
0,10,7,2
1,6,5,4
2,2,3,6
3,8,0,10


Langkah 2: Standarisasi dataset

Dataset distandarkan dengan menggunakan rumus $$\frac{X-\bar{X}}{S}$$

In [None]:
X_mean = np.mean(X)
X_std = np.std(X)
X_norm = (X - X_mean) / X_std
df_norm = pd.DataFrame(X_norm, columns=['Fitur 1', 'Fitur 2', 'Fitur 3'])
display(df_norm)

Unnamed: 0,Fitur 1,Fitur 2,Fitur 3
0,1.55307,0.572184,-1.062627
1,0.245222,-0.081741,-0.408703
2,-1.062627,-0.735665,0.245222
3,0.899146,-1.716551,1.55307


Langkah 3: Menentukan/menghitung matriks kovarian

Matriks kovarian diperoleh dengan rumus $$\mathbf{C} = \frac{1}{N-1}\left(X^T X\right)$$

In [None]:
N = X.shape[0]
C = np.dot(X_norm.T, X_norm) / (N - 1)
df_C = pd.DataFrame(C)
display(df_C)

Unnamed: 0,0,1,2
0,1.469933,0.035635,-0.2049
1,0.035635,1.273942,-1.140312
2,-0.2049,-1.140312,1.256125


Langkah 4: Menentukan nilai dan vektor eigen $\mathbf{C}$


In [None]:
nilai, vektor = np.linalg.eig(C)
print("Nilai eigen, λ = ")
print(nilai)
print("\nVektor eigen, v = ")
print(vektor)

indeks_nilai = np.argsort(nilai)[::-1]
nilai_urut = nilai[indeks_nilai]
vektor_urut = vektor[:, indeks_nilai]

print("\nNilai eigen, λ = ")
print(nilai_urut)
print("\nVektor eigen, v = ")
print(vektor_urut)

Nilai eigen, λ = 
[2.43537376 1.45072088 0.11390536]

Vektor eigen, v = 
[[ 0.17412575  0.98067293  0.08922343]
 [ 0.69260165 -0.18637513  0.69682657]
 [-0.69998799  0.05953915  0.7116684 ]]

Nilai eigen, λ = 
[2.43537376 1.45072088 0.11390536]

Vektor eigen, v = 
[[ 0.17412575  0.98067293  0.08922343]
 [ 0.69260165 -0.18637513  0.69682657]
 [-0.69998799  0.05953915  0.7116684 ]]


Langkah 5: Menentukan *threshold* sebagai penentu banyaknya komponen utama

*Threshold* dihitung dengan rumus $$\frac{k}{N}>t$$

In [None]:
t = 0.6
k = int(np.floor(t*N))
v = vektor_urut[:,0:k]
print("Matriks komponen utama, v = ")
df_v = pd.DataFrame(v, columns=['KU_'+str(i) for i in range(1,k+1)])
display(df_v)

Matriks komponen utama, v = 


Unnamed: 0,KU_1,KU_2
0,0.174126,0.980673
1,0.692602,-0.186375
2,-0.699988,0.059539


Langkah 6: Memproyeksikan dataset dengan vektor eigen komponen utamanya

In [None]:
X_pca = np.dot(X_norm, v)
print("Dataset yang telah direduksi menjadi {} komponen utama adalah = ".format(k))
df_X_pca = pd.DataFrame(X_pca, columns=['KU_'+str(i) for i in range(1,k+1)])
display(df_X_pca)

Dataset yang telah direduksi menjadi 2 komponen utama adalah = 


Unnamed: 0,KU_1,KU_2
0,1.410551,1.353145
1,0.272173,0.231383
2,-0.866206,-0.89038
3,-2.119452,1.294159


#AKU Untuk dataset besar

Untuk dataset yang besar, misalnya dataset dari UC Irvine Machine Learning Repository: https://archive.ics.uci.edu/ atau Kaggle: https://www.kaggle.com/datasets. Langkah-langkah untuk AKU dapat disederhanakan menjadi empat langkah akibat penggubanaan beberapa package dalam python misalnya pandas untuk representasi data, scikit-learn untuk AKU. Langkah-langkahnya adalah:


1.   Mengambil dataset
2.   Standarisasi dataset
3.   Menentukan *threshold*
4.   Menjalankan AKU dengan sklearn



#UC Irvine Machine Learning Repository

Langkah 1: Mengambil dataset

Misal kita ingin mengambil dataset *Breast cancer* dari UC Irvine, https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

In [None]:
pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [None]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)

# data (as pandas dataframes)
X = breast_cancer_wisconsin_diagnostic.data.features
y = breast_cancer_wisconsin_diagnostic.data.targets

# metadata
print(breast_cancer_wisconsin_diagnostic.metadata)

# variable information
print(breast_cancer_wisconsin_diagnostic.variables)

data = pd.DataFrame(X)
display(data)

{'uci_id': 17, 'name': 'Breast Cancer Wisconsin (Diagnostic)', 'repository_url': 'https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic', 'data_url': 'https://archive.ics.uci.edu/static/public/17/data.csv', 'abstract': 'Diagnostic Wisconsin Breast Cancer Database.', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 569, 'num_features': 30, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['Diagnosis'], 'index_col': ['ID'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1993, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C5DW2B', 'creators': ['William Wolberg', 'Olvi Mangasarian', 'Nick Street', 'W. Street'], 'intro_paper': {'ID': 230, 'type': 'NATIVE', 'title': 'Nuclear feature extraction for breast tumor diagnosis', 'authors': 'W. Street, W. Wolberg, O. Mangasarian', 'venue': 'Electronic imaging', 'year': 1993, 'journal': None, 'DOI': '1

Unnamed: 0,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,fractal_dimension1,...,radius3,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


Langkah 2: Standarisasi data

Untuk data yang besar, kita gunakan StandarScaler yang kita import dari sklearn preprocessing

In [None]:
from sklearn.preprocessing import StandardScaler

feature_names = X.columns
X_stand = pd.DataFrame(StandardScaler().fit_transform(X), columns=feature_names)
display(X_stand)

Unnamed: 0,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,fractal_dimension1,...,radius3,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,1.886690,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.243890,0.281190
2,1.579888,0.456187,1.566503,1.558884,0.942210,1.052926,1.363478,2.037231,0.939685,-0.398008,...,1.511870,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955000,1.152255,0.201391
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.935010
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.539340,1.371011,1.428493,-0.009560,-0.562450,...,1.298575,-1.466770,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.397100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,2.110995,0.721473,2.060786,2.343856,1.041842,0.219060,1.947285,2.320965,-0.312589,-0.931027,...,1.901185,0.117700,1.752563,2.015301,0.378365,-0.273318,0.664512,1.629151,-1.360158,-0.709091
565,1.704854,2.085134,1.615931,1.723842,0.102458,-0.017833,0.693043,1.263669,-0.217664,-1.058611,...,1.536720,2.047399,1.421940,1.494959,-0.691230,-0.394820,0.236573,0.733827,-0.531855,-0.973978
566,0.702284,2.045574,0.672676,0.577953,-0.840484,-0.038680,0.046588,0.105777,-0.809117,-0.895587,...,0.561361,1.374854,0.579001,0.427906,-0.809587,0.350735,0.326767,0.414069,-1.104549,-0.318409
567,1.838341,2.336457,1.982524,1.735218,1.525767,3.272144,3.296944,2.658866,2.137194,1.043695,...,1.961239,2.237926,2.303601,1.653171,1.430427,3.904848,3.197605,2.289985,1.919083,2.219635


Langkah 3: Menentukan *threshold*

In [None]:
t = 0.5
n_features = X_stand.shape[1]
k = int(np.floor(t*n_features))

Langkah 4: Menjalankan AKU dengan sklearn

AKU/PCA merupakan salah satu methods yang ada di *package* sklearn

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=k)
pca.fit(X_stand)
kolom = ['KU_'+str(i) for i in range(1,k+1)]
X_pca = pd.DataFrame(pca.transform(X_stand),columns=kolom)
display(X_pca)

Unnamed: 0,KU_1,KU_2,KU_3,KU_4,KU_5,KU_6,KU_7,KU_8,KU_9,KU_10,KU_11,KU_12,KU_13,KU_14,KU_15
0,9.192837,1.948583,-1.123166,-3.633731,1.195110,1.411424,2.159370,-0.398407,-0.157118,-0.877402,0.262955,0.859014,-0.103388,-0.690804,0.601793
1,2.387802,-3.768172,-0.529293,-1.118264,-0.621775,0.028656,0.013358,0.240988,-0.711905,1.106995,0.813120,-0.157923,0.943529,-0.653475,-0.008975
2,5.733896,-1.075174,-0.551748,-0.912083,0.177086,0.541452,-0.668166,0.097374,0.024066,0.454275,-0.605604,-0.124387,0.410627,0.016680,-0.483420
3,7.122953,10.275589,-3.232790,-0.152547,2.960878,3.053422,1.429911,1.059565,-1.405440,-1.116975,-1.151514,-1.011316,0.933271,-0.487417,0.168848
4,3.935302,-1.948072,1.389767,-2.940639,-0.546747,-1.226495,-0.936213,0.636376,-0.263805,0.377704,0.651360,0.110515,-0.387948,-0.539181,-0.310319
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,6.439315,-3.576817,2.459487,-1.177314,0.074824,-2.375193,-0.596130,-0.035471,0.987929,0.256989,-0.062651,-0.123342,0.051723,-0.404290,-0.652750
565,3.793382,-3.584048,2.088476,2.506028,0.510723,-0.246710,-0.716326,-1.113360,-0.105207,-0.108632,0.244804,-0.222753,0.192637,0.015555,0.069975
566,1.256179,-1.902297,0.562731,2.089227,-1.809991,-0.534447,-0.192758,0.341887,0.393917,0.520877,-0.840512,-0.096473,-0.157418,0.285691,-0.090998
567,10.374794,1.672010,-1.877029,2.356031,0.033742,0.567936,0.223082,-0.280239,-0.542035,-0.089296,-0.178628,0.697461,-1.225195,0.218698,-0.229591


# Kaggle

Langkah 1: Mengambil dataset

Misal kita ingin mengambil dataset *Plant Health Data*: https://www.kaggle.com/datasets/ziya07/plant-health-data.

Jangan lupa untuk menginstall API Kaggle dengan mengikuti langkah-langkah https://www.kaggle.com/docs/api#authentication

In [None]:
import kagglehub
import pandas as pd
import os

path = kagglehub.dataset_download("ziya07/plant-health-data")
print("Path to dataset files:", path)

!unzip -o {path} -d ./data
data_dir = "./data"
files = os.listdir(data_dir)
csv_files = [file for file in files if file.endswith(".csv")]
if csv_files:
    df = pd.read_csv(os.path.join(data_dir, csv_files[0]))
    display(df)
else:
    print("Tidak ada database")

Dataset URL: https://www.kaggle.com/datasets/ziya07/plant-health-data
License(s): CC0-1.0
plant-health-data.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  ./data/plant-health-data.zip
replace ./data/plant_health_data.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


Unnamed: 0,Timestamp,Plant_ID,Soil_Moisture,Ambient_Temperature,Soil_Temperature,Humidity,Light_Intensity,Soil_pH,Nitrogen_Level,Phosphorus_Level,Potassium_Level,Chlorophyll_Content,Electrochemical_Signal,Plant_Health_Status
0,2024-10-03 10:54:53.407995,1,27.521109,22.240245,21.900435,55.291904,556.172805,5.581955,10.003650,45.806852,39.076199,35.703006,0.941402,High Stress
1,2024-10-03 16:54:53.407995,1,14.835566,21.706763,18.680892,63.949181,596.136721,7.135705,30.712562,25.394393,17.944826,27.993296,0.164899,High Stress
2,2024-10-03 22:54:53.407995,1,17.086362,21.180946,15.392939,67.837956,591.124627,5.656852,29.337002,27.573892,35.706530,43.646308,1.081728,High Stress
3,2024-10-04 04:54:53.407995,1,15.336156,22.593302,22.778394,58.190811,241.412476,5.584523,16.966621,26.180705,26.257746,37.838095,1.186088,High Stress
4,2024-10-04 10:54:53.407995,1,39.822216,28.929001,18.100937,63.772036,444.493830,5.919707,10.944961,37.898907,37.654483,48.265812,1.609805,High Stress
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1195,2024-11-01 04:54:53.493508,10,29.665780,27.605285,15.381699,54.668196,650.536854,5.715289,29.993107,14.914470,21.560747,24.273224,0.714553,Moderate Stress
1196,2024-11-01 10:54:53.493508,10,15.490782,22.108112,15.221033,61.243143,768.760787,5.958957,45.258678,25.216248,31.940717,30.930676,1.448029,High Stress
1197,2024-11-01 16:54:53.493508,10,23.495723,21.680240,15.499764,40.693671,293.643366,7.419157,38.351189,48.656078,28.473233,38.324484,0.880019,Moderate Stress
1198,2024-11-01 22:54:53.493508,10,30.971675,23.019488,21.934918,41.387107,492.952014,5.855767,49.402550,23.843971,19.750042,46.027529,0.344597,Healthy


In [None]:
X1 = df.iloc[:,2:12]
display(X1)

Unnamed: 0,Soil_Moisture,Ambient_Temperature,Soil_Temperature,Humidity,Light_Intensity,Soil_pH,Nitrogen_Level,Phosphorus_Level,Potassium_Level,Chlorophyll_Content
0,27.521109,22.240245,21.900435,55.291904,556.172805,5.581955,10.003650,45.806852,39.076199,35.703006
1,14.835566,21.706763,18.680892,63.949181,596.136721,7.135705,30.712562,25.394393,17.944826,27.993296
2,17.086362,21.180946,15.392939,67.837956,591.124627,5.656852,29.337002,27.573892,35.706530,43.646308
3,15.336156,22.593302,22.778394,58.190811,241.412476,5.584523,16.966621,26.180705,26.257746,37.838095
4,39.822216,28.929001,18.100937,63.772036,444.493830,5.919707,10.944961,37.898907,37.654483,48.265812
...,...,...,...,...,...,...,...,...,...,...
1195,29.665780,27.605285,15.381699,54.668196,650.536854,5.715289,29.993107,14.914470,21.560747,24.273224
1196,15.490782,22.108112,15.221033,61.243143,768.760787,5.958957,45.258678,25.216248,31.940717,30.930676
1197,23.495723,21.680240,15.499764,40.693671,293.643366,7.419157,38.351189,48.656078,28.473233,38.324484
1198,30.971675,23.019488,21.934918,41.387107,492.952014,5.855767,49.402550,23.843971,19.750042,46.027529


Langkah 2: Standarisasi data

Untuk data yang besar, kita gunakan StandarScaler yang kita import dari sklearn preprocessing

In [None]:
from sklearn.preprocessing import StandardScaler

feature_names_1 = X1.columns
X1_stand = pd.DataFrame(StandardScaler().fit_transform(X1), columns=feature_names_1)
display(X1_stand)

Unnamed: 0,Soil_Moisture,Ambient_Temperature,Soil_Temperature,Humidity,Light_Intensity,Soil_pH,Nitrogen_Level,Phosphorus_Level,Potassium_Level,Chlorophyll_Content
0,0.278321,-0.511285,0.662825,0.049963,-0.247408,-1.620166,-1.746638,1.355983,0.768579,0.108796
1,-1.184139,-0.666361,-0.435676,1.035845,-0.072300,1.051744,0.052635,-0.424888,-1.043216,-0.770972
2,-0.924655,-0.819209,-1.557519,1.478694,-0.094261,-1.491368,-0.066879,-0.234739,0.479665,1.015220
3,-1.126428,-0.408656,0.962382,0.380088,-1.626583,-1.615750,-1.141667,-0.356286,-0.330470,0.352435
4,1.696462,1.433049,-0.633556,1.015672,-0.736748,-1.039349,-1.664853,0.666060,0.646682,1.542359
...,...,...,...,...,...,...,...,...,...,...
1195,0.525571,1.048262,-1.561354,-0.021064,0.166063,-1.390877,-0.009874,-1.339201,-0.733189,-1.195476
1196,-1.108602,-0.549694,-1.616173,0.727684,0.684081,-0.971854,1.316460,-0.440430,0.156786,-0.435783
1197,-0.185748,-0.674071,-1.521070,-1.612468,-1.397725,1.539182,0.716310,1.604562,-0.140515,0.407937
1198,0.676122,-0.284769,0.674590,-1.533501,-0.524421,-1.149304,1.676496,-0.560153,-0.888438,1.286945


Langkah 3: Menentukan *threshold*

In [None]:
import numpy as np

t1 = 0.5
n1_features = X1_stand.shape[1]
k1 = int(np.floor(t1*n1_features))

Langkah 4: Menjalankan AKU dengan sklearn

AKU/PCA merupakan salah satu methods yang ada di *package* sklearn

In [None]:
import pandas as pd
from sklearn.decomposition import PCA

pca1 = PCA(n_components=k1)
pca1.fit(X1_stand)
kolom = ['KU_'+str(i) for i in range(1,k1+1)]
X1_pca = pd.DataFrame(pca1.transform(X1_stand),columns=kolom)
X1_pca = pd.concat([df[[df.columns[0]]], df[[df.columns[1]]], X1_pca, df[[df.columns[-1]]]], axis=1)
display(X1_pca)

Unnamed: 0,Timestamp,Plant_ID,KU_1,KU_2,KU_3,KU_4,KU_5,Plant_Health_Status
0,2024-10-03 10:54:53.407995,1,1.076241,0.819401,0.033575,0.002712,-0.913038,High Stress
1,2024-10-03 16:54:53.407995,1,-0.884671,-0.643419,-0.729676,0.597258,0.858854,High Stress
2,2024-10-03 22:54:53.407995,1,0.093858,0.748375,-0.332440,-0.789005,-0.182238,High Stress
3,2024-10-04 04:54:53.407995,1,1.160858,-0.318868,1.156322,0.230522,-0.341130,High Stress
4,2024-10-04 10:54:53.407995,1,2.372224,2.313893,0.148083,0.236215,-0.589752,High Stress
...,...,...,...,...,...,...,...,...
1195,2024-11-01 04:54:53.493508,10,-0.352420,0.014483,0.789709,-1.533681,-0.922466,Moderate Stress
1196,2024-11-01 10:54:53.493508,10,-1.480995,-0.147465,0.043616,-1.155050,-0.192959,High Stress
1197,2024-11-01 16:54:53.493508,10,-2.247855,0.854787,-0.418215,1.818697,-0.251008,Moderate Stress
1198,2024-11-01 22:54:53.493508,10,0.219983,-0.666245,0.882892,-0.484053,-1.848163,Healthy
