## Step 10: Process nodewise PCA

We would like to use linear regression to estimate a linear relationship between feature spaces of nodes. However, there may be some collinearity among the features at each node. Thus, we transform our data by performing (classical) Principal Component Analysis at each node. We take the top principal components comprising 95% of the explained variance ratio. This gives us a new, reduced dataset, along with a reduced node structure.|

In [1]:
import sys

sys.path.append("../src/")

import json

import pandas as pd
import numpy as np

from quiver_utils.elementwise_models import *

In [2]:
X_train = pd.read_csv('../data/crop_mapping/selection_features_train.csv')
X_test = pd.read_csv('../data/crop_mapping/selection_features_test.csv')

with open('../data/crop_mapping/node_structure.json', 'r') as file:
    node_structure = json.load(file)

In [3]:
nodewise_pca = NodewisePCA(
    node_structure, 
    use_threshold=True
)
nodewise_pca.fit(X_train)
X_nodewise_PCA_train = nodewise_pca.transform(X_train)
X_nodewise_PCA_test = nodewise_pca.transform(X_test)
reduced_node_structure = nodewise_pca.get_reduced_node_structure()

In [4]:
reduced_node_structure

{'sig': ['sig_0', 'sig_1'],
 'R': ['R_0', 'R_1'],
 'Ro': ['Ro_0', 'Ro_1', 'Ro_2'],
 'L': ['L_0', 'L_1', 'L_2'],
 'HA': ['HA_0', 'HA_1', 'HA_2'],
 'PH': ['PH_0'],
 'rvi': ['rvi_0'],
 'paul': ['paul_0', 'paul_1', 'paul_2'],
 'krog': ['krog_0', 'krog_1'],
 'free': ['free_0'],
 'yam': ['yam_0']}

In [5]:
X_nodewise_PCA_train.head()

Unnamed: 0,sig_0,sig_1,R_0,R_1,Ro_0,Ro_1,Ro_2,L_0,L_1,L_2,...,HA_2,PH_0,rvi_0,paul_0,paul_1,paul_2,krog_0,krog_1,free_0,yam_0
0,2.288567,-0.689374,2.760387,0.275942,0.653107,-0.835729,0.781425,2.777417,0.705206,0.332844,...,0.440926,1.145807,1.12367,2.336927,0.237783,0.432852,1.897495,-0.529375,0.732852,0.612956
1,-3.306494,-0.864289,1.409796,-0.548366,-0.912279,-0.297044,-0.203791,-2.350584,0.4616,0.167349,...,-0.054067,0.895623,0.590499,-3.223003,0.239405,0.641808,-2.235821,-0.442761,-1.50701,-1.649416
2,1.248398,0.756708,-1.967527,1.318417,0.438031,-0.363325,-0.459687,1.315198,-0.309263,0.255535,...,0.108821,-0.133692,-0.130598,1.371221,0.331885,-0.239157,1.410635,-0.393612,0.836439,0.7352
3,-3.359995,-0.852669,1.362974,-0.611149,-0.805197,-0.433749,0.038791,-2.37189,0.454267,0.16946,...,-0.017779,0.867381,0.557822,-3.275394,0.286021,0.634583,-2.214108,-0.452444,-1.500996,-1.641167
4,-2.925189,-1.046335,2.048667,-0.298645,-0.798156,-0.412565,-0.089968,-2.167477,0.539754,0.178275,...,0.06169,1.244349,0.91534,-2.832549,0.22819,0.789583,-1.865113,-0.44251,-1.375444,-1.513685


In [6]:
with open('../data/crop_mapping/reduced_nodes.json', 'w') as file:
    json.dump(reduced_node_structure, file)

X_nodewise_PCA_train.to_csv('../data/crop_mapping/nodewise_PCA_train.csv', mode='w', index=False)
X_nodewise_PCA_test.to_csv('../data/crop_mapping/nodewise_PCA_test.csv', mode='w', index=False)

In [7]:
# Apply PCA directly to the data, to compare during the eval phase
standard_PCA_model = TopPrincipalComponents(use_threshold=True)
standard_PCA_model.fit(X_train)

X_standard_PCA_train = standard_PCA_model.transform(X_train)
X_standard_PCA_test = standard_PCA_model.transform(X_test)

In [8]:
X_standard_PCA_train.to_csv('../data/crop_mapping/standard_PCA_train.csv', mode='w', index=False)
X_standard_PCA_test.to_csv('../data/crop_mapping/standard_PCA_test.csv', mode='w', index=False)