Extracting training data

In [2]:
import numpy as np
data = np.loadtxt(open("train_2008.csv", "rb"), delimiter=",", skiprows=1)
print(data)

In [23]:
X_train = data[:, 3:383]
Y_train = data[382]

In [25]:
from sklearn.decomposition import PCA

Transforming data to unit scale (important, see https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py)

In [32]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit on training set only.
scaler.fit(X_train)
# Apply transform to both the training set and the test set.
X_train = scaler.transform(X_train)

Fitting PCA

In [33]:
pca = PCA()
pca.fit(X_train)

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

This prints a sorted array with the explained variance of the most important features (scaled to add to 1, thus a "ratio")

In [None]:
print(pca.explained_variance_ratio_)

To see how many features we can shave off, we can make a cumulative sum array, so keeping the first $i$ features preserves ratios[i] of the variance

In [44]:
cum_sum = 0.
ratios = pca.explained_variance_ratio_
for i in range(len(ratios)):
    elem = ratios[i]
    elem += cum_sum
    cum_sum = elem
    ratios[i] = elem
print(ratios)

    
    

[0.16177281 0.20682623 0.24295566 0.27541845 0.30125789 0.32430912
 0.34605532 0.36428452 0.38224014 0.39861743 0.41295514 0.42657111
 0.43919913 0.45146501 0.46213952 0.47202735 0.48144214 0.4902248
 0.49865001 0.50661268 0.51425258 0.52144689 0.52856563 0.53540485
 0.54205312 0.54857045 0.55492899 0.56103369 0.56678657 0.57250957
 0.57810077 0.58361398 0.58907452 0.59445895 0.59979438 0.6051234
 0.6102807  0.61537782 0.62042789 0.62534576 0.63011841 0.63476748
 0.63939281 0.64398176 0.64852224 0.65283984 0.65710577 0.66128003
 0.66540704 0.66942053 0.6733754  0.67729866 0.68118179 0.68504254
 0.68886525 0.69260262 0.6962695  0.69987877 0.70341474 0.70692404
 0.71040157 0.7138587  0.71725351 0.72062207 0.72395017 0.72727044
 0.73054313 0.73377338 0.73696983 0.74015576 0.74331868 0.74640527
 0.74944958 0.7524832  0.75550351 0.75848492 0.76136929 0.76422367
 0.76705697 0.76986248 0.77264703 0.77543011 0.77819325 0.78094061
 0.7836766  0.78640276 0.78911893 0.7918033  0.79446282 0.797116

Note that around only $\frac{2}{3}$ of the features can explain 99.5% of the variance.

In [56]:
ratios[250]

0.9948885777770978

we could maybe use the 250-300 most important features in training (but idk how much time this will save us)