## CH4 Cross Validation

### Train Features
1. land surface temp (wp_LST.day)
2. sensible heat flux (wp_le)
3. latent heat flux (wp_h)
4. net radiation (net_rad)
5. avg air temp (avg_air_temp)

### Performance
Compared to regressions on other values, CH4 Methane Regression performs poorly. Looking at the feature correlation plots, we see that there aren't any variables that are strongly correlated with ch4_gf. Thus, this poor performance is not surprising.

In [1]:
import sys
sys.path.append('../')
import exp
import regression as r

In [2]:
df = exp.get_exp1_data()
df.head()

Unnamed: 0,avg_air_temp,avg_soil_temp,doy,net_rad,year,wp_ch4_gf,wp_co2_gf,wp_er,wp_gpp,wp_h,...,mb_bnd2,mb_bnd3,mb_bnd7,mb_evi,mb_lswi,mb_ndvi,wp_LST.day,wp_LST.night,mb_LST.day,mb_LST.night
0,19.2,22.3,195,190.0,2012,4332.368657,-304.542172,145.072376,-449.614548,1447.549899,...,0.187575,0.025212,0.053137,0.298162,0.56237,0.6491,29.61,17.285,26.335,18.645
1,19.3,21.8,196,189.0,2012,5305.896768,-335.648791,150.278671,-485.927462,1921.833137,...,0.186562,0.024569,0.051306,0.296544,0.574074,0.6504,29.63,17.2325,26.4075,18.5925
2,20.3,21.9,197,187.0,2012,6215.371936,-313.150966,158.307017,-471.457982,1176.374322,...,0.18555,0.023925,0.049475,0.294925,0.585779,0.6517,29.65,17.18,26.48,18.54
3,16.8,22.4,198,186.0,2012,7129.353337,-339.900067,153.561669,-493.461736,2575.636175,...,0.184537,0.023281,0.047644,0.293306,0.597483,0.653,29.67,17.1275,26.5525,18.4875
4,17.0,21.5,199,151.0,2012,7070.768573,-319.771564,144.05348,-463.825044,1916.08126,...,0.183525,0.022638,0.045812,0.291687,0.609188,0.6543,29.69,17.075,26.625,18.435


In [3]:
train_cols = ["wp_LST.day", "wp_h", "wp_le", "net_rad", "avg_air_temp"]
X, Y = exp.featurize(df, train_cols, ["wp_ch4_gf"])
X, Y, scaler = r.preprocess(X, Y)
X.shape

(1028, 5)

In [4]:
r.random_forests_cross_val(X, Y, feature_names=train_cols)

Running Random Forests Cross Validation...
10-fold CV Acc Mean:  0.718659261145
CV Scores:  0.746011980089, 0.745834837801, 0.70516081643, 0.755340606135, 0.634935942478, 0.721901725789, 0.762214413658, 0.782581985518, 0.607552901038, 0.725057402518
OOB score: 0.719886785599
Feature Importances:
('avg_air_temp', 0.25002115606718361)
('wp_LST.day', 0.24702363784263143)
('wp_le', 0.22285950120325942)
('net_rad', 0.16492651212110004)
('wp_h', 0.11516919276582557)


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='sqrt', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=200, n_jobs=1, oob_score=True, random_state=None,
           verbose=0, warm_start=False)

In [5]:
r.xgb_trees_cross_val(X, Y, feature_names=train_cols)

Running Gradient Boosted Trees Cross Validation...
10-fold CV Acc Mean:  0.661368273893
CV Scores:  0.655785240407, 0.646001549389, 0.654724372546, 0.697719035605, 0.607970483693, 0.654514010272, 0.714471921371, 0.775850827011, 0.525678278661, 0.680967019979
Feature Importances:
('wp_LST.day', 0.26324217602663841)
('net_rad', 0.22893620135594145)
('wp_le', 0.19333201424651439)
('avg_air_temp', 0.16933707559041655)
('wp_h', 0.14515253278048934)


GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.1, loss='ls',
             max_depth=3, max_features='sqrt', max_leaf_nodes=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=200,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)

In [6]:
r.svc_cross_val(X, Y)

Running SVC Cross Validation...
10-fold CV Acc Mean:  -0.0310181251948
CV Scores:  -0.0568398460287, 0.0431227012345, -0.0470083566456, 0.0254967439609, -0.00683056549759, -0.0361613297849, -0.0687493349358, -0.174653902067, 0.01559685628, -0.00415421846338


SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [7]:
r.dnn_cross_val(X, Y)

Running Neural Network Cross Validation...
Step #1, avg. loss: 26719660.00000
Step #501, epoch #50, avg. loss: 26093274.00000
Step #1001, epoch #100, avg. loss: 17364868.00000
Step #1501, epoch #150, avg. loss: 14357041.00000
Step #2001, epoch #200, avg. loss: 12998029.00000
Step #2501, epoch #250, avg. loss: 12328649.00000
Step #3001, epoch #300, avg. loss: 11824319.00000
Step #3501, epoch #350, avg. loss: 11304137.00000
Step #4001, epoch #400, avg. loss: 10848895.00000
Step #4501, epoch #450, avg. loss: 10366060.00000
Step #1, avg. loss: 42104104.00000
Step #501, epoch #50, avg. loss: 25576854.00000
Step #1001, epoch #100, avg. loss: 17128592.00000
Step #1501, epoch #150, avg. loss: 14122751.00000
Step #2001, epoch #200, avg. loss: 12969589.00000
Step #2501, epoch #250, avg. loss: 12175369.00000
Step #3001, epoch #300, avg. loss: 11621165.00000
Step #3501, epoch #350, avg. loss: 11152093.00000
Step #4001, epoch #400, avg. loss: 10814477.00000
Step #4501, epoch #450, avg. loss: 103441

TensorFlowEstimator(batch_size=100, class_weight=None,
          continue_training=False, early_stopping_rounds=None,
          keep_checkpoint_every_n_hours=10000, learning_rate=0.1,
          max_to_keep=5, model_fn=<function tanh_dnn at 0x11144bb90>,
          n_classes=0, num_cores=4, optimizer='SGD', steps=5000,
          tf_master='', tf_random_seed=42, verbose=1)