## CH4 Cross Validation

### Train Features
1. land surface temp (wp_LST.day)
2. sensible heat flux (wp_le)
3. latent heat flux (wp_h)
4. net radiation (net_rad)
5. avg air temp (avg_air_temp)

### Performance
Compared to regressions on other values, CH4 Methane Regression performs poorly. Looking at the feature correlation plots, we see that there aren't any variables that are strongly correlated with ch4_gf. Thus, this poor performance is not surprising.

In [1]:
import sys
sys.path.append('../')
import exp
import regression as r

In [2]:
df = exp.get_exp1_data()
df.head()

Unnamed: 0,PET,VPD,air_temp,doy,precip,soil_temp,sw_in,wind_speed,year,wp_RNET,...,wp_evi,wp_lswi2,wp_ndvi,mb_evi,mb_lswi2,mb_ndvi,wp_LST.day,wp_LST.night,mb_LST.day,mb_LST.night
0,7.33,0.808731,19.179167,195,0.0,22.320833,30.3156,4.958333,2012,20.798342,...,0.355407,0.286584,0.611743,0.278104,0.523764,0.652612,31.567899,17.20453,26.696193,18.481563
1,6.52,0.755945,19.325,196,0.0,21.770833,29.6316,3.791667,2012,20.573593,...,0.362843,0.31711,0.624457,0.281016,0.525663,0.651848,29.57,17.39,26.19,18.75
2,6.92,0.858993,20.2625,197,0.0,21.908333,29.3472,4.1375,2012,20.475931,...,0.370279,0.347637,0.637171,0.283928,0.527563,0.651084,31.097908,17.235624,26.745817,18.494425
3,6.35,0.477617,16.791667,198,0.0,22.420833,28.818,6.033333,2012,20.571045,...,0.377714,0.378163,0.649886,0.28684,0.529463,0.65032,30.868718,17.248525,26.76917,18.499213
4,5.13,0.55682,17.016667,199,0.0,21.529167,23.1732,4.35,2012,16.757401,...,0.38515,0.408689,0.6626,0.289752,0.531362,0.649556,30.657792,17.259663,26.791436,18.502905


In [3]:
df.columns

Index([u'PET', u'VPD', u'air_temp', u'doy', u'precip', u'soil_temp', u'sw_in',
       u'wind_speed', u'year', u'wp_RNET', u'wp_ch4_gf', u'wp_co2_gf',
       u'wp_er', u'wp_gpp', u'wp_h', u'wp_le', u'mb_RNET', u'mb_ch4_gf',
       u'mb_co2_gf', u'mb_er', u'mb_gpp', u'mb_h', u'mb_le', u'wp_evi',
       u'wp_lswi2', u'wp_ndvi', u'mb_evi', u'mb_lswi2', u'mb_ndvi',
       u'wp_LST.day', u'wp_LST.night', u'mb_LST.day', u'mb_LST.night'],
      dtype='object')

In [4]:
train_cols = ["wp_LST.day", "wp_h", "wp_le", "wp_RNET", "air_temp"]
X, Y = exp.featurize(df, train_cols, ["wp_ch4_gf"])
X, Y, scaler = r.preprocess(X, Y)
X.shape

(1028, 5)

In [5]:
r.random_forests_cross_val(X, Y, feature_names=train_cols)

Running Random Forests Cross Validation...
10-fold CV Acc Mean:  0.691183840697
CV Scores:  0.655134405198, 0.709847839905, 0.720787410836, 0.818368552438, 0.689200411957, 0.70668138638, 0.66424639531, 0.626592679601, 0.542842188584, 0.778137136764
OOB score: 0.699727734529
Feature Importances:
('wp_LST.day', 0.25284529815795304)
('air_temp', 0.24555207801640619)
('wp_le', 0.21920507114116972)
('wp_RNET', 0.17084502200593576)
('wp_h', 0.11155253067853538)


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='sqrt', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=200, n_jobs=1, oob_score=True, random_state=None,
           verbose=0, warm_start=False)

In [6]:
r.xgb_trees_cross_val(X, Y, feature_names=train_cols)

Running Gradient Boosted Trees Cross Validation...
10-fold CV Acc Mean:  0.666788194531
CV Scores:  0.670432584266, 0.693631169775, 0.664412257737, 0.770229921657, 0.651586839893, 0.701751195374, 0.621980191269, 0.630878579378, 0.501944322098, 0.761034883866
Feature Importances:
('wp_le', 0.23435178134481738)
('wp_LST.day', 0.21417358330992708)
('wp_RNET', 0.19292231108792815)
('air_temp', 0.18776145568736546)
('wp_h', 0.17079086856996187)


GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.1, loss='ls',
             max_depth=3, max_features='sqrt', max_leaf_nodes=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=200,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)

In [7]:
r.svc_cross_val(X, Y)

Running SVC Cross Validation...
10-fold CV Acc Mean:  0.079506659563
CV Scores:  0.100564145999, -0.0518856226057, 0.126556489416, 0.131980847778, 0.012326960445, 0.144847570161, 0.0777971709179, 0.072742383218, 0.106896975573, 0.0732396747264


SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [8]:
r.dnn_cross_val(X, Y)

Running Neural Network Cross Validation...
Step #100, epoch #10, avg. train loss: 681970.81250
Step #200, epoch #20, avg. train loss: 655331.37500
Step #300, epoch #30, avg. train loss: 633762.06250
Step #400, epoch #40, avg. train loss: 606714.18750
Step #500, epoch #50, avg. train loss: 605515.37500
Step #600, epoch #60, avg. train loss: 595052.50000
Step #700, epoch #70, avg. train loss: 567497.56250
Step #800, epoch #80, avg. train loss: 576539.18750
Step #900, epoch #90, avg. train loss: 570689.50000
Step #1000, epoch #100, avg. train loss: 547691.75000
Step #1100, epoch #110, avg. train loss: 546829.62500
Step #1200, epoch #120, avg. train loss: 537669.62500
Step #1300, epoch #130, avg. train loss: 541272.25000
Step #1400, epoch #140, avg. train loss: 531041.81250
Step #1500, epoch #150, avg. train loss: 525890.00000
Step #1600, epoch #160, avg. train loss: 515687.84375
Step #1700, epoch #170, avg. train loss: 504061.62500
Step #1800, epoch #180, avg. train loss: 507499.12500
Ste

TensorFlowEstimator(batch_size=100, class_weight=None, clip_gradients=5.0,
          config=None, continue_training=False, learning_rate=0.1,
          model_fn=<function tanh_dnn at 0x7febc38ea9b0>, n_classes=0,
          optimizer='Adagrad', steps=5000, verbose=1)