# Blog feedback (David)
- https://archive.ics.uci.edu/ml/datasets/BlogFeedback
- very large samples (60021), large dimension (281)
- attribute characteristics: numeric

Using only 10 rows of the training set (due to performance)
### without preprocessing
- runtime: 27.005s
- best parameter settings: {'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 500}
- train scores: TBD
- test scores: [mae] 12.814, [mse] 1662.049, [r2] -0.040

### with preprocessing 1
- preprocessing: standardized scaling
- runtime: 27.328
- best parameter settings: {'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 500}
- train scores: TBD
- test scores: [mae] 11.985, [mse] 1647.030, [r2] -0.031

### with preprocessing 2
- preprocessing: standardized scaling, PCA down to 100 dimensions
- runtime: 31.536s
- best parameter settings: {'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 500}
- train scores: TBD
- test scores: [mae] 15.412, [mse] 1605.100, [r2] -0.005

### Notes (by Moritz)
- from UCI: 
```0...50: Average, standard deviation, min, max and median of the Attributes 51...60 for the source of the current blog post. With source we mean the blog on which the post appeared. For example, myblog.blog.org would be the source of the post myblog.blog.org/post_2010_09_10```
    - could drop cols 0 (1) to 49 (50) as they contain summary vals of cols 50 (51) to 59 (60).
    - could drop cols 50 to 59 and aggregate per source

In [6]:
### Select data and train initial models

%run base.ipynb
%matplotlib inline

train_path = 'data/blog_feedback/blogData_train.csv'
test_path = 'data/blog_feedback/blogData_test-2012.02.01.00_00.csv'

train = pd.read_csv(train_path, header=None, sep = ",")
test = pd.read_csv(test_path, header=None, sep=",")

train_data = train.iloc[:,:279]
train_target = train.iloc[:,280]
test_data = test.iloc[:,:279]
test_target = test.iloc[:,280]

#display(train_data[:10])
#corr_data = train_data.iloc[:, 50:]
#correlation_matrix(corr_data, corr_data.columns)

# drop cols 0 to 50
train_data = train_data.iloc[:, 50:]
test_data = test_data.iloc[:, 50:]

train_data, test_data = scale_data(train_data, test_data)
#display(train_data[:10])
train_data, test_data = my_pca(train_data, test_data, 15)
#display(train_data[:10])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,-0.445005,3.86751,-0.437419,0.086788,0.217783,-0.627407,1.558813,0.500486,-0.691524,-0.11699,-1.895232,-0.396394,0.037078,0.7117,0.768915
1,-0.433191,3.858208,-0.324007,-0.152133,0.207255,0.131669,1.919447,0.320021,-0.825965,0.171863,0.516468,-0.237017,-0.488357,0.516712,1.105459
2,-0.433191,3.858208,-0.324007,-0.152133,0.207255,0.131669,1.919447,0.320021,-0.825965,0.171863,0.516468,-0.237017,-0.488357,0.516712,1.105459
3,-0.445005,3.86751,-0.437419,0.086788,0.217783,-0.627407,1.558813,0.500486,-0.691524,-0.11699,-1.895232,-0.396394,0.037078,0.7117,0.768915
4,-0.443482,3.861178,-0.40683,-0.200368,0.170299,-1.504723,1.176314,0.590764,-0.505052,0.385417,-0.626191,-0.767802,0.35155,0.557267,0.082296
5,-0.431887,3.856005,-0.325692,-0.402073,0.157742,-0.749745,1.538116,0.410536,-0.639289,0.678472,1.786645,-0.606659,-0.175563,0.376051,0.417866
6,-0.431887,3.856005,-0.325692,-0.402073,0.157742,-0.749745,1.538116,0.410536,-0.639289,0.678472,1.786645,-0.606659,-0.175563,0.376051,0.417866
7,-0.443482,3.861178,-0.40683,-0.200368,0.170299,-1.504723,1.176314,0.590764,-0.505052,0.385417,-0.626191,-0.767802,0.35155,0.557267,0.082296
8,-0.44661,3.819739,-0.186781,-0.195972,0.248342,-1.815459,0.371861,0.840715,-0.319719,-0.097504,-1.99332,-0.651737,0.290111,0.161706,-1.013962
9,-0.44661,3.819739,-0.186781,-0.195972,0.248342,-1.815459,0.371861,0.840715,-0.319719,-0.097504,-1.99332,-0.651737,0.290111,0.161706,-1.013962


In [15]:
# Linear Regression
reg = linear_reg(train_data, train_target)
result = pd.DataFrame(reg.predict(test_data))

# TODO score

display(result)

  linalg.lstsq(X, y)


R^2 value for model: 0.22200099890915115


Unnamed: 0,0
0,14.221363
1,-4.002151
2,14.177533
3,7.724669
4,7.673790
5,-1.788599
6,-4.441984
7,2.650506
8,-3.084189
9,1.085007


In [None]:
# SVR
# params
param_grid = {
    'C': np.linspace(.2,1,5),
    'kernel': ['linear', 'rbf', 'sigmoid', 'poly'], # poly very slow
    'epsilon': np.linspace(0,.5,6),
    'gamma': ['auto', 'scale']
}

# run grid search
gs = run_svr(train_data, train_target, cv=5, param_grid=param_grid)

# predict
result = pd.DataFrame(gs.best_estimator_.predict(test_data))

# join id col
#result = pd.concat([X_test.reset_index()[['id']], result], axis='columns')
display(result)

GridSearch initializing...
SVR model in training...


In [7]:
# Gradient Boosted Decision Tree
param_fix = {'learning_rate': 0.01, 'loss': 'ls'}
cv = 10
#param_grid = {'n_estimators': [100, 500, 5]}
param_grid = {'n_estimators': (50, 100, 150, 200, 300, 400, 500), 'max_depth': (1, 2, 3, 4, 5), 'min_samples_split': (2,3,5)}

gbt = run_boosted_tree(train_data.iloc[:100,:], train_target[:100], test_data, test_target, param_fix, cv, param_grid)
#plot_scores(gbt.cv_results_)
#plot_training_deviance(gbt, test_data, test_target)

GridSearch initializing...
GradientBoostedRegressor model in training...
GradientBoostedRegressor model selected and fitted in 53.804 s

Best parameters selected by GridSearch: {'max_depth': 4, 'min_samples_split': 5, 'n_estimators': 50}


In [3]:
### Test model

pred = gbt.predict(test_data)

# Metrics
mae = metrics.mean_absolute_error(test_target, pred)
mse = metrics.mean_squared_error(test_target, pred)
r2 = metrics.r2_score(test_target, pred)

print("\nMetrics: [mae] %.3f, [mse] %.3f, [r2] %.3f" % (mae, mse, r2))


Metrics: [mae] 14.948, [mse] 1592.432, [r2] 0.003
