-
Notifications
You must be signed in to change notification settings - Fork 25
/
r-regression-online-news-popularity-take2.Rmd
569 lines (446 loc) · 19.3 KB
/
r-regression-online-news-popularity-take2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
---
title: "Regression Model for Online News Popularity Using R Take 2"
author: "David Lowe"
date: "December 3, 2018"
output:
html_document:
toc: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. https://machinelearningmastery.com/
SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Online News Popularity dataset is a regression situation where we are trying to predict the value of a continuous variable.
INTRODUCTION: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the article's popularity level in social networks. The dataset does not contain the original content, but some statistics associated with it. The original content can be publicly accessed and retrieved using the provided URLs.
Many thanks to K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal, for making the dataset and benchmarking information available.
In iteration Take1, the script focused on evaluating various machine learning algorithms and identifying the algorithm that produces the best accuracy result. Iteration Take1 established a baseline performance regarding accuracy and processing time.
For this iteration, we will examine the feasibility of using a dimensionality reduction technique of ranking the attribute importance with a gradient boosting tree method. Afterward, we will eliminate the features that do not contribute to the cumulative importance of 0.99 (or 99%).
ANALYSIS: From the previous iteration Take1, the baseline performance of the machine learning algorithms achieved an average RMSE of 10446. Two algorithms (Random Forest and Stochastic Gradient Boosting) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved the best RMSE of 10299. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an RMSE of 12978, which was slightly worse than the accuracy of the training data and possibly due to over-fitting.
In the current iteration, the baseline performance of the machine learning algorithms achieved an average RMSE of 10409. Two algorithms (ElasticNet and Stochastic Gradient Boosting) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved the best RMSE of 10312. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an RMSE of 13007, which was worse than the accuracy of the training data and possibly due to over-fitting.
From the model-building activities, the number of attributes went from 58 down to 35 after eliminating 23 attributes. The processing time went from 21 hours 7 minutes in iteration Take1 down to 11 hours 41 minutes in iteration Take2, which was a reduction of 44% from Take1.
CONCLUSION: The feature selection techniques helped by cutting down the attributes and reduced the training time. Furthermore, the modeling took a much shorter time to process yet still retained a comparable level of accuracy. For this iteration, the Random Forest algorithm achieved the top training and validation results comparing to other machine learning algorithms. For this dataset, Random Forest should be considered for further modeling or production use.
Dataset Used: Online News Popularity Dataset
Dataset ML Model: Regression with numerical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
The project aims to touch on the following areas:
1. Document a predictive modeling problem end-to-end.
2. Explore data cleaning and transformation options
3. Explore non-ensemble and ensemble algorithms for baseline model performance
4. Explore algorithm tuning techniques for improving model performance
Any predictive modeling machine learning project genrally can be broken down into about six major tasks:
1. Prepare Problem
2. Summarize Data
3. Prepare Data
4. Model and Evaluate Algorithms
5. Improve Accuracy or Results
6. Finalize Model and Present Results
## 1. Prepare Problem
### 1.a) Load libraries
```{r}
startTimeScript <- proc.time()
library(caret)
library(corrplot)
library(parallel)
library(mailR)
# Create one random seed number for reproducible results
seedNum <- 888
set.seed(seedNum)
```
### 1.b) Load dataset
```{r}
originalDataset <- read.csv("OnlineNewsPopularity.csv", header= TRUE)
# Dropping the two non-predictive attributes: url and timedelta
originalDataset$url <- NULL
originalDataset$timedelta <- NULL
# Different ways of reading and processing the input dataset. Saving these for future references.
#x_train <- read.fwf("X_train.txt", widths = widthVector, col.names = colNames)
#y_train <- read.csv("y_train.txt", header = FALSE, col.names = c("targetVar"))
#y_train$targetVar <- as.factor(y_train$targetVar)
#xy_train <- cbind(x_train, y_train)
```
```{r}
# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(originalDataset)
# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
```
```{r}
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization!
targetCol <- totCol
colnames(originalDataset)[targetCol] <- "targetVar"
```
```{r}
# We create training datasets (xy_train, x_train, y_train) for various operations.
# We create validation datasets (xy_test, x_test, y_test) for various operations.
set.seed(seedNum)
# Create a list of the rows in the original dataset we can use for training
training_index <- createDataPartition(originalDataset$targetVar, p=0.70, list=FALSE)
# Use 70% of the data to train the models and the remaining for testing/validation
xy_train <- originalDataset[training_index,]
xy_test <- originalDataset[-training_index,]
if (targetCol==1) {
x_train <- xy_train[,(targetCol+1):totCol]
y_train <- xy_train[,targetCol]
y_test <- xy_test[,targetCol]
} else {
x_train <- xy_train[,1:(totAttr)]
y_train <- xy_train[,totCol]
y_test <- xy_test[,totCol]
}
```
### 1.c) Set up the key parameters to be used in the script
```{r}
# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 4
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
```
### 1.d) Set test options and evaluation metric
```{r}
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "RMSE"
```
### 1.e) Set up the email notification function
```{r}
email_notify <- function(msg=""){
sender <- "luozhi2488@gmail.com"
receiver <- "dave@contactdavidlowe.com"
sbj_line <- "Notification from R Script"
password <- readLines("email_credential.txt")
send.mail(
from = sender,
to = receiver,
subject= sbj_line,
body = msg,
smtp = list(host.name = "smtp.gmail.com", port = 465, user.name = sender, passwd = password, ssl = TRUE),
authenticate = TRUE,
send = TRUE)
}
```
```{r}
email_notify(paste("Library and Data Loading Completed!",date()))
```
## 2. Summarize Data
To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.
### 2.a) Descriptive statistics
#### 2.a.i) Peek at the data itself.
```{r}
head(xy_train)
```
#### 2.a.ii) Dimensions of the dataset.
```{r}
dim(xy_train)
dim(xy_test)
```
#### 2.a.iii) Types of the attributes.
```{r}
sapply(xy_train, class)
```
#### 2.a.iv) Statistical summary of all attributes.
```{r}
summary(xy_train)
```
#### 2.a.v) Count missing values.
```{r}
sapply(xy_train, function(x) sum(is.na(x)))
```
### 2.b) Data visualizations
```{r}
# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
boxplot(x_train[,i], main=names(x_train)[i])
}
```
```{r}
# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
hist(x_train[,i], main=names(x_train)[i])
}
```
```{r}
# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
plot(density(x_train[,i]), main=names(x_train)[i])
}
```
```{r}
# Correlation plot
correlations <- cor(x_train)
corrplot(correlations, method="circle")
```
```{r}
email_notify(paste("Data Summary and Visualization Completed!",date()))
```
## 3. Prepare Data
Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:
* Cleaning data by removing duplicates, marking missing values and even imputing missing values.
* Feature selection where redundant features may be removed.
* Data transforms where attributes are scaled or redistributed in order to best expose the structure of the problem later to learning algorithms.
### 3.a) Data Cleaning
```{r}
# Not applicable for this iteration of the project.
# Mark missing values
#invalid <- 0
#entireDataset$some_col[entireDataset$some_col==invalid] <- NA
# Impute missing values
#entireDataset$some_col <- with(entireDataset, impute(some_col, mean))
```
### 3.b) Feature Selection
```{r}
# Using the Lasso algorithm, we try to rank the attributes' importance.
startTimeModule <- proc.time()
set.seed(seedNum)
model_fs <- train(targetVar~., data=xy_train, method="lasso", preProcess="scale", trControl=control)
rankedImportance <- varImp(model_fs, scale=FALSE)
print(rankedImportance)
plot(rankedImportance)
```
```{r}
# Set the importance threshold and calculate the list of attributes that don't contribute to the importance threshold
maxThreshold <- 0.99
rankedAttributes <- rankedImportance$importance
rankedAttributes <- rankedAttributes[order(-rankedAttributes$Overall),,drop=FALSE]
totalWeight <- sum(rankedAttributes)
i <- 1
accumWeight <- 0
exit_now <- FALSE
while ((i <= totAttr) & !exit_now) {
accumWeight = accumWeight + rankedAttributes[i,]
if ((accumWeight/totalWeight) >= maxThreshold) {
exit_now <- TRUE
} else {
i <- i + 1
}
}
lowImportance <- rankedAttributes[(i+1):(totAttr),,drop=FALSE]
lowAttributes <- rownames(lowImportance)
cat('Number of attributes contributed to the importance threshold:',i,"\n")
cat('Number of attributes found to be of low importance:',length(lowAttributes))
```
```{r}
# Removing the unselected attributes from the training and validation dataframes
xy_train <- xy_train[, !(names(xy_train) %in% lowAttributes)]
xy_test <- xy_test[, !(names(xy_test) %in% lowAttributes)]
```
### 3.c) Data Transforms
```{r}
# Not applicable for this iteration of the project.
```
```{r}
proc.time()-startTimeScript
```
```{r}
email_notify(paste("Data Cleaning and Transformation Completed!",date()))
```
## 4. Model and Evaluate Algorithms
After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the dataset. The typical evaluation tasks include:
* Defining test options such as cross validation and the evaluation metric to use.
* Spot checking a suite of linear and nonlinear machine learning algorithms.
* Comparing the estimated accuracy of algorithms.
For this project, we will evaluate four linear, three non-linear, and three ensemble algorithms:
Linear Algorithms: Linear Regression, Ridge, LASSO, and ElasticNet
Non-Linear Algorithms: Decision Trees (CART), k-Nearest Neighbors, and Support Vector Machine
Ensemble Algorithms: Bagged CART, Random Forest, and Stochastic Gradient Boosting
The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.
### 4.a) Generate models using linear algorithms
```{r LR}
# Linear Regression (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.lm <- train(targetVar~., data=xy_train, method="lm", metric=metricTarget, trControl=control)
print(fit.lm)
proc.time()-startTimeModule
```
```{r}
email_notify(paste("Linear Regression Modeling Completed!",date()))
```
```{r RIDGE}
# Ridge (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.ridge <- train(targetVar~., data=xy_train, method="ridge", metric=metricTarget, trControl=control)
print(fit.ridge)
proc.time()-startTimeModule
```
```{r}
email_notify(paste("Ridge Modeling Completed!",date()))
```
```{r LASSO}
# lasso (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.lasso <- train(targetVar~., data=xy_train, method="lasso", metric=metricTarget, trControl=control)
print(fit.lasso)
proc.time()-startTimeModule
```
```{r}
email_notify(paste("Lasso Modeling Completed!",date()))
```
```{r EN}
# ElasticNet (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.en <- train(targetVar~., data=xy_train, method="enet", metric=metricTarget, trControl=control)
print(fit.en)
proc.time()-startTimeModule
```
```{r}
email_notify(paste("ElasticNet Modeling Completed!",date()))
```
### 4.b) Generate models using nonlinear algorithms
```{r CART}
# Decision Tree - CART (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
proc.time()-startTimeModule
```
```{r}
email_notify(paste("Decision Tree Modeling Completed!",date()))
```
```{r KNN}
# k-Nearest Neighbors (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.knn <- train(targetVar~., data=xy_train, method="knn", metric=metricTarget, trControl=control)
print(fit.knn)
proc.time()-startTimeModule
```
```{r}
email_notify(paste("k-Nearest Neighbors Modeling Completed!",date()))
```
```{r SVM}
# Support Vector Machine (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.svm <- train(targetVar~., data=xy_train, method="svmRadial", metric=metricTarget, trControl=control)
print(fit.svm)
proc.time()-startTimeModule
```
```{r}
email_notify(paste("Support Vector Machine Modeling Completed!",date()))
```
### 4.c) Generate models using ensemble algorithms
In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.
```{r BAGCART}
# Bagged CART (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
proc.time()-startTimeModule
```
```{r}
email_notify(paste("Bagged CART Modeling Completed!",date()))
```
```{r RF}
# Random Forest (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
proc.time()-startTimeModule
```
```{r}
email_notify(paste("Random Forest Modeling Completed!",date()))
```
```{r GBM}
# Stochastic Gradient Boosting (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
proc.time()-startTimeModule
```
```{r}
email_notify(paste("Stochastic Gradient Boosting Modeling Completed!",date()))
```
### 4.d) Compare baseline algorithms
```{r SPOT_CHECK}
results <- resamples(list(LR=fit.lm, RIDGE=fit.ridge, LASSO=fit.lasso, EN=fit.en, CART=fit.cart, kNN=fit.knn, SVM=fit.svm, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
dotplot(results)
cat('The average RMSE from all models is:',
mean(c(results$values$`LR~RMSE`, results$values$`RIDGE~RMSE`, results$values$`LASSO~RMSE`, results$values$`EN~RMSE`, results$values$`CART~RMSE`, results$values$`kNN~RMSE`, results$values$`SVM~RMSE`, results$values$`BagCART~RMSE`, results$values$`RF~RMSE`, results$values$`GBM~RMSE`)))
```
```{r}
email_notify(paste("Baseline Modeling Completed!",date()))
```
## 5. Improve Accuracy or Results
After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.
Using the two best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.
### 5.a) Algorithm Tuning
Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.
```{r FINAL1}
# Tuning algorithm #1 - ElasticNet
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(lambda=c(0.1,0.01,0.001), fraction=c(0.25,0.5,1.0))
fit.final1 <- train(targetVar~., data=xy_train, method="enet", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final1)
print(fit.final1)
proc.time()-startTimeModule
```
```{r}
email_notify(paste("Algorithm #1 Tuning Completed!",date()))
```
```{r FINAL2}
# Tuning algorithm #2 - Stochastic Gradient Boostin
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(.n.trees=c(50,100,150,200), .shrinkage=0.1, .interaction.depth=1, .n.minobsinnode=10)
fit.final2 <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
plot(fit.final2)
print(fit.final2)
proc.time()-startTimeModule
```
```{r}
email_notify(paste("Algorithm #2 Tuning Completed!",date()))
```
### 5.d) Compare Algorithms After Tuning
```{r POST_TUNING}
results <- resamples(list(RF=fit.final1, GBM=fit.final2))
summary(results)
dotplot(results)
```
## 6. Finalize Model and Present Results
Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:
* Using an optimal model tuned to make predictions on unseen data.
* Creating a standalone model using the tuned parameters
* Saving an optimal model to file for later use.
### 6.a) Predictions on validation dataset
```{r PREDICT}
predictions <- predict(fit.final2, newdata=xy_test)
print(RMSE(predictions, y_test))
print(R2(predictions, y_test))
```
### 6.b) Create standalone model on entire training dataset
```{r FINALMODEL}
startTimeModule <- proc.time()
library(gbm)
set.seed(seedNum)
finalModel <- gbm(targetVar~., data=xy_train, n.trees=100, shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, verbose=F)
summary(finalModel)
proc.time()-startTimeModule
```
### 6.c) Save model for later use
```{r}
#saveRDS(finalModel, "./finalModel_Regression.rds")
```
```{r}
proc.time()-startTimeScript
```
```{r}
email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
```