#  Blog Comments Predictions

> Burcin Sarac<br/>
> MSc. Business Analytics FT18<br/>
> Department of Management Science and Technology<br/>
> Athens University of Economics and Business

In this project I will try to predict the number of comments that a blog post receives based on features of the post by using Pyspark. The data was provided from the UCI Machine Learning Archive at this link <https://archive.ics.uci.edu/ml/datasets/BlogFeedback>.

I also get some ideas from a paper by [Krisztian Buza](http://www.cs.bme.hu/~buza/) (2014): Feedback Prediction for Blogs. In Data Analysis, Machine Learning and Knowledge Discovery (pp. 145-152), available at <http://www.cs.bme.hu/~buza/pdfs/gfkl2012_blogs.pdf>, in which the data were originally used.

In [2]:
import os
os.chdir("E:/dersler/big data systems/Assignment_2")

In [3]:
import findspark
findspark.init("E:/Python/spark/spark2.4.2")
from pyspark.sql import SparkSession

spark =  SparkSession.builder.appName("Feedback").getOrCreate()

sc = spark.sparkContext

According to dataset information provided by the webpage, my aim will be feedback prediction by using transformed previous feedback data. For feedbacks, blog posts from [torokgaborelemez.blog.hu](torokgaborelemez.blog.hu) were used.  Raw HTML-documents of the blog posts were crawled and processed. Detailed transformation approach can be seen from the original paper mentioned above.

The train dataset includes data for the years between 2010 to 2011 and in the test data basetimes were Feb and March 2012. By using train dataset from past I am going to try predict future events included in test data. 

For reading datasets, I putted all test csv files into a folder called "feedback" and read all with a one command executed below. And I read train data separately with another command. 

In [4]:
testdf = spark.read.\
    option("inferSchema", "true").\
    csv("feedback")

In [5]:
testdf.count()

7624

In [6]:
testdf.printSchema()

root
 |-- _c0: double (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: double (nullable = true)
 |-- _c5: double (nullable = true)
 |-- _c6: double (nullable = true)
 |-- _c7: double (nullable = true)
 |-- _c8: double (nullable = true)
 |-- _c9: double (nullable = true)
 |-- _c10: double (nullable = true)
 |-- _c11: double (nullable = true)
 |-- _c12: double (nullable = true)
 |-- _c13: double (nullable = true)
 |-- _c14: double (nullable = true)
 |-- _c15: double (nullable = true)
 |-- _c16: double (nullable = true)
 |-- _c17: double (nullable = true)
 |-- _c18: double (nullable = true)
 |-- _c19: double (nullable = true)
 |-- _c20: double (nullable = true)
 |-- _c21: double (nullable = true)
 |-- _c22: double (nullable = true)
 |-- _c23: double (nullable = true)
 |-- _c24: double (nullable = true)
 |-- _c25: double (nullable = true)
 |-- _c26: double (nullable = true)
 |-- _c27: double (nullable = tru

In [7]:
traindf = spark\
  .read\
  .option("inferSchema", "true")\
  .csv("blogData_train.csv")

In [8]:
traindf.count()

52397

In [9]:
traindf.printSchema()

root
 |-- _c0: double (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: double (nullable = true)
 |-- _c5: double (nullable = true)
 |-- _c6: double (nullable = true)
 |-- _c7: double (nullable = true)
 |-- _c8: double (nullable = true)
 |-- _c9: double (nullable = true)
 |-- _c10: double (nullable = true)
 |-- _c11: double (nullable = true)
 |-- _c12: double (nullable = true)
 |-- _c13: double (nullable = true)
 |-- _c14: double (nullable = true)
 |-- _c15: double (nullable = true)
 |-- _c16: double (nullable = true)
 |-- _c17: double (nullable = true)
 |-- _c18: double (nullable = true)
 |-- _c19: double (nullable = true)
 |-- _c20: double (nullable = true)
 |-- _c21: double (nullable = true)
 |-- _c22: double (nullable = true)
 |-- _c23: double (nullable = true)
 |-- _c24: double (nullable = true)
 |-- _c25: double (nullable = true)
 |-- _c26: double (nullable = true)
 |-- _c27: double (nullable = tru

Firstly, for further understanding I tried to check unique target variables in train dataset with the command below. There were 438 different target variable appeared at total. 

In [10]:
traindf.select("_c280").distinct().orderBy("_c280").count()

438

For prediction I used MLib library by importing `pyspark.ml` and algorithms in this MLib library requires that the data is numerical. And as it can be observed from the results of PrintSchema command, all columns are already numeric and ready to implement an algorithm. 

But first, I needed to collect all feature columns in one vector. Only the last column was target variable and all others were features. I used a `VectorAssembler` to create the new column containing the features vector.

In [11]:
import random
random.seed(2019)

In [12]:
from pyspark.ml.feature import VectorAssembler

label_col = traindf.columns[-1]

assembler = VectorAssembler(
    inputCols=[ x for x in traindf.columns[:-1] ],
    outputCol='features')

traindata = assembler.transform(traindf)

In [13]:
label_coltest = testdf.columns[-1]

assembler = VectorAssembler(
    inputCols=[ x for x in testdf.columns[:-1] ],
    outputCol='features')

testdata = assembler.transform(testdf)

# Decision Tree Regression

Since the target data was continous, I planned to use Decision Tree Regression, Random Forest Regression, Gradient Boosting Regression, Linear Regression, Generalized Linear Regression models. And I first began with Decision Tree Regression.

The spark.ml implementation supports decision trees for binary and multiclass classification and for regression, using both continuous as well as categorical features. 

From now on I used the vectors created from train and test datasets to train models and make predictions. 

In [14]:
import random
random.seed(2019)

In [15]:
from pyspark.ml.regression import DecisionTreeRegressor
dt = DecisionTreeRegressor(featuresCol ='features', labelCol = label_col)
dt_model = dt.fit(traindata)

In [16]:
from pyspark.ml.evaluation import RegressionEvaluator

dt_predictions = dt_model.transform(testdata)
dt_evaluator = RegressionEvaluator(
    labelCol=label_coltest, predictionCol="prediction", metricName="rmse")
rmse = dt_evaluator.evaluate(dt_predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

Root Mean Squared Error (RMSE) on test data = 24.344


In [17]:
dt_predictions.select('prediction', label_coltest, 'features').show(5)

+------------------+-----+--------------------+
|        prediction|_c280|            features|
+------------------+-----+--------------------+
|0.6974973390464234|  0.0|(280,[0,1,3,5,6,8...|
|0.6974973390464234|  2.0|(280,[0,1,2,3,4,5...|
|0.6974973390464234|  0.0|(280,[25,26,28,30...|
|0.6974973390464234|  0.0|(280,[0,1,3,4,5,6...|
|0.6974973390464234|  0.0|(280,[25,27,28,29...|
+------------------+-----+--------------------+
only showing top 5 rows



By using trained Decision Tree Regression Model, I made predictions on test data and as a result to check model accuracy, I printed Root Mean Squared Error (RMSE) on test data which equal to 24.344. 

# Random forest regression

Random forests are ensembles of decision trees. Random forests combine many decision trees in order to reduce the risk of overfitting. The spark.ml implementation supports random forests for binary and multiclass classification and for regression, using both continuous and categorical features.

In [18]:
import random
random.seed(2019)

In [19]:
from pyspark.ml.regression import RandomForestRegressor
rf = RandomForestRegressor(featuresCol ='features', labelCol = label_col)
rf_model = rf.fit(traindata)

In [20]:
rf_predictions = rf_model.transform(testdata)
rf_evaluator = RegressionEvaluator(
    labelCol=label_coltest, predictionCol="prediction", metricName="rmse")
rmse = rf_evaluator.evaluate(dt_predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

Root Mean Squared Error (RMSE) on test data = 24.344


In [21]:
rf_predictions.select('prediction', label_coltest, 'features').show(5)

+------------------+-----+--------------------+
|        prediction|_c280|            features|
+------------------+-----+--------------------+
|0.5454426268790066|  0.0|(280,[0,1,3,5,6,8...|
|1.1934745676508776|  2.0|(280,[0,1,2,3,4,5...|
|0.5454426268790066|  0.0|(280,[25,26,28,30...|
|2.1812232465179284|  0.0|(280,[0,1,3,4,5,6...|
|0.6417145455456952|  0.0|(280,[25,27,28,29...|
+------------------+-----+--------------------+
only showing top 5 rows



By using trained Random Forest Regression Model, I made predictions on test data and as a result to check model accuracy, I printed Root Mean Squared Error (RMSE) on test data which was equal to the RMSE generated from DT as well with value of 24.344. 

# Gradient-boosted tree regression

Gradient-Boosted Trees (GBTs) are again ensembles of decision trees. GBTs iteratively train decision trees in order to minimize a loss function. The spark.ml implementation supports GBTs for binary classification and for regression, using both continuous and categorical features.

In [22]:
import random
random.seed(2019)

In [23]:
from pyspark.ml.regression import GBTRegressor
gbt = GBTRegressor(featuresCol = 'features', labelCol = label_col, maxIter=10)
gbt_model = gbt.fit(traindata)

In [24]:
gbt_predictions = gbt_model.transform(testdata)
gbt_predictions.select('prediction', label_coltest, 'features').show(5)

+------------------+-----+--------------------+
|        prediction|_c280|            features|
+------------------+-----+--------------------+
|0.1192245348327167|  0.0|(280,[0,1,3,5,6,8...|
| 1.341459192150167|  2.0|(280,[0,1,2,3,4,5...|
|0.1192245348327167|  0.0|(280,[25,26,28,30...|
|0.3730967618086877|  0.0|(280,[0,1,3,4,5,6...|
| 0.504965564412016|  0.0|(280,[25,27,28,29...|
+------------------+-----+--------------------+
only showing top 5 rows



In [26]:
gbt_evaluator = RegressionEvaluator(
    labelCol=label_coltest, predictionCol="prediction", metricName="rmse")
rmse = gbt_evaluator.evaluate(gbt_predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

Root Mean Squared Error (RMSE) on test data = 23.8821


Root Mean Squared Error (RMSE) generated from Gradient-boosted Tree Regression by predicting on test data was 24.0679.

The root-mean-square error is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed, so it is expected to be lower for a better prediction. In this case, this model performs better related to DT and RF models on test data according to their RMSE values. 

## Linear Regression

In [27]:
import random
random.seed(2019)

In [28]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol = 'features', labelCol=label_col, maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(traindata)

In [29]:
trainingSummary = lr_model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

RMSE: 30.304233
r2: 0.354077


R^2 indicates that 35% of the variability in target variable (the number of comments) can be explained using this model for the training data.  However, I should be cautious that the performance on the training data may not a good approximation of the performance on the test data(unseen data from model).

In [30]:
lr_predictions = lr_model.transform(testdata)
lr_predictions.select("prediction",label_coltest,"features").show(5)

+-------------------+-----+--------------------+
|         prediction|_c280|            features|
+-------------------+-----+--------------------+
| 0.3951240540144729|  0.0|(280,[0,1,3,5,6,8...|
|  9.671163253091592|  2.0|(280,[0,1,2,3,4,5...|
| -4.689294185569542|  0.0|(280,[25,26,28,30...|
|-1.7490783315701872|  0.0|(280,[0,1,3,4,5,6...|
| 1.7935638672743517|  0.0|(280,[25,27,28,29...|
+-------------------+-----+--------------------+
only showing top 5 rows



In [31]:
lr_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol=label_coltest,metricName="r2")
print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(lr_predictions))

R Squared (R2) on test data = 0.337531


In [32]:
test_result = lr_model.evaluate(testdata)
print("Root Mean Squared Error (RMSE) on test data = %g" % test_result.rootMeanSquaredError)

Root Mean Squared Error (RMSE) on test data = 24.8245


R^2 indicates that 33.75% of the variability in target variable (the number of comments) can be explained using this model for the test data. It seems that trained model works well with the unseen data as well. 

However it seems from RMSE value that, its prediction variance is slightly higher than Decision Tree based models, which means predictions were worse than DT models regarding to prediction accuracy.

# Generalized Linear Regression

Contrasted with linear regression where the output is assumed to follow a Gaussian distribution, generalized linear models (GLMs) are specifications of linear models where the response variable Yi follows some distribution from the exponential family of distributions. 

In this trained model below, I demonstrated training a GLM with a Poisson response and log link function.

In [33]:
import random
random.seed(2019)

In [35]:
from pyspark.ml.regression import GeneralizedLinearRegression

glr = GeneralizedLinearRegression(featuresCol = 'features', labelCol=label_col, 
                                  family="poisson", link="log", maxIter=10, regParam=0.3)
glr_model = glr.fit(traindata)

In [36]:
summary = glr_model.summary
print("Residual Degree Of Freedom: " + str(summary.residualDegreeOfFreedom))

Residual Degree Of Freedom: 52116


In [37]:
summary.residuals().show()

+--------------------+
|   devianceResiduals|
+--------------------+
|   -4.36659741082731|
|  -4.010295272297183|
|  -4.010295272297183|
|   -4.36659741082731|
|   5.197808467799306|
|  -3.165968698158762|
|  -3.165968698158762|
|   5.197808467799306|
|  1.3926179028167147|
|  1.3926179028167147|
|  -5.148275567795249|
|  -4.075641947734064|
|  -4.206389861598059|
|  -3.148659070107025|
|-0.29914644600813817|
|  0.7314546985000732|
| -2.1839420160485075|
|  0.7314546985000732|
|-0.29914644600813817|
|  -3.145627381697337|
+--------------------+
only showing top 20 rows



In [38]:
glr_predictions = glr_model.transform(testdata)
glr_predictions.select("prediction",label_coltest,"features").show(5)

+------------------+-----+--------------------+
|        prediction|_c280|            features|
+------------------+-----+--------------------+
| 5.202500944367718|  0.0|(280,[0,1,3,5,6,8...|
|  9.54598517051311|  2.0|(280,[0,1,2,3,4,5...|
|3.0484806811911596|  0.0|(280,[25,26,28,30...|
| 5.283403397463622|  0.0|(280,[0,1,3,4,5,6...|
| 6.998762099408029|  0.0|(280,[25,27,28,29...|
+------------------+-----+--------------------+
only showing top 5 rows



In [39]:
glr_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol=label_coltest,metricName="r2")
print("R Squared (R2) on test data = %g" % glr_evaluator.evaluate(glr_predictions))

R Squared (R2) on test data = 0.352158


Since it did not provide me RMSE value and did not allow me to calculate manually by taking root of mean of sum of square of residuals, I cannot compare this model with Decision Tree based models in terms of performance. However, it performed better than Linear Regression regarding to its R^2 value, which was 35%. 