# Data Analysis in Ruby

We are investigating the question:

If a blog post has received a comment in the first 24 hours after publication, how many more comments will it receive before 24 hours after its publication have passed?

After doing the entire data cleaning and preprocessing with `daru`, we use `mixed_models` to fit a linear mixed model.
The number of comments of a given blog post is modeled as a function of the average number of comments and trackbacks per page at the hosting website of the blog, the number of parent blog post and their comments, and the blog post text length. Additionally, we model random fluctuations of the number of comments due to the day of the week when the blog post was released.

We evaluate which predictors have the strongest effect on the number of comments that a blog post receives.

Finally, we use the resulting model make predictions on a test data set.

## Data Preprocessing with `daru`

Since `daru` requires csv files to have a header line, we add a header to the data file and save the new data frame.

In [1]:
without_header = '../examples/data/blogData_train.csv'
with_header = '../examples/data/blogData_train_with_header.csv'
colnames = (1..281).to_a.map { |x| "v#{x}" }
header = colnames.join(',')
File.open(with_header, 'w') do |fo|
  fo.puts header
  File.foreach(without_header) do |li|
    fo.puts li
  end
end

We load the data with `daru`, select the data columns which we want to keep, and assign them meaningful names.

In [2]:
# load the data with daru
require 'daru'
df = Daru::DataFrame.from_csv '../examples/data/blogData_train_with_header.csv'

# select a subset of columns of the data frame
keep = [:v16, :v41, :v54, :v62, :v270, :v271, :v272, 
        :v273, :v274, :v275, :v276, :v277, :v280]
blog_data = df[*keep]
df = nil

# assign meaningful names for the selected columns
meaningful_names = [:host_comments_avg, :host_trackbacks_avg, 
                    :comments, :length, :mo, :tu, :we, :th, 
                    :fr, :sa, :su, :parents, :parents_comments]
blog_data.vectors = Daru::Index.new(meaningful_names)

# the resulting data set
blog_data.head

"if(window['d3'] === undefined ||\n   window['Nyaplot'] === undefined){\n    var path = {\"d3\":\"http://d3js.org/d3.v3.min\",\"downloadable\":\"http://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\"};\n\n\n\n    var shim = {\"d3\":{\"exports\":\"d3\"},\"downloadable\":{\"exports\":\"downloadable\"}};\n\n    require.config({paths: path, shim:shim});\n\n\nrequire(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\n\n\tvar script = d3.select(\"head\")\n\t    .append(\"script\")\n\t    .attr(\"src\", \"http://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\")\n\t    .attr(\"async\", true);\n\n\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\n\n\n\t    var event = document.createEvent(\"HTMLEvents\");\n\t    event.initEvent(\"load_nyaplot\",false,false);\n\t    window.dispatchEvent(event);\n\t

Daru::DataFrame:69851026143700 rows: 10 cols: 13,Daru::DataFrame:69851026143700 rows: 10 cols: 13,Daru::DataFrame:69851026143700 rows: 10 cols: 13,Daru::DataFrame:69851026143700 rows: 10 cols: 13,Daru::DataFrame:69851026143700 rows: 10 cols: 13,Daru::DataFrame:69851026143700 rows: 10 cols: 13,Daru::DataFrame:69851026143700 rows: 10 cols: 13,Daru::DataFrame:69851026143700 rows: 10 cols: 13,Daru::DataFrame:69851026143700 rows: 10 cols: 13,Daru::DataFrame:69851026143700 rows: 10 cols: 13,Daru::DataFrame:69851026143700 rows: 10 cols: 13,Daru::DataFrame:69851026143700 rows: 10 cols: 13,Daru::DataFrame:69851026143700 rows: 10 cols: 13,Daru::DataFrame:69851026143700 rows: 10 cols: 13
Unnamed: 0_level_1,host_comments_avg,host_trackbacks_avg,comments,length,mo,tu,we,th,fr,sa,su,parents,parents_comments
0,34.567566,0.972973,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,34.567566,0.972973,5.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,34.567566,0.972973,5.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,34.567566,0.972973,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,34.567566,0.972973,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
5,34.567566,0.972973,5.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
6,34.567566,0.972973,5.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
7,34.567566,0.972973,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
8,34.567566,0.972973,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
9,34.567566,0.972973,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


For a more clear representation of the data, and in order to use the day of the week as a grouping variable for the observations, we replace the respective seven 0-1-valued columns with one column of categorical data (with values 'mo', 'tu', 'th', 'fr', 'sa', 'su').

In [3]:
days = Array.new(blog_data.nrows) { :unknown }
[:mo, :tu, :we, :th, :fr, :sa, :su].each do |d|
  ind = blog_data[d].each_index.select { |i| blog_data[d][i]==1 }
  ind.each { |i| days[i] = d.to_s }
  blog_data.delete_vector(d)
end
blog_data[:day] = days
blog_data.head 3

Daru::DataFrame:69851020652680 rows: 3 cols: 7,Daru::DataFrame:69851020652680 rows: 3 cols: 7,Daru::DataFrame:69851020652680 rows: 3 cols: 7,Daru::DataFrame:69851020652680 rows: 3 cols: 7,Daru::DataFrame:69851020652680 rows: 3 cols: 7,Daru::DataFrame:69851020652680 rows: 3 cols: 7,Daru::DataFrame:69851020652680 rows: 3 cols: 7,Daru::DataFrame:69851020652680 rows: 3 cols: 7
Unnamed: 0_level_1,host_comments_avg,host_trackbacks_avg,comments,length,parents,parents_comments,day
0,34.567566,0.972973,2.0,0.0,0.0,0.0,th
1,34.567566,0.972973,5.0,0.0,0.0,0.0,we
2,34.567566,0.972973,5.0,0.0,0.0,0.0,we


As can be seen in the above output, the length of the text in a blog post is often given as zero. Those are probably missing values, and we get rid of those observations. 

We also delete observation which have zero comments in the first 24 hours after publication, to comply with our research objective stated in the beginning.

In [4]:
nonzero_ind = blog_data[:length].each_index.select do |i| 
  blog_data[:length][i] > 0 && blog_data[:comments][i] > 0
end
blog_data = blog_data.row[*nonzero_ind]
blog_data.nrows

22435

Clearly, the variable `parents` denoting the number of parent blog posts is highly correlated to the variable `parents_comments` which denotes the number of comments that the parents of a blog post received on average. Therefore, we shouldn't include both these variables in the linear mixed models.

Thus, we combine the variables `parents` and `parents_comments` into one variable called `has_parent_with_comments`, which designates if a blog post has at least one parent post with at least one comment.

In [5]:
# create a binary indicator vector specifying if a blog post has at least 
# one parent post which has comments
hpwc = (blog_data[:parents] * blog_data[:parents_comments]).to_a
blog_data[:has_parent_with_comments] = hpwc.map { |t| t == 0 ? 'n' : 'y'} 
blog_data.delete_vector(:parents)
blog_data.delete_vector(:parents_comments)
blog_data.head 3

Daru::DataFrame:69851075521420 rows: 3 cols: 6,Daru::DataFrame:69851075521420 rows: 3 cols: 6,Daru::DataFrame:69851075521420 rows: 3 cols: 6,Daru::DataFrame:69851075521420 rows: 3 cols: 6,Daru::DataFrame:69851075521420 rows: 3 cols: 6,Daru::DataFrame:69851075521420 rows: 3 cols: 6,Daru::DataFrame:69851075521420 rows: 3 cols: 6
Unnamed: 0_level_1,host_comments_avg,host_trackbacks_avg,comments,length,day,has_parent_with_comments
1221,110.30087,0.0,74.0,3501.0,we,n
1222,110.30087,0.0,74.0,3501.0,we,n
1223,110.30087,0.0,218.0,4324.0,th,n


## Fit a linear mixed model

In [6]:
require 'mixed_models'
model_fit = LMM.from_formula(formula: "comments ~ host_comments_avg + length + has_parent_with_comments + (1 | day)", data: blog_data)
model_fit.fix_ef

starting iteration 0
starting iteration 1
starting iteration 2
starting iteration 3
starting iteration 4
starting iteration 5
starting iteration 6
starting iteration 7
starting iteration 8
starting iteration 9
starting iteration 10
starting iteration 11
starting iteration 12
starting iteration 13
starting iteration 14
starting iteration 15
starting iteration 16
starting iteration 17
starting iteration 18
starting iteration 19


{:intercept=>40.827178187509304, :host_comments_avg=>0.28664521910753954, :length=>0.0007307403890011964, :has_parent_with_comments_lvl_y=>-20.27113218068513}

### Assess the quality of the fit

In [10]:
require 'gnuplotrb'
include GnuplotRB

x, y = model_fit.fitted, model_fit.residuals
fitted_vs_residuals = Plot.new([[x,y], with: 'points', pointtype: 6, notitle: true],
                               xlabel: 'Fitted', ylabel: 'Residuals')

LoadError: cannot load such file -- gnuplotrb