<br><br><br>
<span style="color:red;font-size:60px">Collaborative filtering example</span>
<br><br>

In <span style="color:blue">collaborative filtering</span>, an algorithm uses available choices made by a user, along with the set of choices made by a large set of other users, to make recommendations for the user

<br><br><br>
<span style="color:green;font-size:xx-large">Author recommendation system</span>

<img src="people authors graph.png">

<li><span style="color:red">Initial data</span>: Ratings given by readers to authors. Not every reader rates every author so this graph is likely to be <span style="color:red">sparsely connected</span></li>
<li>Vertices in this graph are either authors or readers and edges are "author reader connections". Edge attributes are the ratings (1 to 5) gvien to the author</li>
<li>The objective of the <span style="color:red">recommender system</span> is to predict the rating that a reader will give to an unrated (by them) author</li>

<br><br><br>
<span style="color:green;font-size:xx-large">SVD++ and collaborative filtering</span>
<br><br>



<li>Collaborative filtering starts with a sparse matrix with people on one axis and authors on the other axis (since most people will have read only a few of the many million authors, this matrix is sparse)</li>
<li>The general idea is to start by assuming that the authors (or movies, books, brands of cereal, etc.) can be grouped into k classes that represent some latent attribute of the authors (genres, for example) and then decomposing the large sparse matrix into a n x k and k x m matrix (n=number of people, m = number of authors, k=number of latent attributes)</li>
<li>SVD++ combines singular value decomposition with graph neighborhood models to compute factor weightings in the decomposed matrix</li> 
<li>If interested, see <a href="https://people.engr.tamu.edu/huangrh/Spring16/papers_course/matrix_factorization.pdf">https://people.engr.tamu.edu/huangrh/Spring16/papers_course/matrix_factorization.pdf</a></li>




<span style="color:blue;font-size:large">Example of the factorized graph</span><br><br>
<img src="factors.png">

<br><br><br>
<span style="color:green;font-size:xx-large">Graph setup</span>
<br><br>




In [None]:
%%init_spark
launcher.packages= ["graphframes:graphframes:0.8.2-spark3.2-s_2.12"]

In [None]:
//GraphFrame imports
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.graphframes._


//GraphX imports
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD





<span style="color:blue;font-size:large">GraphX or GraphFrames</span>
<li>Both GraphX as well as GraphFrames have an implementation of the SVD++ algorithm</li>
<li>But, I just can't figure out what exactly the GraphFrames version returns (no documentation!)</li>
<li>So, we'll use GraphX for now!</li>

In [None]:
val people = Array((1L,"John"),
                       (2L,"Isabella"),
                       (3L,"Qing"),
                       (4L,"Bathsheba"),
                       (5L,"Akaash"),
                       (6L,"Pablo"),
                       (7L,"Ludovica"))

val authors = Array((100L,"Murakami"),
                   (101L,"Adams"),
                   (103L,"Liu"),
                   (104L,"Pachinko"),
                   (105L,"Kawabata"),
                   (106L,"Hardy"))

val vertexArray = people++authors

val edgeArray = Array(Edge(1L,100L,4.0),
                     Edge(1L,103L,5.0),
                     Edge(2L,104L,2.0),
                     Edge(2L,106L,3.0),
                     Edge(3L,101L,1.0),
                     Edge(4L,105L,5.0),
                     Edge(4L,104L,3.0),
                     Edge(5L,100L,2.0),
                     Edge(5L,105L,4.0),
                     Edge(6L,101L,1.0),
                     Edge(7L,103L,3.0),
                     Edge(7L,105L,4.0))

val vertexRDD: RDD[(Long, String)] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Double]] = sc.parallelize(edgeArray)



<span style="color:green;font-size:xx-large">Run SVD++</span>



<span style="color:blue;font-size:large">Set the hyperparameters</span>

In [None]:
val config = new lib.SVDPlusPlus.Conf(rank=2, //number of latent factors
                                    maxIters=10,
                                    minVal=0,
                                    maxVal=5,
                                    gamma1=0.007, //hyper parameters controlling search
                                    gamma2=0.007, //and preventing overfitting (see paper!)
                                    gamma6=0.005,
                                    gamma7=0.015)

<span style="color:blue;font-size:large">run the model</span>
<li>the model returns a graph and the mean rating for the dataset</li>

In [None]:

//val conf = new lib.SVDPlusPlus.Conf(2,10,0,5,0.007,0.007,0.005,0.015)
val (g,mean) = lib.SVDPlusPlus.run(edgeRDD,config)

<span style="color:blue;font-size:large">Analyze results</span>
<li>svd++ returns a graphx graph that corresponds to the original graph (13 vertices, 12 edges)</li>
<li>The edges are the original edges</li>
<li>The vertices are enhanced by an array that contains the vertex id along with 4 pieces of data</li>
<li>The technical meaning of these things is best left to the linked paper but, roughly</li>
<ul>
    <li>the first is an arrays of factor loadings (user to latent factor or item to latent factor)</li>
    <li>for users, the second is a composite of factor loadings and rating bias (the degree to which a user assigns ratings). For items, the second is a composite of factor loadings and rated bias (the degree to which an author is rated)</li>
    <li>the third is a bias adjustment value that captures the user or item bias (i.e., if the user hates all authors then the bias would boost their ratings a bit and if an author is generally disliked then the bias would bring down their rating for a new user)</li>
</ul>

In [None]:
//results for user Ludovica
g.vertices.filter(_._1==7L).collect()(0)


In [None]:
//Results for author Adams
g.vertices.filter(_._1==101L).collect()(0)


In [None]:
val u = 7L
val i = 104L
val user = g.vertices.filter(_._1 == u).collect()(0)._2 //This gives the user attributes from the graph
val item = g.vertices.filter(_._1 == i).collect()(0)._2 //This gives the item attributes from the graph


In [None]:
g.vertices.filter(_._1 == u).collect()(0)._2

In [None]:
user._3

In [None]:
item._3

In [None]:
item._1

In [None]:
user._2

In [None]:
item._1.zip(user._2)

In [None]:
item._1.zip(user._2).map(x => x._1 * x._2).reduce(_ + _) 

<span style="color:blue;font-size:large">Using the graph, calculate the rating a user would give to an author</span>

In [None]:
def pred(g:Graph[(Array[Double], Array[Double], Double, Double),Double],
         mean:Double, u:Long, i:Long) = {
  val user = g.vertices.filter(_._1 == u).collect()(0)._2 //This gives the user attributes from the graph
  val item = g.vertices.filter(_._1 == i).collect()(0)._2 //This gives the item attributes from the graph
  mean + user._3 + item._3 +  //user sentiment bias + item sentiment bias
    item._2.zip(user._2).map(x => x._1 * x._2).reduce(_ + _) 
    //item._2 is the item to factors weights
    //user._2 is the user to factor loadings
    //We zip these together and add them up
    //The entire total is then added to the mean rating from all users
}

pred(g, mean, 7L, 101L)

In [None]:
def pred(g:Graph[(Array[Double], Array[Double], Double, Double),Double],
         mean:Double, u:Long, i:Long) = {
  val user = g.vertices.filter(_._1 == u).collect()(0)._2 //This gives the user attributes from the graph
  val item = g.vertices.filter(_._1 == i).collect()(0)._2 //This gives the item attributes from the graph
  mean + user._3 + item._3 +  //user sentiment bias + item sentiment bias
    item._2.zip(user._2).map(x => x._1 * x._2).reduce(_ + _) 
    //item._2 is the item to factors weights
    //user._2 is the user to factor loadings
    //We zip these together and add them up
    //The entire total is then added to the mean rating from all users
}

pred(g, mean, 7L, 101L)

<span style="color:blue;font-size:large">All ratings for a particular user</span>

In [None]:
val user = "Ludovica"
val user_id = vertexRDD.filter(v => v._2==user).collect()(0)._1
val all_preds = authors.map(l=>pred(g,mean,user_id,l._1)).zip(authors.map(l=>l._2))
all_preds.sortBy(-_._1).foreach {l =>
    println(user + " rates " + l._2.toString + " " + l._1.toString)
}