# Implementation of PLSI algorithm using Spark 

In [1]:
import pandas as pd
import time
import numpy as np
import matplotlib.pyplot as plt
import findspark
findspark.init("spark-2.2.3-bin-hadoop2.6")
import pyspark
from numpy import random

## 1. Prepare the environment

### Import the data

We will implement the PLSI algorithm on the Movielens dataset. To simplify our task, we will start by implementing the PLSI algorithm on a reduced version of the Movielens dataset ("ratings_short"), which contains 100 836 observations. 

In [2]:
ratings_data = pd.read_csv("ratings.csv")
ratings_data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


**Description of the dataset :** 
- userId, to characterize the users 
- movieId, to characterize the movies
- rating : the rating of the user to the corresponding movie. Ratings are going from 1 to 5 but here we will only use the seen / not seen information to provide movie recommendations.  
- timestamp : we will not be using this column

More information about the movies are available in the "movies.csv" dataset : the movieId gives us access to the corresponding movie information such as the title and the genres. 

In [3]:
movies = pd.read_csv("movies.csv")
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
####MORE DESCRIPTIONS ON THE DATA

### Create a Spark environment

In [5]:
sc = pyspark.SparkContext()

To be able to implement the PLSI algorithm in Spark, we will need to transform the dataset into an RDD, and then perform pyspark operations on it. 

In [17]:
rdd = sc.textFile("ratings.csv")

#Remove header line 
header = rdd.first()
rdd = rdd.filter(lambda x: x != header)

rdd.collect()[0:10]

['1,1,4.0,964982703',
 '1,3,4.0,964981247',
 '1,6,4.0,964982224',
 '1,47,5.0,964983815',
 '1,50,5.0,964982931',
 '1,70,3.0,964982400',
 '1,101,5.0,964980868',
 '1,110,4.0,964982176',
 '1,151,5.0,964984041',
 '1,157,5.0,964984100']

## 2. Probabilistic Latent Semantic Indexing algorithm (PLSI)

The PLSI algorithm that we will implement here is based on Das description of the Google News recommendation system. The algorithm is based on the following model : 
- u (users) and s (movies) are random variables 
- The relationship between users and movies is learned by modeling the joint distribution of users and items as a mixture distribution 
- To capture this relationship, we introduce a hidden variable z (latent variable), that kind of represents user communities (same preferences) and movie communities (sames genres). 

All in all, we try to compute the following probability for each (user, movie) couple : p(s|u) = sum(p(s|z)p(z|u)), which is the probability for a given user to see a given movie. This is obtained by summing for each community the probability for a movie s to be seen given a community z times the probability to be in the community z given a user u. 

### Split into training and testing set

In [7]:
nb_cluster = sc.broadcast(3) #Number of latent variables z (number of communities)
nb_iter = 30 #Number of iterations of the EM algorithm 

In [8]:
def parseLine(line):
    line = line.split(',')
    # We want to keep only the two first rows since we consider that a click-log is a 1-rating
    line = line[0]+','+line[1]
    return(line)

In [18]:
rdd = rdd.map(parseLine)
rdd.collect()[0:10]

['1,1',
 '1,3',
 '1,6',
 '1,47',
 '1,50',
 '1,70',
 '1,101',
 '1,110',
 '1,151',
 '1,157']

We will discuss later how to find the optimal number of clusters and number of iterations. First, we need to keep only the interesting information, ie the userIds and movieIds (we don't need the ratings as we just want to know whether an user has seen a movie or not).

In [20]:
#Split data into train/test
alpha = 0.1
train_test = rdd.randomSplit([1-alpha, alpha], seed=42)
train, test = train_test[0], train_test[1]

#### Initialisation of q*:

We now want to associate each (user, movie) to a community z. We will associate each couple with a probability to belong to a given community. We will first initialize this probability randomly, and then make it converge.

In [23]:
def cartesianProd(us):
    to_return = []
    for z in range(nb_cluster.value):
        to_return += [us+','+str(z)]
    return(to_return)

In [24]:
q = train.flatMap(cartesianProd).map(lambda x: (x, random.rand()))
q.collect()[0:10]

[('1,1,0', 0.1175353045440819),
 ('1,1,1', 0.3517448359632236),
 ('1,1,2', 0.7037494495663035),
 ('1,3,0', 0.2750808379494788),
 ('1,3,1', 0.7809704952503798),
 ('1,3,2', 0.705114597064498),
 ('1,6,0', 0.34927008770956347),
 ('1,6,1', 0.357667747791937),
 ('1,6,2', 0.4550823262042012),
 ('1,47,0', 0.41240055310184887)]

#### EM-algorithm :

In [27]:
loglik = []

for k in range(nb_iter):

    ##M-step (computation of the N(s,z)/N(z) = p(s|z) function)

    # return (s,z, N(s,z)) with (s,z) as keys and by summing all the users
    Nsz = q.map(lambda q : (q[0].split(',')[1]+','+q[0].split(',')[2],q[1])).reduceByKey(lambda x,y : x+y)

    # return (z, N(z)=∑N(s,z)) with z as key and by summing N(s,z) on all the movies
    Nz = Nsz.map(lambda N : (N[0].split(',')[1], N[1])).reduceByKey(lambda x,y : x+y)

    # return (s,z, N(s,z)/N(z)) with (s,z) as keys
    Nsz = Nsz.map(lambda x : (x[0].split(',')[1], (x[0].split(',')[0],x[1])))
    tmpN = Nsz.join(Nz) 
    Nsz_normalized = tmpN.map(lambda x : (x[1][0][0]+','+x[0], x[1][0][1]/x[1][1]))
    
    
    ##M-step (computation of the p(z|u) function)
    
    # return (u,z, p(z|u)) with (u,z) as keys
    Puz = q.map(lambda q : (q[0].split(',')[0]+','+q[0].split(',')[2],q[1])).reduceByKey(lambda x,y : x+y)
    
    # return (u, ∑p(z|u)) with u as key
    Pu = Puz.map(lambda p : (p[0].split(',')[0], p[1])).reduceByKey(lambda x,y : x+y)
    
    # return (u, z, p(z|u)=p(z|u)/∑p(z|u)) with (u,z) as keys 
    Puz = Puz.map(lambda x : (x[0].split(',')[0], (x[0].split(',')[1],x[1])))
    tmpP = Puz.join(Pu)
    Puz = tmpP.map(lambda x : (x[0]+','+x[1][0][0], x[1][0][1]/x[1][1]))
    
    ##E-step (computation of the p(s|z)*p(u|z)/{∑p(s|z)*p(u|z)) function)
    
    #join q(u,s;z) & p(z|u) on u & z - and forget old value of q*
    tmpQ = q.map(lambda x : (x[0].split(',')[0]+','+x[0].split(',')[2] , x[0].split(',')[1]))
    tmpQ = tmpQ.join(Puz)
    
    #join q(u,s;z) & N(s,z)/N(z) on s & z
    tmpQ = tmpQ.map(lambda x : (x[1][0]+','+x[0].split(',')[1] , (x[0].split(',')[0],x[1][1])))
    tmpQ = tmpQ.join(Nsz_normalized)
    
    #return ((u,s;z), p(s|z)*p(u|z)) =  ((u,s;z), q*~) with : q*~ = p(s|z)*p(u|z)
    tmpQ = tmpQ.map(lambda x : (x[1][0][0]+','+x[0], x[1][0][1]*x[1][1]))
    
    #return ((u,s), ∑p(s|z)*p(u|z))
    sumTmpQ = tmpQ.map(lambda x : (x[0].split(',')[0]+','+x[0].split(',')[1],x[1])).reduceByKey(lambda x,y : x+y)
    
    # We compute here the log-likelihood.
    log = sumTmpQ.map(lambda x : np.log(x[1]))
    N = sumTmpQ.count()
    logLik = log.reduce(lambda x,y:x+y)
    print((1/N)*logLik)
    loglik.append(1/N*logLik)
    
    #return ((u,s,z), p(s|z)*p(u|z)/{∑p(s|z)*p(u|z))}
    tmpQ = tmpQ.map(lambda x : (x[0].split(',')[0]+','+x[0].split(',')[1], (x[0].split(',')[2],x[1])))
    tmpQ = tmpQ.join(sumTmpQ)
    q = tmpQ.map(lambda x : (x[0]+','+x[1][0][0], x[1][0][1]/x[1][1]))
    
    q.persist()

-8.113960532439867
-8.107403569613536
-8.107135302649311
-8.106702621830959


KeyboardInterrupt: 

Let us plot the loglik to see the result of our algorithm :

In [None]:
fig = plt.figure()
fig.title('log-likelihood')
ax = fig.add_subplot(111)
ax.set_xlabel('number of iterations')
ax.set_ylabel('log-likelihood')
ax.plot(loglik)
plt.show()