This is reference code to show how one could use collaborative filtering to predict fund managers interest in funds. For this model, we'll use the terms `mgrId` for the fund manager and `acctId` for the funds. We'll repurpose the movie lense data for this example to show what the syntax would be for training. In a production model, the fund manager rate of a fund would have to be derived using a heuristic to represent interest based upon existing holding and recent activities. 

---


First, we'll create our rate data:

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row

lines = spark.read.text("/FileStore/shared_uploads/brad.barker@databricks.com/sample_movielens_ratings.txt").rdd
parts = lines.map(lambda row: row.value.split("::"))
ratingsRDD = parts.map(lambda p: Row(mgrId=int(p[0]), acctId=int(p[1]),
                                     rating=float(p[2]), timestamp=str(p[3])))
ratings = spark.createDataFrame(ratingsRDD)
(training, test) = ratings.randomSplit([0.8, 0.2])
display(ratings.limit(5))

mgrId,acctId,rating,timestamp
0,2,3.0,1424380312
0,3,1.0,1424380312
0,5,2.0,1424380312
0,9,4.0,1424380312
0,11,1.0,1424380312


In [0]:
# Build the recommendation model using ALS on the training data
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als = ALS(maxIter=5, regParam=0.01, userCol="mgrId", itemCol="acctId", ratingCol="rating", coldStartStrategy="drop")
model = als.fit(training)

In [0]:
# The model is fit. Under the covers this creates two latent feature matrices that when multiplied return the prediction for all purmutations of mgrId and acctId.
predictions = model.transform(test)
display(predictions.select('mgrId','acctId','prediction').limit(5))

mgrId,acctId,prediction
28,2,1.1121767
28,7,-2.2989483
28,14,-0.57560885
28,15,1.2969652
28,19,1.7296705


In [0]:
# we will also want evaluate the performance of predictions on the test data:
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 1.813362543634652


In [0]:
# Thus if we wanted to know which fund mgrs that would be most interested in a specific funds, we first create a dataframe with the subset of funds we are selling, say acctId 5 & 6:
our_funds_id = spark.createDataFrame(pd.DataFrame(data={'acctId': [5,6]})) 
display(our_funds_id)

acctId
5
6


In [0]:
# We can now pick out the top 10 fund managers that would like our fund the most:
display(model.recommendForItemSubset(our_funds_id, 10))

acctId,recommendations
5,"List(List(22, 4.030378), List(3, 3.4047587), List(16, 2.8394647), List(7, 2.1704905), List(0, 2.1675959), List(18, 2.0180101), List(26, 1.9976994), List(24, 1.8300114), List(20, 1.8080615), List(15, 1.6994258))"
6,"List(List(26, 3.012682), List(24, 2.8385303), List(3, 2.7363763), List(23, 2.083785), List(16, 1.9774854), List(1, 1.90662), List(11, 1.8214415), List(22, 1.7989099), List(0, 1.6934865), List(29, 1.6916502))"


**NOTE**: depending on the data used to show current interest, its possible we'll want to use an implicit instead of explicit parameter setting. The demo uses explicit because in the demo data the movie ratings were explicitly rated by users.