# Yelp Recommender

## Intro

The purpose of this exercise is to use Spark in a real dataset, instead of just a toy example.

You will use the data from the [Yelp Dataset Challenge](https://www.yelp.de/dataset_challenge), which contains information about businesses, users, reviews and more.

For this exercise, you will need to focus only on the following files:
- yelp_academic_dataset_business.json
- yelp_academic_dataset_review.json

The goal is to build a recommender using Spark's ALS (Alternating Least Squares) and then generate recommendations for a given user.

Since the dataset is quite big, you should pick a business category (e.g. Restaurants) and a city (e.g. Edinburgh) and work on the recommender using only this subset of the data.

Please take some time to:
- find out what information you will need to feed as input to Spark's ALS
- check how this information is available in the dataset
- plan how you will tackle this problem

In [1]:
import pyspark
from pyspark import SparkContext, SQLContext


sc = pyspark.SparkContext('local[*]')
sqlc = SQLContext(sc)

In [3]:
people = sqlc.read.json("/media/diego/QData/datasets/yelp_dsr/yelp_academic_dataset_business.json")

In [4]:
type(people)

pyspark.sql.dataframe.DataFrame

## Business Data

- Load the file ***yelp_academic_dataset_business.json*** and select the following columns:
    - business_id
    - name
    - city
    - stars
    - categories



In [5]:
people.createOrReplaceTempView("people")
toronto  = sqlc.sql("select * from people where city='Toronto'")

In [6]:
toronto.toPandas()

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state,type
0,979 Bloor Street W,"[Alcohol: none, Ambience: {'romantic': False, ...",EDqCEAGXVGCH4FJXgqtjqg,"[Restaurants, Pizza, Chicken Wings, Italian]",Toronto,"[Monday 11:0-2:0, Tuesday 11:0-2:0, Wednesday ...",1,43.661054,-79.429089,Pizza Pizza,Dufferin Grove,M6H 1L5,7,2.5,ON,business
1,321 Jarvis Street,"[BusinessAcceptsCreditCards: True, Restaurants...",cdk-qqJ71q6P7TJTww_DSA,"[Hotels & Travel, Event Planning & Services, H...",Toronto,,1,43.659829,-79.375401,Comfort Inn,Downtown Core,M5B 2C2,8,3.0,ON,business
2,1000 Queen Street W,"[BusinessAcceptsCreditCards: True, BusinessPar...",YCsLfBVdLFeN2Necw1HPSA,"[Shoe Stores, Fashion, Men's Clothing, Shopping]",Toronto,,0,43.644258,-79.418829,Stussy,Ossington Strip,M6J 1H1,5,2.0,ON,business
3,"123 Front St, Unit 103 and 103-A","[Alcohol: none, Ambience: {'romantic': False, ...",826djy6K_9Fp0ptqJ2_Yag,"[Fast Food, Mexican, Restaurants]",Toronto,"[Monday 10:45-22:0, Tuesday 10:45-22:0, Wednes...",1,43.644920,-79.383333,Chipotle Mexican Grill,Downtown Core,M5J 2M2,68,3.5,ON,business
4,1560 Queen St E,"[BusinessAcceptsCreditCards: False, BusinessPa...",XxjrqA5jrzLH0EdCKUG9Mw,"[Desserts, Coffee & Tea, Food]",Toronto,,0,43.665879,-79.318965,I Deal Coffee,Leslieville,M4L 1E9,4,4.5,ON,business
5,2014 Queen Street E,"[BusinessAcceptsCreditCards: True, GoodForKids...",OLG7Gou8kmTLxogtJyP6ZQ,"[Tex-Mex, Restaurants]",Toronto,"[Monday 11:0-21:0, Tuesday 11:0-21:0, Wednesda...",0,43.670327,-79.299190,Z-Teca,The Beach,M4L 1J3,4,1.5,ON,business
6,2816 Markham Road,"[Alcohol: full_bar, Ambience: {'romantic': Fal...",L_thK7r3K_h5M4tV7amEKQ,"[Diners, Italian, Sandwiches, Breakfast & Brun...",Toronto,"[Monday 0:0-0:0, Tuesday 0:0-0:0, Wednesday 0:...",1,43.822982,-79.247915,Honey B Hives Restaurant,Scarborough,M1V 4C3,118,3.5,ON,business
7,250 Queens Quay W,"[Alcohol: beer_and_wine, BusinessAcceptsCredit...",6pSUvtk5-OOaJfX0hbkb8Q,"[Restaurants, Japanese, Sushi Bars]",Toronto,,0,43.639601,-79.382890,Ichiban Sushi,Harbourfront,M5J,9,3.5,ON,business
8,1560 Yonge St,"[BikeParking: True, BusinessAcceptsCreditCards...",hSep-C-1JSC8c_8tR96etQ,"[Coffee & Tea, Food]",Toronto,,1,43.689561,-79.395044,The Second Cup,Yonge and St. Clair,M4T 2S9,5,3.0,ON,business
9,1034 Queen Street W,"[AgesAllowed: 19plus, Ambience: {'romantic': F...",sZ05KpeRuZEzFP5TD_PQkg,"[Lounges, Bars, Nightlife, Dance Clubs]",Toronto,"[Monday 17:0-2:30, Tuesday 17:0-2:30, Wednesda...",1,43.644149,-79.420280,Apt. 200,West Queen West,M6J 1H7,15,2.0,ON,business


In [7]:
toronto.describe().toPandas()

Unnamed: 0,summary,address,business_id,city,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state,type
0,count,14540,14540,14540,14540.0,14540.0,14540.0,14540,14540,14540,14540.0,14540.0,14540,14540
1,mean,,,,0.8013755158184319,43.67621023983338,-79.39420013944437,2316.0,,,24.05055020632737,3.515749656121045,,
2,stddev,,,,0.3989783784282531,0.0431378672815804,0.0619279949426971,1009.7484835343898,,,47.23432101358981,0.871000467081536,,
3,min,,--DaPTJW3-tB1vP-PfdTEg,Toronto,0.0,43.116234868,-79.70339,00 Gelato,,,3.0,1.0,ON,business
4,max,canada wide,zzf3RkMI1Y2E1QaZqeU8yA,Toronto,1.0,45.028731,-76.364179,z-teca,Yorkville,W8M 3T5,1145.0,5.0,ON,business


In [8]:
df_business.take(1)

NameError: name 'df_business' is not defined

### Choosing a business category

- Define a regular Python function that takes a list of categories and returns 1 if a category of your choice (for instance, 'Restaurants') is contained in the list of categories or 0 otherwise
- Using the Python function, define a Spark's User Defined Function (UDF) with an IntegerType return
- Using the UDF, filter the businesses that belong to the category you chose

In [9]:
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import IntegerType, BooleanType
from pyspark.sql.functions import UserDefinedFunction
from pyspark.ml.linalg import VectorUDT, Vectors

                             
def is_italian(categories):
    return "Italian" in categories if categories else False 

is_it_udf = UserDefinedFunction(is_italian, BooleanType())

df_italians_toronto = toronto.filter(is_it_udf("categories"))

In [10]:
df_italians_toronto 

DataFrame[address: string, attributes: array<string>, business_id: string, categories: array<string>, city: string, hours: array<string>, is_open: bigint, latitude: double, longitude: double, name: string, neighborhood: string, postal_code: string, review_count: bigint, stars: double, state: string, type: string]

### Choosing a city
- Having filtered by the business category, now it is time to filter by the city (for instance, Edinburgh)

### Generating numeric IDs
- If you haven't done it yet, take one sample from your already filtered DataFrame and notice that the ***business_id*** contains an alphanumeric value - this is not good for Spark's ALS implementation, which requires IDs for items (in our case, businesses) and users to be numeric
- Use a ***StringIndexer*** to create a new column ***business_idn*** from the conversion of business_id into a numeric value

In [11]:
from pyspark.ml.feature import StringIndexer

business_indexer = StringIndexer().setInputCol("business_id").setOutputCol("business_idn").setHandleInvalid('skip')
        

df_italians_toronto = business_indexer.fit(df_italians_toronto).transform(df_italians_toronto)        

In [12]:
df_italians_toronto.toPandas()

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state,type,business_idn
0,979 Bloor Street W,"[Alcohol: none, Ambience: {'romantic': False, ...",EDqCEAGXVGCH4FJXgqtjqg,"[Restaurants, Pizza, Chicken Wings, Italian]",Toronto,"[Monday 11:0-2:0, Tuesday 11:0-2:0, Wednesday ...",1,43.661054,-79.429089,Pizza Pizza,Dufferin Grove,M6H 1L5,7,2.5,ON,business,461.0
1,2816 Markham Road,"[Alcohol: full_bar, Ambience: {'romantic': Fal...",L_thK7r3K_h5M4tV7amEKQ,"[Diners, Italian, Sandwiches, Breakfast & Brun...",Toronto,"[Monday 0:0-0:0, Tuesday 0:0-0:0, Wednesday 0:...",1,43.822982,-79.247915,Honey B Hives Restaurant,Scarborough,M1V 4C3,118,3.5,ON,business,111.0
2,"98 Island Road, Unit C","[BusinessAcceptsCreditCards: True, BusinessPar...",VutzmcU-hZt_WfgMkEuiFA,"[Italian, Restaurants]",Toronto,"[Monday 11:30-21:0, Tuesday 11:30-21:0, Wednes...",1,43.798446,-79.139529,Pasta Tutti Giorni,Scarborough,M1C 3P2,5,4.0,ON,business,323.0
3,552 Parliament Street,"[Alcohol: none, Ambience: {'romantic': False, ...",nPBZaN5ttrArS6tZwdjN7Q,"[Italian, Canadian (New), Restaurants, Cafes]",Toronto,"[Monday 9:0-21:0, Tuesday 9:0-21:0, Wednesday ...",1,43.666844,-79.369178,Cabbagetown Brew,Cabbagetown,M4X 1P6,10,4.5,ON,business,194.0
4,2100 Bloor Street W,"[Alcohol: beer_and_wine, BusinessParking: {'ga...",6tFbsJuIarw6vawDOZ9GkA,"[Pizza, Restaurants, Italian, Chicken Wings]",Toronto,"[Monday 11:0-23:0, Tuesday 11:0-23:0, Wednesda...",1,43.652618,-79.470894,Pizza Hut,Bloor-West Village,M6S 4Y7,9,2.5,ON,business,228.0
5,1228 Saint Clair Avenue W,"[Alcohol: full_bar, Ambience: {'romantic': Fal...",6a7zecn8W8r9dDWwQVb6aA,"[Italian, Restaurants]",Toronto,,0,43.677560,-79.445139,Novecento,Corso Italia,M6E 1B7,7,4.5,ON,business,90.0
6,106 Victoria St,"[Alcohol: full_bar, Ambience: {'romantic': Fal...",3QS2Wgedv3jjKC5vJXJTcA,"[Italian, Restaurants]",Toronto,,1,43.652200,-79.378155,Osteria Ciceri e Tria,Downtown Core,M5C,32,3.5,ON,business,85.0
7,1986 Weston Road,"[BusinessAcceptsCreditCards: True, GoodForKids...",wPxQQbwXfqnJEMirpwcPnw,"[Italian, Sandwiches, Coffee & Tea, Pizza, Res...",Toronto,"[Monday 11:0-23:0, Tuesday 11:0-23:0, Wednesda...",1,43.700878,-79.519349,Weston Favourites,,M9N 1W8,3,3.5,ON,business,249.0
8,594 College St,"[Alcohol: full_bar, Ambience: {'romantic': Fal...",JK_DiDwbl7HX5OhCAh3h4g,"[Caterers, Italian, Breakfast & Brunch, Nightl...",Toronto,"[Monday 8:0-1:0, Tuesday 8:0-1:0, Wednesday 8:...",1,43.655255,-79.413732,Cafe Diplomatico,Little Italy,M6G 1B3,137,3.0,ON,business,136.0
9,579 Mount Pleasant Road,"[Alcohol: full_bar, Ambience: {'romantic': Tru...",x1IPOlH_O5D3qZdZo6YyEA,"[Restaurants, Italian]",Toronto,"[Monday 11:30-22:0, Tuesday 11:30-22:0, Wednes...",1,43.703560,-79.387976,Florentia,Mount Pleasant and Davisville,M4S 2M5,14,3.5,ON,business,420.0


## Review Data

- Load the file ***yelp_academic_dataset_review.json*** and select the following columns:
    - user_id
    - business-id
    - stars
    - date

In [13]:
df_reviews = sqlc.read.json("/media/diego/QData/datasets/yelp_dsr/yelp_academic_dataset_review.json")

In [28]:
df_reviews = df_reviews.withColumnRenamed('stars','stars_r').withColumnRenamed('type','type_r')

### Keeping reviews for the chosen city only

- You are only interested in reviews of businesses you kept after filtering for category and city - how to filter out everything else? (hint: take a look at the ***join*** operation of DataFrames)

In [29]:
df_city_reviews = df_italians_toronto.join(df_reviews, on='business_id')

In [30]:
df_city_reviews

DataFrame[business_id: string, address: string, attributes: array<string>, categories: array<string>, city: string, hours: array<string>, is_open: bigint, latitude: double, longitude: double, name: string, neighborhood: string, postal_code: string, review_count: bigint, stars: double, state: string, type: string, business_idn: double, cool: bigint, date: string, funny: bigint, review_id: string, stars_r: bigint, text: string, type_r: string, useful: bigint, user_id: string]

In [34]:
df_city_reviews.write.json(path="data/city_reviews")

### Generating numeric IDs

- As it happened with the ***business_id***, you also need to convert ***user_id*** into a numeric value - once again, use a ***StringIndexer*** to create a new column named ***user_idn*** containing the result of the conversion

In [36]:
from pyspark.ml.feature import StringIndexer

business_indexer = StringIndexer().setInputCol("user_id").setOutputCol("user_idn").setHandleInvalid('skip')
        

df_city_reviews = business_indexer.fit(df_city_reviews).transform(df_city_reviews)        

In [37]:
df_city_reviews.cache()

DataFrame[business_id: string, address: string, attributes: array<string>, categories: array<string>, city: string, hours: array<string>, is_open: bigint, latitude: double, longitude: double, name: string, neighborhood: string, postal_code: string, review_count: bigint, stars: double, state: string, type: string, business_idn: double, cool: bigint, date: string, funny: bigint, review_id: string, stars_r: bigint, text: string, type_r: string, useful: bigint, user_id: string, user_idn: double]

### Adding a sequential number to the user's reviews

- Now add a ***sequential number*** to the user's reviews, that is, for each user, order his/her reviews by date (multiple reviews on the same date can be randomly ordered) and number them (hint: check ***window functions***)
- This sequential number will be useful later to perform a time-wise split of the dataset

In [43]:
from pyspark.sql import functions as F
from pyspark.sql import Window
from pyspark.sql.functions import count

windowSpec = Window.partitionBy(df_city_reviews['user_idn']).orderBy(df_city_reviews['date']) 
    
df_review_cnt = df_city_reviews.select("user_idn", count("date").over(windowSpec).alias("count"))

In [49]:
df_review_cnt.show(10)

+--------+-----+
|user_idn|count|
+--------+-----+
|   299.0|    1|
|   299.0|    2|
|   299.0|    3|
|   299.0|    4|
|   299.0|    5|
|   299.0|    6|
|   299.0|    7|
|   305.0|    1|
|   305.0|    2|
|   305.0|    3|
+--------+-----+
only showing top 10 rows



In [44]:
df_city_reviews = df_city_reviews.join(df_review_cnt, on='user_idn')

Unnamed: 0,user_idn,business_id,address,attributes,categories,city,hours,is_open,latitude,longitude,...,cool,date,funny,review_id,stars_r,text,type_r,useful,user_id,count
0,2197.0,Dnq4w6ajMifZFFTIR5cGxw,1151 Davenport Road,"[Alcohol: none, Ambience: {'romantic': False, ...","[Italian, Restaurants, Pizza, Sandwiches]",Toronto,"[Tuesday 11:0-21:0, Wednesday 11:0-21:0, Thurs...",1,43.674561,-79.431663,...,0,2017-01-07,0,Ioko8qRban3R-o3NB3AOeQ,5,hands down the best sausage ive ever had. it c...,review,0,w7QSVBgCYHu2_MG6w6Jp7Q,2
1,2197.0,Dnq4w6ajMifZFFTIR5cGxw,1151 Davenport Road,"[Alcohol: none, Ambience: {'romantic': False, ...","[Italian, Restaurants, Pizza, Sandwiches]",Toronto,"[Tuesday 11:0-21:0, Wednesday 11:0-21:0, Thurs...",1,43.674561,-79.431663,...,0,2017-01-07,0,Ioko8qRban3R-o3NB3AOeQ,5,hands down the best sausage ive ever had. it c...,review,0,w7QSVBgCYHu2_MG6w6Jp7Q,1
2,10243.0,Dnq4w6ajMifZFFTIR5cGxw,1151 Davenport Road,"[Alcohol: none, Ambience: {'romantic': False, ...","[Italian, Restaurants, Pizza, Sandwiches]",Toronto,"[Tuesday 11:0-21:0, Wednesday 11:0-21:0, Thurs...",1,43.674561,-79.431663,...,0,2016-12-23,0,KAM3yMErPwHX5BKywUYyOA,5,Amazing one of the best pork belly sandwiches ...,review,0,195i7Ix1PruGaz9SsOi6Iw,1
3,5768.0,Dnq4w6ajMifZFFTIR5cGxw,1151 Davenport Road,"[Alcohol: none, Ambience: {'romantic': False, ...","[Italian, Restaurants, Pizza, Sandwiches]",Toronto,"[Tuesday 11:0-21:0, Wednesday 11:0-21:0, Thurs...",1,43.674561,-79.431663,...,0,2016-10-11,0,1AvYCkFn5gz2qk5PA6GJWw,5,Dante's Inferno is without question (in my hum...,review,1,-onI--GpQ__P1ibxx25MRA,1
4,11731.0,Dnq4w6ajMifZFFTIR5cGxw,1151 Davenport Road,"[Alcohol: none, Ambience: {'romantic': False, ...","[Italian, Restaurants, Pizza, Sandwiches]",Toronto,"[Tuesday 11:0-21:0, Wednesday 11:0-21:0, Thurs...",1,43.674561,-79.431663,...,0,2016-01-15,0,AowSoAsbT-VE8y27ejtaTg,5,I'm so pleased to say that Dante's Inferno Pan...,review,0,sQeSycJd1EI4QafGOsloqQ,1


### Subsetting reviews to keep only users with more than 4 reviews

- Some users had rated only 1 or a few businesses - this would pose as a problem to make recommendations - so you would want to keep only users who had rated more than 4 reviews, for instance
- Find the ***total number of reviews*** for each user and then filter them using this information (hint: again, you can use a ***window function***)

In [52]:
from pyspark.sql import functions as F
from pyspark.sql import Window
from pyspark.sql.functions import count, sys

windowSpec = Window.partitionBy(df_city_reviews['user_idn']).orderBy(df_city_reviews['date']).rowsBetween(-sys.maxsize, sys.maxsize)
    
df_review_sum = df_city_reviews.select("user_idn", count('date').over(windowSpec).alias("count_tot"))

df_review_sum.show(10)

+--------+-----+
|user_idn|sum_r|
+--------+-----+
|   299.0|   49|
|   299.0|   49|
|   299.0|   49|
|   299.0|   49|
|   299.0|   49|
|   299.0|   49|
|   299.0|   49|
|   299.0|   49|
|   299.0|   49|
|   299.0|   49|
+--------+-----+
only showing top 10 rows



In [53]:
df_city_reviews = df_city_reviews.join(df_review_sum, on='user_idn')

In [56]:
df_city_reviews.toPandas().head()

Py4JJavaError: An error occurred while calling o536.collectToPython.
: java.lang.OutOfMemoryError: Java heap space
	at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:261)
	at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:257)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:257)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:279)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:278)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2803)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
	at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2823)
	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2800)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)


In [None]:
df_city_reviews.cache()

### Calculating mean rating by user

- Now you can calculate the mean rating by user and make it into a dictionary where the key is the ***user_id*** (hint: look at ***rdd*** method of DataFrames and ***collectAsMap*** method of RDDs)

In [None]:
dict_user_means = ...

### Centering rating by user

- The dictionary containing mean ratings by user can be seen as a ***lookup table*** - what is the appropriate way of dealing with those in Spark?
- Once you have figured this out, define a regular Python function that takes two arguments - ***user_id*** (String) and ***rating*** (String, which you will need to convert to float inside the function) - and returns the result of subtracting the mean rating of the user from the rating parameter
- Using the Python function, define a Spark's User Defined Function (UDF) with a DoubleType return
- Using the UDF, create a column in your DataFrame with the centered ratings
    - Notice that your UDF needs two columns as input: ***user_id*** and ***stars*** - so far, all UDFs received only one column (hint: check the ***array*** function from pyspark.sql)

In [None]:
from pyspark.sql.functions import array

lookup_user_means = ...

def zero_mean(user_id, rating):
    pass

df_centered = ...

## Dataset

### Splitting into training and test sets by time

- In recommender systems, it is common practice to do the training/test split timewise, that is, the test set is composed of the latest reviews
- First, filter only those reviews which have a sequential number smaller than the ***total number of reviews***, by user: this is your training set
- Then, filter only those reviews which have a sequential number identical to the ***total number of reviews***, by user: this is your test set
- Now you can see why you had to add the random time to the review's date - since some users had done all his/her reviews on the same day, your training set for that particular use would have been empty. By doing this, you guarantee your test set will have only 1 review for each user.

In [None]:
df_training = ...
df_test = ...

## Alternate Least Squares (ALS) Model

- This is the recommender itself - the ALS uses a iterative approach to find the underlying factors that yield the user/item rating matrix
- It takes as input a DataFrame with three columns, representing:
    - userCol: user IDs (numeric - remember the conversion you did)
    - itemCol: item IDs (numeric - remember the conversion you did)
    - ratingCol: rating (numeric, obviously)
    - coldStartStrategy: "drop" (if there is unseen data on the test set, meaning a new user, drop it)
- Its parameters are:
    - rank: the number of factors to consider
    - maxIter: the maximum number of iterations to perform
    - regParam: the regularization parameter
- Use Spark's ALS to fit a model based on your DataFrame

In [None]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

model = ...

### Predictions for the training set

- Once the model is trained, make predictions for the training set and use a ***RegressionEvaluator*** to find out the RMSE of the predictions

In [None]:
predictions = ...

evaluator = ...

train_rmse = ...

print(train_rmse)

### Predictions for the test set

- Now, make predictions for the test set and use a ***RegressionEvaluator*** to find out the RMSE of the predictions

In [None]:
...

print(test_rmse)

## Recommendations

Now, your model is trained, but how can you use it to make recommendations for a given user?

### Organizing business data

- It would not make sense to recommend a place the user has already rated, right? So, generate a dictionary where ***user_idn*** is the key and a list of the already rated ***business_idn*** is the value (hint: when aggregating DataFrames, ***collect_list*** is a VERY useful function to turn multiple records into a list)

In [None]:
from pyspark.sql.functions import collect_list

dict_visited_by_user = ...

- Besides, recommending a given business_id also does not help much, right? So you need to organize the business data in a way it can be shown to the user.
    - Define a regular Python function that takes one argument ***row*** (Row type) and returns a dictionary where ***business_idn*** is the key and the value is yet another dictionary with relevant fields (for instance: name, address, stars, categories)
    - Transform your business DataFrame into an RDD and apply the function you defined - upon collecting, you will end up with a list of dictionaries
    - Transform this list of dictionaries into a single dictionary

In [None]:
def rest_to_json(row):
    pass

rest = ...

dict_rest = {k: v for d in rest for k, v in d.items()}

### Making recommendations for a user

- To actually make the recommendations, we need to build an input DataFrame to feed the model
    - A DataFrame can be created using the SQL Context and a list of Rows, each containg two columns: user_idn and business_idn - the rating will be computed by the model
    - But you only need to have rows for the businesses which were not yet rated by the user - from all businesses, exclude the ones already rated by him/her

In [None]:
from pyspark.sql import Row
from pyspark.sql.functions import desc

user_idn = 317
n_business = len(dict_rest)

visited = ...
not_visited = ...

df_test_user = ...

- Now, you can use the generated DataFrame to make predictions
    - If there are any NA predictions, make sure to turn them into a really bad value (for instance, -5.0) (hint: remember ***na*** method of DataFrames)
- Order the predictions and take the ***business_idn*** of the top 5
- Finally, use this information to fetch the business data from the dictionary you assembled a couple of steps ago

In [None]:
predictions = ...

top_predictions = ...

response = list(map(lambda idn: dict_rest[idn], top_predictions))

In [None]:
response

## Congratulations, you finished the exercise!

In [None]:
sc.stop()