# Final Project: Yelp Location Based Recommender
## Charley Ferrari
## Data 643

In [1]:
from pyspark import  SparkContext
import numpy as np
import json
import hashlib
from pyspark.sql import Row
from pyspark.sql.functions import lit, col
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
from pyspark.sql import SQLContext

CLUSTER_URL = open('/root/spark-ec2/cluster-url').read().strip()
sc = SparkContext( CLUSTER_URL, 'pyspark')


sqlContext = SQLContext(sc)

This project will look at the <a href="https://www.yelp.com/dataset_challenge/">Yelp Academic Dataset</a>. This dataset contains a large set of reviews, user interactions, and business information. This data is available in json format, in user, business, and review jsons. Review jsons include a number of stars, along with a user and business, so the first part of the recommender will be based on this.

In [5]:
reviews = sqlContext.read.json("/yelp_academic_dataset_review.json")

I will be using the ALS collaborative algorithm from Spark mllib to predict restaurants for users. There will be some initial cleaning needed to get things working. First, the mllib Rating objects require users and items to be ints, so I will use the zipWithUniqueId function to acheive this. 

After initial attempts trying to create unique ints with hashes, I found <a href="https://github.com/LukeTillman/killrvideo-csharp/blob/master/data/spark/recommendations_pipeline.py">this</a> example and used their approach of creating separate ID tables to join with the original dataset.

In [9]:
user_ids = reviews.select("user_id").distinct().rdd.zipWithUniqueId()
user_map = user_ids.map(lambda (x, y): Row(user_id=x.user_id, userid_int=y)).toDF().cache()

business_ids = reviews.select("business_id").distinct().rdd.zipWithUniqueId()
business_map = business_ids.map(lambda (x, y): Row(business_id=x.business_id, businessid_int=y)).toDF().cache()

review_id = reviews.join(user_map, 'user_id').\
                    join(business_map, 'business_id')
    
yelp_ratings = review_id.map(lambda l: Rating(l.userid_int, l.businessid_int, l.stars))

Next, I will be able to run my ALS algorithm:

In [48]:
rank = 10
numIterations = 10
model = ALS.train(yelp_ratings, rank, numIterations)

The first step of my recommendation will just predict rankings for a given user, but the second step will take location into account. So, I will create a set of restaurants to predict based on a user. The below steps will select a portion of the dataset that includes restaurants not visited by this user, so their scores may be predicted and ranked. For this test I am using userID 387400

In [154]:
reviewsL = review_id.filter(review_id['userid_int'] != 387400)
reviewsR = review_id.filter(review_id['userid_int'] == 387400).select('businessid_int')

filt2 = reviewsL.join(reviewsR, 'businessid_int', 'left_outer')

filt3 = filt2.select('businessid_int').distinct().withColumn('userid_int', lit(387400))

testratings = filt3.map(lambda l: (l.userid_int, l.businessid_int))

predictions = model.predictAll(testratings).map(lambda r: ((r[0], r[1]), r[2]))

predictionsDF = predictions.map(lambda l: Row(userid_int = l[0][0], businessid_int = l[0][1], pred_stars = l[1])).toDF()

predictionsDF = predictionsDF.join(business_map, 'businessid_int')

With this dataframe created, I could take into account location. To truly hone this I would probably need to measure distance in a way other than degrees of longitude and latitude, but for the time being I'll define a range of longitudes and latitudes to look at. For this portion, I'll bring in the businesses json, and select out the longitude and latitude of the businesses.

The yelp dataset includes a large amount of (sometimes sporadic and specific to certain types of businesses) data, as seen in the below schema.

In [151]:
businesses = sqlContext.read.json("/yelp_academic_dataset_business.json")
businesses.printSchema()

root
 |-- attributes: struct (nullable = true)
 |    |-- Accepts Credit Cards: boolean (nullable = true)
 |    |-- Accepts Insurance: boolean (nullable = true)
 |    |-- Ages Allowed: string (nullable = true)
 |    |-- Alcohol: string (nullable = true)
 |    |-- Ambience: struct (nullable = true)
 |    |    |-- casual: boolean (nullable = true)
 |    |    |-- classy: boolean (nullable = true)
 |    |    |-- divey: boolean (nullable = true)
 |    |    |-- hipster: boolean (nullable = true)
 |    |    |-- intimate: boolean (nullable = true)
 |    |    |-- romantic: boolean (nullable = true)
 |    |    |-- touristy: boolean (nullable = true)
 |    |    |-- trendy: boolean (nullable = true)
 |    |    |-- upscale: boolean (nullable = true)
 |    |-- Attire: string (nullable = true)
 |    |-- BYOB: boolean (nullable = true)
 |    |-- BYOB/Corkage: string (nullable = true)
 |    |-- By Appointment Only: boolean (nullable = true)
 |    |-- Caters: boolean (nullable = true)
 |    |-- Coat Chec

In [163]:
businesses = sqlContext.read.json("/yelp_academic_dataset_business.json").\
    select('latitude','longitude', 'business_id','name')

Once I've brought in the business information, my next step will be to filter them within the zone I have chosen:

In [164]:
closeby = businesses.filter(businesses['latitude'] < 40.37).filter(businesses['latitude'] > 40.35).\
    filter(businesses['longitude'] < -79.89).filter(businesses['longitude'] > -79.93)
    
closeby.count()

10

Lastly, I'll join this back with my original predictions and see how this recommender would rank the businesses that are close by to this users' current location:

In [165]:
closebyrankings = predictionsDF.join(closeby, 'business_id')

In [166]:
closebyrankings.show()

+--------------------+--------------+--------------------+----------+----------+-----------+--------------------+
|         business_id|businessid_int|          pred_stars|userid_int|  latitude|  longitude|                name|
+--------------------+--------------+--------------------+----------+----------+-----------+--------------------+
|-G9mzl-6Tj_3P7Hho...|         14423|  3.2559025336199205|    387400| 40.350463|  -79.92311|              Skyvue|
|5UmKMjUEUNdYWqANh...|         25230| -2.0717306247886764|    387400|40.3543266|-79.9007057|           Mr Hoagie|
|Oet_iNNFm5XW9NCRr...|         40430|    -1.2880634329106|    387400|40.3565449| -79.896124| Pasquale's Pizzeria|
|LhUCZvUbrLH-Y_2qE...|         72053|-0.46897867486092637|    387400|40.3509675|-79.9136731| Secrets Bar & Grill|
|e4BoQqsyrguxP2Ibq...|         69862| -3.2174362816994657|    387400|40.3565449| -79.896124|   Guns Priced Right|
|9ZqbQaYJEyZYRk-o9...|         10284| 0.11458410839293687|    387400|  40.36176|   -79.8

## Discussion

It turns out the random latitude and longitude I chose (based off of what came off from take(1)) is an industrial town a couple of miles outside of Pittsburgh. The location chosen for them was the Skyvue diner:

<img src="https://lh3.googleusercontent.com/gJlQIM_5CDcZ8Klz8rO354_CapHyWGUTB1_GxJDi__55CG2i2bKavwx6oGlD1XyGSjc25qQVkywZ19D58fj3bsJZk-htZjEceylcnuzsPAJ9-UiILVmkvRU7auQiN35wWxxOHPiP9lXBOhjz6OPMcj6l5WS4DGan0Wo9Fc9RpD5JuBPxt5A5IKJE1QH6PaiQuFLe5ItcAWqhEx5DLSHGv4KXjesIljnd--tGbh6w2ay_ejCFK0KJOHK4fBD1ULqirc0L5Wvmsl8Ol3E0KZrF3I9B8Z0P3kiDlMBMc1Nt5cO1hdx4PxwZPMtRkTf2MOgL0aTsbYQsuz643nxN6-FdNt80imZB-pSU0ck3c1nVOPXlMih7FiVh6ejvxbsGaqFRUj3tLigZ3YoUvo8IIBJy1eC94pHq2JaIFLUZF2pAmsK5sOYEV22XIGl9ND34dU9X99lHDA8V-uioNDIpJ1zS8nllYZqu_kPU5eU29jkQ2f7nICE_fJ2yKU6pzHlgFnm_10Qw1s2BaK4Ss5iy5xrt-vlHgwuu5q_K2XU68Ke5Ui--3qUrNnjzTV62uNQgN9I8teZeW9VRvTfwVMAocewjV7teKhjZ5RM=w2294-h976-no">

In previous projects I've used time as a context, and it proved to be more complicated than something like location. For a location based service, the goal is to rank what is around you, and it's fairly easy to set a geographic bound based on how far you can walk or drive. 

The Yelp dataset is incredibly rich, and there are many more attributes I could choose to use in more content based filters. Once again, this may end up becoming a simpler problem. Many attributes will get chosen by a user when they are asking for recommendations: It would be rare for someone to just ask for a general recommendation from yelp. On the other hand, this restriction changes the information a recommender can go on. If a user is looking for restaurant recommendations, will their behavior in department stores affect what they're looking for? Will you end up with valuable signals (price sensitivity) or just simply noise that doesn't translate across business types?