## ALS

The final goal was to attempt to create a basic recommendation system using past reservations. This uses [Alternating Least Squares](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html), a native recommender in PySpark's ML library. This documentation, along with [this walkthrough](https://github.com/shashwatwork/Building-Recommeder-System-in-PySpark/blob/master/Crafting%20Recommedation%20System%20with%20PySpark.ipynb) guided this implementation.

ALS is fairly straightforward, using three inputs, all integers, to build the model:
- userCol - person record in the transaction, customerzip was used in this model
- itemCol - facilityid, the product identifier
- ratingCol - a score to the item assigned by the user, used a "days stayed" calculation to simulate this value

In [1]:
from pyspark.sql import SparkSession

MAX_MEMORY = "8g"

spark = SparkSession.builder.appName('recreation.gov reservations') \
    .config("spark.executor.memory", MAX_MEMORY) \
    .config("spark.driver.memory", MAX_MEMORY) \
    .getOrCreate()

In [2]:
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType

schemaRating = StructType([
    StructField("item", IntegerType(), True),
    StructField("user", IntegerType(), True),
    StructField("rating", IntegerType(), True),
])

In [3]:
dfReservations2021 = spark.read.format('csv').schema(schemaRating).csv('REC_Collaborative_Facility.csv', header=True, ignoreTrailingWhiteSpace=True)

In [4]:
dfReservations2021.filter(dfReservations2021.rating.isNull()).show()

+----+----+------+
|item|user|rating|
+----+----+------+
+----+----+------+



In [5]:
dfReservations2021.show(truncate=False)

+------+-----+------+
|item  |user |rating|
+------+-----+------+
|252494|99709|1     |
|252494|99706|1     |
|252494|99709|1     |
|252494|99709|1     |
|252494|84401|2     |
|252494|99709|2     |
|252494|99709|1     |
|252494|99743|1     |
|252494|99743|2     |
|252494|99708|3     |
|252494|84401|2     |
|252494|99710|1     |
|252494|99705|1     |
|252494|99709|1     |
|252494|99709|1     |
|252494|99712|2     |
|252494|99556|1     |
|252494|99709|1     |
|252494|99712|2     |
|252494|99705|2     |
+------+-----+------+
only showing top 20 rows



In [6]:
dfReservations2021 = dfReservations2021.dropna()

In [7]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

#split train and test
trainDF, testDF = dfReservations2021.randomSplit([0.8, 0.2])
trainDF.cache()

# build model
# coldStartStrategy - helped drop nulls
als = ALS(coldStartStrategy="drop", implicitPrefs=True)
model = als.fit(trainDF)

In [8]:
# generate predictions
predictions = model.transform(testDF)

# evalute model using root mean squared evaluator
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
evaluator.evaluate(predictions)

1.636292417364111

In [9]:
# create a test user from Silver Spring, MD
test_user = testDF.filter('user == 20901').select('user', 'item', 'rating')
test_user.show()

+-----+------+------+
| user|  item|rating|
+-----+------+------+
|20901|232432|     2|
|20901|232432|     2|
|20901|232432|     3|
|20901|232432|     3|
|20901|232433|     2|
|20901|232459|     1|
|20901|232459|     2|
|20901|232459|     3|
|20901|232459|     3|
|20901|232507|     1|
|20901|232507|     1|
|20901|232507|     1|
|20901|232507|     2|
|20901|232507|     2|
|20901|232507|     2|
|20901|232507|     2|
|20901|232507|     3|
|20901|232507|     3|
|20901|232507|     3|
|20901|232507|     3|
+-----+------+------+
only showing top 20 rows



In [26]:
# get recommendations for test user
recommendations = model.recommendForUserSubset(test_user, 5)
recommendations.sort('recommendations', ascending=False).show()

+-----+--------------------+
| user|     recommendations|
+-----+--------------------+
|20901|[{232507, 1.48820...|
+-----+--------------------+



In [27]:
from pyspark.sql.functions import explode

recommendations.select(explode('recommendations').alias('recs')).select('recs.item', 'recs.rating').sort('recs.rating', ascending=False).show()

+------+---------+
|  item|   rating|
+------+---------+
|232507|1.4882011|
|233626|   1.4415|
|232508|1.3851465|
|232459|1.3799562|
|251431| 1.359406|
+------+---------+



In [15]:
%pip install requests

Note: you may need to restart the kernel to use updated packages.


In [20]:
import requests
r = requests.get('https://ridb.recreation.gov/api/v1/facilities/232507', headers={'apikey': 'XXXXX'})
r.json()

{'FacilityID': '232507',
 'LegacyFacilityID': '70989',
 'OrgFacilityID': 'AN370989',
 'ParentOrgID': '128',
 'ParentRecAreaID': '2576',
 'FacilityName': 'ASSATEAGUE ISLAND NATIONAL SEASHORE CAMPGROUND',
 'FacilityDescription': '<h2>Overview</h2>\n<p>Assateague Island, famed for its wild horses, lies off the Delmarva Peninsula on the Atlantic Coast. This barrier island is a constantly shifting ribbon of sand, altered daily by powerful wind and waves. <br> <br>The Assateague Island National Seashore, Assateague State Park, and the Chincoteague National Wildlife Refuge each manage and protect this unique, diverse strip of land. <br> <br>For more information go to https://www.nps.gov/asis</p>\n<h2>Recreation</h2>\nActivities are abundant on the island, with crabbing and clamming, and a long stretch of beach for swimming, kayaking and fishing.<h2>Facilities</h2>\n<p>The campground is open year-round. Advance reservations are available up to 6 months in advance during the following dates:\xa

In [18]:
from pyspark.ml.tuning import ParamGridBuilder

# set parameters for tuning
paramGrid = ParamGridBuilder()\
    .addGrid(als.maxIter, [5, 10, 15, 25])\
    .addGrid(als.rank, [10, 20, 50, 100])\
    .addGrid(als.regParam, [0.001, 0.01, 0.1, 0.2])\
    .build()

In [19]:
from pyspark.ml.tuning import CrossValidator

crossval = CrossValidator(estimator=als,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator)

# cross validate create best model
cvModel = crossval.fit(trainDF)

In [21]:
# assess prediction model
cvPred = cvModel.bestModel.transform(testDF)
evaluator.evaluate(cvPred)

1.5081936325379801

In [22]:
cvModel.bestModel

ALSModel: uid=ALS_cab102eee562, rank=50

In [23]:
cvRecommendations = cvModel.bestModel.recommendForUserSubset(test_user, 5)
cvRecommendations.sort('recommendations', ascending=False).show()



+-----+--------------------+
| user|     recommendations|
+-----+--------------------+
|20901|[{233379, 1.46403...|
+-----+--------------------+



In [24]:
from pyspark.sql.functions import explode

cvRecommendations.select(explode('recommendations').alias('recs')).select('recs.item', 'recs.rating').sort('recs.rating', ascending=False).show()

+------+---------+
|  item|   rating|
+------+---------+
|233379|1.4640328|
|232459| 1.436412|
|232507| 1.423916|
|232095|1.4138017|
|247762|1.3669013|
+------+---------+



In [25]:
r = requests.get('https://ridb.recreation.gov/api/v1/facilities/233379', headers={'apikey': 'XXXXX'})
r.json()

{'FacilityID': '233379',
 'LegacyFacilityID': '72421',
 'OrgFacilityID': 'AN372421',
 'ParentOrgID': '128',
 'ParentRecAreaID': '2896',
 'FacilityName': 'OAK RIDGE CAMPGROUND',
 'FacilityDescription': '<h2>Overview</h2>\n<p>Oak Ridge Campground is a 100-site, wooded campground located in Prince William Forest Park, 35 miles southwest of Washington, DC. The park\'s land was set aside during the Great Depression, and in 1935 the Civilian Conservation Corps (CCC) began restoring the previously over-farmed acreage, converting it to recreational lands for public use. The CCC built trails, dams and cabins, making the park a wonderful place for recreation and relaxation. <br><br>Large group camping is not permitted at Oak Ridge Campground. Please read the need to know section for additional information.</p>\n<h2>Recreation</h2>\n<p>The park offers many recreational activities, including hiking, biking, orienteering and fishing. Hikers enjoy exploring the park\'s 37 miles of foot trails. The S