#**A Sample ALS Filtering Interface**

This is a demostration of how the ALS UI would interact with the user.
Disclaimer: Due to GitHub's file size limitations, we were only able to host a smaller dataset. This does not fully demostrate our model that splits as overfitting issues can be caused by splitting a smaller dataset into a training and test set.

Please refer to the Collaborative_Filtering.ipynb for the full implementation of Big Data for issues such as train/test dataset and coldStartStrategy that will drop Nulls that are present in the test dataset but not in the train dataset.

Please also note that in a complete pipeline, user id information would most likely be captured at the log-in step.

#**Setup SparkNLP and PySpark**

In [None]:
! wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2022-05-15 21:42:38--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2022-05-15 21:42:38--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-05-15 21:42:38--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:44

In [None]:
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import numpy as np
from pyspark.ml.linalg import *
from pyspark.sql.types import * 
from pyspark.sql.functions import *
from pyspark.ml.feature import *
import sparknlp

spark = sparknlp.start(gpu=True)

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  3.4.4
Apache Spark version:  3.0.3


#**Retrieving Dataset Sample**


In [None]:
!wget https://raw.githubusercontent.com/azraf-a/BERT_SparkNLP_Filter/main/ratings_small.csv -O ratings.csv
!wget https://raw.githubusercontent.com/azraf-a/BERT_SparkNLP_Filter/main/movies_small.csv -O movies.csv

--2022-05-15 21:43:42--  https://raw.githubusercontent.com/azraf-a/BERT_SparkNLP_Filter/main/ratings_small.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2483723 (2.4M) [text/plain]
Saving to: ‘ratings.csv’


2022-05-15 21:43:43 (50.1 MB/s) - ‘ratings.csv’ saved [2483723/2483723]

--2022-05-15 21:43:43--  https://raw.githubusercontent.com/azraf-a/BERT_SparkNLP_Filter/main/movies_small.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 494431 (483K) [text/plain]
Saving to: ‘movies.csv’


2022-05-15 21:43:43 (30.6 MB/s) - ‘mov

In [None]:
schema = StructType() \
      .add("movieId",IntegerType(),True) \
      .add("imdbId",StringType(),True) \
      .add("tmdbId",IntegerType(),True)

df = spark\
.read\
.option("inferSchema","true")\
.option("header", "true")\
.csv("ratings.csv")

df=df.drop('timestamp')
df=df.drop('genres')

df2 = spark\
.read\
.option("inferSchema","true")\
.option("header", "true")\
.csv("movies.csv")

df4=df.join(df2, ['movieId'])

##**Generating ALS Model**

In [None]:
#Training the ALS model
from pyspark.ml.recommendation import ALS

als_model = ALS(userCol='userId',
                itemCol='movieId',
                nonnegative=True,
                regParam=0.1,
                rank=10)
# rank is the number of latent factors we are choosing

recommender = als_model.fit(df)

In [None]:
predictions = recommender.transform(df)

##**User Queries**

In [None]:
#Generate top 10 movie recommendations for each user
userRecs = recommender.recommendForAllUsers(10)


In [None]:
user_id = input ("Please enter your assigned user id ( your assigned number between 1 and 610 ) :") 

print("Your use id is : ", user_id)
print()
print("Based on your past reviewing history and similar users, we believe you will like: ")
rec = [row[0] for row in userRecs.filter(col('userId') == user_id).select('recommendations').collect()]
movies = [row[0] for row in rec[0]]
for movie in movies:
  print(df4.filter(col('movieId') == movie).select('title').collect()[0][0])

Please enter your assigned user id ( your assigned number between 1 and 610 ) :32
Your use id is :  32

Based on your past reviewing history and similar users, we believe you will like: 
On the Beach (1959)
Saving Face (2004)
Victory (a.k.a. Escape to Victory) (1981)
Seve (2014)
The Big Bus (1976)
Grand Day Out with Wallace and Gromit, A (1989)
Moby Dick (1956)
Holy Mountain, The (Montaña sagrada, La) (1973)
Black Mirror: White Christmas (2014)
Trial, The (Procès, Le) (1962)
