# Task I: Build a recommendation system

* Suppose users are eager to answer some questions, but there is a lot of questions and it is not easy to search for relevant one. 
* We have a list of questions that already have some answers but non of them was accepted so they still wait for more precise/accurate answer
* Try to build a system that will recommend for each of these questions a set of 10 users that are likely to answer them
* Use the ALS algorithm with implicit ratings and assume that the rating can be modeled by the score information
    * Some questions have been answered by multiple users and these answers gained some score (even though they may have not been accepted)
    * We will assume that users with similar knowladge/interest will gain similar score for their answer. So for particular question we will recommend a user depending on other users that answered this question

In [10]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.recommendation import ALS
import os

In [2]:
spark = (
    SparkSession
    .builder
    .appName('RS')
    .getOrCreate()
)

In [5]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

answers_input_path = os.path.join(project_path, 'data/answers')
questions_input_path = os.path.join(project_path, 'output/questions-transformed')

In [6]:
questionsDF = (
    spark
    .read
    .option('path', questions_input_path)
    .load()
)

answersDF = (
    spark
    .read
    .option('path', answers_input_path)
    .load()
)

#### Prepare the input data for ALS

Hint:
* the algorithm assumes a dataframe with 3 cols: user, item, rating
    * in our case the item is question_id
    * in our case rating is score

In [17]:
ratings = (
    questionsDF
    .filter(col('accepted_answer_id').isNotNull())
    .alias('q')
    .join(answersDF.alias('a'), 'question_id')
    .select(
        
        col('a.user_id').alias('user'),
        col('question_id').alias('item'),
        col('a.score').alias('rating')
    )
    .filter(col('user').isNotNull())
)

In [18]:
ratings.show(n=5)

+------+------+------+
|  user|  item|rating|
+------+------+------+
|137842|370385|     1|
|   717|  6419|     3|
| 47360|396818|     7|
|  1492| 41748|     1|
| 59406|150238|    18|
+------+------+------+
only showing top 5 rows



In [9]:
data = (
    questionsDF
    .filter(col('accepted_answer_id').isNull())
    .alias('q')
    .join(answersDF.alias('a'), 'question_id')
    .select(
        
        col('a.user_id').alias('user'),
        col('question_id').alias('item'),
        col('a.score').alias('rating')
    )
    .filter(col('user').isNotNull())
)

In [11]:
data.show(n=5)

+------+------+------+
|  user|  item|rating|
+------+------+------+
|211169|437989|     1|
|204101|427703|     0|
| 26143|109750|     4|
| 43351|397644|     4|
| 52112|399917|     4|
+------+------+------+
only showing top 5 rows



#### Train the ALS model

In [19]:
als = ALS(rank=10, maxIter=5, seed=0)

model = als.fit(ratings)

In [14]:
questions = spark.createDataFrame([(7880, )], ['item'])

In [20]:
model.recommendForItemSubset(questions, 5).show(truncate=80)

+----+---------------+
|item|recommendations|
+----+---------------+
+----+---------------+



In [15]:
model.recommendForItemSubset(questions, 5).show(truncate=80)

+----+--------------------------------------------------------------------------------+
|item|                                                                 recommendations|
+----+--------------------------------------------------------------------------------+
|7880|[[12240, 1.8585068], [29, 1.2806133], [1945, 1.1543533], [2311, 1.0827202], [...|
+----+--------------------------------------------------------------------------------+



In [16]:
model.recommendForItemSubset(questions, 5).printSchema()

root
 |-- item: integer (nullable = false)
 |-- recommendations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- user: integer (nullable = true)
 |    |    |-- rating: float (nullable = true)

