<a href="https://colab.research.google.com/github/gabgovar/Apache-Spark/blob/main/Collaborative_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Colaborative filtering

## Instalando as Demendências do Hadoop Spark no Google Colab

In [1]:
# instalar as dependências
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

## Configurando as Demendências do Hadoop Spark no Google Colab

In [2]:
# configurar as variáveis de ambiente
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

# tornar o pyspark "importável"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')
findspark.find()

'spark-2.4.4-bin-hadoop2.7/python/pyspark'

## Lendo os DF's

In [21]:
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col,explode
spark = SparkSession.builder.appName("Collaborative filtering").getOrCreate()

In [25]:
moviesDF = spark.read.options(header = "True", inferSchema = "True").csv("/content/drive/MyDrive/PySpark/data/movies.csv")
ratingsDF = spark.read.options(header = "True", inferSchema = "True").csv("/content/drive/MyDrive/PySpark/data/ratings.csv")

moviesDF.show()
ratingsDF.show()

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
|      6|         Heat (1995)|Action|Crime|Thri...|
|      7|      Sabrina (1995)|      Comedy|Romance|
|      8| Tom and Huck (1995)|  Adventure|Children|
|      9| Sudden Death (1995)|              Action|
|     10|    GoldenEye (1995)|Action|Adventure|...|
|     11|American Presiden...|Comedy|Drama|Romance|
|     12|Dracula: Dead and...|       Comedy|Horror|
|     13|        Balto (1995)|Adventure|Animati...|
|     14|        Nixon (1995)|               Drama|
|     15|Cutthroat Island ...|Action|Adventure|...|
|     16|       Casino (1995)|         Crime|Drama|
|     17|Sen

## Joining DFs

In [28]:
ratings = ratingsDF.join(moviesDF, 'movieId', 'left')

## Dados de treino e teste

In [29]:
# separando em 80% de treino e 20% de teste
(train, test) = ratings.randomSplit([0.8,0.2])

In [30]:
ratings.count()

100836

In [31]:
print(train.count())
train.show()

80543
+-------+------+------+----------+----------------+--------------------+
|movieId|userId|rating| timestamp|           title|              genres|
+-------+------+------+----------+----------------+--------------------+
|      1|     5|   4.0| 847434962|Toy Story (1995)|Adventure|Animati...|
|      1|    17|   4.5|1305696483|Toy Story (1995)|Adventure|Animati...|
|      1|    18|   3.5|1455209816|Toy Story (1995)|Adventure|Animati...|
|      1|    19|   4.0| 965705637|Toy Story (1995)|Adventure|Animati...|
|      1|    21|   3.5|1407618878|Toy Story (1995)|Adventure|Animati...|
|      1|    27|   3.0| 962685262|Toy Story (1995)|Adventure|Animati...|
|      1|    31|   5.0| 850466616|Toy Story (1995)|Adventure|Animati...|
|      1|    32|   3.0| 856736119|Toy Story (1995)|Adventure|Animati...|
|      1|    33|   3.0| 939647444|Toy Story (1995)|Adventure|Animati...|
|      1|    43|   5.0| 848993983|Toy Story (1995)|Adventure|Animati...|
|      1|    44|   3.0| 869251860|Toy Story (

In [32]:
print(test.count())
test.show()

20293
+-------+------+------+----------+----------------+--------------------+
|movieId|userId|rating| timestamp|           title|              genres|
+-------+------+------+----------+----------------+--------------------+
|      1|     1|   4.0| 964982703|Toy Story (1995)|Adventure|Animati...|
|      1|     7|   4.5|1106635946|Toy Story (1995)|Adventure|Animati...|
|      1|    15|   2.5|1510577970|Toy Story (1995)|Adventure|Animati...|
|      1|    40|   5.0| 832058959|Toy Story (1995)|Adventure|Animati...|
|      1|    50|   3.0|1514238116|Toy Story (1995)|Adventure|Animati...|
|      1|    57|   5.0| 965796031|Toy Story (1995)|Adventure|Animati...|
|      1|    71|   5.0| 864737933|Toy Story (1995)|Adventure|Animati...|
|      1|    78|   4.0|1252575124|Toy Story (1995)|Adventure|Animati...|
|      1|    89|   3.0|1520408314|Toy Story (1995)|Adventure|Animati...|
|      1|    98|   4.5|1532457849|Toy Story (1995)|Adventure|Animati...|
|      1|   107|   4.0| 829322340|Toy Story (

## Modelo ALS 

In [33]:
als = ALS(userCol = "userId", itemCol="movieId", ratingCol="rating", nonnegative=True,implicitPrefs=False, coldStartStrategy="drop")

## Ajuste de hiperparâmetros e validação cruzada

In [34]:
param_grid = ParamGridBuilder() \
            .addGrid(als.rank, [10, 50, 100, 150]) \
            .addGrid(als.regParam, [.01, .05, .1, .15]) \
            .build()

In [35]:
evaluator = RegressionEvaluator(
           metricName="rmse", 
           labelCol="rating", 
           predictionCol="prediction")

In [36]:
cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

In [37]:
model = cv.fit(train)

In [38]:
best_model = model.bestModel
test_predictions = best_model.transform(test)
RMSE = evaluator.evaluate(test_predictions)
print(RMSE)

0.8594451578310159


In [39]:
recommendations = best_model.recommendForAllUsers(5)

In [40]:
df = recommendations

In [41]:
df.show()

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|   471|[[68945, 4.852309...|
|   463|[[68945, 5.041718...|
|   496|[[68945, 4.286804...|
|   148|[[170355, 4.45189...|
|   540|[[170355, 5.65680...|
|   392|[[68945, 4.664522...|
|   243|[[67618, 5.619018...|
|    31|[[170355, 5.03692...|
|   516|[[170355, 4.84768...|
|   580|[[68945, 4.790042...|
|   251|[[170355, 5.86506...|
|   451|[[68945, 5.302093...|
|    85|[[1140, 4.84152],...|
|   137|[[170355, 4.93007...|
|    65|[[68945, 4.944324...|
|   458|[[67618, 5.489680...|
|   481|[[5867, 3.9843266...|
|    53|[[68945, 6.651965...|
|   255|[[7842, 3.5909836...|
|   588|[[170355, 4.70744...|
+------+--------------------+
only showing top 20 rows



In [42]:
df2 = df.withColumn("movieid_rating", explode("recommendations"))

In [43]:
df2.show()

+------+--------------------+-------------------+
|userId|     recommendations|     movieid_rating|
+------+--------------------+-------------------+
|   471|[[68945, 4.852309...|  [68945, 4.852309]|
|   471|[[68945, 4.852309...| [170355, 4.852309]|
|   471|[[68945, 4.852309...|   [3379, 4.852309]|
|   471|[[68945, 4.852309...|   [6818, 4.625158]|
|   471|[[68945, 4.852309...|  [8477, 4.6076508]|
|   463|[[68945, 5.041718...| [68945, 5.0417185]|
|   463|[[68945, 5.041718...|[170355, 5.0417185]|
|   463|[[68945, 5.041718...|  [3379, 5.0417185]|
|   463|[[68945, 5.041718...|  [59018, 4.712633]|
|   463|[[68945, 5.041718...|  [60943, 4.712633]|
|   496|[[68945, 4.286804...|  [68945, 4.286804]|
|   496|[[68945, 4.286804...| [170355, 4.286804]|
|   496|[[68945, 4.286804...|   [3379, 4.286804]|
|   496|[[68945, 4.286804...| [84847, 4.2657585]|
|   496|[[68945, 4.286804...| [158966, 4.196594]|
|   148|[[170355, 4.45189...|[170355, 4.4518933]|
|   148|[[170355, 4.45189...| [68945, 4.4518933]|


In [44]:
df2.select("userId", col("movieid_rating.movieId"), col("movieid_rating.rating")).show()

+------+-------+---------+
|userId|movieId|   rating|
+------+-------+---------+
|   471|  68945| 4.852309|
|   471| 170355| 4.852309|
|   471|   3379| 4.852309|
|   471|   6818| 4.625158|
|   471|   8477|4.6076508|
|   463|  68945|5.0417185|
|   463| 170355|5.0417185|
|   463|   3379|5.0417185|
|   463|  59018| 4.712633|
|   463|  60943| 4.712633|
|   496|  68945| 4.286804|
|   496| 170355| 4.286804|
|   496|   3379| 4.286804|
|   496|  84847|4.2657585|
|   496| 158966| 4.196594|
|   148| 170355|4.4518933|
|   148|  68945|4.4518933|
|   148|   3379|4.4518933|
|   148|  67618|4.3692966|
|   148|  25906|4.3385787|
+------+-------+---------+
only showing top 20 rows

