## MLlib으로 영화 추천 알고리즘 구현하기
+ 데이터셋: MovieLens(2500만개 영화 평점 데이터), 본 예제에서는 7만개의 데이터만 사용 ```ratings_short.csv```
+ 추천 알고리즘: Alternating Least Squares (ALS)

---

### 1. 영화 평점 데이터 불러오기 & 데이터프레임 생성하기

In [1]:
# [+] SparkSession 설정
import findspark
findspark.init()

from pyspark.sql import SparkSession

# [+] SparkSession 객체 생성 및 설정
spark = SparkSession.builder.master('local').appName('mllib-movie').getOrCreate()

#### movielens 데이터 불러오기
+ 파일명: ```ratings_short.csv```
+ 스키마 설정: ```inferSchema=True```
+ 헤더 사용: ```header=True```

In [5]:
# [+] movielens 데이터 불러오기
path = './data/'
file = 'ratings_short.csv'

ratings_df = spark.read.csv(path + file, inferSchema=True, header=True)

In [6]:
# [+] 데이터프레임 출력
ratings_df.show()

+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|     1|    296|   5.0|1147880044|
|     1|    306|   3.5|1147868817|
|     1|    307|   5.0|1147868828|
|     1|    665|   5.0|1147878820|
|     1|    899|   3.5|1147868510|
|     1|   1088|   4.0|1147868495|
|     1|   1175|   3.5|1147868826|
|     1|   1217|   3.5|1147878326|
|     1|   1237|   5.0|1147868839|
|     1|   1250|   4.0|1147868414|
|     1|   1260|   3.5|1147877857|
|     1|   1653|   4.0|1147868097|
|     1|   2011|   2.5|1147868079|
|     1|   2012|   2.5|1147868068|
|     1|   2068|   2.5|1147869044|
|     1|   2161|   3.5|1147868609|
|     1|   2351|   4.5|1147877957|
|     1|   2573|   4.0|1147878923|
|     1|   2632|   5.0|1147878248|
|     1|   2692|   5.0|1147869100|
+------+-------+------+----------+
only showing top 20 rows



In [7]:
# [+] 데이터프레임 스키마 출력
ratings_df.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [8]:
# 타임스탬프를 제외한 컬럼 선택
ratings_df = ratings_df.select(['userId', 'movieId', 'rating'])

In [10]:
# describe(): 기술 통계량 출력
ratings_df.select('rating').describe().show()

+-------+------------------+
|summary|            rating|
+-------+------------------+
|  count|             71921|
|   mean|3.5821387355570695|
| stddev| 1.042406032579843|
|    min|               0.5|
|    max|               5.0|
+-------+------------------+



### 2. 훈련 데이터 준비 및 영화 추천 모델 학습

In [11]:
# randomSplit(): 훈련 데이터셋과 테스트 데이터셋을 나누기
train_df, test_df = ratings_df.randomSplit([0.8, 0.2])

In [12]:
# 추천 알고리즘(Alternating Least Squares) 임포트
from  pyspark.ml.recommendation import ALS

#### 추천 알고리즘 초매개변수 설정
+ ```maxIter```: 최대 학습 반복 횟수
+ ```regParam```: 정규화 매개변수(범위: 0~1)
+ ```coldStartStrategy```: 데이터가 부족한 신규 유저 및 아이템에 대한 예측 문제(Cold Start)를 처리하는 방식이며 ```drop```값은 해당 데이터를 모델 학습 과정에서 배제

In [14]:
# 추천 알고리즘 초매개변수 설정

als = ALS(
    maxIter=5,
    regParam=0.1,
    userCol='userId',
    itemCol='movieId',
    ratingCol='rating',
    coldStartStrategy='drop'
)

In [15]:
# [+] 모델 학습
model = als.fit(train_df)

In [None]:
# # 메모리 부족으로 인한 오류 발생시, 아래의 코드를 실행
# from pyspark.sql import SparkSession

# MAX_MEMORY = '5g'
# spark = SparkSession.builder.appName('movie-recommendation')\
#     .config('spark.executor.memory', MAX_MEMORY)\
#     .config('spark.driver.memory', MAX_MEMORY)\
#     .getOrCreate()

In [17]:
# [+] 모델 예측
predictions = model.transform(test_df)

In [18]:
# [+] 예측값 출력
predictions.show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   148|     19|   3.0| 2.8600545|
|   148|     32|   4.0| 3.9757197|
|   148|     50|   4.5|  4.486109|
|   148|    527|   5.0|   4.36725|
|   148|    912|   4.0| 4.2792554|
|   148|   1172|   4.5|  3.907507|
|   148|   1178|   5.0| 4.1054544|
|   148|   1196|   3.5|  4.102692|
|   148|   1199|   3.5| 3.9058423|
|   148|   1206|   4.0|   4.00924|
|   148|   1207|   4.0|  4.291973|
|   148|   1217|   3.5|  4.377297|
|   148|   1234|   4.5| 4.1506653|
|   148|   1250|   4.0| 4.0075674|
|   148|   1284|   4.0|  3.935092|
|   148|   1292|   3.5| 3.9180844|
|   148|   1617|   4.0| 4.2387466|
|   148|   2329|   4.0| 4.2994995|
|   148|   2858|   4.0|  4.239336|
|   148|   3435|   4.5| 4.0150285|
+------+-------+------+----------+
only showing top 20 rows



In [19]:
# [+] 평점과 예측평점에 대한 통계 출력
predictions.select('rating', 'prediction').describe().show()

+-------+------------------+------------------+
|summary|            rating|        prediction|
+-------+------------------+------------------+
|  count|             13487|             13487|
|   mean|  3.60295098984207| 3.420315276778079|
| stddev|1.0371921961158144|0.7559795884305968|
|    min|               0.5|        0.11918053|
|    max|               5.0|         5.5110316|
+-------+------------------+------------------+



In [20]:
# 모델 성능 평가: RMSE(Root Mean Squared Error)
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
    metricName='rmse',
    labelCol='rating',
    predictionCol='prediction'
)

In [21]:
# RMSE 측정
rmse = evaluator.evaluate(predictions)

In [22]:
rmse

0.9122050030430022

### 3. 학습된 모델을 이용한 영화 추천
+ ```recommendForAllUsers()```: 유저별 아이템 추천
+ ```recommendForAllItems()```: 아이템별 유저 추천
+ ```recommendForUserSubset()```: 특정 유저 그룹에 대한 아이템 추천

In [24]:
# 유저별 아이템을 3개씩 추천
model.recommendForAllUsers(3).show()

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|     1|[{69699, 5.187405...|
|     2|[{1226, 5.645141}...|
|     3|[{38095, 4.869428...|
|     4|[{8235, 5.2967744...|
|     5|[{55814, 5.009750...|
|     6|[{38095, 5.192246...|
|     7|[{4546, 5.019621}...|
|     8|[{6188, 4.9620185...|
|     9|[{60103, 5.485226...|
|    10|[{160718, 4.73979...|
|    11|[{59725, 5.012532...|
|    12|[{3470, 4.891706}...|
|    13|[{136449, 5.28848...|
|    14|[{6286, 5.4066854...|
|    15|[{160718, 5.80492...|
|    16|[{7099, 5.2134285...|
|    17|[{8958, 4.6497145...|
|    18|[{3179, 5.010684}...|
|    19|[{141432, 4.90274...|
|    20|[{38095, 5.351368...|
+------+--------------------+
only showing top 20 rows



In [25]:
# 아이템별 유저를 3명씩 추천
model.recommendForAllItems(3).show()

+-------+--------------------+
|movieId|     recommendations|
+-------+--------------------+
|      1|[{327, 5.3840566}...|
|      2|[{327, 4.7976213}...|
|      3|[{484, 4.7779408}...|
|      4|[{173, 4.541672},...|
|      5|[{484, 4.513037},...|
|      6|[{539, 4.898551},...|
|      7|[{484, 4.9674277}...|
|      8|[{448, 4.6179695}...|
|      9|[{198, 4.599905},...|
|     10|[{327, 4.755638},...|
|     11|[{484, 5.0589147}...|
|     12|[{198, 4.7528453}...|
|     13|[{327, 5.4825606}...|
|     14|[{448, 5.1150827}...|
|     15|[{474, 4.3613715}...|
|     16|[{28, 5.339954}, ...|
|     17|[{87, 5.3824453},...|
|     18|[{474, 4.8556476}...|
|     19|[{428, 4.763089},...|
|     20|[{484, 4.5239925}...|
+-------+--------------------+
only showing top 20 rows



In [26]:
# 특정 유저 선택
user_lst = [1]

In [27]:
from pyspark.sql.types import IntegerType

In [28]:
# 데이터프레임 생성
users_df = spark.createDataFrame(user_lst, IntegerType()).toDF('userID')

In [29]:
users_df.show()

+------+
|userID|
+------+
|     1|
+------+



In [30]:
# recommendForUserSubset(): 특정 유저 그룹에 대한 아이템 추천
user_recs = model.recommendForUserSubset(users_df, 5)

In [31]:
user_recs.show()

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|     1|[{69699, 5.187405...|
+------+--------------------+



In [32]:
user_recs.collect()

[Row(userId=1, recommendations=[Row(movieId=69699, rating=5.187405109405518), Row(movieId=101070, rating=5.082510948181152), Row(movieId=8327, rating=5.021006107330322), Row(movieId=2632, rating=4.963029861450195), Row(movieId=5767, rating=4.963029861450195)])]

In [33]:
# 추천결과를 파이썬 객체로 받아오기
movies_lst = user_recs.collect()[0].recommendations

In [34]:
movies_lst

[Row(movieId=69699, rating=5.187405109405518),
 Row(movieId=101070, rating=5.082510948181152),
 Row(movieId=8327, rating=5.021006107330322),
 Row(movieId=2632, rating=4.963029861450195),
 Row(movieId=5767, rating=4.963029861450195)]

In [35]:
# movies_lst에 대한 데이터프레임 생성
recs_df = spark.createDataFrame(movies_lst)
recs_df.show()

+-------+-----------------+
|movieId|           rating|
+-------+-----------------+
|  69699|5.187405109405518|
| 101070|5.082510948181152|
|   8327|5.021006107330322|
|   2632|4.963029861450195|
|   5767|4.963029861450195|
+-------+-----------------+



In [36]:
# [+] 영화 데이터에 대한 데이터프레임 생성

file = 'movies.csv'

movies_df = spark.read.csv(path + file, inferSchema=True, header=True)


In [37]:
movies_df.show()

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
|      6|         Heat (1995)|Action|Crime|Thri...|
|      7|      Sabrina (1995)|      Comedy|Romance|
|      8| Tom and Huck (1995)|  Adventure|Children|
|      9| Sudden Death (1995)|              Action|
|     10|    GoldenEye (1995)|Action|Adventure|...|
|     11|American Presiden...|Comedy|Drama|Romance|
|     12|Dracula: Dead and...|       Comedy|Horror|
|     13|        Balto (1995)|Adventure|Animati...|
|     14|        Nixon (1995)|               Drama|
|     15|Cutthroat Island ...|Action|Adventure|...|
|     16|       Casino (1995)|         Crime|Drama|
|     17|Sen

In [39]:
# [+] recs_df, movies_df 에 대한 Temporary View 생성
recs_df.createOrReplaceTempView('recommendations')
movies_df.createOrReplaceTempView('movies')

In [40]:
# SQL JOIN 연산을 통해 추천된 영화 제목 받아오기
spark.sql(
    "SELECT * \
    FROM movies JOIN recommendations \
    ON movies.movieID = recommendations.movieID \
    ORDER BY rating DESC"
).show()

+-------+--------------------+--------------------+-------+-----------------+
|movieId|               title|              genres|movieId|           rating|
+-------+--------------------+--------------------+-------+-----------------+
|  69699| Love Streams (1984)|        Comedy|Drama|  69699|5.187405109405518|
| 101070|       Wadjda (2012)|               Drama| 101070|5.082510948181152|
|   8327|        Dolls (2002)|       Drama|Romance|   8327|5.021006107330322|
|   2632|Saragossa Manuscr...|Adventure|Drama|M...|   2632|4.963029861450195|
|   5767|Teddy Bear (Mis) ...|        Comedy|Crime|   5767|4.963029861450195|
+-------+--------------------+--------------------+-------+-----------------+



### 4. 유저 별 영화 추천 서비스를 간단하게 구현하기
1. SQL문 작성
2. 영화 추천 함수 작성
3. 영화 추천 테스트

In [41]:
# SQL JOIN 연산을 통해 추천된 영화 제목 받아오기
query = """
SELECT * 
FROM movies JOIN recommendations 
ON movies.movieID = recommendations.movieID
ORDER BY rating DESC
"""

In [42]:
# 입력된 유저에 대한 영화 추천 함수
def get_recommendations(user_id, num_recs):
    users_df = spark.createDataFrame([user_id], IntegerType()).toDF('userID')
    users_recs_df = model.recommendForUserSubset(users_df, num_recs)
    
    recs_lst = users_recs_df.collect()[0].recommendations
    recs_df = spark.createDataFrame(recs_lst)
    recs_df.createOrReplaceTempView('recommendations')

    recommended_movies = spark.sql(query)
    
    return recommended_movies

In [43]:
# 1번 유저에 대한 영화 5개 추천
recs = get_recommendations(395, 5)

In [44]:
# 추천 결과 출력
recs.show()

+-------+--------------------+--------------------+-------+-----------------+
|movieId|               title|              genres|movieId|           rating|
+-------+--------------------+--------------------+-------+-----------------+
|   8253|Lupin III: The Ca...|Action|Adventure|...|   8253|5.778921127319336|
|   2660|Thing from Anothe...|       Horror|Sci-Fi|   2660|5.524172782897949|
|   7072|   Stagecoach (1939)|Action|Drama|Roma...|   7072|5.524172782897949|
| 142871|O Pátio das Canti...|              Comedy| 142871|5.448753356933594|
|  37495|Survive Style 5+ ...|Fantasy|Mystery|R...|  37495|5.448753356933594|
+-------+--------------------+--------------------+-------+-----------------+



In [45]:
# toPandas(): Pandas 데이터프레임으로 출력
recs.toPandas()

Unnamed: 0,movieId,title,genres,movieId.1,rating
0,8253,Lupin III: The Castle Of Cagliostro (Rupan san...,Action|Adventure|Animation|Comedy|Crime|Mystery,8253,5.778921
1,7072,Stagecoach (1939),Action|Drama|Romance|Western,7072,5.524173
2,2660,"Thing from Another World, The (1951)",Horror|Sci-Fi,2660,5.524173
3,142871,O Pátio das Cantigas (1942),Comedy,142871,5.448753
4,37495,Survive Style 5+ (2004),Fantasy|Mystery|Romance|Thriller,37495,5.448753
