## 4. 사용자의 평가경향을 고려한 사용자기반 협업필터링 추천시스템
* * *
#### 과제: <code>surprise</code>를 통해 구현하기

### Setup(1): 경로 설정 및 데이터 불러오기

In [86]:
import os
import pandas as pd
import numpy as np

In [87]:
data_path = "../../data/kmrd-small"
%cd $data_path
print(os.getcwd())

C:\Users\dobyl\Desktop\Doby\2025\AI추천시스템\data\kmrd-small
C:\Users\dobyl\Desktop\Doby\2025\AI추천시스템\data\kmrd-small


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [88]:
# 데이터 불러오기
castings_df = pd.read_csv('castings.csv', encoding='utf-8')
genres_df = pd.read_csv('genres.csv', encoding='utf-8')
countries_df = pd.read_csv('countries.csv', encoding='utf-8')
movies_df = pd.read_csv('movies.txt', sep='\t', encoding='utf-8')
peoples_df = pd.read_csv('peoples.txt', sep='\t', encoding='utf-8')
rates_df = pd.read_csv('rates.csv', encoding='utf-8')

### Setup(2): 4번 문제 재구현을 위해 필요한 데이터 정제 및 구현
* <code>user_movie_rates_df</code>

In [89]:
user_reviews_df = pd.DataFrame({
    'review_count': rates_df.groupby('user')['movie'].count(),
})

user_reviews_over9_df = user_reviews_df[user_reviews_df['review_count']>9]

movie_reviewed_df = pd.DataFrame({
    'num_users_watch': rates_df.groupby('movie')['user'].count(),
})

movie_reviewed_over9_df = movie_reviewed_df[movie_reviewed_df['num_users_watch']>9]

user_movie_rates_df = rates_df[rates_df.user.isin(user_reviews_over9_df.index)]
user_movie_rates_df = user_movie_rates_df[user_movie_rates_df.movie.isin(movie_reviewed_over9_df.index)]
user_movie_rates_df.head()

Unnamed: 0,user,movie,rate,time
0,0,10003,7,1494128040
1,0,10004,7,1467529800
2,0,10018,9,1513344120
3,0,10021,9,1424497980
4,0,10022,7,1427627340


### (1) user_movie_rates_df을 sklearn의 train_test_split 분리
* surprise는 pandas의 Dataframe을 기반으로 돌아가지 않고, 내부적으로 정의한 데이터 타입이 있다.

In [90]:
## train_test_split 사용하여 rating_df를 train용과 test용으로 80:20으로 분리
##     train데이터 user_movie_rating pivot 구축, unique user_id, unique movie_id를 개수
from surprise.model_selection import train_test_split
from surprise import Dataset, Reader
x = user_movie_rates_df.copy()

reader = Reader(rating_scale=(min(x['rate']), max(x['rate']))) # {1, 10}의 값

x = Dataset.load_from_df(x[['user', 'movie', 'rate']], reader)

x_train, x_test = train_test_split(x, test_size=0.2, random_state=1234)

### (2) similarity matrix
* 직접 구현을 할 때는 similarity matrix를 사전에 구현한 이후에, 모델을 구현하는 코드 구조였다.
* 하지만, surprise는 모델을 정의할 때, 내부적으로 similarity matrix를 갖게 된다. 즉, 순서가 바뀐다.
* 이에 따라 (2)번은 (3)번에서 사용할 모델을 (2)번에서 먼저 정의하여 모델 내 similarity matrix를 출력한다.
* 이는 NumPy type이라 출력하였을 때, Dataframe처럼 깔끔한 결과는 아니지만, 그 아래 shell에서 shape을 보면 2141x2141로 User-based similarity matrix임을 알 수 있다.

In [91]:
from surprise.prediction_algorithms.knns import KNNWithMeans

# 알고리즘 옵션 변경, 정확도 계산
sim_options = {'name': 'pearson_baseline',
               'user_based': True}

algo = KNNWithMeans(k=10, sim_options=sim_options)
algo.fit(x_train)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x19819b035b0>

In [92]:
algo.sim

array([[ 1.        ,  0.        , -0.00949931, ...,  0.        ,
         0.        ,  0.00628111],
       [ 0.        ,  1.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.00949931,  0.        ,  1.        , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  1.        ,
        -0.00937563,  0.        ],
       [ 0.        ,  0.        ,  0.        , ..., -0.00937563,
         1.        ,  0.        ],
       [ 0.00628111,  0.        ,  0.        , ...,  0.        ,
         0.        ,  1.        ]])

In [93]:
algo.sim.shape

(2141, 2141)

### (3) (1)의 결과에 대해 test data에 대해 RMSE을 구하라.
* 공유된 노트북 파일의 실험 결과와 오차가 있다.
* 직접 구현된 KNN Bias 모델은 7.2673, Surprise의 KNN Bias 모델은 6.9507이다.
* 이에 대해 추론할 수 있는 근거는 몇 가지가 있다.
    1.  우선, train_test_split 함수가 다르다. (Scikit-learn vs Surprise)
    2.  또한, stratify 기능이 Surprise에는 없어서 분명 train, test의 구성이 다를 것으로 예상된다.
    3.  본 실험에서도 (3)번 문제에 제시된 제약 조건을 따라 구현하였으나, 직접 구현된 KNN Bias에서는 '해당 영화를 평가한 사용자가 최소 2명이 되는 경우에만 계산'과 같은 제약 조건을 고려하지 않았기 때문으로 예상된다.
    4.  다만, Neighbor size는 10으로 동일하게 설정되었다.

In [94]:
from surprise import accuracy

predictions = []
for uid, iid, true_r in x_test:
    if uid not in x_train.all_users(): # test data의 user가 train data에 없을 경우의 추천 값: 0
        est = 0
        details = {'was_impossible': True}
    elif iid not in x_train.all_items():
        est = x_train.global_mean  # test data의 (user, movie)에서 movie가 없거나 user와 유사한 사용자가 없을 경우의 추천값: user의 평균 rate
        details = {'was_impossible': True}
    else:
        pred = algo.predict(uid, iid)
        est = pred.est
        details = {'was_impossible': False}
    predictions.append((uid, iid, true_r, est, details))
    
accuracy.rmse(predictions)

RMSE: 6.9507


6.950699220848747

### (4) 문제 1의 (1)에서 영화를 200개 이상 평가한 user 리스트를 구하라.
* 기존 코드와 동일

In [95]:
user_200_review = list(user_reviews_df[user_reviews_df["review_count"] >= 200].index)
len(user_200_review)

12

In [96]:
user_200_review

[44, 92, 95, 110, 146, 170, 224, 465, 1051, 1662, 1820, 2769]

### (5) test data에서 user_200_review의 true_user2items
* 기존 코드와 동일

In [97]:
x_test = pd.DataFrame(x_test, columns=['user', 'movie', 'rate'])

test_user_movie_rates = x_test.pivot_table(index='user', columns='movie', values='rate').fillna(0)
test_user_movie_rates.head()

movie,10001,10002,10003,10004,10005,10006,10007,10008,10009,10011,...,10970,10971,10975,10979,10980,10981,10983,10988,10994,10998
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [98]:
true_user2items = dict()
top_n = 30

for user in user_200_review:
    # 각 사용자의 평점 시리즈를 가져온 후, 점수 높은 순으로 정렬
    test_user_ratings = test_user_movie_rates.loc[user].sort_values(ascending=False)[:top_n]

    # pdb.set_trace()
    # 점수가 높은 순서로 상위 10개의 영화 선택
    top_10_movies = list(test_user_ratings.index)

    # 사용자 ID를 키로 하고, 상위 10개의 영화 리스트를 값으로 저장
    true_user2items[user] = top_10_movies

for key, value in true_user2items.items():
    print(f"{key}: {value}")

44: [10001, 10103, 10048, 10054, 10877, 10330, 10065, 10489, 10071, 10086, 10501, 10502, 10819, 10101, 10102, 10532, 10454, 10114, 10561, 10132, 10249, 10146, 10245, 10718, 10213, 10688, 10173, 10670, 10200, 10185]
92: [10105, 10740, 10405, 10041, 10272, 10767, 10251, 10718, 10055, 10871, 10909, 10677, 10132, 10323, 10593, 10680, 10045, 10624, 10462, 10142, 10151, 10563, 10411, 10236, 10305, 10971, 10429, 10241, 10092, 10700]
95: [10998, 10249, 10700, 10034, 10148, 10046, 10058, 10561, 10071, 10200, 10072, 10116, 10086, 10107, 10101, 10102, 10022, 10103, 10980, 10004, 10670, 10636, 10016, 10114, 10048, 10068, 10519, 10019, 10740, 10841]
110: [10514, 10038, 10040, 10041, 10448, 10029, 10450, 10050, 10058, 10065, 10066, 10489, 10310, 10389, 10290, 10362, 10636, 10102, 10014, 10630, 10217, 10242, 10016, 10128, 10744, 10101, 10472, 10762, 10532, 10104]
146: [10688, 10071, 10638, 10741, 10381, 10021, 10106, 10299, 10272, 10321, 10020, 10410, 10566, 10697, 10262, 10113, 10101, 10275, 10114, 

### (6) neighbor 30의 평가를 기준으로 평가 점수 상위 30개 pred_user2items
* KNN을 Neighbor size 30으로 두고 재학습하여 이를 진행했으며,
* 기존 Shell에서 predict하는 코드를 Suprise 기반 모델의 predict 코드로 수정하였다.

In [99]:
data = []
for uid in x_train.all_users():       # 내부 integer id
    raw_uid = x_train.to_raw_uid(uid) # 원래 user id
    for iid, rating in x_train.ur[uid]:  # (item inner id, rating)
        raw_iid = x_train.to_raw_iid(iid)
        data.append((raw_uid, raw_iid, rating))

x_train_ = pd.DataFrame(data, columns=['user', 'movie', 'rate'])

In [100]:
train_user_movie_rates = x_train_.pivot_table(index='user', columns='movie', values='rate').fillna(0)

In [101]:
# 각 사용자에 대해 neighbor 30명을 기준으로 예측 평점 구하고, 상위 10개의 영화 추천
pred_user2items = {}
neighbor_size = 30
top_n = 30  # 추천할 영화 개수

# 알고리즘 옵션 변경, 정확도 계산
sim_options = {'name': 'pearson_baseline',
               'user_based': True}

# 30으로두고 재학습
algo = KNNWithMeans(k=neighbor_size, sim_options=sim_options)
algo.fit(x_train)

for user in user_200_review:  # user_ids는 평가할 사용자 목록
    # 각 사용자별 평가하지 않은 movie_id
    non_rating_movi_ids = list(train_user_movie_rates.loc[user][train_user_movie_rates.loc[user]==0].index)
    user_predictions = {}
    for movie in non_rating_movi_ids:
        # CF_knn_bias 함수 호출하여 예측 평점 계산
        pred = algo.predict(user, movie)
        est = pred.est
        user_predictions[movie] = est

    # 예측 평점을 기준으로 상위 10개의 영화 추천
    top_movies = sorted(user_predictions, key=user_predictions.get, reverse=True)[:top_n]
    pred_user2items[user] = top_movies

for key, value in pred_user2items.items():
    print(f"{key}: {value}")

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
44: [10044, 10064, 10070, 10138, 10150, 10152, 10176, 10213, 10228, 10284, 10286, 10336, 10351, 10382, 10400, 10409, 10430, 10435, 10512, 10548, 10590, 10629, 10841, 10843, 10847, 10865, 10866, 10983, 10232, 10747]
92: [10762, 10458, 10469, 10983, 10183, 10740, 10020, 10152, 10341, 10452, 10321, 10249, 10294, 10019, 10462, 10865, 10199, 10244, 10424, 10296, 10213, 10405, 10910, 10109, 10001, 10200, 10046, 10005, 10179, 10150]
95: [10017, 10080, 10109, 10152, 10176, 10284, 10294, 10296, 10300, 10301, 10321, 10329, 10341, 10349, 10391, 10430, 10458, 10462, 10469, 10526, 10584, 10590, 10629, 10696, 10744, 10834, 10841, 10850, 10851, 10983]
110: [10866, 10590, 10284, 10070, 10127, 10901, 10321, 10390, 10329, 10067, 10526, 10430, 10629, 10294, 10058, 10983, 10103, 10152, 10767, 10400, 10102, 10580, 10029, 10448, 10841, 10922, 10688, 10071, 10290, 10822]
146: [10866, 10590, 10

### (7) precision@10, recall@10
* 기존 코드와 동일
* 하지만, 실험 결과가 굉장히 낮은 수치를 기록한다. 기존 코드(직접 구현된)의 성능이 어느 정도 나오는지는 기록되지는 않았음 -> 비교 불가

In [102]:
def precision_recall_at_k(true_user2items, pred_user2items, k=10):
    precisions = []
    recalls = []

    for user, true_items in true_user2items.items():
        pred_items = pred_user2items.get(user, [])[:k]  # 상위 k개 예측 항목

        # 교집합의 크기 (정확히 맞춘 예측)
        hit_count = len(set(true_items) & set(pred_items))

        # Precision@k: 예측한 상위 k개 중 실제로 본 항목의 비율
        precision = hit_count / min(len(pred_items), k) if pred_items else 0
        precisions.append(precision)

        # Recall@k: 실제로 본 항목 중 예측한 상위 k개에 포함된 항목의 비율
        recall = hit_count / len(true_items) if any(true_items) else 0
        recalls.append(recall)

    # 전체 사용자에 대한 평균 precision@k와 recall@k
    avg_precision = sum(precisions) / len(precisions) if precisions else 0
    avg_recall = sum(recalls) / len(recalls) if recalls else 0

    return avg_precision, avg_recall

precision_at_10, recall_at_10 = precision_recall_at_k(true_user2items, pred_user2items, k=10)
print("Precision@10:", precision_at_10)
print("Recall@10:", recall_at_10)

Precision@10: 0.05833333333333334
Recall@10: 0.019444444444444445
