## KNN 협업 필터링 실습

유저-영화 평점 데이터를 이용. 유저가 아직 평가하지 않은 영화를 추천.

In [1]:
import pandas as pd
import numpy as np

np.random.seed(2021)

## 1. Data

### 1.1 Data Load

userID: 유저 고유 아이디.
movieID: 영화 고유 아이디.
rating: 유저가 영화를 평가한 점수.

In [2]:
ratings = pd.read_csv('ratings_small.csv')
ratings = ratings[['userId', 'movieId', 'rating']]

In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


다른 두 데이터를 이용해 ratings 데이터의 movieID에 맞는 영화 제목을 얻기.

In [4]:
movies = pd.read_csv('movies_metadata.csv')
links = pd.read_csv('links_small.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


### 1.2 Data Preprocessing

- movies 데이터에서 'tt 숫자'로 이루어진 imdb_id에서 숫자 부분,
- links 데이터의 '숫자'로 이루어진 imdbId를 연결.

In [5]:
movies = movies.fillna('')
movies = movies[movies['imdb_id'].str.startswith('tt')]
movies['imdbId'] = movies['imdb_id'].apply(lambda x: int(x[2:]))
movies = movies.merge(links, on='imdbId')

In [6]:
movies = movies[['title', 'movieId']]
movies = movies.set_index('movieId')

In [7]:
movies.head()

# movieId와 title을 mapping

Unnamed: 0_level_0,title
movieId,Unnamed: 1_level_1
1,Toy Story
2,Jumanji
3,Grumpier Old Men
4,Waiting to Exhale
5,Father of the Bride Part II


pivot 함수를 이용해 유저 아이디가 index, 영화 아이디가 column, 평가 점수가 value인 user_movie_matrix를 생성.

In [8]:
user_movie_matrix = ratings.pivot(
    index='userId',
    columns='movieId',
    values='rating',
)

In [9]:
user_movie_matrix.iloc[-5:, -5:]

movieId,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
667,,,,,
668,,,,,
669,,,,,
670,,,,,
671,,,,,


유저가 평가하지 않은 영화에 대해 결측값을 0으로 대체.

In [10]:
user_movie_matrix = user_movie_matrix.fillna(0)

In [11]:
user_movie_matrix.shape

(671, 9066)

## 2. KNN Basic

k가 5인 KNN Basic을 이용해 유저 '124'가 아직 평가하지 않은 영화 '648'에 대한 점수를 예측.

In [12]:
k = 5
user_i = 124
movie_id = 648

### 2.1 유저 간 유사도를 계산.

cosin_similarity 함수를 이용해 유저별 코사인 유사도를 계산.

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

user_similarity = cosine_similarity(user_movie_matrix)

In [17]:
user_similarity.shape # user * user

(671, 671)

In [16]:
user_similarity[:10, :10]

array([[1.        , 0.        , 0.        , 0.07448245, 0.01681799,
        0.        , 0.08388416, 0.        , 0.01284289, 0.        ],
       [0.        , 1.        , 0.12429498, 0.11882103, 0.10364614,
        0.        , 0.21298521, 0.11319045, 0.11333307, 0.04321284],
       [0.        , 0.12429498, 1.        , 0.08163991, 0.15153112,
        0.06069128, 0.15471414, 0.24978072, 0.13447489, 0.1146725 ],
       [0.07448245, 0.11882103, 0.08163991, 1.        , 0.13064868,
        0.07964833, 0.31974534, 0.19101336, 0.03041726, 0.13718558],
       [0.01681799, 0.10364614, 0.15153112, 0.13064868, 1.        ,
        0.06379575, 0.0958878 , 0.16571211, 0.08661604, 0.03237017],
       [0.        , 0.        , 0.06069128, 0.07964833, 0.06379575,
        1.        , 0.        , 0.12850206, 0.02174493, 0.04526415],
       [0.08388416, 0.21298521, 0.15471414, 0.31974534, 0.0958878 ,
        0.        , 1.        , 0.14957182, 0.05972764, 0.18649318],
       [0.        , 0.11319045, 0.2497807

In [19]:
user_similarity = pd.DataFrame(
    data=user_similarity,
    index=user_movie_matrix.index,
    columns=user_movie_matrix.index,
) # Data Frame으로 간편하게 보기.

In [21]:
user_similarity.head(5)

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.0,0.0,0.074482,0.016818,0.0,0.083884,0.0,0.012843,0.0,...,0.0,0.0,0.014474,0.043719,0.0,0.0,0.0,0.062917,0.0,0.017466
2,0.0,1.0,0.124295,0.118821,0.103646,0.0,0.212985,0.11319,0.113333,0.043213,...,0.477306,0.063202,0.077745,0.164162,0.466281,0.425462,0.084646,0.02414,0.170595,0.113175
3,0.0,0.124295,1.0,0.08164,0.151531,0.060691,0.154714,0.249781,0.134475,0.114672,...,0.161205,0.064198,0.176134,0.158357,0.177098,0.124562,0.124911,0.080984,0.136606,0.170193
4,0.074482,0.118821,0.08164,1.0,0.130649,0.079648,0.319745,0.191013,0.030417,0.137186,...,0.114319,0.047228,0.136579,0.25403,0.121905,0.088735,0.068483,0.104309,0.054512,0.211609
5,0.016818,0.103646,0.151531,0.130649,1.0,0.063796,0.095888,0.165712,0.086616,0.03237,...,0.191029,0.021142,0.146173,0.224245,0.139721,0.058252,0.042926,0.038358,0.062642,0.225086


### 2.2 아이템 i를 평가한 유저들 중에서 유저 u와 비슷한 유저 k명을 찾기.

유저 '124'와 유사한 다른 유저 k명을 찾기.

In [25]:
user_i_similarity = user_similarity.loc[user_i]

In [26]:
user_i_similarity

userId
1      0.000000
2      0.129669
3      0.224600
4      0.147568
5      0.159521
         ...   
667    0.065720
668    0.074023
669    0.049342
670    0.201474
671    0.330381
Name: 124, Length: 671, dtype: float64

In [27]:
user_i_similarity = user_i_similarity.sort_values(ascending=False)

In [28]:
user_i_similarity

userId
124    1.000000
458    0.455216
379    0.433607
355    0.432242
282    0.423280
         ...   
640    0.000000
642    0.000000
341    0.000000
76     0.000000
1      0.000000
Name: 124, Length: 671, dtype: float64

유사도 상위 k명의 유사도와 id를 추출.

이 때 가장 유사도가 높은 id는 user_i로 제외.

In [29]:
top_k_similarity = user_i_similarity[1: k + 1]
top_k_similar_user_ids = top_k_similarity.index

In [30]:
top_k_similar_user_ids

Int64Index([458, 379, 355, 282, 271], dtype='int64', name='userId')

In [31]:
top_k_similarity

userId
458    0.455216
379    0.433607
355    0.432242
282    0.423280
271    0.409402
Name: 124, dtype: float64

### 2.3 K명의 유사한 유저들이 아이템 i에 평가한 선호도를 유사도 기준으로 가중 평균.

유사한 유저들이 648번 영화에 대해 평가한 정보들을 추출. 

In [52]:
top_k_similar_ratings = user_movie_matrix.loc[top_k_similar_user_ids, movie_id]
top_k_weighted_ratings = top_k_similar_ratings * top_k_similarity

In [35]:
movie_id

648

In [36]:
top_k_similar_ratings

userId
458    4.5
379    0.0
355    3.5
282    4.0
271    0.0
Name: 648, dtype: float64

평가 점수가 있는 유저에 대한 Weight 추출.

In [33]:
top_k_weight = (top_k_similar_ratings > 0) * top_k_similarity

In [37]:
top_k_weight

# 379와 271 유저의 경우, 해당 영화를 평가하지 않음.

userId
458    0.455216
379    0.000000
355    0.432242
282    0.423280
271    0.000000
dtype: float64

유사도가 곱해진 평가 점수의 합을 유사도 합으로 나누기.

In [53]:
weighted_rating = top_k_weighted_ratings.sum()
weight = top_k_weight.sum()

In [54]:
weight

1.3107381789824097

weight가 0보다 작은 경우 유저 모두 평가하지 않은 경우.

In [55]:
if weight > 0:
    prediction_rating = weighted_rating / weight
    
else:
    prediction_rating = 0

In [56]:
prediction_rating

4.008763709331574

### 2.4 예측 선호도가 높은 아이템을 유저에게 추천.

모든 영화에 대해 점수를 예측, 예측 평가 점수가 높은 영화를 유저에게 추천.

### 2.4.1 선호도 계산

In [57]:
prediction_dict = {}

# 모든 영화 아이디에 대해 평점 예측
for movie_id in user_movie_matrix.columns:
    
    # 이미 유저가 평가한 경우 제외
    if user_movie_matrix.loc[user_i, movie_id] > 0:
        continue
        
    top_k_similar_ratings = user_movie_matrix.loc[top_k_similar_user_ids, movie_id]
    
    top_k_weighted_ratings = top_k_similar_ratings * top_k_similarity
    top_k_weight = (top_k_similar_ratings > 0) * top_k_similarity
    
    weighted_rating = top_k_weighted_ratings.sum()
    weight = top_k_weight.sum()
    
    if weight > 0:
        prediction_rating = weighted_rating / weight
        
    else:
        prediction_rating = 0
        
    # 영화 아이디별로 예측 평가 점수 저장.
    prediction_dict[movie_id] = prediction_rating

영화 아이디별 예측 평가 점수를 내림차순으로 정렬.

In [59]:
prediction = pd.Series(prediction_dict).sort_values(ascending=False)

In [61]:
prediction.head(5)

1258    5.0
924     5.0
3861    5.0
524     5.0
3916    5.0
dtype: float64

### 2.4.2 상위 아이템 추출

예측 평가 점수 상위 10개의 영화 아이디 추출.

In [62]:
recommend = prediction[:10].index

In [63]:
movies.loc[recommend]

Unnamed: 0_level_0,title
movieId,Unnamed: 1_level_1
1258,The Shining
924,2001: A Space Odyssey
3861,The Replacements
524,Rudy
3916,Remember the Titans
260,Star Wars
968,Night of the Living Dead
1653,Gattaca
2115,Indiana Jones and the Temple of Doom
2692,Run Lola Run


## 3. KNN with Means

In [64]:
user_id = 124
k = 5
movie_i = 648

pivot 함수를 이용해 영화 아이디가 인덱스이고, 유저 아이디가 컬럼, 값이 평가 점수인 movie_user_matrix를 생성.

결측치는 0으로 대체.

In [65]:
movie_user_matrix = ratings.pivot(
    index='movieId',
    columns='userId',
    values='rating',
)
movie_user_matrix = movie_user_matrix.fillna(0)

### 3.1 아이템간의 유사도를 계산

영화간의 피어슨 유사도를 계산.

In [66]:
movie_similarity = np.corrcoef(movie_user_matrix)

In [67]:
movie_similarity = pd.DataFrame(
    data=movie_similarity,
    index=movie_user_matrix.index,
    columns=movie_user_matrix.index,
)

In [69]:
movie_similarity.shape # 아이템을 기준으로.

(9066, 9066)

### 3.2 아이템 i와 비슷한 아이템을 k개 찾기

영화 '648'과 유사한 다른 영화 k개를 찾기.

우선 movie_i와 다른 영화 간의 유사도 추출.

In [70]:
movie_i_similarity = movie_similarity.loc[movie_i]

다른 영화와의 유사도 내림차순 정렬.

In [71]:
movie_i_similarity = movie_i_similarity.sort_values(ascending=False)

유사도 상위 k개의 유사도와 id를 추출.

이 때 가장 유사도가 높은 id는 movie_i로 제외.

In [72]:
top_k_similarity = movie_i_similarity[1: k + 1]
top_k_similar_movie_ids = top_k_similarity.index

In [74]:
top_k_similarity

movieId
780    0.534337
733    0.522740
736    0.430270
786    0.401280
376    0.370700
Name: 648, dtype: float64

In [73]:
top_k_similar_movie_ids

Int64Index([780, 733, 736, 786, 376], dtype='int64', name='movieId')

### 3.3 아이템 i의 평균 선호도를 계산

- 영화별로 특징이 되는 평균 선호도를 계산.
- 평점이 0인 경우, 평가하지 않음을 반영하기 위해 결측치로 대체.

In [75]:
movie_user_matrix = movie_user_matrix.replace(0, np.NaN)

In [76]:
movie_bias = movie_user_matrix.mean(1)

In [77]:
movie_bias

movieId
1         3.872470
2         3.401869
3         3.161017
4         2.384615
5         3.267857
            ...   
161944    5.000000
162376    4.500000
162542    5.000000
162672    3.000000
163949    5.000000
Length: 9066, dtype: float64

### 3.4 유저가 평가한 K개의 아이템의 선호도의 편차를 유사도 기준으로 가중 평균

### 3.4.1 유저별 영화 평가 점수 편차 계산

In [78]:
movie_user_matrix_wo_bias = movie_user_matrix.sub(movie_bias, axis=0)

In [82]:
movie_user_matrix_wo_bias

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,-0.87247,,0.12753,,...,,0.12753,-0.37247,,,,,,0.12753,1.12753
2,,,,,,,,,,,...,1.598131,,,-0.401869,,,,,,
3,,,,,0.838983,,,,,,...,,,,-0.161017,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,-0.267857,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161944,,,,,,,,,,,...,,,,,,,,,,
162376,,,,,,,,,,,...,,,,,,,,,,
162542,,,,,,,,,,,...,,,,,,,,,,
162672,,,,,,,,,,,...,,,,,,,,,,


### 3.4.2 상위 k개의 선호도 추출

In [79]:
top_k_similar_ratings = movie_user_matrix_wo_bias.loc[top_k_similar_movie_ids, user_id]
top_k_weighted_ratings = top_k_similar_ratings * top_k_similarity

In [80]:
top_k_similar_ratings

movieId
780         NaN
733         NaN
736         NaN
786   -0.108696
376         NaN
Name: 124, dtype: float64

In [81]:
top_k_weighted_ratings

movieId
780         NaN
733         NaN
736         NaN
786   -0.043617
376         NaN
dtype: float64

추출된 영화중 평가 점수가 있는 영화에 대한 가중치만 남기기.

In [83]:
top_k_weight = (pd.notna(top_k_similar_ratings)) * top_k_similarity
top_k_weight

movieId
780    0.00000
733    0.00000
736    0.00000
786    0.40128
376    0.00000
dtype: float64

### 3.4.3 가중  평균

유사도가 곱해진 평가 점수의 편차 합을 유사도 합으로 나눔.

In [84]:
weighted_rating = top_k_weighted_ratings.sum()
weight = top_k_weight.sum()

In [87]:
weight

0.40128029563155027

영화 평균 평점 추출.

In [85]:
bias = movie_bias.loc[movie_i]

In [86]:
bias

3.5327380952380953

In [88]:
if weight != 0:
    # 평균 평점에 가중 편차 합
    prediction_rating = bias + weighted_rating / weight
    
# weight가 0인 경우 유사 영화 모두 평가하지 않은 경우
else:
    prediction_rating = 0

In [89]:
prediction_rating

3.4240424430641823

### 3.5 예측 선호도가 높은 아이템을 유저에게 추천.

모든 영화에 대해서 점수를 예측하고 예측 평가 점수가 높은 영화를 유저에게 추천.

In [90]:
prediction_dict = {}

# 모든 영화 아이디에 대해 평점 예측
for movie_id in movie_user_matrix.index:
    
    # 이미 유저가 평가한 경우 제외
    if movie_user_matrix.loc[movie_i, user_id] > 0:
        continue
        
    top_k_similar_ratings = movie_user_matrix_wo_bias.loc[top_k_similar_movie_ids, user_id]
    
    top_k_weighted_ratings = top_k_similar_ratings * top_k_similarity
    top_k_weight = (top_k_similar_ratings != 0) * top_k_similarity
    
    weighted_rating = top_k_weighted_ratings.sum()
    weight = top_k_weight.sum()
    
    bias = movie_bias.loc[movie_i]
    
    if weight > 0:
        prediction_rating = bias + weighted_rating / weight
        
    else:
        prediction_rating = 0
        
    # 영화 아이디 별로 예측 평가 점수 저장
    prediction_dict[movie_id] = prediction_rating

영화 아이디별 예측 평가 점수를 내림차순으로 정렬.

In [92]:
prediction = pd.Series(prediction_dict).sort_values(ascending=False)

In [94]:
prediction.head(5)

1        3.513433
31963    3.513433
31903    3.513433
31921    3.513433
31923    3.513433
dtype: float64

예측 평가 점수 상위 10개의 영화 아이디 추출.

In [95]:
recommend = prediction[:10].index

In [96]:
movies.loc[recommend]

Unnamed: 0_level_0,title
movieId,Unnamed: 1_level_1
1,Toy Story
31963,Bed and Board
31903,Želary
31921,The Seven-Per-Cent Solution
31923,The Three Musketeers
31930,Masculin Féminin
31952,Control
31956,Five Times Two
31973,Germany Year Zero
31804,Night Watch
