<a href="https://colab.research.google.com/github/clustering-jun/KMU-Recommender_Systems/blob/main/L02_Finding_Similar_Items.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Finding Similar Items Practice**

## **MovieLens Dataset**
- 영화 추천 서비스 movielens.org 에서 수집한 영화 별점 데이터셋
- https://grouplens.org/datasets/moivelens/
- MovieLens 25M Dataset 파일 목록

| 파일명             | 설명                        | 형태                                     |
|--------------------|-----------------------------|------------------------------------------|
| ratings.csv        | 유저가 영화에 매긴 별점     | userId, movieId, rating, timestamp          |
| tags.csv           | 유저가 영화에 매긴 태그     | userId, movieId, tag, timestamp             |
| movies.csv         | 영화 목록                   | movieId, title, genres                     |
| links.csv          | imdb, tmdb 정보             | movieId, imdbId, tmdbId                     |
| genome-tags.csv    | 태그 목록                   | tagId, tag                                |
| genome-score.csv   | 영화와 태그 사이의 관련도   | movieId, tagId, relevance                  |



### **데이터셋 다운로드**
- wget: url로부터 파일을 다운로드 받는 쉘 명령어
- unzip: zip 압축 파일을 해제하는 쉘 명령어

In [2]:
!wget https://files.grouplens.org/datasets/movielens/ml-25m.zip
!unzip ml-25m.zip

--2025-08-11 05:52:09--  https://files.grouplens.org/datasets/movielens/ml-25m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 261978986 (250M) [application/zip]
Saving to: ‘ml-25m.zip’


2025-08-11 05:52:11 (121 MB/s) - ‘ml-25m.zip’ saved [261978986/261978986]

Archive:  ml-25m.zip
   creating: ml-25m/
  inflating: ml-25m/tags.csv         
  inflating: ml-25m/links.csv        
  inflating: ml-25m/README.txt       
  inflating: ml-25m/ratings.csv      
  inflating: ml-25m/genome-tags.csv  
  inflating: ml-25m/genome-scores.csv  
  inflating: ml-25m/movies.csv       


In [1]:
!head ml-25m/ratings.csv

userId,movieId,rating,timestamp
1,296,5.0,1147880044
1,306,3.5,1147868817
1,307,5.0,1147868828
1,665,5.0,1147878820
1,899,3.5,1147868510
1,1088,4.0,1147868495
1,1175,3.5,1147868826
1,1217,3.5,1147878326
1,1237,5.0,1147868839


### **movies.csv**
- 영화 제목(moives)과 장르(genresets) 불러오기
 - csv(comma separated values) 파일: 값들을 쉼표로 구분한 텍스트 파일

In [40]:
import csv

titles = {}
genresets = {}

with open('ml-25m/movies.csv', 'r') as f:
    print(f.readline())
    # next(csvreader) # skip column names

    csvreader = csv.reader(f)

    for mid, title, genres in csvreader:

        titles[int(mid)] = title
        genresets[int(mid)] = set(genres.split('|'))

movieId,title,genres



### **장르가 유사한 영화 찾기**
- 자카드 유사도가 가장 높은 k개의 영화 찾기

In [43]:
def jaccard_similarity(a,b):
    if len(a | b) == 0: return 0
    return len(a & b) / len(a | b)

def get_topk_jaccard_genres(target_mid, k=20):
    target_genreset = genresets[target_mid]

    res = []

    for mid, title in titles.items():
        genreset = genresets[mid]
        score = jaccard_similarity(target_genreset, genreset)
        res.append( (score, title) )

    res.sort(reverse=True)
    return res[:k]

In [44]:
mid = 164909 # la la land
res = get_topk_jaccard_genres(mid, 20)
res

[(1.0, 'Собака Павлова (2005)'),
 (1.0, 'Şabaniye (1984)'),
 (1.0, 'Zus & Zo (2001)'),
 (1.0, 'Zero (2018)'),
 (1.0, 'Zack and Miri Make a Porno (2008)'),
 (1.0, 'Youth in Revolt (2009)'),
 (1.0, 'You and the Night (2013)'),
 (1.0, 'You Stupid Man (2002)'),
 (1.0, 'You Are the Apple of My Eye (2011)'),
 (1.0, 'Yo Yo (Yoyo) (1965)'),
 (1.0, 'Yes, But... (Oui, mais...) (2001)'),
 (1.0, 'Yes Or No (2010)'),
 (1.0, 'Yellow Cab Man, The (1950)'),
 (1.0, 'Year of the Dog (2007)'),
 (1.0, 'Year by the Sea (2016)'),
 (1.0, "X's & O's (2007)"),
 (1.0, 'Wyjazd Integracyjny (2011)'),
 (1.0, 'World According to Garp, The (1982)'),
 (1.0, 'Working Girl (1988)'),
 (1.0, 'Words and Pictures (2013)')]

### **다른 사용자가 함께 본 영화 찾기**
- 시청한 사용자 집합이 유사한 영화 찾기
 - 각 영화를 사용자 집합으로 표현
 - 사용자 집합간 자카드 유사도 계산

In [46]:
ratings = []

with open('ml-25m/ratings.csv', 'r') as f:
    print(f.readline()) # skip column names

    for line in f:
        uid, mid, rating, timestamp = line.split(',')
        ratings.append( (int(uid), int(mid), float(rating)) )

userId,movieId,rating,timestamp



In [48]:
from collections import defaultdict

usets = defaultdict(set)

for uid, mid, rating in ratings:
    usets[mid].add(uid)

In [52]:
from tqdm import tqdm

def get_topk_jaccard_ratings(target_mid, k=20):
    target_uset = usets[target_mid]

    res = []

    for mid, title in tqdm(titles.items()):
        uset = usets[mid]
        score = jaccard_similarity(target_uset, uset)
        res.append( (score, title) )

    res.sort(reverse=True)
    return res[:k]

In [54]:
mid = 112552 # whiplash
res = get_topk_jaccard_ratings(mid, k=20)
res

100%|██████████| 62423/62423 [00:53<00:00, 1171.06it/s]


[(1.0, 'Whiplash (2014)'),
 (0.34346657709103123, 'Gone Girl (2014)'),
 (0.33069178628389156, 'The Imitation Game (2014)'),
 (0.3202530162820663, 'Interstellar (2014)'),
 (0.31890598662389213, 'Grand Budapest Hotel, The (2014)'),
 (0.3178914625340221, 'Ex Machina (2015)'),
 (0.31770758896670004,
  'Birdman: Or (The Unexpected Virtue of Ignorance) (2014)'),
 (0.3133357646486749, 'Her (2013)'),
 (0.312987900078064, 'Wolf of Wall Street, The (2013)'),
 (0.3059296340689558, 'Nightcrawler (2014)'),
 (0.2975845919065411, 'Mad Max: Fury Road (2015)'),
 (0.2957345586142398, 'Django Unchained (2012)'),
 (0.2737903411445145, 'The Martian (2015)'),
 (0.2668467981626587, 'Gravity (2013)'),
 (0.2627047459050819, 'Shutter Island (2010)'),
 (0.2575789399985997, 'Dallas Buyers Club (2013)'),
 (0.25711628659226915, 'Dark Knight Rises, The (2012)'),
 (0.25615628009354796, 'The Revenant (2015)'),
 (0.2561167699254437, 'Intouchables (2011)'),
 (0.25411334552102377, 'Arrival (2016)')]

### **비슷한 평가를 받는 영화 찾기**
- 사용자가 매긴 별점을 영화를 표현하는 벡터로 보고 피어슨 상관계수를 계산
 - **각 영화별 별점을 평균값으로 뺄셈하여 전처리**
 - 전처리한 데이터에서 코사인 유사도 계산

In [55]:
from collections import defaultdict

ursets = defaultdict(dict)

for uid, mid, rating in ratings:
    ursets[mid][uid] = rating

In [57]:
for mid, urset in ursets.items():
    avg = sum(urset.values()) / len(ursets)
    for k in urset:
        urset[k] -= avg

In [58]:
def cosine_similarity(a,b):
    nu = sum(a[k] * b[k] for k in a.keys() & b.keys())
    de = (sum(x*x for x in a.values()) * sum(x*x for x in b.values())) ** 0.5
    if de == 0: return 0
    return nu/de

In [59]:
from tqdm import tqdm

def get_topk_pearson_ratings(target_mid, k=20):
    target_urset = ursets[target_mid]

    res = []

    for mid, title in tqdm(titles.items()):
        urset = ursets[mid]
        score = cosine_similarity(target_urset, urset)
        res.append( (score, title) )

    res.sort(reverse=True)
    return res[:k]

In [61]:
mid = 59315
res = get_topk_pearson_ratings(mid)
res

100%|██████████| 62423/62423 [01:22<00:00, 754.55it/s] 


[(1.0000000000000027, 'Iron Man (2008)'),
 (0.6145974601926216, 'Avengers, The (2012)'),
 (0.6005251725727653, 'Iron Man 2 (2010)'),
 (0.5742348976724545, 'Batman Begins (2005)'),
 (0.5732738873828683, 'Star Trek (2009)'),
 (0.571732874015334, 'Dark Knight, The (2008)'),
 (0.5373537786453846, 'Avatar (2009)'),
 (0.5357945803058293, 'Guardians of the Galaxy (2014)'),
 (0.5333849122281809, 'WALL·E (2008)'),
 (0.5286920058942817, 'Sherlock Holmes (2009)'),
 (0.5145056012122434, 'Bourne Ultimatum, The (2007)'),
 (0.5138882448400881, 'Iron Man 3 (2013)'),
 (0.5107599364470127, 'Dark Knight Rises, The (2012)'),
 (0.5081743572914865, 'Casino Royale (2006)'),
 (0.5073080570705581, 'X-Men: First Class (2011)'),
 (0.49374319715280607, 'Captain America: The First Avenger (2011)'),
 (0.492098340724142, 'Up (2009)'),
 (0.49117422130775257, 'Captain America: The Winter Soldier (2014)'),
 (0.4911455205724086, 'Inception (2010)'),
 (0.4862128987887418, 'V for Vendetta (2006)')]