# Item - item collaborative filtering

References: 
 - https://aman.ai/recsys/index.html
 - https://stackabuse.com/creating-a-simple-recommender-system-in-python-using-pandas/
 - https://grouplens.org/datasets/movielens/latest/

Catatan:
* Adalah teknik untuk membentuk rekomendasi dari matrix item terhadap item. 
* Contohnya: item yang dijual di toko online dengan item lain yang dijual di toko tersebut.
* Pada bentuk yang paling dasarnya, menggunakan 2 dimensi data, misalnya jumlah klik dan item id
* Sehingga bisa membandingkan "jika orang mengklik item ini, biasanya juga mengklik item x, y, z"
* Kekurangan: 
  * komputasi yang dibutuhkan cukup besar/berat (memory dan cpu time)
* Kelebihan:
  * dapat bekerja dengan cukup baik dengan data yang sedikit


## Contoh soal: Movies recommendation
* Dengan dataset movies recommendation, kita akan membuat mesin rekomendasi untuk movie. 
* Statement masalah: jika user melihat suatu movie, temukan rekomendasi movie lainnya.


### Mempersiapkan Data

In [107]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

In [9]:
! mkdir -p /tmp/ml-latest-small && wget https://files.grouplens.org/datasets/movielens/ml-latest-small.zip -O /tmp/ml-latest-small/ml-latest-small.zip
! unzip /tmp/ml-latest-small/ml-latest-small.zip -d /tmp/ml-latest-small

--2024-09-29 15:24:19--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘/tmp/ml-latest-small/ml-latest-small.zip’


2024-09-29 15:24:19 (5.56 MB/s) - ‘/tmp/ml-latest-small/ml-latest-small.zip’ saved [978202/978202]

Archive:  /tmp/ml-latest-small/ml-latest-small.zip
   creating: /tmp/ml-latest-small/ml-latest-small/
  inflating: /tmp/ml-latest-small/ml-latest-small/links.csv  
  inflating: /tmp/ml-latest-small/ml-latest-small/tags.csv  
  inflating: /tmp/ml-latest-small/ml-latest-small/ratings.csv  
  inflating: /tmp/ml-latest-small/ml-latest-small/README.txt  
  inflating: /tmp/ml-latest-small/ml-latest-small/movies.csv  


In [10]:
base_dir = "/tmp/ml-latest-small/ml-latest-small"
links = pd.read_csv(f"{base_dir}/links.csv")
tags = pd.read_csv(f"{base_dir}/tags.csv")
ratings = pd.read_csv(f"{base_dir}/ratings.csv")
movies = pd.read_csv(f"{base_dir}/movies.csv")

In [14]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [15]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [12]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


check apakah kita mempunyai data yang tidak unik

In [19]:
ratings_group_by_count = ratings.groupby(["userId", "movieId"]).count()
# tidak ada user yang me-rating movie lebih dari 1x
ratings_group_by_count[ratings_group_by_count.rating > 1]


Unnamed: 0_level_0,Unnamed: 1_level_0,rating,timestamp
userId,movieId,Unnamed: 2_level_1,Unnamed: 3_level_1


In [13]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [22]:
ratings_group_by_avg = ratings.groupby(["userId", "movieId"]).mean().reset_index()

In [101]:
movie_to_user_pivot = ratings.pivot_table(
    values="rating",
    aggfunc="mean",
    columns="userId",
    index="movieId",
    fill_value=0
    
)
movie_to_user_pivot.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


### Implementasi item-item collaborative filtering

* jarak antara 2 vector dapat diukur dengan berbagai cara, namun yang paling umum digunakan adalah `cosine similarity`
* Rumus `cosine similarity` antara 2 vector `A` and `B` adalah
$$
 \text{Cosine Similarity} = \frac{A . B}{||A|| \space ||B||}
$$
* Karena matrix yang kita punya adalah `movies` x `users`, maka kita akan membandingkan movies menurut $\text{user}_m$ dengan $\text{user}_n$, misalnya
  * movies menurut $\text{user}_1$ denga $\text{user}_5$
* untuk kasus ini akan digunakan data `rating`, namun vector yang dibentuk bisa berupa apapun, contohnya jumlah klik atau jumlah view.
* untuk membandingkan movies ke movies, maka akan digunakan "pairwise" `cosine similarity`

In [92]:
def cos_sim_vectors(a, b):
    dot_product = a.dot(b)
    len_a = np.linalg.norm(a)
    len_b = np.linalg.norm(b)

    if len_a <= 0 or len_b <= 0:
        return 0
    return dot_product / len_a / len_b

def cos_sim_pairwise(X):
    """
        X mempunyai dimensi (n, m), dimana dalam dataset movies, n bisa berupa dimensi movies, dan 
        m bisa berupa dimensi users (movies x user), sehingga yang akan diukur adalah
        vector movie_1 dibandingkan dengan vector movie_2.
    """
    data = X.values
    result_shape = data.shape[0]
    similarities = np.zeros((result_shape, result_shape))
    for i in range(result_shape):
        for j in range(result_shape):
            similarities[i, j] = cos_sim_vectors(data[i], data[j])

    return similarities

In [103]:
# contoh hasil
cos_sim_pairwise(movie_to_user_pivot.loc[:10, :10])

array([[1.   , 0.   , 0.346, 0.   , 0.   , 0.391, 0.   , 0.   , 0.   ,
        0.   ],
       [0.   , 1.   , 0.552, 0.707, 0.707, 0.5  , 0.707, 0.707, 0.   ,
        0.981],
       [0.346, 0.552, 1.   , 0.781, 0.781, 0.994, 0.781, 0.781, 0.   ,
        0.65 ],
       [0.   , 0.707, 0.781, 1.   , 1.   , 0.707, 1.   , 1.   , 0.   ,
        0.832],
       [0.   , 0.707, 0.781, 1.   , 1.   , 0.707, 1.   , 1.   , 0.   ,
        0.832],
       [0.391, 0.5  , 0.994, 0.707, 0.707, 1.   , 0.707, 0.707, 0.   ,
        0.588],
       [0.   , 0.707, 0.781, 1.   , 1.   , 0.707, 1.   , 1.   , 0.   ,
        0.832],
       [0.   , 0.707, 0.781, 1.   , 1.   , 0.707, 1.   , 1.   , 0.   ,
        0.832],
       [0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   ],
       [0.   , 0.981, 0.65 , 0.832, 0.832, 0.588, 0.832, 0.832, 0.   ,
        1.   ]])

In [105]:
# ini akan memakan waktu cukup lama
# similarities = cos_sim_pairwise(movie_to_user_pivot)

karena implementasi yg digunakan kurang optimal, maka lebih baik menggunakan fungsi yang sudah ada di sciki-learn

In [113]:
# untuk sekarang kita memakai implementasi scikit learn untuk mempersingkat waktu 
similarities = cosine_similarity(movie_to_user_pivot)

In [112]:
similarities[:10, :10]

array([[1.   , 0.411, 0.297, 0.036, 0.309, 0.376, 0.277, 0.132, 0.233,
        0.396],
       [0.411, 1.   , 0.282, 0.106, 0.288, 0.297, 0.229, 0.172, 0.045,
        0.418],
       [0.297, 0.282, 1.   , 0.092, 0.418, 0.284, 0.403, 0.313, 0.305,
        0.243],
       [0.036, 0.106, 0.092, 1.   , 0.188, 0.09 , 0.275, 0.158, 0.   ,
        0.096],
       [0.309, 0.288, 0.418, 0.188, 1.   , 0.299, 0.474, 0.284, 0.335,
        0.218],
       [0.376, 0.297, 0.284, 0.09 , 0.299, 1.   , 0.244, 0.148, 0.214,
        0.386],
       [0.277, 0.229, 0.403, 0.275, 0.474, 0.244, 1.   , 0.274, 0.162,
        0.239],
       [0.132, 0.172, 0.313, 0.158, 0.284, 0.148, 0.274, 1.   , 0.   ,
        0.19 ],
       [0.233, 0.045, 0.305, 0.   , 0.335, 0.214, 0.162, 0.   , 1.   ,
        0.049],
       [0.396, 0.418, 0.243, 0.096, 0.218, 0.386, 0.239, 0.19 , 0.049,
        1.   ]])

* dengan menggunakan similarities ini, maka kita dapat menghitung dan mengeluarkan rekomendasi movies, seperti berikut

In [178]:
class ItemToItemCollborativeFiltering:
    def __init__(self, similarities, movies_to_titles):
        self._similarities = similarities
        self._movies_to_titles = movies_to_titles
    
    def find_similar(self, title, n=5):
        movieId = self._movies_to_titles.loc[self._movies_to_titles["title"] == title, "movieId"]
        similarity_scores = self._similarities[movieId][0]
        sorted_similarities = sorted(
            [
                (movie_id, similarity_score) for movie_id, similarity_score  in enumerate(similarity_scores)
            ], 
            key=lambda x: -x[-1]
        )
        # similarity yang pertama pasti judul movie yang sedang di query, sehingga harus dikeluarkan dari pertimbangan
        sorted_similarities = sorted_similarities[1:]
        result_similarities = sorted_similarities[:n]
        
        result_similarities_indexes = [movie_id for movie_id, _ in result_similarities]
        return [
            title
            for _, title in 
            self._movies_to_titles.loc[
                self._movies_to_titles["movieId"].isin(result_similarities_indexes),
                "title"
            ].items()
        ]

In [179]:
filter = ItemToItemCollborativeFiltering(similarities=similarities, movies_to_titles=movies)

In [201]:
movie_id = 312
movie_title = movies.loc[movies["movieId"] == movie_id, "title"].item()

similar_movies = filter.find_similar(
    movies.loc[movies["movieId"] == movie_id, "title"].item(),
    n=20)

print(f"Users that rated '{movie_title}' also likes:")
for i in range(len(similar_movies)):
    print(f"  - {similar_movies[i]}")

Users that rated 'Stuart Saves His Family (1995)' also likes:
  - Usual Suspects, The (1995)
  - City Hall (1996)
  - Congo (1995)
  - Free Willy 2: The Adventure Home (1995)
  - Party Girl (1995)
  - White Man's Burden (1995)
  - Boys on the Side (1995)
  - Houseguest (1994)
  - My Crazy Life (Mi vida loca) (1993)
  - Nell (1994)
  - Perez Family, The (1995)
  - Santa Clause, The (1994)
  - Corrina, Corrina (1994)
  - Fatal Instinct (1993)
  - Home Alone (1990)
  - Terminator 2: Judgment Day (1991)
  - Pallbearer, The (1996)
  - Stupids, The (1996)
