# OMF : Occurence Matrix Factorization

This method builds on the MEMF method. 
However, the weight given to each cluster will not be uniform.

Indeed, when a genre is rarely given, it means that it is more precise and selective. We should therefore give more importance to the rare genre. 

Let us assume that a movie belongs to three genres:

- Adventure : 4067 movies in that category
- IMAX : 197 movies in that category
- Action : 7130 movies in that category

Therefore, the IMAX component of the prediction is likely more relevant since it is more specific. 

The weights we come up to must sum to one and be positive and reflect this idea.

w_i = exp(-n_i) / (sum(over j)exp(-n_j))

In [1]:
from lib import MEMF
from lib import MF
from lib import utils
from lib import preprocessing as prepro

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
ratings = pd.read_csv('../ml-latest/ratings_small.csv')

In [4]:
ratings.head()

Unnamed: 0,user,item,rating
0,18,6,3.0
1,18,21,3.0
2,18,161,4.0
3,18,216,3.0
4,18,230,4.0


In [5]:
hashed_ratings = utils.hash_data(ratings)

In [6]:
t, reduced_ratings = prepro.separate_items_with_few_ratings(1, hashed_ratings)

In [7]:
X_train, X_test,_,_ = train_test_split(reduced_ratings, 
                                       reduced_ratings['item'], 
                                       test_size = 0.15, 
                                       stratify = reduced_ratings['item'],
                                       random_state = 596)

In [8]:
X_train = pd.concat([t, X_train])

In [9]:
X_test[-X_test['user'].isin(X_train['user'].unique())]

Unnamed: 0,old_user,old_item,user,item,rating
2212,7026,590,254,24,4.0
2213,7026,592,254,25,3.0
2211,7026,161,254,6,5.0


In [10]:
X_train = pd.concat([pd.DataFrame({'old_user' : [7026], 'old_item' : [590],
                                   'user': [254], 'item' : [24], 
                                   'rating' : [4.0]}), X_train], 
                    ignore_index = True)
X_test = X_test[-(np.logical_and(X_test['user'] == 254, X_test['item'] == 24))]

In [11]:
y_train = X_train['rating']
y_test = X_test['rating']
X_train = X_train.drop('rating', axis = 1)
X_test = X_test.drop('rating', axis = 1)

In [12]:
X_train.head()

Unnamed: 0,old_user,old_item,user,item
0,7026,590,254,24
1,1507,138940,67,681
2,1832,172803,75,818
3,1832,183565,75,855
4,6884,138544,248,679


In [13]:
movie_file = "../ml-latest/clean_movies.csv"

In [14]:
OMF = MEMF.OMF(movie_file)

In [15]:
OMF.define_clusters()

Number of Clusters: 20


In [16]:
for cluster in OMF.clusters.keys():
    print(OMF.clusters[cluster].size)

4067
2663
2749
15956
2637
7412
24144
7130
5105
8216
5555
2773
3444
197
5118
1820
1113
1378
364
4266


In [17]:
OMF.compute_membership()

100%|██████████| 58098/58098 [01:07<00:00, 854.65it/s] 


In [18]:
count = 0
for movie in OMF.membership.keys():
    print(movie)
    print(OMF.membership[movie])
    count = count + 1
    if count == 6:
        break

1
{'Adventure': 0.2088491840113339, 'Animation': 0.22135405475757444, 'Children': 0.22056700267927745, 'Comedy': 0.12763720539519993, 'Fantasy': 0.22159255315661416}
2
{'Adventure': 0.32080857172569643, 'Children': 0.33880805153405263, 'Fantasy': 0.34038337674025093}
3
{'Comedy': 0.412442638587691, 'Romance': 0.5875573614123091}
4
{'Comedy': 0.3187791444554087, 'Drama': 0.22709459539151916, 'Romance': 0.45412626015307217}
5
{'Comedy': 1.0}
6
{'Action': 0.3285684421783355, 'Crime': 0.35731470968367096, 'Thriller': 0.31411684813799357}


In [19]:
OMF.fit(X_train, y_train.values)

Adventure matrix factorization is over
Animation matrix factorization is over
Children matrix factorization is over
Comedy matrix factorization is over
Fantasy matrix factorization is over
Romance matrix factorization is over
Drama matrix factorization is over
Action matrix factorization is over
Crime matrix factorization is over
Thriller matrix factorization is over
Horror matrix factorization is over
Mystery matrix factorization is over
Sci-Fi matrix factorization is over
IMAX matrix factorization is over
Documentary matrix factorization is over
War matrix factorization is over
Musical matrix factorization is over
Western matrix factorization is over
Film-Noir matrix factorization is over
(no genres listed) matrix factorization is over


In [20]:
OMF.predict(X_train)

100%|██████████| 75774/75774 [05:40<00:00, 222.69it/s]


In [21]:
OMF.score(y_train)

0.7029604308888561

In [23]:
OMF.RMSE

0.7029604308888561

In [24]:
prediction = OMF.prediction

In [25]:
np.mean(prediction) - np.mean(y_train)

0.011156225012824095

In [26]:
np.min(prediction)

0.17401185317170542

In [27]:
np.max(prediction)

5.993233780331224