### Collaborative Filtering
-----------------------------------------------------------------------
----------------------------------------------------------------------

    Collaborative Filtering techniques make recommendations for a user based on ratings and preferences data of many users. The main underlying idea is that if two users have both rated 2 items similarly, then the items that one user has liked that the other user has not yet 

##### In this notebook we will apply user user collaborative filtering technique!

#### User User collaborative filtering !!
------------------------------------------

    This approach relies on the idea that users who have similar rating behaviours so far, share the same tastes and will likely exhibit similar rating behaviours going forward. The algorithm first computes the similarity between users by considering ratings both users have in common

From notebook 2, we are taking similarity funcition. We will use this function to find similarity between active user  and other user using normalized data

In [3]:
def weight_factor(x, y):
    ''' 
    Weight factor implies relationship between user x and user y
    Also know as similarity between user x and user y
    We are using Pearson correlation coefficient here. 
    '''    
    t1, t2, t3 = 0, 0, 0 
    for i, j in zip(x, y):
        t1+=i*j
        t2+=i*i
        t3+=j*j
    return t1/(np.sqrt(t2) * np.sqrt(t3))

#### Selecting normalized data from database!

In [4]:
import numpy as np
import pandas as pd
import sqlite3 as db

# Connect to a database (or create one if it doesn't exist)
sql_db = 'jester_jokes'

# database location and creating sql connection!
db_loc = 'data/{}.db'.format(sql_db)
conn = db.connect(db_loc)
# Create a 'cursor' for executing commands
c = conn.cursor()

# Selecting normalized ratings 
query_normalized = 'SELECT * FROM normalized_ratings'
normalized_ratings_df = pd.read_sql(query_normalized, conn)

# Selecting ratings
query_ratings = 'SELECT * FROM ratings'
ratings_df = pd.read_sql(query_ratings, conn)

In [5]:
normalized_ratings_df.head(2)

Unnamed: 0,user_id,number_of_jokes_rated,joke_1,joke_2,joke_3,joke_4,joke_5,joke_6,joke_7,joke_8,...,joke_91,joke_92,joke_93,joke_94,joke_95,joke_96,joke_97,joke_98,joke_99,joke_100
0,1,74,-4.388108,12.221892,-6.228108,-4.728108,-4.088108,-5.068108,-6.418108,7.601892,...,6.251892,0.0,0.0,0.0,0.0,0.0,-2.198108,0.0,0.0,0.0
1,2,100,1.3337,-3.0363,3.6137,1.6237,-5.1263,-12.4063,-3.4763,-8.0863,...,0.0737,-7.6963,-3.0363,5.1137,-2.9363,-4.8863,0.3137,-2.4063,-7.0663,-1.6763


#### Separating normalized ratings dataframe!

We will be separating normalized ratings dataframe into 2 parts, 

    1> Complete ratings: Those users who have rated all 100 jokes
    2> Sparse ratings: Those users who haven't rated all 100 jokes
    
For ease in computation, we will only select active user from sparse ratings, and other users from compete rating groups

In [6]:
# We will be using users who have rated all the 100 jokes as other users.
complete_ratings = normalized_ratings_df[normalized_ratings_df['number_of_jokes_rated'] == 100]
print('total user count who have rated all the jokes: ', len(complete_ratings))
# We will be randomly using one out of these users as active user and use it to find 
# similarity with complete_ratings dataset. 
sparse_ratings = normalized_ratings_df[normalized_ratings_df['number_of_jokes_rated'] != 100]
print('total user count who have not rated all the jokes: ', len(sparse_ratings))

total user count who have rated all the jokes:  14116
total user count who have not rated all the jokes:  59305


    Selecting one of the user from sparse_ratings matrix for whom the recommendation will be given. We will call him our active user and generate recommendation for him!

In [7]:
# selecting a random user say 1000th user in sparse_ratings list
n = 1000
active_user_id = sparse_ratings.iloc[n, 0]
print("Let's selct a random user with user id {} as active user for\
 which we will recommend the joke".format(str(active_user_id)))

Let's selct a random user with user id 1352 as active user for which we will recommend the joke


In [8]:
print('ratings given by active user {} for 100 jokes'.format(str(active_user_id)))
active_user = sparse_ratings[sparse_ratings['user_id'] == active_user_id]
active_user_rating = active_user.iloc[:, 2:]
active_user_rating

ratings given by active user 1352 for 100 jokes


Unnamed: 0,joke_1,joke_2,joke_3,joke_4,joke_5,joke_6,joke_7,joke_8,joke_9,joke_10,...,joke_91,joke_92,joke_93,joke_94,joke_95,joke_96,joke_97,joke_98,joke_99,joke_100
1351,7.871096,-1.888904,-2.568904,-1.118904,2.821096,-4.078904,-0.188904,1.891096,-2.568904,3.931096,...,0.0,-3.448904,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


##### Finding similarity between active user and complete rated user!

In [9]:
# saving active user ratings into 1 d list
active_user_rating_list = active_user_rating.values.ravel()
# finding similarity between active user and all its neighbours among complete rating users
similarity = np.array([(complete_ratings.iloc[i, 0],\
             weight_factor(active_user_rating_list, complete_ratings.iloc[i, 2:]))\
             for i in range(complete_ratings.shape[0])])

#### Sorting the neighbours using similarity 

In [10]:
ind = np.argsort( similarity[:,1] )
similarity = similarity[ind]
similarity

array([[ 2.76640000e+04, -3.52644866e-01],
       [ 3.26670000e+04, -3.34265224e-01],
       [ 2.53620000e+04, -3.30902199e-01],
       ...,
       [ 1.92170000e+04,  3.99469178e-01],
       [ 1.06150000e+04,  4.06355890e-01],
       [ 7.49200000e+03,  4.59458050e-01]])

### Now we have similarity matrix, our next task is to select neighbourhood. 
##### There are few methods to select neighbours:
    1. Use all neighbours in samll dataset
    2. Threshold similarity or distance:
        eg. all users with similarity of 0.1 or above to user for whom we are generating the recommendations will be considered
    3. Random neighbour:
        Can be useful for very large dataset in conjucture with other technique. for eg. randomly select 10,000 neighbour and then threshold them or pick top n
    4. Top n neighbour by similarity or distance
        
### How many neighbours?
##### Between 25 to 100 is ofen used.
        
#### Our method !
    We will be use threshold method followed by random n samples, (where n can be any random number between 25 - 100) but for simplicity we will take n as 30. 
    For threshold we will select other users with similarity > 0.1
    

In [12]:
neighbours = similarity[similarity[:,1] > 0.1]
print('We have {} potential neighbours! Now we will be randomly selecting 30 samples out of them'.format(len(neighbours)))

We have 4601 potential neighbours! Now we will be randomly selecting 30 samples out of them


#### Randomly selecting 30 neighbours

In [13]:
# by replace = False, ensuring that no duplicate neighbour is selected !!
index_30_neighbour = np.random.choice(range(len(neighbours)), 30, replace=False)
selected_neighbours = neighbours[index_30_neighbour]
selected_neighbours

array([[3.23130000e+04, 2.01937550e-01],
       [2.49720000e+04, 1.27577933e-01],
       [3.46390000e+04, 2.01279295e-01],
       [3.68480000e+04, 1.50230728e-01],
       [2.26270000e+04, 1.11033644e-01],
       [8.50200000e+03, 2.70440082e-01],
       [1.44360000e+04, 1.07649024e-01],
       [2.70370000e+04, 1.28080255e-01],
       [1.76080000e+04, 1.05304573e-01],
       [4.10600000e+04, 1.93403604e-01],
       [3.66270000e+04, 1.61955903e-01],
       [8.97900000e+03, 2.68981037e-01],
       [2.21140000e+04, 1.67170822e-01],
       [6.57400000e+03, 1.46263622e-01],
       [2.54530000e+04, 1.41099161e-01],
       [4.60340000e+04, 1.30307916e-01],
       [1.80000000e+04, 2.16982017e-01],
       [3.19300000e+03, 2.09792215e-01],
       [3.64050000e+04, 1.23917764e-01],
       [3.98770000e+04, 1.09974729e-01],
       [1.97950000e+04, 2.59416876e-01],
       [5.14200000e+03, 2.15887916e-01],
       [2.42750000e+04, 1.01509702e-01],
       [1.56420000e+04, 2.42351093e-01],
       [3.284200

    Once we have selected 30 closest neighbours, our task is to create scoring for the jokes !! We will crate scoring for only those jokes which isn't rated yet by the active user yet. 

In [14]:
recommendation_columns = [column for column in active_user_rating.columns if active_user_rating[column].values[0] == 0]

##### Mean value of ratings of active user!!

    The mean value of rating is used to create so as to compensate for the offset created when data was normalized.
    By adding mean value to the predicted score, it will give the probable score the user will give to a joke!

In [16]:
active_user_raw_ratings = ratings_df[ratings_df['user_id'] == active_user_id].iloc[:, 2:]
active_user_mean_rating = np.mean(active_user_raw_ratings.drop(recommendation_columns, axis = 1).values)
active_user_mean_rating

0.6289041095890411

In [17]:
# Selecting neighbours user_id
neighbour_user_id = selected_neighbours[:, 0]


# selectig neighbour user similarity
neighbour_user_similarity = selected_neighbours[:, 1]

# viewing all neighbours user id and similarity
print('neighbours user id: ', neighbour_user_id, '\n\n')
print('neighbours user similarity: ', neighbour_user_similarity)

neighbours user id:  [32313. 24972. 34639. 36848. 22627.  8502. 14436. 27037. 17608. 41060.
 36627.  8979. 22114.  6574. 25453. 46034. 18000.  3193. 36405. 39877.
 19795.  5142. 24275. 15642. 32842.  8491. 35522.  8952. 18659. 13025.] 


neighbours user similarity:  [0.20193755 0.12757793 0.2012793  0.15023073 0.11103364 0.27044008
 0.10764902 0.12808025 0.10530457 0.1934036  0.1619559  0.26898104
 0.16717082 0.14626362 0.14109916 0.13030792 0.21698202 0.20979222
 0.12391776 0.10997473 0.25941688 0.21588792 0.1015097  0.24235109
 0.15129789 0.25339581 0.10817388 0.13459399 0.15426064 0.12046957]


#### Selecting all the data of neighbour!

In [18]:
neighbours_df = complete_ratings[complete_ratings['user_id'].isin(neighbour_user_id)]
len(neighbours_df)

30

In [22]:
# selecting only recommendation columns
print('We will be suggesting one out of {} jokes to the active user \n\n'.format(len(recommendation_columns)))

neighbours_df = neighbours_df[recommendation_columns]
neighbours_df.head()

We will be suggesting one out of 27 jokes to the active user 




Unnamed: 0,joke_71,joke_72,joke_73,joke_74,joke_75,joke_76,joke_77,joke_78,joke_80,joke_82,...,joke_90,joke_91,joke_93,joke_94,joke_95,joke_96,joke_97,joke_98,joke_99,joke_100
3192,-6.1697,-5.0997,3.5903,3.8803,-4.5697,-6.6597,3.1003,7.4203,-7.9697,7.7603,...,2.8103,-6.3697,5.3803,4.6503,-6.2697,-4.4197,6.7403,-8.9897,5.8703,-6.3197
5141,-2.9059,4.1841,4.5241,-5.2759,6.1741,3.9941,-4.5559,2.8741,0.3041,-3.9659,...,2.1441,3.0741,0.4041,-0.8659,1.8041,3.8941,-5.8159,5.2041,3.4141,1.8041
6573,-4.6741,3.9159,-1.2741,1.5859,1.5859,3.2359,0.2759,3.7259,-9.0941,0.8559,...,-1.2341,3.6759,4.0659,0.0359,3.1859,3.1359,1.0559,5.1759,5.8059,1.1059
8490,-8.3116,-0.7916,0.7684,-6.7616,1.1984,-6.0816,-6.6116,-6.8116,1.2484,-7.8716,...,-5.9816,-5.7416,4.8884,0.4284,7.2684,-6.8116,0.2784,-0.0116,-1.5116,1.0084
8501,0.2559,0.1659,0.7959,-3.5741,-0.1341,3.1759,2.2959,0.7959,0.4559,-0.2241,...,-0.3241,-0.4241,-2.0741,-4.0641,-2.6541,-0.9541,2.1559,-0.5141,0.0659,-0.0841


#### checking score for item 0 of the recommendation_columns

In [23]:
item_id = recommendation_columns[0]
print( 'item 0 for which we calculate the score is', item_id)

def score_user_item(item_id, neighbours_df,neighbour_user_similarity, active_user_mean_rating ):
    item_rating = neighbours_df[item_id]
    t1, t2 = 0, 0
    for similarity, norm_rating in zip(neighbour_user_similarity, item_rating):
        t1+= norm_rating * similarity
        t2+= similarity
    score = (t1 + active_user_mean_rating)/t2
    return score

# checking score for a particular joke!
score_user_item(item_id, neighbours_df,neighbour_user_similarity, active_user_mean_rating )
        

item 0 for which we calculate the score is joke_71


-1.8580168676384594

#### suggesting use the joke which has the highesht score among all

In [25]:
# Computing user item score !
top_score = -np.inf
joke_to_suggest = ''

for column in neighbours_df.columns:
    score =score_user_item(column, neighbours_df,neighbour_user_similarity, active_user_mean_rating)
    if score > top_score:
        top_score = score
        joke_to_suggest = column
print('highest score is', top_score)
print('The highest score obtained by the joke among all the unseen jokes is', joke_to_suggest, )    

highest score is 3.089570835775759
The highest score obtained by the joke among all the unseen jokes is joke_89
