## Computing recommendations: Overall strategy

In what follows, we will implement three collaborative filtering strategies which comes under "user-based" paradigm.
In "user-based" paradigm, the focus is on the past behaviour of users ("what did they watch?", "how much did they rate?").

### Steps involved:

* Evaluate similarity between two users.
* Find users similar to the user you want to recommend movies to.
* Rank movies among the ones seen by similar users.
* Recommend the best movies which the user has not seen.

The three collaborative filtering strategies are:

1. Movies that have been seen by most of the similar users.
2. Movies that are watched AND liked by most of the similar users.
3. Movies that are liked by users giving similar ratings.

---

### Building graph database from DSV files

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
from py2neo import Graph

In [2]:
%matplotlib inline

In [3]:
neo4j_uname = 'neo4j'
neo4j_pswd = 'Impelsys!#%&('

# Connect to neo4j
graph = Graph(host='localhost', http_port=7474, user=neo4j_uname, password=neo4j_pswd)

In [4]:
# Loading user data
users_col = ['id', 'age', 'gender', 'occupation', 'zipcode']
users = pd.read_csv('movie-dataset/u.user', sep='|', header=None, names=users_col)
num_users = users.shape[0]

# Loading genre data
genres_col = ['name', 'id']
genres = pd.read_csv('movie-dataset/u.genre', sep='|', header=None, names=genres_col)
num_genres = genres.shape[0]

# Loading movie data
movie_col = ['id', 'title', 'release date', 'useless', 'IMDb url']
movie_col = movie_col + genres['id'].tolist()
movies = pd.read_csv('movie-dataset/u.item', sep='|', header=None, names=movie_col)
movies = movies.fillna('unknown')
num_movies = movies.shape[0]

# Loading ratings data
ratings_col = ['user_id', 'item_id', 'rating', 'timestamp']
ratings = pd.read_csv('movie-dataset/u.data', sep='\t', header=None, names=ratings_col)
num_ratings = ratings.shape[0]

In [5]:
# Create the nodes relative to Users, each one being identified by its user_id
# Begin db transaction
tx = graph.begin()

statement = "MERGE (a:User {user_id:{A}}) RETURN a"

for u in users['id']:
    # Replace 'A' with user_id
    tx.run(statement, {'A': u})

# Commit db transaction
tx.commit()

In [6]:
# Create the nodes relative to Genres,
# each one being identified by its genre_id and with the property name 
tx = graph.begin()
statement = "MERGE (a:Genre {genre_id:{A}, name:{B}}) RETURN a"

for g, row in genres.iterrows():
    # Replace 'A' and 'B' with genre_id and name respectively
    tx.run(statement, {'A': row.iloc[1], 'B': row.iloc[0]})
    
tx.commit()

In [7]:
# Create the Movie nodes with properties movie_id, title and url ; then create the Is_genre edges
tx = graph.begin()
movie_stmt = 'MERGE (a:Movie {movie_id:{A}, title:{B}, url:{C}}) RETURN a'
genre_stmt = '''MATCH (g:Genre {genre_id:{D}})
                MATCH (m:Movie {movie_id:{A}})
                MERGE (m)-[r:Is_genre]->(g) RETURN r'''

# Looping over movies
for m, row in movies.iterrows():
    movie_id = row.loc['id']
    movie_title = row.loc['title'].decode('latin-1')
    movie_url = row.loc['IMDb url']
    
    # Create Movie nodes
    tx.run(movie_stmt, {'A': movie_id, 'B': movie_title, 'C': movie_url})
    
    # Create an array of booleans for genre (sliced from each movie data)
    is_genre = row.iloc[-19:] == 1
    # Form an array of genre_ids
    related_genres = genres[is_genre].axes[0].values
    
    # Looping over related genres
    for genre in related_genres:
        # Create Movie-Genre relationships
        tx.run(genre_stmt, {'A': movie_id, 'D': genre})
    
    # For every 100 movies, push queued statements to the server for execution to avoid one massive "commit"
    if m % 100 == 0:
        tx.process()
        
tx.commit()

In [8]:
# Create the Has_rated edges, with rating as property
tx = graph.begin()
statement = '''MATCH (u:User {user_id:{A}})
               MATCH (m:Movie {movie_id:{B}})
               MERGE (u)-[r:Has_rated {rating:{C}}]->(m) RETURN r'''

# Looping over ratings
for r, row in ratings.iterrows():
    user_id = row.loc['user_id']
    movie_id = row.loc['item_id']
    rating = row.loc['rating']
    
    # Create User-Movie relationship (i.e. Ratings)
    tx.run(statement, {'A': user_id, 'B': movie_id, 'C': rating})
    
    if r % 100 == 0:
        tx.process()
        
tx.commit()

In [None]:
# Create index
graph.run('CREATE INDEX ON :User(user_id)')
graph.run('CREATE INDEX ON :Movie(movie_id)')
graph.run('CREATE INDEX ON :Genre(genre_id)')

In [None]:
# Add new user and the ratings from the user
num_users += 1
new_user_id = num_users

# Create a node for new user
tx = graph.begin()
statement = 'MERGE (u:User {user_id:{A}}) RETURN u'
tx.run(statement, {'A': new_user_id})
tx.commit()

# Load ratings from new user
new_user_ratings = pd.read_csv('movie-dataset/new_user.data', sep='|', header=None, names=['item_id', 'rating'])
num_ratings += new_user_ratings.shape[0]

# Create Has_rated relations between new user and movies
tx = graph.begin()
statement = '''MATCH (u:User {user_id:{A}})
               MATCH (m:Movie {movie_id:{B}})
               MERGE (u)-[r:Has_rated {rating:{C}}]->(m) RETURN r'''

# Looping over new user ratings
for r, row in new_user_ratings.iterrows():
    rating = row.loc['rating']
    movie_id = row.loc['item_id']
    tx.run(statement, {'A': new_user_id, 'B': movie_id, 'C': rating})
    
    if r % 100 == 0:
        tx.process()
        
tx.commit()

### Strategy 1:

* Compute similarity between two users u<sub>1</sub> and u<sub>i</sub> as the ratio of number of movies they have in common.

   ***similarity = number of movies seen by both u<sub>1</sub> and u<sub>i</sub> / number of movies seen by u<sub>1</sub>***


* Find the set of users similar to u<sub>1</sub>. We can define a threshold, so that we can reduce the number of users and optimize the selectivity of the subset.


* Find the set of movies rated by similar users, which is not seen by u<sub>1</sub>.


* Rank each movie, by computing the proportion of similar users who have seen that particular movie.

   ***rank = number of similar users who have seen that particular movie / total number of similar users***

In [None]:
user_id = 944
threshold = 0.5

query = (
    # Count movies rated by user1 as countm
    'MATCH (u1:User {user_id: {user_id}})-[:Has_rated]->(m1:Movie) '
    'WITH count(m1) as countm '
    # Find users who share atleast 1 movie with u1
    'MATCH (u1:User {user_id: {user_id}})-[:Has_rated]->(m1:Movie) '
    'MATCH (m1)<-[r:Has_rated]-(u2:User) WHERE NOT u2=u1 '
    # Compute similarity between users
    'WITH u2, countm, tofloat(count(r))/countm as sim WHERE sim>{threshold} '
    # Count number of similar users as countu
    'WITH count(u2) as countu, countm '
    'MATCH (u1:User {user_id: {user_id}})-[:Has_rated]->(m1:Movie) '
    'MATCH (m1)<-[r:Has_rated]-(u2:User) WHERE NOT u2=u1 '
    # Compute similarity
    'WITH u1, u2, countu, tofloat(count(r))/countm as sim WHERE sim>{threshold} '
    # Find movies that were rated by at least one similar user, but not by u1
    'MATCH (m:Movie)<-[r:Has_rated]-(u2) '
    'WHERE NOT (m)<-[:Has_rated]-(u1) '
    'RETURN DISTINCT m as movie, tofloat(count(r))/countu as score ORDER BY score DESC '
    'LIMIT 10')

tx = graph.begin()
recommended_movies = tx.run(query, {'user_id': user_id, 'threshold': threshold})

result = tx.commit()
for num, movie in enumerate(recommended_movies.data()):
    print str(num + 1).zfill(2) + '.', movie['movie']['title'].ljust(70, '-'), movie['score']