# Courses recommendation system
# III. Make recommendations

This is the third part of the Udacity Data Science Nanodegree capstone project, which consists in the creation of a course recommendation system.

After the exploratory data analysis, is time to play around with structures created in the first part and trying to make recommendations.

I will use three types of recommendations:

* Knowledge based recommendations
* Content based filtering
* Neighborhood based collaborative filtering
* Model based collaborative filtering

## 1 Import libraries

In [None]:
import os
import numpy as np
import pandas as pd
import pickle

from scipy.sparse import csr_matrix
from db_utils import connection
from common import requested_courses

import matplotlib as mpl
import matplotlib.pyplot as plt

%matplotlib inline

## 2 Retrieve data

In this step I will retrieve the clean data from database.

**Retrieve courses**

In [2]:
courses_query = '''SELECT c.*, cat.name AS category_name FROM courses c 
                    JOIN categories cat ON c.category_id = cat.id'''

courses_df = pd.read_sql_query(courses_query, con=connection())

**Retrieve leads**

In [3]:
leads_query = 'SELECT * FROM clean_leads ORDER BY created_on DESC'

leads_df = pd.read_sql_query(leads_query, con=connection())

**Retrieve reviews**

In [4]:
reviews_query = 'SELECT * FROM clean_reviews ORDER BY created_on DESC'

reviews_df = pd.read_sql_query(reviews_query, con=connection())

## 3 Knowledge based recommendations
In this type of recommendation, I will use two measures: the most requested courses and the most valued courses by users' ratings. Also, I will add a category filter.

### 3.1 Most requested courses

In [7]:
def get_top_courses_by_leads(df, n=10, category=None):
    """ 
    Creates an array of n courses ordered by number of leads generated
    
    :param df DataFrame: Leads dataframe
    :param n int: Number of courses in the array
    :param category str: If category is supplied, the array will be of courses belonging to that category
    
    :return numpy.ndarray: Array of top n courses by leads generated
    """ 
    if category:
        df = df[df['category_name'] == category]
        
    top_courses = df.sort_values('number_of_leads', ascending=False)['title'].head(n)
    
    return top_courses.values

In [18]:
print('Most requested courses:\n')
print('\n'.join(get_top_courses_by_leads(courses_df)))

print('\nMost requested Engineering courses:\n')
print('\n'.join(get_top_courses_by_leads(courses_df, category='Engineering')))

Most requested courses:

Aviation Engineering - BEng (Hons)
Bachelor in Aviation Management
MSc Public Health
Master of computer science
MBA - Engineering Management
BA in Hospitality Management
MBA - Big Data Management
Master of Leadership in Development Finance - Online
MBA - Marketing
NVQ Tiling courses - Free - Funded by Government

Most requested Engineering courses:

Aviation Engineering - BEng (Hons)
Master of computer science
MBA - Engineering Management
Master Artificial intelligence
Master in Engineering Management
BTEC HND Civil Engineering
HNC Civil Engineering
B.TECH CIVIL ENGINEERING
Level 5 Diploma in Civil Engineering
Master of Bioscience Engineering: Human Health Engineering (Leuven)


### 3.2 Most valued courses

In [19]:
def get_top_courses_by_rating(df, n=10, category=None):
    """ 
    Creates an array of n courses ordered by rating
    
    :param df DataFrame: Reviews dataframe
    :param n int: Number of courses in the array
    :param category str: If category is supplied, the array will be of courses belonging to that category
    :return numpy.ndarray: Array of top n courses by rating
    """
    if category:
        df = df[df['category_name'] == category]
    
    top_courses = df.sort_values(by='weighted_rating', ascending=False)
    
    return top_courses['title'].values[:n]

In [20]:
print('Most valued courses:\n')
print('\n'.join(get_top_courses_by_rating(courses_df)))

print('\nMost valued courses on Beauty Therapy:\n')
print('\n'.join(get_top_courses_by_rating(courses_df, category='Beauty Therapy')))

Most valued courses:

Aerospace Engineering MEng (Hons) 4 years
Excel Intermediate Course
Elite personal training diploma Level 2 & 3
MS Project: An Introduction, One-to-one, Classroom based
Workplace Wellbeing Champion -Initial Training
IPAF Scissor and Boom Training
Corporate English Training
Safety Harness Awareness and Inspection
PASMA Mobile Tower Scaffold 
NCUK International Foundation Year in Business

Most valued courses on Beauty Therapy:

Cryotherapy Induced Lipolysis Short Course
Ultrasound for Skin Rejuvenation Short Course
High Intensity Focused Ultrasound (HIFU) for Face and Neck Short Course
Ultrasonic Lipo-Cavitation Short Course
Russian Volume Eyelash Extensions
LED Light Therapy Short Course
Level 4 Radio Frequency for Face and Body Course
Radio Frequency for Face and Body Short Course
Hair Extensions Business Diploma Course
Beauty Therapist Diploma Course


## 4 Content based recommendations

Using the courses similarities built in part one, I will make recommendations based on courses title and description.

### 4.1 Retrieve similarities

In [34]:
courses_sim_query = '''SELECT * FROM courses_similarities'''

courses_sims = pd.read_sql_query(courses_sim_query, con=connection())

In [40]:
def similar_courses(course_id, df):
    courses = df[df['a_course_id'] == course_id].sort_values('similarity', ascending=False)['another_course_id'].values
    
    return np.array(courses)

In [48]:
course_id = '170624724'

course_name = course_names([course_id])[0]

print('\nSimilar courses to "{}":\n'.format(course_name))
print('\n'.join(course_names(similar_courses(course_id, courses_sims))))


Similar courses to "Child Care Diploma Level 3":

Working in Child Care
Part Time Level 2 Diploma in Child Care and Education
CACHE Level 3 Award/Certificate/Diploma in Childcare and Education
Child Care Course - Level 3 - Accredited
Child Playwork Course - Level 3 - Accredited
Level 3 Diploma in Child Care - CPD Certified & IAO Approved
Level 3 Diploma in Child Care - Best Seller
Professional Diploma in Child Psychology - CPD Certified
Child Psychology and Child Care Diploma


The content of all these courses seems quite similar.

## 5 Neighborhood based collaborative filtering

### 5.1 Using leads data

For this recommendations, I will use the leads data. I will find similar users, that is, users that have generated a lead on a course in which a user has just generated a lead. Then, I will look for courses where those users have generated leads and I will recommend them to that user. Something similar to a cross-selling section.

In [28]:
def find_similar_users(user_id, sparse_user_item_dict, min_similarity=1):
    """ 
    Creates an array of similar users based on leads generated on the same courses
    
    :param user_id str: User id for which we want to find similar users
    :param user_item_matrix DataFrame: Leads user-item matrix
                          
    :return numpy.array: Array of similar users sorted by similarity
    """
    
    user_courses = np.array(sparse_user_item_dict[user_id].todense())[0]
    
    similarities = dict()
    
    for another_user_id, another_user_courses in sparse_user_item_dict.items():
        if user_id == another_user_id:
            continue
        
        similarity = np.dot(user_courses, np.array(another_user_courses.todense())[0])
        if similarity < min_similarity:
            continue        
                            
        similarities[another_user_id] = similarity
        

    sorted_similarities = sorted(similarities.items(), key=lambda item: item[1], reverse=True)
    
    return np.array([id for (id, similarity) in sorted_similarities])

def leads_based_recommendations_for_user(user_id, max_recs=10):
    """ 
    Returns an array of recommended courses for a user based on generated leads
    
    :param user_id str: User id for which we want to make the recommendations
    :param max_recs int: Maximum number of recommendations
                          
    :return numpy.array: Array of courses recommended based on generated leads
    """
    
    user_courses = requested_courses(user_id, leads_df)
    similar_users = find_similar_users(user_id, user_leads_courses_map)

    recs = np.array([])

    for user in similar_users:
        neighbs_leads = requested_courses(user, leads_df)

        new_recs = np.setdiff1d(neighbs_leads, user_courses, assume_unique=True)
        recs = np.unique(np.concatenate([new_recs, recs], axis=0))

        if len(recs) > max_recs:
            break

    return recs[:max_recs]

def course_names(course_ids):
    return courses_df[courses_df['id'].isin(course_ids)]['title'].values

In [30]:
with open('../data/user_courses_map.pickle', 'rb') as filename:
    user_leads_courses_map = pickle.load(filename)

In [33]:
user_id = '1460318498c1f53bb880ce2e6d9ef64b'

print('User courses:')
print(course_names(requested_courses(user_id, leads_df)))

print('\nRecommended courses:')
print(course_names(leads_based_recommendations_for_user(user_id)))

User courses:
['Adobe Photoshop, Illustrator and Graphic Design Bundle Course']

Recommended courses:
['Enterprise Transformation Maturity Canvas'
 'ACCA-Accountancy Traineeship Program' 'HNC Graphic Design'
 'Graphics Design and Desktop Publishing'
 'Quality Management Systems (QMS) - Lead Auditor'
 'Internal Audit - OHSAS 18001 Occupational Health & Safety'
 'Advanced Strategic Management'
 'Adobe Graphic Design & Web Design Online Training Bundle'
 'Professional Diploma in Graphic Design - CPD Certified'
 'Auditing and Internal Control Skills']


It seems that the recommender works fine, it makes sense that a user who has generated lead in an Adobe Photoshop course is recommended courses on graphic design and other Adobe products.

### 5.2 Using rating data

In this seccion, I will use the distances DataFrame built in part one to find the nearest neighbors.

In [None]:
def find_closest_neighbors(user_id, dist_df):
    """Search for the more similar users to user_id
    
    :param user: The user_id of the individual you want to find the closest users
    :return: An array of the id's of the users sorted from closest to farthest away
    """
    closest_users = dist_df[dist_df['a_user_id'] == user_id].sort_values(by='eucl_distance')['another_user_id']
    closest_neighbors = np.array(closest_users)
    
    return closest_neighbors


In [None]:
def make_recommendations_by_rating(user_id, dist_df, max_recs=10):
    """Returns an array of recommended courses for a user based on users ratings
    
    :param user_id str: User id for which we want to make the recommendations
    :param dist_df: Distances DataFrame
    :param max_recs int: Maximum number of recommendations
    :return numpy.array: Array of courses recommended based on user ratings
    """
    recs = np.array([])
    
    user_rated_courses = rated_courses(user_id)
    closest_neighbors = find_closest_neighbors(user_id, dist_df)
    
    for neighbor in closest_neighbors:
        neighbs_likes = courses_liked(neighbor, 0)
        
        #Obtain recommendations for each neighbor
        new_recs = np.setdiff1d(neighbs_likes, user_rated_courses, assume_unique=True)
        
        # Update recs with new recs
        recs = np.unique(np.concatenate([new_recs, recs], axis=0))
        
        # If we have enough recommendations exit the loop
        if len(recs) > max_recs - 1:
            break
            
    return recs
    

In [None]:
user_id = 'ff4c26ace3459427b7c06c493071d31a'

print('User courses:')
print(course_names(courses_liked(user_id)))

print('\nRecommendations:')
print(course_names(make_recommendations_by_rating(user_id, eucl_distances_df)))

## 6 Model based collaborative filtering
In this part I will use the leads user-item matrix. Since there are no nan values in that matrix, I can use Singular Value Decomposition from numpy on the matrix.

### 6.1 Perform SVD

**Get the leads user-item matrix**

In [52]:
leads_user_item_matrix = pd.read_csv('../data/leads_matrix.csv').set_index('user_id')

In [None]:
u, s, vt = np.linalg.svd(leads_user_item_matrix)
s.shape, u.shape, vt.shape

In [None]:
def reduce_svd(k):
    U_reduced = np.mat(u[:,:k])
    Vt_reduced = np.mat(vt[:k,:])
    Sigma_reduced = np.eye(k) * s[:k]

    return Sigma_reduced, U_reduced, Vt_reduced

def reduce_svd_2(k):
    U_reduced = np.mat(u[:,:k])
    Vt_reduced = np.mat(vt[:k,:])
    Sigma_reduced = np.eye(k) * s[:k]
    Sigma_sqrt = np.sqrt(Sigma_reduced)

    return U_reduced * Sigma_sqrt, Sigma_sqrt * Vt_reduced

In the following graph we will see how the accuracy improves as we increase the number of lantent features

In [None]:
num_latent_feats = np.arange(10,1500,20)
sum_errs = []

for k in num_latent_feats:
    # restructure with k latent features
    s_new, u_new, vt_new = np.diag(s[:k]), u[:, :k], vt[:k, :]
    
    # take dot product
    user_item_est = np.around(np.dot(np.dot(u_new, s_new), vt_new))
    
    # compute error for each prediction to actual value
    diffs = np.subtract(leads_user_item_matrix, user_item_est)
    
    # total errors and keep track of them
    err = np.sum(np.sum(np.abs(diffs)))
    sum_errs.append(err)
    

In [None]:
plt.plot(num_latent_feats, 1 - np.array(sum_errs)/leads_df.shape[0]);
plt.xlabel('Number of Latent Features');
plt.ylabel('Accuracy');
plt.title('Accuracy vs. Number of Latent Features');

In [None]:
def train_test_split(df, order_by, total_size=None, train_size=.8, test_size=.2):
    df_size = df.shape[0] if total_size is None else total_size
    train_size = int(np.floor(df_size * train_size))
    test_size = int(np.floor(df_size * test_size))
    
    #df = df.sort_values(order_by)
    train_set = df.head(train_size)
    #test_set = df.iloc[train_size:train_size + test_size]
    test_set = df.tail(test_size)
    
    return train_set, test_set

In [None]:
train_df, test_df = train_test_split(leads_df, 'created_on')

In [None]:
train_user_item_matrix = create_user_item_matrix(train_df, select_column='course_title', allow_nulls=False)
test_user_item_matrix = create_user_item_matrix(test_df, select_column='course_title', allow_nulls=False)

In [None]:
train_idx = train_user_item_matrix.index.values
train_courses = train_user_item_matrix.columns.values
test_idx = test_user_item_matrix.index.values
test_courses = test_user_item_matrix.columns.values

n_users_preds = len(np.intersect1d(test_idx, train_idx))
cold_start_users = len(test_idx) - n_users_preds

n_courses_preds = len(np.intersect1d(test_courses, train_courses))
cold_start_courses = len(test_courses) - n_courses_preds

print('Users we can make predictions for: {}\n' \
      'Users we cannot make predictions for: {}\n' \
      'Courses we can make predictions for: {}\n' \
      'Courses we cannot make predictions for: {}\n'.format(n_users_preds, cold_start_users, n_courses_preds, cold_start_courses))

In [None]:
def predictions(u, s, vt, k):
    '''
    INPUT:
    u - user feature matrix
    s - test dataframe
    vt - item feature matrix
    k - number of latent features to keep
    
    OUTPUT:
    user_item_matrix - a predictions user-item matrix
    
    '''
    
    s_new = np.diag(s[:k])
    u_new = u[:, :k]
    vt_new = vt[:k, :]
    
    user_item_matrix = np.around(np.dot(np.dot(u_new, s_new), vt_new))
    
    return user_item_matrix

In [None]:
u_train, s_train, vt_train = np.linalg.svd(train_user_item_matrix)
u_train.shape, s_train.shape, vt_train.shape

In [None]:
common_user_ids = train_user_item_matrix.index.isin(test_idx)
common_courses_ids = train_user_item_matrix.columns.isin(test_courses)

u_test = u_train[common_user_ids, :]
vt_test = vt_train[:, common_courses_ids]

u_test.shape, vt_test.shape

In [None]:
test_user_item_matrix = test_user_item_matrix.loc[np.intersect1d(test_idx, train_idx), np.intersect1d(test_courses, train_courses)]

In [None]:
num_latent_feats = np.arange(5,800,10)
sum_errs_train = []
sum_errs_test = []

for k in num_latent_feats:
    # restructure with k latent features
    user_train_preds = predictions(u_train, s_train, vt_train, k)
    user_test_preds = predictions(u_test, s_train, vt_test, k)
    
    # compute error for each prediction to actual value
    diffs_train = np.subtract(train_user_item_matrix, user_train_preds)
    diffs_test = np.subtract(test_user_item_matrix, user_test_preds)
    
    # total errors and keep track of them
    err_train = np.sum(np.sum(np.abs(diffs_train)))
    err_test = np.sum(np.sum(np.abs(diffs_test)))
    
    sum_errs_train.append(err_train)
    sum_errs_test.append(err_test)

In [None]:
plt.plot(num_latent_feats, 1 - (np.array(sum_errs_train)/(train_user_item_matrix.shape[0]*train_user_item_matrix.shape[1])), label='Train set');
plt.plot(num_latent_feats, 1 - (np.array(sum_errs_test)/(test_user_item_matrix.shape[0]*test_user_item_matrix.shape[1])), label='Test set');

plt.legend(loc='best')
plt.xlabel('Number of Latent Features');
plt.ylabel('Accuracy');
plt.title('Accuracy vs. Number of Latent Features');
plt.show();