# Courses recommendation system
# IV. Make recommendations

This is the third part of the Udacity Data Science Nanodegree capstone project, which consists in the creation of a course recommendation system.

After the exploratory data analysis, is time to play around with structures created in the first part and trying to make recommendations.

I will use four types of recommendations:

* Knowledge based recommendations
* Content based filtering
* Neighborhood based collaborative filtering
* ~~Model based collaborative filtering~~

## 1 Import libraries

In [76]:
import numpy as np
import pandas as pd
import pickle

from db_utils import connection
from common import requested_courses

## 2 Retrieve data

In this step I will retrieve the clean data from database.

**Retrieve courses**

In [2]:
courses_query = '''SELECT c.*, cat.name AS category_name FROM courses c 
                    JOIN categories cat ON c.category_id = cat.id'''

courses_df = pd.read_sql_query(courses_query, con=connection())

**Retrieve leads**

In [3]:
leads_query = 'SELECT * FROM clean_leads ORDER BY created_on DESC'

leads_df = pd.read_sql_query(leads_query, con=connection())

**Retrieve reviews**

In [4]:
reviews_query = 'SELECT * FROM clean_reviews ORDER BY created_on DESC'

reviews_df = pd.read_sql_query(reviews_query, con=connection())

## 3 Knowledge based recommendations
In this type of recommendation, I will use two measures: the most requested courses and the most valued courses by users' ratings. Also, I will add a category filter.

### 3.1 Most requested courses

In [7]:
def get_top_courses_by_leads(df, n=10, category=None):
    """ 
    Creates an array of n courses ordered by number of leads generated
    
    :param df DataFrame: Leads dataframe
    :param n int: Number of courses in the array
    :param category str: If category is supplied, the array will be of courses belonging to that category
    
    :return numpy.ndarray: Array of top n courses by leads generated
    """ 
    if category:
        df = df[df['category_name'] == category]
        
    top_courses = df.sort_values('number_of_leads', ascending=False)['title'].head(n)
    
    return top_courses.values

In [18]:
print('Most requested courses:\n')
print('\n'.join(get_top_courses_by_leads(courses_df)))

print('\nMost requested Engineering courses:\n')
print('\n'.join(get_top_courses_by_leads(courses_df, category='Engineering')))

Most requested courses:

Aviation Engineering - BEng (Hons)
Bachelor in Aviation Management
MSc Public Health
Master of computer science
MBA - Engineering Management
BA in Hospitality Management
MBA - Big Data Management
Master of Leadership in Development Finance - Online
MBA - Marketing
NVQ Tiling courses - Free - Funded by Government

Most requested Engineering courses:

Aviation Engineering - BEng (Hons)
Master of computer science
MBA - Engineering Management
Master Artificial intelligence
Master in Engineering Management
BTEC HND Civil Engineering
HNC Civil Engineering
B.TECH CIVIL ENGINEERING
Level 5 Diploma in Civil Engineering
Master of Bioscience Engineering: Human Health Engineering (Leuven)


### 3.2 Most valued courses

In [19]:
def get_top_courses_by_rating(df, n=10, category=None):
    """ 
    Creates an array of n courses ordered by rating
    
    :param df DataFrame: Reviews dataframe
    :param n int: Number of courses in the array
    :param category str: If category is supplied, the array will be of courses belonging to that category
    :return numpy.ndarray: Array of top n courses by rating
    """
    if category:
        df = df[df['category_name'] == category]
    
    top_courses = df.sort_values(by='weighted_rating', ascending=False)
    
    return top_courses['title'].values[:n]

In [20]:
print('Most valued courses:\n')
print('\n'.join(get_top_courses_by_rating(courses_df)))

print('\nMost valued courses on Beauty Therapy:\n')
print('\n'.join(get_top_courses_by_rating(courses_df, category='Beauty Therapy')))

Most valued courses:

Aerospace Engineering MEng (Hons) 4 years
Excel Intermediate Course
Elite personal training diploma Level 2 & 3
MS Project: An Introduction, One-to-one, Classroom based
Workplace Wellbeing Champion -Initial Training
IPAF Scissor and Boom Training
Corporate English Training
Safety Harness Awareness and Inspection
PASMA Mobile Tower Scaffold 
NCUK International Foundation Year in Business

Most valued courses on Beauty Therapy:

Cryotherapy Induced Lipolysis Short Course
Ultrasound for Skin Rejuvenation Short Course
High Intensity Focused Ultrasound (HIFU) for Face and Neck Short Course
Ultrasonic Lipo-Cavitation Short Course
Russian Volume Eyelash Extensions
LED Light Therapy Short Course
Level 4 Radio Frequency for Face and Body Course
Radio Frequency for Face and Body Short Course
Hair Extensions Business Diploma Course
Beauty Therapist Diploma Course


## 4 Content based recommendations

Using the courses similarities built in part one, I will make recommendations based on courses title and description.

### 4.1 Retrieve similarities

In [34]:
courses_sim_query = '''SELECT * FROM courses_similarities'''

courses_sims = pd.read_sql_query(courses_sim_query, con=connection())

In [40]:
def similar_courses(course_id, df):
    courses = df[df['a_course_id'] == course_id].sort_values('similarity', ascending=False)['another_course_id'].values
    
    return np.array(courses)

In [48]:
course_id = '170624724'

course_name = course_names([course_id])[0]

print('\nSimilar courses to "{}":\n'.format(course_name))
print('\n'.join(course_names(similar_courses(course_id, courses_sims))))


Similar courses to "Child Care Diploma Level 3":

Working in Child Care
Part Time Level 2 Diploma in Child Care and Education
CACHE Level 3 Award/Certificate/Diploma in Childcare and Education
Child Care Course - Level 3 - Accredited
Child Playwork Course - Level 3 - Accredited
Level 3 Diploma in Child Care - CPD Certified & IAO Approved
Level 3 Diploma in Child Care - Best Seller
Professional Diploma in Child Psychology - CPD Certified
Child Psychology and Child Care Diploma


The content of all these courses seems quite similar.

## 5 Neighborhood based collaborative filtering

### 5.1 Using leads data

For this recommendations, I will use the leads data. I will find similar users, that is, users that have generated a lead on a course in which a user has just generated a lead. Then, I will look for courses where those users have generated leads and I will recommend them to that user. Something similar to a cross-selling section.

In [28]:
def find_similar_users(user_id, sparse_user_item_dict, min_similarity=1):
    """ 
    Creates an array of similar users based on leads generated on the same courses
    
    :param user_id str: User id for which we want to find similar users
    :param user_item_matrix DataFrame: Leads user-item matrix
                          
    :return numpy.array: Array of similar users sorted by similarity
    """
    
    user_courses = np.array(sparse_user_item_dict[user_id].todense())[0]
    
    similarities = dict()
    
    for another_user_id, another_user_courses in sparse_user_item_dict.items():
        if user_id == another_user_id:
            continue
        
        similarity = np.dot(user_courses, np.array(another_user_courses.todense())[0])
        if similarity < min_similarity:
            continue        
                            
        similarities[another_user_id] = similarity
        

    sorted_similarities = sorted(similarities.items(), key=lambda item: item[1], reverse=True)
    
    return np.array([id for (id, similarity) in sorted_similarities])

def leads_based_recommendations_for_user(user_id, max_recs=10):
    """ 
    Returns an array of recommended courses for a user based on generated leads
    
    :param user_id str: User id for which we want to make the recommendations
    :param max_recs int: Maximum number of recommendations
                          
    :return numpy.array: Array of courses recommended based on generated leads
    """
    
    user_courses = requested_courses(user_id, leads_df)
    similar_users = find_similar_users(user_id, user_leads_courses_map)

    recs = np.array([])

    for user in similar_users:
        neighbs_leads = requested_courses(user, leads_df)

        new_recs = np.setdiff1d(neighbs_leads, user_courses, assume_unique=True)
        recs = np.unique(np.concatenate([new_recs, recs], axis=0))

        if len(recs) > max_recs:
            break

    return recs[:max_recs]

def course_names(course_ids):
    return courses_df[courses_df['id'].isin(course_ids)]['title'].values

In [30]:
with open('../data/user_courses_map.pickle', 'rb') as filename:
    user_leads_courses_map = pickle.load(filename)

In [33]:
user_id = '1460318498c1f53bb880ce2e6d9ef64b'

print('User courses:')
print(course_names(requested_courses(user_id, leads_df)))

print('\nRecommended courses:')
print(course_names(leads_based_recommendations_for_user(user_id)))

User courses:
['Adobe Photoshop, Illustrator and Graphic Design Bundle Course']

Recommended courses:
['Enterprise Transformation Maturity Canvas'
 'ACCA-Accountancy Traineeship Program' 'HNC Graphic Design'
 'Graphics Design and Desktop Publishing'
 'Quality Management Systems (QMS) - Lead Auditor'
 'Internal Audit - OHSAS 18001 Occupational Health & Safety'
 'Advanced Strategic Management'
 'Adobe Graphic Design & Web Design Online Training Bundle'
 'Professional Diploma in Graphic Design - CPD Certified'
 'Auditing and Internal Control Skills']


It seems that the recommender works fine, it makes sense that a user who has generated lead in an Adobe Photoshop course is recommended courses on graphic design and other Adobe products.