### Data

Included at this URL (112 MB) are three CSV files containing data that may be of use. You don’t have to use all of it. Depending on how you choose to solve this problem, some of it may not be relevant.




In [1]:
import pandas as pd

tutorials_df = pd.read_csv('./data/tutorials.csv')
tags_df = pd.read_csv('./data/tags.csv')
sessions_df = pd.read_csv('./data/sessions.csv')

#### tutorials.csv - Tutorial page data, including tags. A single tutorial can have multiple tags. This file contains one row per unique (id, tag_id) pair. Columns:

- id: unique identifier for tutorial pages
- title: the title of the page
- slug: the end of the tutorial’s URL. i.e. every tutorial can be found at https://www.digitalocean.com/community/tutorials/
- description: some text that describes the article
- created_at: UTC timestamp of when the tutorial was posted
- tag_id: the ID of a content tag that applies to the tutorial.


In [2]:
tutorials_df.head()

Unnamed: 0,tutorial_id,title,slug,description,created_at,tag_id,tutorial_length
0,134,How To Add Swap on CentOS 6,how-to-add-swap-on-centos-6,Linux swaps allow a system to harness more mem...,2012-08-17 20:44:53.000,11,4739
1,747,How to Setup FastCGI Caching with Nginx on you...,how-to-setup-fastcgi-caching-with-nginx-on-you...,Here's how to setup FastCGI caching with Nginx...,2013-10-29 16:45:52.000,14,9171
2,747,How to Setup FastCGI Caching with Nginx on you...,how-to-setup-fastcgi-caching-with-nginx-on-you...,Here's how to setup FastCGI caching with Nginx...,2013-10-29 16:45:52.000,26,9171
3,809,How To Use the Dokku One-Click DigitalOcean Im...,how-to-use-the-dokku-one-click-digitalocean-im...,Dokku is an exciting new way to deploy applica...,2013-11-14 22:03:06.000,26,11828
4,961,An Introduction to Linux I/O Redirection,an-introduction-to-linux-i-o-redirection,The redirection capabilities built into Linux ...,2014-01-23 19:30:14.000,5,12277


In [3]:
len(tutorials_df['tag_id'].unique())

152

#### tags.csv - Content tag data. A tag is a short descriptor of a tutorial’s content. Examples: “Java”, “Databases”, “Machine Learning”. Columns:

- id: unique identifier for content tags.
- name: the name of the tag
- description: a description of the tag
- tag_type: one of “glossary” (An entry in the DigitalOcean community glossary), “tag” (a label identifying the subject of a tutorial’s content), or “distro” (a type of tag identifying a particular operating system distribution, e.g. CentOS 7)



In [4]:
tags_df.head()

Unnamed: 0,id,name,description,tag_type
0,108,Elasticsearch,Elasticsearch is an open-source full text sear...,tag
1,26,Email,"Tutorials, projects, and Q&A related to runnin...",tag
2,180,Databases,Databases are organized data structures that s...,tag
3,139,Solutions,Content highlighting the DO suite of products.,tag
4,131,Object Storage,Object Storage is a data storage architecture ...,tag


#### sessions.csv - A subset of session data. A session is a record of someone’s visiting a tutorial page. All sessions in this CSV are ‘significant’ in that none of them are extremely short, none are from bots, etc. Columns:

- user_id: unique identifier for visitors.
- tutorial_id: the ID of the tutorial the visitor saw
- session_start_at: UTC timestamp of when the session started
- session_end_at: UTC timestamp of when the session ended


In [5]:
sessions_df.head()

Unnamed: 0,user_id,tutorial_id,session_start_at,session_end_at
0,2ffcc4ad-1ccd-43f0-93b6-6a4fac5ec2a8,2636,2019-01-30 05:24:41.097000+00:00,2019-01-30 05:38:33.460636380+00:00
1,c88743c5-885f-42fe-a192-bd7b63d4c9e9,2332,2019-03-08 10:23:36.165000+00:00,2019-03-08 10:34:49.910454540+00:00
2,e9b96517-d8cd-4edf-8f53-996882506fca,2109,2019-02-13 21:45:14.020000+00:00,2019-02-13 21:55:39.329090920+00:00
3,6de1dcfa-f2ae-49bc-86dc-1f3a6980f70b,114,2019-02-22 06:55:41.806000+00:00,2019-02-22 07:05:37.115090920+00:00
4,d04996f4-9e4f-4536-bf8e-eef5bd9366cb,2641,2019-03-06 15:53:24.830000+00:00,2019-03-06 16:07:41.193636380+00:00


### Solution:

To solve this problem, we can take one of two different approaches:

1. Item - Item collaborative filtering
2. User - User collaborative filtering 

I use an **item to item collaborative filtering** approach to build a recommendation system. My solution to this problem involves the following steps:

-   1. Create a **matrix factorization** by taking tutorials_df and tags_df and returns a sparse matrix
    where each row is a tutorial, and each column is a tag, and 
    the values are 1/0 if a tag applies to a tutorial.
    
    
-   2. Write a function that takes in a tutorial_id, and return the 5 closest tutorials from the pairwise distances
    generated from the matrix factorization based on the hamming distance. **Note that some tutorials have tags missing, and when a tutorial has tags missing, I handle this by returning a random list of 5 recommendations for a tutorial**
    
    
-   3. **Model Evaluation**: Once we have a recommender system, we evaluate the performance of the recommender system both qualitatively and quantitaively. For the qualiative evaluation, I pick a random tutorial, and see the recommended tutorials to see if they make sense. I also look at the % of users that looked at a tutorial that also looked at a recommended tutorial as a quantitative estimate of the performance of the recommender system.


-   4. Finally, return a json object with the recommendations for each tutorial id.


- **Note:**  I took roughly 3-4 hours to work on this assignment, and used Python, pandas, numpy and sklearn to do my analysis.

### 1. Matrix Factorization

In [6]:
# Creating a matrix factorization.

def create_matrix_factorization(tutorials_df, tags_df):
    
    '''
    Takes in the tutorials_df and tags_df and returns a sparse matrix
    where each row is a tutorial, and each column is a tag, and 
    the values are 1/0 if a tag applies to a tutorial.
    '''
    
    tutorials_tags_df = pd.merge(tutorials_df, tags_df, how='inner', left_on='tag_id', right_on='id')
    tutorials_tags_df = tutorials_tags_df[['tutorial_id', 'tag_id']]
    tutorials_tags_df = tutorials_tags_df.sort_values('tutorial_id')
    tutorials_tags_df['value'] = 1
    combined_df = pd.get_dummies(tutorials_tags_df['tag_id'])
    combined_df['tutorial_id'] = tutorials_tags_df['tutorial_id']
    matrix_factorization_df = combined_df.groupby(['tutorial_id']).sum()
    
    return matrix_factorization_df

In [7]:
matrix_factorization_df = create_matrix_factorization(tutorials_df=tutorials_df, tags_df=tags_df)
matrix_factorization_df.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,155,156,161,165,166,167,168,173,178,180
tutorial_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,1,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
matrix_factorization_df.shape

(2457, 106)

#### Here, each row is a tutorial, and each column is a tag, and the values are 1/0 if a tag applies to a tutorial. As we can see, there are 2457 distinct tutorials, and 106 distinct tags.

### 2.  Write a recommender system based on matrix factorization

In [9]:
from sklearn.neighbors import DistanceMetric
import random

# Find all paiwise distances from the matrix factorization
dist = DistanceMetric.get_metric('hamming')
matrix_factorization_df = create_matrix_factorization(tutorials_df, tags_df)
pairwise_distances = dist.pairwise(matrix_factorization_df)

def recommend_tutorials(tutorial_id, tutorials_df, tags_df, pairwise_distances):
    '''
    Take in a tutorial_id, and return the 5 closest tutorials from the pairwise distances
    generated from the matrix factorization based on the hamming distance.
    '''
    tutorials_tags_df = pd.merge(tutorials_df, tags_df, how='inner', left_on='tag_id', right_on='id').sort_values('tutorial_id')
    tutorial_ids = list(set(tutorials_tags_df['tutorial_id']))
    try:
        tutorial_idx = tutorial_ids.index(tutorial_id)
        row = pairwise_distances[tutorial_idx]
        # Find the 5 nearest neigbors from the pairwise distances based on hamming distance
        five_nearest_neighbors_indices = list(row.argsort()[1:6][::-1])
        five_nearest_neigbors = [tutorial_ids[i] for i in five_nearest_neighbors_indices]
    except BaseException:
        # This is the case when a tutorial has no tags. I am handling this by returning 5 random 
        # tutorials.
        five_nearest_neigbors = random.sample(tutorial_ids, 5)
    
    return five_nearest_neigbors

In [10]:
recommend_tutorials(tutorial_id=6, tutorials_df=tutorials_df, tags_df=tags_df, pairwise_distances=pairwise_distances)

[994, 1281, 2150, 1378, 1784]

### 3. Model Validation/ Evaluation
#### 1. Qualitative (Based on inspection)

In [11]:
# Let's examine how well the recommender works. Choosing a target tutorial:
target_tutorial_id = 4
tutorials_tags_df = pd.merge(tutorials_df, tags_df, how='inner', left_on='tag_id', right_on='id')
tutorials_tags_df[tutorials_tags_df['tutorial_id'].isin([target_tutorial_id])]

Unnamed: 0,tutorial_id,title,slug,description_x,created_at,tag_id,tutorial_length,id,name,description_y,tag_type
3298,4,How To Install nginx on CentOS 6 with yum,how-to-install-nginx-on-centos-6-with-yum,"This articles covers how to install nginx, a w...",2012-05-22 00:16:33.000,6,2042,6,CentOS,,distro
3975,4,How To Install nginx on CentOS 6 with yum,how-to-install-nginx-on-centos-6-with-yum,"This articles covers how to install nginx, a w...",2012-05-22 00:16:33.000,3,2042,3,Nginx,Nginx is one of the most popular web servers i...,tag
6143,4,How To Install nginx on CentOS 6 with yum,how-to-install-nginx-on-centos-6-with-yum,"This articles covers how to install nginx, a w...",2012-05-22 00:16:33.000,50,2042,50,Ruby,"Ruby is a dynamic, reflective, object-oriented...",tag


In [12]:
recommended_tutorial_ids = recommend_tutorials(tutorial_id=target_tutorial_id, tutorials_df=tutorials_df, 
                                               tags_df=tags_df, pairwise_distances=pairwise_distances)
print 'Recommended tutorial ids are: ', recommended_tutorial_ids
tutorials_tags_df[tutorials_tags_df['tutorial_id'].isin(recommended_tutorial_ids)]

Recommended tutorial ids are:  [1661, 1611, 1264, 1365, 1590]


Unnamed: 0,tutorial_id,title,slug,description_x,created_at,tag_id,tutorial_length,id,name,description_y,tag_type
3303,1264,How To Install Nginx on CentOS 7,how-to-install-nginx-on-centos-7,"This articles covers how to install Nginx, a w...",2014-07-21 20:08:11.291,6,4389,6,CentOS,,distro
3334,1611,Como Configurar sua Própria VPN com PPTP,como-configurar-sua-propria-vpn-com-pptp-pt,Uma das questões mais comuns entre nossos usuá...,2015-05-26 20:33:39.094,6,6108,6,CentOS,,distro
3381,1365,How To Set Up Nginx Server Blocks on CentOS 7,how-to-set-up-nginx-server-blocks-on-centos-7,Nginx uses to manage configurations for an ind...,2014-11-04 15:08:10.126,6,14936,6,CentOS,,distro
3577,1590,How To Redirect www to Non-www with Nginx on C...,how-to-redirect-www-to-non-www-with-nginx-on-c...,This tutorial will show you how to redirect a ...,2015-05-04 20:08:40.714,6,7233,6,CentOS,,distro
3980,1264,How To Install Nginx on CentOS 7,how-to-install-nginx-on-centos-7,"This articles covers how to install Nginx, a w...",2014-07-21 20:08:11.291,3,4389,3,Nginx,Nginx is one of the most popular web servers i...,tag
4026,1611,Como Configurar sua Própria VPN com PPTP,como-configurar-sua-propria-vpn-com-pptp-pt,Uma das questões mais comuns entre nossos usuá...,2015-05-26 20:33:39.094,3,6108,3,Nginx,Nginx is one of the most popular web servers i...,tag
4038,1661,How To Upgrade Nginx In-Place Without Dropping...,how-to-upgrade-nginx-in-place-without-dropping...,Nginx is a powerful web server and reverse pro...,2015-06-15 19:16:23.511,3,11524,3,Nginx,Nginx is one of the most popular web servers i...,tag
4073,1365,How To Set Up Nginx Server Blocks on CentOS 7,how-to-set-up-nginx-server-blocks-on-centos-7,Nginx uses to manage configurations for an ind...,2014-11-04 15:08:10.126,3,14936,3,Nginx,Nginx is one of the most popular web servers i...,tag
4286,1590,How To Redirect www to Non-www with Nginx on C...,how-to-redirect-www-to-non-www-with-nginx-on-c...,This tutorial will show you how to redirect a ...,2015-05-04 20:08:40.714,3,7233,3,Nginx,Nginx is one of the most popular web servers i...,tag
5174,1611,Como Configurar sua Própria VPN com PPTP,como-configurar-sua-propria-vpn-com-pptp-pt,Uma das questões mais comuns entre nossos usuá...,2015-05-26 20:33:39.094,62,6108,62,VPN,"A VPN, or virtual private network, is a way to...",tag


#### 2. Quantitative
Let's look at the percentage of users that looked at a tutorial_id that also looked at one of the 5 recommended tutorials.

In [13]:
tutorials_tags_df = pd.merge(tutorials_df, tags_df, how='inner', left_on='tag_id', right_on='id').sort_values('tutorial_id')
tutorial_ids = list(set(tutorials_tags_df['tutorial_id']))

In [14]:
user_overlap_percentages = []
for tutorial_id in tutorial_ids:
    
    target_tutorial_id = tutorial_id
    recommended_tutorial_ids = recommend_tutorials(tutorial_id=target_tutorial_id, tutorials_df=tutorials_df, 
                                               tags_df=tags_df, pairwise_distances=pairwise_distances)
    
    target_tutorial_user_ids = list(sessions_df[sessions_df['tutorial_id']==target_tutorial_id]['user_id'].unique())
    if len(target_tutorial_user_ids) >0:
        recommended_tutorial_user_ids = list(sessions_df[sessions_df['tutorial_id'].isin(recommended_tutorial_ids)]['user_id'].unique())

        user_overlap_percentages.append(len(set(target_tutorial_user_ids).intersection(recommended_tutorial_user_ids)) / \
        float(len(target_tutorial_user_ids)) * 100)
   

In [15]:
import numpy as np
print 'Percentage of users who looked at a tutorial that also looked at one of the recommended tutorials: ', \
      np.mean(user_overlap_percentages), '%'

Percentage of users who looked at a tutorial that also looked at one of the recommended tutorials:  14.229746608517441 %


#### So 14.2% of users (or roughly 1 in 7) who looked at a tutorial also looked at one of the recommended tutorials. This is very reassuring!

### 4. Return json object with recommendations for each tutorial_id

In [16]:
import json

def return_recommendations(tutorials_df, tags_df):
    result = {}
    tutorial_ids = list(set(tutorials_df['tutorial_id']))
    for target_tutorial_id in tutorial_ids:
        recommended_tutorial_ids = recommend_tutorials(tutorial_id=target_tutorial_id, tutorials_df=tutorials_df, 
                                               tags_df=tags_df, pairwise_distances=pairwise_distances)
        result[target_tutorial_id] = recommended_tutorial_ids
    
    result_json = json.dumps(result)
    return result_json

In [18]:
recommendations = return_recommendations(tutorials_df, tags_df)
