### Article Recommendation

#### Problem Statement

Serendipite is an article aggregation platform
where articles from different domains such as 
technology, politics, news and so on are shared by its users and then these articles are recommended 
on the basis of reading habits.

In Assignment 1, you used a simple popularity based system with no personalization . Now you wish to explore the possibility
of bringing personalized article recommendations to its customer base.

Can you help them figure out what they can achieve with collaborative filtering by accurately predicting ratings for each user article combination?

#### Data Description

- train.zip
- article_info.csv
- test.csv
- sample_submission.csv


#### Evaluation Metric

The evaluation metric for this problem is RMSE score

#### Read Data

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [2]:
#Reading ratings file:
train = pd.read_csv('Recommender systems with Python/Collaborative filtering assignment/Dataset/train.csv')

#Reading ratings file:
test = pd.read_csv('Recommender systems with Python/Collaborative filtering assignment/Dataset/test.csv')

#Reading Movie Info File
article_info = pd.read_csv('Recommender systems with Python/Collaborative filtering assignment/Dataset/article_info.csv')

#Reading sample submission file
sample_submission =  pd.read_csv('Recommender systems with Python/Collaborative filtering assignment/Dataset/sample_submission.csv')

In [3]:
train.head()

Unnamed: 0,user_id,article_id,rating
0,1,456,1
1,1,2934,1
2,1,82,1
3,1,1365,1
4,1,221,1


In [4]:
train.tail()

Unnamed: 0,user_id,article_id,rating
16726,1087,2242,1
16727,1087,419,1
16728,1087,784,1
16729,1087,1249,1
16730,1087,1692,1


In [5]:
test.head()

Unnamed: 0,user_id,article_id
0,1,2607
1,1,1445
2,1,911
3,1,857
4,1,2062


In [6]:
test.tail()

Unnamed: 0,user_id,article_id
7238,1087,2089
7239,1087,504
7240,1087,1801
7241,1087,967
7242,1087,857


In [7]:
article_info.head()

Unnamed: 0,article_id,website,title,content
0,1025,uxmovement,Comment concevoir une procédure pas à pas que ...,par anthony le 18/07/16 à 8h02 Si une nouvelle...
1,2328,endeavor,Ressources humaines? Seulement si vous optez p...,"«Ambassadeurs», «avocats», «porte-parole» d'un..."
2,2469,linkedin,Deux motions de vente différentes. . . .,J'ai passé pas mal de temps récemment avec des...
3,2590,googleblog,Apprentissage large et profond: mieux avec Ten...,"""Apprenez les règles comme un pro, afin de pou..."
4,697,infoq,Agile: manque de compétences en tests,"Fran O'Hara, directeur et consultant principal..."


In [8]:
article_info.tail()

Unnamed: 0,article_id,website,title,content
2524,224,techcrunch,Kite veut être le compagnon de programmation e...,La plupart des environnements de développement...
2525,856,issuu,GRI Magazine 4e édition,"GRI, Club, Magazine, résidentiel, industriel, ..."
2526,2817,linkedin,4 tendances macro de la blockchain: où placer ...,Publié le Simon Taylor Suivre Abonné Ne plus s...
2527,839,googleblog,Spotify choisit Google Cloud Platform pour ali...,Ce n'est pas tous les jours que vous déplacez ...
2528,722,cnet,Watson d'IBM vise à rendre les séjours à l'hôp...,Ben Hider / Getty Images Ce n'est pas la même ...


In [9]:
sample_submission.head()

Unnamed: 0,user_id,article_id,rating
0,1,2607,1
1,1,1445,1
2,1,911,1
3,1,857,1
4,1,2062,1


In [10]:
sample_submission.tail()

Unnamed: 0,user_id,article_id,rating
7238,1087,2089,1
7239,1087,504,1
7240,1087,1801,1
7241,1087,967,1
7242,1087,857,1


In [11]:
sample_submission.shape

(7243, 3)

#### Merging article and ratings information

In [13]:
train = train.merge(article_info[['article_id','title']], how='left', left_on = 'article_id', right_on = 'article_id')

In [14]:
train['article'] = train['article_id'].map(str) + str(': ') + train['title'].map(str)
train = train.drop(['article_id', 'title'], axis = 1)

In [15]:
train.head()

Unnamed: 0,user_id,rating,article
0,1,1,"456: Obtenez 6 mois d'accès à Pluralsight, la ..."
1,1,1,2934: La plateforme cloud de Google est désorm...
2,1,1,82: La technologie derrière les photos d'aperçu
3,1,1,1365: Les VM préemptives de Google Cloud Platf...
4,1,1,221: Ray Kurzweil: Le monde ne se détériore pa...


In [16]:
# Assign X as original ratings dataframe
X = train.copy()

# Plit into train and test datasets
X_train, X_test = train_test_split(X, test_size=0.20, random_state=23)

In [17]:
X_train.head()

Unnamed: 0,user_id,rating,article
3608,244,5,2388: Le patron est gay. Et?
5549,389,1,"215: Elopar lance ""Digio"", carte pour se battr..."
1445,89,1,1425: Le nouveau bourreau de travail travaille...
14673,943,1,531: 5 fonctionnalités impressionnantes de Goo...
9311,590,1,2536: Firebase et Google Cloud: mieux ensemble


In [18]:
X_test.head()

Unnamed: 0,user_id,rating,article
13598,855,1,1114: 20 bibliothèques PHP impressionnantes po...
5711,400,2,47: L'IoT au service de la relation médecin-pa...
8736,559,2,1148: Ce que vous ne saviez probablement pas p...
14351,913,1,1600: Quatre Gotchas Node.js que les équipes o...
8642,550,1,694: Jeff Insurance: ASSURANCES ET JEUX


#### Define a function to calculate RMSE

In [19]:
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

#### Calculate Baseline with average of all ratings

In [20]:
def baseline(user_id, article_id):
    return X_train['rating'].mean()

#### Function to claculate RMSE score obtained on the test set by a model

In [21]:

def rmse_score(model):
    
    #Construct a list of user-article tuples from the test dataset
    id_pairs = zip(X_test['user_id'], X_test['article'])
    
    #Predict the rating for every user-article tuple
    y_pred = np.array([model(user_id, article) for (user_id, article) in id_pairs])
    
    #Extract the actual ratings given by the users in the test data
    y_true = np.array(X_test['rating'])
    
    #Return the final RMSE score
    return rmse(y_true, y_pred)

#### RMSE score baseline

In [22]:
rmse_score(baseline)

0.9617629831002544

#### User based collaborative filtering

#### Using Simple user mean

In [23]:
p_matrix = X_train.pivot_table(values='rating', index='user_id', columns='article')

p_matrix.head()

article,1000: Vous demandez trop de chatbots. Laissez-les grandir,1003: DeepMind passe à TensorFlow,1004: La croissance de la consommation mondiale d'électricité n'est pas seulement la faute de Bitcoin,1005: La fureur des fans de Warcraft contre Blizzard suite à la fermeture du serveur - BBC News Afrique,1006: La grande bibliothèque,1007: Ciesp-Campinas et le Lean Institute Brasil organisent un événement gratuit sur le thème «Lean pour surmonter la crise» à Campinas,1008: Principales tendances des langages de programmation: l'essor du Big Data,1009: DARPA passe au «méta» avec l'apprentissage automatique pour l'apprentissage automatique,100: Les physiciens ont découvert ce qui rend les réseaux de neurones si extraordinairement puissants,"1010: Enfin, CSS en JavaScript! Rencontrez CSSX - Smashing Magazine",...,"990: Trois cultures, trois continents et trois leçons sur le leadership",991: Visa lance un défi pour les startups technologiques - Startupi,992: Conformité PCI et Drupal Commerce: quelle passerelle de paiement dois-je choisir?,994: Personnalisez votre expérience G Suite avec App Maker et les applications recommandées,"995: L'histoire intérieure de la façon dont Amazon a créé Echo, la prochaine entreprise d'un milliard de dollars que personne n'a vu venir",996: Android recommandera des applications en fonction de l'emplacement | Google Discovery,997: Comment finir des objets PLA imprimés en 3D,998: Les nouveaux fronts de travail d'Alelo vont des actions de fidélité aux nouveaux moyens de paiement,99: Le créateur d'Ubuntu dit que le système résout le problème de sécurité de l'Internet des objets,9: HHVM vs PHP 7 - La concurrence se rapproche - Kinsta
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,1.0,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,


In [24]:
#User Based Collaborative Filter using Mean Ratings
def cf_user_mean(user_id, article):
    
    #Check if article exists in p_matrix
    if article in p_matrix:
        
        #Compute the mean of all the ratings given to the article
        mean_rating = p_matrix[article].mean()
    
    else:
        #Default to average rating from the train set
        mean_rating = X_train['rating'].mean()
    
    return mean_rating

#### Calculate RMSE with Simple user mean by article

In [25]:
#Compute RMSE for the Mean model
rmse_score(cf_user_mean)

1.050302348263033

From the above it can be observed that including other user ratings is not helping to improve RMSE score and it is making it worse

#### Using Similarity weighted mean

Now we will use Pearson correlation and using pearson correlation as weight try to predict the unknown ratings and check performance

In [27]:
#Compute the Pearson Correlation using the ratings matrix with corr function from Pandas
pearson_corr = p_matrix.T.corr()

In [30]:

pearson_corr = pd.DataFrame(pearson_corr, index=p_matrix.index, columns=p_matrix.index)

pearson_corr.head(10)

user_id,1,2,3,5,7,8,9,10,11,12,...,1078,1079,1080,1081,1082,1083,1084,1085,1086,1087
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
7,,,,,1.0,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,1.0,,,,...,,,,,,,,,,
10,,,,,,,,1.0,,,...,,,,,,,,,,
11,,,,,,,,,,,...,,,,,,,,,,
12,,,,,,,,,,1.0,...,,,,,,,,,,


In the above view , we can see lots of missing values. We will replace all these missing values by 0 as this means there is no correlation between 2 users from the data provided to us.


In [31]:
#Fill all the missing correlations with 0
pearson_cor = pearson_corr.fillna(0)

Define a function to predict the unknown ratings in the test set using user based collaborative filtering with similarity as pearson coorelation and using all neighbours with positive coorelation

In [32]:
#User Based Collaborative Filter using Weighted Mean Ratings
def cf_user_wmean(user_id, article):
    
    #Check if article exists in r_matrix
    if article in p_matrix:
        
        #Mean rating for active user
        ra = p_matrix.loc[user_id].mean()

        #Get the similarity scores for the user in question with every other user
        sim_scores = pearson_corr[user_id].sort_values(ascending = False)
        
        # Keep similarity scores for users with positive correlation with active user
        sim_scores_pos = sim_scores[sim_scores > 0]
        
        #Get the user ratings for the article in question
        m_ratings = p_matrix[article][sim_scores_pos.index]
        
        #Extract the indices containing NaN in the m_ratings series (Users who have not rated the target article)
        idx = m_ratings[m_ratings.isnull()].index
        
        #Drop the NaN values from the m_ratings Series
        m_ratings = m_ratings.dropna()
        
        # If there are no ratings from similar users we cannot use this method so we predict just 
        # the average rating of the article else we use the prediction formula
        if len(m_ratings) == 0:
            #Default to average rating in the absence of ratings by similar users
            wmean_rating = p_matrix[article].mean()
        else:   
            #Drop the corresponding correlation scores from the sim_scores series
            sim_scores_pos = sim_scores_pos.drop(idx)
            
            #Subtract average rating of each user from the rating (rbp - mean(rb))
            m_ratings = m_ratings - p_matrix.loc[m_ratings.index].mean(axis = 1)
            
            #Compute the final weighted mean using np.dot which is nothing but the product divided by sum of weights
            wmean_rating = ra + (np.dot(sim_scores_pos, m_ratings)/ sim_scores_pos.sum())
   
    else:
        #Default to average rating in the absence of any information on the article in train set
        wmean_rating = X_train['rating'].mean()
    
    return wmean_rating

In [33]:
rmse_score(cf_user_wmean)

1.1067526973854382