# Recommender System with ClickStream Data

Deskdrop is an internal communications platform that allows companies employees to share relevant articles with their peers, and collaborate around them.

We will be using Implicit Library, a Fast Python Collaborative Filtering for Implicit Datasets, for our matrix factorization.



## Setting up Enviroment

In [4]:
!pip install implicit

Collecting implicit
  Using cached https://files.pythonhosted.org/packages/5a/d8/6b4f1374ffa2647b72ac76960c71b984c6f3238090359fb419d03827d87a/implicit-0.4.2.tar.gz
Building wheels for collected packages: implicit
  Building wheel for implicit (setup.py) ... [?25l[?25hdone
  Created wheel for implicit: filename=implicit-0.4.2-cp36-cp36m-linux_x86_64.whl size=3471683 sha256=a8307a37242f570dd50acb619865ae2c6dfc6ec66cbea937c0e2432a09440c39
  Stored in directory: /root/.cache/pip/wheels/1b/48/b1/1aebe3acc3afb5589e72d3e7c3ffc3f637dc4721c1a974dff7
Successfully built implicit
Installing collected packages: implicit
Successfully installed implicit-0.4.2


In [0]:
import pandas as pd
import scipy.sparse as sparse
import numpy as np
import random
import implicit
from sklearn.preprocessing import MinMaxScaler

## Loading Data

In [7]:
# Install Kaggle library
#!pip install -q kaggle
!pip install kaggle --upgrade

from google.colab import files
#to import kaggle.json
uploaded = files.upload()

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json

Requirement already up-to-date: kaggle in /usr/local/lib/python3.6/dist-packages (1.5.6)


Saving kaggle.json to kaggle.json


In [8]:
!kaggle datasets list -s DeskDrop

ref                                                    title                                             size  lastUpdated          downloadCount  
-----------------------------------------------------  -----------------------------------------------  -----  -------------------  -------------  
gspmoreira/articles-sharing-reading-from-cit-deskdrop  Articles sharing and reading from CI&T DeskDrop    8MB  2017-08-27 21:33:01           6021  
gspmoreira/news-portal-user-interactions-by-globocom   News Portal User Interactions by Globo.com       360MB  2019-04-16 00:25:04            639  


In [9]:
!kaggle datasets download -d gspmoreira/articles-sharing-reading-from-cit-deskdrop

Downloading articles-sharing-reading-from-cit-deskdrop.zip to /content
 61% 5.00M/8.20M [00:00<00:00, 28.8MB/s]
100% 8.20M/8.20M [00:00<00:00, 40.1MB/s]


In [10]:
!unzip articles-sharing-reading-from-cit-deskdrop.zip
articles_df = pd.read_csv('shared_articles.csv')
interactions_df = pd.read_csv('users_interactions.csv')


Archive:  articles-sharing-reading-from-cit-deskdrop.zip
  inflating: shared_articles.csv     
  inflating: users_interactions.csv  


## About Data

Article data

In [11]:
articles_df.shape

(3122, 13)

In [12]:
articles_df.head(3)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
0,1459192779,CONTENT REMOVED,-6451309518266745024,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en


Interactions Data

In [13]:
interactions_df.shape

(72312, 8)

In [14]:
interactions_df.head()

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US
2,1465416190,VIEW,310515487419366995,-1130272294246983140,2631864456530402479,,,
3,1465413895,FOLLOW,310515487419366995,344280948527967603,-3167637573980064150,,,
4,1465412290,VIEW,-7820640624231356730,-445337111692715325,5611481178424124714,,,


The data contains about 73k users interactions on more than 3k public articles shared in the platform, more importantly, it contains rich implicit feedback, different interaction types were logged, making it possible to infer the user’s level of interest in the articles.

## Data Transformation

1. **Missing %**

Articles Data

In [15]:

percent_missing = articles_df.isnull().sum() * 100 / len(articles_df)
missing_values = pd.DataFrame({'column_name': articles_df.columns,
                               'percent_missing': percent_missing})
missing_values

Unnamed: 0,column_name,percent_missing
timestamp,timestamp,0.0
eventType,eventType,0.0
contentId,contentId,0.0
authorPersonId,authorPersonId,0.0
authorSessionId,authorSessionId,0.0
authorUserAgent,authorUserAgent,78.21909
authorRegion,authorRegion,78.21909
authorCountry,authorCountry,78.21909
contentType,contentType,0.0
url,url,0.0


Interactions Data

In [0]:

percent_missing = interactions_df.isnull().sum() * 100 / len(interactions_df)
missing_values = pd.DataFrame({'column_name': interactions_df.columns,
                               'percent_missing': percent_missing})
missing_values

Unnamed: 0,column_name,percent_missing
timestamp,timestamp,0.0
eventType,eventType,0.0
contentId,contentId,0.0
personId,personId,0.0
sessionId,sessionId,0.0
userAgent,userAgent,21.288306
userRegion,userRegion,21.303518
userCountry,userCountry,21.288306


**2. Remove columns that we do not need.**



In [0]:
articles_df.drop(['authorUserAgent', 'authorRegion', 'authorCountry'], axis=1, inplace=True)
interactions_df.drop(['userAgent', 'userRegion', 'userCountry'], axis=1, inplace=True)

**3. Remove eventType == 'CONTENT REMOVED' from articles_df.**

In [0]:
#Removing rows having event type as content removed
articles_df = articles_df[articles_df['eventType'] == 'CONTENT SHARED']

#Removing event_type column
articles_df.drop('eventType', axis=1, inplace=True)

#Merging data
df = pd.merge(interactions_df[['contentId','personId', 'eventType']], 
              articles_df[['contentId', 'title']], how = 'inner', on = 'contentId')

df.head()

Unnamed: 0,contentId,personId,eventType,title
0,-3499919498720038879,-8845298781299428018,VIEW,Hiri wants to fix the workplace email problem
1,-3499919498720038879,-8845298781299428018,VIEW,Hiri wants to fix the workplace email problem
2,-3499919498720038879,-108842214936804958,VIEW,Hiri wants to fix the workplace email problem
3,-3499919498720038879,-1443636648652872475,VIEW,Hiri wants to fix the workplace email problem
4,-3499919498720038879,-1443636648652872475,VIEW,Hiri wants to fix the workplace email problem


In [0]:
df.shape

(72269, 4)

**4. Creating Event Strength column**

In [0]:
df['eventType'].value_counts()

VIEW               61043
LIKE                5745
BOOKMARK            2463
COMMENT CREATED     1611
FOLLOW              1407
Name: eventType, dtype: int64

In [0]:
event_type_strength = {
   'VIEW': 1.0,
   'LIKE': 2.0, 
   'BOOKMARK': 3.0, 
   'FOLLOW': 4.0,
   'COMMENT CREATED': 5.0,  
}
df['eventStrength'] = df['eventType'].apply(lambda x: event_type_strength[x])

In [0]:
df.head()

Unnamed: 0,contentId,personId,eventType,title,eventStrength
0,-3499919498720038879,-8845298781299428018,VIEW,Hiri wants to fix the workplace email problem,1.0
1,-3499919498720038879,-8845298781299428018,VIEW,Hiri wants to fix the workplace email problem,1.0
2,-3499919498720038879,-108842214936804958,VIEW,Hiri wants to fix the workplace email problem,1.0
3,-3499919498720038879,-1443636648652872475,VIEW,Hiri wants to fix the workplace email problem,1.0
4,-3499919498720038879,-1443636648652872475,VIEW,Hiri wants to fix the workplace email problem,1.0


**5. Drop duplicates**

In [0]:
df = df.drop_duplicates()

**6. Grouping eventStrength together with person and content.**

In [0]:
grouped_df = df.groupby(['personId', 'contentId', 'title']).sum().reset_index()
grouped_df.sample(10)

Unnamed: 0,personId,contentId,title,eventStrength
6458,-6120788038252469111,1738052593226421681,Como resolver conflitos no ambiente corporativ...,1.0
15378,-1616903969205976623,8224860111193157980,Psicóloga de Harvard diz que as pessoas julgam...,4.0
34219,5640522320021444231,1021946628105173582,Introducing the Google Analytics Demo Account ...,1.0
33373,5127372011815639401,-3581194288660477595,The End Of Apple Man,1.0
20552,22763587941636338,-9055044275358686874,Accept questions from your audience when prese...,1.0
5312,-6874532444478397764,897043351348716074,Mark Zuckerberg compartilha como será o Facebo...,1.0
36101,6852303461450629547,-4336877432539963613,Comparing IoT Platforms: Compare 4 IoT platfor...,1.0
38162,7948079555216525045,-2549933363319068481,The real reasons you procrastinate - and how t...,1.0
33005,4870191573331352123,-1038011342017850,Para entender o Dia Internacional contra a Hom...,1.0
30779,3891637997717104548,-5017021199558167721,Dries Buytaert,6.0



*Instead of representing an explicit rating, the eventStrength can represent a “confidence” in terms of how strong the interaction was. Articles with a larger number of eventStrength by a person can carry more weight in our ratings matrix of eventStrength.*

**7. To get around “negative integer” warning, will have to create numeric person_id and content_id columns.**

A. Converting column types to category

In [0]:
grouped_df['title'] = grouped_df['title'].astype("category")
grouped_df['personId'] = grouped_df['personId'].astype("category")
grouped_df['contentId'] = grouped_df['contentId'].astype("category")

B. Creating 2 columns and mapping the codes in every row

In [0]:
grouped_df['person_id'] = grouped_df['personId'].cat.codes
grouped_df['content_id'] = grouped_df['contentId'].cat.codes

In [0]:
grouped_df.sample(10)

Unnamed: 0,personId,contentId,title,eventStrength,person_id,content_id
15190,-1690554517720703181,-4655195825208522542,Java 9 na prática: Jigsaw,1.0,772,750
30642,3829784524040647339,7276478113479207148,"Leonardo Dicaprio, Barry Sternlicht back Qloo,...",1.0,1344,2663
37134,7410250256723888301,5270696484536580646,8 Insanely Simple Productivity Hacks,1.0,1699,2330
5957,-6502100706127527925,-8208801367848627943,Ray Kurzweil: The world isn't getting worse - ...,1.0,270,187
30533,3829784524040647339,39554158227241538,Campus São Paulo: conheça a primeira turma do ...,1.0,1344,1492
31463,4142810830429822977,-1622037268576555626,You won't recognize the new world of digital p...,1.0,1375,1209
20006,-108842214936804958,-5798690764728257756,Airbnb bets on local with user-generated guide...,1.0,926,561
21106,374329072173397727,-2948321821574578861,Quando usar paginação e quando usar scroll inf...,1.0,979,1016
14286,-2511855597392146401,3075564241645350154,Decentralizing IoT networks through blockchain,1.0,693,1967
19907,-229539536136014922,5338677278233757627,How to Get a Job at Google,1.0,916,2350


## Alternating Least Squares Recommender Model Fitting

There are different ways to factor a matrix, like Singular Value Decomposition (SVD) or Probabilistic Latent Semantic Analysis (PLSA) if we’re dealing with explicit data.

With implicit data the difference lies in how we deal with all the missing data in our very sparse matrix. For explicit data we treat them as just unknown fields that we should assign some predicted rating to. But for implicit we can’t just assume the same since there is information in these unknown values as well. As stated before we don’t know if a missing value means the user disliked something, or if it means they love it but just don’t know about it. Basically we need some way to learn from the missing data. So we’ll need a different approach to get us there.


ALS is an iterative optimization process where we for every iteration try to arrive closer and closer to a factorized representation of our original data.

We have our original matrix `R` of size `u x i` with our users, items and some type of feedback data. We then want to find a way to turn that into one matrix with `users` and hidden features of size `u x f` and one with `items` and hidden features of size `f x i`. 

In U and V we have weights for how each user/item relates to each feature. What we do is we calculate U and V so that their product approximates R as closely as possible: R ≈ U x V.

![alttext](https://miro.medium.com/max/2400/1*ygHEXIhg5FtkSD3UQaldgw.png)

By randomly assigning the values in U and V and using least squares iteratively we can arrive at what weights yield the best approximation of R. The least squares approach in it’s basic forms means fitting some line to the data, measuring the sum of squared distances from all points to the line and trying to get an optimal fit by minimising this value.


With the alternating least squares approach we use the same idea but iteratively alternate between optimizing U and fixing V and vice versa. We do this for each iteration to arrive closer to R = U x V.


Solution is to merge the preference (p) for an item with the confidence (c) we have for that preference. We start out with missing values as a negative preference with a low confidence value and existing values a positive preference but with a high confidence value.

![alttext](https://miro.medium.com/max/1178/1*fzBWtw4JtADMA_PjI8fUjA.png)

Basically our preference is a binary representation of our feedback data r. If the feedback is greater than zero we set it to 1. 

![alttext](https://miro.medium.com/max/1185/1*1rnDwWv6upHMQAeoFI_izQ.png)

Here the confidence is calculated using the magnitude of r (the feedback data)giving us a larger confidence the more times a user has played, viewed or clicked an item. The rate of which our confidence increases is set through a linear scaling factor α. We also add 1 so we have a minimal confidence even if α x r equals zero.


This also means that even if we only have one interaction between a user and item the confidence will be higher than that of the unknown data given the α value. In the paper they found `α = 40` to work well and somewhere between 15 and 40 is a good range to try on.

Create two matrices, one for fitting the model (content-person) and one for recommendations (person-content).

In [0]:
#Creating sparse matrix
sparse_content_person = sparse.csr_matrix((grouped_df['eventStrength'].astype(float), 
                                           (grouped_df['content_id'], grouped_df['person_id'])))

sparse_person_content = sparse.csr_matrix((grouped_df['eventStrength'].astype(float), 
                                           (grouped_df['person_id'], grouped_df['content_id'])))

#Model Initialization
model = implicit.als.AlternatingLeastSquares(factors=20, regularization=0.1, iterations=50)

#Fitting model using the sparse content-person matrix.
alpha = 15
data = (sparse_content_person * alpha).astype('double')
model.fit(data)

GPU training requires factor size to be a multiple of 32. Increasing factors from 20 to 32.


HBox(children=(IntProgress(value=0, max=50), HTML(value='')))





**Finding similar articles**

To calculate the similarity between items we compute the dot-product between our item vectors and it’s transpose. So if we want articles similar to content_id = 1732, we take the dot product between all item vectors and the transpose of the content_id=1732 item vector. This will give us the similarity score:
![alttext](https://miro.medium.com/max/1183/1*aeQO6uL29wzQy8C1pz6Wwg.png)

Let's see similar articles to content_id = 1732

In [0]:
grouped_df[grouped_df['content_id'] == 1732][:1]

Unnamed: 0,personId,contentId,title,eventStrength,person_id,content_id
4998,-6944500707172804068,1582315529508020223,Analytics startup Amplitude raises $15M,3.0,226,1732


In [0]:
content_id = 1732
n_similar = 10

person_vecs = model.user_factors
#Len - 1895; every value is an array of 32 factors
content_vecs = model.item_factors
#Len - 2979 every value is an array of 32 factors

#Calculate single value for every array
content_norms = np.sqrt((content_vecs * content_vecs).sum(axis=1))

#.dot used for dot product and then getting normalized value
scores = content_vecs.dot(content_vecs[content_id]) / content_norms

#Partition and get last n indexes
top_idx = np.argpartition(scores, -n_similar)[-n_similar:]

similar = sorted(zip(top_idx, scores[top_idx] / content_norms[content_id]), key=lambda x: -x[1])

for content in similar:
    idx, score = content
    print(grouped_df.title.loc[grouped_df.content_id == idx].iloc[0])

Analytics startup Amplitude raises $15M
Yahoo discloses hack of 1 billion accounts
Front-End Performance Checklist 2017 (PDF, Apple Pages) - Smashing Magazine
Cego é destaque entre os fotógrafos que registram a Paralimpíada do Rio
Getting Ready For HTTP/2: A Guide For Web Designers And Developers - Smashing Magazine
The Real Lesson for Data Science That is Demonstrated by Palantir's Struggles
Swarm A.I. Correctly Predicts the Kentucky Derby, Accurately Picking all Four Horses of the Superfecta at 540 to 1 Odds
Falling for Web Components
Facebook now flags and down-ranks fake news with help from outside fact checkers
Amazon founder: A.I.'s impact is


The article itself will be the first one.


**Recommend Articles to Persons**

To make recommendations for a given user we take a similar approach. Here we calculate the dot product between our user vector and the transpose of our item vectors. This gives us a recommendation score for our user and each item:

![](https://miro.medium.com/max/1163/1*UGxdsKnvkekwW0IQNrj1ig.png)

In [0]:
def recommend(person_id, sparse_person_content, person_vecs, content_vecs, num_contents=10):
    # Get the interactions scores from the sparse person content matrix
    person_interactions = sparse_person_content[person_id,:].toarray()
    # Add 1 to everything, so that articles with no interaction yet become equal to 1
    person_interactions = person_interactions.reshape(-1) + 1
    # Make articles already interacted zero
    person_interactions[person_interactions > 1] = 0
    # Get dot product of person vector and all content vectors
    rec_vector = person_vecs[person_id,:].dot(content_vecs.T).toarray()
    
    # Scale this recommendation vector between 0 and 1
    min_max = MinMaxScaler()
    rec_vector_scaled = min_max.fit_transform(rec_vector.reshape(-1,1))[:,0]
    # Content already interacted have their recommendation multiplied by zero
    recommend_vector = person_interactions * rec_vector_scaled
    # Sort the indices of the content into order of best recommendations
    content_idx = np.argsort(recommend_vector)[::-1][:num_contents]
    
    # Start empty list to store titles and scores
    titles = []
    scores = []

    for idx in content_idx:
        # Append titles and scores to the list
        titles.append(grouped_df.title.loc[grouped_df.content_id == idx].iloc[0])
        scores.append(recommend_vector[idx])

    recommendations = pd.DataFrame({'title': titles, 'score': scores})

    return recommendations

In [0]:
person_vecs = sparse.csr_matrix(model.user_factors)
content_vecs = sparse.csr_matrix(model.item_factors)

# Create recommendations for person with id 1456
person_id = 1518

recommendations = recommend(person_id, sparse_person_content, person_vecs, content_vecs)

print(recommendations)

                                               title     score
0  New MacBook Pro is not a Laptop for Developers...  1.000000
1  Conheça Cris Grether, a diretora Global de Des...  0.797777
2  7 Tips to Create a Company Learning Culture li...  0.789233
3                    Branding é problema seu. E meu.  0.775028
4                     Top 10 Intranet Trends of 2016  0.760208
5  'The Simpsons' celebrates 600 episodes with a ...  0.758206
6  Eddy Cue and Craig Federighi Open Up About Lea...  0.753278
7  Para neurociência, motivação não é fator princ...  0.741027
8  10 Multipurpose Material Design Themes to Make...  0.734829
9  Lava-louças Brastemp Ative! 8 Serviços Blf08ab...  0.731685



**Do they make sense? Let’s get the articles this person has interacted with.**

In [0]:
grouped_df[grouped_df['person_id'] == 1518]

Unnamed: 0,personId,contentId,title,eventStrength,person_id,content_id
34403,5664882385083540533,-7565625515044903967,Agronegócio entra na era da agricultura digita...,1.0,1518,284
34404,5664882385083540533,-4333957157636611418,Why Programmers Want Private Offices,1.0,1518,794
34405,5664882385083540533,-2555801390963402198,Curta do Google pode ser o primeiro filme em r...,1.0,1518,1060
34406,5664882385083540533,358192030955601318,A nova sede da LEGO na Dinamarca é toda sobre ...,1.0,1518,1539
34407,5664882385083540533,1582315529508020223,Analytics startup Amplitude raises $15M,1.0,1518,1732
34408,5664882385083540533,3380406830501456504,GE integra inteligência artificial Alexa em um...,1.0,1518,2019
34409,5664882385083540533,8428597553954921991,Time to Re-Think Design Thinking,1.0,1518,2847


In [0]:
#References:
#https://medium.com/radon-dev/als-implicit-collaborative-filtering-5ed653ba39fe
#https://towardsdatascience.com/building-a-collaborative-filtering-recommender-system-with-clickstream-data-dffc86c8c65