## Content-Based Recommendation Example

#### In this notebook, we will explore some approaches and techniques for content-based recommender systems. We will look at how to do text features extraction and represent items (in this case documents) as feature vectors. Then, we will see how we can obtain a feature-based representation of a user profile from user interactions with items, and generate a list of recommended items based on similarity (in terms of content features) to the user profile.

#### The data set we will use is from Deskdrop which is an internal communications platform developed by <a href="https://ciandt.com/us/en-us" target=_blank>CI&T</a>. Among other features, this platform allows companies employees to share relevant articles with their peers, and collaborate around them. The contains two files <a href="http://facweb.cs.depaul.edu/mobasher/classes/csc577/data/shared_articles.csv" target=_blank>"shared_articles.csv"</a> and <a href="http://facweb.cs.depaul.edu/mobasher/classes/csc577/data/users_interactions.csv" target=_blank>"users_interactions.csv"</a> containing, respectively, the shared documents and the log of user interactions with the items (such as viewing, liking, bookmarking, etc.). The original data set was provided by <a href="https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop" target=_blank>Gabriel Moreira on Kaggle</a>. 

In [1]:
import pandas as pd
import scipy as sp
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

#### We will use the following libraries for preprocessing tasks such as tokenization and vectorization of documents

In [2]:
import re
import nltk      ### Need to first install NLTK, e.g., "pip install nltk" or "conda install nltk" from conda
                 ### command prompt. Also see: https://www.nltk.org/install.html
nltk.download()  ### This will let you download various NLTK packages; it's only necessary to do this once

from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


#### The following function will read in a list of stop words to be used as part of tokenization of documents. You can obtain the file containing the stop words here: <a href="http://facweb.cs.depaul.edu/mobasher/classes/csc577/data/stopwords_en.txt" target=_blank>stopwords_en.txt</a>.

In [3]:
def get_stop_words():
    result = set()
    for line in open('stopwords_en.txt', 'r').readlines():
        result.add(line.strip())
    return result

In [4]:
stop_words = get_stop_words()

#### Load the articles and filter the document to keep only English documents. These will serve as our "items" for the purpose of recommendation.

In [5]:
articles_df = pd.read_csv('http://facweb.cs.depaul.edu/mobasher/classes/csc577/data/shared_articles.csv')
articles_df.head(3)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
0,1459192779,CONTENT REMOVED,-6451309518266745024,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en


In [6]:
articles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3122 entries, 0 to 3121
Data columns (total 13 columns):
timestamp          3122 non-null int64
eventType          3122 non-null object
contentId          3122 non-null int64
authorPersonId     3122 non-null int64
authorSessionId    3122 non-null int64
authorUserAgent    680 non-null object
authorRegion       680 non-null object
authorCountry      680 non-null object
contentType        3122 non-null object
url                3122 non-null object
title              3122 non-null object
text               3122 non-null object
lang               3122 non-null object
dtypes: int64(4), object(9)
memory usage: 317.2+ KB


In [7]:
articles_df = articles_df[articles_df['eventType'] == 'CONTENT SHARED']
articles_df = articles_df[articles_df['lang'] == 'en']
articles_df.head(3)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en
3,1459194474,CONTENT SHARED,-6151852268067518688,3891637997717104548,-1457532940883382585,,,,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en


#### For the purpose of this example, we will only need the attributes "contentId", 'title', and 'text' (the latter attribute contains the text of the document).

In [8]:
articles = articles_df[['contentId', 'title', 'text']]
articles.head()

Unnamed: 0,contentId,title,text
1,-4110354420726924665,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...
2,-7292285110016212249,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...
3,-6151852268067518688,Google Data Center 360° Tour,We're excited to share the Google Data Center ...
4,2448026894306402386,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...
5,-2826566343807132236,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...


In [9]:
articles.shape

(2211, 3)

In [10]:
### Reindex the articles dataframe to start from 0 (this is to avoid incosistencies when 
### we later use sparse matrices which use zero-based indexing)

articles.index = range(articles.shape[0])
articles.head()

Unnamed: 0,contentId,title,text
0,-4110354420726924665,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...
1,-7292285110016212249,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...
2,-6151852268067518688,Google Data Center 360° Tour,We're excited to share the Google Data Center ...
3,2448026894306402386,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...
4,-2826566343807132236,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...


#### Next, we'll use scikit-learn's TfidfVectorizer to tokenize the documents and create as document-feature matrix. This fuction also performs the necessary transformation to convert the term weights into TFxIDF weights. The alternative is to use CountVectorizer which has similar capabilities, but results in simple term frequencies as weights. Please familiarize yourself with the documentation for these function to see various avaialble parameters. In this case, we'll see how the vectorizer function can tokenize (using the default tokenizer), remove stop words, and filter out terms that appear in more than 80% of the documents or in less than 3 documents. This will help in reducing the feature space. Note that this does not perform stemming. We will see a bit later how we can incorporate our own preprocessor that can perform additional tasks during tokenization, including stemming.

In [11]:
vectorizer = TfidfVectorizer(stop_words=stop_words, norm=None, max_df=0.8, min_df=3)

In [12]:
art_mat = vectorizer.fit_transform(articles['text'])

In [13]:
art_mat

<2211x17464 sparse matrix of type '<class 'numpy.float64'>'
	with 608391 stored elements in Compressed Sparse Row format>

#### This will produce a doucment-feature matrix in sparse matrix format. This is a sparse representation of a 2211x17464 document-feature matrix (documents are represented as 17464 feature vectors). Each entry here is in the form "(doc_id, term_id)  term_weight" and in this case, the term weights are TFxIDF values. Note that for the most part we will not need to convert this to a "dense" matrix with all the zeros present. Most Numpy, Scipy, and Scikit-learn functions that work with matrices, will work natively with sparse matrices. For example, below is the an example of accessing document vectors (rows in the document-term matrix) 0 through 99. Please review Scipy's documentation on sparse matrices: https://docs.scipy.org/doc/scipy/reference/sparse.html.

In [14]:
print(art_mat[0:100])

  (0, 1202)	2.6928391771996303
  (0, 622)	3.3689335693768565
  (0, 4599)	3.156474918162663
  (0, 4474)	4.5119976206158
  (0, 17405)	2.100422243913349
  (0, 15)	2.413793802480441
  (0, 9284)	2.5990937680286565
  (0, 6470)	4.294933115377972
  (0, 2745)	3.6264785474083983
  (0, 13641)	7.092214450208125
  (0, 12164)	2.796290514587655
  (0, 2680)	6.6222108209623896
  (0, 15740)	2.461376517471456
  (0, 15277)	2.4711709150637438
  (0, 2670)	4.624114918736506
  (0, 12710)	2.2464537993021034
  (0, 4087)	5.5235985322942796
  (0, 4031)	6.062595033026967
  (0, 15063)	4.20184269231196
  (0, 14832)	4.731360449090104
  (0, 13179)	4.574517977597134
  (0, 13771)	3.2420668484980664
  (0, 4410)	3.1761994235104414
  (0, 16866)	6.757284767007635
  (0, 15037)	6.504427785306006
  :	:
  (99, 9120)	5.949609230110056
  (99, 5322)	16.361533668438923
  (99, 6144)	2.875652255261775
  (99, 14219)	3.4337942035788975
  (99, 11444)	7.143715656043646
  (99, 8119)	3.049163182373575
  (99, 3435)	3.607902161835463
  (99, 

#### Let's take a look at a subset of the terms/features extracted from the documents.

In [15]:
features = vectorizer.get_feature_names()
print(features[1000:1100])

['amidst', 'amir', 'amis', 'amit', 'aml', 'amortized', 'amounts', 'amp', 'amphitheater', 'amphitheatre', 'ample', 'amplification', 'amplified', 'amplify', 'amplifying', 'amplitude', 'amps', 'amsterdam', 'amusement', 'amusing', 'amy', 'amzn', 'ana', 'anaconda', 'analog', 'analogies', 'analogous', 'analogue', 'analogy', 'analyse', 'analysed', 'analyser', 'analyses', 'analysing', 'analysis', 'analyst', 'analysts', 'analytic', 'analytical', 'analytics', 'analyze', 'analyzed', 'analyzer', 'analyzes', 'analyzing', 'ananthram', 'anathema', 'anatomy', 'ancestors', 'anchor', 'anchored', 'anchors', 'ancient', 'ancillary', 'anderson', 'andrea', 'andreas', 'andreessen', 'andrei', 'andrej', 'andrew', 'android', 'androidmanifest', 'andy', 'anecdotally', 'angel', 'angela', 'angeles', 'angellist', 'angels', 'anger', 'angered', 'angie', 'angle', 'angles', 'angry', 'angular', 'angular2', 'angularjs', 'ani', 'animal', 'animals', 'animate', 'animated', 'animates', 'animating', 'animation', 'animations', '

#### we can view the full vocabulary (a.k.a. dictionary) of features that are the dimensions for the vector represenation of items. This will also give us the total occurrence counts for each term across the whole collection of documents.

In [16]:
vectorizer.vocabulary_

{'work': 17291,
 'early': 5228,
 'public': 12386,
 'version': 16791,
 'ethereum': 5749,
 'software': 14505,
 'recently': 12754,
 'released': 12997,
 'face': 6043,
 'technical': 15564,
 'legal': 9176,
 'problems': 12169,
 'bitcoin': 1971,
 'advocates': 752,
 'say': 13719,
 'security': 13878,
 'greater': 7134,
 'complexity': 3386,
 'far': 6125,
 'faced': 6046,
 'testing': 15684,
 'fewer': 6232,
 'attacks': 1470,
 'novel': 10686,
 'design': 4533,
 'invite': 8561,
 'intense': 8395,
 'scrutiny': 13829,
 'authorities': 1545,
 'given': 6962,
 'potentially': 11926,
 'fraudulent': 6626,
 'contracts': 3699,
 'like': 9280,
 'ponzi': 11840,
 'schemes': 13760,
 'written': 17357,
 'directly': 4744,
 'sophisticated': 14548,
 'capabilities': 2549,
 'fascinating': 6138,
 'executives': 5874,
 'corporate': 3812,
 'america': 992,
 'ibm': 7809,
 'said': 13645,
 'year': 17403,
 'experimenting': 5934,
 'way': 17048,
 'control': 3718,
 'real': 12693,
 'world': 17317,
 'objects': 10755,
 'called': 2502,
 'inte

#### You might note, however that without stemming, we have many redundant variants of terms as features. For example, 'animate', 'animated', 'animates', 'animating', 'animation', 'animations' are all considered distinct features. This not only increases the size of the feature space, but will potentially prevent matches among pairs of documents that don't include the same variants of terms. So, we need to include stemming as part of our preprocessing. To do this with the vectroizer functions in Scikit-learn, we will need to create our own preprocessing function that performs tokenization, stemming, and other normalization tasks on a given document and returns the transformed document (to be used by TfidfVectorizer). In this case, we will employ NLTK's facilities for tokenization, stemming and stop word handling as part of our "normalize_document" function.

In [17]:
def normalize_document(doc):
    wpt = nltk.WordPunctTokenizer()
    stop_words = nltk.corpus.stopwords.words('english')
    ps = nltk.stem.PorterStemmer()
    # convert to lower case, and remove special characters and white space
    doc = re.sub(r'[^a-zA-Z0-9_\s]', '', doc, re.I)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize the document
    tokens = wpt.tokenize(doc)
    # remove stopwords
    filtered_tokens = [token for token in tokens if (token not in stop_words and token not in ["."])]
    # put the filtered document back together
    doc = ' '.join([ps.stem(token) for token in filtered_tokens])
    return doc

#### Lets' call the vectorizer function again, but use our "normalize_document" function as the preprocessor. (Note, however, that this could be significantly slower than the previous version.)

In [18]:
vectorizer2 = TfidfVectorizer(preprocessor=normalize_document, norm=None, max_df=0.8, min_df=3)

In [19]:
vectorizer2.fit(articles['text'])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=0.8, max_features=None,
                min_df=3, ngram_range=(1, 1), norm=None,
                preprocessor=<function normalize_document at 0x000001D841A76DC8>,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [20]:
art_mat2 = vectorizer2.transform(articles['text'])

In [21]:
art_mat2

<2211x11194 sparse matrix of type '<class 'numpy.float64'>'
	with 579559 stored elements in Compressed Sparse Row format>

#### Note that the feature space is not significantly smaller and the features are not stemmed versions of the original terms.

In [22]:
features2 = vectorizer2.get_feature_names()
print(features2[1000:1100])

['atap', 'atari', 'ate', 'athlet', 'atla', 'atlant', 'atlanta', 'atlassian', 'atm', 'atmospher', 'atom', 'atop', 'attach', 'attack', 'attain', 'attempt', 'attend', 'attende', 'attent', 'attest', 'attia', 'attitud', 'attract', 'attribut', 'attrit', 'auction', 'audac', 'audibl', 'audienc', 'audio', 'audiobook', 'audiophil', 'audit', 'auditor', 'auditori', 'augment', 'augur', 'august', 'aunt', 'aura', 'aurora', 'auster', 'austin', 'australia', 'australian', 'austria', 'auth', 'auth0', 'authent', 'author', 'authoris', 'authorit', 'authorship', 'autist', 'auto', 'autocomplet', 'autocorrect', 'autodesk', 'autofocu', 'autom', 'automag', 'automak', 'automat', 'automl', 'automobil', 'automot', 'autonom', 'autonomi', 'autopilot', 'autosc', 'autoscal', 'autovalu', 'autowir', 'auxiliari', 'av', 'avail', 'avatar', 'avenu', 'averag', 'avers', 'avg', 'aviat', 'avid', 'aviv', 'avoid', 'avro', 'aw', 'await', 'awak', 'awaken', 'awar', 'award', 'awash', 'away', 'awe', 'awesom', 'awhil', 'awk', 'awkward',

In [23]:
vectorizer2.vocabulary_

{'work': 11047,
 'still': 9443,
 'earli': 3281,
 'first': 3915,
 'full': 4165,
 'public': 7853,
 'version': 10681,
 'ethereum': 3587,
 'softwar': 9167,
 'recent': 8103,
 'releas': 8231,
 'system': 9747,
 'could': 2453,
 'face': 3737,
 'technic': 9835,
 'legal': 5790,
 'problem': 7751,
 'tarnish': 9803,
 'bitcoin': 1345,
 'mani': 6097,
 'advoc': 589,
 'say': 8652,
 'secur': 8746,
 'greater': 4463,
 'complex': 2250,
 'thu': 10003,
 'far': 3784,
 'much': 6552,
 'less': 5814,
 'test': 9909,
 'fewer': 3854,
 'attack': 1013,
 'novel': 6836,
 'design': 2872,
 'may': 6197,
 'also': 717,
 'invit': 5329,
 'intens': 5255,
 'scrutini': 8718,
 'author': 1049,
 'given': 4333,
 'potenti': 7619,
 'fraudul': 4104,
 'contract': 2378,
 'like': 5860,
 'ponzi': 7578,
 'scheme': 8676,
 'written': 11091,
 'directli': 2977,
 'sophist': 9204,
 'capabl': 1745,
 'made': 6044,
 'fascin': 3790,
 'execut': 3660,
 'corpor': 2432,
 'america': 747,
 'ibm': 4954,
 'said': 8595,
 'last': 5713,
 'year': 11138,
 'experi':

#### The next step is to compute pair-wise similarities among all items/documents. We can use this similarity matrix to return the most similar items to a given item. Note, that the cosine_similarity function in sklearn.metrics.pairwise can natively work with sparse matrices

In [24]:
from sklearn.metrics.pairwise import cosine_similarity

In [25]:
sim_mat = cosine_similarity(art_mat2)
sim_mat.shape

(2211, 2211)

In [26]:
np.set_printoptions(linewidth=120, precision=2, edgeitems=10)

In [27]:
print(sim_mat)

[[1.   0.06 0.03 0.36 0.18 0.21 0.22 0.21 0.89 0.02 ... 0.06 0.06 0.05 0.01 0.04 0.02 0.03 0.07 0.14 0.02]
 [0.06 1.   0.04 0.07 0.05 0.06 0.07 0.08 0.07 0.02 ... 0.06 0.05 0.02 0.   0.03 0.01 0.03 0.06 0.08 0.04]
 [0.03 0.04 1.   0.03 0.04 0.02 0.03 0.03 0.02 0.06 ... 0.06 0.06 0.04 0.01 0.05 0.04 0.1  0.03 0.06 0.  ]
 [0.36 0.07 0.03 1.   0.3  0.22 0.35 0.2  0.3  0.03 ... 0.1  0.06 0.07 0.01 0.08 0.04 0.03 0.05 0.17 0.02]
 [0.18 0.05 0.04 0.3  1.   0.06 0.25 0.11 0.15 0.03 ... 0.07 0.04 0.02 0.03 0.03 0.02 0.04 0.12 0.12 0.02]
 [0.21 0.06 0.02 0.22 0.06 1.   0.22 0.21 0.16 0.01 ... 0.11 0.04 0.05 0.01 0.02 0.02 0.02 0.04 0.21 0.02]
 [0.22 0.07 0.03 0.35 0.25 0.22 1.   0.14 0.18 0.02 ... 0.11 0.06 0.06 0.   0.06 0.02 0.02 0.05 0.16 0.  ]
 [0.21 0.08 0.03 0.2  0.11 0.21 0.14 1.   0.18 0.03 ... 0.06 0.08 0.05 0.02 0.04 0.02 0.04 0.03 0.07 0.02]
 [0.89 0.07 0.02 0.3  0.15 0.16 0.18 0.18 1.   0.02 ... 0.06 0.05 0.04 0.01 0.03 0.02 0.03 0.06 0.12 0.01]
 [0.02 0.02 0.06 0.03 0.03 0.01 0.02 

#### The following function takes a similarity matrix and a target item, and returns to k most similar items in the item database to the target item (it returns the indices of the k-nearest-neighbors along with their similarity values to the target item).

In [28]:
def content_based_sim(dataMat, simMatrix, item, k):
    sims = simMatrix[item,:]
    idx = np.argsort(sims)
    idx = idx[::-1]
    ### Need to remove the item itself since it has the highest similarity to itself
    idx = np.array([i for i in idx if i != item])
    neigh_idx = idx[:k]
    neigh_sims = sims[neigh_idx]
    return neigh_idx, neigh_sims

#### Let's look at a specific item (the article with index 249) and find the top 10 most similar items.

In [29]:
item = 249
articles.loc[item]['title']

'Digital banking: Mondo hopes to become the Google or Facebook of the sector'

In [30]:
nidx, nsims = content_based_sim(art_mat2, sim_mat, item, 10)

In [31]:
print(nidx)

[ 754  239  247 2143 1482  261 1850  683  702 1758]


In [32]:
print(nsims)

[0.49 0.44 0.41 0.4  0.39 0.39 0.38 0.38 0.38 0.37]


In [33]:
pd.set_option('max_colwidth', 100)

#### And here are the top 10 recommendations based on the above target (query) item. It's clear from the titles that the recommeded items are very similar (in terms of content) to the target item.

In [34]:
articles.iloc[nidx]['title']

754       The #digital upstarts offering app-only #banking for smartphone users #benchmark
239                                                    Building a digital-banking business
247                          Why Barclays Sees Banking's Future as an Information Business
2143                                   Mapping the Global NeoBank Landscape - Techfoliance
1482    Mobile-Only Challenger Banks Are Shaping the Future of Financial Services Industry
261                           App-only bank Atom just launched - here's what it looks like
1850                          Digital Tool as Strategic Enabler for Banking Transformation
683                                            A digital crack in banking's business model
702       Welcome to GoogleBank, Facebook Bank, Amazon Bank, and Apple Bank - Enrique Dans
1758                                   Blockchain Will Be Used By 15% of Big Banks By 2017
Name: title, dtype: object

#### Observations about the content_based_sim function:

<ul>
    <li>
        First thing to note is that this is a non-personalized form of recommendations. A user preference 
        profile is not being using to generated recommendation; only a query/target item is used. Below, 
        we will explore a more personalized form of content-based recommendation.
    </li>
    <li>
        It should also be noted that the above function could, in fact, work with any item-item similarity 
        matrix. For example, the sim matrix could be based on a ratings matrix as in the case of item-based 
        collaborative filtering and does not need to be based on similarity of content-features. This 
        observation also points to the possibility that we can combine content-based and collaborative 
        filtering by using two independent similarity functions (a collaborative and a content-based one) to 
        create a hybrid recommender system (a good potential project idea). 
    </li>
    <li>
        Finally, it's important to point out that creating a separate similarity matrix offline, while 
        advantageous in terms of scalability, also has drawbacks. The sim matrix is generated based on 
        existing items in our item database. So, the content_based_sim function can only accept a target 
        item that is already represented in the database. This negates one of the key advantages of 
        content-based filtering which theoretically should allow a completely new item, represented using 
        the current feature space, should still be accepted as a target item, and the system should be 
        able to generate recommendations for that item. To allow for this, we would have to forego the 
        use of an offline similarity matrix and measure similarities between the target item and all items in 
        the databases during the recommendation generation time (i.e., within the content_based_sim function).
    </li>
</ul>


#### To be able to generate personalized recommendations, we need to first generate a user profile based on past preferences of the user. This could be done by collecting items the user has previous rated positively, or even just items they viewed, selected, purchased, etc. If there are both positive and negative ratings or ratings from a range of values, these ratings can be used as weights when creating an aggregate representation for the user's profile. In this case, the data provides information on users' various interactions with the items. For the purpose of this example, we are going to only consider articles that were "liked" by users. We'll combine all such articles for each user to create their profile as a feature vector over the set of features we already extracted earlier from documents. 

In [35]:
interactions_df = pd.read_csv('http://facweb.cs.depaul.edu/mobasher/classes/csc577/data/users_interactions.csv')
interactions_df.head(3)

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52...",NY,US
2,1465416190,VIEW,310515487419366995,-1130272294246983140,2631864456530402479,,,


In [36]:
liked_df = interactions_df[interactions_df["eventType"]=="LIKE"]

In [37]:
liked_df.head()

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
33,1465415756,LIKE,-8142426490949346803,1908339160857512799,9121879357144259163,,,
36,1465413867,LIKE,310515487419366995,3609194402293569455,1143207167886864524,,,
40,1465413845,LIKE,310515487419366995,344280948527967603,-3167637573980064150,,,
43,1465413946,LIKE,310515487419366995,-8763398617720485024,1395789369402380392,,,
56,1465413763,LIKE,-1492913151930215984,3609194402293569455,1143207167886864524,,,


#### To match these records with the shared articles, we only need the personId and contentId fields

In [38]:
liked_df = liked_df[["personId", "contentId"]]
liked_df.head()

Unnamed: 0,personId,contentId
33,1908339160857512799,-8142426490949346803
36,3609194402293569455,310515487419366995
40,344280948527967603,310515487419366995
43,-8763398617720485024,310515487419366995
56,3609194402293569455,-1492913151930215984


In [39]:
liked_df.shape

(5745, 2)

In [40]:
### No. of unique users in the data
len(liked_df["personId"].unique())

788

In [41]:
### No. of unique items in the data
len(liked_df["contentId"].unique())

1742

In [42]:
### Let's remind ourselves about what the articles dataframe looked like
articles.head()

Unnamed: 0,contentId,title,text
0,-4110354420726924665,"Ethereum, a Virtual Currency, Enables Transactions That Rival Bitcoin's",All of this work is still very early. The first full public version of the Ethereum software was...
1,-7292285110016212249,Bitcoin Future: When GBPcoin of Branson Wins Over USDcoin of Trump,"The alarm clock wakes me at 8:00 with stream of advert-free broadcasting, charged at one satoshi..."
2,-6151852268067518688,Google Data Center 360° Tour,We're excited to share the Google Data Center 360° Tour - a YouTube 360° video that gives you an...
3,2448026894306402386,"IBM Wants to ""Evolve the Internet"" With Blockchain Technology",The Aite Group projects the blockchain market could be valued at $400 million by 2019. For that ...
4,-2826566343807132236,IEEE to Talk Blockchain at Cloud Computing Oxford-Con - CoinDesk,One of the largest and oldest organizations for computing professionals will kick off its annual...


#### We can now do a join of these two tables (articles and liked_df dataframes)

In [43]:
ui_merge = pd.merge(liked_df, articles)
ui_merge.head()

Unnamed: 0,personId,contentId,title,text
0,3609194402293569455,-1492913151930215984,Chrome DevTools - Console API Reference,The DevTools docs have moved! Read the latest version of this article and head over to the new h...
1,7774613525190730745,-1492913151930215984,Chrome DevTools - Console API Reference,The DevTools docs have moved! Read the latest version of this article and head over to the new h...
2,7527226129639571966,-1492913151930215984,Chrome DevTools - Console API Reference,The DevTools docs have moved! Read the latest version of this article and head over to the new h...
3,-1602833675167376798,3727587882617538492,Bitcoin In The Time Of Negative Interest Rates,The Central Banks of Japan and Europe have imposed negative interest rates on deposits. Would th...
4,-1602833675167376798,-692972306229904743,Blockchain won't kill banks: Bitcoin pioneer,Blockchain - the technology that underpins the cryptocurrency bitcoin - is unlikely to kill bank...


#### Let's look at the articles liked by a specific user in the database

In [44]:
pid = -1602833675167376798
p = ui_merge[ui_merge["personId"]==pid]
p

Unnamed: 0,personId,contentId,title,text
3,-1602833675167376798,3727587882617538492,Bitcoin In The Time Of Negative Interest Rates,The Central Banks of Japan and Europe have imposed negative interest rates on deposits. Would th...
4,-1602833675167376798,-692972306229904743,Blockchain won't kill banks: Bitcoin pioneer,Blockchain - the technology that underpins the cryptocurrency bitcoin - is unlikely to kill bank...
27,-1602833675167376798,-21036008762564671,The sales secrets of high-growth companies,The authors of Sales Growth reveal five actions that distinguish sales organizations at fast-gro...
90,-1602833675167376798,1929674614667189969,Diane Greene wants to put the enterprise front and center of Google Cloud strategy,"When Google bought bebop Technologies last fall for $348 million , it got more than a stealthy s..."
113,-1602833675167376798,-1654063646246197191,Big IT Rising,"Big IT rising An overview of Lean, Agile and DevOps New Ways of Working The lunch of big corpora..."
...,...,...,...,...
3352,-1602833675167376798,2760335717049716507,"Razorfish, US digital revenues, drag down Publicis",It looked like a stunning strategy. After Publicis - one of the world's largest media and commun...
3381,-1602833675167376798,-746073086109727488,How to get the most from your agency relationships in 2017,Executives who know how to set up and manage agency relationships are best positioned to improve...
3405,-1602833675167376798,6831941111848480366,"Amazon looking to buy Capital One? "" Banking Technology",Amazon is rumoured to be pondering the acquisition of Capital One. Banking Technology contacted ...
3409,-1602833675167376798,6244532954645766056,3 Big Blockchain Ideas MIT is Working on Right Now - CoinDesk,When one of the world's most prestigious universities announces it will explore a controversial ...


#### We can combine all the items in a given user's record to create the profile. This could be done by creating separate vectors for each item/document and then take the sum or the mean of the vectors. If the user has rating values associated with these items, they could be used to compute a weighted sum or mean. In this case, there are only unary ratings ("liked") for items and it is more convenient for first concatenate all documents belonging to the user, and then generate a feature vector for the combined document.

In [45]:
user_texts = pd.DataFrame([], columns=['personId', 'text'])
user_content = {} # a dictionary to keep track of article ids for each user
for pid in set(ui_merge["personId"]):
    p = ui_merge[ui_merge["personId"]==pid]
    all_text = ' '.join(p["text"])
    user_content[pid] = [a for a in p["contentId"]]
    row = pd.DataFrame([[pid, all_text]], columns=['personId', 'text'])
    user_texts = user_texts.append(row, ignore_index=True)

In [46]:
user_texts.head()

Unnamed: 0,personId,text
0,-6174073684310263806,Here are the SDKs Top Mobile Apps Use We are working on a free community resource for mobile app...
1,3813842765808990208,Editor's note: this is the seventh post in a series of in-depth posts on what's new in Kubernete...
2,4142810830429822977,"Bottom line: Machine learning is providing the needed algorithms, applications, and frameworks t..."
3,-8694104221113176052,"Big IT rising An overview of Lean, Agile and DevOps New Ways of Working The lunch of big corpora..."
4,-1678759546322702318,"2015 has been heralded as the year the Internet of Things finally takes hold, but while we wait ..."


In [47]:
user_content[-1602833675167376798]

[3727587882617538492,
 -692972306229904743,
 -21036008762564671,
 1929674614667189969,
 -1654063646246197191,
 -5170198873410718233,
 8265190708606665292,
 6437568358552101410,
 -862645085190435621,
 -2871288807409592,
 -3363563881552061188,
 2857117417189640073,
 -7681408188643141872,
 3149164017776669829,
 2916072977192006313,
 4814419120794996930,
 -662806181534790446,
 -3750879736572068916,
 2072448887839540892,
 5238119115012015307,
 -5410531116380081703,
 643905832258297060,
 -1088742830039453732,
 6754313450809254838,
 7229629480273331039,
 -205193648629294862,
 7124632953201847659,
 1544550983918141657,
 5152069678055228801,
 8118799783881928573,
 4774970687540378081,
 2885511262558254418,
 4139191914110236238,
 9032993320407723266,
 1459131257418324496,
 4419562057180692966,
 5468598741732935699,
 -2539915991213675511,
 -9055044275358686874,
 6152652267138213180,
 -667193404227875686,
 3075564241645350154,
 -5912792039759735631,
 -8312968399134741370,
 -4320251915347436672,
 -

#### Now we can use our vectorizer that was trained on the original document set to transform the new profile documents.

In [48]:
user_profiles = vectorizer2.transform(user_texts['text'])

In [49]:
user_profiles

<564x11194 sparse matrix of type '<class 'numpy.float64'>'
	with 475164 stored elements in Compressed Sparse Row format>

#### Each element/row of "user_profiles" represents the feature vector for one user (essentially feature representation of the concatenated documents liked by the user. Below are the first couple of uses (in sparse matrix format) with user index, feature index, and TFxIDF weights.

In [50]:
print(user_profiles[0:2])

  (0, 11162)	4.2828117548456275
  (0, 11148)	4.865112157871928
  (0, 11138)	5.164729266241387
  (0, 11099)	40.67896423322242
  (0, 11090)	4.558517636250693
  (0, 11081)	4.259001106151909
  (0, 11074)	10.621083342256632
  (0, 11063)	4.259001106151909
  (0, 11047)	23.103114967714923
  (0, 11021)	4.066848228449644
  (0, 11020)	7.213629128124869
  (0, 10952)	3.8733386253399242
  (0, 10950)	5.993602161540015
  (0, 10941)	4.9700925231147215
  (0, 10916)	1.7160105450030172
  (0, 10891)	2.86684162557962
  (0, 10876)	4.838771231492437
  (0, 10861)	3.0159330884942275
  (0, 10858)	7.092214450208125
  (0, 10847)	35.96161296924009
  (0, 10844)	12.502813037983637
  (0, 10819)	1.7031427203916243
  (0, 10804)	3.760009940032921
  (0, 10751)	13.81978578682834
  (0, 10736)	9.588962480129588
  :	:
  (1, 282)	4.7698267299179
  (1, 276)	9.222093746854766
  (1, 266)	4.712668316077951
  (1, 263)	4.574517977597134
  (1, 261)	4.809832064531599
  (1, 259)	4.712668316077951
  (1, 252)	7.19141377748329
  (1, 247)	

#### Now we are ready to generate some presonalized recommendations. The following function takes a user (feature vector for a target user's profile) and returns the indices of the top k most similar items to the user's profile. The function also has an input argument "rated_items_indicies" which is an array of indices of items in the user profile (in this case, these are indices in the "articles" dataframe corresponding to articles that the user has previously liked).

In [51]:
def content_based_recommend(dataMat, user, rated_items_indicies, k):
    from sklearn.metrics.pairwise import cosine_similarity
    sims = cosine_similarity(dataMat, user)
    sims = sims.flatten() # cosine_similarity returns a nx1 array; flatten it into a 1d numpy array
    idx = np.argsort(sims)
    # Make sure we don't add items that are already rated by the user
    idx = np.array([i for i in idx if not (i in rated_items_indicies)])
    idx = idx[::-1]
    neigh_idx = idx[:k]
    neigh_sims = sims[neigh_idx]
    return neigh_idx, neigh_sims

#### Let's test this in a specific user (user with index 20)

In [52]:
u = 20
user = user_profiles[u]
pid = user_texts.iloc[u]["personId"]
rated_items = articles[articles["contentId"].isin(user_content[pid])]

In [53]:
rated_items_idx = np.array(rated_items.index)
rated_items_idx

array([1368, 1378, 1431, 1467, 1814], dtype=int64)

In [54]:
n_index, n_sims = content_based_recommend(art_mat2, user, rated_items_idx, 5)

In [55]:
n_index

array([ 631, 1043, 2157,  278, 1125], dtype=int64)

#### Here are the articles previosuly "liked" by this user:

In [56]:
rated_items["title"]

1368                                                   Promoting gender equality through emoji �� ��
1378                  Neural networks are inadvertently learning our language's hidden gender biases
1431    Glamour Exclusive: President Obama On Feminism and The World He Wants to Leave His Daughters
1467                                         The Genderbread Person v3 | It's Pronounced Metrosexual
1814                                                               Introducing Ask a Female Engineer
Name: title, dtype: object

#### And here are the top 5 recommendations based on the user's profile:

In [57]:
recs = articles.iloc[n_index]
recs

Unnamed: 0,contentId,title,text
631,764116021156146784,"If you think women in tech is just a pipeline problem, you haven't been paying attention - Tech ...","If you think women in tech is just a pipeline problem, you haven't been paying attention Accordi..."
1043,-96719740408441346,"In war for talent, 'brogrammers' will be losers","Start-ups are fighting a war for talent in Silicon Valley, and the companies that actively welco..."
2157,6463551093741815202,"Little girls doubt that women can be brilliant, study shows",WASHINGTON - Can women be brilliant? Little girls are not so sure. A study published Thursday in...
278,2927476152522512441,Equal Pay Day in the spotlight this year,"As if you hadn't heard - and surely, you have heard - women earn less money than men. The oft-ci..."
1125,-8412113620940365599,The best-and worst-places to be a working woman,The best-and worst-places to be a working woman by The Data Team TO MARK the United Nations ' In...


#### It's clear that the recommender is finding items that are quite similar to user's past interests.

#### Note that content_based_recommender function does not require that the target user should be an existing user in the database. Indeed, a new user who has rated some of the articles can be represented as a feature vector over the existing feature space and passed to this function. So, it has the "liveness" property in the sense that it can generate out-of-sample recommendations.

#### A note on evaluation:
<p>While it's possible to do rating prediction with content-based recommenders (for example in a manner similar to item-based collaborative filtering method), it is more common to view these systems as "top-n recommender systems". In other words, the main goal of the system is to return a ranked list of n recommendations for a given target user. There are a number of evaluation metrics such as precision, recall, hit ratio, and normalized discounted cumulative gain (NDCG) that can be used to evaluate the quality of the ranked list.</p>
<p>Typically, a portion of the items in each user profile in the test set is set aside for evaluation. The remaining part of the user profile is used to generate a recommendation list of size n. For example, precison@k for each test user can be computed by determining the ratio of top k recommended items that are present in the set aside portion of the profile. Average precision@k is then computed based on all test users. Recall, on the other hand can be computed based on the ratio of the set-aside items that are covered by top-k recommendation.</p>
<p>You can find a detailed discussion of evaluation metrics for recommendation, including the evaluation of ranked lists in Chapter 7 (Section 7.5) of the Aggarwal book: "Recommender Systems: A Texbook".</p>