# Recommendations with IBM (Submit Version)

In this notebook, you will be putting your recommendation skills to use on real data from the IBM Watson Studio platform. 


You may either submit your notebook through the workspace here, or you may work from your local machine and submit through the next page.  Either way assure that your code passes the project [RUBRIC](https://review.udacity.com/#!/rubrics/2322/view).  **Please save regularly.**

By following the table of contents, you will build out a number of different methods for making recommendations that can be used for different situations. 


## Table of Contents

I. [Exploratory Data Analysis](#Exploratory-Data-Analysis)<br>
II. [Rank Based Recommendations](#Rank)<br>
III. [User-User Based Collaborative Filtering](#User-User)<br>
IV. [Content Based Recommendations (EXTRA - NOT REQUIRED)](#Content-Recs)<br>
V. [Matrix Factorization](#Matrix-Fact)<br>
VI. [Extras & Concluding](#conclusions)

At the end of the notebook, you will find directions for how to submit your work.  Let's get started by importing the necessary libraries and reading in the data.

In [364]:
import random
from collections import Counter
from pprint import pp
import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import project_tests as t
import pickle

%matplotlib inline

df = pd.read_csv('data/user-item-interactions.csv')
df_content = pd.read_csv('data/articles_community.csv')
del df['Unnamed: 0']
del df_content['Unnamed: 0']

# Show df to get an idea of the data
df.head()

Unnamed: 0,article_id,title,email
0,1430.0,"using pixiedust for fast, flexible, and easier...",ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7
1,1314.0,healthcare python streaming application demo,083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b
2,1429.0,use deep learning for image classification,b96a4f2e92d8572034b1e9b28f9ac673765cd074
3,1338.0,ml optimization using cognitive assistant,06485706b34a5c9bf2a0ecdac41daf7e7654ceb7
4,1276.0,deploy your python model as a restful api,f01220c46fc92c6e6b161b1849de11faacd7ccb2


### <a class="anchor" id="Exploratory-Data-Analysis">Part I : Exploratory Data Analysis</a>

Use the dictionary and cells below to provide some insight into the descriptive statistics of the data.

`1.` What is the distribution of how many articles a user interacts with in the dataset?  Provide a visual and descriptive statistics to assist with giving a look at the number of times each user interacts with an article.  

In [365]:
median_val = df.groupby('email').size().sort_values(ascending=True).median()
max_views_by_user = df.groupby('email').size().sort_values(ascending=False).head(1).values[0]

`2.` Explore and remove duplicate articles from the **df_content** dataframe.  

In [366]:
# Find and explore duplicate articles

dups = df_content.groupby('article_id').size().sort_values(ascending=False) # duplicates of article id
dups_idx = dups[dups > 1].index # duplicates index

df_content[df_content['article_id'].isin(dups_idx)].sort_values(by='article_id') # retrieve duplicated content

Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
50,Follow Sign in / Sign up Home About Insight Da...,Community Detection at Scale,Graph-based machine learning,Live,50
365,Follow Sign in / Sign up Home About Insight Da...,During the seven-week Insight Data Engineering...,Graph-based machine learning,Live,50
221,* United States\r\n\r\nIBM® * Site map\r\n\r\n...,When used to make sense of huge amounts of con...,How smart catalogs can turn the big data flood...,Live,221
692,Homepage Follow Sign in / Sign up Homepage * H...,One of the earliest documented catalogs was co...,How smart catalogs can turn the big data flood...,Live,221
232,Homepage Follow Sign in Get started Homepage *...,"If you are like most data scientists, you are ...",Self-service data preparation with IBM Data Re...,Live,232
971,Homepage Follow Sign in Get started * Home\r\n...,"If you are like most data scientists, you are ...",Self-service data preparation with IBM Data Re...,Live,232
399,Homepage Follow Sign in Get started * Home\r\n...,Today’s world of data science leverages data f...,Using Apache Spark as a parallel processing fr...,Live,398
761,Homepage Follow Sign in Get started Homepage *...,Today’s world of data science leverages data f...,Using Apache Spark as a parallel processing fr...,Live,398
578,This video shows you how to construct queries ...,This video shows you how to construct queries ...,Use the Primary Index,Live,577
970,This video shows you how to construct queries ...,This video shows you how to construct queries ...,Use the Primary Index,Live,577


In [367]:
# Remove any rows that have the same article_id - only keep the first

df_content = df_content.drop_duplicates('article_id', keep='first')

`3.` Use the cells below to find:

**a.** The number of unique articles that have an interaction with a user.  
**b.** The number of unique articles in the dataset (whether they have any interactions or not).<br>
**c.** The number of unique users in the dataset. (excluding null values) <br>
**d.** The number of user-article interactions in the dataset.

In [368]:
unique_articles = df.article_id.nunique() # The number of unique articles that have at least one interaction
total_articles = df_content.article_id.nunique() # The number of unique articles on the IBM platform
unique_users = df.email.nunique() # The number of unique users
user_article_interactions = df.shape[0] # The number of user-article interactions

`4.` Use the cells below to find the most viewed **article_id**, as well as how often it was viewed.  After talking to the company leaders, the `email_mapper` function was deemed a reasonable way to map users to ids.  There were a small number of null values, and it was found that all of these null values likely belonged to a single user (which is how they are stored using the function below).

In [369]:
most_viewed_ds = df.groupby('article_id').size().sort_values(ascending=False).head(1)

# The most viewed article in the dataset as a string with one value following the decimal 
most_viewed_article_id = str(most_viewed_ds.index[0])

# The most viewed article in the dataset was viewed how many times?
max_views = most_viewed_ds.values[0]

In [370]:
## No need to change the code here - this will be helpful for later parts of the notebook
# Run this cell to map the user email to a user_id column and remove the email column

def email_mapper():
    coded_dict = dict()
    cter = 1
    email_encoded = []
    
    for val in df['email']:
        if val not in coded_dict:
            coded_dict[val] = cter
            cter+=1
        
        email_encoded.append(coded_dict[val])
    return email_encoded

email_encoded = email_mapper()
del df['email']
df['user_id'] = email_encoded

# show header
df.head()

Unnamed: 0,article_id,title,user_id
0,1430.0,"using pixiedust for fast, flexible, and easier...",1
1,1314.0,healthcare python streaming application demo,2
2,1429.0,use deep learning for image classification,3
3,1338.0,ml optimization using cognitive assistant,4
4,1276.0,deploy your python model as a restful api,5


In [371]:
## If you stored all your results in the variable names above, 
## you shouldn't need to change anything in this cell

sol_1_dict = {
    '`50% of individuals have _____ or fewer interactions.`': median_val,
    '`The total number of user-article interactions in the dataset is ______.`': user_article_interactions,
    '`The maximum number of user-article interactions by any 1 user is ______.`': max_views_by_user,
    '`The most viewed article in the dataset was viewed _____ times.`': max_views,
    '`The article_id of the most viewed article is ______.`': most_viewed_article_id,
    '`The number of unique articles that have at least 1 rating ______.`': unique_articles,
    '`The number of unique users in the dataset is ______`': unique_users,
    '`The number of unique articles on the IBM platform`': total_articles
}

# Test your dictionary against the solution
t.sol_1_test(sol_1_dict)

It looks like you have everything right here! Nice job!


---

### <a class="anchor" id="Rank">Part II: Rank-Based Recommendations</a>

Unlike in the earlier lessons, we don't actually have ratings for whether a user liked an article or not.  We only know that a user has interacted with an article.  In these cases, the popularity of an article can really only be based on how often an article was interacted with.

`1.` Fill in the function below to return the **n** top articles ordered with most interactions as the top. Test your function using the tests below.

In [372]:
def get_top_articles(n, df=df):
    '''
    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article titles 
    
    '''
    # Your code here
    article_ids_ds = df.groupby('article_id').size().sort_values(ascending=False).head(n)
    ids = article_ids_ds.index
    
    top_articles = []
    for i in ids:
        top_articles.append(df[df['article_id'] == i].head(1).title.values[0])
    
    
    
    return top_articles # Return the top article titles from df (not df_content)


def get_top_article_ids(n, df=df):
    '''
    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles_ids - (list) A list of the top 'n' article titles 
    
    '''
    # Your code here
    article_ids_ds = df.groupby('article_id').size().sort_values(ascending=False).head(n)
    ids = article_ids_ds.index
    top_articles_ids = list(ids)
 
    return top_articles_ids # Return the top article ids

In [373]:
print(get_top_articles(10))
print(get_top_article_ids(10))

['use deep learning for image classification', 'insights from new york car accident reports', 'visualize car data with brunel', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'predicting churn with the spss random tree algorithm', 'healthcare python streaming application demo', 'finding optimal locations of new store using decision optimization', 'apache spark lab, part 1: basic concepts', 'analyze energy consumption in buildings', 'gosales transactions for logistic regression model']
[1429.0, 1330.0, 1431.0, 1427.0, 1364.0, 1314.0, 1293.0, 1170.0, 1162.0, 1304.0]


In [374]:
# Test your function by returning the top 5, 10, and 20 articles
top_5 = get_top_articles(5)
top_10 = get_top_articles(10)
top_20 = get_top_articles(20)

# Test each of your three lists from above
t.sol_2_test(get_top_articles)

Your top_5 looks like the solution list! Nice job.
Your top_10 looks like the solution list! Nice job.
Your top_20 looks like the solution list! Nice job.


---

### <a class="anchor" id="User-User">Part III: User-User Based Collaborative Filtering</a>


`1.` Use the function below to reformat the **df** dataframe to be shaped with users as the rows and articles as the columns.  

* Each **user** should only appear in each **row** once.


* Each **article** should only show up in one **column**.  


* **If a user has interacted with an article, then place a 1 where the user-row meets for that article-column**.  It does not matter how many times a user has interacted with the article, all entries where a user has interacted with an article should be a 1.  


* **If a user has not interacted with an item, then place a zero where the user-row meets for that article-column**. 

Use the tests to make sure the basic structure of your matrix matches what is expected by the solution.

In [375]:
# create the user-article matrix with 1's and 0's

def create_user_item_matrix(df):
    '''
    INPUT:
    df - pandas dataframe with article_id, title, user_id columns
    
    OUTPUT:
    user_item - user item matrix 
    
    Description:
    Return a matrix with user ids as rows and article ids on the columns with 1 values where a user interacted with 
    an article and a 0 otherwise
    '''
    
    df_user_item = df.copy()
    
    # Fill in the function here
    df_user_item['title'] = 1
    df_user_item = df_user_item[['user_id', 'article_id', 'title']]
    
    user_item = pd.pivot_table(df_user_item, index='user_id', columns='article_id', values='title', fill_value=0)
    
    return user_item # return the user_item matrix 

user_item = create_user_item_matrix(df)

In [376]:
user_item

article_id,0.0,2.0,4.0,8.0,9.0,12.0,14.0,15.0,16.0,18.0,...,1434.0,1435.0,1436.0,1437.0,1439.0,1440.0,1441.0,1442.0,1443.0,1444.0
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5145,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5146,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5147,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5148,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [377]:
## Tests: You should just need to run this cell.  Don't change the code.
assert user_item.shape[0] == 5149, "Oops!  The number of users in the user-article matrix doesn't look right."
assert user_item.shape[1] == 714, "Oops!  The number of articles in the user-article matrix doesn't look right."
assert user_item.sum(axis=1)[1] == 36, "Oops!  The number of articles seen by user 1 doesn't look right."
print("You have passed our quick tests!  Please proceed!")

You have passed our quick tests!  Please proceed!


`2.` Complete the function below which should take a user_id and provide an ordered list of the most similar users to that user (from most similar to least similar).  The returned result should not contain the provided user_id, as we know that each user is similar to him/herself. Because the results for each user here are binary, it (perhaps) makes sense to compute similarity as the dot product of two users. 

Use the tests to test your function.

In [378]:
def find_similar_users(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user_id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    similar_users - (list) an ordered list where the closest users (largest dot product users)
                    are listed first
    
    Description:
    Computes the similarity of every pair of users based on the dot product
    Returns an ordered
    
    '''
    
    user_ids_all = user_item.index
    
    # a dictionary holds [user id : similarity] value pairs
    dot_product_dict = {}

    # compute similarity of each user to the provided user
    for i in user_ids_all:
        dot_product = np.dot(user_item.loc[user_id:user_id,:], user_item.loc[i:i,:].transpose())
        dot_product_dict[i] = dot_product[0][0]
        
    # sort by highest (most similar), also drop against self    
    ds = pd.Series(dot_product_dict.values(), index=dot_product_dict.keys()).drop(index=user_id).sort_values(ascending=False)
    
    # extract only index as list of ids. 
    most_similar_users = list(ds.index)
       
    return most_similar_users # return a list of the users in order from most to least similar

In [379]:
# Do a spot check of your function
print("The 10 most similar users to user 1 are: {}".format(find_similar_users(1)[:10]))
print("The 5 most similar users to user 3933 are: {}".format(find_similar_users(3933)[:5]))
print("The 3 most similar users to user 46 are: {}".format(find_similar_users(46)[:3]))

The 10 most similar users to user 1 are: [3933, 23, 3782, 203, 4459, 3870, 131, 46, 4201, 395]
The 5 most similar users to user 3933 are: [1, 23, 3782, 4459, 203]
The 3 most similar users to user 46 are: [4201, 23, 3782]


`3.` Now that you have a function that provides the most similar users to each user, you will want to use these users to find articles you can recommend.  Complete the functions below to return the articles you would recommend to each user. 

In [380]:
def get_article_names(article_ids, df=df):
    '''
    INPUT:
    article_ids - (list) a list of article ids
    df - (pandas dataframe) df as defined at the top of the notebook
    
    OUTPUT:
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the title column)
    '''
    # Your code here
    
    article_names = []
    
    for i in article_ids:
        
        # print(i)

        name = df[df.article_id == i].title.head(1).values[0]

        article_names.append(name)
    
    return article_names # Return the article names associated with list of article ids



def get_user_articles(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    article_ids - (list) a list of the article ids seen by the user
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the doc_full_name column in df_content)
    
    Description:
    Provides a list of the article_ids and article titles that have been seen by a user
    '''
    # Your code here
    
    
    article_ids = list(user_item.loc[user_id][user_item.loc[user_id] > 0].index)
    article_names = get_article_names(article_ids)
    
    return article_ids, article_names # return the ids and names



def user_user_recs(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    Users who are the same closeness are chosen arbitrarily as the 'next' user
    
    For the user where the number of recommended articles starts below m 
    and ends exceeding m, the last items are chosen arbitrarily
    
    '''
    # Your code here
    
    
    # get all similar user ids for the targeted user
    similar_uids = find_similar_users(user_id)


    # get all article ids of the targeted user
    article_ids_target_user = get_user_articles(user_id)[0]
    # print(f"[article_ids_target_user]:\n {article_ids_target_user} \n")

    
    # a list contain unseen articles to recommend
    recs = []

    for uid in similar_uids:

        # print(f"\n\n\n[number of recs]: {len(recs)}\n")
        if len(recs) == m:
            # print(f"Number of recs reaches threadhold. Enough. Stop")
            break


        #print(f"[similar user id]: {uid}")

        # get this uid's article ids, and arbitrarily shuffle
        article_ids_similar_user = get_user_articles(uid)[0]

        # compute the differences of articles seen between the this user and targeted user
        # subtraction's order matters
        set_diff = list(set(article_ids_similar_user) - set(article_ids_target_user))
        # print(f"[set_diff]:\n {set_diff} \n")
        
        # make a shuffle for arbitraily chocies from the diff set
        random.shuffle(set_diff)

        # add the differences of article ids to recs [], append only unique (no duplicate)
        for i in set_diff:
            if i not in recs and len(recs) < m:
                recs.append(i)
                # print(f"[id] {i} appended")

    return recs # return your recommendations for this user_id 

In [381]:
# Check Results
get_article_names(user_user_recs(1, 10)) # Return 10 recommendations for user 1

['twelve\xa0ways to color a map of africa using brunel',
 'optimizing a marketing campaign: moving from predictions to actions',
 'airbnb data for analytics: mallorca reviews',
 'intents & examples for ibm watson conversation',
 'web picks (week of 4 september 2017)',
 'awesome deep learning papers',
 'deep learning from scratch i: computational graphs',
 'fertility rate by country in total births per woman',
 'using machine learning to predict parking difficulty',
 'visualize data with the matplotlib library']

In [382]:
# According to dataset, df and df_content, article id are all actually numeric. 

print(df.article_id.dtype)
print(df_content.article_id.dtype)

float64
int64


In [383]:
# Test your functions here. 
# I use numeric id instead of string id for article id, since they are actually all numeric.
# It makes sense to make it consistent numeric over this notebook

assert set(get_article_names([1024.0, 1176.0, 1305.0, 1314.0, 1422.0, 1427.0])) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis']), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_article_names([1320.0, 232.0, 844.0])) == set(['housing (2015): united states demographic measures','self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook']), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_user_articles(20)[0]) == set([1320.0, 232.0, 844.0])
assert set(get_user_articles(20)[1]) == set(['housing (2015): united states demographic measures', 'self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook'])
assert set(get_user_articles(2)[0]) == set([1024.0, 1176.0, 1305.0, 1314.0, 1422.0, 1427.0])
assert set(get_user_articles(2)[1]) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis'])
print("If this is all you see, you passed all of our tests!  Nice job!")

If this is all you see, you passed all of our tests!  Nice job!


`4.` Now we are going to improve the consistency of the **user_user_recs** function from above.  

* Instead of arbitrarily choosing when we obtain users who are all the same closeness to a given user - choose the users that have the most total article interactions before choosing those with fewer article interactions.


* Instead of arbitrarily choosing articles from the user where the number of recommended articles starts below m and ends exceeding m, choose articles with the articles with the most total interactions before choosing those with fewer total interactions. This ranking should be  what would be obtained from the **top_articles** function you wrote earlier.

In [384]:
def get_top_sorted_users(user_id, df=df, user_item=user_item):
    '''
    INPUT:
    user_id - (int)
    df - (pandas dataframe) df as defined at the top of the notebook 
    user_item - (pandas dataframe) matrix of users by articles: 
            1's when a user has interacted with an article, 0 otherwise
    
            
    OUTPUT:
    neighbors_df - (pandas dataframe) a dataframe with:
                    neighbor_id - is a neighbor user_id
                    similarity - measure of the similarity of each user to the provided user_id
                    num_interactions - the number of articles viewed by the user - if a u
                    
    Other Details - sort the neighbors_df by the similarity and then by number of interactions where 
                    highest of each is higher in the dataframe
     
    '''
    # Your code here
    
    # get all neighbors ids
    nbh_ids = find_similar_users(user_id)
    
    
    # assemble a data matrix
    data_matrix = np.array([
    [
        x, # neighbor id
        np.dot(user_item.loc[user_id:user_id,:], user_item.loc[x:x,:].transpose())[0][0], # similarity score
        df[df.user_id == x].shape[0] # number of content interaction
    ] for x in nbh_ids])
    
    # make a dataframe
    neighbors_df = pd.DataFrame(data=data_matrix, 
                                columns=['neighbor_id', 'similarity', 'num_interactions'], 
                                index=data_matrix[:,0]).sort_values(by=['similarity', 'num_interactions'],
                                                                    ascending=[False, False])
    
    
    
    return neighbors_df # Return the dataframe specified in the doc_string




def user_user_recs_part2(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user by article id
    rec_names - (list) a list of recommendations for the user by article title
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    * Choose the users that have the most total article interactions 
    before choosing those with fewer article interactions.

    * Choose articles with the articles with the most total interactions 
    before choosing those with fewer total interactions. 
   
    '''
    # Your code here
    
    
    # get all similar user ids for the targeted user
    # fetch with the 'neighbors_df'
    similar_uids = list(get_top_sorted_users(user_id).index)
    #print(f"[similar_uids]: \n{similar_uids}")


    # get all article ids of the targeted user
    article_ids_target_user = get_user_articles(user_id)[0]
    # print(f"[article_ids_target_user]:\n {article_ids_target_user} \n")

    
    # a list contain unseen articles to recommend
    recs = []

    for uid in similar_uids:

        # print(f"\n\n\n[number of recs]: {len(recs)}\n")
        if len(recs) == m:
            #print(f"Number of recs reaches threadhold {m}. Enough. Stop")
            break


        #print(f"[similar user id]: {uid}")

        # get this uid's article ids, and arbitrarily shuffle
        article_ids_similar_user = get_user_articles(uid)[0]

        # compute the differences of articles seen between the this user and targeted user
        # subtraction's order matters
        set_diff = list(set(article_ids_similar_user) - set(article_ids_target_user))
        #print(f"[set_diff before sort]:\n {set_diff} \n")
        
        # Sort the set. Determine with highest total interactions metric 
        set_diff = list(df[df.article_id.isin(set_diff)]['article_id'].value_counts().index)
        #print(f"[set_diff after sort]:\n {set_diff} \n")

        # add the differences of article ids to recs [], append only unique (no duplicate)
        for i in set_diff:
            if i not in recs and len(recs) < m:
                recs.append(i)
                #print(f"[id] {i} appended")
    
    
    
    
    rec_names = get_article_names(recs)
    
    return recs, rec_names

In [385]:
# Quick spot check - don't change this code - just use it to test your functions
rec_ids, rec_names = user_user_recs_part2(20, 10)
print("The top 10 recommendations for user 20 are the following article ids:")
print(rec_ids)
print()
print("The top 10 recommendations for user 20 are the following article names:")
print(rec_names)

The top 10 recommendations for user 20 are the following article ids:
[1330.0, 1427.0, 1364.0, 1170.0, 1162.0, 1304.0, 1351.0, 1160.0, 1354.0, 1368.0]

The top 10 recommendations for user 20 are the following article names:
['insights from new york car accident reports', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'predicting churn with the spss random tree algorithm', 'apache spark lab, part 1: basic concepts', 'analyze energy consumption in buildings', 'gosales transactions for logistic regression model', 'model bike sharing data with spss', 'analyze accident reports on amazon emr spark', 'movie recommender system with spark machine learning', 'putting a human face on machine learning']


`5.` Use your functions from above to correctly fill in the solutions to the dictionary below.  Then test your dictionary against the solution.  Provide the code you need to answer each following the comments below.

In [386]:
### Tests with a dictionary of results

user1_most_sim = 3933 # Find the user that is most similar to user 1 
user131_10th_sim = 242 # Find the 10th most similar user to user 131

In [387]:
## Dictionary Test Here
sol_5_dict = {
    'The user that is most similar to user 1.': user1_most_sim, 
    'The user that is the 10th most similar to user 131': user131_10th_sim,
}

t.sol_5_test(sol_5_dict)

This all looks good!  Nice job!


`6.` If we were given a new user, which of the above functions would you be able to use to make recommendations?  Explain.  Can you think of a better way we might make recommendations?  Use the cell below to explain a better method for new users.

**Answer** 

For new users, code start problem, we can use knowledge base approach, pulling most-interacted (viewed) content and trending content. 
Since the dataset has no timestamp attribute, we might draw the most-interacted content.

In [388]:
new_user = 0.0

# What would your recommendations be for this new user '0.0'?  As a new user, they have no observed articles.
# Provide a list of the top 10 article ids you would give to 


new_user_recs = get_top_article_ids(10, df) # Your recommendations here

In [389]:
assert set(new_user_recs) == set([1314.0, 1429.0, 1293.0, 1427.0, 1162.0, 1364.0, 1304.0, 1170.0, 1431.0,
                                  1330.0]), "Oops!  It makes sense that in this case we would want to recommend the most popular articles, because we don't know anything about these users."

print("That's right!  Nice job!")

That's right!  Nice job!


---

### <a class="anchor" id="Content-Recs">Part IV: Content Based Recommendations (EXTRA - NOT REQUIRED)</a>

Another method we might use to make recommendations is to perform a ranking of the highest ranked articles associated with some term.  You might consider content to be the **doc_body**, **doc_description**, or **doc_full_name**.  There isn't one way to create a content based recommendation, especially considering that each of these columns hold content related information.  

`1.` Use the function body below to create a content based recommender.  Since there isn't one right answer for this recommendation tactic, no test functions are provided.  Feel free to change the function inputs if you decide you want to try a method that requires more input values.  The input values are currently set with one idea in mind that you may use to make content based recommendations.  One additional idea is that you might want to choose the most popular recommendations that meet your 'content criteria', but again, there is a lot of flexibility in how you might make these recommendations.

### This part is NOT REQUIRED to pass this project.  However, you may choose to take this on as an extra way to show off your skills.

In [390]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


from nltk.util import bigrams
from nltk.util import ngrams
from nltk.lm.preprocessing import flatten

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Constants and Reusable objects for tokenization
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

# Soft copy a article content dataframe for NLP processing.
df_nlp = df_content.copy()

[nltk_data] Downloading package punkt to /Users/apple/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/apple/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/apple/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/apple/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [391]:
# Tokenization helper

def tokenize(text):
    '''
    private tokenizer to transform each text.
    As a NLP helper function including following tasks:
    - Replace URLs
    - Normalize text
    - Remove punctuation
    - Tokenize words
    - Remove stop words
    - Legmmatize words
    :param text: A message text.
    :return: cleaned tokens extracted from original message text.
    '''

    # print(f"original text: \n {text}")

    # replace urls
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

    # tokenize text
    tokens = word_tokenize(text)

    # lemmatize and remove stop words
    tokens = [lemmatizer.lemmatize(word).strip() for word in tokens if word not in stop_words]

    # in case after normalize/lemmatize, if there is no words, make a dummy element. otherwise follwing transformation
    # may breaks
    if len(tokens) < 1:
        tokens = ['none']

    # print(f"tokens: \n {tokens} \n\n")
    return tokens

In [392]:
# Do some cleaning to the df_nlp dataset.


# Found index and article id mismatched. 
# Update index with article to make it consistent and eaiser to process with.
df_nlp.index = df_nlp.article_id


# Clean empty / missing page body and desc content. Update the empty (NaN) with 'empty' placeholder
df_nlp.loc[df_nlp[df_nlp.doc_body.isnull()].index, 'doc_body'] = 'empty'
df_nlp.loc[df_nlp[df_nlp.doc_description.isnull()].index, 'doc_description'] = 'empty'



In [393]:
# Internal private similarity function of body, title, desc respectly
# These internal functions will be called by similarity overall function (a weighed sum of all similarity).



def _compute_article_body_similarity(article_id_1, article_id_2, data=df_nlp):
    # Compute the consine similarity based on tfidf of body content
    # This is a private helper function that is used by overall similarity function.
    

    doc_a = ' '.join(tokenize(data.loc[article_id_1].doc_body))
    doc_b = ' '.join(tokenize(data.loc[article_id_2].doc_body))
    
    

    # combine to a list of documents
    documents = [doc_a, doc_b]

    
    # instanciate a scikitlearn tfidf vecorizer
    vectorizer = TfidfVectorizer()

    # fit transform to get a sparse matrix
    matrix = vectorizer.fit_transform(documents)


    # # UnComment out belows to examinate details (term features, tfidf values)
    # ====================================
    # term_features =  vectorizer.get_feature_names_out()

    # # convert to readable array 
    # matrix_array = matrix.toarray()

    # # assemble to a dataframe for explor 
    # tfidf_dataframe = pd.DataFrame(data=matrix_array, columns=term_features)
    # ====================================


    return cosine_similarity(matrix[:1], matrix[1:])[0][0]



def _compute_article_title_similarity(article_id_1, article_id_2, data=df_nlp):
    # Compute the consine similarity based on tfidf of title (doc_full_name)
    # think of title tag for seo pagerank
    
    # This is a private helper function that is used by overall similarity function.
    

    doc_a = ' '.join(tokenize(data.loc[article_id_1].doc_full_name))
    doc_b = ' '.join(tokenize(data.loc[article_id_2].doc_full_name))
    
    

    # combine to a list of documents
    documents = [doc_a, doc_b]

    
    # instanciate a scikitlearn tfidf vecorizer
    vectorizer = TfidfVectorizer()

    # fit transform to get a sparse matrix
    matrix = vectorizer.fit_transform(documents)


    # # UnComment out belows to examinate details (term features, tfidf values)
    # ====================================
    # term_features =  vectorizer.get_feature_names_out()

    # # convert to readable array 
    # matrix_array = matrix.toarray()

    # # assemble to a dataframe for explor 
    # tfidf_dataframe = pd.DataFrame(data=matrix_array, columns=term_features)
    # ====================================


    return cosine_similarity(matrix[:1], matrix[1:])[0][0]



def _compute_article_desc_similarity(article_id_1, article_id_2, data=df_nlp):
    # Compute the consine similarity based on tfidf of desc content (doc_description)
    # think of desc tag for seo pagerank
    
    # This is a private helper function that is used by overall similarity function.
    

    doc_a = ' '.join(tokenize(data.loc[article_id_1].doc_description))
    doc_b = ' '.join(tokenize(data.loc[article_id_2].doc_description))
    
    

    # combine to a list of documents
    documents = [doc_a, doc_b]

    
    # instanciate a scikitlearn tfidf vecorizer
    vectorizer = TfidfVectorizer()

    # fit transform to get a sparse matrix
    matrix = vectorizer.fit_transform(documents)


    # # UnComment out belows to examing details (term features, tfidf values)
    # ====================================
    # term_features =  vectorizer.get_feature_names_out()

    # # convert to readable array 
    # matrix_array = matrix.toarray()

    # # assemble to a dataframe for explor 
    # tfidf_dataframe = pd.DataFrame(data=matrix_array, columns=term_features)
    # ====================================


    return cosine_similarity(matrix[:1], matrix[1:])[0][0]

In [394]:
def compute_article_similarity(article_id_1, article_id_2):
    """
    Cosine similarty of overall content details, 
    in consideration of: similarity_title, similarity_body, similarity_desc
    """
    
    # calculate similary for body, title, desc, then combine a single value.
    # think about how google weight title,desc,body. Title tag is very heavy. SEO-wised
    # so having 3 consine similarity values, then do a normailized one. 
    # what's the formular? think of a course, assignments weight x, final exam weight y,
    # then what's total grade.
    # https://www.indeed.com/career-advice/career-development/how-to-calculate-weighted-average
    
    similarity_title = _compute_article_title_similarity(article_id_1,article_id_2)
    similarity_body = _compute_article_body_similarity(article_id_1,article_id_2)
    similarity_desc = _compute_article_desc_similarity(article_id_1,article_id_2)
    
    # a weighted sum caluculation for final score
    # I give title 0.5 weight, body 0.4 weight, desc 0.1 weight. (adjust if necessary)
    
    overall = similarity_title * 0.5 + similarity_body * 0.4 + similarity_desc * 0.1
    
    return overall

In [395]:
# Spot Check:

compute_article_similarity(55, 50)

0.07122043497708203

Find similar articles for a given article.

This only works on the df_content/df_nlp dataset. 

Use full details of an article. (title, desc, body), this info only available in the df_content/df_nlp dataset
Calculate in real-time. It might take some seconds. 
(Because it loops to all articles against the targeted article, use cosine similarity, not dot product.)
For articles not in the df_content/df_nlp dataset, another 

method is suitable loopup_similar_title_articles(), which checks only on the title. 

In [396]:
# Find similar articles for a given article based on content similarity.(overall of title+desc+body)
# This only works on the df_content/df_nlp dataset. 
# Use full details of an article. (title, desc, body), this info only available in the df_content/df_nlp dataset
# Calculate in real-time. It might take some seconds. 
# (since it loops to all articles against the targeted article, use cosine similarity, not dot product.)
# For articles not in the df_content/df_nlp dataset, another 
# method is suitable loopup_similar_title_articles(), which checks only on the title. 


def find_similar_content_articles(article_id, data=df_nlp):
    
    article_ids_all = data.index
    similarity_dict = {}
    
    for i in article_ids_all:
        # print(f"\n[i]: {i}")
        
        if i == article_id:
            continue

        similarity_score = compute_article_similarity(article_id,i)
        similarity_dict[i] = similarity_score
        # print(f"[similarity_score]: {similarity_score}")
    
    
    similarity_ds = pd.Series(data=similarity_dict.values(), 
                           index=similarity_dict.keys()).sort_values(ascending=False).index
    
    return list(similarity_ds)

In [444]:
# Spot Check: 
# Articles that relevant to article id 420, based on content details similarity (title, desc, body)
# Calculate in real time, it might take some seconds
# (Because it loops to all articles against the targeted article, use language consine similarity not dot product.)
start = time.time()
relevant_content_articles_for_420th = find_similar_content_articles(420)
end = time.time()

print(f"{(end - start) / 60} seconds")
print(relevant_content_articles_for_420th[:20])

0.32945080200831095 seconds
[389, 993, 949, 592, 714, 117, 678, 942, 231, 925, 15, 463, 353, 835, 907, 284, 977, 600, 595, 997]


In [445]:
# Spot Check:
# Checking relavancy. They are ordered by relavancy. 

df_nlp.iloc[relevant_content_articles_for_420th].head(20)[['doc_full_name','doc_description','doc_body']]

# Looks like related to 'Apache Spark'

Unnamed: 0_level_0,doc_full_name,doc_description,doc_body
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
389,Apache Spark™ 2.0: Impressive Improvements to ...,What a difference a version number makes! With...,* Home\r\n * Community\r\n * Projects\r\n * Bl...
993,Configuring the Apache Spark SQL Context,The Apache Spark website documents the propert...,* Home\r\n * Community\r\n * Projects\r\n * Bl...
949,Apache Spark SQL Analyzer Resolves Order-by Co...,The Apache Spark SQL component has several sub...,* Home\r\n * Community\r\n * Projects\r\n * Bl...
592,Apache Spark Analytics,Combine Apache® Spark™ with other cloud servic...,APACHE SPARK ANALYTICSCombine Apache® Spark™ w...
714,A Survey of Books about Apache Spark™,From the big crop of books about Apache Spark™...,* Home\r\n * Community\r\n * Projects\r\n * Bl...
117,Apache Spark™ 2.0: Migrating Applications,This post provides a brief summary of sample c...,* Home\r\n * Community\r\n * Projects\r\n * Bl...
678,Spark SQL - Rapid Performance Evolution,Spark SQL Version 1.6 runs queries faster! Tha...,{ spark .tc } * Community\r\n * Projects\r\n *...
942,"Interview with Sean Li, New Apache Spark™ Comm...",Sean looks back on his first encounter with Sp...,* Home\r\n * Community\r\n * Projects\r\n * Bl...
231,Speed your SQL Queries with Spark SQL,Get faster queries and write less code too. Le...,Skip to main content IBM developerWorks / Deve...
925,Build SQL queries with Apache Spark in DSX,This video shows you how to use the Spark SQL ...,Skip navigation Sign in SearchLoading...\r\n\r...


### Thoughs

The above find_similar_content_articles() work well. However, it is calculating on the fly, could take time to load. 

Such real-time calculating is too expensive in terms of user experience. (Waited xx seconds for similar articles)

To improve the experience, we can pre-calculate or cache the cosine similarity scores.

So make an article_article_similary data frame for lookup. (the content-content recs). 

(people who view X article also might be interested in Y article based on the relevancy of article_article NLP cosine similarity.)

#### Approach

Make a dataframe, store article-article-similarity. (Based on the df_nlp dataframe, the cleaned df_content)

In [399]:
# Make a article_article dummy dataframe based on shape of df_nlp

article_article = pd.DataFrame(
    data=np.zeros((len(df_nlp.index), len(df_nlp.index))),
    index=df_nlp.index,
    columns=df_nlp.index
)

article_article.head()

article_id,0,1,2,3,4,5,6,7,8,9,...,1041,1042,1043,1044,1045,1046,1047,1048,1049,1050
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [400]:
# Compute article_article cosine similarity for each article, and update to the article_article dataframe

# ATTENTION: THE ITERATION TAKES HOURS to finish. 1050 x 1050 instances. 

def compute_and_update_similarity():
    
    start = time.time()


    # Number of articles
    len_articles = article_article.shape[0]

    # List of index of articles
    idx_articles = list(article_article.index)


    # Loop thru each article: for each row, loop each columns. 
    for i in idx_articles:
        print(f"Processing row {i} of {len_articles-1}.")

        for j in idx_articles:
            article_article.loc[i, j] = compute_article_similarity(i, j)


    end = time.time()
    print(f"{(end - start) / 60} seconds")
    


#### Uncomment the below cell ONLY if want to run the compute and update again. (Warning: took hours)

Basically you don't need to do so, as I have already done it and exported to pkl.
(**It took 8 hours**)

Ready to use. You can download from this link:

Just load it and use it.

In [401]:
# Check info above. Uncomment only if necessary

# compute_and_update_similarity()
# article_article.to_pickle('article_article_similarity_df.pkl')

In [402]:
# Load from pre-calculated pkl. 

article_article = pd.read_pickle('article_article_similarity_df.pkl')


# All calculated and cached cosine similarity score for every article. Ready to use.
article_article.head()

article_id,0,1,2,3,4,5,6,7,8,9,...,1041,1042,1043,1044,1045,1046,1047,1048,1049,1050
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,0.032471,0.07246,0.020439,0.259333,0.03341,0.016156,0.051355,0.046419,0.047824,...,0.063201,0.051671,0.143709,0.020163,0.010646,0.003492,0.004095,0.028436,0.0,0.078722
1,0.032471,1.0,0.186967,0.028671,0.093232,0.098205,0.035012,0.155317,0.206362,0.103742,...,0.104562,0.152361,0.093032,0.030101,0.08249,0.014128,0.01935,0.056505,0.010927,0.090637
2,0.07246,0.186967,1.0,0.053018,0.172458,0.106455,0.041646,0.204603,0.339824,0.153497,...,0.149979,0.208527,0.139453,0.036274,0.009098,0.009416,0.01393,0.104403,0.018914,0.160396
3,0.020439,0.028671,0.053018,1.0,0.027609,0.037732,0.084109,0.048761,0.031276,0.049991,...,0.027299,0.026927,0.033499,0.010938,0.036107,0.012851,0.011922,0.010652,0.009425,0.036531
4,0.259333,0.093232,0.172458,0.027609,1.0,0.09599,0.02699,0.154261,0.135846,0.152328,...,0.122135,0.141956,0.137278,0.02467,0.007071,0.006955,0.003037,0.038724,0.072108,0.138237


In [403]:
# Instead of finding in real time calculation with find_similar_content_articles(),
# this is another approach that uses a lookup from pre-calculated
# article_article df. It saves time for user.

def loopup_similar_content_articles(article_id, data=article_article, n=20):
    """
    NOTE THAT: the cosine similarity score is based on content: title + desc + body, 
    therefore only ids are in df_content are available to work with this method. 
    Otherwise, use loopup_similar_title_articles()
    
    input: n - number of top similar to return
    
    """
    
    

    if article_id in article_article.index.values:

        ids = list(
            data.loc[article_id][data.loc[article_id].index != article_id].sort_values(ascending=False).head(n).index)
        names = list(df_nlp.loc[ids].doc_full_name.values)

    else:
        print(f"Article {article_id} is not in df_content.")
        print(f"We are unable to compute overall content similarity for it.")
        print(f"Alternatively, you might try loopup_similar_title_articles() which is based on title relevancy.")
        ids = []
        names = []

    return ids, names


In [404]:
# Spot Check:
# Lookup most relevant articles for 455th article

relevant_for_455th_article = loopup_similar_content_articles(455, n=30)

print(relevant_for_455th_article[0])
relevant_for_455th_article[1]

# Revealing that they are about on 'machine learning'.

[800, 1035, 313, 805, 444, 721, 260, 967, 122, 96, 384, 809, 124, 723, 234, 892, 54, 253, 812, 861, 412, 732, 871, 74, 221, 89, 479, 616, 500, 567]


['Machine Learning for the Enterprise',
 'Machine Learning for the Enterprise.',
 'What is machine learning?',
 'Machine Learning for everyone',
 'Declarative Machine Learning',
 'The power of machine learning in Spark',
 'The Machine Learning Database',
 'ML Algorithm != Learning Machine',
 'Watson Machine Learning for Developers',
 'Improving quality of life with Spark-empowered machine learning',
 'Continuous Learning on Watson',
 'Use the Machine Learning Library',
 'Python Machine Learning: Scikit-Learn Tutorial',
 '10 Essential Algorithms For Machine Learning Engineers',
 '3 Scenarios for Machine Learning on Multicloud',
 'Breaking the 80/20 rule: How data catalogs transform data scientists’ productivity',
 '8 ways to turn data into value with Apache Spark machine learning',
 'Lifelong (machine) learning: how automation can help your models get smarter over time',
 'Machine Learning Exercises In Python, Part 1',
 'Cleaning the swamp: Turn your data lake into a source of crystal-c

### Side Note on Datasets Inconsistency: 

- df_content (page content of articles)
- df interaction between users and articles)

Since that original df_content only contains articles from [0 - 1050].

However, df (the interaction) dataset shows more unique article ids than those beyond 1050. e,g 11xx, 12xx, 13xx, 14xx.

So these two datasets' articles are not up to date.

The given df_content is 'late' (not catch up), which means that:

**articles in the df (interaction) dataset might not be found in the df_content (content info) dataset.**

Therefore, due to this inconsistency, we will not see content details for some articles mentioned in df (interaction set).

The inconsistency is a limitation of given datasets. 

---

Due to the limitation of inconsistency mentioned above,

Let's create a unified article data frame that **stores every unique article's \[ID\] and \[TITLE\]**.

This dataset serves the purpose of lookup the article's title. (Think of an index of all articles).

Because of the inconsistency, the bottomline: **not every article has body/desc info, but all articles DO have titles**.

The idea is that: **for those articles that cannot do NLP processing on \[TITLE + DESC + BODY\] content, we at least can process NLP on \[TITLE\] for them**.

In [405]:
# Make a all titles dataframe. Store every unique article.
# Schema: article id, article title

def make_titles_df(df=df, df_content=df_content):
    """
    Generate an article index dataframe, from orginal df and df_content
    Only store article id and title for every unique article.
    """
    
    
    unique_ids_in_df = sorted(list(df.article_id.astype('int64').value_counts().index))
    unique_ids_in_df_content = sorted(list(df_content.article_id.astype('int64').value_counts().index))
    
    # ids that are not in df_content, but appear in df
    ids_not_in_df_content = list(set(unique_ids_in_df) - set(unique_ids_in_df_content))
    # print(f"How many?: {len(ids_not_in_df_content)}")
    
    # Subset the diff articles dataframe
    ids_not_in_df_content_float = [float(x) for x in ids_not_in_df_content]
    df_diff = df[df.article_id.isin(ids_not_in_df_content_float)][['article_id', 'title']].drop_duplicates()
    df_diff.article_id = df_diff.article_id.astype('int64')
    df_diff = df_diff.sort_values(by='article_id', ascending=True)
    
    # Subset from df_content
    df_content_titles_subset = df_content[['article_id', 'doc_full_name']]
    df_content_titles_subset.columns = [['article_id', 'title']]
    df_content_titles_subset.index = df_content_titles_subset['article_id'].values.flatten()
    
    # convert to numpy array shape and concatenate
    a = df_diff.to_numpy()
    b = df_content_titles_subset.to_numpy()
    all_titles_np = np.concatenate((a, b))
    
    # make a dataframe
    titles_df = pd.DataFrame(all_titles_np)
    titles_df.columns = ['article_id', 'title']
    titles_df.index = titles_df.article_id
    titles_df = titles_df.sort_index()
    
    return titles_df
    
    
    
# Get titles df
titles_df = make_titles_df()

# Spot check titles df
titles_df

Unnamed: 0_level_0,article_id,title
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,Detect Malfunctioning IoT Sensors with Streami...
1,1,Communicating data science: A guide to present...
2,2,"This Week in Data Science (April 18, 2017)"
3,3,DataLayer Conference: Boost the performance of...
4,4,Analyze NY Restaurant data using Spark in DSX
...,...,...
1440,1440,world marriage data
1441,1441,world tourism data by the world tourism organi...
1442,1442,worldwide county and region - national account...
1443,1443,worldwide electricity demand and production 19...


---

### Also Do Similarty Finding Based on Title only 

As we can see, we do have articles that have no content details avaiable.

For these items, we use title to find similarity. 

We can do real-time compuation across all, or like above, we precalculate and cache to title-title dataframe for lookup.

In [451]:
# Find similar articles for a given article based on title only similarity.
# This only works on all articles in df or df_content.
# Use only the TITLE of an article. this info available in both df and df_content
# Calculate in real-time. It might take some seconds. 
# (since it loops to all articles against the targeted article, use cosine similarity, not dot product.)
# For articles also in the df_content/df_nlp dataset, another 
# method is suitable loopup_similar_content_articles(), which checks based on more content information. 


def find_similar_title_articles(article_id, data=titles_df):
    
    """
    Note that it is based on page title only.
    According orginal datasets, every article has title, not not all article have body/desc info.
    
    If the article is in df_content, we consider it as 'lucky', we can use NLP based on content.
    So we also check at the end, see if it is 'lucky', if yes, we give it a reminder / hint that
    notice we can run find_similar_content_articles() with the article also.
    """
    
    if article_id not in titles_df.index:
        print(f"Article {article_id} is not found.")
        print(f"Please double check the id or try another.")
        result = []
        
    else:
        
        # similarities holder
        similarity_dict = {}

        # loop thru article_id aginst every article

        for i in data.article_id.to_list():
            # print(f"\n[i]: {i}")

            if i == article_id:
                # if against self, skip to next loop
                continue

            # Tokenize self and another. 
            # Rejoin as documents.
            doc_a = ' '.join(tokenize(data.loc[article_id].title))
            doc_b = ' '.join(tokenize(data.loc[i].title))


            # combine to a list of documents
            documents = [doc_a, doc_b]


            # instanciate a scikitlearn tfidf vecorizer
            vectorizer = TfidfVectorizer()

            # fit transform to get a sparse matrix
            matrix = vectorizer.fit_transform(documents)


            # # Uncomment belows to explore details (term features, tfidf values)
            # ====================================
            # term_features =  vectorizer.get_feature_names_out()

            # # convert to readable array 
            # matrix_array = matrix.toarray()

            # # assemble to a dataframe for explor 
            # tfidf_dataframe = pd.DataFrame(data=matrix_array, columns=term_features)
            # ====================================

            similarity_score = cosine_similarity(matrix[:1], matrix[1:])[0][0]
            similarity_dict[i] = similarity_score
            # print(f"[similarity_score]: {similarity_score}")


        similarity_ds = pd.Series(data=similarity_dict.values(), 
                               index=similarity_dict.keys()).sort_values(ascending=False).index

        result = list(similarity_ds)
        
        # Reminder check. 
        if article_id in df_content.article_id.index:
            print(f"We also have content details (desc, body) information for this article: {article_id}.")
            print(f"Alternatively, you might want to try find_similar_content_articles() for it.")
            print(f"It might give you better result, because it checks on title, desc, body")
    
    return result


In [463]:
# Spot Check:

# Relevant articles for 1420, based on title NLP cosine calculation.

relevant_title_articles_for_1420th = find_similar_title_articles(1420)

titles_df.loc[[1420] + relevant_title_articles_for_1420th].head(20)


ValueError: cannot insert article_id, already exists

---

In [406]:
# helper function: get n-grams (two grams here)

# Why 2-grams, not 3-grams or everygram? Because thru experiment, I found 
# 2 grams generalize the meaning well, 
# 3 grams often off the track, every-gram went too verbose and miss out meaning
# It is case by case, however for these articles, 2 grams works well. 


def get_content_bigrams_by_article_id(article_id, data=df_nlp):
    """
    Input an article id, return the bigrams results of its content
    Then Reorder grams By most frequent.
    
    Note that the content is combination string of title, desc, body.
    So assume that the article id is IN df_content, if the artcile is not in, we will
    not be able to get its decs, body content.
    
    For those artcile not in df_content, only in df (interaction set), use alternative method 
    get_title_bigrams_by_article_id(), which get ngrams only from title.
    
    """
    
    if article_id not in data.index:
        print(f"{article_id} is not found in article content details dataset.")
        print(f"You can try ngrams from title. \nAlternative method: get_title_bigrams_by_article_id()")
        return []
    
    else:
        title = data.loc[article_id].doc_full_name
        desc = data.loc[article_id].doc_description
        body = data.loc[article_id].doc_body

        # a list of tokens
        tokened_text = tokenize(title) + tokenize(desc) + tokenize(body)

        # ngrams 2
        grams = [f"{x[0]} {x[1]}" for x in list(bigrams(tokened_text))]

        # sort with frequency descending
        sorted_grams = [x[0] for x in Counter(grams).most_common()]

        return sorted_grams
    

    

    
def get_title_bigrams_by_article_id(article_id, data=titles_df):
    """
    Input an article id, return the bigrams results of its title
    Then Reorder grams By most frequent.
    
    Note that it is based on page title only.
    According orginal datasets, every article has title, not not all article have body/desc info.
    
    If the article is in df_content, we consider it as 'lucky', we can use NLP on content.
    So we do another check here, see if it is 'lucky', if yes, we give it a reminder / hint that
    notice we do run get_content_bigrams_by_article_id() with the article also.
    
    """
    if article_id not in data.index:
        print(f"{article_id} is not found in article titles dataset.")
        print(f"We don't have any information for this article. \nPlease check article id again.")
        return []
    
    else:
    
        title = data.loc[article_id].title

        # a list of tokens
        tokened_text = tokenize(title)

        # ngrams 2
        grams = [f"{x[0]} {x[1]}" for x in list(bigrams(tokened_text))]

        # sort with frequency descending
        sorted_grams = [x[0] for x in Counter(grams).most_common()]
        
        # Reminder check
        if article_id in df_content.article_id.index:
            print(f"We also have content details, e,g desc, body information for this article: {article_id}.")
            print(f"Alternatively, you might want to try get_content_bigrams_by_article_id() for it.")
        
        return sorted_grams
    


In [407]:
# Spot Check

get_content_bigrams_by_article_id(949)

['logical plan',
 'apache spark',
 'spark sql',
 'c 3',
 'sort operator',
 'parsed logical',
 'make sure',
 'community project',
 'project blog',
 'spark spark',
 'aggregate operator',
 'a1 14',
 'a2 15',
 'sql analyzer',
 'analyzer resolve',
 'resolve order',
 'order column',
 'column apache',
 'sql component',
 'component several',
 'several sub',
 'sub component',
 'component including',
 'including analyzer',
 'analyzer play',
 'play important',
 'important role',
 'role making',
 'making sure',
 'sure logical',
 'plan fully',
 'fully resolved',
 'resolved end',
 'end analysis',
 'analysis phase',
 'phase analyzer',
 'analyzer take',
 'take parsed',
 'plan input',
 'input make',
 'sure table',
 'table reference',
 'reference attribute',
 'attribute column',
 'column reference',
 'reference function',
 'function reference',
 'reference resolved',
 'resolved looking',
 'looking metadata',
 'metadata catalog',
 'catalog work',
 'work applying',
 'applying set',
 'set rule',
 'rule log

In [408]:
# Spot Check

get_title_bigrams_by_article_id(1400)

['uci ml',
 'ml repository',
 'repository chronic',
 'chronic kidney',
 'kidney disease',
 'disease data',
 'data set']