# Recommendations with IBM

In this notebook, you will be putting your recommendation skills to use on real data from the IBM Watson Studio platform. 


You may either submit your notebook through the workspace here, or you may work from your local machine and submit through the next page.  Either way assure that your code passes the project [RUBRIC](https://review.udacity.com/#!/rubrics/2322/view).  **Please save regularly.**

By following the table of contents, you will build out a number of different methods for making recommendations that can be used for different situations. 


## Table of Contents

I. [Exploratory Data Analysis](#Exploratory-Data-Analysis)<br>
II. [Rank Based Recommendations](#Rank)<br>
III. [User-User Based Collaborative Filtering](#User-User)<br>
IV. [Content Based Recommendations (EXTRA - NOT REQUIRED)](#Content-Recs)<br>
V. [Matrix Factorization](#Matrix-Fact)<br>
VI. [Extras & Concluding](#conclusions)

At the end of the notebook, you will find directions for how to submit your work.  Let's get started by importing the necessary libraries and reading in the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import project_tests as t
import pickle

%matplotlib inline

df = pd.read_csv('../data/user-item-interactions.csv')
df_content = pd.read_csv('../data/articles_community.csv')
del df['Unnamed: 0']
del df_content['Unnamed: 0']

# Show df to get an idea of the data
df.head()

Unnamed: 0,article_id,title,email
0,1430.0,"using pixiedust for fast, flexible, and easier...",ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7
1,1314.0,healthcare python streaming application demo,083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b
2,1429.0,use deep learning for image classification,b96a4f2e92d8572034b1e9b28f9ac673765cd074
3,1338.0,ml optimization using cognitive assistant,06485706b34a5c9bf2a0ecdac41daf7e7654ceb7
4,1276.0,deploy your python model as a restful api,f01220c46fc92c6e6b161b1849de11faacd7ccb2


In [2]:
# Show df_content to get an idea of the data
df_content.head()

Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
0,Skip navigation Sign in SearchLoading...\r\n\r...,Detect bad readings in real time using Python ...,Detect Malfunctioning IoT Sensors with Streami...,Live,0
1,No Free Hunch Navigation * kaggle.com\r\n\r\n ...,"See the forest, see the trees. Here lies the c...",Communicating data science: A guide to present...,Live,1
2,☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat...,Here’s this week’s news in Data Science and Bi...,"This Week in Data Science (April 18, 2017)",Live,2
3,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA...",Learn how distributed DBs solve the problem of...,DataLayer Conference: Boost the performance of...,Live,3
4,Skip navigation Sign in SearchLoading...\r\n\r...,This video demonstrates the power of IBM DataS...,Analyze NY Restaurant data using Spark in DSX,Live,4


### <a class="anchor" id="Exploratory-Data-Analysis">Part I : Exploratory Data Analysis</a>

This part of the exercise has been done in the first jupyter notebook "1-Exploratory_Data_Analysis". Only the mapping from email to user_id is needed.

In [3]:
## No need to change the code here - this will be helpful for later parts of the notebook
# Run this cell to map the user email to a user_id column and remove the email column

def email_mapper():
    coded_dict = dict()
    cter = 1
    email_encoded = []
    
    for val in df['email']:
        if val not in coded_dict:
            coded_dict[val] = cter
            cter+=1
        
        email_encoded.append(coded_dict[val])
    return email_encoded

email_encoded = email_mapper()
del df['email']
df['user_id'] = email_encoded

# show header
df.head()

Unnamed: 0,article_id,title,user_id
0,1430.0,"using pixiedust for fast, flexible, and easier...",1
1,1314.0,healthcare python streaming application demo,2
2,1429.0,use deep learning for image classification,3
3,1338.0,ml optimization using cognitive assistant,4
4,1276.0,deploy your python model as a restful api,5


### <a class="anchor" id="Rank">Part II: Rank-Based Recommendations</a>

This part of the exercise has been done in the second jupyter notebook "2-Rank-Based_Recommendations". None of the exploratory results are needed in this notebook.

### <a class="anchor" id="User-User">Part III: User-User Based Collaborative Filtering</a>


`1.` Use the function below to reformat the **df** dataframe to be shaped with users as the rows and articles as the columns.  

* Each **user** should only appear in each **row** once.


* Each **article** should only show up in one **column**.  


* **If a user has interacted with an article, then place a 1 where the user-row meets for that article-column**.  It does not matter how many times a user has interacted with the article, all entries where a user has interacted with an article should be a 1.  


* **If a user has not interacted with an item, then place a zero where the user-row meets for that article-column**. 

Use the tests to make sure the basic structure of your matrix matches what is expected by the solution.

In [4]:
# create the user-article matrix with 1's and 0's

def create_user_item_matrix(df):
    '''
    INPUT:
    df - pandas dataframe with article_id, title, user_id columns
    
    OUTPUT:
    user_item - user item matrix 
    
    Description:
    Return a matrix with user ids as rows and article ids on the columns with 1 values where a user interacted with 
    an article and a 0 otherwise
    '''
    # Fill in the function here
    n_rows = df.user_id.unique().shape[0]
    n_cols = df.article_id.unique().shape[0]

    user_item = pd.DataFrame(data = np.zeros((n_rows, n_cols)), index = df.user_id.unique(), columns = df.article_id.unique())
    
    for user_id, list_article_id in df.groupby(by = ['user_id'])['article_id']:
        # It does not matter how often a user has interacted with an article
        article_ids = list(set(list_article_id))
        #print(article_ids)
        #print([1430.0, 1314.0])
        user_item.loc[user_id, article_ids] = 1
        #print(user_item.loc[user_id])
        
    
    return user_item # return the user_item matrix 

user_item = create_user_item_matrix(df)

In [5]:
## Tests: You should just need to run this cell.  Don't change the code.
assert user_item.shape[0] == 5149, "Oops!  The number of users in the user-article matrix doesn't look right."
assert user_item.shape[1] == 714, "Oops!  The number of articles in the user-article matrix doesn't look right."
assert user_item.sum(axis=1)[1] == 36, "Oops!  The number of articles seen by user 1 doesn't look right."
print("You have passed our quick tests!  Please proceed!")

You have passed our quick tests!  Please proceed!


`2.` Complete the function below which should take a user_id and provide an ordered list of the most similar users to that user (from most similar to least similar).  The returned result should not contain the provided user_id, as we know that each user is similar to him/herself. Because the results for each user here are binary, it (perhaps) makes sense to compute similarity as the dot product of two users. 

Use the tests to test your function.

In [6]:
def calc_dot_product(user_item=user_item):
    return user_item.dot(user_item.transpose())
dot_product = calc_dot_product()


In [7]:
def find_similar_users(user_id, user_item=user_item, dot_product = dot_product):
    '''
    INPUT:
    user_id - (int) a user_id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    similar_users - (list) an ordered list where the closest users (largest dot product users)
                    are listed first
    
    Description:
    Computes the similarity of every pair of users based on the dot product
    Returns an ordered
    
    '''
    # compute similarity of each user to the provided user
    # -> dot_product

    # sort by similarity and create list of just the ids
    most_similar_users = dot_product.loc[user_id].sort_values(ascending = False).index
    
    # remove the own user's id
    most_similar_users = most_similar_users.drop(labels=[user_id])
    
    
    return most_similar_users # return a list of the users in order from most to least similar

In [8]:
find_similar_users(131, dot_product = dot_product)[:10]

Int64Index([3870, 3782, 23, 4459, 203, 98, 3764, 3697, 49, 242], dtype='int64')

In [9]:
# Do a spot check of your function
print("The 10 most similar users to user 1 are: {}".format(find_similar_users(1,dot_product = dot_product)[:10]))
print("The 5 most similar users to user 3933 are: {}".format(find_similar_users(3933, dot_product = dot_product)[:5]))
print("The 3 most similar users to user 46 are: {}".format(find_similar_users(46, dot_product = dot_product)[:3]))

The 10 most similar users to user 1 are: Int64Index([3933, 23, 3782, 203, 4459, 131, 3870, 46, 4201, 5041], dtype='int64')
The 5 most similar users to user 3933 are: Int64Index([1, 23, 3782, 4459, 203], dtype='int64')
The 3 most similar users to user 46 are: Int64Index([4201, 23, 3782], dtype='int64')


`3.` Now that you have a function that provides the most similar users to each user, you will want to use these users to find articles you can recommend.  Complete the functions below to return the articles you would recommend to each user. 

In [10]:
def get_article_names(article_ids, df=df):
    '''
    INPUT:
    article_ids - (list) a list of article ids
    df - (pandas dataframe) df as defined at the top of the notebook
    
    OUTPUT:
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the title column)
    '''
    # Your code here
    article_names = [df[df.article_id == tmp_id].iloc[0]['title'] for tmp_id in article_ids]
    
    
    return article_names # Return the article names associated with list of article ids


def get_user_articles(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    article_ids - (list) a list of the article ids seen by the user
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the doc_full_name column in df_content)
    
    Description:
    Provides a list of the article_ids and article titles that have been seen by a user
    '''
    # Your code here
    # Extract the row of the user: user_item.loc[user_id]
    # Which articles have been read by the user? - (user_item.loc[user_id] == 1)
    #print(user_item.loc[user_id].index[(user_item.loc[user_id] == 1)].tolist())
    article_ids = user_item.loc[user_id].index[(user_item.loc[user_id] == 1)].tolist()
    article_names = get_article_names(article_ids)
    
    return article_ids, article_names # return the ids and names


def user_user_recs(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    Users who are the same closeness are chosen arbitrarily as the 'next' user
    
    For the user where the number of recommended articles starts below m 
    and ends exceeding m, the last items are chosen arbitrarily
    
    '''
    # Your code here
    similar_users = find_similar_users(user_id, dot_product = dot_product)
    #print(similar_users)
    user_article_ids, __ = get_user_articles(user_id, user_item=user_item)
    proposed_ids = set()
    
    for sim_user_id in similar_users:
        sim_user_article_ids, __ = get_user_articles(sim_user_id, user_item=user_item)
        #print(proposed_ids)
        proposed_ids.update([item for item in sim_user_article_ids if item not in user_article_ids])
        #print(proposed_ids)
        
        if len(proposed_ids) >= m:
            # if the last analysed user provides more articles to recommend then needed
            # How many articles have to be removed
            to_remove = len(proposed_ids) - m
            [proposed_ids.pop() for i in range(to_remove)]
            recs = proposed_ids
            break
    
    return recs # return your recommendations for this user_id    

In [11]:
user_user_recs(10, m=2)

{485.0, 996.0}

In [12]:
get_user_articles(10, user_item=user_item)
# Solution:
# 1314, 1429, 1185, 1170, 173, 1427, 1172, 1305, 1330, 1360, 310, 898, 1174, 1422, 1171, 939, 1336, 585, 932

([1314.0,
  1429.0,
  1185.0,
  1170.0,
  173.0,
  1427.0,
  1172.0,
  1305.0,
  1330.0,
  1360.0,
  310.0,
  898.0,
  1174.0,
  1422.0,
  1171.0,
  939.0,
  1336.0,
  585.0,
  932.0],
 ['healthcare python streaming application demo',
  'use deep learning for image classification',
  'classify tumors with machine learning',
  'apache spark lab, part 1: basic concepts',
  '10 must attend data science, ml and ai conferences in 2018',
  'use xgboost, scikit-learn & ibm watson machine learning apis',
  'apache spark lab, part 3: machine learning',
  'gosales transactions for naive bayes model',
  'insights from new york car accident reports',
  'pixieapp for outlier detection',
  'time series prediction using recurrent neural networks (lstms)',
  'neural language modeling from scratch (part 1)',
  'breast cancer wisconsin (diagnostic) data set',
  'use r dataframes & ibm watson natural language understanding',
  'apache spark lab, part 2: querying data',
  'deep learning from scratch i: co

In [13]:
get_article_names([1430.0, 1314.0, 1429.0, 1338.0, 1276.0], df=df)
# Solution: 
# article_id --> title (in df)
# 1430.0 ------> using pixiedust for fast, flexible, and easier...
# 1314.0 ------> healthcare python streaming application demo 
# 1429.0 ------> use deep learning for image classification 
# 1338.0 ------> ml optimization using cognitive assistant
# 1276.0 ------> deploy your python model as a restful api

['using pixiedust for fast, flexible, and easier data analysis and experimentation',
 'healthcare python streaming application demo',
 'use deep learning for image classification',
 'ml optimization using cognitive assistant',
 'deploy your python model as a restful api']

In [14]:
# Check Results
get_article_names(user_user_recs(1, 10)) # Return 10 recommendations for user 1

['brunel interactive visualizations in jupyter notebooks',
 'ml algorithm != learning machine',
 'flightpredict ii: the sequel  – ibm watson data lab',
 'recent trends in recommender systems',
 'markdown for jupyter notebooks cheatsheet',
 'using deep learning with keras to predict customer churn',
 '5 practical use cases of social network analytics: going beyond facebook and twitter',
 'recommender systems: approaches & algorithms',
 '1448    i ranked every intro to data science course on...\nName: title, dtype: object',
 'this week in data science (may 30, 2017)']

In [15]:
# Test your functions here - No need to change this code - just run this cell
# Small modification done, as the user_ids are stored as floats
assert set(get_article_names([1024.0, 1176.0, 1305.0, 1314.0, 1422.0, 1427.0])) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis']), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_article_names([1320.0, 232.0, 844.0])) == set(['housing (2015): united states demographic measures','self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook']), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_user_articles(20)[0]) == set([1320.0, 232.0, 844.0])
assert set(get_user_articles(20)[1]) == set(['housing (2015): united states demographic measures', 'self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook'])
assert set(get_user_articles(2)[0]) == set([1024.0, 1176.0, 1305.0, 1314.0, 1422.0, 1427.0])
assert set(get_user_articles(2)[1]) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis'])
print("If this is all you see, you passed all of our tests!  Nice job!")

If this is all you see, you passed all of our tests!  Nice job!


`4.` Now we are going to improve the consistency of the **user_user_recs** function from above.  

* Instead of arbitrarily choosing when we obtain users who are all the same closeness to a given user - choose the users that have the most total article interactions before choosing those with fewer article interactions.


* Instead of arbitrarily choosing articles from the user where the number of recommended articles starts below m and ends exceeding m, choose articles with the articles with the most total interactions before choosing those with fewer total interactions. This ranking should be  what would be obtained from the **top_articles** function you wrote earlier.

In [16]:
def get_top_sorted_users(user_id, df=df, user_item=user_item, dot_product = dot_product):
    '''
    INPUT:
    user_id - (int)
    df - (pandas dataframe) df as defined at the top of the notebook 
    user_item - (pandas dataframe) matrix of users by articles: 
            1's when a user has interacted with an article, 0 otherwise
    
            
    OUTPUT:
    neighbors_df - (pandas dataframe) a dataframe with:
                    neighbor_id - is a neighbor user_id
                    similarity - measure of the similarity of each user to the provided user_id
                    num_interactions - the number of articles viewed by the user - if a u
                    
    Other Details - sort the neighbors_df by the similarity and then by number of interactions where 
                    highest of each is higher in the dataframe
     
    '''
    # Your code here
    #similar_users = find_similar_users(user_id, user_item = user_item, dot_product = dot_product)
    #print(similar_users)
    n_rows = dot_product.shape[0] - 1 # because the user_id is not part of the table
    n_cols = 3
    neighbors_df = pd.DataFrame(data = np.zeros((n_rows, n_cols)), columns = ['neighbor_id', 'similarity', 'num_interactions'])
    neighbors_df.neighbor_id = find_similar_users(user_id, user_item = user_item, dot_product = dot_product)
    # for each neighbor the similarity is recalculated
    similarity = dot_product.loc[user_id].sort_values(ascending = False)
    neighbors_df.similarity = [similarity.loc[tmp_id] for tmp_id in neighbors_df.neighbor_id]
    # num_interactions
    # Not sure if total interaction, or only the number of different articles
    # ASSUMED: total number of interactions on the platform
    num_interactions = df.groupby(by = ['user_id'])['article_id'].count()
    neighbors_df.num_interactions = [num_interactions.loc[tmp_id] for tmp_id in neighbors_df.neighbor_id]
    
    neighbors_df = neighbors_df.sort_values(by = ['similarity', 'num_interactions'], ascending = False)
    return neighbors_df # Return the dataframe specified in the doc_string


def user_user_recs_part2(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user by article id
    rec_names - (list) a list of recommendations for the user by article title
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    * Choose the users that have the most total article interactions 
    before choosing those with fewer article interactions.

    * Choose articles with the articles with the most total interactions 
    before choosing those with fewer total interactions. 
   
    '''
    # Your code here
    similar_users = get_top_sorted_users(user_id, df=df, user_item=user_item, dot_product = dot_product).neighbor_id
    
    user_article_ids, __ = get_user_articles(user_id, user_item=user_item)
    proposed_ids = set()
    #print(df.groupby(by = 'article_id').count())
    article_interactions = df.groupby(by = 'article_id').count().sort_values(by = 'title', ascending = False)
    print(type(article_interactions))
    
    for sim_user_id in similar_users:
        # extract the article the user <sim_user_id> has read
        sim_user_article_ids, __ = get_user_articles(sim_user_id, user_item=user_item)
        # drop all ids the user has already read
        sim_user_article_ids = [item for item in sim_user_article_ids if item not in user_article_ids]
        # find the number of interactions for each article_id in <sim_user_article_ids> 
        
        #for tmp_id in sim_user_article_ids:
        #    print(tmp_id)
        #    print(article_interactions[tmp_id])
        #    break
        
        interactions = [article_interactions.loc[tmp_id] for tmp_id in sim_user_article_ids]
        print(type(interactions))
        #interactions = pd.Series(data = interactions)
        assert len(sim_user_article_ids) == len(interactions)
        # sort the articles by interactions
        print(interactions)
        #print(interactions.sort_values(ascending = False))
        break
    #    #interactions = 
    #    #print(proposed_ids)
    #    proposed_ids.update([item for item in sim_user_article_ids if item not in user_article_ids])
    #    #print(proposed_ids)
    #    
    #    if len(proposed_ids) >= m:
    #        # if the last analysed user provides more articles to recommend then needed
    #        # How many articles have to be removed
    #        to_remove = len(proposed_ids) - m
    #        [proposed_ids.pop() for i in range(to_remove)]
    #        recs = proposed_ids
    #        break
    
    return recs, rec_names



In [17]:
user_user_recs_part2(5, m=2)

<class 'pandas.core.frame.DataFrame'>
<class 'list'>
[title      336
user_id    336
Name: 1430.0, dtype: int64, title      614
user_id    614
Name: 1314.0, dtype: int64, title      937
user_id    937
Name: 1429.0, dtype: int64, title      382
user_id    382
Name: 1338.0, dtype: int64, title      340
user_id    340
Name: 1432.0, dtype: int64, title      442
user_id    442
Name: 1185.0, dtype: int64, title      89
user_id    89
Name: 14.0, dtype: int64, title      122
user_id    122
Name: 1395.0, dtype: int64, title      565
user_id    565
Name: 1170.0, dtype: int64, title      157
user_id    157
Name: 12.0, dtype: int64, title      330
user_id    330
Name: 1052.0, dtype: int64, title      455
user_id    455
Name: 1393.0, dtype: int64, title      80
user_id    80
Name: 362.0, dtype: int64, title      627
user_id    627
Name: 1364.0, dtype: int64, title      115
user_id    115
Name: 194.0, dtype: int64, title      512
user_id    512
Name: 1162.0, dtype: int64, title      671
user_id    67

NameError: name 'recs' is not defined

In [None]:
get_top_sorted_users(5, df=df, user_item=user_item)

In [None]:
# Quick spot check - don't change this code - just use it to test your functions
rec_ids, rec_names = user_user_recs_part2(20, 10)
print("The top 10 recommendations for user 20 are the following article ids:")
print(rec_ids)
print()
print("The top 10 recommendations for user 20 are the following article names:")
print(rec_names)

`5.` Use your functions from above to correctly fill in the solutions to the dictionary below.  Then test your dictionary against the solution.  Provide the code you need to answer each following the comments below.

In [None]:
### Tests with a dictionary of results

user1_most_sim = # Find the user that is most similar to user 1 
user131_10th_sim = # Find the 10th most similar user to user 131

In [None]:
## Dictionary Test Here
sol_5_dict = {
    'The user that is most similar to user 1.': user1_most_sim, 
    'The user that is the 10th most similar to user 131': user131_10th_sim,
}

t.sol_5_test(sol_5_dict)

`6.` If we were given a new user, which of the above functions would you be able to use to make recommendations?  Explain.  Can you think of a better way we might make recommendations?  Use the cell below to explain a better method for new users.

**Provide your response here.**

`7.` Using your existing functions, provide the top 10 recommended articles you would provide for the a new user below.  You can test your function against our thoughts to make sure we are all on the same page with how we might make a recommendation.

In [None]:
new_user = '0.0'

# What would your recommendations be for this new user '0.0'?  As a new user, they have no observed articles.
# Provide a list of the top 10 article ids you would give to 
new_user_recs = # Your recommendations here



In [None]:
assert set(new_user_recs) == set(['1314.0','1429.0','1293.0','1427.0','1162.0','1364.0','1304.0','1170.0','1431.0','1330.0']), "Oops!  It makes sense that in this case we would want to recommend the most popular articles, because we don't know anything about these users."

print("That's right!  Nice job!")