# Recommendations with MovieTweetings: Collaborative Filtering

最受歡迎的推薦方法之一為 **collaborative filtering**，我們會使用 user-item 的合作關係來做出新的推薦，主要有兩種做法:

1. **Neighborhood-Based Collaborative Filtering**，基本想法為關聯相似的物件或是使用者來做出推薦。

2. **Model Based Collaborative Filtering**，基本想法為使用機器學習或其他數學模型來理解物件與使用者的關係並預測評分來做出推薦。

這裡我們會使用 **neighborhood-based collaborative filtering** 方法，而它又可以細分為

1. **User-based collaborative filtering:** 利用和想要推薦的那個使用者相關的其他使用者來做推薦。

2. **Item-based collaborative filtering:** 這方法首先要找出與每個物件相關的物件們(based on similar ratings)，然後再利用使用者對這些相似物件的評分來判斷使用者是否會喜歡新的物件。

這裡主要使用 **user-based collaborative filtering** 還有以下三個步驟來推薦電影作為例子
1. 移除使用者已經看過的電影
2. 找到鄰居評分高的電影
3. 推薦符合前兩項標準的電影給使用者

_注意這裡僅示範作用在一對使用者，而非疊代到所有使用者以節省時間。_

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from IPython.display import HTML
%matplotlib inline

# Read in the datasets
movies = pd.read_csv('./data/movies_clean.csv')
reviews = pd.read_csv('./data/reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']

print(reviews.head())

   user_id  movie_id  rating   timestamp                 date
0        1    111161      10  1373234211  2013-07-07 21:56:51
1        1    117060       7  1373415231  2013-07-10 00:13:51
2        1    120755       6  1373424360  2013-07-10 02:46:00
3        1    317919       6  1373495763  2013-07-10 22:36:03
4        1    454876      10  1373621125  2013-07-12 09:25:25


In [2]:
# remove duplicate
movies.drop(7897, inplace=True)

## Measures of Similarity

使用 **neighborhood** based collaborative filtering 最重要的就是知道如何計算物件間或使用者間的相似度或距離，這裡主要使用兩個指標

- **Pearson's correlation coefficient**

其計算方法為
$$CORR(x, y) = \frac{\text{COV}(x, y)}{\text{STDEV}(x)\text{ }\text{STDEV}(y)}$$
其中
$$\text{STDEV}(x) = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}$$

$$\text{COV}(x, y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})$$

其中 n 為向量的長度，x 和 y 要一樣。$\bar{x}$ 則是向量中所有元素的平均值。

我們可以使用 correlation coefficient 來代表兩個向量有多相似，越靠近 1 就代表越相似，以下我們會看到有時候這個方法不太好。

- **Euclidean distance**

計算兩個向量間的直線距離，所以值越大代表兩向量差別越遠，其計算方法為
$$ \text{EUCL}(x, y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}$$
和 correlation coefficient 不同，沒有分母來改變尺度，所以在使用這個指標時要確保資料在同一個尺度上。

------------

### User-Item Matrix

把數值放進矩陣內會比較方便計算，每一列是一個使用者，每一行是一種物件。
![alt text](https://view3f484599.udacity-student-workspaces.com/files/images/userxitem.png "User Item Matrix")

上圖可以看到 **User 1** 和 **User 2** 都有 **Item 1**，而 **User 2**, **User 3** 和 **User 4** 都有 **Item 2**.  然而裡面還是有大量的遺失值代表該使用者沒有的物件。

以下我們首先由 **reviews** dataset 創造出上面的矩陣，其中元素是評分，所以我們只需要 **reviews** dataframe 的前三欄就好。

In [3]:
user_items = reviews[['user_id', 'movie_id', 'rating']]
user_items.head()

Unnamed: 0,user_id,movie_id,rating
0,1,111161,10
1,1,117060,7
2,1,120755,6
3,1,317919,6
4,1,454876,10


### Creating the User-Item Matrix

為了創造 user-items matrix 我們首先使用 [pivot table](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html)。

然而這個方法很有可能遇到 memory error，[this link here](https://stackoverflow.com/questions/39648991/pandas-dataframe-pivot-memory-error) 可能會有所幫助。
_____

`1.` 創造一個使用者為列，電影為欄，元素為評分的矩陣，如果該使用者沒有評分就讓值為 NaN

In [4]:
# Create user-by-item matrix
user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()

In [5]:
assert movies.shape[0] == user_by_movie.shape[1], "Oh no! Your matrix should have {} columns, and yours has {}!".format(movies.shape[0], user_by_movie.shape[1])
assert reviews.user_id.nunique() == user_by_movie.shape[0], "Oh no! Your matrix should have {} rows, and yours has {}!".format(reviews.user_id.nunique(), user_by_movie.shape[0])
print("Looks like you are all set! Proceed!")
HTML('<img src="https://view3f484599.udacity-student-workspaces.com/files/images/greatjob.webp">')

Looks like you are all set! Proceed!


`2.` 現在有了矩陣 users by movies，使用這個矩陣創造一個 key 為使用者，value 為該使用者評過分電影的 dictionary。

In [6]:
# Create a dictionary with users and corresponding movies seen

def movies_watched(user_id):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    OUTPUT:
    movies - an array of movies the user has watched
    '''
    movies = user_by_movie.loc[user_id][user_by_movie.loc[user_id].notnull()].index.values
    return movies


def create_user_movie_dict():
    '''
    INPUT: None
    OUTPUT: movies_seen - a dictionary where each key is a user_id and the value is an array of movie_ids
    
    Creates the movies_seen dictionary
    '''
    movies_seen = dict()
    
    for user_id in user_by_movie.index.values:
        movies_seen[user_id] = movies_watched(user_id)
        
    return movies_seen


# Use your function to return dictionary
movies_seen = create_user_movie_dict()

`3.` 如果使用者沒有評超過兩部電影，我們將它視為太新，創造一個只包含夠資格的使用者的新 dictionary

In [7]:
# Remove individuals who have watched 2 or fewer movies - don't have enough data to make recs

def create_movies_to_analyze(movies_seen, lower_bound=2):
    '''
    INPUT:  
    movies_seen - a dictionary where each key is a user_id and the value is an array of movie_ids
    lower_bound - (an int) a user must have more movies seen than the lower bound to be added to the movies_to_analyze dictionary

    OUTPUT: 
    movies_to_analyze - a dictionary where each key is a user_id and the value is an array of movie_ids
    
    The movies_seen and movies_to_analyze dictionaries should be the same except that the output dictionary has removed 
    
    '''
    movies_to_analyze = dict()
    # Do things to create updated dictionary
    for key, value in movies_seen.items():
        if len(value) > lower_bound:
            movies_to_analyze[key] = value
    return movies_to_analyze


# Use your function to return your updated dictionary
movies_to_analyze = create_movies_to_analyze(movies_seen)

### Calculating User Similarities

現在我們有了 **movies_to_analyze** dictionary，可以開始看看使用者間的相似度了，以下是怎麼決定相似度的 pseudocode:
```
for user1 in movies_to_analyze
    for user2 in movies_to_analyze
        see how many movies match between the two users
        if more than two movies in common
            pull the overlapping movies
            compute the distance/similarity metric between ratings on the same movies for the two users
            store the users and the distance metric
```

但這會花很長的時間執行，但是其他方法又會超過記憶體大小，因此我們不會計算所有可能的配對，而是只看幾個特定的例子。

我們要計算使用者間的 [correlation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html)。

`4.` 使用 **movies_to_analyze** dictionary 和 **user_by_movie** dataframe 創造一個可以計算兩個使用者間的相似電影的評分的correlation

In [8]:
def compute_correlation(user1, user2):
    '''
    INPUT
    user1 - int user_id
    user2 - int user_id
    OUTPUT
    the correlation between the matching ratings between the two users
    '''
    # Pull movies for each user
    movies1 = movies_to_analyze[user1]
    movies2 = movies_to_analyze[user2]
    
    # Find Similar Movies
    sim_movs = np.intersect1d(movies1, movies2, assume_unique=True)
    
    # Calculate correlation between the users
    df = user_by_movie.loc[(user1, user2), sim_movs]
    corr = df.transpose().corr().iloc[0,1]
    return corr #return the correlation

### Why the NaN's?
`5.` 為什麼我們會得到 **NaN** values? 就算是 users 2 和 104 有 **NaN** correlation，為什麼 NaNs 會存在呢? 這些 NaN's 使得 correlation coefficient 變得比較不適合做為衡量相似度的標準。

In [9]:
# Which movies did both user 2 and user 104 see?
movies1 = movies_to_analyze[2]
movies2 = movies_to_analyze[104]
sim_movs = np.intersect1d(movies1, movies2, assume_unique=True)
df = user_by_movie.loc[(2, 104), sim_movs]
df.transpose().corr()

user_id,2,104
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,,
104,,


> It's because there's no variance/ standard deviation in the second column and thus in the correlation coefficient calculation when you divide by std or var (however it's implemented) you're in turn dividing zero by zero which yield nan.

`6.` 因為 correlation coefficient 比較不適合，所以我們改用 euclidean distance between the ratings.　[this post](https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy) 對完成這個問題特別好用。

In [10]:
def compute_euclidean_dist(user1, user2):
    '''
    INPUT
    user1 - int user_id
    user2 - int user_id
    OUTPUT
    the euclidean distance between user1 and user2
    '''
    # Pull movies for each user
    movies1 = movies_to_analyze[user1]
    movies2 = movies_to_analyze[user2]
    
    # Find Similar Movies
    sim_movs = np.intersect1d(movies1, movies2, assume_unique=True)
    
    # Calculate euclidean distance between the users
    df = user_by_movie.loc[(user1, user2), sim_movs]
    dist = np.linalg.norm(df.loc[user1] - df.loc[user2])
    
    return dist #return the euclidean distance

### Using the Nearest Neighbors to Make Recommendations

`7.` Complete the functions below, which allow you to find the recommendations for any user.  There are five functions which you will need:

* **find_closest_neighbors** - this returns a list of user_ids from closest neighbor to farthest neighbor using euclidean distance


* **movies_liked** - returns an array of movie_ids


* **movie_names** - takes the output of movies_liked and returns a list of movie names associated with the movie_ids


* **make_recommendations** - takes a user id and goes through closest neighbors to return a list of movie names as recommendations


* **all_recommendations** = loops through every user and returns a dictionary of with the key as a user_id and the value as a list of movie recommendations

其中　df_dists　是包含所有可能距離的 dataframe

In [11]:
def find_closest_neighbors(user):
    '''
    INPUT:
        user - (int) the user_id of the individual you want to find the closest users
    OUTPUT:
        closest_neighbors - an array of the id's of the users sorted from closest to farthest away
    '''
    # I treated ties as arbitrary and just kept whichever was easiest to keep using the head method
    # You might choose to do something less hand wavy - order the neighbors
    closest_users = df_dists[df_dists['user1']==user].sort_values(by='eucl_dist').iloc[1:]['user2']
    closest_neighbors = np.array(closest_users)
    
    return closest_neighbors
    
    
    
def movies_liked(user_id, min_rating=7):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    min_rating - the minimum rating considered while still a movie is still a "like" and not a "dislike"
    OUTPUT:
    movies_liked - an array of movies the user has watched and liked
    '''
    movies_liked = np.array(user_items.query('user_id == @user_id and rating > (@min_rating -1)')['movie_id'])
    return movies_liked


def movie_names(movie_ids):
    '''
    INPUT
    movie_ids - a list of movie_ids
    OUTPUT
    movies - a list of movie names associated with the movie_ids
    
    '''
    movie_lst = list(movies[movies.movie_id.isin(movie_ids)].movie)

    return movie_lst
    
    
def make_recommendations(user, num_recs=10):
    '''
    INPUT:
        user - (int) a user_id of the individual you want to make recommendations for
        num_recs - (int) number of movies to return
    OUTPUT:
        recommendations - a list of movies - if there are "num_recs" recommendations return this many
                          otherwise return the total number of recommendations available for the "user"
                          which may just be an empty list
    '''
    # I wanted to make recommendations by pulling different movies than the user has already seen
    # Go in order from closest to farthest to find movies you would recommend
    # I also only considered movies where the closest user rated the movie as a 9 or 10
    
    # movies_seen by user (we don't want to recommend these)
    movies_seen = movies_watched(user)
    closest_neighbors = find_closest_neighbors(user)
    
    # Keep the recommended movies here
    recs = np.array([])
    
    # Go through the neighbors and identify movies they like the user hasn't seen
    for neighbor in closest_neighbors:
        neighbs_likes = movies_liked(neighbor)
        
        #Obtain recommendations for each neighbor
        new_recs = np.setdiff1d(neighbs_likes, movies_seen, assume_unique=True)
        
        # Update recs with new recs
        recs = np.unique(np.concatenate([new_recs, recs], axis=0))
        
        # If we have enough recommendations exit the loop
        if len(recs) > num_recs-1:
            break
    
    # Pull movie titles using movie ids
    recommendations = movie_names(recs)
    
    return recommendations

def all_recommendations(num_recs=10):
    '''
    INPUT 
        num_recs (int) the (max) number of recommendations for each user
    OUTPUT
        all_recs - a dictionary where each key is a user_id and the value is an array of recommended movie titles
    '''
    # All the users we need to make recommendations for
    users = np.unique(df_dists['user1'])
    n_users = len(users)
    
    #Store all recommendations in this dictionary
    all_recs = dict()
    
    # Make the recommendations for each user
    for user in users:
        all_recs[user] = make_recommendations(user, num_recs)
    
    return all_recs