# Re-Ranking for Topic Diversification

Recommender systems (RecSys) are a class of machine learning models built to recommend the most relevant things to the user - an ad that a customer is most likely to click on, a product that a customer is most likely to buy, and so on. In doing so, RecSys often exploit known preferences of the users to ensure that suggestions from it are relevant. This leads to an of echo chamber of sorts, for example, if you were to purchase a summer dress from a retail website, you might be shown other summer dresses everytime you visit the website because the RecSys knows that you were interested in summer dresses in the past.

Understandably, this leads to some obvious issues like:
- bad user experience from constantly seeing a product that might not be relevant to them anymore
- not exposing other products from the catalog that user might have bought leading to lost revenues for the company
aling with a number of nuanced issues.

One can think of a few ways to resolve these issues like:
- Ensure that the output of the model recommends a diverse list of topics by changes to features or type of model
- Re-ranking the items recommended by the RecSys to ensure it is diverse

For the purpose of this exercise, we will look at the latter, the problem of re-ranking. We have a dataset of movie ratings ([MovieLens Dataset](https://grouplens.org/datasets/movielens/1m/ "MovieLens 1M dataset")), a stable benchmark dataset for recommender systems. It has 1 million ratings from 6000 users on 4000 movies (or 4.16% user-movie interactions covered).

## Sections
1. [Methodology](#Methodology) 
2. [Building a baseline RecSys](#Building-a-baseline-RecSys)


## Methodology
We will build two recommender systems:  
1. Our **baseline** RecSys will recommend 10 top-rated movies from this dataset.  
2. Our **improved** RecSys will try to bring diversity to this recommended list.

Reference:
For the most part, we will follow this tutorial - https://nbviewer.jupyter.org/github/david-cortes/datascienceprojects/blob/master/machine_learning/topic_diversification.ipynb

## Building a baseline RecSys

In [19]:
# Importing data
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

movies = pd.read_csv('data/ml-1m/movies.dat', sep='::', engine='python')
ratings = pd.read_csv('data/ml-1m/ratings.dat', sep='::', engine='python')
users = pd.read_csv('data/ml-1m/users.dat', sep='::', engine='python')

In [26]:
# Getting the top-rated movies
import numpy as np

top_rated = ratings.assign(raters=ratings.Rating>=0)\
    .groupby('MovieID')\
    .agg({'Rating': 'mean','UserID': 'count'})\
    .rename(columns={'UserID': 'UserCount', 'Rating': 'AvgRating'})\
    .assign(score=lambda x: x.AvgRating*x.UserCount)\
    .sort_values('score',ascending=False)\
    .head(50)

top_rated.join(movies.set_index('MovieID')).head(10)

Unnamed: 0_level_0,AvgRating,UserCount,score,Title,Genres
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2858,4.317386,3428,14800.0,American Beauty (1999),Comedy|Drama
260,4.453694,2991,13321.0,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi
1196,4.292977,2990,12836.0,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Drama|Sci-Fi|War
1210,4.022893,2883,11598.0,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Romance|Sci-Fi|War
2028,4.337354,2653,11507.0,Saving Private Ryan (1998),Action|Drama|War
1198,4.477725,2514,11257.0,Raiders of the Lost Ark (1981),Action|Adventure
593,4.351823,2578,11219.0,"Silence of the Lambs, The (1991)",Drama|Thriller
2571,4.31583,2590,11178.0,"Matrix, The (1999)",Action|Sci-Fi|Thriller
2762,4.406263,2459,10835.0,"Sixth Sense, The (1999)",Thriller
589,4.058513,2649,10751.0,Terminator 2: Judgment Day (1991),Action|Sci-Fi|Thriller


Unnamed: 0_level_0,Rating,UserID,score
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2858,4.317386,3428,14800.0
260,4.453694,2991,13321.0
1196,4.292977,2990,12836.0
1210,4.022893,2883,11598.0
2028,4.337354,2653,11507.0
1198,4.477725,2514,11257.0
593,4.351823,2578,11219.0
2571,4.31583,2590,11178.0
2762,4.406263,2459,10835.0
589,4.058513,2649,10751.0


Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
33,34,Babe (1995),Children's|Comedy|Drama
49,50,"Usual Suspects, The (1995)",Crime|Thriller
108,110,Braveheart (1995),Action|Drama|War
257,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi
293,296,Pulp Fiction (1994),Crime|Drama
315,318,"Shawshank Redemption, The (1994)",Drama
352,356,Forrest Gump (1994),Comedy|Romance|War
453,457,"Fugitive, The (1993)",Action|Thriller
476,480,Jurassic Park (1993),Action|Adventure|Sci-Fi


Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama


Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648
