# Welcome to my ML Project for Course 5510 in the CU Boulder MS-DS Program

### This file can be found in my [personal Github](https://github.com/blake-tagget/nuance-and-nonsense/tree/main/BecasueOfSchool/5010_ML_Unsupervised_Learning).

## Hybrid Recommender System - Project Intro

As we completed the course, I found myself not truly understanding recommender systems. Sure, I passed the assignments and all, but I wanted a second stab at it so I can explain each step along the way, in my own words, and at my own pace. So let's get it going! 

## Recommending a Movie - and make it complex!

When we did the Week 3 assignment, we used the [MovieLens 1M dataset](https://www.kaggle.com/odedgolden/movielens-1m-dataset) posted on kaggle to implement content and collaborative filtering algorithms. The data can be downloaded directly from [MovieLens](https://grouplens.org/datasets/movielens/) and it all comes from [GroupLens Research at the University of Minnesota](https://grouplens.org/https://grouplens.org/). I also found that there is a neat library they made called [LensKit](https://lkpy.readthedocs.io/en/stable/index.html) which gives us access to additional functionality. Check out the actual [MovieLens](https://movielens.org/home) product if you want.Check out the actual [MovieLens](https://movielens.org/home) product if you want.

We will also use publicly available IMDb datasets [available here](https://www.imdb.com/interfaces/https://www.imdb.com/interfaces/). Thankfully, the good people at MovieLens include a mapping between their data and IMDb.


Disclaimer 1 : The terms "item" and "movie" are used interchangably to mean the same thing. 

Disclaimer 2 : The order of the steps ARE NOT the order i took to complete the project. They are the order the functionally makes sense when trying to save time and space (reduce complexity).

### The goal of this project is to expand on what we did manually for the week3 and week4 labs by utilizing more movie attributes to make a hybrid recommender system. 

Let's dive in and see where we go!

## Step 0 : Identify an Unsupervised Machine Learning Problem

### In short: Recommend a Movie! 

In class we learned Content-based and Collaborative-filtering methods for recommending movies. this project puts them together.

Collaborative Filtering : We will use user ratings to calculate the similarity of movies <br>
Content Based : We will use a wide variety of movie attributes to calculate the similarity as well

There is no right or wrong answer in terms of a recommendation becasue we don;t have a target to shoot for. What we can do is use the similarity scores to calculate a weighted average of a predicted rating. We will then judge the correctness of our model by how close we can get to the actual rating.

In truth, this score is just a way to see how good our model is at explaining reality. In practice, we would just present the most similar movies to a users highest rated movies. There would be no practical purpose to recommend a low rated movie to a user of a streaming service. 

We will do the following steps:
1. Gather and clean some data - the focus here is getting our data in the most efficient structure (hello arrays) and to gather movie feature sets.
2. Explore some similarity scores - this is analogous to a business logic check.
3. Make the Model
4. Find the best similarity scores for each feature set
5. Find the best combined model set
6. Make a Recommendation based on our model set
7. Discuss / Conclusion


## Step 1 : Gather and Clean some data

In [1]:
# Step 1 - Pull some data using pandas:
from datetime import date, datetime, timedelta
import time
import numpy as np
import pandas as pd


import lenskit
from lenskit.datasets import MovieLens
mlens = MovieLens(path='ml-latest-small')
m25 = MovieLens(path='ml-25m')

mlens.links['lk_tconst'] = 'tt'+mlens.links['imdbId'].astype(str).str.zfill(7)
links = mlens.links.reset_index().set_index('lk_tconst')['item']

In [2]:
cols={'tconst':str,'startYear':float,'titleType':str,'runtimeMinutes':float,'primaryTitle':str,'originalTitle':str}
imdb_ratings = pd.read_table('title.ratings.tsv',na_values="\\N")[['tconst','averageRating','numVotes']].set_index('tconst').join(links,on='tconst',how='inner')

# I got tired of re-loading over 2gb's of data so i just saved my own csv

try: 
    imdb_titles = pd.read_csv('imdb_titles.csv',index_col=0)
    
except:
    imdb_titles = pd.read_table('title.basics.tsv',na_values="\\N")[list(cols.keys())].set_index('tconst').join(links,on='tconst',how='inner')
    imdb_titles.to_csv('imdb_titles.csv')
    
try: 
    imdb_people = pd.read_csv('imdb_people.csv',index_col=0)
    
except:    
    imdb_people = pd.concat([pd.read_table('title.principals.tsv',na_values="\\N")[['tconst','nconst']].set_index('tconst')['nconst'],\
                             pd.concat([pd.read_table('title.crew.tsv',na_values="\\N")[['tconst','directors']].set_index('tconst')['directors'].str.split(',').explode().rename('nconst'),\
                                        pd.read_table('title.crew.tsv',na_values="\\N")[['tconst','writers']].set_index('tconst')['writers'].str.split(',').explode().rename('nconst')])\
                            ]).reset_index().drop_duplicates()\
                    .join(links,on='tconst',how='right').dropna()\
                    .merge(pd.read_table('name.basics.tsv',na_values="\\N")[['nconst','primaryName']],on='nconst',how='left')
    
    imdb_people.to_csv('imdb_people.csv')



In [3]:
imdb_titles.head()

Unnamed: 0_level_0,startYear,titleType,runtimeMinutes,primaryTitle,originalTitle,item
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
tt0000417,1902.0,short,13.0,A Trip to the Moon,Le voyage dans la lune,32898
tt0000439,1903.0,short,11.0,The Great Train Robbery,The Great Train Robbery,49389
tt0000516,1908.0,short,8.0,The Electric Hotel,El hotel eléctrico,140541
tt0004972,1915.0,movie,195.0,The Birth of a Nation,The Birth of a Nation,7065
tt0006333,1916.0,movie,85.0,"20,000 Leagues Under the Sea","20,000 Leagues Under the Sea",62383


In [4]:
imdb_people.head()

Unnamed: 0,tconst,nconst,item,primaryName
0,tt0114709,nm0169505,1,Joel Cohen
1,tt0114709,nm0000158,1,Tom Hanks
2,tt0114709,nm0000741,1,Tim Allen
3,tt0114709,nm0725543,1,Don Rickles
4,tt0114709,nm0001815,1,Jim Varney


### Discussion:

The people related (names and principles) IMDb datasets are HUGE, and titles is pretty big. I had to trim them and save a csv so that my iteration could be faster. You'll also note that I have to do some string manipulation to get the IMDb links to be correct. If the link didn't match to an IMDb record, I drop it from the data. 

Data is fun!


In [5]:
# filter all our datasets down to the intersection
attributes = imdb_titles.join(imdb_ratings.drop(columns=['item']),how='left').dropna().set_index('item')
common_movies = set(imdb_people['item']).intersection(set(attributes.index)).intersection(set(m25.tags['item'])).intersection(set(mlens.ratings['item'])).intersection(mlens.links.index).intersection(mlens.movies.index)

print(len(common_movies))

attributes = attributes[attributes.index.isin(common_movies)]
links = mlens.links[mlens.links.index.isin(common_movies)]
movies = mlens.movies[mlens.movies.index.isin(common_movies)].join(links['lk_tconst'],how='left')
ratings = mlens.ratings[mlens.ratings['item'].isin(common_movies)]
tags = m25.tags[m25.tags.item.isin(common_movies)]
imdb_people = imdb_people[imdb_people['item'].isin(common_movies)]
users = ratings[['user']].drop_duplicates()

item_id = dict(zip(movies.index,list(range(len(movies)))))
user_id = dict(zip(users.user,list(range(len(users)))))

item_index = pd.DataFrame([],index=movies.index)
user_index = pd.DataFrame([],index=users['user'])

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.decomposition import PCA

from scipy.spatial.distance import squareform, pdist, cdist
from scipy.sparse import coo_matrix, csr_matrix

cutoff = .7
train_r, test_r = train_test_split(ratings,train_size=cutoff, random_state=20)

train_t = tags[~((tags['item']+tags['user']).isin(set(test_r['item']+test_r['user'])))]
test_t = tags[((tags['item']+tags['user']).isin(set(test_r['item']+test_r['user'])))]

train_ratings_mr = np.array(coo_matrix((list(train_r.rating), ([user_id[x] for x in train_r['user']], [item_id[x] for x in train_r['item']] )), shape=(len(users),len(movies))).toarray())


9254


In [6]:
genres_df = item_index.join(movies['genres'].str.get_dummies()).drop(columns=['(no genres listed)'])

types = attributes[['titleType']].drop_duplicates().reset_index(drop=True)
types['is_tv'] = types['titleType'].str.contains('tv')
types['format'] = types['titleType'].str.replace('tv','').str.lower()
types['tv'] = LabelEncoder().fit_transform(types['is_tv'])

attributes = attributes.join(types.set_index(['titleType'])[['format','tv']],on='titleType').drop(columns=['titleType','originalTitle','primaryTitle'])
attributes = pd.get_dummies(attributes,columns=['format'])

attributes = item_index.join(attributes,how='left')

attributes['runtimeMinutes'] = attributes['runtimeMinutes'].astype(float)
attributes['total_ratings'] = attributes['averageRating']*attributes['numVotes']

attr_nums = attributes[['startYear', 'runtimeMinutes', 'averageRating', 'numVotes','total_ratings']]
attr_bool = attributes.drop(columns=['startYear', 'runtimeMinutes', 'averageRating', 'numVotes','total_ratings'])

In [7]:
attributes.head()

Unnamed: 0_level_0,startYear,runtimeMinutes,averageRating,numVotes,tv,format_episode,format_miniseries,format_movie,format_series,format_short,format_special,format_video,total_ratings
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,1995.0,81.0,8.3,1002586,0,0,0,1,0,0,0,0,8321463.8
2,1995.0,104.0,7.0,352486,0,0,0,1,0,0,0,0,2467402.0
3,1995.0,101.0,6.6,28410,0,0,0,1,0,0,0,0,187506.0
4,1995.0,124.0,5.9,11341,0,0,0,1,0,0,0,0,66911.9
5,1995.0,106.0,6.0,39465,0,0,0,1,0,0,0,0,236790.0


In [8]:
attributes.describe()

Unnamed: 0,startYear,runtimeMinutes,averageRating,numVotes,tv,format_episode,format_miniseries,format_movie,format_series,format_short,format_special,format_video,total_ratings
count,9254.0,9254.0,9254.0,9254.0,9254.0,9254.0,9254.0,9254.0,9254.0,9254.0,9254.0,9254.0,9254.0
mean,1994.35509,105.813054,6.634936,83713.82,0.028744,0.002161,0.005187,0.965853,0.000648,0.010482,0.00724,0.008429,604276.6
std,18.697433,28.703214,1.009365,169881.1,0.167096,0.046441,0.071837,0.181617,0.025456,0.101849,0.084785,0.091426,1375543.0
min,1902.0,3.0,1.3,27.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,207.9
25%,1987.0,93.0,6.1,8320.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,52733.88
50%,1999.0,103.0,6.7,24892.5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,163345.2
75%,2007.75,116.0,7.3,82972.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,549060.0
max,2018.0,780.0,9.5,2701871.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,25127400.0


In [9]:

imdb_has_people = item_index.join(imdb_people.pivot_table(index='item',columns='primaryName',values='nconst',aggfunc='size'),how='outer').fillna(0).astype(bool)

pca = PCA(n_components=25)
imdb_has_people_pca = item_index.join(pd.DataFrame(pca.fit_transform(imdb_has_people),index=imdb_has_people.index),how='left')
imdb_has_people_pca.shape

(9254, 25)

In [10]:

train_tags = item_index.join(train_t.pivot_table(index='item',columns='tag',values='user',aggfunc='size'),how='outer').fillna(0).astype(bool)

pca = PCA(n_components=25)

train_tags_pca = item_index.join(pd.DataFrame(pca.fit_transform(train_tags.astype(bool)),index=train_tags.index),how='left')
train_tags_pca.shape

(9254, 25)

### Discussion:

Now we have our 5 feature sets:
- Training ratings matrix (train_ratings_mr)
- Genre attributes (genres_df)
- IMDb numeric attributes (attr_bool)
- IMDb boolean attributes (attr_nums)
- IMDb crew relations (imdb_has_people_pca)
- MovieLens provided Tags (train_tags_pca)

Since we are here, I'll add that how we split the attributes and perform dimentionality reduction are all parts we could tweak to accurately represent the real world of what makes a movie good. 


## Step 3 : Explore some similarity scores

In [11]:

def calc_distance(df,distance):
   
    dist_m = pd.DataFrame(squareform(pdist(df, distance)),index=df.index, columns= df.index)
    
    return dist_m

In [12]:
# calc a bunch of item-item similarity matrixes
attr_nums_dist = calc_distance(attr_nums, 'euclidean')
crew_euclid = calc_distance(imdb_has_people_pca, 'euclidean')
crew_cosine = calc_distance(imdb_has_people_pca, 'cosine')
tag_euclid = calc_distance(train_tags_pca, 'euclidean')
tag_cosine = calc_distance(train_tags_pca, 'cosine')


In [13]:
# find a different movie!
pd.DataFrame(attr_nums_dist.loc[187595].sort_values()).join(movies,how='left').join(imdb_people.groupby(['item'])['primaryName'].agg(list)).join(attr_nums).head()


Unnamed: 0_level_0,187595,title,genres,lk_tconst,primaryName,startYear,runtimeMinutes,averageRating,numVotes,total_ratings
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
187595,0.0,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,tt3778644,"[Kathleen Kennedy, Alden Ehrenreich, Woody Har...",2018.0,135.0,6.9,353480,2439012.0
53322,9753.964866,Ocean's Thirteen (2007),Crime|Thriller,tt0496806,"[Jerry Weintraub, George Clooney, Brad Pitt, M...",2007.0,122.0,6.9,352081,2429358.9
99112,16442.717446,Jack Reacher (2012),Action|Crime|Thriller,tt0790724,"[Joe Kraemer, Tom Cruise, Rosamund Pike, Richa...",2012.0,130.0,7.0,346316,2424212.0
54272,19477.740471,"Simpsons Movie, The (2007)",Animation|Comedy,tt0462538,"[George Meyer, Dan Castellaneta, Julie Kavner,...",2007.0,87.0,7.3,334921,2444923.3
7293,19626.846549,50 First Dates (2004),Comedy|Romance,tt0343660,"[Teddy Castellucci, Adam Sandler, Drew Barrymo...",2004.0,99.0,6.8,361324,2457003.2


In [14]:
# find a different movie!
pd.DataFrame(crew_euclid.loc[187595].sort_values()).join(movies,how='left').join(imdb_people.groupby(['item'])['primaryName'].agg(list)).join(attr_nums).head()

Unnamed: 0_level_0,187595,title,genres,lk_tconst,primaryName,startYear,runtimeMinutes,averageRating,numVotes,total_ratings
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
187595,0.0,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,tt3778644,"[Kathleen Kennedy, Alden Ehrenreich, Woody Har...",2018.0,135.0,6.9,353480,2439012.0
6821,0.31944,More American Graffiti (1979),Comedy,tt0079576,"[Caleb Deschanel, Candy Clark, Bo Hopkins, Ron...",1979.0,110.0,5.3,4707,24947.1
3363,0.348462,American Graffiti (1973),Comedy|Drama,tt0069704,"[Ron Eveslage, Richard Dreyfuss, Ron Howard, P...",1973.0,110.0,7.4,92140,681836.0
122886,0.468337,Star Wars: Episode VII - The Force Awakens (2015),Action|Adventure|Fantasy|Sci-Fi|IMAX,tt2488496,"[Kathleen Kennedy, Daisy Ridley, John Boyega, ...",2015.0,138.0,7.8,937743,7314395.4
69524,0.511407,Raiders of the Lost Ark: The Adaptation (1989),Action|Adventure|Thriller,tt0772251,"[Chris Strompolos, Angela Rodriguez, Michael B...",1989.0,100.0,8.0,777,6216.0


In [15]:
# find a different movie!
pd.DataFrame(crew_cosine.loc[187595].sort_values()).join(movies,how='left').join(imdb_people.groupby(['item'])['primaryName'].agg(list)).join(attr_nums).head()

Unnamed: 0_level_0,187595,title,genres,lk_tconst,primaryName,startYear,runtimeMinutes,averageRating,numVotes,total_ratings
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
187595,0.0,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,tt3778644,"[Kathleen Kennedy, Alden Ehrenreich, Woody Har...",2018.0,135.0,6.9,353480,2439012.0
6821,0.081614,More American Graffiti (1979),Comedy,tt0079576,"[Caleb Deschanel, Candy Clark, Bo Hopkins, Ron...",1979.0,110.0,5.3,4707,24947.1
3363,0.105099,American Graffiti (1973),Comedy|Drama,tt0069704,"[Ron Eveslage, Richard Dreyfuss, Ron Howard, P...",1973.0,110.0,7.4,92140,681836.0
3259,0.147823,Far and Away (1992),Adventure|Drama|Romance,tt0104231,"[Daniel P. Hanley, Tom Cruise, Nicole Kidman, ...",1992.0,140.0,6.6,65216,430425.6
166528,0.177873,Rogue One: A Star Wars Story (2016),Action|Adventure|Fantasy|Sci-Fi,tt3748528,"[George Lucas, Felicity Jones, Diego Luna, Ala...",2016.0,133.0,7.8,647294,5048893.2


In [16]:
# find a different movie!
pd.DataFrame(tag_euclid.loc[187595].sort_values()).join(movies,how='left').join(imdb_people.groupby(['item'])['primaryName'].agg(list)).join(attr_nums).head()


Unnamed: 0_level_0,187595,title,genres,lk_tconst,primaryName,startYear,runtimeMinutes,averageRating,numVotes,total_ratings
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
187595,0.0,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,tt3778644,"[Kathleen Kennedy, Alden Ehrenreich, Woody Har...",2018.0,135.0,6.9,353480,2439012.0
61160,1.189067,Star Wars: The Clone Wars (2008),Action|Adventure|Animation|Sci-Fi,tt1185834,"[Catherine Winder, Matt Lanter, Nika Futterman...",2008.0,98.0,5.9,67710,399489.0
173291,1.313699,Valerian and the City of a Thousand Planets (2...,Action|Adventure|Sci-Fi,tt2239822,"[Thierry Arbogast, Dane DeHaan, Cara Delevingn...",2017.0,136.0,6.4,184917,1183468.8
65982,1.406706,Outlander (2008),Action|Adventure|Sci-Fi,tt0462465,"[Pierre Gill, Jim Caviezel, Sophia Myles, Ron ...",2008.0,115.0,6.2,76559,474665.8
70336,1.481016,G.I. Joe: The Rise of Cobra (2009),Action|Adventure|Sci-Fi|Thriller,tt1046173,"[Lorenzo di Bonaventura, Dennis Quaid, Channin...",2009.0,118.0,5.7,210531,1200026.7


In [17]:
# find a different movie!
pd.DataFrame(tag_cosine.loc[187595].sort_values()).join(movies,how='left').join(imdb_people.groupby(['item'])['primaryName'].agg(list)).join(attr_nums).head()


Unnamed: 0_level_0,187595,title,genres,lk_tconst,primaryName,startYear,runtimeMinutes,averageRating,numVotes,total_ratings
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
187595,0.0,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,tt3778644,"[Kathleen Kennedy, Alden Ehrenreich, Woody Har...",2018.0,135.0,6.9,353480,2439012.0
61160,0.141215,Star Wars: The Clone Wars (2008),Action|Adventure|Animation|Sci-Fi,tt1185834,"[Catherine Winder, Matt Lanter, Nika Futterman...",2008.0,98.0,5.9,67710,399489.0
173291,0.143262,Valerian and the City of a Thousand Planets (2...,Action|Adventure|Sci-Fi,tt2239822,"[Thierry Arbogast, Dane DeHaan, Cara Delevingn...",2017.0,136.0,6.4,184917,1183468.8
65982,0.150813,Outlander (2008),Action|Adventure|Sci-Fi,tt0462465,"[Pierre Gill, Jim Caviezel, Sophia Myles, Ron ...",2008.0,115.0,6.2,76559,474665.8
101864,0.151778,Oblivion (2013),Action|Adventure|Sci-Fi|IMAX,tt1483013,"[Duncan Henderson, Tom Cruise, Morgan Freeman,...",2013.0,124.0,7.0,532284,3725988.0


In [20]:
# clear up some memory
del attr_nums_dist

del crew_euclid
del crew_cosine
del tag_euclid
del tag_cosine

### Discussion:

Look at that, for a sample Star Wars film, our feature attributes and PCA dimensionality reduction retain the correlation (business logic) we would need. We see that Solo is n-dimensionally close to other George Lucas films.

Sweet!


## Step 4 : Create the Model

In [21]:

# function city!!

def calc_sim(name,features,distance_calc):
    t0=time.perf_counter()
    sim = 1-squareform(pdist(features,distance_calc))
    scaler = MinMaxScaler()
    scaler.fit_transform(sim)
    
    t1=time.perf_counter()
    print(f'''{name} {distance_calc} similarity calculation time : ''',t1-t0)

    return sim

def predict(item_sim):
    t0=time.perf_counter()
    preds = np.ndarray(shape=len(test_r['user']))

    for row,user,item in zip(range(len(test_r['user'])),test_r['user'],test_r['item']):

        list_of_rating = train_ratings_mr[user_id.get(user)]
        list_of_movies = item_sim[item_id.get(item)]

        preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)

    preds[np.isnan(preds)]=3 #In case there is nan values in prediction, it will impute to 3.
    t1=time.perf_counter()
    print(f'''predict calculation time : ''',t1-t0)
    return preds


def error_rate(preds):    
    yt=np.array(test_r['rating'])
    error = np.sqrt(((yt-preds)**2).mean())
    print(error)
    return error

def convert_df_for_sim(df):
     return np.array(df.astype(pd.SparseDtype("int", 0)).sparse.to_coo().toarray())

def run_model(item_sim_map):
    sim_scores = np.ones(shape=(len(movies),len(movies)))
    for key,val in item_sim_map.items():
        sim_scores *= calc_sim(key,val[0], val[1])
    
    scaler = MinMaxScaler()
    scaler.fit_transform(sim_scores)
    
    pred = predict(sim_scores)
    
    return error_rate(pred)
    
    

### Discussion:

Not much to say here other than, I spent a lot of time to get here! Making sure the algorithm makes efficient use of python data structures and objects is essential to having the notebook run in a "short" amout of time. I'm talking 5 days! I'm not ashamed to say it. But now I know the things to address earlier in a project.

## 5. Find the best similarity scores for each feature set

In [30]:


cat_features = {
    'r':train_ratings_mr.T,
    'g':convert_df_for_sim(genres_df),
    'ab':convert_df_for_sim(attr_bool),
}
bool_distances = ['dice','hamming','yule','jaccard']

num_features = {
    'r':train_ratings_mr.T,
    'an':convert_df_for_sim(attr_nums),
    'p':convert_df_for_sim(imdb_has_people_pca),
    't':convert_df_for_sim(train_tags_pca)
}
num_distances = ['cosine','euclidean','braycurtis','canberra','chebyshev','correlation','seuclidean','sqeuclidean']


In [31]:
# let's determin the best singular model
single_item_sim_models = {}

for name,sim in cat_features.items():
    print(name)
    single_item_sim_models[name]=single_item_sim_models.get(name,{})
    for dist in list(bool_distances):
        print(dist)
        try:
            single_item_sim_models[name][dist] = run_model({0:[sim,str(dist)]})
        except Exception as e:
            single_item_sim_models[name][dist]=f'''{str(e.__class__.__name__)}'''

for name,sim in num_features.items():
    print(name)
    single_item_sim_models[name]=single_item_sim_models.get(name,{})
    for dist in list(num_distances):
        print(dist)
        try:
            single_item_sim_models[name][dist] = run_model({0:[sim,str(dist)]})
        except Exception as e:
            single_item_sim_models[name][dist]=f'''{str(e.__class__.__name__)}'''


r
dice
0 dice similarity calculation time :  25.130537933000028


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.133552872999985
0.9194304452306723
hamming
0 hamming similarity calculation time :  38.73223514800003
predict calculation time :  0.9994151250000414
0.9433548766099663
yule
0 yule similarity calculation time :  36.003919634


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.0106689730000085
inf
jaccard
0 jaccard similarity calculation time :  21.374042872000018


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.0508306620000099
0.8895234326936754
g
dice
0 dice similarity calculation time :  3.2674686990000055


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.0381338050000295
0.9201070698305442
hamming
0 hamming similarity calculation time :  2.833653435999963
predict calculation time :  1.0478322729999832
0.9408999708906615
yule
0 yule similarity calculation time :  3.6098368059999757


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.0340957680000429
inf
jaccard
0 jaccard similarity calculation time :  2.8211593880000123


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  0.9855493020000381
0.918269500206824
ab
dice
0 dice similarity calculation time :  2.4932289600000104


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  0.9869477599999641
0.9436012097293524
hamming
0 hamming similarity calculation time :  2.5064724300000307
predict calculation time :  0.9741026359999978
0.9438495800100112
yule
0 yule similarity calculation time :  2.7192401969999764
predict calculation time :  1.0051287430000002
0.9443240774156871
jaccard
0 jaccard similarity calculation time :  2.4929153119999796


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  0.9986635049999677
0.9435939487966806
r
cosine
0 cosine similarity calculation time :  36.138533378000034
predict calculation time :  0.9913556619999895
1.1583514403101687
euclidean
0 euclidean similarity calculation time :  59.224902524000015
predict calculation time :  1.059977441000001
0.970070785875223
braycurtis
0 braycurtis similarity calculation time :  77.26571085900002


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.0000624369999969
0.9098680804610966
canberra
0 canberra similarity calculation time :  102.648454066
predict calculation time :  1.0315390389999948
0.9808031727398021
chebyshev
0 chebyshev similarity calculation time :  50.10638550099998
predict calculation time :  1.037209737000012
0.9500530171553846
correlation
0 correlation similarity calculation time :  36.30296569500001
predict calculation time :  1.0024592450000682
1.1583514403101687
seuclidean
0 seuclidean similarity calculation time :  38.644440145999965
predict calculation time :  1.0491470389999904
0.9724839510062155
sqeuclidean
0 sqeuclidean similarity calculation time :  59.62418608799999
predict calculation time :  1.0196278829999983
1.0010753007847943
an
cosine
0 cosine similarity calculation time :  2.4539868909999996
predict calculation time :  1.0280239950000123
0.9437333418521536
euclidean
0 euclidean similarity calculation time :  2.44711727899994
predict calculation time :  1.0244971789

  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)
  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.0193066450000288
inf
braycurtis
0 braycurtis similarity calculation time :  3.383697194000092


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.034221319999915
1.1520363061723542
canberra
0 canberra similarity calculation time :  3.3425519730000133


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)
  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.0219347549999611
inf
chebyshev
0 chebyshev similarity calculation time :  2.8932355160000043


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)
  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.0512923589999446
inf
correlation
0 correlation similarity calculation time :  3.2622032120000313
predict calculation time :  1.0632918999999674
1.1583514403101687
seuclidean
0 seuclidean similarity calculation time :  3.605740245999982
predict calculation time :  1.0406938820000278
1.1583514403101687
sqeuclidean
0 sqeuclidean similarity calculation time :  2.939895747000037


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)
  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.0280504410000049
inf
t
cosine
0 cosine similarity calculation time :  3.058185011999967
predict calculation time :  1.068926174000012
1.1583514403101687
euclidean
0 euclidean similarity calculation time :  2.6892058720000023
predict calculation time :  1.0307689349999691
5.09293866188646
braycurtis
canberra
0 canberra similarity calculation time :  3.367411433999905


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.0106502149999415
inf
chebyshev
0 chebyshev similarity calculation time :  2.813097475000177


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)
  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  0.9849365180000405
inf
correlation
0 correlation similarity calculation time :  3.04227243299988
predict calculation time :  1.0010733670001173
1.1583514403101687
seuclidean
0 seuclidean similarity calculation time :  3.424032577000162
predict calculation time :  1.0204737069998373
1.1519615951150444
sqeuclidean
0 sqeuclidean similarity calculation time :  2.734133045999897
predict calculation time :  1.0320878180000363
1.5109961383454067


In [32]:
# let's take the best of each sitance results and try out the different combinations
single_results = pd.DataFrame.from_dict(single_item_sim_models)
single_results

Unnamed: 0,r,g,ab,an,p,t
dice,0.91943,0.920107,0.943601,,,
hamming,0.943355,0.9409,0.94385,,,
yule,inf,inf,0.944324,,,
jaccard,0.889523,0.91827,0.943594,,,
cosine,1.158351,,,0.943733,1.158351,1.158351
euclidean,0.970071,,,1.037114,inf,5.092939
braycurtis,0.909868,,,0.910679,1.152036,ValueError
canberra,0.980803,,,417.955247,inf,inf
chebyshev,0.950053,,,1.037366,inf,inf
correlation,1.158351,,,0.943725,1.158351,1.158351


### Discussion:

The above table shows us that not all distance metrics are applicable and that the jaccard similarity for the ratings matrix is our best single model. Let's see if any combinations can make the score better!

Since the tags don't seem to help, we will exclude them. We could however change our PCA, or add all tags, in order to improve this. But we will move on.


## Step 5 : Find the best combined model set

In [33]:
best_features = {
    'r':[train_ratings_mr.T,single_results['r'].idxmin(axis=0)],
    'an':[convert_df_for_sim(attr_nums),single_results['an'].idxmin(axis=0)],
}

run_model(best_features)

r jaccard similarity calculation time :  21.654609289999826
an braycurtis similarity calculation time :  2.538960463999956


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  0.9983116800001426
0.8804253779067628


0.8804253779067628

In [34]:
best_features = {
    'r':[train_ratings_mr.T,single_results['r'].idxmin(axis=0)],
    'ab':[convert_df_for_sim(attr_nums),single_results['ab'].idxmin(axis=0)],
}

run_model(best_features)

r jaccard similarity calculation time :  21.635986611000135
ab jaccard similarity calculation time :  2.440508402999967


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.0714598389999992
0.8739275697445853


0.8739275697445853

In [35]:
best_features = {
    'r':[train_ratings_mr.T,single_results['r'].idxmin(axis=0)],
    'g':[convert_df_for_sim(genres_df),single_results['g'].idxmin(axis=0)],
}

run_model(best_features)

r jaccard similarity calculation time :  21.453190594000034
g jaccard similarity calculation time :  3.13599322999994


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.0321405960000902
0.8794473421170818


0.8794473421170818

In [36]:
best_features = {
    'r':[train_ratings_mr.T,single_results['r'].idxmin(axis=0)],
    'g':[convert_df_for_sim(genres_df),single_results['g'].idxmin(axis=0)],
    'an':[convert_df_for_sim(attr_nums),single_results['an'].idxmin(axis=0)],
}

run_model(best_features)

r jaccard similarity calculation time :  21.497871136999947
g jaccard similarity calculation time :  2.881446653000012
an braycurtis similarity calculation time :  2.539622012000109


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.0525783890000184
0.8741658261562708


0.8741658261562708

In [37]:
best_features = {
    'r':[train_ratings_mr.T,single_results['r'].idxmin(axis=0)],
    'g':[convert_df_for_sim(genres_df),single_results['g'].idxmin(axis=0)],
    'ab':[convert_df_for_sim(attr_bool),single_results['ab'].idxmin(axis=0)],
    'an':[convert_df_for_sim(attr_nums),single_results['an'].idxmin(axis=0)],
}

run_model(best_features)

r jaccard similarity calculation time :  21.727502865000133
g jaccard similarity calculation time :  2.92224142699979
ab jaccard similarity calculation time :  2.51478099499991
an braycurtis similarity calculation time :  2.5145446319997973


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.0352048029999423
0.874579311205187


0.874579311205187

In [38]:
best_features = {
    'r':[train_ratings_mr.T,single_results['r'].idxmin(axis=0)],
    'g':[convert_df_for_sim(genres_df),single_results['g'].idxmin(axis=0)],
    'ab':[convert_df_for_sim(attr_bool),single_results['ab'].idxmin(axis=0)],
    'an':[convert_df_for_sim(attr_nums),single_results['an'].idxmin(axis=0)],
    'p':[convert_df_for_sim(imdb_has_people_pca),single_results['p'].idxmin(axis=0)]
}

run_model(best_features)

r jaccard similarity calculation time :  21.60796051500006
g jaccard similarity calculation time :  2.9286914700001034
ab jaccard similarity calculation time :  2.525378572999898
an braycurtis similarity calculation time :  2.5160674659998676
p braycurtis similarity calculation time :  3.312950168000043


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.0558434950000901
1.1519287698990364


1.1519287698990364

### Discussion:

That's a lot of printouts! I could have added automation here, but i'm so burnt out. I came, I learned, I concured what I set out to do.

Here we can see that the best model involves 
- Training ratings matrix (train_ratings_mr)
- IMDb boolean attributes (attr_nums)

This was an interesting finding. It seems that format matters ...  

I was really hoping the IMDb Crew feature set would make the list becasue, to me, it's one of the more important recommendation relevant pieces. 


## Step 6 : Make a Recommendation

In [40]:
sample_user = 30

item_sim = (calc_sim('sim1',train_ratings_mr.T,'jaccard')*calc_sim('sim1',convert_df_for_sim(attr_nums),'jaccard'))
predicted_ratings = predict(item_sim)

sim1 jaccard similarity calculation time :  21.869848981999894
sim1 jaccard similarity calculation time :  2.568365415000244


  preds[row] = np.dot(list_of_rating,list_of_movies)/np.dot(list_of_movies,list_of_rating>0)


predict calculation time :  1.019787684000221


In [47]:
pd.Series(predicted_ratings)

0        2.668213
1        3.973643
2        3.844304
3        2.948013
4        4.106015
           ...   
29994    3.862827
29995    4.708908
29996    4.367247
29997    3.860898
29998    4.692296
Length: 29999, dtype: float64

In [75]:
movies_rated = train_r.copy()
movies_rated = movies_rated[movies_rated['user']==sample_user].sort_values(by='rating',ascending=False).join(movies,on='item')
movies_rated.head()

Unnamed: 0,user,item,rating,timestamp,title,genres,lk_tconst
4863,30,60069,5.0,1500370453,WALL·E (2008),Adventure|Animation|Children|Romance|Sci-Fi,tt0910970
4869,30,95510,5.0,1500370369,"Amazing Spider-Man, The (2012)",Action|Adventure|Sci-Fi|IMAX,tt0948470
4878,30,122904,5.0,1500370440,Deadpool (2016),Action|Adventure|Comedy|Sci-Fi,tt1431045
4858,30,5952,5.0,1500370399,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy,tt0167261
4862,30,59315,5.0,1500370444,Iron Man (2008),Action|Adventure|Sci-Fi,tt0371746


In [77]:
movies_recommended = test_r.copy()
movies_recommended = movies_recommended.reset_index(drop=True).join(pd.Series(predicted_ratings,name='pred_rating'))
movies_recommended = movies_recommended[movies_recommended['user']==sample_user].sort_values(by='pred_rating',ascending=False).join(movies,on='item')
movies_recommended.head()


Unnamed: 0,user,item,rating,timestamp,pred_rating,title,genres,lk_tconst
18905,30,7153,5.0,1500370349,5.0,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy,tt0167260
21787,30,318,5.0,1500370344,5.0,"Shawshank Redemption, The (1994)",Crime|Drama,tt0111161
20955,30,93510,5.0,1500370378,4.900993,21 Jump Street (2012),Action|Comedy|Crime,tt1232829
17321,30,97913,4.0,1500370371,4.828357,Wreck-It Ralph (2012),Animation|Comedy,tt1772341
17909,30,68954,5.0,1500370450,4.793894,Up (2009),Adventure|Animation|Children|Drama,tt1049413


### Discussion:

Looks at that, A recommendation based on the users ratings and prioritized by the highes predicted rating (calculated using our best model combo)!

## Step 7 : Discussion / Overall Conclusion

### Learning and takeaways :

Data is hard when it's hard! I am used to working in pandas and spreadsheets. Changing my mental perception to COO format was a learning curve that I discovered late in the game.


### What didn’t work :

Using other users ratings to create a user-user similarity matrix to incorporate here. The idea was realted to the probablitity that a movie is a relevant recommendation becuase others like me liked them. I'm still confident this can work, I just ran out of steam.


### Ways to improve : 

I could add more automation to the utlimate recommendation part. More importantly, I could expore and utilize skLearns pipeline libraries to see if that automation I did could be outsourced. 

