## EECS 731 Project 3
### Adam Podgorny

In [468]:
import pandas as pd
from sklearn.cluster import KMeans

In [469]:
movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")
tags = pd.read_csv("tags.csv")

In [470]:
tag_counts = tags['tag'].value_counts()
tags['counts'] = tags['tag'].apply(lambda x: tag_counts[x])
tags = tags[tags['counts'] > 2 ]
##Tags will only really be good if they have other items in relation to them, so singleton tags get dropped

In [471]:
len(tags['movieId'].unique()) ##That's a lot of movies

1169

In [472]:
mids = movies['movieId'].to_list()
titles = movies['title'].to_list()
moviemap = pd.DataFrame([mids, titles])#, columns=['id', "title"])
moviemap = moviemap.T
moviemap = moviemap.rename(columns={0: "movieId", 1: "title"})
moviemap

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)
...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017)
9738,193583,No Game No Life: Zero (2017)
9739,193585,Flint (2017)
9740,193587,Bungo Stray Dogs: Dead Apple (2018)


So, as we talked about in class, different users have different habits. As such, we are going to range normalize the ratings a bit down the line. The userId and movieId will also be a good way to link tags to ratings

In [473]:
def genre_split(s, index):
    s = s.split("|")
    if (index < len(s)):
        return s[index]
    else:
        return ""
    
def is_in(s, x):
    s = s.split("|")
    if (x in s):
        return 1
    else:
        return 0

In [474]:
gs = movies['genres'].apply(lambda x: genre_split(x, 0)).to_list()
gs = gs + movies['genres'].apply(lambda x: genre_split(x, 1)).to_list()
gs = gs + movies['genres'].apply(lambda x: genre_split(x, 2)).to_list()
gs = gs + movies['genres'].apply(lambda x: genre_split(x, 3)).to_list()
gs = gs + movies['genres'].apply(lambda x: genre_split(x, 4)).to_list()
gs = gs + movies['genres'].apply(lambda x: genre_split(x, 5)).to_list()
gs = set(gs)
gs = list(gs)
gs = gs[1:len(gs)]

In [475]:
for x in gs:
    movies[x] = movies['genres'].apply(lambda z: is_in(z, x))

In [476]:
movies = movies.drop('genres', axis=1)

We may also likely be able to use genres as a validation category outside of the tags, not sure how I want to work with this though. Tags are going to be highly varied, and the best solution would really be some kind of embedding or NLP technique to cluster tags first, then assign one of multiple tags to movies from that cluster, THEN cluster on ratings and genre. But let's see we can do that doesn't involve deep learning with what we have.

In [477]:
movies ##Much better

Unnamed: 0,movieId,title,Fantasy,Children,(no genres listed),Action,Film-Noir,War,Mystery,Western,...,Musical,Animation,Comedy,Sci-Fi,Thriller,Romance,Adventure,Horror,Drama,IMAX
0,1,Toy Story (1995),1,1,0,0,0,0,0,0,...,0,1,1,0,0,0,1,0,0,0
1,2,Jumanji (1995),1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,3,Grumpier Old Men (1995),0,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),0,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,0,1,0
4,5,Father of the Bride Part II (1995),0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),1,0,0,1,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
9738,193583,No Game No Life: Zero (2017),1,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
9739,193585,Flint (2017),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
9740,193587,Bungo Stray Dogs: Dead Apple (2018),0,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [478]:
userids = ratings['userId'].tolist()
userids = set(userids)
userids = list(userids)
highs = []
lows = []
for u in userids:
    t = ratings[ratings['userId'] == u]
    m = t['rating']
    h = m.max()
    l = m.min()
    highs.append(h)
    lows.append(l)
    
t = [userids, lows, highs]
users = pd.DataFrame(t)

In [479]:
users = users.T
users = users.rename(columns={0:"id", 1:"lo", 2:"hi"})
users = users.set_index("id")

In [480]:
def range_normalize(rating, uid):
    h = users.loc[uid].hi
    l = users.loc[uid].lo
    d = h - l
    if (d==0):
        return rating #by squeeze theorem I guess
    r = (rating - l)/d
    return (r * 5)

In [481]:
ratings['normalized'] = ratings.apply(lambda x: range_normalize(x['rating'], x['userId']), axis=1)

In [482]:
merged = pd.merge(ratings, tags, on=['movieId', 'userId'])
merged = merged.drop(['timestamp_x', 'timestamp_y', 'rating'], axis=1)
merged = merged.rename(columns={'normalized': "rating"})

Range normalization and linking of ratings to uids to userids to tags. This will get us more cohesive dataframes with which to work after a bit of column cleanup.

In [483]:
filtered = merged[merged['rating'] >= 4.0]
tag_counts = filtered['tag'].value_counts()
filtered['counts'] = filtered['tag'].apply(lambda x: tag_counts[x])
filtered = filtered[filtered['counts'] >= 3] ##This gets us a managable number of tags

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


So, going back to an earlier point, too many tags that are one-hot-encoded will be taxing for K Means. We also want to make sure the ratings are quality ratings, so I think we can get away with excluding angry reviews, as I will make the assumption that people who give good reviews will give accurate tags as to what they liked about it. Since this is for recommendations, we can really just exclude the negative reviews from the cluster training set to pare things down a bit. 

We also want a reasonable number of clusters that can be pretty broadly encompassing. Ideally, we cluster the tags on their own, but that also means I have some way of clustering the movies before hand in a meaningful way, but then I need to use the tags to cluster the movies. While I could technically do that on what words describe things for what genres, really, that just becomes information bleedthrough to genres, and there are probably more sophisticated ways I could do that. So in light of that, we are going to try to sift the pool of tags to ones that have better explanatory power. These will likely correlate with genre anyway, but that may be a good validation strategy, not actually quite sure yet.

I will point out though, that most tags I see on recommendations tend to be specified to keywords that are fairly specific, leading me to believe that the tags on say Netflix, Hulu, Disney+ whatever are all mapped to some meta-tag. RNA, however, is not so convienent.

I did play around with the ratings and tag counts axis a bit. I want a large dataset to train, but few enough tags to be doable. I may bump these thresholds up a bit of KMeans runs quickly, but...I dont want to stretch the limits of the model, these thresholds feel pretty good. I think a 4 or higher indicates some investment in the film, and more than 10 repetitions means the tag has some generality to it.

In [484]:
unique_tags = filtered['tag'].unique()
len(unique_tags)

104

In [485]:
filtered = pd.get_dummies(filtered, columns=['tag'])
filtered

Unnamed: 0,userId,movieId,rating,counts,tag_Al Pacino,tag_Atmospheric,tag_Brad Pitt,tag_Coen Brothers,tag_Disney,tag_High School,...,tag_thought-provoking,tag_thriller,tag_time travel,tag_twist ending,tag_unique,tag_violence,tag_violent,tag_visually appealing,tag_visually stunning,tag_weird
0,2,60756,5.0,7,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,60756,5.0,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2,106782,5.0,6,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,106782,5.0,7,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,2,106782,5.0,4,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020,599,2959,5.0,11,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2021,599,2959,5.0,12,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2022,599,2959,5.0,14,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2023,599,2959,5.0,4,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [486]:
km = KMeans(n_clusters = 30, random_state=400) #Since I did a cutoff at 3 counts minimum lets divide the tag numbers by 3
train = filtered.drop(['userId','counts'],axis=1)
clustered_1 = km.fit_predict(train.drop(['movieId'], axis=1))
train['cluster'] = pd.DataFrame(clustered_1)
simple_rec_model = train
simple_rec_model = pd.merge(train, moviemap, on=['movieId'])
simple_recs = simple_rec_model[['cluster', 'title']].copy()
simple_recs.head(20)

Unnamed: 0,cluster,title
0,27.0,Step Brothers (2008)
1,0.0,Step Brothers (2008)
2,11.0,"Wolf of Wall Street, The (2013)"
3,0.0,"Wolf of Wall Street, The (2013)"
4,0.0,"Wolf of Wall Street, The (2013)"
5,3.0,"Godfather: Part II, The (1974)"
6,17.0,"Godfather: Part II, The (1974)"
7,,"Godfather: Part II, The (1974)"
8,5.0,Lucky Number Slevin (2006)
9,7.0,Fracture (2007)


So here we have some repeats. At first I was going to filter these, but...there is actually a use here. KMeans, to my knowledge, doesn't particularly like overlapping boundaries...however I want to account for tag overlap. So...this may actually be useful because the real currency of information here are the tags and the ratings.

Let's also use the genre tags to see what happens

In [487]:
withgenre = pd.merge(filtered, movies, on=['movieId'])

In [488]:
g_km = KMeans(n_clusters = 30, random_state=400) #Keeping 36 so we can test the clusters against each other
train = filtered.drop(['userId','counts'],axis=1)
clustered_2 = g_km.fit_predict(train.drop(['movieId'], axis=1))
train['cluster'] = pd.DataFrame(clustered_2)
genre_rec_model = train
genre_rec_model = pd.merge(train, moviemap, on=['movieId'])
genre = simple_rec_model[['cluster', 'title']].copy()
genre.head(20)

Unnamed: 0,cluster,title
0,27.0,Step Brothers (2008)
1,0.0,Step Brothers (2008)
2,11.0,"Wolf of Wall Street, The (2013)"
3,0.0,"Wolf of Wall Street, The (2013)"
4,0.0,"Wolf of Wall Street, The (2013)"
5,3.0,"Godfather: Part II, The (1974)"
6,17.0,"Godfather: Part II, The (1974)"
7,,"Godfather: Part II, The (1974)"
8,5.0,Lucky Number Slevin (2006)
9,7.0,Fracture (2007)


So, the easy way to use this is to push a favorite movie title through the transformation on either genre, or tags, then see what cluster comes up, and select from there. Since there were top rated, there reviews will be based on good sentiment, not negative reviews that got associated with the cluster.

But uh...some of these clusters seem a bit off. Including larger sets of tags or one hot-encoding the tags on a per movie basis may have helped with this. May retroactively do that. Comedy and horror keep getting conflated. Granted...that may be somewhat subjective depending on the reviewer. A more sophisticated approach would do analysis on the the user submitting the review to account for their own preferences.

I am not really sure how to test or validate this really, since clustering is going to be doing a bit of inferential logic that may have some fuzzy boundaries rather than discrete categories. 

Let's try grouping by the movie ID really quickly.

In [489]:
filtered = merged[merged['rating'] >= 3.0] ##Lets lower the threshold a little
tag_counts = filtered['tag'].value_counts()
filtered['counts'] = filtered['tag'].apply(lambda x: tag_counts[x])
filtered = filtered[filtered['counts'] >= 5] ##This gets us a managable number of tags

filtered = pd.get_dummies(filtered, columns=['tag'])

filtered = filtered.groupby("movieId").sum() ##This also gives a dimensionality to the hyperplanes, lets see what happens
filtered

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0_level_0,userId,rating,counts,tag_Adam Sandler,tag_Al Pacino,tag_Animal movie,tag_Astaire and Rogers,tag_Atmospheric,tag_Australia,tag_Coen Brothers,...,tag_tense,tag_terrorism,tag_thought-provoking,tag_thriller,tag_time travel,tag_twist ending,tag_violence,tag_visually appealing,tag_witty,tag_zombies
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,567,3.333333,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16,474,3.888889,7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26,474,3.333333,10,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
28,474,4.444444,52,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
32,2806,24.791667,77,0,0,0,0,0,0,0,...,0,0,0,0,3,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180031,1134,7.777778,47,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
183611,62,3.750000,20,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
184471,62,3.125000,7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
187593,62,3.750000,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [492]:
km = KMeans(n_clusters = 20, random_state=400)  #20 genres, maybe 20 clusters
train = filtered.drop(['userId','counts'],axis=1)
clustered_1 = km.fit_predict(train.reset_index())
train['cluster'] = pd.DataFrame(clustered_1)
simple_rec_model = train
simple_rec_model = pd.merge(train, moviemap, on=['movieId'])
simple_recs = simple_rec_model[['cluster', 'title']].copy()
simple_recs.head(50)

Unnamed: 0,cluster,title
0,0.0,Toy Story (1995)
1,0.0,Casino (1995)
2,0.0,Othello (1995)
3,0.0,Persuasion (1995)
4,0.0,Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
5,0.0,Babe (1995)
6,0.0,Clueless (1995)
7,0.0,Richard III (1995)
8,0.0,Restoration (1995)
9,0.0,Seven (a.k.a. Se7en) (1995)


This maybe is some pandas issues going on here, I am seeing some NaNs where there shouldn't be? Maybe the columns got thrown off? Either way, I am running out of time.


I was going to try to post the clusters as visualization, but I'm not sure how that may work here. Especially in a way that isn't hyper dense to look at over some many axes. Even then, there still seems to be some bugs here, so not sure what that would accomplish.
Saving my filtered pd