# Instructions

When a user looks at a specific genre of movies, a chart should only show movies in the category (or maybe similar categories). Implement a filter which will only show relevant movies for the category.

E.g., business rules like: In animated movies, show chart of movies from animated genre and family movies.


# Report

a. The non-personalized feature you will implement (or have implemented).

I implemented based on raw values of popularity. Movies with more ratings tended to have better reviews.

b. Your chart calculation method.

I will present the movies with the most ratings first, for the desired category.
Since movies that fit into more than one category have all these categories listed, I can expand the selected category
 with related values based on the most frequently-appearing related categories.

c. Your implementation.

(see below)

d. How do the genre charts look?

Most of the movie suggestions look reasonable, although most are pre-1940s. Perhaps these are the ones that have been the most rated? Or the age is included in default list order so that I'm grabbing the oldest ones first.

e. What would you improve if given more time?

I can think of a few things to improve:

1. Provide a wider variety of 'popular' recommendations for each category. This could be distributed by other genres.
2. Combine different genres together.
3. Extract year and distribute based on year.


In [1]:
import csv
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pathlib

In [2]:
# user_id::movie_id::rating::rating_timestamp
df = pd.read_csv(r'../data/ratings.dat', sep='::',
                 names=['user_id', 'movie_id', 'rating', 'rating_timestamp'])
df.shape

  df = pd.read_csv(r'../data/ratings.dat', sep='::',


(888452, 4)

In [44]:
# movie_id::title::genre1|genre2
movie_df = pd.read_csv(r'../data/movies.dat', sep='::',
                 names=['movie_id', 'title', 'genres'])
movie_df.shape

  movie_df = pd.read_csv(r'../data/movies.dat', sep='::',


(36383, 3)

In [45]:
movie_df['genres'] = movie_df['genres'].fillna('')

In [39]:
genres = set(movie_df['genres'].str.split('|').dropna().sum())
genres.remove('')

In [49]:
# prepare new columns
for genre in genres:
    movie_df[genre] = False
    movie_df.loc[movie_df['genres'].str.contains(genre), genre] = True

movie_df.columns

Index(['movie_id', 'title', 'genres', '', 'Mystery', 'Adventure', 'Fantasy',
       'Romance', 'Animation', 'Film-Noir', 'Game-Show', 'Comedy', 'Drama',
       'Family', 'Sci-Fi', 'News', 'Reality-TV', 'Short', 'Biography',
       'Musical', 'Talk-Show', 'Sport', 'Documentary', 'Thriller', 'History',
       'War', 'Horror', 'Action', 'Crime', 'Music', 'Western', 'Adult'],
      dtype='object')

In [54]:
n_rating_df = df.groupby('movie_id').count()['user_id'].reset_index()
n_rating_df.columns = ['movie_id', 'n_ratings']

pop_movies = n_rating_df[n_rating_df.n_ratings > 10].copy()  # 20% most popular movies
pop_df = pd.merge(pop_movies, movie_df).sort_values(by='n_ratings', ascending=False)


In [56]:
# for each genre, get the 10 most popular films
recs = {}
for genre in genres:
    recs[genre] = list(movie_df[movie_df[genre] == True]['title'].head(10))


In [62]:
# for each genre, get 2 related genres
similar_genres = {}
for genre in genres:
    curr = list(genres - {genre})
    related = movie_df[movie_df[genre] == True][curr].sum().reset_index()
    related.columns = ['genre', 'count']
    related_genres = list(related.sort_values(by='count', ascending=False)['genre'].head(2))
    similar_genres[genre] = related_genres

In [68]:
# output data as csv
for genre in genres:
    with open(f'{genre}.csv', 'w', newline='', encoding='utf8') as fh:
        writer = csv.writer(fh)
        writer.writerow(['title'])
        for item in recs[genre]:
            writer.writerow([item])
        for sg in similar_genres[genre]:
            for item in recs[sg][:3]:
                writer.writerow([item])