### Citation
To acknowledge use of the dataset in publications, please cite the following paper:

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os

In [3]:
data_fp = os.path.join('~', 'git', 'netflix_challenge', 'data', 'ml-latest-small')

# The movie/rating datasets

In [4]:
df_tags = pd.read_csv(os.path.join(data_fp, 'tags.csv'))
df_links = pd.read_csv(os.path.join(data_fp, 'links.csv'))
df_movies = pd.read_csv(os.path.join(data_fp, 'movies.csv'))
df_ratings = pd.read_csv(os.path.join(data_fp, 'ratings.csv'))

In [5]:
df_tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,15,339,sandra 'boring' bullock,1138537770
1,15,1955,dentist,1193435061
2,15,7478,Cambodia,1170560997
3,15,32892,Russian,1170626366
4,15,34162,forgettable,1141391765


In [6]:
df_links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [7]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


# Descriptive Questions
- What's the most popular movie
- Most popular genre of movie per year
- Clusters of users
- After a movie of genre X which other genre is watched
- number rating per year
- Correlation between number movies watched and average score given
- Relation between years and genres
- Relation between ratings and genre
- relation between ratings and year
- Clusters of movies
- Correlations between genres

# Task 1 Part b) Data preprocessing
1. Make a data matrix G that expresses the association between movies and genres
    - movieID (rows) and Genres (cols) - Surprises, some MovieIDs don't correspond to a movie.
2. Plot the number of movies per genre
    - some movies may not carry a genre. What does this look like?
3. Visualize the tendency of genres to co-occur in the samemovies
    - Hint: a useful tool is the matplotlib function imshow
    - genres vs genres with a count 

# Task 1 Part c) Statistics
1. Compute and visualize the following:
    - histogram of number of genres per movie
    - histogram of number of movies per user
    - histogram of number of users per movie
    - histogram of average score per movie

# Task 2 Part a) Changing basis
1. Determine a more convenient basis to represent the movie-users score matrix
    - 100004 ratings and 1296 tag applications across 9125 movies with 671 users
    - Find a better basis for this data. Rather than having every movie with 600 possible users, we could create a weight/feature for a type of user. E.g. There will be users who likes action, users who like a mixture of movies like action and romance. 
2. Plot the reconstruction quality with respect to the number of singular vectors
    - How good is our new basis? Think of the fish example from the lectures. 

# Task 2 Part b) Explain principal directions
1. Encode users according to the genres they like
    - what is the relationship between users and genres. What associations can we make?
2. Encode movies according to SVD decomposition of movieuser score matrix
3. How can we interpret the information captured by the SVD basis?
4. Explain SVD basis according to genres: for each basis vector report the k most positively associated genres and the k most negatively associated genres

# Task 2 Part c) Understand the axis
1. Plot the loading of each genre for the first 10 basis vectors

# Task 3 Part a) Movies visualization
1. Make a 2D plotting function using two arbitrary SVD basis vectors to project all movies.
2. For efficiency reasons, exclude movies that are too close to the origin.
3. Find movies that, in the chosen 2D representation, are closest to a regularly spaced grid and display their titles on the plot.

# Task 3 Part b) Movie info
1. Plot the coordinates for a movie given its id.

# Task 3 Part c) Movie in context
1. Find movie id from words in the title
2. Select two axis and report
3. movie info
4. both axis info
5. marked movie in movies 2D plot

# Task 4 Part a) Predict user score
1. Select 4000 scores at random from users at random.
    - it will be based on the principal components
2. Replace these scores with the average score for the corresponding movie to simulate missing values.
3. Using the truncated SVD decomposition with 400 singular vectors approximate the missing scores.
4. Evaluate the accuracy of the reconstruction using a scatter plot.

# Task 4 Part b) classification for individual movie
1. Select a movie at random that has been seen by many users.
2. Consider only the users that have scored the movie.
3. Assume that a score above, say, 4.5 is positive and below negative. This is the target that needs to be predicted.
    - Note: remove the movie score information from the data matrix (e.g. by setting all score entries to 0).
4. Consider a 50% split between positive and negative users.
5. Build a classification algorithm for the problem (no external library such as `scikit`).
6. Compute the accuracy and the baseline accuracy (i.e. the accuracy of a random classifier).