# Final Project - Movie Recommender
# DSC478 
# Eric Vistnes, Xiaojing Shen, Robert Kaszubski

Our goal is to explore ways in which we can recommend movies to users.
We have used different methods including Clustering, Classification, and SVD to do so.

Our dataset consisted of 17770 different movies rated by 480,189 with a grand total of 8,532,958,530 ratings. This proved difficult to process due to numerous issues with memory usage. We took several steps to make this dataset more manageable.


Please refer to the other attached Jupyter Notebooks for a breakdown on how we:

Combined the four txt files the dataset came in: Combine_Files.ipynb

Initially preprocessed and pivoted the entire dataset: CollaborativeFiltering.ipynd

Finally preprocesses and transformed the dataset to our final reduced version: Preprocessing and transformation.ipynd

### Begin Application:

We ask our user to rate movies they have seen from the most frequently rated in our dataset with the assumption that they have seen at least 5 of them.

In [1]:
import numpy as np
import pandas as pd

Read in our reduced and cleaned pickle file containing our dataset:

In [2]:
object = pd.read_pickle('cleanedMovie.pkl')
movies = pd.DataFrame(object)

Dataframe containing the most frequently rated films in order:

In [3]:
popular_movies = pd.DataFrame(movies["MovieID"].value_counts())

terms contains all of our movie titles, we remove unnecessary columns such as the release year:

In [4]:
terms = pd.read_csv('movie_titles.txt', sep='\t', encoding = "ISO-8859-1", header=None, index_col=0)
terms = terms.iloc[:,:2]
terms = terms.iloc[:,1:]

Function to query movie title data and return the title for the given movieID:

In [5]:
def toTitle(MovieID):
    return terms.loc[ MovieID , : ][2]

Function to ask users to rate n number of movies to get ratings to use for recommendations:

Returns an array of the movies the user chose to rate in the format [[MovieID, rating]]

In [6]:
def initial_rating(n):
    print("These are the top",n, "most watched movies on our system:")
    print("Please rate at least 5 movies on a scale of 1 to 5 (1,2,3,4,5), you haven't seen it just hit Enter.")
    print("We can then start providing you recommendations based on your initial input!")
    print("")
    ratings = []
    for mov in range(0,n):
        ID = popular_movies.index[mov]
        title = toTitle(ID)
        print("You are rating", title)
        try:
            while True:
                b = int(input("Rate from 1 to 5: "))
                if b < 1 or b > 5:
                    print("Sorry, your response must be on a scale of 1 to 5.")
                    continue
                else:
                    break
            print("You rated", title, "as a", b)
            ratings.append([ID,b])
        except:
            print("You skipped", title)
            print("")
            continue
        print("")
        
        
    return ratings

#### Please follow the instructions below and try to rate at least 5 movies

In [7]:
user_ratings = initial_rating(20)

These are the top 20 most watched movies on our system:
Please rate at least 5 movies on a scale of 1 to 5 (1,2,3,4,5), you haven't seen it just hit Enter.
We can then start providing you recommendations based on your initial input!

You are rating Forrest Gump
Rate from 1 to 5: 5
You rated Forrest Gump as a 5

You are rating The Sixth Sense
Rate from 1 to 5: 
You skipped The Sixth Sense

You are rating Pirates of the Caribbean: The Curse of the Black Pearl
Rate from 1 to 5: 3
You rated Pirates of the Caribbean: The Curse of the Black Pearl as a 3

You are rating The Matrix
Rate from 1 to 5: 
You skipped The Matrix

You are rating Spider-Man
Rate from 1 to 5: 4
You rated Spider-Man as a 4

You are rating Men in Black
Rate from 1 to 5: 
You skipped Men in Black

You are rating The Silence of the Lambs
Rate from 1 to 5: 5
You rated The Silence of the Lambs as a 5

You are rating Independence Day
Rate from 1 to 5: 
You skipped Independence Day

You are rating Jurassic Park
Rate from 1 to 

Here is the data we just captured from your responses
Returned array of ratings in the format of: [MovieID, Rating] :

In [8]:
user_ratings

[[11283, 5],
 [1905, 3],
 [14410, 4],
 [2862, 5],
 [14312, 4],
 [6971, 4],
 [15107, 4],
 [10042, 5],
 [2452, 4],
 [11064, 4]]

### Now we are going to cluster you based on your ratings

In [9]:
object = pd.read_pickle('cleanedMovie.pkl')
movies = pd.DataFrame(object)

Pivoting data to create a Customer by Movie Matrix of ratings to use for doc term clustering:

In [10]:
movieMatrix = movies.pivot_table(values='Rating', index='CustomerID', columns='MovieID')

In [11]:
movieMatrix = movieMatrix.fillna(0)
movie_arr = np.array(movieMatrix)

In [12]:
import kMeans
from sklearn.cluster import KMeans

In [13]:
kmeans = KMeans(n_clusters=10)
kmeans.fit(movie_arr)

KMeans(n_clusters=10)

In [14]:
def top_movies(df, n):
    for mov in range(0,n):
        print(toTitle(df.index[mov]),df.loc[df.index[mov]][0] )

In [15]:
np.set_printoptions(precision=2,suppress=True)

In [16]:
def print_clust(kmeans, k, n):
    for cluster in range(0,k):
        clust = pd.DataFrame(kmeans.cluster_centers_[cluster])
        clust.index = movieMatrix.columns
        #print(clust)
        #sortDF = pd.DataFrame(clust,terms)
        #print(sortDF)
        sortDF = clust.sort_values(by=[0],ascending=False)
        #print(sortDF.loc[sortDF.index[0]][0])
        #print(sortDF)
        print("Top movies in Cluster", cluster+1)
        top_movies(sortDF, n)
        print("")

So these are the top movies found in each cluster:

In [17]:
print_clust(kmeans,10,10)

Top movies in Cluster 1
Raiders of the Lost Ark 4.4316271963330784
Forrest Gump 4.394194041252866
The Green Mile 4.271963330786861
Braveheart 4.252864782276546
Indiana Jones and the Last Crusade 4.231474407944996
The Shawshank Redemption: Special Edition 4.220015278838808
The Sixth Sense 4.18716577540107
Gladiator 4.184873949579831
The Fugitive 4.128342245989305
Saving Private Ryan 4.119938884644767

Top movies in Cluster 2
The Shawshank Redemption: Special Edition 4.250293083235638
The Sixth Sense 4.227432590855804
The Silence of the Lambs 4.173505275498242
Forrest Gump 4.156506447831184
American Beauty 4.070339976553341
Good Will Hunting 3.9701055099648306
Finding Nemo (Widescreen) 3.943141852286049
Pirates of the Caribbean: The Curse of the Black Pearl 3.9279015240328254
Lord of the Rings: The Fellowship of the Ring 3.9273153575615476
Raiders of the Lost Ark 3.900937866354044

Top movies in Cluster 3
Raiders of the Lost Ark 4.581761006289308
The Godfather 4.462264150943396
The Silen