<a href="https://colab.research.google.com/github/abiflynn/movie_recommender_system/blob/main/WBSFLIX_recommender_systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WBSFLIX Recommender Systems 

# The Data

### Summary

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.

This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.


### Usage License

Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set. The data set may be used for any research purposes under the following conditions:

* The user may not state or imply any endorsement from the University of Minnesota or the GroupLens Research Group.
* The user must acknowledge the use of the data set in publications resulting from the use of the data set (see below for citation information).
* The user may redistribute the data set, including transformations, so long as it is distributed under these same license conditions.
* The user may not use this information for any commercial or revenue-bearing purposes without first obtaining permission from a faculty member of the GroupLens Research Project at the University of Minnesota.
* The executable software scripts are provided "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of them is with you. Should the program prove defective, you assume the cost of all necessary servicing, repair or correction.

In no event shall the University of Minnesota, its affiliates or employees be liable to you for any damages arising out of the use or inability to use these programs (including but not limited to loss of data or data being rendered inaccurate).

If you have any further questions or comments, please email <grouplens-info@umn.edu>

### Citation

To acknowledge use of the dataset in publications, please cite the following paper:

> F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <https://doi.org/10.1145/2827872>

### Further Information About GroupLens

GroupLens is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Since its inception in 1992, GroupLens's research projects have explored a variety of fields including:

* recommender systems
* online communities
* mobile and ubiquitious technologies
* digital libraries
* local geographic information systems

GroupLens Research operates a movie recommender based on collaborative filtering, MovieLens, which is the source of these data. We encourage you to visit <http://movielens.org> to try it out! If you have exciting ideas for experimental work to conduct on MovieLens, send us an email at <grouplens-info@cs.umn.edu> - we are always interested in working with external collaborators.

### Content and Use of Files


**Formatting and Encoding**

The dataset files are written as [comma-separated values](http://en.wikipedia.org/wiki/Comma-separated_values) files with a single header row. Columns that contain commas (`,`) are escaped using double-quotes (`"`). These files are encoded as UTF-8. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.

**User Ids**

MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between `ratings.csv` and `tags.csv` (i.e., the same id refers to the same user across the two files).

**Movie Ids**

Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id `1` corresponds to the URL <https://movielens.org/movies/1>). Movie ids are consistent between `ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv` (i.e., the same id refers to the same movie across these four data files).

**Ratings Data File Structure (ratings.csv)**

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


**Tags Data File Structure (tags.csv)**

All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

    userId,movieId,tag,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


**Movies Data File Structure (movies.csv)**

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)


**Links Data File Structure (links.csv)**
Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,imdbId,tmdbId

movieId is an identifier for movies used by <https://movielens.org>. E.g., the movie Toy Story has the link <https://movielens.org/movies/1>.

imdbId is an identifier for movies used by <http://www.imdb.com>. E.g., the movie Toy Story has the link <http://www.imdb.com/title/tt0114709/>.

tmdbId is an identifier for movies used by <https://www.themoviedb.org>. E.g., the movie Toy Story has the link <https://www.themoviedb.org/movie/862>.

Use of the resources listed above is subject to the terms of each provider.


**Cross-Validation**

Prior versions of the MovieLens dataset included either pre-computed cross-folds or scripts to perform this computation. We no longer bundle either of these features with the dataset, since most modern toolkits provide this as a built-in feature. If you wish to learn about standard approaches to cross-fold computation in the context of recommender systems evaluation, see [LensKit](http://lenskit.org) for tools, documentation, and open-source code examples.


# Import Packages 

In [None]:
import pandas as pd
import numpy as np

# Read the Data 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
links_df = pd.read_csv("/content/drive/MyDrive/WBS CODING/Projects/Project 8: (Recommender Systems)/The Data/links.csv")
movies_df = pd.read_csv("/content/drive/MyDrive/WBS CODING/Projects/Project 8: (Recommender Systems)/The Data/movies.csv")
ratings_df = pd.read_csv("/content/drive/MyDrive/WBS CODING/Projects/Project 8: (Recommender Systems)/The Data/ratings.csv")
tags_df = pd.read_csv("/content/drive/MyDrive/WBS CODING/Projects/Project 8: (Recommender Systems)/The Data/tags.csv")

pd.set_option('display.max_columns', None)


In [None]:
links_df.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [None]:
links_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


In [None]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [None]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [None]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [None]:
tags_df.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [None]:
tags_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


# Function One: Popularity Rankings 

Function to Generate Top n Films based on Rating 

A popularity-based, non-personalised recommender system that takes as an input the ratings and movies datasets and outputs the “best” movies. How you define “best” is up to you. Those movies will appear as the top row of the WBSFLIX site.

In [None]:
def top_movies(n):
   
   #group the movies and get the mean rating 
    df = pd.DataFrame(ratings_df.groupby('movieId')['rating'].mean())

    #add the rating count 
    df['rating_count'] = ratings_df.groupby('movieId')['rating'].count()

    #merge to get the movie title 
    df_1 = pd.merge(df, movies_df, on="movieId", how="inner")[["movieId", "title", "rating", "rating_count"]]

    #select only moves with over 50 ratings and sort movies by rating highest to lowest selcting only n values 
    top_movies = df_1[df_1['rating_count'] >= 50].sort_values(by='rating', ascending=False)[:n]

    #return top n movies as a dataframe
    return top_movies

In [None]:
top_movies(5)

Unnamed: 0,movieId,title,rating,rating_count
277,318,"Shawshank Redemption, The (1994)",4.429022,317
659,858,"Godfather, The (1972)",4.289062,192
2224,2959,Fight Club (1999),4.272936,218
974,1276,Cool Hand Luke (1967),4.27193,57
602,750,Dr. Strangelove or: How I Learned to Stop Worr...,4.268041,97


# Function Two: Item-based Collaborative Filtering

Function that takes the name of a movie, and a number (n), and outputs the n most similar movies to the selected title.

A similarity-based, semi-personalised recommender system that takes a movie as an input – when put into production, it will be a movie that the user has watched recently or rated highly, for now, it’s a manually inputted movie – and then outputs a list of movies that are “similar” to the one inputted based on rating correlations from the user-item matrix. Those movies will appear as the second row of the WBSFLIX site.

In [None]:
def similar_movies(movie_id, n):

    #creating movies cross tab 
    ratings_crosstab = pd.pivot_table(data=ratings_df, values='rating', index='userId', columns='movieId')

    #list of the movie user ratings - exclusing NaNs
    movies_ratings = ratings_crosstab[movie_id]
    movies_ratings[movies_ratings>=0]

    #find similar movies 
    similar_movies = ratings_crosstab.corrwith(movies_ratings)

    #getting correlation score and dropping NaNs
    corr_score = pd.DataFrame(similar_movies, columns=['PearsonR'])
    corr_score.dropna(inplace=True)    
    
    #creating a ratings dataframe 
    rating = pd.DataFrame(ratings_df.groupby('movieId')['rating'].mean())
    rating['rating_count'] = ratings_df.groupby('movieId')['rating'].count()

    #joining correlation scores and rating count
    movies_corr_summary = corr_score.join(rating['rating_count'])
    #drop the choosen movie 
    movies_corr_summary.drop(movie_id, inplace=True) 

    #select only movies with over 10 ratings and sort by correlation score highest to lowest selecting only n values 
    movies_score = movies_corr_summary[movies_corr_summary['rating_count']>=100].sort_values('PearsonR', ascending=False)[:n]

    #merging with movie_df to gte title 
    movie_recommendations = pd.merge(movies_score, movies_df, on="movieId", how="inner")[["movieId", "title", "PearsonR", "rating_count"]]

    #return top n movies as a dataframe
    return movie_recommendations

In [None]:
similar_movies(1, 5)

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


Unnamed: 0,movieId,title,PearsonR,rating_count
0,8961,"Incredibles, The (2004)",0.643301,125
1,6377,Finding Nemo (2003),0.618701,141
2,588,Aladdin (1992),0.611892,183
3,4886,"Monsters, Inc. (2001)",0.490231,132
4,500,Mrs. Doubtfire (1993),0.446261,144


# Function Three: User-based Collaborative Filtering

Function that takes the users userId, and a number (n) and outputs the n most recommended movies based on the cosine similarity of other users.

Fully personalised recommender system, which will generate the third row on the WBSFLIX site.



In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
def top_movies_user(user_id, n):

  #reshaping the data, so that we have users as rows and movies as columns
  users_items = pd.pivot_table(data=ratings_df, 
                                 values='rating', 
                                 index='userId', 
                                 columns='movieId')
  
  #replacing NaNs with zeros 
  users_items.fillna(0, inplace=True)


  #compute cosine similarities
  user_similarities = pd.DataFrame(cosine_similarity(users_items),
                                 columns=users_items.index, 
                                 index=users_items.index)
  
  #compute the weights for the inputed user
  weights = (
    user_similarities.query("userId!=@user_id")[user_id] / sum(user_similarities.query("userId!=@user_id")[user_id])
          )
  
  #find restaurants the inputed user has not rated
  users_items.loc[user_id,:]==0

  #select restaurants that the inputed user has not visited
  not_watched_movies = users_items.loc[users_items.index!=user_id, users_items.loc[user_id,:]==0]

  #dot product between the not-visited-restaurants and the weights
  weighted_averages = pd.DataFrame(not_watched_movies.T.dot(weights), columns=["predicted_rating"])

  #merge with places to get name 
  recommendations = weighted_averages.merge(movies_df, left_index=True, right_on="movieId")[["movieId", "title", "predicted_rating"]]

  #sort values by predicted ratings highest to lowest selecting only n values 
  recommendations_return = recommendations.sort_values("predicted_rating", ascending=False)[:n]

  return recommendations_return

In [None]:
top_movies_user(1,5)

Unnamed: 0,movieId,title,predicted_rating
277,318,"Shawshank Redemption, The (1994)",2.654727
507,589,Terminator 2: Judgment Day (1991),2.087327
659,858,"Godfather, The (1972)",1.859548
2078,2762,"Sixth Sense, The (1999)",1.663564
3638,4993,"Lord of the Rings: The Fellowship of the Ring,...",1.62482


# Implement a Chatbot

In [None]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

users_items = pd.pivot_table(data=ratings_df, 
                                 values='rating', 
                                 index='userId', 
                                 columns='movieId')

users_items.fillna(0, inplace=True)

user_similarities = pd.DataFrame(cosine_similarity(users_items),
                                 columns=users_items.index, 
                                 index=users_items.index)

In [None]:
def weighted_user_rec(user_id, n):
  weights = (user_similarities.query("userId!=@user_id")[user_id] / sum(user_similarities.query("userId!=@user_id")[user_id]))
  not_watched_movies = users_items.loc[users_items.index!=user_id, users_items.loc[user_id,:]==0]
  weighted_averages = pd.DataFrame(not_watched_movies.T.dot(weights), columns=["predicted_rating"])
  recommendations = weighted_averages.merge(movies_df, left_index=True, right_on="movieId")
  top_recommendations = recommendations.sort_values("predicted_rating", ascending=False).head(n)
  return top_recommendations

In [None]:
weighted_user_rec(1, 1)

Unnamed: 0,predicted_rating,movieId,title,genres
277,2.654727,318,"Shawshank Redemption, The (1994)",Crime|Drama


In [None]:
def chat_bot():
    print("Hi! I'm your personal recommender. Tell me your userID.")
    user_id = input()
    user_id = int(user_id)
    recom = weighted_user_rec(user_id, 1)
    print(f"You will probably like the movie: {list(recom['title'])[0]}")
    
chat_bot()

Hi! I'm your personal recommender. Tell me your userID.
1
You will probably like the movie: Shawshank Redemption, The (1994)
