<a href="https://colab.research.google.com/github/Vikas-KM/machine-learning/blob/master/ML_Projects/Movie_Recommendation_using_Naive_Bayes_Algorithm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommendation System using Naive Bayes 

Articles Referred
- https://iopscience.iop.org/article/10.1088/1742-6596/1345/4/042042/pdf
  -  Liu Shuxian and Fang Sen 2019 J. Phys.: Conf. Ser. 1345 042042

Comparative Analysis of different ML Algorithms for recommendation systems
- https://ijcat.com/archives/volume6/issue2/ijcatr06021005.pdf

Movielens Dataset Referred
- https://grouplens.org/datasets/movielens/

link - http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html

### About the dataset
- This **dataset (ml-25m)** describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. **It contains 25000095 ratings and 1093360 tag applications across 62423 movies.** These data were created by 162541 users between January 09, 1995 and November 21, 2019. **This dataset was generated on November 21, 2019.**
- Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id 1 corresponds to the URL https://movielens.org/movies/1). Movie ids are consistent between ratings.csv, tags.csv, movies.csv, and links.csv (i.e., the same id refers to the same movie across these four data files).

### Ratings Data File Structure (ratings.csv)
- All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

- userId,movieId,rating,timestamp
The lines within this file are ordered first by userId, then, within user, by movieId.

- Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

- Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


### Tags Data File Structure (tags.csv)
- All tags are contained in the file tags.csv. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

- userId,movieId,tag,timestamp
- The lines within this file are ordered first by userId, then, within user, by movieId.

- Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

- Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

### Movies Data File Structure (movies.csv)
- Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format:

- movieId,title,genres
- Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

- Genres are a pipe-separated list, and are selected from the following:

  - Action
  - Adventure
  - Animation
  - Children's
  - Comedy
  - Crime
  - Documentary
  - Drama
  - Fantasy
  - Film-Noir
  - Horror
  - Musical
  - Mystery
  - Romance
  - Sci-Fi
  - Thriller
  - War
  - Western
  - (no genres listed)

In [6]:
# importing some standard libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
%matplotlib inline

In [8]:
# Mounting the google drive to download the dataset
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
# Lets Import the movielens dataset
# using movielens 25M dataset

! ls /content/drive/MyDrive/ML_Datasets/movielens_dataset

genome-scores.csv  links.csv   ratings.csv  tags.csv
genome-tags.csv    movies.csv  README.txt


### movie.csv that contains movie information:

In [11]:
df_movies = pd.read_csv('/content/drive/MyDrive/ML_Datasets/movielens_dataset/movies.csv')
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [80]:
# shape of the movies dataset
df_movies.shape

(62423, 3)

In [13]:
# columns in the movies dataset
df_movies.columns

Index(['movieId', 'title', 'genres'], dtype='object')

In [20]:
# lets check if there any null values
df_movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [81]:
# genre needs to be seprated, before we can figure which genre is dominating
df_movies['genres'].value_counts()

Drama                                                      9056
Comedy                                                     5674
(no genres listed)                                         5062
Documentary                                                4731
Comedy|Drama                                               2386
                                                           ... 
Adventure|Animation|Children|Comedy|Fantasy|Sci-Fi|IMAX       1
Action|Comedy|Crime|Mystery|Thriller                          1
Adventure|Animation|Children|Fantasy|Mystery                  1
Animation|Documentary|War                                     1
Comedy|Horror|Romance|Sci-Fi                                  1
Name: genres, Length: 1639, dtype: int64

In [82]:
print(len(df_movies['movieId'].unique()))
print(len(df_movies['title'].unique()))

62423
62325


In [90]:
# list of movies which are duplicate
df_movies_dup = df_movies[df_movies.duplicated(subset=['title'], keep=False)]
df_movies_dup.head()

Unnamed: 0,movieId,title,genres
580,588,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical
1710,1788,Men with Guns (1997),Action|Drama
2553,2644,Dracula (1931),Horror
2759,2851,Saturn 3 (1980),Adventure|Sci-Fi|Thriller
3454,3553,Gossip (2000),Drama|Thriller


In [91]:
df_movies_dup.shape

(196, 3)

In [94]:
df_movies_dup[df_movies_dup['title'] == 'Gossip (2000)']

Unnamed: 0,movieId,title,genres
3454,3553,Gossip (2000),Drama|Thriller
44375,168088,Gossip (2000),Comedy|Drama


In [95]:
df_movies_dup[df_movies_dup['title'] == 'Inside (2012)']

Unnamed: 0,movieId,title,genres
28373,131556,Inside (2012),(no genres listed)
34147,144748,Inside (2012),Horror


In [96]:
df_movies_dup[df_movies_dup['title'] == 'Saturn 3 (1980)']

Unnamed: 0,movieId,title,genres
2759,2851,Saturn 3 (1980),Adventure|Sci-Fi|Thriller
44502,168358,Saturn 3 (1980),Sci-Fi|Thriller


In [92]:
df_movies_dup[df_movies_dup['genres'] == '(no genres listed)']

Unnamed: 0,movieId,title,genres
28373,131556,Inside (2012),(no genres listed)
31457,138656,Black Field (2009),(no genres listed)
34186,144830,The Tunnel (1933),(no genres listed)
39216,156686,Another World (2014),(no genres listed)
41061,160868,Escalation (1968),(no genres listed)
48048,175857,Family Life (1971),(no genres listed)
49071,177993,Escape Room (2017),(no genres listed)
49221,178403,The Forest (2016),(no genres listed)
49867,179783,Let There Be Light (2017),(no genres listed)
50735,181675,Apparition (2014),(no genres listed)


### Observations from the movies.csv dataset
- Title contains the year, needs to be cleaned
- Single movie can have many genres, needs to split
- WOW - No Null data
- 98 movies are duplicate, need to drop list of duplicate movie entries which have ' no genre listed' present
- then remove the rest of the duplicates 

### rating.csv that contains ratings of movies by users:

In [34]:
# importing the ratings dataset
df_ratings = pd.read_csv('/content/drive/MyDrive/ML_Datasets/movielens_dataset/ratings.csv')
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [35]:
# shape of the ratings dataset
df_ratings.shape

(25000095, 4)

In [36]:
# columns in the ratings dataset
df_ratings.columns

Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')

In [37]:
df_ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [38]:
df_ratings['rating'].value_counts()

4.0    6639798
3.0    4896928
5.0    3612474
3.5    3177318
4.5    2200539
2.0    1640868
2.5    1262797
1.0     776815
1.5     399490
0.5     393068
Name: rating, dtype: int64

### Observations
- No NULL data
- rating goes from 0.5 to 5.0

## References Used:
- https://www.analyticsvidhya.com/blog/2020/11/create-your-own-movie-movie-recommendation-system/
