# Movie Recommendation system

This project implements a content-based movie recommendation system using cosine similarity. The system analyzes movie features (title, genres, tags) to recommend similar movies to users.


Import necessary libraries for data manipulation, visualization, and machine learning.


In [145]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load Data Sets

In [146]:
movie = pd.read_csv("movies.csv")
link = pd.read_csv("links.csv")
rating = pd.read_csv("ratings.csv")
tag = pd.read_csv("tags.csv")


In [147]:
print(f"movie shape: {movie.shape}")
print(f"link shape: {link.shape}")
print(f"rating shape: {rating.shape}")
print(f"tag shape: {tag.shape}")

movie shape: (9742, 3)
link shape: (9742, 3)
rating shape: (100836, 4)
tag shape: (3683, 4)


To understand the size and structure of each dataset.

In [148]:
movie.head(2)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy


In [149]:
movie.movieId.value_counts

<bound method IndexOpsMixin.value_counts of 0            1
1            2
2            3
3            4
4            5
         ...  
9737    193581
9738    193583
9739    193585
9740    193587
9741    193609
Name: movieId, Length: 9742, dtype: int64>

In [150]:
movie.movieId.info()

<class 'pandas.core.series.Series'>
RangeIndex: 9742 entries, 0 to 9741
Series name: movieId
Non-Null Count  Dtype
--------------  -----
9742 non-null   int64
dtypes: int64(1)
memory usage: 76.2 KB


In [151]:
movie.title.info()

<class 'pandas.core.series.Series'>
RangeIndex: 9742 entries, 0 to 9741
Series name: title
Non-Null Count  Dtype 
--------------  ----- 
9742 non-null   object
dtypes: object(1)
memory usage: 76.2+ KB


In [152]:
movie.title.head(5)

0                      Toy Story (1995)
1                        Jumanji (1995)
2               Grumpier Old Men (1995)
3              Waiting to Exhale (1995)
4    Father of the Bride Part II (1995)
Name: title, dtype: object

In [153]:
movie.title.duplicated().value_counts()

title
False    9737
True        5
Name: count, dtype: int64

Their 5 movies have duplicate titles

In [154]:
movie[movie.title.duplicated(keep=False)]

Unnamed: 0,movieId,title,genres
650,838,Emma (1996),Comedy|Drama|Romance
2141,2851,Saturn 3 (1980),Adventure|Sci-Fi|Thriller
4169,6003,Confessions of a Dangerous Mind (2002),Comedy|Crime|Drama|Thriller
5601,26958,Emma (1996),Romance
5854,32600,Eros (2004),Drama
5931,34048,War of the Worlds (2005),Action|Adventure|Sci-Fi|Thriller
6932,64997,War of the Worlds (2005),Action|Sci-Fi
9106,144606,Confessions of a Dangerous Mind (2002),Comedy|Crime|Drama|Romance|Thriller
9135,147002,Eros (2004),Drama|Romance
9468,168358,Saturn 3 (1980),Sci-Fi|Thriller


There same titled movie have deferent genres type. I am thinking to keep unique all generes type for same titled.

In [155]:
movie.genres.info()

<class 'pandas.core.series.Series'>
RangeIndex: 9742 entries, 0 to 9741
Series name: genres
Non-Null Count  Dtype 
--------------  ----- 
9742 non-null   object
dtypes: object(1)
memory usage: 76.2+ KB


# Consolidate Duplicate Movie Titles

In [156]:
movie_cleaned = movie.copy()

# Group by title and create combined genres
title_genres = movie.groupby('title')['genres'].apply(lambda x: '|'.join(sorted(set('|'.join(x).split('|'))))).reset_index()
title_genres.columns = ['title', 'combined_genres']

# Drop old genres and merge new combined genres
movie_cleaned = movie.drop('genres', axis=1)
movie_cleaned = movie_cleaned.merge(title_genres, on='title', how='left')
movie_cleaned = movie_cleaned.rename(columns={'combined_genres': 'genres'})

# Duplicate rows remove (title + movieId unique)
movie_cleaned = movie_cleaned.drop_duplicates(subset=['title'], keep='first')

print(movie_cleaned.shape)

(9737, 3)


Merge genres for movies with identical titles. 
Remove duplicate rows. 
Result: 9737 unique movies (reduced from 9737)

In [157]:
movie.shape

(9742, 3)

Orginal Shape has 9742

In [158]:
movie_cleaned[movie_cleaned["title"] == "Confessions of a Dangerous Mind (2002)"]


Unnamed: 0,movieId,title,genres
4169,6003,Confessions of a Dangerous Mind (2002),Comedy|Crime|Drama|Romance|Thriller


In [159]:
link.head(2)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0


imdId and tmdbId are not important. So link dataset not need.

In [160]:
rating.head(2)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247


In [161]:
rating.shape

(100836, 4)

there rating dataset's have more value then movie dataset.

In [162]:
rating.movieId.info()

<class 'pandas.core.series.Series'>
RangeIndex: 100836 entries, 0 to 100835
Series name: movieId
Non-Null Count   Dtype
--------------   -----
100836 non-null  int64
dtypes: int64(1)
memory usage: 787.9 KB


In [163]:
rating.movieId.duplicated().value_counts()

movieId
True     91112
False     9724
Name: count, dtype: int64

In [164]:
rating.movieId.duplicated

<bound method Series.duplicated of 0              1
1              3
2              6
3             47
4             50
           ...  
100831    166534
100832    168248
100833    168250
100834    168252
100835    170875
Name: movieId, Length: 100836, dtype: int64>

In [165]:
rating[rating["movieId"] == 168248]


Unnamed: 0,userId,movieId,rating,timestamp
3659,21,168248,3.5,1500701525
9136,62,168248,4.5,1523048794
17864,111,168248,4.0,1516141936
41240,279,168248,4.0,1506394781
58071,380,168248,5.0,1501786943
95065,599,168248,3.0,1498529689
100832,610,168248,5.0,1493850091


Now we can see that same movieId have deferent kinds of rating. Also userId and timestamp not need.

In [166]:
rating.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


# Aggregate User Ratings

In [167]:
rating_cleaned = rating.copy()

rating_cleaned = rating.groupby('movieId')['rating'].mean().reset_index()
rating_cleaned.shape

(9724, 2)

Calculate average rating for each movie. 
Reduces 100,836 ratings to 9,724 movies with average ratings. 
Eliminates userId and timestamp (not needed for recommendation)

In [168]:
rating_cleaned.head()

Unnamed: 0,movieId,rating
0,1,3.92093
1,2,3.431818
2,3,3.259615
3,4,2.357143
4,5,3.071429


In [169]:
tag.head(2)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996


In [170]:
tag.shape

(3683, 4)

In [171]:
tag.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


In [172]:
tag.movieId.duplicated

<bound method Series.duplicated of 0        60756
1        60756
2        60756
3        89774
4        89774
         ...  
3678      7382
3679      7936
3680      3265
3681      3265
3682    168248
Name: movieId, Length: 3683, dtype: int64>

In [173]:
tag.movieId.duplicated().value_counts()

movieId
True     2111
False    1572
Name: count, dtype: int64

In [174]:
tag.tag.value_counts()

tag
In Netflix queue      131
atmospheric            36
thought-provoking      24
superhero              24
surreal                23
                     ... 
70mm                    1
Romans                  1
British                 1
TERRORISM               1
societal criticism      1
Name: count, Length: 1589, dtype: int64

In [175]:
tag[tag.movieId.duplicated(keep=False)]

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
...,...,...,...,...
3675,606,3578,Romans,1173212944
3678,606,7382,for katie,1171234019
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978


# Consolidate Movie Tags

There have some duplicate movieId and against it have different kinds of tag. Also userId and timestamp not need.

In [176]:
tag_cleaned = tag.copy()

tag_cleaned = tag.groupby('movieId')['tag'].apply(lambda x: '|'.join(sorted(set('|'.join(x).split('|'))))).reset_index()
tag_cleaned.shape

(1572, 2)

Combine all tags for each movie. 
Reduces 3,683 tag entries to 1,572 movies with consolidated tags. 
Removes userId and timestamp

In [177]:
tag_cleaned.head(2)

Unnamed: 0,movieId,tag
0,1,fun|pixar
1,2,Robin Williams|fantasy|game|magic board game


In [178]:
tag_cleaned.movieId.value_counts

<bound method IndexOpsMixin.value_counts of 0            1
1            2
2            3
3            5
4            7
         ...  
1567    183611
1568    184471
1569    187593
1570    187595
1571    193565
Name: movieId, Length: 1572, dtype: int64>

Now we need to marge all of datasets together

# Feature Engineering

In [179]:
merged_data = movie_cleaned.copy()
merged_data = merged_data.merge(rating_cleaned, on='movieId', how='left')
if not tag.empty:
    merged_data = merged_data.merge(tag_cleaned, on='movieId', how='left')
    
print(f"Final merged shape: {merged_data.shape}")
merged_data.head(2)

Final merged shape: (9737, 5)


Unnamed: 0,movieId,title,genres,rating,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.92093,fun|pixar
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.431818,Robin Williams|fantasy|game|magic board game


Create a single unified dataset with all movie information.


In [180]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9737 entries, 0 to 9736
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9737 non-null   int64  
 1   title    9737 non-null   object 
 2   genres   9737 non-null   object 
 3   rating   9719 non-null   float64
 4   tag      1572 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 380.5+ KB


There have still null value on rating and tag.

In [181]:
merged_data["rating"] = merged_data["rating"].fillna(merged_data["rating"].mean())


Ensure no null values that could break the model.

In [182]:
merged_data['new_tags'] = merged_data['title'] + "|" + merged_data['genres']
merged_data.head(5)

Unnamed: 0,movieId,title,genres,rating,tag,new_tags
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.92093,fun|pixar,Toy Story (1995)|Adventure|Animation|Children|...
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.431818,Robin Williams|fantasy|game|magic board game,Jumanji (1995)|Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance,3.259615,moldy|old,Grumpier Old Men (1995)|Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.357143,,Waiting to Exhale (1995)|Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy,3.071429,pregnancy|remake,Father of the Bride Part II (1995)|Comedy


In [183]:
merged_data.new_tags.head(5)

0    Toy Story (1995)|Adventure|Animation|Children|...
1            Jumanji (1995)|Adventure|Children|Fantasy
2               Grumpier Old Men (1995)|Comedy|Romance
3        Waiting to Exhale (1995)|Comedy|Drama|Romance
4            Father of the Bride Part II (1995)|Comedy
Name: new_tags, dtype: object

In [184]:
merged_data['tag'] = merged_data['tag'].fillna('')

merged_data['new_tags'] = merged_data.apply(lambda row: row['new_tags'] + "|" + row['tag'] if row['tag'] != "" else row['new_tags'],axis=1)


Combine title, genres, and tags into single feature column. This column will be used for similarity calculation

In [185]:
merged_data.head()

Unnamed: 0,movieId,title,genres,rating,tag,new_tags
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.92093,fun|pixar,Toy Story (1995)|Adventure|Animation|Children|...
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.431818,Robin Williams|fantasy|game|magic board game,Jumanji (1995)|Adventure|Children|Fantasy|Robi...
2,3,Grumpier Old Men (1995),Comedy|Romance,3.259615,moldy|old,Grumpier Old Men (1995)|Comedy|Romance|moldy|old
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.357143,,Waiting to Exhale (1995)|Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy,3.071429,pregnancy|remake,Father of the Bride Part II (1995)|Comedy|preg...


In [186]:
merged_data.new_tags.head(5)

0    Toy Story (1995)|Adventure|Animation|Children|...
1    Jumanji (1995)|Adventure|Children|Fantasy|Robi...
2     Grumpier Old Men (1995)|Comedy|Romance|moldy|old
3        Waiting to Exhale (1995)|Comedy|Drama|Romance
4    Father of the Bride Part II (1995)|Comedy|preg...
Name: new_tags, dtype: object

In [187]:
merged_data.columns

Index(['movieId', 'title', 'genres', 'rating', 'tag', 'new_tags'], dtype='object')

In [188]:
new = merged_data.drop(["tag"],axis=1)
new.head(2)

Unnamed: 0,movieId,title,genres,rating,new_tags
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.92093,Toy Story (1995)|Adventure|Animation|Children|...
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.431818,Jumanji (1995)|Adventure|Children|Fantasy|Robi...


# Format for Model

In [189]:
new = new.rename(columns={"new_tags":"tag"})
new["tag"] = new["tag"].str.replace('|', ',', regex=False)
new["genres"] = new["genres"].str.replace('|', ',', regex=False)
new.head(2)

Unnamed: 0,movieId,title,genres,rating,tag
0,1,Toy Story (1995),"Adventure,Animation,Children,Comedy,Fantasy",3.92093,"Toy Story (1995),Adventure,Animation,Children,..."
1,2,Jumanji (1995),"Adventure,Children,Fantasy",3.431818,"Jumanji (1995),Adventure,Children,Fantasy,Robi..."


Clean up dataframe structure. 
Replace pipe separators with commas for better readability. 
Final columns: movieId, title, genres, rating, tag

In [190]:
new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9737 entries, 0 to 9736
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9737 non-null   int64  
 1   title    9737 non-null   object 
 2   genres   9737 non-null   object 
 3   rating   9737 non-null   float64
 4   tag      9737 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 380.5+ KB


# Model Building

In [191]:
cv = CountVectorizer(max_features=10000,stop_words='english')
vector = cv.fit_transform(new.tag).toarray()

Convert text features to numerical vectors. 
`max_features=10000`: Limit vocabulary to top 10,000 words. 
`stop_words='english'`: Remove common English words (the, is, at, etc.). 
Result: (9737, 9949) matrix - 9737 movies, 9949 unique features


In [192]:
vector.shape

(9737, 9949)

# Calculate SImilarity Matrix

In [193]:
similarity = cosine_similarity(vector)
similarity

array([[1.        , 0.39528471, 0.2       , ..., 0.        , 0.1118034 ,
        0.10540926],
       [0.39528471, 1.        , 0.07905694, ..., 0.        , 0.        ,
        0.        ],
       [0.2       , 0.07905694, 1.        , ..., 0.        , 0.        ,
        0.10540926],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.1118034 , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.10540926, 0.        , 0.10540926, ..., 0.        , 0.        ,
        1.        ]], shape=(9737, 9737))

Calculate cosine similarity between all movie pairs. 
Result: (9737, 9737) matrix where similarity[i][j] represents similarity between movie i and movie j. 
Values range from 0 (completely different) to 1 (identical)


In [194]:
sorted(similarity[0], reverse=True)

[np.float64(0.9999999999999999),
 np.float64(0.7378647873726218),
 np.float64(0.6708203932499369),
 np.float64(0.5976143046671968),
 np.float64(0.5976143046671968),
 np.float64(0.5976143046671968),
 np.float64(0.5976143046671968),
 np.float64(0.5976143046671968),
 np.float64(0.5976143046671968),
 np.float64(0.5976143046671968),
 np.float64(0.565685424949238),
 np.float64(0.565685424949238),
 np.float64(0.5590169943749475),
 np.float64(0.5590169943749475),
 np.float64(0.5590169943749475),
 np.float64(0.5590169943749475),
 np.float64(0.5590169943749475),
 np.float64(0.5270462766947299),
 np.float64(0.5270462766947299),
 np.float64(0.5270462766947299),
 np.float64(0.5270462766947299),
 np.float64(0.5270462766947299),
 np.float64(0.5270462766947299),
 np.float64(0.5163977794943223),
 np.float64(0.5163977794943223),
 np.float64(0.5163977794943223),
 np.float64(0.5163977794943223),
 np.float64(0.5163977794943223),
 np.float64(0.5163977794943223),
 np.float64(0.5163977794943223),
 np.float64(

In [195]:
new.shape

(9737, 5)

In [196]:
sorted(similarity[0], reverse=True)

[np.float64(0.9999999999999999),
 np.float64(0.7378647873726218),
 np.float64(0.6708203932499369),
 np.float64(0.5976143046671968),
 np.float64(0.5976143046671968),
 np.float64(0.5976143046671968),
 np.float64(0.5976143046671968),
 np.float64(0.5976143046671968),
 np.float64(0.5976143046671968),
 np.float64(0.5976143046671968),
 np.float64(0.565685424949238),
 np.float64(0.565685424949238),
 np.float64(0.5590169943749475),
 np.float64(0.5590169943749475),
 np.float64(0.5590169943749475),
 np.float64(0.5590169943749475),
 np.float64(0.5590169943749475),
 np.float64(0.5270462766947299),
 np.float64(0.5270462766947299),
 np.float64(0.5270462766947299),
 np.float64(0.5270462766947299),
 np.float64(0.5270462766947299),
 np.float64(0.5270462766947299),
 np.float64(0.5163977794943223),
 np.float64(0.5163977794943223),
 np.float64(0.5163977794943223),
 np.float64(0.5163977794943223),
 np.float64(0.5163977794943223),
 np.float64(0.5163977794943223),
 np.float64(0.5163977794943223),
 np.float64(

In [197]:
enumerate(sorted(similarity[0], reverse=True))

<enumerate at 0x269c70e2e80>

In [198]:
list(enumerate(sorted(similarity[0], reverse=True)))

[(0, np.float64(0.9999999999999999)),
 (1, np.float64(0.7378647873726218)),
 (2, np.float64(0.6708203932499369)),
 (3, np.float64(0.5976143046671968)),
 (4, np.float64(0.5976143046671968)),
 (5, np.float64(0.5976143046671968)),
 (6, np.float64(0.5976143046671968)),
 (7, np.float64(0.5976143046671968)),
 (8, np.float64(0.5976143046671968)),
 (9, np.float64(0.5976143046671968)),
 (10, np.float64(0.565685424949238)),
 (11, np.float64(0.565685424949238)),
 (12, np.float64(0.5590169943749475)),
 (13, np.float64(0.5590169943749475)),
 (14, np.float64(0.5590169943749475)),
 (15, np.float64(0.5590169943749475)),
 (16, np.float64(0.5590169943749475)),
 (17, np.float64(0.5270462766947299)),
 (18, np.float64(0.5270462766947299)),
 (19, np.float64(0.5270462766947299)),
 (20, np.float64(0.5270462766947299)),
 (21, np.float64(0.5270462766947299)),
 (22, np.float64(0.5270462766947299)),
 (23, np.float64(0.5163977794943223)),
 (24, np.float64(0.5163977794943223)),
 (25, np.float64(0.5163977794943223))

In [199]:
sorted(list(enumerate(similarity[0])), reverse=True, key = lambda x:x[1])

[(0, np.float64(0.9999999999999999)),
 (7353, np.float64(0.7378647873726218)),
 (2355, np.float64(0.6708203932499369)),
 (1706, np.float64(0.5976143046671968)),
 (2539, np.float64(0.5976143046671968)),
 (3568, np.float64(0.5976143046671968)),
 (6193, np.float64(0.5976143046671968)),
 (6485, np.float64(0.5976143046671968)),
 (8217, np.float64(0.5976143046671968)),
 (9426, np.float64(0.5976143046671968)),
 (12, np.float64(0.565685424949238)),
 (209, np.float64(0.565685424949238)),
 (1757, np.float64(0.5590169943749475)),
 (5976, np.float64(0.5590169943749475)),
 (6946, np.float64(0.5590169943749475)),
 (8898, np.float64(0.5590169943749475)),
 (8925, np.float64(0.5590169943749475)),
 (2809, np.float64(0.5270462766947299)),
 (3000, np.float64(0.5270462766947299)),
 (6259, np.float64(0.5270462766947299)),
 (6625, np.float64(0.5270462766947299)),
 (7528, np.float64(0.5270462766947299)),
 (8804, np.float64(0.5270462766947299)),
 (53, np.float64(0.5163977794943223)),
 (1357, np.float64(0.51639

In [200]:
new.title.iloc[8798:8801]

8798                   Pan (2015)
8799     While We're Young (2014)
8800    Too Late for Tears (1949)
Name: title, dtype: object

In [201]:
new[new['title'] == 'Too Late for Tears (1949)'].index[0]

np.int64(8800)

# Recommendation Function

In [202]:
def recomand(movie):
    index = new[new["title"] == movie].index[0]
    dist = sorted(list(enumerate(similarity[index])), reverse=True, key = lambda x: x[1])
    for i in dist[1:11]:
        print(new.iloc[i[0]].title)

In [203]:
new.iloc[8798].title

'Pan (2015)'

In [204]:
recomand('Pan (2015)')

Peter Pan (2003)
Inside Out (2015)
The Good Dinosaur (2015)
Home (2015)
Indian in the Cupboard, The (1995)
Borrowers, The (1997)
Return to Oz (1985)
NeverEnding Story, The (1984)
MirrorMask (2005)
Zathura (2005)


How It Works:
1. Find the movie's position in the dataset
2. Retrieve similarity scores with all other movies
3. Sort by similarity score (highest first)
4. Return top 10 (excluding the input movie itself at position 0)


In [205]:
new.head(1)

Unnamed: 0,movieId,title,genres,rating,tag
0,1,Toy Story (1995),"Adventure,Animation,Children,Comedy,Fantasy",3.92093,"Toy Story (1995),Adventure,Animation,Children,..."


In [206]:
new.rating.value_counts

<bound method IndexOpsMixin.value_counts of 0       3.920930
1       3.431818
2       3.259615
3       2.357143
4       3.071429
          ...   
9732    4.000000
9733    3.500000
9734    3.500000
9735    3.500000
9736    4.000000
Name: rating, Length: 9737, dtype: float64>

In [207]:
sorted(new.rating, reverse=True)

[5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0

In [208]:
sorted(list(enumerate(new.rating)), reverse=True, key = lambda x:x[1])

[(48, 5.0),
 (87, 5.0),
 (121, 5.0),
 (405, 5.0),
 (432, 5.0),
 (433, 5.0),
 (531, 5.0),
 (536, 5.0),
 (666, 5.0),
 (865, 5.0),
 (870, 5.0),
 (1006, 5.0),
 (1037, 5.0),
 (1228, 5.0),
 (1311, 5.0),
 (1540, 5.0),
 (1647, 5.0),
 (1889, 5.0),
 (2125, 5.0),
 (2234, 5.0),
 (2237, 5.0),
 (2319, 5.0),
 (2329, 5.0),
 (2338, 5.0),
 (2480, 5.0),
 (2597, 5.0),
 (2611, 5.0),
 (2639, 5.0),
 (2665, 5.0),
 (2711, 5.0),
 (2740, 5.0),
 (2749, 5.0),
 (2835, 5.0),
 (2838, 5.0),
 (2880, 5.0),
 (2936, 5.0),
 (2937, 5.0),
 (2938, 5.0),
 (2939, 5.0),
 (2947, 5.0),
 (3067, 5.0),
 (3081, 5.0),
 (3110, 5.0),
 (3256, 5.0),
 (3294, 5.0),
 (3320, 5.0),
 (3504, 5.0),
 (3522, 5.0),
 (3672, 5.0),
 (3691, 5.0),
 (3758, 5.0),
 (3759, 5.0),
 (3807, 5.0),
 (3852, 5.0),
 (3893, 5.0),
 (3908, 5.0),
 (3923, 5.0),
 (3936, 5.0),
 (3974, 5.0),
 (4038, 5.0),
 (4044, 5.0),
 (4045, 5.0),
 (4108, 5.0),
 (4109, 5.0),
 (4178, 5.0),
 (4206, 5.0),
 (4246, 5.0),
 (4251, 5.0),
 (4372, 5.0),
 (4375, 5.0),
 (4390, 5.0),
 (4474, 5.0),
 (459

# Top Rated Movies Function

In [209]:
dist = sorted(list(enumerate(new.rating)), reverse=True, key = lambda x: x[1])
for i in dist[1:11]:
    print(new.iloc[i[0]][["title", "rating", "genres"]])


title     Heidi Fleiss: Hollywood Madam (1995)
rating                                     5.0
genres                             Documentary
Name: 87, dtype: object
title     Awfully Big Adventure, An (1995)
rating                                 5.0
genres                               Drama
Name: 121, dtype: object
title     Live Nude Girls (1995)
rating                       5.0
genres                    Comedy
Name: 405, dtype: object
title     In the Realm of the Senses (Ai no corrida) (1976)
rating                                                  5.0
genres                                                Drama
Name: 432, dtype: object
title       What Happened Was... (1994)
rating                              5.0
genres    Comedy,Drama,Romance,Thriller
Name: 433, dtype: object
title     Thin Line Between Love and Hate, A (1996)
rating                                          5.0
genres                                       Comedy
Name: 531, dtype: object
title     Denise Calls Up 

In [210]:
new.columns

Index(['movieId', 'title', 'genres', 'rating', 'tag'], dtype='object')

# model extract

In [211]:
pickle.dump(new,open("movie_data.pkl", "wb"))
pickle.dump(similarity, open("similarity.pkl", "wb"))