# Movie Recommendation Project
---

## BUSINESS PROBLEM
---
A newly established fictive online movie streaming platform (superperfectmoviesblabla.com) wants to provide film recommendations to its users. Due to the low login rate of users, it's unable to gather user preferences effectively. Therefore, it cannot develop product recommendations using collaborative filtering methods. However, it does know which movies users have watched based on their browser history. Provide film recommendations based on this information.

## DATASET STORY
---

The movies_metadata.csv dataset contains basic information about 45,000 films. 
For 'Content Based Recommendation', the application will utilize the 'overview' variable, which contains film descriptions.

## FEATURES
---

- **overview:** A summary description outlining the plot of the film.
- **release_date:** Release date of the movie
- **title:** Product Name - Quantity - Quantity of the product
- **Quantity:** Title of the movie.

## STEPS
---
## Recommendation Generation Based on Film Overviews (Content Based Recommendation)
### Step 1: Creating the TF-IDF Matrix
### Step 2: Creating the Cosine Similarity Matrix
### Step 3: Generating Recommendations Based on Similarities
---

### *Libraries and Settings*
---

In [1]:
import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings

pd.set_option("display.max_columns", None)
pd.set_option("display.expand_frame_repr", False)
pd.set_option("display.width", 500)
warnings.filterwarnings("ignore")

In [2]:
df_meta = pd.read_csv("/kaggle/input/the-movies-dataset/movies_metadata.csv", usecols = ["overview","release_date","title"])
df_meta.head()

Unnamed: 0,overview,release_date,title
0,"Led by Woody, Andy's toys live happily in his ...",1995-10-30,Toy Story
1,When siblings Judy and Peter discover an encha...,1995-12-15,Jumanji
2,A family wedding reignites the ancient feud be...,1995-12-22,Grumpier Old Men
3,"Cheated on, mistreated and stepped on, the wom...",1995-12-22,Waiting to Exhale
4,Just when George Banks has recovered from his ...,1995-02-10,Father of the Bride Part II


In [3]:
df_meta.shape

(45466, 3)

In [4]:
df_meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   overview      44512 non-null  object
 1   release_date  45379 non-null  object
 2   title         45460 non-null  object
dtypes: object(3)
memory usage: 1.0+ MB


We will consider the overview for Content Based Recommendation in this section.

In [5]:
df_meta["overview"].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [6]:
df_meta["overview"].isnull().sum()

954

As there are 954 missing overviews, we will simply replace them with empty strings, so they will not cause any problem in TF-IDF calculations.

In [7]:
df_meta["overview"].fillna('', inplace=True)
df_meta["overview"].isnull().sum()

0

---
# Creating the TF-IDF Matrix
---

As we work on english text and do not want to bias caused by stop words (words such as 'in', 'and', 'or' etc.), we will use the argument of stop_words for TfidfVectorizer.

In [8]:
tfidf = TfidfVectorizer(stop_words="english")

Now, we can create TF-IDF Matrix for overviews.

In [9]:
tfidf_matrix = tfidf.fit_transform(df_meta["overview"])
tfidf_matrix.shape # movie count (overviews) x unique words count used in all overviews

(45466, 75827)

In [10]:
tfidf.get_feature_names_out()[41600:41650] # some words used in description of movies which we have as features now.

array(['mathematician', 'mathematicians', 'mathematics', 'mathers',
       'matheson', 'mathew', 'mathews', 'mathewson', 'mathias', 'mathieu',
       'mathilassi', 'mathilda', 'mathilde', 'mathis', 'mathison',
       'mathmatics', 'maths', 'mathur', 'mathuyoshi', 'mati', 'matias',
       'matiaz', 'matilda', 'matilde', 'matilyn', 'matinee', 'mating',
       'matinée', 'mation', 'matiss', 'matisse', 'matisses', 'matko',
       'matkę', 'matlin', 'matmos', 'mato', 'matondkar', 'matoran',
       'matrac', 'matriarch', 'matriarchal', 'matricide', 'matriculated',
       'matriculates', 'matrimoni', 'matrimonial', 'matrimonios',
       'matrimony', 'matrix'], dtype=object)

And let's see also words count in interception of movie overviews and all unique words..

In [11]:
tfidf_matrix.toarray()[:10]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

---
# Creating the Cosine Similarity Matrix
---

In [12]:
cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cos_sim.shape

(45466, 45466)

In [13]:
cos_sim[:5]

array([[1.        , 0.01504121, 0.        , ..., 0.        , 0.00595453,
        0.        ],
       [0.01504121, 1.        , 0.04681953, ..., 0.        , 0.02198641,
        0.00929411],
       [0.        , 0.04681953, 1.        , ..., 0.        , 0.01402548,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.00952214,
        0.01641271],
       [0.        , 0.        , 0.02509444, ..., 0.        , 0.00700553,
        0.        ]])

---
# Generating Recommendations Based on Similarities
---

As it is not clear which movie is similar with which one from above score matrix.
Let's find out them by names.

However, first we check for movies if there are any duplicate movie records.

In [14]:
df_meta["title"].value_counts()

title
Cinderella              11
Hamlet                   9
Alice in Wonderland      9
Beauty and the Beast     8
Les Misérables           8
                        ..
Cluny Brown              1
Babies                   1
The Green Room           1
Captain Conan            1
Queerama                 1
Name: count, Length: 42277, dtype: int64

As there are some duplicated records and we want to keep last movies made, we will get rid of duplicated movies by keeping only the latest movies.

In [15]:
# Correcting the values in the release_date column and 
## assign the data type as datetime to enable sorting movies from the most recent to the oldest.
df_meta["release_date"] = df_meta["release_date"].apply(lambda x: "" if len(str(x)) != 10 else x) # assign empty string for not date values
df_meta["release_date"] = pd.to_datetime(df_meta["release_date"].fillna(""), format="%Y-%m-%d") # define datetime as dtype for release_dateafter assigning empty string for NaN values
df_meta = df_meta[["title","overview","release_date"]].sort_values("release_date", ascending=False) # sort values by releate_date

# getting rid of duplicated movies
df_meta = df_meta[~df_meta.duplicated(keep="first")]

# keeping indices as feature
df_meta["indices"] = df_meta.index

# assigning titles as index
df_meta.index = df_meta["title"]

df_meta.head()

Unnamed: 0_level_0,title,overview,release_date,indices
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Avatar 2,Avatar 2,A sequel to Avatar (2009).,2020-12-16,26559
The Other Side of the Wind,The Other Side of the Wind,"Orson Welles' unfinished masterpiece, restored...",2018-12-31,38885
Bad Boys for Life,Bad Boys for Life,The continuing adventures of Miami detectives ...,2018-11-07,30402
Mary Shelley,Mary Shelley,The love affair between poet Percy Shelley and...,2018-04-25,38130
Mobile Homes,Mobile Homes,"In forgotten towns along the American border, ...",2018-04-04,44535


Now, we can seach for a movie with key word to find its index.

In [16]:
searched_movie = "matrix"
df_meta[df_meta["title"].apply(lambda x: str(searched_movie).lower() in str(x).lower())]["indices"]

title
Return to Source: The Philosophy of The Matrix    29468
The Matrix Revolutions                             6783
The Matrix Reloaded                                6221
The Animatrix                                      9372
Armitage: Dual Matrix                             38306
The Matrix Revisited                              43870
The Matrix                                         2458
Name: indices, dtype: int64

Finally, we can get recommendations for selected movie.

In [17]:
movie_rec = pd.DataFrame(cos_sim[30700], columns=["Score"])
indices_first_5_rec = movie_rec.sort_values(["Score"], ascending=False)[0:6].index.to_list()
rec_movie_names = df_meta[df_meta["indices"].apply(lambda x: int(x) in indices_first_5_rec)]
rec_movie_names

Unnamed: 0_level_0,title,overview,release_date,indices
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Minions,Minions,"Minions Stuart, Kevin and Bob are recruited by...",2015-06-17,30700
Despicable Me 2,Despicable Me 2,Gru is recruited by the Anti-Villain League to...,2013-06-25,21161
The Mother Of Invention,The Mother Of Invention,A mockumentary about an aspiring inventor who ...,2010-06-25,31251
"What's Up, Scarlet?","What's Up, Scarlet?","Scarlet Zabrinski, a successful, hard edged, L...",2005-05-18,18440
A Story of Floating Weeds,A Story of Floating Weeds,A kabuki actor's mistress hatches a jealous pl...,1934-11-23,10558
Madam Satan,Madam Satan,Angela and Bob Brooks are an upper class coupl...,1930-09-20,16176


---
# Preparation of Execution Script
---

In [20]:
def calculate_cosine_sim(dataframe) :
    tfidf = TfidfVectorizer(stop_words='english')
    dataframe[ 'overview'] = dataframe[ 'overview'].fillna('')
    tfidf_matrix = tfidf.fit_transform(dataframe['overview'])
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
    return cosine_sim

def content_based_recommender(title, cosine_sim, dataframe):
    # index'leri olusturma
    indices = pd. Series(dataframe.index, index=dataframe['title'])
    indices = indices[~indices.index.duplicated(keep='last')]
    # title'1n index'ini yakalama
    movie_index = indices[title]
    # title'a gore benzerlik skorlarin hesapalamall
    similarity_scores = pd.DataFrame(cosine_sim[movie_index], columns=["score"])
    # kendisi haric ilk 10 filmi getirme
    movie_indices = similarity_scores.sort_values("score", ascending=False)[1:11].index
    return dataframe['title']. iloc[movie_indices]

---
## I hope you found the analysis insightful and informative!

## Your feedback is greatly appreciated, and I welcome any suggestions for improvement. Feel free to reach out with any questions or comments.
---