In [1]:
import numpy as np
import pandas as pd

This cell imports two essential libraries, `numpy` and `pandas`, which are commonly used for data manipulation and analysis in Python. `numpy` is used for handling numerical operations, while `pandas` simplifies data manipulation, such as reading and organizing the movie dataset.


In [2]:
movies=pd.read_csv('tmdb_5000_movies.csv')
credits=pd.read_csv('tmdb_5000_credits.csv')

This cell reads in two datasets, `tmdb_5000_movies.csv` and `tmdb_5000_credits.csv`, which contain information on movies and their associated credits (e.g., cast and crew). These datasets provide the foundation for building the recommendation system.

In [3]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


Head of the movies dataframe.

In [4]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


Head of the credit dataframe.

In [5]:
movies=movies.merge(credits,on="title")

- **Purpose**: This cell merges the `movies` and `credits` dataframes into a single dataframe, combining the movie data with cast and crew information based on the shared "title" column. Merging helps centralize data for easier processing in the recommendation system.

- **Approach**: The code uses the `merge()` function from `pandas` with the parameter `on="title"`, which specifies that the merge should align rows where the "title" column matches in both dataframes. This creates a single comprehensive dataframe containing both movie details and credits data.

- **Outcome**: After running this cell, `movies` becomes a combined dataframe with additional columns from `credits`, including cast and crew details. This enriched dataframe allows the recommendation system to access all relevant movie information in one place for more accurate recommendations.

In [6]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


# Data Preporcessing

In [7]:
# GENERE
# ID
# KEYWORDS
# TITLE
# OVERVIEW
# CAST
# CREW

movies=movies[['movie_id','title','overview','genres','keywords','cast','crew']]

- **Purpose**: This cell selects only the essential columns from the `movies` dataframe that will be used in the recommendation system. By focusing on specific columns, it reduces memory usage and simplifies the data, keeping only relevant information for generating recommendations.

- **Approach**: The code reassigns `movies` to a new subset of itself, using double square brackets `[['movie_id','title','overview','genres','keywords','cast','crew']]` to select multiple columns. These chosen columns include unique identifiers, descriptive details, and data categories that support recommendation criteria (like genre, keywords, and cast).

- **Outcome**: Running this cell updates `movies` to contain only the columns specified (`movie_id`, `title`, `overview`, `genres`, `keywords`, `cast`, and `crew`). This streamlined dataframe is easier to process and analyze in later steps of the recommendation system.

In [8]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4809 non-null   int64 
 1   title     4809 non-null   object
 2   overview  4806 non-null   object
 3   genres    4809 non-null   object
 4   keywords  4809 non-null   object
 5   cast      4809 non-null   object
 6   crew      4809 non-null   object
dtypes: int64(1), object(6)
memory usage: 263.1+ KB


In [9]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [10]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

- **Purpose**: This cell checks for any missing values in each column of the `movies` dataframe. Identifying null values early on helps prevent potential errors during data processing and recommendation generation.

- **Outcome**: Running this cell outputs the number of null values in each column of `movies`. This information highlights any data gaps that might need handling (such as filling, removing, or ignoring nulls) before proceeding with further analysis in the recommendation system.

In [11]:
movies.dropna(inplace=True)

- **Purpose**: This cell removes any rows in the `movies` dataframe that contain missing values. Handling missing data ensures a cleaner dataset, minimizing potential issues in the recommendation process.

- **Approach**: The code uses the `dropna()` function from `pandas`, which removes rows with null values. By setting `inplace=True`, it directly modifies the existing `movies` dataframe without creating a copy, making the change immediate and memory-efficient.

- **Outcome**: Running this cell permanently removes rows with any null values from the `movies` dataframe. As a result, the remaining data is complete and ready for further processing, improving the quality and accuracy of the recommendation system.

In [12]:
movies.duplicated().sum()

0

Check for any duplicate values in movies dataframe

In [13]:
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

- **Purpose**: This cell accesses the `genres` data of the first movie in the `movies` dataframe. Viewing this specific entry helps you understand how genre information is stored for each movie, which is useful for preprocessing steps.

- **Outcome**: Running this cell displays the raw format of the `genres` data for the first movie. This preview helps you assess any required formatting or extraction to make genre information usable in the recommendation system.

In [14]:
#This cell imports Python's built-in ast (Abstract Syntax Trees) module, which is used to evaluate and manipulate Python syntax trees. 
#In this context, ast is likely imported to help parse specific string data types (e.g., dictionaries or lists stored as strings in the dataframe) 
#into usable Python objects.
import ast

In [15]:
#helper function
def convert(obj):
    L=[]
    for i in ast.literal_eval(obj):
        L.append(i['name'])
    return L

- **Purpose**: This cell defines a helper function named `convert`, which extracts and returns a list of genre names from a string representation of a list of dictionaries. This is crucial for converting the `genres` column (which may contain string representations of lists) into a more manageable format for analysis in the recommendation system.

- **Approach**: 
  - The function takes one parameter, `obj`, which is expected to be a string formatted like a list of dictionaries.
  - It initializes an empty list `L`.
  - It uses `ast.literal_eval()` to safely parse `obj` into a Python list of dictionaries.
  - The function iterates over each dictionary in this list, appending the value associated with the 'name' key to the list `L`.
  - Finally, it returns the populated list `L`.

- **Outcome**: After defining this function, it can be called to convert the `genres` strings in the dataframe into lists of genre names. This transformation simplifies the handling of genre data, making it more accessible for analysis and subsequent recommendation processes.

In [16]:
movies['genres']=movies['genres'].apply(convert)

- **Purpose**: This cell applies the `convert` function to the `genres` column of the `movies` dataframe. The goal is to transform the string representations of genre lists into actual lists of genre names, facilitating easier manipulation and analysis.

- **Approach**: The code uses the `apply()` method from `pandas`, which allows you to apply a function along a specified axis of the dataframe. Here, `convert` is applied to each element in the `genres` column. The resulting lists replace the original string representations in the `genres` column.

- **Outcome**: After running this cell, the `genres` column in the `movies` dataframe will contain lists of genre names instead of strings. This format makes it easier to work with genres when developing the recommendation system, such as filtering, matching, or analyzing movie genres based on user preferences.

In [17]:
movies['keywords']=movies['keywords'].apply(convert)

- **Purpose**: This cell applies the `convert` function to the `keywords` column of the `movies` dataframe. The aim is to transform the string representations of keyword lists into actual lists of keywords, similar to the previous transformation for genres.

- **Approach**: The code uses the `apply()` method from `pandas`, which enables the execution of the `convert` function on each entry in the `keywords` column. The transformed lists replace the original string representations within that column.

- **Outcome**: After running this cell, the `keywords` column in the `movies` dataframe will contain lists of keywords instead of strings. This modification enhances data usability, making it easier to filter, analyze, and utilize keywords when generating movie recommendations.

In [18]:
#helper function
def convert3(obj):
    L=[]
    counter=0
    for i in ast.literal_eval(obj):
        if counter!=3:
            L.append(i['name'])
            counter+=1
        else:
            break
    return L

- **Purpose**: This cell defines a helper function named `convert3`, which extracts up to three genre names from a string representation of a list of dictionaries. This is particularly useful when you want to limit the number of genres or keywords associated with each movie for the recommendation system.

- **Approach**: 
  - The function takes one parameter, `obj`, which is expected to be a string formatted like a list of dictionaries.
  - It initializes an empty list `L` and a `counter` variable set to `0`.
  - The function uses `ast.literal_eval()` to safely convert the string `obj` into a Python list of dictionaries.
  - It iterates over this list, checking the `counter` variable. For each dictionary:
    - If `counter` is not equal to `3`, it appends the value associated with the 'name' key to the list `L` and increments the `counter`.
    - If `counter` reaches `3`, the loop breaks, stopping further additions to the list.
  - Finally, it returns the populated list `L`.

- **Outcome**: After defining this function, it can be used to extract up to three genre names or keywords from the `genres` or `keywords` columns of the dataframe. This allows for a more concise representation of movie attributes, which can be beneficial for improving the performance and readability of the recommendation system.

In [19]:
movies['cast']=movies['cast'].apply(convert3)

- **Purpose**: This cell applies the `convert3` function to the `cast` column of the `movies` dataframe. The goal is to extract up to three cast member names from the string representations of lists of dictionaries, providing a more manageable view of the cast information for each movie.

- **Outcome**: After running this cell, the `cast` column in the `movies` dataframe will contain lists of up to three cast member names instead of strings. This streamlined format makes it easier to analyze and utilize cast information when generating recommendations, focusing on the most prominent actors for each movie.

In [20]:
#helper function
def fetch_director(obj):
    L=[]
    counter=0
    for i in ast.literal_eval(obj):
        if i['job']=='Director':
            L.append(i['name'])
            break
    return L

- **Purpose**: This cell defines a helper function named `fetch_director`, which extracts the name of the director from a string representation of a list of dictionaries. This is important for identifying the key creative figure behind each movie, which can be a significant factor in recommendations.

- **Approach**: 
  - The function takes one parameter, `obj`, which is expected to be a string formatted like a list of dictionaries.
  - It initializes an empty list `L` and a `counter` variable (though the counter isn't actually utilized in this function).
  - The function uses `ast.literal_eval()` to safely convert the string `obj` into a Python list of dictionaries.
  - It iterates over this list, checking each dictionary:
    - If the value associated with the 'job' key is 'Director', it appends the corresponding 'name' value to the list `L` and breaks the loop to stop further iterations (since only one director is needed).
  - Finally, it returns the populated list `L`, which will either contain the director's name or be empty if no director is found.

- **Outcome**: After defining this function, it can be used to extract the director's name from the `crew` column of the `movies` dataframe. This enables the recommendation system to leverage director information, enhancing its ability to make more personalized recommendations based on directorial style or preferences.of the cast information for each movie.

In [21]:
movies['crew']=movies['crew'].apply(fetch_director)


- **Purpose**: This cell processes the `crew` column of the `movies` DataFrame to extract the names of directors for each movie. The goal is to create a more streamlined representation of the movie data, focusing on the directors, which can be useful for recommendation purposes or for additional features in the movie recommendation system.

- **Approach**:
  - The code applies the `fetch_director` helper function to each element in the `movies['crew']` column using the `apply()` method. 
  - The `fetch_director` function iterates over the list of crew members (formatted as a string that represents a list of dictionaries) and checks for entries where the job is "Director."
  - For each movie, it collects the name of the director and returns it as a list. This transformation is applied to the entire `crew` column.

- **Outcome**: After executing this cell, the `movies['crew']` column will contain lists of director names for each movie instead of the original complex structure. This makes the dataset easier to work with, as it simplifies the information related to movie directors and prepares it for further analysis or features in the recommendation system. The transformation also aids in enhancing the relevance of movie recommendations based on directors, which can be a key factor for users seeking films by their favorite filmmakers.

In [22]:
movies['overview']=movies['overview'].apply(lambda x:x.split())


- **Purpose**: This cell processes the `overview` column of the `movies` DataFrame to transform the movie overviews into a list of words. The goal is to prepare the overview text for further analysis, such as text vectorization or feature extraction.

- **Approach**:
  - The code uses the `apply()` method with a lambda function on the `movies['overview']` column.
  - The lambda function takes each overview string `x` and applies the `split()` method, which divides the string into a list of words based on whitespace. This effectively converts each movie's overview into a list of individual words.

- **Outcome**: After executing this cell, the `movies['overview']` column will contain lists of words for each movie's overview instead of a single string. This transformation facilitates subsequent text processing tasks, such as bag-of-words vectorization or natural language processing (NLP) operations. By breaking down the overviews into lists, it allows for easier manipulation and analysis, ultimately contributing to the recommendation system's ability to generate suggestions based on textual similarities in movie overviews.

In [23]:
movies['genres']=movies['genres'].apply(lambda x:[i.replace(" ","")for i in x])
movies['keywords']=movies['keywords'].apply(lambda x:[i.replace(" ","")for i in x])
movies['cast']=movies['cast'].apply(lambda x:[i.replace(" ","")for i in x])
movies['crew']=movies['crew'].apply(lambda x:[i.replace(" ","")for i in x])

- **Purpose**: This cell removes any spaces from the genre names, keywords, cast member names, and director names in the respective columns of the `movies` dataframe. This normalization step helps ensure consistency in the data, which can be particularly important when comparing or analyzing string data.

- **Approach**: The code applies a `lambda` function to each of the specified columns (`genres`, `keywords`, `cast`, and `crew`) using the `apply()` method from `pandas`. The `lambda` function iterates through each list in the column, applying the `replace(" ", "")` method to each element `i` to remove spaces.

- **Outcome**: After running this cell, all entries in the `genres`, `keywords`, `cast`, and `crew` columns will have had spaces removed from their respective names. The resulting lists will be cleaner and more standardized, making them easier to work with in subsequent analysis or recommendation processes. This can enhance the performance of any similarity comparisons or searches performed on these fields.

In [24]:
movies['tag']=movies['overview']+movies['genres']+movies['keywords']+movies['cast']+movies['crew']

- **Purpose**: This cell creates a new column named `tag` in the `movies` dataframe, which combines the `overview`, `genres`, `keywords`, `cast`, and `crew` columns. The intention is to create a comprehensive representation of each movie, consolidating all relevant information into a single column for easier access and analysis.

- **Outcome**: After running this cell, the `tag` column in the `movies` dataframe will contain lists that combine information from the `overview`, `genres`, `keywords`, `cast`, and `crew`. This consolidated representation allows for more effective analysis and processing in the recommendation system, as it enables a holistic view of each movie's attributes and features, simplifying subsequent comparisons and recommendations.

In [25]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tag
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski],"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes],"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan],"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton],"[John, Carter, is, a, war-weary,, former, mili..."


In [26]:
new_df=movies[['movie_id','title','tag']]

- **Purpose**: This cell creates a new dataframe named `new_df` that includes only the `movie_id`, `title`, and `tag` columns from the `movies` dataframe. This step is essential for focusing on the most relevant features needed for the recommendation process.

- **Outcome**: After running this cell, the `new_df` dataframe will contain three columns: `movie_id`, `title`, and `tag`. This simplified structure is conducive to further processing and analysis in the recommendation system, making it easier to implement algorithms or perform operations that rely specifically on these key features.

In [27]:
new_df['tag']=new_df['tag'].apply(lambda x:" ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tag']=new_df['tag'].apply(lambda x:" ".join(x))


- **Purpose**: This cell transforms the `tag` column in the `new_df` dataframe by joining the lists of words (created in previous steps) into single strings. This conversion is essential for preparing the data for text-based analysis or similarity comparisons, which typically require string inputs.

- **Approach**: The code uses the `apply()` method from `pandas`, along with a `lambda` function. The `lambda` function takes each list in the `tag` column (`x`) and joins the elements into a single string using the `join()` method, with a space (`" "`) as the separator. This effectively concatenates the words in each list into a single, space-separated string.

- **Outcome**: After running this cell, the `tag` column in the `new_df` dataframe will contain strings instead of lists. Each string will consist of the combined `overview`, `genres`, `keywords`, `cast`, and `crew` information for each movie. This format is much more suitable for text processing tasks, such as applying natural language processing techniques or calculating similarity scores for recommendation algorithms.

In [28]:
new_df['tag']=new_df['tag'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tag']=new_df['tag'].apply(lambda x:x.lower())


- **Purpose**: This cell transforms the `tag` column in the `new_df` dataframe by converting all the strings in that column to lowercase. This normalization step is important for ensuring consistency in text data, which helps improve the accuracy of text-based analyses and comparisons.

- **Approach**: The code uses the `apply()` method from `pandas`, combined with a `lambda` function. The `lambda` function takes each string in the `tag` column (`x`) and converts it to lowercase using the `lower()` method.

- **Outcome**: After running this cell, the `tag` column in the `new_df` dataframe will contain all lowercase strings. This uniformity makes the text data more manageable for subsequent operations, such as searching or calculating similarity scores, as it eliminates discrepancies caused by case sensitivity. Consequently, this prepares the data for more effective analysis in the recommendation system.

In [29]:
new_df.head()

Unnamed: 0,movie_id,title,tag
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


In [30]:
#steming
import nltk

### Explanation
This cell performs a key step in the movie recommendation process.
- **Purpose**: This cell imports the Natural Language Toolkit (NLTK) library, which is a powerful tool for working with human language data (text) in Python. Importing NLTK is a preliminary step for performing various natural language processing (NLP) tasks, such as stemming, tokenization, and text analysis.

In [31]:
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()

- **Purpose**: This cell imports the `PorterStemmer` class from the NLTK library and creates an instance of it. The purpose of stemming is to reduce words to their root form, which helps in standardizing text data by treating different forms of a word as the same base word.

- **Approach**: 
  - The code uses the `from ... import ...` statement to specifically import the `PorterStemmer` class from the `nltk.stem.porter` module.
  - It then creates an instance of `PorterStemmer` and assigns it to the variable `ps`. This instance will be used to perform stemming operations on the text data.

- **Outcome**: After running this cell, the `ps` variable holds an instance of the `PorterStemmer`, which can be used in subsequent code cells to stem words in the `tag` column or any other text data. This will help in normalizing the text data by reducing words to their root forms, thus enhancing the effectiveness of text-based analyses, such as calculating similarities or building the recommendation system.

In [32]:
def stem(text):
    y=[]
    for i in text.split():
         y.append(ps.stem(i))

    return " ".join(y)

- **Purpose**: This cell defines a function named `stem`, which takes a string of text as input and returns a new string where each word has been reduced to its root form using stemming. This process is crucial for normalizing text data, making it more suitable for analysis in the recommendation system.

- **Approach**:
  - The function takes one parameter, `text`, which is expected to be a string.
  - It initializes an empty list `y` to store the stemmed words.
  - The function splits the input `text` into individual words using the `split()` method.
  - It iterates over each word `i` in the split text:
    - For each word, it applies the `stem()` method from the `PorterStemmer` instance (`ps`) to reduce the word to its root form and appends the stemmed word to the list `y`.
  - Finally, the function joins the stemmed words in `y` back into a single string with spaces between them using `" ".join(y)` and returns this string.

- **Outcome**: After defining this function, it can be used to process the `tag` column or any other text data, transforming each entry into a normalized format where words are reduced to their stems. This normalization helps in improving the accuracy of text analysis and recommendation algorithms by focusing on the core meanings of words rather than their variations.

In [33]:
new_df['tag']=new_df['tag'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tag']=new_df['tag'].apply(stem)


- **Purpose**: This cell applies the previously defined `stem` function to the `tag` column of the `new_df` dataframe. The goal is to normalize the text data by reducing each word in the tags to its root form through stemming, which enhances the data's usability in text analysis and recommendation processes.

- **Outcome**: After running this cell, the `tag` column in the `new_df` dataframe will contain strings where each word has been reduced to its root form. This transformation leads to a more standardized representation of the text data, making it easier to perform text analysis and improve the effectiveness of recommendation algorithms. By focusing on the stemmed words, the recommendation system can more accurately match similar movies based on the consolidated attributes in the `tag` column.

In [34]:
new_df['tag'][0]

'in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. action adventur fantasi sciencefict cultureclash futur spacewar spacecoloni societi spacetravel futurist romanc space alien tribe alienplanet cgi marin soldier battl loveaffair antiwar powerrel mindandsoul 3d samworthington zoesaldana sigourneyweav jamescameron'

- **Purpose**: This cell retrieves and displays the first entry in the `tag` column of the `new_df` dataframe. It serves as a quick check to verify the content and format of the `tag` data after the transformations and stemming have been applied.

- **Outcome**: After running this cell, the output will be the first entry in the `tag` column, which should now contain a string of stemmed words representing the movie's overview, genres, keywords, cast, and crew. This output helps to confirm that the stemming process has been executed correctly and that the data is in the expected format, providing a glimpse into how the data looks before proceeding with further analysis or recommendation processes.

In [35]:
# Vectorization: Bag of words
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=5000,stop_words='english')

- **Purpose**: This cell sets up the `CountVectorizer` from the `sklearn.feature_extraction.text` module, which will be used to convert the text data in the `tag` column into a numerical format. Specifically, it transforms the text into a "Bag of Words" representation, where each unique word is represented as a feature, and the count of each word in the text is recorded. This transformation is crucial for enabling machine learning algorithms to process the textual data.

- **Approach**:
  - The code imports the `CountVectorizer` class.
  - An instance of `CountVectorizer` is created and assigned to the variable `cv`. The parameters specified include:
    - `max_features=5000`: This limits the number of unique words (features) to the top 5000 most frequent words across the dataset. This parameter helps manage dimensionality, ensuring that the model focuses on the most significant features.
    - `stop_words='english'`: This parameter instructs the vectorizer to ignore common English words (stop words) that typically do not carry significant meaning, such as "the," "is," and "in." This helps improve the model's performance by reducing noise in the data.

- **Outcome**: After running this cell, the `cv` variable holds an instance of `CountVectorizer` configured to process text data with the specified parameters. This setup is now ready to transform the text data from the `tag` column into a numerical format, enabling further analysis or feeding into machine learning algorithms for the movie recommendation system.

In [36]:
vectors=cv.fit_transform(new_df['tag']).toarray()

Here's the breakdown:

- **Purpose**: This cell uses the previously defined `CountVectorizer` instance (`cv`) to convert the text data in the `tag` column of the `new_df` dataframe into a numerical format. The resulting output is a matrix where each row corresponds to a movie and each column corresponds to a unique word from the top 5000 features.

- **Approach**:
  - The `fit_transform()` method is called on the `cv` instance, passing the `tag` column from `new_df` as the argument. This method performs two operations:
    - **Fit**: It learns the vocabulary from the text data, identifying the top 5000 most frequent words and building a mapping of words to feature indices.
    - **Transform**: It converts the text data into a numerical format (Bag of Words representation), where each entry in the matrix represents the count of words corresponding to each feature for each movie.
  - The output of `fit_transform()` is converted to a dense array using the `toarray()` method, which creates a 2D NumPy array.


- **Outcome**: After running this cell, the variable `vectors` will contain a 2D array where each row represents a movie and each column corresponds to a specific word in the vocabulary. The values in the array indicate the count of each word in the respective movie's `tag`. This numerical representation is essential for further analysis, such as calculating similarities between movies or training machine learning models for the recommendation system.

In [37]:
vectors[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

Here's the breakdown:

- **Purpose**: This cell retrieves and displays the first row of the `vectors` array. Each row corresponds to a movie in the `new_df` dataframe, and the values in the row represent the counts of the words (features) identified by the `CountVectorizer` in that movie's `tag`. This step allows for a quick examination of the numerical representation of the text data for the first movie.

- **Outcome**: After running this cell, the output will be a NumPy array representing the word counts for the first movie in the `tag` column. Each element in this array corresponds to the count of a specific word from the vocabulary defined by the `CountVectorizer`. This output provides insight into how the text data has been transformed into a numerical format, showing the distribution of words in the `tag` for the first movie, which can help in understanding the feature representation for subsequent analysis and modeling in the recommendation system.

In [38]:
len(cv.get_feature_names_out())

5000

Here's the breakdown:

- **Purpose**: This cell calculates and displays the number of unique features (words) that were extracted from the text data using the `CountVectorizer`. This information is important for understanding the dimensionality of the dataset and how many distinct words will be used as input features for further analysis or modeling.

- **Outcome**: After running this cell, the output will be an integer representing the total number of unique words (features) that have been extracted from the `tag` column. This number indicates the size of the feature set used in the Bag of Words representation, which is crucial for understanding the complexity of the dataset and the potential computational requirements for any subsequent machine learning processes or analyses.

### Every movie converted into vector

In [39]:
from sklearn.metrics.pairwise import cosine_similarity

- **Purpose**: This cell imports the `cosine_similarity` function from the `sklearn.metrics.pairwise` module. This function is essential for calculating the cosine similarity between two or more vectors, which will be used to determine how similar different movies are based on their `tag` representations.

- **Outcome**: After running this cell, the `cosine_similarity` function is available for use. In the context of the movie recommendation system, it will be employed to compute the similarity between the numerical representations (vectors) of different movies. Cosine similarity is a widely used metric in recommendation systems, as it measures the cosine of the angle between two vectors, providing a value that ranges from -1 (completely dissimilar) to 1 (identical). This measure will help identify and recommend movies that are similar to each other based on the features extracted from the `tag` column.

In [40]:
similarity=cosine_similarity(vectors)

- **Purpose**: This cell calculates the cosine similarity matrix for all the movies based on their vectorized `tag` representations. The similarity matrix is crucial for identifying how similar each movie is to every other movie in the dataset, which is a key component of the recommendation system.

- **Outcome**: After running this cell, the variable `similarity` will contain a 2D array (matrix) where each entry indicates the similarity score between a pair of movies. The diagonal elements of the matrix will all be 1, as each movie is identical to itself. This similarity matrix can be used to find and recommend movies that are similar to a given movie, facilitating the core functionality of the movie recommendation system.

In [41]:
sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x:x[1])[1:6]

[(1216, 0.2867696673382022),
 (2409, 0.26901379342448517),
 (3730, 0.2605130246476754),
 (507, 0.255608593705383),
 (539, 0.25038669783359574)]

- **Purpose**: This cell retrieves and ranks the top five most similar movies to the first movie in the dataset based on the cosine similarity scores calculated in the previous step. This is a critical step in generating recommendations for a specific movie.

- **Approach**:
  - The code uses `enumerate()` to pair each index (representing a movie) with its corresponding similarity score from the first row of the `similarity` matrix (`similarity[0]`), which contains the similarity scores between the first movie and all other movies.
  - The `list()` function converts the enumerated object into a list of tuples, where each tuple consists of a movie index and its similarity score.
  - The `sorted()` function is then applied to this list, with the following parameters:
    - `reverse=True`: This sorts the list in descending order, so that higher similarity scores appear first.
    - `key=lambda x: x[1]`: This specifies that the sorting should be based on the second element of each tuple (the similarity score).
  - The slice `[1:6]` is used to exclude the first element (the first movie itself, which will have a similarity score of 1) and select the next five most similar movies.

- **Outcome**: After running this cell, the output will be a list of tuples, where each tuple contains an index and the corresponding similarity score for the top five movies that are most similar to the first movie. This information can be used to recommend these similar movies to users, forming a key aspect of the recommendation functionality within the system.

In [42]:
def recommend(movie):
    movie_index=new_df[new_df['title']==movie].index[0]
    distances=similarity[movie_index]
    movies_list=sorted(list(enumerate(distances)),reverse=True,key=lambda x:x[1])[1:6]
    for i in movies_list:
        print(new_df.iloc[i[0]].title)

- **Purpose**: This cell defines a function named `recommend` that takes a movie title as input and recommends the top five movies that are most similar to the given movie based on the cosine similarity scores. This function encapsulates the recommendation logic for easy reuse.

- **Approach**:
  - The function takes one parameter, `movie`, which is the title of the movie for which recommendations are to be generated.
  - It retrieves the index of the input movie from the `new_df` dataframe using the condition `new_df['title'] == movie`. The `index[0]` is used to access the first matching index (in case there are duplicates).
  - The similarity scores for the specified movie are extracted from the `similarity` matrix using the retrieved index (`similarity[movie_index]`).
  - The similarity scores are then enumerated and sorted in descending order to rank the movies based on similarity. The same sorting method used previously is applied, excluding the first element (the movie itself) and selecting the next five most similar movies.
  - Finally, the function iterates over the sorted list of similar movies (`movies_list`) and prints the titles of the recommended movies. The `iloc` method is used to access the movie titles based on their indices.

- **Outcome**: After defining this function, calling `recommend(movie_name)` with a specific movie title will output the titles of the top five recommended movies that are most similar to the input movie. This functionality allows users to receive tailored movie suggestions based on their preferences, forming the core of the movie recommendation system.

In [43]:
recommend('The Shawshank Redemption')

Prison
Penitentiary
Buffalo '66
The Truman Show
Flying By


Here's the breakdown:

- **Purpose**: This cell calls the previously defined `recommend` function, passing in the title "The Shawshank Redemption" as the input. The goal is to generate and display the top five movie recommendations that are most similar to "The Shawshank Redemption" based on the cosine similarity scores.

- **Approach**: 
  - The function call `recommend('The Shawshank Redemption')` triggers the execution of the `recommend` function with the specified movie title.
  - Inside the `recommend` function, the following steps occur:
    - The index of "The Shawshank Redemption" is retrieved from the `new_df` dataframe.
    - The similarity scores for this movie are extracted from the `similarity` matrix.
    - These scores are then sorted to identify the top five most similar movies, excluding "The Shawshank Redemption" itself.
    - The titles of these recommended movies are printed to the console.

- **Outcome**: After executing this cell, the output will be the titles of the five movies that are most similar to "The Shawshank Redemption," based on the defined similarity criteria. This output provides personalized movie suggestions to the user, showcasing the functionality of the movie recommendation system in action. The recommended titles are printed in the order of their similarity to the input movie, allowing users to discover films they might enjoy based on their preferences.

In [44]:
recommend('The Dark Knight')

The Dark Knight Rises
Batman Begins
Batman Returns
Batman Forever
Batman


Here's the breakdown:

- **Purpose**: This cell calls the previously defined `recommend` function, passing in the title "The Dark Knight" as the input. The goal is to generate and display the top five movie recommendations that are most similar to "The Dark Knight" based on the cosine similarity scores.

- **Approach**: 
  - The function call `recommend('The Dark Knight')` triggers the execution of the `recommend` function with the specified movie title.
  - Inside the `recommend` function, the following steps occur:
    - The index of "The Dark Knight" is retrieved from the `new_df` dataframe, which contains the processed movie data.
    - The similarity scores for this movie are extracted from the `similarity` matrix, which was calculated earlier.
    - These similarity scores are sorted to identify the top five most similar movies, excluding "The Dark Knight" itself.
    - The titles of these recommended movies are printed to the console.

- **Outcome**: After executing this cell, the output will be the titles of the five movies that are most similar to "The Dark Knight," based on the defined similarity criteria. This output provides personalized movie suggestions to the user, demonstrating the functionality of the movie recommendation system. The recommended titles are printed in order of their similarity to "The Dark Knight," allowing users to discover films they may enjoy based on their viewing preferences.

In [45]:
recommend('The Avengers')

Iron Man 3
Avengers: Age of Ultron
Captain America: Civil War
Captain America: The First Avenger
Iron Man


In [46]:
import pickle 

- **Purpose**: This cell imports the `pickle` module, which is a standard Python library used for serializing and deserializing Python object structures. The goal of importing `pickle` is to enable the saving and loading of complex data types (like models or datasets) to and from files, which can be useful for persisting state or sharing data between sessions.

- **Outcome**: After executing this cell, you will have access to the functionality provided by the `pickle` module, allowing you to serialize Python objects (e.g., trained machine learning models, DataFrames, or lists) for later use. This is particularly useful for saving the state of your application or for sharing data with others without needing to reprocess it each time you run your code. For example, after training a model, you can save it using `pickle` and later load it back without needing to retrain.

In [47]:
pickle.dump(new_df,open('movies_dict.pkl','wb'))


- **Purpose**: This cell saves the `new_df` DataFrame to a file named `movies_dict.pkl`. The goal is to serialize the DataFrame, allowing it to be easily stored and later retrieved without needing to regenerate or reprocess the data.

- **Approach**:
  - The code uses the `pickle.dump()` function, which is part of the `pickle` module imported earlier.
  - The first argument to `pickle.dump()` is the object to be saved, in this case, `new_df`, which is the DataFrame containing processed movie data.
  - The second argument is an open file object created with `open('movies_dict.pkl', 'wb')`, where:
    - `'movies_dict.pkl'` is the name of the file where the DataFrame will be saved.
    - `'wb'` indicates that the file should be opened in write-binary mode, which is necessary for writing serialized data.
  
- **Outcome**: After executing this cell, the `new_df` DataFrame will be saved as a binary file named `movies_dict.pkl` in the current working directory. This serialized file can later be loaded back into a Python program using `pickle.load()`, restoring the original DataFrame for further analysis or use without needing to reload or process the data from scratch. This is particularly useful for saving intermediate results in a data processing pipeline or for deploying models in a production environment.

In [48]:
new_df['title'].values

array(['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre',
       ..., 'Signed, Sealed, Delivered', 'Shanghai Calling',
       'My Date with Drew'], dtype=object)


- **Purpose**: This cell retrieves the titles of the movies from the `new_df` DataFrame as a NumPy array. The goal is to obtain a simple, iterable representation of the movie titles for use in further processing or analysis.

- **Approach**:
  - The code accesses the `title` column of the `new_df` DataFrame using `new_df['title']`.
  - The `values` attribute is then called on this Series to convert it into a NumPy array. This transformation allows for efficient numerical operations and manipulation, as NumPy arrays are optimized for performance in Python.

- **Outcome**: After executing this cell, the variable will hold a NumPy array containing all the movie titles from the `new_df` DataFrame. This array can be used for various purposes, such as displaying the titles, creating user interfaces, or conducting further analysis (e.g., matching user input against the available movie titles in the recommendation system). By extracting the titles in this manner, you can easily iterate over them or perform operations that require a list of strings.

In [49]:
pickle.dump(new_df,open('movie_list.pkl','wb'))
pickle.dump(similarity,open('similarity.pkl','wb'))


- **Purpose**: This cell saves two important components of the movie recommendation system: the `new_df` DataFrame and the `similarity` matrix. The goal is to serialize both objects, allowing for efficient storage and easy retrieval in future sessions without needing to recompute or reload the data.

- **Approach**:
  - The first line uses `pickle.dump()` to save the `new_df` DataFrame:
    - `new_df` is the DataFrame that contains processed movie data, including titles, tags, and other features.
    - The file is opened in write-binary mode (`'wb'`) as `movie_list.pkl`, where this file will store the serialized DataFrame.
  
  - The second line uses `pickle.dump()` to save the `similarity` matrix:
    - `similarity` is a matrix that contains cosine similarity scores between movies, which is crucial for the recommendation process.
    - This file is opened in write-binary mode (`'wb'`) as `similarity.pkl`, where this file will store the serialized similarity matrix.

- **Outcome**: After executing this cell, two binary files will be created in the current working directory:
  - `movie_list.pkl`: Contains the serialized `new_df` DataFrame, allowing it to be loaded back into memory later.
  - `similarity.pkl`: Contains the serialized `similarity` matrix, enabling quick access to the similarity data for recommendations.

By saving these objects, you ensure that the current state of the movie recommendation system can be restored quickly in the future, facilitating smoother operation, quicker testing, and enhanced usability of the system without the need for reprocessing the initial data or recalculating similarity scores.

# The END!!