# 🎬 Movie Recommendation System (Content-Based)

This project combines **Recommender Systems** and **Text Mining / NLP** techniques.  


---

### 📚 What You’ll Learn
- Text preprocessing and feature extraction using **TF-IDF**
- Building a **content-based recommender system**
- Computing **cosine similarity** between movie features
- Interpreting and extending recommender systems for practical use

---


In [None]:
import os
import pandas as pd

def load_data():
    movies_path = os.path.join("movies.csv")
    tags_path = os.path.join("tags.csv")

    if not os.path.exists(movies_path):
        raise FileNotFoundError("movies.csv not found! Download and unzip the dataset first.")

    movies = pd.read_csv(movies_path)
    tags = pd.read_csv(tags_path) if os.path.exists(tags_path) else pd.DataFrame(columns=["userId","movieId","tag","timestamp"])
    return movies, tags

movies, tags = load_data()
print("Movies loaded:", movies.shape)
print("Tags loaded:", tags.shape)
movies.head()


Movies loaded: (9742, 3)
Tags loaded: (3683, 4)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
def build_content_field(movies, tags):
    movies['genres_clean'] = movies['genres'].fillna('').apply(lambda g: g.replace('|', ' '))
    if not tags.empty:
        tags_grouped = tags.groupby('movieId')['tag'].apply(lambda x: ' '.join(map(str, x))).reset_index()
        movies = movies.merge(tags_grouped, on='movieId', how='left')
        movies['tag'] = movies['tag'].fillna('')
    else:
        movies['tag'] = ''
    movies['content'] = (movies['title'].fillna('') + ' ' + movies['genres_clean'] + ' ' + movies['tag'])
    movies['content'] = movies['content'].str.replace('[^0-9a-zA-Z ]', ' ', regex=True).str.replace('\s+', ' ', regex=True).str.strip().str.lower()
    return movies

movies = build_content_field(movies, tags)
movies[['title','genres','content']].head()


  movies['content'] = movies['content'].str.replace('[^0-9a-zA-Z ]', ' ', regex=True).str.replace('\s+', ' ', regex=True).str.strip().str.lower()


Unnamed: 0,title,genres,content
0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,toy story 1995 adventure animation children co...
1,Jumanji (1995),Adventure|Children|Fantasy,jumanji 1995 adventure children fantasy fantas...
2,Grumpier Old Men (1995),Comedy|Romance,grumpier old men 1995 comedy romance moldy old
3,Waiting to Exhale (1995),Comedy|Drama|Romance,waiting to exhale 1995 comedy drama romance
4,Father of the Bride Part II (1995),Comedy,father of the bride part ii 1995 comedy pregna...


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def fit_tfidf(movies, max_features=5000, ngram_range=(1,2)):
    vectorizer = TfidfVectorizer(max_features=max_features, ngram_range=ngram_range, stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(movies['content'])
    return vectorizer, tfidf_matrix

vectorizer, tfidf_matrix = fit_tfidf(movies)
print("TF-IDF matrix shape:", tfidf_matrix.shape)


TF-IDF matrix shape: (9742, 5000)


In [None]:
from sklearn.metrics.pairwise import linear_kernel

def get_recommendations(title, movies, tfidf_matrix, top_k=10):
    matches = movies[movies['title'].str.lower() == title.lower()]
    if matches.empty:
        matches = movies[movies['title'].str.lower().str.contains(title.lower())]
        if matches.empty:
            raise ValueError(f"No movie found for '{title}'")
    idx = matches.index[0]
    cosine_sim = linear_kernel(tfidf_matrix[idx], tfidf_matrix).flatten()
    sim_indices = cosine_sim.argsort()[::-1]
    sim_indices = sim_indices[sim_indices != idx]
    top_indices = sim_indices[:top_k]
    return movies.iloc[top_indices][['title','genres']].reset_index(drop=True)

example_title = "Toy Story"
recs = get_recommendations(example_title, movies, tfidf_matrix, top_k=5)
print(f"🎬 Recommendations for '{example_title}':")
recs


🎬 Recommendations for 'Toy Story':


Unnamed: 0,title,genres
0,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy
1,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX
2,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy
3,Balto (1995),Adventure|Animation|Children
4,Antz (1998),Adventure|Animation|Children|Comedy|Fantasy


### Explanation
1. **TF-IDF Vectorization:**  
   Converts each movie’s content (title + genres + tags) into a numerical vector.
2. **Cosine Similarity:**  
   Measures how similar two movies’ text vectors are.
3. **Recommendation:**  
   When a user enters a movie title, the system finds the most similar movies in vector space.
4. **Customization:**  
   - Adjust `max_features` in TF-IDF for more detail.  
   - Use bigrams `(1,2)` for multi-word patterns like "science fiction".


In [None]:
# Try it with any movie title you like
your_movie = input("Enter a movie title: ")
try:
    recommendations = get_recommendations(your_movie, movies, tfidf_matrix, top_k=5)
    print(f"Top recommendations for '{your_movie}':")
    display(recommendations)
except Exception as e:
    print(e)

Enter a movie title: Jumanji
Top recommendations for 'Jumanji':


Unnamed: 0,title,genres
0,"Indian in the Cupboard, The (1995)",Adventure|Children|Fantasy
1,Casper (1995),Adventure|Children
2,Robin Williams: Live on Broadway (2002),Comedy
3,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy
4,Tall Tale (1995),Adventure|Children|Fantasy|Western
