<h1 align="center">Movies Recommendation System</h1>

[![GitHub Repository](https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github)](https://github.com/anmmashud/movies_recommendation_system)

[![Web Scraper](https://img.shields.io/badge/Web%20Scraper-Kaggle-blue?style=for-the-badge&logo=kaggle)](https://www.kaggle.com/code/anmmashud/tmdb-web-scrapping-using-api)

[![Live Preview](https://img.shields.io/badge/Live%20Preview-Streamlit-brightgreen?style=for-the-badge&logo=streamlit)](https://movies-recommendation-system-by-anmmashud.streamlit.app)

[![Documentation](https://img.shields.io/badge/Documentation-anmmashud-orange?style=for-the-badge&logo=ghost)](https://anmmashud.xyz/blog/2025/rs_tmdb_5000_movies/)


**To-Do**
- add production companies
- add taglines
- more language
- apply algoriths for feature engeneering

In [None]:
import numpy as np
import pandas as pd
import copy
import ast
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pickle

In [2]:
df = pd.read_csv('/kaggle/input/all-english-movies-1980-to-2025-from-tmdb/tmdb_all_movies_(1980-2025)_v1.csv')
df.shape

(439230, 11)

In [7]:
df2 = df.copy()
df2.nunique()

id                   439226
title                382581
overview             420499
release_date          16568
original_language         1
genre_ids              8929
adult                     2
popularity            45873
top_cast             305370
directors            170565
keywords              90664
dtype: int64

In [27]:
# making another copy where we will do data preproccessing
df3 = df2.copy()
df3['movie_id'] = df3['id']

In [28]:
# droping unneccessery columns. you can alsso do this by creating new dataframe
df3 = df3.drop(columns=['id', 'original_language','popularity'])

In [32]:
# removing rows with null title
df3 = df3.dropna(subset=['title'])
# replacing overview with ` `. because we will concat overview with title so we will get a synthethic overview
df3 = df3.fillna({'overview':' '})

title               0
overview        11104
release_date        0
genre_ids           0
adult               0
top_cast            0
directors           0
keywords            0
movie_id            0
dtype: int64

In [37]:
# see the difference
print(df2.shape)
print(df3.shape)

(439230, 11)
(439227, 9)


In [40]:
print("==== DUPLICATE ====\n",df3.duplicated().sum())
print("==== UNIQUE ====\n",df3.nunique())
print("==== NULL ====\n",df3.isnull().sum())

==== DUPLICATE ====
 0
==== UNIQUE ====
 title           382581
overview        420497
release_date     16568
genre_ids         8929
adult                2
top_cast        305368
directors       170564
keywords         90664
movie_id        439223
dtype: int64
==== NULL ====
 title           0
overview        0
release_date    0
genre_ids       0
adult           0
top_cast        0
directors       0
keywords        0
movie_id        0
dtype: int64


### Genre Code Function

according to our scraped data, we have ids for genre. we dont need to convert them acctual name. even we have also some columns like `keywords` has empty list `[]`. we will kepp them as it is. because our model will will only recognize the common tags. so if we convert it into a general form like `prefix`+`id` then it will be an unique tags for each movie. but if you want you can also convert it into original genre name. but it will need tmdb api and cost extra api calls

**To-Do**
- apply feature engeneering: Scaling

In [87]:
df4 = df3.copy()

In [41]:
# Making a function to modify genres
def genre_code(column_name):     # genre_code in parameter, column_name is palceholder
    import ast
    try: 
        genre_list = ast.literal_eval(column_name)
        codes = [f'gen{i}' for i in genre_list]
        return " ".join(codes)
    except (ValueError, SysntaxError, TypeError):
        return ""

In [88]:
## applying funciton 
df4['genres'] = df4['genre_ids'].apply(genre_code)
# droping existing genre_ids
df4 = df4.drop(columns=['genre_ids'])

In [66]:
# view all genres
all_genres = df4['genres'].dropna().str.split(' ').explode()
unique_genres = all_genres.nunique()
print(unique_genres)
print(all_genres.unique())

20
['gen99' 'gen27' '' 'gen18' 'gen10770' 'gen35' 'gen28' 'gen53' 'gen10749'
 'gen10402' 'gen16' 'gen10751' 'gen14' 'gen12' 'gen80' 'gen9648' 'gen878'
 'gen36' 'gen10752' 'gen37']


### Working with `release_date`

```python
df4['release_date'] = pd.to_datetime(df4['release_date'], errors='coerce')
```

* **Purpose:** Converts your `release_date` column into **datetime objects**.
* `errors='coerce'` → If any row has an invalid date, it becomes `NaT` (Not a Time) instead of crashing.
* **Output:** The `release_date` column now stores proper date objects, e.g., `'1980-06-30'` becomes `Timestamp('1980-06-30 00:00:00')`.

---

```python
df4['year'] = df4['release_date'].dt.year
```

* **Purpose:** Extracts just the **year** from the datetime column.
* `.dt.year` is pandas syntax to access the year part of a datetime.
* **Output:** New column `year`, e.g., `1980`, `1992`, `2005`.

---

```python
df4['decade'] = (df4['year'] // 10 * 10).astype('Int64')
```

* **Purpose:** Groups each year into a **decade**.

  * `// 10` → integer division: `1985 // 10 = 198`
  * `* 10` → multiply back to get the decade: `198 * 10 = 1980`
* `.astype('Int64')` → ensures the column is integer type (handles missing values safely).
* **Output:** New column `decade`, e.g., `1980`, `1990`, `2000`.

---

```python
df4['decade_str'] = df4['decade'].astype(str) + 's'
```

* **Purpose:** Makes a **human-readable string** of the decade.
* `.astype(str)` → converts numbers to strings
* `+ 's'` → adds the letter “s” at the end for style, e.g., `1980s`.
* **Output:** Column `decade_str`, e.g., `'1980s'`, `'1990s'`, `'2000s'`.

---

```python
print(df4[['release_date', 'year', 'decade', 'decade_str']].head())
```

* **Purpose:** Lets you quickly **peek at the first few rows** with all the transformations applied.
* **Output example:**

```
  release_date  year  decade decade_str
0   1980-06-30  1980    1980      1980s
1   1992-03-15  1992    1990      1990s
2   2005-08-22  2005    2000      2000s
3   2010-12-10  2010    2010      2010s
4   1975-05-07  1975    1970      1970s
```


In [89]:
# Make sure the column is datetime
df4['release_date'] = pd.to_datetime(df4['release_date'], errors='coerce')

# Extract the year
df4['year'] = df4['release_date'].dt.year

# Convert to decade
df4['decade'] = (df4['year'] // 10 * 10).astype('Int64')

# make it a string like '1980s'
df4['decade_str'] = df4['decade'].astype(str) + 's' # no need, it will remove by `nltk`

print(df4[['release_date', 'year', 'decade', 'decade_str']].head())


  release_date  year  decade decade_str
0   1980-06-30  1980    1980      1980s
1   1980-06-28  1980    1980      1980s
2   1980-06-28  1980    1980      1980s
3   1980-06-28  1980    1980      1980s
4   1980-06-27  1980    1980      1980s


### Working on `adults`

In [77]:
df4.isnull().sum()

title           0
overview        0
release_date    0
adult           0
top_cast        0
directors       0
keywords        0
movie_id        0
genres          0
year            0
decade          0
decade_str      0
dtype: int64

In [90]:
# Replace True/False with nsfw/sfw and fill missing with sfw
df4['adult'] = df4['adult'].map({True: 'nsfw', False: ''}).fillna('')

# Check
print(df4['adult'].value_counts())

adult
        364512
nsfw     74715
Name: count, dtype: int64


In [91]:
df4.head(1)

Unnamed: 0,title,overview,release_date,adult,top_cast,directors,keywords,movie_id,genres,year,decade,decade_str
0,Yorky Billy,YORKY BILLY is set in Ngurgdu (Spring Peak) in...,1980-06-30,,['Yorky Billy'],['Kim McKenzie'],[],1553504,gen99,1980,1980,1980s


we can see top_cast, directors, keywords are in list with multiple value. so we can make a funtin to convert them into one string list. because we need only the tags. so we will remove all spaces in each values. then make them lowercase and saparate them with space by replacing the comma.
```
df4['top_cast'].head(10)
df4['directors'].head(10)
df4['keywords'].head(10)
```

In [146]:
def list_tags(obj):
    List = []
    for i in ast.literal_eval(obj):
        List.append(i)
    tags = ', '.join(List)
    tags = tags.lower().replace(" ", "").replace(",", " ").replace("-", "").replace("(", "").replace(")", "")
    return tags

In [149]:
df4.head(1)

Unnamed: 0,title,overview,release_date,adult,top_cast,directors,keywords,movie_id,genres,year,decade,decade_str
0,Yorky Billy,YORKY BILLY is set in Ngurgdu (Spring Peak) in...,1980-06-30,,['Yorky Billy'],['Kim McKenzie'],[],1553504,gen99,1980,1980,1980s


In [160]:
df5 = df4.copy()
df5['keywords'] = df5['keywords'].apply(list_tags)
df5['top_cast'] = df5['top_cast'].apply(list_tags)
df5['directors'] = df5['directors'].apply(list_tags)

In [152]:
df5.head(1)

Unnamed: 0,title,overview,release_date,adult,top_cast,directors,keywords,movie_id,genres,year,decade,decade_str
0,Yorky Billy,YORKY BILLY is set in Ngurgdu (Spring Peak) in...,1980-06-30,,yorkybilly,kimmckenzie,,1553504,gen99,1980,1980,1980s


now as we converted imported columns into single string sentence. now we can perform concatination by converting them into list. before we start, we will copy the title to include them into tags

In [161]:
df5['title_x'] = df5['title']

In [162]:
# collecting neccessery columns
df6 = df5[['movie_id', 'title','title_x', 'overview', 'adult', 'top_cast', 'directors', 'keywords', 'genres', 'decade_str']]

In [164]:
columns = ['title_x', 'overview', 'adult', 'top_cast', 'directors', 'keywords', 'genres', 'decade_str']
for i in columns:
    df6[i] = df6[i].apply(lambda x:x.lower().split())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df6[i] = df6[i].apply(lambda x:x.lower().split())


In [165]:
df6.head(1)

Unnamed: 0,movie_id,title,title_x,overview,adult,top_cast,directors,keywords,genres,decade_str
0,1553504,Yorky Billy,"[yorky, billy]","[yorky, billy, is, set, in, ngurgdu, (spring, ...",[],[yorkybilly],[kimmckenzie],[],[gen99],[1980s]


In [169]:
df6.iloc[7].keywords

['chicago',
 'illinois',
 'posttraumaticstressdisorderptsd',
 'airplane',
 'cataclysm',
 'guitar',
 'alcohol',
 'stewardess',
 'taxidriver',
 'passenger',
 'fearofflying',
 'pilot',
 'medicine',
 'aircontroller',
 'landing',
 'saxophone',
 'autopilot',
 'parody',
 'spoof',
 'foodpoisoning',
 'losangeles',
 'california',
 'alcoholabuse',
 'aftercreditsstinger',
 'inflatableliferaft',
 'anarchiccomedy']

as we used .spit(), it automatically seperates each word with comma. but no problem. we can now contat each clumns now

In [170]:
df7 = df6.copy()
df7['tags'] = df7['title_x']+df7['overview']+df7['adult']+df7['top_cast']+df7['directors']+df7['keywords']+df7['genres']+df7['decade_str']


### trainning model

In [172]:
# finally lets create our movies dataframe
movies = df7[['movie_id', 'title', 'tags']]

In [173]:
movies.head(3)

Unnamed: 0,movie_id,title,tags
0,1553504,Yorky Billy,"[yorky, billy, yorky, billy, is, set, in, ngur..."
1,1499587,Dark Vengeance,"[dark, vengeance, dark, vengeance, is, a, 1980..."
2,495099,The Master and Ms. Johnson,"[the, master, and, ms., johnson, sexually, mal..."


In [177]:
movies['tags'] = movies['tags'].apply(lambda x: ' '.join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['tags'] = movies['tags'].apply(lambda x: ' '.join(x))


In [178]:
movies.iloc[7].tags

'airplane! an ex-fighter pilot forced to take over the controls of an airliner when the flight crew succumbs to food poisoning. roberthays juliehagerty leslienielsen kareemabduljabbar lloydbridges jimabrahams jerryzucker davidzucker chicago illinois posttraumaticstressdisorderptsd airplane cataclysm guitar alcohol stewardess taxidriver passenger fearofflying pilot medicine aircontroller landing saxophone autopilot parody spoof foodpoisoning losangeles california alcoholabuse aftercreditsstinger inflatableliferaft anarchiccomedy gen35 1980s'

here we can see some extra symbols like `.`, `!`, `'` etc. we can remove it but not now. we will use some libraires from natural language procees `nltk` and stop words in vectorizing. that will handle that.

also Handling simmiler words like `loved`, `loving` to `love`, `love`.
it will require nltk library from natural language proccessing. if you are running this notbook on your local computer, you may install it by `!pip install nltk`

In [180]:
import nltk
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [181]:
# we will stem each words in tags.
def stem(text):
    a = []
    for i in text.split(): # first it will split the paragraph into comma separated format
        a.append(ps.stem(i)) # then it will stem each words
    return " ".join(a) # then it will append each stemed word again into paragraph

In [182]:
movies['tags'] = movies['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['tags'] = movies['tags'].apply(stem)


In [183]:
movies.iloc[7].tags

'airplane! an ex-fight pilot forc to take over the control of an airlin when the flight crew succumb to food poisoning. roberthay juliehagerti leslienielsen kareemabduljabbar lloydbridg jimabraham jerryzuck davidzuck chicago illinoi posttraumaticstressdisorderptsd airplan cataclysm guitar alcohol stewardess taxidriv passeng fearoffli pilot medicin aircontrol land saxophon autopilot parodi spoof foodpoison losangel california alcoholabus aftercreditssting inflatableliferaft anarchiccomedi gen35 1980'

in our current dataset, it has a large number or rows. that's why stemming will take a lots of time. we will store it as csv. sothat when we will work on it, we can easily load it.

In [11]:
movies.to_csv('movies_stemmed.csv', index=False)


NameError: name 'movies' is not defined

# lets vectorize the moves datafrmae

In [5]:
movies_stem = pd.read_csv('/kaggle/input/stemmed-movies/movies_stemmed.csv')

before we make vector, we have to reduce the dimension of the dataset. so we will manually remove some unwanted words.

## filtering stop words using library

In [6]:
# download once
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = text.split()  # assuming text is already stemmed and lowercase
    filtered = [w for w in words if w not in stop_words]
    return " ".join(filtered)

movies_stem['tags'] = movies_stem['tags'].apply(remove_stopwords)


[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Showing top words

#### do the step furtun to make the best custom stopwords

In [9]:
from collections import Counter

# split all tags into words
all_words = []
for tags in movies_stem['tags']:
    all_words.extend(tags.split())

# count frequency
word_counts = Counter(all_words)
# show top 50 words
top = word_counts.most_common(10)   # use 100 for more view
for word, count in top:
    print(word, " ->",count)

gen99  -> 83157
nsfw  -> 74732
gen18  -> 72536
gen35  -> 52316
live  -> 45426
young  -> 40960
stori  -> 40238
love  -> 39685
life  -> 38952
1990  -> 38101


## using custom stop words

In [7]:
custom_stopwords = ['hi','thi', 'ha', 'film', 'new','-','wa','take', 'two', 'find','get','year','want','2','&','use','it','see','2000','2010','2020',"it'",'go','watch','one','make','must','give']

def custom_filter(tags):
    words = tags.split()
    filtered = [w for w in words if w not in custom_stopwords]
    return " ".join(filtered)

movies_stem['tags'] = movies_stem['tags'].apply(custom_filter)

In [5]:
# check average tags per movies
movies_stem['tags'].apply(lambda x: len(x.split())).describe()


count    439227.000000
mean         56.112860
std          36.066442
min           2.000000
25%          30.000000
50%          47.000000
75%          74.000000
max         237.000000
Name: tags, dtype: float64

In [39]:
# check average tags per movies
movies_stem['tags'].apply(lambda x: len(x.split())).describe()

count    439227.000000
mean         35.026651
std          21.295661
min           0.000000
25%          20.000000
50%          30.000000
75%          46.000000
max         188.000000
Name: tags, dtype: float64

In [8]:
vector = movies_stem.copy()

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=100000,  # 100000 failed;
                        min_df=5,            # ignore words in <5 movies; to-do 5
                        max_df=0.7         # ignore words in >70% movies
                        #stop_words='english'
                       ) # used before
tfidf_matrix = tfidf.fit_transform(vector['tags'])  # returns a sparse matrix


In [15]:
tfidf_sums = np.array(tfidf_matrix.sum(axis=0)).flatten()
words = tfidf.get_feature_names_out()

word_tfidf = pd.DataFrame({'word': words, 'tfidf_sum': tfidf_sums})
word_tfidf = word_tfidf.sort_values(by='tfidf_sum', ascending=False)
print(word_tfidf.head(10))


       word    tfidf_sum
1932  gen18  8194.509184
1942  gen99  8029.447525
3131   nsfw  7577.729375
1935  gen35  6427.680930
2643   life  5928.241754
2717   love  5347.971172
54     1990  5341.576831
2773    man  5229.773033
4986  young  5117.987131
2674   live  5108.471278


In [26]:
from sklearn.neighbors import NearestNeighbors

# use a fast approximate metric
nn = NearestNeighbors(metric='cosine', algorithm='brute')  
nn.fit(tfidf_matrix)

In [11]:
def recommend(movie):
    movie_index = movies_stem[movies_stem['title'] == movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)),reverse=True, key = lambda x:x[1])[1:21] # we can remove [0] as we did it n `distance`

    for i in movies_list:
        print(movies_stem.iloc[i[0]].title)
recommend("Pirates of the Caribbean: At World's End")

In [20]:
def recommend(movie):
    # find the index of the movie
    movie_index = movies_stem[movies_stem['title'] == movie].index[0]

    # query for nearest neighbors
    distances, indices = nn.kneighbors(tfidf_matrix[movie_index], n_neighbors=11)

    # skip the first one (it’s the movie itself)
    for i in range(1, len(indices[0])):
        print(movies_stem.iloc[indices[0][i]].title)
print("\n==============================\n\n✅ All Tranning Compleate....!")



✅ All Tranning Compleate....!


In [32]:
recommend("Iron Man")

Iron Man 2
Iron Man 3
Spider-Man: Homecoming
I Am Iron Man
The Invincible Iron Man
Iron Man XXX: An Extreme Comixxx Parody
Spider-Man: Far From Home
Captain America: Civil War
The Avengers
Guardians of the Galaxy Vol. 2
