This script will recommend movies when given a single movie as an input.  
The source dataset contains information about each movie - Overview, cast, crew, title, etc.  
This information is then vectorized for AI use.  

In [1]:
import numpy as numpy
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Read in the two files

In [217]:
#base_data_path = "sample-data-20mill/"
base_data_path = "sample-data-small/"
#base_data_path = "sample-data-large/"  # folder containing larger set of user reviews

credits_df = pd.read_csv(base_data_path + 'credits.csv')
rawmovies_df = pd.read_csv(base_data_path + 'movies-deeplearning.csv')

#rename the movieID file "id" column to match what is in the movies file
credits_df = credits_df.rename(columns={'movie_id': 'id'})

In [218]:
# Optional.  Change the display to show all data in the IDE
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

#credits_df.head()
#movies_df

Merge the movies and credits datasets together

In [219]:
movies_df = rawmovies_df.merge(credits_df, on='id')
movies_df = movies_df.rename(columns={'title_x': 'title'})  # the merge resulted in 2 title columns
#movies_df = rawmovies_df.merge(credits_df, on='title')
movies_df.shape  #returns the rows and columns of the DF

(4803, 23)

We don't need all of the columnns that came with the dataset.  Lets shrink it down

In [220]:
#movies_df.head()
#movies_df.info()
movies_df = movies_df[[ 'id', 'overview', 'genres', 'keywords', 'cast', 'crew', 'title']] # reduce the amount of columns
#movies_df.info()
#movies_df.isnull().sum()  #check for null values
#movies_df.duplicated().sum()  #tells you if there are any dupes
movies_df.iloc[0].genres #get the values of the "genres" column for the first entry

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

Drop any rows that have a null value for overview

In [221]:
#check to see if there are nulls
movies_df.isnull().sum()

movies_df.dropna(subset=['overview'], inplace=True)


Write a function to extract data from tree

 Before:  [{"id": 28, "name": "Action"}, {"id": 12, "nam...	

 After: [Action, Adventure, ....

In [222]:
import ast  # Abstract Syntax Trees - helps Python applications to process trees of the Python abstract syntax grammar.

def convert(obj):  # convert some literals to objects
    L = []
    for i in ast.literal_eval(obj):
        L.append(i['name'])
    return L

In [223]:
movies_df['genres'] = movies_df['genres'].apply(convert)
movies_df['keywords'] = movies_df['keywords'].apply(convert)
movies_df.head()

Unnamed: 0,id,overview,genres,keywords,cast,crew,title
0,19995,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",Avatar
1,285,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",Pirates of the Caribbean: At World's End
2,206647,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",Spectre
3,49026,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",The Dark Knight Rises
4,49529,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",John Carter


Take care of the 'cast' field

In [224]:
def convert3(obj):  # convert some literals to objects
    L = []
    counter = 0
    for i in ast.literal_eval(obj):
        if counter != 3:
            L.append(i['name'])
            counter +=1
        else:
            break
        return L

In [225]:
movies_df['cast'] = movies_df['cast'].apply(convert3)

Get the Director from the "Crew" cell, put it in its own new column

In [226]:
def fetch_director(obj):
    L=[]
    for i in ast.literal_eval(obj):
        if i['job'] == "Director":
            L.append(i['name'])
    return L

In [227]:
movies_df['director'] = movies_df['crew'].apply(fetch_director)

The "Overview" is a long string.  We need to separate the words.

Before:  "It was the best of times...."

After:  [It, was, the, best, of, times]

In [228]:
movies_df['overview'] = movies_df['overview'].apply(lambda x:x.split())
movies_df.head()

Unnamed: 0,id,overview,genres,keywords,cast,crew,title,director
0,19995,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...",[Sam Worthington],"[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",Avatar,[James Cameron]
1,285,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...",[Johnny Depp],"[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",Pirates of the Caribbean: At World's End,[Gore Verbinski]
2,206647,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...",[Daniel Craig],"[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",Spectre,[Sam Mendes]
3,49026,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...",[Christian Bale],"[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",The Dark Knight Rises,[Christopher Nolan]
4,49529,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...",[Taylor Kitsch],"[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",John Carter,[Andrew Stanton]


Remove Spaces from the strings

In [229]:
movies_df['genres'] = movies_df['genres'].apply(lambda x:[i.replace(" ", "") for i in x])
movies_df['keywords'] = movies_df['keywords'].apply(lambda x:[i.replace(" ", "") for i in x])
movies_df['director'] = movies_df['director'].apply(lambda x:[i.replace(" ", "") for i in x])
#movies_df['cast'] = movies_df['cast'].apply(lambda x:[i.replace(" ", "") for i in x])

In [230]:
movies_df.head()

Unnamed: 0,id,overview,genres,keywords,cast,crew,title,director
0,19995,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...",[Sam Worthington],"[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",Avatar,[JamesCameron]
1,285,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...",[Johnny Depp],"[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",Pirates of the Caribbean: At World's End,[GoreVerbinski]
2,206647,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...",[Daniel Craig],"[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",Spectre,[SamMendes]
3,49026,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...",[Christian Bale],"[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",The Dark Knight Rises,[ChristopherNolan]
4,49529,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...",[Taylor Kitsch],"[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",John Carter,[AndrewStanton]


Create a tag.  This will allow us to store all of the relevant text in a single cell.

In [231]:
movies_df['tags'] = movies_df['overview'] + movies_df['genres'] + movies_df['keywords'] + movies_df['cast'] + movies_df['director']

Now that we have the "tags" column that has our data, we no longer need the other columns.  
Lets create a new DF that contains only what we need

In [232]:
new_df = movies_df[['id', 'title', 'tags']]
new_df.head()

Unnamed: 0,id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."


We need to remove the brackets around the data

Before: [Ever, since, the, second]

After:  Ever since the second

In [233]:
new_df.isnull().sum()

id        0
title     0
tags     42
dtype: int64

In [234]:
new_df.dropna(subset=['tags'], inplace=True)  #remove any bad rows

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df.dropna(subset=['tags'], inplace=True)  #remove any bad rows


In [235]:
new_df['tags'] = new_df['tags'].apply(lambda x:' '.join(x))
new_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:' '.join(x))


Unnamed: 0,id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


Put the tags in lower case for better predictive properties

In [236]:
new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())
new_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())


Unnamed: 0,id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


We will implement feature extraction using count vectorizer.  The vectorizer will transform text into a vector on the basis of the frequency count for each word of the entire text.  

In [237]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000, stop_words='english')

Fit will create a dictionary of tokens.  By default, the tokens are words separated by spaces. 
Need a better explanation here.   

In [238]:
# Sanity check
cv.fit_transform(new_df['tags']).toarray().shape

(4758, 5000)

Convert vector to array

In [239]:
vectors = cv.fit_transform(new_df['tags']).toarray()

In [240]:
# this tells us how many feature names we have.  
len(cv.get_feature_names_out())

5000

Import the natural language toolkit.  NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
https://www.nltk.org/

In [241]:
#%pip install --user -U nltk   # run this once
import nltk

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()


Write a function to re-assemble the tags into readable language.

Result:  in the 22nd century, a parapleg marin is dispa...

In [242]:
def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)

In [243]:
new_df['tags'] = new_df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stem)


In [244]:
new_df.head()

Unnamed: 0,id,title,tags
0,19995,Avatar,"in the 22nd century, a parapleg marin is dispa..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believ to be dead, ha c..."
2,206647,Spectre,a cryptic messag from bond’ past send him on a...
3,49026,The Dark Knight Rises,follow the death of district attorney harvey d...
4,49529,John Carter,"john carter is a war-weary, former militari ca..."


Cosine similarity:  find the similarity between two vectors

In [245]:
from sklearn.metrics.pairwise import cosine_similarity

In [246]:
cosine_similarity(vectors)

array([[1.        , 0.08980265, 0.05892557, ..., 0.02431083, 0.07808688,
        0.        ],
       [0.08980265, 1.        , 0.06350006, ..., 0.02619813, 0.        ,
        0.        ],
       [0.05892557, 0.06350006, 1.        , ..., 0.02578553, 0.02760788,
        0.        ],
       ...,
       [0.02431083, 0.02619813, 0.02578553, ..., 1.        , 0.06834085,
        0.04045567],
       [0.07808688, 0.        , 0.02760788, ..., 0.06834085, 1.        ,
        0.04331481],
       [0.        , 0.        , 0.        , ..., 0.04045567, 0.04331481,
        1.        ]])

In [254]:
cosine_similarity(vectors).shape

(4758, 4758)

In [265]:
similarity = cosine_similarity(vectors)

In [266]:
similarity[0]

array([1.        , 0.08980265, 0.05892557, ..., 0.02431083, 0.07808688,
       0.        ])

In [267]:
sorted(list(enumerate(similarity[0])), reverse=True, key=lambda x:x[1])[1:6]

[(539, 0.2611164839335468),
 (507, 0.24948506639458295),
 (1191, 0.24873416908154547),
 (1213, 0.24459979523511427),
 (260, 0.2434322477800738)]

In [268]:
new_df.head()

Unnamed: 0,id,title,tags
0,19995,Avatar,"in the 22nd century, a parapleg marin is dispa..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believ to be dead, ha c..."
2,206647,Spectre,a cryptic messag from bond’ past send him on a...
3,49026,The Dark Knight Rises,follow the death of district attorney harvey d...
4,49529,John Carter,"john carter is a war-weary, former militari ca..."


In [294]:
def recommend(movie):
    movie_index = new_df[new_df['title']==movie].index[0]
    # print(movie_index)
    distances = similarity[movie_index]
   # print(similarity[movie_index])
    movies_list = sorted(list(enumerate(distances)), reverse = True, key = lambda x:x[1])[1:6]
                         
    for i in movies_list:
        print(new_df.iloc[i[0]].title)

In [298]:
recommend('Avatar')

Titan A.E.
Independence Day
Small Soldiers
Aliens vs Predator: Requiem
Ender's Game
