Let's load the data now.

In [1]:
import pandas as pd 
import numpy as np 
#取四千个电影数据作为推荐
df1=pd.read_csv('IMDb movies.csv')[:4000]

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
df1.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,...,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,...,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,$ 2250,,,,7.0,7.0
2,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,...,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.8,188,,,,,5.0,2.0
3,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,...,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,$ 45000,,,,25.0,3.0
4,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",...,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,,,,,31.0,14.0


## **Plot description based Recommender**

We will compute pairwise similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score. The plot description is given in the **overview** feature of our dataset. 
Let's take a look at the data. .. 

In [3]:
df1['description'].head(5)

0    The adventures of a female reporter in the 1890s.
1    True story of notorious Australian outlaw Ned ...
2    Two men of high rank are both wooing the beaut...
3    The fabled queen of Egypt's affair with Roman ...
4    Loosely adapted from Dante's Divine Comedy and...
Name: description, dtype: object

In [4]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
df1['description'] = df1['description'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df1['description'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(4000, 12268)

Since we have used the TF-IDF vectorizer, calculating the dot product will directly give us the cosine similarity score. Therefore, we will use sklearn's **linear_kernel()** instead of cosine_similarities() since it is faster.

In [5]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

We are going to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. Firstly, for this, we need a reverse mapping of movie titles and DataFrame indices. In other words, we need a mechanism to identify the index of a movie in our metadata DataFrame, given its title.

In [6]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(df1.index, index=df1['title']).drop_duplicates()


In [7]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df1['title'].iloc[movie_indices]

In [8]:
get_recommendations('Madame DuBarry')

1718                         Voltaire
3411               La sposa di Boston
2270                     Le due città
1884                  Madame Du Barry
3044                   La marsigliese
260                  Le due orfanelle
3113            La fanciulla del ring
551     La ragazza con la cappelliera
1003                   Se io fossi re
3389              La regola del gioco
Name: title, dtype: object

In [9]:
get_recommendations('The Bargain')

792                     Rio Rita
2997          Nel cuore del Nord
1217                  Cortigiana
861              Captain Thunder
1972         Straight Is the Way
1106     La sposa nella tempesta
475             L'aquila azzurra
276               L'età di amare
2985            Give Me a Sailor
3665    One Night in the Tropics
Name: title, dtype: object

## **Credits, Genres and Keywords Based Recommender**


In [10]:
# Parse the stringified features into their corresponding python objects
features = ['actors', 'director', 'writer', 'genre']

Next, we'll write functions that will help us to extract the required information from each feature.

In [11]:
# Returns the list top 3 elements or entire list; whichever is more.
def get_list(x):
    if isinstance(x, str):
        split_list = x.split(',')
        if len(split_list) > 1:
            names = [i for i in split_list]
            # Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
            if len(names) > 3:
                names = names[:3]
        elif len(split_list) == 1:
            names = [x]
    else:
        names = []
    return names

In [12]:
features = ['actors', 'director', 'writer', 'genre']
# features = ['actors']
for feature in features:
    df1[feature] = df1[feature].apply(get_list)

In [13]:
# Print the new features of the first 3 films
df1[['title', 'actors', 'director', 'writer', 'genre']].head(3)

Unnamed: 0,title,actors,director,writer,genre
0,Miss Jerry,"[Blanche Bayliss, William Courtenay, Chaunce...",[Alexander Black],[Alexander Black],[Romance]
1,The Story of the Kelly Gang,"[Elizabeth Tait, John Tait, Norman Campbell]",[Charles Tait],[Charles Tait],"[Biography, Crime, Drama]"
2,Den sorte drøm,"[Asta Nielsen, Valdemar Psilander, Gunnar He...",[Urban Gad],"[Urban Gad, Gebhard Schätzler-Perasini]",[Drama]


In [17]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [25]:
# Apply clean_data function to your features.
features = ['actors', 'director', 'writer', 'genre']

for feature in features:
    df1[feature] = df1[feature].apply(clean_data)

In [26]:
df1[['title', 'actors', 'director', 'writer', 'genre']].head(3)

Unnamed: 0,title,actors,director,writer,genre
0,Miss Jerry,"[blanchebayliss, williamcourtenay, chaunceydepew]",[alexanderblack],[alexanderblack],[romance]
1,The Story of the Kelly Gang,"[elizabethtait, johntait, normancampbell]",[charlestait],[charlestait],"[biography, crime, drama]"
2,Den sorte drøm,"[astanielsen, valdemarpsilander, gunnarhelseng...",[urbangad],"[urbangad, gebhardschätzler-perasini]",[drama]


In [27]:
def create_soup(x):
    return ' '.join(x['writer']) + ' ' + ' '.join(x['actors']) + ' ' + ' '.join(x['director']) + ' ' + ' '.join(x['genre'])
df1['soup'] = df1.apply(create_soup, axis=1)

In [28]:
df1['soup'].head(3)

0    alexanderblack blanchebayliss williamcourtenay...
1    charlestait elizabethtait johntait normancampb...
2    urbangad gebhardschätzler-perasini astanielsen...
Name: soup, dtype: object

In [29]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df1['soup'])

In [30]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [31]:
# Reset index of our main DataFrame and construct reverse mapping as before
df1 = df1.reset_index()
indices = pd.Series(df1.index, index=df1['title'])

We can now reuse our **get_recommendations()** function by passing in the new **cosine_sim2** matrix as your second argument.

In [33]:
get_recommendations('Madame DuBarry', cosine_sim2)

186                      Anna Bolena
116           Gli occhi della mummia
120                    Sangue gitano
223                          Sumurum
235                    Lo scoiattolo
316              Das Weib des Pharao
205                      Due sorelle
122               L'allegra prigione
727                       La valanga
142    La principessa delle ostriche
Name: title, dtype: object

In [34]:
get_recommendations('The Bargain', cosine_sim2)

54                   The Italian
44         The Wrath of the Gods
73                 Hell's Hinges
50                    The Coward
454                  Tumbleweeds
65                  Civilization
81       The Return of Draw Egan
181                 Wagon Tracks
225                The Toll Gate
739    Il bandito e la signorina
Name: title, dtype: object