In [1]:
import pandas as pd
import numpy as np
import ast
import warnings
import pickle
import torch 
warnings.filterwarnings('ignore')
# pd.set_option('display.max_colwidth', None)

In [2]:
#Imports
links_small = pd.read_csv('dataset/links_small.csv')
all_movies = pd.read_csv('dataset/movies_metadata.csv')
keywords = pd.read_csv('dataset/keywords.csv')
credits = pd.read_csv('dataset/credits.csv')

In [3]:
print(all_movies[all_movies['title'] == 'Commando']['overview'].values[0])

John Matrix, the former leader of a special commando strike force that always got the toughest jobs done, is forced back into action when his young daughter is kidnapped. To find her, Matrix has to fight his way through an array of punks, killers, one of his former commandos, and a fully equipped private army. With the help of a feisty stewardess and an old friend, Matrix has only a few hours to overcome his greatest challenge: finding his daughter before she's killed.


In [4]:
#Get the ID mappings
tmdbId = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')
movieId = links_small[links_small['tmdbId'].notnull()]['movieId'].astype('int')

In [5]:
all_movies['id'] = pd.to_numeric(all_movies['id'], errors='coerce')

In [6]:
print(all_movies.shape)

(45466, 24)


In [7]:
all_movies = all_movies[all_movies.overview.notna()]

In [8]:
all_movies = all_movies.query("overview != ['No overview found.','No movie overview available.','No Overview','No overview yet.', ' ']")

In [9]:
all_movies.dropna(subset='id', inplace=True)

In [10]:
all_movies.drop_duplicates(subset=['overview'], inplace=True)
all_movies.drop_duplicates(inplace=True)
credits.drop_duplicates(inplace=True)
keywords.drop_duplicates(inplace=True)

In [11]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
all_movies['id'] = all_movies['id'].astype('int')

In [12]:
all_movies = all_movies.merge(credits, on='id')
all_movies = all_movies.merge(keywords, on='id')

In [13]:
movies_df = all_movies[all_movies['id'].isin(tmdbId)]
movies_df = movies_df[movies_df.overview.notna()]
print(movies_df.shape)

(9066, 27)


In [14]:
movies_df.drop_duplicates(subset=['overview'], inplace=True)

In [15]:
movies_df = movies_df[['id', 'title', 'keywords', 'genres', 'overview', 'cast', 'crew', 'poster_path']]

In [16]:
#To Get the Genres and the keywords in List from Json Object
def convert(text):
    L = []
    for i in ast.literal_eval(text):
        L.append(i['name']) 
    return L

In [17]:
movies_df['genres'] = movies_df['genres'].apply(convert)
movies_df['keywords'] = movies_df['keywords'].apply(convert)

In [18]:
#To get the top 3 Cast Members
def getCast(text):
    L = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 3:
            L.append(i['name'])
        counter+=1
    return L 

In [19]:
movies_df['cast'] = movies_df['cast'].apply(getCast)

In [20]:
#To get the Director's name
def getDirector(text):
    L = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            L.append(i['name'])
    return L

In [21]:
movies_df['director'] = movies_df['crew'].apply(getDirector)

In [22]:
movies_df.drop(['crew'], axis=1, inplace=True)

In [23]:
movies_df['keywords'] = movies_df['keywords'].apply(lambda x: [i.replace(" ", "") for i in x])
movies_df['genres'] = movies_df['genres'].apply(lambda x: [i.replace(" ", "") for i in x])
movies_df['cast'] = movies_df['cast'].apply(lambda x: [i.replace(" ", "") for i in x])
movies_df['director'] = movies_df['director'].apply(lambda x: [i.replace(" ", "") for i in x])

In [24]:
movies_df.head()

Unnamed: 0,id,title,keywords,genres,overview,cast,poster_path,director
0,862,Toy Story,"[jealousy, toy, boy, friendship, friends, riva...","[Animation, Comedy, Family]","Led by Woody, Andy's toys live happily in his ...","[TomHanks, TimAllen, DonRickles]",/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,[JohnLasseter]
1,8844,Jumanji,"[boardgame, disappearance, basedonchildren'sbo...","[Adventure, Fantasy, Family]",When siblings Judy and Peter discover an encha...,"[RobinWilliams, JonathanHyde, KirstenDunst]",/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,[JoeJohnston]
2,15602,Grumpier Old Men,"[fishing, bestfriend, duringcreditsstinger, ol...","[Romance, Comedy]",A family wedding reignites the ancient feud be...,"[WalterMatthau, JackLemmon, Ann-Margret]",/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,[HowardDeutch]
3,31357,Waiting to Exhale,"[basedonnovel, interracialrelationship, single...","[Comedy, Drama, Romance]","Cheated on, mistreated and stepped on, the wom...","[WhitneyHouston, AngelaBassett, LorettaDevine]",/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[ForestWhitaker]
4,11862,Father of the Bride Part II,"[baby, midlifecrisis, confidence, aging, daugh...",[Comedy],Just when George Banks has recovered from his ...,"[SteveMartin, DianeKeaton, MartinShort]",/e64sOI48hQXyru7naBFyssKFxVd.jpg,[CharlesShyer]


## Popularity Based Recommender

This Recommender offers generalized recommnendations to every user based on movie popularity and (sometimes) genre. The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. This model does not give personalized recommendations based on the user.

The implementation of this model is extremely trivial. All we have to do is sort our movies based on ratings and popularity and display the top movies of our list. As an added step, we can pass in a genre argument to get the top movies of a particular genre.

I use the TMDB Ratings to come up with our **Top Movies Chart.** I will use IMDB's *weighted rating* formula to construct my chart. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

The next step is to determine an appropriate value for *m*, the minimum votes required to be listed in the chart. We will use **95th percentile** as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

I will build our overall Top 250 Chart and will define a function to build charts for a particular genre. Let's begin!

In [27]:
mean_vote = all_movies[all_movies.vote_average.notnull()]['vote_average'].mean()

minimum_votes = all_movies['vote_count'].quantile(0.95)

In [28]:
all_movies['year'] = pd.to_datetime(all_movies['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [29]:
qualified = all_movies[(all_movies['vote_count'] >= minimum_votes) & (all_movies['vote_count'].notnull()) & (all_movies['vote_average'].notnull())][['id','title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified['id'] = qualified['id'].astype('int')
print(qualified.shape)

(2217, 7)


In [30]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+minimum_votes) * R) + (minimum_votes/(minimum_votes+v) * mean_vote)

qualified['wr'] = qualified.apply(weighted_rating, axis=1)

qualified = qualified.sort_values('wr', ascending=False).head(250)

In [31]:
top_movies=qualified.head(200)

In [32]:
top_movies.head()

Unnamed: 0,id,title,year,vote_count,vote_average,popularity,genres,wr
15386,27205,Inception,2010,14075,8,29.108149,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",7.927116
12428,155,The Dark Knight,2008,12269,8,123.167259,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",7.916766
22634,157336,Interstellar,2014,11187,8,32.213481,"[{'id': 12, 'name': 'Adventure'}, {'id': 18, '...",7.909026
2820,550,Fight Club,1999,9678,8,63.869599,"[{'id': 18, 'name': 'Drama'}]",7.89547
4833,120,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",7.886674


In [33]:
#Saving the Top movies data
pickle.dump(top_movies,open('saved_models/top_movies.pkl','wb'))

## Content Based Recommender

The recommender we built in the previous section suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves romantic movies (and hates action) were to look at our Top 15 Chart, s/he wouldn't probably like most of the movies. If s/he were to go one step further and look at our charts by genre, s/he wouldn't still be getting the best recommendations.

For instance, consider a person who loves *Dilwale Dulhania Le Jayenge*, *My Name is Khan* and *Kabhi Khushi Kabhi Gham*. One inference we can obtain is that the person loves the actor Shahrukh Khan and the director Karan Johar. Even if s/he were to access the romance chart, s/he wouldn't find these as the top recommendations.

To personalise our recommendations more, I am going to build an engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked. Since we will be using movie metadata (or content) to build this engine, this also known as **Content Based Filtering.**

I will build two Content Based Recommenders based on:
* Movie Overviews and Taglines
* Movie Cast, Crew, Keywords and Genre

Also, as mentioned in the introduction, I will be using a subset of all the movies available to us due to limiting computing power available to me. 

In [35]:
import nltk

In [36]:
# nltk.download()

In [37]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [38]:
stop_words = set(stopwords.words('english'))
stop_words.update(["''", "``", "\'s", ','])

In [39]:
#Tokenization and removing stopwords
movies_df['overview_tags'] = movies_df['overview'].apply(lambda tokens: [w for w in word_tokenize(str(tokens).lower()) if w not in stop_words])

In [40]:
movies_df['tags'] = movies_df['overview_tags'] + movies_df['genres'] + movies_df['keywords'] + movies_df['cast'] + movies_df['director']
movies_df = movies_df.drop(columns=['overview_tags','keywords','cast','director'])

In [41]:
movies_df['tags'] = movies_df['tags'].apply(lambda x: ' '.join(x))

In [42]:
movies_df.reset_index(inplace=True)

In [43]:
movies_df.drop(['index'], axis=1, inplace=True)

In [44]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

cv = CountVectorizer(max_features=3000)

vector = cv.fit_transform(movies_df['tags']).toarray()

similarity = cosine_similarity(vector)

In [45]:
from sklearn.metrics.pairwise import cosine_similarity

In [46]:
#Recommend top 10 movies on the basis of Similar tags (Overview, Director, Cast, Genre)

def recommend(movie):
    index = movies_df[movies_df['title'] == movie].index[0]
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
    for i in distances[1:11]:
        print(movies_df.loc[i[0]].title)

In [47]:
recommend('GoldenEye')

Die Another Day
Licence to Kill
A View to a Kill
Octopussy
Live and Let Die
Never Say Never Again
Dr. No
From Russia with Love
Thunderball
Diamonds Are Forever


In [48]:
movies_df['id'] = movies_df['id'].astype(int)

In [95]:
#Saving the model for later use
pickle.dump(movies_df,open('saved_models/movie_info.pkl','wb'))
pickle.dump(similarity,open('saved_models/similarity_matrix.pkl','wb'))

## Collaborative Filtering

Our content based engine suffers from some severe limitations. It is only capable of suggesting movies which are *close* to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who s/he is.

Therefore, in this section, we will use a technique called **Collaborative Filtering** to make recommendations to Movie Watchers. Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

I will not be implementing Collaborative Filtering from scratch. Instead, I will use the **Surprise** library that used extremely powerful algorithms like **Singular Value Decomposition (SVD)** to minimise RMSE (Root Mean Square Error) and give great recommendations.

In [125]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

import warnings; warnings.simplefilter('ignore')

In [127]:
reader = Reader()
ratings = pd.read_csv('dataset/ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [129]:
pickle.dump(ratings,open('saved_models/ratings_info.pkl','wb'))

In [131]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8960  0.8979  0.8968  0.8940  0.8949  0.8959  0.0014  
MAE (testset)     0.6917  0.6899  0.6880  0.6880  0.6893  0.6894  0.0014  
Fit time          0.90    0.83    0.83    0.82    0.84    0.85    0.03    
Test time         0.11    0.09    0.09    0.08    0.09    0.09    0.01    


{'test_rmse': array([0.89599626, 0.89787923, 0.89675044, 0.89395451, 0.89494578]),
 'test_mae': array([0.69173476, 0.68994647, 0.68802097, 0.68804064, 0.6892505 ]),
 'fit_time': (0.9011940956115723,
  0.8336648941040039,
  0.8345592021942139,
  0.8239870071411133,
  0.8396477699279785),
 'test_time': (0.1134941577911377,
  0.0939939022064209,
  0.0919489860534668,
  0.08353066444396973,
  0.08748483657836914)}

In [133]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x18d8b82c0>

In [135]:
pickle.dump(svd,open('saved_models/svd_model.pkl','wb'))

In [99]:
sentence_embeddings = pickle.load(open('saved_models/sentence_embeddings.pkl', 'rb'))

In [161]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

sentence_embeddings = model.encode(movies_df['overview'].tolist())

# pickle.dump(sentence_embeddings,open('saved_models/sentence_embeddings.pkl','wb'))

In [173]:
pickle.dump(sentence_embeddings,open('saved_models/sentence_embeddings.pkl','wb'))

In [169]:
def get_similar_movies(overview):
    print(overview)
    user_input_embedding = model.encode([overview])
    similarities = cosine_similarity(user_input_embedding, sentence_embeddings)
    movies_list = sorted(list(enumerate(similarities[0])),reverse=True,key = lambda x: x[1])
    for i, el in movies_list[:11]:
        print(movies_df.iloc[i].title)

In [155]:
movies_df[movies_df.id == 6906]

Unnamed: 0,id,title,genres,overview,poster_path,tags


In [165]:
movies_df.iloc[6906]

id                                                         13007
title                                                 Religulous
genres                                     [Comedy, Documentary]
overview       Commentator-comic Bill Maher plays devil's adv...
poster_path                     /go7QqWheQdcJVzfFIEkxcecW0BQ.jpg
tags           commentator-comic bill maher plays devil advoc...
Name: 6906, dtype: object

In [177]:
movies_df.head()

Unnamed: 0,id,title,genres,overview,poster_path,tags
0,862,Toy Story,"[Animation, Comedy, Family]","Led by Woody, Andy's toys live happily in his ...",/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,led woody andy toys live happily room andy bir...
1,8844,Jumanji,"[Adventure, Fantasy, Family]",When siblings Judy and Peter discover an encha...,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,siblings judy peter discover enchanted board g...
2,15602,Grumpier Old Men,"[Romance, Comedy]",A family wedding reignites the ancient feud be...,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,family wedding reignites ancient feud next-doo...
3,31357,Waiting to Exhale,"[Comedy, Drama, Romance]","Cheated on, mistreated and stepped on, the wom...",/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,cheated mistreated stepped women holding breat...
4,11862,Father of the Bride Part II,[Comedy],Just when George Banks has recovered from his ...,/e64sOI48hQXyru7naBFyssKFxVd.jpg,george banks recovered daughter wedding receiv...


In [179]:
get_similar_movies(movies_df[movies_df.id == 862].overview.values[0])

Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.
Toy Story
Toy Story 3
Toy Story 2
Child's Play 3
Firestarter
Radio Days
The Kings of Summer
Pinocchio
Luxo Jr.
A Very Harold & Kumar Christmas
Gremlins 2: The New Batch


In [145]:
genres_list = movies_df.genres.to_list()
genre_count = {}
for genres in genres_list:
  genre_list = genres
  for genre in genre_list:
    if genre in genre_count.keys():
      genre_count[genre] += 1
    else:
      genre_count[genre] = 1
print(f"Number of Genres: {len(genre_count)}")
print(genre_count)

Number of Genres: 20
{'Animation': 440, 'Comedy': 3367, 'Family': 857, 'Adventure': 1195, 'Fantasy': 749, 'Romance': 1846, 'Drama': 4635, 'Action': 1749, 'Crime': 1254, 'Thriller': 2016, 'Horror': 916, 'History': 344, 'ScienceFiction': 858, 'Mystery': 651, 'War': 271, 'Foreign': 100, 'Music': 451, 'Documentary': 460, 'Western': 172, 'TVMovie': 36}


In [377]:
genre_master = list(genre_count.keys())

In [379]:
threshold = int(len(movies_df) * 0.01)
rare_genres = [key for key, value in genre_count.items() if value < threshold]
len(rare_genres), rare_genres[:5]

(1, ['TVMovie'])

In [363]:
movies_df.drop('TVMovie', axis=1, inplace=True)

In [385]:
encode_genre_types = { key: idx for idx, (key, value) in enumerate(genre_count.items())}

In [395]:
categorical_genre_list = []
genres_list = movies_df.genres.to_list()

for all_genres in genres_list:
  categorical_list = [0] * len(encode_genre_types)
  # genres = ast.literal_eval(all_genres)
  for genre in all_genres:
    genre_type_index = encode_genre_types[genre] 
    categorical_list[genre_type_index] = 1
  categorical_genre_list.append(categorical_list)

categorical_genre_list[3]

[0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [229]:
movies_df['genre_cat_list'] = categorical_genre_list
movies_df.head()

Unnamed: 0,id,title,genres,overview,poster_path,tags,genre_cat_list
0,862,Toy Story,"[Animation, Comedy, Family]","Led by Woody, Andy's toys live happily in his ...",/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,led woody andy toys live happily room andy bir...,"[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,8844,Jumanji,"[Adventure, Fantasy, Family]",When siblings Judy and Peter discover an encha...,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,siblings judy peter discover enchanted board g...,"[0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,15602,Grumpier Old Men,"[Romance, Comedy]",A family wedding reignites the ancient feud be...,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,family wedding reignites ancient feud next-doo...,"[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,31357,Waiting to Exhale,"[Comedy, Drama, Romance]","Cheated on, mistreated and stepped on, the wom...",/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,cheated mistreated stepped women holding breat...,"[0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,11862,Father of the Bride Part II,[Comedy],Just when George Banks has recovered from his ...,/e64sOI48hQXyru7naBFyssKFxVd.jpg,george banks recovered daughter wedding receiv...,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [231]:
labels = list(encode_genre_types.keys())

In [233]:
movies_df_fine_tune = movies_df[['overview', 'genre_cat_list']]

In [429]:
movies_df[movies_df.title == 'Commando']

Unnamed: 0,id,title,genres,overview,poster_path,tags,genre_cat_list,Animation,Comedy,Family,...,Thriller,Horror,History,ScienceFiction,Mystery,War,Foreign,Music,Documentary,Western
4732,10999,Commando,"[Action, Adventure, Thriller]","John Matrix, the former leader of a special co...",/ggVVcXvlLqFOK6lEkD8G2aDarDb.jpg,john matrix former leader special commando str...,"[0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, ...",0.021334,0.045592,0.014806,...,0.873716,0.018824,0.022438,0.135359,0.025632,0.048034,0.011011,0.003906,0.022,0.026642


In [241]:
movies_df_fine_tune.shape

(9062, 2)

In [247]:
#Fine Tune Bert for multi class multi label classification

# import torch
# from torch.utils.data.dataset import Dataset
# from sklearn.model_selection import train_test_split

# from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# device = torch.device('cpu')

# def to_float(row):
#     new_list = []
#     l = ast.literal_eval(row)
#     for el in l:
#         new_list.append(float(el))
#     return new_list

# texts = list(movies_df_fine_tune['overview'])
# data['genre_cat_list'] = movies_df_fine_tune['genre_cat_list'].apply(to_float)
# labels = list(movies_df_fine_tune['genre_cat_list'])

# train_texts, eval_texts, train_labels, eval_labels = train_test_split(list(movies_df_fine_tune['overview']), list(movies_df_fine_tune['genre_cat_list']), test_size = 0.20, random_state = 0)

# train_texts = texts[:8000]
# train_labels = labels[:8000]

# eval_texts = texts[2:4]
# eval_labels = labels[2:4]

# tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

# train_encodings = tokenizer(train_texts, padding="max_length", truncation=True, max_length=512)
# eval_encodings = tokenizer(eval_texts, padding="max_length", truncation=True, max_length=512)


# class TextClassifierDataset(Dataset):
#     def __init__(self, encodings, labels):
#         self.encodings = encodings
#         self.labels = labels

#     def __len__(self):
#         return len(self.labels)

#     def __getitem__(self, idx):
#         item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
#         item["labels"] = torch.tensor(self.labels[idx])
#         return item

# train_dataset = TextClassifierDataset(train_encodings, train_labels)
# eval_dataset = TextClassifierDataset(eval_encodings, eval_labels)

# model = AutoModelForSequenceClassification.from_pretrained(
#     "bert-base-uncased", 
#     problem_type="multi_label_classification",
#     num_labels=5
# ).to(device)

# training_arguments = TrainingArguments(
#     output_dir=".",
#     evaluation_strategy="epoch",
#     per_device_train_batch_size=16,
#     per_device_eval_batch_size=16,
#     num_train_epochs=5,
# )

# trainer = Trainer(
#     model=model,
#     args=training_arguments,
#     train_dataset=train_dataset,
#     eval_dataset=eval_dataset,
# )

# trainer.train()
# print(trainer.evaluate())

In [281]:
for el in genre_master:
    movies_df[el] = 0

In [261]:
from transformers import TextClassificationPipeline, AutoTokenizer, AutoModelForSequenceClassification

MODEL_NAME = "multilabel_classifier"
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, device='cpu')

In [343]:
#Sample Prediction
sentence = movies_df.overview[0]
prediction = pipe(sentence, return_all_scores=True)
print(prediction)

[[{'label': 'LABEL_0', 'score': 0.621728777885437}, {'label': 'LABEL_1', 'score': 0.8790881633758545}, {'label': 'LABEL_2', 'score': 0.9685618877410889}, {'label': 'LABEL_3', 'score': 0.19067753851413727}, {'label': 'LABEL_4', 'score': 0.22023531794548035}, {'label': 'LABEL_5', 'score': 0.013924052938818932}, {'label': 'LABEL_6', 'score': 0.08540508151054382}, {'label': 'LABEL_7', 'score': 0.02237039990723133}, {'label': 'LABEL_8', 'score': 0.01619631238281727}, {'label': 'LABEL_9', 'score': 0.0060385833494365215}, {'label': 'LABEL_10', 'score': 0.010727817192673683}, {'label': 'LABEL_11', 'score': 0.01163817010819912}, {'label': 'LABEL_12', 'score': 0.05880765989422798}, {'label': 'LABEL_13', 'score': 0.009957588277757168}, {'label': 'LABEL_14', 'score': 0.006109141279011965}, {'label': 'LABEL_15', 'score': 0.017491415143013}, {'label': 'LABEL_16', 'score': 0.056412529200315475}, {'label': 'LABEL_17', 'score': 0.02455463632941246}, {'label': 'LABEL_18', 'score': 0.01060547586530447}, 

In [293]:
for i, row in movies_df.iterrows():
    prediction = pipe(row['overview'], return_all_scores=True)
    for j, el in enumerate(prediction[0]):
        column, score = (genre_master[j], el['score'])
        movies_df.at[i, column] = score

pickle.dump(movies_df,open('saved_models/movie_genre_info.pkl','wb'))

In [353]:
movies_df.head()

Unnamed: 0,id,title,genres,overview,poster_path,tags,genre_cat_list,Animation,Comedy,Family,...,Thriller,Horror,History,ScienceFiction,Mystery,War,Foreign,Music,Documentary,Western
0,862,Toy Story,"[Animation, Comedy, Family]","Led by Woody, Andy's toys live happily in his ...",/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,led woody andy toys live happily room andy bir...,"[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.621729,0.879088,0.968562,...,0.006039,0.010728,0.011638,0.058808,0.009958,0.006109,0.017491,0.056413,0.024555,0.010605
1,8844,Jumanji,"[Adventure, Fantasy, Family]",When siblings Judy and Peter discover an encha...,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,siblings judy peter discover enchanted board g...,"[0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.621093,0.051024,0.929004,...,0.029687,0.01909,0.010904,0.097209,0.025498,0.008035,0.012667,0.020942,0.010476,0.011537
2,15602,Grumpier Old Men,"[Romance, Comedy]",A family wedding reignites the ancient feud be...,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,family wedding reignites ancient feud next-doo...,"[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.006236,0.996229,0.026126,...,0.010073,0.00759,0.004417,0.015397,0.016571,0.006225,0.017058,0.053122,0.006553,0.01032
3,31357,Waiting to Exhale,"[Comedy, Drama, Romance]","Cheated on, mistreated and stepped on, the wom...",/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,cheated mistreated stepped women holding breat...,"[0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.001802,0.884559,0.01172,...,0.007475,0.00178,0.005685,0.004287,0.00951,0.006064,0.01188,0.013356,0.001498,0.002335
4,11862,Father of the Bride Part II,[Comedy],Just when George Banks has recovered from his ...,/e64sOI48hQXyru7naBFyssKFxVd.jpg,george banks recovered daughter wedding receiv...,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.00938,0.995485,0.144746,...,0.009028,0.001771,0.001669,0.005646,0.004504,0.001488,0.003681,0.00727,0.002216,0.002723


In [423]:
def get_similar_genre_movies(genres_list):
    movies_df['calculated'] = movies_df[genres_list[0]] + movies_df[genres_list[1]]
    movies_list = (movies_df.sort_values(by='calculated', ascending=False).head(10))
    movies_df.drop('calculated', axis=1, inplace=True)
    return movies_list

In [425]:
liked_genres = ['Horror', 'Thriller']
get_similar_genre_movies(liked_genres)

Unnamed: 0,id,title,genres,overview,poster_path,tags,genre_cat_list,Animation,Comedy,Family,...,Horror,History,ScienceFiction,Mystery,War,Foreign,Music,Documentary,Western,calculated
7462,28355,Case 39,"[Horror, Mystery, Thriller]","In her many years as a social worker, Emily Je...",/xSyj4F1UVtAXhqSGpNCZRaBsKAJ.jpg,many years social worker emily jenkins believe...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, ...",0.013091,0.011754,0.006618,...,0.951735,0.007912,0.091432,0.605933,0.007473,0.011623,0.008065,0.019574,0.011663,1.909792
7125,1977,The Grudge 3,"[Mystery, Horror, Thriller]","Jake, the sole survivor of The Grudge 2 massac...",/swFCNfNo6hgXygRJ7l1Y2aYKKky.jpg,jake sole survivor grudge 2 massacre tortured ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, ...",0.01241,0.006254,0.005388,...,0.928228,0.013559,0.079112,0.746017,0.014477,0.017278,0.010276,0.026302,0.01586,1.895187
4239,12590,Below,"[Thriller, Horror, Mystery]",In the dark silence of the sea during World Wa...,/5N0yv9QVvpp25JAlC5x1DA7raxr.jpg,dark silence sea world war ii submarine u.s.s ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, ...",0.015817,0.008428,0.006501,...,0.923976,0.012537,0.148574,0.744188,0.011963,0.01483,0.011911,0.025302,0.016396,1.889515
3376,12484,The Forsaken,"[Action, Adventure, Horror, Thriller]",A young man is in a race against time as he se...,/hKwscHAZKrjeOSF7Bvksejpt26m.jpg,young man race time searches cure becoming inf...,"[0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, ...",0.036349,0.018162,0.00718,...,0.952607,0.011662,0.594517,0.17115,0.021292,0.015408,0.006496,0.019671,0.018718,1.885292
2272,23761,Alice Sweet Alice,"[Horror, Mystery, Thriller]",Alice is a withdrawn 12-year-old who lives wit...,/scjfMHHjZRKkmhoqPtNJmMfpcs4.jpg,alice withdrawn 12-year-old lives mother young...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, ...",0.012242,0.008606,0.007153,...,0.941647,0.011812,0.072551,0.699838,0.008336,0.014232,0.011013,0.025731,0.012762,1.884899
4064,26725,Nomads,"[Thriller, Horror, Mystery]",French anthropologist Jean-Charles Pommier and...,/mRWUTfecqnLMAQhAe7b1ikMPCER.jpg,french anthropologist jean-charles pommier wif...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, ...",0.013582,0.021943,0.007278,...,0.956649,0.00735,0.175299,0.600617,0.00688,0.011716,0.01307,0.025445,0.012776,1.884715
1531,10225,Friday the 13th Part VI: Jason Lives,"[Horror, Mystery, Thriller]","As a child, Tommy killed mass-murderer Jason. ...",/le636yjPeWkC74nPDIZiQBH1CVt.jpg,child tommy killed mass-murderer jason . years...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, ...",0.008602,0.011068,0.004178,...,0.917811,0.008632,0.080117,0.6635,0.008117,0.011606,0.006545,0.016485,0.007893,1.88243
1086,794,The Omen,"[Horror, Thriller]","Immediately after their miscarriage, the US di...",/uY14zS4Sm2DdvFXeczFKgLgkQUP.jpg,immediately miscarriage us diplomat robert tho...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, ...",0.00868,0.012017,0.003465,...,0.954271,0.009507,0.046312,0.360862,0.008632,0.010273,0.006128,0.016503,0.00929,1.880728
7617,41215,Black Death,"[Drama, Horror, Action, Thriller, Mystery]","As the plague decimates medieval Europe, rumor...",/gXRERDpyT9s3m2yk6wNmrTWbZfG.jpg,plague decimates medieval europe rumors circul...,"[0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, ...",0.024421,0.003346,0.006978,...,0.923888,0.043109,0.097536,0.706329,0.043224,0.029732,0.012124,0.017815,0.046312,1.879824
6483,9708,The Wicker Man,"[Mystery, Thriller, Horror]",A sheriff investigating the disappearance of a...,/aeQ65vYvOCRmeO0uaVnjHOiDrXZ.jpg,sheriff investigating disappearance young girl...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, ...",0.013911,0.012557,0.006678,...,0.926044,0.013499,0.09754,0.845527,0.010503,0.020881,0.016054,0.025858,0.030354,1.879554
