## **Exploratory analysis and prediction on the "TMDB 5000 Movie Dataset" dataset**

***Authors: Bava Flavio 4836427 , Ciarlo Francesco 4640121, Oldrini Edoardo 4055097***

The following data analysis aims to study an approach for the production of a movie.<br><br>
This file is divided like so:
* Dataset checking and preparation
* Initial exploration of the dataset
* Proposal predictive models based on previous observations

**Importing libraries and dataset**

In [None]:
#libraries
import matplotlib.pyplot as plt  
import numpy as np 
import pandas as pd 
import seaborn as sns 
import plotly.graph_objs as go
import plotly.offline as py
from ast import literal_eval
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import warnings 

#initial settings
warnings.filterwarnings('ignore') 
pd.set_option('display.max_columns',10000)

In [None]:
Movies = pd.read_csv('input/tmdb_5000_movies.csv')

### **Exploratory analysis of the dataset**

#### First of all, we intend to have an overall idea of the available dataset, in particular the dimensions of the dataset and the structure of the entries:

In [None]:
print("Dataset has {} rows and {} columns".format(Movies.shape[0],Movies.shape[1]))

In [None]:
Movies.head(2)

Columns like homepage, spoken_languages and title are usless or redondant, hence we proceed to drop them

In [None]:
Movies.drop(['homepage','spoken_languages','title'],inplace=True,axis='columns')

#### We check if the types of datas are coherent

In [None]:
Movies.info()

Data types are coherent with the information they represent

#### We check if any null values are in the dataset

In [None]:
Movies.isnull().sum()

The column tagline has a huge number of null values, we will manage them when we'll work on this feature.

#### Some columns are in Json format, hence we proceed to convert them to lists

Definition of auxiliary funcitons

In [None]:
def get_name(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        return names
    return []

def get_ISO(x):
    if isinstance(x, list):
        isos = [i['iso_3166_1'] for i in x]
        return isos
    return []

Conversion

In [None]:
feat_to_manage = ['genres','keywords','production_countries','production_companies']
for f in feat_to_manage:
    Movies[f] = Movies[f].apply(literal_eval)

In [None]:
#Turn genres into list
Movies['genres'] = Movies['genres'].apply(get_name)
#Turn prod_countries into list
Movies['production_countries'] = Movies['production_countries'].apply(get_ISO)
#Turn prod_companies into list
Movies['production_companies'] = Movies['production_companies'].apply(get_name)
#Turn keywords into list
Movies['keywords'] = Movies['keywords'].apply(get_name)

#### Some movies are in post producion or are still just rumored but we want to work only on released movies

In [None]:
Movies = Movies.query('status == "Released"')

## **Dataset Analysis**

### Numerical features

#### Budget

##### Let's have a first look to the budget feature

In [None]:
Movies['budget'].describe()

The minimum value of the budget feature is 0. We must discard movies with a non acceptable budget, hence we keep only movie budgets with greater than 10 k, any value < 10 k is interpreted as wrong hence put o nan

In [None]:
for row in Movies.index:
    if Movies.loc[row,'budget'] < 10000:
        Movies.loc[row,'budget'] = np.nan
        
Movies['budget'].describe()

##### Now the budget are acceotable, we divide movies in three classes by budget:
- Low: 1.000000e+04 <= x <= 8.975000e+06 (class 1) 
- Medium: 8.975000e+06 < x <= 5.000000e+07 (class 2)
- High: 5.000000e+07 < x <= 3.800000e+08 (class 3)

In [None]:
bins = [1.000000e+04, 8.975000e+06, 5.000000e+07, 3.800000e+08]
labels=[1,2,3]
Movies['budget_class'] = pd.cut(Movies['budget'],bins=bins,labels=labels)


let's print the count of movies for each budget class

In [None]:
ax = Movies['budget_class'].value_counts().sort_values(ascending=True).plot(kind='bar')
ax.set_xlabel("Budget Class")
ax.set_ylabel("Quantity")
plt.xticks(rotation="horizontal")
plt.show()

The graph is coherent with the divisions we made

##### Let's check the distribution of budgets:

In [None]:

sns.distplot(Movies['budget'])
sns.set(rc={'figure.figsize':(12,6)})
plt.suptitle('Budget distribution')
plt.show()

print("Budget skewness: ",Movies['budget'].skew())


The distribution is skewed, this could be a problem for the machine learning algorithm, hance we try to adjust the skewness of the distribution

In [None]:
from sklearn.preprocessing import PowerTransformer

In [None]:
pt = PowerTransformer(method='box-cox',standardize=False)
Movies['transf_budget'] = pt.fit_transform(Movies[['budget']])

In [None]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(12,6))
sns.distplot(Movies['budget'],ax=ax1)
sns.distplot(Movies['transf_budget'],ax=ax2)
fig.suptitle("Comparison between budget and transf_budget")
plt.show()




In [None]:
print("Skewness before log : {} and Skewness after log : {}".format(Movies['budget'].skew(),Movies['transf_budget'].skew()))

## Revenues ##

In [None]:
print('Movies with 0$ revenues: ',Movies[Movies['revenue'] == 0].shape[0])

We also set these to Nan

In [None]:
for row in Movies.index:
    if (Movies.loc[row, 'revenue'] == 0):
        Movies.loc[row, 'revenue'] = np.nan

In [None]:
Movies['revenue'].isna().sum()

In [None]:
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='box-cox',standardize=False)

In [None]:
Movies['transf_revenue'] = pt.fit_transform(Movies[['revenue']])
fig , (ax1,ax2) = plt.subplots(1,2,figsize=(12,6))

sns.distplot(Movies['revenue'],ax=ax1)
sns.distplot(Movies['transf_revenue'],ax=ax2)
fig.suptitle("Comparison between revenue and transf_revenue")
plt.show()

In [None]:
print("Skewness before log : {} and Skewness after log : {}".format(Movies['revenue'].skew(),Movies['transf_revenue'].skew()))

## Score ##

We want to eliminate those films that have a low percentage of vote_count(votes given), because it would create imbalances,
given that a film voted 8, but by 5 people, is not reliable

In [None]:
#TODO: potremmo anche fare questo lavoro sul dataset di partenza aggiungendo una colonna, metà avranno nan
C = Movies['vote_average'].mean()
C
m = Movies['vote_count'].quantile(0.5)

for i in Movies.index:
    if Movies.loc[i,'vote_count'] <= m:
        Movies.loc[i,'vote_count'] = np.nan
    else:
        pass

    
q_movies = Movies[['id','vote_count','vote_average']]

We consider it important that the average vote is weighted with the number of votes that generate it, to do this we use the formula recommended by the IMBD site and remove the two columns vote_average and vote_count to merge them into one that combines them

In [None]:
#Weighted Rating
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [None]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

q_movies.drop(['vote_average','vote_count'],inplace=True,axis='columns')

#Dataframe merge, now Movies also has Score column"
Movies = pd.merge(Movies,q_movies,on='id',how='inner')

Movies.shape

We have eliminated some films (in q_movies), those with too few votes to be taken into consideration with regard to the received vote and then we have combined the two dataframes, so now Movies also has the score column

## Profits ##

We measure the revenues against the budget through a function, and add a new column called 'profit_perc' containing the results

In [None]:
def calculate_profit_perc(x):
    if (x.revenue>0) and (x.budget>0):
        return ((x.revenue-x.budget)/x.budget)*100
    

In [None]:
Movies = Movies.assign(profit_perc = lambda x: x.budget)
for row in Movies.index:
    Movies.loc[row,'profit_perc'] =  calculate_profit_perc(Movies.loc[row])

We define a function which, given a genre, calculates the average of the profits of the films that contain it

In [None]:
temp = Movies
temp.dropna(axis=0,inplace=True)

In [None]:
#How many movies are there by genre with profit other than Nan
def films_per_genres(genre):
    count = 0
    for row in temp.index:
        if (genre in temp.loc[row, 'genres'] and (temp.loc[row,'profit_perc'] != np.nan)):
            count+=1
    return count


#Profits by genre
def genre_average_profits(genre):
    sum = 0
    count = 0
    for row in temp.index:
        if (genre in temp.loc[row, 'genres'] and (temp.loc[row,'profit_perc'] != np.nan)):
            sum += temp.loc[row, 'profit_perc']
            count+=1
    return sum/count



In [None]:
genres=[]
for row in temp.index:
    _gen = temp.loc[row,'genres']
    for g in _gen:
        if g not in genres:
            genres.append(g)

profits=[]
for g in genres:
    profits.append(genre_average_profits(g))
            
print("Profits for each genre:")
for i in range(0,len(genres)):
    print('\t',genres[i], " has a mean profit of ", profits[i])

We can observe how some films have a very high profit, this is given by the fact that for example Horror films have earned a lot and there are few within the dataset, so let's see the cardinalities of each genre

In [None]:
films = []
for el in genres:
    films.append(films_per_genres(el))
    
print("Genres in the dataframe are:")
for genre,film in zip(genres,films):
    print(genre,film)

Documentaries, for example, are very few compared to the total of films and have a percentage profit of around 6119, therefore, for the reason mentioned above, they greatly unbalance the accounts. We can afford to exclude them, since there are only 9

In [None]:
genres.remove(genres[-1])
profits.remove(profits[-1])

In [None]:
fig, ax = plt.subplots()
ax.bar(genres,profits)
fig.set_figwidth(27)
fig.set_figheight(13)
plt.xticks(rotation=45,fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel("Genres", fontsize=20)
plt.ylabel("Profits", fontsize=20)
plt.show()

In [None]:
log_profit = np.log1p(profits)
fig, ax = plt.subplots()
ax.bar(genres,log_profit)
fig.set_figwidth(27)
fig.set_figheight(13)
plt.xticks(rotation=45,fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel("Genres", fontsize=20)
plt.ylabel("Log Profits", fontsize=20)
plt.show()

## Keywords ##

Let's see what are the most frequent words present among the keywords

In [None]:
from wordcloud import WordCloud, STOPWORDS
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")



plt.subplots(figsize=(12,12))
stop_words=set(stopwords.words('english'))
stop_words.update(',',';','!','?','.','(',')','$','#','+',':','...',' ','')

Movies['keywords'].dropna(inplace=True)
keywords_converted = Movies['keywords'].astype(str)
words=keywords_converted.apply(nltk.word_tokenize)
word=[]
for i in words:
    word.extend(i)
word=pd.Series(word)
word=([i for i in word.str.lower() if i not in stop_words])
wc = WordCloud(background_color="black", max_words=2000,stopwords=STOPWORDS, max_font_size= 60,width=1000,height=600)
wc.generate(" ".join(word))
plt.imshow(wc)
plt.axis('off')
fig=plt.gcf()
fig.set_size_inches(10,10)
plt.show()


## Release Date ##

In [None]:
#Convert the Relase date to datetime format
temp['release_date'] = pd.to_datetime(temp['release_date'])
temp['release_year'] = temp['release_date'].dt.year
temp['release_month'] = temp['release_date'].dt.month
temp['release_day'] = temp['release_date'].dt.dayofweek

In [None]:
temp['decades'] = temp['release_date'].apply(lambda x : (x.year // 10)*10)

Quantity of films released during months of the year

In [None]:
sns.set(rc = {'figure.figsize':(15,8)})
sns.countplot(x='release_month',data=temp)
plt.suptitle("Films released every month")
plt.show()

In [None]:
sns.set(rc = {'figure.figsize':(15,8)})
sns.countplot(x='release_day',data=temp)
plt.suptitle("Films released every day")
plt.show()

In [None]:
ax = sns.scatterplot(x="release_date",y="revenue",data=temp,hue="budget_class")
sns.set(rc = {'figure.figsize':(6,6)})
ax.set_title("Revenue during year")
ax.set_xlabel("Release Year")
ax.set_ylabel("Revenue")
plt.show()

The graph above shows how revenues have increased over time, as have budgets invested

In [None]:
d2 = temp.groupby(['release_month'])['revenue'].mean()
data = [go.Scatter(x=d2.index, y=d2.values, name='mean revenue', yaxis='y')]
layout = go.Layout(dict(title = "Average revenue per month",
                  xaxis = dict(title = 'Month'),
                  yaxis2=dict(title='Average revenue', overlaying='y', side='right')
                  ),legend=dict(
                orientation="v"))
py.iplot(dict(data=data, layout=layout))

In [None]:
d2 = temp.groupby(['release_day'])['revenue'].mean()
data = [go.Scatter(x=d2.index, y=d2.values, name='mean revenue', yaxis='y')]
layout = go.Layout(dict(title = "Average revenue per Day",
                  xaxis = dict(title = 'Day'),
                  yaxis2=dict(title='Average revenue', overlaying='y', side='right')
                  ),legend=dict(
                orientation="v"))
py.iplot(dict(data=data, layout=layout))

## Popularity ##

In [None]:
temp['log_popularity'] = np.log(temp['popularity'])

fig,(ax1,ax2) = plt.subplots(1,2,figsize=(12,6))

sns.distplot(temp['popularity'],ax=ax1)
sns.distplot(temp['log_popularity'],ax=ax2)

fig.suptitle("Comparison between popularity and log_popularity skewness")

plt.show()


In [None]:
print("Skewness before log : {} and Skewness after log : {}".format(temp['popularity'].skew(),temp['log_popularity'].skew()))

In [None]:
sns.set(rc = {'figure.figsize':(12,6)})
ax = sns.scatterplot(x='popularity',y='revenue',data=temp,hue="budget_class")
ax.set_title("Popularity and revenues divided by budget_class")
plt.show()

# Cast & Directors #

In [None]:
temp.columns

In [None]:
cast = pd.read_csv("input/tmdb_5000_credits.csv")

In [None]:
feat_to_manage = ['cast','crew']
for f in feat_to_manage:
    cast[f] = cast[f].apply(literal_eval)

#Two functions that convert directors and actors from json to list-str
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

def get_actors(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        
        return names
    return []

In [None]:
#Create two new column correctly formatted
cast['director'] = cast['crew'].apply(get_director)
cast['actors'] = cast['cast'].apply(get_actors)

In [None]:
#Drop old columns
cast.drop('cast',inplace=True,axis=1)
cast.drop('crew',inplace=True,axis=1)
cast.drop('title',inplace=True,axis=1)

In [None]:
#rename Movie_id to id, preparing for the merge
cast = cast.rename(columns={'movie_id': 'id'})

#Merge two dataframe Movies,cast
full_df = pd.merge(temp,cast,on="id",how="inner")
recommend_df = full_df.copy()

## Actors ##

In [None]:
actors=[]


for i in full_df['actors']:
    actors.extend(i)

actors = list(filter(None, actors))


plt.subplots(figsize=(12,10))
ax=pd.Series(actors).value_counts()[:15].sort_values(ascending=True).plot.barh(width=0.9,color=sns.color_palette('inferno_r',40))
for i, v in enumerate(pd.Series(actors).value_counts()[:15].sort_values(ascending=True).values): 
    ax.text(.8, i, v,fontsize=10,color='black',weight='bold')

plt.title('Actors with highest appearance')
ax.patches[14].set_facecolor('r')
plt.show()

## Directors ##

In [None]:
directors=[]


for i in full_df['director']:
    directors.append(i)

directors = list(filter(None, directors))


plt.subplots(figsize=(12,10))
ax=pd.Series(directors).value_counts()[:14].sort_values(ascending=True).plot.barh(width=0.9,color=sns.color_palette('inferno_r',40))
for i, v in enumerate(pd.Series(directors).value_counts()[:14].sort_values(ascending=True).values): 
    ax.text(.8, i, v,fontsize=10,color='black',weight='bold')

plt.title('Directors with highest appearance')
ax.patches[13].set_facecolor('r')
plt.show()

We calculate the average Score for the most present directors (>= 10)

In [None]:
#Filter the directors with made films >= 10,then calculate the mean scores
director_group = full_df.groupby('director').filter(lambda x : len(x) >= 10)
mean_scores = director_group.groupby('director')['score'].mean().sort_values(ascending=False).reset_index(name="score")

In [None]:
fig, ax = plt.subplots()
ax.bar(mean_scores['director'],mean_scores['score'],color='navy')
fig.set_figwidth(20)
fig.set_figheight(13)
plt.xticks(rotation=45,fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel("Directors", fontsize=20)
plt.ylabel("Scores", fontsize=20)
plt.show()

# Predictions #

## Revenues ##

In [None]:
new_temp=pd.merge(temp,cast,on="id",how="inner")

In [None]:
import re

new_temp['production_companies'] = new_temp['production_companies'].apply(lambda x : x[0] if len(x) > 0 else None)
new_temp['production_companies'] = new_temp['production_companies'].apply(lambda x:re.sub('[^A-Za-z0-9_]+', '', str(x)))

new_temp['actors'] = new_temp['actors'].apply(lambda x : x[0:3] if len(x) > 0 else None)

In [None]:
#genres
df=pd.DataFrame( {'genres': new_temp['genres']})
df= pd.get_dummies(df.genres.apply(pd.Series).stack()).sum(level=0)
new_temp = pd.concat([new_temp,df],axis = 1)

#production companies
df=pd.DataFrame( {'production_companies': new_temp['production_companies']})
df= pd.get_dummies(df.production_companies.apply(pd.Series).stack()).sum(level=0)
new_temp = pd.concat([new_temp,df],axis = 1)

#production countries
df=pd.DataFrame( {'production_countries': new_temp['production_countries']})
df= pd.get_dummies(df.production_countries.apply(pd.Series).stack()).sum(level=0)
new_temp = pd.concat([new_temp,df],axis = 1)


#actors
df=pd.DataFrame( {'actors': new_temp['actors']})
df= pd.get_dummies(df.actors.apply(pd.Series).stack()).sum(level=0)
new_temp = pd.concat([new_temp,df],axis = 1)

#director
df=pd.DataFrame( {'director': new_temp['director']})
df= pd.get_dummies(df.director.apply(pd.Series).stack()).sum(level=0)
new_temp = pd.concat([new_temp,df],axis = 1)


In [None]:
drop_columns=['budget','status','release_date','tagline', 'overview','vote_count','vote_average','original_title','original_language','id','revenue','profit_perc','genres', 'keywords','popularity','production_companies','production_countries','actors','director']#
new_temp= new_temp.drop(drop_columns, axis=1)    
new_temp = new_temp.loc[:,~new_temp.columns.duplicated()]

In [None]:
#Feature scaling 
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
new_temp['transf_revenue'] = scaler.fit_transform(new_temp[['transf_revenue']])
new_temp['transf_budget'] = scaler.fit_transform(new_temp[['transf_budget']])
new_temp['score'] = scaler.fit_transform(new_temp[['score']])
new_temp['runtime'] = scaler.fit_transform(new_temp[['runtime']])
new_temp['log_popularity'] = scaler.fit_transform(new_temp[['log_popularity']])


In [None]:
#Rename columns to filter special characters
new_temp = new_temp.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', str(x)))

#Drop Walt Disney because it's duplicate
new_temp.drop('WaltDisney',inplace=True,axis=1)

In [None]:
from sklearn.model_selection import train_test_split
import optuna
import lightgbm as lgb
from sklearn.metrics import mean_absolute_error

In [None]:
# Formating for modeling
new_temp=new_temp.dropna()
y = new_temp['transf_revenue']
X = new_temp.drop(['transf_revenue'], axis=1)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

**Hyperparams Tuning**

## Optuna ##

In [None]:
def objective(trial):
   
    dtrain = lgb.Dataset(X_train, label=y_train)
    
    param = {
        "objective": "regression",
        "metric": "rmse",
        "verbosity": -1,
        "boosting_type": "gbdt",
        "max_depth" : trial.suggest_int("max_depth",5, 9),
        "learning_rate" : trial.suggest_float("learning_rate",0.01,0.03),
        "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
        "max_bin" : trial.suggest_int("max_bin",5,50),
        "num_leaves": trial.suggest_int("num_leaves", 2, 256),
        "min_data_in_leaf" : trial.suggest_int("min_data_in_leaf",5,50),
        "n_estimators" : trial.suggest_int("n_estimators",500, 2000),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 1, 10),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        
    }

    gbm = lgb.train(param, dtrain)
    preds = gbm.predict(X_valid)
    pred_labels = np.rint(preds)
    accuracy = r2_score(y_valid,preds)
    return accuracy

In [None]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)

In [None]:
study.best_trial.value

In [None]:
opt_params = {}
for key, value in study.best_trial.params.items():
    opt_params[key] = [value]

opt_params

## GridSearchCV ##

In [None]:
from sklearn.model_selection import GridSearchCV

grid_params = {
         'n_estimators' : [1000],
         'num_leaves': [5,10],
         'objective': ['regression'],
         'max_depth': [9,7,10],
         'max_bins' : [10,20],
         'learning_rate': [0.01],
         "boosting": ["gbdt"],
         "feature_fraction": [0.9],
         "bagging_fraction": [0.9],
         "metric": ['r2'],
         "lambda_l1": [0.2],
         "verbosity" : [-1]

        }

lgb_model = lgb.LGBMRegressor()

gs = GridSearchCV(lgb_model,param_grid=opt_params,verbose=1,scoring="r2",refit="r2",cv=7,n_jobs=-1)

gs.fit(X_train,y_train)

In [None]:
gs.best_score_


## LGBRegressor ##

In [None]:
lgb_model = lgb.LGBMRegressor(**gs.best_params_,nthread = 4,n_jobs = -1)
lgb_model.fit(X_train, y_train, 
        eval_set=[(X_train, y_train), (X_valid, y_valid)], eval_metric='rmse',
        verbose=1, early_stopping_rounds=100)

y_pred = lgb_model.predict(X_valid)

In [None]:
from sklearn.metrics import r2_score 


def print_errors(real_value, predicted_value):
    print('\tMean absolute error:', mean_absolute_error(real_value, predicted_value))
    print('\tMean squared error', mean_squared_error(real_value, predicted_value))
    print('\tRMSE:', np.sqrt(mean_squared_error(real_value, predicted_value)))
    print('\tScore: ',r2_score(real_value,predicted_value))

print_errors(y_valid, y_pred)

In [None]:
df = pd.DataFrame({'Real Values': y_valid, 'Predicted Values': y_pred})
df

In [None]:
lgb.plot_importance(lgb_model,max_num_features=20)
plt.show()

In [None]:
plt.scatter(y_valid, y_pred)
tmp = [min(np.concatenate((y_train,y_valid))),
       max(np.concatenate((y_train,y_valid)))]
plt.plot(tmp,tmp,'r')
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.show()

## Score ##

Convert categorical features to a list of binary

In [None]:
def binary(genre_list):
    binaryList = []
    
    for genre in genres:
        if genre in genre_list:
            binaryList.append(1)
        else:
            binaryList.append(0)
    
    return binaryList

In [None]:
full_df['genres_bin'] = full_df['genres'].apply(lambda x: binary(x))
full_df['genres_bin'].head()

In [None]:
for i,j in zip(full_df['actors'],full_df.index):
    list2=[]
    list2=i[:4]
    full_df.loc[j,'actors']=str(list2)
full_df['actors']=full_df['actors'].str.strip('[]').str.replace(' ','').str.replace("'",'')
full_df['actors']=full_df['actors'].str.split(',')
for i,j in zip(full_df['actors'],full_df.index):
    list2=[]
    list2=i
    list2.sort()
    full_df.loc[j,'actors']=str(list2)
full_df['actors']=full_df['actors'].str.strip('[]').str.replace(' ','').str.replace("'",'')
full_df['actors']=full_df['actors'].str.split(',')

In [None]:
full_df['actors'].head(3)

In [None]:
actor_list=[]
for row in full_df.index:
    _actors = full_df.loc[row,'actors']
    for g in _actors:
        if g not in actor_list:
            actor_list.append(g)


def binary(cast_list):
    binaryList = []
    
    for genre in actor_list:
        if genre in cast_list:
            binaryList.append(1)
        else:
            binaryList.append(0)
    
    return binaryList

In [None]:
full_df['actors_bin'] = full_df['actors'].apply(lambda x: binary(x))
full_df['actors_bin'].head(3)


In [None]:
for i,j in zip(full_df['director'],full_df.index):
    list2=[]
    list2=i[:4]
    full_df.loc[j,'director']=str(list2)
full_df['director']=full_df['director'].str.strip('[]').str.replace(' ','').str.replace("'",'')
full_df['director']=full_df['director'].str.split(',')
for i,j in zip(full_df['director'],full_df.index):
    list2=[]
    list2=i
    list2.sort()
    full_df.loc[j,'director']=str(list2)
full_df['director']=full_df['director'].str.strip('[]').str.replace(' ','').str.replace("'",'')
full_df['director']=full_df['director'].str.split(',')

In [None]:
director_list=[]
for row in full_df.index:
    _actors = full_df.loc[row,'director']
    for g in _actors:
        if g not in director_list:
            director_list.append(g)

def binary(directors):
    binaryList = []
    
    for direct in director_list:
        if direct in directors:
            binaryList.append(1)
        else:
            binaryList.append(0)
    
    return binaryList

In [None]:
full_df['director_bin'] = full_df['director'].apply(lambda x: binary(x))
full_df['director_bin'].head(3)

In [None]:
full_df['keywords'] = full_df['keywords'].astype(str)

full_df['keywords']=full_df['keywords'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')
full_df['keywords']=full_df['keywords'].str.split(',')
for i,j in zip(full_df['keywords'],full_df.index):
    list2=[]
    list2=i[:4]
    full_df.loc[j,'keywords']=str(list2)
full_df['keywords']=full_df['keywords'].str.strip('[]').str.replace(' ','').str.replace("'",'')
full_df['keywords']=full_df['keywords'].str.split(',')
for i,j in zip(full_df['keywords'],full_df.index):
    list2=[]
    list2=i
    list2.sort()
    full_df.loc[j,'keywords']=str(list2)
full_df['keywords']=full_df['keywords'].str.strip('[]').str.replace(' ','').str.replace("'",'')
full_df['keywords']=full_df['keywords'].str.split(',')

In [None]:
words_list = []
for index, row in full_df.iterrows():
    genres = row["keywords"]
    
    for genre in genres:
        if genre not in words_list:
            words_list.append(genre)

In [None]:
def binary(words):
    binaryList = []
    
    for genre in words_list:
        if genre in words:
            binaryList.append(1)
        else:
            binaryList.append(0)
    
    return binaryList

In [None]:
full_df['words_bin'] = full_df['keywords'].apply(lambda x: binary(x))
full_df['words_bin'].head(3)

In [None]:
full_df = full_df[(full_df['vote_average']!=0)] #removing the movies with 0 score and without drector names 
full_df = full_df[full_df['director']!='']

In [None]:
new_id = list(range(0,full_df.shape[0]))
full_df['new_id']=new_id
full_df=full_df[['original_title','genres','vote_average','score','genres_bin','actors_bin','new_id','director','director_bin','words_bin']]
full_df.head()

In [None]:
from scipy import spatial

def Similarity(movieId1, movieId2):
    a = full_df.iloc[movieId1]
    b = full_df.iloc[movieId2]
    
    genresA = a['genres_bin']
    genresB = b['genres_bin']
    
    genreDistance = spatial.distance.cosine(genresA, genresB)
    
    scoreA = a['actors_bin']
    scoreB = b['actors_bin']
    scoreDistance = spatial.distance.cosine(scoreA, scoreB)
    
    directA = a['director_bin']
    directB = b['director_bin']
    directDistance = spatial.distance.cosine(directA, directB)
    
    return genreDistance + directDistance + scoreDistance

In [None]:
import operator

def score_prediction(name):
    new_movie=full_df[full_df['original_title'].str.contains(name)].iloc[0].to_frame().T
    print('Selected Movie: ',new_movie.original_title.values[0])
    def getNeighbors(baseMovie, K):
        distances = []
    
        for index, movie in full_df.iterrows():
            if movie['new_id'] != baseMovie['new_id'].values[0]:
                dist = Similarity(baseMovie['new_id'].values[0], movie['new_id'])
                distances.append((movie['new_id'], dist))
    
        distances.sort(key=operator.itemgetter(1))
        neighbors = []
    
        for x in range(K):
            neighbors.append(distances[x])
        return neighbors

    K = 10
    avgScore = 0
    neighbors = getNeighbors(new_movie, K)

    
    for neighbor in neighbors:
        avgScore = avgScore+full_df.iloc[neighbor[0]][2]  
    
    print('\n')
    avgScore = avgScore/K
    print('The predicted rating for %s is: %f' %(new_movie['original_title'].values[0],avgScore))
    print('The actual rating for %s is %f' %(new_movie['original_title'].values[0],new_movie['score']))

In [None]:
score_prediction('Green')

## Reccomendation System (Overview,Title) ##

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf = TfidfVectorizer(stop_words="english")
recommend_df['overview'] = recommend_df['overview'].fillna('')
tfidf_matr = tfidf.fit_transform(recommend_df['overview'])

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matr,tfidf_matr)

In [None]:
indices = pd.Series(recommend_df.index, index=recommend_df['original_title']).drop_duplicates()

In [None]:
#Recommendation using cosine similarity
def get_recommendations(title, cosine_similarity=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_similarity[idx]))
    #print(sim_scores)
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    #print(sim_scores)
    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return recommend_df['original_title'].iloc[movie_indices]


In [None]:
get_recommendations('Tangled')

## Reccomendation System (Keywords,Actors,Directors,Genres)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [None]:
# Apply clean_data function to your features.
features = ['actors', 'keywords', 'director', 'genres']


for feature in features:
    recommend_df[feature] = recommend_df[feature].apply(clean_data)

In [None]:
#We create a soup of metadata that will be feed to our Count Vectorizer
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['actors']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
recommend_df['soup'] = recommend_df.apply(create_soup, axis=1)

In [None]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(recommend_df['soup'])
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)
# Reset index of our main DataFrame and construct reverse mapping as before
recommend_df = recommend_df.reset_index()
indices = pd.Series(recommend_df.index, index=recommend_df['original_title'])

In [None]:
get_recommendations("Tangled", cosine_sim2)