# Cleaned Dataframe

## Input:
- In this notebook we are working with the two datasets 'movies_metadata.csv' and 'credits.csv' which we found on kaggle (https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset).
- The 'movies_metadate.csv' contains important information about 45 000 movies for our project, like title of the movie, -release date and genres. 
- The 'credits.csv' contains information about the cast of the 45 000 movies.
- We are also working with the Internet Movie Script Database (https://imsdb.com/), which contains scripts of 1,093 movies.

## Output:
- Two newly created CSVs, only with all of the movies for which we have both the movie script and the information from our original CSVs. Thus our cleaned data with which it will be easier to work with.

In [1]:
import pandas as pd 
import regex as re
import ssl
import os

In [2]:
# loading both our original datasets
df = pd.read_csv(r'credits.csv\credits.csv')
df_2 = pd.read_csv(r'movies_metadata.csv\movies_metadata.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'credits.csv\\credits.csv'

In [4]:
def file_name(movie):
    """
    translates the name of the movie into the version that is identical with the title of the saved scripts
  
    Parameters:
    movie (string): name of the movie for which we want to check whether we have the corresponding script
  
    Returns:
    str: the path where we have the script saved, so that we can later easily access each movie (only if we have the script saved)
    None: if we dont have the corresponding script to the movie  
    """
    
    # getting the actual file name of the movie
    
    # If the movie name starts with 'The', the script file has the 'The' after the rest of the title
    if movie.startswith("The"):
        movie = movie.replace("The","")
        movie = movie + "the"
        
    # Script file namesare without spaces
    new_movie = movie.replace(" ", "")

    genres = list()
    # getting the dictionary with all genres of our current movie
    g = meta_data.at[current_movie, "genres"]
    
    # note, that we have to evaluate g to turn the returned string into a dictionary
    # for all entries in our dictionary g, we try to append the genre which is saved under the key 'name'
    for e in eval(g):
        # Because sometimes there is no genre given, we use the try and except blocks to avoid error messages
        try:
            genres.append(e["name"])
        except:
            genres.append("")

    # because we have the movie scripts saved under a corresponding genre folder, we search for the movie script
    # in all of the three genre (folders) which we have saved for it        
    for genre in genres:

        # creating a string with the path to the corresponding movie script (all file names have the same ending)
        path = r"imsdb_scenes_dialogs_nov_2015/imsdb_scenes_dialogs_nov_2015/dialogs/{gen}/{name}_dialog.txt".format(gen=genre, name=new_movie.lower())
        
        # if we have the script of the movie
        if os.path.exists(path):
            return path
            
    return(None)
    

In [9]:
# first we need a list of all the movies that we have in our dataframe
list_of_movies = df_2['original_title'].tolist()

# we only want to have each movie in our list once
movie_set = set(list_of_movies)
convert_list_to_set = set(movie_set)
list_of_movies = list(convert_list_to_set)

In [11]:
# We want to insert an extra column in our cast dataset
# Because we don' want to change the orgiginal datasets, we create our own here
cast = df
meta_data = df_2

# We add an extra column in our cast dataframe with the title of the movie, so we can differentiate the movies by name and not just by index like before
#cast = cast.insert(0, 'original_title', meta_data['original_title'])    # commented out because this can only be executed once

In [12]:
# If we have duplicates in our datasets, they are dropped here
meta_data = meta_data[~meta_data['original_title'].duplicated()]
cast = cast[~cast['original_title'].duplicated()]

# We set the "original title" columns as index columns, so we can access single entries by movie names later on
meta_data = meta_data.set_index('original_title')
cast = cast.set_index('original_title')

In [10]:
# in this for loop, we check for every movie in our dataset, whether we have the corresponding script to it
# with the file_name function and if we don't have it, we drop that entry
for movie in list_of_movies:
    print(movie)
    movie_file_name = file_name(movie)

    if not movie_file_name:

        cast = cast.drop(movie)
        meta_data = meta_data.drop(movie)
        
        print("movie to be removed:", movie)
        

# saving the now cleaned dataframes (only with the movies for which we have movie scripts) in new CSVs with which we will keep on working      
meta_data.to_csv("movies_metadata_cleaned.csv")
cast.to_csv("credits_cleaned.csv")

Grace of Monaco
Sorry, this movie doesn't seem to be in this database
movie to be removed: Grace of Monaco
Pocket Money
Sorry, this movie doesn't seem to be in this database
movie to be removed: Pocket Money
Утомлённые солнцем
Sorry, this movie doesn't seem to be in this database
movie to be removed: Утомлённые солнцем
The House on Carroll Street
Sorry, this movie doesn't seem to be in this database
movie to be removed: The House on Carroll Street
Just for Kicks
Sorry, this movie doesn't seem to be in this database
movie to be removed: Just for Kicks
V Boy Idut Odni Stariki
Sorry, this movie doesn't seem to be in this database
movie to be removed: V Boy Idut Odni Stariki
Sex is Comedy
Sorry, this movie doesn't seem to be in this database
movie to be removed: Sex is Comedy
Ernest et Célestine
Sorry, this movie doesn't seem to be in this database
movie to be removed: Ernest et Célestine
Burai hijô
Sorry, this movie doesn't seem to be in this database
movie to be removed: Burai hijô
Refug