# Datasets

Box Office Mojo:
- bom.movie_gross.csv.gz

IMDB (more infor info [here](https://www.imdb.com/interfaces/)):
- imdb.name.basics.csv.gz
- imdb.title.akas.csv.gz
- imdb.title.basics.csv.gz
- imdb.title.crew.csv.gz
- imdb.title.principals.csv.gz
- imdb.title.ratings.csv.gz

Rotten Tomatoes:
- rt.movie_info.tsv.gz
- rt.reviews.tsv.gz

TheMovieDB:
- tmdb.movies.csv.gz

The Numbers:
- tn.movie_budgets.csv.gz

# Genres Data Exploration

## Importing Modules

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

pd.set_option('display.max_rows', 500) # Allows Jupyter Notebook to expand how much data is shown.

## Reading in IMDB's title.basics dataset

In [2]:
imdb_title_basics_df = pd.read_csv('zippedData/imdb.title.basics.csv.gz')
g_df = imdb_title_basics_df #g stands for genre(s)
g_df

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,


## Identifying Duplicates in IMDB's title.basics

Now I want to see how many duplicates are in the `'primary_title'` column.

I probably want to see duplicates more often later, so I might as well define a function.

In [3]:
def determine_dupes(series):
    series_vcs = pd.Series(series.value_counts())
    series_dupes = [series_vcs.index[index] for index in list(range(len(series_vcs))) if series_vcs.values[index] > 1]
    print("Amount of unique duplicate movie title: " + str(len(series_dupes)))
    print("Total amount of duplicate movie titles: " + str(series_vcs[0:len(series_dupes)].sum()))
    
    return series_vcs

determine_dupes(g_df['primary_title'])

Amount of unique duplicate movie title: 6182
Total amount of duplicate movie titles: 16255


Home                      24
Broken                    20
The Return                20
Homecoming                16
Alone                     16
                          ..
Agnus Dei                  1
Maria by Callas            1
The Sound and the Fury     1
Peek a Boo                 1
The Avatar Project         1
Name: primary_title, Length: 136071, dtype: int64

## Creating `'title_dupes_df'` DataFrame to include new movie titles with movie name and movie year for duplicates

To do this, I need to create a new DataFrame called `title_dupes_df`.

But I first need to create a unique ID column within `g_df` so that I can easily merge `title_dupes_df` later.

In [4]:
g_df.reset_index(inplace=True)
g_df.rename(columns={'index' : 'unique_id'}, inplace=True)
g_df

Unnamed: 0,unique_id,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
...,...,...,...,...,...,...,...
146139,146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,


I now need to create a dataframe of all the duplicate movie titles, and then create a new column called `'updated_title'` that includes the name of the movie and the year of the movie.

In [5]:
title_vcs = pd.Series(g_df['primary_title'].value_counts())
title_dupes = [title_vcs.index[index] for index in list(range(len(title_vcs))) if title_vcs.values[index] > 1]
title_dupes_df = g_df[g_df['primary_title'].isin(title_dupes)]
title_dupes_df

Unnamed: 0,unique_id,tconst,primary_title,original_title,start_year,runtime_minutes,genres
6,6,tt0112502,Bigfoot,Bigfoot,2017,,"Horror,Thriller"
19,19,tt0247643,Los pájaros se van con la muerte,Los pájaros se van con la muerte,2011,110.0,"Drama,Mystery"
21,21,tt0250404,Godfather,Godfather,2012,,"Crime,Drama"
30,30,tt0285252,Life's a Beach,Life's a Beach,2012,100.0,Comedy
34,34,tt0297400,Snowblind,Snowblind,2015,,"Crime,Drama"
...,...,...,...,...,...,...,...
146119,146119,tt9913594,Bacchanalia,Bacchanalia,2017,72.0,"Drama,Mystery,Thriller"
146120,146120,tt9913936,Paradise,Paradise,2019,,"Crime,Drama"
146125,146125,tt9914642,Albatross,Albatross,2017,,Documentary
146134,146134,tt9916160,Drømmeland,Drømmeland,2019,72.0,Documentary


In [6]:
title_dupes_df['updated_title'] = title_dupes_df['primary_title'] + ' ' + title_dupes_df['start_year'].astype(str)
title_dupes_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  title_dupes_df['updated_title'] = title_dupes_df['primary_title'] + ' ' + title_dupes_df['start_year'].astype(str)


Unnamed: 0,unique_id,tconst,primary_title,original_title,start_year,runtime_minutes,genres,updated_title
6,6,tt0112502,Bigfoot,Bigfoot,2017,,"Horror,Thriller",Bigfoot 2017
19,19,tt0247643,Los pájaros se van con la muerte,Los pájaros se van con la muerte,2011,110.0,"Drama,Mystery",Los pájaros se van con la muerte 2011
21,21,tt0250404,Godfather,Godfather,2012,,"Crime,Drama",Godfather 2012
30,30,tt0285252,Life's a Beach,Life's a Beach,2012,100.0,Comedy,Life's a Beach 2012
34,34,tt0297400,Snowblind,Snowblind,2015,,"Crime,Drama",Snowblind 2015
...,...,...,...,...,...,...,...,...
146119,146119,tt9913594,Bacchanalia,Bacchanalia,2017,72.0,"Drama,Mystery,Thriller",Bacchanalia 2017
146120,146120,tt9913936,Paradise,Paradise,2019,,"Crime,Drama",Paradise 2019
146125,146125,tt9914642,Albatross,Albatross,2017,,Documentary,Albatross 2017
146134,146134,tt9916160,Drømmeland,Drømmeland,2019,72.0,Documentary,Drømmeland 2019


I want to see how many unique duplicates there are in `title_dupes_df` and compare it to how many unique duplicates there were in `g_df`.

Over 66% of the records that once weren't unique are now unique!

In [7]:
determine_dupes(title_dupes_df['updated_title'])

Amount of unique duplicate movie title: 1870
Total amount of duplicate movie titles: 3942


Ângelo de Sousa - Tudo o Que Sou Capaz 2010    7
Split 2016                                     6
Inside Out 2011                                5
Escape from Candyland 2017                     4
Fragments 2018                                 4
                                              ..
Fountain of Youth 2017                         1
A Picture of You 2013                          1
Happyland 2010                                 1
Hostile 2017                                   1
Evidence 2012                                  1
Name: updated_title, Length: 14183, dtype: int64

Now lets remove the columns that we don't want to duplicate during the merge.

In [8]:
remove_columns = ['tconst', 'primary_title', 'original_title', 'start_year', 'runtime_minutes', 'genres']
title_dupes_df.drop(columns=remove_columns, axis=1, inplace=True)
title_dupes_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,unique_id,updated_title
6,6,Bigfoot 2017
19,19,Los pájaros se van con la muerte 2011
21,21,Godfather 2012
30,30,Life's a Beach 2012
34,34,Snowblind 2015
...,...,...
146119,146119,Bacchanalia 2017
146120,146120,Paradise 2019
146125,146125,Albatross 2017
146134,146134,Drømmeland 2019


## Merging `'title_dupes_df1'` with `'g_df'`

I now need too merge `title_dupes_df` into `g_df`.

In [9]:
g_and_dupes_df = pd.merge(left=g_df, right=title_dupes_df, on=['unique_id'], how ='left')
g_and_dupes_df.head(25)

Unnamed: 0,unique_id,tconst,primary_title,original_title,start_year,runtime_minutes,genres,updated_title
0,0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",
1,1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",
2,2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,
3,3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",
4,4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",
5,5,tt0111414,A Thin Life,A Thin Life,2018,75.0,Comedy,
6,6,tt0112502,Bigfoot,Bigfoot,2017,,"Horror,Thriller",Bigfoot 2017
7,7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,83.0,"Adventure,Animation,Comedy",
8,8,tt0139613,O Silêncio,O Silêncio,2012,,"Documentary,History",
9,9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,82.0,Biography,


Now I need to assign non-null values in the `'updated_title'` column to their respective `'primary_title`'. I found it easiest to create a new column that tells me whether or not a non-null value is in the `'updated_title'` column.

In [10]:
# This code block may take up to 20 seconds to run.

g_and_dupes_df['is_title_NaN'] = g_and_dupes_df['updated_title'].isnull()
g_and_dupes_df.head(25)
for movie_index in list(range(len(g_df.index))):
    if ~g_and_dupes_df['is_title_NaN'][movie_index]:
        g_and_dupes_df['primary_title'][movie_index] = g_and_dupes_df['updated_title'][movie_index]
g_and_dupes_df.head(25)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  g_and_dupes_df['primary_title'][movie_index] = g_and_dupes_df['updated_title'][movie_index]


Unnamed: 0,unique_id,tconst,primary_title,original_title,start_year,runtime_minutes,genres,updated_title,is_title_NaN
0,0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",,True
1,1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",,True
2,2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,,True
3,3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",,True
4,4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",,True
5,5,tt0111414,A Thin Life,A Thin Life,2018,75.0,Comedy,,True
6,6,tt0112502,Bigfoot 2017,Bigfoot,2017,,"Horror,Thriller",Bigfoot 2017,False
7,7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,83.0,"Adventure,Animation,Comedy",,True
8,8,tt0139613,O Silêncio,O Silêncio,2012,,"Documentary,History",,True
9,9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,82.0,Biography,,True


Now lets remove the columns that we don't need for further data exploration.

In [11]:
remove_columns = ['updated_title', 'is_title_NaN']
g_and_dupes_df.drop(columns=remove_columns, axis=1, inplace=True)

In [12]:
g_df = g_and_dupes_df
display(g_df.head(25))
g_df.info()

Unnamed: 0,unique_id,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
5,5,tt0111414,A Thin Life,A Thin Life,2018,75.0,Comedy
6,6,tt0112502,Bigfoot 2017,Bigfoot,2017,,"Horror,Thriller"
7,7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,83.0,"Adventure,Animation,Comedy"
8,8,tt0139613,O Silêncio,O Silêncio,2012,,"Documentary,History"
9,9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,82.0,Biography


<class 'pandas.core.frame.DataFrame'>
Int64Index: 146144 entries, 0 to 146143
Data columns (total 7 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   unique_id        146144 non-null  int64  
 1   tconst           146144 non-null  object 
 2   primary_title    146144 non-null  object 
 3   original_title   146123 non-null  object 
 4   start_year       146144 non-null  int64  
 5   runtime_minutes  114405 non-null  float64
 6   genres           140736 non-null  object 
dtypes: float64(1), int64(2), object(4)
memory usage: 13.9+ MB


# Merging Julian's Exploratory Data

Attempting to merge Julian's data with my own...

## Reading in TheNumber's movie_budgets dataset

In [None]:
# Credit to Julian
tn_movie_budgets_df = pd.read_csv('zippedData/tn.movie_budgets.csv.gz')
mm_df = tn_movie_budgets_df #mm stands for money_metrics
display(mm_df)
mm_df.info()

## Converting columns with \$'s (`str`) to `int` 

In [None]:
# Credit to Julian
def clean_a_money_column(df, series_title):
    df[series_title] = df[series_title].str.replace('$','')
    df[series_title] = df[series_title].str.replace(',','')
    df[series_title] = pd.to_numeric(df[series_title])
    
    return df

mm_df = clean_a_money_column(mm_df,'worldwide_gross')
mm_df = clean_a_money_column(mm_df,'domestic_gross')
mm_df = clean_a_money_column(mm_df,'production_budget')
display(mm_df)
mm_df.info() # Notice that the dtype for columns 3 thru 5 are now int64 instead of object

## Profit Margin & Return on Investment

Profit Margin (PM): $(Revenue-Cost)/Revenue$

Return on Investment (ROI): $Revenue/Cost$

In [None]:
# Credit to Julian
numerator = mm_df['worldwide_gross'] - mm_df['production_budget']
mm_df['Profit Margin'] = numerator/mm_df['worldwide_gross']*100
mm_df['ROI'] = mm_df['worldwide_gross']/mm_df['production_budget']
# mm_df.sort_values(['Profit Margin'], ascending=False, inplace=True) # Sort by largest Profit Margin
mm_df

In [None]:
# Credit to Julian
# Scatter plot of worldwide gross revenue vs. production budget cost:
mm_df.plot('production_budget', 'worldwide_gross', kind='scatter');

## Adding `'Year'` column

In [None]:
# Credit to Julian
mm_df['Year'] = pd.to_numeric(mm_df['release_date'].map(lambda x: x[-4:]))
mm_df

In [None]:
# Credit to Julian
# Scatter plot of ROI vs. Year of Movie Release:
mm_df.plot('Year', 'ROI', kind='scatter');

## Identifying Duplicates in TheNumbers' movie_budgets

In [None]:
movie_title_dupes = pd.Series(mm_df['movie'].value_counts())
title_dupes = [movie_title_dupes.index[index] for index in list(range(len(movie_title_dupes))) if movie_title_dupes.values[index] > 1]
print("Amount of unique duplicate movie title: ", len(title_dupes))
print("Total amount of duplicate movie titles: ", movie_title_dupes[0:len(title_dupes)].sum())
display(movie_title_dupes)

## Merging The Numbers' movie_budgets with IMDB's title.basics

In order to perform a merge later on, I need to rename the movie title column, `'primary_title'`, within IMDB's title.basics.

In [None]:
g_df.rename(columns={'primary_title' : 'movie'}, inplace=True)
# display(g_df)

Left join g_df into mm_df:

In [None]:
mm_and_g_df = pd.merge(left=mm_df, right=g_df, on='movie', how ='left')
# display(mm_and_g_df)

I need to create a unique ID column in the newly merged DataFrame.

In [None]:
mm_and_g_df.reset_index(inplace=True)
mm_and_g_df.rename(columns={'index' : 'unique_id'}, inplace=True)
mm_and_g_df

## Data Cleaning...

In [None]:
movie_title_dupes = pd.Series(mm_and_g_df['movie'].value_counts())
title_dupes = [movie_title_dupes.index[index] for index in list(range(len(movie_title_dupes))) if movie_title_dupes.values[index] > 1]
print("Amount of unique duplicate movie title: ", len(title_dupes))
print("Total amount of duplicate movie titles: ", movie_title_dupes[0:len(title_dupes)].sum())
display(movie_title_dupes.head(500))

In [None]:
def generate_uid_removal_list(test_df, keep_uid_list):
    test_df = test_df.drop_duplicates(subset=['unique_id'])
    for uid in keep_uid_list:
        test_df = test_df[test_df['unique_id'] != uid]

    return test_df['unique_id'].tolist()

def remove_uid_from_df(df, uid_removal_list):
    for uid in uid_removal_list:
        df = df[df['unique_id'] != uid]
    
    return df

In [None]:
# Home
test_df = mm_and_g_df[mm_and_g_df['movie'] == 'Home']
# display(test_df)
keep_uid_list = [277]
uid_removal_list = generate_uid_removal_list(test_df, keep_uid_list)
uid_removal_list
mm_and_g_df = remove_uid_from_df(mm_and_g_df, uid_removal_list)
display(mm_and_g_df)
test_df = mm_and_g_df[mm_and_g_df['movie'] == 'Home']
# display(test_df)

`test_df = mm_and_g_df_left[mm_and_g_df_left['movie'] == 'Home']`

Legend: mm_df --> g_df

KEEP Mar 27, 2015 Home --> KEEP tt2224026

REMOVE Jun 5, 2009 Home --> N/A (tt1014762 not in title.basics)

REMOVE Apr 23, 2009 Home --> N/A (tt0822388 not in title.basics)

REMOVE ALL OTHER 'Home' MOVIES FROM g_df

## What to do with NaN Values in `'genres'` column??

How many rows in the `'genres'` column have `NaN` values?

In [None]:
g_null_values = g_df['genres'].isna().sum()
print(f'Rows with Null Values in Genres Column: {g_null_values}')
NaN_in_g = g_df[g_df['genres'].isnull()]
NaN_in_g
# NaN_in_genres.to_csv(r'C:\Users\avbro\Documents\Flatiron\1-phase1\projects\New-Film-Studio-at-Microsoft\null_genres.csv')

These `NaN` values could in theory be given genres if time permitted.
But for the sake of time, all rows with `NaN`values in the `'genres'` column will be removed.

In [None]:
g_df.dropna(axis=0, subset=['genres'], inplace=True)
g_df

Notice index 16 was removed.

In [None]:
g_df.head(20)

## Converting dtype of `'genres'` from `'string'` to `'list'`

The dtype for the `'genres'` column is currently a `string`, and I need to split up the different genres per title into a `list` of genres.

In [None]:
g_df['genres'] = g_df['genres'].map(lambda x: x.split(','))

I would like to see how many different genres can be associated with a single movie,
so I can make usful functions. I first want to know how many genres are associated
with each movie, so I want to make a new column called `'genres_len'`.

In [None]:
g_df['genres_len'] = g_df['genres'].map(lambda x: len(x))

In [None]:
max_genres_for_single_title = g_df['genres_len'].max()
print(f'Maximum Amount of Genres for a Single Title: {max_genres_for_single_title}')

In [None]:
g_df

## Functions for genre data

Useful functions that will come in handy for data manipulation pertaining to genres...

In [None]:
def unique_g(g_series):
    unique_g_set = set()
    unique_g_list = []
    unique_g_dict = {}
    
    unique_g_set = set([g for g_list in g_series for g in g_list])
    unique_g_list = sorted(list(unique_g_set))
    unique_g_dict = {k: v for k, v in enumerate(unique_g_list)}
        
    return(unique_g_dict)


unique_g_dict = unique_g(g_df['genres'])
unique_g_dict

In [None]:
def select_g_types(unique_g_dict, int_list):
    g_types_list = []
    
    g_types_list = [unique_g_dict[index] for index in int_list]
    
    return g_types_list


g_types = select_g_types(unique_g_dict, [0, 13, 16, 19])
print(g_types)

The entire code block below was for testing purposes only.

In [None]:
# # test_genres_df = genres_df
# # I HIGHLY RECOMMEND NOT PERFORMING THE FOR LOOP ON ALL +140,000 RECORDS
# # This code block will take ~2 minutes to compute.
# # And the following code block will not compute under 10 minutes...

# # I recommend taking a sample of genres_df to parse through instead
# test_genres_df = genres_df[0:5000]
# movies_for_removal = []

# for movie in list(range(len(test_genres_df.index))):
#     test_df = pd.DataFrame(test_genres_df.iloc[movie]).T
#     for g_type in genre_types:
#         if g_type in test_df['genres'][test_df.index[0]]:
#             break
#         elif g_type == genre_types[len(genre_types)-1]:
#             movies_for_removal.append(test_df['tconst'][test_df.index[0]])

# print("Amount of rows to remove: ", len(movies_for_removal))
# print("Total amount of rows in test_genres_df DataFrame: ", len(test_genres_df.index))
# print("Expected amount of rows after removal: ", len(test_genres_df.index)-len(movies_for_removal))

# for movie_for_removal in movies_for_removal:
#     test_genres_df = test_genres_df[test_genres_df['tconst'] != movie_for_removal]

# test_genres_df

In [None]:
# I HIGHLY RECOMMEND NOT PERFORMING THIS FUNCTION ON DATAFRAMES WITH OVER 5,000 RECORDS
# AS THIS FUNCTION MAY TAKE VERY LONG TO COMPUTE

def movies_w_specified_genres(g_df, g_types_list):
    movies_for_removal = []

    for movie in list(range(len(g_df.index))):
        test_df = pd.DataFrame(g_df.iloc[movie]).T
        for g_type in g_types_list:
            if g_type in test_df['genres'][test_df.index[0]]:
                break
            elif g_type == g_types_list[len(g_types_list)-1]:
                movies_for_removal.append(test_df['tconst'][test_df.index[0]])
    for movie_for_removal in movies_for_removal:
        g_df = g_df[g_df['tconst'] != movie_for_removal]
    
    return(g_df)
            

test_df = movies_w_specified_genres(g_df[0:50], g_types)
test_df