# Enriching data
#### 1. Adding data from www.themoviedb.org (TMDB)

In this section we are going to be using the API of www.themoviedb.org (TMDB) to populate the existing dataframe with more information about the movie. The script (TMDB_v1) is going to query the website with the movie name and match with the movie, where the release date is the closest. If no match is found the code skips the movie and goes to the next.

This 1. step is adding a lot of different data from TMDB. The most important features are the:
- TMDB ID           (unique identifier for the movie, which can be used to further enrich the data)
- TMDB vote average (average movie rating)
- TMDB vote count   (how many votes were cast)

Furthermore, TMDB_v1 also adds the:
- original language, original title, release date, title
- overview (summary)
- popularity (metric composed of many different variables indicating a movies lifetime popularity)


This script was run separately in another file in /utils/add_TMDB_movie_metadata_v1.py as it needed to be executed in  several installments over several sessions, since the API was limited by 40 requests/second. (this version contains slightly different path variations).

In [2]:
import tmdbsimple as tmdb # Wrapper library for the API of themoviedb.org  (TMDB)
from tqdm import tqdm # Progress bar for the script
import pandas as pd
from datetime import datetime # Used for comparing movie release dates
from dotenv import load_dotenv # Makes keeping the API-key as local environment file simpler
import os # Used for loading the .env file

load_dotenv() # Loads .env files aka. the TMDB_API_KEY

headers_movie_metadata = ["Wikipedia Movie ID", "Freebase Movie ID", "Movie name", "Movie release date",
                          "Movie box office revenue", "Movie runtime", "Movie languages", "Movie countries",
                          "Movie genres"]
movie_metadata = pd.read_csv('data/movie.metadata.tsv', sep="\t", names=headers_movie_metadata)


# Load API key
TMDB_API_KEY = os.environ.get("TMDB_API_KEY")
tmdb.API_KEY = TMDB_API_KEY
tmdb.REQUESTS_TIMEOUT = 5  # Seconds, for both connect and read

# Create a list to save progress
saved_progress = []

# Determine where to resume
start_index = 28000

# Progress file that can be used to resume
#saved_progress = pd.read_json('progress.json')['index'].tolist()
#start_index = saved_progress[-1] + 1  # Start from the next index

# Create a DataFrame to store the data
movie_metadata_TMDB = movie_metadata.copy()

for index, row in tqdm(movie_metadata_TMDB.iterrows(), total=len(movie_metadata_TMDB), desc="Processing"): # Wraps for loop in progress bar.
    
    # Skip previously processed indices - Commented out in favor of manual start_index
    #if index in saved_progress:
    #    continue
    if index < start_index:
        continue
    try:
        if not pd.isna(row["Movie release date"]):
            search = tmdb.Search()
            response = search.movie(query=row["Movie name"])

            # Convert dataframe release date to datetime
            movie_release_date_str = row["Movie release date"]
            if len(movie_release_date_str) == 4:  # Handle "YYYY" format
                movie_release_date = datetime.strptime(movie_release_date_str, "%Y").date()
            elif len(movie_release_date_str) == 7:  # Handle "YYYY-DD" format
                movie_release_date = datetime.strptime(movie_release_date_str, "%Y-%m").date()
            else:  # Assume it's in the format "YYYY-MM-DD"
                movie_release_date = datetime.strptime(movie_release_date_str, "%Y-%m-%d").date()

            # Handle if release_date is empty.
            date_list_converted = [datetime.strptime(each_date['release_date'], "%Y-%m-%d").date() for each_date in
                                   search.results if each_date.get('release_date')]

            # Create list of differences in time
            differences = [abs(movie_release_date - each_date) for each_date in date_list_converted]
            
            # If differences are empty = skip
            if not differences:
                continue
            minimum_index = differences.index(min(differences))  # Index of the closest match
            match = search.results[minimum_index]
            # print(f"Closest match: {match['title']} (Release Date: {match['release_date']})")

            # Add info in dataframe about the movie
            movie_metadata_TMDB.loc[index, 'TMDB_id'] = match['id']
            movie_metadata_TMDB.loc[index, 'TMDB_original_language'] = match['original_language']
            movie_metadata_TMDB.loc[index, 'TMDB_original_title'] = match['original_title']
            movie_metadata_TMDB.loc[index, 'TMDB_overview'] = match['overview']
            movie_metadata_TMDB.loc[index, 'TMDB_popularity'] = match['popularity']
            movie_metadata_TMDB.loc[index, 'TMDB_release_date'] = match['release_date']
            movie_metadata_TMDB.loc[index, 'TMDB_title'] = match['title']
            movie_metadata_TMDB.loc[index, 'TMDB_vote_average'] = match['vote_average']
            movie_metadata_TMDB.loc[index, 'TMDB_vote_count'] = match['vote_count']
            
            # Save the index as progress
            saved_progress.append(index)

            # Save progress periodically (in case of interruption)
            if index % 50 == 0:
                progress_df = pd.DataFrame({'index': saved_progress})
                progress_df.to_json('progress.json')
                movie_metadata_TMDB.to_csv('movie_metadata_TMDB.csv', index=False)

    except Exception as e:
        print(f"Error at index {index}: {str(e)}")

# Save final progress
progress_df = pd.DataFrame({'index': saved_progress})
progress_df.to_json('progress.json')

# Save your final DataFrame
#movie_metadata_TMDB.to_csv('modified_data/movie_metadata_TMDB.csv') # Commented out to not overwrite existing file with blank data

Processing: 100%|██████████| 81741/81741 [00:03<00:00, 20657.45it/s]


We are now going to see how many rows of data were populated in the enriching.

In [3]:
import pandas as pd
df = pd.read_csv('modified_data/movie_metadata_TMDB.csv')
print(f"Added TMDB ID to {df['TMDB_id'].count()} movies. Total movies: {df['Wikipedia Movie ID'].count()}. \n Percentage populated: {round(df['TMDB_id'].count()/df['Wikipedia Movie ID'].count()*100,2)}%")


Added TMDB ID to 68944 movies. Total movies: 81741. 
 Percentage populated: 84.34%


2 - Adding Movie Revenue
Since the script above (TMDB_v1) of the data collection used a 'query' method for TMDB, we are now going to be using the 'Movies' method to collect further data about the movies (TMDB_v2). 
Since many of the movies are missing their revenue, TMDB might have that information for some movies.

The same basis as for TMDB_v1 is used to collect data. The data being collected is: 
- revenue, budget, runtime, IMDB-id and genres.

While our original dataset includes movie genres, it's always good to have more information and the two might differ for an interesting analysis further on.

Like TMDB_v1, TMDB_v2 was run separately in another file: /utils/add_TMDB_movie_metadata_V2.py as it needed to be run on several occasions since the API is limited.

In [None]:
import tmdbsimple as tmdb
from tqdm import tqdm
import pandas as pd
from datetime import datetime
import re
from dotenv import load_dotenv
import os

load_dotenv()

# Load v1, which was created in part 1 above
movie_metadata_TMDB = pd.read_csv('modified_data/movie_metadata_TMDB.csv')
movie_metadata_TMDB = movie_metadata_TMDB.dropna(subset=['TMDB_id'])

# Load API key
TMDB_API_KEY = os.environ.get("TMDB_API_KEY")
tmdb.API_KEY = TMDB_API_KEY
tmdb.REQUESTS_TIMEOUT = 5  # seconds, for both connect and read

# Create a list to save progress
saved_progress = []
# Determine where to resume
start_index = 35000

# Progress file that can be used to resume
#saved_progress = pd.read_json('progress_v2.json')['index'].tolist()
#start_index = saved_progress[-1] + 1  # Start from the next index

for index, row in tqdm(movie_metadata_TMDB.iterrows(), total=len(movie_metadata_TMDB), desc="Processing"): # Wraps for loop in progress bar
    if index < start_index:
        continue
    try:
        movie = tmdb.Movies(row['TMDB_id']) # The "Movies" method is used with the TMDB_id of the movie
        movie_info = movie.info()
        
        # The dataframe is populated with data about: runtime, revenue, budget and IMDB_id.
        movie_metadata_TMDB.loc[index,'TMDB_runtime'] = movie_info['runtime']
        movie_metadata_TMDB.loc[index,'TMDB_revenue'] = movie_info['revenue']
        movie_metadata_TMDB.loc[index,'TMDB_budget'] = movie_info['budget']
        movie_metadata_TMDB.loc[index,'TMDB_IMDB_id'] = movie_info['imdb_id']

        # Convert list of genres to a string representation for adding the values easily to the df
        genres_str = ', '.join([genre['name'] for genre in movie_info['genres']])
        # Assign values to DataFrame
        movie_metadata_TMDB.loc[index, 'TMDB_genres'] = genres_str
        
        # Save the index as progress
        saved_progress.append(index)

        # Save progress periodically (in case of interruption)
        if index % 50 == 0:
            progress_df = pd.DataFrame({'index': saved_progress})
            progress_df.to_json('progress_v2.json')
            movie_metadata_TMDB.to_csv('movie_metadata_TMDB_v2.csv', index=False)

    except Exception as e:
        print(f"Error at index {index}: {str(e)}")

# Save final progress
progress_df = pd.DataFrame({'index': saved_progress})
progress_df.to_json('progress_v2.json')

# Save your final DataFrame
#movie_metadata_TMDB.to_csv('modified_data/movie_metadata_TMDB_v2_DONE.csv', index=False) # commented out to not overwrite existing data


After data collection we now need to merge TMDB_v2 with TMDB_v1. This merge is going to create a dataset called TMDB_v3. This is to make sure we have a dataset that contains all the data and not only movies where a 'TMDB_id' is present as in TMDB_v2.


In [35]:
# The two datasets are loaded
movie_metadata_v2 = pd.read_csv('modified_data/movie_metadata_TMDB_v2.csv')
movie_metadata = pd.read_csv('modified_data/movie_metadata_TMDB.csv',index_col=0)

# Datasets are inspected beforehand
display(movie_metadata)
display(movie_metadata_v2)

# Datasets are merged with a left join on movie_metadata (the dataset we want to populate)
merged_df = pd.merge(movie_metadata,movie_metadata_v2,how='left')

display(merged_df) # The merged dataset is displayed and a successful merge is confirmed

#merged_df.to_csv("modified_data/movie_metadata_TMDB_v3.csv",index=False) # The merged dataset is saved as *_V3, it is commented out to prevent data being overwritten 


  movie_metadata_v2 = pd.read_csv('modified_data/movie_metadata_TMDB_v2.csv')


Unnamed: 0,Wikipedia Movie ID,Freebase Movie ID,Movie name,Movie release date,Movie box office revenue,Movie runtime,Movie languages,Movie countries,Movie genres,TMDB_id,TMDB_original_language,TMDB_original_title,TMDB_overview,TMDB_popularity,TMDB_release_date,TMDB_title,TMDB_vote_average,TMDB_vote_count
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",10016.0,en,Ghosts of Mars,"In 2176, a Martian police unit is sent to pick...",17.280,2001-08-24,Ghosts of Mars,5.123,980.0
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp...",784579.0,en,Getting Away with Murder: The JonBenét Ramsey ...,Dramatization of the story behind the murder o...,0.750,2000-02-16,Getting Away with Murder: The JonBenét Ramsey ...,8.000,1.0
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...",396302.0,no,Brun bitter,A stolen bicycle case ends with drunken detect...,0.600,1988-11-17,Hair of the Dog,0.000,0.0
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",33592.0,en,White of the Eye,"In a wealthy and isolated desert community, a ...",7.336,1987-06-19,White of the Eye,5.742,64.0
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}",11192.0,de,Die flambierte Frau,"Eva, an upper-class housewife, frustratedly le...",2.397,1983-05-11,A Woman in Flames,5.300,13.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81736,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}",117124.0,en,Mermaids: The Body Found,A story that imagines how these real-world phe...,5.098,2011-03-19,Mermaids: The Body Found,4.500,20.0
81737,34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0...",44946.0,en,Knucklehead,A fight promoter deeply in debt to his crooked...,9.789,2010-10-22,Knucklehead,5.500,50.0
81738,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}",285337.0,en,Another Nice Mess,Nixon and Agnew played as Laurel and Hardy.,1.960,1972-08-23,Another Nice Mess,0.000,0.0
81739,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ...",,,,,,,,,


Unnamed: 0,Wikipedia Movie ID,Freebase Movie ID,Movie name,Movie release date,Movie box office revenue,Movie runtime,Movie languages,Movie countries,Movie genres,TMDB_id,...,TMDB_popularity,TMDB_release_date,TMDB_title,TMDB_vote_average,TMDB_vote_count,TMDB_runtime,TMDB_revenue,TMDB_budget,TMDB_IMDB_id,TMDB_genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",10016.0,...,17.280,2001-08-24,Ghosts of Mars,5.123,980.0,98.0,14010832.0,28000000.0,tt0228333,"Action, Horror, Science Fiction"
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp...",784579.0,...,0.750,2000-02-16,Getting Away with Murder: The JonBenét Ramsey ...,8.000,1.0,60.0,,0.0,tt0245916,"Drama, Crime"
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...",396302.0,...,0.600,1988-11-17,Hair of the Dog,0.000,0.0,83.0,,0.0,tt0094806,"Mystery, Crime, Drama"
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",33592.0,...,7.336,1987-06-19,White of the Eye,5.742,64.0,111.0,,0.0,tt0094320,"Horror, Thriller"
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}",11192.0,...,2.397,1983-05-11,A Woman in Flames,5.300,13.0,106.0,,0.0,tt0083949,Drama
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68939,32468537,/m/0crwd9y,Shadow Boxing 2,2007-10-18,,132.0,"{""/m/06b_j"": ""Russian Language"", ""/m/02h40lc"":...","{""/m/06bnz"": ""Russia""}","{""/m/01z02hx"": ""Sports"", ""/m/0lsxr"": ""Crime Fi...",56525.0,...,4.421,2007-10-18,Revenge,5.607,28.0,,,,,
68940,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}",117124.0,...,5.098,2011-03-19,Mermaids: The Body Found,4.500,20.0,,,,,
68941,34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0...",44946.0,...,9.789,2010-10-22,Knucklehead,5.500,50.0,,,,,
68942,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}",285337.0,...,1.960,1972-08-23,Another Nice Mess,0.000,0.0,,,,,


Unnamed: 0,Wikipedia Movie ID,Freebase Movie ID,Movie name,Movie release date,Movie box office revenue,Movie runtime,Movie languages,Movie countries,Movie genres,TMDB_id,...,TMDB_popularity,TMDB_release_date,TMDB_title,TMDB_vote_average,TMDB_vote_count,TMDB_runtime,TMDB_revenue,TMDB_budget,TMDB_IMDB_id,TMDB_genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",10016.0,...,17.280,2001-08-24,Ghosts of Mars,5.123,980.0,98.0,14010832.0,28000000.0,tt0228333,"Action, Horror, Science Fiction"
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp...",784579.0,...,0.750,2000-02-16,Getting Away with Murder: The JonBenét Ramsey ...,8.000,1.0,60.0,,0.0,tt0245916,"Drama, Crime"
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...",396302.0,...,0.600,1988-11-17,Hair of the Dog,0.000,0.0,83.0,,0.0,tt0094806,"Mystery, Crime, Drama"
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",33592.0,...,7.336,1987-06-19,White of the Eye,5.742,64.0,111.0,,0.0,tt0094320,"Horror, Thriller"
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}",11192.0,...,2.397,1983-05-11,A Woman in Flames,5.300,13.0,106.0,,0.0,tt0083949,Drama
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81736,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}",117124.0,...,5.098,2011-03-19,Mermaids: The Body Found,4.500,20.0,,,,,
81737,34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0...",44946.0,...,9.789,2010-10-22,Knucklehead,5.500,50.0,,,,,
81738,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}",285337.0,...,1.960,1972-08-23,Another Nice Mess,0.000,0.0,,,,,
81739,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ...",,...,,,,,,,,,,


2.1 - combining revenue columns into a single one.
In this section we're going to analyse the data collected, combine the 'original revenue' and 'TMDB_revenue' columns and save another version of the dataset, which contains the final data for the analysis.

In [39]:
import numpy as np
import pandas as pd

movie_metadata_v3 = pd.read_csv('modified_data/movie_metadata_TMDB_v3.csv')

# We will now compare the two columns containing the 'original revenue' and 'TMDB_revenue', where either column is NaN but not both at the same time. This will show how they can complement each other.
filtered_df = movie_metadata_v3[(movie_metadata_v3['Movie box office revenue'].isna() | movie_metadata_v3['TMDB_revenue'].isna()) & ~(movie_metadata_v3['Movie box office revenue'].isna() & movie_metadata_v3['TMDB_revenue'].isna())]
display(filtered_df[['Movie name','Movie box office revenue','TMDB_revenue']])

# Combining the two revenue columns into one using combine_first which combines two df objects by filling null values in one df with non-null from the other df 
movie_metadata_v3['Movie box office revenue enriched'] = movie_metadata_v3['Movie box office revenue'].combine_first(movie_metadata_v3['TMDB_revenue'])

# Comparing old dataset to new
tmdb_v3_revenue = movie_metadata_v3['Movie box office revenue enriched'].isna().sum()
tmdb_v1 = pd.read_csv('modified_data/movie_metadata_TMDB.csv')
tmdb_v1_revenue = tmdb_v1['Movie box office revenue'].isna().sum()

print("Original revenue NaN:", tmdb_v1_revenue)
print("New revenue NaN:",tmdb_v3_revenue)
print("Difference", tmdb_v1_revenue-tmdb_v3_revenue)

# Removing the two columns that combined into one column
movie_metadata_v4 = movie_metadata_v3.drop(columns=['Movie box office revenue','TMDB_revenue'])

display(movie_metadata_v4)

movie_metadata_v4.to_csv('modified_data/movie_metadata_TMDB_FINAL.csv',index=False) # Saving the final version of the modified dataset

  movie_metadata_v3 = pd.read_csv('modified_data/movie_metadata_TMDB_v3.csv')


Unnamed: 0,Movie name,Movie box office revenue,TMDB_revenue
36,They Knew What They Wanted,932000.0,
47,Daddy and Them,,6718.0
53,Rudo y Cursi,11091868.0,
57,Innocence,,37598.0
60,The Great New Wonderful,172055.0,
...,...,...,...
81695,Coming to America,288752301.0,
81720,Spaced Invaders,15369573.0,
81725,State and Main,6944471.0,
81726,Guilty as Sin,22886222.0,


Original revenue NaN: 73340
New revenue NaN: 71723
Difference 1617


Unnamed: 0,Wikipedia Movie ID,Freebase Movie ID,Movie name,Movie release date,Movie runtime,Movie languages,Movie countries,Movie genres,TMDB_id,TMDB_original_language,...,TMDB_popularity,TMDB_release_date,TMDB_title,TMDB_vote_average,TMDB_vote_count,TMDB_runtime,TMDB_budget,TMDB_IMDB_id,TMDB_genres,Movie box office revenue enriched
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",10016.0,en,...,17.280,2001-08-24,Ghosts of Mars,5.123,980.0,98.0,28000000.0,tt0228333,"Action, Horror, Science Fiction",14010832.0
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp...",784579.0,en,...,0.750,2000-02-16,Getting Away with Murder: The JonBenét Ramsey ...,8.000,1.0,60.0,0.0,tt0245916,"Drama, Crime",
2,28463795,/m/0crgdbh,Brun bitter,1988,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...",396302.0,no,...,0.600,1988-11-17,Hair of the Dog,0.000,0.0,83.0,0.0,tt0094806,"Mystery, Crime, Drama",
3,9363483,/m/0285_cd,White Of The Eye,1987,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",33592.0,en,...,7.336,1987-06-19,White of the Eye,5.742,64.0,111.0,0.0,tt0094320,"Horror, Thriller",
4,261236,/m/01mrr1,A Woman in Flames,1983,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}",11192.0,de,...,2.397,1983-05-11,A Woman in Flames,5.300,13.0,106.0,0.0,tt0083949,Drama,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81736,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}",117124.0,en,...,5.098,2011-03-19,Mermaids: The Body Found,4.500,20.0,,,,,
81737,34980460,/m/0g4pl34,Knuckle,2011-01-21,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0...",44946.0,en,...,9.789,2010-10-22,Knucklehead,5.500,50.0,,,,,
81738,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}",285337.0,en,...,1.960,1972-08-23,Another Nice Mess,0.000,0.0,,,,,
81739,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ...",,,...,,,,,,,,,,


It seems like the original dataset contained much of the information available about the movie revenue.
However, for 1617 movies the revenue was added, which is quite a lot in the context of ~40000 movies with plot summaries. This will only further enhance the data analysis process.

