# Cleaning data

Loading the basic data for the analysis into a DataFrame:

In [1]:
import pandas as pd

df = pd.read_csv('top_movies.csv') # read csv file

Standarize the column's names.

In [3]:
df.columns = df.columns.str.lower().str.replace(' ', '_')
df = df.rename(columns={'moive_name': 'title'})

Remove what is not useful for the analysis:

In [4]:
df = df.dropna(subset=['cast', 'director'])

In [6]:
df = df.drop(columns = ['unnamed:_0', 'votes', 'pg_rating', 'duration'])

Convert 'genre' and 'cast' columns into lists for best understanding of the data:

In [8]:
df['genre'] = df['genre'].apply(lambda x: x.split(', '))
df['cast'] = df['cast'].apply(lambda x: x.split(', '))

Last check to the DataFrame:

In [9]:
df.dtypes

title          object
rating        float64
meta_score    float64
genre          object
year            int64
cast           object
director       object
dtype: object

In [None]:
df.sample(10)

## Revenue DataFrame

Load the dataframe with the movie titles and the profit they have obtained:

In [10]:
revenue_df = pd.read_csv('revenue_movies_dataset.csv') # read csv file

Standarize the names of the movies to be able to merge the dataframes on the 'title' column.

In [11]:
def clean_title(title):
    # Convertir a minúsculas
    title = title.lower()
    # Eliminar espacios en blanco al principio y al final
    title = title.strip()
    # Reemplazar caracteres especiales (opcional, depende de tus datos)
    title = title.replace("'", "")  # Eliminar apóstrofes
    title = title.replace(":", "")  # Eliminar dos puntos
    return title

# Aplicar la función de limpieza en ambas columnas 'title'
df['title'] = df['title'].apply(clean_title)
revenue_df['title'] = revenue_df['title'].apply(clean_title)

Clean the revenue DataFrame:

In [12]:
revenue_df.drop_duplicates(subset='title', inplace=True)

Merge both to obtain a complete DataFrame with all the necessary information about the movies, including their revenue.

In [13]:
# Unir df original con el revenue_df
df_merged = pd.merge(df, revenue_df, on='title', how='inner')
df_merged.sample(10)

Unnamed: 0,title,rating,meta_score,genre,year,cast,director,revenue
541,edward scissorhands,7.9,74.0,"[Drama, Fantasy, Romance]",1990,"[Johnny Depp, Winona Ryder, Dianne Wiest, Anth...",Tim Burton,53000000
1392,shooter,7.1,53.0,"[Action, Drama, Thriller]",2007,"[Mark Wahlberg, Michael Peña, Rhona Mitra, Dan...",Antoine Fuqua,95696996
64,jingle all the way,5.7,34.0,"[Adventure, Comedy, Family]",1996,"[Arnold Schwarzenegger, Sinbad, Phil Hartman, ...",Brian Levant,129832389
1524,the purge,5.7,41.0,"[Horror, Sci-Fi, Thriller]",2013,"[Ethan Hawke, Lena Headey, Max Burkholder, Ade...",James DeMonaco,89328627
80,gremlins,7.3,70.0,"[Comedy, Fantasy, Horror]",1984,"[Zach Galligan, Phoebe Cates, Hoyt Axton, John...",Joe Dante,153083102
139,shazam! fury of the gods,6.0,47.0,"[Action, Adventure, Comedy]",2023,"[Zachary Levi, Asher Angel, Jack Dylan Grazer,...",David F. Sandberg,133437105
128,avatar the way of water,7.6,67.0,"[Action, Adventure, Fantasy]",2022,"[Sam Worthington, Zoe Saldana, Sigourney Weave...",James Cameron,2320250281
1592,fast & furious presents hobbs & shaw,6.5,60.0,"[Action, Adventure, Thriller]",2019,"[Dwayne Johnson, Jason Statham, Idris Elba, Va...",David Leitch,760098996
1707,star trek beyond,7.0,68.0,"[Action, Adventure, Sci-Fi]",2016,"[Chris Pine, Zachary Quinto, Karl Urban, Zoe S...",Justin Lin,343471816
1352,the name of the rose,7.7,54.0,"[Drama, Mystery, Thriller]",1986,"[Sean Connery, Christian Slater, Helmut Qualti...",Jean-Jacques Annaud,77200000


In [14]:
df_merged.shape

(1805, 8)

Save the result in a .csv file.

In [15]:
df_merged.to_csv('top_movies_cleaned.csv', index=False)