# Cleaning data

Loading the basic data for the analysis into a DataFrame:

In [4]:
import pandas as pd

df = pd.read_csv('data/raw/top_movies.csv') # read csv file

Standarize the column's names.

In [5]:
df.columns = df.columns.str.lower().str.replace(' ', '_')
df = df.rename(columns={'moive_name': 'title'})

Remove what is not useful for the analysis:

In [6]:
df = df.dropna(subset=['cast', 'director'])

In [7]:
df = df.drop(columns = ['unnamed:_0', 'votes', 'pg_rating', 'duration'])

Convert 'genre' and 'cast' columns into lists for best understanding of the data:

In [8]:
df['genre'] = df['genre'].apply(lambda x: x.split(', '))
df['cast'] = df['cast'].apply(lambda x: x.split(', '))

Last check to the DataFrame:

In [9]:
df.dtypes

title          object
rating        float64
meta_score    float64
genre          object
year            int64
cast           object
director       object
dtype: object

In [10]:
df.sample(10)

Unnamed: 0,title,rating,meta_score,genre,year,cast,director
793,Paddington,7.3,77.0,"[Adventure, Comedy, Family]",2014,"[Hugh Bonneville, Sally Hawkins, Julie Walters...",Paul King
1122,Fantastic Beasts and Where to Find Them,7.2,66.0,"[Adventure, Family, Fantasy]",2016,"[Eddie Redmayne, Katherine Waterston, Alison S...",David Yates
767,Romeo + Juliet,6.7,60.0,"[Drama, Romance]",1996,"[Leonardo DiCaprio, Claire Danes, John Leguiza...",Baz Luhrmann
813,Frequency,7.4,67.0,"[Crime, Drama, Mystery]",2000,"[Dennis Quaid, Jim Caviezel, Shawn Doyle, Eliz...",Gregory Hoblit
871,WALL·E,8.4,95.0,"[Animation, Adventure, Family]",2008,"[Ben Burtt, Elissa Knight, Jeff Garlin, Fred W...",Andrew Stanton
1853,Moonrise Kingdom,7.8,84.0,"[Adventure, Comedy, Drama]",2012,"[Jared Gilman, Kara Hayward, Bruce Willis, Bil...",Wes Anderson
347,Knives Out,7.9,82.0,"[Comedy, Crime, Drama]",2019,"[Daniel Craig, Chris Evans, Ana de Armas, Jami...",Rian Johnson
1706,Forgetting Sarah Marshall,7.1,67.0,"[Comedy, Drama, Romance]",2008,"[Kristen Bell, Jason Segel, Paul Rudd, Mila Ku...",Nicholas Stoller
1304,Legends of the Fall,7.5,45.0,"[Drama, Romance, War]",1994,"[Brad Pitt, Anthony Hopkins, Aidan Quinn, Juli...",Edward Zwick
1820,Pokémon: Detective Pikachu,6.5,53.0,"[Action, Adventure, Comedy]",2019,"[Ryan Reynolds, Justice Smith, Kathryn Newton,...",Rob Letterman


## Revenue DataFrame

Load the dataframe with the movie titles and the profit they have obtained:

In [12]:
revenue_df = pd.read_csv('data/raw/revenue_movies_dataset.csv') # read csv file

Standarize the names of the movies to be able to merge the dataframes on the 'title' column.

In [13]:
def clean_title(title):
    # Convertir a minúsculas
    title = title.lower()
    # Eliminar espacios en blanco al principio y al final
    title = title.strip()
    # Reemplazar caracteres especiales (opcional, depende de tus datos)
    title = title.replace("'", "")  # Eliminar apóstrofes
    title = title.replace(":", "")  # Eliminar dos puntos
    return title

# Aplicar la función de limpieza en ambas columnas 'title'
df['title'] = df['title'].apply(clean_title)
revenue_df['title'] = revenue_df['title'].apply(clean_title)

Clean the revenue DataFrame:

In [14]:
revenue_df.drop_duplicates(subset='title', inplace=True)

Merge both to obtain a complete DataFrame with all the necessary information about the movies, including their revenue.

In [15]:
# Unir df original con el revenue_df
df_merged = pd.merge(df, revenue_df, on='title', how='inner')
df_merged.sample(10)

Unnamed: 0,title,rating,meta_score,genre,year,cast,director,revenue
1730,young guns,6.8,50.0,"[Action, Drama, Western]",1988,"[Emilio Estevez, Kiefer Sutherland, Lou Diamon...",Christopher Cain,44726644
365,avatar,7.9,83.0,"[Action, Adventure, Fantasy]",2009,"[Sam Worthington, Zoe Saldana, Sigourney Weave...",James Cameron,2923706026
344,ghosted,5.8,34.0,"[Action, Adventure, Comedy]",2023,"[Chris Evans, Ana de Armas, Adrien Brody, Mike...",Dexter Fletcher,0
1129,kill bill vol. 2,8.0,83.0,"[Action, Crime, Thriller]",2004,"[Uma Thurman, David Carradine, Michael Madsen,...",Quentin Tarantino,152159461
1152,the village,6.6,44.0,"[Drama, Mystery, Thriller]",2004,"[Sigourney Weaver, William Hurt, Joaquin Phoen...",M. Night Shyamalan,256697520
1481,liar liar,6.9,70.0,"[Comedy, Fantasy]",1997,"[Jim Carrey, Maura Tierney, Amanda Donohoe, Je...",Tom Shadyac,302710615
885,the kill room,5.4,58.0,"[Comedy, Thriller]",2023,"[Alexis Linkletter, Joe Manganiello, Danny Pla...",Nicol Paone,476375
1235,fifty shades freed,4.5,31.0,"[Drama, Romance, Thriller]",2018,"[Dakota Johnson, Jamie Dornan, Eric Johnson, E...",James Foley,368307760
160,shes all that,5.9,51.0,"[Comedy, Romance]",1999,"[Freddie Prinze Jr., Rachael Leigh Cook, Matth...",Robert Iscove,103166989
243,holidate,6.1,44.0,"[Comedy, Romance]",2020,"[Emma Roberts, Luke Bracey, Kristin Chenoweth,...",John Whitesell,0


In [16]:
df_merged.shape

(1805, 8)

Save the result in a .csv file.

In [15]:
df_merged.to_csv('top_movies_cleaned.csv', index=False)