# Cleaning data

Loading the basic data for the analysis into a DataFrame:

In [2]:
import pandas as pd

df = pd.read_csv('../data/raw/top_movies.csv') # read csv file

Standarize the column's names.

In [3]:
df.columns = df.columns.str.lower().str.replace(' ', '_')
df = df.rename(columns={'moive_name': 'title'})

Remove what is not useful for the analysis:

In [4]:
df = df.dropna(subset=['cast', 'director'])

In [5]:
df = df.drop(columns = ['unnamed:_0', 'votes', 'pg_rating', 'duration'])

Convert 'genre' and 'cast' columns into lists for best understanding of the data:

In [6]:
df['genre'] = df['genre'].apply(lambda x: x.split(', '))
df['cast'] = df['cast'].apply(lambda x: x.split(', '))

Last check to the DataFrame:

In [7]:
df.dtypes

title          object
rating        float64
meta_score    float64
genre          object
year            int64
cast           object
director       object
dtype: object

In [8]:
df.sample(10)

Unnamed: 0,title,rating,meta_score,genre,year,cast,director
910,Children of Men,7.9,84.0,"[Action, Drama, Sci-Fi]",2006,"[Julianne Moore, Clive Owen, Chiwetel Ejiofor,...",Alfonso Cuarón
497,Blockers,6.2,69.0,"[Comedy, Drama]",2018,"[Leslie Mann, John Cena, Ike Barinholtz, Kathr...",Kay Cannon
155,Titanic,7.9,75.0,"[Drama, Romance]",1997,"[Leonardo DiCaprio, Kate Winslet, Billy Zane, ...",James Cameron
73,Harry Potter and the Sorcerer's Stone,7.6,65.0,"[Adventure, Family, Fantasy]",2001,"[Daniel Radcliffe, Rupert Grint, Richard Harri...",Chris Columbus
1123,Sea of Love,6.8,66.0,"[Crime, Drama, Mystery]",1989,"[Al Pacino, Ellen Barkin, John Goodman, Michae...",Harold Becker
418,Skyfall,7.8,81.0,"[Action, Adventure, Thriller]",2012,"[Daniel Craig, Javier Bardem, Naomie Harris, J...",Sam Mendes
1841,Maybe I Do,4.9,42.0,"[Comedy, Romance]",2023,"[Diane Keaton, William H. Macy, Richard Gere, ...",Michael Jacobs
1047,Minority Report,7.6,80.0,"[Action, Crime, Mystery]",2002,"[Tom Cruise, Colin Farrell, Samantha Morton, M...",Steven Spielberg
112,Gladiator,8.5,67.0,"[Action, Adventure, Drama]",2000,"[Russell Crowe, Joaquin Phoenix, Connie Nielse...",Ridley Scott
1559,The Divergent Series: Insurgent,6.2,42.0,"[Action, Adventure, Sci-Fi]",2015,"[Shailene Woodley, Ansel Elgort, Theo James, K...",Robert Schwentke


## Revenue DataFrame

Load the dataframe with the movie titles and the profit they have obtained:

In [11]:
revenue_df = pd.read_csv('../data/raw/revenue_movies_dataset.csv') # read csv file

Standarize the names of the movies to be able to merge the dataframes on the 'title' column.

In [12]:
def clean_title(title):
    # Convertir a minúsculas
    title = title.lower()
    # Eliminar espacios en blanco al principio y al final
    title = title.strip()
    # Reemplazar caracteres especiales (opcional, depende de tus datos)
    title = title.replace("'", "")  # Eliminar apóstrofes
    title = title.replace(":", "")  # Eliminar dos puntos
    return title

# Aplicar la función de limpieza en ambas columnas 'title'
df['title'] = df['title'].apply(clean_title)
revenue_df['title'] = revenue_df['title'].apply(clean_title)

Clean the revenue DataFrame:

In [13]:
revenue_df.drop_duplicates(subset='title', inplace=True)

Merge both to obtain a complete DataFrame with all the necessary information about the movies, including their revenue.

In [14]:
# Unir df original con el revenue_df
df_merged = pd.merge(df, revenue_df, on='title', how='inner')
df_merged.sample(10)

Unnamed: 0,title,rating,meta_score,genre,year,cast,director,revenue
1478,a thousand and one,7.0,81.0,"[Crime, Drama]",2023,"[Teyana Taylor, Aaron Kingsley Adetola, Aven C...",A.V. Rockwell,0
315,catch me if you can,8.1,75.0,"[Biography, Crime, Drama]",2002,"[Leonardo DiCaprio, Tom Hanks, Christopher Wal...",Steven Spielberg,352114312
1025,fast five,7.3,66.0,"[Action, Crime, Thriller]",2011,"[Vin Diesel, Paul Walker, Dwayne Johnson, Jord...",Justin Lin,626137675
930,thirteen,6.8,70.0,[Drama],2003,"[Evan Rachel Wood, Holly Hunter, Nikki Reed, V...",Catherine Hardwicke,0
444,percy jackson & the olympians the lightning thief,5.9,47.0,"[Adventure, Family, Fantasy]",2010,"[Logan Lerman, Kevin McKidd, Steve Coogan, Bra...",Chris Columbus,226497209
1275,a quiet place part ii,7.2,71.0,"[Drama, Horror, Sci-Fi]",2020,"[Emily Blunt, Millicent Simmonds, Cillian Murp...",John Krasinski,297400000
1202,silverado,7.2,64.0,"[Action, Crime, Drama]",1985,"[Kevin Kline, Scott Glenn, Kevin Costner, Rosa...",Lawrence Kasdan,32192570
1706,maybe i do,4.9,42.0,"[Comedy, Romance]",2023,"[Diane Keaton, William H. Macy, Richard Gere, ...",Michael Jacobs,4393504
563,guillermo del toros pinocchio,7.6,79.0,"[Animation, Drama, Family]",2022,"[Ewan McGregor, David Bradley, Gregory Mann, B...",Guillermo del ToroMark Gustafson,0
117,fight club,8.8,67.0,[Drama],1999,"[Brad Pitt, Edward Norton, Meat Loaf, Zach Gre...",David Fincher,100853753


In [15]:
df_merged.shape

(1805, 8)

Save the result in a .csv file.

In [17]:
df_merged.to_csv('../data/top_movies_cleaned.csv', index=False)