# Cleaning data

Loading the basic data for the analysis into a DataFrame:

In [1]:
import pandas as pd

df = pd.read_csv('data/raw/top_movies.csv') # read csv file

Standarize the column's names.

In [2]:
df.columns = df.columns.str.lower().str.replace(' ', '_')
df = df.rename(columns={'moive_name': 'title'})

Remove what is not useful for the analysis:

In [3]:
df = df.dropna(subset=['cast', 'director'])

In [4]:
df = df.drop(columns = ['unnamed:_0', 'votes', 'pg_rating', 'duration'])

Convert 'genre' and 'cast' columns into lists for best understanding of the data:

In [5]:
df['genre'] = df['genre'].apply(lambda x: x.split(', '))
df['cast'] = df['cast'].apply(lambda x: x.split(', '))

Last check to the DataFrame:

In [6]:
df.dtypes

title          object
rating        float64
meta_score    float64
genre          object
year            int64
cast           object
director       object
dtype: object

In [7]:
df.sample(10)

Unnamed: 0,title,rating,meta_score,genre,year,cast,director
1360,Fresh,6.7,67.0,"[Horror, Thriller]",2022,"[Daisy Edgar-Jones, Sebastian Stan, Jojo T. Gi...",Mimi Cave
911,In Bruges,7.9,67.0,"[Comedy, Crime, Drama]",2008,"[Colin Farrell, Brendan Gleeson, Ciarán Hinds,...",Martin McDonagh
611,Guillermo del Toro's Pinocchio,7.6,79.0,"[Animation, Drama, Family]",2022,"[Ewan McGregor, David Bradley, Gregory Mann, B...",Guillermo del ToroMark Gustafson
1758,Red Dragon,7.2,60.0,"[Crime, Drama, Thriller]",2002,"[Anthony Hopkins, Edward Norton, Ralph Fiennes...",Brett Ratner
1819,Big George Foreman,6.6,45.0,"[Biography, Drama, Sport]",2023,"[Khris Davis, Jasmine Mathews, Sullivan Jones,...",George Tillman Jr.
347,Knives Out,7.9,82.0,"[Comedy, Crime, Drama]",2019,"[Daniel Craig, Chris Evans, Ana de Armas, Jami...",Rian Johnson
340,Fast Times at Ridgemont High,7.1,61.0,"[Comedy, Drama]",1982,"[Sean Penn, Jennifer Jason Leigh, Judge Reinho...",Amy Heckerling
198,Old Dads,6.2,42.0,[Comedy],2023,"[Bill Burr, Bobby Cannavale, Bokeem Woodbine, ...",Bill Burr
149,Godzilla,6.4,62.0,"[Action, Adventure, Sci-Fi]",2014,"[Aaron Taylor-Johnson, Elizabeth Olsen, Bryan ...",Gareth Edwards
1060,A Merry Friggin' Christmas,5.1,28.0,"[Comedy, Drama]",2014,"[Joel McHale, Lauren Graham, Clark Duke, Olive...",Tristram Shapeero


## Revenue DataFrame

Load the dataframe with the movie titles and the profit they have obtained:

In [8]:
revenue_df = pd.read_csv('data/raw/revenue_movies_dataset.csv') # read csv file

Standarize the names of the movies to be able to merge the dataframes on the 'title' column.

In [9]:
def clean_title(title):
    # Convertir a minúsculas
    title = title.lower()
    # Eliminar espacios en blanco al principio y al final
    title = title.strip()
    # Reemplazar caracteres especiales (opcional, depende de tus datos)
    title = title.replace("'", "")  # Eliminar apóstrofes
    title = title.replace(":", "")  # Eliminar dos puntos
    return title

# Aplicar la función de limpieza en ambas columnas 'title'
df['title'] = df['title'].apply(clean_title)
revenue_df['title'] = revenue_df['title'].apply(clean_title)

Clean the revenue DataFrame:

In [10]:
revenue_df.drop_duplicates(subset='title', inplace=True)

Merge both to obtain a complete DataFrame with all the necessary information about the movies, including their revenue.

In [11]:
# Unir df original con el revenue_df
df_merged = pd.merge(df, revenue_df, on='title', how='inner')
df_merged.sample(10)

Unnamed: 0,title,rating,meta_score,genre,year,cast,director,revenue
922,white chicks,5.8,41.0,"[Comedy, Crime]",2004,"[Marlon Wayans, Shawn Wayans, Busy Philipps, M...",Keenen Ivory Wayans,113086475
1748,san andreas,6.1,43.0,"[Action, Adventure, Thriller]",2015,"[Dwayne Johnson, Carla Gugino, Alexandra Dadda...",Brad Peyton,473990832
954,"the cook, the thief, his wife & her lover",7.5,62.0,"[Crime, Drama]",1989,"[Richard Bohringer, Michael Gambon, Helen Mirr...",Peter Greenaway,7724701
37,its a wonderful life,8.6,89.0,"[Drama, Family, Fantasy]",1946,"[James Stewart, Donna Reed, Lionel Barrymore, ...",Frank Capra,9644124
731,little children,7.5,75.0,"[Drama, Romance]",2006,"[Kate Winslet, Jennifer Connelly, Patrick Wils...",Todd Field,14821658
970,traffic,7.6,86.0,"[Crime, Drama, Thriller]",2000,"[Michael Douglas, Benicio Del Toro, Catherine ...",Steven Soderbergh,207515725
355,the christmas chronicles part two,6.0,51.0,"[Adventure, Comedy, Family]",2020,"[Kurt Russell, Goldie Hawn, Darby Camp, Julian...",Chris Columbus,0
628,the perks of being a wallflower,7.9,67.0,[Drama],2012,"[Logan Lerman, Emma Watson, Ezra Miller, Paul ...",Stephen Chbosky,33400000
776,love again,5.9,32.0,"[Comedy, Drama, Romance]",2023,"[Priyanka Chopra Jonas, Sam Heughan, Céline Di...",Jim Strouse,10000000
530,zodiac,7.7,79.0,"[Crime, Drama, Mystery]",2007,"[Jake Gyllenhaal, Robert Downey Jr., Mark Ruff...",David Fincher,84785914


In [12]:
df_merged.shape

(1805, 8)

Save the result in a .csv file.

In [16]:
df_merged.to_csv('./data/top_movies_cleaned.csv', index=False)