# Loading the dataset chosen and cleaning it
For this project we decided to analyse a dataset on movies, that can be found in this link: https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies  
It is being updated daily and has over 1 million rows, so it was important to clean and filter it properly before proceeding with the rest of the project.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('../data/raw/TMDB_movie_dataset_v11.csv')
df

In [None]:
df.isna().sum()

#### After checking the amount of missing values in most columns, we decided to start filtering the dataframe and reduce the number of columns that would be useless to our analysis.

In [None]:
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')

filtered_df = df[df['release_date'] >= '2004-01-01'] # keeping only the movies from the past 20 years
filtered_df.shape

In [None]:
filtered_df = filtered_df.drop(columns=['homepage', 'backdrop_path', 'imdb_id',
                                       'poster_path', 'id', 'overview', 'tagline',
                                       'original_title', 'vote_count',
                                       'spoken_languages', 'popularity',
                                       'production_companies', 'original_language',
                                       'production_countries', 'keywords'])

In [None]:
filtered_df = filtered_df[(filtered_df['revenue'] != 0)]

In [None]:
filtered_df.shape

In [None]:
filtered_df.isna().sum()

In [None]:
clean_movies = filtered_df.dropna()

In [None]:
clean_movies.shape

####  Sorting the dataframe by the revenue, so that the best sellers are on top

In [None]:
clean_movies = clean_movies.sort_values(by = 'revenue', ascending=False)
clean_movies.head()

#### Further cleaning and taking care of the most obvious outliers

In [None]:
clean_movies = clean_movies[clean_movies['status'] == 'Released']
clean_movies = clean_movies[clean_movies['revenue'] >= 1000]
clean_movies = clean_movies[clean_movies['runtime'] >= 30]
clean_movies = clean_movies[clean_movies['budget'] >= 1]
clean_movies = clean_movies.drop(columns = ['status'])
clean_movies

In [None]:
# Dropping the row where the title is "TikTok Rizz Party" because it seems to be wrong
clean_movies = clean_movies[clean_movies['title'] != "TikTok Rizz Party"]
clean_movies

## Export the clean dataset

In [None]:
clean_movies.to_csv('clean_movies.csv')