# Exploratory Data Analysis of Movies with Ratings

As our other movie dataset did not contain the average rating from Letterboxd, I've incorporated this movie dataset to fetch the rating. However, the id column does not match the movie_id on the first dataset, so we will have to match movies by title and date, and possibly runtime if that's not enough.

In [2]:
import pandas as pd

In [4]:
movies_with_ratings = pd.read_csv('../data/raw/unclean_movies_with_ratings.csv')

# Display some information about the transactions DataFrame
movies_with_ratings.info()
movies_with_ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 164616 entries, 0 to 164615
Data columns (total 5 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   id      164616 non-null  int64  
 1   name    164616 non-null  object 
 2   date    157062 non-null  float64
 3   minute  155740 non-null  float64
 4   rating  67519 non-null   float64
dtypes: float64(3), int64(1), object(1)
memory usage: 6.3+ MB


Unnamed: 0,id,name,date,minute,rating
0,1000002,Parasite,2019.0,133.0,4.56
1,1000003,Everything Everywhere All at Once,2022.0,140.0,4.3
2,1000004,Fight Club,1999.0,139.0,4.27
3,1000005,La La Land,2016.0,129.0,4.09
4,1000007,Interstellar,2014.0,169.0,4.35


Name has data type object and is not explicit about what's in them. I'll first check if there's any mismatched data types.

In [5]:
print(f"Name types: {movies_with_ratings['name'].apply(type).unique()}")

Name types: [<class 'str'>]


It just contains strings which is expected so we can move on. Checking if any rows have missing values:

In [6]:
movies_with_ratings.isnull().any(axis=1).sum()

np.int64(97198)

There are many rows with missing data, however that may not be a concern considering we only need this table to extract the rating for the movies in the original movies table. Let's check which bits of data are missing for a better overview:

In [7]:
name_missing = (movies_with_ratings["name"].isnull())

print(name_missing.sum())

0


There are no rows with missing name (movie title). Let's check for missing ratings now:

In [8]:
rating_missing = (movies_with_ratings["rating"].isnull())

print(rating_missing.sum())

97097


Most of the rows with missing data have missing rating. We'll leave these and see if it's a concern once merged with the other movies table.

Let's check for missing date as that will be a concern when merging.

In [15]:
date_missing = (movies_with_ratings["date"].isnull())

print(date_missing.sum())
movies_with_ratings[date_missing]

7554


Unnamed: 0,id,name,date,minute,rating
7105,1007915,Frankenstein,,,
8233,1009206,Untitled Peaky Blinders Film,,,
12934,1014647,Dracula,,,
14450,1016448,Scarface,,,
15544,1017716,Havoc,,,
...,...,...,...,...,...
164611,1941489,Trap,,85.0,
164612,1941490,Traumnovelle,,109.0,
164613,1941513,Untouchable,,,
164614,1941515,Vagabond,,6.0,


There are quite a few rows with missing date. We'll leave these as is and find out if it's a concern when merging. Most likely these are lesser-known films that won't be present in our other table, which has no dates missing.

# Data Cleaning

## Convert float to int

Date and minute columns have floating .0 which we don't need. Let's convert these columns to type int.

In [21]:
movies_with_ratings['date'] = movies_with_ratings['date'].astype('Int64')
movies_with_ratings['minute'] = movies_with_ratings['minute'].astype('Int64')

movies_with_ratings

Unnamed: 0,id,name,date,minute,rating
0,1000002,Parasite,2019,133,4.56
1,1000003,Everything Everywhere All at Once,2022,140,4.30
2,1000004,Fight Club,1999,139,4.27
3,1000005,La La Land,2016,129,4.09
4,1000007,Interstellar,2014,169,4.35
...,...,...,...,...,...
164611,1941489,Trap,,85,
164612,1941490,Traumnovelle,,109,
164613,1941513,Untouchable,,,
164614,1941515,Vagabond,,6,


## Standardise ratings

The user_ratings rating system is out of 10, while this one is out of 5. Let's multiply the ratings by 2 to standardise them.

In [22]:
movies_with_ratings['rating'] = movies_with_ratings['rating'] * 2

movies_with_ratings.head()


Unnamed: 0,id,name,date,minute,rating
0,1000002,Parasite,2019,133,9.12
1,1000003,Everything Everywhere All at Once,2022,140,8.6
2,1000004,Fight Club,1999,139,8.54
3,1000005,La La Land,2016,129,8.18
4,1000007,Interstellar,2014,169,8.7


## Drop id column

In [23]:
movies_with_ratings = movies_with_ratings.drop(columns=['id'])

movies_with_ratings.shape

(164616, 4)

In [26]:
movies_with_ratings.reset_index(drop=True, inplace=True)

# Save the Cleaned Data

For testing purposes in the pipeline, it makes sense for us to export the cleaned DataFrame to a CSV file.  This will allow us to use the cleaned data in the pipeline without having to run the cleaning steps again.

In [27]:
movies_with_ratings.to_csv(
    "../tests/test_data/expected_movies_with_ratings_clean_results.csv", index=False
)