# Exploratory Data Analysis of Movies

In [1]:
import pandas as pd

In [2]:
movies = pd.read_csv('../data/raw/unclean_movies.csv')

# Display some information about the transactions DataFrame
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 865 entries, 0 to 864
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   movie_id           865 non-null    object 
 1   movie_title        865 non-null    object 
 2   genres             865 non-null    object 
 3   original_language  865 non-null    object 
 4   image_url          865 non-null    object 
 5   runtime            865 non-null    float64
 6   spoken_languages   865 non-null    object 
 7   year_released      865 non-null    float64
dtypes: float64(2), object(6)
memory usage: 54.2+ KB


Runtime and year_released are of type float which may want to convert to int. Many columns have data type object and are not explicit about what's in them. I'll first check if there's any mismatched data types.

In [3]:
print(f"movie_id types: {movies['movie_id'].apply(type).unique()}")
print(f"movie_title types: {movies['movie_title'].apply(type).unique()}")
print(f"genres types: {movies['genres'].apply(type).unique()}")
print(f"original_language types: {movies['original_language'].apply(type).unique()}")
print(f"image_url types: {movies['image_url'].apply(type).unique()}")
print(f"spoken_languages types: {movies['spoken_languages'].apply(type).unique()}")


movie_id types: [<class 'str'>]
movie_title types: [<class 'str'>]
genres types: [<class 'str'>]
original_language types: [<class 'str'>]
image_url types: [<class 'str'>]
spoken_languages types: [<class 'str'>]


They all only contain strings. However, some of the columns look like arrays, and we will probably want to convert them to arrays. Currently any empty array will not be flagged as missing data so we will need to check those columns again after transformation.

In [4]:
movies.isnull().any(axis=1).sum()

np.int64(0)

There are no rows with missing data in any of the columns.

In [5]:
duplicates = movies.duplicated().sum()
print(f'There are {duplicates} duplicate rows in the movies DataFrame.')

There are 0 duplicate rows in the movies DataFrame.


There are no duplicated records.

In [6]:
movies.shape

(865, 8)

Checking for any records with same title but with other columns differing:

In [7]:
same_title = movies[movies.duplicated(subset=["movie_title"], keep=False)]
same_title

Unnamed: 0,movie_id,movie_title,genres,original_language,image_url,runtime,spoken_languages,year_released
12,the-lion-king,The Lion King,"[""Family"",""Animation"",""Drama""]",en,sm/upload/lz/96/yu/mf/ztuEReeV6ofpU1HxUV9AsR6a...,89.0,"[""English""]",1994.0
83,ghostbusters-2016,Ghostbusters,"[""Action"",""Fantasy"",""Comedy""]",en,film-poster/2/2/0/4/8/22048-ghostbusters-0-230...,117.0,"[""English""]",2016.0
85,beauty-and-the-beast-2017,Beauty and the Beast,"[""Family"",""Fantasy"",""Romance""]",en,film-poster/2/4/8/7/0/8/248708-beauty-and-the-...,129.0,"[""Français"",""English""]",2017.0
191,black-panther-2018,Black Panther,"[""Action"",""Adventure"",""Science Fiction""]",en,film-poster/2/0/9/3/8/3/209383-black-panther-0...,134.0,"[""English"",""한국어/조선말"",""Kiswahili"",""""]",2018.0
232,halloween-2018,Halloween,"[""Horror"",""Thriller""]",en,film-poster/3/5/7/6/8/2/357682-halloween-0-230...,106.0,"[""English""]",2018.0
333,suspiria,Suspiria,"[""Horror""]",it,film-poster/4/4/9/0/0/44900-suspiria-0-230-0-3...,99.0,"[""Deutsch"",""English"",""Italiano"",""Latin""]",1977.0
400,the-lion-king-2019,The Lion King,"[""Adventure"",""Family"",""Animation""]",en,film-poster/3/5/4/5/3/9/354539-the-lion-king-0...,118.0,"[""English""]",2019.0
413,beauty-and-the-beast-1991,Beauty and the Beast,"[""Romance"",""Family"",""Animation"",""Fantasy""]",en,sm/upload/qw/zu/zk/ky/iLe8yMOF5we6lvYbKLHzho7A...,84.0,"[""Français"",""English""]",1991.0
446,aladdin,Aladdin,"[""Animation"",""Family"",""Adventure"",""Fantasy"",""R...",en,sm/upload/n3/26/lb/e2/trm94WPjW5ApKulhFvuRm5cT...,90.0,"[""English""]",1992.0
508,suspiria-2018,Suspiria,"[""Horror"",""Thriller""]",en,film-poster/2/9/3/0/6/6/293066-suspiria-0-230-...,152.0,"[""English"",""Français"",""Deutsch""]",2018.0


There are some movies with the same title i.e. remakes. Now checking by movie_id to ensure there are still no repeating movies:

In [8]:
same_id = movies[movies.duplicated(subset=["movie_id"], keep=False)]
same_id

Unnamed: 0,movie_id,movie_title,genres,original_language,image_url,runtime,spoken_languages,year_released


There are no movies with same id. However I noticed two "Ex Machina" films in the first check with only date differing (2014 vs 2015). Now checking on movie_title and runtime just to check if any films were inserted twice into the original database:

In [9]:
same_movie = movies[movies.duplicated(subset=["movie_title", "runtime"], keep=False)]
same_movie

Unnamed: 0,movie_id,movie_title,genres,original_language,image_url,runtime,spoken_languages,year_released
191,black-panther-2018,Black Panther,"[""Action"",""Adventure"",""Science Fiction""]",en,film-poster/2/0/9/3/8/3/209383-black-panther-0...,134.0,"[""English"",""한국어/조선말"",""Kiswahili"",""""]",2018.0
815,ex-machina-2014,Ex Machina,"[""Drama"",""Science Fiction""]",en,film-poster/1/8/7/9/8/6/187986-ex-machina-0-23...,108.0,"[""English""]",2014.0
846,black-panther,Black Panther,"[""Action"",""Adventure"",""Science Fiction""]",en,film-poster/2/0/9/3/8/3/209383-black-panther-0...,134.0,"[""English"",""한국어/조선말"",""Kiswahili"",""""]",2018.0
864,ex-machina-2015,Ex Machina,"[""Drama"",""Science Fiction""]",en,film-poster/1/8/7/9/8/6/187986-ex-machina-0-23...,108.0,"[""English""]",2015.0


It appears two films have been inserted twice into Letterboxd's database with differing movie_id, and in "Ex Machina"'s case differing release year. Will decide which one to keep after exploring user_ratings dataset.

Now I want to check the quality of the data. First, checking year_released looks correct by simply checking if it's within an acceptable range.

In [10]:
invalid_years = movies[(movies['year_released'] < 1900) | (movies['year_released'] > 2025)]
invalid_years

Unnamed: 0,movie_id,movie_title,genres,original_language,image_url,runtime,spoken_languages,year_released


There are no movies outside the acceptable range so I am happy with that.

Now checking unique values for original_language to see if there's any that look wrong.

In [11]:
movies['original_language'].unique()

array(['en', 'cn', 'ja', 'ko', 'sv', 'fr', 'it', 'da', 'de', 'ru', 'pt',
       'pl', 'es', 'el', 'id'], dtype=object)

These all look fine.

I want to transform genres and spoken_languages columns to arrays before checking the data quality.

# Cleaning the data

Genres and spoken_languages data is currently in string format but contains comma separated values. Want to transform these strings into arrays.

In [12]:
import ast

In [13]:
movies['genres'] = movies['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) and x else [])

print(movies['genres'].apply(type).unique())
print(movies.head())

[<class 'list'>]
            movie_id        movie_title  \
0  napoleon-dynamite  Napoleon Dynamite   
1      insomnia-2002           Insomnia   
2        a-bugs-life       A Bug's Life   
3    before-midnight    Before Midnight   
4     searching-2018          Searching   

                                   genres original_language  \
0                                [Comedy]                en   
1              [Crime, Mystery, Thriller]                en   
2  [Adventure, Animation, Comedy, Family]                en   
3                        [Romance, Drama]                en   
4              [Thriller, Mystery, Drama]                en   

                                           image_url  runtime  \
0  sm/upload/wu/r9/ma/tt/2VMXuUAvU8T0oQl0w77CqVAR...     95.0   
1  film-poster/5/1/7/3/9/51739-insomnia-0-230-0-3...    118.0   
2  film-poster/4/7/1/1/1/47111-a-bug-s-life-0-230...     95.0   
3  film-poster/1/0/2/4/4/4/102444-before-midnight...    109.0   
4  film-poster/4/2/0

Genres are now list objects. Now the same for spoken_languages:

In [14]:
movies['spoken_languages'] = movies['spoken_languages'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) and x else [])

print(movies['spoken_languages'].apply(type).unique())
print(movies.head())

[<class 'list'>]
            movie_id        movie_title  \
0  napoleon-dynamite  Napoleon Dynamite   
1      insomnia-2002           Insomnia   
2        a-bugs-life       A Bug's Life   
3    before-midnight    Before Midnight   
4     searching-2018          Searching   

                                   genres original_language  \
0                                [Comedy]                en   
1              [Crime, Mystery, Thriller]                en   
2  [Adventure, Animation, Comedy, Family]                en   
3                        [Romance, Drama]                en   
4              [Thriller, Mystery, Drama]                en   

                                           image_url  runtime  \
0  sm/upload/wu/r9/ma/tt/2VMXuUAvU8T0oQl0w77CqVAR...     95.0   
1  film-poster/5/1/7/3/9/51739-insomnia-0-230-0-3...    118.0   
2  film-poster/4/7/1/1/1/47111-a-bug-s-life-0-230...     95.0   
3  film-poster/1/0/2/4/4/4/102444-before-midnight...    109.0   
4  film-poster/4/2/0

Now I can check for missing or anomalous values in genres:

In [15]:
missing_genres = movies[movies['genres'].apply(lambda x: len(x) == 0)]
print(f"Number of movies with missing genres: {len(missing_genres)}")

unique_genres = set(genre for sublist in movies['genres'] for genre in sublist)
print(unique_genres)

Number of movies with missing genres: 0
{'Comedy', 'Animation', 'Family', 'Drama', 'Action', 'Science Fiction', 'Crime', 'Adventure', 'Fantasy', 'TV Movie', 'Horror', 'War', 'Thriller', 'History', 'Western', 'Mystery', 'Music', 'Romance'}


There are no movies with missing genres and each genre seems suitable. Now the same for spoken languages:

In [16]:
missing_languages = movies[movies['spoken_languages'].apply(lambda x: len(x) == 0)]
print(f"Number of movies with missing spoken_language: {len(missing_languages)}")

unique_languages = set(language for sublist in movies['spoken_languages'] for language in sublist)
print(unique_languages)

Number of movies with missing spoken_language: 0
{'', 'Český', '广州话 / 廣州話', 'Somali', 'Esperanto', 'svenska', 'Tiếng Việt', 'اردو', 'বাংলা', 'Hrvatski', 'Afrikaans', 'Bahasa indonesia', 'Nederlands', 'ਪੰਜਾਬੀ', 'suomi', 'Polski', 'български език', 'Español', 'தமிழ்', 'Português', 'Kiswahili', 'Pусский', 'Íslenska', 'සිංහල', 'Norsk', 'Gaeilge', 'Deutsch', '한국어/조선말', 'Bahasa melayu', 'فارسی', 'Italiano', 'العربية', 'Türkçe', 'shqip', '普通话', 'ελληνικά', 'Latin', 'isiZulu', 'English', '日本語', 'Magyar', 'ภาษาไทย', 'Română', 'עִבְרִית', 'Dansk', 'беларуская мова', 'हिन्दी', 'Français'}


One of the language values is an empty string. Let's explore the records with this value: 

In [17]:
missing_language = movies[movies['spoken_languages'].apply(lambda x: '' in x)]
print(missing_language)


                                              movie_id  \
14                             raiders-of-the-lost-ark   
191                                 black-panther-2018   
284                              avengers-infinity-war   
348  borat-cultural-learnings-of-america-for-make-b...   
393                 the-life-aquatic-with-steve-zissou   
475                                          tangerine   
509                                         prometheus   
532                                           sinister   
590                                   avengers-endgame   
627                                         district-9   
637                             the-darjeeling-limited   
753                                      a-serious-man   
774                                               cars   
811                                     apocalypse-now   
846                                      black-panther   

                                           movie_title  \
14           

There were some empty values in the dataset, let's remove these:

In [18]:
movies['spoken_languages'] = movies['spoken_languages'].apply(lambda x: [language for language in x if language != ''])
print(movies['spoken_languages'].apply(lambda x: '' in x).any())

False


Double checking that this "language" has been removed from the dataset:

In [19]:
unique_languages = set(language for sublist in movies['spoken_languages'] for language in sublist)
print(unique_languages)

{'Český', '广州话 / 廣州話', 'Somali', 'Esperanto', 'svenska', 'Tiếng Việt', 'اردو', 'বাংলা', 'Hrvatski', 'Afrikaans', 'Bahasa indonesia', 'Nederlands', 'ਪੰਜਾਬੀ', 'suomi', 'Polski', 'български език', 'Español', 'தமிழ்', 'Português', 'Kiswahili', 'Pусский', 'Íslenska', 'සිංහල', 'Norsk', 'Gaeilge', 'Deutsch', '한국어/조선말', 'Bahasa melayu', 'فارسی', 'Italiano', 'العربية', 'Türkçe', 'shqip', '普通话', 'ελληνικά', 'Latin', 'isiZulu', 'English', '日本語', 'Magyar', 'ภาษาไทย', 'Română', 'עִבְרִית', 'Dansk', 'беларуская мова', 'हिन्दी', 'Français'}


And checking one of the previously affected records to ensure it looks correct now:

In [20]:
movies[movies['movie_id'] == 'district-9']

Unnamed: 0,movie_id,movie_title,genres,original_language,image_url,runtime,spoken_languages,year_released
627,district-9,District 9,[Science Fiction],en,film-poster/4/0/8/6/0/40860-district-9-0-230-0...,112.0,"[Afrikaans, isiZulu, English]",2009.0


Checking final time for missing languages:

In [21]:
missing_languages = movies[movies['spoken_languages'].apply(lambda x: len(x) == 0)]
print(f"Number of movies with missing spoken_language: {len(missing_languages)}")

Number of movies with missing spoken_language: 0


### Standardise float types to int

Currently runtime and year_released have floating .0 which we don't need. Will convert these columns to int.

In [22]:
movies['year_released'] = movies['year_released'].astype(int)
movies['runtime'] = movies['runtime'].astype(int)

movies.head()

Unnamed: 0,movie_id,movie_title,genres,original_language,image_url,runtime,spoken_languages,year_released
0,napoleon-dynamite,Napoleon Dynamite,[Comedy],en,sm/upload/wu/r9/ma/tt/2VMXuUAvU8T0oQl0w77CqVAR...,95,[English],2004
1,insomnia-2002,Insomnia,"[Crime, Mystery, Thriller]",en,film-poster/5/1/7/3/9/51739-insomnia-0-230-0-3...,118,[English],2002
2,a-bugs-life,A Bug's Life,"[Adventure, Animation, Comedy, Family]",en,film-poster/4/7/1/1/1/47111-a-bug-s-life-0-230...,95,[English],1998
3,before-midnight,Before Midnight,"[Romance, Drama]",en,film-poster/1/0/2/4/4/4/102444-before-midnight...,109,"[English, ελληνικά, Français]",2013
4,searching-2018,Searching,"[Thriller, Mystery, Drama]",en,film-poster/4/2/0/4/6/7/420467-searching-0-230...,102,[English],2018


I'm satisfied that the data is now adequately cleaned.

In [23]:
movies.reset_index(drop=True, inplace=True)

# Save the cleaned data

For testing purposes in the pipeline, it makes sense for us to export the cleaned DataFrame to a CSV file.  This will allow us to use the cleaned data in the pipeline without having to run the cleaning steps again.

In [24]:
movies.to_csv(
    "../tests/test_data/expected_movies_clean_results.csv", index=False
)