MovieMate

Ciara Fasullo

evz5pv@virginia.edu

DS 4003

February 29, 2024

DATASET SELECTION:

Criteria for Choosing the Dataset:

-Relevance to App: This dataset contains information that will build a perfect framework for my interactive app. My goal is to create an app that gives users movie recommendations based on their preferences (genre, actors, etc.). This dataset contains all of that information as well as ratings, which will help to curate these recommendations. Further, I would like to compare movies of the same genre, films that were in theatres around the same time, etc. on my website in an interactive way, so this dataset provides me with all of the information I need in order to achieve this goal.

-Data Completeness: This dataset is relatively complete, as there are very few cells missing key information. Because we have information available on genre, actors, year released, etc. for the vast majority of films, we will be able to provide the user with a great deal of information for any movie recommended to them.

-Data Quality: This dataset was fairly clean from the start; I removed a few columns that were not relevant to my app, but the data is consistent and accurate.

-Diversity in Categories: This data encompasses a wide variety of films, whether that be within specific genres, time periods, or in the country of origin. This was ideal because it ensures that the user's preferences will be met with matches regardless of how niche of a category they may be looking for.

-Reliability: This data is sourced from FilmTV, which is a credile, reliable source that utilizes the same standardized rating systems as other sites such as Rotten Tomatoes or IMDb. Because these rating metrics are widely accepted, it contributes to a positive user experience as far as interpreting the information given.

Description of the Selected Dataset:

This dataset pulls movie data from FilmTV, which is a credible source (similar to IMDb) that gathers data on various movies - whether that be their genre, actors, ratings, etc. FilmTV also features movies from an array of countries, providing us with data that is much more expansive than IMDb or other American websites. Each row represents a movie available on FilmTV (there are over 40,000 movies). The columns include the movie title, year of release, duration, country, directors, actors, and ratings. 

DATA PROVENANCE AND SOURCE:


Original Use for Dataset:

This dataset was created to provide further information in regards to the aspects that make a movie successful from users or profit perspective. It was also designed to be able to be combined with other movie datasets publicly available through Rotten Tomatoes, IMDb, etc.

Source Information:

FilmTV: https://www.filmtv.it/

Data Transformation History:

The data has been scraped from the publicly available website https://www.filmtv.it/ and was last updated on 2023-10-21.

Authorship and Ownership (License):

CC0: Public Domain

Dependencies:

-collection methodology: Python script (requests library)

Versioning:

-dataset is updated annually (last updated 2023-10-21)

In [1]:
#import dependencies
import pandas as pd
import seaborn as sns
import plotly.express as px

In [2]:
#load csv file into a dataframe
df = pd.read_csv('data.csv')
df

Unnamed: 0,filmtv_id,title,year,genre,duration,country,directors,actors,avg_vote,critics_vote,public_vote,total_votes,description,notes,humor,rhythm,effort,tension,erotism
0,2,Bugs Bunny's Third Movie: 1001 Rabbit Tales,1982,Animation,76,United States,"David Detiege, Art Davis, Bill Perez",,7.7,8.00,7.0,22,"With two protruding front teeth, a slightly sl...","These are many small independent stories, whic...",3,3,0,0,0
1,3,18 anni tra una settimana,1991,Drama,98,Italy,Luigi Perelli,"Kim Rossi Stuart, Simona Cavallari, Ennio Fant...",6.5,6.00,7.0,4,"Samantha, not yet eighteen, leaves the comfort...","Luigi Perelli, the director of the ""Piovra"", o...",0,2,0,2,0
2,17,Ride a Wild Pony,1976,Romantic,91,United States,Don Chaffey,"Michael Craig, John Meillon, Eva Griffith, Gra...",5.7,6.00,5.0,10,"In the Australia of the pioneers, a boy and a ...","""Ecological"" story with a happy ending, not wi...",1,2,1,0,0
3,18,Diner,1982,Comedy,95,United States,Barry Levinson,"Mickey Rourke, Steve Guttenberg, Ellen Barkin,...",7.0,8.00,6.0,18,Five boys from Baltimore have a habit of meeti...,A cast of will be famous for Levinson's direct...,2,2,0,1,2
4,20,A che servono questi quattrini?,1942,Comedy,85,Italy,Esodo Pratelli,"Eduardo De Filippo, Peppino De Filippo, Clelia...",5.9,5.33,7.0,15,"With a stratagem, the penniless and somewhat p...",Taken from the play by Armando Curcio that the...,3,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41394,232817,Gold Digger Killer,2021,Thriller,87,"Canada, United States",Robin Hays,"Julie Benz, Roan Curtis, Georgia Bradner, Eli ...",4.0,,4.0,3,Celeste is an attractive waitress in the forti...,Freely taken from a true story.,0,0,0,0,0
41395,232893,Addio al nubilato 2,2023,Comedy,90,Italy,Francesco Apolloni,"Laura Chiatti, Chiara Francini, Antonia Liskov...",2.7,,3.0,3,"When Eleonora is downloaded to the altar, her ...",Bachelorette fare sequel (2021).,0,0,0,0,0
41396,232915,Konferensen,2023,Horror,100,Sweden,Patrik Eklund,"Katia Winter, Eva Melander, Lola Zackow, Adam ...",6.0,,6.0,6,A team building conference organized for a gro...,,0,0,0,0,0
41397,232919,Ballelina,2023,Thriller,92,South Korea,Chung-Hyun Lee,"Jeon Jong-seo, Park Yu-rim, Ji-hun Kim",5.8,,6.0,5,Once Ok-Ju worked as a bodyguard and was one o...,,0,0,0,0,0


In [3]:
#data cleaning
columns_to_drop = ['filmtv_id', 'total_votes', 'notes', 'humor', 'rhythm', 'effort', 'tension', 'erotism']
df = df.drop(columns = columns_to_drop)
df
#note - this data was already in tidy format (each variable has its own column, each observation has its own row)
#note - the data cleaning was to remove columns I wouldn't be using in my app, not to remove outliers or any relevant information

Unnamed: 0,title,year,genre,duration,country,directors,actors,avg_vote,critics_vote,public_vote,description
0,Bugs Bunny's Third Movie: 1001 Rabbit Tales,1982,Animation,76,United States,"David Detiege, Art Davis, Bill Perez",,7.7,8.00,7.0,"With two protruding front teeth, a slightly sl..."
1,18 anni tra una settimana,1991,Drama,98,Italy,Luigi Perelli,"Kim Rossi Stuart, Simona Cavallari, Ennio Fant...",6.5,6.00,7.0,"Samantha, not yet eighteen, leaves the comfort..."
2,Ride a Wild Pony,1976,Romantic,91,United States,Don Chaffey,"Michael Craig, John Meillon, Eva Griffith, Gra...",5.7,6.00,5.0,"In the Australia of the pioneers, a boy and a ..."
3,Diner,1982,Comedy,95,United States,Barry Levinson,"Mickey Rourke, Steve Guttenberg, Ellen Barkin,...",7.0,8.00,6.0,Five boys from Baltimore have a habit of meeti...
4,A che servono questi quattrini?,1942,Comedy,85,Italy,Esodo Pratelli,"Eduardo De Filippo, Peppino De Filippo, Clelia...",5.9,5.33,7.0,"With a stratagem, the penniless and somewhat p..."
...,...,...,...,...,...,...,...,...,...,...,...
41394,Gold Digger Killer,2021,Thriller,87,"Canada, United States",Robin Hays,"Julie Benz, Roan Curtis, Georgia Bradner, Eli ...",4.0,,4.0,Celeste is an attractive waitress in the forti...
41395,Addio al nubilato 2,2023,Comedy,90,Italy,Francesco Apolloni,"Laura Chiatti, Chiara Francini, Antonia Liskov...",2.7,,3.0,"When Eleonora is downloaded to the altar, her ..."
41396,Konferensen,2023,Horror,100,Sweden,Patrik Eklund,"Katia Winter, Eva Melander, Lola Zackow, Adam ...",6.0,,6.0,A team building conference organized for a gro...
41397,Ballelina,2023,Thriller,92,South Korea,Chung-Hyun Lee,"Jeon Jong-seo, Park Yu-rim, Ji-hun Kim",5.8,,6.0,Once Ok-Ju worked as a bodyguard and was one o...


In [4]:
#save cleaned dataframe
df.to_csv('cleaned_dataset.csv', index=False)

EXPLORATORY DATA ANALYSIS:

Number of Observations in the Dataset: 41399

Unique Categories for Categorical Variables:

-'title' : title of the movie

-'year' : year the movie was released

-'genre' : genre of the movie

-'country' : country in which the movie was produced

-'directors' : director(s) of the movie

-'actors' : actors in the movie

-'description' : summary of the movie

EVALUATION OF MISSING DATA:

In [5]:
# Generate a list of missing data from each column

# 1. Assuming 'year' is the column
year_column = 'year'
missing_year_data = df[df[year_column].isnull()][['title', year_column]]

# Print the titles and associated missing actor values
print(f"Movies with missing release {year_column}:")
print(missing_year_data)

# 2. Assuming 'genre' is the column
genre_column = 'genre'
missing_genre_data = df[df[genre_column].isnull()][['title', genre_column]]

# Print the titles and associated missing actor values
print(f"Movies with missing {genre_column}:")
print(missing_genre_data)

# 3. Assuming 'duration' is the column
duration_column = 'duration'
missing_duration_data = df[df[duration_column].isnull()][['title', duration_column]]

# Print the titles and associated missing actor values
print(f"Movies with missing {duration_column}:")
print(missing_duration_data)

# 4. Assuming 'country' is the column
country_column = 'country'
missing_country_data = df[df[country_column].isnull()][['title', country_column]]

# Print the titles and associated missing actor values
print(f"Movies with missing production {country_column}:")
print(missing_country_data)

# 5. Assuming 'directors' is the column
director_column = 'directors'
missing_director_data = df[df[director_column].isnull()][['title', director_column]]

# Print the titles and associated missing actor values
print(f"Movies with missing {director_column}:")
print(missing_director_data)

# 6. Assuming 'actor' is the column
actor_column = 'actors'
missing_actor_data = df[df[actor_column].isnull()][['title', actor_column]]

# Print the titles and associated missing actor values
print(f"Movies with missing {actor_column}:")
print(missing_actor_data)

# 7. Assuming 'avg_vote' is the column
avg_vote_column = 'avg_vote'
missing_avg_vote_data = df[df[avg_vote_column].isnull()][['title', avg_vote_column]]

# Print the titles and associated missing actor values
print(f"Movies with missing average vote:")
print(missing_avg_vote_data)

# 8. Assuming 'critics_vote' is the column
critics_vote_column = 'critics_vote'
missing_critics_vote_data = df[df[critics_vote_column].isnull()][['title', critics_vote_column]]

# Print the titles and associated missing actor values
print(f"Movies with missing critics vote:")
print(missing_critics_vote_data)

# 9. Assuming 'public_vote' is the column
public_vote_column = 'public_vote'
missing_public_vote_data = df[df[public_vote_column].isnull()][['title', public_vote_column]]

# Print the titles and associated missing actor values
print(f"Movies with missing public vote:")
print(missing_public_vote_data)

# 7. Assuming 'description' is the column
description_column = 'description'
missing_description_data = df[df[description_column].isnull()][['title', description_column]]

# Print the titles and associated missing actor values
print(f"Movies with missing description:")
print(missing_description_data)


Movies with missing release year:
Empty DataFrame
Columns: [title, year]
Index: []
Movies with missing genre:
                                    title genre
15153                  Monsieur Batignole   NaN
15614                 Il posto dell'anima   NaN
15660                      Yossi & Jagger   NaN
15713                    Cradle Will Rock   NaN
15721                         Bord de mer   NaN
...                                   ...   ...
37435                            The Farm   NaN
37552            All the Colors of Giallo   NaN
37723                        Night School   NaN
38069                   House of Whipcord   NaN
38447  Tensión sexual, Volumen 1: Volátil   NaN

[95 rows x 2 columns]
Movies with missing duration:
Empty DataFrame
Columns: [title, duration]
Index: []
Movies with missing production country:
                                                title country
18902                                Two Smart People     NaN
27081                                   Flor

In [6]:
#Distribution Analysis of Continuous Variables:
duration_descriptive_stats = df['duration'].describe() #descriptive stats of the duration column
print("Duration Statistics:")
print(duration_descriptive_stats)

avg_vote_descriptive_stats = df['avg_vote'].describe() #descriptive stats of the avg_vote column
print("Average Vote Statistics:")
print(avg_vote_descriptive_stats)

critics_vote_descriptive_stats = df['critics_vote'].describe() #descriptive stats of the critics_vote column
print("Critics Vote Statistics:")
print(critics_vote_descriptive_stats)

public_vote_descriptive_stats = df['public_vote'].describe() #descriptive stats of the public_vote column
print("Public Vote Statistics:")
print(public_vote_descriptive_stats)


Duration Statistics:
count    41399.000000
mean       100.537163
std         27.260962
min         41.000000
25%         90.000000
50%         96.000000
75%        107.000000
max       1525.000000
Name: duration, dtype: float64
Average Vote Statistics:
count    41399.000000
mean         5.801522
std          1.403861
min          1.000000
25%          4.800000
50%          5.900000
75%          6.900000
max         10.000000
Name: avg_vote, dtype: float64
Critics Vote Statistics:
count    36703.000000
mean         5.796077
std          1.593062
min          1.000000
25%          4.670000
50%          6.000000
75%          7.000000
max         10.000000
Name: critics_vote, dtype: float64
Public Vote Statistics:
count    41205.000000
mean         5.924135
std          1.480112
min          1.000000
25%          5.000000
50%          6.000000
75%          7.000000
max         10.000000
Name: public_vote, dtype: float64


In [7]:
#Identification and Handling of Outliers:

#For the Duration Column:

#Calculate IQR
q1_duration = df['duration'].quantile(0.25)
q3_duration = df['duration'].quantile(0.75)
iqr_duration = q3_duration - q1_duration

# Define lower and upper bounds for outliers
lower_bound_duration = q1_duration - 1.5 * iqr_duration
upper_bound_duration = q3_duration + 1.5 * iqr_duration

# Identify outliers in the 'duration' column
outliers_duration = df[(df['duration'] < lower_bound_duration) | (df['duration'] > upper_bound_duration)]

# Display outliers with 'title' and 'duration' columns
duration_outliers_info = outliers_duration[['title', 'duration']]
print("Duration Outliers:")
print(duration_outliers_info)

# Note: I do not feel a need to remove outliers in movie duration, as many of these are categorized through various film genres, such as shorts.

# For the Average Vote Column:

# Calculate IQR
q1_avg_vote = df['avg_vote'].quantile(0.25)
q3_avg_vote = df['avg_vote'].quantile(0.75)
iqr_avg_vote = q3_avg_vote - q1_avg_vote

# Define lower and upper bounds for outliers
lower_bound_avg_vote = q1_avg_vote - 1.5 * iqr_avg_vote
upper_bound_avg_vote = q3_avg_vote + 1.5 * iqr_avg_vote

# Identify outliers in the 'avg_vote' column
outliers_avg_vote = df[(df['avg_vote'] < lower_bound_avg_vote) | (df['avg_vote'] > upper_bound_avg_vote)]

# Display outliers with 'title' and 'duration' columns
avg_vote_outliers_info = outliers_avg_vote[['title', 'avg_vote']]
print("Average Vote Outliers:")
print(avg_vote_outliers_info)

# For the Critics Vote Column:

# Calculate IQR
q1_critics_vote = df['critics_vote'].quantile(0.25)
q3_critics_vote = df['critics_vote'].quantile(0.75)
iqr_critics_vote = q3_critics_vote - q1_critics_vote

# Define lower and upper bounds for outliers
lower_bound_critics_vote = q1_critics_vote - 1.5 * iqr_critics_vote
upper_bound_critics_vote = q3_critics_vote + 1.5 * iqr_critics_vote

# Identify outliers in the 'critics_vote' column
outliers_critics_vote = df[(df['critics_vote'] < lower_bound_critics_vote) | (df['critics_vote'] > upper_bound_critics_vote)]

# Display outliers with 'title' and 'critics_vote' columns
critics_vote_outliers_info = outliers_critics_vote[['title', 'critics_vote']]
print("Critics Vote Outliers:")
print(critics_vote_outliers_info)

# For the Public Vote Column:

# Calculate IQR
q1_public_vote = df['public_vote'].quantile(0.25)
q3_public_vote = df['public_vote'].quantile(0.75)
iqr_public_vote = q3_public_vote - q1_public_vote

# Define lower and upper bounds for outliers
lower_bound_public_vote = q1_public_vote - 1.5 * iqr_public_vote
upper_bound_public_vote = q3_public_vote + 1.5 * iqr_public_vote

# Identify outliers in the 'critics_vote' column
outliers_public_vote = df[(df['public_vote'] < lower_bound_public_vote) | (df['public_vote'] > upper_bound_public_vote)]

# Display outliers with 'title' and 'critics_vote' columns
public_vote_outliers_info = outliers_public_vote[['title', 'public_vote']]
print("Public Vote Outliers:")
print(public_vote_outliers_info)


Duration Outliers:
                                  title  duration
13                   Bowery at Midnight        62
36                            The Abyss       138
81                         Africa addio       140
84                           L'âge d'or        60
86     On Her Majesty's Secret Services       140
...                                 ...       ...
41345                           La Bête       145
41350                   Zielona Granica       147
41366               Magyarázat mindenre       151
41386                     American Moon       214
41387                           Znachor       140

[2954 rows x 2 columns]
Average Vote Outliers:
                                                   title  avg_vote
6170               Amico mio... frega tu... che frego io       1.5
20195                                   Operazione poker       1.5
20620                                          Tabù n° 2       1.0
20943                                  Chek law dak gung       1.

DATA DICTIONARY:

Explanation of Each Variable in the Dataset:

-Movie Titles: the name of the movie; relatively unique identifier for each film in the dataset, although some movies may share the same title (in this case, they will be differentiated by year of release)

-Year: the year in which the movie was released

-Genre: the category or genre to which the movie belongs

-Country: the country in which the movie was filmed

-Directors: the name of the director(s) who directed the movie

-Actors: a list of actors involved in the movie

-Average Vote: the numerical rating assigned to the movie, indicating its perceived quality by the standards of the audience and critics

-Critics Vote: the numerical rating assigned to the movie, indicating its perceived quality by the standards of the critics

-Public Vote: the numerical rating assigned to the movie, indicating its perceived quality by the standards of the audience/general public

-Description: a brief summary/synopsis of the movie

Data Types and Units:

-Movie Titles: represented as strings

-Year: represented as integers

-Genre: represented as strings or categorical data types, but each genre is a unique unit

-Duration: represented as integers, unit of measure is minutes

-Country: represented as strings or categorical data types, but each country is a unique unit

-Directors: represented as strings or categorical data types, but each director is a unique unit

-Actors: represented as strings or categorical data types, but each actor is a unique unit

-Average Vote: represented as floats or integers

-Critics Vote: represented as floats or integers

-Public Vote: represented as floats or integers

-Description: represented as a string or text data type, but each movie sumary is a unique unit

DASHBOARD UI COMPONENTS BRAINSTORMING:

List of Potential UI Components for the Dashboard:

-Movie title search bar: Allows users to find movies by entering the title. Allows for easy navigation within the dataset when the user has a specific movie they are searching for.

-Filters: If the user is unsure of a title, they can utilize dropdown filters for categories such as genre, year released, director, actors, etc. to display data based on these preferences. This allows them to explore without the need for a title.

-Carousel: Display a dynamic carousel featuring the top-rated movies in the dataset based on user preferences such as genre, year, etc. This allows for quick recommendations by looking at featured top-rated movies.

-Interactive Buttons: Utilize interactive buttons that dynamically load more details such as the movie description when clicked. This provides users with additional information when desired without cluttering the main dashboard.

Consideration of User Needs and Interests:

Taking into account that users will inevitably have varying preferences and goals when exploring this app, my goal is to provide a seamless and engaging experience for any user. By making the various elements easy to navigate, providing lots of options for filtering, and providing recommendations that cater to the individual interests of the users, I hope to make an app that is enjoyable to navigate. I will make the search features (search bar, filters, etc.) intuitive and require few steps to avoid frustration. Further, I will use visually-appealing interactive features to captivate users who may be more interested in the statistical aspect of the app than the recommendation feature. In other words, by providing a mix of interactive components, I hope to cater to the varied interests and needs of the user to create a more personal feel to the experience.

POSSIBLE DATA VISUALIZATIONS (3-6):

-Bubble Chart of Actors and Ratings: Create a bubble chart with actors on one axis, ratings on the other, and bubble size representing the number of movies in which they have acted. This gives an interesting visual for the relationship between actors, ratings, and their overall contribution to the film industry. Note: this same concept could be applied to directors. 

-Histogram of Ratings: Display a histogram to visualize the distribution of movie ratings. This could be filtered by genre, year, etc. For example, a histogram displaying the distribution of movie ratings of comedy movies released from 1990-1995.

-Boxplot of Ratings by Genre: Create boxplots comparing the distribution of ratings across different movie genres in the same period of time. For example, boxplots displaying the distribution of ratings of different movie genres from 1990-1995. This could also be done the other way, where we reate boxplots comparing the distribution of ratings of a specific genre across different periods of time. For example, boxplots displaying the distribution of ratings of horror movies released in different time periods.

-Scatter Plot of Ratings vs Duration: Visualize the relationship between movie ratings and their durations using a scatter plot to identify a potential association between the two variables. Filters could also be applied to this to see if these trends differ by genre. For example, could look at a scatter plot of ratings vs duration of thriller movies released in the 1980s.

-Stacked Bar Chart of Genres Over Time: Show how the distribution (popularity) of movie genres has changed over time using a stacked bar chart. It may be interesting to see how preferences in the film industry have changed over the years, as well as to help predict the way in which the film industry may be moving in upcoming years.

-Heatmap of Actor Collaborations: Visualize collaborations between actors. Each cell represents the number of movies featuring a given pair of actors, showing popular actor pairs. The same concept could be applied to actor-director collaborations. Another way to visualize this infomration could be in a network graph.