<div style="text-align: center;">
<h1>Reel Realities: How Gender and Age Shape Success Across Box Office and Streaming Platforms</h1>
</div>

### <u>Imports</u>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

### 1. <u>Data cleaning and pre-processing</u>

#### 1.1 CMU Dataset

In [2]:
CMU_movies = pd.read_csv("./data/CMU/filtered_movie_metadata.csv", delimiter=",", skipinitialspace=True)

# If there's an extra unnamed column, drop it
if "Unnamed: 0" in CMU_movies.columns:
    CMU_movies = CMU_movies.drop(columns=["Unnamed: 0"])
CMU_movies.columns = [
    "Wikipedia_movie_ID",
    "Freebase_movie_ID",
    "Movie_name",
    "Movie_release_date",
    "Movie_box_office_revenue",
    "runtimeMinutes",
    "Movie_languages",
    "Movie_countries",
    "genres",
    "Cast",
    "Female_actors",
    "Male_actors",
    "Female_actor_percentage",
    "Average_female_actor_age",
    "Average_male_actor_age"
]
CMU_movies.head()

Unnamed: 0,Wikipedia_movie_ID,Freebase_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,runtimeMinutes,Movie_languages,Movie_countries,genres,Cast,Female_actors,Male_actors,Female_actor_percentage,Average_female_actor_age,Average_male_actor_age
0,975900,/m/03vyhn,Ghosts of Mars,2001,14010832.0,98.0,['English Language'],['United States of America'],"['Thriller', 'Science Fiction', 'Horror', 'Adv...","Wanda De Jesus, Natasha Henstridge, Ice Cube, ...",6,7,46.15,43.0,43.857143
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000,,95.0,['English Language'],['United States of America'],"['Mystery', 'Biographical film', 'Drama', 'Cri...","Alice Barrett, Robert Catrini, Cliff DeYoung, ...",2,5,28.57,39.0,49.4
2,13696889,/m/03cfc81,The Gangsters,1913,,35.0,"['Silent film', 'English Language']",['United States of America'],"['Short Film', 'Silent film', 'Indie', 'Black-...",Roscoe Arbuckle,0,1,0.0,,26.0
3,10408933,/m/02qc0j7,Alexander's Ragtime Band,1938,3600000.0,106.0,['English Language'],['United States of America'],"['Musical', 'Comedy', 'Black-and-white']","Ethel Merman, Tyrone Power, Alice Faye, Don Am...",2,2,50.0,26.5,27.0
4,6631279,/m/0gffwj,Little city,1997,,93.0,['English Language'],['United States of America'],"['Romantic comedy', 'Ensemble Film', 'Comedy-d...","Josh Charles, Penelope Ann Miller, Annabella S...",4,2,66.67,37.75,30.0


In [3]:
plot_summaries_df = pd.read_csv("data/CMU/plot_summaries.txt", delimiter="\t", names = ["Wikipedia_movie_ID", "Plot Summaries"])

print(f"The plot summaries dataframe has {len(plot_summaries_df):,} values.")
plot_summaries_df.head()

The plot summaries dataframe has 42,303 values.


Unnamed: 0,Wikipedia_movie_ID,Plot Summaries
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...


#### 1.2 IMDB Dataset

We will use two IMDB datasets to decribe movies:
1. "title.ratings.tsv" gives us the ratings of the movies as voted by viewers. 
2. "title.basics.tsv", indexes into "title.ratings.tsv" using a alphanumeric unique identifier of the title. It gives general information about the movie such as runtime, release date and adult rating.
3. "title.crew.tsv", indexes into the previous two using the same alphanumeric unique identifier of the title. It gives information on the directors and writers of the movie.

Reference:
Internet Movie Database. (2024). IMDb non-commercial datasets. Retrieved from https://developer.imdb.com/non-commercial-datasets/

In [4]:
# Loading the datasets. Null values are represented using "\N".
IMDB_ratings_df = pd.read_csv("data/IMDB/title.ratings.tsv", delimiter="\t", na_values="\\N")
IMDB_basics_df = pd.read_csv("data/IMDB/title.basics.tsv", delimiter="\t", na_values="\\N", low_memory=False)
IMDB_crew_df = pd.read_csv("data/IMDB/title.crew.tsv", delimiter="\t", na_values="\\N", low_memory=False)

In [5]:
IMDB_ratings_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2100
1,tt0000002,5.6,282
2,tt0000003,6.5,2119
3,tt0000004,5.4,182
4,tt0000005,6.2,2850


In [6]:
IMDB_basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0.0,1894.0,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0.0,1892.0,,5,"Animation,Short"
2,tt0000003,short,Poor Pierrot,Pauvre Pierrot,0.0,1892.0,,5,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0.0,1892.0,,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0.0,1893.0,,1,"Comedy,Short"


In [7]:
IMDB_crew_df.head()

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,
1,tt0000002,nm0721526,
2,tt0000003,nm0721526,
3,tt0000004,nm0721526,
4,tt0000005,nm0005690,


In [8]:
# Checking the lengths of the datasets
print(f"Length of IMDB_ratings_df: {len(IMDB_ratings_df):,}\n"
      f"Length of IMDB_basics_df: {len(IMDB_basics_df):,}\n"
      f"Length of IMDB_crew_df: {len(IMDB_crew_df):,}")

Length of IMDB_ratings_df: 1,498,615
Length of IMDB_basics_df: 11,235,767
Length of IMDB_crew_df: 10,571,536


Before dealing with the null values we will merge the dataframes together using the alphanumeric unique identifier.

In [9]:
# Merging all three datasets.
IMDB_merged_df = pd.merge(IMDB_ratings_df, IMDB_basics_df, how="inner", left_on="tconst", right_on="tconst")
IMDB_merged_df = pd.merge(IMDB_merged_df, IMDB_crew_df, how="inner", on="tconst")

print(f"The resulting merged dataframe has length: {len(IMDB_merged_df):,}.")
print(f"{len(IMDB_ratings_df)-len(IMDB_merged_df):,} rows were lost in the merging process.")
IMDB_merged_df.head()

The resulting merged dataframe has length: 1,484,729.
13,886 rows were lost in the merging process.


Unnamed: 0,tconst,averageRating,numVotes,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,directors,writers
0,tt0000001,5.7,2100,short,Carmencita,Carmencita,0.0,1894.0,,1,"Documentary,Short",nm0005690,
1,tt0000002,5.6,282,short,Le clown et ses chiens,Le clown et ses chiens,0.0,1892.0,,5,"Animation,Short",nm0721526,
2,tt0000003,6.5,2119,short,Poor Pierrot,Pauvre Pierrot,0.0,1892.0,,5,"Animation,Comedy,Romance",nm0721526,
3,tt0000004,5.4,182,short,Un bon bock,Un bon bock,0.0,1892.0,,12,"Animation,Short",nm0721526,
4,tt0000005,6.2,2850,short,Blacksmith Scene,Blacksmith Scene,0.0,1893.0,,1,"Comedy,Short",nm0005690,


We can see we do not lose a lot of rows with respect to the IMDB_ratings_df dataframe.

Next, we look at titleType. These dataframes do not only have movies but also short movies, tv shows, episodes. The next step is thus to filter only movies.

In [10]:
# Filtering movies from the list of titles.
IMDB_merged_df = IMDB_merged_df[IMDB_merged_df["titleType"] == "movie"]

print(f"There are {len(IMDB_merged_df):,} movies in the resulting dataframe.")
IMDB_merged_df.head()

There are 319,293 movies in the resulting dataframe.


Unnamed: 0,tconst,averageRating,numVotes,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,directors,writers
8,tt0000009,5.4,216,movie,Miss Jerry,Miss Jerry,0.0,1894.0,,45,Romance,nm0085156,nm0085156
143,tt0000147,5.2,540,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0.0,1897.0,,100,"Documentary,News,Sport",nm0714557,
337,tt0000502,4.1,19,movie,Bohemios,Bohemios,0.0,1905.0,,100,,nm0063413,"nm0063413,nm0657268,nm0675388"
372,tt0000574,6.0,938,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0.0,1906.0,,70,"Action,Adventure,Biography",nm0846879,nm0846879
382,tt0000591,5.7,28,movie,The Prodigal Son,L'enfant prodigue,0.0,1907.0,,90,Drama,nm0141150,nm0141150


We can now look at null values in the merged IMDB dataframe.

In [11]:
# Checking the percentage of null values in the dataset.
n_null_IMDB = ((IMDB_merged_df.isnull().sum() / len(IMDB_merged_df)) * 100).apply(lambda x: f"{x:,.2f}%")

print(
    f"Percentage of null values per column:\n"
    f"IMDB_ratings_df:\n{n_null_IMDB}"
)

Percentage of null values per column:
IMDB_ratings_df:
tconst              0.00%
averageRating       0.00%
numVotes            0.00%
titleType           0.00%
primaryTitle        0.00%
originalTitle       0.00%
isAdult             0.00%
startYear           0.01%
endYear           100.00%
runtimeMinutes      9.98%
genres              3.26%
directors           0.98%
writers            12.30%
dtype: object


The end year is always missing. Other than that the proportion of missing values is very small (< 12%). End year does not have any useful information for our intended analysis and can thus be dropped. We can also drop the titleType column since we know they are all movies after the filtering that was done above.

In [12]:
# Dropping unnecessary columns.
IMDB_merged_df = IMDB_merged_df.drop(columns=["endYear", "titleType"], axis=1)

print(f"The resulting dataframe has {len(IMDB_merged_df):,} rows.")
IMDB_merged_df.head()

The resulting dataframe has 319,293 rows.


Unnamed: 0,tconst,averageRating,numVotes,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres,directors,writers
8,tt0000009,5.4,216,Miss Jerry,Miss Jerry,0.0,1894.0,45,Romance,nm0085156,nm0085156
143,tt0000147,5.2,540,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0.0,1897.0,100,"Documentary,News,Sport",nm0714557,
337,tt0000502,4.1,19,Bohemios,Bohemios,0.0,1905.0,100,,nm0063413,"nm0063413,nm0657268,nm0675388"
372,tt0000574,6.0,938,The Story of the Kelly Gang,The Story of the Kelly Gang,0.0,1906.0,70,"Action,Adventure,Biography",nm0846879,nm0846879
382,tt0000591,5.7,28,The Prodigal Son,L'enfant prodigue,0.0,1907.0,90,Drama,nm0141150,nm0141150


We will use ratings from the IMDB dataset during our study. However, these ratings are based on viewer votes. Initially we thought of discarding rows with too few votes. However, there could be a link between number of votes and number of views of a movie (although definitely not a direct one). We thus decided to keep all rows for the analysis.

#### 1.3 Merging the datasets

##### 1.3.1 Merging IMDB and CMU Movies

In [13]:
# Merging on the original title.
merge1 = pd.merge(IMDB_merged_df, CMU_movies, how="inner", left_on="originalTitle", right_on="Movie_name")
# Merging on the secondary title.
merge2 = pd.merge(IMDB_merged_df, CMU_movies, how="inner", left_on="primaryTitle", right_on="Movie_name")

# Concatenating and dropping duplicates that appear from movies with the same originalTitle and primaryTitle.
movie_df = pd.concat([merge1, merge2]).drop_duplicates().reset_index(drop=True)

print(f"The resulting dataframe has {len(movie_df):,} rows.")
movie_df.head()

The resulting dataframe has 46,685 rows.


Unnamed: 0,tconst,averageRating,numVotes,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes_x,genres_x,directors,...,runtimeMinutes_y,Movie_languages,Movie_countries,genres_y,Cast,Female_actors,Male_actors,Female_actor_percentage,Average_female_actor_age,Average_male_actor_age
0,tt0000147,5.2,540,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0.0,1897.0,100.0,"Documentary,News,Sport",nm0714557,...,,[],[],['Sports'],James J. Corbett,0,1,0.0,,30.0
1,tt0000615,4.3,27,Robbery Under Arms,Robbery Under Arms,0.0,1907.0,,Drama,nm0533958,...,141.0,['English Language'],['Australia'],"['History', 'Western', 'Action', 'Drama', 'Adv...","Tom E. Lewis, Ed Devereaux, Andy Anderson, Rob...",0,5,0.0,,38.8
2,tt0000615,4.3,27,Robbery Under Arms,Robbery Under Arms,0.0,1907.0,,Drama,nm0533958,...,,['Silent film'],['Australia'],"['Silent film', 'Drama']",Jim Gerald,0,1,0.0,,16.0
3,tt0000679,5.2,77,The Fairylogue and Radio-Plays,The Fairylogue and Radio-Plays,0.0,1908.0,120.0,"Adventure,Fantasy","nm0091767,nm0877783",...,120.0,['English Language'],['United States of America'],"['Silent film', 'Black-and-white']","Romola Remus, L. Frank Baum",1,1,50.0,8.0,52.0
4,tt0000886,4.7,40,"Hamlet, Prince of Denmark",Hamlet,0.0,1910.0,,Drama,nm0099901,...,111.0,['German Language'],['Germany'],['Drama'],"Asta Nielsen, Eduard von Winterstein, Heinz St...",2,5,28.57,33.0,43.4


Some columns appear twice. Let's take a look at the proportion of null values in each duplicate column.

In [14]:
n_null_movie = (movie_df.isnull().sum()/len(movie_df)).apply(lambda x:f"{x:.2%}")

print(f"Percentage of null values per column:\n{n_null_movie}")

Percentage of null values per column:
tconst                       0.00%
averageRating                0.00%
numVotes                     0.00%
primaryTitle                 0.00%
originalTitle                0.00%
isAdult                      0.00%
startYear                    0.00%
runtimeMinutes_x             6.02%
genres_x                     1.54%
directors                    0.47%
writers                      5.57%
Wikipedia_movie_ID           0.00%
Freebase_movie_ID            0.00%
Movie_name                   0.00%
Movie_release_date           0.00%
Movie_box_office_revenue    74.46%
runtimeMinutes_y            11.58%
Movie_languages              0.00%
Movie_countries              0.00%
genres_y                     0.00%
Cast                         0.00%
Female_actors                0.00%
Male_actors                  0.00%
Female_actor_percentage      0.00%
Average_female_actor_age    14.35%
Average_male_actor_age       5.24%
dtype: object


We can see:
- runtimeMinutes_x and runtimeMinutes_y have 6.02% and 11.58% missing values respectively. We will combine the non null values from both these columns into a new column called runtimeMinutes and then drop the previous two columns. 
- genres_x has 1.54% missing values against 0.00% missing values for genres_y. Furthermore genres_y is from the CMU dataset and seems more complete. We will thus drop the genres_x column.

In [15]:
# Combining all non-null values from runtimeMinutes_x and runtimeMinutes_y into runtimeMinutes.
movie_df["runtimeMinutes"] = movie_df["runtimeMinutes_x"].combine_first(movie_df["runtimeMinutes_y"])

# Dropping the unnecessary columns.
movie_df.drop(columns=["runtimeMinutes_x", "runtimeMinutes_y", "genres_x"], inplace=True)

# Renaming the column to Genres.
movie_df.rename(columns={"genres_y":"Genres"}, inplace=True)
movie_df.head()

Unnamed: 0,tconst,averageRating,numVotes,primaryTitle,originalTitle,isAdult,startYear,directors,writers,Wikipedia_movie_ID,...,Movie_languages,Movie_countries,Genres,Cast,Female_actors,Male_actors,Female_actor_percentage,Average_female_actor_age,Average_male_actor_age,runtimeMinutes
0,tt0000147,5.2,540,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0.0,1897.0,nm0714557,,28703057,...,[],[],['Sports'],James J. Corbett,0,1,0.0,,30.0,100.0
1,tt0000615,4.3,27,Robbery Under Arms,Robbery Under Arms,0.0,1907.0,nm0533958,"nm0092809,nm0533958",27543288,...,['English Language'],['Australia'],"['History', 'Western', 'Action', 'Drama', 'Adv...","Tom E. Lewis, Ed Devereaux, Andy Anderson, Rob...",0,5,0.0,,38.8,141.0
2,tt0000615,4.3,27,Robbery Under Arms,Robbery Under Arms,0.0,1907.0,nm0533958,"nm0092809,nm0533958",32986669,...,['Silent film'],['Australia'],"['Silent film', 'Drama']",Jim Gerald,0,1,0.0,,16.0,
3,tt0000679,5.2,77,The Fairylogue and Radio-Plays,The Fairylogue and Radio-Plays,0.0,1908.0,"nm0091767,nm0877783","nm0000875,nm0877783",5954041,...,['English Language'],['United States of America'],"['Silent film', 'Black-and-white']","Romola Remus, L. Frank Baum",1,1,50.0,8.0,52.0,120.0
4,tt0000886,4.7,40,"Hamlet, Prince of Denmark",Hamlet,0.0,1910.0,nm0099901,nm0000636,5586863,...,['German Language'],['Germany'],['Drama'],"Asta Nielsen, Eduard von Winterstein, Heinz St...",2,5,28.57,33.0,43.4,111.0


Let's now see if Movie_name, originalTitle and primaryTitle are all necessary or if there are any redundancies.

In [16]:
test1 = movie_df["Movie_name"] == movie_df["primaryTitle"]  
test2 = movie_df["Movie_name"] == movie_df["originalTitle"]

# Checking if there are any movies for which Movie_name is not either in primaryTitle or originalTitle
print(f"There are {(~(test1 | test2)).sum().item()} movies for which Movie_name is in neither primaryTitle ot originalTitle.")

There are 0 movies for which Movie_name is in neither primaryTitle ot originalTitle.


We can see the Movie_name column is redundant as its information is either in primaryTitle or in originalTitle. We can thus drop this column.

In [17]:
movie_df.drop(columns="Movie_name", inplace=True)

##### 1.3.2 Adding Plot Summaries when possible

In [18]:
movie_df = pd.merge(movie_df, plot_summaries_df, how="left", on="Wikipedia_movie_ID")

# Checking how many movies have plot summaries
n_movie_plots = (1 - (movie_df["Plot Summaries"].isnull().sum() / len(movie_df))) * len(movie_df)
print(f"{int(n_movie_plots):,} movies from our final dataset have plot summaries.")

30,793 movies from our final dataset have plot summaries.


We can now also drop movie identifier columns (as everything is already indexed): tconst, Freebase_movie_ID and Wikipedia_movie_ID.

In [19]:
movie_df.drop(columns=["Wikipedia_movie_ID", "tconst", "Freebase_movie_ID"], inplace=True)

This gives us our final cleaned dataset for our study:

In [20]:
movie_df.head()

Unnamed: 0,averageRating,numVotes,primaryTitle,originalTitle,isAdult,startYear,directors,writers,Movie_release_date,Movie_box_office_revenue,...,Movie_countries,Genres,Cast,Female_actors,Male_actors,Female_actor_percentage,Average_female_actor_age,Average_male_actor_age,runtimeMinutes,Plot Summaries
0,5.2,540,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0.0,1897.0,nm0714557,,1897,100000.0,...,[],['Sports'],James J. Corbett,0,1,0.0,,30.0,100.0,The film no longer exists in its entirety; how...
1,4.3,27,Robbery Under Arms,Robbery Under Arms,0.0,1907.0,nm0533958,"nm0092809,nm0533958",1985,,...,['Australia'],"['History', 'Western', 'Action', 'Drama', 'Adv...","Tom E. Lewis, Ed Devereaux, Andy Anderson, Rob...",0,5,0.0,,38.8,141.0,
2,4.3,27,Robbery Under Arms,Robbery Under Arms,0.0,1907.0,nm0533958,"nm0092809,nm0533958",1907,,...,['Australia'],"['Silent film', 'Drama']",Jim Gerald,0,1,0.0,,16.0,,Key scenes of the film included the branding o...
3,5.2,77,The Fairylogue and Radio-Plays,The Fairylogue and Radio-Plays,0.0,1908.0,"nm0091767,nm0877783","nm0000875,nm0877783",1908,,...,['United States of America'],"['Silent film', 'Black-and-white']","Romola Remus, L. Frank Baum",1,1,50.0,8.0,52.0,120.0,
4,4.7,40,"Hamlet, Prince of Denmark",Hamlet,0.0,1910.0,nm0099901,nm0000636,1921,,...,['Germany'],['Drama'],"Asta Nielsen, Eduard von Winterstein, Heinz St...",2,5,28.57,33.0,43.4,111.0,


### 2. <u>Our success metric</u>

### 3. <u>Gender and age vs success BLABLABLA</u>

Dependent variables:
- Ratings
- Success metric
- Profit ratio

Independent variables:
- Gender
- Age
- Genre
- isAdult?
- Movie country
- Movie language
- Release date

Look at adding starpower

### 4. <u>How does it compare to streaming platforms? Are movies made for these platforms different? Have box office movies adapted since the rise of streaming?</u>

### 5. <u>What are the social reasons behind the presence of female characters in movies? Is it due to sexualization or genuine equality of representation?</u>