# Research question 6



### Do budget and production scale affect the type of ending chosen? Exploring whether high-budget films tend to favor certain endings (e.g., happy endings for wider audience appeal) could reveal if financial considerations impact storytelling choices.

This notebook presents initial observations and is not intended to represent the final conclusions.


##### Importations

In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind

In [9]:
# path
DATA_FOLDER = '../../src/data/'
MOVIE_DATASET = DATA_FOLDER + 'movies_dataset_final_2.tsv'

# Dataset loading
movies = pd.read_csv(MOVIE_DATASET, sep='\t')

In [10]:
movies.columns

Index(['Movie_ID', 'Other_Column', 'Title', 'Release_Date', 'Runtime',
       'Languages', 'Country', 'Genres', 'Summary', 'Score', 'director',
       'vote_average', 'revenue', 'collection', 'budget', 'productions'],
      dtype='object')

Remove movies with missing values for budget

In [25]:
# Count rows where 'budget' is NaN or 0
missing_or_zero_count = movies[(movies['budget'].isnull()) | (movies['budget'] == 0)].shape[0]
print(f"Number of movies with missing or zero budget: {missing_or_zero_count}")

# Calculate the percentage of these rows
percentage_missing_or_zero = (missing_or_zero_count / len(movies)) * 100
print(f"Percentage of movies with missing or zero budget: {percentage_missing_or_zero:.2f}%")

# Remove rows where 'budget' is NaN or 0
movies = movies[(movies['budget'].notnull()) & (movies['budget'] > 0)]

# Verify removal
remaining_rows = len(movies)
print(f"Number of rows remaining after removal: {remaining_rows}")



Number of movies with missing or zero budget: 0
Percentage of movies with missing or zero budget: 0.00%
Number of rows remaining after removal: 6942


In [27]:
missing_productions = movies['productions'].isna().sum()
print(f"Number of missing values in 'productions': {missing_productions}")



Number of missing values in 'productions': 0


In [31]:
import ast

# Function to extract production names
def extract_production_names(production_list):
    # Ensure the input is a list of dictionaries
    if isinstance(production_list, str):
        # If the column contains a string representation of a list (like JSON), convert it to a list
        try:
            production_list = ast.literal_eval(production_list)
        except (ValueError, SyntaxError):
            return []  # Return empty list if there's a parsing error

    # If it's a valid list of dictionaries, extract the 'name'
    if isinstance(production_list, list):
        return [item['name'] for item in production_list if isinstance(item, dict) and 'name' in item]
    return []

# Apply the function to the 'productions' column to extract production names
movies['production_names'] = movies['productions'].apply(extract_production_names)

# Display the first few rows to check the result
print(movies[['productions', 'production_names']].head())


                                          productions  \
0   [{'id': 51312, 'logo_path': None, 'name': 'Ani...   
4   [{'id': 3166, 'logo_path': '/vyyv4Gy9nPqAZKElP...   
6   [{'id': 1947, 'logo_path': None, 'name': 'New ...   
12  [{'id': 22284, 'logo_path': None, 'name': 'CAT...   
14  [{'id': 13549, 'logo_path': None, 'name': 'Gol...   

                                     production_names  
0   [Animationwerks, Screen Gems, Storm King Produ...  
4                           [Walt Disney Productions]  
6           [New Deal Productions, Columbia Pictures]  
12                          [CAT Films, Mimosa Films]  
14                         [Golan-Globus Productions]  


In [32]:
# Count the number of missing values in the 'production_names' column
missing_production_names = movies['production_names'].isna().sum()
print(f"Number of missing values in 'production_names': {missing_production_names}")


Number of missing values in 'production_names': 0


In [40]:
# Extract the first production name for each film
movies['first_production_name'] = movies['production_names'].apply(lambda x: x[0] if isinstance(x, list) and len(x) > 0 else None)

# Display the first few rows to verify the result
print(movies[['Title', 'first_production_name']].head())


                          Title     first_production_name
0                Ghosts of Mars            Animationwerks
4                  Mary Poppins   Walt Disney Productions
6                      Baby Boy      New Deal Productions
12       The Gods Must Be Crazy                 CAT Films
14  Kinjite: Forbidden Subjects  Golan-Globus Productions


### Statistics

In [41]:
# Calculate the correlation between budget and score
correlation = movies['budget'].corr(movies['Score'])
print(f"Correlation between budget and score: {correlation:.2f}")

Correlation between budget and score: 0.03


We use an ANOVA (Analysis of Variance) test to determine if the mean scores across the groups of films, defined by their first production, differ significantly. This test is appropriate because we are comparing multiple groups (each corresponding to a different production) on a continuous variable (the film score). ANOVA helps to assess whether the variation in scores is due to differences between the production groups or if it can be attributed to random chance.

In [43]:
import scipy.stats as stats

# Group by the first production name and calculate the mean score for each production
production_score = movies.groupby('first_production_name')['Score'].mean()

# Perform ANOVA to test if there are significant differences in the scores between production groups
anova_result = stats.f_oneway(*(movies[movies['first_production_name'] == prod]['Score'] for prod in production_score.index))

# Print the result in a cleaner format
if anova_result.pvalue < 0.05:
    print("ANOVA result: There are significant differences in scores between production groups.")
else:
    print("ANOVA result: There are no significant differences in scores between production groups.")

print(f"ANOVA p-value: {anova_result.pvalue:.4f}")


ANOVA result: There are significant differences in scores between production groups.
ANOVA p-value: 0.0003
