In [4]:
import pandas as pd
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
#from data.dataset_enhancer import get_movies

### 0. Data Enhancement

First we ran the notebook data_preprocessing.ipynb to generate the necessary files for this notebook. The data_preprocessing.ipynb notebook loads the movie metadata and additional datasets, preprocesses the data, and saves the cleaned data to CSV files. The data_preprocessing.ipynb notebook also generates the necessary files for this notebook, such as the movie metadata for older and newer movies, and additional datasets for sequels, books, comics, remakes, and collections. There is no need to run it again, as the files are already on github. The data_preprocessing.ipynb notebook is run once to generate the necessary files for this notebook. You also need an API key, that we are not putting on github for security reasons.
The function stores the output in the data file.

In [4]:
"""keywords_name = ["sequels", "book", "comics", "remake"]
keywords_id = [9663, 818, 9717, 9714]
start_date = "1880-01-01"
end_date = "2010-01-01"

get_movies(keywords_name, keywords_id, start_date, end_date)

start_date = "2010-01-01"
end_date = "2024-01-01"
get_movies(keywords_name, keywords_id, start_date, end_date)"""

'keywords_name = ["sequels", "book", "comics", "remake"]\nkeywords_id = [9663, 818, 9717, 9714]\nstart_date = "1880-01-01"\nend_date = "2010-01-01"\n\nget_movies(keywords_name, keywords_id, start_date, end_date)\n\nstart_date = "2010-01-01"\nend_date = "2024-01-01"\nget_movies(keywords_name, keywords_id, start_date, end_date)'

## 1. Files loading and preprocessing

The following cell organizes and preprocesses movie datasets from different time periods (1880–2010 and 2010–2024) using the MovieFrames class. It first imports the class, then loads the movie metadata for older and newer movies. File paths for additional datasets (such as sequels, books, comics, remakes, and collections) are dynamically generated for both time periods. The MovieFrames objects (movie_frames_old and movie_frames_new) are then created to structure and preprocess the data. The movie_frames_old object specifically standardizes column names for the older dataset using the old=True flag. These objects help manage movie data by categories and prepare it for further analysis.

### 1.1 Data Collection

In [5]:
from src.models.movies_frame import MovieFrames

movie_df = pd.read_csv('data/MovieSummaries_filtered/movie_df.csv')
        
new_movie_df = pd.read_csv('data/all_sample/all_sample_2010_2024_metadata.csv')

keywords = ["sequels", "book", "comics", "remake"]
path_old = []
path_new = []

for keyword in keywords:        
    path_old.append(f"data/{keyword}/{keyword}_1880_2010_with_wiki_id.csv")
    path_new.append(f"data/{keyword}/{keyword}_2010_2024_metadata.csv")

path_old.append("data/collections/sequels_and_original_1880_2010_with_wiki_id.csv")
path_new.append("data/collections/sequels_and_original_2010_2024_metadata.csv")

movie_frames_old = MovieFrames(movie_df, path_old, 1880, 2010)
movie_frames_new = MovieFrames(new_movie_df, path_new, 2010, 2024)


### 1.2 Data Preprocessing

Then the following cell visualizes the size differences between datasets during the preprocessing steps using the display_data_cleaning_graph function. The function takes a MovieFrames object and calculates the number of movies at three stages:

- The original data loaded from TMDb.
- After matching the TMDb data with the Wikipedia data.
- After filtering out movies with mismatched release years.

These sizes are passed to the create_graph function, which generates a bar graph showing the changes in dataset sizes for five categories: sequel collections, sequels, books, comics, and remakes. The graph highlights how the preprocessing steps affect the number of movies in each category. 

In [None]:
## Size differencesKingdom Hospital
from src.models.movie_data_cleaner import display_data_cleaning_graph
fig = display_data_cleaning_graph(movie_frames_old)

fig

### 2.1 How many sequels are there compared to movies 

Groups the movies by 5-year intervals, counts how many movies fall into each interval, and returns the string representation of the interval labels. The first cell shows the result for data between 1880 and 2010, the second cell will show results between 2010-2024 when it will work.

Plot of number of movies per 5 year (left figure) VS. Plot of number of movies with sequels per 5 year (right figure)

In [None]:
from src.models.movie_counter import get_movie_counter_figure

#Plot figure 1 (left):  number of movies per 5 years
fig = get_movie_counter_figure(movie_frames_old)
fig


In [None]:
movie_frames_new.drop_different_years()
movie_frames_new.drop_impossible_years()
movie_frames_concat = movie_frames_old.concat_movie_frame(movie_frames_new)
fig = get_movie_counter_figure(movie_frames_concat)
fig

### 1.3 Ratio of sequels to original movies



Calculation and plot of the ratio between movies with sequel and the number of movies, both per 5 years.


$$
\text{Ratio} = \frac{\text{nb of movie with sequel per 5 year}}{\text{nb of movie per 5 year}}
$$

In [None]:

from src.models.movie_counter import get_ratio_movie_figure

fig = get_ratio_movie_figure(movie_frames_old)
fig

In [None]:

from src.models.movie_counter import get_ratio_movie_figure

fig = get_ratio_movie_figure(movie_frames_concat)
fig

## 3.1 Box office revenue


#### 3.1.1 Box office revenue for movies with sequels compared to all movies

In [None]:
from src.utils.evaluation_utils import inflate
import numpy as np
import swifter

for df in movie_frames_old.get_all_df():
    df["Movie box office revenue inflation adj"] = df.swifter.apply(lambda x: inflate(x["Movie box office revenue"], x["release year"]), axis=1)
                
for df in movie_frames_new.get_all_df():
    df["Movie box office revenue inflation adj"] = df.swifter.apply(lambda x: inflate(x["Movie box office revenue"], x["release year"]), axis=1)
    
for df in movie_frames_concat.get_all_df():
    df.reset_index(drop=True, inplace=True)
    df["Movie box office revenue inflation adj"] = df.swifter.apply(lambda x: inflate(x["Movie box office revenue"], x["release year"]), axis=1)



In the future `np.long` will be defined as the corresponding NumPy scalar.



AttributeError: module 'numpy' has no attribute 'long'

In [None]:
from src.models.box_office_revenue import get_box_office_absolute

# Plot figure 4: box office revenue per year
fig = get_box_office_absolute(movie_frames_concat)
fig

ModuleNotFoundError: No module named 'utils'

Calculation and plots the percentage of box office revenue each year contributed by movies with sequels, relative to the total box office revenue for all movies that year

$$
\text{Box Office \%} = \frac{\text{Box office of movies with sequel per year}}{\text{Box office for all movies per year}} * 100
$$

In [None]:
from src.models.box_office_revenue import get_box_office_ratio

fig = get_box_office_ratio(movie_frames_concat)
fig

Calculation and plot of the average inflation-adjusted box office revenue per year, both for all movies and for movies with sequels

In [None]:
from src.models.box_office_revenue import get_average_box_office_revenue

fig = get_average_box_office_revenue(movie_frames_concat)
fig

ModuleNotFoundError: No module named 'utils'

#### 3.1.2 Box office revenue for movies with sequel compared to the first movie of the collection

This plot shows the box office revenue for the first movie in a collection compared to the sequels. The x-axis represents the collection, and the y-axis is the box office revenue. On the left side, the box office of the entire serie following the first movie is seen, and on the right side, the average box office of the sequels is compared to the first. Improving sequels are linked in green, and decreasing sequels are linked in red.

compare_first_sequel splits the movies into first movies and sequels, calculates the box office revenue for each movie, and then plots the data.


In [None]:
from src.models.box_office_revenue import compare_first_sequel

fig_plt, fig2 = compare_first_sequel(movie_frames_concat)
fig_plt.show()
fig2.show()


ModuleNotFoundError: No module named 'utils'

These plots highlight where the first movie outperforms the sequel (red lines) and vice versa (green lines). The second plot also includes a yellow horizontal line showing the average box office revenue of all movies in the dataset.

The log scale on the y-axis is used to better visualize large differences in revenue, especially when there are very high values.

## 4. Number of movies in a collection

A plot of the comparison between the budget and the box office revenue for collection. The x-axis is the budget and the y-axis is the revenue. The size of the circles is proportional to the number of movies in the collection. 
get_budget_vs_revenue first computes the box office revenue and budget for each movie in the collection, then calculates the total box office revenue and budget for the collection. The budget is our first use of the extended data which wasn't in the original database, but was given by the TMDB API. The function then plots the data.



In [None]:
from src.models.collection_analysis import get_budget_vs_revenue
import seaborn as sns

fig = get_budget_vs_revenue(movie_frames_concat, ["data/sequels/sequels_1880_2010_extended.csv", "data/sequels/sequels_2010_2024_extended.csv"])
fig.show()

ImportError: cannot import name 'colormaps' from 'matplotlib' (/opt/anaconda3/lib/python3.8/site-packages/matplotlib/__init__.py)

### 5. Time between sequels

This graph displays the time occured between sequels in a collection. The x-axis represents the number of years between sequels, and the y-axis is the collection. The size of the circles is proportional to the box office revenue of the movie. The link between each film . The graph helps identify patterns in the time between sequels and the revenue generated by each sequel.

get_time_between_sequels first separates the movies between first movie and the following sequels, creates a dataframe with the movies in each collection, their release date and the box office revenue. It then calculates the time between sequels. Then it draws the graph.

In [None]:
movie_frames_old.movie_df_sequel_original[movie_frames_old.movie_df_sequel_original["collection"] == "The Lord of the Rings Collection"]

Unnamed: 0,Wikipedia movie ID,Freebase movie ID,Movie name,Movie release date,Movie box office revenue,Movie runtime,Movie languages,Movie countries,Movie genres,Unnamed: 0.2,id,release_date,original_title,title,collection,collection_id,release year
127,173944,/m/017gm7,The Lord of the Rings: The Two Towers,2002-12-05,926047100.0,179.0,"{""/m/05p2d"": ""Old English language"", ""/m/02h40...","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/0hj3n2k"": ""Fantasy Adventure"", ""/m/03k9fj...",58,121,2002-12-18,The Lord of the Rings: The Two Towers,The Lord of the Rings: The Two Towers,The Lord of the Rings Collection,119,2002.0
1190,173941,/m/017gl1,The Lord of the Rings: The Fellowship of the Ring,2001-12-10,871530300.0,178.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/0hj3n2k"": ""Fantasy Adventure"", ""/m/03k9fj...",57,120,2001-12-18,The Lord of the Rings: The Fellowship of the Ring,The Lord of the Rings: The Fellowship of the Ring,The Lord of the Rings Collection,119,2001.0
1209,174251,/m/017jd9,The Lord of the Rings: The Return of the King,2003-12-17,1119930000.0,250.0,"{""/m/05p2d"": ""Old English language"", ""/m/02h40...","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/0hj3n2k"": ""Fantasy Adventure"", ""/m/03k9fj...",59,122,2003-12-17,The Lord of the Rings: The Return of the King,The Lord of the Rings: The Return of the King,The Lord of the Rings Collection,119,2003.0


In [None]:
from src.models.collection_analysis import get_time_between_sequels

fig = get_time_between_sequels(movie_frames_concat)
fig.show()

## Other line of enquiry:

- Highest grossing series
- By genre
- Find studios that do a lot of sequels
- Is there a correlation between the box office revenue of the first movie and the sequels
- Add more box office revenue data and get movie budget data
- ...

In [1]:
import plotly.express as px
from ipywidgets import interact, widgets
import pandas as pd

# Charger les données principales
#df = movie_frames_new.movie_df_sequel_original
#df['Movie release date'] = pd.to_datetime(df['Movie release date'])

# Charger les budgets et concaténer
df1_part1 = pd.read_csv("data/collections/sequels_and_original_1880_2010_extended.csv")
df1_part2 = pd.read_csv("data/collections/sequels_and_original_2010_2024_extended.csv")
df1= pd.concat([df1_part1, df1_part2], ignore_index=True)
df1.head(30)
df1['release_date'] = pd.to_datetime(df1['release_date'])
# Trier par collection et date de sortie
df1 = df1.sort_values(by=['collection', 'release_date'])
df1 = df1.drop_duplicates(subset=["title"], keep="first")
# Attribuer un numéro à chaque film dans une collection
df1['Numéro'] = df1.groupby('collection').cumcount() + 1

# Ajouter les budgets (fusion avec df1)
#df = pd.merge(df, df1[['id', 'budget','revenue','vote_average']], on='id', how='left')
#print(df[df['collection'].str.contains("Harry Potter", na=False)])
# Garder uniquement les films avec un budget valide

df1 = df1[(df1['budget'].notna()) & (df1['budget'] != 0) &
        (df1['revenue'].notna()) & (df1['revenue'] != 0) &
        (df1['vote_average'].notna()) & (df1['vote_average'] != 0)]
df = df1[df1['Numéro'] <= 5]



# Filtrer les collections qui ont au moins 5 films
df = df.groupby('collection').filter(lambda group: len(group) >= 5)

# Ajouter les colonnes pour les comparaisons
df["revenue_previous"] = df.groupby("collection")["revenue"].shift(1)
df["vote_previous"] = df.groupby("collection")["vote_average"].shift(1)

# Fonction pour définir les couleurs selon le critère choisi
def set_colors(comparison):
    if comparison == "revenus":
        df["Couleur"] = df.apply(lambda row: "Bleu" if row["revenue"] < row["revenue_previous"]
                                 else ("Rouge" if pd.notna(row["revenue_previous"]) else "Gris"), axis=1)
    elif comparison == "notes":
        df["Couleur"] = df.apply(lambda row: "Bleu" if row["vote_average"] < row["vote_previous"]
                                 else ("Rouge" if pd.notna(row["vote_previous"]) else "Gris"), axis=1)

# Fonction pour construire la figure
def build_figure(num_film, comparison="revenus"):
    set_colors(comparison)  # Appliquer les couleurs dynamiquement
    filtered_data = df[df["Numéro"] == num_film]
    
    if filtered_data.empty:
        print(f"Aucun film trouvé pour le numéro {num_film}")
        return
    
    fig = px.scatter(
        filtered_data,
        x="budget",
        y="vote_average",
        size="revenue",
        color="Couleur",
        hover_name="title",
        title=f"Analyse des films numéro {num_film} ({comparison})",
        labels={"budget": "Budget (M$)", "id": "Film ID"},
        range_y=[df['vote_average'].min(), df['vote_average'].max()],
        color_discrete_map={"Rouge": "red", "Bleu": "blue", "Gris": "grey"},
        log_x=True
    )
    return fig

# Fonction principale pour l'interactivité
def plot_interactive(num_film, comparison):
    fig = build_figure(num_film, comparison)
    fig.show()

# Ajouter les sliders interactifs
max_num_film = df["Numéro"].max()

slider_num_film = widgets.IntSlider(min=1, max=max_num_film, step=1, value=1, description="Numéro du film:")
comparison_dropdown = widgets.Dropdown(
    options=["revenus", "notes"],
    value="revenus",
    description="Comparer par:"
)

# Afficher les widgets
interact(plot_interactive, num_film=slider_num_film, comparison=comparison_dropdown)

  from pandas.core.computation.check import NUMEXPR_INSTALLED


interactive(children=(IntSlider(value=1, description='Numéro du film:', max=5, min=1), Dropdown(description='C…

<function __main__.plot_interactive(num_film, comparison)>

In [3]:
import plotly.express as px
from ipywidgets import interact, widgets
import ast

# Ajouter les colonnes pour les comparaisons
df["revenue_previous"] = df.groupby("collection")["revenue"].shift(1)
df["vote_previous"] = df.groupby("collection")["vote_average"].shift(1)

def extract_genres(genres, first_only=False):
    try:
        # Convertir la chaîne en liste de dictionnaires
        genres_list = ast.literal_eval(genres)
        if not genres_list:
            return None
        if first_only:
            return genres_list[0]['name']
        clean_genres = [genre['name'] for genre in genres_list if 'name' in genre]
        return ", ".join(clean_genres)
    except (ValueError, SyntaxError, TypeError):
        return None


#on fait en sorte que le premier film d'un collection determine le egnre de la colection poru plus de coherence comme ca chaque collection a un seul genre 
df['main_genre'] = df['genres'].apply(lambda x: extract_genres(x, first_only=True))

# Créer un dictionnaire où le genre est déterminé par le film numéro 1 de chaque collection
genre_dict = df[df['Numéro'] == 1].set_index('collection')['main_genre'].to_dict()

df["Ratio_Revenu_Note"] = ((df["revenue"]-df["budget"] )/ df["budget"])*(df["vote_average"])


# Trier par main_genre et numéro de film
df = df.sort_values(by=["main_genre", "Numéro"])
# Agréger les données : calculer la moyenne du Ratio_Revenu_Note par Numéro et par Genre
df_grouped = df.groupby(["Numéro", "main_genre"], as_index=False).agg({
    "Ratio_Revenu_Note": "mean"
})

# Créer une courbe par genre
fig = px.line(
    df_grouped,
    x="Numéro",
    y="Ratio_Revenu_Note",
    color="main_genre",
    markers=True,
    title="Évolution du Ratio Revenu/Note Moyenne par Genre",
    labels={"Numéro": "Numéro du Film", "Ratio_Revenu_Note": "Ratio Revenu/Note"}
)

# Personnaliser le graphique
fig.update_traces(line=dict(width=2))
fig.update_layout(
    xaxis=dict(tickmode="linear", dtick=1),
    yaxis=dict(title="Ratio Revenu/Note Moyenne"),
    legend_title="main_genre",
    template="plotly_white"
)

fig.show()
# Vérifier les valeurs uniques de Numéro pour "Romance"


In [4]:
import plotly.express as px

# Étape 1 : Calculer la croissance pour le critère choisi
def calculate_growth(df, metric):
    """
    Calcule la variation (%) d'un métrique par collection et la retourne.
    """
    growth_column = f"{metric}_growth"
    df[growth_column] = df.groupby("collection")[metric].pct_change() * 100
    return df.dropna(subset=[growth_column]), growth_column

# Étape 2 : Calculer la variation en fonction de la métrique choisie
metric = "Ratio_Revenu_Note"  # Changer à "vote_average" pour les notes
df_growth, growth_column = calculate_growth(df, metric)

# Étape 3 : Calculer la moyenne de la variation par numéro de film et par genre
df_growth_grouped = df_growth.groupby(["Numéro", "main_genre"], as_index=False).agg({
    growth_column: "mean"
})

# Étape 4 : Créer un graphique pour la variation
fig = px.line(
    df_growth_grouped,
    x="Numéro",
    y=growth_column,
    color="main_genre",
    markers=True,
    title=f"Variation Moyenne du {metric} par Numéro de Film et Genre",
    labels={"Numéro": "Numéro du Film", growth_column: "Variation Moyenne (%)"}
)

# Étape 5 : Améliorer l'affichage
fig.update_traces(
    line=dict(width=2),
    marker=dict(size=8)
)
fig.update_layout(
    xaxis=dict(title="Numéro du Film", tickmode="linear", dtick=1),
    yaxis=dict(title="Variation Moyenne (%)", tickformat=".2f"),
    template="plotly_white",
    legend_title="Genre",
    title_x=0.5  # Centrer le titre
)

# Ajouter une ligne horizontale pour marquer 0%
fig.add_hline(
    y=0,
    line_dash="dot",
    line_color="red",
    annotation_text="Croissance Nulle",
    annotation_position="top left"
)

# Afficher le graphique
fig.show()


In [5]:
import plotly.express as px

# Étape 1 : Extraire le budget du premier film de chaque collection
first_movie_budget = df[df['Numéro'] == 1].set_index('collection')['budget'].to_dict()

# Étape 2 : Grouper par collection pour obtenir le total des revenus
roi_by_collection = df.groupby("collection", as_index=False).agg(
    total_revenue=("revenue", "sum")
)

# Ajouter le budget du premier film à chaque collection
roi_by_collection['first_movie_budget'] = roi_by_collection['collection'].map(first_movie_budget)

# Étape 3 : Calculer le ROI basé sur le premier film
roi_by_collection['ROI'] = (roi_by_collection['total_revenue'] - roi_by_collection['first_movie_budget']) / roi_by_collection['first_movie_budget']

# Ajouter le genre principal basé sur le premier film
genre_dict = df[df['Numéro'] == 1].set_index('collection')['main_genre'].to_dict()
roi_by_collection['main_genre'] = roi_by_collection['collection'].map(genre_dict)

# Étape 4 : Visualisation avec Plotly
fig = px.bar(
    roi_by_collection,
    x="ROI",
    y="collection",
    color="main_genre",
    orientation='h',
    title="Rentabilité des Collections par Genre (Basé sur le Premier Film)",
    labels={"ROI": "Retour sur Investissement (ROI)", "collection": "Collection", "main_genre": "Genre"},
    hover_data=["total_revenue", "first_movie_budget"]
)

# Ajouter une ligne verticale pour un seuil de rentabilité neutre
fig.add_vline(x=0, line_dash="dot", line_color="red", annotation_text="Seuil de Rentabilité")
fig.update_traces(
    textposition='inside',  # Affiche les valeurs horizontalement à l'intérieur des barres
    textangle=0,  # Assure que le texte reste horizontal (angle à 0°)
    insidetextanchor="middle",  # Centre le texte à l'intérieur des barres
    marker=dict(line=dict(width=0.5)),  # Réduire l'épaisseur du bord des barres
    selector=dict(type='bar')
)
# Personnalisation du graphique
fig.update_layout(
    height=20 * len(roi_by_collection),  # Ajuster la taille dynamique du graphique
    xaxis=dict(title="Retour sur Investissement (ROI)", tickformat=".0%"),
    yaxis=dict(title="Collection"),
    template="plotly_white"
)

# Afficher le graphique
fig.show()

# Afficher les valeurs triées
print(roi_by_collection.sort_values(by='ROI', ascending=False))


                                  collection  total_revenue  \
15            Paranormal Activity Collection      811605295   
12                        Mad Max Collection      713465182   
25        Texas Chainsaw Massacre Collection      142186000   
20                            Saw Collection      669753364   
10                     James Bond Collection      515600000   
9                       Insidious Collection      744379972   
5                       Halloween Collection      139605426   
4                 Friday the 13th Collection      173079579   
23                      Star Wars Collection     3460213893   
31                 The Terminator Collection     1845327738   
30                      The Purge Collection      533895379   
8                   Indiana Jones Collection     2367696867   
0       A Nightmare on Elm Street Collection      203335206   
11                  Jurassic Park Collection     4889523548   
18                 Police Academy Collection      34070

In [6]:
import pandas as pd
import plotly.express as px

# Étape 1 : Calculer le ROI accumulé pour chaque collection à chaque temps T (numéro de film)
df['first_movie_budget'] = df.groupby('collection')['budget'].transform('first')  # Budget du premier film
df['cumulative_revenue'] = df.groupby('collection')['revenue'].cumsum()          # Revenus cumulés

# Calcul du ROI progressif et arrondi à l'unité
df['ROI_progressif'] = ((df['cumulative_revenue'] - df['first_movie_budget']) / df['first_movie_budget']).round(0)

# Étape 2 : Créer un DataFrame regroupé pour l'animation
df_race = df.groupby(['Numéro', 'collection', 'main_genre'], as_index=False).agg({
    'ROI_progressif': 'last'  # Dernière valeur du ROI accumulé pour chaque "temps"
})
df_race['collection_clean'] = df_race['collection'].str.replace(r'\s*Collection$', '', regex=True)

# Étape 3 : Créer un race chart avec Plotly
fig = px.bar(
    df_race,
    x="ROI_progressif",
    y="collection_clean",
    color="main_genre",
    animation_frame="Numéro",  # Animation par numéro de film
    orientation='h',
    title="Évolution du ROI Progressif par Collection et Genre",
    labels={"ROI_progressif": "Retour sur Investissement (ROI)", "collection_clean": "Collection", "main_genre": "Genre"},
    text="ROI_progressif"  # Afficher la valeur arrondie sur chaque barre
)

# Personnalisation du graphique pour plus de fluidité
fig.update_layout(
    xaxis=dict(title="Retour sur Investissement (ROI)", type="log"),  # Échelle logarithmique
    yaxis=dict(title="Collection", categoryorder='total ascending'),
    transition={"duration": 1200, "easing": "cubic-in-out"},  # Animation plus lente et fluide
    margin=dict(l=250, r=50, t=50, b=50),  # Fixer les marges pour éviter les mouvements
    height=800,
    updatemenus=[dict(type="buttons", showactive=False,
                      buttons=[dict(label="▶ Play",
                                    method="animate",
                                    args=[None, {"frame": {"duration": 1500, "redraw": True},
                                                 "fromcurrent": True,
                                                 "mode": "immediate"}]),
                               dict(label="⏸ Pause",
                                    method="animate",
                                    args=[[None], {"frame": {"duration": 0, "redraw": False},
                                                   "mode": "immediate"}])])]
)

# Mettre à jour les barres pour une transition douce
fig.update_traces(
    textposition='inside',  # Affiche les valeurs horizontalement à l'intérieur des barres
    textangle=0,  # Assure que le texte reste horizontal (angle à 0°)
    insidetextanchor="middle",  # Centre le texte à l'intérieur des barres
    marker=dict(line=dict(width=0.5)),  # Réduire l'épaisseur du bord des barres
    selector=dict(type='bar')
)

# Afficher le graphique
fig.show()


In [7]:
import pandas as pd
import plotly.graph_objects as go
import numpy as np
import plotly.io as pio

pio.templates.default = "simple_white"

# Étape 1 : Fixer le budget du premier film pour chaque collection
df['first_movie_budget'] = df.groupby('collection')['budget'].transform('first')

# Étape 2 : Recalculer les revenus cumulés par collection
df['cumulative_revenue'] = df.groupby('collection')['revenue'].cumsum()

# Étape 3 : Calculer le ROI progressif et l'arrondir
df['ROI_progressif'] = ((df['cumulative_revenue'] - df['first_movie_budget']) / df['first_movie_budget']).round(1)

# Étape 4 : Préparer les données pour la race chart
df_race = df.groupby(['Numéro', 'collection', 'main_genre'], as_index=False).agg({
    'ROI_progressif': 'last',
    'main_genre': 'first'  # Préserver explicitement la colonne main_genre
})
df_race['collection_clean'] = df_race['collection'].str.replace(r'\s*Collection$', '', regex=True)

# Tri dynamique par Numéro et ROI_progressif
df_race = df_race.sort_values(['Numéro', 'ROI_progressif'], ascending=[True, False])

# Étape 5 : Définir des couleurs fixes pour chaque genre
fixed_colors = {
    'Horror': 'darkblue',
    'Comedy': 'yellow',
    'Action': 'green',
    'Romance': 'pink',
    'Drama': 'blue',
    'Fantasy': 'purple',
    'Science Fiction': 'lightblue',
    'Adventure': 'blue',
    'Thriller': 'brown',
    'Animation': 'cyan',
    'Other': 'gray'  # Couleur par défaut si un genre n'est pas dans la liste
}

# Si un genre n'est pas dans les couleurs fixes, attribuer "Other"
df_race['main_genre'] = df_race['main_genre'].apply(lambda x: x if x in fixed_colors else 'Other')

# Étape 6 : Créer une fonction pour récupérer les 10 premières valeurs
def get_top_10(data, num):
    filtered = data[data['Numéro'] == num]
    return filtered.nlargest(15, 'ROI_progressif')

# Frames de l'animation
frames = []
for num in sorted(df_race['Numéro'].unique()):
    top10 = get_top_10(df_race, num)
    frames.append(go.Frame(
        data=[go.Bar(x=top10['collection_clean'], y=top10['ROI_progressif'],
                     marker_color=[fixed_colors[genre] for genre in top10['main_genre']],
                     text=top10['ROI_progressif'].astype(str) + "%",
                     textposition='outside', cliponaxis=False)],
        layout=go.Layout(title_text=f"Évolution du ROI Progressif - Film {num}")
    ))

# Données initiales
initial_data = get_top_10(df_race, df_race['Numéro'].min())

# Création du graphique
# Création du graphique
fig = go.Figure(
    data=[go.Bar(x=initial_data['collection_clean'], y=initial_data['ROI_progressif'],
                 marker_color=[fixed_colors[genre] for genre in initial_data['main_genre']],
                 text=initial_data['ROI_progressif'].astype(str) + "%",
                 textposition='outside', cliponaxis=False)],
    layout=go.Layout(
        title="Évolution du ROI Progressif par Collection et Genre",
        font=dict(size=20),
        height=800,
        xaxis=dict(title="Collection", showline=False, tickangle=-90),
        yaxis=dict(title="Retour sur Investissement (ROI)", type="log", range=[0,4], showline=False),
        updatemenus=[dict(
            type="buttons",
            x=0.85, y=1,  # Position en haut à droite
            showactive=False,
            buttons=[
                dict(label="Play",
                     method="animate",
                     args=[None, {"frame": {"duration": 1000, "redraw": True},
                                  "fromcurrent": True}]),
                dict(label="Pause",
                     method="animate",
                     args=[[None], {"frame": {"duration": 0, "redraw": False},
                                    "mode": "immediate"}])
            ]
        )],
        showlegend=False  # Supprime complètement la légende
    ),
    frames=frames
)

# Afficher le graphique
fig.show()



In [None]:

# Vérifier les doublons pour la collection Underworld
df_underworld = df[df['collection'] == 'Underworld Collection']
df_underworld.head(5)



Unnamed: 0,id,release_date,original_title,title,collection,collection_id,budget,revenue,runtime,popularity,...,Numéro,revenue_previous,vote_previous,Couleur,main_genre,Ratio_Revenu_Note,Ratio_Revenu_Note_growth,first_movie_budget,cumulative_revenue,ROI_progressif
2052,277,2003-09-19,Underworld,Underworld,Underworld Collection,2326,22000000,95708457,122,45.591,...,1,,,Gris,Fantasy,22.782614,,22000000,95708457,3.4
2054,834,2006-01-12,Underworld: Evolution,Underworld: Evolution,Underworld Collection,2326,50000000,111476513,106,28.182,...,2,95708457.0,6.8,Rouge,Fantasy,8.1149,-64.381174,22000000,207184970,8.4
2055,12437,2009-01-22,Underworld: Rise of the Lycans,Underworld: Rise of the Lycans,Underworld Collection,2326,35000000,92158961,92,34.467,...,3,111476513.0,6.6,Bleu,Fantasy,10.615236,30.811667,22000000,299343931,12.6
2053,52520,2012-01-19,Underworld: Awakening,Underworld: Awakening,Underworld Collection,2326,70000000,160112671,88,33.039,...,4,92158961.0,6.5,Rouge,Fantasy,8.11014,-23.599054,22000000,459456602,19.9
2056,346672,2016-11-24,Underworld: Blood Wars,Underworld: Blood Wars,Underworld Collection,2326,35000000,81093313,91,32.246,...,5,160112671.0,6.3,Bleu,Fantasy,7.770016,-4.193821,22000000,540549915,23.6


In [None]:
import pandas as pd
import plotly.express as px
from sklearn.preprocessing import StandardScaler

# Charger les données
df = pd.read_csv("data/collections/sequels_and_original_2010_2024_extended.csv")
df['release_date'] = pd.to_datetime(df['release_date'])

# Nettoyer les données
df = df[(df['budget'].notna()) & (df['budget'] > 0) &
        (df['revenue'].notna()) & (df['revenue'] > 0) &
        (df['vote_average'].notna()) & (df['vote_average'] > 0)]

# Ajouter un Numéro des films dans la collection
df = df.sort_values(by=['collection', 'release_date'])
df['Numéro'] = df.groupby('collection').cumcount() + 1

# Calculer ROI
df['ROI'] = (df['revenue'] - df['budget']) / df['budget']

# Calculer un score global par film (pondération entre ROI et note moyenne)
df['global_score'] = df['ROI'] * 0.5 + df['vote_average'] * 0.5

# Identifier le meilleur film de chaque collection
best_films = df.loc[df.groupby('collection')['global_score'].idxmax()]

# Visualiser les meilleurs films
fig = px.bar(
    best_films,
    x="global_score",
    y="collection",
    orientation='h',
    color="title",
    text="title",
    title="Top Films par Collection (Meilleurs Résultats)",
    labels={"global_score": "Score Global", "collection": "Collection", "main_genre": "Genre"},
    hover_data=["ROI", "vote_average", "budget", "revenue"]
)

# Personnaliser le graphique
fig.update_layout(
    xaxis=dict(title="Score Global"),
    yaxis=dict(title="Collection", categoryorder='total ascending'),
    height=800
)

fig.show()

# Afficher les meilleurs films
print("Meilleurs films par collection :")
print(best_films[['collection', 'title', 'Numéro', 'ROI', 'vote_average', 'global_score']])


  from pandas.core.computation.check import NUMEXPR_INSTALLED


Meilleurs films par collection :
                                collection                           title  \
303   101 Dalmatians (Animated) Collection  One Hundred and One Dalmatians   
390                   12 Rounds Collection                       12 Rounds   
902              47 Meters Down Collection                  47 Meters Down   
1050            A Dog's Purpose Collection                 A Dog's Purpose   
345   A Nightmare on Elm Street Collection       A Nightmare on Elm Street   
...                                    ...                             ...   
108                    Yu-Gi-Oh Collection             Yu-Gi-Oh! The Movie   
1227                 Zombieland Collection                      Zombieland   
471                   Zoolander Collection                       Zoolander   
697                       [REC] Collection                           [REC]   
1340                        xXx Collection      xXx: Return of Xander Cage   

      Numéro        ROI  vote_

In [1]:
import pandas as pd
import plotly.express as px
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Étape 1 : Charger et préparer les données
df = pd.read_csv("data/collections/sequels_and_original_2010_2024_extended.csv")
df['release_date'] = pd.to_datetime(df['release_date'])

# Nettoyage des données
df = df[(df['budget'].notna()) & (df['budget'] > 0) &
        (df['revenue'].notna()) & (df['revenue'] > 0) &
        (df['vote_average'].notna()) & (df['vote_average'] > 0)]

# Ajouter un numéro pour chaque film dans la collection
df = df.sort_values(by=['collection', 'release_date'])
df['Numéro'] = df.groupby('collection').cumcount() + 1

# Calculer des métriques pour l'analyse
df['ROI'] = (df['revenue'] - df['budget']) / df['budget']
df['note_growth'] = df.groupby('collection')['vote_average'].diff()
df['roi_growth'] = df.groupby('collection')['ROI'].diff()

# Identifier le dernier film de chaque collection
df['is_last_film'] = df['collection'] != df['collection'].shift(-1)
df_last_films = df[df['is_last_film']]

# Étape 2 : Préparer les données pour le modèle
features = ['ROI', 'note_growth', 'roi_growth', 'budget', 'revenue']
df_last_films = df_last_films.dropna(subset=features)

# Normaliser les données
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_last_films[features])

# Ajouter la colonne cible pour les succès continus
success_threshold = 0.5
df_last_films['success'] = (df_last_films['ROI'] > success_threshold) & (df_last_films['vote_average'] > 6)

# Diviser les données en ensembles d'entraînement et de test
X_train, X_test, y_train, y_test = train_test_split(X_scaled, df_last_films['success'], test_size=0.3, random_state=42)

# Étape 3 : Entraîner un modèle XGBoost
model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    objective='binary:logistic',  # Correct pour des probabilités
    random_state=42
)
model.fit(X_train, y_train)

# Prédictions continues pour les derniers films
df_last_films['success_probability'] = model.predict_proba(X_scaled)[:, 1]

# Vérification et ajustement des probabilités
df_last_films['success_probability'] = df_last_films['success_probability'].clip(lower=0, upper=1)

# Étape 4 : Classer les collections par probabilité de succès
top_collections = df_last_films.sort_values(by=['success_probability'], ascending=False).head(10)

# Étape 5 : Visualisation avec un Scatter Plot
fig = px.scatter(
    df_last_films,  # Utiliser uniquement les derniers films
    x="ROI",
    y="vote_average",
    size="budget",  # Taille basée sur la probabilité
    color="success_probability",  # Couleur basée sur la probabilité
    hover_name="collection",
    title="Probabilités de succès pour les derniers films des collections",
    labels={"ROI": "Retour sur Investissement", "vote_average": "Note Moyenne", "success_probability": "Probabilité de Succès"},
    color_continuous_scale="Viridis",  # Palette de couleurs continue
    log_x=True
)

# Ajouter une ligne pour le seuil de succès
fig.add_vline(x=success_threshold, line_dash="dot", line_color="red", annotation_text="Seuil de succès")

# Définir la plage des valeurs de ROI (axe X)
fig.update_layout(
    height=600,
    width=800,
    xaxis=dict(range=[-1, 2])  # Spécifiez ici le range de l'axe X
)

fig.show()

# Étape 6 : Afficher les collections recommandées pour un nouveau film
print("Collections recommandées pour un nouveau film :")
print(top_collections[['collection', 'title', 'success_probability', 'ROI', 'vote_average']])
top_collections.head(10)


  from pandas.core.computation.check import NUMEXPR_INSTALLED


Collections recommandées pour un nouveau film :
                    collection                                   title  \
1617        Jumanji Collection                 Jumanji: The Next Level   
1548     Mamma Mia! Collection             Mamma Mia! Here We Go Again   
276   Jurassic Park Collection                 Jurassic World Dominion   
691    Fifty Shades Collection                      Fifty Shades Freed   
1416       Deadpool Collection                    Deadpool & Wolverine   
929          Avatar Collection                Avatar: The Way of Water   
1002      John Wick Collection                    John Wick: Chapter 4   
1511  The Conjuring Collection  The Conjuring: The Devil Made Me Do It   
368      Madagascar Collection      Madagascar 3: Europe's Most Wanted   
666   Bridget Jones Collection                    Bridget Jones's Baby   

      success_probability       ROI  vote_average  
1617             0.998333  5.413551         6.905  
1548             0.998159  4.2747

Unnamed: 0,id,release_date,original_title,title,collection,collection_id,budget,revenue,runtime,popularity,...,imdb_id,production_countries,original_language,Numéro,ROI,note_growth,roi_growth,is_last_film,success,success_probability
1617,512200,2019-12-04,Jumanji: The Next Level,Jumanji: The Next Level,Jumanji Collection,495527,125000000,801693929,123,65.679,...,tt7975244,"[{'iso_3166_1': 'US', 'name': 'United States o...",en,3,5.413551,0.105,-4.645772,True,True,0.998333
1548,458423,2018-07-09,Mamma Mia! Here We Go Again,Mamma Mia! Here We Go Again,Mamma Mia! Collection,458558,75000000,395607854,114,22.804,...,tt6911608,"[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",en,2,4.274771,0.12,-6.452952,True,True,0.998159
276,507086,2022-06-01,Jurassic World Dominion,Jurassic World Dominion,Jurassic Park Collection,328,165000000,1001978080,147,89.847,...,tt8041270,"[{'iso_3166_1': 'US', 'name': 'United States o...",en,6,5.072594,0.2,-1.636031,True,True,0.998021
691,337167,2018-01-17,Fifty Shades Freed,Fifty Shades Freed,Fifty Shades Collection,344830,55000000,371985018,105,107.491,...,tt4477536,"[{'iso_3166_1': 'US', 'name': 'United States o...",en,3,5.763364,0.217,-0.173833,True,True,0.998015
1416,533535,2024-07-24,Deadpool & Wolverine,Deadpool & Wolverine,Deadpool Collection,448150,200000000,1338073382,128,834.328,...,tt6263850,"[{'iso_3166_1': 'US', 'name': 'United States o...",en,3,5.690367,0.208,-0.454148,True,True,0.997978
929,76600,2022-12-14,Avatar: The Way of Water,Avatar: The Way of Water,Avatar Collection,87096,460000000,2320250281,192,135.823,...,tt1630029,"[{'iso_3166_1': 'US', 'name': 'United States o...",en,2,4.044022,0.037,-7.29229,True,True,0.997668
1002,603692,2023-03-22,John Wick: Chapter 4,John Wick: Chapter 4,John Wick Collection,404609,90000000,440157245,170,185.797,...,tt10366206,"[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...",en,4,3.890636,0.253,-1.049541,True,True,0.997308
1511,423108,2021-05-25,The Conjuring: The Devil Made Me Do It,The Conjuring: The Devil Made Me Do It,The Conjuring Collection,313086,39000000,206431050,111,72.355,...,tt7069210,"[{'iso_3166_1': 'US', 'name': 'United States o...",en,3,4.293104,0.132,-2.777189,True,True,0.997195
368,80321,2012-06-06,Madagascar 3: Europe's Most Wanted,Madagascar 3: Europe's Most Wanted,Madagascar Collection,14740,145000000,746900000,93,61.181,...,tt1277953,"[{'iso_3166_1': 'US', 'name': 'United States o...",en,3,4.151034,0.098,1.125034,True,True,0.997176
666,95610,2016-09-14,Bridget Jones's Baby,Bridget Jones's Baby,Bridget Jones Collection,8936,35000000,211952420,123,13.915,...,tt1473832,"[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",en,3,5.055783,0.2,-0.57239,True,True,0.997133


In [1]:
from src.models.models_ROI import plot_interactive
import pandas as pd
df1_part1 = pd.read_csv("data/collections/sequels_and_original_1880_2010_extended.csv")
df1_part2 = pd.read_csv("data/collections/sequels_and_original_2010_2024_extended.csv")
# Concaténer les deux parties
df1 = pd.concat([df1_part1, df1_part2], ignore_index=True)
fig=plot_interactive(df1)




  from pandas.core.computation.check import NUMEXPR_INSTALLED


TypeError: 'NoneType' object is not iterable

In [None]:
df.head(5)


In [10]:
from src.models.models_ROI import probability_of_success

fig=probability_of_success(df1)

fig.show()
fig.write_html("results/probabsucces")


In [5]:
from src.models.models_ROI import generate_race_chart
fig = generate_race_chart(df1)
fig.show()
fig.write_html("results/ROIprog")


In [4]:
from src.models.models_ROI import export_to_html

export_to_html(df1, output_path="interactive_plot.html")

Graphique exporté avec succès : interactive_plot.html


In [13]:
from src.models.models_ROI import build_figure

fig= build_figure(df1, 1, comparison="notes")
fig.show()
fig.write_html("results/notes1.html")

