<h1 style="color: #001965; font-weight: bold; text-align: center">Marvel vs DC - Data Analysis</h1>

<p><b>Dataset description: </b>This dataset offers a comprehensive comparison between Marvel and DC movies and TV shows, focusing on various attributes to provide an in-depth look at the superhero entertainment industry. By covering a broad spectrum of information, including titles, release years, genres, runtimes, age ratings, directors, and leading actors, the dataset serves as a valuable tool for understanding the production details of these two iconic franchises. More information on: <a href="https://www.kaggle.com/datasets/willianoliveiragibin/marvel-vs-dc/data">Marvel vs DC - Kaggle</a></p> 

In [1]:
# Import library
import pandas as pd # Data processing
import numpy as np # Linear algebra 
import matplotlib.pyplot as plt # Data Visualization
import seaborn as sns # Data Visualization
import plotly.express as px # Create Types and Figures
import plotly.io as pio # To ensure correct graph rendering
pio.renderers.default = 'iframe'
print("Setup is Complete")

Setup is Complete


In [2]:
import os
print(os.getcwd()) # Check the current working directory
print(os.listdir()) # List all files in the directory

C:\Users\anne\Documents\GitHub\Marvel vs DC
['.ipynb_checkpoints', 'DataAnalysis_Marvel_vs_DC.ipynb', 'dataset_marvel_vs_dc.csv', 'dataset_marvel_vs_dc.xlsx', 'iframe_figures', 'Marvel Vs DC NEW.csv']


<h3 style="color: #001965; font-weight: bold;">Section <span style="color: #1E90FF">1</span>: Data Collection</h3>

In [3]:
# df is DataFrame variable
df = pd.read_csv("../Marvel vs DC/Marvel Vs DC NEW.csv") # Remember to flip the direction of the slashes from '\' to '/'
pd.set_option("display.max_columns", None)
df.head(3)

Unnamed: 0,ID,Movie,Year,Genre,RunTime,Description,IMDB_Score
0,0,Eternals,-2021,"Action,Adventure,Drama",0,"The saga of the Eternals, a race of immortal b...",0.0
1,1,Loki,(2021– ),"Action,Adventure,Fantasy",0,A new Marvel chapter with Loki at its center.,0.0
2,2,The Falcon and the Winter Soldier,-2021,"Action,Adventure,Drama",50 min,"Following the events of 'Avengers: Endgame,' S...",7.5


In [4]:
# DataFrame Summary
# Option 1
#def dataFrame_summary(df):
#    summary = {
#        "N° of rows: ": df.size,
#        "N° of columns: ": df.shape[1],
#        "Column Labels: ": df.columns.tolist(),
#        "Data Types: ": df.dtypes
#    }
#    return summary
#print(f"{dataFrame_summary(df)} \n")

# Option 2
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690 entries, 0 to 1689
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ID           1690 non-null   int64  
 1   Movie        1690 non-null   object 
 2   Year         1657 non-null   object 
 3   Genre        1683 non-null   object 
 4   RunTime      1690 non-null   object 
 5   Description  1690 non-null   object 
 6   IMDB_Score   1690 non-null   float64
dtypes: float64(1), int64(1), object(5)
memory usage: 92.6+ KB


<h3 style="color: #001965; font-weight: bold;">Section <span style="color: #1E90FF">2</span>: Data CleanUp </h3>

In [5]:
df.isnull().sum() # Identifying the number of null values in each column

ID              0
Movie           0
Year           33
Genre           7
RunTime         0
Description     0
IMDB_Score      0
dtype: int64

In [6]:
df.dropna() # Removing null values

Unnamed: 0,ID,Movie,Year,Genre,RunTime,Description,IMDB_Score
0,0,Eternals,-2021,"Action,Adventure,Drama",0,"The saga of the Eternals, a race of immortal b...",0.0
1,1,Loki,(2021– ),"Action,Adventure,Fantasy",0,A new Marvel chapter with Loki at its center.,0.0
2,2,The Falcon and the Winter Soldier,-2021,"Action,Adventure,Drama",50 min,"Following the events of 'Avengers: Endgame,' S...",7.5
3,3,WandaVision,-2021,"Action,Comedy,Drama",350 min,Blends the style of classic sitcoms with the M...,8.1
4,4,Spider-Man: No Way Home,-2021,"Action,Adventure,Sci-Fi",0,A continuation of Spider-Man: Far From Home.,0.0
...,...,...,...,...,...,...,...
1685,1685,DC's Legends of Tomorrow,(2016– ),"Action,Adventure,Drama",42 min,"Worlds lived, worlds died. Nothing will ever b...",8.5
1686,1686,Supergirl,(2015–2021),"Action,Adventure,Drama",42 min,"In the wake of Lex Luthor's return, the show f...",8.3
1687,1687,Supergirl,(2015–2021),"Action,Adventure,Drama",42 min,Kara comes face to face with Red Daughter and ...,8.1
1688,1688,Supergirl,(2015–2021),"Action,Adventure,Drama",42 min,Kara and Lena head to Kaznia to hunt down Lex....,7.4


In [7]:
# CleanUp Year
# Goal: Extract start and end year of each movie using a regex
## Regex Expression
# \d: any digit [0-9]
# {4}: want exactly 4 digits
# ?: : group that will not be returned
# \D*: any non-digit character, like '-' or ' '
# (\d{4})?: ? make group return optional
##
df[['StartYear', 'EndYear']] = df['Year'].str.extract(r'(\d{4})(?:\D*(\d{4})?)?')
df['EndYear'] = df['EndYear'].replace(' ', 'Present').fillna('No End Year')# CleanUp the cases where the movie no has ending year
df[['Year', 'StartYear', 'EndYear']].sample(3)

Unnamed: 0,Year,StartYear,EndYear
593,(2001–2011),2001,2011
499,(1979– ),1979,No End Year
58,(2021– ),2021,No End Year


In [8]:
### CleanUp RunTime
# Goal: Separate the numbers from the minute indicator
df['RunTime(Min)'] = df['RunTime'].str.split(expand=True)[0].astype('float64') # split the RunTime column's values and store the first part in a new column
df[['RunTime', 'RunTime(Min)']].sample(3)

Unnamed: 0,RunTime,RunTime(Min)
1266,22 min,22.0
646,21 min,21.0
1007,22 min,22.0


In [9]:
# CleanUp Movie
# Goal: Identify the unique names and categorize them into Marvel or DC using the keywords
df['Movie'].unique()
# Defining Keywords
marvel_keywords = [
    "Avengers", "Black Panther", "Captain America", "Doctor Strange", "Eternals",
    "Falcon", "Guardians of the Galaxy", "Hawkeye", "Hulk", "Iron Man", "Loki",
    "Scarlet Witch", "Shang-Chi", "Spider-Man", "Thor", "WandaVision", "Ant-Man",
    "Black Widow", "Captain Marvel", "Deadpool", "X-Men", "Wolverine", "Fantastic Four",
    "Ms. Marvel", "Moon Knight", "She-Hulk", "Daredevil", "Punisher", "Jessica Jones",
    "Luke Cage", "Iron Fist", "Inhumans", "What If...?", "Mutant X", "Secret Invasion", "Blade", "Agents of S.H.I.E.L.D.",
    "Fantastic 4", "Tony Stark", "Steve Rogers", "Bruce Banner", "Natasha Romanoff",
    "Stephen Strange", "T'Challa", "Scott Lang", "Carol Danvers", "Clint Barton",
    "Wanda Maximoff", "Vision", "Thanos", "Nick Fury", "Star-Lord", "Peter Quill",
    "Gamora", "Rocket Raccoon", "Groot", "Logan", "Silver Surfer", "Matt Murdock",
    "Winter Soldier", "Bucky Barnes", "Hero", "Villain", "Superpower", "Infinity",
    "Asgard", "Shield", "Superhero", "Power", "Universe", "Battle", "Origin", "War",
    "Quantum", "Multiverse", "Technology", "Transformation", "Alliance", "Legacy",
    "Time", "Space", "Justice", "Wakanda", "Stark", "Mutant", "S.H.I.E.L.D.",
    "Gem", "Crossover", "Team", "God", "Cosmic", "Destiny", "Drax", "Venom",
    "Ghost Rider", "The Punisher", "The Eternals", "Hank Pym", "Magneto"
]
dc_keywords = [
    "Batman", "Superman", "Justice", "League", "Dark", "Knight", "Wonder Woman",
    "Suicide Squad", "Man of Steel", "Gotham", "Green Lantern", "Flash", "Arrow",
    "Titans", "Shazam", "Aquaman", "Joker", "Birds of Prey", "Vengeance", "Crisis",
    "Return", "Rise", "Legends", "Hero", "Dawn", "Tomorrow", "Gods", "Power",
    "World", "Origin", "Reign", "Son", "War", "Throne", "Watchmen", "Phantom",
    "Enemy", "Universe", "Blood", "Redemption", "Doomsday", "Apocalypse", "End",
    "Constantine", "Catwoman", "Jonah Hex", "Nightwing", "Robin", "Supergirl",
    "Zatanna", "Swamp Thing", "Blue Beetle", "Black Adam", "Vixen", "Harley Quinn",
    "Static Shock", "Hawkman", "Green Arrow", "Batgirl", "Superboy", "Darkseid",
    "Deathstroke", "Lobo", "Spectre", "The Question", "Red Hood", "Lucifer",
    "Steel", "Raven", "Huntress", "Booster Gold", "Mr. Terrific", "Black Canary",
    "The Atom", "Martian Manhunter", "Dr. Fate", "Black Lightning", "Firestorm",
    "Etrigan", "The Demon", "Orion", "Metamorpho", "Azrael", "Deadshot",
    "Solomon Grundy", "Doctor Manhattan", "Batwoman", "Doom Patrol", "Peacemaker",
    "Cyborg", "Teen Titans", 
]
# Categorizing the movies
def categorize_movie(name):
    for keyword in marvel_keywords:
        if keyword in name:
            return 'Marvel'
    for  keyword in dc_keywords:
        if keyword in name:
            return 'DC'
    return 'Other'

df['Category'] = df['Movie'].apply(categorize_movie) # Using the function
df.sample(3)

Unnamed: 0,ID,Movie,Year,Genre,RunTime,Description,IMDB_Score,StartYear,EndYear,RunTime(Min),Category
697,697,Batman: The Animated Series,(1992–1995),"Animation,Action,Adventure",22 min,After learning the name of an extortion ringle...,8.8,1992,1995,22.0,DC
589,589,Superman: The Mysterious Mr. Mist,(1996 Video Game),,0,a Plot,4.6,1996,No End Year,0.0,DC
858,858,Justice League Unlimited,(2004–2006),"Animation,Action,Adventure",23 min,The Justice League battle Mordru in the backgr...,8.1,2004,2006,23.0,Marvel


In [10]:
# CleanUp Genre
# Goal: Defining the main genre. In this case, the first genre was chosen as the main one
df['MainGenre'] = df['Genre'].str.split(',').str[0]
df.sample(3)

Unnamed: 0,ID,Movie,Year,Genre,RunTime,Description,IMDB_Score,StartYear,EndYear,RunTime(Min),Category,MainGenre
1258,1258,Young Justice,(2010– ),"Animation,Action,Adventure",22 min,Black Manta gives Miss Martian 24 hours to liv...,9.2,2010,No End Year,22.0,Marvel,Animation
682,682,Batman: The Animated Series,(1992–1995),"Animation,Action,Adventure",22 min,When a jealous scientist with a passion for Al...,8.2,1992,1995,22.0,DC,Animation
703,703,Batman: The Animated Series,(1992–1995),"Animation,Action,Adventure",22 min,While the Dynamic Duo race to stop Ra's Al Ghu...,7.7,1992,1995,22.0,DC,Animation


In [11]:
# Goal: Removing data that will not be needed for analysis
df = df.drop('Description', axis=1) # Removing the Description column
df = df[df['Category'] != 'Other'] # Removing the rows where the value is 'Other' in Category
df = df[df['RunTime'] != '0'] # Removing the rows where the value is 0 in RunTime
df = df.dropna(subset=['StartYear']) # Removing the rows where the value is NaN in StartYear
df.sample(3)

Unnamed: 0,ID,Movie,Year,Genre,RunTime,IMDB_Score,StartYear,EndYear,RunTime(Min),Category,MainGenre
79,79,Thor: Tales of Asgard,(2011 Video),"Animation,Action,Adventure",77 min,6.3,2011,No End Year,77.0,Marvel,Animation
683,683,Batman: The Animated Series,(1992–1995),"Animation,Action,Adventure",22 min,7.7,1992,1995,22.0,DC,Animation
721,721,Batman: The Animated Series,(1992–1995),"Animation,Action,Adventure",22 min,7.0,1992,1995,22.0,DC,Animation


<h3 style="color: #001965; font-weight: bold;">Section <span style="color: #1E90FF">3</span>: Data Exloration </h3>

In [12]:
# Count number of movies by category
movies_category = df['Category'].value_counts().reset_index()
movies_category.columns = ['Category', 'Number of Movies']

# Define colors
colors = {'DC': '#345776', 'Marvel': '#7d373b'}

# Create a bar chart
fig = px.bar(
    movies_category,
    x='Category',
    y='Number of Movies',
    title='Number of Marvel and DC Movies',
    color='Category',
    color_discrete_map=colors,
    text='Number of Movies'
)

fig.update_traces(textposition='inside', insidetextanchor='middle')
fig.update_layout(xaxis_title='Category', yaxis_title='Number of Movies')
fig.show()

In [13]:
# Count number of Movies by Main Genre
movies_main_genre = df['MainGenre'].value_counts().reset_index()
movies_main_genre.columns = ['MainGenre', 'Number of Movies']

# Create a gradient of blue colors
blues = ['#0000FF', '#1F51FF', '#4682B4', '#5F9EA0', '#6495ED', '#87CEEB', '#ADD8E6']
colors = blues[:len(movies_main_genre)]

# Create a bar chart with Plotly Express
fig = px.bar(
    movies_main_genre,
    x='MainGenre',
    y='Number of Movies',
    title='Number of Movies by Main Genre',
    text='Number of Movies',
    color='MainGenre',
    color_discrete_sequence=colors  # Gradient of blue colors
)

fig.update_traces(textposition='outside')
fig.update_layout(
    xaxis_title='Main Genre',
    yaxis_title='Number of Movies',
    xaxis_tickangle=-45
)
fig.show()


In [14]:
# Count the number of movies by Start Year and sort the index
movies_year = df['StartYear'].value_counts().sort_index().reset_index()
movies_year.columns = ['StartYear', 'Number of Movies']

fig = px.bar(
    movies_year,
    x='StartYear',
    y='Number of Movies',
    title='Number of Movies by Start Year',
    text='Number of Movies',
    color='Number of Movies',
    color_continuous_scale='Viridis'  # Use a continuous color scale for gradient effect
)

fig.update_traces(textposition='outside')
fig.update_layout(
    xaxis_title='Start Year',
    yaxis_title='Number of Movies',
    xaxis_tickangle=-90
)
fig.show()


In [15]:
# Create a histogram for the distribution of IMDB scores using Plotly Express
fig = px.histogram(
    df,
    x='IMDB_Score',
    nbins=20,
    title='Distribution of IMDB Scores',
    color_discrete_sequence=['blue'],  # Set bar color to cyan
    labels={'IMDB_Score': 'IMDB Score'}  # Label for the x-axis
)

fig.update_layout(
    xaxis_title='IMDB Score',
    yaxis_title='Number of Movies',
    bargap=0.2,  # Gap between bars
)
fig.show()

In [16]:
df_unique = df
df_unique['Genres_List'] = df['Genre'].str.split(',')
df_exploded = df_unique.explode('Genres_List')

df_exploded_clean = df_exploded.dropna(subset=['Genres_List'])

df_exploded_clean = df_exploded_clean[df_exploded_clean['Genres_List'].str.strip() != '']

fig = px.treemap(
    df_exploded_clean, 
    path=['Category', 'Genres_List', 'Movie'],
    title='Marvel and DC film genres',
    hover_data=['IMDB_Score'], 
    color='IMDB_Score', 
    color_continuous_scale='Viridis'
)

fig.show()

In [17]:
df_unique = df
df_unique['RunTime(Min)'] = df['RunTime(Min)']

fig = px.scatter(
    df_unique,
    x='Year',
    y='IMDB_Score',
    size='RunTime(Min)', 
    color='Category',
    hover_name='Movie',
    title='Duration and IMDb rating for Marvel and DC films'
)

fig.show()

In [18]:
# Top Movies by IMDB Score
top_movies = df.nlargest(10, 'IMDB_Score')
fig_top_movies = px.bar(top_movies, 
                        x = 'Movie', 
                        y = 'IMDB_Score', 
                        title = 'Top 10 Movies by IMDB Score',
                        color = 'Movie',
                        labels = {'IMDB_Score': 'IMDB Score', 'Movie': 'Movie Title'})
fig_top_movies.update_layout(xaxis_title = 'Movie Title', 
                             yaxis_title = 'IMDB Score')
fig_top_movies.show()

In [19]:
# Exporting the DataFrame to an Excel file
df.to_excel('dataset_marvel_vs_dc.xlsx', index=False)

<h5 style="color: #001965;"> References ans Inspirations: </h5>
<ul>
    <li>Marvel vs DC | EDA by Rudra prasad bhuyan</li>
    <li>Marvel vs DC: Networked by Timur Khabirovich</li>
    <li>Marvel vs DC ⍟ | EDA 🚀📈| Analysis by Pranav Mehta</li>
</ul>