In [27]:
import pandas as pd

from collections import Counter

# Load The Marvel Universe Social Network datasets
I decided to select this [dataset](https://www.kaggle.com/csanhueza/the-marvel-universe-social-network) to perform the data analysis and data visualizations of this course.

In [19]:
edges_df = pd.read_csv('../the-marvel-universe-social-network/edges.csv')
hero_network_df = pd.read_csv('../the-marvel-universe-social-network/hero-network.csv')
nodes_df = pd.read_csv('../the-marvel-universe-social-network/nodes.csv')

# Exploring the `edges.csv` dataset
The `edges.csv` dataset simply indicates in which comics the heroes appear.

In [20]:
# Basic description of the dataset
edges_df.describe()

Unnamed: 0,hero,comic
count,96104,96104
unique,6439,12651
top,SPIDER-MAN/PETER PARKER,COC 1
freq,1577,111


From here it is clear that various heros and comics appear more than once, and thus it is sensible to perform some kind of frequency analysis.

Furthermore, there are 96104 entries in this dataset.

In [21]:
# Basic information of the dataset
edges_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96104 entries, 0 to 96103
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   hero    96104 non-null  object
 1   comic   96104 non-null  object
dtypes: object(2)
memory usage: 1.5+ MB


Fortunately this dataset does not contain any `NaN` values, however it handles values that are not numeric (strings):

In [22]:
# First ten values
edges_df.head(10)

Unnamed: 0,hero,comic
0,24-HOUR MAN/EMMANUEL,AA2 35
1,3-D MAN/CHARLES CHAN,AVF 4
2,3-D MAN/CHARLES CHAN,AVF 5
3,3-D MAN/CHARLES CHAN,COC 1
4,3-D MAN/CHARLES CHAN,H2 251
5,3-D MAN/CHARLES CHAN,H2 252
6,3-D MAN/CHARLES CHAN,M/PRM 35
7,3-D MAN/CHARLES CHAN,M/PRM 36
8,3-D MAN/CHARLES CHAN,M/PRM 37
9,3-D MAN/CHARLES CHAN,WI? 9


# Preparing the data
Since all the values in the dataset are strings, then it is necessary to map those strings to a numeric key so that the data analysis and data visualization are easier to handle.

In [37]:
# Get numeric keys for the 'hero' column
edges_heroes = edges_df['hero']
# Filter duplicate heroes
edges_heroes_unique = edges_heroes.unique()
# Get numeric values for the string values in the dataset
heroes = {i: edges_heroes_unique[i] for i in range(len(edges_heroes_unique))}
# Get how often a hero appears in the dataset
hero_frequency = Counter(edges_heroes)

In [40]:
# Get numeric keys for the 'comic' column
edges_comics = edges_df['comic']
# Filter duplicate comics
edges_comics_unique = edges_comics.unique()
# Get numeric values for the string values in the dataset
comics = {i: edges_comics_unique[i] for i in range(len(edges_comics_unique))}
# Get how often a comic appears in the dataset
comic_frequency = Counter(edges_comics)

# Simple probabilistic analysis
Now it is possible to continue with a simple probabilistic analysis of the dataset.

## Mean
Since the dataset does not contain numeric values, the measure of *mean* does not provide significant results in this case.

## Median
In this case the *median* provides significant results as it indicates the most common values in the dataset.

In [42]:
# Most common hero
most_common_hero = hero_frequency.most_common(1)
print(most_common_hero)

[('SPIDER-MAN/PETER PARKER', 1577)]


In [43]:
# Most common comic
most_common_comic = comic_frequency.most_common(1)
print(most_common_comic)

[('COC 1', 111)]


## Standard deviation
