<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Datasets" data-toc-modified-id="Datasets-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Datasets</a></span></li><li><span><a href="#Imports-and-load-data" data-toc-modified-id="Imports-and-load-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Imports and load data</a></span></li><li><span><a href="#Data-cleaning-and-transformation" data-toc-modified-id="Data-cleaning-and-transformation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data cleaning and transformation</a></span></li><li><span><a href="#Hypothesis-Test" data-toc-modified-id="Hypothesis-Test-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Hypothesis Test</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

### Introduction

<I> Description: </I> <br>
In this project we will analyze and compare based on centrality measures across a social network comprised of over 6,000 Marvel characters.  The centrality measures used are degree centrality and eigenvector centrality.  Our primary objectives are to assess... <br>
1. Are male characters more popular than female characters (popularity will be based on number of connections)? <br>
2. Is there a greater prevalence of male characters in Marvel comics?
<br>

<I> Approach: </I>
1. for each node in the dataset calculate degree centrality and eigenvector centrality. <br>
2. compute the average degree and eigenvector centrality for each gender <br>
3. run t-tests comparing the genders to assess if there is a statistically significant difference for the above objectives


<I> Degree Centrality </I>is one of the easiest measures to calculate. The degree centrality of a node is simply its degree—the number of edges it has. The higher the degree, the more central the node is. This can be an effective measure, since many nodes with high degrees also have high centrality by other measures. <br>

<I> Eigenvector Centrality </I>is a measure of the influence a node has on a network. If a node is pointed to by many nodes (which also have high eigenvector centrality) then that node will have high eigenvector centrality.

### Datasets

The datasets used for this analysis are: <br>
1. <I>Kaggle’s Marvel Universe Social Network </I> [link to site](https://www.kaggle.com/csanhueza/the-marvel-universe-social-network) This is our primary dataset, providing files of nodes and edges. <br><br>
2. <I> Five-Thirty-Eight’s Comic Characters </I> [link to site](https://github.com/fivethirtyeight/data/tree/master/comic-characters).  This dataset was originally used for a story called “Comic Books are Still Made by Men…” and appears to have been derived from the [Marvel Fandom site](https://marvel.fandom.com/wiki/Category:Characters_by_Gender). We use this dataset to identify the agenda of each super hero.

### Imports and load data

In [1]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from collections import defaultdict
from scipy import stats
import math

import warnings
warnings.filterwarnings('ignore')

pd.set_option('Display.max_columns', None)
pd.set_option('Display.max_rows', None)

In [2]:
%cd C:\Users\user\Documents\00_Applications_DataScience\CUNY\DATA620\KJW_CUNY_DATA_620\KJW_Project1_DS620

C:\Users\user\Documents\00_Applications_DataScience\CUNY\DATA620\KJW_CUNY_DATA_620\KJW_Project1_DS620


In [3]:
#data load
hnetwork_in = pd.read_csv('data\hero-network.csv')
hgender_in = pd.read_csv('data\marvel_heroe_with_gender_list.csv')

In [4]:
hnetwork_in.head(5)

Unnamed: 0,hero1,hero2
0,"LITTLE, ABNER",PRINCESS ZANDA
1,"LITTLE, ABNER",BLACK PANTHER/T'CHAL
2,BLACK PANTHER/T'CHAL,PRINCESS ZANDA
3,"LITTLE, ABNER",PRINCESS ZANDA
4,"LITTLE, ABNER",BLACK PANTHER/T'CHAL


In [5]:
hnetwork_in.tail(5)

Unnamed: 0,hero1,hero2
574462,COLOSSUS II/PETER RA,CALLISTO
574463,CALLISTO,ROGUE /
574464,CALLISTO,CALIBAN/
574465,CALIBAN/,ROGUE /
574466,HULK/DR. ROBERT BRUC,"MARKS, DR. SHIELA"


In [6]:
hgender_in.head(5)

Unnamed: 0,Hero,Name,Gender
0,MAGUS II,"('Shingen Harada II (Earth-616)', 86)",Male
1,DORREK [SKRULL],"('Dorrek VIII (Earth-616)', 86)",Male
2,DESADIA,"('Sir Percy of Scandia (Earth-616)', 64)",Male
3,MISS ITCH/BLISS,"('Bliss (Morlock) (Earth-616)', 86)",Female
4,MAN-OF-WAR,"('Spider-Man (Peter Parker)', 86)",Male


In [7]:
hgender_in.tail(5)

Unnamed: 0,Hero,Name,Gender
6421,FLYNN,"('Alexander Flynn (Earth-616)', 90)",Male
6422,VOYAGER,"('Voyager (Body Magician) (Earth-616)', 90)",Male
6423,"SAINT, JOHNNY","('Josef Saint (Earth-616)', 86)",Male
6424,TUNDRA,"('Tundra (Earth-616)', 90)",
6425,"PINKERTON, PERCIVAL","('Percival Pinkerton (Earth-616)', 86)",Male


### Data cleaning and transformation

In [8]:
#Create a dictionary keyed by hero name and value is gender.  This dictionary is used in the next step to
#help create a dataframe of each hero's name and gender, with a unique hero id.
gender_dict = defaultdict(list)

for r in hgender_in.values:
    key = r[0]
    gender_dict[key] = r[2]

In [9]:
#This step will create a dataframe that lists each superhero and assigns a unique id to each.  The heroe's gender will also be 
#a column in the dataframe. The id's will serve as node ids for the graph. 

#combine the hero1 and hero2 columns into one set of unique hero names
hero1_list = hnetwork_in['hero1'].tolist()
hero2_list = hnetwork_in['hero2'].tolist()

all_heroes_list = hero1_list + hero2_list

#use set to remove duplicates
hnetwork_set = set(all_heroes_list)  
hnetwork_list = list(hnetwork_set)

#use dataframe to generate a unique id for each name
hnetwork_df = pd.DataFrame()    
hnetwork_df['hero_name'] = hnetwork_list
hnetwork_df.index.name = 'hero_id'
hnetwork_df = hnetwork_df.reset_index()

#loop thru the dataframe and get the heroes gender from the gender dictionary
hero_gender_list = []
for r in hnetwork_df.values:
    hero_gender = gender_dict.get(r[1])
    hero_gender_list.append(hero_gender)

hnetwork_df['hero_gender'] = hero_gender_list

hnetwork_df.head(5)

Unnamed: 0,hero_id,hero_name,hero_gender
0,0,ASTROLOGER/,Male
1,1,SOUL MAN/FATHER JASO,Male
2,2,TORGO,Male
3,3,"GREY, ELAINE",Female
4,4,DIBDEB,Male


In [10]:
hnetwork_df.tail(5)

Unnamed: 0,hero_id,hero_name,hero_gender
6421,6421,EVERY-MAN,Male
6422,6422,ARMAGEDDON MAN,Male
6423,6423,CLEARCUT/,Male
6424,6424,VENUS/APHRODITE/VICT,Female
6425,6425,"CONTONI, PAUL",Male


In [11]:
#create a hero id's dictionary keyed by name with id as value.  This dictionary will be used to the next step to add hero1_nodeId's
#hero2_nodeids to hero network dataframe.
hero_ids_dict = defaultdict(list)

for r in hnetwork_df.values:
    key = r[1]
    hero_ids_dict[key] = r[0]

In [12]:
#loop thru the hnetwork_in dataframe and create two additional columns (hero1_id, hero2_id).  These id's 
#will be used as the node id's for the graph and centrality calculations.
hero_network_df = hnetwork_in

hero1_id_list = []
hero2_id_list = []

for r in hnetwork_in.values:
    #use the hero ids dictionary
    hero1_id = hero_ids_dict.get(r[0])
    hero1_id_list.append(hero1_id)
    
    hero2_id = hero_ids_dict.get(r[1])
    hero2_id_list.append(hero2_id)
    
hero_network_df['hero1_nodeid'] = hero1_id_list
hero_network_df['hero2_nodeid'] = hero2_id_list

print('This is a dataframe of the hero network by hero name and by hero node id...')
hero_network_df.head(5)

This is a dataframe of the hero network by hero name and by hero node id...


Unnamed: 0,hero1,hero2,hero1_nodeid,hero2_nodeid
0,"LITTLE, ABNER",PRINCESS ZANDA,5407,6224
1,"LITTLE, ABNER",BLACK PANTHER/T'CHAL,5407,6107
2,BLACK PANTHER/T'CHAL,PRINCESS ZANDA,6107,6224
3,"LITTLE, ABNER",PRINCESS ZANDA,5407,6224
4,"LITTLE, ABNER",BLACK PANTHER/T'CHAL,5407,6107


In [13]:
hero_network_df.tail(5)

Unnamed: 0,hero1,hero2,hero1_nodeid,hero2_nodeid
574462,COLOSSUS II/PETER RA,CALLISTO,2336,2627
574463,CALLISTO,ROGUE /,2627,986
574464,CALLISTO,CALIBAN/,2627,6117
574465,CALIBAN/,ROGUE /,6117,986
574466,HULK/DR. ROBERT BRUC,"MARKS, DR. SHIELA",3119,3060


In [14]:
# Load the data from the hero_network_df to a graph
g1 = nx.Graph()

# Add nodes
node_ids_list = hnetwork_df.hero_id.tolist()
g1.add_nodes_from(node_ids_list)

# Add edges
edges_list = list(zip(hero_network_df.hero1_nodeid.tolist(), hero_network_df.hero2_nodeid.tolist()))
g1.add_edges_from(edges_list)

In [15]:
#Calculate degree centrality for each node and then add it to the hero network dataframe
heroes_degree_centrality = nx.degree_centrality(g1)

hnetwork_df['hero_degree_centrality'] = list(heroes_degree_centrality.values())

In [16]:
#Calculate eigenvector centrality for each node and then add it to the hero network dataframe
heroes_eigenvector_centrality = nx.eigenvector_centrality(g1)

hnetwork_df['hero_eigenvector_centrality'] = list(heroes_eigenvector_centrality.values())

hnetwork_df.head(5)

Unnamed: 0,hero_id,hero_name,hero_gender,hero_degree_centrality,hero_eigenvector_centrality
0,0,ASTROLOGER/,Male,0.000467,0.000739
1,1,SOUL MAN/FATHER JASO,Male,0.00249,0.001695
2,2,TORGO,Male,0.004825,0.005206
3,3,"GREY, ELAINE",Female,0.022101,0.019707
4,4,DIBDEB,Male,0.003113,0.001593


In [17]:
hnetwork_df.tail(5)

Unnamed: 0,hero_id,hero_name,hero_gender,hero_degree_centrality,hero_eigenvector_centrality
6421,6421,EVERY-MAN,Male,0.006381,0.005276
6422,6422,ARMAGEDDON MAN,Male,0.002646,0.001388
6423,6423,CLEARCUT/,Male,0.001556,0.000851
6424,6424,VENUS/APHRODITE/VICT,Female,0.00965,0.009187
6425,6425,"CONTONI, PAUL",Male,0.001712,0.001435


In [18]:
#Write dataframe to a csv file
hnetwork_df.to_csv('heroes_network_with_centrality.csv')

In [19]:
#Create a dataframe of male heroes
hnetwork_males_df = hnetwork_df[hnetwork_df.hero_gender == 'Male']
hnetwork_males_df = hnetwork_males_df.sort_values(['hero_eigenvector_centrality'], ascending = False) 

print('Males with highest eigenvector scores')
hnetwork_males_df.head(20)

Males with highest eigenvector scores


Unnamed: 0,hero_id,hero_name,hero_gender,hero_degree_centrality,hero_eigenvector_centrality
1369,1369,CAPTAIN AMERICA,Male,0.296965,0.116775
4770,4770,IRON MAN/TONY STARK,Male,0.236887,0.102541
4809,4809,SCARLET WITCH/WANDA,Male,0.206226,0.100821
867,867,THING/BENJAMIN J. GR,Male,0.220389,0.100782
4812,4812,SPIDER-MAN/PETER PAR,Male,0.27035,0.100232
4506,4506,MR. FANTASTIC/REED R,Male,0.21463,0.099745
2322,2322,VISION,Male,0.193152,0.098534
934,934,HUMAN TORCH/JOHNNY S,Male,0.211829,0.098518
5712,5712,WOLVERINE/LOGAN,Male,0.213385,0.098364
1820,1820,BEAST/HENRY &HANK& P,Male,0.197198,0.095499


In [20]:
#Create a dataframe of female heroes
hnetwork_females_df = hnetwork_df[hnetwork_df.hero_gender == 'Female']
hnetwork_females_df = hnetwork_females_df.sort_values(['hero_eigenvector_centrality'], ascending = False)

print('Females with highest eigenvector scores')
hnetwork_females_df.head(20)

Females with highest eigenvector scores


Unnamed: 0,hero_id,hero_name,hero_gender,hero_degree_centrality,hero_eigenvector_centrality
1749,1749,INVISIBLE WOMAN/SUE,Female,0.192374,0.095076
1763,1763,SHE-HULK/JENNIFER WA,Female,0.166693,0.091875
3735,3735,WASP/JANET VAN DYNE,Female,0.169805,0.09143
2151,2151,BLACK WIDOW/NATASHA,Female,0.143502,0.080438
3704,3704,MARVEL GIRL/JEAN GRE,Female,0.15642,0.077741
2634,2634,BLACK KNIGHT V/DANE,Female,0.115331,0.07702
986,986,ROGUE /,Female,0.129339,0.074234
6107,6107,BLACK PANTHER/T'CHAL,Female,0.110661,0.069046
2843,2843,SHADOWCAT/KATHERINE,Female,0.111751,0.067692
4675,4675,SERSI/SYLVIA,Female,0.090428,0.06155


### Hypothesis Test

In [21]:
#The male and female count reveals (not surprisingly) that the male:female super hero count is skewed toward males
#by almost 3:1.
male_N = len(hnetwork_males_df.index)
female_N = len(hnetwork_females_df.index)

print('Gender counts are...')
print(f'male heroes: {male_N}')
print(f'female heroes: {female_N}')

missing_gender = len(hnetwork_df) - female_N - male_N
print(f'missing gender: {missing_gender}')

Gender counts are...
male heroes: 4505
female heroes: 1661
missing gender: 260


In [22]:
#We'll respond to question #1 via a t-test.

#Hypothesis
#H0: male average eigenvector centrality = female average eigenvector centrality
#H1: male average eigenvector centrality != female average eigenvector centrality

#Data
male_mean = hnetwork_males_df.hero_eigenvector_centrality.mean()
male_std = hnetwork_males_df.hero_eigenvector_centrality.std()

female_mean = hnetwork_females_df.hero_eigenvector_centrality.mean()
female_std = hnetwork_females_df.hero_eigenvector_centrality.std()

print(f'male sample size is: {male_N}')
print(f'male mean is: {male_mean}')
print(f'male std is: {male_std}')
print()
print(f'female sample size is: {female_N}')
print(f'female mean is: {female_mean}')
print(f'female std is: {female_std}')

male sample size is: 4505
male mean is: 0.00619167994995531
male std is: 0.011335253898937556

female sample size is: 1661
female mean is: 0.0057193152336958005
female std is: 0.010106180477445235


In [23]:
#Run the t-test
def tcdf(value, df):
    tcdf_value = stats.t.cdf(value, df)
    return tcdf_value

def t_independent_samples_H(N1, N2, u1, u2, o1, o2):
    #the mean and std are so small that the t-value calc below errored on "divide by zero", so multiplying all values by 1000
    u1 = u1 * 1000
    u2 = u2 * 1000
    o1 = o1 * 1000
    o2 = o2 * 1000
    
    if N1 > N2:
        df = N1 - 1
    else:
        df = N2 - 1

    temp1 = o1**2/N1
    temp2 = o2**2/N2
    SE = round(math.sqrt(temp1 + temp2),3)
    
    tvalue = (u1 - u2)/SE

    pvalue = (1 - (tcdf(tvalue, df))) * 2
    pvalue = round(pvalue, 4)

    return (SE, tvalue, pvalue)

answer = t_independent_samples_H(male_N, female_N, male_mean, female_mean, male_std, female_std) #(N1, N2, u1, u2, o1, o2)
print(f'The SE, t-value, and pvalues are {answer}.')
print('Since the pvalue > 0.05, we cannot reject the null hypothesis.  Male eigenvector centrality does not appear to be significantly different than that of females')

The SE, t-value, and pvalues are (0.3, 1.5745490541983642, 0.1154).
Since the pvalue > 0.05, we cannot reject the null hypothesis.  Male eigenvector centrality does not appear to be significantly different than that of females


### Conclusion

You'll recall in the introduction we had two questions we'd like to analyze... <br>
1. <I> Are male characters more popular than female characters (popularity will be based on number of connections)? </I> <br>
Per the above hypothesis test we conclude that male eigenvector centrality is not significantly different than female eigenvector centrality. <br> <br>

2. <I> Is there a greater prevalence of male characters in Marvel comics? </I> <br>
Given the above ratio of around 4,500 males and 1,600 females, the male to female ratio is almost 3:1.  Based on this we conclude that there is a greater prevalence of male characters Marvel comics.