# Superheroes
- Andrea Cohen
- 03.02.2023

## The Data:
- Source: from Kaggle
    - superhero_info.csv:  
    Contains Name, Publisher, Demographic Info, and Body measurements.
    - superhero_powers.csv:  
    Contains Hero name and list of powers

## Tasks
    - Clean the files and combine them into one final DataFrame.
    - Use the combined DataFrame to answer questions.

## Preliminary steps

### Imports

In [1]:
import pandas as pd
import os, json

### Load the data

In [2]:
info_df = pd.read_csv('Data/superhero_info - superhero_info.csv')
display(info_df.head())
display(info_df.info())

Unnamed: 0,Hero|Publisher,Gender,Race,Alignment,Hair color,Eye color,Skin color,Measurements
0,A-Bomb|Marvel Comics,Male,Human,good,No Hair,yellow,Unknown,"{'Height': '203.0 cm', 'Weight': '441.0 kg'}"
1,Abe Sapien|Dark Horse Comics,Male,Icthyo Sapien,good,No Hair,blue,blue,"{'Height': '191.0 cm', 'Weight': '65.0 kg'}"
2,Abin Sur|DC Comics,Male,Ungaran,good,No Hair,blue,red,"{'Height': '185.0 cm', 'Weight': '90.0 kg'}"
3,Abomination|Marvel Comics,Male,Human / Radiation,bad,No Hair,green,Unknown,"{'Height': '203.0 cm', 'Weight': '441.0 kg'}"
4,Absorbing Man|Marvel Comics,Male,Human,bad,No Hair,blue,Unknown,"{'Height': '193.0 cm', 'Weight': '122.0 kg'}"


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 463 entries, 0 to 462
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Hero|Publisher  463 non-null    object
 1   Gender          463 non-null    object
 2   Race            463 non-null    object
 3   Alignment       463 non-null    object
 4   Hair color      463 non-null    object
 5   Eye color       463 non-null    object
 6   Skin color      463 non-null    object
 7   Measurements    463 non-null    object
dtypes: object(8)
memory usage: 29.1+ KB


None

In [4]:
powers_df = pd.read_csv('Data/superhero_powers - superhero_powers.csv')
display(powers_df.head())
display(powers_df.info())

Unnamed: 0,hero_names,Powers
0,3-D Man,"Agility,Super Strength,Stamina,Super Speed"
1,A-Bomb,"Accelerated Healing,Durability,Longevity,Super..."
2,Abe Sapien,"Agility,Accelerated Healing,Cold Resistance,Du..."
3,Abin Sur,Lantern Power Ring
4,Abomination,"Accelerated Healing,Intelligence,Super Strengt..."


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 667 entries, 0 to 666
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   hero_names  667 non-null    object
 1   Powers      667 non-null    object
dtypes: object(2)
memory usage: 10.5+ KB


None

## Transformations

### Separate Hero|Publisher into 2 columns

In [5]:
# explore the existing format
info_df['Hero|Publisher'].head()

0            A-Bomb|Marvel Comics
1    Abe Sapien|Dark Horse Comics
2              Abin Sur|DC Comics
3       Abomination|Marvel Comics
4     Absorbing Man|Marvel Comics
Name: Hero|Publisher, dtype: object

In [6]:
# separate each string in the column, split them into 2 columns, save the new columns in the dataframe
info_df[['Hero', 'Publisher']] = info_df['Hero|Publisher'].str.split('|',expand=True)
# drop the original column
info_df = info_df.drop(columns=['Hero|Publisher'])
display(info_df.head())

Unnamed: 0,Gender,Race,Alignment,Hair color,Eye color,Skin color,Measurements,Hero,Publisher
0,Male,Human,good,No Hair,yellow,Unknown,"{'Height': '203.0 cm', 'Weight': '441.0 kg'}",A-Bomb,Marvel Comics
1,Male,Icthyo Sapien,good,No Hair,blue,blue,"{'Height': '191.0 cm', 'Weight': '65.0 kg'}",Abe Sapien,Dark Horse Comics
2,Male,Ungaran,good,No Hair,blue,red,"{'Height': '185.0 cm', 'Weight': '90.0 kg'}",Abin Sur,DC Comics
3,Male,Human / Radiation,bad,No Hair,green,Unknown,"{'Height': '203.0 cm', 'Weight': '441.0 kg'}",Abomination,Marvel Comics
4,Male,Human,bad,No Hair,blue,Unknown,"{'Height': '193.0 cm', 'Weight': '122.0 kg'}",Absorbing Man,Marvel Comics


### Convert Measurements into dictionaries, then separate into 2 columns

In [7]:
# examine a single value from the measurements column
measurements = info_df.loc[0,"Measurements"]
print(type(measurements))
measurements

<class 'str'>


"{'Height': '203.0 cm', 'Weight': '441.0 kg'}"

In [10]:
# use .str.replace to replace all single quotes
info_df['Measurements'] = info_df['Measurements'].str.replace("'",'"')
# use json.loads to load a JSON-compatible STRING into a dictionary/list of dictionaries
info_df['Measurements'] = info_df['Measurements'].apply(json.loads)
info_df['Measurements'].head()

0    {'Height': '203.0 cm', 'Weight': '441.0 kg'}
1     {'Height': '191.0 cm', 'Weight': '65.0 kg'}
2     {'Height': '185.0 cm', 'Weight': '90.0 kg'}
3    {'Height': '203.0 cm', 'Weight': '441.0 kg'}
4    {'Height': '193.0 cm', 'Weight': '122.0 kg'}
Name: Measurements, dtype: object

In [12]:
# use the pandas pd.Series function on the column of dictionaries to automatically convert each key to a new column
height_weight = info_df['Measurements'].apply(pd.Series)
# concatenate the 2 new columns to the original dataframe 
info_df = pd.concat((info_df, height_weight), axis = 1)
# drop the original column
info_df = info_df.drop(columns=['Measurements'])
display(info_df.head())

Unnamed: 0,Gender,Race,Alignment,Hair color,Eye color,Skin color,Hero,Publisher,Height,Weight
0,Male,Human,good,No Hair,yellow,Unknown,A-Bomb,Marvel Comics,203.0 cm,441.0 kg
1,Male,Icthyo Sapien,good,No Hair,blue,blue,Abe Sapien,Dark Horse Comics,191.0 cm,65.0 kg
2,Male,Ungaran,good,No Hair,blue,red,Abin Sur,DC Comics,185.0 cm,90.0 kg
3,Male,Human / Radiation,bad,No Hair,green,Unknown,Abomination,Marvel Comics,203.0 cm,441.0 kg
4,Male,Human,bad,No Hair,blue,Unknown,Absorbing Man,Marvel Comics,193.0 cm,122.0 kg


## Compare the average weight of super powers who have Super Speed to those who do not.

## What is the average height of heroes for each publisher?