# Cleaning Data

The Avengers are a well-known and widely-loved team of superheroes in the Marvel universe that were originally introduced in the 1960's comic book series. The recent Disney movies re-popularized them, as part of the new Marvel [Cinematic Universe](https://en.wikipedia.org/wiki/Marvel_Cinematic_Universe).



<img src="https://wallup.net/wp-content/uploads/2019/07/24/603374-avengers-age-ultron-marvel-superhero-action-adventure-comics-heroes-ageultron-hero.jpg" alt="drawing" width="1100" height="200"/>


But since the writers killed off and revived many of the superheroes, the team at FiveThirtyEight was curious to explore data from the [Marvel Wikia site](https://marvel.fandom.com/wiki/Marvel_Database) further. To learn how they collected their data, which is available in their GitHub repository, read the write-up they published on the [FiveThirtyEight website](https://fivethirtyeight.com/features/avengers-death-comics-age-of-ultron/).

## Exploring the Data

FiveThirtyEight team did a wonderful job acquiring the data, but it still has some inconsistencies. 

the goal is to clean up their dataset so it can be more useful for analysis in pandas. 

Let's read it into pandas as a dataframe and preview the first five rows to get a better sense of it.

In [1]:
! ls -l

total 36
-rw-rw-r-- 1 ion ion 27638 may 31 16:49  avengers.csv
-rw-rw-r-- 1 ion ion  7254 jun  1 10:36 'Cleaning Data.ipynb'


In [2]:
import chardet 

In [3]:
! file -k avengers.csv

avengers.csv: CSV text\012- , ISO-8859 text, with very long lines


In [4]:
with open("avengers.csv", "rb") as file:
    print(chardet.detect(file.read()))

{'encoding': 'ISO-8859-1', 'confidence': 0.7292846143077137, 'language': ''}


<br>

**ISO/IEC 8859-1** encodes what it refers to as **"Latin alphabet no. 1"**, consisting of 191 characters from the Latin script. 

This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa.

Remenber:

`ISO-8859-1 == Windows-1252 == Latin-1`

In [6]:
import pandas as pd

avengers = pd.read_csv("avengers.csv", encoding='ISO-8859-1')
avengers.head(5)

Unnamed: 0,URL,Name/Alias,Appearances,Current?,Gender,Probationary Introl,Full/Reserve Avengers Intro,Year,Years since joining,Honorary,...,Return1,Death2,Return2,Death3,Return3,Death4,Return4,Death5,Return5,Notes
0,http://marvel.wikia.com/Henry_Pym_(Earth-616),"Henry Jonathan ""Hank"" Pym",1269,YES,MALE,,Sep-63,1963,52,Full,...,NO,,,,,,,,,Merged with Ultron in Rage of Ultron Vol. 1. A...
1,http://marvel.wikia.com/Janet_van_Dyne_(Earth-...,Janet van Dyne,1165,YES,FEMALE,,Sep-63,1963,52,Full,...,YES,,,,,,,,,Dies in Secret Invasion V1:I8. Actually was se...
2,http://marvel.wikia.com/Anthony_Stark_(Earth-616),"Anthony Edward ""Tony"" Stark",3068,YES,MALE,,Sep-63,1963,52,Full,...,YES,,,,,,,,,"Death: ""Later while under the influence of Imm..."
3,http://marvel.wikia.com/Robert_Bruce_Banner_(E...,Robert Bruce Banner,2089,YES,MALE,,Sep-63,1963,52,Full,...,YES,,,,,,,,,"Dies in Ghosts of the Future arc. However ""he ..."
4,http://marvel.wikia.com/Thor_Odinson_(Earth-616),Thor Odinson,2402,YES,MALE,,Sep-63,1963,52,Full,...,YES,YES,NO,,,,,,,Dies in Fear Itself brought back because that'...


## Filtering Out Bad Data

Because the data came from a crowdsourced community site, it could contain errors. 

If you plot a histogram of the values in the Year column, which describes the year Marvel introduced each Avenger, you'll immediately notice some oddities. 

For example, there are quite a few Avengers who look like they were introduced in 1900, which we know is a little fishy -- the Avengers weren't introduced in the comic series until the 1960's!

This is obviously a mistake in the data. We only want to keep the Avengers who were introduced after 1960.

In [None]:
import matplotlib.pyplot as plt
true_avengers = pd.DataFrame()

#avengers['Year'].hist()

true_avengers = avengers[avengers['Year'] >= 1960]

## Consolidating Deaths

We're interested in the total number of deaths each character experienced, so we'd like to have a single field containing that information. 

Right now, there are five fields (Death1 to Death5), each of which contains a binary value representing whether a superhero experienced that death or not. For example, a superhero could experience Death1, then Death2, and so on until the writers decided not to bring the character back to life.

We'd like to **combine that information in a single field** so we can perform numerical analysis on it more easily.

Create a new column, `Deaths`, that contains the number of times each superhero died. 

- The possible values for each death field are YES, NO, and NaN for missing data.

- Keep all of the original columns (including Death1 to Death5) and update `true_avengers` with the new `Deaths` column.

In [None]:
def clean_deaths(row):
    num_deaths = 0
    columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
    
    for c in columns:
        death = row[c]
        if pd.isnull(death) or death == 'NO':
            continue
        elif death == 'YES':
            num_deaths += 1
    return num_deaths

new_true_avengers = true_avengers

new_true_avengers['Deaths'] = true_avengers.apply(clean_deaths, axis = 1)

In [None]:
new_true_avengers['Deaths'].head(5)

## Verifying Years Since Joining

We want to verify that the Years since joining field accurately reflects the Year column. For example, if an Avenger was introduced in the Year 1960, is the Years since joining value for that Avenger 55?

Calculate the number of rows where Years since joining is accurate.
Since this challenge was created in 2015, use that as the reference year.

We want to know for how many rows Years since joining was correctly calculated as the Year value subtracted from 2015.

Assign the integer value describing the number of rows with a correct value for Years since joining to `joined_accuracy_count`.

In [None]:
joined_accuracy_count  = int()
correct_joined_years = new_true_avengers[new_true_avengers['Years since joining'] == (2015 - new_true_avengers['Year'])]
joined_accuracy_count = len(correct_joined_years)
joined_accuracy_count

###  159 number of rows where Years since joining is accurate