![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

# Pandas DataFrame exercises


In [1]:
# Import the numpy package under the name np
import numpy as np

# Import the pandas package under the name pd
import pandas as pd

# Import the matplotlib package under the name plt
import matplotlib.pyplot as plt
%matplotlib inline

# Print the pandas version and the configuration
print(pd.__version__)

1.4.3


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## DataFrame creation

### Create an empty pandas DataFrame


In [7]:
# This is a very basic way, without including any data, index or columns

empty_df = pd.DataFrame()
print(empty_df)

Empty DataFrame
Columns: []
Index: []


In [9]:
# This is another way, where we assign None to the key components of the DataFrame

pd.DataFrame(data=[None],
             index=[None],
             columns=[None])

Unnamed: 0,None
,


<img width=400 src="https://cdn.dribbble.com/users/4678/screenshots/1986600/avengers.png"></img>

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Create a `marvel_df` pandas DataFrame with the given marvel data


In [2]:
marvel_data = [
    ['Spider-Man', 'male', 1962],
    ['Captain America', 'male', 1941],
    ['Wolverine', 'male', 1974],
    ['Iron Man', 'male', 1963],
    ['Thor', 'male', 1963],
    ['Thing', 'male', 1961],
    ['Mister Fantastic', 'male', 1961],
    ['Hulk', 'male', 1962],
    ['Beast', 'male', 1963],
    ['Invisible Woman', 'female', 1961],
    ['Storm', 'female', 1975],
    ['Namor', 'male', 1939],
    ['Hawkeye', 'male', 1964],
    ['Daredevil', 'male', 1964],
    ['Doctor Strange', 'male', 1963],
    ['Hank Pym', 'male', 1962],
    ['Scarlet Witch', 'female', 1964],
    ['Wasp', 'female', 1963],
    ['Black Widow', 'female', 1964],
    ['Vision', 'male', 1968]
]

In [3]:
# This is one way, setting the data and columns name first, then the index

marvel_heroes = pd.DataFrame(data=[['Spider-Man', 'male', 1962],
    ['Captain America', 'male', 1941],
    ['Wolverine', 'male', 1974],
    ['Iron Man', 'male', 1963],
    ['Thor', 'male', 1963],
    ['Thing', 'male', 1961],
    ['Mister Fantastic', 'male', 1961],
    ['Hulk', 'male', 1962],
    ['Beast', 'male', 1963],
    ['Invisible Woman', 'female', 1961],
    ['Storm', 'female', 1975],
    ['Namor', 'male', 1939],
    ['Hawkeye', 'male', 1964],
    ['Daredevil', 'male', 1964],
    ['Doctor Strange', 'male', 1963],
    ['Hank Pym', 'male', 1962],
    ['Scarlet Witch', 'female', 1964],
    ['Wasp', 'female', 1963],
    ['Black Widow', 'female', 1964],
    ['Vision', 'male', 1968]], columns=['Name','Gender','Year of Birth'])

marvel_heroes.index = marvel_heroes['Name']

marvel_heroes

Unnamed: 0_level_0,Name,Gender,Year of Birth
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Spider-Man,Spider-Man,male,1962
Captain America,Captain America,male,1941
Wolverine,Wolverine,male,1974
Iron Man,Iron Man,male,1963
Thor,Thor,male,1963
Thing,Thing,male,1961
Mister Fantastic,Mister Fantastic,male,1961
Hulk,Hulk,male,1962
Beast,Beast,male,1963
Invisible Woman,Invisible Woman,female,1961


In [4]:
# This is another way, much simpler where we get directly the variable where the data is located

marvel_df = pd.DataFrame(data=marvel_data)

marvel_df

Unnamed: 0,0,1,2
0,Spider-Man,male,1962
1,Captain America,male,1941
2,Wolverine,male,1974
3,Iron Man,male,1963
4,Thor,male,1963
5,Thing,male,1961
6,Mister Fantastic,male,1961
7,Hulk,male,1962
8,Beast,male,1963
9,Invisible Woman,female,1961


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Add column names to the `marvel_df`
 

In [5]:
# This is one way, at the same time when we create the DataFrame

marvel_df =  pd.DataFrame(data=marvel_data, columns=['Name','Gender','Year of Birth'])
marvel_df


Unnamed: 0,Name,Gender,Year of Birth
0,Spider-Man,male,1962
1,Captain America,male,1941
2,Wolverine,male,1974
3,Iron Man,male,1963
4,Thor,male,1963
5,Thing,male,1961
6,Mister Fantastic,male,1961
7,Hulk,male,1962
8,Beast,male,1963
9,Invisible Woman,female,1961


In [6]:
col_names = ['name', 'sex', 'first_appearance']

marvel_df.columns = col_names
marvel_df

Unnamed: 0,name,sex,first_appearance
0,Spider-Man,male,1962
1,Captain America,male,1941
2,Wolverine,male,1974
3,Iron Man,male,1963
4,Thor,male,1963
5,Thing,male,1961
6,Mister Fantastic,male,1961
7,Hulk,male,1962
8,Beast,male,1963
9,Invisible Woman,female,1961


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Add index names to the `marvel_df` (use the character name as index)


In [7]:
# If we use set_index we will assign a number to become the Index

marvel_df.set_index(['name'])
marvel_df

Unnamed: 0,name,sex,first_appearance
0,Spider-Man,male,1962
1,Captain America,male,1941
2,Wolverine,male,1974
3,Iron Man,male,1963
4,Thor,male,1963
5,Thing,male,1961
6,Mister Fantastic,male,1961
7,Hulk,male,1962
8,Beast,male,1963
9,Invisible Woman,female,1961


In [8]:
marvel_df.index = marvel_df['name']
marvel_df

Unnamed: 0_level_0,name,sex,first_appearance
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Spider-Man,Spider-Man,male,1962
Captain America,Captain America,male,1941
Wolverine,Wolverine,male,1974
Iron Man,Iron Man,male,1963
Thor,Thor,male,1963
Thing,Thing,male,1961
Mister Fantastic,Mister Fantastic,male,1961
Hulk,Hulk,male,1962
Beast,Beast,male,1963
Invisible Woman,Invisible Woman,female,1961


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Drop the name column as it's now the index

In [9]:
# This is one way, using columns to specify the column name

marvel_df.drop(columns=['name'])


Unnamed: 0_level_0,sex,first_appearance
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Spider-Man,male,1962
Captain America,male,1941
Wolverine,male,1974
Iron Man,male,1963
Thor,male,1963
Thing,male,1961
Mister Fantastic,male,1961
Hulk,male,1962
Beast,male,1963
Invisible Woman,female,1961


In [10]:
#marvel_df = marvel_df.drop(columns=['name'])
marvel_df = marvel_df.drop(['name'], axis=1)
marvel_df

Unnamed: 0_level_0,sex,first_appearance
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Spider-Man,male,1962
Captain America,male,1941
Wolverine,male,1974
Iron Man,male,1963
Thor,male,1963
Thing,male,1961
Mister Fantastic,male,1961
Hulk,male,1962
Beast,male,1963
Invisible Woman,female,1961


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Drop 'Namor' and 'Hank Pym' rows


In [11]:
# Here we drop the rows without affecting the existing DataFrame
marvel_df.drop(['Namor','Hulk'], axis=0)

Unnamed: 0_level_0,sex,first_appearance
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Spider-Man,male,1962
Captain America,male,1941
Wolverine,male,1974
Iron Man,male,1963
Thor,male,1963
Thing,male,1961
Mister Fantastic,male,1961
Beast,male,1963
Invisible Woman,female,1961
Storm,female,1975


In [12]:
# Here we affect the DataFrame once we dropped these rows
marvel_df = marvel_df.drop(['Namor', 'Hank Pym'], axis=0)
marvel_df

Unnamed: 0_level_0,sex,first_appearance
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Spider-Man,male,1962
Captain America,male,1941
Wolverine,male,1974
Iron Man,male,1963
Thor,male,1963
Thing,male,1961
Mister Fantastic,male,1961
Hulk,male,1962
Beast,male,1963
Invisible Woman,female,1961


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## DataFrame selection, slicing and indexation

### Show the first 5 elements on `marvel_df`
 

In [13]:
# There are multiple ways, one very simple is by using head

marvel_df.head()

Unnamed: 0_level_0,sex,first_appearance
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Spider-Man,male,1962
Captain America,male,1941
Wolverine,male,1974
Iron Man,male,1963
Thor,male,1963


In [14]:
# Other options is using loc when we use the names of the characthers, and iloc for spatial position

#marvel_df.loc[['Spider-Man', 'Captain America', 'Wolverine', 'Iron Man', 'Thor'], :] # bad!
#marvel_df.loc['Spider-Man': 'Thor', :]
#marvel_df.iloc[0:5, :]
#marvel_df.iloc[0:5,]
marvel_df.iloc[:5,]
#marvel_df.head()

Unnamed: 0_level_0,sex,first_appearance
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Spider-Man,male,1962
Captain America,male,1941
Wolverine,male,1974
Iron Man,male,1963
Thor,male,1963


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Show the last 5 elements on `marvel_df`


In [15]:
# We can use iloc indicating the number where we start slicing and nothing at the end

#marvel_df.tail()
marvel_df.iloc[-5:]

Unnamed: 0_level_0,sex,first_appearance
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Doctor Strange,male,1963
Scarlet Witch,female,1964
Wasp,female,1963
Black Widow,female,1964
Vision,male,1968


In [16]:
# Also we can use tail to show the last 5 elements

#marvel_df.loc[['Hank Pym', 'Scarlet Witch', 'Wasp', 'Black Widow', 'Vision'], :] # bad!
#marvel_df.loc['Hank Pym':'Vision', :]
#marvel_df.iloc[-5:,]
marvel_df.tail()

Unnamed: 0_level_0,sex,first_appearance
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Doctor Strange,male,1963
Scarlet Witch,female,1964
Wasp,female,1963
Black Widow,female,1964
Vision,male,1968


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Show just the sex of the first 5 elements on `marvel_df`

In [17]:
# We can use iloc for spatial position of rows and columns

marvel_df.iloc[:5,0]

name
Spider-Man         male
Captain America    male
Wolverine          male
Iron Man           male
Thor               male
Name: sex, dtype: object

In [18]:
# If we use to_frame we can show the information in the DataFrame form

#marvel_df.iloc[:5,]['sex'].to_frame()
marvel_df.iloc[:5,].sex.to_frame()
#marvel_df.head().sex.to_frame()

Unnamed: 0_level_0,sex
name,Unnamed: 1_level_1
Spider-Man,male
Captain America,male
Wolverine,male
Iron Man,male
Thor,male


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Show the first_appearance of all middle elements on `marvel_df` 

In [19]:
# We use iloc to slice by spatial position 

marvel_df.iloc[1:-1,].first_appearance.to_frame()

Unnamed: 0_level_0,first_appearance
name,Unnamed: 1_level_1
Captain America,1941
Wolverine,1974
Iron Man,1963
Thor,1963
Thing,1961
Mister Fantastic,1961
Hulk,1962
Beast,1963
Invisible Woman,1961
Storm,1975


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Show the first and last elements on `marvel_df`


In [20]:
# This is one way

marvel_df.iloc[[0, -1]]


Unnamed: 0_level_0,sex,first_appearance
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Spider-Man,male,1962
Vision,male,1968


In [21]:
# This is another way, indicating the column names

marvel_df.iloc[[0, -1],][['sex', 'first_appearance']]
#marvel_df.iloc[[0, -1],]

Unnamed: 0_level_0,sex,first_appearance
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Spider-Man,male,1962
Vision,male,1968


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## DataFrame manipulation and operations

### Modify the `first_appearance` of 'Vision' to year 1964

In [22]:
marvel_df.loc['Vision', 'first_appearance'] = 1964
marvel_df

Unnamed: 0_level_0,sex,first_appearance
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Spider-Man,male,1962
Captain America,male,1941
Wolverine,male,1974
Iron Man,male,1963
Thor,male,1963
Thing,male,1961
Mister Fantastic,male,1961
Hulk,male,1962
Beast,male,1963
Invisible Woman,female,1961


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Add a new column to `marvel_df` called 'years_since' with the years since `first_appearance`


In [23]:
# We can calculate the difference in years by substracting the current year from the date that is in the column first_appearance

marvel_df['years_since'] = 2023 - marvel_df['first_appearance']
marvel_df

Unnamed: 0_level_0,sex,first_appearance,years_since
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Spider-Man,male,1962,61
Captain America,male,1941,82
Wolverine,male,1974,49
Iron Man,male,1963,60
Thor,male,1963,60
Thing,male,1961,62
Mister Fantastic,male,1961,62
Hulk,male,1962,61
Beast,male,1963,60
Invisible Woman,female,1961,62


In [24]:
marvel_df['years_since'] = 2018 - marvel_df['first_appearance']
marvel_df

Unnamed: 0_level_0,sex,first_appearance,years_since
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Spider-Man,male,1962,56
Captain America,male,1941,77
Wolverine,male,1974,44
Iron Man,male,1963,55
Thor,male,1963,55
Thing,male,1961,57
Mister Fantastic,male,1961,57
Hulk,male,1962,56
Beast,male,1963,55
Invisible Woman,female,1961,57


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## DataFrame boolean arrays (also called masks)

### Given the `marvel_df` pandas DataFrame, make a mask showing the female characters


In [35]:
# I used loc to create this mask

female_heroes = marvel_df.loc[marvel_df['sex']== 'female']
female_heroes

Unnamed: 0_level_0,sex,first_appearance,years_since
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Invisible Woman,female,1961,57
Storm,female,1975,43
Scarlet Witch,female,1964,54
Wasp,female,1963,55
Black Widow,female,1964,54
sex,female,female,female


In [None]:
# We can also create the mask directly without using loc

mask = marvel_df['sex'] == 'female'
mask

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Given the `marvel_df` pandas DataFrame, get the male characters


In [None]:
# your code goes here


In [None]:
mask = marvel_df['sex'] == 'male'

marvel_df[mask]

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Given the `marvel_df` pandas DataFrame, get the characters with `first_appearance` after 1970


In [None]:
# your code goes here


In [None]:
mask = marvel_df['first_appearance'] > 1970

marvel_df[mask]

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Given the `marvel_df` pandas DataFrame, get the female characters with `first_appearance` after 1970

In [None]:
# your code goes here


In [None]:
mask = (marvel_df['sex'] == 'female') & (marvel_df['first_appearance'] > 1970)

marvel_df[mask]

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## DataFrame summary statistics

### Show basic statistics of `marvel_df`

In [None]:
# your code goes here


In [None]:
marvel_df.describe()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Given the `marvel_df` pandas DataFrame, show the mean value of `first_appearance`

In [None]:
# your code goes here


In [None]:

#np.mean(marvel_df.first_appearance)
marvel_df.first_appearance.mean()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Given the `marvel_df` pandas DataFrame, show the min value of `first_appearance`


In [None]:
# your code goes here


In [None]:
#np.min(marvel_df.first_appearance)
marvel_df.first_appearance.min()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Given the `marvel_df` pandas DataFrame, get the characters with the min value of `first_appearance`

In [None]:
# your code goes here


In [None]:
mask = marvel_df['first_appearance'] == marvel_df.first_appearance.min()
marvel_df[mask]

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## DataFrame basic plottings

### Reset index names of `marvel_df`


In [None]:
# your code goes here


In [None]:
marvel_df = marvel_df.reset_index()

marvel_df

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Plot the values of `first_appearance`


In [None]:
# your code goes here


In [None]:
#plt.plot(marvel_df.index, marvel_df.first_appearance)
marvel_df.first_appearance.plot()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Plot a histogram (plot.hist) with values of `first_appearance`


In [None]:
# your code goes here


In [None]:

plt.hist(marvel_df.first_appearance)

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
