# Chapter 2: Data Ingestion & Variables

In [2]:
low_memory=False
import numpy as np
import pandas as pd

## 2.1 Introduction and Motivation

Before we can analyze data, we must learn how to bring it into our environment and explore it efficiently. `pandas` is a Python library that helps us work with data stored in tables called DataFrames. In this notebook we will load datasets into `pandas`, index rows and columns, and perform basic operations to prepare data for further analysis. These are essential skills for cleaning and understanding real world data.

## 2.2 Problem Setting

To make this concrete, we will explore a dataset of Pokemon. We will load a table of Pokemon attributes and practice selecting, filtering, and combining data to answer simple questions.

## 2.3 Data Ingestion

`pandas` makes loading data simple. By default `pandas` looks for files in the same folder where the notebook runs. If your file is in a different folder, provide the path to that folder. When reading a CSV file, you can specify options such as the separator and which column to use as the index. Here we read a CSV named `Pokemon.csv` from the notebook folder.

In the code we call `pd.read_csv('Pokemon.csv', sep=';', index_col=0)`. 
- The `sep` argument tells `pandas` which character separates values in the file — here a semicolon is used instead of the more common comma. 
- The `index_col=0` argument tells `pandas` to use the first column of the file as the DataFrame index, that is the row labels. This is useful when the CSV already includes an ID or name column that you want to use as the row index rather than the default numeric row numbers.

In [3]:
pokemon = pd.read_csv("Pokemon.csv", sep= ";", index_col = 0)

The data is now loaded into memory but we cannot see it yet. Use the `head()` method to preview the first few rows of the table and get a quick sense of the columns and values.

In [4]:
pokemon.head()

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [41]:
pokemon.head(10)

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Attack2
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False,49.0
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False,62.0
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False,82.0
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False,100.0
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False,52.0
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False,64.0
6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1,False,84.0
6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False,130.0
6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,1,False,104.0
7,Squirtle,Water,,314,44,48,65,50,64,43,1,False,48.0


In [43]:
pokemon.tail(3)

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Attack2
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True,110.0
720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True,160.0
721,Volcanion,Fire,Water,600,80,110,120,130,90,70,6,True,110.0


Beautiful! We can even check exactly what type pandas has stored our data in:

In [6]:
type(pokemon)

pandas.core.frame.DataFrame

Pandas uses two common types. A **DataFrame** stores the whole table. A **Series** stores a single column. Use the two depending on whether you need the full table or just one column.

In [7]:
type(pokemon["Name"])

pandas.core.series.Series

Once everything is loaded in, we can gain some initial insights in the data. For example, we can see just how many records a dataset contains by checking the length.

In [8]:
len(pokemon)

800

You can also get a list of all the column names your dataset contains.

In [9]:
pokemon.columns

Index(['Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense',
       'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')

### Question 1: How can we see exactly how many records and columns/dimensions a dataset has in one single line of code?

In [10]:
pokemon.shape

(800, 12)

The pokemon dataset has 800 records and 12 columns.

## 2.4 Basic data selection, subsetting and slicing

Now that the data is loaded, we can start selecting parts of it. `pandas` uses familiar indexing ideas similar to lists and dictionaries. To get a single column from a DataFrame, put the column name in square brackets. To get multiple columns, pass a list of names inside the square brackets. Use `loc` to select by label and `iloc` to select by integer position.

In [11]:
pokemon["Attack"]

#
1       49
2       62
3       82
3      100
4       52
      ... 
719    100
719    160
720    110
720    160
721    110
Name: Attack, Length: 800, dtype: int64

In [44]:
pokemon.Attack

#
1       49
2       62
3       82
3      100
4       52
      ... 
719    100
719    160
720    110
720    160
721    110
Name: Attack, Length: 800, dtype: int64

If you wish to select multiple columns, simply pass them along within the square brackets. Notice that you need to add double brackets!

In [12]:
pokemon[["Attack", "HP"]]

Unnamed: 0_level_0,Attack,HP
#,Unnamed: 1_level_1,Unnamed: 2_level_1
1,49,45
2,62,60
3,82,80
3,100,80
4,52,39
...,...,...
719,100,50
719,160,50
720,110,80
720,160,80


Returning the first 6 rows:

In [50]:
pokemon[0:6]

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Attack2
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False,49.0
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False,62.0
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False,82.0
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False,100.0
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False,52.0
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False,64.0


You can select a single value from a specific row and column using `loc` with row and column labels. This is useful when you know the row index and the column name.
- "4": row
- "Name": column name

In [13]:
pokemon.loc[4,"Name"]

'Charmander'

If you don't know the name of the column, or just want to use indexes, you can make use of the 'iloc' function.

In [14]:
pokemon.iloc[4,1]

'Fire'

### Question 2: Show 'Type 1' of only the first 100 pokemon?

In [15]:
pokemon.loc[:100,"Type 1"]

#
1         Grass
2         Grass
3         Grass
3         Grass
4          Fire
         ...   
96      Psychic
97      Psychic
98        Water
99        Water
100    Electric
Name: Type 1, Length: 109, dtype: object

You can also slice by column names, for example selecting a range of columns between two column labels.
- ':' -> selecting all rows
- 'Name':'Type 2' -> range of columns between two column labels

In [16]:
pokemon.loc[:,'Name':'Type 2']

Unnamed: 0_level_0,Name,Type 1,Type 2
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Bulbasaur,Grass,Poison
2,Ivysaur,Grass,Poison
3,Venusaur,Grass,Poison
3,VenusaurMega Venusaur,Grass,Poison
4,Charmander,Fire,
...,...,...,...
719,Diancie,Rock,Fairy
719,DiancieMega Diancie,Rock,Fairy
720,HoopaHoopa Confined,Psychic,Ghost
720,HoopaHoopa Unbound,Psychic,Dark


In [52]:
pokemon.loc[2:5,:]

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Attack2
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False,62.0
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False,82.0
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False,100.0
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False,52.0
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False,64.0


If you wish, you can easily create new columns to add to your existing dataframe.

In [17]:
pokemon["My favorite"] = False
pokemon

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,My favorite
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True,False
719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True,False
720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True,False
720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True,False


If you no longer need a column, you can remove it with the `del` statement or use `drop()` to remove one or more columns from the DataFrame.

In [18]:
del pokemon["My favorite"]
pokemon

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


You can also use mathematical comparisons and logical operators to narrow down rows. For example you can ask for rows where a column value is less than a threshold or where two conditions both hold true.

In [19]:
pokemon["Attack"] < 50

#
1       True
2      False
3      False
3      False
4      False
       ...  
719    False
719    False
720    False
720    False
721    False
Name: Attack, Length: 800, dtype: bool

The code `pokemon["Attack"] < 50` returns a Series of boolean values. Each position answers the question "is this Attack value under 50?" You can use this boolean Series as a mask to select only the rows where the condition is true. For example `pokemon[pokemon["Attack"] < 50]` returns all Pokemon with an Attack below 50.

In [20]:
pokemon[pokemon["Attack"] < 50]

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
7,Squirtle,Water,,314,44,48,65,50,64,43,1,False
10,Caterpie,Bug,,195,45,30,35,20,20,45,1,False
11,Metapod,Bug,,205,50,20,55,25,25,30,1,False
12,Butterfree,Bug,Flying,395,60,45,50,90,80,70,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
678,MeowsticMale,Psychic,,466,74,48,76,83,81,104,6,False
678,MeowsticFemale,Psychic,,466,74,48,76,83,81,104,6,False
684,Swirlix,Fairy,,341,62,48,66,59,57,49,6,False
694,Helioptile,Electric,Normal,289,44,38,33,61,43,70,6,False


Alternitavely, we can specify the column we want as a property.

In [21]:
pokemon[pokemon.Attack < 50]

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
7,Squirtle,Water,,314,44,48,65,50,64,43,1,False
10,Caterpie,Bug,,195,45,30,35,20,20,45,1,False
11,Metapod,Bug,,205,50,20,55,25,25,30,1,False
12,Butterfree,Bug,Flying,395,60,45,50,90,80,70,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
678,MeowsticMale,Psychic,,466,74,48,76,83,81,104,6,False
678,MeowsticFemale,Psychic,,466,74,48,76,83,81,104,6,False
684,Swirlix,Fairy,,341,62,48,66,59,57,49,6,False
694,Helioptile,Electric,Normal,289,44,38,33,61,43,70,6,False


If we wish to combine multiple logic operators we can easily do so by making use of the numpy 'logical_and' and 'logical_or' methods.

In [22]:
pokemon.loc[np.logical_and(pokemon["HP"] > 70, pokemon["Attack"] < 200)]

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1,False
6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False
6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
717,Yveltal,Dark,Flying,680,126,131,95,131,98,99,6,True
718,Zygarde50% Forme,Dragon,Ground,600,108,100,121,81,95,95,6,True
720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


By now you can see how this can quickly get rather complicated when combining everything!

### Question 3: How many pokemons are Legendary?

In [23]:
len(pokemon[pokemon["Legendary"]])

65

### Question 4: What is the name of the first flying (Type 2) Pokemon which has a Speed of 60? Return only the Name of this pokemon.

In [24]:
pokemon.loc[np.logical_and(pokemon["Type 2"] == "Flying", pokemon["Speed"] == 60)].iloc[0, 0]

"Farfetch'd"

## 2.5 Merging observations and merging variables

Appending, merging and concatenation are ways to combine dataframes in pandas (very related to unions and joins in SQL). Going over the possibilities of the underlying pandas functions will require way too much time as there are so many ways to combine datfarmes. Instead we will only explore two often encountered situations: adding new rows and adding new columns

To do this, let's make two subsets of our pokemon dataframe.

In [25]:
poke1 = pokemon.loc[:7, :'Attack']
poke1

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Bulbasaur,Grass,Poison,318,45,49
2,Ivysaur,Grass,Poison,405,60,62
3,Venusaur,Grass,Poison,525,80,82
3,VenusaurMega Venusaur,Grass,Poison,625,80,100
4,Charmander,Fire,,309,39,52
5,Charmeleon,Fire,,405,58,64
6,Charizard,Fire,Flying,534,78,84
6,CharizardMega Charizard X,Fire,Dragon,634,78,130
6,CharizardMega Charizard Y,Fire,Flying,634,78,104
7,Squirtle,Water,,314,44,48


In [26]:
poke2 = pokemon.loc[:7, 'Defense':'Speed']
poke2

Unnamed: 0_level_0,Defense,Sp. Atk,Sp. Def,Speed
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,49,65,65,45
2,63,80,80,60
3,83,100,100,80
3,123,122,120,80
4,43,60,50,65
5,58,80,65,80
6,78,109,85,100
6,111,130,85,100
6,78,159,115,100
7,65,50,64,43


In [27]:
pd.concat([poke1, poke2])

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Bulbasaur,Grass,Poison,318.0,45.0,49.0,,,,
2,Ivysaur,Grass,Poison,405.0,60.0,62.0,,,,
3,Venusaur,Grass,Poison,525.0,80.0,82.0,,,,
3,VenusaurMega Venusaur,Grass,Poison,625.0,80.0,100.0,,,,
4,Charmander,Fire,,309.0,39.0,52.0,,,,
5,Charmeleon,Fire,,405.0,58.0,64.0,,,,
6,Charizard,Fire,Flying,534.0,78.0,84.0,,,,
6,CharizardMega Charizard X,Fire,Dragon,634.0,78.0,130.0,,,,
6,CharizardMega Charizard Y,Fire,Flying,634.0,78.0,104.0,,,,
7,Squirtle,Water,,314.0,44.0,48.0,,,,


This is not exactly what we were looking for... Our dataframes are joined, but they do not appear to share any common data. This is because we still need to specify our axis on which to merge!
- axis=0 (Rows)
- axis=1 (Columns)

In [28]:
poke3 = pd.concat([poke1, poke2], axis=1)
poke3

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80
4,Charmander,Fire,,309,39,52,43,60,50,65
5,Charmeleon,Fire,,405,58,64,58,80,65,80
6,Charizard,Fire,Flying,534,78,84,78,109,85,100
6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100
6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100
7,Squirtle,Water,,314,44,48,65,50,64,43


#### Question 5: create a new selection of the pokemon dataset containing the same columns as our dataset above but for the pokemon with id 10-15. Merge them in poke3.

In [54]:
poke4 = pokemon.loc[10:15, :'Speed']
poke4

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
10,Caterpie,Bug,,195,45,30,35,20,20,45
11,Metapod,Bug,,205,50,20,55,25,25,30
12,Butterfree,Bug,Flying,395,60,45,50,90,80,70
13,Weedle,Bug,Poison,195,40,35,30,20,20,50
14,Kakuna,Bug,Poison,205,45,25,50,25,25,35
15,Beedrill,Bug,Poison,395,65,90,40,45,80,75
15,BeedrillMega Beedrill,Bug,Poison,495,65,150,40,15,80,145


In [30]:
pd.concat([poke3, poke4], axis=0)

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80
4,Charmander,Fire,,309,39,52,43,60,50,65
5,Charmeleon,Fire,,405,58,64,58,80,65,80
6,Charizard,Fire,Flying,534,78,84,78,109,85,100
6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100
6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100
7,Squirtle,Water,,314,44,48,65,50,64,43


Adding a new column is super easy! Simply define the name of the new column and add your values.

In [31]:
caught = [True, False, False, False, True, False, False, False, False, True]
poke3["Caught"] = caught
poke3

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Caught
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,True
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,False
4,Charmander,Fire,,309,39,52,43,60,50,65,True
5,Charmeleon,Fire,,405,58,64,58,80,65,80,False
6,Charizard,Fire,Flying,534,78,84,78,109,85,100,False
6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,False
6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,False
7,Squirtle,Water,,314,44,48,65,50,64,43,True


Dropping a column is even easier.

In [32]:
del poke3["Caught"]
poke3

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80
4,Charmander,Fire,,309,39,52,43,60,50,65
5,Charmeleon,Fire,,405,58,64,58,80,65,80
6,Charizard,Fire,Flying,534,78,84,78,109,85,100
6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100
6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100
7,Squirtle,Water,,314,44,48,65,50,64,43


## 2.6 Variable types

Maintaining a good overview of your datatypes is important when working with a dataset. Luckily, pandas saves the day here once again as they have a simple way of providing this overview!

In [33]:
pokemon.dtypes

Name          object
Type 1        object
Type 2        object
Total          int64
HP             int64
Attack         int64
Defense        int64
Sp. Atk        int64
Sp. Def        int64
Speed          int64
Generation     int64
Legendary       bool
dtype: object

You can see how each column has its own datatype. Since variables in python are dynamically typed, we can easily start modifying this by simply modifying the data itself.

In [34]:
pokemon["Attack2"] = pokemon["Attack"]+0.00
pokemon.dtypes

Name           object
Type 1         object
Type 2         object
Total           int64
HP              int64
Attack          int64
Defense         int64
Sp. Atk         int64
Sp. Def         int64
Speed           int64
Generation      int64
Legendary        bool
Attack2       float64
dtype: object

## 2.7 Date and time

Any data type to do with dates and times always causes issues. There are just so many ways to represent this, and there are so many different timezones! All of this leads to one giant mess where it can be extremely hard to compare two times to each other.

In order to create some peace in the chaos, pandas provides us with a few tools. In general, it tries to convert any date/time object into a timestamp.

In [35]:
pd.to_datetime('2025-02-17 09:10:12', format='%Y-%m-%d %H:%M:%S', utc = True)

Timestamp('2025-02-17 09:10:12+0000', tz='UTC')

If we want to involve timezone, we can specify this with the 'tz_localize' function.

In [36]:
pd.to_datetime('2025-02-17 09:10:12', format='%Y-%m-%d %H:%M:%S').tz_localize(tz = "Europe/Brussels")

Timestamp('2025-02-17 09:10:12+0100', tz='Europe/Brussels')

As said before, there are a lot of timezones. You can get a handy list of the ones available here: https://gist.github.com/heyalexej/8bf688fd67d7199be4a1682b3eec7568

If you have loaded your time in a certain timezone and wish to convert it to another timezone, the 'astimezone' function provides the solution.

In [37]:
pd.to_datetime('2025-02-17 09:10:12', format='%Y-%m-%d %H:%M:%S').tz_localize(tz = "Europe/Brussels").astimezone("UTC")

Timestamp('2025-02-17 08:10:12+0000', tz='UTC')

Now you may be wondering what all those percentage symbols are. They indicate the format your datetime object originates in. Once again, there are a whole lot of different options. A good cheat sheet can be found here: https://strftime.org/

In [38]:
pd.to_datetime('9:10 17/2/2025', format='%H:%M %d/%m/%Y', utc = True)

Timestamp('2025-02-17 09:10:00+0000', tz='UTC')

You can even use this notation to get the date as an actual string!

In [39]:
pd.to_datetime('9:10 17/2/2025', format='%H:%M %d/%m/%Y', utc = True).strftime("%B %d, %Y")

'February 17, 2025'

When you combine it all you start seeing exactly why dates can be such a mess, especially knowing that each datasource you use will most likely have a different format.

In [40]:
pd.to_datetime('22:10 17/2/2025', format='%H:%M %d/%m/%Y').tz_localize( tz = "US/Central").astimezone("UTC").strftime("%B %d, %Y")

'February 18, 2025'