This notebook presents Pandas code compared to anologous SQL query.

### Selection
We can treat a Pandas dataframe as a database table. To select one or multiple or all columns of a dataframe,

In [1]:
import pandas as pd

# SELECT * FROM pokemon
pokemon = pd.read_csv('./Data/pokemon.csv')
pokemon

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


Selecting specific columns:

In [2]:
# SELECT name, hp FROM pokemon
pokemon[['Name', 'HP']]

Unnamed: 0,Name,HP
0,Bulbasaur,45
1,Ivysaur,60
2,Venusaur,80
3,VenusaurMega Venusaur,80
4,Charmander,39
...,...,...
795,Diancie,50
796,DiancieMega Diancie,50
797,HoopaHoopa Confined,80
798,HoopaHoopa Unbound,80


Selecting only first few records:

In [3]:
# SELECT name, hp FROM pokemon LIMIT 5
pokemon[['Name', 'HP']][:5]

Unnamed: 0,Name,HP
0,Bulbasaur,45
1,Ivysaur,60
2,Venusaur,80
3,VenusaurMega Venusaur,80
4,Charmander,39


Selecting records after offset:

In [5]:
# SELECT name, hp FROM pokemon LIMIT 5 OFFSET 5
pokemon[['Name', 'HP']][5:10]

Unnamed: 0,Name,HP
5,Charmeleon,58
6,Charizard,78
7,CharizardMega Charizard X,78
8,CharizardMega Charizard Y,78
9,Squirtle,44


To select only unique values:

In [24]:
# SELECT DISTINCT `Type 1` FROM pokemon
pokemon['Type 1'].unique()

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

In [27]:
# SELECT DISTINCT `Type 1`, `Type 2` FROM pokemon
# Pandas equivalent ??

### Conditions

In [6]:
# SELECT name, hp FROM pokemon WHERE hp > 100 LIMIT 5
pokemon[pokemon['HP'] > 100][['Name', 'HP']][:5]

Unnamed: 0,Name,HP
44,Jigglypuff,115
45,Wigglytuff,140
96,Muk,105
120,Rhydon,105
121,Chansey,250


In [9]:
# SELECT name, hp FROM pokemon WHERE hp BETWEEN 50 AND 100 LIMIT 5
pokemon[(pokemon['HP'] >= 50) & (pokemon['HP'] <= 100)][['Name', 'HP']][:5]

Unnamed: 0,Name,HP
1,Ivysaur,60
2,Venusaur,80
3,VenusaurMega Venusaur,80
5,Charmeleon,58
6,Charizard,78


In [15]:
# SELECT name, hp, `Type 1` FROM learning.pokemon WHERE `Type 1` IN ('Grass', 'Bug') LIMIT 10;
pokemon[pokemon['Type 1'].isin(['Grass', 'Bug'])][['Name', 'HP', 'Type 1']][:10]

Unnamed: 0,Name,HP,Type 1
0,Bulbasaur,45,Grass
1,Ivysaur,60,Grass
2,Venusaur,80,Grass
3,VenusaurMega Venusaur,80,Grass
13,Caterpie,45,Bug
14,Metapod,50,Bug
15,Butterfree,60,Bug
16,Weedle,40,Bug
17,Kakuna,45,Bug
18,Beedrill,65,Bug


### Aggregates
Pandas has dedicated functions for finding count, min, max, etc:

In [17]:
# SELECT COUNT(name) FROM pokemon
pokemon['Name'].count()

800

In [18]:
# SELECT COUNT(*) FROM pokemon
len(pokemon)

800

In [20]:
# SELECT MAX(hp) FROM pokemon
pokemon['HP'].max()

255

In [21]:
# SELECT AVG(hp) FROM learning.pokemon;
pokemon['HP'].mean()

69.25875

### Grouping

The `groupby` function creates `DataFrameGroupBy` object.

In [28]:
# SELECT `Type 1`, COUNT(`Type 1`) FROM pokemon GROUP BY `Type 1`
pokemon.groupby('Type 1')['Type 1'].count()

Type 1
Bug          69
Dark         31
Dragon       32
Electric     44
Fairy        17
Fighting     27
Fire         52
Flying        4
Ghost        32
Grass        70
Ground       32
Ice          24
Normal       98
Poison       28
Psychic      57
Rock         44
Steel        27
Water       112
Name: Type 1, dtype: int64

In [30]:
# SELECT `Type 1`, AVG(hp) FROM pokemon GROUP BY `Type 1`
pokemon.groupby('Type 1')['HP'].mean()

Type 1
Bug         56.884058
Dark        66.806452
Dragon      83.312500
Electric    59.795455
Fairy       74.117647
Fighting    69.851852
Fire        69.903846
Flying      70.750000
Ghost       64.437500
Grass       67.271429
Ground      73.781250
Ice         72.000000
Normal      77.275510
Poison      67.250000
Psychic     70.631579
Rock        65.363636
Steel       65.222222
Water       72.062500
Name: HP, dtype: float64

In [35]:
# SELECT `Type 1`, `Type 2`, AVG(hp) FROM pokemon GROUP BY `Type 1`, `Type 2`
pokemon.groupby(['Type 1', 'Type 2'])['HP'].mean()

Type 1  Type 2  
Bug     Electric    60.000000
        Fighting    80.000000
        Fire        70.000000
        Flying      63.000000
        Ghost        1.000000
                      ...    
Water   Ice         90.000000
        Poison      61.666667
        Psychic     87.000000
        Rock        70.750000
        Steel       84.000000
Name: HP, Length: 136, dtype: float64

To aggregate over multiple columns, we make use of `agg` function

In [36]:
# SELECT `Type 1`, `Type 2`, AVG(hp), COUNT(name) FROM pokemon GROUP BY `Type 1`, `Type 2`
pokemon.groupby(['Type 1', 'Type 2']).agg({
    'HP': 'mean',
    'Name': 'count'
})

Unnamed: 0_level_0,Unnamed: 1_level_0,HP,Name
Type 1,Type 2,Unnamed: 2_level_1,Unnamed: 3_level_1
Bug,Electric,60.000000,2
Bug,Fighting,80.000000,2
Bug,Fire,70.000000,2
Bug,Flying,63.000000,14
Bug,Ghost,1.000000,1
...,...,...,...
Water,Ice,90.000000,3
Water,Poison,61.666667,3
Water,Psychic,87.000000,5
Water,Rock,70.750000,4
