<a href="https://colab.research.google.com/github/bwsi-hadr/01-Intro-to-python/blob/master/01b_Intro_to_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Pandas

Pandas is a package built on top of numpy which incorporates many additional features which make it more convenient for data analysis. One main addition is the inclusion of row and column labels, which make pandas function much like spreadsheet software. It also includes tools for much more complex computations than regular numpy allows

In [0]:
import pandas as pd
import numpy as np
# by default, pandas only shows the beginning and end of a large dataframe
# here we set it to 999 so that we see more of the dataframe
pd.options.display.max_rows = 999 

## Importing data
There are a lot of built-in functions for importing data to pandas, for example, from files, database, and web sources. For a full list of the supported methods, see [the documentation on pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).

In this example, we're going to load data formatted as a csv from a website URL. In particular, it is a direct link to a file hosted in our course github repository.

The data we're loading is a simplified pokedex which includes the number, name, base stats, types, abilities, and colors of each pokemon. We'll be using this data to explore the functionality of pandas. 

The primary class of pandas is the DataFrame, which functions like a spreadsheet (like in excel). It allows for named rows and columns. Data types between two different columns can vary, but entries within the same column should have the same type. 

In [3]:
# URL for data
pokedex_url = 'https://raw.githubusercontent.com/bwsi-hadr/01-Intro-to-python/master/pokedex.csv?token=AK6STNOEBTXIUO4XFKRDLWK5CODHG' 
# instead of a url, you could instead put a path to a csv file on the local system where python is running. However, since colab is on the cloud, it's easier to use a url
pokedex_df = pd.read_csv(pokedex_url, index_col='Index') # create a dataframe from a csv, set the column titled "Index" as the row labels

pokedex_df = pokedex_df.fillna('') #fill in the NaN values with empty string 
pokedex_df # prints out the dataframe

Unnamed: 0_level_0,Pokemon,HP,Atk,Def,SpAtk,SpDef,Speed,Total,Type I,Type II,Ability I,Ability II,Hidden Ability,Color
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1.0,Bulbasaur,45,49,49,65,65,45,318,Grass,Poison,Overgrow,,Chlorophyll,Green
2.0,Ivysaur,60,62,63,80,80,60,405,Grass,Poison,Overgrow,,Chlorophyll,Green
3.0,Venusaur,80,82,83,100,100,80,525,Grass,Poison,Overgrow,,Chlorophyll,Green
4.0,Charmander,39,52,43,60,50,65,309,Fire,,Blaze,,Solar Power,Red
5.0,Charmeleon,58,64,58,80,65,80,405,Fire,,Blaze,,Solar Power,Red
6.0,Charizard,78,84,78,109,85,100,534,Fire,Flying,Blaze,,Solar Power,Red
7.0,Squirtle,44,48,65,60,54,43,314,Water,,Torrent,,Rain Dish,Blue
8.0,Wartortle,59,63,80,65,80,58,405,Water,,Torrent,,Rain Dish,Blue
9.0,Blastoise,79,83,100,85,105,78,530,Water,,Torrent,,Rain Dish,Blue
10.0,Caterpie,45,30,35,20,20,45,195,Bug,,Shield Dust,,Run Away,Green


## Indexing
Like numpy, you can index a dataframe to select elements or subsets of entries. There are a number of ways to index pandas dataframes. The two following methods are recommended:
`.loc[]` and `.iloc[]`

### .loc[ ] name-based indexing
The first way, using `.loc[a,b]` indexes based on the __values__ of the row index and column names. This allows us to get entries where the row index value equals `a` and the column name equals `b`.

In [0]:
# Get the entry for index value 745.1, and the value in the 'Pokemon' column (which gives the name)
pokedex_df.loc[745.1, ['Pokemon']]

Pokemon    Lycanroc (midnight)
Name: 745.1, dtype: object

In [0]:
# You can get multiple column values for a single entry
pokedex_df.loc[250,['Pokemon','Type I','Type II']]

Pokemon    Swinub
Type I        Ice
Type II    Ground
Name: 250.0, dtype: object

In [0]:
# you can use the : notation to slice multiple rows and columns
pokedex_df.loc[690:700,'Pokemon':'Total']

Unnamed: 0_level_0,Pokemon,HP,Atk,Def,SpAtk,SpDef,Speed,Total
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
690.0,Skrelp,50,60,60,60,60,30,320
691.0,Dragalge,65,75,90,97,123,44,494
692.0,Clauncher,50,53,62,58,63,44,330
693.0,Clawitzer,71,73,88,120,89,59,500
694.0,Helioptile,44,38,33,61,43,70,289
695.0,Heliolisk,62,55,52,109,94,109,481
696.0,Tyrunt,58,89,77,45,45,48,362
697.0,Tyrantrum,82,121,119,69,59,71,521
698.0,Amaura,77,59,50,67,63,46,362
699.0,Aurorus,123,77,72,99,92,58,521


In [0]:
# Similarly to numpy, you can use the : notation without indices on a side to mean "from beginning" or "to end"
pokedex_df.loc[:10,"Ability I":]

Unnamed: 0_level_0,Ability I,Ability II,Hidden Ability,Color
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.0,Overgrow,,Chlorophyll,Green
2.0,Overgrow,,Chlorophyll,Green
3.0,Overgrow,,Chlorophyll,Green
4.0,Blaze,,Solar Power,Red
5.0,Blaze,,Solar Power,Red
6.0,Blaze,,Solar Power,Red
7.0,Torrent,,Rain Dish,Blue
8.0,Torrent,,Rain Dish,Blue
9.0,Torrent,,Rain Dish,Blue
10.0,Shield Dust,,Run Away,Green


In [8]:
# and : by itself gives all of the entries in that dimension
pokedex_df.loc[317,:]

Pokemon               Shedinja
HP                           1
Atk                         90
Def                         45
SpAtk                       30
SpDef                       30
Speed                       40
Total                      236
Type I                     Bug
Type II                  Ghost
Ability I         Wonder Guard
Ability II                    
Hidden Ability                
Color                    Brown
Nickname                   NaN
Name: 317.0, dtype: object

### .iloc[ ] integer indexing

The second way, using `.iloc[i,j]` uses the numpy-style integer indices to get a value. In this case, iloc takes integer values from 0 to the number of entries in that dimension. 

Like numpy, `.iloc[i,j]` gives the entry at the $i^{th}$ row and $j^{th}$ column.

In [126]:
# The : style indexing also works for iloc
pokedex_df.iloc[25:30, 0:8]

Unnamed: 0_level_0,Pokemon,HP,Atk,Def,SpAtk,SpDef,Speed,Total
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
27.0,Raichu,60,90,55,90,80,100,475
28.0,Sandshrew,50,75,85,20,30,40,300
29.0,Sandslash,75,100,110,45,55,65,450
30.0,Nidoran♀,55,47,52,40,40,41,275
31.0,Nidorina,70,62,67,55,55,56,365


## Adding columns and changing values
You can add another column like:
```
dataframe['NewColumnName'] = value
```
value can be a single value, an array of the same size as the index of `dataframe`, a pandas `Series`, or a dictionary mapping `index: value` for entries in the dataframe index

In [4]:
pokedex_df['Nickname'] = None
pokedex_df

Unnamed: 0_level_0,Pokemon,HP,Atk,Def,SpAtk,SpDef,Speed,Total,Type I,Type II,Ability I,Ability II,Hidden Ability,Color,Nickname
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1.0,Bulbasaur,45,49,49,65,65,45,318,Grass,Poison,Overgrow,,Chlorophyll,Green,
2.0,Ivysaur,60,62,63,80,80,60,405,Grass,Poison,Overgrow,,Chlorophyll,Green,
3.0,Venusaur,80,82,83,100,100,80,525,Grass,Poison,Overgrow,,Chlorophyll,Green,
4.0,Charmander,39,52,43,60,50,65,309,Fire,,Blaze,,Solar Power,Red,
5.0,Charmeleon,58,64,58,80,65,80,405,Fire,,Blaze,,Solar Power,Red,
6.0,Charizard,78,84,78,109,85,100,534,Fire,Flying,Blaze,,Solar Power,Red,
7.0,Squirtle,44,48,65,60,54,43,314,Water,,Torrent,,Rain Dish,Blue,
8.0,Wartortle,59,63,80,65,80,58,405,Water,,Torrent,,Rain Dish,Blue,
9.0,Blastoise,79,83,100,85,105,78,530,Water,,Torrent,,Rain Dish,Blue,
10.0,Caterpie,45,30,35,20,20,45,195,Bug,,Shield Dust,,Run Away,Green,


In [131]:
# you can change the value of an entry using indexing and assignment
pokedex_df.loc[4,'Nickname'] = 'Abby' # indexing using .loc[] pokedex entry 4, 'Nickname column'
pokedex_df.iloc[113,-1] = 'Terry' # indexing using .iloc[] 113rd row, last column
pokedex_df.iloc[[3,113],:] # show values

Unnamed: 0_level_0,Pokemon,HP,Atk,Def,SpAtk,SpDef,Speed,Total,Type I,Type II,Ability I,Ability II,Hidden Ability,Color,Nickname
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
4.0,Charmander,39,52,43,60,50,65,309,Fire,,Blaze,,Solar Power,Red,Abby
129.0,Tangela,65,55,115,100,40,60,435,Grass,,Chlorophyll,Leaf Guard,Regenerator,Blue,Terry


113

## Changing index column
You can change the index column of the dataframe with `.set_index()` or reset it to default numerical indexing with `.reset_index()`.

Note however, that these operations create a __copy__ of the original dataframe, and do not change the dataframe "in place" (i.e. doesn't change the original object)

In [15]:
# reset index to be default numerical index
reindexed_pokedex = pokedex_df.reset_index()
reindexed_pokedex

Unnamed: 0,Index,Pokemon,HP,Atk,Def,SpAtk,SpDef,Speed,Total,Type I,Type II,Ability I,Ability II,Hidden Ability,Color,Nickname
0,1.0,Bulbasaur,45,49,49,65,65,45,318,Grass,Poison,Overgrow,,Chlorophyll,Green,
1,2.0,Ivysaur,60,62,63,80,80,60,405,Grass,Poison,Overgrow,,Chlorophyll,Green,
2,3.0,Venusaur,80,82,83,100,100,80,525,Grass,Poison,Overgrow,,Chlorophyll,Green,
3,4.0,Charmander,39,52,43,60,50,65,309,Fire,,Blaze,,Solar Power,Red,Abby
4,5.0,Charmeleon,58,64,58,80,65,80,405,Fire,,Blaze,,Solar Power,Red,
5,6.0,Charizard,78,84,78,109,85,100,534,Fire,Flying,Blaze,,Solar Power,Red,
6,7.0,Squirtle,44,48,65,60,54,43,314,Water,,Torrent,,Rain Dish,Blue,
7,8.0,Wartortle,59,63,80,65,80,58,405,Water,,Torrent,,Rain Dish,Blue,
8,9.0,Blastoise,79,83,100,85,105,78,530,Water,,Torrent,,Rain Dish,Blue,
9,10.0,Caterpie,45,30,35,20,20,45,195,Bug,,Shield Dust,,Run Away,Green,


In [18]:
# Change the index to the Pokemon name column
pokedex_indexed_by_name = pokedex_df.set_index('Pokemon')
pokedex_indexed_by_name

Unnamed: 0_level_0,HP,Atk,Def,SpAtk,SpDef,Speed,Total,Type I,Type II,Ability I,Ability II,Hidden Ability,Color,Nickname
Pokemon,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Bulbasaur,45,49,49,65,65,45,318,Grass,Poison,Overgrow,,Chlorophyll,Green,
Ivysaur,60,62,63,80,80,60,405,Grass,Poison,Overgrow,,Chlorophyll,Green,
Venusaur,80,82,83,100,100,80,525,Grass,Poison,Overgrow,,Chlorophyll,Green,
Charmander,39,52,43,60,50,65,309,Fire,,Blaze,,Solar Power,Red,Abby
Charmeleon,58,64,58,80,65,80,405,Fire,,Blaze,,Solar Power,Red,
Charizard,78,84,78,109,85,100,534,Fire,Flying,Blaze,,Solar Power,Red,
Squirtle,44,48,65,60,54,43,314,Water,,Torrent,,Rain Dish,Blue,
Wartortle,59,63,80,65,80,58,405,Water,,Torrent,,Rain Dish,Blue,
Blastoise,79,83,100,85,105,78,530,Water,,Torrent,,Rain Dish,Blue,
Caterpie,45,30,35,20,20,45,195,Bug,,Shield Dust,,Run Away,Green,


In [19]:
# note that the original dataframe was not affected
pokedex_df

Unnamed: 0_level_0,Pokemon,HP,Atk,Def,SpAtk,SpDef,Speed,Total,Type I,Type II,Ability I,Ability II,Hidden Ability,Color,Nickname
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1.0,Bulbasaur,45,49,49,65,65,45,318,Grass,Poison,Overgrow,,Chlorophyll,Green,
2.0,Ivysaur,60,62,63,80,80,60,405,Grass,Poison,Overgrow,,Chlorophyll,Green,
3.0,Venusaur,80,82,83,100,100,80,525,Grass,Poison,Overgrow,,Chlorophyll,Green,
4.0,Charmander,39,52,43,60,50,65,309,Fire,,Blaze,,Solar Power,Red,Abby
5.0,Charmeleon,58,64,58,80,65,80,405,Fire,,Blaze,,Solar Power,Red,
6.0,Charizard,78,84,78,109,85,100,534,Fire,Flying,Blaze,,Solar Power,Red,
7.0,Squirtle,44,48,65,60,54,43,314,Water,,Torrent,,Rain Dish,Blue,
8.0,Wartortle,59,63,80,65,80,58,405,Water,,Torrent,,Rain Dish,Blue,
9.0,Blastoise,79,83,100,85,105,78,530,Water,,Torrent,,Rain Dish,Blue,
10.0,Caterpie,45,30,35,20,20,45,195,Bug,,Shield Dust,,Run Away,Green,


## Logical indexing
You can test conditionals on a column, for example, testing if the value of the column is greater than a value. What's returned is a boolean column of the same size, with a value of `True` wherever it is true, and `False` otherwise. (This can similarly be done in numpy).



In [14]:
pokedex_df['Total'] > 520

Index
1.0      False
2.0      False
3.0       True
4.0      False
5.0      False
6.0       True
7.0      False
8.0      False
9.0       True
10.0     False
11.0     False
12.0     False
13.0     False
14.0     False
15.0     False
16.0     False
17.0     False
18.0     False
19.0     False
20.0     False
21.0     False
22.0     False
23.0     False
24.0     False
26.0     False
27.0     False
28.0     False
29.0     False
30.0     False
31.0     False
32.0     False
33.0     False
34.0     False
35.0     False
37.0     False
38.0     False
39.0     False
40.0     False
42.0     False
43.0     False
44.0     False
45.0     False
47.0     False
48.0     False
49.0     False
51.0     False
52.0     False
53.0     False
54.0     False
55.0     False
56.0     False
57.0     False
58.0     False
59.0     False
60.0     False
61.0     False
62.0     False
63.0     False
64.0      True
65.0     False
66.0     False
67.0     False
69.0     False
70.0     False
71.0     False
72.0     False
73.0

If you take that boolean column and use it to index the dataframe, you can get all of the entries that satisfy that condition.

In [13]:
pokedex_df.loc[pokedex_df['Total']>=520] # all of the pokemon with total base stats greater than 500

Unnamed: 0_level_0,Pokemon,HP,Atk,Def,SpAtk,SpDef,Speed,Total,Type I,Type II,Ability I,Ability II,Hidden Ability,Color,Nickname
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
3.0,Venusaur,80,82,83,100,100,80,525,Grass,Poison,Overgrow,,Chlorophyll,Green,
6.0,Charizard,78,84,78,109,85,100,534,Fire,Flying,Blaze,,Solar Power,Red,
9.0,Blastoise,79,83,100,85,105,78,530,Water,,Torrent,,Rain Dish,Blue,
64.0,Arcanine,90,110,80,100,80,95,555,Fire,,Intimidate,Flash Fire,Justified,Brown,
99.0,Cloyster,50,90,180,85,45,70,520,Water,Ice,Shell Armor,Skill Link,Overcoat,Purple,
112.0,Exeggutor,95,95,85,125,65,55,520,Grass,Psychic,Chlorophyll,,Harvest,Yellow,
138.0,Starmie,60,75,85,100,85,115,520,Water,Psychic,Illuminate,Natural Cure,Analytic,Purple,
154.0,Gyarados,95,125,79,60,100,81,540,Water,Flying,Intimidate,,Moxie,Blue,
155.0,Lapras,130,85,80,85,95,60,535,Water,Ice,Water Absorb,Shell Armor,Hydration,Blue,
158.0,Vaporeon,130,65,60,110,95,65,525,Water,,Water Absorb,,Hydration,Blue,


In [25]:
# Get all of the pokemon of a given color
pokedex_df.loc[pokedex_df['Color']=='Black']

Unnamed: 0_level_0,Pokemon,HP,Atk,Def,SpAtk,SpDef,Speed,Total,Type I,Type II,Ability I,Ability II,Hidden Ability,Color,Nickname
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
174.0,Snorlax,160,110,65,65,110,30,540,Normal,,Immunity,Thick Fat,Gluttony,Black,
162.0,Umbreon,95,65,110,60,130,65,525,Dark,,Synchronize,,Inner Focus,Black,
226.0,Murkrow,60,85,42,85,42,91,405,Dark,Flying,Insomnia,Super Luck,Prankster,Black,
230.0,Unown,48,72,48,72,48,48,336,Psychic,,Levitate,,,Black,
244.0,Sneasel,55,95,55,35,75,115,430,Dark,Ice,Inner Focus,Keen Eye,Pickpocket,Black,
260.0,Houndour,45,60,30,80,50,65,330,Dark,Fire,Early Bird,Flash Fire,Unnerve,Black,
261.0,Houndoom,75,90,50,110,80,95,500,Dark,Fire,Early Bird,Flash Fire,Unnerve,Black,
328.0,Mawile,50,85,85,55,55,50,380,Steel,Fairy,Hyper Cutter,Intimidate,Sheer Force,Black,
352.0,Spoink,60,25,35,70,80,60,330,Psychic,,Thick Fat,Own Tempo,Gluttony,Black,
363.0,Seviper,73,100,60,100,60,65,458,Poison,,Shed Skin,,Infiltrator,Black,


### Compound boolean queries
You can use boolean operations (OR, AND, NOT) to combine conditions

We will look for pokemon with specific types to demonstrate compound queries

### OR queries
In the following example, we query for all pokemon who have at least part "Dragon" typing.

The vertical line | represents the element-wise "OR" operation. "OR" operations require either or both statements on either side of the operation to be true.

Thus, the below query 

```query = (pokedex_df['Type I']=='Dragon') | (pokedex_df['Type II']=='Dragon')```

asks, for each row, if 'Type I' is equal to "Dragon" OR 'Type II' is equal to "Dragon"

In [40]:
query = (pokedex_df['Type I']=='Dragon') | (pokedex_df['Type II']=='Dragon')
pokedex_df.loc[query] # Get all of the entries corresponding to the above query

Unnamed: 0_level_0,Pokemon,HP,Atk,Def,SpAtk,SpDef,Speed,Total,Type I,Type II,Ability I,Ability II,Hidden Ability,Color,Nickname
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
178.0,Dratini,41,64,45,50,50,50,300,Dragon,,Shed Skin,,Marvel Scale,Blue,
179.0,Dragonair,61,84,65,70,70,70,420,Dragon,,Shed Skin,,Marvel Scale,Blue,
180.0,Dragonite,91,134,95,100,100,80,600,Dragon,Flying,Inner Focus,,Multiscale,Brown,
134.0,Kingdra,75,95,95,95,95,85,540,Water,Dragon,Swift Swim,Sniper,Damp,Blue,
356.0,Vibrava,50,70,50,50,50,70,340,Ground,Dragon,Levitate,,,Green,
357.0,Flygon,80,100,80,80,80,100,520,Ground,Dragon,Levitate,,,Green,
361.0,Altaria,75,70,90,70,105,80,490,Dragon,Flying,Natural Cure,,Cloud Nine,Blue,
400.0,Bagon,45,75,60,40,30,50,300,Dragon,,Rock Head,,Sheer Force,Blue,
401.0,Shelgon,65,95,100,60,50,50,420,Dragon,,Rock Head,,Overcoat,White,
402.0,Salamence,95,135,80,110,80,100,600,Dragon,Flying,Intimidate,,Moxie,Blue,


### AND queries
The ampersand `&` character represents the element-wise boolean operation "AND". In order for an "AND" statement to be true, both conditions on either side of the operation must be true.

We can combine boolean operations to construct complex queries: for example, getting all pokemon who are both Dragon and Flying types. The query below looks for pokemon who are EITHER (Primary Type "Flying" AND Secondary Type "Dragon") OR (Primary Type "Dragon" AND Secondary Type "Flying")

Note that parentheses should be used to group operations so that the order of operations is clear

In [42]:
# you can use the backslash operator \ to continue code on the next line
query = ((pokedex_df['Type I']=='Flying') & (pokedex_df['Type II']=='Dragon'))\
        | ((pokedex_df['Type I']=='Dragon') & (pokedex_df['Type II']=='Flying')) 
pokedex_df.loc[query]

Unnamed: 0_level_0,Pokemon,HP,Atk,Def,SpAtk,SpDef,Speed,Total,Type I,Type II,Ability I,Ability II,Hidden Ability,Color,Nickname
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
180.0,Dragonite,91,134,95,100,100,80,600,Dragon,Flying,Inner Focus,,Multiscale,Brown,
361.0,Altaria,75,70,90,70,105,80,490,Dragon,Flying,Natural Cure,,Cloud Nine,Blue,
402.0,Salamence,95,135,80,110,80,100,600,Dragon,Flying,Intimidate,,Moxie,Blue,
413.0,Rayquaza,105,150,90,150,90,95,680,Dragon,Flying,Air Lock,,,Green,
714.0,Noibat,40,30,35,45,40,55,245,Flying,Dragon,Frisk,Infiltrator,Telepathy,Purple,
715.0,Noivern,85,70,80,97,80,123,535,Flying,Dragon,Frisk,Infiltrator,Telepathy,Purple,


### NOT queries
There are two ways to do negation. Consider the following query for selecting all pokemon who are Fire type, but not Flying

```
query = ((pokedex_df['Type I']=='Fire') & ~(pokedex_df['Type II']=='Flying'))\
        | ((pokedex_df['Type I']!='Flying') & (pokedex_df['Type II']=='Fire'))
```

The first way to do negation is with the tilde ~ symbol, which negates the subsequent boolean statement:
We use this in `~(pokedex_df['Type II']=='Flying')`.
The parentheses determine order of operations. First we find all entries where their secondary type is "Flying"
Then the tilde in front negates that statement, flipping all True values to False, and vice versa

The second way to do negation is using the != notation, which represents "not equal to"
We use this in `(pokedex_df['Type I']!='Flying')`
This finds all entries where their primary type is not equal to "Flying"

In [43]:
query = ((pokedex_df['Type I']=='Fire') & ~(pokedex_df['Type II']=='Flying'))\
        | ((pokedex_df['Type I']!='Flying') & (pokedex_df['Type II']=='Fire'))

pokedex_df.loc[query]

Unnamed: 0_level_0,Pokemon,HP,Atk,Def,SpAtk,SpDef,Speed,Total,Type I,Type II,Ability I,Ability II,Hidden Ability,Color,Nickname
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
4.0,Charmander,39,52,43,60,50,65,309,Fire,,Blaze,,Solar Power,Red,Abby
5.0,Charmeleon,58,64,58,80,65,80,405,Fire,,Blaze,,Solar Power,Red,
39.0,Vulpix,38,41,40,50,65,65,299,Fire,,Flash Fire,,Drought,Brown,
40.0,Ninetales,73,76,75,81,100,100,505,Fire,,Flash Fire,,Drought,Yellow,
63.0,Growlithe,55,70,45,70,50,60,350,Fire,,Intimidate,Flash Fire,Justified,Brown,
64.0,Arcanine,90,110,80,100,80,95,555,Fire,,Intimidate,Flash Fire,Justified,Brown,
83.0,Ponyta,50,85,55,65,65,90,410,Fire,,Run Away,Flash Fire,Flame Body,Yellow,
84.0,Rapidash,65,100,70,80,80,105,500,Fire,,Run Away,Flash Fire,Flame Body,Yellow,
149.0,Magmar,65,95,57,100,85,93,495,Fire,,Flame Body,,Vital Spirit,Red,
160.0,Flareon,65,130,60,95,110,65,525,Fire,,Flash Fire,,Guts,Red,


## Sorting
Dataframes can be sorted by value

In [22]:
pokedex_df.sort_values('Pokemon', ascending=False) # Sort by name descending

Unnamed: 0_level_0,Pokemon,HP,Atk,Def,SpAtk,SpDef,Speed,Total,Type I,Type II,Ability I,Ability II,Hidden Ability,Color,Nickname
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
718.0,Zygarde,108,100,121,81,95,95,600,Dragon,Ground,Aura Break,,,Green,
634.0,Zweilous,72,85,70,65,70,58,420,Dark,Dragon,Hustle,,,Blue,
44.0,Zubat,40,45,35,30,40,55,245,Poison,Flying,Inner Focus,,Infiltrator,Purple,
570.0,Zorua,40,65,40,80,40,65,330,Dark,,Illusion,,,Gray,
571.0,Zoroark,60,105,60,120,60,105,510,Dark,,Illusion,,,Gray,
287.0,Zigzagoon,38,30,41,30,41,60,240,Normal,,Pickup,Gluttony,Quick Feet,Brown,
644.0,Zekrom,100,150,120,120,100,90,680,Dragon,Electric,Teravolt,,,Black,
523.0,Zebstrika,75,100,63,80,63,116,497,Electric,,Lightningrod,Motor Drive,Sap Sipper,Black,
176.0,Zapdos,90,90,85,125,90,100,580,Electric,Flying,Pressure,,Lightningrod,Yellow,
362.0,Zangoose,73,115,60,60,60,90,458,Normal,,Immunity,,Poison Rampage,White,


In [23]:
# You can sort by multiple columns
pokedex_df.sort_values(['Type I','Type II', 'Total'], ascending=False) # Sort by type combination, then stats descending

Unnamed: 0_level_0,Pokemon,HP,Atk,Def,SpAtk,SpDef,Speed,Total,Type I,Type II,Ability I,Ability II,Hidden Ability,Color,Nickname
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
424.0,Empoleon,84,86,88,111,101,60,530,Water,Steel,Torrent,,Defiant,Blue,
565.0,Carracosta,74,108,133,83,65,32,495,Water,Rock,Solid Rock,Sturdy,Swift Swim,Blue,
398.0,Relicanth,100,90,130,45,65,55,485,Water,Rock,Swift Swim,Rock Head,Sturdy,Gray,
253.0,Corsola,55,55,85,65,85,35,380,Water,Rock,Hustle,Natural Cure,Regenerator,Pink,
564.0,Tirtouga,54,78,103,53,45,22,355,Water,Rock,Solid Rock,Sturdy,Swift Swim,Blue,
138.0,Starmie,60,75,85,100,85,115,520,Water,Psychic,Illuminate,Natural Cure,Analytic,Purple,
86.0,Slowbro,95,75,110,100,80,30,490,Water,Psychic,Oblivious,Own Tempo,Regenerator,Pink,
87.0,Slowking,95,75,80,100,110,30,490,Water,Psychic,Oblivious,Own Tempo,Regenerator,Pink,
779.0,Bruxish,68,105,70,70,70,92,475,Water,Psychic,Dazzling,Strong Jaw,Wonder Skin,Pink,
85.0,Slowpoke,90,65,65,40,40,15,315,Water,Psychic,Oblivious,Own Tempo,Regenerator,Pink,


## Calculations and operations
There are various ways to do math to the dataframes

### Column operations
You can perform basic math element-wise between columns, such as addition, subtraction, multiplication, and division

In [4]:
pokedex_df['Total']/ pokedex_df['HP'] # computes the ratio between total stats and HP

Index
1.0        7.066667
2.0        6.750000
3.0        6.562500
4.0        7.923077
5.0        6.982759
6.0        6.846154
7.0        7.136364
8.0        6.864407
9.0        6.708861
10.0       4.333333
11.0       4.100000
12.0       6.416667
13.0       4.875000
14.0       4.555556
15.0       5.846154
16.0       6.275000
17.0       5.539683
18.0       5.650602
19.0       8.433333
20.0       7.509091
21.0       6.550000
22.0       6.800000
23.0       9.433333
24.0       7.300000
26.0       8.571429
27.0       7.916667
28.0       6.000000
29.0       6.000000
30.0       5.000000
31.0       5.214286
32.0       5.500000
33.0       5.934783
34.0       5.983607
35.0       6.111111
37.0       4.614286
38.0       4.978947
39.0       7.868421
40.0       6.917808
42.0       2.347826
43.0       3.035714
44.0       6.125000
45.0       6.066667
47.0       7.111111
48.0       6.583333
49.0       6.400000
51.0       8.142857
52.0       6.750000
53.0       5.083333
54.0       6.428571
55.0      26.5

### Universal functions
Universal functions can operate on entire dataframes or slices of dataframes. They return a dataframe copy with the same index as the original dataframe. Numpy universal functions can be applied to any numeric dataframe or slice.

In [9]:
np.log(pokedex_df.loc[:,'HP':'Total']) # take natural log of all stats
# again, these functions operate on each element independently, and return a dataframe with the same index and size of input

Unnamed: 0_level_0,HP,Atk,Def,SpAtk,SpDef,Speed,Total
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1.0,3.806662,3.89182,3.89182,4.174387,4.174387,3.806662,5.762051
2.0,4.094345,4.127134,4.143135,4.382027,4.382027,4.094345,6.003887
3.0,4.382027,4.406719,4.418841,4.60517,4.60517,4.382027,6.263398
4.0,3.663562,3.951244,3.7612,4.094345,3.912023,4.174387,5.733341
5.0,4.060443,4.158883,4.060443,4.382027,4.174387,4.382027,6.003887
6.0,4.356709,4.430817,4.356709,4.691348,4.442651,4.60517,6.280396
7.0,3.78419,3.871201,4.174387,4.094345,3.988984,3.7612,5.749393
8.0,4.077537,4.143135,4.382027,4.174387,4.382027,4.060443,6.003887
9.0,4.369448,4.418841,4.60517,4.442651,4.65396,4.356709,6.272877
10.0,3.806662,3.401197,3.555348,2.995732,2.995732,3.806662,5.273




- applying functions
- aggregations
- transformations
- groupby

### .apply()
You can apply an arbitrary function to each row or column using the `.apply()` function.

If you pass the name of a function (e.g. np.mean, or the name of any function you've written), you can apply it to either each column, or each row using `.apply()`

In [71]:
pokedex_df.loc[:,'Atk':'Speed'].apply(np.mean) # applies the function np.mean() to each column

Atk      76.115854
Def      71.919512
SpAtk    69.957317
SpDef    70.370732
Speed    66.140244
dtype: float64

In [72]:
pokedex_df.loc[:,'Atk':'Speed'].apply(np.mean, axis=1) # if you wanted to apply it to each row instead, use the argument axis=1

Index
1.0       54.6
2.0       69.0
3.0       89.0
4.0       54.0
5.0       69.4
6.0       91.2
7.0       54.0
8.0       69.2
9.0       90.2
10.0      30.0
11.0      31.0
12.0      65.0
13.0      31.0
14.0      32.0
15.0      63.0
16.0      42.2
17.0      57.2
18.0      77.2
19.0      44.6
20.0      71.6
21.0      44.4
22.0      75.4
23.0      50.6
24.0      75.6
26.0      53.0
27.0      83.0
28.0      50.0
29.0      75.0
30.0      44.0
31.0      59.0
32.0      81.0
33.0      45.4
34.0      60.8
35.0      82.8
37.0      50.6
38.0      75.6
39.0      52.2
40.0      86.4
42.0      31.0
43.0      57.0
44.0      41.0
45.0      76.0
47.0      55.0
48.0      67.0
49.0      81.0
51.0      50.0
52.0      69.0
53.0      49.0
54.0      76.0
55.0      51.0
56.0      74.0
57.0      50.0
58.0      75.0
59.0      54.0
60.0      84.0
61.0      53.0
62.0      78.0
63.0      59.0
64.0      93.0
65.0      52.0
66.0      64.0
67.0      82.0
69.0      57.0
70.0      72.0
71.0      87.0
72.0      47.0
73.0

The above example is actually a bit unnecessary. There are several built-in functions in dataframes, including a `.mean()` which has the same functionality.

In [73]:
pokedex_df.loc[:,'Atk':'Speed'].mean()

Atk      76.115854
Def      71.919512
SpAtk    69.957317
SpDef    70.370732
Speed    66.140244
dtype: float64

In [74]:
pokedex_df.loc[:,'Atk':'Speed'].mean(axis=1)

Index
1.0       54.6
2.0       69.0
3.0       89.0
4.0       54.0
5.0       69.4
6.0       91.2
7.0       54.0
8.0       69.2
9.0       90.2
10.0      30.0
11.0      31.0
12.0      65.0
13.0      31.0
14.0      32.0
15.0      63.0
16.0      42.2
17.0      57.2
18.0      77.2
19.0      44.6
20.0      71.6
21.0      44.4
22.0      75.4
23.0      50.6
24.0      75.6
26.0      53.0
27.0      83.0
28.0      50.0
29.0      75.0
30.0      44.0
31.0      59.0
32.0      81.0
33.0      45.4
34.0      60.8
35.0      82.8
37.0      50.6
38.0      75.6
39.0      52.2
40.0      86.4
42.0      31.0
43.0      57.0
44.0      41.0
45.0      76.0
47.0      55.0
48.0      67.0
49.0      81.0
51.0      50.0
52.0      69.0
53.0      49.0
54.0      76.0
55.0      51.0
56.0      74.0
57.0      50.0
58.0      75.0
59.0      54.0
60.0      84.0
61.0      53.0
62.0      78.0
63.0      59.0
64.0      93.0
65.0      52.0
66.0      64.0
67.0      82.0
69.0      57.0
70.0      72.0
71.0      87.0
72.0      47.0
73.0

Most [descriptive statistics](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats) (median, mode, standard deviation, count, quantile, etc) have built-in functions. Where `.apply()` really shines, however, is for applying custom functions. 

To show that functionality, let's take a sidebar and talk about lambda functions.

#### Lambda functions
Sometimes you want to define a function, but you don't want to write a whole 
```
def functionname(inputs):
  ...
  return outputs
```

In those situations, you can write lambda (anonymous) functions. The syntax for lambda functions is:

```
functionname = lambda x: operations_on_x
```

For example, if you wanted to write a function that squares the input, you could write:

In [59]:
square_it = lambda x: x**2 # recall that ** is exponentiation in python, this function returns the square of the input
for num in range(10):
  print(square_it(num))

0
1
4
9
16
25
36
49
64
81


This is especially useful if some function takes another function as an argument, like panda's `.apply()`

The typical usage of `.apply()` is:

`dataframe_obj.apply(functionname)`, which calls `functionname` on each column of `dataframe_obj`

However, you can replace `functionname` with a lambda function. In this example, we use the following lambda function:

```lambda col: [np.min(col), np.max(col)]```

The word `col` represents the argument, and then to the right of the colon, we return a list with two elements: the minimum and maximum values of the column


In [77]:
pokedex_df.loc[:,'HP':'Total'].apply(lambda col: [np.min(col), np.max(col)])

HP         [1, 255]
Atk        [5, 181]
Def        [5, 230]
SpAtk     [10, 180]
SpDef     [20, 230]
Speed      [5, 180]
Total    [175, 720]
dtype: object

the above can be expanded into a dataframe:

In [86]:
pokedex_df.loc[:,'HP':'Total'].apply(lambda col: [np.min(col), np.max(col)], result_type='expand')

Unnamed: 0,HP,Atk,Def,SpAtk,SpDef,Speed,Total
0,1,5,5,10,20,5,175
1,255,181,230,180,230,180,720


### Groupby
Dataframes can be grouped by column values, and operations can be applied to each group

In [87]:
pokedex_df.groupby(['Type I','Type II'])['Total'].count()

Type I    Type II 
Bug                   18
          Electric     4
          Fairy        2
          Fighting     3
          Fire         2
          Flying      13
          Ghost        1
          Grass        6
          Ground       2
          Poison      11
          Rock         3
          Steel        6
          Water        3
Dark                   9
          Dragon       4
          Fighting     2
          Fire         2
          Flying       5
          Ghost        1
          Ice          2
          Psychic      2
          Steel        2
Dragon                12
          Electric     1
          Fighting     2
          Fire         1
          Flying       4
          Ground       4
          Ice          1
          Psychic      2
Electric              27
          Fairy        2
          Fire         1
          Flying       4
          Ghost        1
          Grass        1
          Ice          1
          Normal       2
          Steel        4
      

In [103]:
pokedex_df.groupby(['Type I','Type II'])['Total'].apply(lambda col: [np.min(col), np.max(col)])

Type I    Type II 
Bug                   [194, 500]
          Electric    [319, 500]
          Fairy       [304, 464]
          Fighting    [500, 570]
          Fire        [360, 550]
          Flying      [244, 515]
          Ghost       [236, 236]
          Grass       [285, 490]
          Ground      [266, 424]
          Poison      [195, 475]
          Rock        [325, 505]
          Steel       [424, 600]
          Water       [230, 530]
Dark                  [220, 600]
          Dragon      [300, 600]
          Fighting    [348, 488]
          Fire        [330, 500]
          Flying      [370, 680]
          Ghost       [380, 380]
          Ice         [430, 510]
          Psychic     [288, 482]
          Steel       [340, 490]
Dragon                [300, 600]
          Electric    [680, 680]
          Fighting    [420, 600]
          Fire        [680, 680]
          Flying      [490, 680]
          Ground      [300, 600]
          Ice         [660, 660]
          Psychic     [6

# Exercises
