# getting started with pandas

**Extra resources:**

* Python Data Science Handbook: https://jakevdp.github.io/PythonDataScienceHandbook/
* Pandas cheatsheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf 

<img src = "https://www.hola.com/imagenes//actualidad/2015102781754/trece-pandas-seis-gemelos-china/0-339-649/pandas1-a.jpg?filter=w400&filter=ds75">

In [1]:
import numpy
import pandas

import seaborn
import matplotlib.pyplot as plt

import time
import sys
import os

import requests
from bs4 import BeautifulSoup

  import pandas.util.testing as tm


In [2]:
pandas.__version__

'1.0.3'

In [3]:
numpy.__version__

'1.17.2'

# dataframes, series and indexes

### Dataframes

Imagine them as excel tables that you can explore

In [4]:
pokemon = ["Bulbasaur", "Squirtle", "Charizard", "Pikachu", "Onix", "Geodude", "Vulpix", "Starmie"]
types = ["Grass / Poison", "Water", "Fire / Flying", "Electric", "Rock / Ground",  "Rock / Ground", "Fire", "Water"]
master = ["Ash", "Ash", "Ash", "Ash", "Brock", "Brock", "Misty", "Misty"]
pokedex_number = [1, 4, 7, 25, 95, 74, 37, 121]
evolved = ["No", "No", "Yes", "No", "No", "No", "No", "Yes"]

In [5]:
df_pokemon = pandas.DataFrame({"name": pokemon, 
                               "types": types, 
                               "master": master, 
                               "number": pokedex_number, 
                               "is_evolved": evolved})

df_pokemon

Unnamed: 0,name,types,master,number,is_evolved
0,Bulbasaur,Grass / Poison,Ash,1,No
1,Squirtle,Water,Ash,4,No
2,Charizard,Fire / Flying,Ash,7,Yes
3,Pikachu,Electric,Ash,25,No
4,Onix,Rock / Ground,Brock,95,No
5,Geodude,Rock / Ground,Brock,74,No
6,Vulpix,Fire,Misty,37,No
7,Starmie,Water,Misty,121,Yes


In [6]:
df_pokemon.values

array([['Bulbasaur', 'Grass / Poison', 'Ash', 1, 'No'],
       ['Squirtle', 'Water', 'Ash', 4, 'No'],
       ['Charizard', 'Fire / Flying', 'Ash', 7, 'Yes'],
       ['Pikachu', 'Electric', 'Ash', 25, 'No'],
       ['Onix', 'Rock / Ground', 'Brock', 95, 'No'],
       ['Geodude', 'Rock / Ground', 'Brock', 74, 'No'],
       ['Vulpix', 'Fire', 'Misty', 37, 'No'],
       ['Starmie', 'Water', 'Misty', 121, 'Yes']], dtype=object)

In [7]:
type(df_pokemon)

pandas.core.frame.DataFrame

In [8]:
type(df_pokemon.values)

numpy.ndarray

In [9]:
# pokemons = ["Bulbasaur", "Squirtle", "Charizard", "Pikachu", "Onix", "Geodude", "Vulpix", "Starmie"]
# types = ["Grass / Poison", "Water", "Fire / Flying", "Electric", "Rock / Ground",  "Rock / Ground", "Fire", "Water"]
# master = ["Ash", "Ash", "Ash", "Ash", "Brock", "Brock", "Misty"]
# pokedex_number = [1, 4, 7, 25, 95, 74, 37, 121]
# evolved = ["No", "No", "Yes", "No", "No", "No", "No", "Yes"]

# df_pokemon = pandas.DataFrame({"name": pokemons, 
#                                "types": types, 
#                                "master": master, 
#                                "number": pokedex_number, 
#                                "is_evolved": evolved})
# df_pokemon

### Series

pandas columns are called Series

In [10]:
df_pokemon["name"]

0    Bulbasaur
1     Squirtle
2    Charizard
3      Pikachu
4         Onix
5      Geodude
6       Vulpix
7      Starmie
Name: name, dtype: object

In [11]:
type(df_pokemon["name"])

pandas.core.series.Series

In [12]:
type(df_pokemon["name"].values)

numpy.ndarray

Series (pandas columns) are  basically numpy arrays with a name on top. This means that you can apply numpy array operations to pandas columns

In [13]:
df_pokemon["name"].max()

'Vulpix'

### Indexes

In [14]:
df_pokemon

Unnamed: 0,name,types,master,number,is_evolved
0,Bulbasaur,Grass / Poison,Ash,1,No
1,Squirtle,Water,Ash,4,No
2,Charizard,Fire / Flying,Ash,7,Yes
3,Pikachu,Electric,Ash,25,No
4,Onix,Rock / Ground,Brock,95,No
5,Geodude,Rock / Ground,Brock,74,No
6,Vulpix,Fire,Misty,37,No
7,Starmie,Water,Misty,121,Yes


In [15]:
df_pokemon.columns

Index(['name', 'types', 'master', 'number', 'is_evolved'], dtype='object')

In [16]:
df_pokemon.index

RangeIndex(start=0, stop=8, step=1)

In [17]:
df_pokemon.loc[4, "number"]

95

Indexes help us to acces individual values and slices of the dataframe

### What can you do with a dataframe?

*Sort pokemon by pokedex number*

In [18]:
df_pokemon.sort_values(by = "number")

Unnamed: 0,name,types,master,number,is_evolved
0,Bulbasaur,Grass / Poison,Ash,1,No
1,Squirtle,Water,Ash,4,No
2,Charizard,Fire / Flying,Ash,7,Yes
3,Pikachu,Electric,Ash,25,No
6,Vulpix,Fire,Misty,37,No
5,Geodude,Rock / Ground,Brock,74,No
4,Onix,Rock / Ground,Brock,95,No
7,Starmie,Water,Misty,121,Yes


*Which pokemon are evolved?*

In [19]:
df_pokemon.loc[df_pokemon["is_evolved"] == "Yes"]

Unnamed: 0,name,types,master,number,is_evolved
2,Charizard,Fire / Flying,Ash,7,Yes
7,Starmie,Water,Misty,121,Yes


*Which pokemon are dual type?*

In [20]:
df_pokemon["is_dual_type"] = df_pokemon["types"].str.contains("/")
df_pokemon

Unnamed: 0,name,types,master,number,is_evolved,is_dual_type
0,Bulbasaur,Grass / Poison,Ash,1,No,True
1,Squirtle,Water,Ash,4,No,False
2,Charizard,Fire / Flying,Ash,7,Yes,True
3,Pikachu,Electric,Ash,25,No,False
4,Onix,Rock / Ground,Brock,95,No,True
5,Geodude,Rock / Ground,Brock,74,No,True
6,Vulpix,Fire,Misty,37,No,False
7,Starmie,Water,Misty,121,Yes,False


In [21]:
df_pokemon.loc[df_pokemon["is_dual_type"]]

Unnamed: 0,name,types,master,number,is_evolved,is_dual_type
0,Bulbasaur,Grass / Poison,Ash,1,No,True
2,Charizard,Fire / Flying,Ash,7,Yes,True
4,Onix,Rock / Ground,Brock,95,No,True
5,Geodude,Rock / Ground,Brock,74,No,True


*Add a region column*

In [22]:
df_pokemon["region"] = "Kanto"
df_pokemon

Unnamed: 0,name,types,master,number,is_evolved,is_dual_type,region
0,Bulbasaur,Grass / Poison,Ash,1,No,True,Kanto
1,Squirtle,Water,Ash,4,No,False,Kanto
2,Charizard,Fire / Flying,Ash,7,Yes,True,Kanto
3,Pikachu,Electric,Ash,25,No,False,Kanto
4,Onix,Rock / Ground,Brock,95,No,True,Kanto
5,Geodude,Rock / Ground,Brock,74,No,True,Kanto
6,Vulpix,Fire,Misty,37,No,False,Kanto
7,Starmie,Water,Misty,121,Yes,False,Kanto


*Add a new row*

In [23]:
df_pokemon.loc[len(df_pokemon)] = ["Togepi", "Normal", "Misty", None, "No", "No", "Johto"]
df_pokemon

Unnamed: 0,name,types,master,number,is_evolved,is_dual_type,region
0,Bulbasaur,Grass / Poison,Ash,1.0,No,True,Kanto
1,Squirtle,Water,Ash,4.0,No,False,Kanto
2,Charizard,Fire / Flying,Ash,7.0,Yes,True,Kanto
3,Pikachu,Electric,Ash,25.0,No,False,Kanto
4,Onix,Rock / Ground,Brock,95.0,No,True,Kanto
5,Geodude,Rock / Ground,Brock,74.0,No,True,Kanto
6,Vulpix,Fire,Misty,37.0,No,False,Kanto
7,Starmie,Water,Misty,121.0,Yes,False,Kanto
8,Togepi,Normal,Misty,,No,No,Johto


*Fill Togepi pokedex number (175)*

In [24]:
df_pokemon.loc[df_pokemon["name"] == "Togepi", "number"] = 175
df_pokemon

Unnamed: 0,name,types,master,number,is_evolved,is_dual_type,region
0,Bulbasaur,Grass / Poison,Ash,1,No,True,Kanto
1,Squirtle,Water,Ash,4,No,False,Kanto
2,Charizard,Fire / Flying,Ash,7,Yes,True,Kanto
3,Pikachu,Electric,Ash,25,No,False,Kanto
4,Onix,Rock / Ground,Brock,95,No,True,Kanto
5,Geodude,Rock / Ground,Brock,74,No,True,Kanto
6,Vulpix,Fire,Misty,37,No,False,Kanto
7,Starmie,Water,Misty,121,Yes,False,Kanto
8,Togepi,Normal,Misty,175,No,No,Johto


# Pandas foundations

In [25]:
os.listdir("data")

['lego_sets.csv', 'cards.collectible.json', '.DS_Store']

## Lego dataset (reading flat files)

Dataset taken from Kaggle: https://www.kaggle.com/mterzolo/lego-sets

<img src = "https://www.lego.com/cdn/cs/set/assets/blt95c35d4ed5665a49/75192.jpg?fit=bounds&format=jpg&quality=80&auto=webp&width=528&height=528&dpr=1">

In [26]:
df_legos = pandas.read_csv("data/lego_sets.csv")
df_legos.head()

Unnamed: 0,ages,list_price,num_reviews,piece_count,play_star_rating,prod_desc,prod_id,prod_long_desc,review_difficulty,set_name,star_rating,theme_name,val_star_rating,country
0,6-12,29.99,2.0,277.0,4.0,Catapult into action and take back the eggs fr...,75823.0,Use the staircase catapult to launch Red into ...,Average,Bird Island Egg Heist,4.5,Angry Birds™,4.0,US
1,6-12,19.99,2.0,168.0,4.0,Launch a flying attack and rescue the eggs fro...,75822.0,Pilot Pig has taken off from Bird Island with ...,Easy,Piggy Plane Attack,5.0,Angry Birds™,4.0,US
2,6-12,12.99,11.0,74.0,4.3,Chase the piggy with lightning-fast Chuck and ...,75821.0,Pitch speedy bird Chuck against the Piggy Car....,Easy,Piggy Car Escape,4.3,Angry Birds™,4.1,US
3,12+,99.99,23.0,1032.0,3.6,Explore the architecture of the United States ...,21030.0,Discover the architectural secrets of the icon...,Average,United States Capitol Building,4.6,Architecture,4.3,US
4,12+,79.99,14.0,744.0,3.2,Recreate the Solomon R. Guggenheim Museum® wit...,21035.0,Discover the architectural secrets of Frank Ll...,Challenging,Solomon R. Guggenheim Museum®,4.6,Architecture,4.1,US


In [27]:
len(df_legos)

12261

In [28]:
df_legos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12261 entries, 0 to 12260
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ages               12261 non-null  object 
 1   list_price         12261 non-null  float64
 2   num_reviews        10641 non-null  float64
 3   piece_count        12261 non-null  float64
 4   play_star_rating   10486 non-null  float64
 5   prod_desc          11884 non-null  object 
 6   prod_id            12261 non-null  float64
 7   prod_long_desc     12261 non-null  object 
 8   review_difficulty  10206 non-null  object 
 9   set_name           12261 non-null  object 
 10  star_rating        10641 non-null  float64
 11  theme_name         12258 non-null  object 
 12  val_star_rating    10466 non-null  float64
 13  country            12261 non-null  object 
dtypes: float64(7), object(7)
memory usage: 1.3+ MB


In [29]:
df_legos.describe()

Unnamed: 0,list_price,num_reviews,piece_count,play_star_rating,prod_id,star_rating,val_star_rating
count,12261.0,10641.0,12261.0,10486.0,12261.0,10641.0,10466.0
mean,65.141998,16.826238,493.405921,4.337641,59836.75,4.514134,4.22896
std,91.980429,36.368984,825.36458,0.652051,163811.5,0.518865,0.660282
min,2.2724,1.0,1.0,1.0,630.0,1.8,1.0
25%,19.99,2.0,97.0,4.0,21034.0,4.3,4.0
50%,36.5878,6.0,216.0,4.5,42069.0,4.7,4.3
75%,70.1922,13.0,544.0,4.8,70922.0,5.0,4.7
max,1104.87,367.0,7541.0,5.0,2000431.0,5.0,5.0


In [30]:
df_legos = df_legos.dropna()
df_legos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9910 entries, 0 to 12260
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ages               9910 non-null   object 
 1   list_price         9910 non-null   float64
 2   num_reviews        9910 non-null   float64
 3   piece_count        9910 non-null   float64
 4   play_star_rating   9910 non-null   float64
 5   prod_desc          9910 non-null   object 
 6   prod_id            9910 non-null   float64
 7   prod_long_desc     9910 non-null   object 
 8   review_difficulty  9910 non-null   object 
 9   set_name           9910 non-null   object 
 10  star_rating        9910 non-null   float64
 11  theme_name         9910 non-null   object 
 12  val_star_rating    9910 non-null   float64
 13  country            9910 non-null   object 
dtypes: float64(7), object(7)
memory usage: 1.1+ MB


In [31]:
df_legos = df_legos.drop(["prod_id", "prod_long_desc", "num_reviews", "val_star_rating"], axis = 1)
df_legos.head()

Unnamed: 0,ages,list_price,piece_count,play_star_rating,prod_desc,review_difficulty,set_name,star_rating,theme_name,country
0,6-12,29.99,277.0,4.0,Catapult into action and take back the eggs fr...,Average,Bird Island Egg Heist,4.5,Angry Birds™,US
1,6-12,19.99,168.0,4.0,Launch a flying attack and rescue the eggs fro...,Easy,Piggy Plane Attack,5.0,Angry Birds™,US
2,6-12,12.99,74.0,4.3,Chase the piggy with lightning-fast Chuck and ...,Easy,Piggy Car Escape,4.3,Angry Birds™,US
3,12+,99.99,1032.0,3.6,Explore the architecture of the United States ...,Average,United States Capitol Building,4.6,Architecture,US
4,12+,79.99,744.0,3.2,Recreate the Solomon R. Guggenheim Museum® wit...,Challenging,Solomon R. Guggenheim Museum®,4.6,Architecture,US


## Are there any marvel lego pieces with easy difficulty that cost less than 150 dollars? (row and column selection)

*how many difficulty levels there are?*

In [32]:
df_legos["review_difficulty"].unique()

array(['Average', 'Easy', 'Challenging', 'Very Easy', 'Very Challenging'],
      dtype=object)

In [33]:
df_legos["review_difficulty"].nunique()

5

In [34]:
df_legos["theme_name"].unique()

array(['Angry Birds™', 'Architecture', 'BOOST', 'BrickHeadz', 'City',
       'Classic', 'Creator 3-in-1', 'Creator Expert',
       'THE LEGO® BATMAN MOVIE', 'DC Comics™ Super Heroes', 'DIMENSIONS™',
       'Juniors', 'DC Super Hero Girls', 'Disney™', 'DUPLO®', 'Elves',
       'Friends', 'Ghostbusters™', 'Ideas',
       'Indoraptor Rampage at Lockwood Estate',
       'Jurassic Park Velociraptor Chase', 'Dilophosaurus Outpost Attack',
       'Stygimoloch Breakout', 'Pteranodon Chase', 'Marvel Super Heroes',
       'MINDSTORMS®', 'Minecraft™', 'Minifigures', 'NEXO KNIGHTS™',
       'THE LEGO® NINJAGO® MOVIE™', 'NINJAGO®', 'Speed Champions',
       'Star Wars™', 'Technic', 'Power Functions',
       'Carnotaurus Gyrosphere Escape', 'LEGO® Creator 3-in-1'],
      dtype=object)

In [35]:
df_legos["theme_name"].nunique()

37

### Column selection

In [36]:
df_legos_mini = df_legos[["set_name", "theme_name", "ages", "piece_count", "review_difficulty", "country"]] 
df_legos_mini

Unnamed: 0,set_name,theme_name,ages,piece_count,review_difficulty,country
0,Bird Island Egg Heist,Angry Birds™,6-12,277.0,Average,US
1,Piggy Plane Attack,Angry Birds™,6-12,168.0,Easy,US
2,Piggy Car Escape,Angry Birds™,6-12,74.0,Easy,US
3,United States Capitol Building,Architecture,12+,1032.0,Average,US
4,Solomon R. Guggenheim Museum®,Architecture,12+,744.0,Challenging,US
...,...,...,...,...,...,...
12256,Manta Ray Bomber,THE LEGO® NINJAGO® MOVIE™,7-14,341.0,Easy,PT
12257,Piranha Attack,THE LEGO® NINJAGO® MOVIE™,7-14,217.0,Easy,PT
12258,NINJAGO® City Chase,THE LEGO® NINJAGO® MOVIE™,7-14,233.0,Easy,PT
12259,Lloyd - Spinjitzu Master,THE LEGO® NINJAGO® MOVIE™,6-14,48.0,Very Easy,PT


### Row selection 

In [37]:
df_legos_mini["piece_count"] > 500

0        False
1        False
2        False
3         True
4         True
         ...  
12256    False
12257    False
12258    False
12259    False
12260    False
Name: piece_count, Length: 9910, dtype: bool

In [38]:
df_legos_mini.loc[df_legos_mini["piece_count"] > 500]

Unnamed: 0,set_name,theme_name,ages,piece_count,review_difficulty,country
3,United States Capitol Building,Architecture,12+,1032.0,Average,US
4,Solomon R. Guggenheim Museum®,Architecture,12+,744.0,Challenging,US
5,Shanghai,Architecture,12+,597.0,Average,US
6,New York City,Architecture,12+,598.0,Average,US
7,Buckingham Palace,Architecture,12+,780.0,Average,US
...,...,...,...,...,...,...
12248,Fire Mech,THE LEGO® NINJAGO® MOVIE™,9-14,944.0,Average,PT
12249,Lightning Jet,THE LEGO® NINJAGO® MOVIE™,9-14,876.0,Average,PT
12250,Garma Mecha Man,THE LEGO® NINJAGO® MOVIE™,8-14,747.0,Average,PT
12251,Garmadon's Volcano Lair,THE LEGO® NINJAGO® MOVIE™,8-14,521.0,Average,PT


### Combined selection

In [39]:
df_legos_mini_double = df_legos_mini.loc[df_legos_mini["theme_name"].str.contains("Star Wars")][["set_name", "theme_name"]]
df_legos_mini_double

Unnamed: 0,set_name,theme_name
628,Kessel Run Millennium Falcon™,Star Wars™
629,Imperial TIE Fighter™,Star Wars™
630,Moloch's Landspeeder™,Star Wars™
631,Yoda's Hut,Star Wars™
632,Han Solo's Landspeeder™,Star Wars™
...,...,...
12192,Yoda's Hut,Star Wars™
12193,Jedi™ and Clone Troopers™ Battle Pack,Star Wars™
12194,Imperial Patrol Battle Pack,Star Wars™
12195,A-Wing Starfighter™,Star Wars™


*Answering the original question*

Are there any marvel lego pieces with easy difficulty that cost less than 150 dollars? (row and column selection)

In [40]:
df_legos[["set_name", "theme_name", "list_price"]]\
        .loc[df_legos["theme_name"] == "Marvel Super Heroes"]\
        .loc[df_legos["review_difficulty"] == "Easy"]\
        .loc[df_legos["list_price"] <= 150]

Unnamed: 0,set_name,theme_name,list_price
467,The Hulkbuster Smash-Up,Marvel Super Heroes,29.9900
468,Thor's Weapon Quest,Marvel Super Heroes,19.9900
469,Outrider Dropship Attack,Marvel Super Heroes,14.9900
475,Hulk vs. Red Hulk,Marvel Super Heroes,59.9900
476,The Ultimate Battle for Asgard,Marvel Super Heroes,49.9900
...,...,...,...
12038,Iron Man: Detroit Steel Strikes,Marvel Super Heroes,47.5678
12039,Rhino Face-Off by the Mine,Marvel Super Heroes,36.5878
12041,ATM Heist Battle,Marvel Super Heroes,35.3678
12042,Captain America Jet Pursuit,Marvel Super Heroes,35.3678


In [41]:
df_legos[["set_name", "theme_name", "list_price"]]\
        .loc[(df_legos["theme_name"] == "Marvel Super Heroes") |
              (df_legos["review_difficulty"] == "Easy") |
              (df_legos["list_price"] <= 150)]

Unnamed: 0,set_name,theme_name,list_price
0,Bird Island Egg Heist,Angry Birds™,29.9900
1,Piggy Plane Attack,Angry Birds™,19.9900
2,Piggy Car Escape,Angry Birds™,12.9900
3,United States Capitol Building,Architecture,99.9900
4,Solomon R. Guggenheim Museum®,Architecture,79.9900
...,...,...,...
12256,Manta Ray Bomber,THE LEGO® NINJAGO® MOVIE™,36.5878
12257,Piranha Attack,THE LEGO® NINJAGO® MOVIE™,24.3878
12258,NINJAGO® City Chase,THE LEGO® NINJAGO® MOVIE™,24.3878
12259,Lloyd - Spinjitzu Master,THE LEGO® NINJAGO® MOVIE™,12.1878


## Whats the brand of the toys? (broadcasting)

In [42]:
df_legos_aux = df_legos[["set_name", "theme_name"]].copy()
df_legos_aux.head(10)

Unnamed: 0,set_name,theme_name
0,Bird Island Egg Heist,Angry Birds™
1,Piggy Plane Attack,Angry Birds™
2,Piggy Car Escape,Angry Birds™
3,United States Capitol Building,Architecture
4,Solomon R. Guggenheim Museum®,Architecture
5,Shanghai,Architecture
6,New York City,Architecture
7,Buckingham Palace,Architecture
8,London,Architecture
9,Chicago,Architecture


In [43]:
df_legos_aux["brand"] = "Lego"
df_legos_aux

Unnamed: 0,set_name,theme_name,brand
0,Bird Island Egg Heist,Angry Birds™,Lego
1,Piggy Plane Attack,Angry Birds™,Lego
2,Piggy Car Escape,Angry Birds™,Lego
3,United States Capitol Building,Architecture,Lego
4,Solomon R. Guggenheim Museum®,Architecture,Lego
...,...,...,...
12256,Manta Ray Bomber,THE LEGO® NINJAGO® MOVIE™,Lego
12257,Piranha Attack,THE LEGO® NINJAGO® MOVIE™,Lego
12258,NINJAGO® City Chase,THE LEGO® NINJAGO® MOVIE™,Lego
12259,Lloyd - Spinjitzu Master,THE LEGO® NINJAGO® MOVIE™,Lego


*Reordering columns*

In [44]:
df_legos_aux[["brand", "theme_name", "set_name"]]

Unnamed: 0,brand,theme_name,set_name
0,Lego,Angry Birds™,Bird Island Egg Heist
1,Lego,Angry Birds™,Piggy Plane Attack
2,Lego,Angry Birds™,Piggy Car Escape
3,Lego,Architecture,United States Capitol Building
4,Lego,Architecture,Solomon R. Guggenheim Museum®
...,...,...,...
12256,Lego,THE LEGO® NINJAGO® MOVIE™,Manta Ray Bomber
12257,Lego,THE LEGO® NINJAGO® MOVIE™,Piranha Attack
12258,Lego,THE LEGO® NINJAGO® MOVIE™,NINJAGO® City Chase
12259,Lego,THE LEGO® NINJAGO® MOVIE™,Lloyd - Spinjitzu Master


## What architecture lego are more difficult and have more pieces? (sorting)

In [45]:
df_legos_architecture = df_legos.loc[df_legos["theme_name"] == "Architecture"]
df_legos_architecture.head()

Unnamed: 0,ages,list_price,piece_count,play_star_rating,prod_desc,review_difficulty,set_name,star_rating,theme_name,country
3,12+,99.99,1032.0,3.6,Explore the architecture of the United States ...,Average,United States Capitol Building,4.6,Architecture,US
4,12+,79.99,744.0,3.2,Recreate the Solomon R. Guggenheim Museum® wit...,Challenging,Solomon R. Guggenheim Museum®,4.6,Architecture,US
5,12+,59.99,597.0,3.7,Celebrate Shanghai with this LEGO® Architectur...,Average,Shanghai,4.9,Architecture,US
6,12+,59.99,598.0,3.7,Celebrate New York City with this LEGO® Archit...,Average,New York City,4.2,Architecture,US
7,12+,49.99,780.0,4.4,Recreate Buckingham Palace with LEGO® Architec...,Average,Buckingham Palace,4.7,Architecture,US


In [46]:
df_legos_architecture.sort_values(by = "review_difficulty")

Unnamed: 0,ages,list_price,piece_count,play_star_rating,prod_desc,review_difficulty,set_name,star_rating,theme_name,country
3,12+,99.9900,1032.0,3.6,Explore the architecture of the United States ...,Average,United States Capitol Building,4.6,Architecture,US
7256,12+,62.9860,598.0,3.7,Celebrate New York City with this LEGO® Archit...,Average,New York City,4.2,Architecture,GB
7257,12+,41.9860,321.0,3.2,Build your own LEGO® interpretation of the ico...,Average,The Eiffel Tower,4.6,Architecture,GB
7258,12+,41.9860,386.0,4.1,Experience the grandeur of the Arc de Triomphe!,Average,Arc de Triomphe,4.4,Architecture,GB
7826,12+,134.1878,1032.0,3.6,Explore the architecture of the United States ...,Average,United States Capitol Building,4.6,Architecture,IE
...,...,...,...,...,...,...,...,...,...,...
10628,12+,42.5929,361.0,4.2,Celebrate Sydney with this LEGO® Architecture ...,Easy,Sydney,4.6,Architecture,NZ
7835,12+,36.5878,361.0,4.2,Celebrate Sydney with this LEGO® Architecture ...,Easy,Sydney,4.6,Architecture,IE
8404,12+,36.5878,361.0,4.2,Celebrate Sydney with this LEGO® Architecture ...,Easy,Sydney,4.6,Architecture,IT
826,12+,37.9924,361.0,4.2,Celebrate Sydney with this LEGO® Architecture ...,Easy,Sydney,4.6,Architecture,AU


In [47]:
df_legos_architecture = df_legos.get(["set_name", "theme_name", "piece_count", "review_difficulty"])\
                                .loc[df_legos["theme_name"] == "Architecture"]

len(df_legos_architecture)

210

In [48]:
df_legos_architecture.sort_values(by = "set_name")

Unnamed: 0,set_name,theme_name,piece_count,review_difficulty
2537,Arc de Triomphe,Architecture,386.0,Average
6141,Arc de Triomphe,Architecture,386.0,Average
11192,Arc de Triomphe,Architecture,386.0,Average
8952,Arc de Triomphe,Architecture,386.0,Average
5591,Arc de Triomphe,Architecture,386.0,Average
...,...,...,...,...
10043,United States Capitol Building,Architecture,1032.0,Average
7250,United States Capitol Building,Architecture,1032.0,Average
6682,United States Capitol Building,Architecture,1032.0,Average
2531,United States Capitol Building,Architecture,1032.0,Average


In [49]:
df_legos_architecture = df_legos_architecture.drop_duplicates()
len(df_legos_architecture)

10

In [50]:
df_legos_architecture

Unnamed: 0,set_name,theme_name,piece_count,review_difficulty
3,United States Capitol Building,Architecture,1032.0,Average
4,Solomon R. Guggenheim Museum®,Architecture,744.0,Challenging
5,Shanghai,Architecture,597.0,Average
6,New York City,Architecture,598.0,Average
7,Buckingham Palace,Architecture,780.0,Average
8,London,Architecture,468.0,Average
9,Chicago,Architecture,444.0,Average
10,Arc de Triomphe,Architecture,386.0,Average
11,The Eiffel Tower,Architecture,321.0,Average
12,Sydney,Architecture,361.0,Easy


In [51]:
df_legos_architecture.sort_values(by = "piece_count")

Unnamed: 0,set_name,theme_name,piece_count,review_difficulty
11,The Eiffel Tower,Architecture,321.0,Average
12,Sydney,Architecture,361.0,Easy
10,Arc de Triomphe,Architecture,386.0,Average
9,Chicago,Architecture,444.0,Average
8,London,Architecture,468.0,Average
5,Shanghai,Architecture,597.0,Average
6,New York City,Architecture,598.0,Average
4,Solomon R. Guggenheim Museum®,Architecture,744.0,Challenging
7,Buckingham Palace,Architecture,780.0,Average
3,United States Capitol Building,Architecture,1032.0,Average


In [52]:
df_legos_architecture.sort_values(by = "piece_count", ascending= False)

Unnamed: 0,set_name,theme_name,piece_count,review_difficulty
3,United States Capitol Building,Architecture,1032.0,Average
7,Buckingham Palace,Architecture,780.0,Average
4,Solomon R. Guggenheim Museum®,Architecture,744.0,Challenging
6,New York City,Architecture,598.0,Average
5,Shanghai,Architecture,597.0,Average
8,London,Architecture,468.0,Average
9,Chicago,Architecture,444.0,Average
10,Arc de Triomphe,Architecture,386.0,Average
12,Sydney,Architecture,361.0,Easy
11,The Eiffel Tower,Architecture,321.0,Average


In [53]:
df_legos_architecture.sort_values(by = ["review_difficulty", "piece_count"], ascending= [False, True])

Unnamed: 0,set_name,theme_name,piece_count,review_difficulty
12,Sydney,Architecture,361.0,Easy
4,Solomon R. Guggenheim Museum®,Architecture,744.0,Challenging
11,The Eiffel Tower,Architecture,321.0,Average
10,Arc de Triomphe,Architecture,386.0,Average
9,Chicago,Architecture,444.0,Average
8,London,Architecture,468.0,Average
5,Shanghai,Architecture,597.0,Average
6,New York City,Architecture,598.0,Average
7,Buckingham Palace,Architecture,780.0,Average
3,United States Capitol Building,Architecture,1032.0,Average


In [54]:
df_legos_architecture["review_difficulty"] = pandas.Categorical(df_legos_architecture["review_difficulty"],
                                categories = ['Very Easy', 'Easy', 'Average', 'Challenging', 'Very Challenging'],
                                ordered = True)

In [55]:
df_legos_architecture.sort_values(by = ["review_difficulty", "piece_count"], ascending= [False, False])

Unnamed: 0,set_name,theme_name,piece_count,review_difficulty
4,Solomon R. Guggenheim Museum®,Architecture,744.0,Challenging
3,United States Capitol Building,Architecture,1032.0,Average
7,Buckingham Palace,Architecture,780.0,Average
6,New York City,Architecture,598.0,Average
5,Shanghai,Architecture,597.0,Average
8,London,Architecture,468.0,Average
9,Chicago,Architecture,444.0,Average
10,Arc de Triomphe,Architecture,386.0,Average
11,The Eiffel Tower,Architecture,321.0,Average
12,Sydney,Architecture,361.0,Easy


# Making a pokedex (reading html code)

What if I don't have a csv? We have information available online. Pandas helps with the scrapping of tables in html format. 

* `pandas.read_html` https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html
* Pokedex database: https://pokemondb.net/pokedex/all

In [56]:
url = "https://pokemondb.net/pokedex/all"
html_text = requests.get(url).content

In [57]:
pandas.read_html(html_text)

[        #                         Name            Type  Total   HP  Attack  \
 0       1                    Bulbasaur    Grass Poison    318   45      49   
 1       2                      Ivysaur    Grass Poison    405   60      62   
 2       3                     Venusaur    Grass Poison    525   80      82   
 3       3       Venusaur Mega Venusaur    Grass Poison    625   80     100   
 4       4                   Charmander            Fire    309   39      52   
 ...   ...                          ...             ...    ...  ...     ...   
 1029  890          Eternatus Eternamax   Poison Dragon   1125  255     115   
 1030  891                        Kubfu        Fighting    385   60      90   
 1031  892  Urshifu Single Strike Style   Fighting Dark    550  100     130   
 1032  892   Urshifu Rapid Strike Style  Fighting Water    550  100     130   
 1033  893                       Zarude      Dark Grass    600  105     120   
 
       Defense  Sp. Atk  Sp. Def  Speed  
 0      

In [58]:
df_pokedex = pandas.read_html(html_text)[0]
df_pokedex

Unnamed: 0,#,Name,Type,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
0,1,Bulbasaur,Grass Poison,318,45,49,49,65,65,45
1,2,Ivysaur,Grass Poison,405,60,62,63,80,80,60
2,3,Venusaur,Grass Poison,525,80,82,83,100,100,80
3,3,Venusaur Mega Venusaur,Grass Poison,625,80,100,123,122,120,80
4,4,Charmander,Fire,309,39,52,43,60,50,65
...,...,...,...,...,...,...,...,...,...,...
1029,890,Eternatus Eternamax,Poison Dragon,1125,255,115,250,125,250,130
1030,891,Kubfu,Fighting,385,60,90,60,53,50,72
1031,892,Urshifu Single Strike Style,Fighting Dark,550,100,130,100,63,60,97
1032,892,Urshifu Rapid Strike Style,Fighting Water,550,100,130,100,63,60,97


## Create a separate type 1 and type 2 columns for pokemons

In [59]:
split_columns = df_pokedex["Type"].str.split(" ", expand = True)

df_pokedex["Type 1"] = split_columns[0]
df_pokedex["Type 2"] = split_columns[1]

df_pokedex

Unnamed: 0,#,Name,Type,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Type 1,Type 2
0,1,Bulbasaur,Grass Poison,318,45,49,49,65,65,45,Grass,Poison
1,2,Ivysaur,Grass Poison,405,60,62,63,80,80,60,Grass,Poison
2,3,Venusaur,Grass Poison,525,80,82,83,100,100,80,Grass,Poison
3,3,Venusaur Mega Venusaur,Grass Poison,625,80,100,123,122,120,80,Grass,Poison
4,4,Charmander,Fire,309,39,52,43,60,50,65,Fire,
...,...,...,...,...,...,...,...,...,...,...,...,...
1029,890,Eternatus Eternamax,Poison Dragon,1125,255,115,250,125,250,130,Poison,Dragon
1030,891,Kubfu,Fighting,385,60,90,60,53,50,72,Fighting,
1031,892,Urshifu Single Strike Style,Fighting Dark,550,100,130,100,63,60,97,Fighting,Dark
1032,892,Urshifu Rapid Strike Style,Fighting Water,550,100,130,100,63,60,97,Fighting,Water


## What's the strongest physical attacker?

In [60]:
df_pokedex.sort_values(by = "Attack", ascending = False).head(10)

Unnamed: 0,#,Name,Type,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Type 1,Type 2
191,150,Mewtwo Mega Mewtwo X,Psychic Fighting,780,106,190,100,154,100,130,Psychic,Fighting
260,214,Heracross Mega Heracross,Bug Fighting,600,80,185,115,40,105,75,Bug,Fighting
927,798,Kartana,Grass Steel,570,59,181,131,59,31,109,Grass,Steel
458,383,Groudon Primal Groudon,Ground Fire,770,100,180,160,150,90,90,Ground,Fire
460,384,Rayquaza Mega Rayquaza,Dragon Flying,780,105,180,100,180,100,115,Dragon,Flying
463,386,Deoxys Attack Forme,Psychic,600,50,180,20,180,20,150,Psychic,
1024,888,Zacian Crowned Sword,Fairy Steel,720,92,170,115,80,115,148,Fairy,Steel
751,646,Kyurem Black Kyurem,Dragon Ice,700,125,170,100,120,90,95,Dragon,Ice
528,445,Garchomp Mega Garchomp,Dragon Ground,700,108,170,115,120,95,92,Dragon,Ground
932,800,Necrozma Ultra Necrozma,Psychic Dragon,754,97,167,97,167,97,129,Psychic,Dragon


## What's the strongest special attacker?

In [61]:
df_pokedex.sort_values(by = "Sp. Atk", ascending = False).head(10)

Unnamed: 0,#,Name,Type,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Type 1,Type 2
192,150,Mewtwo Mega Mewtwo Y,Psychic,780,106,150,70,194,120,140,Psychic,
463,386,Deoxys Attack Forme,Psychic,600,50,180,20,180,20,150,Psychic,
460,384,Rayquaza Mega Rayquaza,Dragon Flying,780,105,180,100,180,100,115,Dragon,Flying
456,382,Kyogre Primal Kyogre,Water,770,100,150,90,180,160,90,Water,
84,65,Alakazam Mega Alakazam,Psychic,600,55,50,65,175,105,150,Psychic,
925,796,Xurkitree,Electric,570,83,89,71,173,71,83,Electric,
125,94,Gengar Mega Gengar,Ghost Poison,600,60,65,80,170,95,130,Ghost,Poison
752,646,Kyurem White Kyurem,Dragon Ice,700,125,120,90,170,100,95,Dragon,Ice
841,720,Hoopa Hoopa Unbound,Psychic Dark,680,80,160,60,170,130,80,Psychic,Dark
932,800,Necrozma Ultra Necrozma,Psychic Dragon,754,97,167,97,167,97,129,Psychic,Dragon


## Rename # column

In [62]:
df_pokedex = df_pokedex.rename(columns = {"#": "Pokedex"})
df_pokedex

Unnamed: 0,Pokedex,Name,Type,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Type 1,Type 2
0,1,Bulbasaur,Grass Poison,318,45,49,49,65,65,45,Grass,Poison
1,2,Ivysaur,Grass Poison,405,60,62,63,80,80,60,Grass,Poison
2,3,Venusaur,Grass Poison,525,80,82,83,100,100,80,Grass,Poison
3,3,Venusaur Mega Venusaur,Grass Poison,625,80,100,123,122,120,80,Grass,Poison
4,4,Charmander,Fire,309,39,52,43,60,50,65,Fire,
...,...,...,...,...,...,...,...,...,...,...,...,...
1029,890,Eternatus Eternamax,Poison Dragon,1125,255,115,250,125,250,130,Poison,Dragon
1030,891,Kubfu,Fighting,385,60,90,60,53,50,72,Fighting,
1031,892,Urshifu Single Strike Style,Fighting Dark,550,100,130,100,63,60,97,Fighting,Dark
1032,892,Urshifu Rapid Strike Style,Fighting Water,550,100,130,100,63,60,97,Fighting,Water


## What's the fastest pokemon of the first generation?

In [63]:
df_pokedex.loc[df_pokedex["Pokedex"] <= 151].sort_values(by = "Speed", ascending = False)

Unnamed: 0,Pokedex,Name,Type,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Type 1,Type 2
84,65,Alakazam Mega Alakazam,Psychic,600,55,50,65,175,105,150,Psychic,
182,142,Aerodactyl Mega Aerodactyl,Rock Flying,615,80,135,85,70,95,150,Rock,Flying
132,101,Electrode,Electric,490,60,50,70,80,80,150,Electric,
19,15,Beedrill Mega Beedrill,Bug Poison,495,65,150,40,15,80,145,Bug,Poison
192,150,Mewtwo Mega Mewtwo Y,Psychic,780,106,150,70,194,120,140,Psychic,
...,...,...,...,...,...,...,...,...,...,...,...,...
52,39,Jigglypuff,Normal Fairy,270,115,45,20,45,25,20,Normal,Fairy
94,74,Geodude Alolan Geodude,Rock Electric,300,40,80,100,30,30,20,Rock,Electric
93,74,Geodude,Rock Ground,300,40,80,100,30,30,20,Rock,Ground
104,79,Slowpoke Galarian Slowpoke,Psychic,315,90,65,65,40,40,15,Psychic,


## What's the fastest pokemon of the first generation? (no megas)

In [64]:
df_pokedex.loc[(df_pokedex["Pokedex"] <= 151) & (~df_pokedex["Name"].str.contains("Mega"))]\
            .sort_values(by = "Speed", ascending = False)

Unnamed: 0,Pokedex,Name,Type,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Type 1,Type 2
132,101,Electrode,Electric,490,60,50,70,80,80,150,Electric,
181,142,Aerodactyl,Rock Flying,515,80,105,65,60,75,130,Rock,Flying
174,135,Jolteon,Electric,525,65,65,60,110,95,130,Electric,
190,150,Mewtwo,Psychic,680,106,110,90,154,90,130,Psychic,
33,25,Pikachu Partner Pikachu,Electric,430,45,80,50,75,60,120,Electric,
...,...,...,...,...,...,...,...,...,...,...,...,...
52,39,Jigglypuff,Normal Fairy,270,115,45,20,45,25,20,Normal,Fairy
94,74,Geodude Alolan Geodude,Rock Electric,300,40,80,100,30,30,20,Rock,Electric
93,74,Geodude,Rock Ground,300,40,80,100,30,30,20,Rock,Ground
103,79,Slowpoke,Water Psychic,315,90,65,65,40,40,15,Water,Psychic


## Which pokemons have a mega evolution?

In [65]:
df_pokedex.loc[df_pokedex["Name"].str.contains(" Mega ")]

Unnamed: 0,Pokedex,Name,Type,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Type 1,Type 2
3,3,Venusaur Mega Venusaur,Grass Poison,625,80,100,123,122,120,80,Grass,Poison
7,6,Charizard Mega Charizard X,Fire Dragon,634,78,130,111,130,85,100,Fire,Dragon
8,6,Charizard Mega Charizard Y,Fire Flying,634,78,104,78,159,115,100,Fire,Flying
12,9,Blastoise Mega Blastoise,Water,630,79,103,120,135,115,78,Water,
19,15,Beedrill Mega Beedrill,Bug Poison,495,65,150,40,15,80,145,Bug,Poison
23,18,Pidgeot Mega Pidgeot,Normal Flying,579,83,80,80,135,80,121,Normal,Flying
84,65,Alakazam Mega Alakazam,Psychic,600,55,50,65,175,105,150,Psychic,
106,80,Slowbro Mega Slowbro,Water Psychic,590,95,75,180,130,80,30,Water,Psychic
125,94,Gengar Mega Gengar,Ghost Poison,600,60,65,80,170,95,130,Ghost,Poison
150,115,Kangaskhan Mega Kangaskhan,Normal,590,105,125,100,60,100,100,Normal,


In [66]:
len(df_pokedex.loc[df_pokedex["Name"].str.contains(" Mega ")])

48

## Which pokemon change type when they mega evolve?

In [67]:
df_megas = df_pokedex.loc[df_pokedex["Name"].str.contains(" Mega ")].get(["Pokedex", "Name", "Type"])
df_megas.head()

Unnamed: 0,Pokedex,Name,Type
3,3,Venusaur Mega Venusaur,Grass Poison
7,6,Charizard Mega Charizard X,Fire Dragon
8,6,Charizard Mega Charizard Y,Fire Flying
12,9,Blastoise Mega Blastoise,Water
19,15,Beedrill Mega Beedrill,Bug Poison


In [68]:
df_megas = df_megas.rename(columns = {"Name": "Mega Name", "Type": "Mega Type"})
df_megas.head()

Unnamed: 0,Pokedex,Mega Name,Mega Type
3,3,Venusaur Mega Venusaur,Grass Poison
7,6,Charizard Mega Charizard X,Fire Dragon
8,6,Charizard Mega Charizard Y,Fire Flying
12,9,Blastoise Mega Blastoise,Water
19,15,Beedrill Mega Beedrill,Bug Poison


In [69]:
#inner join
df_pokedex_megas = pandas.merge(df_pokedex, df_megas, on = "Pokedex")
df_pokedex_megas.head()

Unnamed: 0,Pokedex,Name,Type,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Type 1,Type 2,Mega Name,Mega Type
0,3,Venusaur,Grass Poison,525,80,82,83,100,100,80,Grass,Poison,Venusaur Mega Venusaur,Grass Poison
1,3,Venusaur Mega Venusaur,Grass Poison,625,80,100,123,122,120,80,Grass,Poison,Venusaur Mega Venusaur,Grass Poison
2,6,Charizard,Fire Flying,534,78,84,78,109,85,100,Fire,Flying,Charizard Mega Charizard X,Fire Dragon
3,6,Charizard,Fire Flying,534,78,84,78,109,85,100,Fire,Flying,Charizard Mega Charizard Y,Fire Flying
4,6,Charizard Mega Charizard X,Fire Dragon,634,78,130,111,130,85,100,Fire,Dragon,Charizard Mega Charizard X,Fire Dragon


In [70]:
df_pokedex_megas.loc[df_pokedex_megas["Type"] != df_pokedex_megas["Mega Type"]]\
                .loc[~df_pokedex_megas["Name"].str.contains(" Mega ")]

Unnamed: 0,Pokedex,Name,Type,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Type 1,Type 2,Mega Name,Mega Type
2,6,Charizard,Fire Flying,534,78,84,78,109,85,100,Fire,Flying,Charizard Mega Charizard X,Fire Dragon
18,80,Slowbro Galarian Slowbro,Poison Psychic,490,95,100,95,100,70,30,Poison,Psychic,Slowbro Mega Slowbro,Water Psychic
23,127,Pinsir,Bug,500,65,125,100,55,70,85,Bug,,Pinsir Mega Pinsir,Bug Flying
25,130,Gyarados,Water Flying,540,95,125,79,60,100,81,Water,Flying,Gyarados Mega Gyarados,Water Dark
29,150,Mewtwo,Psychic,680,106,110,90,154,90,130,Psychic,,Mewtwo Mega Mewtwo X,Psychic Fighting
35,181,Ampharos,Electric,510,90,75,85,115,90,55,Electric,,Ampharos Mega Ampharos,Electric Dragon
47,254,Sceptile,Grass,530,70,85,65,105,85,120,Grass,,Sceptile Mega Sceptile,Grass Dragon
59,306,Aggron,Steel Rock,530,70,110,180,60,60,50,Steel,Rock,Aggron Mega Aggron,Steel
69,334,Altaria,Dragon Flying,490,75,70,90,70,105,80,Dragon,Flying,Altaria Mega Altaria,Dragon Fairy
87,428,Lopunny,Normal,480,65,76,84,54,96,105,Normal,,Lopunny Mega Lopunny,Normal Fighting


# What's next?

* Groupping and aggregating functions
* Merging dataframes (joins, appends)
* Apply vectorized functions
* Working with Big Data
* Pandas pro tips and tricks

# Contact

Manuel Montoya 
* Mail: manuel.montoya@pucp.edu.pe
* Linkedin: https://www.linkedin.com/in/manuel-montoya-gamio/

<img src="https://miro.medium.com/max/3006/1*KdxlBR9P3mDp9JZ_URMdYQ.jpeg">