<img src="https://snipboard.io/Kx6OAi.jpg">

# Session 1. Advanced Pandas: Configuration and Review
<div style="margin-top: -20px;">Author:  David Yerrington</div>

## Learning Objectives

- Install your Python Environment
- Run Jupyter Lab
- Reveiw of Pandas

### Prerequisite Knowledge
- Basic Pandas 
  - Difference between Series vs Dataframe
  - Bitmasks, query function, selecting data
  - Aggregations

## Environment Setup

We will first review some basic points to setup Python and the environment to start in [the setup guide](../environment.md).


# 1. Introduction

The most common problems we face working with data include:
- Access
- Exploration
- Transformation

Most problems in the world involve acquiring data in some form, understanding it enough to scope projects, then changing a portion of data in some way to clean, model, or create more datasets!  Pandas is our multi-tool of sorts when it comes to exploring the potential for applications with data.  The faster you can work within Pandas to explore and transform data, the easier it will be to produce most projects.

## 1.1 What is "Advanced" Pandas?

TBD:  Explanation of my experience about what is common working on data science teams and engineering roles.
2ndary explanation:  Use cases of advanced pandas



## 1.2 Use Cases

TBD: explanation of these use cases

- Working with mobile game data with semi-structured data
- Log files
- Text analytics 
- Writing methods using apply and map rather than manual iteration


# 2. Pandas Review

These topics are useful points of review that will will be using going forward but also are foundational points that are important before we go deeper.

## Load a dataset

This is a Pokemon dataset and it's from [Kaggle](https://www.kaggle.com/terminus7/pokemon-challenge).  

> Pokemon are creatures that fight each other, in a turn-based RPG game.  Nothing is more practical than Pokemon data.

In [5]:
import pandas as pd

df = pd.read_csv("../data/pokemon.csv", encoding = "utf8")

## 2.1 Series vs. DataFrame

![](https://snipboard.io/8i3yIz.jpg)

### Selecting Series:  Row

In [7]:
df.loc[0]

#                     1
Name          Bulbasaur
Type 1            Grass
Type 2           Poison
HP                   45
Attack               49
Defense              49
Sp. Atk              65
Sp. Def              65
Speed                45
Generation            1
Legendary         False
Name: 0, dtype: object

### Selecting Series:  Column

In [11]:
df.loc[:, 'Attack']

0       49
1       62
2       82
3      100
4       52
      ... 
795    100
796    160
797    110
798    160
799    110
Name: Attack, Length: 800, dtype: int64

### Question: Is the object `df.columns` a series or a DataFrame?
Follow up:  How would you find out?

In [20]:
df.columns

Index(['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk',
       'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')

## 2.2 `.loc` and `.iloc`
### Selecting all rows, all columns

> `.iloc` is extremely useful for when you want to programatically access columns when you might not know the names of the columns explicitly.  This is of particular use when writing scripts to automate the process of cleaning data.

In [25]:
df.loc[:, ['Type 1', 'Type 2', 'Attack']]

Unnamed: 0,Type 1,Type 2,Attack
0,Grass,Poison,49
1,Grass,Poison,62
2,Grass,Poison,82
3,Grass,Poison,100
4,Fire,,52
...,...,...,...
795,Rock,Fairy,100
796,Rock,Fairy,160
797,Psychic,Ghost,110
798,Psychic,Dark,160


### Selecting all rows, specific columns

In [29]:
df.loc[[0, 4], ['Name', 'Type 1', 'Type 2', 'Attack']]

Unnamed: 0,Name,Type 1,Type 2,Attack
0,Bulbasaur,Grass,Poison,49
4,Charmander,Fire,,52


In [32]:
df.iloc[[0, 10], [1,2,3,5]]

Unnamed: 0,Name,Type 1,Type 2,Attack
0,Bulbasaur,Grass,Poison,49
10,Wartortle,Water,,63


## 3. Selecting data 

### With Masks

Selection of data in Pandas is performed by passing a series of True/False values which are interpreted as "show" or "hide" for rows with the same numeric index.

Lets see this on a basic dataset that we create first.

In [33]:
basic_data = [
    ("Cat"),
    ("Bat"),
    ("Rat"),
    ("Wombat") # this is the best row
]
basic_df = pd.DataFrame(basic_data, columns = ["animal"])
basic_df

Unnamed: 0,animal
0,Cat
1,Bat
2,Rat
3,Wombat


In [35]:
## Selecting only the first row "cat"
mask = [True, False, False, False]
basic_df[mask]

Unnamed: 0,animal
0,Cat


In [36]:
## Selecting only the second and fouth rows for "bat" and "wombat"
mask = [False, True, False, True]
basic_df[mask]

Unnamed: 0,animal
1,Bat
3,Wombat


In [39]:
mask = basic_df['animal'] == "Wombat"
basic_df[mask]

Unnamed: 0,animal
3,Wombat


## This also works with `.loc` and `.iloc`

In [43]:
basic_df.loc[mask, ['animal']]

Unnamed: 0,animal
3,Wombat


So when we use basic conditional statements with Pandas to select data, they are actually creating a `boolean` series of `True` and `False` values that match the rows they coorespond to for selection.

In [46]:
basic_df.iloc[[0,1,3]]

Unnamed: 0,animal
0,Cat
1,Bat
3,Wombat


## Using `.query()`

Query is a useful method in Pandas to quickly select data in a "sql-like" manner.

In [48]:
basic_df.query("animal == 'Wombat'")

Unnamed: 0,animal
3,Wombat


The `.query` method is great for exploring data but it does have some tradeoffs.


- You can refer to column names that contain spaces or operators by surrounding them in backticks. Also columns that begin with digits need to be escaped.
- You still need to escape columns that are named as reserved words but sometimes this breaks query I've found. ie: Column named as `class`.


## 4. `.info()` and Data Types

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        799 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   HP          800 non-null    int64 
 5   Attack      800 non-null    int64 
 6   Defense     800 non-null    int64 
 7   Sp. Atk     800 non-null    int64 
 8   Sp. Def     800 non-null    int64 
 9   Speed       800 non-null    int64 
 10  Generation  800 non-null    int64 
 11  Legendary   800 non-null    bool  
dtypes: bool(1), int64(8), object(3)
memory usage: 69.7+ KB


# Summary

By now you should have your Jupyter environment setup and running and be familliar with some core Pandas concepts:

* `.loc` and `.iloc` select data by row and column.  `.loc` allows you to select columns by name.  `.iloc` provides selection of columns by index which is useful for programatic access.
* Data is selected at the row level by boolean series called "masks".  Conditional operators create these boolean sequences and control the selection and display of DataFrames.
* `.describe()` is a handy function that automatically provides the most common aggregation methods across all numerical values in the dataset.
* `.groupby("[key column]")` aggregates data matching the key column parameter into groups of rows matching the same key. 