# Introduction to Pandas

### Outline
* [What Is Pandas?](#what-is-pandas)
* [Why Pandas?](#why-pandas)
* [Installing Pandas](#installing-numpy)
    * with a stack
    * with pip
* [Getting started](#getting-started)
* [Data structures](#data-structures)
    * Series
    * DataFrame
* [Series overview](#series-overview)
    * Creating 
    * index
    * values
* [DataFrame overview](#dataframe-overview)
    * Creating empty DataFrame
    * Creating a DataFrame from a Python dict
    * Adding columns
    * Appending rows to the DataFrame
* [DataFrame characteristics](#dataframe-characteristics)
    * shape
    * describe 
    * head
    * tail
* [DataFrame indexing](#dataframe-indexing)
    * Using iloc to select single row
    * Using iloc to select multiple rows (slicing)
    * Using iloc to select multiple rows with steps (slicing with steps)
    * Setting an index
    * Using loc to select a single row
* [Conditionals](#conditionals)
* [Resources](#resources)

<a id="what-is-pandas"></a>
### What Is Pandas?

Pandas is data analysis and manipulation tool.

[More information](https://pandas.pydata.org/about/index.html)

<a id="why-pandas"></a>
### Why Pandas?

* Fast
* Powerful
* Flexible
* Easy to use
* Open source
* Built on top of the Python programming language.

<a id="installing-pandas"></a>
### Installing Pandas

Pandas is already included in the following stacks:
* [Anaconda](https://www.anaconda.com/)
* [SciPy](https://www.scipy.org/about.html)

Numpy can also be installed with pip:

```
python3 -m pip install pandas
```

<a id="getting-started"></a>
### Getting Started

By convention, when imported pandas is typically aliased as pd.

In [None]:
import pandas as pd

<a id="data-structures"></a>
### Data Structures

**Series**
Series are designed to accomodate a sequence of one-dimentional data.  

**DataFrame**
Dataframes are designed to contain cases with several dimensions.

<a id="series-overview"></a>
### Series Overview

[Series](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

In [None]:
s = pd.Series([1, 2, 3, 4])

In [None]:
s

In [None]:
s.index

In [None]:
type(s.index)

In [None]:
s.values

In [None]:
# the values in a Series is simply a numpy ndarray
type(s.values)

<a id="dataframe-overview"></a>
### DataFrame Overview

[DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) is a 2-dimensional labeled data structure with columns of potentially different types.

**Creating an empty DataFrame**

In [None]:
f1_2019_races = pd.DataFrame()

In [None]:
f1_2019_races

**Creating a DataFrame from a Python dict**

[data source](https://www.formula1.com/en/results.html/2019/races.html)

In [None]:
races_dict = {
    "grand_prix": [ 
        'Australia', 'Bahrain', 'China', 
        'Azerbaijan', 'Spain', 'Monaco', 
        'Canada', 'France', 'Austria', 
        'Great Britain', 'Germany', 'Hungary', 
        'Belgium', 'Italy', 'Singapore', 
        'Russia', 'Japan', 'Mexico', 
        'United States', 'Brazil',
    ],
    "laps": [
        58, 57, 56,
        51, 66, 78,
        70, 53, 71, 
        52, 64, 70,
        44, 53, 61,
        53, 52, 71, 
        56, 71, 
    ]
}

In [None]:
f1_2019_races = pd.DataFrame(races_dict)

In [None]:
f1_2019_races

**Adding columns to a DataFrame**

In [None]:
f1_2019_races['winner'] = [
    'Valtteri Bottas', 'Lewis Hamilton', 'Lewis Hamilton',
    'Valtteri Bottas', 'Lewis Hamilton', 'Lewis Hamilton',
    'Lewis Hamilton', 'Lewis Hamilton', 'Max Verstappen',
    'Lewis Hamilton', 'Max Verstappen', 'Lewis Hamilton',
    'Charles Leclerc', 'Charles Leclerc', 'Sebastian Vettel',
    'Lewis Hamilton', 'Valtteri Bottas', 'Lewis Hamilton',
    'Valtteri Bottas', 'Max Verstappen',
]

In [None]:
f1_2019_races['car'] = [
    'MERCEDES', 'MERCEDES', 'MERCEDES', 
    'MERCEDES', 'MERCEDES', 'MERCEDES', 
    'MERCEDES', 'MERCEDES', 'RED BULL RACING HONDA',
    'MERCEDES', 'RED BULL RACING HONDA', 'MERCEDES',
    'FERRARI', 'FERRARI', 'FERRARI', 
    'MERCEDES', 'MERCEDES', 'MERCEDES', 
    'MERCEDES', 'RED BULL RACING HONDA'
]

In [None]:
f1_2019_races

In [None]:
tempdf = f1_2019_races

In [None]:
f1_2019_races = tempdf

**Appending rows to the DataFrame**

In [None]:
abu_dhabi_race = [ {'grand_prix':'Abu Dhabi', 
                  'laps': 55, 
                  'winner':'Lewis Hamilton', 
                  'car':'MERCEDES' } ]
abu = pd.DataFrame.from_records(abu_dhabi_race)
# abu = pd.Dataframe(abu_dhabi_race)

In [None]:
newdf = pd.concat([f1_2019_races, abu], ignore_index=True)

In [None]:
newdf

<a id="dataframe-characteristics"></a>
### DataFrame Characteristics


In [None]:
f1_2019_races.shape

In [None]:
f1_2019_races.describe()

In [None]:
# describe is also available for Series (a column in the DataFrame)
f1_2019_races['car'].describe()

In [None]:
# view the first few rows (defaults to first 5)
f1_2019_races.head()

In [None]:
f1_2019_races.head(3) # we can specify how many we want instead of the default value.

In [None]:
# view the last few rows (defaults to last 5)
f1_2019_races.tail()

In [None]:
f1_2019_races.tail(7) # we can specify how many we want instead of the default value.

<a id="dataframe-indexing"></a>
### DataFrame Indexing


**Using iloc to select a single row**

In [None]:
# select the first row by index position
f1_2019_races.iloc[0]

In [None]:
# select the last row by index position
f1_2019_races.iloc[-1]

**Using iloc to select multiple rows (slicing)**

In [None]:
# select rows associated with indexes 7-13
f1_2019_races.iloc[7:14]

**Using iloc to select multiple rows with steps (slicing with steps)**

In [None]:
# select every third race in the DataFrame starting with the 3rd race
# remember that arrays use zero based indexes
f1_2019_races.iloc[2::3]

**Setting an index**

In [None]:
f1_2019_races = f1_2019_races.set_index(f1_2019_races['grand_prix'])

**Using loc to select a single row**

In [None]:
# select the row corresponding to the Brazilian Grand Prix
f1_2019_races.loc['Brazil']

<a id="conditionals"></a>
### Conditionals


Single condition

In [None]:
# Find races that Mercedes did not win first place.
f1_2019_races[f1_2019_races['car'] != 'MERCEDES']

When using multiple conditions be sure to:  
* separate each condition with the & sign  
* wrap each condition in parentheses

In [None]:
# Find the races where the race had less than 60 laps and the winner was someone other than Lewis Hamilton
f1_2019_races[(f1_2019_races['laps'] < 60) & (f1_2019_races['winner'] != 'Lewis Hamilton')]

<a id="resources"></a>
### Resources

* [User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)
* [Pandas Category Data Type](https://pbpython.com/pandas_dtypes_cat.html)

![Capivara](../imgs/capivara.jpg)