<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# Reading Data with Pandas 

[pandas](https://pandas.pydata.org/) is the most important Python package at the disposal of Data Engineers and Scientists.

> Pandas's name derives from **panel data**,  a  common  term  for  multidimensional  datasets encountered in statistics and econometrics. — [Wes McKinney](https://www.dlr.de/sc/Portaldata/15/Resources/dokumente/pyhpc2011/submissions/pyhpc2011_submission_9.pdf)

Pandas' key features:
* built on top of the [NumPy](https://numpy.org/) package.
* input for
  * statistical analysis in [SciPy](https://www.scipy.org/)
  * plotting functions from [Matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/)
  * and machine learning algorithms in [Scikit-learn](https://scikit-learn.org/)
  
[Pandas' documentation](https://pandas.pydata.org/pandas-docs/stable/index.html) is available oneline

## ![Python Tiny Logo](https://dl.dropboxusercontent.com/s/wl9nvyva3qjsaz2/logo_python_tiny.png) Getting Started

Let's start importing Pandas

In [None]:
%load_ext autotime
import pandas as pd

### Series and DataFrames

The primary two components of pandas are

| `Series` | `DataFrame` |
|----------|-------------|
| a column | a multi-dimensional table made up of a collection of Series |
| ![](https://pandas.pydata.org/pandas-docs/stable/_images/01_table_series.svg) | ![](https://pandas.pydata.org/pandas-docs/stable/_images/01_table_dataframe1.svg) |

**NOTE** *light gray* cells are data, while *dark gray* cells are indexes

Let's create a dataframe

In [None]:
df = pd.DataFrame({
       "Name": ["Braund, Mr. Owen Harris",
                "Allen, Mr. William Henry",
                "Bonnell, Miss. Elizabeth"],
       "Age": [22, 35, 58],
       "Sex": ["male", "male", "female"]}
   )

Let's have a look to the results

In [None]:
df

**0**, **1**, **2** are *row indexes*

**Name**, **Age** and **Sex** are *column indexes*

[pandas.DataFrame.info()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html?highlight=info#pandas.DataFrame.info) prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

In [None]:
df.info()

### ![Python Tiny Logo](https://dl.dropboxusercontent.com/s/wl9nvyva3qjsaz2/logo_python_tiny.png)  Using column indexes

we can use *column indexes* to access the columns

![](https://pandas.pydata.org/pandas-docs/stable/_images/03_subset_columns.svg)

If we access just one column the result is a Pandas' `series`

In [None]:
df["Age"]

If we access multiple columns the result is a Pandas' `DataFrame`

In [None]:
df[["Age","Sex"]]

*NOTE*: the selection of a single column is very similar to selection of dictionary values based on the key in Python dictionaries

### ![Python Tiny Logo](https://dl.dropboxusercontent.com/s/wl9nvyva3qjsaz2/logo_python_tiny.png) Using row indexes

we can use *row indexes* to access the rows. 

Also in this case, the result is a Pandas' `series` if we select a single row.

![](https://pandas.pydata.org/pandas-docs/stable/_images/03_subset_rows.svg)

In [None]:
df.loc[1]

Otherwise, it is a `DataFrame`

In [None]:
df.loc[0:1]

*NOTE*: the selection of rows is very similar to slicing Python lists

### ![Python Tiny Logo](https://dl.dropboxusercontent.com/s/wl9nvyva3qjsaz2/logo_python_tiny.png)  Using row and column indexes together

we can use *row* and *column indexes* together to select a subset of both rows and columns.

![](https://pandas.pydata.org/pandas-docs/stable/_images/03_subset_columns_rows1.svg)

notice that the following code return a `DataFrame` with all data

In [None]:
df.iloc[0:3, 0:3]

This one returns only the first two columns of the second and third rows.

In [None]:
df.iloc[1:3, 0:2]

This one returns the first and the third columns of the first and third rows.

In [None]:
df.iloc[[0,2], [0,2]]

*Question*: how do you select the first and the third columns of the first and second rows?

In [None]:
df.<fill-in>

*Question*: how do you select the second column of the second row?

In [None]:
df.<fill-in>

### ![Python Tiny Logo](https://dl.dropboxusercontent.com/s/wl9nvyva3qjsaz2/logo_python_tiny.png)  Input/Output APIs

Look up the documentation for [Input/Output APIs](https://pandas.pydata.org/pandas-docs/stable/reference/io.html)

|**Format**|**Data Description**|**Reader**    |
|----------|--------------------|--------------|
|text      |CSV                 |read_csv      |
|text      |JSON                |read_json     |
|text      |HTML                |read_html     |
|text      |Local clipboard     |read_clipboard|
|binary    |MS Excel            |read_excel    |
|binary    |HDF5 Format         |read_hdf      |
|binary    |Feather Format      |read_feather  |
|binary    |Parquet Format      |read_parquet  |
|binary    |Msgpack             |read_msgpack  |
|binary    |Stata               |read_stata    |
|binary    |SAS                 |read_sas      |
|binary    |Python Pickle Format|read_pickle   |
|SQL       |SQL                 |read_sql      |
|SQL       |Google Big Query    |read_gbq      |

The pandas I/O API is a set of top level `reader` functions that generally return a pandas object (a `DataFrame`). 

**Note**: We will offer an overview on some of the functions

### Acknowledges

This material was derived from [Pandas's Getting Started Tutorial on "How do I select a subset of a DataFrame?"](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/01_table_oriented.html)

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.