<a href="https://colab.research.google.com/github/albertomanfreda/intensive_school_ml/blob/master/lessonPandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas

Pandas (Python for Data Analysis) is an open source library for data analysis and manipulation, built on top of NumPy.

The main feature of Pandas are two powerful data structures, **Series** and **DataFrame** which combine the functionality of NumPy arrays with those of Python dictionaries, allowing to access variable with named labels, in addition to numerical indices.

A Series correspond to a mono-dimensional array, while a DataFrame essentially extends a bi-dimensional array.

## Series

A Pandas Series is similar to a NumPy unidimensional array, but, on top of it, adds an **Index** that can be used to access data.

An Index is an immutable array of labels. Those labels act like keys in a dictionary, and can be either strings or numbers or other types (we will not dig into the details of which type can be used and why).

In [None]:
import numpy as np
# Standard alias for Pandas
import pandas as pd

# A Series can be created from a NumPy 1d array
# The Index can be made of numbers or strings
# If an explicit Index is not given, a progressive integer index will be created  
capitals_pop = pd.Series(1e6 * np.array([2.873, 9.273, 8.982, 2.148]),
                        index=['Rome', 'Tokyo', 'London', 'Paris'])
print(capitals_pop)
print(capitals_pop.values)
print(capitals_pop.index)

You can access items in a Series by the usual syntax of NumPy arrays or by index. Accessing by index is similar to accessing by key the values in a dictionary.


In [None]:
# Random access by position
print(capitals_pop[0])
# Random access with Index
print(capitals_pop['Rome'])

# Slice with positional integer indices
print(capitals_pop[1:3])

# Slice with named Index
print(capitals_pop['Rome':'London'])

Random access and slicing can be also done with the **loc** and **iloc** methods. **loc** will alwys use the Index, while **iloc** will always use the position. This is useful to avoid confusion when the Index is made of integer numbers.

In [None]:
# Random access with position
print(capitals_pop.iloc[0])
# Random access with Index
print(capitals_pop.loc['Rome'])

# Slice with positional indices
print(capitals_pop.iloc[1:3])

# Slice with named Index
print(capitals_pop.loc['Rome':'London'])


You can also build a Series out of a dictionary. The keys are used to create the index:

In [None]:
capitals_dict = {'Rome'   : 2.873e6,
                 'Tokyo'  : 9.273e6,
                 'London' : 8.982e6,
                 'Paris'  : 2.148e6}
capitals_pop = pd.Series(capitals_dict)

# Like a dictionary, a Series has a keys() method returning the index
print(capitals_pop.keys())

#Differently from dictionaries, for-loop iterates on values, not on keys.
# Dictionary loop:
for item in capitals_dict:
    print(item)
# Series loop:
for item in capitals_pop:
    print(item)

In [None]:
# However, looping with items() is just the same
for key, value in capitals_pop.items():
    print('{} -> {}'.format(key, value))

## DataFrames

A DataFrame is essentially a sequence of Series objects sharing the same index.
You can think of it as a table: each row and each column has its own name.

In [None]:
# We start by defining another Series wuth the same Index
capital_states_dict = {'Rome'   : 'Italy',
                       'Tokyo'  : 'Japan',
                       'London' : 'United Kingdom',
                       'Paris'  : 'France'}
capital_states = pd.Series(capital_states_dict)

""" Now put them together to build a DataFrame. To do this we first create a
dictionary of Series, which we pss to the constructor of DataFrame. Each Series
becomes a column of the DataFrame, with a name given by the corresponding key
of the dictionary. """
capitals = pd.DataFrame({'population': capitals_pop,
                         'state': capital_states})
# To get a quick look at the DataFrame we can use the info() method
print(capitals.info(), '\n')
# Or just simply print it
print(capitals)

In [None]:
""" You can also build a DataFrame from a NumPy 2d array. Column names and
row names are specified with the 'columns' and 'index' arguments, respectively.
"""
pd.DataFrame(np.array([[21, 55],
                       [26, 82],
                       [19, 77]]),
             columns=['age', 'weight'],
             index=['Alice', 'Bob', 'Charles'])

In [None]:
""" Finally, you can also build a DataFrame from a list of dictionaries, where 
each dictionary is a row. The name of the rows can be specified with the
index argument. Missing values are filled automatically with the NumPy
special value NaN (Not a Number)"""
pd.DataFrame([{'age': 21, 'weight': 55},
              {'age': 26}, 
              {'age': 19, 'weight': 77}],
             index=['Alice', 'Bob', 'Charles'])

The syntax on selecting items on a DataFrame can be confusing. The easiest way to remeber itt is to think to a DataFrame as a dictionary of columns (that is, of Series). So, if *data* is a DataFrame, *data[a][b]* will return the item on the *a column* and the *b row*. **This is different from NumPy 2d arrays, where the first index selects the row.**

Differently from elements in a Series, columns in a DataFrame can only be ccessed using their names, which cannot be replaced by implicit integer indices.

In [None]:
# Select a column
print(capitals['population'], '\n')

# Select a column and a row
print(capitals['state']['Rome'])

In [None]:
# This does not work
print(capitals[1])

You can also use **loc** and **iloc** on DataFrames. As in Series **loc** is used for random access or slicing with the labels (row and column names) while **iloc** allows to use numerical indices.

In [None]:
# WARNING: loc uses NumPy ordering (row index first)
print(capitals.loc[:'Tokyo', 'population'], '\n')
print(capitals.loc['Rome', 'population':], '\n')

In [None]:
# WARNING: iloc uses NumPy order (row index first)
print(capitals.iloc[0:3, :])

And, since we are operating on NumPy arrays, we get the benefits of masking too.

In [None]:
print(capitals[capitals['population'] > 5e6])

That was just the surface of what you can do with Pandas. You can find a much longer tutorial here: https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb