# Pandas 101 (a crash course for students)

Let's familiarize with ***Pandas***. 

*DISCLAIMER: This material has to be intended as a pragmatic shortcut to move on, and should NOT stop you from attending a complete Pandas course/tutorial*

## Introduction

Pandas is very useful as it provides data structures as well as functionalities to ***quickly manage, manipulate and analyze your data set***. 

Key concepts: "*Series*" and "*DataFrame*" data structures.

In [None]:
import numpy
import pandas

## Pandas Series

A series is a 1D array of _indexed_ data. 

First of all, a Pandas series can be created from a list or array - exactly as we can create a Numpy array from a Python list - as follows:

In [None]:
L = [0.25, 0.5, 0.75, 1.0]
data = pandas.Series(L)   # creation from a Python list
data

Of course, you can create a Pandas series as easily also from a Numpy array:

In [None]:
A = numpy.array([0.25, 0.5, 0.75, 1.0])   # creation from a Numpy array
data = pandas.Series(A)
data

You already see that its print-out in Jupyter looks different from how a Python list or a Numpy array looked:

In [None]:
L

In [None]:
A

In [None]:
data

Here is what we meant by "indexed data".

In [None]:
A = numpy.array([ 0.25,  0.5 ,  0.75,  1.  ])
rownames = ['index_1','index_2','index_3','index_4']
myseries = pandas.Series(A, index=rownames)
myseries

***Practice***: Try to give a smaller rownames vector above..

Evident difference between `values` and `index` attributes.

The `values` are simply a familiar NumPy array...

In [None]:
myseries.values

.. while the `index` is an array-like object of type `pandas.Index`:

In [None]:
myseries.index

How do I access the data in my pandas series?

Firstly, I can access the data in my pandas series like a NumPy array.

In [None]:
myseries[0]

In [None]:
myseries[1]

Secondly, you can access the data in my pandas series like a dictionary (more later on this):

In [None]:
myseries['index_1']

### Importance of indexes

Pandas series can be seen as "***generalized NumPy arrays***".

The essential difference with a 1D NumPy array is the presence of the indexes:

* the Numpy array has an **implicitly** defined integer index used to access the values
* the Pandas series has an **explicitly** defined index associated with the values.

This explicit index definition gives the series object plenty of additional capabilities. 

E.g. we can use ANY strings as indexes:

In [None]:
data = pandas.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data['b']

Of course, we can even use random, non-contiguous, non-sequential numerical indices:

In [None]:
data = pandas.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
data

### Use of series as specialized dictionaries

The existence of indexes so close to arrays, lets you think of Pandas Series a bit like **a specialization of a Python dictionary**. Let's see the difference.

The series-as-dict analogy can be made even more clear by constructing a Series object directly from a Python dictionary:

In [None]:
phones_dict = {'Daniele': 1111111,     # create a Python dict
                   'Amanda': 2222222,
                   'Bice': 3333333,
                   'Licia': 4444444}
phones_series = pandas.Series(phones_dict)        # use the dict to create a series

In [None]:
phones_dict

In [None]:
phones_series

In this way, by default a series is created, where the series index is drawn from the sorted keys (note the sorted order in the resulting Pandas Series).

From here, typical dictionary-style item access can be performed for both:

In [None]:
phones_dict['Amanda']

In [None]:
phones_series['Amanda']

Unlike a dictionary, though, the Series also supports array-style operations such as **slicing**:

In [None]:
phones_dict['Bice':'Licia']   

In [None]:
phones_series['Bice':'Licia']   

> More: https://pandas.pydata.org/pandas-docs/stable/indexing.html

### More on constructing Series objects

There are many ways to construct Pandas Series from scratch. All of them are some version of the following

    pandas.Series(data, index=index)

where `data` can be one of many entities, while `index` is an optional argument.

What can `data` be?

1. `data` can be a list or NumPy array, in which case `index` defaults to an integer sequence (starting from 0):

In [None]:
series1 = pandas.Series([2, 4, 6])
series1

2. `data` can be a scalar, in which case it is broadcast to fill the specified index:

In [None]:
series2 = pandas.Series(5, index=[100, 200, 300])
series2

3. `data` can be a dictionary, in which case `index` defaults to the sorted dictionary keys:

In [None]:
series3 = pandas.Series({2:'c', 1:'a', 3:'b'})   # NOTE: it will be sorted by dict keys, not by index elements
series3

In each case mentioned above, the `index` can be explicitly set as you want, if a different result is preferred. E.g. you can explicitly identify the particular indices you want to be included (these and only these) from the dictionary.

In [None]:
series4 = pandas.Series({2:'c', 1:'a', 3:'b'}, index=[2, 3])
series4

_NOTE: much more can be said on Panda Series, but we stop here.. Read more on Pandas documentation._


## Pandas DataFrame

The next fundamental structure in Pandas is the **DataFrame** (or "data frame", or "dataframe").

In general, it is a nD array where the rows and the columns can be (and usually are) labeled.

Like the Pandas Series, the Pandas DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. 

### 1. Pandas DataFrame as a generalised Numpy array

Best way to see it is as follows:

**A Pandas DataFrame is an analog of a 2D array with both flexible row indices and flexible column names**.


Let's see some examples.

Let's start from the same phones dictionary example as before:
   1. contruct a Pandas Series from a dictionary (same as above)
   1. construct another Pandas Series with additional info
   1. build a sequence of these (aligned, they share the same index) Series objects

This is part 1.

In [None]:
phones_dict = {'Daniele': 1111111,                # create a dict
                   'Amanda': 2222222,
                   'Bice': 3333333,
                   'Licia': 4444444}
phones_series = pandas.Series(phones_dict)        # use the dict to create a series
phones_series

This is part 2.

In [None]:
address_dict = {'Daniele': "Via Dromedario 1",                # create a dict
                   'Amanda': "Via Dromedario 2",
                   'Bice': "Via Dromedario 3",
                   'Licia': "Via Dromedario 4"}
address_series = pandas.Series(address_dict)        # use the dict to create a series
address_series

This is part 3. Now you can use a dictionary structure to construct a single 2D object (Pandas DataFrame) containing all this information:

In [None]:
my_address_book = pandas.DataFrame({'PHONE': phones_series,
                                    'ADDRESS': address_series})

And print it out. Be prepared for a nice surprise..

In [None]:
my_address_book

NOTE: the printout is extremely easy to read, and organised as you meant it to be. Thanks to Jupyter and Pandas!

Like the Series object, the DataFrame object has an `index` attribute which gives access to the index labels:

In [None]:
my_address_book.index

And a `values` attribute too:

In [None]:
my_address_book.values

And many more. E.g. you can use the `column` attribute, which is an `Index` object holding the column labels:

In [None]:
my_address_book.columns

E.g. you might want to do something only on the first column:

In [None]:
my_address_book.columns[0]

Or the `all` attribute, which shows everything in one go, both DataFrame labels and DataFrame full content:

In [None]:
my_address_book.all

In this example, we saw that the Pandas DataFrame can be thought of as a generalization of a 2D NumPy array, where both the rows and columns have a generalized index for accessing the data.

### 2. Pandas DataFrame as a specialised dictionary

Similarly, we can think of a Pandas DataFrame as a specialisation of a dictionary.

Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. 

E.g. see the example below. Asking 'ADDRESS' attribute returns the Pandas Series object containing all the addresses we have in the total addressbook DataFrame, with the index information (so we know WHO lives WHERE).

In [None]:
my_address_book['ADDRESS']

Same for the 'PHONE' attribute, of course - as for any other attribute that might exist:

In [None]:
my_address_book['PHONE']

## <font color='red'>Exercise 1</font>

Are you able to build a single, one-shot query to get Amanda's phone number?


_Hint_: build a query piece after piece, then merge and simplify, and get to the best piece of code.

## <font color='green'>Solution of Exercise 1</font>

In [None]:
# write your solution here..

In [None]:
my_address_book['PHONE']

In [None]:
tmp_series = my_address_book['PHONE']
tmp_series['Amanda']

In [None]:
my_address_book['PHONE']['Amanda']

## Constructing DataFrame objects

A Pandas DataFrame can be constructed in a variety of ways. 

   1. From a single Series object
   1. From a list of dicts
   1. From a dictionary of Series objects
   1. From a two-dimensional NumPy array

.. and more.. but we stop here. Let's see some examples for each.

### 1. From a single Series object

In [None]:
phones_series

In [None]:
DF1 = pandas.DataFrame(phones_series, columns=['ecco i numeri di telefono'])
DF1

### 2. From a list of dicts

In [None]:
list_of_dicts = [{'a': i, 'b': 2 * i}
        for i in range(4)]
list_of_dicts

In [None]:
DF2 = pandas.DataFrame(list_of_dicts)
DF2

### 3. From a dictionary of Series objects

In [None]:
my_address_book = pandas.DataFrame({'PHONE': phones_series,
                                    'ADDRESS': address_series})
my_address_book

### 4. From a 2D NumPy array

In [None]:
DF3 = pandas.DataFrame(numpy.random.rand(3, 2),
                       columns=['first column', 'second column'],
                       index=['first row', 'second row', 'third row'])
DF3

## Pandas Index

We said that the Pandas Series and pandas DataFrame contain an explicit indec which lets you reference and modify data. 

This index is a `Index` object, and it is an interesting structure in itself. It can be thought of either:
   * as an "immutable array"
   * as an "ordered set"
   
Those views have some interesting consequences in the operations available on Index objects.

As a simple example, let's construct an index from a list of integers:

In [None]:
my_index = pandas.Index([1, 2, 3, 4, 5])
my_index

Int64Index([1, 2, 3, 4, 5], dtype='int64')

### Index as Immutable Array

The index in many ways operates like an array. E.g., we can use standard Python indexing notation to retrieve values or slices:

In [None]:
my_index[1]

2

In [None]:
my_index[2:]

Int64Index([3, 4, 5], dtype='int64')

In [None]:
my_index[:2]

Int64Index([1, 2], dtype='int64')

In [None]:
my_index[::2]

Int64Index([1, 3, 5], dtype='int64')

Index objects also have many of the attributes familiar from NumPy arrays:

In [None]:
print(my_index.size)
print(my_index.shape)
print(my_index.ndim)
print(my_index.dtype)

5
(5,)
1
int64


One difference between Index objects and NumPy arrays is that indices are immutable: that is, they cannot be modified via the normal means:

In [None]:
my_index[1]

2

In [None]:
my_index[1] = 0    

TypeError: ignored

Which is actually a good thing. Immutability is not there to prevent you from freedom, it is there to avoid you terrible mistakes. 

### Index as Ordered set

Pandas objects are designed to facilitate operations such as joins across datasets (additional info: this is facilitated in Python with Pandas, as Python has a built-in `set` object and the Pandas Index object follows many of the conventions of this built-in `set` object, so that unions, intersections, differences, and other combinations can be computed in a familiar way).

In [None]:
indA = pandas.Index([1, 3, 5, 7, 9, 100, 200])
indB = pandas.Index([2, 4, 6, 8, 10, 100, 200])

In [None]:
indA & indB  # intersection

  """Entry point for launching an IPython kernel.


Int64Index([100, 200], dtype='int64')

In [None]:
indA | indB  # union

  """Entry point for launching an IPython kernel.


Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100, 200], dtype='int64')

In [None]:
indA - indB  # difference

Int64Index([-1, -1, -1, -1, -1, 0, 0], dtype='int64')

These operations may also be accessed via object methods, e.g.:

In [None]:
indA.intersection(indB)

Int64Index([100, 200], dtype='int64')

## Done. That's all for the Pandas 101 crash course.

The appetizer is over. Time for you to move to the main course, i.e. a good Pandas course and/or tutorial.

## What we have learnt

Basics of Series, DataFrame, Index objects in Pandas - which form the foundation of data-oriented computing with Pandas. 

Very useful for data manipulation: just as understanding the effective use of NumPy arrays is fundamental to effective numerical computing in Python, understanding the effective use of Pandas structures is fundamental to the **data munging** required for data science in Python.

* "data munging" = the process of manual data cleansing prior to analysis

## Reading material

* Pandas documentation page (user guide), http://pandas.pydata.org/pandas-docs/stable/
* Pandas cookbook (many examples), http://pandas.pydata.org/pandas-docs/stable/cookbook.html
* Pandas API Reference, http://pandas.pydata.org/pandas-docs/stable/api.html