# Pandas 101 (a crash course for students)

Just something quick and simple on ***Pandas***. This is another key Python library used for ML you need to know something about.

*DISCLAIMER: This material has to be intended as a pragmatic shortcut to move on, and should NOT stop you from attending a complete Pandas course/tutorial*

## Introduction

Pandas is very useful as it provides data structures as well as functionalities to ***quickly manage, manipulate and analyze your data set***. It is a very powerful tool for slicing and dicing you data. 

With the goal of ML in mind, the key in Pandas is to understand the "*Series*" and "*DataFrame*" data structures.

In [2]:
import numpy
import pandas

## Pandas Series

A series is a 1D array of indexed data. In "indexing" lies the major difference w.r.t other data structures you might know. The meaning of "indexed data" will be understandable in a while, in one of the next examples.

First of all, a Pandas series can be created from a list or array - exactly as we can create a Numpy array from a Python list - as follows:

In [3]:
L = [0.25, 0.5, 0.75, 1.0]   # creation of a Python list
data = pandas.Series(L)      # creation of a pandas Series from a Python list
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

Of course, you can create a Pandas series as easily also from a Numpy array:

In [4]:
A = numpy.array([0.25, 0.5, 0.75, 1.0])  # creation of a Numpy array from a Python list
data = pandas.Series(A)                  # creation of a pandas Series from a Numpy array
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

You already see that its print-out in Jupyter looks different from how a Python list or a Numpy array looked:

In [5]:
L

[0.25, 0.5, 0.75, 1.0]

In [6]:
A

array([ 0.25,  0.5 ,  0.75,  1.  ])

In [7]:
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

Here is what we meant by "indexed data".

In [8]:
A = numpy.array([ 0.25,  0.5 ,  0.75,  1.  ])
rownames = ['index_1','index_2','index_3','index_4']
myseries = pandas.Series(A, index=rownames)
myseries

index_1    0.25
index_2    0.50
index_3    0.75
index_4    1.00
dtype: float64

***EXERCISE***: try to give a smaller rownames vector above..

So, a pandas series has both a sequence of values and a sequence of indices, which we can access with the `values` and `index` attributes.

The `values` are simply a familiar NumPy array...

In [9]:
myseries.values

array([ 0.25,  0.5 ,  0.75,  1.  ])

.. while the `index` is an array-like object of type `pandas.Index`:

In [10]:
myseries.index

Index([u'index_1', u'index_2', u'index_3', u'index_4'], dtype='object')

How do I access the data in my pandas series?

Firstly, I can access the data in my pandas series like a NumPy array. I.e. data can be accessed by the associated index via the familiar Python square-bracket notation:

In [11]:
myseries[0]

0.25

In [12]:
myseries[1]

0.5

Secondly, you can access the data in my pandas series like a dictionary (more later on this):

In [13]:
myseries['index_1']

0.25

### Importance of indexes

Pandas series can be seen as ***"generalized NumPy arrays"***.

From what we have seen so far from its behaviour, it may look like the series object is basically interchangeable with a 1D NumPy array. The essential difference is the presence of the indexes:

* the Numpy array has an **implicitly** defined integer index used to access the values
* the Pandas series has an **explicitly** defined index associated with the values.

This explicit index definition gives the series object plenty of additional capabilities. 

E.g. the index needs not be an integer, can consist of values of any desired type, and with any content you want. For example, we can use ANY strings as indexes:

In [15]:
data = pandas.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data['b']

0.5

Of course, we can even use random, non-contiguous, non-sequential numerical indices:

In [16]:
data = pandas.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

### Use of series as specialized dictionaries

The existence of indexes so close to arrays, lets you think of Pandas Series a bit like **a specialization of a Python dictionary**. Let's see the difference.

The series-as-dict analogy can be made even more clear by constructing a Series object directly from a Python dictionary:

In [17]:
phones_dict = {'Daniele': 1111111,              # create a Python dict
               'Amanda': 2222222,
               'Bice': 3333333,
               'Licia': 4444444}
phones_series = pandas.Series(phones_dict)      # use the dict to create a series

In [18]:
phones_dict

{'Amanda': 2222222, 'Bice': 3333333, 'Daniele': 1111111, 'Licia': 4444444}

In [19]:
phones_series

Amanda     2222222
Bice       3333333
Daniele    1111111
Licia      4444444
dtype: int64

In this way, by default a series is created, where the series index is drawn from the sorted keys (note the sorted order in the resulting Pandas Series).

From here, typical dictionary-style item access can be performed for both:

In [20]:
phones_dict['Amanda']

2222222

In [21]:
phones_series['Amanda']

2222222

Unlike a dictionary, though, the Series also supports array-style operations such as **slicing**:

In [22]:
phones_dict['Bice':'Licia']   # this does not work for dictionaries

TypeError: unhashable type

In [23]:
phones_series['Bice':'Licia']   # it works for Pandas Series

Bice       3333333
Daniele    1111111
Licia      4444444
dtype: int64

This opens up **plenty** of possible indexing and slicing options with very small portions of code.

    More: https://pandas.pydata.org/pandas-docs/stable/indexing.html

### More on constructing Series objects

There are many ways to construct Pandas Series from scratch. All of them are some version of the following

    pandas.Series(data, index=index)

where `data` can be one of many entities, while `index` is an optional argument.

What can `data` be?

`data` can be a list or NumPy array, in which case `index` defaults to an integer sequence (starting from 0):

In [None]:
series1 = pandas.Series([2, 4, 6])
series1

`data` can be a scalar, in which case it is broadcast to fill the specified index:

In [None]:
series2 = pandas.Series(5, index=[100, 200, 300])
series2

`data` can be a dictionary, in which case `index` defaults to the sorted dictionary keys 

(NOTE: it will be sorted by dict keys, not by index elements!)

In [None]:
series3 = pandas.Series({2:'c', 1:'a', 3:'b'})
series3

In each case mentioned above, the `index` can be explicitly set as you want, if a different result is preferred. E.g. you can explicitly identify the particular indices you want to be included (these and only these) from the dictionary.

In [None]:
series4 = pandas.Series({2:'c', 1:'a', 3:'b'}, index=[2, 3])
series4

_NOTE: much more can be said on Panda Series, but we stop here.. More on Pandas documentation_

## Pandas DataFrame

The next fundamental structure in Pandas is the **DataFrame** (or "data frame", or "dataframe").

In general, it is a nD array where the rows and the columns can be (and usually are) labeled.

Like the Pandas Series, the Pandas DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. 

### 1. Pandas DataFrame as a generalised Numpy array

Best way to see it is as follows:

If a Pandas Series is an analog of a 1D array but with flexible indices, **a Pandas DataFrame is an analog of a 2D array with both flexible row indices and flexible column names**.

In other words, just as you might think of a 2D array as an ordered sequence of aligned 1D columns, you can think of **a DataFrame as a sequence of aligned Series objects in Pandas**. Here, by “aligned” we mean that they share the same index.

Let's see some examples.

Let's start from the same phones dictionary example as before:
   1. contruct a Pandas Series from a dictionary (same as above)
   1. construct another Pandas Series with additional info
   1. build a sequence of these (aligned, they share the same index) Series objects

This is part 1.

In [24]:
phones_dict = {'Daniele': 1111111,                # create a dict
                   'Amanda': 2222222,
                   'Bice': 3333333,
                   'Licia': 4444444}
phones_series = pandas.Series(phones_dict)        # use the dict to create a series
phones_series

Amanda     2222222
Bice       3333333
Daniele    1111111
Licia      4444444
dtype: int64

This is part 2.

In [25]:
address_dict = {'Daniele': "Via Dromedario 1",                # create a dict
                   'Amanda': "Via Dromedario 2",
                   'Bice': "Via Dromedario 3",
                   'Licia': "Via Dromedario 4"}
address_series = pandas.Series(address_dict)        # use the dict to create a series
address_series

Amanda     Via Dromedario 2
Bice       Via Dromedario 3
Daniele    Via Dromedario 1
Licia      Via Dromedario 4
dtype: object

This is part 3. Now you can use a dictionary structure to construct a single 2D object (Pandas DataFrame) containing all this information:

In [26]:
my_address_book = pandas.DataFrame({'PHONE': phones_series,
                                    'ADDRESS': address_series})

And print it out. Be prepared for a nice surprise..

In [27]:
my_address_book

Unnamed: 0,ADDRESS,PHONE
Amanda,Via Dromedario 2,2222222
Bice,Via Dromedario 3,3333333
Daniele,Via Dromedario 1,1111111
Licia,Via Dromedario 4,4444444


NOTE: the printout is extremely easy to read, and organised as you meant it to be. Thanks to Jupyter and Pandas!

Like the Series object, the DataFrame object has an `index` attribute which gives access to the index labels:

In [28]:
my_address_book.index

Index([u'Amanda', u'Bice', u'Daniele', u'Licia'], dtype='object')

And a `values` attribute too:

In [29]:
my_address_book.values

array([['Via Dromedario 2', 2222222],
       ['Via Dromedario 3', 3333333],
       ['Via Dromedario 1', 1111111],
       ['Via Dromedario 4', 4444444]], dtype=object)

And many more. E.g. you can use the `column` attribute, which is an `Index` object holding the column labels:

In [30]:
my_address_book.columns

Index([u'ADDRESS', u'PHONE'], dtype='object')

E.g. you might want to do something only on the first column:

In [31]:
my_address_book.columns[0]

'ADDRESS'

Or the `all` attribute, which shows everything in one go, both DataFrame labels and DataFrame full content:

In [32]:
my_address_book.all

<bound method DataFrame.all of                   ADDRESS    PHONE
Amanda   Via Dromedario 2  2222222
Bice     Via Dromedario 3  3333333
Daniele  Via Dromedario 1  1111111
Licia    Via Dromedario 4  4444444>

In this example, we saw that the Pandas DataFrame can be thought of as a generalization of a 2D NumPy array, where both the rows and columns have a generalized index for accessing the data.

### 2. Pandas DataFrame as a specialised dictionary

Similarly, we can think of a Pandas DataFrame as a specialisation of a dictionary.

Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. 

E.g. see the example below. Asking 'ADDRESS' attribute returns the Pandas Series object containing all the addresses we have in the total addressbook DataFrame, with the index information (so we know WHO lives WHERE).

In [None]:
my_address_book['ADDRESS']

Same for the 'PHONE' attribute, of course - as for any other attribute that might exist:

In [None]:
my_address_book['PHONE']

## <font color='red'>Exercise</font>

<div class="alert alert-block alert-info">
Are you able to build a single, one-shot query to get Amanda's phone number?
</div>

NOTE: if you think of a real addressbook, this is basically the logical query you would do, e.g. "let me open by addressbook and look for Amanda's phone number" or "Daniele's address" or similar. Usually you indeed do NOT want ALL information you have about one person when looking into the addressbook, do you?

_Hint_: build a query piece after piece, then merge and simplify, and get to the best piece of code.

## <font color='red'>Solution and discussion</font>

I want a phone number. I am suggested to proceed step by step. OK. Let me get all phone numbers, first.

In [None]:
my_address_book['PHONE']

OK. What I got above is a Pandas Series. Let me call it with something like `tmp` in the name, as this will not be the final step. I know how to get one element of my `tmp` Pandas Series (check out examples before), so let me try:

In [None]:
tmp_series = my_address_book['PHONE']
tmp_series['Amanda']

Cool! I got it! Now, look at the structure, digest it, and... get rid of useless intermediate steps, and simplify (= make it compact):

In [None]:
my_address_book['PHONE']['Amanda']

Well done!

NOTE: there is large potential confusion here - and indeed everyone does a lot of mistakes when trying out DataFrame while thinking at arrays, and viceversa.

   * in a 2D NumPy array, e.g. `numpy_array[0]` below, you will get returned the first row (element) of the array

In [None]:
numpy_array = numpy.array([1,2,3])
numpy_array[0]

   * in a DataFrame, e.g. my_address_book['ADDRESS'] below, you will get returned the first column

In [None]:
#my_address_book[0]   # this gives an error, of course
my_address_book['ADDRESS']

Because of this possible confusione, it is probably way better to think about Pandas DataFrames as generalized dictionaries rather than generalized arrays - though both ways of thinking of it can be useful.

## Constructing DataFrame objects

A Pandas DataFrame can be constructed in a variety of ways. 

   1. From a single Series object
   1. From a list of dicts
   1. From a dictionary of Series objects
   1. From a two-dimensional NumPy array

.. and more.. but we stop here. Let's see some examples for each.

### 1. From a single Series object

A Pandas DataFrame is a collection of Pandas series, and a single-column Pandas DataFrame can obviously be constructed from a single Pandas Series:

In [None]:
phones_series

In [None]:
DF1 = pandas.DataFrame(phones_series, columns=['ecco i numeri di telefono'])
DF1

### 2. From a list of dicts

Any list of dictionaries can be made into a Pandas DataFrame. We will use a simple list comprehension to create some data in this example:

In [None]:
list_of_dicts = [{'a': i, 'b': 2 * i}
        for i in range(4)]
list_of_dicts

In [None]:
DF2 = pandas.DataFrame(list_of_dicts)
DF2

### 3. From a dictionary of Series objects

We saw this in the addressbook example above: a Pandas DataFrame can be constructed from a dictionary of Series objects.

In [None]:
my_address_book = pandas.DataFrame({'PHONE': phones_series,
                                    'ADDRESS': address_series})
my_address_book

### 4. From a 2D NumPy array

Given a 2D array of data, we can create a Pandas DataFrame with any specified column and index names. If left out, an integer index will be used for each.

In [None]:
DF3 = pandas.DataFrame(numpy.random.rand(3, 2),
                       columns=['first column', 'second column'],
                       index=['first row', 'second row', 'third row'])
DF3

## Pandas Index

We said that the Pandas Series and pandas DataFrame contain an explicit indec which lets you reference and modify data. 

This index is a `Index` object, and it is an interesting structure in itself. It can be thought of either:
   * as an "immutable array"
   * as an "ordered set"
   
Those views have some interesting consequences in the operations available on Index objects.

As a simple example, let's construct an index from a list of integers:

In [None]:
my_index = pandas.Index([1, 2, 3, 4, 5])
my_index

### Index as Immutable Array

The index in many ways operates like an array. E.g., we can use standard Python indexing notation to retrieve values or slices:

In [None]:
my_index[1]

In [None]:
my_index[2:]

In [None]:
my_index[:2]

In [None]:
my_index[::2]

Index objects also have many of the attributes familiar from NumPy arrays:

In [None]:
print(my_index.size)
print(my_index.shape)
print(my_index.ndim)
print(my_index.dtype)

One difference between Index objects and NumPy arrays is that indices are immutable: that is, they cannot be modified via the normal means:

In [None]:
my_index[1]

In [None]:
my_index[1] = 0     # error

Which is actually a good thing. Immutability is not there to prevent you from freedom, it is there to avoid you terrible mistakes. This immutability makes it safer to share indices between multiple dataframes and arrays, without the potential for nasty side-effects from unwanted, mistakenly done index modification.

### Index as Ordered set

Pandas objects are designed to facilitate operations such as joins across datasets (additional info: this is facilitated in Python with Pandas, as Python has a built-in `set` object and the Pandas Index object follows many of the conventions of this built-in `set` object, so that unions, intersections, differences, and other combinations can be computed in a familiar way).

In [None]:
indA = pandas.Index([1, 3, 5, 7, 9, 100, 200])
indB = pandas.Index([2, 4, 6, 8, 10, 100, 200])

In [None]:
indA & indB  # intersection

In [None]:
indA | indB  # union

In [None]:
indA - indB  # difference

These operations may also be accessed via object methods, e.g.:

In [None]:
indA.intersection(indB)

## Done. That's all for the Pandas 101 crash course.

The appetizer is over. Time for you to move to the main course, i.e. a good Pandas course and/or tutorial.

## What we have learnt

Basics of Series, DataFrame, Index objects in Pandas - which form the foundation of data-oriented computing with Pandas. 

Very useful for data manipulation: just as understanding the effective use of NumPy arrays is fundamental to effective numerical computing in Python, understanding the effective use of Pandas structures is fundamental to the **data munging** required for data science in Python.

* "data munging" = the process of manual data cleansing prior to analysis

## Reading material

* Pandas documentation page (user guide), http://pandas.pydata.org/pandas-docs/stable/
* Pandas cookbook (many examples), http://pandas.pydata.org/pandas-docs/stable/cookbook.html
* Pandas API Reference, http://pandas.pydata.org/pandas-docs/stable/api.html