<!--NAVIGATION-->
<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>< [Quiz](03.04-Quiz.ipynb) | [Contents](00.00-Index.ipynb) | [Advanced Processing](04.02-Pandas.ipynb) ></span>

<a href="https://colab.research.google.com/github/eurostat/e-learning/blob/main/python-official-statistics/04.01-Dataframes.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

<a id='top'></a>

# Pandas: Series & Dataframes
## Content  
- [Pandas Objects](#objects)
    - [Series](#series)
    - [Index](#index)
    - [DataFrame](#dataframe)
- [Constructing DataFrames](#construct)
- [DataFrame Information](#info)
- [Indexing & Selection (and Slicing)](#indexing)
- [Dropping Stuff](#dropping)
- [Sort & Rank](#sort)
- [Applying Functions](#apply)

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. Its name is a play on the phrase "Python data analysis" itself.  
Wes McKinney started building what would become pandas at AQR Capital while he was a researcher there from 2007 to 2010.

<a id='objects'></a>

## Pandas Objects
Pandas objects can be thought of as enhanced versions of NumPy ndarrays in which the rows and columns are identified with labels rather than simple integer indices.  
The three fundamental Pandas data structures are the ``Series``, ``DataFrame``, and ``Index``.
On top of the basic data structures Pandas provides a lot of useful tools, methods, and functionality.

<a id='series'></a>
### Series Object
A Pandas ``Series`` is a one-dimensional array of indexed data. It can be seen as a generalized NumPy array: an array with specialized index.   
It can be created from a list or array as follows:

In [None]:
import numpy as np
import pandas as pd

data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

As we can see, the ``Series`` contains a sequence of values and a sequence of indices, which we can access with the ``values`` and ``index`` attributes.
The ``values`` are simply a NumPy array:

In [None]:
data.values

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [None]:
print(data[1])
print(data[2:])

<a id='index'></a>

### Index Objects
They appear in conjunction with Series (and will see later, with DataFrames).  
By default implicitly defined index is a `RangeIndex` object and permit the access to elements as in a Python list or Numpy array.  
Depending on how you define Pandas Series an index for that object is created:

In [None]:
# implicit index
print(data.index)
# explicit index
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
print(data)
# new type string (fixed size, more efficient)
# about changing the index type (same function can be applied to series too)
print(data.index.astype('string'))
print('index item size:',len(data.index.astype('string')[0]))
print(data.index.astype('str'))
print(data['b'])

Series as dictionary:

In [None]:
population = pd.Series({'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135})
print(population)
print(population['California':'New York'])

or a dictionary in which the index change the available data and the order too:

In [None]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

<a id='dataframe'></a>

### DataFrame Object
If a ``Series`` is an analog of a one-dimensional array with flexible indices, a ``DataFrame`` is an analog of a two-dimensional array with both flexible row indices and flexible column names.  
Something like an Excel sheet.   

Let's create a dataframe by linking together the already created series, population, with this new one, surface:

In [None]:
surface_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
surface = pd.Series(surface_dict)
surface

And now put them together as a dataframe:

In [None]:
states = pd.DataFrame({'population': population,
                       'surface': surface})
# states.T.columns
# states.columns
states

<a id='construct'></a>

## Constructing DataFrame objects
Pandas offers multiple ways to create a DataFrame, here some of them:

- ### From a Series object

In [None]:
pd.DataFrame(population, columns=['population'])

- ### From a list of dicts

In [None]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)

- ### From a dictionary of Series objects
_already used for our first example_
- ### From a two-dimensional NumPy array

In [None]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

- ### From a NumPy structured array

In [None]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
pd.DataFrame(A)

<a id='info'></a>

## DataFrame Information

As for Numpy arrays, here in Pandas Dataframes there are some attributes and functions useful to undestand the structure and information stored in the object.

### About structure
``info()`` and ``count()`` provide also some qualitative information (null, non-null status of values). 

In [None]:
# Basic Information
# (rows, columns)
print('- shape:', states.shape)
# (rows * columns)
print('- size:', states.size)
# Describe index
print('- index:', states.index)
# Describe DataFrame columns
print('- columns:', states.columns)
# Info on DataFrame
print('- info():', states.info())
# Number of non-NA values
print('- count():', states.count())
# Number of data types
print('- dtypes:', states.dtypes)
# Value types count
print('- dtypes.value_counts():', states.dtypes.value_counts())

### About data inside
Some statistical information describing data inside object.

In [None]:
# Sum of values
print(states.sum())
# Cumulative sum of values
print(states.cumsum())
# Minimum/maximum values
print(states.min(),states.max())
# Minimum/Maximum index value
print(states.idxmin(),states.idxmax() )
# Summary statistics
print(states.describe())
# Mean of values
print(states.mean())
# Median of values
print(states.median())

<a id='indexing'></a>

## Data Indexing and Selection (and Slicing)
The individual ``Series`` (columns of the ``DataFrame``) can be accessed via dictionary-style indexing of the column name:

In [None]:
# individual series
print(states['population'])
print()
print(type(states['population']))
print()
# Not the same! this is fancy indexing and returns a DataFrame
print(states[['population']])
print()
print(type(states[['population']]))
print()


Also an attribute-style access with column names can be used, but, the column name must be a valid variable name.  
_Ex. if there is a space in name, the column name is not a valid attribute._

In [None]:
states.population

### loc & iloc
To understand how the two functions (indexers) work we must see a ``DataFrame as a two-dimensional array``. First dimension select rows, second dimension select columns.  

Using the ``iloc`` indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index).  

Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the explicit index and column names.  

Both indexers can be used for ``simple selections``, ``slicing``, ``masking``, and ``fancy indexing``:

In [None]:
def print_fmt(desc, obj):
    print(desc)
    print(obj)
    print()

states['density'] = states['population']/states['surface']

# iloc usage
print(states)
print('\niloc operations:\n')
# slicing iloc
print_fmt('slice both dimensions', states.iloc[1:3,:1])
# masking and fancy indexing as in Numpy (iloc)
print_fmt('masking and fancy indexing', states.iloc[(states['density'] > 100).values, [1, 2]])
# slicing with position integer for rows
print_fmt('slicing one dimension', states.iloc[0:3])
# indexing for both dimensions
print_fmt('selection', states.iloc[1,1])

In [None]:
# loc usage
print(states)
print('\nloc operations:\n')
# slicing with loc both dimensions
print_fmt('slicing both dimensions', states.loc[:'Texas', :'surface'])
# fancy indexing both dimensions
print_fmt('fancy indexing both dimensions', states.loc[['Texas', 'Florida'], ['surface']])
# masking and fancy indexing as in Numpy (loc)
print_fmt('masking and fancy indexing', states.loc[states['density'] > 100, ['population', 'density']])

### at & iat
Attention! These two are indexers. In previous versions of Pandas they were functions. So now the usage is a little bit more intuitive:


In [None]:
# following two lines are equivalent in states dataframe
print(states.at['New York', 'population'])
print(states.iat[2, 0])

### Additional indexing conventions

There is something more to be mentioned about DataFrame rows and columns:  
- While *indexing* refers to columns, *slicing* refers to rows.
- *Masking* are also interpreted row-wise rather than column-wise.

Masking implicit: equivalent to loc

In [None]:
print_fmt('masking implicit', states[states.density > 100])
print_fmt('masking with loc', states.loc[states.density > 100])

Slicing implicit: equivalent to iloc

In [None]:
print_fmt('slicing rows implicit', states[0:3])
print_fmt('slicing rows with iloc', states.iloc[0:3])

**Masking at DataFrame level**  
This operation is keeping all rows (of course it have no numeric sense here). Also it's replacing false values with NaN (not a number). It has no equivalent with indexers.

In [None]:
print_fmt('masking at dataframe level', states[states > 100])

**Pandas inherit operations from Numpy (ufunc)**  
The result is keeping the index and columns.

In [None]:
print(states/1000)

With series (broadcasting row-wise by default), equivalent with ``axis=1``.

In [None]:
print_fmt('with operator /', states/states.loc['California'])
print_fmt('with function div', states.div(states.loc['California'], axis=1))

Same div column-wise:

In [None]:
print(states.div(states['population'], axis=0))

Preserving the columns  
Here the divisor series (``alaska``) has no column ``density``, so, accordingly the result is NaN for it.

In [None]:
alaska = pd.Series([100000, 1000000], name='Alaska', index=['population', 'surface'])
print(states/alaska.T)

Index and columns are aligned and sorted if dataframes involved in operation are not similar:

In [None]:
rng = np.random.RandomState(42)
X = pd.DataFrame(rng.randint(0, 20, (2, 2)), columns=list('AB'))
Y = pd.DataFrame(rng.randint(0, 10, (3, 3)), columns=list('BAC'))
print_fmt('X', X)
print_fmt('Y', Y)
print_fmt('X + Y', X + Y)
print_fmt('Y + X', Y + X)

This preservation and alignment of indices and columns means that operations on data in Pandas will always maintain the data context, which prevents the types of silly errors that might come up when working with heterogeneous and/or misaligned data in raw NumPy arrays.

<a id='dropping'></a>

## Dropping Stuff
Now the last simple manipulation in DataFrames is removing some lines or columns. The function is ``drop()`` and the usage is as follow (implicit is about rows):

In [None]:
print(states)
# Drop values from rows (axis=0)
print(states.drop([ 'Texas', 'Florida']))
#Drop values from columns(axis=1)
states.drop('surface', axis=1)


<a id='sort'></a>

## Sort & Rank
``sort_...()`` functions are useful when you need change the position of rows based on values for some column. When the column is the index is the function sort_index() for any other column is sort_values().  
``rank()`` function is creating a new DataFrame with ranks: values from 1 based on values for specific column, or as percentile.

In [None]:
print(states)

# Sort by labels along an axis
print(states.sort_index(ascending=False))

# Sort by the values along an axis
print(states.sort_values(by='population'))

# Assign ranks to entries
print(states.rank(ascending=False, pct=True))

<a id='apply'></a>

## Applying Functions
When `Data Processing` it is often necessary to perform operations (such as statistical calculations, splitting, or substituting value) on a certain row or column to obtain new data. Writing a for-loop to iterate through Pandas DataFrame and Series tends to have more lines of code, less code readability, and slower performance.  
Fortunately, there are already great methods that are built into Pandas to help you accomplish the goals: ``apply()`` and ``applymap()``.

In [None]:
print(states)

# Apply function
# Used to apply a function along an axis of the DataFrame or on values of Series.
print('\nApply:\n')
print(states.apply(lambda x: np.sum(x)))

# Apply function
# Used element-wise
print('\nApplyMap:\n')
print(states.applymap(lambda x: x*2))


<!--NAVIGATION-->
<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>< [Quiz](03.04-Quiz.ipynb) | [Contents](00.00-Index.ipynb) | [Advanced Processing](04.02-Pandas.ipynb) > [Top](#top) ^ </span>

<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>This is the Jupyter notebook version of the __Python for Official Statistics__ produced by Eurostat; the content is available [on GitHub](https://github.com/eurostat/e-learning/tree/main/python-official-statistics).
<br>The text and code are released under the [EUPL-1.2 license](https://github.com/eurostat/e-learning/blob/main/LICENSE).</span>