In [None]:
import numpy as np
import pandas as pd

# Pandas
**Pandas** is very popular data processing library/framework for Python. It's main benefit is the **DataFrame**. The DataFrame is inspired by the dataframe that is part of the **R** language (see [tiddle](https://tibble.tidyverse.org/)). The pandas DataFrame is build on top of the numpy array, but handles non-numeric and heterogenous data much easier. It also allows indexing and annotation of columns and rows with names. Pandas is a rich set of functionalities build around the DataFrame, such as plotting, statistics and data handling with support for various datasources (SQL, mdf5, PyTables). We will start by introducing the DataFrame.

<img src="./pandas.jpg" alt="pandas" style="width:400px;"/>

# 1 DataFrame
A DataFrame reminds of how data is layed out in a spreadsheet calculator like Excel. We have data in rows and columns with annotated names and indexes.

## create a DataFrame from a numpy array

In [None]:
N_COLS = 5
N_ROWS = 25
COLUMN_NAMES = ['A', 'B', 'C', 'D', 'E']

temp_array = np.random.rand(N_ROWS, N_COLS)

df = pd.DataFrame(data=temp_array, columns=COLUMN_NAMES)
df.head()  # prints the first rows of the DataFrame to get an idea of the format

**Notice** the column names on the top and the index on the left. In this case the index is a numeric series counting the number of samples. Thus the DataFrame is mostly used as *samples x features* or *samples x variables*. The column names and index are not part of the data, but they enrich and structure the data.

### from a DataFrame to a Numpy array
We can dump the data into a Numpy array.

In [None]:
temp_array = df.values
print(temp_array)

print(temp_array[:5, :])  # print first 5 rows

## Exercise
Make a DataFrame with three columns (height, age, IQ) with random integer values for 30 individuals/samples.

## the DataFrame object is really rich
Compare the amount of methods defined on the DataFrame object with the number of methods supported by the Numpy Array. Try to get an idea what kind of operations the DataFrame supports.

In [None]:
print('Array methods:')
print(dir(temp_array))
print()
print('DataFrame methods:')
print(dir(df))

## 1.1 DataFrame indexing and selection
DataFrames support basic to advanced ways of indexing data. There are various methods defined on the DataFrame, see [this](https://pandas.pydata.org/pandas-docs/stable/indexing.html) page for an excellent reference.

### Select a column
A DataFrame is build up out of [pandas.Series](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html). A Series is a one-dimensional array with axis labels (including time series). If we select one column from a DataFrame we get a Series back. Notice it does not have the column name anymore.

In [None]:
df['A']

### select multiple columns
To select multiple columns we pass a list of column names to the DataFrame. Because we select more than one column (Series) we get a DataFrame back. Notice that we indeed see the columns names at the top.

In [None]:
df[['B', 'E']]

### use the index
The basic use of the *index* is to slice the columns to select certain rows (samples). Pandas has many ways to intepret your selection and this will be part of the learning curve.

In [None]:
df[:2]  # select the first two rows

### use the .loc method
The `.loc` method is a important data selection method defined on the DataFrame object (see [here](https://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label)). It let's you select from the columns and index at the same time.

In [None]:
df.loc[:3, ['C', 'D']]  # select the first 4 elements from the 'C' and 'D' columns

## Exercise
Look at the documentation of the `.iloc` method defined on the DataFrame [here](https://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position). 

Select the last row of the last column in the DataFrame `df`.

# 2 Series
A **Series** is a one-dimensional array with axis labels, which we have encountered above ([documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)). It often represents a time-series and is often encountered as one column in a **DataFrame**.

## Exercise
Select one column from the DataFrame `df` and compute its' mean value. Is it what you expect?

# 3 Mutability when indexing
As we have seen with Numpy the rules of data selection and mutating these selections are different from matlab. Let's see how this is handled in Pandas.

In [None]:
N_COLS = 5
N_ROWS = 25
COLUMN_NAMES = ['A', 'B', 'C', 'D', 'E']
mutable_df = pd.DataFrame(data=np.random.rand(N_ROWS, N_COLS), columns=COLUMN_NAMES)

In [None]:
select_df = mutable_df[['B', 'E']]
select_df.iloc[0, 0] = 1  # do you get a warning message?
select_df

If all went well you got a warning message which refers to [this](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy) section about *indexing-view-versus-copy*. Thank you Pandas!

Now let's see what happened to our original DataFrame.

In [None]:
mutable_df

**Nothing** happened, which might suprise you. Refer to [this](http://pandas.pydata.org/pandas-docs/stable/indexing.html#why-does-assignment-fail-when-using-chained-indexing) section on the page for more info.

Let's try again with the `.loc` method.

In [None]:
select_df = mutable_df.loc[:3, 'A']

In [None]:
select_df[0] = 10

In [None]:
mutable_df

**Something** happened! It will take some time for you to learn the behavior of Pandas in this respect. For now we leave you with the warning given on the Pandas website.

_Warning_

_Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called chained assignment and should be avoided. See [Returning a View versus Copy](https://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy)_

## Exercise
Look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html) of the `.copy` method. Copy a DataFrame to a new variable and explore the difference between `deep = False` and `deep = True` . What kind of differences do you expect?

# 4 Basic statistics

In [None]:
N_COLS = 5
N_ROWS = 25
COLUMN_NAMES = ['A', 'B', 'C', 'D', 'E']

temp_array = np.random.rand(N_ROWS, N_COLS)

df = pd.DataFrame(data=temp_array, columns=COLUMN_NAMES)

## Exercise
Look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html) for the `.sum` method. Compute the sum across rows (1 sum per column) and across columns (1 sum per row).

Do the same for the `.mean` method ([documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html))

## Exercise
Compute the difference between to two columns and compute the absolute value of this difference for each row.

## Exercise
Compute the maximal value of the last 10 values in the last two columns.