# Pandas - Introduction

This notebook is the first part of the collection devoted to the pandas library.
It presents the basic objects 

In [None]:
import jupy_helpers
from IPython.display import display

In [None]:
# Start using pandas (default import convention)
import pandas as pd
import numpy as np

In [None]:
# Let pandas speak for themselves
print(pd.__doc__)

Visit the official website: https://pandas.pydata.org

In [None]:
# Current version (should be 0.24 in 2019)
print(pd.__version__)

## Basic objects 

The **pandas** library has a vast API with many useful functions. However, most of this revolves
around two important classes:

* Series
* DataFrame

In this introduction, we will focus on them - what each of them does and how they relate to each other
and numpy objects.

### Series

Series is a one-dimensional data structure, central to pandas. 

For a complete API, visit https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

In [None]:
# My first series
series = pd.Series([1, 2, 3])
series

This looks a bit like a Numpy array, does it not?

Actually, in most cases the Series wraps a Numpy array...

In [None]:
series.values  # The result is a Numpy array

...and if we construct the series from a numpy array, it wraps it directly.

In [None]:
zeros_array = np.zeros(10)

# We check object identity
pd.Series(zeros_array).values is zeros_array

But there is something more. Alongside the values, we see that each item (or "row") has a certain label. The collection of lables is called **index**.

In [None]:
series.index

This index (see below) can be used, as its name suggests, to index items of the series.

In [None]:
# Return an element from the series
series.loc[1]

In [None]:
# Construction from a dictionary
series_ab = pd.Series({"a": 2, "b": 4})
series_ab

**Exercise**: Create a series with 5 elements.

In [None]:
%exercise

# result = ...

result = pd.Series([1,2,3,4,5])

In [None]:
%validate
assert len(result) == 5

### DataFrame

A **DataFrame** is pandas' answer to Excel sheets - it is a collection of columns (or, in our case, a collection of **Series**).
Quite often, we directly read data frames from an external source, but it is possible to create them from:
* a dict of Series, numpy arrays or other array-like objects
* from an iterable of rows (where rows are Series, lists, ...)

In [None]:
# List of lists
table = [
    ['a', 1],
    ['b', 3],
    ['c', 5]
]
table_df = pd.DataFrame(table)
table_df

In [None]:
# Dict of Series
df = pd.DataFrame({
    'number': pd.Series([1, 2, 3, 4], dtype=np.int8),
    'letter': pd.Series(['a', 'b', 'c', 'd'])
})
df

In [None]:
# Numpy array (10x2), specify column names
data = np.random.normal(0, 1, (10, 2))

df = pd.DataFrame(data, columns=['a', 'b'])
df

In [None]:
# A DataFrame also has an index...
df.index

In [None]:
# ...that is shared by all columns
df.index is df["a"].index

## D(ata) types

Pandas builds upon the numpy data types (mentioned earlier) and adds a couple of more.

In [None]:
typed_df = pd.DataFrame({
  "bool": np.arange(5) % 2 == 0,
  "int": range(5),
  "int[nan]": pd.Series([np.nan, 0, 1, 2, 3], dtype="Int64"),
  "float": np.arange(5) * 3.14,
  "object": [None, 1, "2", 3.0, 4 + 1j],
  "string?": ["a", "b", "c", "d", "e"],
  "datetime": pd.date_range('2018-01-01', periods=5, freq='3M'),
  "timedelta": pd.timedelta_range(0, freq="1s", periods=5),
  "category": pd.Series(["animal", "plant", "animal", "animal", "plant"], dtype="category")
})
typed_df

In [None]:
typed_df.dtypes

We will see some of the types practically used in further analysis.

## Indices & indexing



In [None]:
abc_series = pd.Series(range(3), index=["a", "b", "c"])
abc_series

In [None]:
abc_series.index

In [None]:
abc_series.index = ["c", "d", "e"]  # Changes the labels in-place!
abc_series.index.name = "letter"
abc_series

In [None]:
table = [
    ['a', 1],
    ['b', 3],
    ['c', 5]
]
table_df = pd.DataFrame(
    table,
    index=["first", "second", "third"],
    columns=["alpha", "beta"]
)
table_df

In [None]:
alpha = table_df["alpha"]  # Simple [] indexing in DataFrame returns Series
alpha

In [None]:
alpha["second"]             # Simple [] indexing in Series returns scalar values.

In [None]:
alpha.second   # This also works

but careful!

In [None]:
alpha.first

There are two ways how to properly index rows & cells in the DataFrame:

- `loc` for label-based indexing
- `iloc` for order-based indexing (it does not use the **index** at all)

Note the square brackets. The mentioned attributes actually are not methods
but special "indexer" objects. They accept one or two arguments specifying
the position along one or both axes.

**Exercise:** Create `DataFrame` whose `x`-column is $0, \frac{1}{4}\pi, \frac{1}{2}pi, .. 2\pi $, `y` column is `cos(x)` and index are `fractions` `0, 1/4, 1/2 ... 2`

In [None]:
%exercise

import fractions

# index = [fractions.Fraction(n, ___) for n in range(___)]
# x = np.___([___ for ___ in ___])
# y = ___

# df = pd.DataFrame(___, index = ___)

index = [fractions.Fraction(n, 4) for n in range(9)]
x = np.array([np.pi * i for i in index])
y = np.cos(x)
df = pd.DataFrame({"x": x, "y": y}, index = index)

# display
df

In [None]:
%validate

np.allclose(df.loc[fractions.Fraction(3, 2)], [fractions.Fraction(3, 2) * np.pi, 0])

#### loc


In [None]:
first = table_df.loc["first"]
first

In [None]:
table_df.loc["first", "beta"]            

In [None]:
table_df.loc["first":"second", "beta"]   # Use ranges (inclusive)

#### iloc

In [None]:
table_df.iloc[1]

In [None]:
table_df.iloc[0:4:2]   # Select every second row

In [None]:
table_df.at["first", "beta"]

In [None]:
type(table_df.at)

## Modifying DataFrames

Adding a new column is like assigning to adding a key/value pair to a dict.
Note that the operation, unlike most others, does modify the DataFrame.

In [None]:
from datetime import datetime
table_df["now"] = datetime.now()
table_df

Non-destructive version that returns a new DataFrame, uses the `assign` method:

In [None]:
table_df.assign(delta = [True, False, True])

In [None]:
# However, the original DataFrame is not changed
table_df

Deleting a column is very easy too.

In [None]:
del table_df["now"]
table_df

The **drop** method works with both rows and columns

In [None]:
table_df.drop("beta", axis=1)

In [None]:
table_df.drop("second", axis=0)

**Exercise:** Use a combination of `reset_index`, `drop` and `set_index` to transform `table_df` into `pd.DataFrame({'index': table_df.index}, index=table_df["alpha"])`

In [None]:
%exercise

# results = table_df.___.___.___

result = table_df.reset_index().drop(columns=['beta']).set_index('alpha')

result

In [None]:
%validate

pd.testing.assert_frame_equal(result, pd.DataFrame({'index': table_df.index}, index=table_df["alpha"]))

---
**Let's get some data!**