In [None]:
import pandas as pd
print(pd.__version__)

import numpy as np

# `Series`
`Series` is a data type implemented by `pandas` package. It combines numerical speed on `numpy` and dictinary-like indexing. `Series` are always 1-dimensional, and they have 2 elements -- `index` and `value`. Let's create a `Series` as an example. We can create it in a same way as we would a `numpy` array:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0, 1.25, 1.5])
data

As it stands, it doesn't seem to be much different from `numpy` array, and we can access the values (right column) in the same way as with `numpy` array, using `index` (the left column):

In [None]:
data[0]

In [None]:
data[3:]

We can also extract `values` and `index` separately:

In [None]:
# values are represented as numpy array
data.values

In [None]:
# index in this case is a range object
data.index

In [None]:
# we can change it to a list or array
np.array(data.index)

So far, it is unclear why we would use `Series` over arrays. First, index doesn't need to be integer. Here I create a `Series` where index is `str` rather than numbers. To do it, I specify `index` explicitly when I create the `Series`:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0, 1.25, 1.5],
                 index=['a', 'b', 'c', 'd', 'e', 'f'])
data

Now we can access the values in a way that is more reminiscent of `dict`:

In [None]:
data['b']

But compared to `dict` we can also slice things in the `Series`:

In [None]:
data['b':'e']

We can think of `Series` as an array which has its own index associates with it, which can be anything you want. And you can access the values in the `Series` based on that index. This entails many convenient properties, for example, if you select a subset of the `Series`, the index will be preserved:

In [None]:
# create Series with values from 10 to 12 with step 0.1 using numpy function arange
data = pd.Series(np.arange(10,12,0.1))
data

In [None]:
# take every second value from that Series; note that the index is preserved
data[::2]

This is extremely useful when working with data, because you don't need to worry about which subset of data you've picked: indexes will be always consistent. In reality, `pandas` gives you choice: you can access elements based on the index associated with these elements (*explicit* index) and based on the position of the element independent of the index (*implicit* index, same as with `numpy` arrays).

`Series` are also extremely useful when you have labels for each values, like so: 

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

Here we created the `Series` from `dict`, and you could say -- why would we even use `Series` here when we already have `dict` with the same content?

In [None]:
# you can access dict
population_dict['California']

In [None]:
# and you can access Series in the same way
population['California']

The answer is twofold. First, efficiency. `Series` give you efficiency of `numpy` arrays, when you use them for computation. `dicts` lack this efficiency (they are optimized for a specific purpose). Second, `Series` offer more flexibility in working with data, like doing slices:

In [None]:
population['California':'Illinois']

## Ways to create `Series`
There are many ways you can create `Series`, here are some:

In [None]:
pd.Series([2, 4, 6])

In [None]:
pd.Series(5, index=[100, 200, 300])

In [None]:
pd.Series({2:'a', 1:'b', 3:'c'})

# <font color='DarkSeaGreen '>Exercise</font>
In the cell(s) below:
1. Create a `Series` with the following content: `index` are numbers from 1 to 7 (use `np.arange` to create that). Values are strings, containing names of the days of the week from Monday to Sunday. 

2. Create another `Series` with the same `index`, but values are 'Workday' for indexes 1 to 5 and 'Weekend' for 6 and 7.

>**Tip**: try this and see what it does: `['Workday']*5`

>**Tip**: try this and see what it does: `['a','b','c'] + ['d','e']`

Optional:
1. Create a `Series` with the following content: `index` are days of the month April 2017 (from 1 to 30), and values are days of the week are numbers from 1 to 7 (1st of April was Saturday). 

2. Create another `Series` with the same index, and names of the days of the week. Try to use your previous `Series` with days of the week to populate the `Series` with names of the days.

3. Create another `Series` with the same index, but fill it with strings 'Workday' and 'Weekend'.

# `DataFrame`
`Series` are good and useful and all, but frequently we have not a single dimension, but a collection of associated variable, like, for example, names of subjects, their age, their score on our task, etc. `DataFrame` is a multidimensional extension of `Series`. It has *rows*, just like `Series`, but it also has *columns*. Let's create a `DataFrame` from a `Series` with state populations we had before, and add area of each state as a second column.

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

In [None]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

In the `DataFrame` index is shared between all columns:

In [None]:
states.index

You can also get a list of all columns:

In [None]:
states.columns

And each column is a `Series`, which you can access with `dict`-like notation:

In [None]:
states['area']

In [None]:
# verify that a single columns of a dataframe is a series
type(states['area'])

## Ways to construct `DataFrame`
As with `Series`, there are many ways of creating `DataFrames`, here are some:

In [None]:
pd.DataFrame(population, columns=['population'])

In [None]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

In [None]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

In [None]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

# <font color='DarkSeaGreen '>Exercise</font>
In the cell(s) below:
1. Create a `DataFrame` from the 2 `Series` in the previous exercise with one column holding names of days of the week and another holding whether it is a workday or a weekend. 

Optional:
1. Do the same for your `Series` with days of the month (you'll have 3 columns: number of the day of the week, name of the day of the week and whether it is workday or a weekend).

# Indexing and selection

The point of `index` is, well... indexing. That is, `index` is a basically just a column by which is it handy to *idenfity* individual rows. If you think about `index` in this way, it becomes apparent, for example, that `index` should not contain duplicates (it can, but it is usually not a good idea, because trying to get a single value will give you several values).

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

In [None]:
data['b']

As I was saying, `Series` (and `DataFrames`) combine the functionality of `dict` and `numpy` arrays, and `Series` allow you to use many methods from both of these classes. For example, while you can access index of the `Series` by using `Series.index` attribute, you can also use method `keys()` which come straight from the `dict` class, and it will give the same output. After all, `index` for `Series` is like `key` for `dict`: these are the "hooks" which allow you to get certain values from the objects.

In [None]:
data.index

In [None]:
data.keys()

You can add new values to the `Series` just like you would to the `dict`:

In [None]:
data['e'] = 1.25
data

# Slicing and indexing
Slicing and indexing `Series` and `DataFrames` is one of the key operations you will do all the time, so it is better to get it clearly. Let's start with the `Series`. We already showed that index can be non-integer, e.g. it can be composed of strings:

In [None]:
data

And you can slice with this (*explicit*) indexing:

In [None]:
# slicing by explicit index
data['b':'d']

But `Series` also allows you to slice based on elements posision, which is called *implicit* indexing:

In [None]:
# slicing by implicit integer index
data[1:4]

This can create confusion if your index is integer, but not continuos, e.g. [2,4,6,8,10], etc. This is resolved in the next section.


## Boolean indexing
A particular type of indexing is *boolean* indexing, sometimes also called *masking*. It happens when instead of an index you're passing a set of `bool` values (e.g. an array, or a `Series`), which has to be the same size as the object you're trying to index. In this case you're getting only those values, which corresponded to the `True` values in the boolean array. This comes up very frequently, and it is a very useful type of indexing, because if allows you to *filter* values based on certain criteria. In the following example I only want values which are larger than `0.6`:

In [None]:
data

In [None]:
# this creates a boolean *mask* which has True only when values > 0.3
data > 0.6

In [None]:
# I use this mask to get the corresponding values
data[data > 0.3]

>**Pro-tip**: You can also combine several conditions in the same line by using `&` operator, which implements an element-wise logical AND, e.g.:

In [None]:
(data > 0.3) & (data < 0.8)

In [None]:
# masking
data[(data > 0.3) & (data < 0.8)]

> Stick `|` implements logical OR:

In [None]:
data[(data < 0.3) | (data > 0.8)]

> These 2 and other logical operations can also be called using `numpy` functions, e.g. `(data > 0.3) & (data < 0.8)` is equivalent to `np.logical_and(data > 0.3, data < 0.8)`.

In [None]:
# using logical OR as a function
np.logical_or(data < 0.3, data > 0.8)

## Avoiding confusion between explicit and implicit indexing
As was pointed out in the previous section, having an explicit and an implicit indexing can create confusion, especially if you have integer index. Let's see it in practice:

In [None]:
data = pd.Series(['a', 'b', 'c', 'e', 'f', 'g'], index=[1, 3, 5, 7, 9, 11])
data

When you retrieve elements, explicit index is used by default:

In [None]:
# explicit index is used by default when indexing 
data[3]

But when you do slicing, it uses implicit indexing, so `[:3]` returns first 3 elements, instead of elements up to with `index` 3:

In [None]:
# implicit index by default when slicing
data[:3]

This is no good -- can lead to errors. To avoid these and explicitly use either of two types of indexes, `pandas` `Series` and `DataFrames` have attributes `.loc` (for explicit indexing) and `.iloc` (for implicit indexing). Let's see a couple of examples:

In [None]:
data.loc[1]

In [None]:
data.loc[:3]

In [None]:
data.iloc[1]

In [None]:
data.iloc[:3]

In [None]:
states

In [None]:
states.loc['California']

In [None]:
states.iloc[0]

You can also specify both dimensions and extract different elements:

In [None]:
states.loc['California','population']

In [None]:
states.iloc[0,1]

# In case of `DataFrames`

In [None]:
states

In [None]:
states['area']

In [None]:
states['density'] = states['population'] / states['area']
states

In [None]:
states.values

In [None]:
states.T

In [None]:
states.iloc[:3, :2]

In [None]:
states.loc[:'Illinois', :'population']

In the end the fact that you can use implicit indexing (`.iloc`) doesn't mean that you should. In reality, explicit indexing is the one you use most of the time.

# <font color='DarkSeaGreen '>Exercise</font>
In the cell(s) below, from  the `DataFrame` with days of the week:
1. Extract row with index 3
2. Extract value `Saturday`
3. Extract column (as `Series`) with status of the days (workday or weekend)
4. It is the week after Pasqua! Change work status of Monday to 'Holiday'
5. Make new column with encodes work-status of the days as `bool` (`True` for workdays and `False` for weekends and holidays); set the name of this column `come_to_SISSA`
6. Using `.iloc`, extract the last 3 values of the `come_to_SISSA` column
7. Using `boolean` indexing, extract names of the day of the week where `come_to_SISSA` is `True`
8. Same as previous, but when `come_to_SISSA` is `False`

Optional. In the `DataFrame` with days of the month:
1. Set values of the following days to 'Holiday': 14, 17, 24, 25
2. Make new column with encodes work-status of the days as `bool` (`True` for workdays and `False` for weekends and holidays); set the name of this column `come_to_SISSA`
3. Using `.iloc`, extract the last 3 row of the `DataFrame`
4. Using `boolean` indexing, extract names of the day of the week where `come_to_SISSA` is `False` 

# Operations on `Series` and `DataFrame`
Operations on `DataFrames` and `Series` will preserve the `index`, even if the order is different, example:

In [None]:
a = pd.Series({0:10, 1:20, 2: 30})
a

In [None]:
b = pd.Series({0:1, 1:2, 2: 3})[::-1]
b

In [None]:
a/b

If some of the indexes exist in one, but not the other `Series`, it will give you `NaN`, *Not a Number*, as a result:

In [None]:
area = pd.Series({'Alaska': 1723337, 
                  'Texas': 695662,
                  'California': 423967}, name='area')

population = pd.Series({'California': 38332521, 
                        'Texas': 26448193,
                        'New York': 19651127}, name='population')

In [None]:
population / area

In [None]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

This simplifies working on complex data A LOT, because you don't have to worry that all the values are in the same positions in all of your tables or columns, you just need to know that the labels (indexes) are consistent.

Another example:

In [None]:
A = pd.DataFrame(np.random.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

In [None]:
B = pd.DataFrame(np.random.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

In [None]:
A + B

Sometimes you might want to do operations not element-wise, but columns-wise, or row-wise, for example, if you want to subtract values in one row from all rows. `pandas` can handle this automatically:

In [None]:
df = pd.DataFrame(np.random.randint(10, size=(3, 4)), columns=list('QRST'))
df

By default, it will work fine with rows:

In [None]:
df - df.loc[0]

But if you try columns, you'll get something weird:

In [None]:
df - df['R']

Of course, there is a logical explanation for why this happens, and it has to do with `pandas` not understanding which axis you want to match. It tries to match `index` of `df['R']` with `columns` of `df` and doesn't find any correspondence between the two. To avoid this, you need to specify the axis explicitly. In this case you'd have to use method `subtract`, like so:

In [None]:
df.subtract(df['R'], axis=0)

Keeping consistency with its matching principles, it you subtract a `Series` which contains only some indexes, it will only return you values for the corresponding columns:

In [None]:
# getting second row, and ever second column
halfrow = df.iloc[0, ::2]
halfrow

In [None]:
df - halfrow

# Handling missing data
Missing data is an big problem and can be a pain to handle. In `pandas` there are special tools for dealing with missing data. First, what is considered a missing value? It is either `None` or `np.nan`. `None` is an in-build special Python class, which you can use in any context to specify a missing value, but the problem is that it is not optimized for numerical computations, because it doesn't belong to any specific numerical class. Therefore, if you put it in an array, it will automatically make `dtype=object`, which is the most general `dtype` (can include anything) and doesn't offer speed advantages of numerical arrays. In comparison, `np.nan` actually belongs to the class `float64` and is specifically designed to occupy the same memory space as `float64` and not hurdle the computations. Demonstration:

In [None]:
vals = np.array([1, None, 3, 4])
vals.dtype # dtype('O') means `object`

In [None]:
vals = np.array([1, np.nan, 3, 4])
vals.dtype

Just to see how much `dtype=object` hurts as, let's create two arrays, which contain the same values (numbers from 0 to 10^6), first with `dtype=object` and second with `dtype=int`, and time how much it takes to compute sum of all elements.

>**Note**: `1E6` is a shortcut for `1 * 10**6`.

In [None]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

Same data, but with `dtype=object` it takes ~30 times longer to compute the sum. So we don't want to use `None` in any kind of settings, where we intend to do numerical computations, we always use `np.nan`.

In [None]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype

If you try to do operations with `NaN`, it will give `NaN` as a result. Same as if you try to use standard aggregation functions:

In [None]:
1 + np.nan

In [None]:
vals2.sum(), vals2.min(), vals2.max()

To handle `nan` there are special `numpy` functions which ignore `nan`:

In [None]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

However, in `pandas` this is all simplified. First, both `None` and `np.nan` will become `NaN` in `pandas` (if you have numerical `Series`, otherwise see **side note** below), so you don't have to worry about having `None` somewhere:

In [None]:
x = pd.Series([1, np.nan, 2, None])
x

Second missing values are ignored by default, because they are frequent in data analysis and we don't want to use special functions all the time:

In [None]:
x.mean()

In [None]:
x.sum()

>**Side note**: If you have numerical `Series`, `pandas` will change `None` to `np.nan`. However, if you have a `Series` with `dtype=object` (`Series` which can contain anything, but doesn't offer numerical advantage), it will not change by default:

In [None]:
data = pd.Series([1, np.nan, 'hello', None])
data

> This is not a problem, because methods for working with missing values we will learn below will work with both `nan` and `None`. They do not discriminate.

There are also several methods to deal specifically with `NaN` values:

- `isnull()`: Generate a boolean mask indicating missing values
- `notnull()`: Opposite of `isnull()`
- `dropna()`: Return a filtered version of the data
- `fillna()`: Return a copy of the data with missing values filled or imputed

In [None]:
data = pd.Series([1, np.nan, 'hello', None])
data

In [None]:
data.isnull()

In [None]:
data[data.notnull()]

In [None]:
data.dropna()

In [None]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

If you `dropna()`, by default it will drop all rows which contain at least 1 `nan` value:

In [None]:
df.dropna()

But you can specify axis (`'rows'` or `'columns'`, which is the same as `0` and `1`):

In [None]:
df.dropna(axis='columns')

You can also specify to throw away columns or rows only if all values are `nan`:

In [None]:
df[3] = np.nan
df

In [None]:
df.dropna(axis='columns', how='all')

### Filling null values
Sometimes you don't want to throw away missing values, but intead you want to fill them with some other values. There a method `fillna()` for this, and it has parameters which help you specify the filling rule:

In [None]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

You can fill with a certain values (e.g., `0`):

In [None]:
data.fillna(0)

Or you can fill each `nan` with the previous existing values (forward fill):

In [None]:
# forward-fill
data.fillna(method='ffill')

Or with the next existing value (backward fill):

In [None]:
# back-fill
data.fillna(method='bfill')

In [None]:
df

For `DataFrames` you can also specify which axis to do filling from (by default it is rows, meaning that it takes from the previous row in the same column):

In [None]:
df

In [None]:
df.fillna(method='ffill')

If you specify `axis=1` (or `axis='columns'` which is the same), it will take from the same row, but from the previous (`ffill`) or next (`bfill`) column:

In [None]:
df.fillna(method='ffill', axis='columns')

# <font color='DarkSeaGreen '>Exercise</font>
In the cell(s) below, in the `DataFrame` with days of the week:
1. Create a new column 'lesson_duration_min' with all values = np.nan
2. Set the following values in this column: for Tuesday -- 120, Wednesday -- 120+150, Thursday -- 120, Friday -- 120+150
2. Calculate mean duration of the lessons next week.
2. Calculate total number of class hours for the next week.
3. Using the new column, get the names of the days of the week when there is no lesson
4. Using `fillna` change values of `nan` to `0`

Optional: 
1. Create 'lesson_duration_min' for the days of the month `DataFrame`, except don't put values of duration in every cell separately, but set all Wednesdays and Fridays to 120+150 and all Tuesdays and Thursdays to 120.
2. In every row which is not a 'Workday' set duration to 0.
2. Calculate mean duration of the lessons for the month.
2. Calculate total number of class hours for the month.