- handling missing data
- include example of loading and organizing iris dataset from visualization lesson

advantages of long data
- can handle different sample sizes for groups
- groupby
- need to use seaborn for plotting by group

advantages of wide data
- matplotlib graphics is simpler (e.g. boxplots)


# resourses 

http://www.jstatsoft.org/v59/i10/paper

https://tomaugspurger.github.io/modern-5-tidy.html

In [1]:
import pandas as pd
print(pd.__version__)

import numpy as np

0.19.2


# `Series`
`Series` is a data type implemented by `pandas` package. It combines numerical speed on `numpy` and dictinary-like indexing. `Series` are always 1-dimensional, and they have 2 elements -- `index` and `value`. Let's create a `Series` as an example. We can create it in a same way as we would a `numpy` array:

In [3]:
data = pd.Series([0.25, 0.5, 0.75, 1.0, 1.25, 1.5])
data

0    0.25
1    0.50
2    0.75
3    1.00
4    1.25
5    1.50
dtype: float64

As it stands, it doesn't seem to be much different from `numpy` array, and we can access the values (right column) in the same way as with `numpy` array, using `index` (the left column):

In [4]:
data[0]

0.25

In [5]:
data[3:]

3    1.00
4    1.25
5    1.50
dtype: float64

We can also extract `values` and `index` separately:

In [7]:
# values are represented as numpy array
data.values

array([ 0.25,  0.5 ,  0.75,  1.  ,  1.25,  1.5 ])

In [10]:
# index in this case is a range object
data.index

RangeIndex(start=0, stop=6, step=1)

In [15]:
# we can change it to a list or array
np.array(data.index)

array([0, 1, 2, 3, 4, 5], dtype=int64)

So far, it is unclear why we would use `Series` over arrays. First, index doesn't need to be integer. Here I create a `Series` where index is `str` rather than numbers. To do it, I specify `index` explicitly when I create the `Series`:

In [18]:
data = pd.Series([0.25, 0.5, 0.75, 1.0, 1.25, 1.5],
                 index=['a', 'b', 'c', 'd', 'e', 'f'])
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
f    1.50
dtype: float64

Now we can access the values in a way that is more reminiscent of `dict`:

In [19]:
data['b']

0.5

But compared to `dict` we can also slice things in the `Series`:

In [24]:
data['b':'e']

b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

We can think of `Series` as an array which has its own index associates with it, which can be anything you want. And you can access the values in the `Series` based on that index. This entails many convenient properties, for example, if you select a subset of the `Series`, the index will be preserved:

In [27]:
# create Series with values from 10 to 12 with step 0.1 using numpy function arange
data = pd.Series(np.arange(10,12,0.1))
data

0     10.0
1     10.1
2     10.2
3     10.3
4     10.4
5     10.5
6     10.6
7     10.7
8     10.8
9     10.9
10    11.0
11    11.1
12    11.2
13    11.3
14    11.4
15    11.5
16    11.6
17    11.7
18    11.8
19    11.9
dtype: float64

In [29]:
# take every second value from that Series; note that the index is preserved
data[::2]

0     10.0
2     10.2
4     10.4
6     10.6
8     10.8
10    11.0
12    11.2
14    11.4
16    11.6
18    11.8
dtype: float64

This is extremely useful when working with data, because you don't need to worry about which subset of data you've picked: indexes will be always consistent. In reality, `pandas` gives you choice: you can access elements based on the index associated with these elements (*explicit* index) and based on the position of the element independent of the index (*implicit* index, same as with `numpy` arrays).

`Series` are also extremely useful when you have labels for each values, like so: 

In [30]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

Here we created the `Series` from `dict`, and you could say -- why would we even use `Series` here when we already have `dict` with the same content?

In [31]:
# you can access dict
population_dict['California']

38332521

In [33]:
# and you can access Series in the same way
population['California']

38332521

The answer is twofold. First, efficiency. `Series` give you efficiency of `numpy` arrays, when you use them for computation. `dicts` lack this efficiency (they are optimized for a specific purpose). Second, `Series` offer more flexibility in working with data, like doing slices:

In [34]:
population['California':'Illinois']

California    38332521
Florida       19552860
Illinois      12882135
dtype: int64

## Ways to create `Series`
There are many ways you can create `Series`, here are some:

In [39]:
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

In [40]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [41]:
pd.Series({2:'a', 1:'b', 3:'c'})

1    b
2    a
3    c
dtype: object

# `DataFrame`
`Series` are good and useful and all, but frequently we have not a single dimension, but a collection of associated variable, like, for example, names of subjects, their age, their score on our task, etc. `DataFrame` is a multidimensional extension of `Series`. It has *rows*, just like `Series`, but it also has *columns*. Let's create a `DataFrame` from a `Series` with state populations we had before, and add area of each state as a second column.

In [43]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
dtype: int64

In [44]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


In the `DataFrame` index is shared between all columns:

In [45]:
states.index

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

You can also get a list of all columns:

In [46]:
states.columns

Index(['area', 'population'], dtype='object')

And each column is a `Series`, which you can access with `dict`-like notation:

In [47]:
states['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [49]:
# verify that a single columns of a dataframe is a series
type(states['area'])

pandas.core.series.Series

## Ways to contruct `DataFrame`
As with `Series`, there are many ways of creating `DataFrames`, here are some:

In [54]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Florida,19552860
Illinois,12882135
New York,19651127
Texas,26448193


In [55]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [135]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


In [58]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.057777,0.378066
b,0.044698,0.81313
c,0.57548,0.966393


# Indexing and selection

The point of index is, well... indexing. That is, index is a basically just a column by which is it handy to *idenfity* individual rows. If you think about index in this way, it becomes apparent, for example, that index should not contain duplicates.

In [69]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [70]:
data['b']

0.5

As I was saying, `Series` (and `DataFrames`) combine the functionality of `dict` and `numpy` arrays, and `Series` allow you to use many methods from both of these classes. For example, while you can access index of the `Series` by using `Series.index` attribute, you can also use method `keys()` which come straight from the `dict` class, and it will give the same output. After all, `index` for `Series` is like `key` for `dict`: these are the "hooks" which allow you to get certain values from the objects.

In [73]:
data.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [74]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

You can add new values to the `Series` just like you would to the `dict`:

In [76]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

# Slicing and indexing
Slicing and indexing `Series` and `DataFrames` is one of the key operations you will do all the time, so it is better to get it clearly. Let's start with the `Series`. We already showed that index can be non-integer, e.g. it can be composed of strings:

In [77]:
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

And you can slice with this (*explicit*) indexing:

In [79]:
# slicing by explicit index
data['b':'d']

b    0.50
c    0.75
d    1.00
dtype: float64

But `Series` also allows you to slice based on elements posision, which is called *implicit* indexing:

In [82]:
# slicing by implicit integer index
data[1:4]

b    0.50
c    0.75
d    1.00
dtype: float64

This can create confusion if your index is integer, but not continuos, e.g. [2,4,6,8,10], etc. This is resolved in the next section.


## Boolean indexing
A particular type of indexing is *boolean* indexing, sometimes also called *masking*. It happens when instead of an index you're passing a set of `bool` values (e.g. an array, or a `Series`), which has to be the same size as the object you're trying to index. In this case you're getting only those values, which corresponded to the `True` values in the boolean array. This comes up very frequently, and it is a very useful type of indexing, because if allows you to *filter* values based on certain criteria. In the following example I only want values which are larger than `0.6`:

In [87]:
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

In [88]:
# this creates a boolean *mask* which has True only when values > 0.3
data > 0.6

a    False
b    False
c     True
d     True
e     True
dtype: bool

In [90]:
# I use this mask to get the corresponding values
data[data > 0.3]

b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

>**Pro-tip**: You can also combine several conditions in the same line by using `&` operator, which implements an element-wise logical AND, e.g.:

In [100]:
(data > 0.3) & (data < 0.8)

a    False
b     True
c     True
d    False
e    False
dtype: bool

In [101]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

> Stick `|` implements logical OR:

In [102]:
data[(data < 0.3) | (data > 0.8)]

a    0.25
d    1.00
e    1.25
dtype: float64

> These 2 and other logical operations can also be called using `numpy` functions, e.g. `(data > 0.3) & (data < 0.8)` is equivalent to `np.logical_and(data > 0.3, data < 0.8)`.

In [113]:
# using logical OR as a function
np.logical_or(data < 0.3, data > 0.8)

a     True
b    False
c    False
d     True
e     True
dtype: bool

## Avoiding confusion between explicit and implicit indexing
As was pointed out in the previous section, having an explicit and an implicit indexing can create confusion, especially if you have integer index. Let's see it in practice:

In [119]:
data = pd.Series(['a', 'b', 'c', 'e', 'f', 'g'], index=[1, 3, 5, 7, 9, 11])
data

1     a
3     b
5     c
7     e
9     f
11    g
dtype: object

When you retrieve elements, explicit index is used by default:

In [120]:
# explicit index is used by default when indexing 
data[3]

'b'

But when you do slicing, it uses implicit indexing, so `[:3]` returns first 3 elements, instead of elements up to with `index` 3:

In [121]:
# implicit index by default when slicing
data[:3]

1    a
3    b
5    c
dtype: object

This is no good -- can lead to errors. To avoid these and explicitly use either of two types of indexes, `pandas` `Series` and `DataFrames` have attributes `.loc` (for explicit indexing) and `.iloc` (for implicit indexing). Let's see a couple of examples:

In [122]:
data.loc[1]

'a'

In [127]:
data.loc[:3]

1    a
3    b
dtype: object

In [132]:
data.iloc[1]

'b'

In [134]:
data.iloc[:3]

1    a
3    b
5    c
dtype: object

In [136]:
states

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


In [140]:
states.loc['California']

area            423967
population    38332521
Name: California, dtype: int64

In [141]:
states.iloc[0]

area            423967
population    38332521
Name: California, dtype: int64

You can also specify both dimensions and extract different elements:

In [142]:
states.loc['California','population']

38332521

In [144]:
states.iloc[0,1]

38332521

# In case of `DataFrames`

In [148]:
states

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


In [149]:
states['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [151]:
states['density'] = states['population'] / states['area']
states

Unnamed: 0,area,population,density
California,423967,38332521,90.413926
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


In [152]:
states.values

array([[  4.23967000e+05,   3.83325210e+07,   9.04139261e+01],
       [  1.70312000e+05,   1.95528600e+07,   1.14806121e+02],
       [  1.49995000e+05,   1.28821350e+07,   8.58837628e+01],
       [  1.41297000e+05,   1.96511270e+07,   1.39076746e+02],
       [  6.95662000e+05,   2.64481930e+07,   3.80187404e+01]])

In [153]:
states.T

Unnamed: 0,California,Florida,Illinois,New York,Texas
area,423967.0,170312.0,149995.0,141297.0,695662.0
population,38332520.0,19552860.0,12882140.0,19651130.0,26448190.0
density,90.41393,114.8061,85.88376,139.0767,38.01874


In [154]:
states.iloc[:3, :2]

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


In [157]:
states.loc[:'Illinois', :'population']

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


In the end the fact that you can use implicit indexing (`.iloc`) doesn't mean that you should. In reality, explicit indexing is the one you use most of the time.

# Operations on `Series` and `DataFrame`
Operations on `DataFrames` and `Series` will preserve the `index`, even if the order is different, example:

In [161]:
a = pd.Series({0:10, 1:20, 2: 30})
a

0    10
1    20
2    30
dtype: int64

In [165]:
b = pd.Series({0:1, 1:2, 2: 3})[::-1]
b

2    3
1    2
0    1
dtype: int64

In [166]:
a/b

0    10.0
1    10.0
2    10.0
dtype: float64

If some of the indexes exist in one, but not the other `Series`, it will give you `NaN`, *Not a Number*, as a result:

In [167]:
area = pd.Series({'Alaska': 1723337, 
                  'Texas': 695662,
                  'California': 423967}, name='area')

population = pd.Series({'California': 38332521, 
                        'Texas': 26448193,
                        'New York': 19651127}, name='population')

In [168]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [169]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

This simplifies working on complex data A LOT, because you don't have to worry that all the values are in the same positions in all of your tables or columns, you just need to know that the labels (indexes) are consistent.

Another example:

In [170]:
A = pd.DataFrame(np.random.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

Unnamed: 0,A,B
0,1,5
1,9,2


In [171]:
B = pd.DataFrame(np.random.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,8,0,3
1,3,8,4
2,5,2,2


In [172]:
A + B

Unnamed: 0,A,B,C
0,1.0,13.0,
1,17.0,5.0,
2,,,


In [173]:
df = pd.DataFrame(np.random.randint(10, size=(3, 4)), columns=list('QRST'))
df

Unnamed: 0,Q,R,S,T
0,0,3,3,0
1,2,7,9,5
2,6,1,0,6


In [174]:
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,2,4,6,5
2,6,-2,-3,6


In [175]:
df.iloc[0]

Q    0
R    3
S    3
T    0
Name: 0, dtype: int32

In [176]:
df.subtract(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,-3,0,0,-3
1,-5,0,2,-2
2,5,0,-1,5


In [177]:
halfrow = df.iloc[0, ::2]
halfrow

Q    0
S    3
Name: 0, dtype: int32

In [178]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,2.0,,6.0,
2,6.0,,-3.0,


# Handling missing data
example: rat's choice on incomplete trials

In [179]:
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

In [180]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

dtype = object
10 loops, best of 3: 72.2 ms per loop

dtype = int
1000 loops, best of 3: 1.97 ms per loop



In [181]:
vals1.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

In [183]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype

dtype('float64')

In [184]:
1 + np.nan

nan

In [185]:
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

In [186]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

(8.0, 1.0, 4.0)

In [187]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

In [188]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int32

In [189]:
x[0] = None
x

0    NaN
1    1.0
dtype: float64

- `isnull()`: Generate a boolean mask indicating missing values
- `notnull()`: Opposite of `isnull()`
- `dropna()`: Return a filtered version of the data
- `fillna()`: Return a copy of the data with missing values filled or imputed

In [None]:
data = pd.Series([1, np.nan, 'hello', None])
data

In [None]:
data.isnull()

In [None]:
data[data.notnull()]

In [None]:
data.dropna()

In [None]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

In [None]:
df.dropna()

In [None]:
df.dropna(axis='columns')

In [None]:
df[3] = np.nan
df

In [None]:
df.dropna(axis='columns', how='all')

### Filling null values

In [None]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

In [None]:
data.fillna(0)

In [None]:
# forward-fill
data.fillna(method='ffill')

In [None]:
# back-fill
data.fillna(method='bfill')

In [None]:
df

In [None]:
df = df.fillna(method='ffill', axis=1)
df

In [None]:
df.fillna(method='bfill', axis=1)

In [194]:
pd.read_csv('data/brain_size.csv', delimiter=';').groupby('Gender')['MRI_Count'].mean()

Gender
Female    862654
Male      954855
Name: MRI_Count, dtype: int64