# Indexing

A big chunk of pandas' complexity (both in the library and as a user) revolves around indexing.
It's a complex task, since we want to support so many use-cases

- Like lists, you can index by location.
- Like dictionaries, you can index by label.
- Like NumPy arrays, you can index by boolean masks.
- Any of these indexers could be scalar indexes, or they could be arrays, or they could be slices.
- Any of these should work on the index (row labels) or columns of a DataFrame.
- And any of these should work on Hierarchical indexes.

The data we'll work with is a subset of the data from beeradvocate.com, via [Standford](https://snap.stanford.edu/data/web-RateBeer.html). The raw data is strangely formatted.

```
beer/name: Sausa Weizen
beer/beerId: 47986
beer/brewerId: 10325
beer/ABV: 5.00
beer/style: Hefeweizen
review/appearance: 2.5
review/aroma: 2
review/palate: 1.5
review/taste: 1.5
review/overall: 1.5
review/time: 1234817823
review/profileName: stcules
review/text: A lot of foam. But a lot.	In the smell some banana, and then lactic and tart. Not a good start.	Quite dark orange in color, with a lively carbonation (now visible, under the foam).	Again tending to lactic sourness.	Same for the taste. With some yeast and banana.		

beer/name: Red Moon
beer/beerId: 48213
beer/brewerId: 10325
beer/ABV: 6.20
 ...
```

The dataset was a bit large to processess all at once

```bash
$ wc -l beeradvocate.txt
 22212596 beeradvocate.txt
```

So I parsed them in chunks.

```python
def format_review(review):
    return dict(map(lambda x: x.strip().split(": ", 1), review))

def as_dataframe(reviews):
    col_names = {
        'beer/ABV': 'abv',
        'beer/beerId': 'beer_id',
        'beer/brewerId': 'brewer_id',
        'beer/name': 'beer_name',
        'beer/style': 'beer_style',
        'review/appearance': 'review_appearance',
        'review/aroma': 'review_aroma',
        'review/overall': 'review_overall',
        'review/palate': 'review_palate',
        'review/profileName': 'profile_name',
        'review/taste': 'review_taste',
        'review/text': 'text',
        'review/time': 'time'
    }
    df = pd.DataFrame(list(reviews))
    numeric = ['abv', 'review_appearance', 'review_aroma',
               'review_overall', 'review_palate', 'review_taste']
    df = (df.rename(columns=col_names)
            .replace('', np.nan))
    df[numeric] = df[numeric].astype(float)
    df['time'] = pd.to_datetime(df.time.astype(int), unit='s')
    return df

def main():
    with open('beeradvocate.txt') as f:
        reviews = filter(lambda x: x != ('\n',),
                         partitionby(lambda x: x == '\n', f))
        reviews = map(format_review, reviews)
        reviews = partition_all(100000, reviews)
        os.makedirs('beer_reviews', exist_ok=True)

        for i, subset in enumerate(reviews):
            print(i, end='\r')
            df = as_dataframe(subset)
            df.to_csv('beer_reviews/review_%s.csv' % i, index=False)
```

# Aside: dask

To select the subset we'll work with, about a 10th of the reviews, I used [`dask`](http://dask.readthedocs.org).
All of those files wouldn't fit in memory at once. But we can compute quantiles in chunks and aggregate those together. 

```python

In [1]: import dask.dataframe as dd

In [2]: df = dd.read_csv('beer_reviews/*.csv', parse_dates=['time'])

In [3]: cutoffs = df.time.quantiles([.5, .6])

In [4]: %time cutoffs = cutoffs.compute()
CPU times: user 20.7 s, sys: 8.37 s, total: 29.1 s
Wall time: 28.2 s

In [5]: %time subset = df[(df.time >= cutoffs[0]) & (df.time <= cutoffs[1])].compute()
CPU times: user 20.9 s, sys: 7.68 s, total: 28.6 s
Wall time: 27.5 s

In [6]: subset.to_csv('../notebooks/data/beer_subset.csv', index=False)
```

Just writing `cutoff = df.time.quantile([10])` doesn't actually do the computation, instead it build of dask graph of what it needs to do when asked for the result.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

pd.options.display.max_rows = 10

In [None]:
df = pd.read_csv('data/beer_subset.csv.gz', parse_dates=['time'], compression='gzip')
review_cols = ['review_appearance', 'review_aroma', 'review_overall',
               'review_palate', 'review_taste']
df.head()

## Boolean indexing

Like a where clause in SQL. The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.

In [None]:
df.abv < 5

In [None]:
df[df.abv < 5].head()

Notice that we just used `[]` there. We can pass the boolean indexer in to `.loc` as well.

In [None]:
df.loc[df.abv < 5, ['beer_style', 'review_overall']].head()

Again, you can get complicated

In [None]:
df[((df.abv < 5) & (df.time > pd.Timestamp('2009-06'))) | (df.review_overall >= 4.5)]

Be careful with the order of operations. In python `&` and `|` have lower precedence than `>, =, <`.

In [None]:
2 > 1 & 0

In [None]:
(2 > 1) & 0

<div class="alert alert-success">
    <b>Exercise</b>: Find the IPAs
</div>

Select just the rows where the `beer_style` contains `'IPA'`. 

Hint: `Series` containing strings have a bunch of [useful methods](http://pandas.pydata.org/pandas-docs/stable/text.html#method-summary) under the `DataFrame.<column>.str` namespace. Typically they corrospond to regular python string methods, but

- They gracefully propogate missing values
- They're a bit more liberal about accepting regular expressions

We can't use `'IPA' in df['beer_style']`, since `in` is used to check membership in the series itself, not the strings. But `in` uses `__contains__`, so look for a string method like that.

In [None]:
df.beer_style.str.  # <tab>

In [None]:
# Your solution
is_ipa = ...


In [None]:
%load -r 1:4 solutions/solutions_indexing.py

This is quite powerful. Any method that returns a boolean array is potentially an indexer.

<div class="alert alert-success">
    <b>Exercise</b>: Find a subset of beer styles
</div>

Find the rows where the beer style is either `'American IPA'` or `'Pilsner'`.
Remeber that we use `&` and `|` to do elementwise logical `and`s and `or`s

In [None]:
# Your code here

In [None]:
%load -r 5:7 solutions/solutions_indexing.py

# isin

Useful for seeing if a value is contained in a collection. We can rewrite the last exercise as:

In [None]:
df[df.beer_style.isin(['American IPA', 'Pilsner'])].head()

Let's collect the brewers and beers with the most reviews...

In [None]:
brewer = df[['brewer_id', 'beer_id']]
brewer.head()

In [None]:
brewer_ids = df.brewer_id.value_counts().index[:10]
beer_ids = df.beer_id.value_counts().index[:10]
brewer_ids

And filter to the reviews that were for the top items.

`DataFrame.isin()` can take a dictionary.

In [None]:
to_find = {
    'brewer_id': brewer_ids,
    'beer_id': beer_ids
}
brewer.isin(to_find)

Two dimensional boolean indexing is a bit different:

In [None]:
brewer[brewer.isin(to_find)]

What happened?

The result of `DataFrame.isin` is always the same shape as the input.
Use `.any` or `.all` if you intend to index with the result.

For example, to get reviews where *both* the `brewer_id` and the `beer_id` are in the top 10:

In [None]:
brewer[brewer.isin(to_find).all('columns')].head()

In [None]:
brewer[brewer.isin(to_find).any('columns')].head()

<div class="alert alert-success">
    <b>Exercise</b>: Find the rows where all of the reviews are at least 4
</div>

Select the rows where the scores of the 5 `review_cols` ('review_appearance', 'review_aroma', 'review_overall', 'review_palate', 'review_taste') are *all* at least 4.0.

hint: Like NumPy arrays, DataFrames have an `any` and `all` methods that check whether it contains `any` or `all` True values. These methods also take an `axis` argument for the dimension to remove.

- `0` or `'index'` removes (or aggregates over) the vertical dimension
- `1` or `'columns'` removes (aggregates over) the horizontal dimension.

In [None]:
review_cols

In [None]:
# your code goes here

In [None]:
%load -r 9:20 solutions/solutions_indexing.py

<div class="alert alert-success">
    <b>Exercise</b>: Find the rows where the average of the reviews is at least 4
</div>

select rows where the average of the 5 `review_cols` ('review_appearance', 'review_aroma', 'review_overall', 'review_palate', 'review_taste') is at least 4.

In [None]:
# Your code here

In [None]:
%load -r 20:22 solutions/solutions_indexing.py

# Hierarchical Indexing

One of the most powerful and most complicated features of pandas.
Let's you represent high-dimensional datasets in a table.

In [None]:
reviews = df.set_index(['profile_name', 'beer_id', 'time'])
reviews.head()

You'll almost always want to sort your MultiIndex.

In [None]:
reviews = reviews.sort_index()
reviews.head()

### Top Reviewers

Let's select all the reviews by the top reviewers, by label.

In [None]:
top_reviewers = df['profile_name'].value_counts().head(5).index
top_reviewers

In [None]:
reviews.loc[top_reviewers, :, :].head()

The syntax is a bit trickier when you want to specify a row Indexer *and* a column Indexer.

In [None]:
reviews.loc[(top_reviewers, 99, :), ['beer_name', 'brewer_name']]

In [None]:
reviews.loc[pd.IndexSlice[top_reviewers, 99, :], ['beer_name', 'brewer_id']]

<div class="alert alert-success">
    <b>Exercise</b>: Select the most-reviewd beers
</div>

Use `.loc` to select the `beer_name` and `beer_style` for the 10 most popular beers, as measured by number of reviews.

In [None]:
# Your solution goes here

In [None]:
%load -r 24:27 solutions/solutions_indexing.py

# Pitfalls


Chained indexing

In [None]:
df.loc[df.beer_style.str.contains('IPA')]['beer_name'] = 'yummy'
df.loc[df.beer_style.str.contains('IPA')]['beer_name']

Anytime you see back-to-back square-brackets, `][`, you're asking for trouble.

In [None]:
good = df.copy()

In [None]:
good.loc[df.beer_style.str.contains('IPA'), 'beer_name']

In [None]:
good.loc[good.beer_style.str.contains('IPA'), 'beer_name'] = 'yummy'
good.loc[good.beer_style.str.contains('IPA'), 'beer_name']

# Recap

- Boolean masks should always be 1-dimensional and the same length
- sort your `MultiIndexes`
- `isin` + `.any()` or `.all()` for comparing to collections