# Pandas Reshaping

Dataframes come in all different shapes and sizes,
but typically when we work with them, we want them
to be in the same shape so it is more uniform.

This module is about reshaping pandas dataframes.
There are lots of different methods, some of which
are:

1. pivot
1. melt
1. stack
1. unstack

There are also functions for combining different 
DataFrames, like

1. concat
2. merge
3. join

The methods listed above are only a small fraction
of what is available in the Pandas API, and we might
not be able to cover everything we listed.

For documentation on all of the features provided by 
Pandas, check out:

* https://pandas.pydata.org/pandas-docs/stable/reshaping.html
* https://pandas.pydata.org/pandas-docs/stable/merging.html

### Disclaimer:
I'm a noob at this stuff. I just read docs and call functions
until the things get in the right shape.

In [None]:
import pandas as pd

## Pivoting
Given two "categorical" columns and "values" column,
``df.pivot`` generates a pivot table.

In [None]:
data = pd.read_csv("../data.csv")
query = (
    "location_id == 6 "
    "and sex_id == 2")

# Two categorical columns, one values column.
small_data = data.query(query)[
    ["age_group_id", "year_id", "mort_rate"]]
small_data.head()

In [None]:
small_data_pivot = small_data.pivot(
    index="year_id",
    columns="age_group_id",
    values="mort_rate")
small_data_pivot.head()

In [None]:
small_data_pivot.reset_index().head()

In [None]:
small_data_pivot[5][1995]

In [None]:
small_data_pivot[[2,3]].head()

In [None]:
small_data_pivot[2:5].head()  # this doesn't do what we expect!

In [None]:
# We need to use pd.IndexSlice to propery slice the index.
small_data_pivot.loc[pd.IndexSlice[:, 2:5]].head()

## Melting, the opposite of pivoting, kind of.

In [None]:
# start from a dataframe without a hierarchical index on the columns
flat_data = small_data_pivot.reset_index()
flat_data.head()

In [None]:
long_data = flat_data.melt(id_vars="year_id", value_name="mort_rate")
long_data.head()

In [None]:
small_data.head()

In [None]:
small_data - long_data 
# this doesn't work because the index for long_data got reset
# by pretty much everything we did, and pandas does operations
# by aligning on the index.

In [None]:
# Just showing that the arrays didn't add properly.
print(len(small_data))
print(len(long_data))
print(len(small_data) + len(long_data))

## Stacking and Unstacking

In [None]:
indexed_data = data.set_index(["location_id", "age_group_id", "sex_id", "year_id"])[["mort_rate"]]
indexed_data.head()

In [None]:
# unstack moves the `fastest-changing-row-index` to
# the `fastest-changing-column-index`.
indexed_data.unstack().head()

In [None]:
# Stacking will move the fastest changing column
# index to the fastest changing row index.
indexed_data.unstack().stack().head()

In [None]:
# If the column index isn't hierarchical, then it converts it into 
# one long series.
print(type(indexed_data.unstack()["mort_rate"].stack()))
print(indexed_data.unstack()["mort_rate"].stack().head())

In [None]:
# We can convert this series back into a dataframe with the same index.
pd.DataFrame(
    indexed_data.unstack()["mort_rate"].stack(), 
    columns=["mort_rate"]).head()

In general, stack and unstack don't work well when you don't have multi-indexes
on the columns or rows (respectively).

# Combining DataFrames
Merging, concattenating, and joining.

## Concat
This is used for adding a bunch of new rows to a dataframe.

In [None]:
import numpy as np

a_few_rows = pd.DataFrame(dict(
    x=[1,2,3,1,2,3],
    y=[1,1,1,2,2,2],
    val=np.random.rand(6)
    ))
a_few_rows

In [None]:
some_more_rows = pd.DataFrame(dict(
    x=[1,2,3,1,2,3],
    y=[3,3,3,4,4,4],
    val=np.random.rand(6)
    ))
some_more_rows

In [None]:
long_data_set = pd.concat([a_few_rows, some_more_rows])
long_data_set
# Notice how weird the index is!

In [None]:
yet_another_set_of_rows = pd.DataFrame(dict(
    x=[1,2,3,1,2,3],
    y=[12,12,12,-12,-12,-12],
    val=100
    ))
yet_another_set_of_rows

In [None]:
long_data_set + yet_another_set_of_rows  # weird arithemetic because the index is weird.

When concatenating dataframes, there are two ways to do it:
```
result = pd.DataFrame()
for path in filenames:
    df = pd.read_csv(path)
    result = pd.concat([result, df])
```
and
```
dataframes = []
for path in filenames:
    df = pd.read_csv(path)
    dataframes.append(df)
result = pd.concat(dataframes)
```

One way is super bad.

# Merging

In [None]:
x_names = pd.DataFrame(dict(
    x=[1,2,3],
    names=["Kendrick", "Ken", "K-Dot"]
    ))
x_names

In [None]:
long_data_set.merge(x_names)

# WOW THAT WAS EASY

Other things:
* left vs right vs inner vs outer merges
* duplicate column names that aren't used for the merge get weird suffixes.
* 

In [None]:
pd.merge()  # check out the merge api with a shift-tab.

# JOINING
This is like merging, but doesn't work.

That's a joke. Join is used to join on the index, as opposed to values in a column.

In [None]:
data.head()

In [None]:
index = ["location_id", "age_group_id", "sex_id", "year_id"]
mort_rates = data.set_index(index)[["mort_rate"]]
pops = data.set_index(index)[["population"]]

In [None]:
mort_rates.head()

In [None]:
pops.head()

In [None]:
pops.join(mort_rates).head()

# SECRETS:
data prep for the exercises.

In [None]:
mort_pop_data = pd.read_csv("../data.csv")
mort_pop_data.head()

In [None]:
mort_only = mort_pop_data[["location_id", "age_group_id", "sex_id", "year_id", "mort_rate"]]
pop_only = mort_pop_data[["location_id", "age_group_id", "sex_id", "year_id", "population"]]

In [None]:
mort_wide = mort_only.set_index(
        ["location_id", "age_group_id", "sex_id", "year_id"]
    ).unstack()["mort_rate"]
logged_mort_wide = np.log(mort_wide).reset_index()
logged_mort_wide.head()

In [None]:
pop_wide = pop_only.set_index(
        ["location_id", "sex_id", "year_id", "age_group_id"]
    ).unstack()["population"].reset_index().set_index(["location_id", "sex_id", "year_id"]).sort_index()
pop_wide.head()

In [None]:
logged_mort_wide.to_csv("mort.csv")
pop_wide.to_hdf("pop.hdf", "data")

In [None]:
import numpy as np

def deaths_and_death_rates_and_pops():
    """
    Returns number of deaths for all locations, ages, sexes, and years,
    along with mort_rate and population.
    
    number of deaths = mortality_rate * population
    
    The two files are provided, but they have different formats, and
    mortality is in log rate space. Use jupyter to explore the data 
    within each file and how you can reshape the files and compute 
    number of deaths.
    
    Make sure you convert the log mortality rates into mortality rates!
    
    Return:
        pd.DataFrame: a dataframe with columns location_id, sex_id,
            year_id, age_group_id, mort_rate, population, and num_deaths;
            and a simple index (just 0 to N).
    """
    log_mort_file = "mort.csv"
    pop_file = "pop.hdf"

    # Reshaping pops
    pop_wide = pd.read_hdf(pop_file)
    pop_long = pd.DataFrame(pop_wide.stack(), columns=["population"]).reset_index()

    # Reshaping log_mort
    mort_wide = pd.read_csv(log_mort_file).drop("Unnamed: 0", axis=1)
    mort_long = mort_wide.melt(
        id_vars=["location_id", "age_group_id", "sex_id"], 
        value_name="log_mort_rate",
        var_name="year_id",
        )
    mort_long["year_id"] = mort_long["year_id"].astype("int")
    
    # Converting from log to linear space.
    mort_long["mort_rate"] = np.exp(mort_long["log_mort_rate"])
    mort_long.drop("log_mort_rate", axis=1, inplace=True)
    
    data = mort_long.merge(pop_long)
    data["num_deaths"] = data["mort_rate"] * data["population"]
    return data

In [None]:
def test_deaths_and_death_rates_and_pops():
    res = deaths_and_death_rates_and_pops()
    assert set(res.columns) == set([
        "location_id", "sex_id", "year_id", "age_group_id", 
        "mort_rate", "population", "num_deaths"]), "Missing columns."
    
    assert len(res) == 51480, len(res)
    
    one_set_of_vals = res.query(
            "location_id == 34 "
            "and age_group_id == 19 "
            "and sex_id == 2 "
            "and year_id == 2009"
        )[["mort_rate", "population", "num_deaths"]].values
    
    expected_vals = [0.045623, 112054.0, 5112.225765]
    
    assert np.isclose(one_set_of_vals, expected_vals).all()

test_deaths_and_death_rates_and_pops()