Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

### Melting DataFrames

Tranformation required when **column names** represent the values of a variable.  Examples from class include the year column in the Tuburculosis data set, the month column in the unemployment dataset, and the treatment column in the clinical trials dataset.  Consider:

In [None]:
import pandas as pd

clinic_columns = ['First', 'Last', 'TreatmentA', 'TreatmentB']
clinic_data = [
    ['John', 'Smith', -1, 2],
    ['Jane', 'Doe', 16, 11],
    ['Mary', 'Johnson', 3, 1]
]
trials = pd.DataFrame(clinic_data, columns=clinic_columns)
trials

Here, we see the values in both the `TreatmentA` and `TreatmentB` columns represent the same "kind" of thing, the response of some metric to a particular treatment.  If we added a third treatment and incorporated it into the dataset, we would add another **column**, but would not be measuring anything new.  So we would be adding a column without adding an observational variable.

An observation is uniquely determined, then, by the triple of `First`, `Last`, and `Treatment`, our independent variables, and the only remaining dependent variable is the `Response`.

So we need to transform the column names `TreatmentA` and `TreatmentB` into values of a `Treatment` column, and the corresponding values in the existing columns are used to populate the `Response` column.

At a minimum, a melt operation has to partition the existing column names into the columns to be retained in the new data frame, and the columns that are values for a new variable.

In [None]:
trials2 = trials.melt(id_vars=['First', 'Last'])
trials2

By default, the new column under which the previous column headers become values is called a generic **variable**, and the values under those columns are assembled in a column named **value**.  We can optionally include better names for one or both of these new column labels:

In [None]:
trials2 = trials.melt(id_vars=['First', 'Last'], 
                      value_name='Response', var_name='Treatment')
trials2

In [None]:
getlast = lambda s: s[-1]
treatment2 = trials2['Treatment'].apply(getlast)
trials2['Treatment'] = treatment2
trials2

Note that, to complete this data curation, we would map the missing observation to a `np.nan`, and the `TreatmentA` to `A` and `TreatmentB` to `B`.  We may also want to drop the row with the nan, which is easy to do with curated data.

In [None]:
tbcasescolumns = ["country", "year", "cases"]
tbcasesdata = [ ["Afghanistan",  1999,    745],
                ["Afghanistan",  2000,   2666],
                [     "Brazil",  1999,  37737],
                [     "Brazil",  2000,  80488],
                [      "China",  1999, 212258],
                [      "China",  2000, 213766] ]
tbcases = pd.DataFrame(tbcasesdata, columns=tbcasescolumns)
tbcases

In [None]:
tbcases.pivot(index='country', columns='year', values='cases')

In [None]:
table1columns = ["country",  "year",       "type",     "count"]
table1data =[ ["Afghanistan",  1999,      "cases",       745],
              ["Afghanistan",  1999, "population",  19987071],
              ["Afghanistan",  2000,      "cases",      2666],
              ["Afghanistan",  2000, "population",  20595360],
              [     "Brazil",  1999,      "cases",     37737],
              [     "Brazil",  1999, "population", 172006362],
              [     "Brazil",  2000,      "cases",     80488],
              [     "Brazil",  2000, "population", 174504898],
              [      "China",  1999,      "cases",    212258],
              [      "China",  1999, "population",1272915272],
              [      "China",  2000,      "cases",    213766],
              [      "China",  2000, "population",1280428583] ]
table1 = pd.DataFrame(table1data, columns=table1columns)

In [None]:
table1

In [None]:
table1.pivot_table(index=['country', 'year'], values='count', columns='type')

In [None]:
table1['syear'] = table1.year.apply(str)

In [None]:
table1['country-year'] = table1.country + '-' + table1.syear
table1 = table1[['country-year', 'type', 'count']]
table1

In [None]:
table1.pivot(index='country-year', columns='type', values='count')

In [None]:
table1_indexed = table1.set_index(['country', 'year'])
table1_indexed

### Pivoting DataFrames

#### Video Examples

In the first example, the data, as presented are Tidy-conformant.  There appears to be a unique Independent Variable, `id` that uniquely defines each observation, and `id` determines `treatment`, `gender`, and `response`.  Note that, if, for a given `id`, either treatment (or both) were possible, then the combination of `id` and `treatment` would determine `response` and `id` (alone) would determine `gender`.  Just trying to keep exercising the ideas from class in what consitutes tidy data based on functional dependency.

In [None]:
import pandas as pd

clinic_columns = ['id', 'treatment', 'gender', 'response']
clinic_data = [
    [1, 'A', 'F', 5],
    [2, 'A', 'M', 3],
    [3, 'B', 'F', 8],
    [4, 'B', 'M', 9]
]
trials = pd.DataFrame(clinic_data, columns=clinic_columns)
trials

Rehaping of this data is **not**, in this case, to go from non-tidy data to tidy data.  Rather, for presentational purposes, we decide that a matrix presentation with treatment down one axis and gender across the other axis is preferred.

The transformation of **pivot**, in essence, takes **values** (of a categorical variable, like gender) and makes a **column** for each different value of the categorical.  When it does this, it requires that we specify what we want to use for the row labels and to specify what column that will appear at the intersection of the row label and the new column.  So we need three pieces of information for a pivot:

1. row labels (the "**index**" of the transformed dataframe)
2. which **column** in the preimage to use to find the possible values for the newly generated columns
3. the column to use for the **values** at the intersection of the new row labels and the new columns

This is where the names used for the named parameters in the `pandas` `pivot` method are intended to convey.

In [None]:
trials_presentation = trials.pivot(index='treatment', columns='gender', values='response')
trials_presentation