# Lecture 10 - Pivoting Tables

## Spark 010, Spring 2024

## February 26, 2024

Data can be formatted in different ways.  We often associate the shape of a dataset based on its number of observations and the number variables per observation.

- **Long table**: Lots of observations, few variables
- **Wide table**: Few observations, lots of variables

Often, there are too many variables associated with one observation.  However, some of the variables might be **related** to each other in and of themselves.  If we notice thier relationship, then we can use this information to reshape our table, either increasing or reducing the number of variables to consider.  In other words, we can use this information to make our table from long to wide, or from wide to long.

The information we use to reformat the table are the variables of interest

A **pivot** can be seen as a variable on which to aggregate observed values.  In the context of a table, a pivot variable can be seen as an aggregation of values directly related to that variable.  When we undergo the process of ungrouping the values from each of the variables, we are "unpivoting" the data.  In Python, a fancier term for "unpivot" is "melt".

Here we will explore how pandas can be used to pivot and unpivot data for the purpose of reformatting as the need arises.

In [None]:
import numpy as np
import pandas as pd

Below is an example of a wide table containing just one observation and six variables.

In [None]:
time_frame = pd.DataFrame(data={'Year':[2024], 'Month':[1], 'Day':[25], 'Hour':[11], 'Minute':[30], 'Second':[44]})
#time_frame = [time_frame; [2024, 1, 25, 11, 30, 44]]
time_frame

The above is a single measurement of time itself, with each of the six columns representing one aspect of that point in time.

## Unpivoting a table: `melt()`

We can make the above table **long** by placing the variables and values into two respective columns.  This can be done by using the function `melt()`:

```
pd.melt(dataframe_to_melt,
         id_vars='Identifier Column',
         var_name='Variables to Unpivot',
         value_name='Values underneath unpivoted variables')
```

**Use `melt()` on the dataset `timeframe` above to unpivot the time info.  Set the parameter `id_vars` to be the `'Year'` column.**

In [None]:
melted_time_frame = pd.melt(..., id_vars = ...)
melted_time_frame

The Pandas function `melt()` is one way to move the values under different variables into a single column.

In the exampke above, we took the values of the five variables after `Year` and placed them into a single column named `values`.  Furthermore, a new column was created that contained the names of those five variables.

When `melt()` is used, the **names of the variables that acted as the pivots** are no longer variables, but values to a new column containing them.  The values they once contained however are side-by-side.

If these five columns all correspond to the **same type of measurement** and you would like to analyze **all the measurements**, then what we can do is place all the measurements from the five columns into one, while still preserving the original information.

## Pivoting the table: `pivot()`

We can also do the opposite: We can turn this long table above back into a **wide** table using the function `pivot()`.

```
pd.pivot(dataframe_to_pivot,
         index='Variable to Fix',
         columns='Variables to Pivot',
         values='Values to be pivoted to each column')
```

**Use `pivot()` on the dataset `melted_time_frame` above to unpivot the time info.  Set the indices of the table to be the `'Year'` column, the `columns` parameter to be the values in the `variable` column, and the `values` to be grouped as those in the `value` column.**

In [None]:
pivoted_time_frame = pd.pivot(..., index = ..., columns = ..., values = ...)
pivoted_time_frame

The Pandas function `pivot()` does the opposite of `melt()`.  Here we are pivoting the table by creating a group for each variable under the as pivots to reshape the table by making each unique entry in `variable` as its own column.

When `pivot()` is used, the **values under the column representing all the variables are assigned their own columns.**  The values that were once on the same rows they were placed in are now values correspinding to those variables.

Note that the table above is not exactly the same as the original.  That's becasue `pivot()` also grouped the variables and placed the `Year` as an index rather than a value.  We can fix both of these as follows:

In [None]:
pivoted_time_frame = pivoted_time_frame.reset_index(names='Year')
pivoted_time_frame.columns.name = None
pivoted_time_frame

## Application to Time-Series Data

A time-series is a sequence of ordered numerical data points.  One example of a time series are present in weather reports: daily records of temperatures.  For each day, we have a single measurement, and across a whole week we have seven measurements with a clear order of _when_ each data point was recorded.

Let's look at another example of a time series where each time point is a semester, and where the data point is a record of enrollment.

In [None]:
enrollment = pd.read_csv('eng_enrollment.csv')
enrollment

Let's use `melt()` on this dataset to place all of the enrollment values into a single column.  **Unpivot the values in the columns pertaining to the semesters, keeping the indentifier variables as those in the `'Major'` column.**

In [None]:
pd.melt(enrollment, id_vars = ...)

Let's make the columns descriptive by naming them based on the values we are analyzing.  **Assign names to the new variable and value columns as `'Semester'` and `'Enrollment'` respectively.**

In [None]:
melted_enrollment = pd.melt(enrollment, id_vars = ..., var_name= ..., value_name = ...)
melted_enrollment

Now, let's try to undo the process with `pivot()`.  **Obtain the Fall semester columns by pivoting the values in the current `'Semester'` column and grouping the values in the `'Enrollment'` column under each new semester column.**  Make sure to set the `index` of this table to be the `'Major'` column.

In [None]:
pivoted_enrollment = melted_enrollment.pivot(index = ..., columns = ..., values = ...)
pivoted_enrollment

Finally, we will do some adjustments to the names of the columns to ensure the format matches.  **Reset the indices of the pivoted table to be under the name 'Major', then remove the upper column name `'Semester'` by setting it to `None`.**  This will force the names of the fall semesters into one level, restoring the original table.

In [None]:
fixed_enrollment = pivoted_enrollment.reset_index(names = ...)
fixed_enrollment.columns.name = ...
fixed_enrollment