# Pivot values in rows into separate columns

It's often best to have [tidy data](https://r4ds.had.co.nz/tidy-data.html). However, sometimes when standardizing data that reflects similar events across different sources, I find it preferable to have one row represent the same thing. For example, if standardizing arrest records across jurisdictions, some jurisdictions might have one row per person arrested, while another jurisdiction might have one row per charge. In this case, I might want to pivot the charges into multiple columns like `charge`, `charge_2`, `charge_3`, etc. This isn't ideal, but sometimes this is the best option.

Here's how to do it. It's worth checking out the Pandas documentation on [reshaping and pivot tables](https://pandas.pydata.org/docs/user_guide/reshaping.html) first to get familiar with some of the methods that take data from wide to long and back again. 

In [1]:
# Setup

from io import StringIO

import numpy as np
import pandas as pd

Consider this extract from a city police department's arrest data. I've removed personally identifying details. As a result, there are some rows that end up being duplicates. In this example, `unique_id` identifies the person arrested.

In [2]:
# Set up some example data
example_csv = """unique_id,penal_law_code,penal_law_description
776187,2923.12,Carrying Concealed Weapons
776187,2923.12,Carrying Concealed Weapons
776187,2923.12,Carrying Concealed Weapons
776187,2923.12,Carrying Concealed Weapons
776187,2923.12,Carrying Concealed Weapons
776187,2923.12,Carrying Concealed Weapons
776187,2923.13,Having Weapons While Under Disability
776187,2923.13,Having Weapons While Under Disability
776187,2923.13,Having Weapons While Under Disability
776187,2923.13,Having Weapons While Under Disability
776187,2923.13,Having Weapons While Under Disability
776187,2923.13,Having Weapons While Under Disability
"""

df = pd.read_csv(StringIO(example_csv))

df 

Unnamed: 0,unique_id,penal_law_code,penal_law_description
0,776187,2923.12,Carrying Concealed Weapons
1,776187,2923.12,Carrying Concealed Weapons
2,776187,2923.12,Carrying Concealed Weapons
3,776187,2923.12,Carrying Concealed Weapons
4,776187,2923.12,Carrying Concealed Weapons
5,776187,2923.12,Carrying Concealed Weapons
6,776187,2923.13,Having Weapons While Under Disability
7,776187,2923.13,Having Weapons While Under Disability
8,776187,2923.13,Having Weapons While Under Disability
9,776187,2923.13,Having Weapons While Under Disability


Start out by removing the duplicates and adding a column that identifies each charge.

In [3]:
# Remove duplicates and add a column to count each charge number
df_with_nos = df.drop_duplicates().assign(charge_no=lambda df: np.arange(df.shape[0]))

df_with_nos

Unnamed: 0,unique_id,penal_law_code,penal_law_description,charge_no
0,776187,2923.12,Carrying Concealed Weapons,0
6,776187,2923.13,Having Weapons While Under Disability,1


Then, melt the data so there's one row per observation, that is one row per charge, per column we want to eventually widen.

In [4]:
# Get one row per observation

df_tidy = df_with_nos.melt(id_vars=["unique_id", "charge_no"])

df_tidy

Unnamed: 0,unique_id,charge_no,variable,value
0,776187,0,penal_law_code,2923.12
1,776187,1,penal_law_code,2923.13
2,776187,0,penal_law_description,Carrying Concealed Weapons
3,776187,1,penal_law_description,Having Weapons While Under Disability


We could get away without doing the next step, but let's make more human friendly labels so that the first charge statute is named `penal_law_code` and the second `penal_law_code_2` instead of `penal_law_code_0` and `penal_law_code_1` respectively.

In [5]:
# Translate the charge numbers into more human-friendly labels

df_tidy["charge_label"] = np.where(
    df_tidy["charge_no"] == 0,
    # If the charge number is 0, make the label an empty string
    "",
    # Otherwise, increment the index by one
    (df_tidy["charge_no"] + 1).astype("str")
)

df_tidy

Unnamed: 0,unique_id,charge_no,variable,value,charge_label
0,776187,0,penal_law_code,2923.12,
1,776187,1,penal_law_code,2923.13,2.0
2,776187,0,penal_law_description,Carrying Concealed Weapons,
3,776187,1,penal_law_description,Having Weapons While Under Disability,2.0


Then, combine the original column name (in `variable`) with the numeric labels we created in the previous step. This will be the column name when we eventually make the data wide.

In [6]:
# Generate new column names

df_tidy["new_col_name"] = np.where(
    df_tidy["charge_label"] == "",
    # If there's not a number in the label, just use the original column name
    df_tidy["variable"],
    # If there is a numeric label, append it to the original column name
    df_tidy["variable"] + "_" + df_tidy["charge_label"]
)

df_tidy

Unnamed: 0,unique_id,charge_no,variable,value,charge_label,new_col_name
0,776187,0,penal_law_code,2923.12,,penal_law_code
1,776187,1,penal_law_code,2923.13,2.0,penal_law_code_2
2,776187,0,penal_law_description,Carrying Concealed Weapons,,penal_law_description
3,776187,1,penal_law_description,Having Weapons While Under Disability,2.0,penal_law_description_2


Now use [`DataFrame.pivot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html) to go from long to wide, putting the values in columns with the new column names we just created.

In [7]:
# Pivot the data from long to wide

df_wide = df_tidy.pivot(index="unique_id", columns="new_col_name", values="value")

# `.pivot()` will make the name of the column index `new_col_name`.
# I think this is ugly. Get rid of it.
df_wide.columns.name = None

df_wide

Unnamed: 0_level_0,penal_law_code,penal_law_code_2,penal_law_description,penal_law_description_2
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
776187,2923.12,2923.13,Carrying Concealed Weapons,Having Weapons While Under Disability


We're almost done, but I want to keep the original column order isntead of all the statutes coming before all of the descriptions. To do this, I'll make use of Python's [`sorted()`](https://docs.python.org/3/howto/sorting.html) function with a custom `key` function.

In [8]:
# Reorder the columns

# Start by defining a custom key function we'll use with sorted()

def get_number_from_colname(colname):
    """
    Returns the numeric order from a column name
    """
    bits = colname.split("_")
    try:
        return int(bits[-1])
        
    except ValueError:
        return 0

# Re-order the columns
cols_ordered = sorted(df_wide.columns, key=get_number_from_colname)
df_wide = df_wide[cols_ordered]

df_wide

Unnamed: 0_level_0,penal_law_code,penal_law_description,penal_law_code_2,penal_law_description_2
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
776187,2923.12,Carrying Concealed Weapons,2923.13,Having Weapons While Under Disability
