In [1]:
import pandas as pd

df = pd.DataFrame({'Name': {0: 'Goof', 1: 'Mickey', 2: 'Donald'},
                   'Duckopolis': {0: 11, 1: 32, 2: 33},
                   'Mouseville': {0: 68, 1: 51, 2: 42},
                   'Prison': {0:71, 1:3, 2:22}
                  })

In [2]:
print(df.head())

     Name  Duckopolis  Mouseville  Prison
0    Goof          11          68      71
1  Mickey          32          51       3
2  Donald          33          42      22


The problem with the data above is that the column names contain values (the names of the locations); this makes the data "messy."

In [3]:
pd.melt(df, id_vars=["Name"], var_name="Location", value_name="years_spent")

Unnamed: 0,Name,Location,years_spent
0,Goof,Duckopolis,11
1,Mickey,Duckopolis,32
2,Donald,Duckopolis,33
3,Goof,Mouseville,68
4,Mickey,Mouseville,51
5,Donald,Mouseville,42
6,Goof,Prison,71
7,Mickey,Prison,3
8,Donald,Prison,22


Another example:

In [4]:
grades = pd.DataFrame({'Name': {0: "Goof", 1: "Mickey", 2: "Donald"},
                       'Algebra Test': {0: 15, 1: 73, 2: 91},
                       'Broccoli Test': {0: 100, 1: 100, 2: 61}                      
                      })

This is what we call a **wide-form** table: the values are spread out and we gain information by looking at the intersections between column and rows. This is convenient and readable for humans, but not as useful as it could be for computers. To turn this into a **long-form** table, where each row corresponds to a single observation (i.e., the single result for a single test taken by a single student), we use `pd.melt()`.

In [5]:
print(grades)

     Name  Algebra Test  Broccoli Test
0    Goof            15            100
1  Mickey            73            100
2  Donald            91             61


When we use `pd.melt()` we enter the following parameters:

1. `id_vars`: the *identifier variable* (or variables; more than one can be entered); this is the leftmost column of the new table, containing the essential identifier for each observation (in this case, the student's name); it is already correctly organized in the original table;
2. `var_name`: this is the column that we add through `melt`: it will contain the column names that currently contain *values*, and that will become individual rows corresponding to each observation (in this case, the type of test taken);
3. `value_name`: this is the name that will be given to the column containing the values that are currently spread out in the table (in this case, the score for each test).

Note that I will also rename the columns using `rename()`, to avoid redundancy (removing the "Test" from each column name).

In [6]:
grades_tidied = pd.melt(
    grades.rename(columns={"Algebra Test": "Algebra", "Broccoli Test": "Broccoli"}),
    id_vars="Name", var_name="Test", value_name="Score")
print(grades_tidied)

     Name      Test  Score
0    Goof   Algebra     15
1  Mickey   Algebra     73
2  Donald   Algebra     91
3    Goof  Broccoli    100
4  Mickey  Broccoli    100
5  Donald  Broccoli     61


At times data can be stored in tables that are **too long**; this does not refer to the length of a correctly tidied **long-form** table, but to columns that contain value names and columns that contain different kinds of values.

In [9]:
profiles = pd.DataFrame({"Name": ["Mickey", "Mickey", "Goof", "Goof", "Donald", "Donald"],
                         "Attributes": ["Age", "No. of partners", "Age", "No. of partners", "Age", "No. of partners"],
                         "Values": [31, 5, 78, 43, 42, 0]
                        })
print(profiles)

     Name       Attributes  Values
0  Mickey              Age      31
1  Mickey  No. of partners       5
2    Goof              Age      78
3    Goof  No. of partners      43
4  Donald              Age      42
5  Donald  No. of partners       0


The column "Attributes" contains value names instead of values, and the column "Values" contains different kinds of values; besides, a single observation (the different data concerning a single individual) is split between multiple lines.
To fix this, we use `pivot()`. (`pivot()` is the anti-melt; it spreads out a table instead of slimming it.) This function takes three parameters:
1. `index`: the leftmost column, containing the identifiers for each observation;
2. `columns`: the name of the column that contains the column names that we want to spread out;
3. `values`: the name of the column that contains the values that should be spread out.
It is advisable to `reset_index()` as we call this function on a dataset; and to set `.columns.name = None` afterwards in order to drop any leftover column names.

In [10]:
tidied_profiles = profiles.pivot(index = "Name", columns = "Attributes", values = "Values").reset_index()
tidied_profiles.columns.name = None
print(tidied_profiles)

     Name  Age  No. of partners
0  Donald   42                0
1    Goof   78               43
2  Mickey   31                5
