# Column headers are values, not variable names

*This is one of the more common data manipulations to get to a tidy form!*

For example, many government data sets are in a format which is **good for visual lookup, but not for analysis and exploration**, with measurements in various years spread across multiple columns.

### A toy example – years as column headers

Let's define a simple, small DataFrame with that structure:

In [1]:
import pandas as pd

In [2]:
df = pd.DataFrame({'state':['Maine','Alaska','Ohio'],
                  '2009':[1,2,3],
                  '2010':[4,5,6],
                  '2011':[7,8,9]})
df

Unnamed: 0,state,2009,2010,2011
0,Maine,1,4,7
1,Alaska,2,5,8
2,Ohio,3,6,9


## Transforming from *wide* into *tall* format

The problems with this format are 

- **The column headers are really a Dimension that should have its own *Year* column**
- **The values that are spread across the multiple rows and columns in the body of the table are a Measure that should be in a single column.**

Confusingly, every language seems to have its own term for this process / transformation:

- **In Pandas you do a "melt"**
- In the R `tidyr` package this is a "gather"
- In OpenRefine it's a "Transpose->Transpose cells across columns into rows..." operation
- In Tableau this is called a "Pivot"
- Many call this process "un-pivoting", since a *Pivot Table* in Excel converts data in the opposite direction, from the *tall* format into *wide*. 


### Minimally, for a `melt()` you need to specify 

- a list of the columns which **don't** get "un-pivoted" – *these values will get repeated*.

*Notice that the column headers by default end up in a column called "variable", and the table body values end up in a column called "value".*

In [3]:
df2 = df.melt(['state'])
df2

Unnamed: 0,state,variable,value
0,Maine,2009,1
1,Alaska,2009,2
2,Ohio,2009,3
3,Maine,2010,4
4,Alaska,2010,5
5,Ohio,2010,6
6,Maine,2011,7
7,Alaska,2011,8
8,Ohio,2011,9


### Slighly more explicitly: id_vars

The argument name for the list of columns that don't get melted is `id_vars`. This argument is the default if you don't use its name, as above, but it's not a bad idea in your code to spell this out explicitly. 

The `id_vars` "identify" the rest of the columns in each original row. They are the IDs, or the key fields, which uniquely identify the rest of the information.

In [4]:
df2 = df.melt(id_vars=['state'])
df2

Unnamed: 0,state,variable,value
0,Maine,2009,1
1,Alaska,2009,2
2,Ohio,2009,3
3,Maine,2010,4
4,Alaska,2010,5
5,Ohio,2010,6
6,Maine,2011,7
7,Alaska,2011,8
8,Ohio,2011,9


## More complete `.melt()` statement

More fully, you can explicitly specify any combination of these (as long as you include an `id_vars` list):

- list of columns that don't get melted (and, thus, will get repeated): `id_vars=`
- list of columns that get melted from columns into rows: `value_vars=`
- name you want for the column that used to be column headers: `var_name=`
- name you want for the column that used to be the table body values: `value_name=`


In [5]:
df2 = df.melt(id_vars=['state'], 
              value_vars=['2009','2010','2011'], 
              var_name='year', value_name='number')
df2

Unnamed: 0,state,year,number
0,Maine,2009,1
1,Alaska,2009,2
2,Ohio,2009,3
3,Maine,2010,4
4,Alaska,2010,5
5,Ohio,2010,6
6,Maine,2011,7
7,Alaska,2011,8
8,Ohio,2011,9


## Sort the DataFrame

You can sort the rows of a DataFrame according to any column, or ordered combination of columns.

**The results won't be saved unless you do it with `inplace=True` or assign the results of the sort to a variable!**

Sometimes this is useful to just inspect a DataFrame in a certain sort order, when you don't really need to save the results.

In [6]:
df2.sort_values(['state'])

Unnamed: 0,state,year,number
1,Alaska,2009,2
4,Alaska,2010,5
7,Alaska,2011,8
0,Maine,2009,1
3,Maine,2010,4
6,Maine,2011,7
2,Ohio,2009,3
5,Ohio,2010,6
8,Ohio,2011,9


#### See, it displayed, but didn't really get saved!

Notice that the default is to sort in "ascending" order, or `ascending=True`

In [7]:
df2

Unnamed: 0,state,year,number
0,Maine,2009,1
1,Alaska,2009,2
2,Ohio,2009,3
3,Maine,2010,4
4,Alaska,2010,5
5,Ohio,2010,6
6,Maine,2011,7
7,Alaska,2011,8
8,Ohio,2011,9


### Sorting rows by multiple columns

Here we'll sort the rows "descending" by 'year', and then within each year group by 'state'. It's not necessary, but it's slightly more readable if you specify the argument name `by`.

In [8]:
df2 = df2.sort_values(by=['year','state'], ascending=False)
df2

Unnamed: 0,state,year,number
8,Ohio,2011,9
6,Maine,2011,7
7,Alaska,2011,8
5,Ohio,2010,6
3,Maine,2010,4
4,Alaska,2010,5
2,Ohio,2009,3
0,Maine,2009,1
1,Alaska,2009,2


#### `ascending` can also take a list the same length as `by`

In [9]:
df2 = df2.sort_values(by=['year','state'], ascending=[False,True])
df2

Unnamed: 0,state,year,number
7,Alaska,2011,8
6,Maine,2011,7
8,Ohio,2011,9
4,Alaska,2010,5
3,Maine,2010,4
5,Ohio,2010,6
1,Alaska,2009,2
0,Maine,2009,1
2,Ohio,2009,3
