# Long format, wide format, pivot tables, and melting

This lesson is all about data transformation in pandas. Data transformation is in essense reorganizing the rows and columns of your dataset to be a different shape and format. 

The benefits to transforming your data are primarily for easier access and manipulation of data, whether it be through easier masking/conditional statements or because you would prefer to operate across columns or down rows. 

Over time you will get a feel for which data formats are better for different tasks. This lesson, however, is focused in large part on the _functional application_ of data transformation. How do you do this to a dataset?

---

In [1]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

sns.set_style('darkgrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [4]:
a = pd.DataFrame({'a':['1','no']})
a.a.astype(float, raise_on_error=False)
a

Unnamed: 0,a
0,1
1,no


---

## 1. "Wide" format data

**Wide** format data is the more common format of data for .csv type files. You are already familiar with wide format data: I believe all of the datasets we have been using thus far have been in wide format.

Wide format data is formatted with criteria:

- There are multiple ID _and_ value columns. In other words, there is a column for every "variable" with its own unique values.
- The format has both the conceptual simplicity of a single column of values per variable and a more compact matrix.
- Is not useful for SQL-style operations: it can make it much harder or even impossible to join tables together on a value.
- Can be more useful in pandas when you need to preform operations on variables **across columns**. For example, multiplying columns together.
- It is the most commonly the format that you will put the data in when you are ready to perform modeling (with some exceptions). When we get into modeling next week I will explain why.

---

## 2. Load  "Nerdy Personality Attributes" dataset

This is a parsed and modified version of the full "Nerdy Personality Attributes" survey that asked subjects to self-rate on questions related to "nerdiness" as well as more general personality traits such as openness and extraversion. Demographic information on the subjects was also collected.

In this modified version, for the sake of example, some of the subjects have only data for the survey and not the demographic variables. Because there are missing values and the data in general is "messy", this is also in part a data cleaning problem.

We will load the data in wide format first:


In [2]:
nerdy_wide_f = '~/github-repos/DSI-SF-2/datasets/nerdy_personality_attributes/NPAS_parsed_trunc_wide_missing.csv'

nerdy_wide = pd.read_csv(nerdy_wide_f)
print nerdy_wide.shape

(1391, 57)


In [4]:
nerdy_wide.head(2)
print nerdy_wide.columns

Index([u'subject_id', u'academic_over_social', u'age', u'anxious', u'bookish',
       u'books_over_parties', u'calm', u'collect_books', u'conventional',
       u'critical', u'dependable', u'diagnosed_autistic', u'disorganized',
       u'education', u'engnat', u'enjoy_learning', u'excited_about_research',
       u'extraverted', u'familysize', u'gender', u'hand',
       u'hobbies_over_people', u'in_advanced_classes',
       u'intelligence_over_appearance', u'interested_science',
       u'introspective', u'libraries_over_publicspace', u'like_dry_topics',
       u'like_hard_material', u'like_science_fiction', u'like_superheroes',
       u'major', u'married', u'online_over_inperson', u'opennness',
       u'play_many_videogames', u'playes_rpgs', u'prefer_fictional_people',
       u'race_arab', u'race_asian', u'race_black', u'race_hispanic',
       u'race_native_american', u'race_native_austrailian', u'race_nerdy',
       u'race_white', u'read_tech_reports', u'religion', u'reserved',
       u

The dataset is in the familiar (rows, columns) format where each column is a variable, each row contains the observation for that variable for (in this case) that distinct subject.

In [3]:
#nerdy_wide.head(3)

We can check to see how many null values there are per column with the convenient chained function pattern below:

In [5]:
nerdy_wide.isnull().sum()

subject_id                        0
academic_over_social              0
age                             691
anxious                           0
bookish                           0
books_over_parties                0
calm                              0
collect_books                     0
conventional                      0
critical                          0
dependable                        0
diagnosed_autistic                0
disorganized                      0
education                       691
engnat                          691
enjoy_learning                    0
excited_about_research            0
extraverted                       0
familysize                      691
gender                          691
hand                            691
hobbies_over_people               0
in_advanced_classes               0
intelligence_over_appearance      0
interested_science                0
introspective                     0
libraries_over_publicspace        0
like_dry_topics             

The 691 missing demographic variables are intentional (I specifically set it up so only 700 of the subjects have demographic information).

However, we can see that the `major` variable has 970 missing values. This is not intentional by me.

If we were to just drop all the rows that have any null values at this point, we would lose 970 rows due to the commonly missing variable `major`.

With a numeric column, this would be hard to avoid without "imputing" some number to fill in the values. In the simplest case imputing the mean or median for missing numeric values is used (but not very good).

With a **categorical variable**, which `major` is, we have a luxury of replacing the missing values with another category. In this case, I will replace the values with "unknown".

In [6]:
nerdy_wide.loc[nerdy_wide.major.isnull(), 'major'] = 'unknown'
print nerdy_wide.major.isnull().sum()

0


In [7]:
nerdy_wide.head(2)

Unnamed: 0,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
0,0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0


## 3. "Long" format

Now we can load the same data in but in what's commonly referred to as "long format". 

Long data is formatted with criteria:

- Potentially multiple "id" (identification) columns.
- Variable:value column pairs that match a variable key to a value (in the simple case, a single variable column and a single value column).
- The "variable" column corresponds to the multiple variable columns in your wide format data. Now, instead of a column for each variable, you have a row for each variable:value pair, per id. 
- This is a standard format in SQL databases because it is appropriate for joining different tables together by keys.

In [8]:
nerdy_long_f = '~/github-repos/DSI-SF-2/datasets/nerdy_personality_attributes/NPAS_parsed_trunc_long_missing.csv'

nerdy_long = pd.read_csv(nerdy_long_f)
print nerdy_long.shape

(70295, 3)


In [15]:
nerdy_long.sort_values(['subject_id','value'], axis=0, ascending=False).head(3)

Unnamed: 0,subject_id,variable,value
7699,1390,major,Science
50820,1390,extraverted,7.0
57775,1390,sympathetic,7.0


You can see that the long data has way more rows, but only three columns.

Below you see the three columns: `subject_id`, `variable`, and `value`.

**`subject_id:`**
- This is the "key" or "id" column. Each subject id will have corresponding entries in the variable column, one for each row.

**`variable:`**
- This column indicates which variable the item in the value column corresponds to.

**`value:`**

- This contains all the values for all of the variables for all ids. Essentially, every cell in the wide dataset except the subject_id is listed in this column.

In [5]:
#nerdy_long.head(3)

You can see that the unique values in the variable column correspond to the column headers in the wide format data:

In [16]:
nerdy_long.variable.unique()

array(['education', 'urban', 'gender', 'engnat', 'age', 'hand', 'religion',
       'voted', 'married', 'familysize', 'major', 'race_white',
       'race_nerdy', 'race_native_american', 'writing_novel',
       'read_tech_reports', 'online_over_inperson', 'introspective',
       'hobbies_over_people', 'books_over_parties', 'bookish',
       'libraries_over_publicspace', 'race_native_austrailian',
       'like_hard_material', 'race_hispanic', 'diagnosed_autistic',
       'play_many_videogames', 'race_arab', 'race_asian',
       'interested_science', 'playes_rpgs', 'in_advanced_classes',
       'collect_books', 'intelligence_over_appearance',
       'watch_science_shows', 'academic_over_social',
       'like_science_fiction', 'like_dry_topics', 'race_black', 'calm',
       'disorganized', 'extraverted', 'dependable', 'critical',
       'opennness', 'anxious', 'sympathetic', 'reserved', 'conventional',
       'was_odd_child', 'prefer_fictional_people', 'enjoy_learning',
       'excited_abou

Let's again replace the `major` variables with 'unknown', but in a way that works with long format data:

## Pandas `pivot_table()`: long to wide format

The `pd.pivot_table()` function is a very powerful tool to both transform data from long to wide format and also to conveniently summarize data into a matrix with arbitrary functions.

First we'll look at how we transform this long format data back into the wide format data.

**Parameters to note in the function:**

    nerdy_long: the pivot_table() function takes a dataframe to pivot as its first argument
    
- **`columns`**: this is the list of columns in the wide format data to transform back to columns in wide format, with each unique value in the long format column becoming a header for the wide format   
- **`values`**: a single column indicating the values to use when pivoting and filling in the new wide format columns
- **`index`**: columns in the long format data that are index variables – this means that these will be left as single columns, not spread out across columns by unique value such as in the columns parameter 
- **`aggfunc`**: often pivot_table() is used to perform a summary of the data. aggfunc stands for "aggregation function". It is required and defaults to np.mean. You can put your own function in, which I do below.
- **`fill_value`**: if a cell is missing for the wide format data, the value to fill in
    
I am putting in my own function, `select_item_or_nan()` to the `aggfunc` keyword argument. Because my `subject_id` column has a single variable value for each id, I just want the single element in the long format value cell. My data is messy and so I have to write a function to check for some places it can break. 

Note: `x` passed into my function is a series object (weirdly). I pull out the first element of that with the `.iloc` indexer.

In [21]:
def select_item_or_nan(x):
    x = x.iloc[0]
    if len(x) == 0:
        return np.nan
    else:
        return x
    
new_wide = pd.pivot_table(nerdy_long, columns=['variable'], values='value', 
                          index=['subject_id'], aggfunc=select_item_or_nan,
                          fill_value=np.nan)

In [22]:
new_wide.head()

variable,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,dependable,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,7.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,5.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,3.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
3,5.0,,4.0,4.0,5.0,7.0,5.0,1.0,2.0,7.0,...,,2.0,5.0,5.0,6.0,,,5.0,5.0,4.0
4,4.0,,3.0,5.0,5.0,6.0,4.0,2.0,5.0,4.0,...,,6.0,0.0,5.0,5.0,,,5.0,4.0,1.0


In [23]:
new_wide.reset_index().head()

variable,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
0,0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
3,3,5.0,,4.0,4.0,5.0,7.0,5.0,1.0,2.0,...,,2.0,5.0,5.0,6.0,,,5.0,5.0,4.0
4,4,4.0,,3.0,5.0,5.0,6.0,4.0,2.0,5.0,...,,6.0,0.0,5.0,5.0,,,5.0,4.0,1.0


In [30]:
nerdy_wide.head()

Unnamed: 0,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
0,0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
3,3,5.0,,4.0,4.0,5.0,7.0,5.0,1.0,2.0,...,,2.0,5.0,5.0,6.0,,,5.0,5.0,4.0
4,4,4.0,,3.0,5.0,5.0,6.0,4.0,2.0,5.0,...,,6.0,0.0,5.0,5.0,,,5.0,4.0,1.0


### Multiindex/Hierarchical indexing pt. 1

Below in the header you can see that the format of the wide data is not the same as our original loaded wide format. Pandas implements something called **Multiindexing** or **Hierarchical indexing** which allows for "tiered" row and column labels.

Right now it is not that bad, but this can get very complicated and annoying which we will see further down in the lesson.

The main difference here is that we have a `variable` name in the top left corner, which is "labeling" our columns (and corresponds to the name of our original column in the long format data). The row indexer has become our single key/id variable `subject_id`. The columns are what we would expect here, each one a variable like in the original wide data.

In [31]:
new_wide.head()

variable,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,dependable,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,7.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,5.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,3.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
3,5.0,,4.0,4.0,5.0,7.0,5.0,1.0,2.0,7.0,...,,2.0,5.0,5.0,6.0,,,5.0,5.0,4.0
4,4.0,,3.0,5.0,5.0,6.0,4.0,2.0,5.0,4.0,...,,6.0,0.0,5.0,5.0,,,5.0,4.0,1.0


Let's drop the null values from our recreated wide data.

Remember our `subject_id` is now the **index**, and so we can access it with the `.index` attribute.

In [44]:
new_wide.dropna(inplace=True)


None


In [46]:
new_wide.head(3)

Unnamed: 0,level_0,index,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
0,0,0,1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
1,1,1,2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
2,2,2,5,4.0,18.0,5.0,3.0,4.0,4.0,4.0,...,1.0,5.0,4.0,5.0,4.0,3.0,2.0,4.0,5.0,3.0


We can use the dataframe function `.reset_index()` to move `subject_id` into a column and create a new index. Now we have the dataframe in the format we got when we loaded the original wide data in before. The only exception is that we still have that "variable" column label.

In [47]:
new_wide = nerdy_wide
new_wide.head()

Unnamed: 0,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
0,0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
3,3,5.0,,4.0,4.0,5.0,7.0,5.0,1.0,2.0,...,,2.0,5.0,5.0,6.0,,,5.0,5.0,4.0
4,4,4.0,,3.0,5.0,5.0,6.0,4.0,2.0,5.0,...,,6.0,0.0,5.0,5.0,,,5.0,4.0,1.0


In [49]:
new_wide.dropna(inplace=True)

You can remove the column label (which I personally find confusing) by setting the `.columns.name` attribute to None.

In [37]:
new_wide.columns.name = None

## `pivot_table` for summarization

For those of you who are experienced with Excel, the pandas pivot table does the same thing as the pivot table in Excel. It's more powerful, but obviously harder to use than the user-friendly spreadsheet version.

Next we'll use pivot table to generate some summary statistics for `anxious`, `bookish`, and `calm` by `major`. 

We can do it two ways. First let's subset the data just to those columns and subject id.

In [50]:
wide_subset = new_wide[['subject_id','major','anxious','bookish','calm']]
wide_subset.head()

Unnamed: 0,subject_id,major,anxious,bookish,calm
1,1,biophysics,4.0,4.0,6.0
2,2,biology,7.0,5.0,2.0
5,5,Geology,5.0,3.0,4.0
6,6,unknown,1.0,4.0,6.0
7,7,unknown,7.0,3.0,1.0


### Going from wide to long with `.melt()`

**`.melt()`** is a function that essentially performs the inverse operation of `pivot_table` on dataframes.

Melt takes a dataframe as its first argument. Additional arguments typically used in the melt function are:

- **`id_vars`**: the column or columns that will be id variables. id variables contain datapoints specified by the variable and value columns
- **`value_vars`**: a list that specifies which columns should be converted into a single value column and variable column.
- **`var_name`**: the header name of the variable column (default='variable')
- **`value_name`**: the header name of the value column (default='value')

Below I only specify the `id_vars` as subject_id and major. The variable and value columns are inferred.

In [55]:
subset_long = pd.melt(wide_subset, id_vars=['subject_id','major'])
subset_long.head(4)

Unnamed: 0,subject_id,major,variable,value
0,1,biophysics,anxious,4.0
1,2,biology,anxious,7.0
2,5,Geology,anxious,5.0
3,6,unknown,anxious,1.0


You can do the same thing as above without having to subset the dataframe first by simply specifying the value_vars to lengthen. The output dataframe will then not have information on the columns left out of the `id_vars` and `value_vars` arguments.

In [56]:
subset_long2 = pd.melt(new_wide, id_vars=['subject_id','major'], 
                       value_vars=['anxious','calm'])
subset_long2.head()

Unnamed: 0,subject_id,major,variable,value
0,1,biophysics,anxious,4.0
1,2,biology,anxious,7.0
2,5,Geology,anxious,5.0
3,6,unknown,anxious,1.0
4,7,unknown,anxious,7.0


In [58]:
subset_long.dtypes

subject_id      int64
major          object
variable       object
value         float64
dtype: object

The value column is still a string, so we can convert it to float:

In [57]:
subset_long.value = subset_long.value.astype(float)

Unnamed: 0,subject_id,major,variable,value
0,1,biophysics,anxious,4.0
1,2,biology,anxious,7.0
2,5,Geology,anxious,5.0


### Summarizing with aggregate functions

Pivot table can take in the long format variable, value, and an index to group by and apply aggregate functions as well for summarizing data easily. Note that your index variable should not be pulling out unique rows (for example, subject_id by variable would only have one value to send into the aggregate functions).

The output dataframe gives you a "hierarchical" column index – the three variable for each aggregate function. The row index is the majors you divided the data up by.

If you apply more index variables to split by, the row indices will also become hierarchical! It can get complicated fast.

In [60]:
subset_summary = pd.pivot_table(subset_long, columns=['variable'], values='value',
                                index=['major'], aggfunc=[np.mean, np.median],
                               fill_value=np.nan)
subset_summary.head()

Unnamed: 0_level_0,mean,mean,mean,median,median,median
variable,anxious,bookish,calm,anxious,bookish,calm
major,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
None yet,7.0,3.0,3.0,7.0,3.0,3.0
+ACI-+ACIAIg-hotel and restaurant management+ACIAIg-+ACI-,2.0,2.0,7.0,2.0,2.0,7.0
Aerospace Engineer,2.0,2.0,7.0,2.0,2.0,7.0
Aerospace Engineering,3.0,4.0,3.0,3.0,4.0,3.0
Agricultural Economics,2.0,2.0,6.0,2.0,2.0,6.0


In [67]:
subset_summary.loc[:,'mean'].loc['Art','anxious']

4.0

In [80]:
subset_summary.reset_index()

Unnamed: 0_level_0,major,mean,mean,mean,median,median,median
variable,Unnamed: 1_level_1,anxious,bookish,calm,anxious,bookish,calm
0,None yet,7.000000,3.000000,3.000000,7.0,3.0,3.0
1,+ACI-+ACIAIg-hotel and restaurant management+A...,2.000000,2.000000,7.000000,2.0,2.0,7.0
2,Aerospace Engineer,2.000000,2.000000,7.000000,2.0,2.0,7.0
3,Aerospace Engineering,3.000000,4.000000,3.000000,3.0,4.0,3.0
4,Agricultural Economics,2.000000,2.000000,6.000000,2.0,2.0,6.0
5,Anthropology,5.333333,3.666667,4.333333,5.0,4.0,4.0
6,Anthropology,5.000000,4.000000,3.000000,5.0,4.0,3.0
7,Architecture,3.000000,4.000000,5.666667,4.0,4.0,6.0
8,Architecture,5.000000,1.000000,5.000000,5.0,1.0,5.0
9,Art,4.000000,4.333333,5.333333,5.0,4.5,5.5


In [68]:
subset_summary.to_records()

rec.array([(' None yet', 7.0, 3.0, 3.0, 7.0, 3.0, 3.0),
 ('+ACI-+ACIAIg-hotel and restaurant management+ACIAIg-+ACI-', 2.0, 2.0, 7.0, 2.0, 2.0, 7.0),
 ('Aerospace Engineer ', 2.0, 2.0, 7.0, 2.0, 2.0, 7.0),
 ('Aerospace Engineering', 3.0, 4.0, 3.0, 3.0, 4.0, 3.0),
 ('Agricultural Economics', 2.0, 2.0, 6.0, 2.0, 2.0, 6.0),
 ('Anthropology', 5.333333333333333, 3.6666666666666665, 4.333333333333333, 5.0, 4.0, 4.0),
 ('Anthropology ', 5.0, 4.0, 3.0, 5.0, 4.0, 3.0),
 ('Architecture', 3.0, 4.0, 5.666666666666667, 4.0, 4.0, 6.0),
 ('Architecture ', 5.0, 1.0, 5.0, 5.0, 1.0, 5.0),
 ('Art', 4.0, 4.333333333333333, 5.333333333333333, 5.0, 4.5, 5.5),
 ('Art Education', 5.0, 4.0, 3.0, 5.0, 4.0, 3.0),
 ('Art history', 6.0, 4.0, 1.0, 6.0, 4.0, 1.0),
 ('Arts', 3.5, 2.0, 4.5, 3.5, 2.0, 4.5),
 ('Astronomy', 0.0, 3.0, 0.0, 0.0, 3.0, 0.0),
 ('Astrophysics', 5.0, 5.0, 7.0, 5.0, 5.0, 7.0),
 ('Biochemical Engineering', 4.0, 3.0, 4.0, 4.0, 3.0, 4.0),
 ('Biochemistry', 3.5, 3.0, 4.5, 3.5, 3.0, 4.5),
 ('Biochemi

In [71]:
subset_summary_flat = pd.DataFrame(subset_summary.to_records())
subset_summary_flat.head(1)

Unnamed: 0,major,"('mean', 'anxious')","('mean', 'bookish')","('mean', 'calm')","('median', 'anxious')","('median', 'bookish')","('median', 'calm')"
0,None yet,7.0,3.0,3.0,7.0,3.0,3.0


In [79]:
new_cols = ['major']+['_'.join(eval(col)) for col in subset_summary_flat.columns[1:]]
subset_summary_flat.columns = new_cols
subset_summary_flat.head()

Unnamed: 0,major,mean_anxious,mean_bookish,mean_calm,median_anxious,median_bookish,median_calm
0,None yet,7.0,3.0,3.0,7.0,3.0,3.0
1,+ACI-+ACIAIg-hotel and restaurant management+A...,2.0,2.0,7.0,2.0,2.0,7.0
2,Aerospace Engineer,2.0,2.0,7.0,2.0,2.0,7.0
3,Aerospace Engineering,3.0,4.0,3.0,3.0,4.0,3.0
4,Agricultural Economics,2.0,2.0,6.0,2.0,2.0,6.0


In [76]:
tempvar = 100

In [77]:
eval("tempvar")

100

The `.names` attribute on the index and the columns will show you the hierarchy of labels. The row index is "major", and the two column indices are None and 'variable' (the aggregate functions get no label from pivot table in this case). 

If you print out the columns, you can see it has become a pandas `MultiIndex` object that has levels, labels, and names. I won't go into too much detail on this – reading the pandas documentation on MultiIndexes has a lot more information.

Indexing along the hierarchical column headers can be done with chained bracket keys, with the top level column label in the first bracket down to the bottom level.

In some cases you can just split them up by comma within the brackets.

## Converting a MultiIndex dataframe to "flat"

Personally, while I see multiindex dataframes as potentiall useful and a cool concept, I think the overhead and confusion on how to subset/mask them is annoying, especially when you have to start doing modeling pulling out data from these DataFrames.

To "flatten" a multi-indexed dataframe down, you can use the `.to_records()` function. To make this a new dataframe, it needs to be wrapped in a `pd.DataFrame()` like so:

You can see that the new column names are tuples of the hierarchy of the multiindexed columns. You can convert these to new, more easily indexed columns with a list comprehension, for example with the comprehension below.

The **eval** function takes a string and trys to evaluate it as if it were a python command! Be careful with this function.

---

## Preface to merging/joining: long and wide data

You will practice merging and joining much more tomorrow, but this section is a preview for what is to come with a focus on the difference between merging long and wide datasets together.

Load in the data we've been using above, but now split up with just the demographic variables in one dataset and the survey question answers in another. These datasets are in wide format, and they both contain `subject_id` to identify who the questions are for. 

As you may recall, the demographic responses have fewer observations.

In [6]:
n_demos_file = '~/github-repos/DSI-SF-2/datasets/nerdy_personality_attributes/NPAS_parsed_trunc_demo_sample.csv'
n_survey_file = '~/github-repos/DSI-SF-2/datasets/nerdy_personality_attributes/NPAS_parsed_trunc_survey.csv'

#demos_subset = pd.read_csv(n_demos_file)
#survey = pd.read_csv(n_survey_file)

### Pandas `.merge()` function

The merge function is a built-in function in a DataFrame. The first argument is another DataFrame that you want to merge it with, and the `on` keyword argument is the key or keys that you want the DataFrames to be "matched" on.

We are specifying `how='inner'` here, which essentially means that the subject_id has to be present in both dataframes to merge them together and return them. Because the demographics dataset has fewer subject_ids, it will only merge the subject_id rows from the survey dataset that are present in the demographics dataset.

### Make the demographic and survey data long format using melt

This is the same way we used melt in a previous section. 

- For the demographic dataframe, specify two id_vars, gender and subject_id.
- For the survey dataframe, only specify subject_id for id_vars

Merge together the long form datasets just like we did before with the wide format data.

Here we will still merge on 'subject_id' with 'inner' for the how variable. We have duplicate named columns in each of these dataframes ('variable' and 'value'). We can specify `suffixes=('_survey','_demo')` to give the instances of the survey and demographic dataframes appropriate column names when they are joined together.

### Pivot with the merged long dataframe

Now, use the pivot_table function on the merged demographics and survey dataframes (the long one) with columns the variable column for survey as well as the variable column for the demographics. Make the values the survey values column and the index gender. Set the aggregate function to just be the mean.

For example:

```python
demo_survey_means = pd.pivot_table(demos_survey_long, columns=['variable_survey', 'variable_demo'], 
                                   values='value_survey',
                                   index=['gender'], aggfunc=[np.mean],
                                   fill_value=np.nan)
```

You can see that if you specify multiple variable columns in the columns argument, it will stack them in a hierarchical column setup. So, for every variable in variable_survey, the mean for each gender for each variable in variable_demo.

A simpler version below just has the variable_demo in the columns argument, in which case it calculates the mean across those variables for each gender in the dataframe.