<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## `pandas` Long Format, Wide Format, Pivot Tables, and Melting
_Instructor: Amine Mehablia
___
<br>

This lesson is all about **transforming data** using `pandas`. Data transformation is the reorganization of your data set's rows and columns into a different, potentially **more useful shape and format**. 

The benefits of transforming your data include **better access to relevant information** and **streamlined data manipulation**. As you become more familiar with data sets and their associated operations, you will develop an intuition and appreciation for when it's better to **work row-wise or column-wise**.

Different data formats are better for different tasks. It takes time and experience to learn the distinctions. But, for now, we'll introduce the **common structures, transformations, and how to apply these transformations**.

### Learning Objectives
- Understand the differences between **long and wide format data**.
- Understand **pivot tables**.
- Practice transforming data between **long and wide** formats.
- Practice creating pivot tables.
- Learn how to avoid **common pitfalls and obstacles** in data transformation with `pandas`.


### Lesson Guide

- [Wide Format Data](#wide_format)
- [Load and Examine the NPAS Data](#load_nerdy)
- [Long Format Data](#long_format)
- [Using `pandas`' `.pivot_table()` Function: Long to Wide Format](#pivot_tables)
- [MultiIndex/Hierarchical Indices in `pandas`](#multiindex)
- [Using `pandas`' `.melt()` Function: Wide to Long Format](#melt)
- [Summarizing Data With `.pivot_table()` and Aggregate  Functions](#pivot_table_summarizing)
- [The Inner Workings of the MultiIndex](#examining_multiindex)
- [Getting Rid of the MultiIndex: "Flattening" Data](#multiindex_to_flat)
- [A Preface: Merging and Joining With Long and Wide Format Data](#merging_joining_preface)
- [`pandas`' `.merge()` function: Joining Long Format vs. Wide Format Data](#pandas_merge)


In [2]:
import numpy as np # use for calculation
import scipy.stats as stats # using for makeup
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd # for datastructure 

<a id='wide_format'></a>

### Wide Format Data

---

Between "wide" and "long," **wide format data is the more intuitive**. It's also a common format for `.csv` files. You've already viewed multiple data sets in wide format throughout this course.

Wide format data is structured so that:

- Unique IDs, subjects, observations, etc. are represented as **rows**.
- Distinct information categories (**variables**) are represented as columns. In other words, there is a **column for every "variable"** with its own unique values.
- This format can often be a more compact matrix, particularly if little or no information is missing.
- It is **not as useful for SQL-style operations**: It can make it much harder or even impossible to **join tables together on a value**.
- It can be useful in `pandas` when you need to perform operations on variables **across columns**; for example, multiplying columns together to create a new column.
- It is the data format required for statistical modeling (with few exceptions).

<a id='load_nerdy'></a>

### Load and Examine the "Nerdy Personality Attributes" Data Set

---

This is a pre-cleaned and modified version of the full "Nerdy Personality Attributes" survey, which asked subjects to rate themselves based on questions related to "nerdiness" as well as more general personality traits such as openness and extraversion. Researches also collected demographic information from the subjects.

You can find the raw data [here](http://personality-testing.info/_rawdata/), along with many other sociological surveys.

In this modified version, for the sake of our example, some of the subjects provided data for the survey but not the demographic variables. Because there are missing values and the data is "messy," we have a data cleaning problem.

**Load the data (which is in wide format).** 

In [3]:
nerdy_wide_f ='data/NPAS_parsed_trunc_wide_missing.csv'

# load data and print the dimensions
nerdy_wide = pd.read_csv(nerdy_wide_f)

In [5]:
nerdy_wide.head()
nerdy_wide.tail()

Unnamed: 0,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
1386,1386,4.0,18.0,5.0,2.0,4.0,3.0,3.0,2.0,6.0,...,8.0,6.0,4.0,4.0,5.0,3.0,2.0,3.0,5.0,5.0
1387,1387,4.0,,2.0,3.0,4.0,6.0,4.0,5.0,2.0,...,,5.0,5.0,4.0,5.0,,,5.0,3.0,2.0
1388,1388,3.0,17.0,2.0,1.0,4.0,5.0,5.0,1.0,2.0,...,1.0,7.0,3.0,5.0,3.0,2.0,2.0,4.0,5.0,3.0
1389,1389,2.0,21.0,5.0,4.0,5.0,4.0,3.0,2.0,4.0,...,2.0,6.0,4.0,4.0,4.0,2.0,2.0,5.0,3.0,1.0
1390,1390,5.0,28.0,4.0,1.0,5.0,6.0,5.0,3.0,2.0,...,2.0,1.0,5.0,5.0,7.0,2.0,2.0,5.0,5.0,1.0


This data set is in a familiar format in which each column is a variable and each row contains an observation for that variable, corresponding to a distinct subject.

*Wide format implies that all of the information for one distinct subject **will be represented in the columns corresponding to that row**. A single subject should not be represented in multiple rows of data.*

In [6]:
# First let's print the columns:
nerdy_wide.columns

Index(['subject_id', 'academic_over_social', 'age', 'anxious', 'bookish',
       'books_over_parties', 'calm', 'collect_books', 'conventional',
       'critical', 'dependable', 'diagnosed_autistic', 'disorganized',
       'education', 'engnat', 'enjoy_learning', 'excited_about_research',
       'extraverted', 'familysize', 'gender', 'hand', 'hobbies_over_people',
       'in_advanced_classes', 'intelligence_over_appearance',
       'interested_science', 'introspective', 'libraries_over_publicspace',
       'like_dry_topics', 'like_hard_material', 'like_science_fiction',
       'like_superheroes', 'major', 'married', 'online_over_inperson',
       'opennness', 'play_many_videogames', 'playes_rpgs',
       'prefer_fictional_people', 'race_arab', 'race_asian', 'race_black',
       'race_hispanic', 'race_native_american', 'race_native_austrailian',
       'race_nerdy', 'race_white', 'read_tech_reports', 'religion', 'reserved',
       'socially_awkward', 'strange_person', 'sympathetic', 'urb

In [7]:
#nerdy_wide.major.unique()

In [9]:
nerdy_wide.shape # we saw this is matrix shape

(1391, 57)

In [11]:
nerdy_wide.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
subject_id,1391.0,695.0,401.691424,0.0,347.5,695.0,1042.5,1390.0
academic_over_social,1391.0,3.702372,1.170003,0.0,3.0,4.0,5.0,5.0
age,700.0,26.265714,12.84209,14.0,18.0,21.0,30.0,90.0
anxious,1391.0,4.464414,1.96882,0.0,3.0,5.0,6.0,7.0
bookish,1391.0,3.643422,1.29724,0.0,3.0,4.0,5.0,5.0
books_over_parties,1391.0,4.102804,1.054725,0.0,4.0,4.0,5.0,5.0
calm,1391.0,4.383178,1.853786,0.0,3.0,5.0,6.0,7.0
collect_books,1391.0,3.853343,1.185064,0.0,3.0,4.0,5.0,5.0
conventional,1391.0,2.560029,1.630363,0.0,1.0,2.0,4.0,7.0
critical,1391.0,4.209921,1.855713,0.0,3.0,5.0,6.0,7.0


**Check to see how many null values there are per column.**

*Tips:* An easy way is to use the `.isnull()` method associated with the `.sum()`

In [12]:
# Now let's count the null values by column:
nerdy_wide.isnull().sum()

subject_id                        0
academic_over_social              0
age                             691
anxious                           0
bookish                           0
books_over_parties                0
calm                              0
collect_books                     0
conventional                      0
critical                          0
dependable                        0
diagnosed_autistic                0
disorganized                      0
education                       691
engnat                          691
enjoy_learning                    0
excited_about_research            0
extraverted                       0
familysize                      691
gender                          691
hand                            691
hobbies_over_people               0
in_advanced_classes               0
intelligence_over_appearance      0
interested_science                0
introspective                     0
libraries_over_publicspace        0
like_dry_topics             

The 691 missing demographic variables are intentional (I specifically enforced that only 700 of the subjects would have demographic information).

However, we can see that the `major` variable has 970 missing values. This was not an intentional change.

At this point, if we were to just **drop all the rows that have any null values, we would lose at least 970 rows** because of the missing `major` variable.

With a numeric column, this would be hard to avoid without "imputing" some number to fill in those values. In the simplest case, **imputing the mean or median for missing numeric values** is a common fix (but not ideal).

With a **categorical variable** like `major`, we have the luxury of replacing the missing values with a new category label that stands for "missing." 

**Replace the missing `major` column values with `unknown`.**

In [18]:
# first create a mask for the missing values in the major column:
# null_mask = nerdy_wide.major.isnull()
# set missing values in major to "unknown":
# nerdy_wide.loc[mask, 'major'] = 'unknown'
nerdy_wide.major.fillna('unknown',inplace=True)
nerdy_wide.major.head()

0       unknown
1    biophysics
2       biology
3       unknown
4       unknown
Name: major, dtype: object

In [19]:
# if all goes right you should not have any missing values left
print (nerdy_wide.major.isnull().sum())

0


<a id='long_format'></a>

### Long Format Data

---

Now, we can load the same data — this time in the format commonly called "long."

Long format data is structured so that:

- There are potentially multiple `ID` (identification) columns.
- There are pairs of columns such as `variable:value` that match a variable key to a value (In the simplest case, there would be a single `variable` column and a single `value` column).
- The `variable` column corresponds to the multiple variable columns in a wide format data set. Instead of a column for each variable, you have a row for each `variable:value` pair *per ID*. 
- This is a standard format for SQL databases because it makes it easier to join different tables together with keys.

**Load the long format of the same data below.**

In [20]:
nerdy_long_f = 'data/NPAS_parsed_trunc_long_missing.csv'

# load long data and print the dimensions
nerdy_long = pd.read_csv(nerdy_long_f)

You can see that the long format data has far more rows than the wide data set but only three columns.

Below you can view the three columns: `subject_id`, `variable`, and `value`.

**`subject_id:`**
- This is the primary "key" or `ID` column. Each `subject_id` will have corresponding entries in the `variable` column — one for each row.

**`variable:`**
- This column indicates the variable with which the item in the `value` column corresponds.

**`value:`**

- This contains all values for all variables for all IDs. Essentially, every cell in the wide data set except the `subject_id` is listed in this column.

In [25]:
# print the header:
nerdy_long.head(20)

Unnamed: 0,subject_id,variable,value
0,1,education,4.0
1,2,education,3.0
2,5,education,2.0
3,6,education,2.0
4,7,education,2.0
5,8,education,3.0
6,9,education,1.0
7,10,education,2.0
8,14,education,3.0
9,15,education,2.0


**Print out the unique values in the `variable` column.**

You can see that the unique values in the `variable` column correspond to the column headers in the wide format data.

*Tips: use the .unique() method*

In [29]:
# print the unique values in the variable column:
nerdy_long.variable.unique()

array(['education', 'urban', 'gender', 'engnat', 'age', 'hand',
       'religion', 'voted', 'married', 'familysize', 'major',
       'race_white', 'race_nerdy', 'race_native_american',
       'writing_novel', 'read_tech_reports', 'online_over_inperson',
       'introspective', 'hobbies_over_people', 'books_over_parties',
       'bookish', 'libraries_over_publicspace', 'race_native_austrailian',
       'like_hard_material', 'race_hispanic', 'diagnosed_autistic',
       'play_many_videogames', 'race_arab', 'race_asian',
       'interested_science', 'playes_rpgs', 'in_advanced_classes',
       'collect_books', 'intelligence_over_appearance',
       'watch_science_shows', 'academic_over_social',
       'like_science_fiction', 'like_dry_topics', 'race_black', 'calm',
       'disorganized', 'extraverted', 'dependable', 'critical',
       'opennness', 'anxious', 'sympathetic', 'reserved', 'conventional',
       'was_odd_child', 'prefer_fictional_people', 'enjoy_learning',
       'excited_abou

In [36]:
# count the unique subject ids:
len(nerdy_long.subject_id.unique())


1391

**Replace the missing values in `major` with `unknown` in the long format data set.**

The process for replacing data will be different because of the format. Using logical selection masks with `pandas`' `.loc` syntax is the preferable way to do this.

In [38]:
# Identify the missing values in major:
#sum(nerdy_long.value.isnull())
nerdy_long.value.isnull().sum()


279

In [39]:
# replace the missing values for major in the long dataset with "unknown":
major_mask = (nerdy_long.variable == 'major') & (nerdy_long.value.isnull())
nerdy_long.loc[major_mask, 'value'] = 'unknown'

In [40]:
# check that there is no missing values left:
print(nerdy_long[nerdy_long.variable == 'major'].isnull().sum())

# you should get only 0s

subject_id    0
variable      0
value         0
dtype: int64


<a id='pivot_tables'></a>

### `Pandas`' `.pivot_table()` Function: Long to Wide Format

---

The `pd.pivot_table()` function is a powerful tool for both transforming data from long to wide format as well as summarizing data with user-supplied functions.

First, we'll look at transforming the long format data back into the wide format using the `.pivot_table()` function.

**Important parameters for the `.pivot_table()` function include:**

> The `pivot_table()` function takes a DataFrame to pivot as its first argument. 
    
- **`columns`**: This is the list of columns in the long format data to be transformed back into columns in the wide format. After pivoting, each unique value in the long format column becomes a header in the wide format.
- **`values`**: A single column indicating the values to use when pivoting and filling the new wide format columns.
- **`index`**: Columns in the long format data that are index variables. These will be left as single columns, not spread out by unique value like in the `columns` parameter.
- **`aggfunc`**: Often `.pivot_table()` is used to perform a summary of the data. `aggfunc` stands for "aggregation function." It's required and defaults to `np.mean()`. You can also insert your own function, which we'll demonstrate below.
- **`fill_value`**: If a cell is missing for the wide format data, this value will fill it in.
    
Next we'll put in our own function — `select_item_or_nan()` — to the `aggfunc` keyword argument. Because my `subject_id` column has a single variable value for each ID, I just want the single element in the long format value cell. My data is messy, so I have to write a function to check for places it could break. 

**Note:** Passed into my function, `x` will be a Series object. I pull out the first element of that using the `.iloc` indexer.

### Let's make sure value has only values:

*Note: The lambda operator or lambda function is used for creating small, one-time and anonymous function objects in Python. This is not the object of this lesson. We will cover it at a later stage. Do not worry about understanding it for now.*

In [46]:
nerdy_long.shape

(70295, 3)

#### lambda operator or lambda function is used for creating small, one-time and anonymous function objects in Python.

#### This is not the object of this lesson. We will cover it at a later stage. Do not worry about understanding it for now.*

In [47]:
def is_float(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

In [48]:
# mask with true or false if we can convert to a numerical value
mask = nerdy_long.value.map(lambda x: is_float(x))

In [49]:
# Example1 of Lambda function

x = lambda a: a + 10
print(x(5))

15


In [50]:
# Example2 of Lambda function

x = lambda a, b : a * b
print(x(5, 6))

30


Why Use Lambda Functions?
The power of lambda is better shown when you use them as an anonymous function inside another function.

Say you have a function definition that takes one argument, and that argument will be multiplied with an unknown number:

#### Now remove non numeric values using the mask

In [73]:
nerdy_long_only_num = nerdy_long[mask].copy()
# nerdy_long_only_num.reset_index(drop=True, inplace=True)

In [74]:
nerdy_long_only_num

Unnamed: 0,subject_id,variable,value
0,1,education,4.0
1,2,education,3.0
2,5,education,2.0
3,6,education,2.0
4,7,education,2.0
5,8,education,3.0
6,9,education,1.0
7,10,education,2.0
8,14,education,3.0
9,15,education,2.0


In [75]:
nerdy_long_only_num.dtypes

subject_id     int64
variable      object
value         object
dtype: object

#### Convert the column `value`  from the dataframe `nerdy_long_only_num` to float

In [76]:
nerdy_long_only_num['value'] = nerdy_long_only_num.value.map(lambda x: float(x))

In [77]:
nerdy_long_only_num.dtypes
#nerdy_long_only_num.shape

subject_id      int64
variable       object
value         float64
dtype: object

In [78]:
nerdy_long.dtypes


subject_id     int64
variable      object
value         object
dtype: object

#### Finally pivot the data on subject_id and variable using .pivot()

In [81]:
nerdy_long_only_num.pivot(index='subject_id', columns='variable')

Unnamed: 0_level_0,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value
variable,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,dependable,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
subject_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,7.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,5.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,3.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
3,5.0,,4.0,4.0,5.0,7.0,5.0,1.0,2.0,7.0,...,,2.0,5.0,5.0,6.0,,,5.0,5.0,4.0
4,4.0,,3.0,5.0,5.0,6.0,4.0,2.0,5.0,4.0,...,,6.0,0.0,5.0,5.0,,,5.0,4.0,1.0
5,4.0,18.0,5.0,3.0,4.0,4.0,4.0,4.0,3.0,5.0,...,1.0,5.0,4.0,5.0,4.0,3.0,2.0,4.0,5.0,3.0
6,4.0,18.0,1.0,4.0,5.0,6.0,5.0,1.0,1.0,2.0,...,1.0,5.0,5.0,5.0,5.0,2.0,2.0,1.0,4.0,1.0
7,3.0,21.0,7.0,3.0,5.0,1.0,5.0,4.0,6.0,5.0,...,12.0,5.0,5.0,5.0,6.0,2.0,1.0,3.0,3.0,3.0
8,3.0,25.0,5.0,5.0,4.0,3.0,2.0,1.0,1.0,5.0,...,7.0,7.0,3.0,4.0,7.0,2.0,2.0,3.0,0.0,5.0
9,3.0,17.0,6.0,4.0,5.0,5.0,4.0,2.0,6.0,3.0,...,1.0,7.0,5.0,5.0,4.0,2.0,2.0,5.0,5.0,5.0


<a id='melt'></a>

### Using pandas' `.melt()` Function: Wide to Long Format

---
First, let's reload a fresh copy of the data:

In [82]:
nerdy_wide_flat = pd.read_csv('data/nerdy_wide_flat.csv')

In [83]:
nerdy_wide_flat.head()

Unnamed: 0,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
0,1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
1,2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
2,5,4.0,18.0,5.0,3.0,4.0,4.0,4.0,4.0,3.0,...,1.0,5.0,4.0,5.0,4.0,3.0,2.0,4.0,5.0,3.0
3,6,4.0,18.0,1.0,4.0,5.0,6.0,5.0,1.0,1.0,...,1.0,5.0,5.0,5.0,5.0,2.0,2.0,1.0,4.0,1.0
4,7,3.0,21.0,7.0,3.0,5.0,1.0,5.0,4.0,6.0,...,12.0,5.0,5.0,5.0,6.0,2.0,1.0,3.0,3.0,3.0


**`.melt()`** is a function that essentially performs the inverse of `.pivot_table()` on DataFrames.

`.melt()` takes a DataFrame as its first argument. Additional arguments typically used with this function are:

- **`id_vars`**: The column or columns that will be ID variables. ID variables contain data points specified by the `variable` and `value` columns.
- **`value_vars`**: A list that specifies which columns should be converted into single `value` and `variable` columns.
- **`var_name`**: The header name of the `variable` column (default='variable').
- **`value_name`**: The header name of the `value` column (default='value').

**First, subset the wide format data into just columns: `['subject_id','anxious','booking','calm','major']`, as example.**

In [85]:
# subset the wide data:
nerdy_subset = nerdy_wide_flat[['subject_id','anxious','bookish','calm','major']]
# nerdy_subset = nerdy_wide_flat[...??]
nerdy_subset.shape
nerdy_subset.head()

Unnamed: 0,subject_id,anxious,bookish,calm,major
0,1,4.0,4.0,6.0,biophysics
1,2,7.0,5.0,2.0,biology
2,5,5.0,3.0,4.0,Geology
3,6,1.0,4.0,6.0,missing
4,7,7.0,3.0,1.0,missing


**Use `.melt()` on the subset with `id_vars=['subject_id','major']`.**

Print out the shape of the data and the header. The non-ID columns and their values are now represented by the `variable:value` column pair.

##### **Note**: When you only specify the `id_vars`, the remaining columns become part of the `variable` and `value` columns.

In [86]:
nerdy_sub_long = pd.melt(nerdy_subset, id_vars=['subject_id','major'])
nerdy_sub_long[nerdy_sub_long['subject_id'] == 912].head(30)

Unnamed: 0,subject_id,major,variable,value
448,912,Molecular Biology,anxious,2.0
1148,912,Molecular Biology,bookish,4.0
1848,912,Molecular Biology,calm,6.0


If we don't specify `major` as an `id_var`, it will end up in the `variable` column.

In [87]:
### with two value_vars
nerdy_sub_long = pd.melt(nerdy_subset, id_vars='subject_id')
print(nerdy_subset.shape, nerdy_sub_long.shape)
nerdy_sub_long.head(4)

(700, 5) (2800, 3)


Unnamed: 0,subject_id,variable,value
0,1,anxious,4
1,2,anxious,7
2,5,anxious,5
3,6,anxious,1


In [103]:
### with all value_vars
nerdy_sub_long = pd.melt(nerdy_wide_flat[['subject_id','major', 'bookish','calm']], id_vars=['subject_id','major'])
print(nerdy_wide_flat.shape, nerdy_sub_long.shape)
nerdy_sub_long.head(4)

(700, 57) (1400, 4)


Unnamed: 0,subject_id,major,variable,value
0,1,biophysics,bookish,4.0
1,2,biology,bookish,5.0
2,5,Geology,bookish,3.0
3,6,missing,bookish,4.0


The more `id_vars` that we specify, the flatter our DataFrame will be. 

You can achieve the same result without having to subset the DataFrame first by simply specifying the `value_vars` keyword argument. The output DataFrame will then only contain the data specified in the `id_vars` and `value_vars` arguments.

**Create the same DataFrame with `.melt()` on the full wide data set, but select the columns to use with the `value_vars` argument.**

In [104]:
nerdy_sub_long = pd.melt(nerdy_wide_flat, id_vars=['subject_id','major'], 
                         value_vars=['anxious','bookish','calm'])

In [105]:
# print the datatypes
nerdy_sub_long.dtypes

subject_id      int64
major          object
variable       object
value         float64
dtype: object

The `value` column is still a string, so we can convert it to a float.

In [106]:
# ensure the value is a float

In [107]:
nerdy_sub_long.head()

Unnamed: 0,subject_id,major,variable,value
0,1,biophysics,anxious,4.0
1,2,biology,anxious,7.0
2,5,Geology,anxious,5.0
3,6,missing,anxious,1.0
4,7,missing,anxious,7.0


<a id='pivot_table_summarizing'></a>

### Summarizing Your Data With  `.pivot_table()` and Aggregate Functions

---
First, let's reload a fresh copy of the data:

In [108]:
nerdy_sub_long = pd.read_csv('data/nerdy_sub_long.csv')

For those of you who have experience with Excel, `pandas`' `.pivot_table()` accomplishes the same thing. It's more powerful but harder to use than the spreadsheet version.

`.pivot_table()` can take in a variable, value, and index to group by and apply aggregate functions to summarize the data. 

**Note**: Be careful that your index variable is not pulling out unique rows (For example, `subject_id` by variable would only have one value to send into the aggregate functions).

Below, I am calling the `.pivot_table()` function with:

- The long format data as the first argument.
- `variable` specified as the columns that indicate the variable names (groups).
- `value` specified as the column that contains the data per variable.
- `major` as the index; the rows will be grouped by `major`.
- `np.mean`, `np.median`, `np.std`, and `len` as aggregate functions. These will be calculated for each `major-by-variable` group.
- A `fill_value` of `np.nan` for cells in the output table that have no data.

In [110]:
nerdy_major_summary = pd.pivot_table(nerdy_sub_long, columns=['variable'], values='value',
                                     index=['major'], aggfunc=[np.mean, np.median, np.std, len],
                                     fill_value=np.nan)

The output DataFrame gives you a "hierarchical" column index — the three variables for each aggregate function. The row index is the `major` groups.

If you apply more index variables, the row indices will also become hierarchical! However, this can quickly make for a bloated DataFrame.

In [111]:
# print the header of the pivot table
nerdy_major_summary.head()

Unnamed: 0_level_0,mean,mean,mean,median,median,median,std,std,std,len,len,len
variable,anxious,bookish,calm,anxious,bookish,calm,anxious,bookish,calm,anxious,bookish,calm
None yet,7.0,3.0,3.0,7.0,3.0,3.0,,,,1,1,1
+ACI-+ACIAIg-hotel and restaurant management+ACIAIg-+ACI-,2.0,2.0,7.0,2.0,2.0,7.0,,,,1,1,1
Aerospace Engineer,2.0,2.0,7.0,2.0,2.0,7.0,,,,1,1,1
Aerospace Engineering,3.0,4.0,3.0,3.0,4.0,3.0,,,,1,1,1
Agricultural Economics,2.0,2.0,6.0,2.0,2.0,6.0,,,,1,1,1


In [None]:
def more_than_five(x):
    try:
        if x>2:
            return 5* 7
        else:
            return None
    except:
        return 

In [120]:
#map using for row
#applymap usige for DF all
#apply usinge for whole coulme or row

nerdy_wide_flat.collect_books = nerdy_wide_flat.collect_books.map(more_than_five)


NameError: name 'more_than_five' is not defined

In [119]:
nerdy_wide_flat.iloc[:,-5:].apply(lambda x: x.mean()) # for colume
nerdy_wide_flat.apply(lambda x: x.["urban"] + x.["voted"], axis = 1) # for row by row

SyntaxError: invalid syntax (<ipython-input-119-35de41647bc8>, line 2)

<a id='merging_joining_preface'></a>

### Practice: Merging and Joining With Long and Wide Format Data

---

You will be merging and joining data sets extensively throughout this course and in your future careers. However, it is important to note the differences between merging long and wide data sets together.

**Load in the data used above, but now split it so that the demographic variables are in one data set and the survey question answers are in another.** 

These data sets are in a wide format, and they both contain `subject_id`s to identify the questions' categories. 

As you may recall, the demographic responses have fewer observations.

In [97]:
n_demos_file = 'data/NPAS_parsed_trunc_demo_sample.csv'
n_survey_file = 'data/NPAS_parsed_trunc_survey.csv'
# load the files
demos_subset = pd.read_csv(n_demos_file)
survey = pd.read_csv(n_survey_file)

In [None]:
# print the header of the demos and survey


<a id='pandas_merge'></a>

### Use  `pandas`' `.merge()` function: Joining Long Format vs. Wide Format Data

---

As we have seen yesterday, the `.merge()` function comes built into a DataFrame. The first argument is another DataFrame you want to merge it with, and the `on` keyword argument is the key(s) by which you want the DataFrames to be "matched."

We are specifying `how='inner'` here, which means that the key must be present in both DataFrames to have the corresponding rows included in the output. Because the demographics data set has fewer `subject_id`s, it will only merge the `subject_id` rows from the survey data set that are also present in the demographics data set.

**Combine the survey and demographic wide format data sets using `.merge()`.**

In [None]:
# demos_survey = demos_subset.merge(survey, on=..?


In [None]:
# print the merged data header


**Convert the demographic and survey data into long format using `.melt()`.**

- For the demographic DataFrame, specify two `id_vars` — `gender` and `subject_id`.
- For the survey DataFrame, only specify `subject_id` for `id_vars`.

In [None]:
# melt the demographic data


In [None]:
# melt the survey data


**Merge the long form data sets together, just like we did previously with the wide format data.**

Here, we will still merge on `subject_id`, using `'inner'` for the `how` variable. We have duplicate named columns in each of these DataFrames (`variable` and `value`). We can specify `suffixes=('_survey','_demo')` to give the instances of the survey and demographic DataFrames appropriate column names when they are joined together.

In [None]:
# merge the survey and demo data
